Sora是由美国人工智能研究机构OpenAI开发的文本到视频模型。它可以根据描述性提示生成视频,在时间上向前或向后扩展现有视频,并从静止图像生成视频。截至2024年2月,它尚未向公众开放。
与其他视频生成AI相比,Sora令人印象深刻是因为它可以从文本中创建长达一分钟的视频,具有高分辨率和高保真度,支持多样化视频格式,多模态输入,优秀的真实世界交互能力,同时保持角色和场景的3D一致性和长期一致性。
三句话概述
Sora使用视频压缩网络,将输入的图像或视频压缩成低维的表达。类似于标准化不同尺寸和分辨率的照片,但又不失其独特性。
Sora将压缩的数据分解成时空块(Spacetime Patches),每个小块都包含了部分时间和空间的信息,使它们更容易处理、存储。
Sora使用DiT扩散模型(Diffusion Transformer),通过文本提示生成视频。过程开始于一段类似随机噪声的视频,Sora根据提示,利用大量的视频和图片数据库逐步去除噪声,对视频不断修改打磨,最终把视频打磨成接近文本描述的内容。
(以下是英文版)
Sora is a text-to-video model developed by the America-based artificial intelligence research organization OpenAI. It can generate videos based on descriptive prompts, extend existing videos forwards or backwards in time, and generate videos from still images. As of February 2024, it is unreleased and not yet available to the public.
Summary
Compared to other video generation AI, Sora is so impressive because it can create up to a minute-long video from text with high resolution and high fidelity, support diverse video formats and multi-modal input, with excellent real-world interaction capabilities, while maintaining 3D consistency and long-term consistency for characters and scenes.
Sora uses a video compression network to compress an input image or video into a low-dimensional representation. Similar to standardizing photos of different sizes and resolutions, but without losing their uniqueness.
Sora breaks down compressed data into Spacetime Patches, each containing partial time and space information, making them easier to process and store.
Sora uses a DiT(Diffusion Transformer) to generate video from text prompts. The process starts with a video similar to random noise. According to the prompts, Sora uses a large number of video and picture databases to gradually remove noise, constantly modify and polish the video, and finally polish the video into something close to the text description.
网友评论