Unlocking the Power: Unveiling the Secrets of Sora

MAY 21ST, 2024

Since the advent of ChatGPT at the end of 2022, artificial intelligence has once again become the focal point of global attention, with AI based on large language models (LLMs) emerging as the "darling" of the field. Over the past year, we have witnessed rapid progress in the areas of text-to-text and text-to-image AI, while the development in the text-to-video domain has been relatively slow. However, at the beginning of 2024, OpenAI dropped another “bombshell”—the text-to-video model Sora, completing the final piece of the content creation puzzle with AI.

A year ago, a video of Will Smith eating noodles went viral on social media. In the footage, his face was contorted, his features distorted as he ate spaghetti in a twisted manner. This unflattering image served as a reminder that the technology for AI-generated videos was just taking its first steps.

Smith Eating Noodles—Generated by AI

Just one year later, a video generated by Sora, depicting "A stylish woman walks down a Tokyo street" once again set social media ablaze. In March, Sora collaborated with artists from around the world to officially launch a series of surreal short films that defy tradition. The short film "Air Head" created by renowned director Walter in conjunction with Sora, features exquisite and realistic visuals with content that is both imaginative and free-spirited. It can be said that Sora has "trounced" mainstream AI video models such as Gen-2, Pika, and Stable Video Diffusion since its debut.

《Air Head》

The evolution of AI far exceeds expectations, and it is easy to foresee that the existing industrial landscape, including short videos, gaming, film, television and advertising, will be reshaped in the near future. The arrival of Sora seems to bring us one step closer to building the World Models.

What gives Sora such formidable power? What magical technologies does it employ? After consulting official technical reports and numerous related documents, I will explain in this article the technology behind Sora and the keys to its success.

What Core Problem Does Sora Address?

In a nutshell, the challenge Sora faces is how to transform various types of visual data into a unified representation method that can be trained holistically.

Why pursue unified training? To answer this, let's first understand the mainstream AI video generation approaches before Sora.

Pre-Sora Era AI Video Generation Methods

Expanding Based on Single Frame Image Content

Expanding based on a single frame means predicting the next frame based on the content of the current frame, with each frame being a continuation of the previous one, thus forming a continuous video stream (the essence of video is a series of images displayed consecutively).

In this process, an image is first generated from a text description, and then a video is generated based on the image. However, there is a problem with this approach: the randomness inherent in generating images from text is amplified when generating videos from those images, resulting in low controllability and stability of the final video.

Direct Training on Entire Video Segments

Since the frame-by-frame deduced video does not yield good results, the approach shifts to training the entire video segment.

Here, a few-second video clip is typically selected, and the model is informed of the content displayed in the video. After extensive training, the AI can learn to generate video segments similar in style to the training data. The downside of this approach is that the AI learns content in fragments, making it difficult to produce long and continuous videos.

One might ask, why not train on longer videos? The main reason is that videos are much larger than text or images, and the memory of graphics cards is limited, which does not support the training of longer videos. Under various constraints, the AI's knowledge base is extremely limited, and when inputting content it "never seen" the generated results are often unsatisfactory.

Therefore, to break through the bottleneck of AI video, these core issues must be addressed.

Challenges in Video Model Training

Video data comes in various forms, from landscape to portrait, from 240p to 4K, with different aspect ratios and resolutions, making the video attributes highly diverse. The complexity and diversity of the data pose significant challenges for AI training, leading to poor model performance. This is why it is necessary to unify the representation of these video data first.

Sora's core mission is to find a method to transform various types of visual data into a unified representation, enabling all video data to be effectively trained within a unified framework.

Sora: A Milestone on the Path to AGI

Our mission is to ensure that artificial general intelligence benefits all of humanity. ——OpenAI

OpenAI's goal has always been clear—to achieve artificial general intelligence (AGI). So, what significance does the birth of Sora have for achieving OpenAI's goal?

To achieve AGI, LLMs must understand the world. Looking at the development of OpenAI, the initial GPT model allowed AI to understand text (only length), followed by the DALL·E model, which allowed AI to understand images (length and width), and now the Sora model, which allows AI to understand videos (length, width, and time).

By understanding text, images, and videos comprehensively, AI can progressively understand the world. Sora is equivalent to a vanguard on OpenAI's path to AGI. It is not just a video generation model, as the title of its technical report^[1]states: "A Video Generation Model as a World Simulator."

The vision of the OpenPie coincides with the goal of OpenAI. OpenPie believes that using a small number of symbols and computational models to model human society and individual intelligence laid the foundation for early AI, but more dividends depend on a larger volume of data and higher computational power. When we cannot construct groundbreaking new models, we can look for more datasets and use greater computing power to improve the accuracy of the model, exchanging data computing power for model capabilities, driving innovation in data computing systems. In the Large Model Data Computing System PieDataCS released by OpenPie, AI mathematical models, data, and computing will be seamlessly integrated and mutually enhanced as never before, becoming a new productive force driving high-quality social development ^[2].

The Secret of Sora

Sora is not the first text-to-video model to be released, so why has it caused a sensation? What secrets lie behind it? If we describe Sora's training process in one sentence: it compresses raw video into a latent space through a visual encoder and breaks it down into spacetime patches. These spacetime patches, combined with textual conditional constraints, are trained and generated through a transformer with diffusion, and the generated spacetime patches are finally mapped back to the pixel space through the corresponding visual decoder.

Video Compression Network

Sora first transforms raw video data into low-dimensional latent space features. The video data we watch daily is too large and must first be transformed into a low-dimensional vector that AI can process. Here, OpenAI draws on a classic paper: Latent Diffusion Models^[3].

The core point of this paper is to refine the original image into a latent space feature that can retain the key feature information of the original image while greatly compressing the amount of data and information.

OpenAI has likely upgraded the variational autoencoder (VAE) in the paper, which is intended for images, to support the processing of video data. In this way, Sora can transform a large amount of raw video data into low-dimensional latent space features, that is, to refine the core key information of the video, which can represent the key content of the video.

Spacetime Patches

To conduct large-scale AI video training, the basic unit of training data must first be defined. In LLMs, the basic unit of training is the Token ^[4]. OpenAI draws inspiration from the success of ChatGPT: the Token mechanism elegantly unifies different forms of text—code, mathematical symbols, and various natural languages. Can Sora find its own "Token"?

Thanks to the research of predecessors, Sora has finally found the answer—Patch.

Vision Transformer（ViT）

What is a patch? A patch can be understood as an image block. When the resolution of the image to be processed is too large, direct training is not realistic. Therefore, in the Vision Transformer^[5]paper, a method is proposed: the original image is divided into equally sized image blocks (Patches), and these patches are serialized and added with their positional information (Position Embedding). This allows the complex image to be transformed into the most familiar sequence in the Transformer architecture, using the self-attention mechanism to capture the relationships between each patch, and ultimately understand the content of the entire image.

The Structure of ViT^[5]

And video can be seen as a sequence of images distributed along the time axis. Therefore, Sora adds a temporal dimension, upgrading static image blocks into spacetime image blocks (Spacetime Patches). Each spacetime patch contains both temporal and spatial information of the video, meaning that a spacetime patch not only represents a small spatial area in the video but also the changes of this spatial area over a period of time.

By introducing the concept of patch, spacetime patches at different positions in a single frame can calculate spatial correlation; spacetime patches at the same position in consecutive frames can calculate temporal correlation. Each patch is no longer an isolated existence but is closely connected with the surrounding elements. In this way, Sora can understand and generate video content with rich spatial details and temporal dynamics.

Decompose Sequence Frames into Spacetime Patches

Native Resolution（NaViT）

However, there is a significant drawback to the ViT model—the original image must be square, and each patch must be the same fixed size. In daily life, videos are either wide or tall, not square.

So, OpenAI found another solution: the "Patch n' Pack" technology in NaViT^[6], which allows for the processing of input content with any resolution and aspect ratio.

This technology breaks down content with different aspect ratios and resolutions into patches, which can be adjusted in size according to different needs. Patches from different images can be flexibly packed in the same sequence for unified training. In addition, this technology can discard redundant patch based on the similarity of the images, greatly reducing the cost of training and achieving faster training.

Patch n'Pack

This is also why Sora can support the generation of videos with different resolutions and aspect ratios. Training with the native aspect ratio can improve the composition and framing of the output video because cropping will inevitably result in the loss of information, making it easier for the model to misunderstand the main content the original image intended to display, leading to a video that only shows part of the subject.

Spacetime Patches play a role similar to that of Tokens in LLMs, they are the basic units that make up a video. When we compress and decompose a video into a series of spacetime patches, we are essentially converting continuous visual information into a series of discrete units that can be processed by the model. They are the foundation of the model's learning and generation.

Text Understanding

Now we have understood the process by which Sora transforms raw video into ultimately trainable temporal-spatial vectors. However, before actual training can begin, there is still a problem to be solved: telling the model what the video is about.

To train a text-to-video model, it is necessary to establish a correspondence between text and video. During training, a large number of videos with corresponding text descriptions are required, but the quality of manually annotated descriptions is low and not standardized, which affects the training results. Therefore, OpenAI drew on its own re-captioning technology from DALL·E 3^[7] and applied it to the field of video.

Specifically, OpenAI first trained a highly descriptive captioner model to generate text captions for all videos in the training set. This part of the text description is then matched and trained with the previously mentioned spacetime patches during the final training, allowing Sora to understand and correspond the text descriptions with video patches.

In addition, OpenAI will also use GPT to turn users' brief prompts into more detailed descriptive captions that are sent to the video model, enabling Sora to accurately generate high-quality videos according to user prompts.

Video Training and Generation

In the official technical report^[1], it is clearly stated: Sora is a diffusion transformer, that is, Sora is a Diffusion model with a Transformer as the backbone network.

Diffusion Transformer（DiT）

The concept of diffusion originates from the physical process, such as a drop of ink dropped into water, which gradually diffuses over time. This diffusion is a process from low entropy to high entropy, where you can see the ink gradually dispersing from a drop to various parts of the water.

Inspired by this diffusion process, the diffusion model was born. It is a typical drawing model, with Stable Diffusion and Midjourney both being based on this model. Its basic principle is to gradually add noise to the original image, allowing it to gradually become a completely noisy state, and then reverse this process by denoising to restore the image. By letting the model learn a large number of reversal experiences, it ultimately enables the model to learn to generate specific image from a noisy image.

According to the content revealed in the report, the method Sora uses is likely to replace the U-Net architecture in the original Diffusion model with Transformer architecture. Because according to the experience in deep learning, compared to U-Net, the Transformer architecture has strong parameter scalability, and as the number of parameters increases, the performance improvement of the Transformer architecture will be more significant.

The Structure of DiT^[8]

Through a process similar to that of the diffusion model, during training, given noisy Patches (as well as textual prompts and other conditional information), the model repeatedly adds noise and denoise, ultimately enabling the model to learn to predict the original Patches.

Predict The Original Patches From Noisy Patches

Video Generation

Finally, let's summarize the entire process by which Sora generates videos from text.

When a user inputs a text description, Sora first calls the model to expand it into a standard video description, then generates initial spacetime patch from noise based on the description. Then, Sora continuously infers and generates the next spacetime patch based on the existing spacetime patch and textual conditions (similar to how GPT predicts the next Token based on the existing Tokens). Finally, through the corresponding decoder, the generated latent representation is mapped back to the pixel space, thus forming a video.

Potential of Data Computing

Looking through Sora's technical report, we can find that in fact, Sora has not achieved a major breakthrough in terms of technology. Instead, it has integrated previous researches very well. After all, no technology will suddenly emerge. The key to Sora's success is the accumulation of computing power and data.

Sora showed a clear scale effect during the training process. The figure below shows that with a fixed input and seed, the quality of the generated samples improved significantly with the increase in computing power.

Base Compute vs 4x Compute vs 32x Compute

In addition, by learning a large amount of data, Sora also showed some unexpected abilities.

3D Consistency: Sora can generate videos with dynamic camera motion. As the camera shifts and rotates, the characters and scene elements move consistently through three-dimensional space.

Long-range Coherence And Object Permanence: In long shots, characters, animals, and objects can maintain a consistent appearance even after being occluded or leaving the frame.

Interacting With The World: Sora can simulate behaviors that affect the state of the world in a simple way. For example, a painter can leave new strokes along a canvas that persist over time.

Simulating Digital Worlds: Sora can also simulate game videos, such as "Minecraft."

These characteristics do not require explicit inductive biases for 3D, objects, etc. They are purely phenomena of scale effects.

PieDataComputingSystem(PieDataCS)

The success of Sora once again proves the effectiveness of the "Vigorously Miracle"—the continuous expansion of the model scale will directly promote performance improvement, which heavily relies on a large amount of high-quality datasets and ultra-large-scale computing power, where neither data nor computing can be lacking.

Since its establishment, OpenPie has positioned its mission as "Data Computing For New Discoveries," with the goal of creating an "infinite model game". Its Large Model Data Computing System PieDataCS reconstructs data storage and computing with cloud-native technology, providing one storage for multiple data computation engines (including PieCloudDB Database, PieCloudVector, PieCloudML, etc.) making AI models larger and faster, and comprehensively upgrading the data system to the era of large models.

In PieDataCS, everything in the world can be digitized into data. Data can be used to train initial models, and the trained models form computational rules that are added back into the data computing system. This process continues to iterate and explore AI intelligence indefinitely. In the future, OpenPie will continue to explore in the field of data computing, strengthen core technology research and development capabilities, work with industry partners to explore the best practices of the data element industry, and promote intelligent decision-making.

Reference

Related Blogs:

no related blog