An Empirical Study of Autoregressive Pre-training from Videos

Jathushan Rajasegaran^1,2, Ilija Radosavovic,², Rahul Ravishankar², Yossi Gandelsman^1,2, Christoph Feichtenhofer¹, Jitendra Malik^1,2

¹ Meta AI, FAIR, ² UC Berkeley

We empirically study autoregressive pre-training from videos. To perform our study, we construct a series of autoregressive video models, called Toto. We treat videos as sequences of visual tokens and train transformer models to autoregressively predict future tokens. Our models are pre-trained on a diverse dataset of videos and images comprising over 1 trillion visual tokens. We explore different architectural, training, and inference design choices. We evaluate the learned visual representations on a range of downstream tasks including image recognition, video classification, object tracking, and robotics. Our results demonstrate that, despite minimal inductive biases, autoregressive pre-training leads to competitive performance across all benchmarks. Finally, we find that scaling our video models results in similar scaling curves to those seen in language models, albeit with a different rate.

Paper

Introduction

In a paper published in 1951, Shannon, having just published the foundational papers of information theory, proposed a “guessing game” of next word prediction to estimate the entropy of English. Nearly 70 years later, training a high-capacity transformer network on this task, provided the generative pre-training backbone for Large Language Models.

Less well known is the fact that in 1954, Fred Attneave proposed an analog of Shannon’s task for images. To quote “We may divide the picture into arbitrarily small elements which we “transmit” to a subject (S) in a cumulative sequence, having them guess at the color of each successive element until they are correct. This method of analysis resembles the scanning process used in television and facsimile systems and accomplishes the like purpose of transforming two spatial dimensions into a single sequence in time”.

In this paper, we empirically study autoregressive pre-training from videos. To perform our empirical study, we construct a family of autoregressive video models which we call Toto. We treat videos as sequences of visual tokens and train a causal transformer models on next-token prediction task. We use causal transformer model with LLaMa architecture. We use dVAE to tokenize frames into discrete tokens. Treating videos as sequences of tokens enables us to jointly train on videos and images using a unified format. We construct a diverse dataset of videos and images comprising over 1 trillion visual tokens. Our models are first pre-trained on this data and then evaluated on downstream tasks. We extract visual representations using attention pooling from relevant layers of the model.

Overall Framework: Starting with images and video frames from a collection of datasets, we tokenize each frame/image into discrete visual tokens independently. We pre-train the transformer by predicting the next visual tokens, with a context length of 4K tokens of images or video frames. Once trained, we take the intermediate representations and evaluate them on various tasks.

Pre-training

Given a large collection of images and videos, we tokenize all of them into a 1D sequence using raster scan ordering. This produces a dataset of tokens, $\{x^j_1, x^j_2, x^j_3, ..., x^j_n\}$ where $j$ is the sample either from a video or an image and $n$ is the number of tokens in an image or a video. We model the density $p(x)$ as:

$$p(x^j) = \prod_{i=1}^{n} p(x^j_i | x^j_{i-1}, x^j_{i-2}, ..., x^j_{1}, \theta)$$

Here, $\theta$ is the model parameters, which can be optimized by minimizing the negative log-likelihood loss:

$$\mathcal{L}_{\text{pre-train}} = \mathop{\mathbb{E}}_{x^j \sim X} -\log p(x^j).$$

Using this loss, we pre-train our models at different sizes on over one visual trillion tokens. These tokens are generated from images and video. The figure shows the training loss of 3 differently sized models with 120m, 280m and 1.1b parameters.

Training Loss Curves: We show the training loss curves for base, large, and 1b models trained with tokens from dVAE with a vocabulary size of 8k and context length of 4k tokens (equivalent to 16 images or video frames).

Compute Optimal Scaling

We study the scaling behaviours of toto using $\mu$-Parameterization. First we train various models a1-a6, with linearly increasing hidden size and number of layers, and we used VQGAN tokenizer. Then we tune the learning rate for these models, with $\mu$-Parameterization. The analysis shows optimal learning rate of $2^{-7}$ for all the model widths.

Once we find the optimal learning rate, we train a1-a6 models on our data mixture, as mentioned in the datasets table. The figure shows the loss vs compute of toto models. This shows a clear power law relationship with compute and validation loss. Based on these experiments toto shows a power law of:

$$L(C) = 7.32 \cdot C^{-0.0378}$$

Interestingly, if we look at GPT3 power law relationship, it has:

$$L(C) = 2.57 \cdot C^{-0.0480}$$

While these are not comparable directly, but the scaling coefficient shows how much change in loss for an added extra compute. This shows, that visual next token models such as toto scales, but at a slower rate than language only models.

Training Loss Curves: We show the training loss curves for base, large, and 1b models trained with tokens from dVAE with a vocabulary size of 8k and context length of 4k tokens (equivalent to 16 images or video frames).

Experiments

ImageNet Results: We compare discriminative and generative models on ImageNet recognition task. While achieving comparable performance among generative models, our models model achieves the highest accuracy on autoregressive modeling. ^†models are evaluated with linear probing.

K400 Results: We compare discriminative and generative models on Kinetics-400 action recognition task. While achieving comparable performance among generative models, our models are the first to show the competitive performance on K400 with autoregressive pre-training, and shows scaling with large model sizes.

Ego4D Results: Our model achieves comparable mean-average precision compared to previous work. We compare our method with, FRCNN+Rnd, FRCNN+SF, Hiera, StillFast, VideoMAE, and MAE-ST.

Probing Across Layers, Models, and Tasks: We study the behavior of our models across multiple layers and tasks. For image classification, action recognition, and object tracking, all the models behave similarly and peak around 50% of the model depth. This behavior is observed across all model sizes. Robot tasks show a different behaviour, where the middle layers perform good at picking the object, last layers also perform good as middle layers.

Real-world Deployment: We show an example episode of our policy performing the cube picking task on a Franka robot in the real world. We use toto-base to run the robot at real time, despite being a small model, toto was able to achieve about 63% success rate in real world setting.

Robot Manipulation Results: We compare MAE-base with toto-base pre-trained models on robot manipulation. We evaluate each model the mean success rate over training steps. toto was able to learn these tasks faster than MAE, across two robots and two tasks.

Semi-Supervised Tracking: We follow the protocol in STC, start with the GT segmentation mask, and propagate the labels using the features computed by toto-large. The mask was propagated up to 60 frames without losing much information.

Limitations

In this work, we introduced toto, for generative pre-training from videos. Despite its competitive performance, this approach has limitations. A significant limitation stems from the use of internet videos, which, unlike carefully curated datasets, introduces challenges related to data quality and diversity. This variance in data quality can impact model performance, especially when compared to models trained on more curated datasets.

Another limitation is the use of tokenizer, this makes the learning not end-to-end, and the representation and generation quality is bounded by the quality of the tokenizer, and with quantized vectors, the quality is very much limited, this needs further explorations to build a universal visual tokenizer. Another fundamental limitation is training on videos for next token prediction task. The added redundancy in video frames, can hurt quality of the learned representations.

Additionally, our exploration of various design choices are based on ImageNet classification. While it does transfer to most of the tasks we considered in this paper, it may not be the optimal configuration for many other tasks.

Furthermore, we have not yet fully assessed our method's effectiveness in dealing with dense prediction tasks, fine-grained recognition, or comprehending complex temporal dynamics over extended time frames. These areas represent key opportunities for further research, aiming to broaden the fruitfulness of generative pre-trained models.

Conclusion

We present an approach toto, for generative pre-training from videos. We build on prior work on generative pre-training from images and make architectural improvements to enable scaling to videos, including the use of quantized patch embeddings, relative position information. We curate a large video dataset and conduct a large-scale empirical study across a range of diverse tasks, including image recognition, video classification, object tracking, trajectory prediction, and robotic manipulation.

We perform extensive ablation studies to understand different design choices and compare our approach to strong baselines across different tasks. We find that, despite minimal inductive biases, our approach achieves competitive performance across all tasks. Finally, we studied the scaling behaviours of visual next token prediction models, and showed it scales with compute, but at a slower rate than text based next token prediction models.

Acknowledgments

We thank Andrea Madotto, Po-Yao (Bernie) Huang, and Shiry Ginosar for helpful discussions. We're grateful to Ronghang Hu and Xinlei Chen for their help with TPU setup and code bases. We also thank Baifeng Shi for helping us with robots evaluations. We thank Valentin Gabeur and Neerja Thakkar for their valuable feedback on the paper.

Citation

@article{autoregressive,
            title={Autoregressive Pre-training from Videos},
            author={Jathushan Rajasegaran, Ilija Radosavovic, Rahul Ravishankar, Yossi Gandelsman, Christoph Feichtenhofer,  Jitendra Malik},
            year={2024}
        }