Yannic Kilcher
BERT and GPT-2/3 have shown the enormous power of using generative models as pre-training for classification tasks. However, for images, pre-training is usually done with supervised or self-supervised objectives. This paper investigates how far you can get when applying the principles from the world of NLP to the world of images.
OUTLINE:
0:00 – Intro & Overview
2:50 – Generative Models for Pretraining
4:50 – Pretraining for Visual Tasks
7:40 – Model Architecture
15:15 – Linear Probe Experiments
24:15 – Fine-Tuning Experiments
30:25 – Conclusion & Comments
Paper:
https://cdn.openai.com/papers/Generative_Pretraining_from_Pixels_V2.pdf
Blog: https://openai.com/blog/image-gpt/
Code: https://github.com/openai/image-gpt
Abstract:
Inspired by progress in unsupervised representation learning for natural language, we examine whether similar models can learn useful representations for images. We train a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure. Despite training on low-resolution ImageNet without labels, we find that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification. On CIFAR-10, we achieve 96.3% accuracy with a linear probe, outperforming a supervised Wide ResNet, and 99.0% accuracy with full finetuning, matching the top supervised pre-trained models. An even larger model trained on a mixture of ImageNet and web images is competitive with self-supervised benchmarks on ImageNet, achieving 72.0% top-1 accuracy on a linear probe of our features.
Authors: Mark Chen, Alec Radford, Rewon Child, Jeff Wu, Heewoo Jun, Prafulla Dhariwal, David Luan, Ilya Sutskever
Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher
hmm, I don't think your comment about linear probing after fine-tuning is likely to help much. iiuc the linear probe accuracy at the last layer should re-discover the fine-tuning result (the 99% accuracy). It seems pretty unlikely (though not impossible) that removing later layers would help, unless you think the model is going to add too much noise in these layers and destroy signal from previous layers.
I think you received the paper a week earlier than anyone else, cause you're so fast XD
Amazing Mate.
Man! You're so fast!
Thanks Yannic for the insights! This paper came out recently
Teaching Pre-Trained Models to Systematically Reason Over Implicit Knowledge
https://arxiv.org/abs/2006.06609
So BERT is an autoencoder objective so the only difference that they have here compared to people trying autoencoders ("back in the day!") for semi-supervised learning is self-attention and lots more data? Pretty nuts. I guess the fact the autoregressive GPT objective compared to the autoencoder objective is something.
i was suprised when the paper just came out and you already made a vid on it too. pro youtuber move… btw great explanation, love your content!
Great work!
With "rolled-out" pixels, the last known pixel always has relationships to pixels at each fixed distance away. E.g. given a 32×32 image, the pixel at -1 distance from the pixel to be predicted has similar relationship to the pixel at -32 distance (-1 vertically before "roll-out"). -2 is similar to -64, etc. But with language, there's no repeating 32-word pattern, and there's never a similar relationship between two words at two fixed distances away (maybe in poetry!). Is that fact build into the model before training, or is that a type of "image grammar" that's learned by lower layers?
I would like to see this done with sparse attention using the row and column for queries and keys. Maybe then you don't have to downsize the images so much.
So the quote "What I cannot create, I do not understand" hold also a bit for neural networks =).
Any thoughts on why they didn't use recent work on sparser transformers to deal with long sequence length?
No offence to Henry AI Labs, but your explanations are very simplistic and you have a nice patient flow of explaining the paper(have been looking for it past some months).
Kudos Brother, you bought yourselves a subscriber today 🙂
This is so amazing, it's fucked up. I'm glad I went to Uni to learn Computer Science 4 years ago (at age 38). This is stuff I can now get into more easily.
I'm a dummy who isn't good at computers, how do I use this program?
it's more gooder.
Did it figured out by itself that cats can keep a sheet of paper in their paws? Or such kind of images are in the dataset?
oh… here it is. :O Thank you!
So does random cropping induce a non localised storage patch of weights (in effect providing contrastive weight spaces), which then can then combine in a 'holographic manor' to contribute towards an answer.
31:00 You could use Discriminator from GAN and I think that's the most common practice but it wouldn't be pixel by pixel. Autoregressive models also can use convolutions though (e.g. PixelCNN). They just kind of use half of a filter because they can't see what's ahead as that would be cheating 😛
Could this be used for compression by only storing the pixel if it's different from what's expected?
Yannic, It would be great if you tell pytorch implementation along with reading.
What if you train this stuff using memes ?
👏👏👏👏👏👏👍👌
2:20
The first one of the generated images is so cute.
I want it.