Yannic Kilcher
#gpt3 #openai #gpt-3
How far can you go with ONLY language modeling? Can a large enough language model perform NLP task out of the box? OpenAI take on these and other questions by training a transformer that is an order of magnitude larger than anything that has ever been built before and the results are astounding.
OUTLINE:
0:00 – Intro & Overview
1:20 – Language Models
2:45 – Language Modeling Datasets
3:20 – Model Size
5:35 – Transformer Models
7:25 – Fine Tuning
10:15 – In-Context Learning
17:15 – Start of Experimental Results
19:10 – Question Answering
23:10 – What I think is happening
28:50 – Translation
31:30 – Winograd Schemes
33:00 – Commonsense Reasoning
37:00 – Reading Comprehension
37:30 – SuperGLUE
40:40 – NLI
41:40 – Arithmetic Expressions
48:30 – Word Unscrambling
50:30 – SAT Analogies
52:10 – News Article Generation
58:10 – Made-up Words
1:01:10 – Training Set Contamination
1:03:10 – Task Examples
https://arxiv.org/abs/2005.14165
https://github.com/openai/gpt-3
Abstract:
Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions – something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3’s few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.
Authors: Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei
Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher
does it get sarcasm and can it tell a joke?
Glad to see someone has worse handwriting then me xD
Excellent Review of the paper you Yannick – much appreciated!
Thank you
M
What device are you using to view this and draw?
I am 66,666th viewer and I understood that nothing to do here
Yannic,
I really like the summary, pretty comprehensive, but I can't be the only one who cringes at "Oh it's not reasoning, it's just statistical relationships between words"? Like holy shit, GPT-3 shows truly incredible results and we really should dive in how it does that, instead of discrediting GPT-3 as "only" doing statistical stuff. This is the whole "as soon as it works, no one calls it AI anymore" thing again.
Good overview, but would have been better with a shorter section of "GPT3 is not reasoning, it's just copy-pasting and remembering addition and multiplication tables" (like a lot of actual human beings do…)
But what we all want to know is:
Why did the Methodist and Baptist churches split?
47:14
i get you are trying to debunk some of hype around GPT3 … but credit where credit is due – even if its doing some complex "look up", this is what most other models are also doing, this one does it better!
Thanks so much for your videos! One question: Why would you be impressed if "Good English Output" would have been a part of the model output and not the question?
Just by building language modelling won't enable the NN to perceive the logic.
You keep saying pattern matching. I think you should approach these papers with less bias. It lets you see things from many perspectives.
This is just another case of shifting the goal post.
You gave it something to learn from but you expect it not to use similar wordings?
I mean, do you expect it to be completely original every single time? Maybe come up with new words and new sentence structures cos _originality_.
The fact is, even as humans, you reuse words and phrases that you have heard/read/seen before in certain contexts when it comes up again.
The goal should be to hit the perfect spot of generalisation without over-generalising or over-fitting.
OUTLINE:
0:00-Intro & OvervieW
1:20-Language Models
2:45-Language Modeling Datasets
3:20-Model Size
5:35-Transformer Models
7:25-Fine Tuning
10:15– In-Context Learning
17:15-Start of Experimental Results
19:10-Question Answering
23:10-What I think is happening
28:50– Translation
31:30-Winograd Schemes
33:00-Commonsense Reasoning
37:00– Reading Comprehension
37:30-SuperGLUE
40:40– NLI
41:40– Arithmetic Expressions
48:30– Word Unscrambling
50:30– SAT Analogies
52:10-News Article Generation
58:10-Made-up Words
1:01:10-Training Set Contamination
1:03:10-Task Examples
IMHO, the biggest catch is doing the Binary poetry tests on non-english-authority/expert humans. We need a model to learn the right English and not just a one to fool an average human !
Seeing that the number of authors runs in dozens, they should have come out clean on the experiments and should have done thorough analysis on the "data points" already being in the training data. Though, the work is still really good.
The review on GPT-3 along with a push in subscriptions owing to the recent popular paper reviews such as ResNet, Word2Vec, etc. (Plus years of hard-work) have made @Yannic an overnight star 🙂 .
the most important part is in the 13th minute : is this algo trying really to understand what's going on, a real humain like understanding of the task to learn ? . Understanding. the billions of calculus and huge compute power, and tons of layers are really generating some kind of a deep true understanding ? that"s the Holy graal of A.I. Cheers.
I did a calculation that might be of some interest (but quickly, so it needs to be checked and thought).
Data set if I anderstand well is something like 450 billions "tokens", does it means chars or close to it?
Parameters of the model is 175 billions. (let's say 175 billions bytes).
Best text compression is arround 90% (divides the size by 10). So interpreting it like that it seems a compressed version of the whole data set fits largely in the model params and let a huge number of params for interpolation logic.
NB: this calculation shoud be donne more rigorously, what is a token? which size is a single parameter etc…
Amazing content. Thank you so much for the intuition that really helped .
Although I gave short breaks I made it to the end. Dİd a good job sir.
I like the "methodist" article check but the arithmetic one is weak: a table, when converted to text, (and if converted to text, maybe they discarded the tables) may come out very badly. The plus sign/addition word might not appear in the final text close to the numbers. If GTP was able to "learn/memorize" arithmetic from an HTML table converted to plain text that would be real intelligence 🙂 Further research is needed to understand what it understood.
I might missed something but could it be that model is having all training data represented internally, see this at around 37min https://m.youtube.com/watch?v=u4alGiomYP4 and also https://m.youtube.com/watch?v=fKk9KhGRBdI at 5min where claim is made that less structure will be required with more data. Any thoughts?
28:32 if you watch numberphile talk on gpt3 / the example he uses is maths / 2 and 3 digits. It ‘learns’ to do maths beyond training samples
..all you did here was interpolate the paper data.
(I'm snarky but I loved the vid.)