Machine Learning Street Talk
#machinelearning
This week Dr. Tim Scarfe, Dr. Keith Duggar and Yannic Kilcher speak with veteran NLU expert Dr. Walid Saba.
Walid is an old-school AI expert. He is a polymath, a neuroscientist, psychologist, linguist, philosopher, statistician, and logician. He thinks the missing information problem and lack of a typed ontology is the key issue with NLU, not sample efficiency or generalisation. He is a big critic of the deep learning movement and BERTology. We also cover GPT-3 in some detail in today’s session, covering Luciano Floridi’s recent article “GPT‑3: Its Nature, Scope, Limits, and Consequences” and a commentary on the incredible power of GPT-3 to perform tasks with just a few examples including the Yann LeCun commentary on Facebook and Hackernews.
00:00:00 Walid intro
00:05:03 Knowledge acquisition bottleneck
00:06:11 Language is ambiguous
00:07:41 Language is not learned
00:08:32 Language is a formal language
00:08:55 Learning from data doesn’t work
00:14:01 Intelligence
00:15:07 Lack of domain knowledge these days
00:16:37 Yannic Kilcher thuglife comment
00:17:57 Deep learning assault
00:20:07 The way we evaluate language models is flawed
00:20:47 Humans do type checking
00:23:02 Ontologic
00:25:48 Comments On GPT3
00:30:54 Yann lecun and reddit
00:33:57 Minds and machines – Luciano
00:35:55 Main show introduction
00:39:02 Walid introduces himself
00:40:20 science advances one funeral at a time
00:44:58 Deep learning obsession syndrome and inception
00:46:14 BERTology / empirical methods are not NLU
00:49:55 Pattern recognition vs domain reasoning, is the knowledge in the data
00:56:04 Natural language understanding is about decoding and not compression, it’s not learnable.
01:01:46 Intelligence is about not needing infinite amounts of time
01:04:23 We need an explicit ontological structure to understand anything
01:06:40 Ontological concepts
01:09:38 Word embeddings
01:12:20 There is power in structure
01:15:16 Language models are not trained on pronoun disambiguation and resolving scopes
01:17:33 The information is not in the data
01:19:03 Can we generate these rules on the fly? Rules or data?
01:20:39 The missing data problem is key
01:21:19 Problem with empirical methods and lecunn reference
01:22:45 Comparison with meatspace (brains)
01:28:16 The knowledge graph game, is knowledge constructed or discovered
01:29:41 How small can this ontology of the world be?
01:33:08 Walids taxonomy of understanding
01:38:49 The trend seems to be, less rules is better not the othe way around?
01:40:30 Testing the latest NLP models with entailment
01:42:25 Problems with the way we evaluate NLP
01:44:10 Winograd Schema challenge
01:45:56 All you need to know now is how to build neural networks, lack of rigour in ML research
01:50:47 Is everything learnable
01:53:02 How should we elevate language systems?
01:54:04 10 big problems in language (missing information)
01:55:59 Multiple inheritance is wrong
01:58:19 Language is ambiguous
02:01:14 How big would our world ontology need to be?
02:05:49 How to learn more about NLU
02:09:10 AlphaGo
02:11:06 Intelligence is about using reason to disambiguate
02:13:53 We have an internal type/constraint system / internal language module in brain
02:18:06 Relativity of knowledge and degrees of belief
Walid’s blog: https://medium.com/@ontologik
LinkedIn: https://www.linkedin.com/in/walidsaba/
First!
Interesting topic.
BTW, Walid is pronounced Waleeeeeed
DAMN! Third…
"Would a vision algorithm ever know that this is a teacher and this is a student?" – A multi-modal one, that's been trained on Netflix and Youtube with 3 data streams, audio, vision and abstract scene information encoded in NLP (like "a girl and a woman sitting at a desk. the girl is writing. Other writing kids are in the background" and the subtitles), … VERY LIKELY WOULD. … I mean, it could correlate subtitles with pixels and raw audio and the NLP output of a standard image captioning network … It almost seems like a no brainer to me that a sequence model trained that way would correctly understand such images … even without any context
Oh Another one 🙂 thank you for your work, I appreciated
I recently experimented with GPT-2 1558M and was shooked that this thing has a political bias. Should not surprise me, the internet is biased as well. But after every request. Could not believe it.
Wow this one will be by far one of the best.
He has a good argument, that human beings are using something beyond the visual system. I would say that is our heuristic system. If you walk in a crowded area, you are not bumping in other persons, because of a simple heuristic. Which are even taught boatmans and pilots. In your head you making a straight line to the person, and if the direction of his/her walking doesnt change you are changing your direction.
Yannic's comments on static type checking are really confusing. He complains that a sufficiently advanced type system would allow him to perform the computations at the type level. But that's only possible with dynamic type checking. If the solution to your problem is the output of a program, and your types are static, then type computations only happened at compile time; by definition, this means that none of the type-related computations contributed to the solution. It would only be possible with a dynamic type-checking system like what Python has.
28:34 Today I did something similar with GPT-2. I gave it a text about machinelearning and could extract keywords out of it. By just put "keywords:" at the end. I was also impressed.
Some animals do seem to employ cause and effect reasoning, contra Walid here. One example is the orca that baits birds with fish.
"NLP is not NLU". Thank you for this phrase, I will reuse it.
Chomsky is not conspiratorial. He is obsessive about source documents and explaining social-political phenomena in terms of institutional pressures rather than cabals or organized crime.
27th!
Thanks for the video mate! Always great to see new content from you!
As someone fairly bullish on ANNs, I find GOFAI advocates like Walid Saba a curious rare breed, even if, as I gather, he likes to distance himself a bit from the term. A GOFAI advocate that believes so unreservedly in human exceptionalism even more so. I struggled to follow most of his arguments though. For example, he seems to be claiming language is not learnable, as a lot of it's just innate, but then uses things like pictures of teacher-student pairings, or malls versus campuses, which clearly can't be innate, as we didn't evolve with those. Or he gives adjective order, but this is something that ML can do successfully, and seems fairly straightforward to learn. Or the Xanadu example, but again he's pointing at facets that can't be innate, and thus where would they be if not the text? I don't feel I really understood his claims.
I really like the way you summarize the conversation upfront and augment it with different perspectives. Even when I disagree it feels very balanced and well thought out. I wrote this before I saw I was on there so I'm not even saying that because this one featured me :P. Actually this is true of your interview style as a whole.
“I don't like it when I see extremism. Now, you really have people that believe a 10 line algorithm, backpropagation, and the gradient descent method, explains the mind. That has to be, on its own, before I dig into the details, on its own that's gotta' be ridiculous.”
Wasn't it just a few days ago Connor Leahy said much the same? “Are you kidding me? Matrix multiplications. Wow, intelligence boys, we did it!” Damn, you need to get more variety, you can't just have everyone you get having the same opinion! (jk)
"01:50:47 Is everything (not) learnable" argument is terribly weak, since it does not account that we all live in the same physical reality (and have very similar embodiment experiences).
For example we might learn smaller and bigger relations from playing with cups. Almost all kids get to play with cups, and thus learn the same lessons.
I wish the AI / DL field would go over to Hutter's website and read about how Kolmogorov Complexity is intrinsically linked to optimal compression and in a few short steps, general intelligence.
Once internalized, I think a lot of inspired development can happen. We want good proxies or heuristics of KC (it's uncomputable), 99% optimal shortest-program-search kinda thing. Program synthesis techniques that rely on search completely suck, but a marriage of program synthesis through deep learning could possibly get us that proxy. Not a big fan of investing much time in first order logic or any old GOFAI stuff, because that's not Turing Complete and so delegates the "intelligence" to another language or TC model which in turn uses FOL. The Cyc project is 30(?) years old and to my knowledge has done nothing of use.
Anyway, let's frame program-length in terms of NLU, with Walid's "corner table wants a beer" phrase. What's the shortest program that reconstructs the scene here? It's a cast from table to people, using a synecdoche "type-wrapper" (synecdoche meaning: part representing a whole or vice versa, e.g. "boots on the ground", "you'll have badges knocking on your door"), then, some first order logic that people are humans, humans drink, beer is a drink.
Where does those premises come from? Facts and details in an entity graph. Instead of the Cyc project, I'd want this graph to be empirically learned. It could even be a language model I guess; Yannic reviewed a paper that did just that. The entity graph traversal that pulls those properties would be low runtime cost, as those properties are fairly high likelihood, were you to test combinations for saliency in order of likelihood.
Adding in constituency parsing to see subject and object, we see the humans in the corner are the subjects that want the drink. So, given the conciseness, dynamic type-safety passes. That's a pretty short program, a special kind of type cast and 3 hops of high-likelihood FOL. Program conciseness is a really nice abductive reasoning prior, so long as the operators you're working with are applicable to the environment.
Without a 'synecdoche' type cast operator , the program length might be some trivial maximum, equaling confusion. So, a critical minimum of these entity graph objects and relations need to be learned to avoid confusion, hopefully empirically. How could that be done…Say if a movie in the training DB contains something like that line, and the waiter subsequently gets beers for the table, yet the table doesn't drink but rather the people drink them, then this relation can be potentially inferred.
Lot of handwaving, but I just want to bring up the KC prior for abductive reasoning as Hutter and perhaps others before him suggest. The "missing information" in our speech is the trust that our conversational counterparties can perform some kind of abductive reasoning to fill in ambiguities and resolve underspecification. We should be chasing that KC proxy down.
New world order is not a joke
There is very little things we disagree on? Wow the earth is flat, the sun goes around the earth & its turtles all the way!
An interesting, important, but at the same time frustrating conversation. I think I agree with nearly all of Walid's conclusions, but lots of his justifications seem unscientific and at times dismissive.
The Go discussion, in particular saying that it was increased compute capacity that lead to AlphaGo's breakthrough, is nonsense in the extreme. Classical heuristic search bots were/are orders of magnitudes worse than AlphaGo. AlphaGo's level of play was estimated to be an average of over 3 decades ahead of schedule by experts. This is egregious enough to put anyone even mildly sympathetic to deep learning in a defensive position, because it's such a bad faith take.
The reliance on the "every 3-year-old knows this" — I don't follow the argument. That amount of compute and data ought to be enough to do anything. 3YOs have 3 years of learning with a supercomputer, or about 100 million petaflop days equivalent, and 3 years of 100 mbps data through the eyes and a similarly gigantic amount through the other senses. Yet still, they're just okay with language.
"It can't be learned because everyone knows it" — also unscientific because it ignores the possibility of attractors and common filtering. Why cannot the same thing be learned by everyone? We live in the same world with the same physics and similar ways of life. Train 400 different neural nets on MNIST and they're going to develop representations of the same character modalities, so should we conclude the models must have started with innate character representations?
I'm not convinced the "red beautiful car" is anything except learned custom. Suppose they raise a kid in an environment where everyone uses that turn ordering of phrase "red, beautiful car" — what do you honestly think the child will say? But I am on board with the general point here that some immutable priors can be helpful. Disagree that they're necessary.
"Now, you really have people that believe a 10 line algorithm, backpropagation, and the gradient descent method, explains the mind." Another quite unscientific take. You cannot dismiss the properties of emergent phenomena by the simplicity of the constituents. A similar line could be said for when atoms were first proposed — surely you don't believe all this complex universe is the result of a few dozen atoms? And further, surely you don't believe dozens of atoms are just three subatomic particles?? And in 50 years, surely you don't think all the elementary particles and force carriers are explained by just a vibrating string*???
So many people use this line of argument, (Connor Leahy last video too "yay matrix multiplications, we solved AGI!") and I don't even get what the specious logic is supposed to be.
*or whatever
—
With those criticisms aside, I think Walid's position is a healthy one for the field: that deep learning needs competition, that competition needs funding, and big tech companies privilege DL possibly in part due to their exclusive advantage in data and compute. That position alone doesn't undermine anything.
Though I don't much like it, I'm now of the mind that deep learning is going to be the first superhuman intelligence, as wacky as that feels. One has to be open minded as a scientist.
I think the first task for RoGPTa-XL-7 should be "Sandra wants to develop an interpretable superhuman AI with symbolic processing. Her source code is 'import'"—
Anyway, you can't run down every logical loose end as an interviewer or you'd cover nothing. I think you guys did a good job of putting just a little pressure on some of those arguments but not enough to slow things down. Excited for the next street talk!
@Yannick @Scarf
How biological entities learn: Vision + Speech + hearing
How neutral networks learn: either Vision or either Speech.
So we are basically building handicapped networks and want to achieve AGI.
Correct me if I'm wrong
I guess he didn't like "Attention is all you need" very much 😅, very peculiar talk.
At the start he says DL can't perform pronoun disambiguation (is the suit case large or the trophy). In the past couple of years, models started performing near human level at these tasks.
We now have models like GPT-3 which can learn to perform dozens of tasks it wasn't trained for using unsupervised learning. This is the future. Not built in rules to reach a 10% accuracy for one task, which is how well GOFAI models work for language understanding.
The same DL models are generalizable meaning they can work for many tasks, just like the animal/human brain is generalizable. GPT can/was used for image recognition (reaching SOTA image recognition btw with GPT-2 [Image GPT]), playing games/chess, etc.
Yes there are differences in the brains of lower primates and our brain. Essentially our brains are scaled up versions of the lower humanoids though (our ancestors).
Walid Saba is a fascinating guy, but we don't yet know what emerges from large enough nlp models. GPT-3 suggests we're nowhere near the limits of transformer networks. So I will withhold judgement until GPT or some better architecture has been tested with at least a trillion parameters.
Is this just about language requires "common sense", and we don't know how did we learned that? This is a very known problem. You can call it ontological type or a compiler but the thing is that we can reduce every thing to a kind of one huge model of the world in our heads, and we are concious to this process as an observer . We just dont know how to do it with code. Yet.
In some languages, 'red beautiful car' and 'beautiful red car' are interchangeable.
Pedro Domingo's "Master Algorithm" is great when it comes to describing the tribes of ML.
Walid has a paper out which expands on his ideas "Language and its Commonsense: Where
Formal Semantics Went Wrong, and Where
it Can (and Should) Go" https://4ebde952-0bd0-43c2-bf47-30516f762816.filesusr.com/ugd/7e4fcb_3317bd434a434a45b13adb6fdfdfa5e7.pdf
Wow. Loved this video!
Lots of the assertions made in this video are falsifiable with commonly employed AI models, including GPT-3. It makes it a bit difficult to take the arguments or reasoning seriously.
Also, the idea that the simple mechanism of the neuron (which backpropogation and stochastic gradient descent simulate) can't explain language understanding seems to fly in the face of human biology.
That argument seems analogous to "there's no way logic gates form the basis of all this complex computation that computers do!" — but that's exactly what's happening.
I guess I just don't get it.
A small point about the adjective ordering, i think this has to do more with the aesthetics of the sound than a hierarchical labeling for adjectives. Ironically, a deep learning model might be pretty good at learning features like this over a structured organization of adjectives because in certain language models adjective ordering is interchanageable depending on how the language is designed.