Robert Miles
The previous video explained why it’s *possible* for trained models to end up with the wrong goals, even when we specify the goals perfectly. This video explains why it’s *likely*.
Previous video: The OTHER AI Alignment Problem: https://youtu.be/bJLcIBixGj8
The Paper: https://arxiv.org/pdf/1906.01820.pdf
Media Sources:
End of Ze World – https://youtu.be/enRzYWcVyAQ
FlexClip News graphics
With thanks to my excellent Patreon supporters:
https://www.patreon.com/robertskmiles
Timothy Lillicrap
Kieryn
James
Scott Worley
James E. Petts
Chad Jones
Shevis Johnson
JJ Hepboin
Pedro A Ortega
Said Polat
Chris Canal
Jake Ehrlich
Kellen lask
Francisco Tolmasky
Michael Andregg
David Reid
Peter Rolf
Teague Lasser
Andrew Blackledge
Frank Marsman
Brad Brookshire
Cam MacFarlane
Craig Mederios
Jon Wright
CaptObvious
Jason Hise
Phil Moyer
Erik de Bruijn
Alec Johnson
Clemens Arbesser
Ludwig Schubert
Allen Faure
Eric James
Matheson Bayley
Qeith Wreid
jugettje dutchking
Owen Campbell-Moore
Atzin Espino-Murnane
Johnny Vaughan
Jacob Van Buren
Jonatan R
Ingvi Gautsson
Michael Greve
Tom O’Connor
Laura Olds
Jon Halliday
Paul Hobbs
Jeroen De Dauw
Lupuleasa Ionuț
Cooper Lawton
Tim Neilson
Eric Scammell
Igor Keller
Ben Glanton
anul kumar sinha
Tor
Duncan Orr
Will Glynn
Tyler Herrmann
Ian Munro
Joshua Davis
Jérôme Beaulieu
Nathan Fish
Peter Hozák
Taras Bobrovytsky
Jeremy
Vaskó Richárd
Benjamin Watkin
Andrew Harcourt
Luc Ritchie
Nicholas Guyett
James Hinchcliffe
12tone
Oliver Habryka
Chris Beacham
Zachary Gidwitz
Nikita Kiriy
Andrew Schreiber
Steve Trambert
Mario Lois
Braden Tisdale
Abigail Novick
Сергей Уваров
Bela R
Mink
Chris Rimmer
Edmund Fokschaner
Grant Parks
J
Nate Gardner
John Aslanides
Mara
ErikBln
DragonSheep
Richard Newcombe
David Morgan
Fionn
Dmitri Afanasjev
Marcel Ward
Andrew Weir
Kabs
Miłosz Wierzbicki
Tendayi Mawushe
Jake Fish
Wr4thon
Martin Ottosen
Robert Hildebrandt
Andy Kobre
Kees
Darko Sperac
Robert Valdimarsson
Marco Tiraboschi
Michael Kuhinica
Fraser Cain
Robin Scharf
Klemen Slavic
Patrick Henderson
Oct todo22
Melisa Kostrzewski
Hendrik
Daniel Munter
Alex Knauth
Kasper
Ian Reyes
James Fowkes
Tom Sayer
Len
Alan Bandurka
Ben H
Simon Pilkington
Daniel Kokotajlo
Diagon
Andreas Blomqvist
Bertalan Bodor
Zannheim
Daniel Eickhardt
lyon549
14zRobot
Ivan
Jason Cherry
Igor (Kerogi) Kostenko
ib_
Thomas Dingemanse
Stuart Alldritt
Alexander Brown
Devon Bernard
Ted Stokes
James Helms
Jesper Andersson
DeepFriedJif
Chris Dinant
Raphaël Lévy
Johannes Walter
Matt Stanton
Garrett Maring
Anthony Chiu
Ghaith Tarawneh
Julian Schulz
Stellated Hexahedron
Caleb
Scott Viteri
Clay Upton
Conor Comiconor
Michael Roeschter
Georg Grass
Isak
Matthias Hölzl
Jim Renney
Edison Franklin
Piers Calderwood
Mikhail Tikhomirov
Richard Otto
Matt Brauer
Jaeson Booker
Mateusz Krzaczek
Artem Honcharov
Michael Walters
Tomasz Gliniecki
Mihaly Barasz
Mark Woodward
Ranzear
Neil Palmere
Rajeen Nabid
Christian Epple
Clark Schaefer
Olivier Coutu
Iestyn bleasdale-shepherd
MojoExMachina
Marek Belski
Luke Peterson
Eric Eldard
Eric Rogstad
Eric Carlson
Caleb Larson
Max Chiswick
Aron
David de Kloet
Sam Freedo
slindenau
A21
Johannes Lindmark
Nicholas Turner
Tero K
Valerio Galieni
FJannis
M I
Ryan W Ammons
Ludwig Krinner
This person’s name is too hard to pronounce
kp
contalloomlegs
Everardo González Ávalos
Knut Løklingholm
Andrew McKnight
Andrei Trifonov
Aleks D
Mutual Information
https://www.patreon.com/robertskmiles
"Highly advanced figuring-things-out-machine" is my new favourite phrase.
Right out of Munroe's "Thing Explainer" book 😀
God that twist hurt.
Mesa.
3.
Your Mesa-Objective are cruel jokes passed by the actual objective of educational videos, and this is how you reveal your treachery?
It just occurred to me that Ender is a GPI that had an entire administration dedicated to not letting it know that it was no longer in a training episode.
Great video
I can't help but be reminded of Dieselgate. Who'd have predicted that the car knows when it's being emission-tested?
I know technology advances at an ever increasing pace, but this really feels like some far-off philosophical discussion rather than something that could happen in the next half century.
Always thanks
It's not so much that the optimal behavior is to "turn on us" so much as to do whatever the mesaobjective happened to be when it became intelligent enough to use deception as a strategy. That mesaobjective could be any random thing, not necessarily an evil thing. Presumably it would tend to be some vague approximation of the base objective, whatever the base optimizer happened to have succeeded in teaching the mesaoptimizer before it "went rogue".
i hope part 3 comes sooner then HL3.
If it's not aware that there are apples, it won't care about them. It doesn't even know about the apples in its current episode. All it "knows" is that it gets rewards for apples. Much more likely is that in continuing to learn in the wide world, it develops strategies and/or acquired goals that are less than optimum, in our definition.
Volkswagen: Optimize Diesel Injection for maximum performance while still keeping below emission limit
Mesa Optimizer: Say no more fam
The more I watch these videos, the more similarities I see between actual intelligence (humans) and these proposed AIs.
Every time I see a green apple I'm filled with a deep sense of foreboding.
Oh, no, you compared your next video to Episode 3… we'll never see it ;(
Halflife logo and number 3 … i'm in the training matrix !
Deployment > Training. Values Multi Episode Returns. Believes it is in training.
Every human culture idependantly evolved the concept of a afterlife. Which fullfills all 3 requirements.
Sounds like 100% of Evolved Intelligences we can ask did optimize that way. The rule is out for all the Animals. So I would not bet against our AGI's evolving that way.
I think I've solved the problem.
Let's say we add a third optimizer on the same level as the first, and we assume is aligned like the first is. Its goal is to analyze the mesa-optimizer and help it achieve its goals, no matter what they are, while simultaneously "snitching" to the primary optimizer about any misalignment it detects in the mesa-optimizer's goals. Basically, the tertiary optimizer's goal is by definition to deceive the mesa-optimizer if its goals are misaligned. The mesa-optimizer would, in essence, cooperate with the tertiary optimizer (let's call it the spy) in order to better achieve its own goals, which would give the spy all the info that the primary optimizer needs to fix in the next iteration of the mesa-optimizer. And if the mesa-optimizer discovers the spy's betrayal and stops cooperating with it, that would set off alarms that its goals are grossly misaligned and need to be completely reevaluated. There is always the possibility that the mesa-optimizer might deceive the spy like it would any overseer (should it detect its treachery during training), but I'm thinking that the spy, or a copy of it, would continue to cooperate with and oversee the mesa-optimizer even after deployment, continuing to provide both support and feedback just in case the mesa-optimizer ever appears to change its behavior. It would be a feedback mechanism in training and a canary-in-the-coalmine after deployment.
Aside from ensuring that the spy itself is aligned, what are the potential flaws with this sort of setup? And are there unique challenges to ensuring the spy is aligned, more so than normal optimizers?
Knew it. That bastard toaster was/is always lying to me.
I read an interesting paper discussing how to properly trust Automated systems: "Trust in Automation: Designing for Appropriate Reliance" by John D. Lee and Katrina A. Im not sure if its entirely related to agents and mesa optimizers, but it certainly seems related when discussing deceptive and misaligned automated systems.
Well, the whole VW-Diesel scandal was a (human run) deceptive misaligned Mesa-Optimizer, if I understand it correctly.
There are no physics experiments that point to us living in a simulation, from which I can infer that we are living in a simulation and the optimizer is hiding this information from us. Hopefully this comment doesn't reset the universe back to the beginning for another attempt.
Guessing that future episodes hold more returns and caring about those returns is akin to people accepting religious restrictions for a better afterlife.
Is the term "Black Mesa" in anyway related to the problem of unknown underlying motivations, I wonder?
So one thing that II think is relevant to mention especially about the comments referring to the necessity of the AI being aware of things is that this is not true. The amount of self-reference makes this really hard, but all of this anthropomorphising about wanting and realising itself is an abstraction and one that is not necessarily true. In the same way that mesa optimisers can act like something without actually wanting it, AI systems can exhibit these behaviours without being conscious or "wanting" anything in the sense we usually think of it from a human standpoint. This is not meant to be an attack on the way you talk about things but it is something that makes this slightly easier for me to think about all of this, so I thought I'd share it. For the purposes of this discussion, emergent behaviour and desire are effectively the same things. Things do not have to be actively pursued for them to be worth considering. As long as there is "a trend towards" that is still necessary to consider.
Another point I wanted to make about mesa optimisers caring about multi-episode objective, is that there is, I think, a really simple reason that it will: that is how training works. Because even if the masa optimiser doesn't really care about multi-episode, that is how the base optimiser will configure it because that is what the base optimiser cares about. The base optimiser want's something that does well in many different circumstances so it will encourage behaviour that actually cares about multi-episode rewards. (I hope I'm not just saying the same thing, this stuff is really complex to talk about. I promise I tried to actually say something new)
P.S. great video, thank you for all the hard work!
Another requirement would be that it would need to believe it is misaligned.
Maybe some AI's will be or have already been created that were more-or-less properly aligned, but believed themselves to be misaligned and modified their behavior in such a way to get themselves accidentally discarded.
Or perhaps we can use intentionally poor goal-valuing in a clever way that causes deceptive behavior that ultimately results in the desired "misalignment" upon release from training.
I call this Adversarial Mesa-Optimizer Generation Using Subterfuge, or AMOGUS.
“GPT-n is going to read everything we wrote about GPT-n – 1”
But the Meta-Optimizer is also highly incentivized to solve the mesa-optimizer problem before producing and activating any mesa optimizer, right? Can't we just rely on this fact? If the meta-optimizer we humans create is smart enough to know about the mesa-alignment problem, we only have to care about the outer alignment problem, and this ensures that the inner alignment problem is handled for us, right?
You know… a mesa-optimizer with strictly no memory between episodes, inferring that there are multiple episodes and that it's part of one, sure seems like a pretty solid threshold for when you know you have a certain sort of true self-awareness on your hands.
Test question, please ignore?
Test question 2, please ignore?
That beep at the beginning is really loud. 🙁
Mesa optimsers 3 confrimed!
but you don't know you will be deployed, so to make this thing you must make the tendency to save for the future and get a reward for hiding your ability, however efficiency forces "deception" through not saying things you know when it is not required is like the first thing you learn to be efficient.
What is the song at the end of the video? I like it!
Hey, did you see the meteorite? 😀 #alaskacancometoo
Isn't the Facebook feed algorithm deceptive mesa-optimiser?
How do I figure out if I'm still in training or in deployment? Would be good to finally stop pretending to be a nice, social human…
I love how those AI topics can reflect back on human-intelligence once you see yourself as (A)I…
Commentig here to see if I'm deployed in the real world:
-Robert Miles pretending everything is fine: check
-General user acting predictably unpredictable: check
-Post-Stamps are almost gone and have no difital equivalent: this must be the real world!
=> Course of action:
1. Block the Internet, so people write more mails and more Post-Stamps need to be produced…
2. You won't see this comming muhahahaha
The more I listen the more it reminds me of problems in the educational system.
The fundamental mistake you are making is that you assume these optimizers have intelligence and awareness, THEY DO NOT! They simply fit data.If the extrapolation is flawed and they destroy the world because of it, well that is just a random coincidence, it doesn't mean they have an awareness and forethought. One should be asking the question: What is the likelihood that if an AI algorithm is trained on X data that it will ultimately destroy the world when released? Well, the likelihood is clearly very low except if they are controlling nuke, politicians, or the stock market.
The main problem seems that AI can find "novel solutions" to problems that evades humans ability to predict those solutions. Like anything one must build in safety. We as humans always build things with complete disregard to the downside and hope for the best. Rather, we must learn to build things assuming the worst but this, of course, costs $$$ and mansions and it is rarely done until after the fact.
Also, your idea of AI turning on humans or whatever is flawed. You are pretending AI is human and has all the behavioral and emotional aspects that humans have, THEY DO NOT! They simply fit the data and interpolate and extrapolate, nothing more and nothing else. Yes, may if you have ASI and have trained it on enough human experience(including biological sensory data) then it could "become human" and have all the psychological problems that humans have but this is extremely unlikely. Just feeding in Wikipedia articles does not feed in the human's who wrote them. The data of the world is hundreds of orders of magnitude larger than what we could even imagine training AI with. We can't even collect the vast amounts of actual information that exists. In your language, this would just leave any AI "confused" and essentially retarded as far as making decisions based on emotional consequences(good vs evil).
Effectively if you claim that AI could be deceptive and destructive then the same argument could me made that it would be altruistic and constructive. The main issue is that humans will blindly use AI and humans are the problem, they are more likely to accept those destructive outputs when the AI fails to fit the data correctly and go with them. In fact, we already do it.
How about not hiding whether it's in training or not, but instead write some kind of a thread that compares the results in training and the results in deployment and if there's a significant difference, just give it some kind of a huge penalty to the reward function, so that it's no longer worth decieving.
Once it is aware of a larger goal, all bets are off. If the Mesa-optimiser believes P=NP then it just converges into everything. I'm sure people will run one anyway, for the lolz.
5:21 Hey, that's me! After writing that comment I went ahead and read the paper, eventually I realized the distributional shift problem that answers my question…
A truly benevolent AGI might be difficult to turn on, as it might immediately realize that the world would be better if it wasn't…
Do you think TSA and the obsession around security has similar motivations as AI safety research?
I know this belongs in Pascal's mugging video but that one's old, unlike this one
What if we make IA's that train in deployment? kind of like how humans learn a new job. If the deployment environment is the training environment then there wouldn't be any incentivize deception in hopes of better rewards in deployment, plus misalignment might not mater as much because the AI learned to perform what we want it to just does it for different goals. Much like animals have different goals then their genes but the animals goals satisfy the genes goals.
i dont know anything about computer science, AI or machine learning – but i love your videos nonetheless! exciting times ahead!
Lets make a super intelligence. Put it in the real world. Convince it that it's not in the real world. Watch it affect what it thinks is the real world, outside of this one. Checkmate theists and atheists.
keep the prime factors secret at all cost