Yannic Kilcher
#decisiontransformer #reinforcementlearning #transformer
Proper credit assignment over long timespans is a fundamental problem in reinforcement learning. Even methods designed to combat this problem, such as TD-learning, quickly reach their limits when rewards are sparse or noisy. This paper reframes offline reinforcement learning as a pure sequence modeling problem, with the actions being sampled conditioned on the given history and desired future rewards. This allows the authors to use recent advances in sequence modeling using Transformers and achieve competitive results in Offline RL benchmarks.
OUTLINE:
0:00 – Intro & Overview
4:15 – Offline Reinforcement Learning
10:10 – Transformers in RL
14:25 – Value Functions and Temporal Difference Learning
20:25 – Sequence Modeling and Reward-to-go
27:20 – Why this is ideal for offline RL
31:30 – The context length problem
34:35 – Toy example: Shortest path from random walks
41:00 – Discount factors
45:50 – Experimental Results
49:25 – Do you need to know the best possible reward?
52:15 – Key-to-door toy experiment
56:00 – Comments & Conclusion
Paper: https://arxiv.org/abs/2106.01345
Website: https://sites.google.com/berkeley.edu/decision-transformer
Code: https://github.com/kzl/decision-transformer
Trajectory Transformer: https://trajectory-transformer.github.io/
Upside-Down RL: https://arxiv.org/abs/1912.02875
Abstract:
We present a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem. This allows us to draw upon the simplicity and scalability of the Transformer architecture, and associated advances in language modeling such as GPT-x and BERT. In particular, we present Decision Transformer, an architecture that casts the problem of RL as conditional sequence modeling. Unlike prior approaches to RL that fit value functions or compute policy gradients, Decision Transformer simply outputs the optimal actions by leveraging a causally masked Transformer. By conditioning an autoregressive model on the desired return (reward), past states, and actions, our Decision Transformer model can generate future actions that achieve the desired return. Despite its simplicity, Decision Transformer matches or exceeds the performance of state-of-the-art model-free offline RL baselines on Atari, OpenAI Gym, and Key-to-Door tasks.
Authors: Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, Igor Mordatch
Links:
TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher
Parler: https://parler.com/profile/YannicKilcher
LinkedIn: https://www.linkedin.com/in/yannic-kilcher-488534136/
BiliBili: https://space.bilibili.com/1824646584
If you want to support me, the best thing to do is to share out the content 🙂
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: https://www.subscribestar.com/yannickilcher
Patreon: https://www.patreon.com/yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
OUTLINE:
0:00 – Intro & Overview
4:15 – Offline Reinforcement Learning
10:10 – Transformers in RL
14:25 – Value Functions and Temporal Difference Learning
20:25 – Sequence Modeling and Reward-to-go
27:20 – Why this is ideal for offline RL
31:30 – The context length problem
34:35 – Toy example: Shortest path from random walks
41:00 – Discount factors
45:50 – Experimental Results
49:25 – Do you need to know the best possible reward?
52:15 – Key-to-door toy experiment
56:00 – Comments & Conclusion
Sequence modeling, Transformer, Memory model, back to RL… kind feel like researchers are running in a circle here…
general intelligence can be achieved by maximizing the Schmids that are Hubed.
Calling this reinforcement learning is a stretch. This is more akin to imitation learning as it is modeling a group of agents. RL is not a modeling problem.
The fact that conditioning on the past works better probably means the problem is non markovian with respect to the state representation chosen initially for the task. Condition on past states and rewards (and actions, why not) enriches the states and allow the model to better discriminate the best action. It is limited in term of context size, but much richer than classic RL where the system is supposed to be markovian and a single state is all you get.
Also, credit assignment happens whatever the size of context, because the reward is going to be propagated backwards in time as the agent encounter states which are close enough.
In more classic RL models it should be even worse that this model if it was not the case, because it only updates a single state-action value, rather than this rich (and smoothed) state representation.
It is because value is = current reward + future reward, that the reward is progressively progragated back. (you maximize non discounted rewards defining a value fonction with discounted future rewards so the series converges in inifinite horizon)
Also interesting, in the planning as inference litterature, you also condition on the "optimality" of your action, similarly to conditionning on the reward, although it does not matter the value of the reward, simply that its the optimal trajectory.
There is another new and interesting paper by Sergei Levin's group called trajectory transformer.
I would imagine that by discount factor they were referring to gamma. As Q-learning is a TD(0) algorithm so there is no lambda to tune. One good intuition for the meaning/purpose of a discount factor is a proxy for the likelihood your agent will survive to reach a future reward. It’s more about tuning how far back it can look for credit assignment, which affects how stable the learning process is.
0:55 I read the causal transformer as casual transformer.
.
11:40 Can we call it a "Model only reinforcement learning" ?
.
32:50 I dont think it is safe to say that all Q-Learning/RNN based learning will be able to incorporate information from that back into the past into the current decision. It can, but it is not guaranteed and in practice, it might forget.
.
50:20 I think this "any reward" thing can be quite useful in developing AI for video games. We dont want a computer opponent to play the hardest it can, the human player should be able to dial down the difficulty.
.
This paper just throws SARSA into a transformer? Thats it?
Why doesn't Schmidhuber like the transformers? Or does he?
😂 "I realized some of you youngsters might not acutally know what an LSTM is"
https://youtu.be/-buULmf7dec?t=698
Hi, great video as always.
I have a problem with the term "offline RL" not every policy learning algorithm is reinforcement learning.
The main problem that RL tries to solve is not reward assignment but exploration vs exploration tradeoff.
If there is no exploration it is not RL.
Great videos as always! Which tool/software do you use to get an infinite PDF canvas to draw on?
I have a question: Can you add a link for second paper?
The reason to discount future rewards is that its not smart enough 43:30
Since the paper implies that a prior is used on the data to essentially extract the highest-reward trajectories, I'm skeptical that it would work well on problems with nondeterministic dynamics. For example, for a case where a particular action has a 50/50 chance of producing a reward of 100 or a reward of -100, the bad trajectories would be thrown out and it would learn that the state,action in question leads to a reward of 100, when in fact on average it leads to a reward of 0. A different action for that state that always gives a reward of 90 would be a better prediction for "action that leads to reward of 100." Or am I misunderstanding?
If you are watching this video, I'm sure you'd like the Introduction to Reinforcement Learning videos from David Silver. https://www.youtube.com/watch?v=2pWv7GOvuf0
Perhaps this was already pointed out, and I apologise for sounding overly-rigurous!
At 17:50 you start describing the fundamental intution of Temporal difference learning via saying "Q^{pi}(S) = r + Q^{pi}(s')". Which is great, but That's the value function (V(s))! Not the state-action value function (Q(s, a)), which also takes an action in its function signature. For the purpose of your explanation it doesn't really matter.
But I'll leave this comment here just in case. Keep up the amazing work.
And congrats on your recent graduation, Dr. Kilcher 😀
Dang it! I had this idea. Well, at least I still have one more idea. Hopefully no one scoops me before I can finish my PhD. Fingers crossed!
Discount factors are not a design choice. You need them for infinite horizon, otherwise the sum blows up and the expectation is undefined. Here they don't "need" them because they're working with a finite horizon formulation.
"I realize, some of you youngsters don't know what an LSTM actually is" ow boy, am I getting old now?
It feels like the notion of “reward “ was confused with “return”. Discount factor is just gamma and lambda is just for return.
Scary to think that there might be “youngsters” that watch these videos who do not know what an an LSTM is. I love living in a time with this pace of innovation.
very interesting and thank you for the video. i like your critical thoughts. i would never dare to question a berkeley, facebook, google paper^^
couldn't one combine this with the TD idea by adding the value V to the sequence?
I think you're mistaking online/offline RL for on-policy/off-policy RL
I guess this is one step backwards from artificial general intelligence and since someone has already made the Schmidshuber joke…….😂😂
You threw in Schumidhuber's 2019 paper, but it's also interesting to note how this approach goes back to Hutter 2005 with General Reinforcement Learning as Solomonoff Induction + Utility Theory.