Decision Transformer: Reinforcement Learning via Sequence Modeling (Research Paper Explained)

Yannic Kilcher

#decisiontransformer #reinforcementlearning #transformer

Proper credit assignment over long timespans is a fundamental problem in reinforcement learning. Even methods designed to combat this problem, such as TD-learning, quickly reach their limits when rewards are sparse or noisy. This paper reframes offline reinforcement learning as a pure sequence modeling problem, with the actions being sampled conditioned on the given history and desired future rewards. This allows the authors to use recent advances in sequence modeling using Transformers and achieve competitive results in Offline RL benchmarks.

OUTLINE:
0:00 – Intro & Overview
4:15 – Offline Reinforcement Learning
10:10 – Transformers in RL
14:25 – Value Functions and Temporal Difference Learning
20:25 – Sequence Modeling and Reward-to-go
27:20 – Why this is ideal for offline RL
31:30 – The context length problem
34:35 – Toy example: Shortest path from random walks
41:00 – Discount factors
45:50 – Experimental Results
49:25 – Do you need to know the best possible reward?
52:15 – Key-to-door toy experiment
56:00 – Comments & Conclusion

Paper: https://arxiv.org/abs/2106.01345
Website: https://sites.google.com/berkeley.edu/decision-transformer
Code: https://github.com/kzl/decision-transformer

Trajectory Transformer: https://trajectory-transformer.github.io/
Upside-Down RL: https://arxiv.org/abs/1912.02875

Abstract:
We present a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem. This allows us to draw upon the simplicity and scalability of the Transformer architecture, and associated advances in language modeling such as GPT-x and BERT. In particular, we present Decision Transformer, an architecture that casts the problem of RL as conditional sequence modeling. Unlike prior approaches to RL that fit value functions or compute policy gradients, Decision Transformer simply outputs the optimal actions by leveraging a causally masked Transformer. By conditioning an autoregressive model on the desired return (reward), past states, and actions, our Decision Transformer model can generate future actions that achieve the desired return. Despite its simplicity, Decision Transformer matches or exceeds the performance of state-of-the-art model-free offline RL baselines on Atari, OpenAI Gym, and Key-to-Door tasks.

Authors: Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, Igor Mordatch

Links:
TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher
Parler: https://parler.com/profile/YannicKilcher
LinkedIn: https://www.linkedin.com/in/yannic-kilcher-488534136/
BiliBili: https://space.bilibili.com/1824646584

If you want to support me, the best thing to do is to share out the content 🙂

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: https://www.subscribestar.com/yannickilcher
Patreon: https://www.patreon.com/yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

<iframe></p> <p><a href="https://www.youtube.com/watch?v=-buULmf7dec">Source</a></p> <div class="be1e40beae42d993bafb8643f4ddde8b" data-index="3" style="float: none; margin:10px 0 10px 0; text-align:center;"> <script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script> <ins class="adsbygoogle" style="display:block; text-align:center;" data-ad-layout="in-article" data-ad-format="fluid" data-ad-client="ca-pub-9244112244416304" data-ad-slot="4549240677"></ins> <script> (adsbygoogle = window.adsbygoogle || []).push({}); </script> </div> <div style="font-size: 0px; height: 0px; line-height: 0px; margin: 0; padding: 0; clear: both;"></div> </div> </article> <div class="clearfix"></div> <ul class="default-theme-post-navigation"> <li class="theme-nav-previous"><a href="https://theengineeringofconsciousexperience.com/gpt-3-based-samantha-p-december-starring-in-creative-leader-concluding-50th-chapter-of-the-course/" rel="prev"><span class="meta-nav">←</span> GPT-3 based Samantha P.December starring in Creative Leader – Concluding 50th chapter of the course!</a></li> <li class="theme-nav-next"><a href="https://theengineeringofconsciousexperience.com/python-kodu-yazan-yapay-zeka-yapalim/" rel="next">PYTHON KODU YAZAN YAPAY ZEKA YAPALIM <span class="meta-nav">→</span></a></li> </ul> <div class="clearfix"></div> <h3 class='comment-reply-title'>Similar Posts</h3> <div class="mb-related-posts mb-simple-featured-posts mb-simple-featured-posts-wrap row"> <article class="mb-featured-article col-md-4 px-lg-3 post"> <a class="post-thumbnail" href="https://theengineeringofconsciousexperience.com/3-generative-ai-tools-you-should-start-using-right-now-generativeai-generativeart-aitools-ai/" aria-hidden="true" tabindex="-1"> <img width="501" height="300" src="https://theengineeringofconsciousexperience.com/wp-content/uploads/2023/11/1701079872_maxresdefault-501x300.jpg" class="attachment-magazinebook-featured-image-medium size-magazinebook-featured-image-medium wp-post-image" alt="" decoding="async" loading="lazy" /> </a> <span class="cat-links"><a href="https://theengineeringofconsciousexperience.com/category/gpt-3/" rel="category tag">GPT 3</a></span> <header class="entry-header"> <h3 class="entry-title"><a href="https://theengineeringofconsciousexperience.com/3-generative-ai-tools-you-should-start-using-right-now-generativeai-generativeart-aitools-ai/" rel="bookmark">3 Generative AI Tools You Should Start Using Right Now! #generativeai #generativeart #aitools #ai</a></h3> <div class="entry-meta"> <span class="posted-on"><i class="far fa-calendar-alt"></i><a href="https://theengineeringofconsciousexperience.com/3-generative-ai-tools-you-should-start-using-right-now-generativeai-generativeart-aitools-ai/" rel="bookmark"><time class="entry-date published updated" datetime="2023-05-16T04:40:05-07:00">May 16, 2023</time></a></span><span class="byline"><i class="far fa-user-circle"></i><span class="author vcard"><a class="url fn n" href="https://theengineeringofconsciousexperience.com/author/coursesity/">Coursesity</a></span></span> </div> </header> </article> <article class="mb-featured-article col-md-4 px-lg-3 post"> <a class="post-thumbnail" href="https://theengineeringofconsciousexperience.com/pandemi-arka-plani-5g-pandemisi-kuresel-yok-olus-2021/" aria-hidden="true" tabindex="-1"> <img width="501" height="282" src="https://theengineeringofconsciousexperience.com/wp-content/uploads/2020/12/1608834235_maxresdefault.jpg" class="attachment-magazinebook-featured-image-medium size-magazinebook-featured-image-medium wp-post-image" alt="" decoding="async" loading="lazy" srcset="https://theengineeringofconsciousexperience.com/wp-content/uploads/2020/12/1608834235_maxresdefault.jpg 1280w, https://theengineeringofconsciousexperience.com/wp-content/uploads/2020/12/1608834235_maxresdefault-300x169.jpg 300w, https://theengineeringofconsciousexperience.com/wp-content/uploads/2020/12/1608834235_maxresdefault-1024x576.jpg 1024w, https://theengineeringofconsciousexperience.com/wp-content/uploads/2020/12/1608834235_maxresdefault-768x432.jpg 768w, https://theengineeringofconsciousexperience.com/wp-content/uploads/2020/12/1608834235_maxresdefault-520x293.jpg 520w" sizes="auto, (max-width: 501px) 100vw, 501px" /> </a> <span class="cat-links"><a href="https://theengineeringofconsciousexperience.com/category/gpt-3/" rel="category tag">GPT 3</a></span> <header class="entry-header"> <h3 class="entry-title"><a href="https://theengineeringofconsciousexperience.com/pandemi-arka-plani-5g-pandemisi-kuresel-yok-olus-2021/" rel="bookmark">Pandemi arka planı | 5G pandemisi | Küresel Yok Oluş 2021</a></h3> <div class="entry-meta"> <span class="posted-on"><i class="far fa-calendar-alt"></i><a href="https://theengineeringofconsciousexperience.com/pandemi-arka-plani-5g-pandemisi-kuresel-yok-olus-2021/" rel="bookmark"><time class="entry-date published updated" datetime="2020-12-20T10:13:33-07:00">December 20, 2020</time></a></span><span class="byline"><i class="far fa-user-circle"></i><span class="author vcard"><a class="url fn n" href="https://theengineeringofconsciousexperience.com/author/964c77ebe7dab751281139f83db0ab55/">Emre Keskin</a></span></span> </div> </header> </article> <article class="mb-featured-article col-md-4 px-lg-3 post"> <a class="post-thumbnail" href="https://theengineeringofconsciousexperience.com/chatgpt-masterclass-basic-to-advanced-in-4-easy-steps/" aria-hidden="true" tabindex="-1"> <img width="501" height="300" src="https://theengineeringofconsciousexperience.com/wp-content/uploads/2023/11/1699924766_maxresdefault-501x300.jpg" class="attachment-magazinebook-featured-image-medium size-magazinebook-featured-image-medium wp-post-image" alt="" decoding="async" loading="lazy" /> </a> <span class="cat-links"><a href="https://theengineeringofconsciousexperience.com/category/gpt-3/" rel="category tag">GPT 3</a></span> <header class="entry-header"> <h3 class="entry-title"><a href="https://theengineeringofconsciousexperience.com/chatgpt-masterclass-basic-to-advanced-in-4-easy-steps/" rel="bookmark">ChatGPT Masterclass: Basic to Advanced in 4 Easy Steps</a></h3> <div class="entry-meta"> <span class="posted-on"><i class="far fa-calendar-alt"></i><a href="https://theengineeringofconsciousexperience.com/chatgpt-masterclass-basic-to-advanced-in-4-easy-steps/" rel="bookmark"><time class="entry-date published updated" datetime="2023-09-07T07:03:38-07:00">September 7, 2023</time></a></span><span class="byline"><i class="far fa-user-circle"></i><span class="author vcard"><a class="url fn n" href="https://theengineeringofconsciousexperience.com/author/ansh_mehra/">Ansh Mehra</a></span></span> </div> </header> </article> </div> <div id="comments" class="comments-area"> <h5 class="comments-title"> 26 thoughts on “<span>Decision Transformer: Reinforcement Learning via Sequence Modeling (Research Paper Explained)</span>” </h5> <ol class="comment-list"> <li id="comment-239606" class="comment even thread-even depth-1"> <article id="div-comment-239606" class="comment-body"> <footer class="comment-meta"> <div class="comment-author vcard"> <b class="fn"><a href="https://www.youtube.com/channel/UCZHmQk67mSJgfCCTn7xBfew" class="url" rel="ugc external nofollow">Yannic Kilcher</a></b> <span class="says">says:</span> </div> <div class="comment-metadata"> <a href="https://theengineeringofconsciousexperience.com/decision-transformer-reinforcement-learning-via-sequence-modeling-research-paper-explained/#comment-239606"><time datetime="2021-06-05T12:34:59-07:00">June 5, 2021 at 12:34 pm</time></a> </div> </footer> <div class="comment-content"> <p>OUTLINE:<br /><a href="https://www.youtube.com/watch?v=-buULmf7dec&t=0m00s">0:00</a> – Intro & Overview<br /><a href="https://www.youtube.com/watch?v=-buULmf7dec&t=4m15s">4:15</a> – Offline Reinforcement Learning<br /><a href="https://www.youtube.com/watch?v=-buULmf7dec&t=10m10s">10:10</a> – Transformers in RL<br /><a href="https://www.youtube.com/watch?v=-buULmf7dec&t=14m25s">14:25</a> – Value Functions and Temporal Difference Learning<br /><a href="https://www.youtube.com/watch?v=-buULmf7dec&t=20m25s">20:25</a> – Sequence Modeling and Reward-to-go<br /><a href="https://www.youtube.com/watch?v=-buULmf7dec&t=27m20s">27:20</a> – Why this is ideal for offline RL<br /><a href="https://www.youtube.com/watch?v=-buULmf7dec&t=31m30s">31:30</a> – The context length problem<br /><a href="https://www.youtube.com/watch?v=-buULmf7dec&t=34m35s">34:35</a> – Toy example: Shortest path from random walks<br /><a href="https://www.youtube.com/watch?v=-buULmf7dec&t=41m00s">41:00</a> – Discount factors<br /><a href="https://www.youtube.com/watch?v=-buULmf7dec&t=45m50s">45:50</a> – Experimental Results<br /><a href="https://www.youtube.com/watch?v=-buULmf7dec&t=49m25s">49:25</a> – Do you need to know the best possible reward?<br /><a href="https://www.youtube.com/watch?v=-buULmf7dec&t=52m15s">52:15</a> – Key-to-door toy experiment<br /><a href="https://www.youtube.com/watch?v=-buULmf7dec&t=56m00s">56:00</a> – Comments & Conclusion</p> </div> </article> </li> <li id="comment-239631" class="comment odd alt thread-odd thread-alt depth-1"> <article id="div-comment-239631" class="comment-body"> <footer class="comment-meta"> <div class="comment-author vcard"> <b class="fn"><a href="https://www.youtube.com/channel/UCe-qMgN6T8o4gvZjXsLpR5g" class="url" rel="ugc external nofollow">Free Mind.D</a></b> <span class="says">says:</span> </div> <div class="comment-metadata"> <a href="https://theengineeringofconsciousexperience.com/decision-transformer-reinforcement-learning-via-sequence-modeling-research-paper-explained/#comment-239631"><time datetime="2021-06-05T23:17:01-07:00">June 5, 2021 at 11:17 pm</time></a> </div> </footer> <div class="comment-content"> <p>Sequence modeling, Transformer, Memory model, back to RL… kind feel like researchers are running in a circle here…</p> </div> </article> </li> <li id="comment-239630" class="comment even thread-even depth-1"> <article id="div-comment-239630" class="comment-body"> <footer class="comment-meta"> <div class="comment-author vcard"> <b class="fn"><a href="https://www.youtube.com/channel/UCFj6Z_7OYtCLfd2WTeLLwgw" class="url" rel="ugc external nofollow">sofia eris bauhaus</a></b> <span class="says">says:</span> </div> <div class="comment-metadata"> <a href="https://theengineeringofconsciousexperience.com/decision-transformer-reinforcement-learning-via-sequence-modeling-research-paper-explained/#comment-239630"><time datetime="2021-06-06T00:04:11-07:00">June 6, 2021 at 12:04 am</time></a> </div> </footer> <div class="comment-content"> <p>general intelligence can be achieved by maximizing the Schmids that are Hubed.</p> </div> </article> </li> <li id="comment-239629" class="comment odd alt thread-odd thread-alt depth-1"> <article id="div-comment-239629" class="comment-body"> <footer class="comment-meta"> <div class="comment-author vcard"> <b class="fn"><a href="https://www.youtube.com/channel/UCnzsi46LpvAHJrOsDxjm5tg" class="url" rel="ugc external nofollow">sts</a></b> <span class="says">says:</span> </div> <div class="comment-metadata"> <a href="https://theengineeringofconsciousexperience.com/decision-transformer-reinforcement-learning-via-sequence-modeling-research-paper-explained/#comment-239629"><time datetime="2021-06-06T00:42:51-07:00">June 6, 2021 at 12:42 am</time></a> </div> </footer> <div class="comment-content"> <p>Calling this reinforcement learning is a stretch. This is more akin to imitation learning as it is modeling a group of agents. RL is not a modeling problem.</p> </div> </article> </li> <li id="comment-239628" class="comment even thread-even depth-1"> <article id="div-comment-239628" class="comment-body"> <footer class="comment-meta"> <div class="comment-author vcard"> <b class="fn"><a href="https://www.youtube.com/channel/UClpMld39vNUGFqfABqGy7vA" class="url" rel="ugc external nofollow">Albert James</a></b> <span class="says">says:</span> </div> <div class="comment-metadata"> <a href="https://theengineeringofconsciousexperience.com/decision-transformer-reinforcement-learning-via-sequence-modeling-research-paper-explained/#comment-239628"><time datetime="2021-06-06T01:54:22-07:00">June 6, 2021 at 1:54 am</time></a> </div> </footer> <div class="comment-content"> <p>The fact that conditioning on the past works better probably means the problem is non markovian with respect to the state representation chosen initially for the task. Condition on past states and rewards (and actions, why not) enriches the states and allow the model to better discriminate the best action. It is limited in term of context size, but much richer than classic RL where the system is supposed to be markovian and a single state is all you get.<br />Also, credit assignment happens whatever the size of context, because the reward is going to be propagated backwards in time as the agent encounter states which are close enough. <br />In more classic RL models it should be even worse that this model if it was not the case, because it only updates a single state-action value, rather than this rich (and smoothed) state representation.<br />It is because value is = current reward + future reward, that the reward is progressively progragated back. (you maximize non discounted rewards defining a value fonction with discounted future rewards so the series converges in inifinite horizon)</p> <p>Also interesting, in the planning as inference litterature, you also condition on the "optimality" of your action, similarly to conditionning on the reward, although it does not matter the value of the reward, simply that its the optimal trajectory.</p> </div> </article> </li> <li id="comment-239627" class="comment odd alt thread-odd thread-alt depth-1"> <article id="div-comment-239627" class="comment-body"> <footer class="comment-meta"> <div class="comment-author vcard"> <b class="fn"><a href="https://www.youtube.com/channel/UCXa_oaovcImh2gR_d0zR5Rg" class="url" rel="ugc external nofollow">Isaac Kargar</a></b> <span class="says">says:</span> </div> <div class="comment-metadata"> <a href="https://theengineeringofconsciousexperience.com/decision-transformer-reinforcement-learning-via-sequence-modeling-research-paper-explained/#comment-239627"><time datetime="2021-06-06T03:48:25-07:00">June 6, 2021 at 3:48 am</time></a> </div> </footer> <div class="comment-content"> <p>There is another new and interesting paper by Sergei Levin's group called trajectory transformer.</p> </div> </article> </li> <li id="comment-239626" class="comment even thread-even depth-1"> <article id="div-comment-239626" class="comment-body"> <footer class="comment-meta"> <div class="comment-author vcard"> <b class="fn"><a href="https://www.youtube.com/channel/UCCddWSw2df3ScWkkYyCRE3Q" class="url" rel="ugc external nofollow">Philip Bontrager</a></b> <span class="says">says:</span> </div> <div class="comment-metadata"> <a href="https://theengineeringofconsciousexperience.com/decision-transformer-reinforcement-learning-via-sequence-modeling-research-paper-explained/#comment-239626"><time datetime="2021-06-06T05:11:09-07:00">June 6, 2021 at 5:11 am</time></a> </div> </footer> <div class="comment-content"> <p>I would imagine that by discount factor they were referring to gamma. As Q-learning is a TD(0) algorithm so there is no lambda to tune. One good intuition for the meaning/purpose of a discount factor is a proxy for the likelihood your agent will survive to reach a future reward. It’s more about tuning how far back it can look for credit assignment, which affects how stable the learning process is.</p> </div> </article> </li> <li id="comment-239625" class="comment odd alt thread-odd thread-alt depth-1"> <article id="div-comment-239625" class="comment-body"> <footer class="comment-meta"> <div class="comment-author vcard"> <b class="fn"><a href="https://www.youtube.com/channel/UCP3mMuNSUhfqeEv6XMRebsQ" class="url" rel="ugc external nofollow">Herp Derpingson</a></b> <span class="says">says:</span> </div> <div class="comment-metadata"> <a href="https://theengineeringofconsciousexperience.com/decision-transformer-reinforcement-learning-via-sequence-modeling-research-paper-explained/#comment-239625"><time datetime="2021-06-06T05:53:46-07:00">June 6, 2021 at 5:53 am</time></a> </div> </footer> <div class="comment-content"> <p><a href="https://www.youtube.com/watch?v=-buULmf7dec&t=0m55s">0:55</a> I read the causal transformer as casual transformer.<br />.<br /><a href="https://www.youtube.com/watch?v=-buULmf7dec&t=11m40s">11:40</a> Can we call it a "Model only reinforcement learning" ?<br />.<br /><a href="https://www.youtube.com/watch?v=-buULmf7dec&t=32m50s">32:50</a> I dont think it is safe to say that all Q-Learning/RNN based learning will be able to incorporate information from that back into the past into the current decision. It can, but it is not guaranteed and in practice, it might forget.<br />.<br /><a href="https://www.youtube.com/watch?v=-buULmf7dec&t=50m20s">50:20</a> I think this "any reward" thing can be quite useful in developing AI for video games. We dont want a computer opponent to play the hardest it can, the human player should be able to dial down the difficulty. <br />.<br />This paper just throws SARSA into a transformer? Thats it?</p> </div> </article> </li> <li id="comment-239624" class="comment even thread-even depth-1"> <article id="div-comment-239624" class="comment-body"> <footer class="comment-meta"> <div class="comment-author vcard"> <b class="fn"><a href="https://www.youtube.com/channel/UCblg19G5VPvyt5GMpaNvVBQ" class="url" rel="ugc external nofollow">galchinsky</a></b> <span class="says">says:</span> </div> <div class="comment-metadata"> <a href="https://theengineeringofconsciousexperience.com/decision-transformer-reinforcement-learning-via-sequence-modeling-research-paper-explained/#comment-239624"><time datetime="2021-06-06T06:11:50-07:00">June 6, 2021 at 6:11 am</time></a> </div> </footer> <div class="comment-content"> <p>Why doesn't Schmidhuber like the transformers? Or does he?</p> </div> </article> </li> <li id="comment-239623" class="comment odd alt thread-odd thread-alt depth-1"> <article id="div-comment-239623" class="comment-body"> <footer class="comment-meta"> <div class="comment-author vcard"> <b class="fn"><a href="https://www.youtube.com/channel/UCVZrdOWOG4a64tUbS4bTc5w" class="url" rel="ugc external nofollow">Marco Lehmann</a></b> <span class="says">says:</span> </div> <div class="comment-metadata"> <a href="https://theengineeringofconsciousexperience.com/decision-transformer-reinforcement-learning-via-sequence-modeling-research-paper-explained/#comment-239623"><time datetime="2021-06-06T12:42:43-07:00">June 6, 2021 at 12:42 pm</time></a> </div> </footer> <div class="comment-content"> <p>😂 "I realized some of you youngsters might not acutally know what an LSTM is"<br /><a href="https://youtu.be/-buULmf7dec?t=698">https://youtu.be/-buULmf7dec?t=698</a></p> </div> </article> </li> <li id="comment-239622" class="comment even thread-even depth-1"> <article id="div-comment-239622" class="comment-body"> <footer class="comment-meta"> <div class="comment-author vcard"> <b class="fn"><a href="https://www.youtube.com/channel/UCLr02aW1BBiTjTTGLqUsLww" class="url" rel="ugc external nofollow">Oshri Naparstek</a></b> <span class="says">says:</span> </div> <div class="comment-metadata"> <a href="https://theengineeringofconsciousexperience.com/decision-transformer-reinforcement-learning-via-sequence-modeling-research-paper-explained/#comment-239622"><time datetime="2021-06-06T23:56:47-07:00">June 6, 2021 at 11:56 pm</time></a> </div> </footer> <div class="comment-content"> <p>Hi, great video as always. </p> <p>I have a problem with the term "offline RL" not every policy learning algorithm is reinforcement learning. <br />The main problem that RL tries to solve is not reward assignment but exploration vs exploration tradeoff.</p> <p>If there is no exploration it is not RL.</p> </div> </article> </li> <li id="comment-239621" class="comment odd alt thread-odd thread-alt depth-1"> <article id="div-comment-239621" class="comment-body"> <footer class="comment-meta"> <div class="comment-author vcard"> <b class="fn"><a href="https://www.youtube.com/channel/UCIbyGOX3WT9n6jizNvEVhHg" class="url" rel="ugc external nofollow">Vishal Batchu</a></b> <span class="says">says:</span> </div> <div class="comment-metadata"> <a href="https://theengineeringofconsciousexperience.com/decision-transformer-reinforcement-learning-via-sequence-modeling-research-paper-explained/#comment-239621"><time datetime="2021-06-07T02:31:42-07:00">June 7, 2021 at 2:31 am</time></a> </div> </footer> <div class="comment-content"> <p>Great videos as always! Which tool/software do you use to get an infinite PDF canvas to draw on?</p> </div> </article> </li> <li id="comment-239620" class="comment even thread-even depth-1"> <article id="div-comment-239620" class="comment-body"> <footer class="comment-meta"> <div class="comment-author vcard"> <b class="fn"><a href="https://www.youtube.com/channel/UC49aLy-jvPaYFHFmr4RI76A" class="url" rel="ugc external nofollow">unfraterned Jack Bow demolishing Club</a></b> <span class="says">says:</span> </div> <div class="comment-metadata"> <a href="https://theengineeringofconsciousexperience.com/decision-transformer-reinforcement-learning-via-sequence-modeling-research-paper-explained/#comment-239620"><time datetime="2021-06-07T03:53:40-07:00">June 7, 2021 at 3:53 am</time></a> </div> </footer> <div class="comment-content"> <p>I have a question: Can you add a link for second paper?</p> </div> </article> </li> <li id="comment-239619" class="comment odd alt thread-odd thread-alt depth-1"> <article id="div-comment-239619" class="comment-body"> <footer class="comment-meta"> <div class="comment-author vcard"> <b class="fn"><a href="https://www.youtube.com/channel/UCuj62pvMVxDIgYuoya5usGw" class="url" rel="ugc external nofollow">ХОРОШО</a></b> <span class="says">says:</span> </div> <div class="comment-metadata"> <a href="https://theengineeringofconsciousexperience.com/decision-transformer-reinforcement-learning-via-sequence-modeling-research-paper-explained/#comment-239619"><time datetime="2021-06-07T05:48:42-07:00">June 7, 2021 at 5:48 am</time></a> </div> </footer> <div class="comment-content"> <p>The reason to discount future rewards is that its not smart enough <a href="https://www.youtube.com/watch?v=-buULmf7dec&t=43m30s">43:30</a></p> </div> </article> </li> <li id="comment-239618" class="comment even thread-even depth-1"> <article id="div-comment-239618" class="comment-body"> <footer class="comment-meta"> <div class="comment-author vcard"> <b class="fn"><a href="https://www.youtube.com/channel/UCe6jTxDnC1NAEJVPZqTYUJA" class="url" rel="ugc external nofollow">NothingButFish</a></b> <span class="says">says:</span> </div> <div class="comment-metadata"> <a href="https://theengineeringofconsciousexperience.com/decision-transformer-reinforcement-learning-via-sequence-modeling-research-paper-explained/#comment-239618"><time datetime="2021-06-07T11:51:33-07:00">June 7, 2021 at 11:51 am</time></a> </div> </footer> <div class="comment-content"> <p>Since the paper implies that a prior is used on the data to essentially extract the highest-reward trajectories, I'm skeptical that it would work well on problems with nondeterministic dynamics. For example, for a case where a particular action has a 50/50 chance of producing a reward of 100 or a reward of -100, the bad trajectories would be thrown out and it would learn that the state,action in question leads to a reward of 100, when in fact on average it leads to a reward of 0. A different action for that state that always gives a reward of 90 would be a better prediction for "action that leads to reward of 100." Or am I misunderstanding?</p> </div> </article> </li> <li id="comment-239617" class="comment odd alt thread-odd thread-alt depth-1"> <article id="div-comment-239617" class="comment-body"> <footer class="comment-meta"> <div class="comment-author vcard"> <b class="fn"><a href="https://www.youtube.com/channel/UClvxP1n_btdM5j0lgk0nL6Q" class="url" rel="ugc external nofollow">Brian Williams</a></b> <span class="says">says:</span> </div> <div class="comment-metadata"> <a href="https://theengineeringofconsciousexperience.com/decision-transformer-reinforcement-learning-via-sequence-modeling-research-paper-explained/#comment-239617"><time datetime="2021-06-07T20:57:28-07:00">June 7, 2021 at 8:57 pm</time></a> </div> </footer> <div class="comment-content"> <p>If you are watching this video, I'm sure you'd like the Introduction to Reinforcement Learning videos from David Silver. <a href="https://www.youtube.com/watch?v=2pWv7GOvuf0">https://www.youtube.com/watch?v=2pWv7GOvuf0</a></p> </div> </article> </li> <li id="comment-239616" class="comment even thread-even depth-1"> <article id="div-comment-239616" class="comment-body"> <footer class="comment-meta"> <div class="comment-author vcard"> <b class="fn"><a href="https://www.youtube.com/channel/UChZEeVBywM3WTmEPKjT2VNw" class="url" rel="ugc external nofollow">Daniel Hernandez</a></b> <span class="says">says:</span> </div> <div class="comment-metadata"> <a href="https://theengineeringofconsciousexperience.com/decision-transformer-reinforcement-learning-via-sequence-modeling-research-paper-explained/#comment-239616"><time datetime="2021-06-08T06:59:58-07:00">June 8, 2021 at 6:59 am</time></a> </div> </footer> <div class="comment-content"> <p>Perhaps this was already pointed out, and I apologise for sounding overly-rigurous!</p> <p>At <a href="https://www.youtube.com/watch?v=-buULmf7dec&t=17m50s">17:50</a> you start describing the fundamental intution of Temporal difference learning via saying "Q^{pi}(S) = r + Q^{pi}(s')". Which is great, but That's the value function (V(s))! Not the state-action value function (Q(s, a)), which also takes an action in its function signature. For the purpose of your explanation it doesn't really matter.</p> <p>But I'll leave this comment here just in case. Keep up the amazing work.<br />And congrats on your recent graduation, Dr. Kilcher 😀</p> </div> </article> </li> <li id="comment-239615" class="comment odd alt thread-odd thread-alt depth-1"> <article id="div-comment-239615" class="comment-body"> <footer class="comment-meta"> <div class="comment-author vcard"> <b class="fn"><a href="https://www.youtube.com/channel/UCuyKzVEejYpuD4s6j_yqfUw" class="url" rel="ugc external nofollow">Existenceisillusion</a></b> <span class="says">says:</span> </div> <div class="comment-metadata"> <a href="https://theengineeringofconsciousexperience.com/decision-transformer-reinforcement-learning-via-sequence-modeling-research-paper-explained/#comment-239615"><time datetime="2021-06-09T10:27:24-07:00">June 9, 2021 at 10:27 am</time></a> </div> </footer> <div class="comment-content"> <p>Dang it! I had this idea. Well, at least I still have one more idea. Hopefully no one scoops me before I can finish my PhD. Fingers crossed!</p> </div> </article> </li> <li id="comment-239614" class="comment even thread-even depth-1"> <article id="div-comment-239614" class="comment-body"> <footer class="comment-meta"> <div class="comment-author vcard"> <b class="fn"><a href="https://www.youtube.com/channel/UC_GE6mNQab9uIJZtBiSEuHA" class="url" rel="ugc external nofollow">Juan Camilo Gamboa Higuera</a></b> <span class="says">says:</span> </div> <div class="comment-metadata"> <a href="https://theengineeringofconsciousexperience.com/decision-transformer-reinforcement-learning-via-sequence-modeling-research-paper-explained/#comment-239614"><time datetime="2021-06-09T11:31:26-07:00">June 9, 2021 at 11:31 am</time></a> </div> </footer> <div class="comment-content"> <p>Discount factors are not a design choice. You need them for infinite horizon, otherwise the sum blows up and the expectation is undefined. Here they don't "need" them because they're working with a finite horizon formulation.</p> </div> </article> </li> <li id="comment-239613" class="comment odd alt thread-odd thread-alt depth-1"> <article id="div-comment-239613" class="comment-body"> <footer class="comment-meta"> <div class="comment-author vcard"> <b class="fn"><a href="https://www.youtube.com/channel/UCG4Rv_-D-_weQ0x7NnFQ1og" class="url" rel="ugc external nofollow">Dennis Bakhuis</a></b> <span class="says">says:</span> </div> <div class="comment-metadata"> <a href="https://theengineeringofconsciousexperience.com/decision-transformer-reinforcement-learning-via-sequence-modeling-research-paper-explained/#comment-239613"><time datetime="2021-06-10T05:23:55-07:00">June 10, 2021 at 5:23 am</time></a> </div> </footer> <div class="comment-content"> <p>"I realize, some of you youngsters don't know what an LSTM actually is" ow boy, am I getting old now?</p> </div> </article> </li> <li id="comment-239612" class="comment even thread-even depth-1"> <article id="div-comment-239612" class="comment-body"> <footer class="comment-meta"> <div class="comment-author vcard"> <b class="fn"><a href="https://www.youtube.com/channel/UC_WPp31QW0WucY14fbInBEw" class="url" rel="ugc external nofollow">Binjian Xin</a></b> <span class="says">says:</span> </div> <div class="comment-metadata"> <a href="https://theengineeringofconsciousexperience.com/decision-transformer-reinforcement-learning-via-sequence-modeling-research-paper-explained/#comment-239612"><time datetime="2021-06-10T07:10:19-07:00">June 10, 2021 at 7:10 am</time></a> </div> </footer> <div class="comment-content"> <p>It feels like the notion of “reward “ was confused with “return”. Discount factor is just gamma and lambda is just for return.</p> </div> </article> </li> <li id="comment-239611" class="comment odd alt thread-odd thread-alt depth-1"> <article id="div-comment-239611" class="comment-body"> <footer class="comment-meta"> <div class="comment-author vcard"> <b class="fn"><a href="https://www.youtube.com/channel/UCYIVNZj7lBuGDIFBpwR5TWQ" class="url" rel="ugc external nofollow">Sean O'Hara</a></b> <span class="says">says:</span> </div> <div class="comment-metadata"> <a href="https://theengineeringofconsciousexperience.com/decision-transformer-reinforcement-learning-via-sequence-modeling-research-paper-explained/#comment-239611"><time datetime="2021-06-10T13:16:22-07:00">June 10, 2021 at 1:16 pm</time></a> </div> </footer> <div class="comment-content"> <p>Scary to think that there might be “youngsters” that watch these videos who do not know what an an LSTM is. I love living in a time with this pace of innovation.</p> </div> </article> </li> <li id="comment-239610" class="comment even thread-even depth-1"> <article id="div-comment-239610" class="comment-body"> <footer class="comment-meta"> <div class="comment-author vcard"> <b class="fn"><a href="https://www.youtube.com/channel/UCn6qEFKA1aTsqDqQ4oFRj_A" class="url" rel="ugc external nofollow">Matthias Soppert</a></b> <span class="says">says:</span> </div> <div class="comment-metadata"> <a href="https://theengineeringofconsciousexperience.com/decision-transformer-reinforcement-learning-via-sequence-modeling-research-paper-explained/#comment-239610"><time datetime="2021-06-10T14:45:23-07:00">June 10, 2021 at 2:45 pm</time></a> </div> </footer> <div class="comment-content"> <p>very interesting and thank you for the video. i like your critical thoughts. i would never dare to question a berkeley, facebook, google paper^^</p> <p>couldn't one combine this with the TD idea by adding the value V to the sequence?</p> </div> </article> </li> <li id="comment-239609" class="comment odd alt thread-odd thread-alt depth-1"> <article id="div-comment-239609" class="comment-body"> <footer class="comment-meta"> <div class="comment-author vcard"> <b class="fn"><a href="https://www.youtube.com/channel/UC0yDa1VC85GGeV_tMus8brw" class="url" rel="ugc external nofollow">Luis Daniel</a></b> <span class="says">says:</span> </div> <div class="comment-metadata"> <a href="https://theengineeringofconsciousexperience.com/decision-transformer-reinforcement-learning-via-sequence-modeling-research-paper-explained/#comment-239609"><time datetime="2021-06-12T06:34:51-07:00">June 12, 2021 at 6:34 am</time></a> </div> </footer> <div class="comment-content"> <p>I think you're mistaking online/offline RL for on-policy/off-policy RL</p> </div> </article> </li> <li id="comment-239608" class="comment even thread-even depth-1"> <article id="div-comment-239608" class="comment-body"> <footer class="comment-meta"> <div class="comment-author vcard"> <b class="fn"><a href="https://www.youtube.com/channel/UCFIeKgDGfOP1sr558QWSyRA" class="url" rel="ugc external nofollow">Aniruddha Datta</a></b> <span class="says">says:</span> </div> <div class="comment-metadata"> <a href="https://theengineeringofconsciousexperience.com/decision-transformer-reinforcement-learning-via-sequence-modeling-research-paper-explained/#comment-239608"><time datetime="2021-06-12T09:23:17-07:00">June 12, 2021 at 9:23 am</time></a> </div> </footer> <div class="comment-content"> <p>I guess this is one step backwards from artificial general intelligence and since someone has already made the Schmidshuber joke…….😂😂</p> </div> </article> </li> <li id="comment-239607" class="comment odd alt thread-odd thread-alt depth-1"> <article id="div-comment-239607" class="comment-body"> <footer class="comment-meta"> <div class="comment-author vcard"> <b class="fn"><a href="https://www.youtube.com/channel/UCNxxs-OEF_5xhM8iOX87NSg" class="url" rel="ugc external nofollow">Dylan Cope</a></b> <span class="says">says:</span> </div> <div class="comment-metadata"> <a href="https://theengineeringofconsciousexperience.com/decision-transformer-reinforcement-learning-via-sequence-modeling-research-paper-explained/#comment-239607"><time datetime="2021-06-13T05:58:20-07:00">June 13, 2021 at 5:58 am</time></a> </div> </footer> <div class="comment-content"> <p>You threw in Schumidhuber's 2019 paper, but it's also interesting to note how this approach goes back to Hutter 2005 with General Reinforcement Learning as Solomonoff Induction + Utility Theory.</p> </div> </article> </li> </ol> <p class="no-comments">Comments are closed.</p> </div> </main> </div> <div class="col-md-3 px-lg-3 "> </div> </div> </div> </div> <footer id="colophon" class="site-footer"> <div class="container"> <div class="row"> <div class="col-md-12 text-center"> <div class="site-info"> <span> Powered By: <a href="https://wordpress.org/" target="_blank">WordPress</a> </span> <span class="sep"> | </span> <span> Theme: <a href="https://odiethemes.com/themes/magazinebook/" target="_blank">MagazineBook</a> By OdieThemes </span> </div> </div> </div> </div> </footer> </div> <script>(function(){var advanced_ads_ga_UID="UA-88163215-1",advanced_ads_ga_anonymIP=!!1;window.advanced_ads_check_adblocker=function(t){var n=[],e=null;function a(t){var n=window.requestAnimationFrame||window.mozRequestAnimationFrame||window.webkitRequestAnimationFrame||function(t){return setTimeout(t,16)};n.call(window,t)}return a((function(){var t=document.createElement("div");t.innerHTML=" ",t.setAttribute("class","ad_unit ad-unit text-ad text_ad pub_300x250"),t.setAttribute("style","width: 1px !important; height: 1px !important; position: absolute !important; left: 0px !important; top: 0px !important; overflow: hidden !important;"),document.body.appendChild(t),a((function(){var a,o,i=null===(a=(o=window).getComputedStyle)||void 0===a?void 0:a.call(o,t),d=null==i?void 0:i.getPropertyValue("-moz-binding");e=i&&"none"===i.getPropertyValue("display")||"string"==typeof d&&-1!==d.indexOf("about:");for(var c=0,r=n.length;c<r;c++)n[c](e);n=[]}))})),function(t){"undefined"==typeof advanced_ads_adblocker_test&&(e=!0),null!==e?t(e):n.push(t)}}(),(()=>{function t(t){this.UID=t,this.analyticsObject="function"==typeof gtag;var n=this;return this.count=function(){gtag("event","AdBlock",{event_category:"Advanced Ads",event_label:"Yes",non_interaction:!0,send_to:n.UID})},function(){if(!n.analyticsObject){var e=document.createElement("script");e.src="https://www.googletagmanager.com/gtag/js?id="+t,e.async=!0,document.body.appendChild(e),window.dataLayer=window.dataLayer||[],window.gtag=function(){dataLayer.push(arguments)},n.analyticsObject=!0,gtag("js",new Date)}var a={send_page_view:!1,transport_type:"beacon"};window.advanced_ads_ga_anonymIP&&(a.anonymize_ip=!0),gtag("config",t,a)}(),this}advanced_ads_check_adblocker((function(n){n&&new t(advanced_ads_ga_UID).count()}))})();})();</script><div style="clear:both;width:100%;text-align:center; font-size:11px; "><a target="_blank" title="WP2Social Auto Publish" href="https://xyzscripts.com/wordpress-plugins/facebook-auto-publish/compare" >WP2Social Auto Publish</a> Powered By : <a target="_blank" title="PHP Scripts & Programs" href="http://www.xyzscripts.com" >XYZScripts.com</a></div><script type="text/javascript" src="https://theengineeringofconsciousexperience.com/wp-content/themes/magazinebook/js/navigation.js?ver=1.0.9" id="magazinebook-navigation-js"></script> <script type="text/javascript" src="https://theengineeringofconsciousexperience.com/wp-content/themes/magazinebook/js/skip-link-focus-fix.js?ver=1.0.9" id="magazinebook-skip-link-focus-fix-js"></script> <script type="text/javascript" src="https://theengineeringofconsciousexperience.com/wp-content/themes/magazinebook/js/jquery.easy-ticker.js?ver=3.1.0" id="magazinebook-news-ticker-js"></script> <script type="text/javascript" src="https://theengineeringofconsciousexperience.com/wp-content/themes/magazinebook/js/splide.min.js?ver=2.3.1" id="splide-js-js"></script> <script type="text/javascript" src="https://theengineeringofconsciousexperience.com/wp-content/themes/magazinebook/js/theme.js?ver=1.0.9" id="magazinebook-theme-js-js"></script> <script>!function(){window.advanced_ads_ready_queue=window.advanced_ads_ready_queue||[],advanced_ads_ready_queue.push=window.advanced_ads_ready;for(var d=0,a=advanced_ads_ready_queue.length;d<a;d++)advanced_ads_ready(advanced_ads_ready_queue[d])}();</script> </body> </html>