deeplizard
Welcome back to this series on reinforcement learning! In this video, weβll continue our discussion of deep Q-networks. Before we can move on to discussing exactly how a DQN is trained, we’re first going to explain the concepts of experience replay and replay memory, which are utilized during the training process. So, letβs get to it!
JΓΌrgen Schmidhuber interview: https://youtu.be/zK_x3Ba2l5Q
π₯π¦ DEEPLIZARD COMMUNITY RESOURCES π¦π₯
π Hey, we’re Chris and Mandy, the creators of deeplizard!
π CHECK OUT OUR VLOG:
π https://www.youtube.com/channel/UC9cBIteC3u7Ee6bzeOcl_Og
π Check out the blog post and other resources for this video:
π https://deeplizard.com/learn/video/Bcuj2fTH4_4
π» DOWNLOAD ACCESS TO CODE FILES
π€ Available for members of the deeplizard hivemind:
π https://www.patreon.com/posts/27743395
π§ Support collective intelligence, join the deeplizard hivemind:
π https://deeplizard.com/hivemind
π€ Support collective intelligence, create a quiz question for this video:
π https://deeplizard.com/create-quiz-question
π Boost collective intelligence by sharing this video on social media!
β€οΈπ¦ Special thanks to the following polymaths of the deeplizard hivemind:
Prash
π Follow deeplizard:
Our vlog: https://www.youtube.com/channel/UC9cBIteC3u7Ee6bzeOcl_Og
Twitter: https://twitter.com/deeplizard
Facebook: https://www.facebook.com/Deeplizard-145413762948316
Patreon: https://www.patreon.com/deeplizard
YouTube: https://www.youtube.com/deeplizard
Instagram: https://www.instagram.com/deeplizard/
π Deep Learning with deeplizard:
Fundamental Concepts – https://deeplizard.com/learn/video/gZmobeGL0Yg
Beginner Code – https://deeplizard.com/learn/video/RznKVRTFkBY
Advanced Code – https://deeplizard.com/learn/video/v5cngxo4mIg
Advanced Deep RL – https://deeplizard.com/learn/video/nyjbcRQ-uQ8
π Other Courses:
Data Science – https://deeplizard.com/learn/video/d11chG7Z-xk
Trading – https://deeplizard.com/learn/video/ZpfCK_uHL9Y
π Check out products deeplizard recommends on Amazon:
π https://www.amazon.com/shop/deeplizard
π Get a FREE 30-day Audible trial and 2 FREE audio books using deeplizardβs link:
π https://amzn.to/2yoqWRn
π΅ deeplizard uses music by Kevin MacLeod
π https://www.youtube.com/channel/UCSZXFhRIx6b0dFX3xS8L1yQ
π http://incompetech.com/
β€οΈ Please use the knowledge gained from deeplizard content for good, not evil.
Source
Check out the corresponding blog and other resources for this video at:
http://deeplizard.com/learn/video/Bcuj2fTH4_4
A question about replay memory — We start by picking a size N for the memory capacity (in your video, N was chosen to be 6). It wasn't fully explained, but is it correct to assume that at each time step, we store the data from that step, and then, when capacity is full, we make room for the data from the next step but releasing the memory from the oldest data? In other words, replay memory will always hold data from the last N steps? Thanks!
Keep up the amazing work !Β Love from India .
Hey there Deeplizard — I'm back with another question. (I'm having to watch this one multiple times — there's a lot to unpack). Ok, so experience at time t is defined as the state at time t, plus the action at time t, plus the reward at t+1, and the state at t+1. Im not sure I understand why the reward term in this definition is the reward at the next time step rather than the reward (if any) at the current time step. Thanks in advance for any help!
Your videos show a lot of game playing at the end of the videos. I have a hard time understanding this in the context of Markov Decision Processes. In many games the environment is adversarial. So you have to keep from being predictable. How does the agent learn this? For example, what would happen if you try to train an agent to play "rock-paper-scissors"?
Great explanation!
Also it would be great if you increase the audio volume in your videos. It's so low.
I just finished all the RL series and found out that this video is recorded just few days ago! Can't wait for the next video!
Good Job…
Eagerly waiting for policy gradient and actor critic
Please upload em…
awesome video. please upload the next tutorial.we are waiting. Thank you
Christmas came early. Loving this series!
Awesome videos, eagerly waiting for the next one!
What I learned:
1.Replay memory:we store the agents experiences at each time step.Include st,at,rt+1,st+1.We store N steps.(I will get more clear when to see the impliment)
2.Experience replay:gaining experince and sampling from the replay memory.
3.Why:break the correlation between consecutive samples.
4.Get the big picture of how it all combine together.A little confuse . Can't wait to try it in code. I think I may try direct train from the sequnce to see the result.Maybe.
I am confused. Shouldn't experience tuple e(t) be define as (s(t), a(t), r(t), s(t+1))? I thought we would be storing reward corresponding to current state-action pair and not the next one.
what i reeeeallly dont get is how we can say 'this is the reward you will get from doing action A'. ive been following the series and i just dont get how you can move along on the frozen ice, get 0 reward for that action, but somehow learn stuff? i mean if you are at the second to last action, sure, i get it. you can win 100 reward. but isnt that because the agent knows that if he moves say Left, which state comes next?
for breakout, do we some how program the thing to understand what the next state is likely to be when it preses Left or Right? its like i get all the follow-up material but i just cannot get past this starting roadblock in my understanding. ive been following the blog & everything and i dont get it.
5:02 watch the ball float through bricks. someone's got a bug in their breakout π
If an experience tuple does not contain a "q-value", and random samples are taken from the replay memory, is exploration vs exploitation really necessary? Can't we just explore randomly?
I have a question. Why do we take random samples from the experience memory to train the neural network? What I read somewhere else is that we NEED the sequence because we want the network to learn what sequence of actions will fail and what sequence of action will succeed. If we break the sequence, the network won't be able to learn that.
Can you help me with this?