Videos

Illustrated Guide to Transformers Neural Network: A step by step explanation



The A.I. Hacker – Michael Phi

Transformers are the rage nowadays, but how do they work? This video demystifies the novel neural network architecture with step by step explanation and illustrations on how transformers work.

CORRECTIONS:
The sine and cosine functions are actually applied to the embedding dimensions and time steps!

⭐ Play and Experiment With the Latest AI Technologies at https://grandline.ai

Hugging Face Write with Transformers
https://transformer.huggingface.co/

Source

Similar Posts

35 thoughts on “Illustrated Guide to Transformers Neural Network: A step by step explanation
  1. is there a small mistake in the graphical explaination in 4:45? could you clearify this?

    The graph seems to have an inconsistency in the representation of the positional encodings for the different time steps, based on your description. In a correctly implemented positional encoding applied to a transformer model, the values for sine and cosine alternate in successive dimensions of the positional encoding vector, not in successive time steps. This means that each element in the sequence is given the same encoding vector, with the values differing only by the position (pos), not by the points in time (time steps).

  2. This is more of a Description, than an Explanation. Simply describing a diagram and naming the blocks is not necessarily helpful.. Anyone else here completely confused, don't dismay. I'm a software engineer with experience in some other machine learning algorithms, and couldn't make much sense of any of this.

  3. Amazing. I still don’t really understand how the Q K and V values are calculated but I learnt a lot more about this seminal paper than others provided — thank you! 🙏

  4. Hi, could you please explain more about the words at 13:07 (about step 6)? You said 'the process matches the encoder input to the decoder input allowing the decoder to decide which encoder input is relevant to put focus on'. So are you meaning that there are not only one but more encoders whose outputs will be parallely inputted to the decoders?

  5. Why does the decoder select the token with the maximum probability instead of randomly selecting a token based on the probability distribution?

Comments are closed.

WP2Social Auto Publish Powered By : XYZScripts.com