Jay Alammar
Canonical Correlation Analysis is one of the methods used to explore deep neural networks. Methods like CKA and SVCCA reveal to us insights into how a neural network processes its inputs. This is often done by using CKA and SVCCA as a similarity measure for different activation matrices. In this video, we look at a number of papers that compare different neural networks together. We also look at papers that compare the representations of the various layers of a neural network.
Contents:
Introduction (0:00)
Correlation (0:54)
How CCA is used to compare representations (2:50)
SVCCA and Computer Vision models (4:40)
Examining NLP language models with SVCCA: LSTM (9:01)
PWCCA – Projection Weighted Canonical Correlation Analysis (10:22)
How multilingual BERT represents different languages (10:43)
CKA: Centered Kernel Alignment (15:25)
BERT, GPT2, ELMo similarity analysis with CKA (16:07)
Convnets, Resnets, deep nets and wide nets (17:35)
Conclusion (18:59)
Explainable AI Cheat Sheet: https://ex.pegg.io/
1) Explainable AI Intro : https://www.youtube.com/watch?v=Yg3q5x7yDeM&t=0s
2) Neural Activations & Dataset Examples https://www.youtube.com/watch?v=y0-ISRhL4Ks
3) Probing Classifiers: A Gentle Intro (Explainable AI for Deep Learning) https://www.youtube.com/watch?v=HJn-OTNLnoE
—–
Papers:
SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability
https://arxiv.org/pdf/1706.05806.pdf
Understanding Learning Dynamics Of Language Models with SVCCA
https://arxiv.org/pdf/1811.00225.pdf
Insights on representational similarity in neural networks with canonical correlation
https://arxiv.org/pdf/1806.05759.pdf
BERT is Not an Interlingua and the Bias of Tokenization
https://www.aclweb.org/anthology/D19-6106.pdf
Similarity of Neural Network Representations Revisited
http://proceedings.mlr.press/v97/kornblith19a/kornblith19a.pdf
Similarity Analysis of Contextual Word Representation Models
https://arxiv.org/pdf/2005.01172.pdf
Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network Representations Vary with Width and Depth
https://arxiv.org/pdf/2010.15327.pdf
—–
Twitter: https://twitter.com/JayAlammar
Blog: https://jalammar.github.io/
Mailing List: http://eepurl.com/gl0BHL
——
More videos by Jay:
The Narrated Transformer Language Model
https://youtu.be/-QH8fRhqFHM
Jay’s Visual Intro to AI
https://www.youtube.com/watch?v=mSTCzNgDJy4
How GPT-3 Works – Easily Explained with Animations
https://www.youtube.com/watch?v=MQnJZuBGmSQ
Up and Down the Ladder of Abstraction [interactive article by Bret Victor, 2011]
https://www.youtube.com/watch?v=1S6zFOzee78
The Unreasonable Effectiveness of RNNs (Article and Visualization Commentary) [2015 article]
https://www.youtube.com/watch?v=o9LEWynwr6g
Hey, this is Ayush, I have something important to ask but may I have your email
Hi Jay… I am very happy to see your new video. Although I didn’t go through yet but I am happy that you start porting videos on fast track … thanks again
nice, amflearning by doing
In my short book on the Walsh Hadamard transform I explain (for some reason) that ReLU is a switch. f(x)=x is connect, f(x)=0 is disconnect. A light switch in your house is binary on off, yet connects and disconnects a continuously variable AC voltage signal (a sine wave.)
In your house you decide the switch state (on or off). In a ReLU unit the switch state is decided by the predicate x<=0.
I can say that but I don't think people get it. It is too much an alteration in viewpoint. Or they don't see the implications.