Backpropagation as simple as possible, but no simpler. Perhaps the most misunderstood part of neural networks, Backpropagation of errors is the key step that allows ANNs to learn. In this video, I give the derivation and thought processes behind backpropagation using high school level calculus.
Supporting Code and Equations:
https://github.com/stephencwelch/Neural-Networks-Demystified
In this series, we will build and train a complete Artificial Neural Network in python. New videos every other friday.
Part 1: Data + Architecture
Part 2: Forward Propagation
Part 3: Gradient Descent
Part 4: Backpropagation
Part 5: Numerical Gradient Checking
Part 6: Training
Part 7: Overfitting, Testing, and Regularization
@stephencwelch

Text Comments (622)
## Would you like to comment?

Welch Labs (4 months ago)

Correction at 1:05 - derivative should read: 2(3x+2x^2)(3+4x). Thanks to Siddharth for pointing this out!

Siddharth Agrawal (1 month ago)

Also the sum rule and chain rule, both have that mistake, (should be 3+4x) but no biggie 😉

Siddharth Agrawal (1 month ago)

lol im also siddharth, wassup siddharth

Carlos Nexus (13 days ago)

This is the video I've been looking for... I really needed to see how it is written in matrix and derivatives form. Backpropagation is amazing!

Өлзиймөнх Болдоо (18 days ago)

o__O

Jason Bear (18 days ago)

I'd be able to watch all the videos in one sitting but have to break it up in days just because of the guitar. Like others have said, great videos with the exception of the guitar.

Sujay Alaspure (27 days ago)

at 4:43 a2 should be of matrix 3x2 how it could be 3x3.

Kien Pham (1 month ago)

In 4:30, you said that delta(3) = -(y - y^) * f'(z(3)). But according to Andrew's ML course in Coursera, delta(3) should be equal to a(3) - y, which is -(y -y^) ; there's no f'(z(3)) term since L = 3. I did do the programming assignment for week 5, but when i multiplied the g'(z(3)) term, it didn't work. Please explain this to me or correct me if I misunderstood it, thanks!

Freak (1 month ago)

I have trouble to understand where is the operation at 4:45 comes from ?
Why d(3).a(2) become transpose(a(2)).d(3) ?

Otman faatih (1 month ago)

Why W2^T and not W2 at the derivation of J by W1?
And thanks for any one who answear my question.

Freak (1 month ago)

That's what I wondering too.

Timothy Wamalwa (1 month ago)

I still get lost when it gets to the part of differentiating the sigmoid function. What rule have you applied.. Quotient, reciprical????

lord Lancelot name (1 month ago)

I watched this a 100 times for 1 hour at 2x 😀

lord Lancelot name (1 month ago)

Best video I have seen on bacprop

JGregs (1 month ago)

Why dZ3dW2 not a 3x1 matrix of the hiddenLayer activations? How did you end up with a 3x3?

James Gregorie (1 month ago)

If our output was more than 1, would we use matrix multiplication instead of element-wise multiplication for calculating delta3?

Edmund52 (1 month ago)

At the end you compare cost1 to cost2 (showing it increases) and compare cost2 to cost3 (showing that it decreases). Should you not compare cost1 to cost3 since cost1 is the original? Im just learning this so I could be completely wrong, just need it explaining to me please.

Beautiful Poetry Love and Life (2 months ago)

Where do you guys learn all these... Are you genius or I am a dummy ... Not able to understand these equations at all.... :( :( .....

Boleiragem (2 months ago)

Where did you code the costFunction? I've watched all the videos, however I didn't find the code for this part.

Shuyi Teng (2 months ago)

By far the most clear and concise explanation (mathematically) about back propagation!

InsiderMiner (3 months ago)

anyone know for sure what is meant at 4:45 by the notation he uses for the 3x3 matrix with elements noted like asuper(2)sub(2,3)? super(2) means from layer 2 but what are the subscripts? I did not find this defined in the video. I know that one of them is the node # and the other example #, and in fact we can deduce when he takes the transpose and moves this matrix to the left side of the expression that the subscript order is example,node , but does he define this anywhere?

Apoorv Patne (3 months ago)

Can anybody help with me the relation at 3:42 ? What's up with a1 and why is it the slope?

Porter Gardiner (4 months ago)

lol just got out of algebra II barely understand any of this

Luis M (4 months ago)

Well explained! Thanks!

Welch Labs (4 months ago)

Thanks for watching!

Sergio Silva (4 months ago)

Great explanation! Thanks!

Toru Matsuki (4 months ago)

Hi guys. Near the end, when doing dz(2)/dW(1), doesn't this derivative have to have the same shape as W(1)? But W(1) is a 2x3 I believe. He says the result is X or X transpose. X is in fact also 3x2. But that seems to be a conincidence. X changes per number of examples. In this case he used 3 examples but it could have been 4 examples. Then X would be 4x2. What am I missing here?

Denis Candido (4 months ago)

I saw this video on my begin of ML studies one year ago and only now backpropagation's derivatives is making sense to my head. Still far of 100% undestanding when comes LSTM tho...

Scorpion (5 months ago)

The happy music is mocking my ignorance and inability to understand this

Vipul Petkar (5 months ago)

Im completely lost between 6:00 and 7:00

sam dalton (5 months ago)

As a non calculus student I was doing ok...until I got to backpropagation :/

Mike Lezhnin (5 months ago)

ffs, I was following (copying) the python code in this series, and here I got an error of some dimensions of matrices not fitting, I was like "mb the author is wrong", so I started looking for the stuff, and after an hour of searching I found, that in my code
y = np.array(([75,82,93]), dtype = float)
and here
y = np.array(([[75],[82],[93]]), dtype = float)
so my y is actually transposed... damn

Amir Ghamarian (5 months ago)

Respect!

Armando Mendivil (5 months ago)

I think would be W^(2) = [W11^(2), W12^(2), W13^(2) ] instead of W^(2) = [W11^(2), W21^(2), W13^(2) ] https://youtu.be/GlcnxUlrtek?t=27s

Bryan Lee Williams (5 months ago)

This was extremely complicated and too hard to understand at the speed you went.

Welch Labs (5 months ago)

Yeah, this video definitely crams a lot of calc into a short time period.

Haim Rubinshtein (5 months ago)

Beautiful

Welch Labs (5 months ago)

Thanks for watching!

Zoronoa01 (5 months ago)

Very good content please keep up the good work

Michiel Albracht (5 months ago)

Really, one of the best video series about this topic!

John Bond (5 months ago)

It's the best video about backpropagation i've ever seen. Thank you very much! :)

Welch Labs (5 months ago)

Thanks for watching!

UncommonReality (5 months ago)

While I do enjoy this a lot, I'd be of great help if you were a little more rigorous with your notation, at approximately 7:09 I tried to write my code using the equations, but the equations don't specify which products are matrix products or element-wise multiplication. I understand that many of your viewers might know python, but for those of us who don't (I'm using your videos to code in matlab) the lack of some notation (even if not rigorously mathematical) to distinguish the products would prevent unnecessary inconveniences. Great videos btw, the best series on neral nets I have watched yet.

UncommonReality (5 months ago)

Btw I'm seeing a lot of complaints about the maths, but if it wasn't for your step by step deduction I'd have never found my mistake.

Pro Odermonicon (5 months ago)

4:47
Who else saw that arrow and thought it was a smiley face at first

John Meighan (6 months ago)

Why is it a dot product between self.a2.T and Delta3? Isn't a2 a matrix and delta3 a column vector?

Carl Oliva (6 months ago)

I dont understand

Denis Timchuk (6 months ago)

After this lesson 99% NS learners give up))It's really hard for me at first time...

2002budokan (6 months ago)

Why you struggle to squeeze a 70 mins subject into 7 mins video?

Eyan Towni (6 months ago)

3Blue1Brown has a great series of Neural Networks but this series by Welch Labs not only explain you the stuff but also shows you how it's coded. However, I wish the maths behind this was done more slowly so people didn't have to pause the video 100 times in just a minute. I guess doing complex maths in real time is much easier to understand than fast paced animations. No wonder why I always fail to grab detailed stuff presented in most of minutephysics videos.

陳郁夫 (6 months ago)

awesome

rampage14x13 (6 months ago)

Song name please?

oki lol (6 months ago)

you don't need this math for a neural network, wtf

chaswards (7 months ago)

what is this sorcery?!

Eroid (7 months ago)

4:23, Um I think I've found an mistake. -(y-yhat) should not turn to (-y-yhat) but (-y+yhat) right?

Jacob Freeze (7 months ago)

The background music is annoying.

Yunfei Chen (7 months ago)

this is starting to turn into a calculus lecture lolz!!!!

Mathu D (8 months ago)

Fantastic explanation. Step.by step.. but the music is very distracting especially while trying to understand this High level math..

Larry Key (8 months ago)

159 ppl didn't make it to calculus 4

himansu odedra (8 months ago)

hi at 7.20 we now have cost 1 and cost 2. what is the difference between the two? could someone provide some clarity on this and why this is printed out?

Ayush srivastava (8 months ago)

why you multiplied dj/dw with a scalar ?

wat da fock (9 months ago)

fuck the music

WahranRai (9 months ago)

Shitty animation !!!!!!!

Nijat Shukurov (9 months ago)

Please , someone write the cost_fucntion: I could not find it

♫♪Ludwig van Beethoven♪♫ (9 months ago)

My brain farted after first 30 seconds then decided to go and read comments

Paul Graffan (9 months ago)

at 4:41 shouldn't a(2) be a 2x3 matrix rather than 3x3 ? If not I can't get my mind around what is the third row corresponding to ? Any one ? thanks

Andre Rocha (9 months ago)

If it were 3 Layers, would it be: dj/dw³ = transpose(a³).delta⁴; dj/dw² = transpose(a³).delta⁴ .transpose(w³).f'(z³) dj/dw³ = transpose(x).transpose(w²).transpose(w³).delta⁴.f'(z³).f'(z²)

Ryan Holmes (9 months ago)

Hi! I really love this video series! Good work at conveying a complex problem simply with creative graphics and descriptions. In 7:07 you mention building deeper neural networks by repeating something. What exactly are we repeating?

Welch Labs (9 months ago)

Good question! Instead of one hidden layer, we repeat multiple hidden layers - making a deep network.

Steven Hof (9 months ago)

Stephen, please don't listen to the haters. Your videos are as enjoyable as they are thorough and by far the best neural net tutorials I've come across. I'm so thankful to have stumbled across your content. Please keep the videos coming. I'd definitely pay for this level of instruction

Welch Labs (9 months ago)

Thanks for watching!

Steven Hof (9 months ago)

Why oh why did I not find this earlier than the night before my Data Mining midterm!!!!!

Divyanshu Gupta (10 months ago)

function [pred,t1,t2,t3,a1,a2,a3,b1,b2,b3] = grDnn(X,y,fX,f2,f3,K)
%neural network with 2 hidden layers
%t1,t2,t3 are thetas for every layer and b1,b2,b3 are biases
n = size(X,1);
Delta1 = zeros(fX,f2);
Db1 = zeros(1,f2);
Delta2 = zeros(f2,f3);
Db2 = zeros(1,f3);
Delta3 = zeros(f3,K);
Db3 = zeros(1,K);
t1 = rand(fX,f2)*(2*.01) - .01;
t2 = rand(f2,f3)*(2*.01) - .01;
t3 = rand(f3,K)*(2*.01) - .01;
pred = zeros(n,K);
b1 = ones(1,f2);
b2 = ones(1,f3);
b3 = ones(1,K);
%Forward Propagation
wb = waitbar(0,'Iterating...');
for o = 1:2
for i = 1:n
waitbar(i/n);
a1 = X(i,:);
z2 = a1*t1 + b1;
a2 = (1 + exp(-z2)).^(-1);
z3 = a2*t2 + b2;
a3 = (1 + exp(-z3)).^(-1);
z4 = a3*t3 + b3;
pred(i,:) = (1 + exp(-z4)).^(-1);
%Backward Propagation
d4 = (pred(i,:) - y(i,:));
d3 = ((d4)*(t3')).*(a3.*(1-a3));
d2 = ((d3)*(t2')).*(a2.*(1-a2));
Delta1 = Delta1 + (a1')*d2;
Db1 = Db1 + d2;
Delta2 = Delta2 + (a2')*d3;
Db2 = Db2 + d3;
Delta3 = Delta3 + (a3')*d4;
Db3 = Db3 + d4;
for l = 1:100
t1 = t1*(.999) - .001*(Delta1/n);
b1 = b1*(.999) - .001*(Db1/n);
t2 = t2*(.999) - .001*(Delta2/n);
b2 = b2*(.999) - .001*(Db2/n);
t3 = t3*(.999) - .001*(Delta3/n);
b3 = b3*(.999) - .001*(Db3/n);
end
end
end
delete(wb);
end
%I can't seem to understand the fault, is it the matrix multiplication, because the code does run successfully but when I test the t1,t2,t3 with some testing examples, the prediction for all examples are exactly same and are equal to the predicted vector for the last trained example.
% please help, i am stuck here for over a month now, thanks!!

Erich König (10 months ago)

Super cool video ! I love it how carefully you explain things! Best NN videos ever!

polic72andDrD3ath (10 months ago)

4:28 Why is the first operation to scalar multiply rather than matrix multiply? I see that it works in the end, but what is the mathematical reason for it? especially when matrix multiplication is the next step.

Yusuf Hafeez (10 months ago)

I'm 16 and I've been fascinated by machine learning and next year in school as part of our exam we need to use Python to code a project. I was thinking of doing something with machine learning but holy crap this stuff is hard. 90% of it went straight through me. If I wanted to code a neural network, would I need to know how all of this works?

Yusuf Hafeez (10 months ago)

Welch Labs thanks for letting me know, I would some day in the future want to know what all this means.

Welch Labs (10 months ago)

Awesome! That's ok - you don't have to get all of this at all - a library like Tensorflow will take care of most of this for you! Best of luck!

Timothy Wamalwa (10 months ago)

Nice but at some point the math was brutal

Roland Harangozo (10 months ago)

I've implemented it according to the explanation in python but somehow it does not work as expected. Does anybody have any idea where it is incorrect?
https://github.com/rharangozo/machine-learning/blob/master/ann/ann.py

Vladimir Bobyr (10 months ago)

I have done the calculations myself and arrived at the same answer when no bias is used. When I try to do the same calculation with an added bias, the dimensions for my first weight do not match with the calculated gradient (off by one). How would this equation need to be adjusted for bias?

Minh Pham (11 months ago)

Yes i am fascinated by the fact how easily you can make it look yet kind of understandable from a visual perspective. Kepp up the good work.

minsoo kim (11 months ago)

haha..... I'm undergraduate junior and this is too hard for me. But I'll do my best.

Jose Kj (11 months ago)

sir,which is the editor used here

Welch Labs (11 months ago)

Adobe Premiere - thanks for watching!

Mayur Kulkarni (11 months ago)

While I took some time digesting it, I believe I can finally understand it. And trust me this is the best resource on internet to understand backprop for a beginner with only rudimentary understanding of calculus. There's a welter of resources on Internet, most of which are so full of notations it's easy to get lost. Thanks for the upload, it took me a whole day but I could finally derive the backprop equations by myself. For those who're struggling, pause the video and try to derive the equation by yourself it ain't that hard. Cheers!

Welch Labs (11 months ago)

Awesome, thanks for watching!

Tamal Das (11 months ago)

while computing dj_dw2 how the matrix a2 became transpose of a2?

Dappernaut (11 months ago)

1:02, that's not the right derivative

Sharpchain_1 (11 months ago)

When i try to run the costPrime function i get this error ValueError: shapes (1,1) and (3,2) not aligned: 1 (dim 1) != 3 (dim 0)

alexander bergkvist (1 year ago)

You are an excellent teacher ^^ Thanks for doing this!

Cankan Kanli (1 year ago)

What are the input weights to get better prediction ?? Plz helpp project tomorrow !! Just need the first weights before hidden layer !!. :/

willie (1 year ago)

after thinking about this for one hour, i was able to understand it

Welch Labs (1 year ago)

Awesome!

Bhavesh Suhagia (1 year ago)

Can someone please explain why we transpose W_2 when calculating d_2. I understand why we do a_2.T or X.T but having difficulty understanding the need for W_2.T
d_3 * W_2.T will give us a 3x3 matrix since d_3 is [3x1] and w_2.T is [1x3]. And all the math works out but looking for some intuition.

Daniel Boyle (1 year ago)

This is by far the best explanation I have come across. It helped me to thoroughly understand the math behind backpropagation! Thank you

Christopher Mislan (1 year ago)

Yo I'm in trouble for my test man.

Rongzmee (1 year ago)

You are so amazing and talented teacher. These really "demystified" nn for me. Plz I'm dying to see more of these demystify tutorials of other topics! Plz don't stop making these awesome videos!

Welch Labs (1 year ago)

Thanks for watching!

test (1 year ago)

here's a more useful example/step-by-step explanation.
https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/

Aniekan Umoren (1 year ago)

AFTER REWACTHING THIS DOZENS OF TIMES, I'VE FINALLY GOT IT!! Thank-you soo much +Welch Labs for your amazing video

Welch Labs (1 year ago)

Wow, awesome!

Sirius Black (1 year ago)

"neural networks demystified", where all of the comments say this video is incomprehensible
as for me, i was lost by 2:20, when you randomly turn the derivative into not a derivative, then bring in the chain rule on what is essentially a single term function with two variables where you simply assert one of those variables is 0 and then you add the second variable's derivative
also the dy/dx notation is incredibly confusing to me because it randomly keeps disappearing and doesn't appear to be actually used in calculating it ever
oh, and of course y and yhat aren't numbers, they're vectors

Michael Cornman (1 year ago)

Thank you for making such good quality videos. I recreated a similar code myself, but I was noticing that my gradients were driving my cost up. I might have missed something, but I fixed it by taking out the negative sign in delta3. For anyone with a similar problem, look into that.

Andre Rocha (1 year ago)

If it were 3 Layers, would it be:
dj/dw³ = transpose(a³).delta⁴;
dj/dw² = transpose(a³).delta⁴ .transpose(w³).f'(z³)
dj/dw³ = transpose(x).transpose(w²).transpose(w³).delta⁴.f'(z³).f'(z²)
I think this would be it, be it's kind of a head-scratcher this one.

Andre Rocha (1 year ago)

It's a question by the way. xD

Ze Cheng (1 year ago)

Thanks for this nice video. However, I think there is an error at 4:43. a(2) is actually a column vector [a(2)_1 a(2)_2 a(2)_3]^T duplicated three time, i.e., [[a(2)_1 a(2)_2 a(2)_3]^T, [a(2)_1 a(2)_2 a(2)_3]^T, [a(2)_1 a(2)_2 a(2)_3]^T]

TB6943 (1 year ago)

Thank you for the video but using papers to explain and puting a music don't really help!

Olga R. I. (1 year ago)

Such music makes it all look like very dramatic

cooolway (1 year ago)

I honestly thought I was getting neural networks until this video mystified it for me.

larsalee (1 year ago)

Nice video. I know calculus from earlier, this video made the math and intuition of back-propagation understandable for me. Thanks for a great video.

Solanich (1 year ago)

at 4:45 why do you transpose and move the a matrix?

Dániel Vásárhelyi (1 year ago)

why is 'a' chosen as slope at 3:44? why isn't it 'Wij', like around 6:15

Robert Krutsch (1 year ago)

this is like the most fucked up video I have seen. Why the fuck don't you write and explain like a normal human without bullshit snippets..

Rolysent Paredes (1 year ago)

Hello, I just would like to know if you accept ANN project development. Thank you....

l ryu (1 year ago)

"Now what's cool here..." Duuuuuuuude what?

Darby Kidwell (1 year ago)

I had a really hard time understanding how the backpropogation was calculated in this video (or really any resource that I could find), so I thought I'd post a comment that may help others understand it better:
I felt like the transpose trick is kind of a cheat that the video did. You can't really take the derivative of the matrix multiplication like he did (or at least not that I'm aware of). If you actually work out the derivative of the cost function, you'll notice that the fully worked out derivative cost function happens to be mathematically equivalent to the equation he mentions (with the transpose).
For example, take the J = SUM(1/2 (y - y^)^2). Assume you have one equation. Since we have one equation, we can say that our J = 1/2 (y-y^)^2
Now start plugging in values for y^. So, for example, you might see that J = 1/2 ( y - f(w(2)_1 * a(2)_1 + w(2)_2 * a(2)_2 + w(2)_3 * a(2)_3 ) )^2. Here our function f(x) is our sigmoid function, but it could be any activation function you choose. For dJ/dW(2) this is as far down as you need to go.
Now, take the derivative of the equation I mentioned with respect to w(2)_1, w(2)_2, w(2)_3.
For dJ/dW(2)_1, you can say this = (y^-y) * f'(z(3)) * a(2)_1
For dJ/dW(2)_2, you can say this = (y^-y) * f'(z(3)) * a(2)_2
For dJ/dW(2)_3, you can say this = (y^-y) * f'(z(3)) * a(2)_3
Notice for, d(3), for one equation, it is a 1x1 matrix and a(2) is a 1x3 matrix. Multiplying a(2)^T * d(3) gives us the above result. Therefore you can see he simply did a short cut in the math.
You can now expand out the above concept to dJ/dW(1) or you could add multiple equations and prove that the transpose equation holds for many equations and many weight layers.

Ani H. (1 year ago)

OMG ... LITERALLY THE BEST EXPLANATION... LOVED THIS PLAYLIST ... BETTER THAN ANDREW NGs COURSERA COURSE ... :P

Join YouTube for a free account, or sign in if you are already a member.

© 2018 Girls rolling their eyes

Byron Beplay, AMGCD Program. CDA Tech goes way past anyone’s expectations of a trade school. From the very first day this staff will be with you step-for-step making sure you feel exactly like family and helping you to succeed. I loved this school and everyone here. So take it from me: Come join the CDA Tech commercial diving family and be a part of this awesome industry! Trey Lancaster, AMGCD Program. I went to a trade school and got my welding certification and that was not enough. I found CDA technical Institute online and decided to become a Commercial Diver because of my love for the water. I had the best time going through CDA Technical Institute and I would definitely recommend this school to anyone looking for a career in Commercial Diving. Brett Lamb, AMGCD Program.