Backpropagation as simple as possible, but no simpler. Perhaps the most misunderstood part of neural networks, Backpropagation of errors is the key step that allows ANNs to learn. In this video, I give the derivation and thought processes behind backpropagation using high school level calculus.
Supporting Code and Equations:
In this series, we will build and train a complete Artificial Neural Network in python. New videos every other friday.
Part 1: Data + Architecture
Part 2: Forward Propagation
Part 3: Gradient Descent
Part 4: Backpropagation
Part 5: Numerical Gradient Checking
Part 6: Training
Part 7: Overfitting, Testing, and Regularization
In 4:30, you said that delta(3) = -(y - y^) * f'(z(3)). But according to Andrew's ML course in Coursera, delta(3) should be equal to a(3) - y, which is -(y -y^) ; there's no f'(z(3)) term since L = 3. I did do the programming assignment for week 5, but when i multiplied the g'(z(3)) term, it didn't work. Please explain this to me or correct me if I misunderstood it, thanks!
At the end you compare cost1 to cost2 (showing it increases) and compare cost2 to cost3 (showing that it decreases). Should you not compare cost1 to cost3 since cost1 is the original? Im just learning this so I could be completely wrong, just need it explaining to me please.
anyone know for sure what is meant at 4:45 by the notation he uses for the 3x3 matrix with elements noted like asuper(2)sub(2,3)? super(2) means from layer 2 but what are the subscripts? I did not find this defined in the video. I know that one of them is the node # and the other example #, and in fact we can deduce when he takes the transpose and moves this matrix to the left side of the expression that the subscript order is example,node , but does he define this anywhere?
Hi guys. Near the end, when doing dz(2)/dW(1), doesn't this derivative have to have the same shape as W(1)? But W(1) is a 2x3 I believe. He says the result is X or X transpose. X is in fact also 3x2. But that seems to be a conincidence. X changes per number of examples. In this case he used 3 examples but it could have been 4 examples. Then X would be 4x2. What am I missing here?
ffs, I was following (copying) the python code in this series, and here I got an error of some dimensions of matrices not fitting, I was like "mb the author is wrong", so I started looking for the stuff, and after an hour of searching I found, that in my code
y = np.array(([75,82,93]), dtype = float)
y = np.array(([,,]), dtype = float)
so my y is actually transposed... damn
While I do enjoy this a lot, I'd be of great help if you were a little more rigorous with your notation, at approximately 7:09 I tried to write my code using the equations, but the equations don't specify which products are matrix products or element-wise multiplication. I understand that many of your viewers might know python, but for those of us who don't (I'm using your videos to code in matlab) the lack of some notation (even if not rigorously mathematical) to distinguish the products would prevent unnecessary inconveniences. Great videos btw, the best series on neral nets I have watched yet.
3Blue1Brown has a great series of Neural Networks but this series by Welch Labs not only explain you the stuff but also shows you how it's coded. However, I wish the maths behind this was done more slowly so people didn't have to pause the video 100 times in just a minute. I guess doing complex maths in real time is much easier to understand than fast paced animations. No wonder why I always fail to grab detailed stuff presented in most of minutephysics videos.
Hi! I really love this video series! Good work at conveying a complex problem simply with creative graphics and descriptions. In 7:07 you mention building deeper neural networks by repeating something. What exactly are we repeating?
Stephen, please don't listen to the haters. Your videos are as enjoyable as they are thorough and by far the best neural net tutorials I've come across. I'm so thankful to have stumbled across your content. Please keep the videos coming. I'd definitely pay for this level of instruction
function [pred,t1,t2,t3,a1,a2,a3,b1,b2,b3] = grDnn(X,y,fX,f2,f3,K)
%neural network with 2 hidden layers
%t1,t2,t3 are thetas for every layer and b1,b2,b3 are biases
n = size(X,1);
Delta1 = zeros(fX,f2);
Db1 = zeros(1,f2);
Delta2 = zeros(f2,f3);
Db2 = zeros(1,f3);
Delta3 = zeros(f3,K);
Db3 = zeros(1,K);
t1 = rand(fX,f2)*(2*.01) - .01;
t2 = rand(f2,f3)*(2*.01) - .01;
t3 = rand(f3,K)*(2*.01) - .01;
pred = zeros(n,K);
b1 = ones(1,f2);
b2 = ones(1,f3);
b3 = ones(1,K);
wb = waitbar(0,'Iterating...');
for o = 1:2
for i = 1:n
a1 = X(i,:);
z2 = a1*t1 + b1;
a2 = (1 + exp(-z2)).^(-1);
z3 = a2*t2 + b2;
a3 = (1 + exp(-z3)).^(-1);
z4 = a3*t3 + b3;
pred(i,:) = (1 + exp(-z4)).^(-1);
d4 = (pred(i,:) - y(i,:));
d3 = ((d4)*(t3')).*(a3.*(1-a3));
d2 = ((d3)*(t2')).*(a2.*(1-a2));
Delta1 = Delta1 + (a1')*d2;
Db1 = Db1 + d2;
Delta2 = Delta2 + (a2')*d3;
Db2 = Db2 + d3;
Delta3 = Delta3 + (a3')*d4;
Db3 = Db3 + d4;
for l = 1:100
t1 = t1*(.999) - .001*(Delta1/n);
b1 = b1*(.999) - .001*(Db1/n);
t2 = t2*(.999) - .001*(Delta2/n);
b2 = b2*(.999) - .001*(Db2/n);
t3 = t3*(.999) - .001*(Delta3/n);
b3 = b3*(.999) - .001*(Db3/n);
%I can't seem to understand the fault, is it the matrix multiplication, because the code does run successfully but when I test the t1,t2,t3 with some testing examples, the prediction for all examples are exactly same and are equal to the predicted vector for the last trained example.
% please help, i am stuck here for over a month now, thanks!!
4:28 Why is the first operation to scalar multiply rather than matrix multiply? I see that it works in the end, but what is the mathematical reason for it? especially when matrix multiplication is the next step.
I'm 16 and I've been fascinated by machine learning and next year in school as part of our exam we need to use Python to code a project. I was thinking of doing something with machine learning but holy crap this stuff is hard. 90% of it went straight through me. If I wanted to code a neural network, would I need to know how all of this works?
I've implemented it according to the explanation in python but somehow it does not work as expected. Does anybody have any idea where it is incorrect?
I have done the calculations myself and arrived at the same answer when no bias is used. When I try to do the same calculation with an added bias, the dimensions for my first weight do not match with the calculated gradient (off by one). How would this equation need to be adjusted for bias?
While I took some time digesting it, I believe I can finally understand it. And trust me this is the best resource on internet to understand backprop for a beginner with only rudimentary understanding of calculus. There's a welter of resources on Internet, most of which are so full of notations it's easy to get lost. Thanks for the upload, it took me a whole day but I could finally derive the backprop equations by myself. For those who're struggling, pause the video and try to derive the equation by yourself it ain't that hard. Cheers!
Can someone please explain why we transpose W_2 when calculating d_2. I understand why we do a_2.T or X.T but having difficulty understanding the need for W_2.T
d_3 * W_2.T will give us a 3x3 matrix since d_3 is [3x1] and w_2.T is [1x3]. And all the math works out but looking for some intuition.
"neural networks demystified", where all of the comments say this video is incomprehensible
as for me, i was lost by 2:20, when you randomly turn the derivative into not a derivative, then bring in the chain rule on what is essentially a single term function with two variables where you simply assert one of those variables is 0 and then you add the second variable's derivative
also the dy/dx notation is incredibly confusing to me because it randomly keeps disappearing and doesn't appear to be actually used in calculating it ever
oh, and of course y and yhat aren't numbers, they're vectors
Thank you for making such good quality videos. I recreated a similar code myself, but I was noticing that my gradients were driving my cost up. I might have missed something, but I fixed it by taking out the negative sign in delta3. For anyone with a similar problem, look into that.
If it were 3 Layers, would it be:
dj/dw³ = transpose(a³).delta⁴;
dj/dw² = transpose(a³).delta⁴ .transpose(w³).f'(z³)
dj/dw³ = transpose(x).transpose(w²).transpose(w³).delta⁴.f'(z³).f'(z²)
I think this would be it, be it's kind of a head-scratcher this one.
Thanks for this nice video. However, I think there is an error at 4:43. a(2) is actually a column vector [a(2)_1 a(2)_2 a(2)_3]^T duplicated three time, i.e., [[a(2)_1 a(2)_2 a(2)_3]^T, [a(2)_1 a(2)_2 a(2)_3]^T, [a(2)_1 a(2)_2 a(2)_3]^T]
I had a really hard time understanding how the backpropogation was calculated in this video (or really any resource that I could find), so I thought I'd post a comment that may help others understand it better:
I felt like the transpose trick is kind of a cheat that the video did. You can't really take the derivative of the matrix multiplication like he did (or at least not that I'm aware of). If you actually work out the derivative of the cost function, you'll notice that the fully worked out derivative cost function happens to be mathematically equivalent to the equation he mentions (with the transpose).
For example, take the J = SUM(1/2 (y - y^)^2). Assume you have one equation. Since we have one equation, we can say that our J = 1/2 (y-y^)^2
Now start plugging in values for y^. So, for example, you might see that J = 1/2 ( y - f(w(2)_1 * a(2)_1 + w(2)_2 * a(2)_2 + w(2)_3 * a(2)_3 ) )^2. Here our function f(x) is our sigmoid function, but it could be any activation function you choose. For dJ/dW(2) this is as far down as you need to go.
Now, take the derivative of the equation I mentioned with respect to w(2)_1, w(2)_2, w(2)_3.
For dJ/dW(2)_1, you can say this = (y^-y) * f'(z(3)) * a(2)_1
For dJ/dW(2)_2, you can say this = (y^-y) * f'(z(3)) * a(2)_2
For dJ/dW(2)_3, you can say this = (y^-y) * f'(z(3)) * a(2)_3
Notice for, d(3), for one equation, it is a 1x1 matrix and a(2) is a 1x3 matrix. Multiplying a(2)^T * d(3) gives us the above result. Therefore you can see he simply did a short cut in the math.
You can now expand out the above concept to dJ/dW(1) or you could add multiple equations and prove that the transpose equation holds for many equations and many weight layers.
CDA Tech provides training and job placement assistance across a variety of careers. Contact us today to learn about the CDA Advantage.
Looking for a new career? Youve come to the right place.
Air Mixed Gas Commercial Diver Maritime Welding EMT DMT Our Programs.
*Underwater Welding Training at CDA Tech.
ROOM AND BOARD FOR MILITARY VETS.
CDA is proud to honor our vets. Thats why well waive both your registration fee and room and board when you attend CDA Technical Institute.*
*Restrictions apply, call for details.
After a successful career in the Army as an Army Ranger and Special Forces Combat Diver, CDA Technical Institute not only gave me a second-to-none education but they gave me the tools to be a successful, internationally-certified commercial diver! With CDAs education, the right attitude and the right work ethic you will be successful in this field!
Byron Beplay, AMGCD Program.
CDA Tech goes way past anyone’s expectations of a trade school. From the very first day this staff will be with you step-for-step making sure you feel exactly like family and helping you to succeed. I loved this school and everyone here. So take it from me: Come join the CDA Tech commercial diving family and be a part of this awesome industry!
Trey Lancaster, AMGCD Program.
I went to a trade school and got my welding certification and that was not enough. I found CDA technical Institute online and decided to become a Commercial Diver because of my love for the water. I had the best time going through CDA Technical Institute and I would definitely recommend this school to anyone looking for a career in Commercial Diving.
Brett Lamb, AMGCD Program.