WEBVTT

1
00:00:00.200 --> 00:00:03.720
<v Speaker 1>So if you ask an artificial intelligence to write a

2
00:00:03.839 --> 00:00:06.559
<v Speaker 1>Shakespearean saunet about I don't know a toaster.

3
00:00:06.480 --> 00:00:08.800
<v Speaker 2>Right, it just does it in like three seconds.

4
00:00:08.400 --> 00:00:13.919
<v Speaker 1>Flat, exactly. It's so incredibly fast and honestly so convincingly

5
00:00:14.039 --> 00:00:16.079
<v Speaker 1>human that it's really easy to just throw our hands

6
00:00:16.120 --> 00:00:18.480
<v Speaker 1>up and say, well, the computer is just thinking.

7
00:00:18.359 --> 00:00:19.399
<v Speaker 2>Yeah, that's the illusion.

8
00:00:19.519 --> 00:00:22.640
<v Speaker 1>But here is the secret that the tech world sort

9
00:00:22.640 --> 00:00:27.920
<v Speaker 1>of you know, blides right past. The AI doesn't actually

10
00:00:27.960 --> 00:00:28.839
<v Speaker 1>know what a toaster is.

11
00:00:29.239 --> 00:00:30.039
<v Speaker 2>No, not at all.

12
00:00:30.079 --> 00:00:32.640
<v Speaker 1>It doesn't know what a poem is. It's not experiencing

13
00:00:32.719 --> 00:00:37.560
<v Speaker 1>this burst of creative genius underneath the hood. It is

14
00:00:37.640 --> 00:00:41.200
<v Speaker 1>really just doing a massive amount of incredibly fast, very

15
00:00:41.240 --> 00:00:42.679
<v Speaker 1>boring accounting.

16
00:00:42.320 --> 00:00:43.920
<v Speaker 2>Which is exactly what we're going to get into.

17
00:00:44.079 --> 00:00:47.679
<v Speaker 1>Right. So today, for you, our listener, we're opening that ledger.

18
00:00:47.679 --> 00:00:50.119
<v Speaker 1>We are taking you on a custom tailored deep dive

19
00:00:50.479 --> 00:00:55.000
<v Speaker 1>to totally demystify how artificial intelligence actually you know, learns.

20
00:00:55.119 --> 00:00:59.240
<v Speaker 2>Yeah, no magic, no impenetrable labyrinths, just the raw mechanics.

21
00:00:59.479 --> 00:00:59.960
<v Speaker 1>The mechanics.

22
00:01:00.079 --> 00:01:02.880
<v Speaker 2>Because I mean, we appreciate the result of a neural network, right,

23
00:01:03.359 --> 00:01:06.640
<v Speaker 2>we rarely understand the underlying chemistry of how it actually

24
00:01:06.680 --> 00:01:07.040
<v Speaker 2>got there.

25
00:01:07.159 --> 00:01:09.719
<v Speaker 1>It's totally a black box for most people exactly.

26
00:01:10.040 --> 00:01:12.480
<v Speaker 2>So our guide for pulling back the curtain today is

27
00:01:12.519 --> 00:01:16.159
<v Speaker 2>this fantastic book by Seth Widman. It's called deep Learning

28
00:01:16.200 --> 00:01:20.439
<v Speaker 2>from Scratch, Building with Python from first principles A great resource,

29
00:01:20.599 --> 00:01:22.519
<v Speaker 2>really is. And what we're going to do is take

30
00:01:22.599 --> 00:01:26.719
<v Speaker 2>all that intimidating jargon, you know, the algorithms, the calculus,

31
00:01:26.760 --> 00:01:29.640
<v Speaker 2>the hyper parameteris very stuff, all of it, and we're

32
00:01:29.680 --> 00:01:31.480
<v Speaker 2>going to strip it all the way down to the

33
00:01:31.560 --> 00:01:32.959
<v Speaker 2>foundational floorboards.

34
00:01:33.239 --> 00:01:36.599
<v Speaker 1>So we're going to use simple math, visual diagrams, and

35
00:01:36.640 --> 00:01:39.120
<v Speaker 1>some basic code to show you that deep learning is

36
00:01:39.239 --> 00:01:43.200
<v Speaker 1>really just a highly scaled assembly line of very very

37
00:01:43.239 --> 00:01:44.879
<v Speaker 1>simple mathematical factories.

38
00:01:44.959 --> 00:01:46.079
<v Speaker 2>That's a great way to put it.

39
00:01:46.159 --> 00:01:48.760
<v Speaker 1>But before we actually get to building that assembly line,

40
00:01:48.799 --> 00:01:50.959
<v Speaker 1>we need to talk about why loving this stuff in

41
00:01:51.000 --> 00:01:53.760
<v Speaker 1>the first place is normally such a complete nightmare. Oh

42
00:01:53.799 --> 00:01:55.680
<v Speaker 1>it really is, because if you try to read a

43
00:01:55.719 --> 00:01:59.439
<v Speaker 1>standard academic paper on neural networks, it often feels like

44
00:01:59.480 --> 00:02:02.840
<v Speaker 1>you're trying to read ancient Greek. Why is the entry

45
00:02:02.879 --> 00:02:04.000
<v Speaker 1>point so brutal?

46
00:02:04.719 --> 00:02:07.200
<v Speaker 2>Well, Wideman tackles this right out of the gate. He

47
00:02:07.319 --> 00:02:10.159
<v Speaker 2>uses that old parable of the blind men and the elephant.

48
00:02:10.599 --> 00:02:11.520
<v Speaker 1>Oh right, sure.

49
00:02:11.680 --> 00:02:14.199
<v Speaker 2>So you have a group of blind men who encounter

50
00:02:14.240 --> 00:02:17.439
<v Speaker 2>an elephant for the first time. One touches the trunk

51
00:02:17.840 --> 00:02:20.360
<v Speaker 2>and says, oh, an elephant is like a thick snake, right,

52
00:02:20.439 --> 00:02:22.360
<v Speaker 2>Another touches the year and says, no, it's a fan.

53
00:02:22.680 --> 00:02:24.759
<v Speaker 2>Another grabs the leg and says it's a tree trunk.

54
00:02:25.120 --> 00:02:30.319
<v Speaker 1>And they're all kind of right but also completely wrong exactly.

55
00:02:30.759 --> 00:02:35.319
<v Speaker 2>They are all describing a correct, isolated part, but none

56
00:02:35.360 --> 00:02:38.400
<v Speaker 2>of them are describing the whole animal. And deep learning

57
00:02:38.439 --> 00:02:41.120
<v Speaker 2>resources have historically done the exact same thing.

58
00:02:41.240 --> 00:02:43.400
<v Speaker 1>Okay, that makes a lot of sense because, like, if

59
00:02:43.400 --> 00:02:46.439
<v Speaker 1>you want to learn a standard computer science concept, say

60
00:02:46.800 --> 00:02:50.080
<v Speaker 1>how a search algorithm works, the resources out there are

61
00:02:50.159 --> 00:02:53.240
<v Speaker 1>usually holistic, Like a good textbook gives you a plain

62
00:02:53.319 --> 00:02:56.879
<v Speaker 1>English explanation, then they give you a whiteboard diagram, then

63
00:02:56.919 --> 00:02:59.120
<v Speaker 1>the math, and finally the pseudo code so you can

64
00:02:59.120 --> 00:03:01.400
<v Speaker 1>actually build it. You get the whole elephant, right, you

65
00:03:01.439 --> 00:03:05.080
<v Speaker 1>get the whole elephant. But AI resources fracture this, don't They.

66
00:03:05.039 --> 00:03:09.599
<v Speaker 2>Completely The field sort of fractured into two really extreme camps.

67
00:03:09.919 --> 00:03:15.039
<v Speaker 2>On one side, you have these highly conceptual, incredibly dense.

68
00:03:15.199 --> 00:03:18.360
<v Speaker 1>Math textbooks, the ancient Greek exactly.

69
00:03:18.719 --> 00:03:22.800
<v Speaker 2>Wideman points to Ian Goodfellow's famous Deep Learning book and look,

70
00:03:22.840 --> 00:03:26.840
<v Speaker 2>it's an absolute masterpiece. Sure, but if you aren't already

71
00:03:26.879 --> 00:03:30.800
<v Speaker 2>fluent in advanced calculus and linear algebra, you're going to

72
00:03:30.879 --> 00:03:33.120
<v Speaker 2>hit a brook wall on like page ten, it's just

73
00:03:33.159 --> 00:03:34.879
<v Speaker 2>a sea of abstract equations.

74
00:03:34.960 --> 00:03:37.960
<v Speaker 1>So what's the other extreme then, Because if I don't

75
00:03:37.960 --> 00:03:40.280
<v Speaker 1>want to drown in calculus, where do people usually go?

76
00:03:40.439 --> 00:03:44.960
<v Speaker 2>Well, they go to the highly practical, code heavy tutorials.

77
00:03:45.439 --> 00:03:47.599
<v Speaker 2>So you might look up the documentation for a modern

78
00:03:47.680 --> 00:03:51.439
<v Speaker 2>library like PyTorch okay, and you just copy a block

79
00:03:51.439 --> 00:03:53.800
<v Speaker 2>of Python code, you paste it, you run it, and

80
00:03:53.840 --> 00:03:55.960
<v Speaker 2>you watch this number on your screen called the loss

81
00:03:56.039 --> 00:03:57.639
<v Speaker 2>value start to go down.

82
00:03:57.520 --> 00:03:58.879
<v Speaker 1>Which means it's working right.

83
00:03:59.080 --> 00:04:02.879
<v Speaker 2>Technically, Yes, the network is learning, but the tutorial never

84
00:04:02.919 --> 00:04:05.800
<v Speaker 2>actually stops to explain the why. It's like you're driving

85
00:04:05.840 --> 00:04:08.520
<v Speaker 2>a sports car but you have zero clue how the

86
00:04:08.520 --> 00:04:10.159
<v Speaker 2>internal combustion engine works.

87
00:04:09.960 --> 00:04:13.400
<v Speaker 1>Which is I'm guessing where Widman's approach comes in. He argues,

88
00:04:13.439 --> 00:04:15.960
<v Speaker 1>you have to merge these perspectives.

89
00:04:15.439 --> 00:04:19.519
<v Speaker 2>Yes, exactly. His core thesis is that to truly understand

90
00:04:19.519 --> 00:04:22.240
<v Speaker 2>neural networks, you have to hold multiple mental models in

91
00:04:22.279 --> 00:04:23.560
<v Speaker 2>your head simultaneously.

92
00:04:23.759 --> 00:04:24.879
<v Speaker 1>Okay, what does that look like?

93
00:04:25.000 --> 00:04:27.079
<v Speaker 2>Well, you have to look at a neural network and

94
00:04:27.120 --> 00:04:30.079
<v Speaker 2>see it as a mathematical function, but at the exact

95
00:04:30.160 --> 00:04:32.519
<v Speaker 2>same time, you have to see it as a computational

96
00:04:32.600 --> 00:04:36.240
<v Speaker 2>graph where data physically flows from left to right. Got it.

97
00:04:36.439 --> 00:04:38.120
<v Speaker 2>You also have to see it as a series of

98
00:04:38.240 --> 00:04:41.959
<v Speaker 2>layered neurons, and finally, you have to understand it conceptually

99
00:04:42.240 --> 00:04:44.720
<v Speaker 2>as a universal function approximator.

100
00:04:44.839 --> 00:04:49.480
<v Speaker 1>Wait, hold on, a universal function approximator. Yeah, that sounds

101
00:04:49.560 --> 00:04:53.839
<v Speaker 1>like a fancy blender from a late night infomercial or something.

102
00:04:53.879 --> 00:04:54.879
<v Speaker 1>What does that actually mean?

103
00:04:55.040 --> 00:04:58.560
<v Speaker 2>I know it sounds super intimidating, but it just means

104
00:04:58.560 --> 00:05:02.120
<v Speaker 2>a machine that can mold it self to approximate literally

105
00:05:02.199 --> 00:05:05.480
<v Speaker 2>any pattern in the universe, provided it has enough parts.

106
00:05:05.680 --> 00:05:09.439
<v Speaker 2>Any pattern, pretty much, whether the pattern is predicting tomorrow's weather,

107
00:05:09.920 --> 00:05:13.279
<v Speaker 2>or recognizing a cat in a photo, or translating English

108
00:05:13.319 --> 00:05:16.639
<v Speaker 2>to French. If there's a logical relationship between the input

109
00:05:16.720 --> 00:05:20.240
<v Speaker 2>and the output, a neural network can approximate it. That's wild,

110
00:05:20.600 --> 00:05:22.959
<v Speaker 2>it is, But you only realize how it does that

111
00:05:23.040 --> 00:05:25.439
<v Speaker 2>if you force yourself to see the math, the diagram

112
00:05:25.519 --> 00:05:27.199
<v Speaker 2>and the code side by side and.

113
00:05:27.160 --> 00:05:29.959
<v Speaker 1>I guess that's why Widman forces the reader to build

114
00:05:30.000 --> 00:05:34.399
<v Speaker 1>these networks from scratch in Python, using just like basic arrays.

115
00:05:34.040 --> 00:05:35.079
<v Speaker 2>In numpis exactly.

116
00:05:35.160 --> 00:05:37.439
<v Speaker 1>It's not because you're trying to build the fastest AI

117
00:05:37.519 --> 00:05:40.879
<v Speaker 1>in the world. It's purely an exercise in solidifying your

118
00:05:40.920 --> 00:05:42.079
<v Speaker 1>understanding of those models.

119
00:05:42.079 --> 00:05:42.519
<v Speaker 2>Spot on.

120
00:05:42.920 --> 00:05:46.279
<v Speaker 1>So let's start doing exactly that for the listener. Let's

121
00:05:46.319 --> 00:05:49.600
<v Speaker 1>abandon the complex terminology. We need to start at the

122
00:05:49.920 --> 00:05:54.439
<v Speaker 1>absolute foundation of all machine learning, the mathematical function and

123
00:05:54.480 --> 00:05:55.480
<v Speaker 1>the derivative right.

124
00:05:55.959 --> 00:05:58.360
<v Speaker 2>And usually when we learn about functions in high school,

125
00:05:58.399 --> 00:06:01.959
<v Speaker 2>we use the Cartesian plane, you know, Rineiti Card's classic

126
00:06:02.360 --> 00:06:02.839
<v Speaker 2>X and Y.

127
00:06:02.920 --> 00:06:04.720
<v Speaker 1>Axes, the good old graph paper.

128
00:06:04.920 --> 00:06:07.879
<v Speaker 2>Exactly, you plot some points, you draw a curved line

129
00:06:07.920 --> 00:06:10.680
<v Speaker 2>through them, and that's fine for basic geometry, but it's

130
00:06:10.680 --> 00:06:13.560
<v Speaker 2>actually a terrible mental model for deep learning.

131
00:06:13.720 --> 00:06:17.279
<v Speaker 1>Yeah, drawing parabolas isn't going to help us build an AI. Instead,

132
00:06:17.319 --> 00:06:20.600
<v Speaker 1>Widman tells us to visualize a function as a mini factory,

133
00:06:20.920 --> 00:06:23.800
<v Speaker 1>just a physical box sitting on a table. Inputs go

134
00:06:23.879 --> 00:06:27.079
<v Speaker 1>into the box on a conveyor belt. The factory has

135
00:06:27.120 --> 00:06:30.959
<v Speaker 1>some internal strict rules that it applies to whatever comes in,

136
00:06:31.519 --> 00:06:34.800
<v Speaker 1>and then a transformed output comes out the other side precisely.

137
00:06:35.120 --> 00:06:37.800
<v Speaker 2>So let's say the factory is a square function. Okay,

138
00:06:38.040 --> 00:06:40.959
<v Speaker 2>you send the number two into the factory. The factory's

139
00:06:41.040 --> 00:06:44.600
<v Speaker 2>internal rule is to multiply the input by itself, so

140
00:06:44.680 --> 00:06:47.360
<v Speaker 2>outcomes the number four. You send in a three outcomes

141
00:06:47.360 --> 00:06:49.800
<v Speaker 2>of nine. It's just a simple predictable machine.

142
00:06:49.920 --> 00:06:52.240
<v Speaker 1>Okay. So if the function is just a factory box,

143
00:06:52.360 --> 00:06:55.319
<v Speaker 1>what is a derivative? Because just hearing the word derivative

144
00:06:55.480 --> 00:06:58.560
<v Speaker 1>definitely triggers some traumatic math flashbacks for me.

145
00:06:58.639 --> 00:07:01.800
<v Speaker 2>Oh for sure. But let's dick with our factory visualization. Yeah,

146
00:07:01.920 --> 00:07:04.800
<v Speaker 2>imagine there is a physical string connecting the input of

147
00:07:04.800 --> 00:07:06.959
<v Speaker 2>the factory to the output of the factory, a string.

148
00:07:07.160 --> 00:07:07.439
<v Speaker 1>Okay.

149
00:07:07.519 --> 00:07:10.680
<v Speaker 2>The derivative is simply asking a very practical question. If

150
00:07:10.720 --> 00:07:14.079
<v Speaker 2>you pull on the input string by a very very

151
00:07:14.079 --> 00:07:17.040
<v Speaker 2>small amount, a tiny nudge like point zero zero zero one,

152
00:07:17.560 --> 00:07:20.720
<v Speaker 2>by what multiple does the output string move?

153
00:07:20.839 --> 00:07:23.240
<v Speaker 1>Ah? Okay, So it's kind of like adjusting an analog

154
00:07:23.600 --> 00:07:26.959
<v Speaker 1>volume knob on an old stereo. If I nudge the

155
00:07:27.000 --> 00:07:30.560
<v Speaker 1>input dial just a tiny fraction of a millimeter, how

156
00:07:30.639 --> 00:07:33.480
<v Speaker 1>much louder does the music actually get? Like? Does a

157
00:07:33.519 --> 00:07:37.360
<v Speaker 1>tiny nudge on the input cause a massive blown speaker

158
00:07:37.399 --> 00:07:40.759
<v Speaker 1>spike in the output, or does it barely move the

159
00:07:40.759 --> 00:07:41.399
<v Speaker 1>needle at all?

160
00:07:41.680 --> 00:07:43.680
<v Speaker 2>Exactly? You're measuring the rate of change.

161
00:07:43.959 --> 00:07:47.319
<v Speaker 1>But okay, why is this tiny nudge so crucial? Like

162
00:07:47.480 --> 00:07:50.120
<v Speaker 1>why does an artificial intelligence care so much about this

163
00:07:50.160 --> 00:07:50.759
<v Speaker 1>little string?

164
00:07:51.199 --> 00:07:56.040
<v Speaker 2>Because this rate of change? Knowing exactly how the input

165
00:07:56.079 --> 00:08:00.879
<v Speaker 2>affects the output is the literal engine of machine learning. Yes,

166
00:08:01.319 --> 00:08:03.720
<v Speaker 2>it is how the model knows how to correct its

167
00:08:03.759 --> 00:08:06.800
<v Speaker 2>own errors. Think about it. If an AI makes a

168
00:08:06.800 --> 00:08:09.560
<v Speaker 2>prediction and that prediction is wrong, it needs to know

169
00:08:09.560 --> 00:08:10.399
<v Speaker 2>how to fix it right.

170
00:08:10.439 --> 00:08:11.160
<v Speaker 1>It has to adjust.

171
00:08:11.199 --> 00:08:13.639
<v Speaker 2>And if the AI knows exactly how a tiny nudge

172
00:08:13.680 --> 00:08:16.399
<v Speaker 2>to its internal settings will affect the final outcome, it

173
00:08:16.439 --> 00:08:19.040
<v Speaker 2>knows exactly which dials to turn and in which direction

174
00:08:19.360 --> 00:08:21.959
<v Speaker 2>to get a better result next time. The derivative is

175
00:08:21.959 --> 00:08:24.040
<v Speaker 2>basically the compass pointing toward the correct answer.

176
00:08:24.160 --> 00:08:26.360
<v Speaker 1>I see. Okay, so we have a single mini factory.

177
00:08:26.480 --> 00:08:28.839
<v Speaker 1>You nudge the input, you watch the output change, you

178
00:08:28.879 --> 00:08:32.000
<v Speaker 1>adjust the dial. That makes sense, but predicting a housing

179
00:08:32.039 --> 00:08:35.399
<v Speaker 1>price or writing a poem takes way more than one

180
00:08:35.399 --> 00:08:39.320
<v Speaker 1>mathematical step. Real data doesn't just go through one simple rule.

181
00:08:39.639 --> 00:08:41.840
<v Speaker 1>So how do these boxes actually talk? To each other

182
00:08:41.840 --> 00:08:43.000
<v Speaker 1>without losing all the data.

183
00:08:43.600 --> 00:08:46.120
<v Speaker 2>So this brings us to the concept of nested functions.

184
00:08:46.799 --> 00:08:49.879
<v Speaker 2>In deep learning. You almost never have just one factory.

185
00:08:50.200 --> 00:08:52.279
<v Speaker 1>You have a chain of them, an assembly line.

186
00:08:52.399 --> 00:08:55.840
<v Speaker 2>Exactly an assembly line. The output conveyor belt of factory

187
00:08:55.879 --> 00:08:58.960
<v Speaker 2>one feeds directly into the input conveyor belt a factory

188
00:08:59.000 --> 00:09:03.039
<v Speaker 2>two one transforms the raw data, passes it to factory two,

189
00:09:03.120 --> 00:09:04.919
<v Speaker 2>which transforms it again, and so on.

190
00:09:05.039 --> 00:09:07.399
<v Speaker 1>Okay, but if I nudge the input at the very

191
00:09:07.440 --> 00:09:10.440
<v Speaker 1>beginning of the assembly line, that ripple has to travel

192
00:09:10.440 --> 00:09:13.399
<v Speaker 1>through every single factory to reach the end. How do

193
00:09:13.440 --> 00:09:16.039
<v Speaker 1>we track that string across ten different boxes.

194
00:09:16.120 --> 00:09:18.759
<v Speaker 2>We use what might be the single most important mathematical

195
00:09:18.840 --> 00:09:22.039
<v Speaker 2>rule in all the deep learning, the chain rule from calculus,

196
00:09:22.120 --> 00:09:26.240
<v Speaker 2>the chain rule. Yes and again, Wideman demystifies this beautifully

197
00:09:26.600 --> 00:09:27.799
<v Speaker 2>using the factory boxes.

198
00:09:28.000 --> 00:09:29.720
<v Speaker 1>Okay, let's trace the string. Then, let's say we have

199
00:09:29.720 --> 00:09:31.759
<v Speaker 1>two boxes. We pull the string on the input to

200
00:09:31.799 --> 00:09:34.679
<v Speaker 1>box one. We observe that its output changes by a

201
00:09:34.679 --> 00:09:37.200
<v Speaker 1>factor of three, so a three in a multiplier. Right,

202
00:09:37.559 --> 00:09:40.440
<v Speaker 1>That output is now the input for box two. And

203
00:09:40.519 --> 00:09:43.000
<v Speaker 1>we already know that if we tweak the input of

204
00:09:43.039 --> 00:09:46.159
<v Speaker 1>box two, its output changes by a factor of say

205
00:09:46.320 --> 00:09:47.440
<v Speaker 1>migus two units.

206
00:09:47.679 --> 00:09:51.159
<v Speaker 2>Perfect setup. So to find the total change across the

207
00:09:51.279 --> 00:09:53.639
<v Speaker 2>entire chain from the very first input to the very

208
00:09:53.679 --> 00:09:56.480
<v Speaker 2>last output, the chain rule says, we simply multiply those

209
00:09:56.559 --> 00:09:59.240
<v Speaker 2>rates of change together. Just multiply them, just multiply them.

210
00:10:00.200 --> 00:10:02.840
<v Speaker 2>One changes things by a factor three, box two changes

211
00:10:02.879 --> 00:10:05.200
<v Speaker 2>things by a factor of niggas two. The total change

212
00:10:05.200 --> 00:10:08.320
<v Speaker 2>across the whole chain is three multiplied by niggas two,

213
00:10:08.759 --> 00:10:10.080
<v Speaker 2>which equals negative six.

214
00:10:10.200 --> 00:10:12.360
<v Speaker 1>Oh wow, so a one unit nudge at the start

215
00:10:12.519 --> 00:10:14.720
<v Speaker 1>creates a null six unit shift at the very end.

216
00:10:14.919 --> 00:10:15.320
<v Speaker 2>Exactly.

217
00:10:15.399 --> 00:10:18.720
<v Speaker 1>But wait, practically speaking, if I'm actually coding this assembly line,

218
00:10:18.879 --> 00:10:20.399
<v Speaker 1>how does the system know those numbers? Like?

219
00:10:20.440 --> 00:10:20.519
<v Speaker 2>Do?

220
00:10:20.600 --> 00:10:22.159
<v Speaker 1>I have to run the data all the way forward

221
00:10:22.200 --> 00:10:24.279
<v Speaker 1>to get an answer, and then somehow trace my steps

222
00:10:24.279 --> 00:10:26.480
<v Speaker 1>all the way backward to figure out the chain rule math.

223
00:10:26.720 --> 00:10:29.759
<v Speaker 2>That is exactly what you have to do. To code

224
00:10:29.759 --> 00:10:33.639
<v Speaker 2>this from scratch. Your system has to make two distinct passes.

225
00:10:34.240 --> 00:10:37.440
<v Speaker 2>First is the forward pass. Okay, forward, You feed your

226
00:10:37.440 --> 00:10:39.919
<v Speaker 2>initial data into the first factory and you let it

227
00:10:40.000 --> 00:10:42.679
<v Speaker 2>run all the way down the assembly line. But here's

228
00:10:42.679 --> 00:10:46.159
<v Speaker 2>the catch. As the data moves forward, the system has

229
00:10:46.200 --> 00:10:49.679
<v Speaker 2>to save all the intermediate quantities at every single step.

230
00:10:50.120 --> 00:10:53.039
<v Speaker 2>It has to keep a meticulous record of what happened

231
00:10:53.080 --> 00:10:53.960
<v Speaker 2>inside each box.

232
00:10:54.159 --> 00:10:56.039
<v Speaker 1>Why doesn't need to save all that? If it reaches

233
00:10:56.080 --> 00:10:58.480
<v Speaker 1>the end and gets an answer, hasn't it done its job?

234
00:10:58.639 --> 00:11:02.279
<v Speaker 2>Because of the second step, the backward pass. Once the

235
00:11:02.360 --> 00:11:04.919
<v Speaker 2>data reaches the end and then network spits out a prediction,

236
00:11:05.440 --> 00:11:08.159
<v Speaker 2>you compare that prediction to the correct answer to see

237
00:11:08.159 --> 00:11:11.360
<v Speaker 2>how wrong you were. Then you run backward down the

238
00:11:11.399 --> 00:11:14.240
<v Speaker 2>assembly line. You use all those intermediate records you save

239
00:11:14.320 --> 00:11:17.320
<v Speaker 2>during the forward pass to calculate the derivatives the strings.

240
00:11:17.320 --> 00:11:20.840
<v Speaker 2>Going backward, you calculate box two string, then multiply it

241
00:11:20.879 --> 00:11:23.440
<v Speaker 2>by box one string using the chain rule, all the

242
00:11:23.440 --> 00:11:24.360
<v Speaker 2>way back to the start.

243
00:11:24.480 --> 00:11:27.639
<v Speaker 1>I'm not going to lie. That sounds incredibly tedious to

244
00:11:27.679 --> 00:11:30.960
<v Speaker 1>code by hand, keeping track of every single variable, saving

245
00:11:30.960 --> 00:11:34.799
<v Speaker 1>it all in memory, running backward, multiplying the strings. It

246
00:11:34.879 --> 00:11:37.519
<v Speaker 1>sounds like an absolute nightmare of bookkeeping.

247
00:11:37.600 --> 00:11:40.639
<v Speaker 2>It is. It's a massive bookkeeping operation. Yeah, and this

248
00:11:40.799 --> 00:11:44.120
<v Speaker 2>is exactly why modern deep learning libraries like PyTorch are

249
00:11:44.159 --> 00:11:45.240
<v Speaker 2>so popular.

250
00:11:44.840 --> 00:11:47.000
<v Speaker 1>Today because they do it for you exactly.

251
00:11:47.360 --> 00:11:51.960
<v Speaker 2>They use something called automatic differentiation. They handle all that

252
00:11:52.039 --> 00:11:56.080
<v Speaker 2>tedious forward and backward bookkeeping completely invisibly. You just define

253
00:11:56.080 --> 00:11:58.600
<v Speaker 2>the factories and the library does the calculus for you.

254
00:11:58.919 --> 00:12:01.080
<v Speaker 1>But Widman forces you to it by hand anyway, right,

255
00:12:01.200 --> 00:12:03.759
<v Speaker 1>he does, because if you just rely on PyTorch, you're

256
00:12:03.799 --> 00:12:06.360
<v Speaker 1>back to being a blind man touching the elephant. You

257
00:12:06.360 --> 00:12:08.360
<v Speaker 1>don't see the whole process exactly.

258
00:12:08.720 --> 00:12:11.519
<v Speaker 2>By coding the forward and backward passes from scratch in Python,

259
00:12:11.879 --> 00:12:15.080
<v Speaker 2>you actually see the mechanics. You realize that learning isn't consciousness.

260
00:12:15.320 --> 00:12:19.000
<v Speaker 2>It's literally just a series of multipliers passed backward down

261
00:12:19.000 --> 00:12:19.879
<v Speaker 2>an assembly line.

262
00:12:19.919 --> 00:12:22.120
<v Speaker 1>Okay, I'm with you on the strings in the assembly line,

263
00:12:22.120 --> 00:12:26.600
<v Speaker 1>But single numbers are great for theory. Reality is messy.

264
00:12:26.879 --> 00:12:27.440
<v Speaker 2>Very messy.

265
00:12:27.639 --> 00:12:30.279
<v Speaker 1>If I want an AI to predict a housing price,

266
00:12:30.840 --> 00:12:34.360
<v Speaker 1>I'm not just feeding it a single number. A house

267
00:12:34.399 --> 00:12:38.519
<v Speaker 1>has dozens of features, square footage, number of bedrooms, age

268
00:12:38.559 --> 00:12:41.799
<v Speaker 1>of the roof, proximity to a highway. So how do

269
00:12:41.840 --> 00:12:44.919
<v Speaker 1>we pull a string on a massive spreadsheet of information?

270
00:12:45.200 --> 00:12:47.879
<v Speaker 2>This is where we scale up to matrices and supervised

271
00:12:47.960 --> 00:12:51.679
<v Speaker 2>learning sew of us. Learning is just finding relationships between

272
00:12:51.759 --> 00:12:54.840
<v Speaker 2>characteristics that have already been measured, Okay, and to process

273
00:12:54.840 --> 00:12:57.600
<v Speaker 2>all those characteristics, we can't use single numbers. We have

274
00:12:57.639 --> 00:13:00.559
<v Speaker 2>to stack the data into grids, which in numb pie

275
00:13:00.759 --> 00:13:03.480
<v Speaker 2>are called end arrays or n dimensional arrays.

276
00:13:03.639 --> 00:13:06.600
<v Speaker 1>Right. So, if you visualize a spreadsheet, the columns are

277
00:13:06.600 --> 00:13:10.720
<v Speaker 1>the features like bedrooms, square footage, and every specific house

278
00:13:10.759 --> 00:13:12.279
<v Speaker 1>you are evaluating becomes a row.

279
00:13:12.519 --> 00:13:12.679
<v Speaker 2>Yep.

280
00:13:13.000 --> 00:13:15.080
<v Speaker 1>So a two x two grid might be two houses

281
00:13:15.159 --> 00:13:16.799
<v Speaker 1>each with two features exactly.

282
00:13:16.960 --> 00:13:19.480
<v Speaker 2>Now, when this grid of data enters the first factory,

283
00:13:19.720 --> 00:13:22.480
<v Speaker 2>the model needs a way to evaluate it. It performs

284
00:13:22.480 --> 00:13:23.840
<v Speaker 2>what's called a weighted sum.

285
00:13:23.960 --> 00:13:24.679
<v Speaker 1>A weighted sum.

286
00:13:24.799 --> 00:13:27.080
<v Speaker 2>Right. It looks at the features and decides how important

287
00:13:27.120 --> 00:13:30.000
<v Speaker 2>each one is. Does the square footage matter more than

288
00:13:30.039 --> 00:13:33.000
<v Speaker 2>the age of the roof? It assigns a mathematical weight

289
00:13:33.080 --> 00:13:33.759
<v Speaker 2>to each feature.

290
00:13:33.960 --> 00:13:37.759
<v Speaker 1>Okay, let me guess how this works mathematically. If I

291
00:13:37.919 --> 00:13:40.879
<v Speaker 1>have a column for bedrooms in a way that says

292
00:13:40.960 --> 00:13:44.519
<v Speaker 1>bedrooms are very important, is the factory just doing a

293
00:13:44.559 --> 00:13:46.559
<v Speaker 1>dot product like matching them up?

294
00:13:46.759 --> 00:13:49.360
<v Speaker 2>Yes. Think of a dot product as a matching game.

295
00:13:50.039 --> 00:13:52.720
<v Speaker 2>The factory lines up the house's features in one hand

296
00:13:53.039 --> 00:13:55.080
<v Speaker 2>and its internal priority weights in the other.

297
00:13:55.320 --> 00:13:55.679
<v Speaker 1>Okay.

298
00:13:56.000 --> 00:13:59.360
<v Speaker 2>It matches the bedrooms to the bedroom weight, multiplies them together,

299
00:14:00.039 --> 00:14:03.440
<v Speaker 2>matches the square footage to the square footage weight, multiplies them.

300
00:14:03.879 --> 00:14:06.919
<v Speaker 2>Then it throws all those paired results into one single

301
00:14:06.919 --> 00:14:08.480
<v Speaker 2>bucket and adds them up.

302
00:14:08.879 --> 00:14:11.279
<v Speaker 1>That's the sum, right, But if you keep multiplying features

303
00:14:11.279 --> 00:14:14.200
<v Speaker 1>by weights, that bucket is going to overflow real fast.

304
00:14:14.320 --> 00:14:16.720
<v Speaker 1>I mean, a three thousand square foot house multiplied by

305
00:14:16.759 --> 00:14:19.480
<v Speaker 1>a heavyweight becomes a massive number. Do we just let

306
00:14:19.519 --> 00:14:21.240
<v Speaker 1>the numbers get infinitely large?

307
00:14:21.440 --> 00:14:23.799
<v Speaker 2>We can't, which is why we usually feed that bucket

308
00:14:23.799 --> 00:14:27.799
<v Speaker 2>into another factory right afterward, typically something called a sigmoid function.

309
00:14:28.039 --> 00:14:30.000
<v Speaker 1>A sigmoid function, we haven't covered that one. What's that?

310
00:14:30.200 --> 00:14:32.360
<v Speaker 2>A sigmoid function is basically a squishing factory.

311
00:14:32.399 --> 00:14:33.480
<v Speaker 1>A squishing factory.

312
00:14:33.639 --> 00:14:36.840
<v Speaker 2>Yeah, it takes whatever wild massive number comes out of

313
00:14:36.840 --> 00:14:40.080
<v Speaker 2>the weighted sum, and it brutally compresses it into a

314
00:14:40.120 --> 00:14:42.200
<v Speaker 2>manageable decimal between zero and one.

315
00:14:42.320 --> 00:14:42.519
<v Speaker 1>Oh.

316
00:14:42.559 --> 00:14:44.759
<v Speaker 2>I see this is incredibly useful if you just want

317
00:14:44.759 --> 00:14:47.240
<v Speaker 2>the network to give you a probability, like a point

318
00:14:47.320 --> 00:14:49.559
<v Speaker 2>eight chance that the house is a goodbye, rather than

319
00:14:49.679 --> 00:14:51.480
<v Speaker 2>outputting a raw score of four million.

320
00:14:51.559 --> 00:14:54.799
<v Speaker 1>Okay, so our assembly line is now take the matrix

321
00:14:54.879 --> 00:14:58.240
<v Speaker 1>of houses, match them with weights, sum them up, and

322
00:14:58.279 --> 00:15:01.159
<v Speaker 1>then squish them through a sigmoid factory to get a probability.

323
00:15:01.279 --> 00:15:01.759
<v Speaker 2>You got it.

324
00:15:02.080 --> 00:15:04.799
<v Speaker 1>I get the forward pass. But here's where my brain

325
00:15:04.960 --> 00:15:08.639
<v Speaker 1>completely breaks. To do the backward pass, we have to

326
00:15:08.679 --> 00:15:12.480
<v Speaker 1>pull the string to correct the errors. How on earth

327
00:15:12.519 --> 00:15:15.279
<v Speaker 1>do you track the derivative of a giant grid of

328
00:15:15.320 --> 00:15:19.519
<v Speaker 1>interacting numbers. Every row and column is interacting with every weight.

329
00:15:19.919 --> 00:15:23.720
<v Speaker 1>The calculus must just explode into absolute chaos, you would.

330
00:15:23.480 --> 00:15:26.759
<v Speaker 2>Think so tracking every single string individually across a massive

331
00:15:26.759 --> 00:15:30.440
<v Speaker 2>matrix would be impossible, right, But the math looks incredibly

332
00:15:30.480 --> 00:15:34.600
<v Speaker 2>messy on a whiteboard, while the resulting code is brilliantly,

333
00:15:34.960 --> 00:15:38.879
<v Speaker 2>shockingly clean. It's a magical property of linear algebra.

334
00:15:39.000 --> 00:15:41.639
<v Speaker 1>WHOA, I would stop right there, time out. You literally

335
00:15:41.759 --> 00:15:45.120
<v Speaker 1>cannot start this deep dive by promising no magic and

336
00:15:45.159 --> 00:15:47.759
<v Speaker 1>then tell me the math relies on a magical property

337
00:15:48.200 --> 00:15:52.519
<v Speaker 1>that is totally cheating. Explain it. Why does the matrix

338
00:15:52.559 --> 00:15:53.919
<v Speaker 1>math clean up so nicely?

339
00:15:54.120 --> 00:15:57.159
<v Speaker 2>Fair? Catch? Okay, you're right, no magic. It comes down

340
00:15:57.200 --> 00:15:58.759
<v Speaker 2>to something called matrix transposition.

341
00:15:58.840 --> 00:15:59.960
<v Speaker 1>Matrix transposition.

342
00:16:00.039 --> 00:16:02.840
<v Speaker 2>Yes, when you need to compute the backward pass the

343
00:16:02.919 --> 00:16:07.200
<v Speaker 2>gradient for a giant grid of weights, the chain rule

344
00:16:07.279 --> 00:16:10.080
<v Speaker 2>dictates that you don't actually have to calculate a million

345
00:16:10.120 --> 00:16:13.879
<v Speaker 2>individual strings. Instead, you take the input matrix, and you

346
00:16:13.879 --> 00:16:16.240
<v Speaker 2>simply transcose it. You flip it on its side.

347
00:16:16.039 --> 00:16:18.879
<v Speaker 1>Meaning the rows become columns and the columns become rows.

348
00:16:19.320 --> 00:16:22.840
<v Speaker 2>Exactly, And why does this work mechanically? Think of the

349
00:16:22.840 --> 00:16:26.399
<v Speaker 2>forward pass like a river flowing downstream, splitting into hundreds

350
00:16:26.440 --> 00:16:29.360
<v Speaker 2>of tiny branches. Those are your data points interacting with weights.

351
00:16:29.480 --> 00:16:30.519
<v Speaker 1>Okay, I picture it.

352
00:16:30.639 --> 00:16:32.519
<v Speaker 2>If you want to send an error signal back up

353
00:16:32.559 --> 00:16:34.679
<v Speaker 2>the river to the exact source that caused it, you

354
00:16:34.759 --> 00:16:37.600
<v Speaker 2>just referse the map. Flipping the matrix on its side

355
00:16:37.720 --> 00:16:40.879
<v Speaker 2>perfectly re routes the air signals backward along the exact

356
00:16:40.879 --> 00:16:43.600
<v Speaker 2>same mathematical paths the data used to travel forward.

357
00:16:43.799 --> 00:16:46.960
<v Speaker 1>Oh wow, So you aren't doing entirely new chaotic math

358
00:16:47.039 --> 00:16:49.279
<v Speaker 1>to go backward, not at all. You're just taking the

359
00:16:49.279 --> 00:16:52.240
<v Speaker 1>infrastructure you build going forward, turning it sideways, and letting

360
00:16:52.279 --> 00:16:54.399
<v Speaker 1>the error flow back to the correct weight.

361
00:16:54.559 --> 00:16:58.799
<v Speaker 2>Precisely because of how matrix transposes work out mathematically, this

362
00:16:59.039 --> 00:17:03.159
<v Speaker 2>incredibly common plex web of interacting data collapses into a

363
00:17:03.200 --> 00:17:06.880
<v Speaker 2>few incredibly simple lines of Python code. During the backward pass.

364
00:17:07.119 --> 00:17:08.720
<v Speaker 2>It scales perfectly.

365
00:17:08.839 --> 00:17:11.079
<v Speaker 1>That is wild. So it doesn't matter if I'm feeding

366
00:17:11.440 --> 00:17:14.279
<v Speaker 1>the factory a single two x two grid or a

367
00:17:14.319 --> 00:17:17.559
<v Speaker 1>massive matrix with a million rows representing every house in

368
00:17:17.559 --> 00:17:20.680
<v Speaker 1>the country. The logic of the assembly line stays exactly

369
00:17:20.680 --> 00:17:21.279
<v Speaker 1>the same.

370
00:17:21.200 --> 00:17:22.079
<v Speaker 2>Exactly the same.

371
00:17:22.160 --> 00:17:25.079
<v Speaker 1>The forward pass runs the matching game and squishes the numbers.

372
00:17:25.519 --> 00:17:27.880
<v Speaker 1>The backward pass flips the map on its side to

373
00:17:27.960 --> 00:17:31.359
<v Speaker 1>route the blame, runs the chain roll and updates the weights.

374
00:17:31.799 --> 00:17:34.000
<v Speaker 2>And that is why we can have AI models today

375
00:17:34.039 --> 00:17:38.319
<v Speaker 2>with billions or even trillions of parameters. The fundamental architecture,

376
00:17:38.400 --> 00:17:41.839
<v Speaker 2>the mini factory, the chain roll, the matrix transposes. It's

377
00:17:42.039 --> 00:17:45.440
<v Speaker 2>infinitely scalable. You just need more powerful computers to run

378
00:17:45.480 --> 00:17:46.559
<v Speaker 2>the assembly line faster.

379
00:17:46.839 --> 00:17:49.480
<v Speaker 1>Okay, let's bring this all together for you, the listener.

380
00:17:50.200 --> 00:17:52.720
<v Speaker 1>We started this deep dive staring at a hidden circuitry

381
00:17:52.759 --> 00:17:56.359
<v Speaker 1>that everyone just assumes is impenetrable, but by looking through

382
00:17:56.400 --> 00:17:59.240
<v Speaker 1>the lens of Seth Weidman's work, we've stripped it down.

383
00:17:59.599 --> 00:18:03.240
<v Speaker 1>We have deep learning is an assembly line of mini factories.

384
00:18:03.640 --> 00:18:07.359
<v Speaker 1>We have inputs that flow forward through nested functions, matching

385
00:18:07.400 --> 00:18:10.160
<v Speaker 1>features to weights. We save our math as we go.

386
00:18:10.880 --> 00:18:13.680
<v Speaker 1>Then we compare our final answer, and we pull the

387
00:18:13.720 --> 00:18:17.559
<v Speaker 1>strings backward, flipping the matrices to calculate exactly how to

388
00:18:17.599 --> 00:18:20.759
<v Speaker 1>adjust our internal dials. It's not a brain, it's just

389
00:18:21.079 --> 00:18:25.720
<v Speaker 1>very fast, very elegant bookkeeping. You now have the foundational

390
00:18:25.759 --> 00:18:28.640
<v Speaker 1>mental model for how machines actually learn.

391
00:18:29.119 --> 00:18:31.960
<v Speaker 2>It's a very empowering realization, honestly, to finally see the

392
00:18:32.000 --> 00:18:35.279
<v Speaker 2>gears turning. But this actually raises a really important question,

393
00:18:35.319 --> 00:18:36.960
<v Speaker 2>and it's the thought I want to leave you with today.

394
00:18:37.359 --> 00:18:39.079
<v Speaker 2>It goes all the way back to the very first

395
00:18:39.079 --> 00:18:42.200
<v Speaker 2>step of this entire process supervised learning.

396
00:18:42.359 --> 00:18:44.119
<v Speaker 1>You mean, like setting up the grid of numbers in

397
00:18:44.119 --> 00:18:44.680
<v Speaker 1>the first place.

398
00:18:44.839 --> 00:18:48.519
<v Speaker 2>Right. Widman points out that to use these beautiful mathematical

399
00:18:48.519 --> 00:18:52.440
<v Speaker 2>assembly lines, we have to translate the messy, ambiguous real

400
00:18:52.480 --> 00:18:54.279
<v Speaker 2>world into precise numbers.

401
00:18:54.440 --> 00:18:54.839
<v Speaker 1>Yeah.

402
00:18:54.880 --> 00:18:58.440
<v Speaker 2>In our example, we chose price to perfectly represent a

403
00:18:58.480 --> 00:19:01.759
<v Speaker 2>house's value. The market decides the price, so mathematically that works.

404
00:19:02.240 --> 00:19:05.200
<v Speaker 2>But what happens when we try to force incredibly complex

405
00:19:05.279 --> 00:19:09.559
<v Speaker 2>human concepts into a single numeric matrix just to make

406
00:19:09.599 --> 00:19:10.279
<v Speaker 2>the math run.

407
00:19:10.559 --> 00:19:13.079
<v Speaker 1>So wait, if we're building a hiring algorithm, we have

408
00:19:13.160 --> 00:19:17.319
<v Speaker 1>to somehow turn a concept like desirability or work ethic

409
00:19:17.920 --> 00:19:20.279
<v Speaker 1>into a column on a spreadsheet exactly.

410
00:19:20.440 --> 00:19:22.440
<v Speaker 2>Or if we're building a loan approval model, we have

411
00:19:22.440 --> 00:19:26.039
<v Speaker 2>to quantify something like reliability. We have to convert nuanced,

412
00:19:26.079 --> 00:19:30.039
<v Speaker 2>deeply human ideas into cold, hard numbers so our minifactories

413
00:19:30.039 --> 00:19:31.000
<v Speaker 2>can actually process them.

414
00:19:31.079 --> 00:19:33.079
<v Speaker 1>So if we use something like say, zip codes to

415
00:19:33.079 --> 00:19:35.400
<v Speaker 1>help predict loan defaults, the math might work perfectly on

416
00:19:35.440 --> 00:19:38.039
<v Speaker 1>the assembly line, but we've accidentally built a machine that

417
00:19:38.079 --> 00:19:41.880
<v Speaker 1>mathematically justifies redlining es. The bias isn't in the AI's brain.

418
00:19:42.240 --> 00:19:44.799
<v Speaker 1>The bias is baked into the columns of the spreadsheet

419
00:19:44.839 --> 00:19:46.680
<v Speaker 1>before the forward pass even starts.

420
00:19:47.000 --> 00:19:51.039
<v Speaker 2>Precisely, if the factory only knows what we put on

421
00:19:51.039 --> 00:19:54.480
<v Speaker 2>the conveyor belt, then the very first step of deep learning,

422
00:19:54.559 --> 00:19:59.319
<v Speaker 2>choosing which numbers represent reality, might actually be its biggest vulnerability.

423
00:19:59.480 --> 00:20:00.400
<v Speaker 1>That's terrible wifying.

424
00:20:00.440 --> 00:20:04.440
<v Speaker 2>Actually, it's a huge issue. Are our models objectively learning

425
00:20:04.480 --> 00:20:07.599
<v Speaker 2>the truth about the world or are they just efficiently

426
00:20:07.799 --> 00:20:12.240
<v Speaker 2>mathematically learning the biases we hardcoded into the system the

427
00:20:12.240 --> 00:20:13.680
<v Speaker 2>moment we decided what to measure.

428
00:20:13.839 --> 00:20:17.000
<v Speaker 1>That completely flips the script. We spent this whole time

429
00:20:17.400 --> 00:20:20.519
<v Speaker 1>demystifying the machinery inside the factory. We learned how the

430
00:20:20.519 --> 00:20:23.279
<v Speaker 1>boxes work, how the chain rule connects them. But maybe

431
00:20:23.319 --> 00:20:25.559
<v Speaker 1>the real question we should be asking isn't how the

432
00:20:25.559 --> 00:20:27.240
<v Speaker 1>factory processes the materials.

433
00:20:27.519 --> 00:20:30.119
<v Speaker 2>Who is deciding what materials are allowed on the conveyor

434
00:20:30.119 --> 00:20:30.880
<v Speaker 2>belt in the first place.

435
00:20:31.039 --> 00:20:33.440
<v Speaker 1>Exactly something for you to ponder until next time.
