WEBVTT

1
00:00:00.000 --> 00:00:02.560
<v Speaker 1>All right, everyone, welcome back to the deep Dive. Today,

2
00:00:02.600 --> 00:00:05.919
<v Speaker 1>we're tackling a really big topic, one that shapes so

3
00:00:06.040 --> 00:00:08.960
<v Speaker 1>much of our modern world, neural networks and deep learning.

4
00:00:09.560 --> 00:00:12.320
<v Speaker 1>Our mission, as always is to cut through the complexity,

5
00:00:12.359 --> 00:00:15.400
<v Speaker 1>pull out the most important bits for you, and give

6
00:00:15.400 --> 00:00:19.679
<v Speaker 1>you that shortcut to being truly well informed. We're diving

7
00:00:19.679 --> 00:00:23.800
<v Speaker 1>into this fantastic textbook by Charusi Aggriwall Neural Networks and

8
00:00:23.879 --> 00:00:27.800
<v Speaker 1>Deep Learning. Seriously, this thing is packed. It covers everything

9
00:00:27.960 --> 00:00:31.359
<v Speaker 1>from the foundational ideas all the way to super cutting

10
00:00:31.399 --> 00:00:33.880
<v Speaker 1>edge applications, and it even touches on some architectures that

11
00:00:33.920 --> 00:00:36.880
<v Speaker 1>have been sort of well forgotten. So our goal is

12
00:00:36.920 --> 00:00:39.119
<v Speaker 1>to make sense of this intricate world, give you a

13
00:00:39.159 --> 00:00:42.240
<v Speaker 1>clear picture of what these powerful technologies are and maybe

14
00:00:42.280 --> 00:00:44.399
<v Speaker 1>more importantly, why they matter so much.

15
00:00:44.560 --> 00:00:47.119
<v Speaker 2>Yeah, and what's really fascinating, I think, is just tracing

16
00:00:47.600 --> 00:00:49.840
<v Speaker 2>how this field has evolved over time. We're going to

17
00:00:50.039 --> 00:00:53.960
<v Speaker 2>explore not just what these networks are, but how they

18
00:00:53.960 --> 00:00:56.439
<v Speaker 2>actually learn, why they work the way they do, which

19
00:00:56.479 --> 00:00:58.920
<v Speaker 2>can be pretty powerful, and of course the incredible ways

20
00:00:58.920 --> 00:01:02.439
<v Speaker 2>they're being used today. I mean everything from self driving

21
00:01:02.479 --> 00:01:06.719
<v Speaker 2>cars to generating creative art. It's kind of amazing. We

22
00:01:06.760 --> 00:01:09.719
<v Speaker 2>really want you to walk away with those aha moments,

23
00:01:09.760 --> 00:01:13.120
<v Speaker 2>you know, yeah, feeling like you've really unlocked some profound insights.

24
00:01:13.319 --> 00:01:17.000
<v Speaker 1>Okay, so let's unpack this right from the start. When

25
00:01:17.040 --> 00:01:19.599
<v Speaker 1>we hear neural networks, the first thing that pops into

26
00:01:19.640 --> 00:01:22.640
<v Speaker 1>mind is, well, the human brain. But how close is

27
00:01:22.680 --> 00:01:23.840
<v Speaker 1>that comparison? Really?

28
00:01:24.359 --> 00:01:26.319
<v Speaker 2>That's a great question. The book points out that, yeah,

29
00:01:26.319 --> 00:01:29.400
<v Speaker 2>there was definitely initial biological inspiration. You know, things like

30
00:01:29.439 --> 00:01:33.040
<v Speaker 2>convolutional neural networks or CNNs were partly inspired by Huble

31
00:01:33.079 --> 00:01:37.319
<v Speaker 2>and Bisles work on how the cats visual cortex processes images.

32
00:01:37.680 --> 00:01:40.640
<v Speaker 2>But this is important. The book also mentions that comparison

33
00:01:40.680 --> 00:01:44.560
<v Speaker 2>is often criticized. It's seen as maybe a poor caricature,

34
00:01:44.719 --> 00:01:47.200
<v Speaker 2>a really simplified version of the actual brain.

35
00:01:47.319 --> 00:01:50.159
<v Speaker 1>Okay, so a loose inspiration, not a direct copy.

36
00:01:49.879 --> 00:01:54.040
<v Speaker 2>Exactly, though neuroscience principles has certainly been useful along the way.

37
00:01:54.319 --> 00:01:56.519
<v Speaker 1>But here's something I found really interesting in the book.

38
00:01:57.200 --> 00:02:01.519
<v Speaker 1>At their core, these networks aren't some completely alien tech.

39
00:02:01.959 --> 00:02:06.959
<v Speaker 1>They're actually built from like basic computational units inspired by

40
00:02:07.000 --> 00:02:10.639
<v Speaker 1>algorithms we already knew from traditional machine learning, things like

41
00:02:10.879 --> 00:02:13.360
<v Speaker 1>least squares regression or logistic regression.

42
00:02:13.479 --> 00:02:13.560
<v Speaker 2>Right.

43
00:02:13.840 --> 00:02:17.000
<v Speaker 1>The power comes from how they combine tons of these

44
00:02:17.000 --> 00:02:17.879
<v Speaker 1>simple units. Right.

45
00:02:17.960 --> 00:02:21.360
<v Speaker 2>Precisely, they learn how to connect these basic building blocks

46
00:02:21.439 --> 00:02:25.960
<v Speaker 2>in really intricate ways, all working together to minimize the

47
00:02:25.960 --> 00:02:28.840
<v Speaker 2>prediction error. It's kind of like building something really complex

48
00:02:28.879 --> 00:02:33.759
<v Speaker 2>and amazing, like a cathedral, but using very simple, powerful bricks.

49
00:02:33.960 --> 00:02:36.120
<v Speaker 1>Okay, so what does that basic brick look like? Then?

50
00:02:36.400 --> 00:02:39.280
<v Speaker 2>Good question. We can start with the simplest one, really,

51
00:02:39.719 --> 00:02:43.479
<v Speaker 2>the perceptron. Imagine it is a tiny decision maker. It

52
00:02:43.520 --> 00:02:47.680
<v Speaker 2>takes in different features pieces of information. Each feature gets

53
00:02:47.719 --> 00:02:50.520
<v Speaker 2>multiplied by a weight, which basically says how important it is.

54
00:02:50.919 --> 00:02:54.360
<v Speaker 2>Then it sums all those weighted features up, and finally

55
00:02:54.479 --> 00:02:58.080
<v Speaker 2>that sum goes through something called an activation function. Think

56
00:02:58.120 --> 00:03:00.879
<v Speaker 2>of it like a gait, which produces the final output.

57
00:03:01.000 --> 00:03:04.520
<v Speaker 2>Often say a class label like cat or dog, and.

58
00:03:04.520 --> 00:03:06.800
<v Speaker 1>I read about a biased neuron. What's that?

59
00:03:07.080 --> 00:03:09.840
<v Speaker 2>Ah? Right, that's sort of a neat trick. It's like

60
00:03:09.919 --> 00:03:12.159
<v Speaker 2>adding a constant offset to the sum before it hits

61
00:03:12.199 --> 00:03:14.520
<v Speaker 2>the activation function. You can do that by having a

62
00:03:14.560 --> 00:03:16.960
<v Speaker 2>special input that always has a value of one with

63
00:03:17.080 --> 00:03:19.240
<v Speaker 2>its own weight. It just gives the model a bit

64
00:03:19.280 --> 00:03:20.120
<v Speaker 2>more flexibility.

65
00:03:20.240 --> 00:03:22.759
<v Speaker 1>Gotcha, And these activation functions. You said they're like gates,

66
00:03:22.800 --> 00:03:23.840
<v Speaker 1>Why are they so important?

67
00:03:24.000 --> 00:03:28.520
<v Speaker 2>They're absolutely crucial. They introduce nonlinearity. Without them, stacking layers

68
00:03:28.520 --> 00:03:30.719
<v Speaker 2>wouldn't actually add much power. It would still just be

69
00:03:30.759 --> 00:03:34.199
<v Speaker 2>a linear model. Early on, people used a simple sign function,

70
00:03:34.400 --> 00:03:36.960
<v Speaker 2>just outputting plus one or angs one. Basically yes, no,

71
00:03:37.479 --> 00:03:39.960
<v Speaker 2>but that's hard to train mathematically because it's not smooth,

72
00:03:40.159 --> 00:03:44.439
<v Speaker 2>not differentiable. Okay, so functions like sigmoid and tan became popular.

73
00:03:44.479 --> 00:03:47.199
<v Speaker 2>Sigmoid squishes the output between zero and one, which are

74
00:03:47.199 --> 00:03:51.159
<v Speaker 2>great for probabilities. Tan is similar as shaped, but squishes

75
00:03:51.199 --> 00:03:52.400
<v Speaker 2>between negative one and one.

76
00:03:52.599 --> 00:03:54.680
<v Speaker 1>But the book mentioned something else has taken over now

77
00:03:54.879 --> 00:03:55.639
<v Speaker 1>re LU.

78
00:03:56.000 --> 00:03:59.439
<v Speaker 2>Yes, and this is a really key aha moment for

79
00:03:59.520 --> 00:04:03.360
<v Speaker 2>understanding modern deep learning. Re LU, which stands for a

80
00:04:03.400 --> 00:04:07.400
<v Speaker 2>rectified linear unit, sounds fancy, but it's incredibly simple. It's

81
00:04:07.400 --> 00:04:09.919
<v Speaker 2>just max v zero, so if the input is negative,

82
00:04:09.919 --> 00:04:12.680
<v Speaker 2>the output is zero. Otherwise the output is just the input. Yeah,

83
00:04:12.759 --> 00:04:18.639
<v Speaker 2>deceptively simple. ReLU and variations like Hardtan have largely replaced

84
00:04:18.639 --> 00:04:22.759
<v Speaker 2>sigmoid and soft hand. Why because they're a piece wise linear.

85
00:04:23.240 --> 00:04:26.360
<v Speaker 2>This makes the math, specifically the gradients used in training

86
00:04:26.560 --> 00:04:29.600
<v Speaker 2>much much easier to handle. They suffer way less from

87
00:04:29.639 --> 00:04:32.639
<v Speaker 2>a huge problem called vanishing gradients, which we should definite

88
00:04:32.680 --> 00:04:35.439
<v Speaker 2>talk more about. This change was fundamental in allowing us

89
00:04:35.480 --> 00:04:37.439
<v Speaker 2>to train much much deeper networks.

90
00:04:37.600 --> 00:04:40.639
<v Speaker 1>Okay, so we have these basic units perceptrons, with these

91
00:04:40.639 --> 00:04:45.040
<v Speaker 1>crucial nonlinear activation functions like ReLU. How do we go

92
00:04:45.079 --> 00:04:47.000
<v Speaker 1>from that to well deep learning?

93
00:04:47.160 --> 00:04:50.160
<v Speaker 2>Right? Connecting it to the bigger picture. It's fascinating actually

94
00:04:50.240 --> 00:04:52.639
<v Speaker 2>that many traditional machine learning models, the ones that people

95
00:04:52.720 --> 00:04:55.759
<v Speaker 2>use for decades, can be seen as shallow neural networks.

96
00:04:55.920 --> 00:04:59.199
<v Speaker 2>Think about least squares regression, logistic regression, even support VIC

97
00:04:59.279 --> 00:05:02.160
<v Speaker 2>machines as v You could represent all of them as

98
00:05:02.199 --> 00:05:05.240
<v Speaker 2>simple neural architectures, maybe just one or two layers deep.

99
00:05:05.399 --> 00:05:06.839
<v Speaker 1>Really like SVMs too.

100
00:05:07.000 --> 00:05:10.279
<v Speaker 2>Yeah, The main difference is often just boil down to

101
00:05:10.319 --> 00:05:13.399
<v Speaker 2>the specific loss function they're trying to minimize, and maybe

102
00:05:13.439 --> 00:05:17.079
<v Speaker 2>the activation function in the output layer. For example, logistic

103
00:05:17.120 --> 00:05:21.319
<v Speaker 2>regression for binary classification uses that sigmoid function we mentioned

104
00:05:21.480 --> 00:05:25.680
<v Speaker 2>to output or probability. Its loss function comes from maximizing

105
00:05:25.680 --> 00:05:29.160
<v Speaker 2>the likelihood of the data. The book also contrasts the

106
00:05:29.160 --> 00:05:32.560
<v Speaker 2>original perceptron learning rule, which would be a bit unstable,

107
00:05:32.800 --> 00:05:36.519
<v Speaker 2>with the Hinge loss used by SVMs, which provides better stability.

108
00:05:36.800 --> 00:05:38.759
<v Speaker 2>It shows this kind of shared ancestry.

109
00:05:38.879 --> 00:05:41.439
<v Speaker 1>Okay, so those are the shallow ones. But the real magic,

110
00:05:41.600 --> 00:05:44.199
<v Speaker 1>the deep and deep learning that comes from adding more

111
00:05:44.240 --> 00:05:45.920
<v Speaker 1>layers right stacking them up.

112
00:05:46.079 --> 00:05:49.120
<v Speaker 2>Exactly, That's where the power really scales up. Multi layer

113
00:05:49.199 --> 00:05:52.439
<v Speaker 2>neural networks introduce what we call hidden layers. These are

114
00:05:52.519 --> 00:05:56.079
<v Speaker 2>layers of computation sandwich between the input and the final output.

115
00:05:56.439 --> 00:06:01.040
<v Speaker 2>You don't directly see their results, hence hidden. Typically, information

116
00:06:01.120 --> 00:06:03.639
<v Speaker 2>flows forward through these layers, one feeding into the next.

117
00:06:03.839 --> 00:06:05.639
<v Speaker 2>We call these feed forward networks.

118
00:06:05.720 --> 00:06:08.360
<v Speaker 1>And what happens inside those hidden layers.

119
00:06:08.160 --> 00:06:11.079
<v Speaker 2>This is where the concept of hierarchical feature engineering comes in.

120
00:06:11.199 --> 00:06:14.480
<v Speaker 2>It's a really powerful idea. Imagine you feed an image

121
00:06:14.519 --> 00:06:17.879
<v Speaker 2>into the network. The first hidden layer might learn to

122
00:06:17.920 --> 00:06:24.199
<v Speaker 2>detect very simple, primitive characteristics, things like horizontal lines, vertical lines,

123
00:06:24.199 --> 00:06:25.720
<v Speaker 2>maybe simple curves or edges.

124
00:06:25.879 --> 00:06:27.240
<v Speaker 1>Okay, basic stuff, right.

125
00:06:27.639 --> 00:06:30.800
<v Speaker 2>Then the next hidden layer takes those simple features as

126
00:06:30.839 --> 00:06:34.000
<v Speaker 2>its input and learns to combine them into slightly more

127
00:06:34.000 --> 00:06:38.519
<v Speaker 2>complex shapes or patterns, maybe corners, circles, simple.

128
00:06:38.240 --> 00:06:40.519
<v Speaker 1>Textures, ah building blocks.

129
00:06:40.199 --> 00:06:44.240
<v Speaker 2>Exactly, And as you go deeper, subsequent layers combine those

130
00:06:44.240 --> 00:06:49.160
<v Speaker 2>features into even more complex semantically significant characteristics. So maybe

131
00:06:49.199 --> 00:06:51.720
<v Speaker 2>a later layer recognizes combinations that look like an eye

132
00:06:51.839 --> 00:06:55.639
<v Speaker 2>or a wheel, or in the book's example, hexagons or honeycombs.

133
00:06:55.920 --> 00:06:58.600
<v Speaker 2>By the time the information reached the final layers, it's

134
00:06:58.639 --> 00:07:01.680
<v Speaker 2>represented in a way that makes classification much easier. The

135
00:07:01.720 --> 00:07:03.759
<v Speaker 2>network has learned to see the important patterns.

136
00:07:03.839 --> 00:07:06.120
<v Speaker 1>That makes a lot of sense. It's like learning progressively

137
00:07:06.199 --> 00:07:08.240
<v Speaker 1>more abstract concepts.

138
00:07:07.920 --> 00:07:12.319
<v Speaker 2>Precisely, and a key advantage here is flexibility. You can

139
00:07:12.319 --> 00:07:16.639
<v Speaker 2>adjust the model's complexity, its learning capacity by just adding

140
00:07:16.800 --> 00:07:19.720
<v Speaker 2>or removing neurons or entire layers, depending on how much

141
00:07:19.759 --> 00:07:22.600
<v Speaker 2>data you have or the convocational resources available.

142
00:07:22.639 --> 00:07:25.199
<v Speaker 1>That brings up another point from the book, the AI winters.

143
00:07:25.600 --> 00:07:27.800
<v Speaker 1>Why did it take so long for neural networks to

144
00:07:27.839 --> 00:07:30.120
<v Speaker 1>really take off if the ideas were around earlier.

145
00:07:30.240 --> 00:07:33.439
<v Speaker 2>Yeah, that's another aha moment. The core concepts from many

146
00:07:33.480 --> 00:07:37.639
<v Speaker 2>of these networks existed decades ago, but they were held back.

147
00:07:38.160 --> 00:07:42.079
<v Speaker 2>The book really emphasizes that the crucial factors were the

148
00:07:42.199 --> 00:07:46.000
<v Speaker 2>massive increase in data availability, the big data and the

149
00:07:46.079 --> 00:07:48.800
<v Speaker 2>parallel explosion and computational power, especially.

150
00:07:48.519 --> 00:07:50.800
<v Speaker 1>With GPUs GPUs the graphics cards.

151
00:07:50.519 --> 00:07:52.839
<v Speaker 2>Exactly they happen to be incredibly good at the kind

152
00:07:52.879 --> 00:07:56.439
<v Speaker 2>of parallel matrix multiplications that neural networks rely on. So

153
00:07:56.480 --> 00:07:58.600
<v Speaker 2>it was really after maybe twenty ten twenty eleven when

154
00:07:58.600 --> 00:08:01.160
<v Speaker 2>we finally had enough data and enough computing power that

155
00:08:01.199 --> 00:08:05.279
<v Speaker 2>these deeper, more complex models could finally be trained effectively

156
00:08:05.560 --> 00:08:08.279
<v Speaker 2>and show what they were capable of. The resources caught

157
00:08:08.360 --> 00:08:09.720
<v Speaker 2>up with the ideas.

158
00:08:09.519 --> 00:08:13.079
<v Speaker 1>Right, Okay, so training these deep networks sounds like a beast.

159
00:08:13.480 --> 00:08:17.480
<v Speaker 1>How does that learning part actually happen? You mentioned gradients before.

160
00:08:17.680 --> 00:08:20.600
<v Speaker 2>Yeah. The core algorithm, the engine driving the learning, is

161
00:08:20.639 --> 00:08:24.240
<v Speaker 2>called back propagation. It's essentially a clever way to figure

162
00:08:24.240 --> 00:08:28.240
<v Speaker 2>out how much each connection, each weight in the network

163
00:08:28.319 --> 00:08:31.279
<v Speaker 2>contributed to the overall error on a given training example.

164
00:08:31.759 --> 00:08:35.600
<v Speaker 2>Works in two phases. First, there's a forward pass. You

165
00:08:35.639 --> 00:08:38.360
<v Speaker 2>feed the input data through the network layer by layer

166
00:08:38.639 --> 00:08:41.360
<v Speaker 2>until you get an output. Then you compare that output

167
00:08:41.399 --> 00:08:45.080
<v Speaker 2>to the correct answer and calculate the error or loss.

168
00:08:45.279 --> 00:08:46.399
<v Speaker 1>Okay, see how wrong.

169
00:08:46.240 --> 00:08:50.399
<v Speaker 2>It was exactly. Then comes the backward pass. Using calculus,

170
00:08:50.399 --> 00:08:53.799
<v Speaker 2>specifically the chain rule, back propagation calculates the gradient of

171
00:08:53.840 --> 00:08:56.159
<v Speaker 2>the loss with respect to each weight. It figures out

172
00:08:56.200 --> 00:08:59.120
<v Speaker 2>how changing each weight would affect the error. This gradient

173
00:08:59.159 --> 00:09:02.559
<v Speaker 2>information is then propagated backward through the network layer by layer.

174
00:09:02.639 --> 00:09:06.159
<v Speaker 2>It's like an assigning blame or credit for the error

175
00:09:06.200 --> 00:09:07.919
<v Speaker 2>back to the connections that caused.

176
00:09:07.600 --> 00:09:10.159
<v Speaker 1>It, and then you use that information to adjust the

177
00:09:10.159 --> 00:09:11.440
<v Speaker 1>weights precisely.

178
00:09:12.080 --> 00:09:14.519
<v Speaker 2>The most common method is to cast a gradient descent

179
00:09:14.679 --> 00:09:18.159
<v Speaker 2>or ASGD. Instead of calculating the error over the entire

180
00:09:18.279 --> 00:09:21.759
<v Speaker 2>massive data sent which would be incredibly slow. SGD takes

181
00:09:21.759 --> 00:09:24.600
<v Speaker 2>a single training example or maybe a small batch of them,

182
00:09:24.919 --> 00:09:27.840
<v Speaker 2>calculates the gradients and makes a small adjustment to the

183
00:09:27.840 --> 00:09:30.600
<v Speaker 2>weights in the direction that reduces the error. Then it

184
00:09:30.600 --> 00:09:33.879
<v Speaker 2>moves to the next example or batch. It's stochastic because

185
00:09:33.919 --> 00:09:36.679
<v Speaker 2>each update is based on just a small sample, making

186
00:09:36.720 --> 00:09:39.360
<v Speaker 2>it a bit noisy, but much much faster overall.

187
00:09:39.480 --> 00:09:42.000
<v Speaker 1>Okay, that makes sense, but you mentioned a problem earlier,

188
00:09:42.159 --> 00:09:46.919
<v Speaker 1>something about gradients, ugh, vanishing and exploding gradients. That sounds bad.

189
00:09:46.960 --> 00:09:48.080
<v Speaker 1>What's going on there right?

190
00:09:48.120 --> 00:09:50.399
<v Speaker 2>This is a huge challenge, especially when you start building

191
00:09:50.399 --> 00:09:54.759
<v Speaker 2>really deep networks. It's a stability issue. Remember how backpropagation

192
00:09:54.919 --> 00:09:58.320
<v Speaker 2>uses the chain rule that involves multiplying many small numbers

193
00:09:58.320 --> 00:10:01.120
<v Speaker 2>together as you go backward through the layer. If those

194
00:10:01.240 --> 00:10:04.480
<v Speaker 2>numbers related to the derivatives of the accivation functions are

195
00:10:04.519 --> 00:10:08.200
<v Speaker 2>consistently less than one, their product can become incredibly tiny,

196
00:10:08.320 --> 00:10:11.120
<v Speaker 2>almost zero by the time it reaches the early layers.

197
00:10:11.399 --> 00:10:14.840
<v Speaker 2>That's the vanishing gradient problem. The signal just fades.

198
00:10:14.480 --> 00:10:17.480
<v Speaker 1>Away, so the early layers stop learning effectively.

199
00:10:17.600 --> 00:10:21.159
<v Speaker 2>Yes, they don't get useful information about how to adjust

200
00:10:21.159 --> 00:10:25.840
<v Speaker 2>their weights. Conversely, if those numbers are consistently greater than one,

201
00:10:25.960 --> 00:10:30.000
<v Speaker 2>their product can blow up, becoming astronomically large. That's the

202
00:10:30.080 --> 00:10:34.279
<v Speaker 2>exploding gradient problem. The updates become huge and unstable and

203
00:10:34.320 --> 00:10:35.360
<v Speaker 2>the network diverges.

204
00:10:35.679 --> 00:10:38.519
<v Speaker 1>Yikes. Okay, so how do we fix that? How do

205
00:10:38.559 --> 00:10:40.360
<v Speaker 1>we train these deep things reliably?

206
00:10:40.720 --> 00:10:44.200
<v Speaker 2>Well? Thankfully, researchers have developed a whole toolkit of techniques

207
00:10:44.240 --> 00:10:47.519
<v Speaker 2>to combat these issues and also to prevent another big

208
00:10:47.559 --> 00:10:48.960
<v Speaker 2>problem overfitting.

209
00:10:49.039 --> 00:10:52.120
<v Speaker 1>Overfitting that's when the model just memorizes the training data right,

210
00:10:52.240 --> 00:10:53.879
<v Speaker 1>but doesn't work well on new stuff.

211
00:10:53.960 --> 00:10:58.519
<v Speaker 2>Exactly, it fails to generalize. So first we have regularization techniques.

212
00:10:58.919 --> 00:11:01.120
<v Speaker 2>Think of these as ways to impose discipline on the

213
00:11:01.159 --> 00:11:04.720
<v Speaker 2>network during training. Weight decay using L one or L

214
00:11:04.759 --> 00:11:07.679
<v Speaker 2>two penalties is common. It adds a cost to having

215
00:11:07.759 --> 00:11:11.360
<v Speaker 2>large weights, encouraging the network to find simpler solutions that

216
00:11:11.360 --> 00:11:15.240
<v Speaker 2>are less likely to overfit. Another simple but effective one

217
00:11:15.399 --> 00:11:18.720
<v Speaker 2>is early stopping. You monitor the network's performance on a

218
00:11:18.720 --> 00:11:21.919
<v Speaker 2>separate data set, a validation set that it doesn't train on.

219
00:11:22.480 --> 00:11:25.039
<v Speaker 2>When the error on that validation set starts to increase,

220
00:11:25.080 --> 00:11:27.240
<v Speaker 2>even if the training error is still decreasing, you just

221
00:11:27.240 --> 00:11:29.679
<v Speaker 2>stop training. The model is starting to overfit.

222
00:11:29.799 --> 00:11:31.399
<v Speaker 1>Makes sense, stop before it gets worse.

223
00:11:31.559 --> 00:11:34.639
<v Speaker 2>Right. Then there are techniques aimed more directly at the

224
00:11:34.679 --> 00:11:38.279
<v Speaker 2>learning dynamics. Dropout is a really clever one. During training,

225
00:11:38.320 --> 00:11:40.960
<v Speaker 2>for each input or a mini batch, you randomly drop

226
00:11:41.000 --> 00:11:44.799
<v Speaker 2>out temporarily said to zero a certain percentage of the neurons,

227
00:11:44.799 --> 00:11:45.919
<v Speaker 2>and the hidden layers.

228
00:11:45.720 --> 00:11:46.799
<v Speaker 1>Just switch them off randomly.

229
00:11:47.080 --> 00:11:51.080
<v Speaker 2>Yep. This forces other neurons to learn more robust features

230
00:11:51.279 --> 00:11:54.120
<v Speaker 2>because they can't rely too much on any single other

231
00:11:54.200 --> 00:11:57.519
<v Speaker 2>neuron always being there. It's like training a team where

232
00:11:57.559 --> 00:12:01.320
<v Speaker 2>players might randomly be unavailable. Everyone has to be more versatile.

233
00:12:01.679 --> 00:12:05.879
<v Speaker 2>It acts like training many different smaller networks simultaneously. And

234
00:12:06.039 --> 00:12:10.320
<v Speaker 2>batch normalization is another life saver, especially for very deep networks.

235
00:12:10.960 --> 00:12:14.159
<v Speaker 2>It normalizes the activations within each mini batch during training,

236
00:12:14.360 --> 00:12:17.879
<v Speaker 2>basically rescaling them to have a consistent mean in variance.

237
00:12:17.559 --> 00:12:19.679
<v Speaker 1>Like tuning the signal kind of yeah.

238
00:12:19.799 --> 00:12:21.759
<v Speaker 2>It helps keep the signals flowing through the network in

239
00:12:21.799 --> 00:12:24.519
<v Speaker 2>a healthy range, preventing them from becoming too large or

240
00:12:24.559 --> 00:12:28.000
<v Speaker 2>too small, which he'll stabilize training and allows for faster learning.

241
00:12:28.440 --> 00:12:31.120
<v Speaker 1>Okay, wow, that's a lot of tricks. Anything else?

242
00:12:31.440 --> 00:12:34.919
<v Speaker 2>Oh? Yeah, we also have adaptive learning rate methods. Instead

243
00:12:34.960 --> 00:12:37.720
<v Speaker 2>of using one fixed learning rate for the entire network,

244
00:12:38.120 --> 00:12:42.639
<v Speaker 2>algorithms like ATTIGRAD, RMS, PROP and the very popular ATOM

245
00:12:43.120 --> 00:12:46.720
<v Speaker 2>dynamically adjust the learning rate for each parameter individually. They

246
00:12:46.759 --> 00:12:49.519
<v Speaker 2>can speed up learning for slow parameters and slow it

247
00:12:49.519 --> 00:12:53.200
<v Speaker 2>down for fast ones, helping convergence. Weight initialization is also

248
00:12:53.279 --> 00:12:56.720
<v Speaker 2>surprisingly important. If you start all weights at zero, all

249
00:12:56.720 --> 00:12:58.840
<v Speaker 2>neurons in a layer will learn the exact same thing,

250
00:12:59.080 --> 00:13:03.519
<v Speaker 2>So you need randomized initialization like xavier or Gloro initialization

251
00:13:03.919 --> 00:13:06.399
<v Speaker 2>to break that symmetry and get things going. And finally,

252
00:13:06.480 --> 00:13:09.799
<v Speaker 2>especially for things like images, data augmentation is huge. You

253
00:13:09.879 --> 00:13:12.639
<v Speaker 2>create more training data by applying random transformations to your

254
00:13:12.679 --> 00:13:16.399
<v Speaker 2>existing data, rotating images, shifting them, changing brightness, stuff like that.

255
00:13:16.480 --> 00:13:18.399
<v Speaker 2>It makes the model more robust variations.

256
00:13:18.600 --> 00:13:21.240
<v Speaker 1>That's quite a toolbox. So putting it all together, what

257
00:13:21.279 --> 00:13:24.679
<v Speaker 1>does deploying these models actually involve in practice.

258
00:13:24.120 --> 00:13:27.159
<v Speaker 2>Well, it means a lot of careful hyper parameter tuning,

259
00:13:27.639 --> 00:13:30.600
<v Speaker 2>finding the right learning rate, the right amount of regularization,

260
00:13:31.039 --> 00:13:34.399
<v Speaker 2>the best network architecture. That often involves experimenting and using

261
00:13:34.399 --> 00:13:37.639
<v Speaker 2>those validation sets to see what works best. The book

262
00:13:37.720 --> 00:13:39.759
<v Speaker 2>mentions that for the huge data sets we have today,

263
00:13:39.799 --> 00:13:42.519
<v Speaker 2>people might use splits like ninety eight percent for training,

264
00:13:42.879 --> 00:13:46.120
<v Speaker 2>one percent for validation, and one percent for final testing,

265
00:13:46.360 --> 00:13:48.600
<v Speaker 2>which is different from older rules of thumb for smaller

266
00:13:48.679 --> 00:13:49.279
<v Speaker 2>data sets.

267
00:13:49.600 --> 00:13:51.120
<v Speaker 1>And you mentioned GPUs earlier.

268
00:13:51.360 --> 00:13:54.919
<v Speaker 2>Absolutely critical training these models involves tons and tons of

269
00:13:54.919 --> 00:13:59.120
<v Speaker 2>matrix multiplications. GPUs are designed for parallel processing and have

270
00:13:59.200 --> 00:14:02.480
<v Speaker 2>high memory ban with making them orders of magnitude faster

271
00:14:02.559 --> 00:14:05.799
<v Speaker 2>than traditional CPUs For this kind of work, training deep

272
00:14:05.840 --> 00:14:10.519
<v Speaker 2>models without GPUs would be practically impossible or at least incredibly.

273
00:14:09.960 --> 00:14:12.000
<v Speaker 1>Slow, And sometimes you need more than one.

274
00:14:11.840 --> 00:14:15.159
<v Speaker 2>GPU for really big models or data sets. Yes, you

275
00:14:15.240 --> 00:14:18.039
<v Speaker 2>might use data parallelism where you split the data across

276
00:14:18.120 --> 00:14:21.039
<v Speaker 2>multiple GPUs, each training a copy of the model, or

277
00:14:21.080 --> 00:14:24.120
<v Speaker 2>even model parallelism where different parts of the neural network

278
00:14:24.120 --> 00:14:27.159
<v Speaker 2>itself are spread across different GPUs because the whole model

279
00:14:27.200 --> 00:14:28.039
<v Speaker 2>is too big to fit.

280
00:14:27.960 --> 00:14:30.440
<v Speaker 1>On one Okay, that gives a much clearer picture of

281
00:14:30.480 --> 00:14:33.720
<v Speaker 1>the training process and challenges. So we've got the basics,

282
00:14:33.919 --> 00:14:37.519
<v Speaker 1>the depth the training. Now let's dive into some specific

283
00:14:37.799 --> 00:14:40.960
<v Speaker 1>types of networks. The book talks about architectures designed for

284
00:14:40.960 --> 00:14:42.080
<v Speaker 1>different kinds of data.

285
00:14:42.200 --> 00:14:46.600
<v Speaker 2>Right exactly. Neural networks are incredibly versatile, partly because we

286
00:14:46.639 --> 00:14:50.639
<v Speaker 2>could design specialized architectures. Let's start with probably the most

287
00:14:50.639 --> 00:14:54.960
<v Speaker 2>famous one for images, convolutional neural networks or CNNs, right.

288
00:14:54.840 --> 00:14:57.080
<v Speaker 1>The ones inspired by the visual cortex.

289
00:14:57.240 --> 00:15:00.399
<v Speaker 2>Loosely, yes, the key idea in CNN this is how

290
00:15:00.399 --> 00:15:03.679
<v Speaker 2>they process spatial data like images. They typically work with

291
00:15:03.759 --> 00:15:07.960
<v Speaker 2>layers that have three dimensions height, width, and depth. Depth

292
00:15:08.000 --> 00:15:10.399
<v Speaker 2>here refers to the number of channels like red, green,

293
00:15:10.480 --> 00:15:12.840
<v Speaker 2>blue in the input or different feature maps in the

294
00:15:12.879 --> 00:15:16.320
<v Speaker 2>hidden layers. The core operation is the convolution. You have

295
00:15:16.360 --> 00:15:18.799
<v Speaker 2>these small filters. You can think of them as pattern detectors.

296
00:15:19.039 --> 00:15:21.840
<v Speaker 2>Maybe one looks for vertical edges, another for horizontal edges,

297
00:15:21.879 --> 00:15:25.559
<v Speaker 2>another for specific texture. These filters slide across the input

298
00:15:25.600 --> 00:15:28.000
<v Speaker 2>image or the feature map from the previous layer and

299
00:15:28.039 --> 00:15:32.360
<v Speaker 2>compute activations. Where the filter finds its specific pattern, it

300
00:15:32.399 --> 00:15:35.320
<v Speaker 2>produces a strong activation in the output feature map.

301
00:15:35.279 --> 00:15:38.159
<v Speaker 1>So each filter creates its own map, highlighting where it

302
00:15:38.240 --> 00:15:38.799
<v Speaker 1>found its.

303
00:15:38.639 --> 00:15:42.840
<v Speaker 2>Pattern precisely, and a key aspect is parameter sharing. The

304
00:15:42.879 --> 00:15:45.919
<v Speaker 2>same filter is used across the entire image, which makes

305
00:15:45.919 --> 00:15:49.879
<v Speaker 2>CNNs efficient and helps them recognize patterns regardless of where

306
00:15:49.879 --> 00:15:53.519
<v Speaker 2>they appear. These convolutional layers are usually paired with ray

307
00:15:53.639 --> 00:15:57.279
<v Speaker 2>lu activations and then often followed by pooling layers. Max

308
00:15:57.320 --> 00:16:00.720
<v Speaker 2>pooling is common. It downsamples the feature map, making it

309
00:16:00.759 --> 00:16:04.200
<v Speaker 2>smaller by taking the maximum value in small regions. This

310
00:16:04.279 --> 00:16:07.320
<v Speaker 2>helps reduce computation and makes the learned features more rowe

311
00:16:07.360 --> 00:16:09.879
<v Speaker 2>busts to small shifts or distortions.

312
00:16:09.320 --> 00:16:11.320
<v Speaker 1>And these are the networks behind image.

313
00:16:11.000 --> 00:16:16.799
<v Speaker 2>Recognition, absolutely, image classification, object detection. CNNs have driven huge

314
00:16:16.840 --> 00:16:20.120
<v Speaker 2>breakthroughs there. The book mentioned some landmark architectures that came

315
00:16:20.159 --> 00:16:23.039
<v Speaker 2>out of research and competitions. There was alex net, which

316
00:16:23.039 --> 00:16:25.559
<v Speaker 2>really kicked off the deep learning revolution in images around

317
00:16:25.559 --> 00:16:29.200
<v Speaker 2>twenty twelve. Then zf net improved on it. Google net

318
00:16:29.200 --> 00:16:32.559
<v Speaker 2>introduced these clever inception modules that process features at different

319
00:16:32.559 --> 00:16:36.720
<v Speaker 2>scale simultaneously and reduce the number of parameters, and ResNet

320
00:16:36.799 --> 00:16:40.919
<v Speaker 2>or residual networks introduced skip connections. Skip connection Yeah they

321
00:16:40.919 --> 00:16:43.679
<v Speaker 2>allowed the gradient information to flow more easily through very

322
00:16:43.679 --> 00:16:47.720
<v Speaker 2>deep networks by creating shortcuts, essentially letting the signal bypass

323
00:16:47.759 --> 00:16:51.720
<v Speaker 2>some layers. This allowed researchers to train networks with hundreds,

324
00:16:51.759 --> 00:16:53.120
<v Speaker 2>even over one thousand layers.

325
00:16:53.200 --> 00:16:57.120
<v Speaker 1>Wow. Okay, so CNN's are four images. What about data

326
00:16:57.159 --> 00:17:00.080
<v Speaker 1>that comes in sequences like text or speech where the

327
00:17:00.159 --> 00:17:01.039
<v Speaker 1>order is critical.

328
00:17:01.159 --> 00:17:04.519
<v Speaker 2>That's the domain of recurrent neural networks or RNNs. Their

329
00:17:04.519 --> 00:17:08.000
<v Speaker 2>defining feature is a kind of memory. They process sequences

330
00:17:08.000 --> 00:17:11.200
<v Speaker 2>step by step, and at each step the output depends

331
00:17:11.279 --> 00:17:13.680
<v Speaker 2>not only on the current input, but also on a

332
00:17:13.720 --> 00:17:17.079
<v Speaker 2>hidden state that summarizes information from previous.

333
00:17:16.720 --> 00:17:18.759
<v Speaker 1>Steps, so they remember what came before.

334
00:17:19.359 --> 00:17:22.400
<v Speaker 2>In a sense. Yes, you can visualize an RNN as

335
00:17:22.400 --> 00:17:25.319
<v Speaker 2>having a loop. The hidden state from one time step

336
00:17:25.359 --> 00:17:27.519
<v Speaker 2>feeds back into the network at the next time step.

337
00:17:28.119 --> 00:17:30.400
<v Speaker 2>And useful way to think about it, especially for training,

338
00:17:30.480 --> 00:17:34.160
<v Speaker 2>is to unfurl or unroll this loop over time. It

339
00:17:34.200 --> 00:17:37.039
<v Speaker 2>looks like a very deep feed forward network, but with

340
00:17:37.079 --> 00:17:40.079
<v Speaker 2>a crucial difference. The same set of weights is used

341
00:17:40.119 --> 00:17:43.000
<v Speaker 2>at every single comm step. This weight sharing is key

342
00:17:43.000 --> 00:17:45.200
<v Speaker 2>for learning patterns that apply across the sequence.

343
00:17:45.240 --> 00:17:46.759
<v Speaker 1>What's a typical use case.

344
00:17:46.839 --> 00:17:49.799
<v Speaker 2>Language modeling is a classic one predicting the next word

345
00:17:49.839 --> 00:17:53.039
<v Speaker 2>in a sentence. The book mentions a cool example by

346
00:17:53.119 --> 00:17:56.839
<v Speaker 2>Andre's Karpathy, who trained an RNN character by character on

347
00:17:56.920 --> 00:18:00.559
<v Speaker 2>Shakespeare's plays. After just a few training it areas, it

348
00:18:00.599 --> 00:18:04.440
<v Speaker 2>produced complete gibberish, but after many more iterations it started

349
00:18:04.519 --> 00:18:09.839
<v Speaker 2>generating text that looked syntactically like Shakespeare, correctly spelled words, punctuation,

350
00:18:10.319 --> 00:18:13.720
<v Speaker 2>line breaks. Even though the meeting was nonsensical, it showed

351
00:18:13.720 --> 00:18:16.160
<v Speaker 2>the RNN was learning the structure of the language.

352
00:18:16.400 --> 00:18:19.319
<v Speaker 1>That's pretty cool. But do RNNs have issues too, like

353
00:18:19.359 --> 00:18:20.480
<v Speaker 1>the gradient problems?

354
00:18:20.519 --> 00:18:24.240
<v Speaker 2>Oh? Definitely. Those vanishing and exploding gradients we talked about

355
00:18:24.240 --> 00:18:27.640
<v Speaker 2>are a major problem for basic RNNs, especially when dealing

356
00:18:27.640 --> 00:18:31.200
<v Speaker 2>with long sequences. Trying to propagate information over many time

357
00:18:31.240 --> 00:18:34.200
<v Speaker 2>steps is difficult. This led to the development of more

358
00:18:34.240 --> 00:18:37.960
<v Speaker 2>sophisticated recurrent units, most famously the long short term memory

359
00:18:38.400 --> 00:18:39.880
<v Speaker 2>or LSTM LCM.

360
00:18:39.920 --> 00:18:40.960
<v Speaker 1>Heard of that one, Yeah.

361
00:18:41.160 --> 00:18:43.519
<v Speaker 2>LSTMs are a type of R and N cell designed

362
00:18:43.559 --> 00:18:47.200
<v Speaker 2>specifically to combat the vanishing gradient problem and capture long

363
00:18:47.279 --> 00:18:52.440
<v Speaker 2>range dependencies. They have internal mechanisms called gates, an input gate,

364
00:18:52.720 --> 00:18:55.640
<v Speaker 2>a forget gate, and an output gate, and a separate

365
00:18:55.680 --> 00:18:58.400
<v Speaker 2>cell state that acts like a conveytor belt for information.

366
00:18:59.279 --> 00:19:01.799
<v Speaker 2>These gates learn to control what information is added to

367
00:19:01.880 --> 00:19:05.119
<v Speaker 2>the cell state, what's removed, and what affects the output

368
00:19:05.160 --> 00:19:08.319
<v Speaker 2>at each step. It allows them to maintain important information

369
00:19:08.440 --> 00:19:12.519
<v Speaker 2>over much longer periods. More recently, things like layer normalization

370
00:19:12.839 --> 00:19:14.599
<v Speaker 2>have also helped improve RNN.

371
00:19:14.400 --> 00:19:19.039
<v Speaker 1>Stability, so LSTMs are better at remembering long term patterns.

372
00:19:18.599 --> 00:19:21.440
<v Speaker 2>Much better generally speaking, and they've been crucial for many

373
00:19:21.440 --> 00:19:25.759
<v Speaker 2>applications machine translation, often using any encoder decoder structure where

374
00:19:25.759 --> 00:19:28.119
<v Speaker 2>one RNN reads the foot sequence and another generates the

375
00:19:28.119 --> 00:19:32.759
<v Speaker 2>output sequence. Google Translate US. This heavily also building conversational

376
00:19:32.759 --> 00:19:36.240
<v Speaker 2>AI systems chatbots doing things like named entity recognition and

377
00:19:36.319 --> 00:19:40.759
<v Speaker 2>text like identifying names or locations, and even powering recommender systems.

378
00:19:40.960 --> 00:19:46.039
<v Speaker 1>Okay, CNN's for space, RNs slstms for time or sequence.

379
00:19:46.920 --> 00:19:50.799
<v Speaker 1>What if the goal is different, like compressing data or

380
00:19:50.920 --> 00:19:52.720
<v Speaker 1>finding a new way to represent it.

381
00:19:52.720 --> 00:19:55.720
<v Speaker 2>That's where auto encoders come into play. The fundamental idea

382
00:19:55.920 --> 00:19:58.920
<v Speaker 2>is pretty elegant, and auto encoder is a neural network

383
00:19:58.960 --> 00:20:00.599
<v Speaker 2>trained to reconstruct it its own input.

384
00:20:00.880 --> 00:20:03.440
<v Speaker 1>Reconstruct its input. What's the point of that?

385
00:20:03.839 --> 00:20:06.720
<v Speaker 2>Ah? The trick is in the middle. The network usually

386
00:20:06.759 --> 00:20:09.960
<v Speaker 2>has a bottleneck layer, a hidden layer with fewer neurons

387
00:20:10.000 --> 00:20:13.920
<v Speaker 2>than the input or output layers. To successfully reconstruct the input,

388
00:20:14.000 --> 00:20:16.839
<v Speaker 2>the network is forced to learn a compressed representation, a

389
00:20:16.880 --> 00:20:19.839
<v Speaker 2>sort of code. In that bottleneck layer. It has to

390
00:20:19.839 --> 00:20:22.079
<v Speaker 2>figure out the most essential features of the data to

391
00:20:22.119 --> 00:20:24.759
<v Speaker 2>squeeze it through the bottleneck and then reconstruct it. They're

392
00:20:24.759 --> 00:20:26.400
<v Speaker 2>sometimes called replicator.

393
00:20:25.920 --> 00:20:30.720
<v Speaker 1>Networks, so it's learning a compressed version like dimensionality reduction exactly.

394
00:20:31.079 --> 00:20:35.079
<v Speaker 2>Basic auto encoders with the linear activation function essentially learn

395
00:20:35.160 --> 00:20:39.799
<v Speaker 2>the same subspace as principal component analysis PCA, but the

396
00:20:39.839 --> 00:20:42.400
<v Speaker 2>real power comes when you make them deep auto encoders

397
00:20:42.720 --> 00:20:47.200
<v Speaker 2>with multiple hidden layers and nonlinear activation functions like RYLU.

398
00:20:47.519 --> 00:20:51.359
<v Speaker 2>These can learn much more complex nonlinear transformations of the data,

399
00:20:51.400 --> 00:20:55.359
<v Speaker 2>effectively disentangling data that might lie on a complicated manifold.

400
00:20:54.960 --> 00:20:56.440
<v Speaker 1>Better than something like PCA.

401
00:20:56.519 --> 00:21:00.680
<v Speaker 2>Then, for complex nonlinear structures, often yes, the booknotes that

402
00:21:00.720 --> 00:21:04.039
<v Speaker 2>can provide better class separation than linear methods, and while

403
00:21:04.119 --> 00:21:07.559
<v Speaker 2>something like TSN is great for visualization, auto encoders are

404
00:21:07.599 --> 00:21:10.319
<v Speaker 2>generally better if you need to apply the learned transformation

405
00:21:10.559 --> 00:21:12.200
<v Speaker 2>to new unseen data points.

406
00:21:12.400 --> 00:21:14.759
<v Speaker 1>Are there different kinds of auto encoders.

407
00:21:14.640 --> 00:21:19.880
<v Speaker 2>Yes several interesting variants. Sparse auto encoders add a penalty

408
00:21:19.920 --> 00:21:23.559
<v Speaker 2>to encourage most hidden units to be inactive, outputting zero,

409
00:21:23.839 --> 00:21:29.240
<v Speaker 2>leading to sparse representations. Denoising auto encoders are trained to

410
00:21:29.279 --> 00:21:32.599
<v Speaker 2>reconstruct the original clean input from a version that has

411
00:21:32.640 --> 00:21:36.319
<v Speaker 2>been artificially corrupted with noise. This forces them to learn

412
00:21:36.480 --> 00:21:41.039
<v Speaker 2>robust features that aren't sensitive to noise, and variational auto

413
00:21:41.119 --> 00:21:45.480
<v Speaker 2>encoders or vaes are a more probabilistic take. They learn

414
00:21:45.519 --> 00:21:48.119
<v Speaker 2>a distribution in the bottleneck layer, which is really useful

415
00:21:48.119 --> 00:21:50.759
<v Speaker 2>for generating new data samples that look similar to the

416
00:21:50.799 --> 00:21:51.519
<v Speaker 2>training data.

417
00:21:51.720 --> 00:21:53.400
<v Speaker 1>Before we jump to the really cutting edge stuff, the

418
00:21:53.400 --> 00:21:57.079
<v Speaker 1>book mentions some forgotten architectures, ones that were important historically.

419
00:21:57.160 --> 00:21:59.640
<v Speaker 2>Yeah, it's good to acknowledge the stepping stones. Radio basis

420
00:21:59.640 --> 00:22:02.920
<v Speaker 2>function networks or RBF networks, for example, they typically have

421
00:22:02.960 --> 00:22:06.000
<v Speaker 2>a hidden layer where neurons compute the similarity of the

422
00:22:06.039 --> 00:22:09.400
<v Speaker 2>input to certain prototype factors. This makes them related to

423
00:22:09.440 --> 00:22:13.039
<v Speaker 2>methods like kernel machines or CAE, nearest neighbors and restricted

424
00:22:13.039 --> 00:22:16.920
<v Speaker 2>Boltzmann machines RBMs. These were quite important for a while,

425
00:22:17.039 --> 00:22:21.039
<v Speaker 2>especially for pre training deep networks, before modern techniques made

426
00:22:21.119 --> 00:22:25.079
<v Speaker 2>end to end training feasible. Ourbms are energy based models,

427
00:22:25.119 --> 00:22:28.960
<v Speaker 2>borrowing ideas from statistical physics. They're good at learning patterns,

428
00:22:29.160 --> 00:22:32.079
<v Speaker 2>especially in binary data, and can be stacked to form

429
00:22:32.440 --> 00:22:36.279
<v Speaker 2>deep believe networks. They were used for tasks like collaborative

430
00:22:36.279 --> 00:22:40.599
<v Speaker 2>filtering and initializing deeper networks. While less common as primary

431
00:22:40.640 --> 00:22:42.880
<v Speaker 2>models now, their ideas were influential.

432
00:22:43.599 --> 00:22:46.279
<v Speaker 1>It's fascinating how many different ways there are to structure

433
00:22:46.319 --> 00:22:49.480
<v Speaker 1>these networks. Okay, so beyond just learning from data, networks

434
00:22:49.480 --> 00:22:53.400
<v Speaker 1>are now doing things that seem more intelligent, like making

435
00:22:53.440 --> 00:22:55.519
<v Speaker 1>decisions or even creating new things.

436
00:22:55.559 --> 00:22:58.720
<v Speaker 2>Absolutely, this takes us into areas like deep reinforcement learning

437
00:22:58.880 --> 00:23:02.759
<v Speaker 2>or deep RL. Here the learning paradigm shifts. Instead of

438
00:23:02.799 --> 00:23:06.200
<v Speaker 2>learning from label examples supervised learning, the agent learns through

439
00:23:06.480 --> 00:23:10.000
<v Speaker 2>reward guided trial and error, much like how humans or animals.

440
00:23:09.680 --> 00:23:11.440
<v Speaker 1>Learn trial and error. How does that work?

441
00:23:11.680 --> 00:23:15.079
<v Speaker 2>The agent takes actions in an environment. These actions change

442
00:23:15.119 --> 00:23:17.720
<v Speaker 2>the state of the environment and potentially lead to rewards

443
00:23:17.759 --> 00:23:20.880
<v Speaker 2>or penalties. The goal is to learn a policy, a

444
00:23:20.920 --> 00:23:25.640
<v Speaker 2>strategy for choosing actions that maximizes the total cumulative reward

445
00:23:25.680 --> 00:23:30.119
<v Speaker 2>over time. It involves balancing, exploration, trying new things to

446
00:23:30.160 --> 00:23:33.599
<v Speaker 2>see what happens, and exploitation sticking with actions known to

447
00:23:33.680 --> 00:23:37.960
<v Speaker 2>yield good rewards. The classic multi armed bandit problem is

448
00:23:37.960 --> 00:23:39.799
<v Speaker 2>a simple illustration of this trade off.

449
00:23:40.079 --> 00:23:42.640
<v Speaker 1>So it learns by doing exactly.

450
00:23:43.119 --> 00:23:45.839
<v Speaker 2>Think of learning to play a game like Tic tac toe.

451
00:23:45.920 --> 00:23:48.720
<v Speaker 2>The RL agent might start by making random moves. When

452
00:23:48.720 --> 00:23:51.279
<v Speaker 2>it eventually wins, the sequence of moves leading to that

453
00:23:51.319 --> 00:23:55.880
<v Speaker 2>win gets positively reinforced. Moves leading to losses get negatively reinforced.

454
00:23:56.559 --> 00:23:58.720
<v Speaker 2>Over many games, it learns the value of different board

455
00:23:58.759 --> 00:24:01.920
<v Speaker 2>positions and actions. The deep part comes from using deep

456
00:24:01.920 --> 00:24:04.519
<v Speaker 2>neural networks to represent the policy or to estimate the

457
00:24:04.599 --> 00:24:07.920
<v Speaker 2>value of states and actions, especially in complex environments with

458
00:24:08.000 --> 00:24:10.839
<v Speaker 2>huge state spaces like video games or robotics.

459
00:24:10.960 --> 00:24:12.359
<v Speaker 1>And this is what was used in AlphaGo.

460
00:24:12.759 --> 00:24:16.119
<v Speaker 2>Yes, that's a prime example in a real aha moment.

461
00:24:17.039 --> 00:24:20.680
<v Speaker 2>Alphag and later Alpha zero used deep RL to master

462
00:24:20.799 --> 00:24:24.039
<v Speaker 2>Go and chess. What was remarkable wasn't just that they

463
00:24:24.039 --> 00:24:26.559
<v Speaker 2>beat the best humans, but how they did it. They

464
00:24:26.680 --> 00:24:29.759
<v Speaker 2>used deep networks to learn patterns and evaluate board positions

465
00:24:29.759 --> 00:24:32.880
<v Speaker 2>from scratch, just by playing millions of games against themselves.

466
00:24:33.359 --> 00:24:37.519
<v Speaker 2>Alpha zero discovered strategies and made moves like sacrificing material

467
00:24:37.599 --> 00:24:41.079
<v Speaker 2>for positional advantage in chess there were novel and sometimes

468
00:24:41.079 --> 00:24:45.279
<v Speaker 2>counterintuitive even to human grand masters. It demonstrated an ability

469
00:24:45.359 --> 00:24:49.160
<v Speaker 2>to discover knowledge autonomously through experience, which is a hallmark

470
00:24:49.200 --> 00:24:49.720
<v Speaker 2>of RL.

471
00:24:50.200 --> 00:24:52.880
<v Speaker 1>That's incredible. It's not just following rules, it's finding new

472
00:24:52.920 --> 00:24:53.839
<v Speaker 1>ones precisely.

473
00:24:54.319 --> 00:24:57.559
<v Speaker 2>The core mechanisms often involve things like Q learning learning

474
00:24:57.640 --> 00:25:00.559
<v Speaker 2>the expected future reward or quality of taking an action

475
00:25:00.640 --> 00:25:03.559
<v Speaker 2>in a state, or policy gradients directly learning the policy

476
00:25:03.559 --> 00:25:06.920
<v Speaker 2>function that map states to probabilities of actions. Besides games,

477
00:25:06.960 --> 00:25:09.640
<v Speaker 2>dep rls being applied to robot control, like learning to

478
00:25:09.680 --> 00:25:13.680
<v Speaker 2>walk or grasp objects, optimizing complex systems, and even potentially

479
00:25:13.759 --> 00:25:16.799
<v Speaker 2>training conversational agents that can negotiate or complete tasks.

480
00:25:17.079 --> 00:25:21.000
<v Speaker 1>Okay, that's learning by doing. What about generating completely new

481
00:25:21.079 --> 00:25:24.200
<v Speaker 1>stuff like those realistic but fake images you hear about.

482
00:25:24.359 --> 00:25:27.880
<v Speaker 2>Ah, that's the territory of generative adversarial networks or jams.

483
00:25:28.279 --> 00:25:30.319
<v Speaker 2>This is another really clever idea. The bookies at a

484
00:25:30.359 --> 00:25:33.400
<v Speaker 2>great analogy. It's like a gain between a counterfeitter and

485
00:25:33.440 --> 00:25:37.480
<v Speaker 2>the police. You have two networks, the generator the counterfeitter

486
00:25:37.960 --> 00:25:42.200
<v Speaker 2>tries to create fake data, say images of faces looks realistic.

487
00:25:42.240 --> 00:25:45.160
<v Speaker 2>It starts by taking random noises input and transforming it

488
00:25:45.599 --> 00:25:50.319
<v Speaker 2>the discriminator. The police is trained to distinguish between real data,

489
00:25:50.680 --> 00:25:53.039
<v Speaker 2>actual face images from a data set, and the fake

490
00:25:53.119 --> 00:25:54.039
<v Speaker 2>data produced by.

491
00:25:53.920 --> 00:25:56.319
<v Speaker 1>The generator, so they're fighting each other exactly.

492
00:25:56.559 --> 00:25:59.440
<v Speaker 2>They train in an adversarial loop. The generator gets better

493
00:25:59.440 --> 00:26:02.640
<v Speaker 2>at fooling the discriminator, and the discriminator gets better at

494
00:26:02.720 --> 00:26:07.519
<v Speaker 2>spotting the fakes. The process continues until ideally, the generator

495
00:26:07.519 --> 00:26:10.440
<v Speaker 2>produces fakes that are so good the discriminator can't tell

496
00:26:10.440 --> 00:26:13.359
<v Speaker 2>them apart from real data anymore. Its accuracy is around

497
00:26:13.359 --> 00:26:15.960
<v Speaker 2>fifty percent. It's framed as a minimax game, reaching a

498
00:26:16.039 --> 00:26:17.640
<v Speaker 2>kind of equilibrium.

499
00:26:17.039 --> 00:26:19.279
<v Speaker 1>And this creates realistic images.

500
00:26:19.160 --> 00:26:23.000
<v Speaker 2>Often stunningly realistic ones. Yeah yeah, but here's another aha moment.

501
00:26:23.160 --> 00:26:26.839
<v Speaker 2>Conditional jams. These allow you to provide some context or

502
00:26:26.880 --> 00:26:30.279
<v Speaker 2>condition to the generator, so instead of just generating any

503
00:26:30.359 --> 00:26:32.839
<v Speaker 2>random phase, you could ask it to generate a face

504
00:26:32.880 --> 00:26:37.160
<v Speaker 2>based on attributes like smiling, wearing glasses, or even generate

505
00:26:37.200 --> 00:26:40.000
<v Speaker 2>an image based on a text description. The book gives

506
00:26:40.039 --> 00:26:42.880
<v Speaker 2>examples like converting black and white photos to color, or

507
00:26:42.920 --> 00:26:46.720
<v Speaker 2>creating different plausible photographs based on a simple sketch like

508
00:26:46.720 --> 00:26:48.039
<v Speaker 2>a police sketch of a suspect.

509
00:26:48.079 --> 00:26:48.640
<v Speaker 1>Wow.

510
00:26:48.759 --> 00:26:52.720
<v Speaker 2>What's amazing here is the level of artistry or creativity involved.

511
00:26:52.920 --> 00:26:56.359
<v Speaker 2>The chan isn't just reconstructing something. It's filling and missing

512
00:26:56.400 --> 00:26:59.880
<v Speaker 2>information in a way that is plausible and esthetically coherent.

513
00:27:00.200 --> 00:27:02.880
<v Speaker 2>It's extrapolating realistically from limited context.

514
00:27:02.960 --> 00:27:05.759
<v Speaker 1>That's bordering on creative Okay. One last area. Models that

515
00:27:05.799 --> 00:27:07.440
<v Speaker 1>can focus or have memory.

516
00:27:08.319 --> 00:27:12.359
<v Speaker 2>Two important concepts there are attention mechanisms and neural turing machines.

517
00:27:12.759 --> 00:27:16.200
<v Speaker 2>Attention is inspired by how we humans focus our cognitive resources.

518
00:27:16.680 --> 00:27:19.400
<v Speaker 2>Instead of treating all parts of the input equally, attention

519
00:27:19.440 --> 00:27:22.680
<v Speaker 2>mechanisms allow a model to dynamically focus on specific portions

520
00:27:22.720 --> 00:27:24.640
<v Speaker 2>of the data that are relevant.

521
00:27:24.200 --> 00:27:26.559
<v Speaker 1>To the task at hand, like paying attention to the

522
00:27:26.599 --> 00:27:27.960
<v Speaker 1>important words.

523
00:27:27.839 --> 00:27:32.359
<v Speaker 2>Exactly in machine translation. When generating a target word, the

524
00:27:32.400 --> 00:27:36.119
<v Speaker 2>attention mechanism might focus heavily on the corresponding source words.

525
00:27:36.799 --> 00:27:40.319
<v Speaker 2>In image captioning, as the model generates the caption word

526
00:27:40.359 --> 00:27:43.319
<v Speaker 2>by word, the attention might shift to different regions of

527
00:27:43.359 --> 00:27:46.519
<v Speaker 2>the image relevant to the word being generated. A dog

528
00:27:46.640 --> 00:27:50.039
<v Speaker 2>focus on dog catches a frisbee focus on frisbee. It's

529
00:27:50.079 --> 00:27:53.079
<v Speaker 2>made a big difference in sequence to sequence tasks and

530
00:27:53.119 --> 00:27:56.599
<v Speaker 2>then neural Turing machines or NTMs. These are really fascinating

531
00:27:56.680 --> 00:27:59.160
<v Speaker 2>because they try to bridge the gap between neural networks

532
00:27:59.240 --> 00:28:03.759
<v Speaker 2>and traditional Most neural networks intertwine computation and memory. The

533
00:28:03.799 --> 00:28:07.640
<v Speaker 2>network state is its memory, and it's often transient. NTMs

534
00:28:07.680 --> 00:28:10.720
<v Speaker 2>introduce an external persistent memory component like the tape of

535
00:28:10.759 --> 00:28:13.519
<v Speaker 2>a Turing machine, that the neural network controller can learn

536
00:28:13.519 --> 00:28:14.240
<v Speaker 2>to read from.

537
00:28:14.039 --> 00:28:16.599
<v Speaker 1>And write to, separating memory and processing.

538
00:28:17.000 --> 00:28:20.640
<v Speaker 2>Yes, this separation potentially allows them to learn to simulate

539
00:28:20.640 --> 00:28:24.680
<v Speaker 2>algorithms just from examples. The book mentions the possibility of

540
00:28:24.720 --> 00:28:27.680
<v Speaker 2>an NTM learning to sort a list of numbers simply

541
00:28:27.720 --> 00:28:30.599
<v Speaker 2>by seeing many examples of scrambled lists and their sorting

542
00:28:30.720 --> 00:28:34.839
<v Speaker 2>versions without being explicitly programmed with a sorting algorithm. They

543
00:28:34.880 --> 00:28:37.599
<v Speaker 2>represent a step towards models that can learn more general

544
00:28:37.599 --> 00:28:41.559
<v Speaker 2>computational processes closer to how a programmable computer works, but

545
00:28:41.680 --> 00:28:42.759
<v Speaker 2>learn through optimization.

546
00:28:43.079 --> 00:28:45.720
<v Speaker 1>Wow. Okay, so stepping back from all this detail, what

547
00:28:45.799 --> 00:28:48.319
<v Speaker 1>does it all mean? We've gone from these simple perceptron

548
00:28:48.359 --> 00:28:51.119
<v Speaker 1>bricks all the way to systems that can play, go,

549
00:28:51.640 --> 00:28:54.319
<v Speaker 1>generate art, maybe even learn algorithms.

550
00:28:54.640 --> 00:28:57.039
<v Speaker 2>It's really quite a journey. It's a testament I think

551
00:28:57.240 --> 00:29:01.559
<v Speaker 2>to the power of combining relatively simple computational ideas, scaling

552
00:29:01.559 --> 00:29:04.599
<v Speaker 2>them up with massive data and compute, and developing clever

553
00:29:04.680 --> 00:29:08.599
<v Speaker 2>ways to train them effectively. The adaptability is just astounding.

554
00:29:08.599 --> 00:29:10.799
<v Speaker 2>How different architectures like C and NS, R and NS

555
00:29:10.799 --> 00:29:14.359
<v Speaker 2>and transformers now can be tailored to unlock insights from

556
00:29:14.480 --> 00:29:17.680
<v Speaker 2>wildly different kinds of data. It provides this high level

557
00:29:17.680 --> 00:29:20.079
<v Speaker 2>way to build systems that learn complex patterns.

558
00:29:20.400 --> 00:29:24.160
<v Speaker 1>This deep dive into Aggerwall's book really highlights how far

559
00:29:24.240 --> 00:29:26.960
<v Speaker 1>the field has come and how fast it's still moving.

560
00:29:27.319 --> 00:29:30.200
<v Speaker 1>From getting the basic training to work dealing with vanishing

561
00:29:30.240 --> 00:29:35.200
<v Speaker 1>gradients to building these incredibly speralized and capable systems.

562
00:29:34.759 --> 00:29:37.960
<v Speaker 2>Absolutely and considering how systems like alphagos seem to discover

563
00:29:38.000 --> 00:29:42.559
<v Speaker 2>strategies or how gans can generate novel creative outputs. It

564
00:29:42.599 --> 00:29:45.559
<v Speaker 2>really does raise a fascinating question, doesn't it. If these

565
00:29:45.599 --> 00:29:49.079
<v Speaker 2>networks can find complex solutions and exhibit something akin to

566
00:29:49.119 --> 00:29:53.480
<v Speaker 2>creativity on their own, driven by data and optimization, what

567
00:29:53.640 --> 00:29:56.680
<v Speaker 2>new forms of intelligence or problem solving, maybe even things

568
00:29:56.680 --> 00:29:59.359
<v Speaker 2>we haven't conceived of, might they unlock in the future.

569
00:29:59.440 --> 00:30:01.960
<v Speaker 1>That is definitely something to mull over. What might they

570
00:30:01.960 --> 00:30:04.759
<v Speaker 1>discover that we with our human biases, might

571
00:30:04.880 --> 00:30:07.440
<v Speaker 2>Miss exactly I thought to keep you company until our

572
00:30:07.440 --> 00:30:08.200
<v Speaker 2>next deep dive
