WEBVTT

1
00:00:00.120 --> 00:00:02.839
<v Speaker 1>Welcome to the deep dive. Our mission is pretty simple.

2
00:00:03.319 --> 00:00:06.879
<v Speaker 1>You give us the source material and we jump right

3
00:00:06.919 --> 00:00:09.679
<v Speaker 1>in to pull out the essential knowledge, basically giving you

4
00:00:09.720 --> 00:00:11.599
<v Speaker 1>a shortcut to getting up to speed on a topic.

5
00:00:11.759 --> 00:00:15.080
<v Speaker 2>Exactly. We dig into the pages, the research, whatever you

6
00:00:15.119 --> 00:00:17.719
<v Speaker 2>send us, and we extract those key insights, maybe some

7
00:00:17.760 --> 00:00:20.320
<v Speaker 2>surprising details, and the stuff you can actually use.

8
00:00:20.559 --> 00:00:24.519
<v Speaker 1>And for this deep dive, we're tackling excerpts from Applied

9
00:00:24.640 --> 00:00:28.440
<v Speaker 1>deep Learning, a case based approach to understanding deep neural

10
00:00:28.480 --> 00:00:30.879
<v Speaker 1>networks by Umberco Miclucci.

11
00:00:30.760 --> 00:00:33.320
<v Speaker 2>Right, and this source material it really gets into the

12
00:00:33.399 --> 00:00:37.439
<v Speaker 2>nuts and bolts how neural networks function, how they actually learn,

13
00:00:37.799 --> 00:00:40.640
<v Speaker 2>and then the practical side like building and training them

14
00:00:40.759 --> 00:00:42.200
<v Speaker 2>using tools like TensorFlow.

15
00:00:42.600 --> 00:00:45.039
<v Speaker 1>So our goal here is simple walk you through these

16
00:00:45.119 --> 00:00:49.520
<v Speaker 1>core deep learning ideas straight from this applied angle, making

17
00:00:49.520 --> 00:00:52.079
<v Speaker 1>it hopefully clear and digestible. Let's get started.

18
00:00:52.159 --> 00:00:54.000
<v Speaker 2>Okay, let's start where the book does really at the

19
00:00:54.000 --> 00:00:56.600
<v Speaker 2>foundation computational graphs.

20
00:00:56.560 --> 00:00:58.399
<v Speaker 1>Right, but before you even talk about neurons, the book

21
00:00:58.439 --> 00:01:01.600
<v Speaker 1>sets it up with this idea. A computational graph is well,

22
00:01:01.640 --> 00:01:04.079
<v Speaker 1>it's just a way to map out and organized calculations. Right.

23
00:01:04.120 --> 00:01:05.879
<v Speaker 1>You define the steps how the data.

24
00:01:05.680 --> 00:01:09.719
<v Speaker 2>Flows, And that's precisely how libraries like TensorFlow work. You

25
00:01:11.200 --> 00:01:16.040
<v Speaker 2>build this graph defining all the operations, additions, multiplications, activation, functions, whatever.

26
00:01:16.319 --> 00:01:19.879
<v Speaker 2>Then TensorFlow takes that graph and runs it, executing everything

27
00:01:20.000 --> 00:01:20.640
<v Speaker 2>very efficiently.

28
00:01:20.920 --> 00:01:23.879
<v Speaker 1>Okay, so what are the basic building blocks for these

29
00:01:23.920 --> 00:01:26.319
<v Speaker 1>graphs and TensorFlow? According to the source, Well.

30
00:01:26.239 --> 00:01:28.480
<v Speaker 2>The most fundamental thing is the tensor itself. You can

31
00:01:28.519 --> 00:01:31.560
<v Speaker 2>think of it basically as a multidimensional array, very much

32
00:01:31.599 --> 00:01:35.519
<v Speaker 2>like NUMPI arrays actually, and it's rank just tells you

33
00:01:35.560 --> 00:01:38.040
<v Speaker 2>how many dimensions it has. Ranked zero is a centile number,

34
00:01:38.120 --> 00:01:40.239
<v Speaker 2>rank one is a vector ranked to a matrix, and

35
00:01:40.280 --> 00:01:40.640
<v Speaker 2>so on.

36
00:01:40.799 --> 00:01:44.519
<v Speaker 1>Got it. Multidimensional arrays holding the data. But what about

37
00:01:44.519 --> 00:01:47.799
<v Speaker 1>the pieces that do the work or change during training?

38
00:01:47.920 --> 00:01:50.159
<v Speaker 2>A right, So you have different kinds of nodes. There's

39
00:01:50.319 --> 00:01:53.879
<v Speaker 2>tf dot variable. These are for parameters that the network

40
00:01:53.879 --> 00:01:57.000
<v Speaker 2>needs to update as it learns. Think weights and biases,

41
00:01:57.040 --> 00:01:58.159
<v Speaker 2>the classic examples.

42
00:01:58.359 --> 00:02:02.200
<v Speaker 1>They vary, makes sense, they vary. What about tf dot placeholder? Then?

43
00:02:02.519 --> 00:02:05.120
<v Speaker 2>Placeholders are different. They're like entry points into the graph.

44
00:02:05.319 --> 00:02:08.400
<v Speaker 2>You use them to feed in data from outside when

45
00:02:08.439 --> 00:02:11.159
<v Speaker 2>you actually run the calculation. They hold values that are

46
00:02:11.199 --> 00:02:14.800
<v Speaker 2>fixed during one run, but you might change them between runs.

47
00:02:14.840 --> 00:02:17.879
<v Speaker 1>Like feeding in a batch of training data, or maybe

48
00:02:17.919 --> 00:02:18.800
<v Speaker 1>the learning rate.

49
00:02:18.800 --> 00:02:22.479
<v Speaker 2>Exactly input data batches are the prime example, or maybe

50
00:02:22.479 --> 00:02:24.000
<v Speaker 2>a learning rate you're setting manually.

51
00:02:24.159 --> 00:02:28.080
<v Speaker 1>Okay, so variables change during a run, placeholders get new

52
00:02:28.159 --> 00:02:31.919
<v Speaker 1>values between runs, and tf dot constant.

53
00:02:31.960 --> 00:02:33.800
<v Speaker 2>That sounds easy. It's just for a value that stays

54
00:02:33.840 --> 00:02:35.800
<v Speaker 2>the same, always, never changes.

55
00:02:35.560 --> 00:02:38.599
<v Speaker 1>And TensorFlow runs this whole graph thing using something called

56
00:02:38.599 --> 00:02:39.120
<v Speaker 1>a session.

57
00:02:39.360 --> 00:02:42.479
<v Speaker 2>That's right. You define the graph structure first, then you

58
00:02:42.520 --> 00:02:45.479
<v Speaker 2>need a TensorFlow session to actually execute the operations in

59
00:02:45.479 --> 00:02:48.280
<v Speaker 2>that graph. The book makes a distinction between session dot

60
00:02:48.360 --> 00:02:51.759
<v Speaker 2>run and tensor dot evil. Dot run lets you execute

61
00:02:51.800 --> 00:02:54.639
<v Speaker 2>specific nodes you list, whereas evil is more like a

62
00:02:54.639 --> 00:02:57.319
<v Speaker 2>shortcut you called directly on a tensor or variable. It

63
00:02:57.400 --> 00:03:00.120
<v Speaker 2>just runs that specific thing within the current session and

64
00:03:00.560 --> 00:03:01.639
<v Speaker 2>gives you its value back.

65
00:03:01.840 --> 00:03:05.400
<v Speaker 1>So we've got the structure for doing calculations. Now, what's

66
00:03:05.439 --> 00:03:09.159
<v Speaker 1>the book says the smallest unit in deep learning.

67
00:03:09.319 --> 00:03:12.639
<v Speaker 2>That would be the single neuron. It's the fundamental building block,

68
00:03:12.719 --> 00:03:16.360
<v Speaker 2>kind of inspired by biological neurons, but much simpler as

69
00:03:16.400 --> 00:03:19.199
<v Speaker 2>the book describes it. It takes several numerical inputs your data

70
00:03:19.240 --> 00:03:23.400
<v Speaker 2>features usually does some processing and spits out a single number.

71
00:03:23.759 --> 00:03:28.479
<v Speaker 1>And that processing involves multiplying inputs by weights, adding a bias,

72
00:03:29.360 --> 00:03:30.919
<v Speaker 1>and then hitting it with an activation function.

73
00:03:31.039 --> 00:03:34.120
<v Speaker 2>Precisely, the weights give importance to different inputs, the bias

74
00:03:34.240 --> 00:03:37.680
<v Speaker 2>shifts the result, and that activation function that's super important.

75
00:03:38.080 --> 00:03:42.080
<v Speaker 2>It introduces nonlinearity because if you just stacked layers of

76
00:03:42.120 --> 00:03:45.039
<v Speaker 2>linear operations, the whole network would still just be doing

77
00:03:45.080 --> 00:03:49.199
<v Speaker 2>a linear transformation. Nonlinearity lets it learn complex stuff.

78
00:03:49.680 --> 00:03:52.000
<v Speaker 1>What activation functions does the book focus on?

79
00:03:52.280 --> 00:03:55.240
<v Speaker 2>It covers some key ones. There's the Sigmoorid function, squashes

80
00:03:55.280 --> 00:03:58.000
<v Speaker 2>everything between zero and one, historically used a lot for

81
00:03:58.319 --> 00:04:02.199
<v Speaker 2>like binary classification. Then the identity function, which is just

82
00:04:02.360 --> 00:04:06.080
<v Speaker 2>linear output equals input, and the really common one now

83
00:04:06.240 --> 00:04:10.240
<v Speaker 2>re lu the rectified linear unit, which is output is

84
00:04:10.360 --> 00:04:13.319
<v Speaker 2>just the input if it's positive and zero if it's negative.

85
00:04:13.599 --> 00:04:15.240
<v Speaker 2>Simple but works very well.

86
00:04:15.439 --> 00:04:17.720
<v Speaker 1>Now the book points that are really practical. Kind of

87
00:04:17.720 --> 00:04:21.079
<v Speaker 1>tricky thing with sigmoid. Doesn't it something that can cause problems?

88
00:04:21.360 --> 00:04:27.560
<v Speaker 2>Ah? Yeah, this is a classic theory versus practice issue. Mathematically,

89
00:04:27.680 --> 00:04:30.759
<v Speaker 2>sigmoid gets really close to zero or one, but never

90
00:04:30.839 --> 00:04:35.079
<v Speaker 2>quite touches them. But computers use floating point numbers. So

91
00:04:35.480 --> 00:04:39.240
<v Speaker 2>for really big positive or negative inputs, the result can

92
00:04:39.279 --> 00:04:41.439
<v Speaker 2>actually get rounded to exactly zero or one.

93
00:04:41.680 --> 00:04:43.759
<v Speaker 1>And why is that a problem? Where does it bite you?

94
00:04:44.079 --> 00:04:47.279
<v Speaker 2>It bites you when you calculate the cost function, Especially

95
00:04:47.279 --> 00:04:50.399
<v Speaker 2>in classification, you often need the logarithm of the output

96
00:04:50.519 --> 00:04:52.839
<v Speaker 2>or log one output. If the output is exactly zero

97
00:04:52.920 --> 00:04:57.319
<v Speaker 2>or one, you're trying to calculate log zero, which is undefined.

98
00:04:57.279 --> 00:05:00.519
<v Speaker 1>Leading to those nan values not a number exactly.

99
00:05:00.600 --> 00:05:02.759
<v Speaker 2>You see nan popping up in your training loss, that's

100
00:05:02.800 --> 00:05:05.519
<v Speaker 2>often a clue, could be the sigmoid issue, maybe related

101
00:05:05.519 --> 00:05:08.160
<v Speaker 2>to data scaling or initial weights being too large. It's

102
00:05:08.160 --> 00:05:09.120
<v Speaker 2>a debugging flag.

103
00:05:09.319 --> 00:05:13.519
<v Speaker 1>That's a super useful tip, watch eff nance. Another practical

104
00:05:13.519 --> 00:05:17.240
<v Speaker 1>point the book makes is about speed right computational efficiency.

105
00:05:17.560 --> 00:05:21.439
<v Speaker 2>Absolutely, it has this great comparison it shows implementing something

106
00:05:21.600 --> 00:05:25.759
<v Speaker 2>like ReLU using numb pies built in matrix operations versus

107
00:05:25.920 --> 00:05:28.000
<v Speaker 2>just writing a standard Python for loop.

108
00:05:28.360 --> 00:05:31.399
<v Speaker 1>I think I remember seeing that graphic. The difference was huge,

109
00:05:31.439 --> 00:05:32.319
<v Speaker 1>wasn't it massive?

110
00:05:32.399 --> 00:05:34.879
<v Speaker 2>Something like one hundred times faster in their example for

111
00:05:34.959 --> 00:05:37.879
<v Speaker 2>a big array. Wow, And it really drives home why

112
00:05:37.879 --> 00:05:41.600
<v Speaker 2>we use libraries like NUMPI or TensorFlow. They push these

113
00:05:41.600 --> 00:05:45.240
<v Speaker 2>operations down to low level code like see and use vectorization.

114
00:05:45.720 --> 00:05:48.480
<v Speaker 2>They process chunks of data all at once, which is

115
00:05:48.680 --> 00:05:52.879
<v Speaker 2>way faster than Python looping through element by element. Understanding

116
00:05:52.920 --> 00:05:55.920
<v Speaker 2>that efficiency is key to why deep learning scales.

117
00:05:56.160 --> 00:05:58.680
<v Speaker 1>Okay, so we have the neuron the basic unit, but

118
00:05:58.759 --> 00:06:02.920
<v Speaker 1>how does it or whole network of them actually learn anything? Right?

119
00:06:02.959 --> 00:06:05.800
<v Speaker 2>So, learning here means finding the best possible values for

120
00:06:05.839 --> 00:06:09.040
<v Speaker 2>the network's parameters, the weights and biases. Best means the

121
00:06:09.120 --> 00:06:12.000
<v Speaker 2>values that make the network predictions match the true answers

122
00:06:12.040 --> 00:06:13.360
<v Speaker 2>as closely as possible.

123
00:06:13.600 --> 00:06:15.920
<v Speaker 1>You measure that closeness using the cost function.

124
00:06:16.079 --> 00:06:18.560
<v Speaker 2>Precisely, the cost function gives you a number that says

125
00:06:18.600 --> 00:06:22.480
<v Speaker 2>how wrong the network is. Lower cost means better performance

126
00:06:22.480 --> 00:06:23.639
<v Speaker 2>on the training data.

127
00:06:23.680 --> 00:06:27.199
<v Speaker 1>And the main algorithm for lowering that cost is gradient

128
00:06:27.240 --> 00:06:28.120
<v Speaker 1>descent YEP.

129
00:06:28.160 --> 00:06:31.360
<v Speaker 2>Gradient descent is the workhorse YEP. It works by calculating

130
00:06:31.439 --> 00:06:35.040
<v Speaker 2>the gradient basically the slope of the cost function with

131
00:06:35.120 --> 00:06:38.279
<v Speaker 2>respect to each weight and bias. Then it adjusts the

132
00:06:38.319 --> 00:06:42.519
<v Speaker 2>parameters slightly in the opposite direction of the gradient. It's

133
00:06:42.519 --> 00:06:45.360
<v Speaker 2>like taking a small step downhill on the cost landscape,

134
00:06:45.639 --> 00:06:47.920
<v Speaker 2>always trying to find the lowest point and.

135
00:06:47.879 --> 00:06:49.000
<v Speaker 1>The size of that step.

136
00:06:49.360 --> 00:06:52.560
<v Speaker 2>That's the learning rate exactly. The learning rate, often written

137
00:06:52.600 --> 00:06:56.920
<v Speaker 2>as gamma or alpha, is a really critical hyperparameter. It

138
00:06:57.000 --> 00:06:59.560
<v Speaker 2>dictates how biggest step you take down hell each time.

139
00:07:00.000 --> 00:07:03.759
<v Speaker 2>A small learning rate means tiny, maybe cautious steps. Large

140
00:07:03.839 --> 00:07:07.120
<v Speaker 2>learning rate means big, bold steps, which sounds good.

141
00:07:07.199 --> 00:07:10.040
<v Speaker 1>But that's where the quirks come in. As the book

142
00:07:10.079 --> 00:07:11.839
<v Speaker 1>puts it, what happens if it's too big.

143
00:07:11.920 --> 00:07:14.800
<v Speaker 2>If it's too big, you can overshoot the minimum point

144
00:07:14.920 --> 00:07:17.079
<v Speaker 2>in the cost landscape. You jump right over it. You

145
00:07:17.120 --> 00:07:19.959
<v Speaker 2>might end up bouncing back and forth, oscillating around the minimum,

146
00:07:20.040 --> 00:07:23.079
<v Speaker 2>or even flying off entirely and diverging. The cost gets

147
00:07:23.079 --> 00:07:24.079
<v Speaker 2>worse instead of better.

148
00:07:24.319 --> 00:07:26.720
<v Speaker 1>Yeah, I can picture that like rolling a ball down

149
00:07:26.759 --> 00:07:28.680
<v Speaker 1>a hill too fast and it rolls right across the

150
00:07:28.759 --> 00:07:30.000
<v Speaker 1>valley and up the other side.

151
00:07:30.079 --> 00:07:32.959
<v Speaker 2>That's a good analogy. Finding that just right. Learning rate

152
00:07:33.040 --> 00:07:35.480
<v Speaker 2>is often one of the first big challenges when you're

153
00:07:35.560 --> 00:07:36.319
<v Speaker 2>training a network.

154
00:07:36.439 --> 00:07:42.040
<v Speaker 1>Okay, so individual neurons learn by minimizing cost with gradient descent,

155
00:07:42.759 --> 00:07:45.399
<v Speaker 1>but the real power comes when you connect lots of

156
00:07:45.399 --> 00:07:47.920
<v Speaker 1>them together in feed forward neural networks.

157
00:07:47.959 --> 00:07:50.519
<v Speaker 2>That's right. You arrange neurons in layers. You have an

158
00:07:50.519 --> 00:07:53.160
<v Speaker 2>input layer, one or more hidden layers in the middle,

159
00:07:53.240 --> 00:07:56.279
<v Speaker 2>and then an output layer. And in a standard fully

160
00:07:56.279 --> 00:08:00.519
<v Speaker 2>connected network, every neuron in one layer passes it output

161
00:08:00.639 --> 00:08:03.040
<v Speaker 2>to every neuron in the very next layer, and.

162
00:08:03.000 --> 00:08:05.920
<v Speaker 1>The calculations just flow forward layer by layer, Which is

163
00:08:05.920 --> 00:08:09.000
<v Speaker 1>why those matrix operations we talked about are so useful. Right,

164
00:08:09.040 --> 00:08:10.360
<v Speaker 1>processing a whole layer at.

165
00:08:10.240 --> 00:08:14.360
<v Speaker 2>Once exactly that equations zwx plus B that's not just

166
00:08:14.399 --> 00:08:16.920
<v Speaker 2>one neuron. W is a matrix of all weights for

167
00:08:16.959 --> 00:08:19.920
<v Speaker 2>the layer, x is a matrix of all inputs or

168
00:08:19.959 --> 00:08:23.000
<v Speaker 2>previous layer outputs for a whole batch, and B is

169
00:08:23.040 --> 00:08:25.759
<v Speaker 2>the bias factor. It calculates everything for the layer in

170
00:08:25.800 --> 00:08:27.720
<v Speaker 2>one go, super efficient, but.

171
00:08:27.800 --> 00:08:30.879
<v Speaker 1>Building these bigger, deeper networks introduces a huge challenge. The

172
00:08:30.879 --> 00:08:33.120
<v Speaker 1>book really digs into overfitting.

173
00:08:33.360 --> 00:08:37.000
<v Speaker 2>Oh yeah. Overfitting is a constant concern. It's when your

174
00:08:37.000 --> 00:08:40.080
<v Speaker 2>model gets too good at the training data. It doesn't

175
00:08:40.120 --> 00:08:43.279
<v Speaker 2>just learn the underlying patterns, it starts memorizing the specific

176
00:08:43.360 --> 00:08:46.279
<v Speaker 2>training examples, including all the random noise and quirks, so.

177
00:08:46.200 --> 00:08:48.720
<v Speaker 1>It eases the practice test but fails the real.

178
00:08:48.600 --> 00:08:52.000
<v Speaker 2>Exam perfect analogy. It performs great on data it's seen,

179
00:08:52.080 --> 00:08:54.960
<v Speaker 2>but poorly on new unseen data because it didn't learn

180
00:08:54.960 --> 00:08:55.919
<v Speaker 2>the general rules.

181
00:08:56.000 --> 00:08:59.519
<v Speaker 1>And the opposite is underfitting or high bias, where the

182
00:08:59.519 --> 00:09:02.080
<v Speaker 1>model is too simple it can't even capture the training

183
00:09:02.159 --> 00:09:03.080
<v Speaker 1>data patterns well.

184
00:09:03.159 --> 00:09:06.159
<v Speaker 2>Right, and the book stresses that the very first step

185
00:09:06.320 --> 00:09:09.799
<v Speaker 2>in fighting overfitting is being able to spot it, which

186
00:09:09.879 --> 00:09:11.480
<v Speaker 2>means you have to split your.

187
00:09:11.440 --> 00:09:14.840
<v Speaker 1>Data into a training set and a development set or

188
00:09:14.879 --> 00:09:16.440
<v Speaker 1>a validation set exactly.

189
00:09:16.480 --> 00:09:18.840
<v Speaker 2>You train the model only on the training set, but

190
00:09:19.120 --> 00:09:22.000
<v Speaker 2>periodically you check its performance on the development dove set,

191
00:09:22.000 --> 00:09:23.159
<v Speaker 2>which it hasn't been trained on.

192
00:09:23.320 --> 00:09:25.799
<v Speaker 1>And if the training air keeps going down but the

193
00:09:25.840 --> 00:09:28.600
<v Speaker 1>dev air stops improving or starts going.

194
00:09:28.440 --> 00:09:30.879
<v Speaker 2>Up, bingo, that's your alarm bell. The model is starting

195
00:09:30.879 --> 00:09:33.679
<v Speaker 2>to overfit the training data. The dev set acts like

196
00:09:33.720 --> 00:09:35.080
<v Speaker 2>your early warning system.

197
00:09:35.399 --> 00:09:38.720
<v Speaker 1>Now going back to grading descent for training these networks,

198
00:09:38.840 --> 00:09:40.000
<v Speaker 1>there are different flavors of.

199
00:09:39.960 --> 00:09:43.200
<v Speaker 2>It, yes, because using the entire data set for every

200
00:09:43.240 --> 00:09:46.279
<v Speaker 2>single weight update that's called batch gradient descent can be

201
00:09:46.320 --> 00:09:49.879
<v Speaker 2>incredibly slow and memory intensive. For large data sets. Batch

202
00:09:49.919 --> 00:09:53.600
<v Speaker 2>GD gives you a very accurate gradient estimate, but the

203
00:09:53.720 --> 00:09:55.399
<v Speaker 2>updates are infrequent.

204
00:09:55.080 --> 00:09:58.279
<v Speaker 1>So the alternative is stochastic gradient descent or SGD.

205
00:09:58.559 --> 00:10:01.759
<v Speaker 2>Right, SGD goes to the other extreme, it updates the

206
00:10:01.759 --> 00:10:04.799
<v Speaker 2>weights after looking at just one training example. This makes

207
00:10:04.799 --> 00:10:07.919
<v Speaker 2>the updates very fast and frequent, but also very noisy

208
00:10:08.000 --> 00:10:12.080
<v Speaker 2>or stochastic. The path towards the minimum jumps around a lot.

209
00:10:12.399 --> 00:10:15.840
<v Speaker 2>That noise can sometimes help it escape shallow local minimum, though.

210
00:10:15.759 --> 00:10:18.120
<v Speaker 1>In the most common approach sits in the middle. Mini

211
00:10:18.200 --> 00:10:20.120
<v Speaker 1>batch gradient descent exactly.

212
00:10:20.480 --> 00:10:22.799
<v Speaker 2>This is what people usually mean by SGD nowadays, even

213
00:10:22.799 --> 00:10:25.799
<v Speaker 2>though it's technically minibatch. You calculate the gradient and update

214
00:10:25.840 --> 00:10:28.120
<v Speaker 2>the weights based on a small batch maybe thirty two,

215
00:10:28.320 --> 00:10:31.679
<v Speaker 2>sixty four, hundred and twenty eight examples. It's a compromise.

216
00:10:31.960 --> 00:10:35.320
<v Speaker 2>You get smoother convergence than pure SGD, but much faster

217
00:10:35.399 --> 00:10:39.320
<v Speaker 2>updates than batch GD. It leverages matrix operations efficiently.

218
00:10:39.120 --> 00:10:41.360
<v Speaker 1>And that mini batch size is another one of those

219
00:10:41.399 --> 00:10:42.559
<v Speaker 1>hyper parameters you have to.

220
00:10:42.519 --> 00:10:46.600
<v Speaker 2>Choose yep and the book clarifies terminology. An iteration is

221
00:10:46.679 --> 00:10:49.879
<v Speaker 2>usually one pass through a mini batch and one weight update,

222
00:10:50.320 --> 00:10:53.639
<v Speaker 2>and epoch is one full pass through the entire training

223
00:10:53.720 --> 00:10:56.320
<v Speaker 2>data set, so many iterations per at BOCH.

224
00:10:56.399 --> 00:11:00.360
<v Speaker 1>The book also mentions something about starting weights weight initialization

225
00:11:00.480 --> 00:11:01.919
<v Speaker 1>being important, very important.

226
00:11:02.120 --> 00:11:04.759
<v Speaker 2>It's not just setting them to zero, which causes problems.

227
00:11:04.960 --> 00:11:08.320
<v Speaker 2>How you initialize them can seriously affect how quickly or

228
00:11:08.399 --> 00:11:12.039
<v Speaker 2>even if the network trains successfully. Bad initialization can lead

229
00:11:12.080 --> 00:11:16.240
<v Speaker 2>to exploding gradients getting huge, or vanishing gradients getting tiny,

230
00:11:16.799 --> 00:11:18.559
<v Speaker 2>or those nan values again.

231
00:11:18.799 --> 00:11:20.000
<v Speaker 1>So what does the book suggest?

232
00:11:20.399 --> 00:11:23.639
<v Speaker 2>It often uses something like TFT truncated normal with a

233
00:11:23.679 --> 00:11:27.480
<v Speaker 2>small standard deviation, maybe zero point one. This draws initial

234
00:11:27.480 --> 00:11:30.840
<v Speaker 2>weights from a normal distribution but cuts off extreme values.

235
00:11:31.480 --> 00:11:33.960
<v Speaker 2>The idea is to start with small random weights to

236
00:11:34.000 --> 00:11:36.320
<v Speaker 2>break symmetry but avoid large values.

237
00:11:36.360 --> 00:11:40.759
<v Speaker 1>Initially, let's talk architecture. Why are deeper networks like with

238
00:11:40.879 --> 00:11:44.440
<v Speaker 1>multiple hidden layers often better than just one really wide

239
00:11:44.480 --> 00:11:45.080
<v Speaker 1>hidden layer.

240
00:11:45.320 --> 00:11:48.840
<v Speaker 2>Well, empirically, deeper networks often seem to need fewer neurons

241
00:11:48.840 --> 00:11:51.440
<v Speaker 2>in total to get the same level of performance as

242
00:11:51.480 --> 00:11:55.159
<v Speaker 2>a very wide but shallow network, but perhaps more importantly,

243
00:11:55.240 --> 00:11:59.159
<v Speaker 2>they often generalize better. The thinking is that layers learn

244
00:11:59.279 --> 00:12:02.480
<v Speaker 2>features higher archically. How So, like the first layer might

245
00:12:02.559 --> 00:12:05.960
<v Speaker 2>learn simple things like edges or corners from pixels, the

246
00:12:06.000 --> 00:12:09.440
<v Speaker 2>next layer combines those into shapes, The layer after that

247
00:12:09.480 --> 00:12:13.000
<v Speaker 2>combines shapes into objects, and so on. It builds up complexity.

248
00:12:13.279 --> 00:12:16.480
<v Speaker 1>So potentially a more sophisticated understanding of the data. But

249
00:12:16.519 --> 00:12:19.399
<v Speaker 1>the book is clear right there's no magic formula for

250
00:12:19.480 --> 00:12:21.000
<v Speaker 1>the number of layers or neurons.

251
00:12:21.279 --> 00:12:24.559
<v Speaker 2>Absolutely not. It's very much problem dependent. Finding the right

252
00:12:24.679 --> 00:12:28.759
<v Speaker 2>architecture usually involves a lot of trial and error experimentation,

253
00:12:28.960 --> 00:12:32.399
<v Speaker 2>maybe drawing on architectures known to work well for similar problems.

254
00:12:32.559 --> 00:12:36.559
<v Speaker 1>Okay, we've got network structure learning algorithms. How to spot overfitting?

255
00:12:37.080 --> 00:12:41.440
<v Speaker 1>What about making the training itself better, faster, more reliable.

256
00:12:41.600 --> 00:12:44.399
<v Speaker 2>One key area is tweaking the learning rate during training

257
00:12:44.759 --> 00:12:48.399
<v Speaker 2>instead of just fixing it using learning rate decay is common.

258
00:12:48.159 --> 00:12:50.559
<v Speaker 1>So starting higher and then reducing it over time.

259
00:12:50.879 --> 00:12:54.440
<v Speaker 2>Exactly, you might start with a relatively large learning rate

260
00:12:54.559 --> 00:12:57.240
<v Speaker 2>to make quick progress when you're far from the solution.

261
00:12:58.120 --> 00:13:00.399
<v Speaker 2>Then as the training goes on and you get closer

262
00:13:00.399 --> 00:13:03.440
<v Speaker 2>to the minimum, you gradually decrease the learning rate to

263
00:13:03.519 --> 00:13:07.399
<v Speaker 2>take smaller, finer steps. This helps avoid that oscillation we

264
00:13:07.440 --> 00:13:10.039
<v Speaker 2>talked about and allows for more precise convergence.

265
00:13:10.480 --> 00:13:11.919
<v Speaker 1>What are common ways to decay it?

266
00:13:12.519 --> 00:13:15.200
<v Speaker 2>The book mentions things like in verse time decay or

267
00:13:15.279 --> 00:13:19.399
<v Speaker 2>exponential decay, where the rate decreases smoothly over training iterations.

268
00:13:19.480 --> 00:13:22.039
<v Speaker 2>It's usually tied to the iteration count, not just the

269
00:13:22.080 --> 00:13:22.919
<v Speaker 2>epoch count.

270
00:13:23.000 --> 00:13:27.000
<v Speaker 1>And then there are fancier optimization algorithms beyond just basic

271
00:13:27.000 --> 00:13:28.360
<v Speaker 1>gradient descent with decay.

272
00:13:28.440 --> 00:13:31.000
<v Speaker 2>Oh yes, these aim to speed up training and make

273
00:13:31.039 --> 00:13:33.559
<v Speaker 2>it more robust. Many of them rely on the idea

274
00:13:33.600 --> 00:13:35.440
<v Speaker 2>of exponentially weighted averages.

275
00:13:35.600 --> 00:13:37.120
<v Speaker 1>Okay, what's the intuition there.

276
00:13:37.279 --> 00:13:40.159
<v Speaker 2>Instead of just using the gradient from the current mini batch,

277
00:13:40.200 --> 00:13:42.879
<v Speaker 2>which can be noisy, these methods keep a running average

278
00:13:42.879 --> 00:13:46.200
<v Speaker 2>of recent gradients. This average smooths out the noise and

279
00:13:46.240 --> 00:13:49.159
<v Speaker 2>gives a better estimate of the true downhill direction. It

280
00:13:49.240 --> 00:13:52.200
<v Speaker 2>helps the optimizer build up momentum to get through flat

281
00:13:52.240 --> 00:13:55.519
<v Speaker 2>regions or damp down oscillations in narrow valleys of the

282
00:13:55.559 --> 00:13:56.200
<v Speaker 2>cost function.

283
00:13:56.519 --> 00:13:58.120
<v Speaker 1>So it's like smoothing out the bumps in the road,

284
00:13:58.559 --> 00:14:02.360
<v Speaker 1>and that leads to optimizers life momentum RMS PROP.

285
00:14:02.159 --> 00:14:05.519
<v Speaker 2>ADAM exactly those momentum adds a fraction of the previous

286
00:14:05.639 --> 00:14:09.039
<v Speaker 2>update step to the current one. RMSProp adapts the learning

287
00:14:09.080 --> 00:14:12.039
<v Speaker 2>rate for each parameter individually based on the average size

288
00:14:12.039 --> 00:14:15.120
<v Speaker 2>of recent ingredients for that parameter, and ADAM, as the

289
00:14:15.159 --> 00:14:18.000
<v Speaker 2>source suggests, kind of combines the ideas of momentum and

290
00:14:18.159 --> 00:14:22.080
<v Speaker 2>RMS PROP. It's often the default go to optimizer because

291
00:14:22.120 --> 00:14:24.279
<v Speaker 2>it tends to work well across a wide range of

292
00:14:24.320 --> 00:14:28.279
<v Speaker 2>problems with relatively little tuning, usually faster and better.

293
00:14:28.279 --> 00:14:30.879
<v Speaker 1>The book says, now, let's circle back to fighting overfitting.

294
00:14:31.039 --> 00:14:34.480
<v Speaker 1>We mentioned the train dev split. What about techniques built

295
00:14:34.480 --> 00:14:37.960
<v Speaker 1>into the training process itself? Regularization right.

296
00:14:38.080 --> 00:14:42.200
<v Speaker 2>Regularization methods are specifically designed to prevent overfitting and help

297
00:14:42.240 --> 00:14:45.039
<v Speaker 2>the model generalize better to data it hasn't seen before.

298
00:14:45.159 --> 00:14:48.039
<v Speaker 1>The book talks about E two and E to one regularization.

299
00:14:48.519 --> 00:14:49.240
<v Speaker 1>What's the difference?

300
00:14:49.480 --> 00:14:52.440
<v Speaker 2>Both work by adding a penalty term to the cost function.

301
00:14:53.240 --> 00:14:56.159
<v Speaker 2>This penalty is based on the size of the network's weights.

302
00:14:56.679 --> 00:15:00.440
<v Speaker 2>Under two, regularization, sometimes called weight decay, adds a penalty

303
00:15:00.480 --> 00:15:03.159
<v Speaker 2>proportional to the sum of the squares of all the weights.

304
00:15:04.000 --> 00:15:07.600
<v Speaker 2>It pushes weights towards zero, but not usually exactly zero.

305
00:15:08.159 --> 00:15:11.919
<v Speaker 2>It encourages smaller whites overall, making the model simpler. And

306
00:15:12.240 --> 00:15:15.919
<v Speaker 2>one home one regularization adds a penalty proportional to the

307
00:15:15.919 --> 00:15:19.039
<v Speaker 2>sum of the absolute values of the weights. It also

308
00:15:19.120 --> 00:15:22.200
<v Speaker 2>pushes weights towards zero, but because of the math involved

309
00:15:22.360 --> 00:15:24.799
<v Speaker 2>the shape of the penalty function, it tends to make

310
00:15:24.879 --> 00:15:27.039
<v Speaker 2>many weights exactly zero, so.

311
00:15:26.960 --> 00:15:30.600
<v Speaker 1>It leads to sparser models where some connections are effectively

312
00:15:30.600 --> 00:15:31.720
<v Speaker 1>turned off exactly.

313
00:15:32.240 --> 00:15:34.840
<v Speaker 2>L one can be useful for feature selection in a

314
00:15:34.879 --> 00:15:37.960
<v Speaker 2>way because it zero's out weights for less important inputs.

315
00:15:38.000 --> 00:15:40.320
<v Speaker 1>Then there's dropout, which sounds completely different.

316
00:15:40.440 --> 00:15:42.879
<v Speaker 2>It is quite different. Yeah, dropout is a very clever

317
00:15:43.000 --> 00:15:46.360
<v Speaker 2>and widely used technique. During each training iteration, you randomly

318
00:15:46.440 --> 00:15:49.840
<v Speaker 2>drop out, temporarily remove a fraction of the neurons in

319
00:15:49.879 --> 00:15:51.039
<v Speaker 2>certain layers.

320
00:15:50.840 --> 00:15:52.559
<v Speaker 1>Just randomly ignore them for that update.

321
00:15:52.840 --> 00:15:56.320
<v Speaker 2>Yep, for that one mini batch calculation, those neurons and

322
00:15:56.360 --> 00:16:00.480
<v Speaker 2>their connections are just gone. In the next iteration, a

323
00:16:00.519 --> 00:16:02.279
<v Speaker 2>different random sat might be dropped.

324
00:16:02.320 --> 00:16:03.039
<v Speaker 1>How does that help?

325
00:16:03.279 --> 00:16:06.080
<v Speaker 2>It prevents the network from becoming too reliant on any

326
00:16:06.120 --> 00:16:10.399
<v Speaker 2>single neuron or specific pathway. Since any neuron might disappear,

327
00:16:10.480 --> 00:16:14.519
<v Speaker 2>the network is forced to learn more robust, redundant representations.

328
00:16:14.919 --> 00:16:17.679
<v Speaker 2>It's kind of like training a large ensemble of slightly

329
00:16:17.720 --> 00:16:19.240
<v Speaker 2>different networks all at once.

330
00:16:19.320 --> 00:16:22.399
<v Speaker 1>Yeah, that makes sense. Forces redundancy. The source notes that

331
00:16:22.480 --> 00:16:24.360
<v Speaker 1>can make the training costs jump around.

332
00:16:24.159 --> 00:16:27.360
<v Speaker 2>A bit more, though, yes, because the network structure is

333
00:16:27.600 --> 00:16:31.320
<v Speaker 2>literally changing slightly on every iteration due to the randomness.

334
00:16:31.840 --> 00:16:34.799
<v Speaker 2>So the training metric might look a bit noisier, but

335
00:16:34.879 --> 00:16:37.720
<v Speaker 2>it often leads to much better generalization on the dev

336
00:16:37.799 --> 00:16:38.600
<v Speaker 2>and test sets.

337
00:16:38.840 --> 00:16:42.840
<v Speaker 1>Okay, so we've trained or model applied regularization, how do

338
00:16:42.919 --> 00:16:46.679
<v Speaker 1>we really know if it's any good? Evaluation seems critical.

339
00:16:46.440 --> 00:16:50.000
<v Speaker 2>Absolutely crucial, and just looking at training error isn't enough.

340
00:16:50.200 --> 00:16:53.399
<v Speaker 2>The book brings up human level performance HLP and Bayes

341
00:16:53.559 --> 00:16:57.559
<v Speaker 2>error in for tasks humans are good at, like recognizing

342
00:16:57.559 --> 00:17:01.639
<v Speaker 2>images or transcribing speech. HLP can be a practical estimate

343
00:17:01.679 --> 00:17:05.519
<v Speaker 2>for the theoretical best possible error. The bees aer Beyes

344
00:17:05.640 --> 00:17:09.599
<v Speaker 2>error is the irreducible error rate. No model, however, perfect

345
00:17:09.720 --> 00:17:12.359
<v Speaker 2>could do better due to inherent ambiguity or noise in

346
00:17:12.400 --> 00:17:13.279
<v Speaker 2>the data itself.

347
00:17:13.400 --> 00:17:15.759
<v Speaker 1>So knowing the HLP gives you a target, like what's

348
00:17:15.799 --> 00:17:16.839
<v Speaker 1>potentially achievable?

349
00:17:16.960 --> 00:17:20.799
<v Speaker 2>Exactly? If human error on a task is say one percent,

350
00:17:20.839 --> 00:17:23.240
<v Speaker 2>and your model has ten percent error, you know there's

351
00:17:23.319 --> 00:17:25.200
<v Speaker 2>likely a lot of room for improvement. If your model

352
00:17:25.240 --> 00:17:27.400
<v Speaker 2>is at one point five percent, maybe you're getting close

353
00:17:27.440 --> 00:17:30.799
<v Speaker 2>to the limit. The book uses MS digit recognition, where

354
00:17:30.960 --> 00:17:33.319
<v Speaker 2>HLP is cited around zero point two percent error.

355
00:17:33.359 --> 00:17:36.799
<v Speaker 1>Okay, hlt bese er is the theoretical floor. How do

356
00:17:36.839 --> 00:17:39.480
<v Speaker 1>we diagnose our model's specific shortcomings?

357
00:17:39.519 --> 00:17:42.319
<v Speaker 2>The book introduces a simple framework called the metric analysis

358
00:17:42.319 --> 00:17:45.359
<v Speaker 2>diagram or MENE. It helps you pinpoint where the error

359
00:17:45.400 --> 00:17:47.079
<v Speaker 2>is coming from by looking at different gaps.

360
00:17:47.119 --> 00:17:48.279
<v Speaker 1>Let's walk through those gaps.

361
00:17:48.400 --> 00:17:53.559
<v Speaker 2>Okay, First, gap bias or sometimes avoidable bias. This is

362
00:17:53.599 --> 00:17:57.160
<v Speaker 2>the difference between the Bayes error or HLP and your

363
00:17:57.240 --> 00:17:59.880
<v Speaker 2>training error. If this gap is large, it means you're

364
00:17:59.839 --> 00:18:03.519
<v Speaker 2>model isn't even fitting the training data. Well, it's likely

365
00:18:03.559 --> 00:18:08.160
<v Speaker 2>to simple underfitting, or the training algorithm itself isn't finding

366
00:18:08.160 --> 00:18:08.839
<v Speaker 2>a good solution.

367
00:18:09.039 --> 00:18:11.640
<v Speaker 1>Okay, so bias is about performance on data it's already

368
00:18:11.680 --> 00:18:15.039
<v Speaker 1>seen relative to the best possible. What's the next gap?

369
00:18:15.400 --> 00:18:18.240
<v Speaker 2>Variance? This is the difference between your training error and

370
00:18:18.279 --> 00:18:20.720
<v Speaker 2>your development set error. If your training error is low

371
00:18:20.880 --> 00:18:23.319
<v Speaker 2>but your DEV error is much higher, that's a classic

372
00:18:23.359 --> 00:18:27.079
<v Speaker 2>sign of overfitting. The model learned the training data specifics,

373
00:18:27.079 --> 00:18:29.680
<v Speaker 2>but isn't generalizing. High variance, and.

374
00:18:29.640 --> 00:18:31.839
<v Speaker 1>There's potentially a third gap mentioned.

375
00:18:31.559 --> 00:18:33.839
<v Speaker 2>Yes, overfitting on the dev set. This is the gap

376
00:18:33.880 --> 00:18:36.279
<v Speaker 2>between your doveset error and your error on a completely

377
00:18:36.279 --> 00:18:39.880
<v Speaker 2>separate test set. If you tune your hyperparameters extensively based

378
00:18:39.880 --> 00:18:43.079
<v Speaker 2>on the debset results, you might inadvertently make your model

379
00:18:43.079 --> 00:18:45.799
<v Speaker 2>perform well, specifically on that deb set, but it might

380
00:18:45.839 --> 00:18:48.240
<v Speaker 2>not generalize as well to totally new data.

381
00:18:48.279 --> 00:18:50.799
<v Speaker 1>AHH, so you've sort of used up the deb set

382
00:18:50.880 --> 00:18:54.599
<v Speaker 1>for unbiased evaluation by tuning on it too much. That's

383
00:18:54.640 --> 00:18:56.839
<v Speaker 1>why you need that final untouched.

384
00:18:56.400 --> 00:18:59.119
<v Speaker 2>Test set precisely keep the test set sacred until the

385
00:18:59.240 --> 00:19:01.599
<v Speaker 2>very end for a final honest assessment.

386
00:19:01.759 --> 00:19:04.799
<v Speaker 1>This all really highlights how crucial that initial data split

387
00:19:04.920 --> 00:19:07.000
<v Speaker 1>is train dev.

388
00:19:07.039 --> 00:19:11.480
<v Speaker 2>Test, and the book emphasizes a critical point. Your dev

389
00:19:11.519 --> 00:19:14.599
<v Speaker 2>and test sets must reflect the real world data distribution.

390
00:19:14.680 --> 00:19:16.720
<v Speaker 2>Your model will actually see.

391
00:19:16.480 --> 00:19:18.559
<v Speaker 1>What kinds of problems happen if they don't well.

392
00:19:18.599 --> 00:19:21.759
<v Speaker 2>A big one is unbalanced classes. The book mentions examples

393
00:19:21.839 --> 00:19:25.279
<v Speaker 2>like detecting rare fraud or maybe identifying only certain digits

394
00:19:25.279 --> 00:19:28.079
<v Speaker 2>in MNIST. If say, fraud is only a point one

395
00:19:28.079 --> 00:19:29.880
<v Speaker 2>percent of your real data, but your dev set is

396
00:19:29.920 --> 00:19:33.079
<v Speaker 2>balanced fifty to fifty, your devset accuracy won't tell you

397
00:19:33.119 --> 00:19:36.200
<v Speaker 2>how the model does on the real skewed distribution, right.

398
00:19:36.079 --> 00:19:38.559
<v Speaker 1>Because getting ninety nine point nine percent accuracy by just

399
00:19:38.599 --> 00:19:41.720
<v Speaker 1>always predicting not fraud would look great on the real data,

400
00:19:41.759 --> 00:19:44.759
<v Speaker 1>but terrible on the balanced DEV set or vice versa.

401
00:19:44.799 --> 00:19:48.039
<v Speaker 2>Exactly so, especially with unbalanced data. The book stresses looking

402
00:19:48.119 --> 00:19:51.920
<v Speaker 2>beyond plane accuracy. You need metrics like the confusion matrix.

403
00:19:51.559 --> 00:19:55.400
<v Speaker 1>Which shows true positives, false positives, true negatives, false negatives right.

404
00:19:55.720 --> 00:19:59.400
<v Speaker 2>And from that you calculate precision how many of the

405
00:19:59.440 --> 00:20:03.119
<v Speaker 2>positive predictions were actually positive, and recall how many of

406
00:20:03.160 --> 00:20:07.079
<v Speaker 2>the actual positives did you find? And often the F

407
00:20:07.160 --> 00:20:09.880
<v Speaker 2>one score, which combines precision and recall into one.

408
00:20:09.799 --> 00:20:12.920
<v Speaker 1>Number, gives you much more nuanced picture of performance. What

409
00:20:12.960 --> 00:20:15.480
<v Speaker 1>about when the training data itself is just different from

410
00:20:15.519 --> 00:20:16.759
<v Speaker 1>the evaluation data.

411
00:20:16.839 --> 00:20:20.279
<v Speaker 2>That's another major challenge. Maybe you trained on high quality images,

412
00:20:20.319 --> 00:20:23.240
<v Speaker 2>but you need to evaluate on blurry phone pictures, or

413
00:20:23.279 --> 00:20:27.079
<v Speaker 2>trained on data from one country deploying in another, performance

414
00:20:27.079 --> 00:20:30.039
<v Speaker 2>will almost certainly drop if the distributions don't match. You

415
00:20:30.119 --> 00:20:32.240
<v Speaker 2>need to be aware of that potential mismatch.

416
00:20:32.640 --> 00:20:35.720
<v Speaker 1>For situations with smaller data sets, the book brings up

417
00:20:35.799 --> 00:20:37.400
<v Speaker 1>kfold cross validation.

418
00:20:37.880 --> 00:20:41.359
<v Speaker 2>Yes, it's a really useful technique. When you can't afford large,

419
00:20:41.400 --> 00:20:44.519
<v Speaker 2>separate dev and test sets, you split your data into

420
00:20:44.680 --> 00:20:47.359
<v Speaker 2>say five or ten folds. Then you train the model

421
00:20:47.400 --> 00:20:50.039
<v Speaker 2>five or ten times. Each time, you hold out one

422
00:20:50.039 --> 00:20:52.960
<v Speaker 2>fold for validation and train on the remaining folds. Then

423
00:20:53.000 --> 00:20:55.440
<v Speaker 2>you average the validation performance across all.

424
00:20:55.279 --> 00:20:58.559
<v Speaker 1>The folds, so you get a more robust estimate of performance,

425
00:20:58.960 --> 00:21:01.440
<v Speaker 1>less dependent on one specific split exactly.

426
00:21:01.480 --> 00:21:04.319
<v Speaker 2>It gives a better sense of generalization and helps check

427
00:21:04.359 --> 00:21:06.640
<v Speaker 2>for overfitting, especially with limited data.

428
00:21:06.799 --> 00:21:10.440
<v Speaker 1>Okay, evaluation tells us what's wrong. To fix things, we

429
00:21:10.599 --> 00:21:14.920
<v Speaker 1>often need to adjust those settings. We don't learn directly

430
00:21:15.200 --> 00:21:17.000
<v Speaker 1>the hyper parameters.

431
00:21:16.480 --> 00:21:20.079
<v Speaker 2>Hyper parameter tuning. It's about finding the best values for

432
00:21:20.160 --> 00:21:23.079
<v Speaker 2>things like the learning rate, the number of layers neurons

433
00:21:23.079 --> 00:21:26.759
<v Speaker 2>per layer, which optimizer to use, the strength of regularization

434
00:21:27.000 --> 00:21:30.000
<v Speaker 2>like that L two penalty, the mini batch size, how

435
00:21:30.079 --> 00:21:33.920
<v Speaker 2>many epochs to train for the weight initialization method. The

436
00:21:33.960 --> 00:21:35.000
<v Speaker 2>list goes on, and.

437
00:21:35.079 --> 00:21:37.559
<v Speaker 1>The book frames. This is trying to optimize a black

438
00:21:37.599 --> 00:21:39.240
<v Speaker 1>box function. What does that mean?

439
00:21:39.640 --> 00:21:42.160
<v Speaker 2>It means you can't just calculate a derivative to find

440
00:21:42.200 --> 00:21:45.519
<v Speaker 2>the best setting. The function takes hyper parameters as input

441
00:21:45.799 --> 00:21:49.079
<v Speaker 2>and its output is the model's performance, like defset accuracy

442
00:21:49.160 --> 00:21:53.160
<v Speaker 2>after training. But evaluating that function actually training the network

443
00:21:53.200 --> 00:21:56.720
<v Speaker 2>with those settings is computationally expensive, often taking hours or days,

444
00:21:57.200 --> 00:21:59.960
<v Speaker 2>and you have potentially many hyper parameters. So the search

445
00:22:00.039 --> 00:22:00.880
<v Speaker 2>space is huge.

446
00:22:01.160 --> 00:22:04.079
<v Speaker 1>So what are the basic strategies for searching this space.

447
00:22:04.400 --> 00:22:08.359
<v Speaker 2>The simplest are grid search and random search. Grid search

448
00:22:08.440 --> 00:22:11.839
<v Speaker 2>is systematic. You define a grid of possible values for

449
00:22:11.920 --> 00:22:15.599
<v Speaker 2>each hyper parameter and try every single combination.

450
00:22:15.359 --> 00:22:18.720
<v Speaker 1>Which sounds thorough, but the book warns about the cursive

451
00:22:18.759 --> 00:22:20.039
<v Speaker 1>dimensionality right.

452
00:22:19.960 --> 00:22:23.680
<v Speaker 2>Absolutely, If you have even just a few hyper parameters

453
00:22:23.720 --> 00:22:27.200
<v Speaker 2>with several values each, the total number of combinations explodes.

454
00:22:27.480 --> 00:22:29.920
<v Speaker 2>It becomes computationally infeasible very.

455
00:22:29.799 --> 00:22:31.759
<v Speaker 1>Quickly, so random search.

456
00:22:31.920 --> 00:22:35.000
<v Speaker 2>Random search often works better in practice, especially in high

457
00:22:35.039 --> 00:22:38.559
<v Speaker 2>dimensional spaces. You define ranges for your hyper parameters, and

458
00:22:38.599 --> 00:22:41.920
<v Speaker 2>then you just sample random combinations within those ranges. The

459
00:22:41.960 --> 00:22:44.960
<v Speaker 2>insight is that usually only a few hyper parameters really

460
00:22:44.960 --> 00:22:48.160
<v Speaker 2>dominate performance. Random search has a better chance of landing

461
00:22:48.160 --> 00:22:50.880
<v Speaker 2>on good values for those important ones compared to grid search,

462
00:22:50.920 --> 00:22:53.799
<v Speaker 2>which waste a lot of time testing combinations where unimportant

463
00:22:53.839 --> 00:22:54.720
<v Speaker 2>parameters vary.

464
00:22:55.079 --> 00:22:57.240
<v Speaker 1>The book makes a really key point about how to

465
00:22:57.279 --> 00:23:01.000
<v Speaker 1>sample certain parameters like the learning rate, not linearly.

466
00:23:01.440 --> 00:23:04.680
<v Speaker 2>Yes, this is crucial for parameters like learning rates or

467
00:23:04.759 --> 00:23:08.039
<v Speaker 2>regularization strengths that often work best across different orders of

468
00:23:08.079 --> 00:23:11.559
<v Speaker 2>magnitude like zero point one zero one point zero one

469
00:23:11.599 --> 00:23:14.880
<v Speaker 2>point zero zero one, Sampling them on a logarithmic scale

470
00:23:14.920 --> 00:23:17.319
<v Speaker 2>is much more effective. Why is that if you sample

471
00:23:17.359 --> 00:23:20.440
<v Speaker 2>learning rate linearly between say zero point zero zero zero

472
00:23:20.440 --> 00:23:22.519
<v Speaker 2>one point one, most of your samples will be clustered

473
00:23:22.559 --> 00:23:25.519
<v Speaker 2>up nearer zero point one. You'll barely test the smaller values.

474
00:23:25.640 --> 00:23:28.119
<v Speaker 1>Ah, because the range point Zerolier point at one point

475
00:23:28.160 --> 00:23:31.400
<v Speaker 1>one is much wider than point zero zero zero zero

476
00:23:31.480 --> 00:23:33.799
<v Speaker 1>one point zero zero one on a linear scale.

477
00:23:33.880 --> 00:23:36.839
<v Speaker 2>Right, But if you sample uniformly on a log scale,

478
00:23:36.880 --> 00:23:39.839
<v Speaker 2>maybe by sampling an exponent are uniformly between mio four

479
00:23:39.960 --> 00:23:42.359
<v Speaker 2>nine oh one and using ten r as your learning rate,

480
00:23:42.680 --> 00:23:45.079
<v Speaker 2>you distribute your search effort much more evenly across those

481
00:23:45.119 --> 00:23:47.680
<v Speaker 2>critical orders of magnitude. You're just as likely to test

482
00:23:47.759 --> 00:23:50.480
<v Speaker 2>values around point zero zero one as values around point

483
00:23:50.559 --> 00:23:51.000
<v Speaker 2>zero one.

484
00:23:51.119 --> 00:23:53.359
<v Speaker 1>That makes a lot of sense for finding those sweet spots. Yeah,

485
00:23:53.359 --> 00:23:55.640
<v Speaker 1>does the book mention more sophisticated tuning methods.

486
00:23:55.880 --> 00:24:00.000
<v Speaker 2>It briefly touches on things like Beaesan optimization. These are small,

487
00:24:00.000 --> 00:24:03.880
<v Speaker 2>harder search strategies. They build a probabilistic model, like a

488
00:24:03.880 --> 00:24:07.799
<v Speaker 2>Goshian process, of how hyper parameters relate to performance based

489
00:24:07.799 --> 00:24:10.799
<v Speaker 2>on the trials run so far. Then they use that

490
00:24:10.880 --> 00:24:14.759
<v Speaker 2>model to intelligently decide which combination of hyper parameters to

491
00:24:14.839 --> 00:24:19.880
<v Speaker 2>try next balancing exploring areas they're uncertain about, versus exploiting

492
00:24:19.920 --> 00:24:22.160
<v Speaker 2>areas that already look promising.

493
00:24:21.759 --> 00:24:24.680
<v Speaker 1>Trying to learn the black box function to optimize it faster.

494
00:24:24.839 --> 00:24:28.039
<v Speaker 2>That's the basic idea. Yeah, more complex, but potentially much

495
00:24:28.039 --> 00:24:31.400
<v Speaker 2>more efficient than random search if evaluations are very expensive.

496
00:24:31.720 --> 00:24:35.200
<v Speaker 1>Now we mostly talked about standard fully connected networks, but

497
00:24:35.240 --> 00:24:38.680
<v Speaker 1>the book also covers specialized architectures for specific data types.

498
00:24:38.960 --> 00:24:41.920
<v Speaker 2>Right because fully connected networks treat every input feature of

499
00:24:41.960 --> 00:24:45.079
<v Speaker 2>the same and don't account for spatial or sequential structure

500
00:24:45.119 --> 00:24:47.720
<v Speaker 2>in the data, that's not always ideal, which leads.

501
00:24:47.519 --> 00:24:50.359
<v Speaker 1>Us to convolutional neural networks or CNNs.

502
00:24:50.759 --> 00:24:54.279
<v Speaker 2>CNN's are king for grid like data, especially images. Their

503
00:24:54.319 --> 00:24:57.720
<v Speaker 2>core operation is the convolution. You have these small filters

504
00:24:57.799 --> 00:25:01.559
<v Speaker 2>called kernels that slide across the input image. Each kernel

505
00:25:01.640 --> 00:25:05.559
<v Speaker 2>is designed, or rather learned, to detect a specific local

506
00:25:05.599 --> 00:25:09.119
<v Speaker 2>pattern or feature, like an edge, a corner, or a texture.

507
00:25:09.480 --> 00:25:12.720
<v Speaker 1>The book had those examples of simple kernels detecting horizontal

508
00:25:12.759 --> 00:25:14.359
<v Speaker 1>or vertical lines right exactly.

509
00:25:14.640 --> 00:25:18.960
<v Speaker 2>The network learns hierarchies of these features. CNNs also typically

510
00:25:19.079 --> 00:25:21.000
<v Speaker 2>use pooling layers like max pooling.

511
00:25:21.119 --> 00:25:22.839
<v Speaker 1>What do pooling layers do they.

512
00:25:22.759 --> 00:25:26.160
<v Speaker 2>Reduce the size the spatial dimensions of the feature maps

513
00:25:26.160 --> 00:25:29.039
<v Speaker 2>coming out of the convolutional layers. This makes the network

514
00:25:29.079 --> 00:25:33.519
<v Speaker 2>computationally cheaper and importantly makes the learned features more robust

515
00:25:33.559 --> 00:25:36.079
<v Speaker 2>to small shifts or distortions in the input image.

516
00:25:36.160 --> 00:25:39.799
<v Speaker 1>Okay, so CNNs for grids like images, What about sequences

517
00:25:39.880 --> 00:25:42.279
<v Speaker 1>like text or time series data?

518
00:25:42.440 --> 00:25:46.000
<v Speaker 2>For that, you have recurrent mirural networks or RNNs. They're

519
00:25:46.000 --> 00:25:49.599
<v Speaker 2>designed specifically for sequential data where the order matters. The

520
00:25:49.680 --> 00:25:52.079
<v Speaker 2>key idea in an RNN is that it has a memory,

521
00:25:52.160 --> 00:25:54.279
<v Speaker 2>a hidden state that gets updated at each step in

522
00:25:54.319 --> 00:25:57.599
<v Speaker 2>the sequence and carries information from previous steps forward, so

523
00:25:57.640 --> 00:25:58.039
<v Speaker 2>it can.

524
00:25:57.880 --> 00:26:00.839
<v Speaker 1>Remember what happened earlier in the sentence or time series

525
00:26:01.240 --> 00:26:02.400
<v Speaker 1>to help process the.

526
00:26:02.319 --> 00:26:06.519
<v Speaker 2>Current element precisely. This allows rn ns to capture dependencies

527
00:26:06.599 --> 00:26:11.440
<v Speaker 2>and context over time. The book mentions applications like speech recognition,

528
00:26:11.880 --> 00:26:15.920
<v Speaker 2>machine translation, or even generating captions for images by processing

529
00:26:16.039 --> 00:26:17.400
<v Speaker 2>image features sequentially.

530
00:26:17.720 --> 00:26:20.680
<v Speaker 1>Very cool. It really shows how the architecture needs to

531
00:26:20.799 --> 00:26:23.799
<v Speaker 1>match the data structure definitely. Now to kind of tie

532
00:26:23.799 --> 00:26:27.079
<v Speaker 1>this all together, the book uses a real world research

533
00:26:27.119 --> 00:26:30.680
<v Speaker 1>example and also emphasizes understanding the fundamentals yeah.

534
00:26:30.720 --> 00:26:33.960
<v Speaker 2>It includes this interesting project where they use neural networks

535
00:26:34.000 --> 00:26:38.039
<v Speaker 2>for calibrating an oxygen sensor. The traditional approach involved complex

536
00:26:38.119 --> 00:26:42.160
<v Speaker 2>nonlinear physics equations. Instead, they just collected data sensor readings

537
00:26:42.160 --> 00:26:46.279
<v Speaker 2>and corresponding known oxygen concentrations and trained a neural network

538
00:26:46.359 --> 00:26:48.720
<v Speaker 2>to learn that mapping directly, letting.

539
00:26:48.480 --> 00:26:51.200
<v Speaker 1>The network figure out the complex relationship from the data.

540
00:26:51.279 --> 00:26:54.000
<v Speaker 2>Itself a great example of using deep learning for a

541
00:26:54.079 --> 00:26:56.920
<v Speaker 2>practical regression problem, and then to really drive home the

542
00:26:56.960 --> 00:27:00.680
<v Speaker 2>fundamentals the book as you build logistic or agression from.

543
00:27:00.599 --> 00:27:03.880
<v Speaker 1>Scratch, using just numb pitt, no high level.

544
00:27:03.640 --> 00:27:07.039
<v Speaker 2>Libraries, none of the deep learning framework magic. You have

545
00:27:07.119 --> 00:27:10.599
<v Speaker 2>to manually code the sigmoid function, calculate the cost function,

546
00:27:11.240 --> 00:27:14.640
<v Speaker 2>figure out the gredient, the derivatives, and implement the gradient

547
00:27:14.680 --> 00:27:16.279
<v Speaker 2>des send update group yourself.

548
00:27:16.599 --> 00:27:19.720
<v Speaker 1>Why put someone through that pain when TensorFlow exists.

549
00:27:19.960 --> 00:27:22.000
<v Speaker 2>The book's point, and it's a good one, is that

550
00:27:22.119 --> 00:27:25.599
<v Speaker 2>doing it manually gives you a much much deeper understanding

551
00:27:25.640 --> 00:27:29.359
<v Speaker 2>and appreciation for what the frameworks are handling automatically. It

552
00:27:29.440 --> 00:27:32.559
<v Speaker 2>forces you to grapple with the math and the algorithms directly.

553
00:27:33.000 --> 00:27:36.000
<v Speaker 2>The book really believes that understanding how it works under

554
00:27:36.000 --> 00:27:39.960
<v Speaker 2>the hood is essential if you want to effectively debug, optimize,

555
00:27:40.079 --> 00:27:42.799
<v Speaker 2>or even just intelligently use these powerful tools.

556
00:27:43.119 --> 00:27:44.759
<v Speaker 1>It pushes back on the idea that you could just

557
00:27:44.799 --> 00:27:46.200
<v Speaker 1>call functions without knowing.

558
00:27:46.000 --> 00:27:50.119
<v Speaker 2>What they do exactly. It argues that true effectiveness requires

559
00:27:50.160 --> 00:27:53.200
<v Speaker 2>that deeper grasp, especially as models get more complex.

560
00:27:53.519 --> 00:27:56.079
<v Speaker 1>Wow. Okay, we have really covered a ton of ground here.

561
00:27:56.119 --> 00:27:59.160
<v Speaker 1>Following the source material, we started with computational graphs and

562
00:27:59.240 --> 00:27:59.960
<v Speaker 1>the basic neuron.

563
00:28:00.319 --> 00:28:03.200
<v Speaker 2>Moved into how they learned with cost functions, ingredient descent,

564
00:28:03.599 --> 00:28:06.160
<v Speaker 2>including the different variations like mini batch.

565
00:28:06.039 --> 00:28:09.599
<v Speaker 1>Scaled up to feed forward networks, tackle overfitting with train

566
00:28:09.680 --> 00:28:14.000
<v Speaker 1>dev splits, and regularization techniques like L two, L one,

567
00:28:14.160 --> 00:28:14.839
<v Speaker 1>and dropout.

568
00:28:15.000 --> 00:28:18.960
<v Speaker 2>Looked at optimization strategies from learning rate decay to advanced

569
00:28:18.960 --> 00:28:20.240
<v Speaker 2>optimizers like ATOM.

570
00:28:20.559 --> 00:28:25.559
<v Speaker 1>Dived into evaluating models properly using things like HLP, THEMAD framework,

571
00:28:25.920 --> 00:28:30.480
<v Speaker 1>precision recall F one, and thinking about data splits and distributions.

572
00:28:30.599 --> 00:28:34.039
<v Speaker 2>Talked about the challenge of hyperparameter tuning using random search

573
00:28:34.200 --> 00:28:36.039
<v Speaker 2>sampling on log scales.

574
00:28:35.839 --> 00:28:40.079
<v Speaker 1>And even touched on specialized architectures like CNNs for images

575
00:28:40.319 --> 00:28:41.920
<v Speaker 1>and RNNs for sequences.

576
00:28:42.039 --> 00:28:45.079
<v Speaker 2>Plus that practical oxygen sensor example and the value of

577
00:28:45.079 --> 00:28:47.079
<v Speaker 2>building from the ground up with NUMPI.

578
00:28:47.200 --> 00:28:49.519
<v Speaker 1>Yeah, we've really tried to pull out the core concepts,

579
00:28:49.519 --> 00:28:52.519
<v Speaker 1>the practical advice, and those interesting details from the Makealucchi

580
00:28:52.559 --> 00:28:55.279
<v Speaker 1>book excerpts you provided. The aim was to give you

581
00:28:55.319 --> 00:28:57.000
<v Speaker 1>that solid overview.

582
00:28:56.799 --> 00:28:59.519
<v Speaker 2>A shortcut to being well informed on these deep learning

583
00:28:59.519 --> 00:29:01.759
<v Speaker 2>fundamental based directly on the source.

584
00:29:02.079 --> 00:29:04.160
<v Speaker 1>So here's a final thought to leave you with. Connecting

585
00:29:04.240 --> 00:29:06.680
<v Speaker 1>some of these threads. We've seen how these models can

586
00:29:06.720 --> 00:29:10.519
<v Speaker 1>get incredibly good at complex tasks, sometimes hitting or even

587
00:29:10.559 --> 00:29:14.119
<v Speaker 1>beating human level performance like that point two percent error

588
00:29:14.160 --> 00:29:18.599
<v Speaker 1>on mnist mentioned. But as the book strongly emphasizes, really

589
00:29:18.680 --> 00:29:22.519
<v Speaker 1>mastering these tools, truly understanding them, debugging them, pushing their

590
00:29:22.559 --> 00:29:26.759
<v Speaker 1>limits requires a serious grasp of the underlying math and algorithms.

591
00:29:27.119 --> 00:29:30.000
<v Speaker 1>It kind of challenges that data science for everyone's narrative, right.

592
00:29:30.039 --> 00:29:33.240
<v Speaker 2>So the question is, as these models become even more

593
00:29:33.279 --> 00:29:37.000
<v Speaker 2>powerful and more accessible through libraries, does the barrier to

594
00:29:37.039 --> 00:29:41.079
<v Speaker 2>true mastery actually get higher? Does it require more fundamental understanding,

595
00:29:41.160 --> 00:29:44.079
<v Speaker 2>not less to use them effectively and responsibly.

596
00:29:44.680 --> 00:29:46.920
<v Speaker 1>Something to think about as you continue your own journey

597
00:29:46.920 --> 00:29:47.640
<v Speaker 1>with deep learning
