WEBVTT

1
00:00:00.080 --> 00:00:02.480
<v Speaker 1>Have you ever stopped to wonder how your phone can

2
00:00:02.560 --> 00:00:06.839
<v Speaker 1>instantly recognize a face, or you know, how some AI

3
00:00:07.040 --> 00:00:11.080
<v Speaker 1>generates those incredibly realistic images seemingly out of thin air.

4
00:00:11.400 --> 00:00:13.480
<v Speaker 2>Right, it feels like magic sometimes.

5
00:00:13.080 --> 00:00:16.199
<v Speaker 1>It really does. So today we're pulling back the curtain

6
00:00:16.480 --> 00:00:21.079
<v Speaker 1>on that magic behind AI powered products. Our mission in

7
00:00:21.160 --> 00:00:23.399
<v Speaker 1>this deep dive is to really get into the core

8
00:00:23.480 --> 00:00:27.760
<v Speaker 1>concepts of neural networks and TensorFlow two point zero. We've

9
00:00:27.760 --> 00:00:30.480
<v Speaker 1>gone through hands on neural networks, so TensorFlow two point

10
00:00:30.519 --> 00:00:34.240
<v Speaker 1>zero by a Paulo Gellion, great resource, absolutely, and we've

11
00:00:34.320 --> 00:00:38.119
<v Speaker 1>extracted the most important sort of nuggets of knowledge to

12
00:00:38.119 --> 00:00:40.840
<v Speaker 1>give you a shortcut to being genuinely well informed about

13
00:00:40.840 --> 00:00:41.960
<v Speaker 1>this fascinating field.

14
00:00:42.079 --> 00:00:44.960
<v Speaker 2>Yeah, and we'll explore the very foundations of machine learning,

15
00:00:45.000 --> 00:00:48.000
<v Speaker 2>really unveil the inner workings of neural networks, and maybe

16
00:00:48.039 --> 00:00:51.520
<v Speaker 2>demystify how frameworks like TensorFlow two point zero make building

17
00:00:51.520 --> 00:00:55.759
<v Speaker 2>these systems well manageable. You'll hopefully gain an intuitive grasp

18
00:00:55.799 --> 00:00:58.479
<v Speaker 2>of not just what these things are, but more importantly,

19
00:00:58.520 --> 00:01:00.560
<v Speaker 2>why they've become so incredible important.

20
00:01:00.679 --> 00:01:03.000
<v Speaker 1>Okay, let's unpack this right at the start. Then, what

21
00:01:03.119 --> 00:01:06.599
<v Speaker 1>exactly is machine learning fundamental, so at its.

22
00:01:06.480 --> 00:01:11.680
<v Speaker 2>Heart, machine learning is a branch of artificial intelligence. The

23
00:01:11.760 --> 00:01:14.959
<v Speaker 2>core idea is we define algorithms that learn a model

24
00:01:15.000 --> 00:01:16.280
<v Speaker 2>directly from data.

25
00:01:16.359 --> 00:01:18.959
<v Speaker 1>Learning from data, not explicit rules exactly.

26
00:01:19.280 --> 00:01:24.439
<v Speaker 2>The goal is to automatically extract meaningful information insites, patterns.

27
00:01:24.319 --> 00:01:26.879
<v Speaker 1>And the applications are just everywhere, now, aren't they. You

28
00:01:26.920 --> 00:01:29.079
<v Speaker 1>probably use them constantly without even thinking about it.

29
00:01:29.120 --> 00:01:34.480
<v Speaker 2>Oh? Absolutely, they're countless, and yeah, probably daily use Think

30
00:01:34.519 --> 00:01:36.599
<v Speaker 2>about face detection in your smart phone camera.

31
00:01:36.719 --> 00:01:37.760
<v Speaker 1>Yep, use that all the time.

32
00:01:38.000 --> 00:01:42.560
<v Speaker 2>Predictive maintenance and factories, medical image analysis, which.

33
00:01:42.359 --> 00:01:44.599
<v Speaker 1>Is huge helping doctors see.

34
00:01:44.359 --> 00:01:49.400
<v Speaker 2>Things precisely, time series forecasting and finance. Autonomous driving obviously

35
00:01:49.680 --> 00:01:53.120
<v Speaker 2>big one, text comprehension, even those recommendation systems telling you

36
00:01:53.159 --> 00:01:53.959
<v Speaker 2>what to watch next.

37
00:01:54.040 --> 00:01:57.200
<v Speaker 1>Guilty, Okay. The source calls the data set the most

38
00:01:57.239 --> 00:02:01.359
<v Speaker 1>critical part of the mL pipeline. Why is the quality

39
00:02:01.359 --> 00:02:03.719
<v Speaker 1>and structure so so important here?

40
00:02:03.920 --> 00:02:07.280
<v Speaker 2>Because everything hinges on it. The model's success lives or

41
00:02:07.319 --> 00:02:10.560
<v Speaker 2>dies by the data. It's like building a house, right

42
00:02:10.919 --> 00:02:15.120
<v Speaker 2>If your materials, your bricks are bad, the house won't stand,

43
00:02:15.199 --> 00:02:18.719
<v Speaker 2>no matter how good the architect is. Makes sense, So

44
00:02:19.080 --> 00:02:23.719
<v Speaker 2>take face detection. It's trained on thousands, maybe millions of

45
00:02:23.840 --> 00:02:28.680
<v Speaker 2>labeled examples faces marked as faces. The more high quality,

46
00:02:28.719 --> 00:02:32.520
<v Speaker 2>diverse data we have, the better the algorithm performs. And

47
00:02:32.560 --> 00:02:35.840
<v Speaker 2>this leads us straight to this crucial practice of splitting

48
00:02:35.960 --> 00:02:39.919
<v Speaker 2>data put it into three distinct, destoint parts. There is

49
00:02:39.960 --> 00:02:42.280
<v Speaker 2>a training set that's what the model actually learns from,

50
00:02:42.919 --> 00:02:45.759
<v Speaker 2>then a validation set. We use that during training to

51
00:02:45.800 --> 00:02:50.120
<v Speaker 2>measure performance and importantly tune things called hyper parameters. Think

52
00:02:50.120 --> 00:02:51.719
<v Speaker 2>of them as settings for the learning.

53
00:02:51.479 --> 00:02:53.680
<v Speaker 3>Process, like knobs to adjust exactly.

54
00:02:53.960 --> 00:02:56.680
<v Speaker 2>And finally, the test set. This is sacred. It's completely

55
00:02:56.759 --> 00:02:59.960
<v Speaker 2>untouched until the very end for the final evaluations.

56
00:03:00.000 --> 00:03:02.080
<v Speaker 1>That's the real test of how it'll do in the wild.

57
00:03:02.280 --> 00:03:05.520
<v Speaker 2>Precisely, it ensures we get an unbiased look at real

58
00:03:05.560 --> 00:03:08.080
<v Speaker 2>world performance, our ultimate reality check.

59
00:03:08.280 --> 00:03:11.080
<v Speaker 1>We often hear about in dimensional spaces and machine learning.

60
00:03:11.199 --> 00:03:13.719
<v Speaker 1>It sounds pretty abstract. What does that actually mean for

61
00:03:13.800 --> 00:03:14.439
<v Speaker 1>our data?

62
00:03:14.639 --> 00:03:18.120
<v Speaker 2>Yeah, it can sound a bit theoretical, But imagine each

63
00:03:18.240 --> 00:03:21.039
<v Speaker 2>example in your data set, like an image or maybe

64
00:03:21.080 --> 00:03:24.120
<v Speaker 2>sensor readings, as just a single point plotted in some

65
00:03:24.280 --> 00:03:27.240
<v Speaker 2>geometric space. The end just refers to the number of

66
00:03:27.240 --> 00:03:29.639
<v Speaker 2>features or attributes that describe that point.

67
00:03:29.840 --> 00:03:32.639
<v Speaker 1>Ah Okay, so more features more dimensions. Got it?

68
00:03:33.120 --> 00:03:36.520
<v Speaker 2>So that fashion m mist image example? Yeah, twenty eight

69
00:03:36.560 --> 00:03:39.639
<v Speaker 2>by twenty eight pixels. That's seven hundred and eighty four attributes.

70
00:03:39.719 --> 00:03:42.199
<v Speaker 2>So each image is a point in a seven hundred

71
00:03:42.199 --> 00:03:43.759
<v Speaker 2>and eighty four dimensional space.

72
00:03:44.039 --> 00:03:46.599
<v Speaker 1>Wow. Okay, that's impossible to picture.

73
00:03:46.319 --> 00:03:48.879
<v Speaker 2>It totally impossible for us, which is why understanding this

74
00:03:48.960 --> 00:03:52.680
<v Speaker 2>concept is key. It helps us grasp why high dimensions

75
00:03:52.680 --> 00:03:56.319
<v Speaker 2>can be tricky, the curse of dimensionality, and it's why

76
00:03:56.400 --> 00:03:59.560
<v Speaker 2>techniques like dimensionality reduction are so vital not just for

77
00:03:59.639 --> 00:04:01.800
<v Speaker 2>visual but from making models work well.

78
00:04:01.879 --> 00:04:04.680
<v Speaker 1>Okay, so machine learning tasks, they usually fall into three

79
00:04:04.719 --> 00:04:09.159
<v Speaker 1>main buckets, supervised, unsupervised, and semi supervised learning. What's the

80
00:04:09.240 --> 00:04:09.800
<v Speaker 1>key difference?

81
00:04:09.879 --> 00:04:12.879
<v Speaker 2>The absolute key distinction. The main thing is the presence

82
00:04:13.000 --> 00:04:14.599
<v Speaker 2>or absence of labels in.

83
00:04:14.560 --> 00:04:17.920
<v Speaker 1>The data, Labels meaning the answers sort of.

84
00:04:18.000 --> 00:04:22.480
<v Speaker 2>Yeah. Supervised learning uses labeled data. You have inputs and

85
00:04:22.519 --> 00:04:26.319
<v Speaker 2>you have the desired outputs like images labeled cat or dog.

86
00:04:27.120 --> 00:04:28.519
<v Speaker 2>The model learns the mapping.

87
00:04:28.800 --> 00:04:29.879
<v Speaker 1>Okay, that's straightforward.

88
00:04:30.000 --> 00:04:34.000
<v Speaker 2>Unsupervised learning deals with unlabeled data. The goal there is

89
00:04:34.040 --> 00:04:37.160
<v Speaker 2>to find hidden patterns or structures without being told what to.

90
00:04:37.120 --> 00:04:39.720
<v Speaker 1>Look for, like finding groups of similar.

91
00:04:39.319 --> 00:04:44.079
<v Speaker 2>Customers exactly, or detecting weird transactions. For fraud detection where

92
00:04:44.079 --> 00:04:45.519
<v Speaker 2>you don't have fraud labels.

93
00:04:45.240 --> 00:04:48.279
<v Speaker 3>Beforehand, and semi supervised, that's a hybrid.

94
00:04:48.519 --> 00:04:51.639
<v Speaker 2>It cleverly uses a mix of labeled and unlabeled data.

95
00:04:51.879 --> 00:04:55.199
<v Speaker 2>Or sometimes situations where maybe all your examples belong to

96
00:04:55.279 --> 00:04:59.800
<v Speaker 2>the same class, which supervised methods alone can't really handle effectively.

97
00:05:00.079 --> 00:05:02.920
<v Speaker 1>Makes sense. So, once we've built a model using one

98
00:05:02.920 --> 00:05:05.000
<v Speaker 1>of these approaches, how do we know if it's actually

99
00:05:05.000 --> 00:05:06.639
<v Speaker 1>any good? What are the key metrics?

100
00:05:06.920 --> 00:05:11.680
<v Speaker 2>Ah metrics, They're fundamental, absolutely critical for evaluating how good

101
00:05:11.680 --> 00:05:15.000
<v Speaker 2>our model is. Accuracy is the most common one for classification.

102
00:05:15.319 --> 00:05:17.360
<v Speaker 1>Just the percentage he gets right, yep.

103
00:05:17.399 --> 00:05:21.879
<v Speaker 2>Simple proportion of correct predictions. However, and this is a

104
00:05:21.879 --> 00:05:25.879
<v Speaker 2>big however, accuracy can be super misleading, especially on unbalanced

105
00:05:25.959 --> 00:05:26.480
<v Speaker 2>data sets.

106
00:05:26.519 --> 00:05:26.839
<v Speaker 1>Yesso.

107
00:05:27.319 --> 00:05:29.959
<v Speaker 2>Well, imagine eighty percent of your data is class A

108
00:05:30.279 --> 00:05:33.639
<v Speaker 2>and only twenty percent is class B. A lazy model

109
00:05:33.720 --> 00:05:36.040
<v Speaker 2>could just predict class A every single time and.

110
00:05:36.000 --> 00:05:37.879
<v Speaker 1>It would look eighty percent accurate.

111
00:05:37.680 --> 00:05:41.600
<v Speaker 2>Exactly eighty percent accuracy, But it's completely useless because it

112
00:05:41.639 --> 00:05:44.480
<v Speaker 2>never finds Class B not a good classifier at all.

113
00:05:44.639 --> 00:05:48.480
<v Speaker 1>Okay, point taking. So if accuracy can fool us, what

114
00:05:48.560 --> 00:05:50.639
<v Speaker 1>are the better alternatives? What else do we look at?

115
00:05:50.680 --> 00:05:53.399
<v Speaker 2>We rely on a whole suite of other, more nuanced

116
00:05:53.439 --> 00:05:58.079
<v Speaker 2>metrics for classification, things like precision that tells us, out

117
00:05:58.120 --> 00:06:00.519
<v Speaker 2>of all the times the model predicted positive, how many

118
00:06:00.560 --> 00:06:01.519
<v Speaker 2>were actually.

119
00:06:01.160 --> 00:06:04.480
<v Speaker 1>Correct, like how many emails flagged as spamword spans.

120
00:06:04.839 --> 00:06:07.879
<v Speaker 2>Then there's recall that asks, out of all the actual

121
00:06:07.920 --> 00:06:10.800
<v Speaker 2>positive cases that existed, how many did our model find.

122
00:06:10.879 --> 00:06:12.600
<v Speaker 1>Making sure we don't miss important.

123
00:06:12.199 --> 00:06:15.439
<v Speaker 2>Stuff exactly, Like in medical diagnosis, recall is often crucial.

124
00:06:15.519 --> 00:06:17.600
<v Speaker 2>You don't want to miss a positive case, right. The

125
00:06:17.680 --> 00:06:20.240
<v Speaker 2>F one score is great because it's the harmonic mean

126
00:06:20.279 --> 00:06:23.079
<v Speaker 2>of precision and recall, balancing.

127
00:06:22.600 --> 00:06:24.519
<v Speaker 1>Both the combined score yep.

128
00:06:25.040 --> 00:06:28.759
<v Speaker 2>And for binary classification just two classes, the area under

129
00:06:28.800 --> 00:06:31.399
<v Speaker 2>the ROC curve AUC is really useful.

130
00:06:31.519 --> 00:06:32.519
<v Speaker 1>ROC curve Yeah.

131
00:06:32.560 --> 00:06:35.240
<v Speaker 2>It shows the trade off between how well the model

132
00:06:35.360 --> 00:06:39.240
<v Speaker 2>finds true positive sensitivity and how well it avoids false

133
00:06:39.279 --> 00:06:43.600
<v Speaker 2>positive specificity across different thresholds. It gives a really good overall.

134
00:06:43.279 --> 00:06:46.480
<v Speaker 1>Picture, okay. And for regression predicting numbers.

135
00:06:46.639 --> 00:06:49.759
<v Speaker 2>For regression, we look at things like mean absolute error

136
00:06:49.959 --> 00:06:52.639
<v Speaker 2>MAE just the average size of the errors and mean

137
00:06:52.720 --> 00:06:57.319
<v Speaker 2>squared error msee, which penalizes larger errors more heavily. They

138
00:06:57.319 --> 00:07:00.120
<v Speaker 2>tell us how close our predictions are to the actual values.

139
00:07:00.160 --> 00:07:02.480
<v Speaker 1>Okay, so we know how to measure success. Let's talk

140
00:07:02.480 --> 00:07:05.319
<v Speaker 1>about the models themselves, especially the stars of the show

141
00:07:05.519 --> 00:07:07.639
<v Speaker 1>neural Networks. How are they actually defined?

142
00:07:07.839 --> 00:07:11.360
<v Speaker 2>Great neural networks? Doctor Robert heck Nielsen, one of the pioneers,

143
00:07:11.480 --> 00:07:15.079
<v Speaker 2>defined a neural network as basically a computing system made

144
00:07:15.160 --> 00:07:18.399
<v Speaker 2>up of a number of simple, highly interconnected processing elements.

145
00:07:18.439 --> 00:07:20.920
<v Speaker 1>Simple elements, but lots of connections.

146
00:07:20.759 --> 00:07:25.439
<v Speaker 2>Which process information by their dynamic state response to external inputs.

147
00:07:26.279 --> 00:07:28.240
<v Speaker 2>More intuitively, you can just think of them as a

148
00:07:28.279 --> 00:07:31.560
<v Speaker 2>computational model loosely inspired by how our brains work.

149
00:07:31.759 --> 00:07:35.839
<v Speaker 1>Loosely inspired. So they're modeled after wiological neurons. But it's

150
00:07:35.839 --> 00:07:37.120
<v Speaker 1>not an exact copy.

151
00:07:36.959 --> 00:07:39.560
<v Speaker 2>Right, Oh, absolutely not. It's a very coarse inspiration. We

152
00:07:39.639 --> 00:07:44.240
<v Speaker 2>borrow terms like dendrites for inputs, synapses for the connection, weights.

153
00:07:43.920 --> 00:07:46.160
<v Speaker 1>That learn the things that change during training.

154
00:07:45.959 --> 00:07:50.360
<v Speaker 2>Exactly, and a nucleus which is basically this nonlinear activation

155
00:07:50.480 --> 00:07:54.519
<v Speaker 2>function that determines if the neuron fires. But the biological

156
00:07:54.560 --> 00:07:56.800
<v Speaker 2>reality far far more complex.

157
00:07:57.279 --> 00:08:01.199
<v Speaker 1>So why do these artificial neurons need that nonlinear activation function.

158
00:08:01.319 --> 00:08:02.800
<v Speaker 1>What does the nonlinearity do?

159
00:08:03.319 --> 00:08:05.839
<v Speaker 2>That's key. Think about a single neuron without it, It

160
00:08:05.839 --> 00:08:08.199
<v Speaker 2>can basically only draw a straight line or a flat

161
00:08:08.240 --> 00:08:11.360
<v Speaker 2>plane in higher dimensions, a hyperplane to separate data.

162
00:08:11.560 --> 00:08:13.839
<v Speaker 1>Okay, like separating red dots from blue dots with.

163
00:08:13.800 --> 00:08:16.079
<v Speaker 2>One line exactly, But what if the dots are all

164
00:08:16.120 --> 00:08:19.360
<v Speaker 2>mixed up in a complex pattern. That straight line isn't enough.

165
00:08:19.759 --> 00:08:23.680
<v Speaker 2>The nonlinear activation function lets the neuron create a curved boundary,

166
00:08:23.920 --> 00:08:24.879
<v Speaker 2>a hypersurface.

167
00:08:25.360 --> 00:08:27.839
<v Speaker 1>Ah, so we can learn more complex separations.

168
00:08:27.879 --> 00:08:31.399
<v Speaker 2>Precisely, it allows the neuron to capture much more complex

169
00:08:31.480 --> 00:08:34.919
<v Speaker 2>relationships in the data, things that aren't just linearly separable.

170
00:08:35.240 --> 00:08:38.000
<v Speaker 1>And is that why we need multi layered neural networks

171
00:08:38.720 --> 00:08:40.840
<v Speaker 1>to handle even more complex stuff?

172
00:08:41.000 --> 00:08:44.799
<v Speaker 2>Yes, exactly, if one curved boundary isn't enough, adding more

173
00:08:44.879 --> 00:08:48.799
<v Speaker 2>layers allows the network to combine and transform these learned boundaries,

174
00:08:49.120 --> 00:08:51.960
<v Speaker 2>creating incredibly intricate decision regions.

175
00:08:52.000 --> 00:08:54.519
<v Speaker 1>So layers build on layers to create complexity.

176
00:08:54.679 --> 00:08:58.240
<v Speaker 2>Right, It enables the network to learn these remarkably complex

177
00:08:58.399 --> 00:09:03.120
<v Speaker 2>classification boundaries need for real world problems. In fact, these

178
00:09:03.159 --> 00:09:07.840
<v Speaker 2>standard feed forward networks are called universal function approximators.

179
00:09:07.159 --> 00:09:09.279
<v Speaker 1>Meaning they can learn anything pretty much.

180
00:09:09.279 --> 00:09:12.440
<v Speaker 2>In theory, if a relationship exists between inputs and outputs,

181
00:09:12.720 --> 00:09:16.200
<v Speaker 2>a sufficiently large and well trained neural network can approximate

182
00:09:16.240 --> 00:09:18.159
<v Speaker 2>that function, no matter how complex it is.

183
00:09:18.399 --> 00:09:21.879
<v Speaker 1>Wow, that's powerful. What's a major advantage of neural networks

184
00:09:21.919 --> 00:09:25.399
<v Speaker 1>compared to other, maybe more traditional machine learning models.

185
00:09:25.519 --> 00:09:28.879
<v Speaker 2>One huge advantage is their ability to act as feature.

186
00:09:28.559 --> 00:09:30.960
<v Speaker 1>Extractor feature extractors meaning.

187
00:09:30.879 --> 00:09:34.360
<v Speaker 2>So many traditional mL models need you to carefully pre

188
00:09:34.399 --> 00:09:39.320
<v Speaker 2>process the data and manually engineer meaningful features first, like

189
00:09:39.799 --> 00:09:44.200
<v Speaker 2>calculating specific ratios or identifying certain shapes beforehand.

190
00:09:44.320 --> 00:09:45.679
<v Speaker 1>A lot of human effort upfront.

191
00:09:45.960 --> 00:09:49.919
<v Speaker 2>Right, neural networks, especially deep ones with the right architecture,

192
00:09:50.159 --> 00:09:53.480
<v Speaker 2>can often learn these important features directly from the raw

193
00:09:53.519 --> 00:09:57.200
<v Speaker 2>input data themselves. They figure out what's important.

194
00:09:56.799 --> 00:09:58.559
<v Speaker 1>On their own, so they kind of learn how to

195
00:09:58.600 --> 00:10:00.519
<v Speaker 1>see the important patterns exactly.

196
00:10:00.840 --> 00:10:04.679
<v Speaker 2>That automatic future extraction is incredibly powerful and saves a

197
00:10:04.759 --> 00:10:06.120
<v Speaker 2>ton of manual work.

198
00:10:06.240 --> 00:10:09.480
<v Speaker 1>Okay, this future extraction is amazing. So we have these

199
00:10:09.519 --> 00:10:13.159
<v Speaker 1>powerful networks, how do we actually teach them? What does

200
00:10:13.279 --> 00:10:14.919
<v Speaker 1>training really involve? Right?

201
00:10:14.960 --> 00:10:18.679
<v Speaker 2>Training, So, training a model like this means iteratively updating

202
00:10:18.720 --> 00:10:21.399
<v Speaker 2>its internal parameters those connection weights and biases we.

203
00:10:21.399 --> 00:10:22.799
<v Speaker 1>Mentioned adjusting the connections.

204
00:10:22.879 --> 00:10:25.480
<v Speaker 2>Yep, we adjust them to find the configuration that best

205
00:10:25.480 --> 00:10:28.519
<v Speaker 2>solves the problem, the one that minimizes errors. And we

206
00:10:28.600 --> 00:10:30.960
<v Speaker 2>measure error using a loss function.

207
00:10:30.840 --> 00:10:32.720
<v Speaker 1>A score for how wrong the model is.

208
00:10:33.120 --> 00:10:36.279
<v Speaker 2>Pretty much, it measures the difference between the model's predictions

209
00:10:36.360 --> 00:10:39.600
<v Speaker 2>and the actual right answers. The tricky part is that

210
00:10:39.639 --> 00:10:44.080
<v Speaker 2>the landscape defined by this loss function is usually incredibly complex,

211
00:10:44.399 --> 00:10:45.159
<v Speaker 2>lots of hills and.

212
00:10:45.200 --> 00:10:48.799
<v Speaker 1>Valleys, So we can't just instantly find the lowest point

213
00:10:48.840 --> 00:10:50.879
<v Speaker 1>the best solution. We have to kind of search for it.

214
00:10:51.120 --> 00:10:54.080
<v Speaker 2>You've got it, We can't just jump there. We use

215
00:10:54.120 --> 00:10:58.120
<v Speaker 2>an iterative method, and the main technique, the absolute workhorse

216
00:10:58.639 --> 00:10:59.639
<v Speaker 2>is gradient descent.

217
00:11:00.080 --> 00:11:02.000
<v Speaker 1>Radiant descent. Okay, how does that work?

218
00:11:02.200 --> 00:11:05.639
<v Speaker 2>Imagine that lost landscape again, like mountains and valleys. You

219
00:11:05.679 --> 00:11:07.799
<v Speaker 2>want to get to the lowest point in a valley.

220
00:11:08.000 --> 00:11:12.279
<v Speaker 2>Gradient descent calculates the slope at your current position. The

221
00:11:12.320 --> 00:11:15.759
<v Speaker 2>gradient the direction of steepest slope exactly. It tells you

222
00:11:15.799 --> 00:11:18.120
<v Speaker 2>which ways downhill, So you take a small step in

223
00:11:18.159 --> 00:11:22.960
<v Speaker 2>that direction, recalculate the slope, and repeat step by step downhill.

224
00:11:23.000 --> 00:11:25.360
<v Speaker 1>And the learning rate that controls how big those steps.

225
00:11:25.120 --> 00:11:28.600
<v Speaker 2>Are precisely it's a critical hyper parameter. It regulates the

226
00:11:28.639 --> 00:11:31.519
<v Speaker 2>size of each step down the slope. Choosing the right

227
00:11:31.600 --> 00:11:34.799
<v Speaker 2>learning rate is well. It's often called more of an

228
00:11:34.879 --> 00:11:35.679
<v Speaker 2>art than a science.

229
00:11:35.919 --> 00:11:36.759
<v Speaker 1>Tricky to get right.

230
00:11:37.039 --> 00:11:39.440
<v Speaker 2>Yeah, Too large a step and you might overshoot the

231
00:11:39.480 --> 00:11:41.799
<v Speaker 2>valley bottom and bounce around or even climb back up.

232
00:11:42.039 --> 00:11:44.000
<v Speaker 2>Too small and training takes.

233
00:11:43.799 --> 00:11:45.639
<v Speaker 1>Forever grawling towards the solution.

234
00:11:45.919 --> 00:11:49.519
<v Speaker 2>Right, So developers often use strategies where the learning rate

235
00:11:49.639 --> 00:11:53.240
<v Speaker 2>changes during training, maybe starting larger and getting smaller over time.

236
00:11:53.320 --> 00:11:56.240
<v Speaker 1>And there are different flavors of gradient descent right, depending

237
00:11:56.240 --> 00:11:58.159
<v Speaker 1>on how much data you use for each step.

238
00:11:58.320 --> 00:12:02.440
<v Speaker 2>Indeed, there's batch grade radient descent that uses the entire

239
00:12:02.519 --> 00:12:05.759
<v Speaker 2>data set to calculate the gradient for each single step.

240
00:12:06.000 --> 00:12:07.480
<v Speaker 1>Sounds accurate, but slow.

241
00:12:07.559 --> 00:12:11.039
<v Speaker 2>Very accurate direction, but totally impractical for the huge data

242
00:12:11.039 --> 00:12:15.440
<v Speaker 2>sets we use today. Then there's stochastic gradient descent SGD

243
00:12:15.679 --> 00:12:18.120
<v Speaker 2>that uses just one single example.

244
00:12:17.679 --> 00:12:18.399
<v Speaker 3>For each update.

245
00:12:18.600 --> 00:12:20.399
<v Speaker 1>Much faster, but maybe noisy.

246
00:12:20.559 --> 00:12:23.919
<v Speaker 2>Exactly faster updates, but the path can be really erratic.

247
00:12:24.039 --> 00:12:27.120
<v Speaker 2>The industry standard really is mini batch gradient descent.

248
00:12:27.159 --> 00:12:28.480
<v Speaker 1>The best of both worlds.

249
00:12:28.639 --> 00:12:32.440
<v Speaker 2>Pretty much uses small subsets or mini batches of data

250
00:12:32.480 --> 00:12:35.799
<v Speaker 2>for each update. It's a great compromise. Faster than batch,

251
00:12:35.879 --> 00:12:37.600
<v Speaker 2>more stable than pure SGD.

252
00:12:38.000 --> 00:12:43.440
<v Speaker 1>Okay, Now beyond basic gradient descent, there are more advanced

253
00:12:43.639 --> 00:12:45.600
<v Speaker 1>optimization algorithms. What do they add?

254
00:12:45.879 --> 00:12:49.799
<v Speaker 2>They significantly improve training, making it faster and often leading

255
00:12:49.840 --> 00:12:53.679
<v Speaker 2>to better results. A classic is momentum like in physics,

256
00:12:54.039 --> 00:12:57.279
<v Speaker 2>kind of it helps the optimization process gain momentum as

257
00:12:57.279 --> 00:13:00.840
<v Speaker 2>it goes downhill, smoothing out oscillations and helping it power

258
00:13:00.879 --> 00:13:03.000
<v Speaker 2>through small bumps or flat areas.

259
00:13:02.679 --> 00:13:04.399
<v Speaker 1>Faster so it doesn't get stuck easily.

260
00:13:04.559 --> 00:13:08.120
<v Speaker 2>Right. And then there's ATOM adaptive moment estimation. This one

261
00:13:08.159 --> 00:13:11.559
<v Speaker 2>is hugely popular that it's an adaptive learning rate method.

262
00:13:11.879 --> 00:13:15.080
<v Speaker 2>It actually maintains a separate learning rate for each individual

263
00:13:15.120 --> 00:13:16.240
<v Speaker 2>parameter in the network.

264
00:13:16.279 --> 00:13:19.159
<v Speaker 1>Wow, okay, Tailored step sizes, Yeah.

265
00:13:19.279 --> 00:13:22.039
<v Speaker 2>It adapts the step size based on how frequently a

266
00:13:22.080 --> 00:13:25.799
<v Speaker 2>feature associated with that parameter occurs. It often converges much

267
00:13:25.840 --> 00:13:29.039
<v Speaker 2>faster and works well across a wide range of problems.

268
00:13:29.360 --> 00:13:31.080
<v Speaker 2>Many people start with ATOM in.

269
00:13:31.000 --> 00:13:35.679
<v Speaker 1>All these complex gradient calculations, finding the slope for potentially

270
00:13:35.759 --> 00:13:41.759
<v Speaker 1>millions of parameters that's handled by backpropagation and automatic differentiation.

271
00:13:41.679 --> 00:13:44.879
<v Speaker 2>Correct Those are the engines that make training feasible. Back

272
00:13:44.919 --> 00:13:48.799
<v Speaker 2>propagation is the algorithm for efficiently calculating all those gradients,

273
00:13:48.879 --> 00:13:53.840
<v Speaker 2>layer by layer, working backward from the loss. Automatic differentiation

274
00:13:53.960 --> 00:13:58.120
<v Speaker 2>is the underlying mechanism that frameworks used to compute derivatives.

275
00:13:57.679 --> 00:14:01.240
<v Speaker 1>Automatically, so they handle the heavy calculs lifting exactly.

276
00:14:01.440 --> 00:14:04.360
<v Speaker 2>They represent the network's math as a computational graph and

277
00:14:04.399 --> 00:14:07.200
<v Speaker 2>efficiently figure out how changes in each way affect the

278
00:14:07.240 --> 00:14:09.840
<v Speaker 2>final loss thousands or millions of times.

279
00:14:10.159 --> 00:14:14.320
<v Speaker 1>Okay, all this theory is fantastic neural networks training optimizers,

280
00:14:14.840 --> 00:14:17.279
<v Speaker 1>But how do we actually build and train these systems

281
00:14:17.279 --> 00:14:20.279
<v Speaker 1>in practice? That's where frameworks like TensorFlow come in, right,

282
00:14:20.759 --> 00:14:23.480
<v Speaker 1>And the source mentions a big shift from TensorFlow one

283
00:14:23.480 --> 00:14:24.600
<v Speaker 1>point x to two point zero.

284
00:14:24.600 --> 00:14:27.200
<v Speaker 2>Oh, absolutely TensorFlow is key, and yes, the shift from

285
00:14:27.240 --> 00:14:29.360
<v Speaker 2>one point x to two point zero was massive, a

286
00:14:29.399 --> 00:14:30.879
<v Speaker 2>really big deal for usability.

287
00:14:30.919 --> 00:14:32.720
<v Speaker 1>What was the all the way? Like in one point x.

288
00:14:32.720 --> 00:14:35.519
<v Speaker 2>Intensive flow one point x, you had this two stage process.

289
00:14:35.559 --> 00:14:38.840
<v Speaker 2>First you had to define a static computational graph, like

290
00:14:39.000 --> 00:14:41.200
<v Speaker 2>drawing a complete blueprint of all the map.

291
00:14:41.039 --> 00:14:43.320
<v Speaker 1>Operation, laying it all out beforehand.

292
00:14:43.039 --> 00:14:46.480
<v Speaker 2>Exactly, and then you'd execute that graph separately using something

293
00:14:46.480 --> 00:14:50.039
<v Speaker 2>called a TF session. It was powerful, for sure, but

294
00:14:50.159 --> 00:14:53.960
<v Speaker 2>it felt less like Python, more like Python was just

295
00:14:54.120 --> 00:14:58.960
<v Speaker 2>controlling a separate C plus plus engine. Debugging was notoriously.

296
00:14:58.200 --> 00:15:01.039
<v Speaker 1>Painful, right I remember hearing that TensorFlow two point oh

297
00:15:01.240 --> 00:15:03.159
<v Speaker 1>change this dramatically hugely.

298
00:15:03.320 --> 00:15:07.279
<v Speaker 2>TensorFlow two point zero embraced eager execution by default. Eager

299
00:15:07.320 --> 00:15:10.960
<v Speaker 2>execution meaning operations run immediately just like regular Python code.

300
00:15:11.000 --> 00:15:15.320
<v Speaker 2>You define something, it runs, No separate session execution step needed.

301
00:15:15.600 --> 00:15:17.759
<v Speaker 3>H much more interactive, way more interactive.

302
00:15:17.759 --> 00:15:21.120
<v Speaker 2>It made debugging vastly simpler, and the whole development process

303
00:15:21.159 --> 00:15:25.080
<v Speaker 2>feel much more natural, much more pythonic, and crucially, TF

304
00:15:25.120 --> 00:15:28.720
<v Speaker 2>two point zero adopted Paras as its official high level API.

305
00:15:29.000 --> 00:15:30.639
<v Speaker 1>KRIS. I've heard that name a lot. Yeah.

306
00:15:30.720 --> 00:15:34.320
<v Speaker 2>Kris is basically a specification and interface for defining and

307
00:15:34.399 --> 00:15:38.759
<v Speaker 2>training models TF dot Karras is Tensorflow's complete implementation of it.

308
00:15:38.759 --> 00:15:42.159
<v Speaker 2>It makes building complex models much more straightforward.

309
00:15:41.879 --> 00:15:44.159
<v Speaker 1>So Karris kind of hides some of that lower level

310
00:15:44.320 --> 00:15:46.799
<v Speaker 1>graph complexity for you. Let's you focus on the layers

311
00:15:46.799 --> 00:15:48.320
<v Speaker 1>in the architecture.

312
00:15:48.080 --> 00:15:51.480
<v Speaker 2>You've got it. With TF new point oh and Karras,

313
00:15:51.559 --> 00:15:56.120
<v Speaker 2>you're mostly thinking in terms of Python objects. Layers models,

314
00:15:56.120 --> 00:15:59.759
<v Speaker 2>not manually managing graphs and sessions. Karras handles a lot

315
00:15:59.799 --> 00:16:01.879
<v Speaker 2>of that complexity under the hood, but you.

316
00:16:01.879 --> 00:16:03.559
<v Speaker 1>Still get the performance benefits.

317
00:16:03.720 --> 00:16:06.879
<v Speaker 2>Yes, because for performance critical parts you can use the

318
00:16:07.000 --> 00:16:10.080
<v Speaker 2>at TF dot function decorator. This is part of something

319
00:16:10.120 --> 00:16:14.039
<v Speaker 2>called autograph. It automatically converts your Python code back into

320
00:16:14.080 --> 00:16:16.480
<v Speaker 2>a high performance TensorFlow graph behind the scenes.

321
00:16:16.759 --> 00:16:21.159
<v Speaker 1>So bez of both worlds, easy development, fast execution exactly.

322
00:16:21.679 --> 00:16:24.879
<v Speaker 2>Especially helpful for really deep or complex models where graph

323
00:16:24.879 --> 00:16:26.120
<v Speaker 2>performance matters most.

324
00:16:26.480 --> 00:16:29.679
<v Speaker 1>Now, getting data into these models efficiently that can be

325
00:16:29.720 --> 00:16:32.399
<v Speaker 1>a real bottleneck, right, Yeah, especially with huge data sets.

326
00:16:32.399 --> 00:16:33.600
<v Speaker 1>How does TensorFlow help there?

327
00:16:33.720 --> 00:16:37.240
<v Speaker 2>That's where the t data set object is absolutely brilliant.

328
00:16:37.320 --> 00:16:41.080
<v Speaker 2>It's an API design specifically for building highly efficient input pipelines.

329
00:16:41.200 --> 00:16:44.200
<v Speaker 1>Input pipelines like assembly lines for data kind of. Yeah.

330
00:16:44.240 --> 00:16:47.960
<v Speaker 2>It handles everything extracting raw data from wherever it lives,

331
00:16:48.039 --> 00:16:52.679
<v Speaker 2>transforming it, maybe resizing images, applying data augmentation, batching it up,

332
00:16:52.720 --> 00:16:54.799
<v Speaker 2>and then loading it efficiently for the model. It's like

333
00:16:54.840 --> 00:16:58.039
<v Speaker 2>a specialized ETL process for machine learning.

334
00:16:57.879 --> 00:17:00.720
<v Speaker 1>ETL extract, transform load.

335
00:17:00.799 --> 00:17:05.519
<v Speaker 2>Right, and crucially, TTF dot data offers key performance optimizations

336
00:17:05.720 --> 00:17:09.480
<v Speaker 2>things like prefetching prefetching it lets the data preparation on

337
00:17:09.519 --> 00:17:12.079
<v Speaker 2>the CPU happen at the same time as the model

338
00:17:12.119 --> 00:17:15.000
<v Speaker 2>training on the GPU or TPU, so the GPU isn't

339
00:17:15.039 --> 00:17:16.960
<v Speaker 2>sitting idle waiting for the next.

340
00:17:16.839 --> 00:17:19.359
<v Speaker 1>Batch, keeping the expensive hardware.

341
00:17:18.960 --> 00:17:20.960
<v Speaker 3>Busy exactly and caching.

342
00:17:21.160 --> 00:17:23.359
<v Speaker 2>It can store the process data in memory or on

343
00:17:23.480 --> 00:17:25.720
<v Speaker 2>disc after the first pass through the data set the

344
00:17:25.759 --> 00:17:27.480
<v Speaker 2>first epoch.

345
00:17:26.920 --> 00:17:30.119
<v Speaker 1>So subsequent epochs are much faster no slow disc.

346
00:17:30.000 --> 00:17:33.480
<v Speaker 2>Reading Precisely, These optimizations can make a massive difference in

347
00:17:33.519 --> 00:17:35.920
<v Speaker 2>training time, especially with large data sets.

348
00:17:36.160 --> 00:17:38.680
<v Speaker 1>Is there a way to manage the whole mL pipeline

349
00:17:38.720 --> 00:17:41.720
<v Speaker 1>sort of end to end within TensorFlow beyond just the

350
00:17:41.799 --> 00:17:42.720
<v Speaker 1>data part. Yes.

351
00:17:42.960 --> 00:17:46.200
<v Speaker 2>For a more structured approach, TensorFlow offers the tf dot

352
00:17:46.400 --> 00:17:49.759
<v Speaker 2>Estimator API. Think of it as a higher level framework

353
00:17:49.799 --> 00:17:53.920
<v Speaker 2>that encapsulates a lot of the standard, often repetitive parts

354
00:17:53.960 --> 00:17:55.200
<v Speaker 2>of an mL workflow.

355
00:17:55.480 --> 00:17:56.839
<v Speaker 1>Well kind of repetitive.

356
00:17:56.400 --> 00:18:01.480
<v Speaker 2>Parts, things like building the graph, correctly initializing variables, handling

357
00:18:01.519 --> 00:18:06.079
<v Speaker 2>the data loading loop, dealing with exceptions, gracefully creating checkpoints

358
00:18:06.119 --> 00:18:07.119
<v Speaker 2>to save your progress.

359
00:18:07.240 --> 00:18:09.200
<v Speaker 1>Ah, so you don't have to write all that boilerplate

360
00:18:09.200 --> 00:18:10.160
<v Speaker 1>code yourself.

361
00:18:10.039 --> 00:18:14.920
<v Speaker 2>Exactly It also handles saving summaries for visualization tools like tensorboard.

362
00:18:15.319 --> 00:18:19.559
<v Speaker 2>It really simplifies development and enforces good practices, especially useful

363
00:18:19.599 --> 00:18:22.599
<v Speaker 2>when you're scaling up to run on multiple machines or devices.

364
00:18:22.839 --> 00:18:24.480
<v Speaker 1>Okay, let's shift gears a bit and talk about some

365
00:18:24.480 --> 00:18:28.319
<v Speaker 1>advanced applications. Image classification is common, but what if you

366
00:18:28.400 --> 00:18:31.319
<v Speaker 1>don't have a ton of labeled images for your specific task?

367
00:18:31.440 --> 00:18:32.079
<v Speaker 3>Great question.

368
00:18:32.640 --> 00:18:36.000
<v Speaker 2>That's where transfer learning comes in, and it's incredibly powerful.

369
00:18:36.039 --> 00:18:38.039
<v Speaker 2>It's a huge time and resource saver.

370
00:18:38.279 --> 00:18:40.400
<v Speaker 1>Transfer learning transferring.

371
00:18:39.839 --> 00:18:43.759
<v Speaker 2>Knowledge exactly instead of training a big, complex convolutional neural

372
00:18:43.759 --> 00:18:47.440
<v Speaker 2>network from scratch, which needs massive data sets like image.

373
00:18:47.200 --> 00:18:49.240
<v Speaker 1>Net, which has millions of images.

374
00:18:48.920 --> 00:18:52.000
<v Speaker 2>Right over fifteen million, yeah, across thousands of categories. Most

375
00:18:52.079 --> 00:18:55.240
<v Speaker 2>people don't have that kind of data for their specific problem.

376
00:18:55.640 --> 00:18:59.160
<v Speaker 2>So with transfer learning, we reuse parts of a model

377
00:18:59.319 --> 00:19:02.039
<v Speaker 2>that was our trained on image NEET or a similar

378
00:19:02.240 --> 00:19:03.480
<v Speaker 2>large data set.

379
00:19:03.799 --> 00:19:07.440
<v Speaker 1>So we're basically borrowing a pre trained brain that already

380
00:19:07.440 --> 00:19:11.519
<v Speaker 1>knows how to see general features in images like edges, textures,

381
00:19:11.559 --> 00:19:12.279
<v Speaker 1>basic shapes.

382
00:19:12.359 --> 00:19:16.119
<v Speaker 2>It's a perfect analogy. It's already learned those fundamental visual patterns,

383
00:19:16.640 --> 00:19:19.160
<v Speaker 2>so we can take that pre trained part often called

384
00:19:19.160 --> 00:19:22.200
<v Speaker 2>the base model or feature extractor like layers from a

385
00:19:22.240 --> 00:19:25.160
<v Speaker 2>model called inception V three. Okay, freeze its weight so

386
00:19:25.200 --> 00:19:28.279
<v Speaker 2>they don't change, and just add a new small classification

387
00:19:28.440 --> 00:19:30.640
<v Speaker 2>layer on top that we train on our own smaller

388
00:19:30.720 --> 00:19:31.160
<v Speaker 2>data set.

389
00:19:31.279 --> 00:19:33.359
<v Speaker 1>Ah, so you only train the last a little bit right.

390
00:19:33.720 --> 00:19:37.559
<v Speaker 2>This dramatically speeds up training and really helps prevent overfitting,

391
00:19:37.640 --> 00:19:40.359
<v Speaker 2>especially when you have limited data, because the bulk of

392
00:19:40.400 --> 00:19:43.559
<v Speaker 2>the model already understands image as well. And TensorFlow Hub

393
00:19:43.559 --> 00:19:46.119
<v Speaker 2>makes us super easy, as you often just need the

394
00:19:46.359 --> 00:19:48.920
<v Speaker 2>URL of the pre trained model on tf hub and

395
00:19:48.960 --> 00:19:51.559
<v Speaker 2>you can load it directly as a CAUs layer. It's

396
00:19:51.599 --> 00:19:52.680
<v Speaker 2>incredibly convenient.

397
00:19:53.240 --> 00:19:56.200
<v Speaker 1>That's fantastic. But what if our new data set is

398
00:19:56.559 --> 00:20:01.119
<v Speaker 1>maybe similar but not exactly like image, or maybe we

399
00:20:01.160 --> 00:20:03.400
<v Speaker 1>do have a decent amount of new data. Is freezing

400
00:20:03.400 --> 00:20:05.160
<v Speaker 1>the whole base model always best?

401
00:20:05.359 --> 00:20:08.240
<v Speaker 2>Good point? In those cases, fine tuning.

402
00:20:07.960 --> 00:20:10.319
<v Speaker 1>Might be a better approach fine tuning, so not keeping

403
00:20:10.400 --> 00:20:12.079
<v Speaker 1>it completely frozen exactly.

404
00:20:12.680 --> 00:20:16.119
<v Speaker 2>Instead of keeping the pre train weights totally fixed, we

405
00:20:16.200 --> 00:20:18.559
<v Speaker 2>allow some of them, usually in the later layers of

406
00:20:18.599 --> 00:20:21.680
<v Speaker 2>the base model, to be updated slightly during training on

407
00:20:21.720 --> 00:20:24.000
<v Speaker 2>our new data, so you let it adapt a bit

408
00:20:24.039 --> 00:20:28.480
<v Speaker 2>more precisely. You continue the back propagation process, but typically

409
00:20:28.519 --> 00:20:30.799
<v Speaker 2>with a very small learning rate because the weights are

410
00:20:30.839 --> 00:20:34.440
<v Speaker 2>already pretty good. This lifts the network specialize its learned

411
00:20:34.480 --> 00:20:36.839
<v Speaker 2>features more towards the specifics of your data set.

412
00:20:36.839 --> 00:20:38.720
<v Speaker 1>Okay, so it requires a bit more compute.

413
00:20:38.880 --> 00:20:42.119
<v Speaker 2>It does require more computation than just feature extraction. The

414
00:20:42.200 --> 00:20:45.319
<v Speaker 2>choice between just using features or fine tuning depends on

415
00:20:45.640 --> 00:20:48.759
<v Speaker 2>your HeartWare, how much data you have, and how similar

416
00:20:48.799 --> 00:20:51.400
<v Speaker 2>your data is to what the model was originally trained on.

417
00:20:51.480 --> 00:20:55.440
<v Speaker 1>Right trade offs. Okay, moving beyond just classifying whole images,

418
00:20:55.640 --> 00:20:58.160
<v Speaker 1>let's talk about object detection. How is that different?

419
00:20:58.440 --> 00:21:02.319
<v Speaker 2>Object detection is a step up and complexity. Image classification

420
00:21:02.440 --> 00:21:05.039
<v Speaker 2>just gives you one label for the whole image.

421
00:21:05.319 --> 00:21:05.680
<v Speaker 1>CAT.

422
00:21:06.279 --> 00:21:10.599
<v Speaker 2>Object detection does two things simultaneously. It localizes objects by

423
00:21:10.680 --> 00:21:14.920
<v Speaker 2>drawing a bounding box around them, predicting the x Y coordinates, width.

424
00:21:14.599 --> 00:21:16.519
<v Speaker 1>And height inpointing where it is yep.

425
00:21:16.519 --> 00:21:19.319
<v Speaker 2>And then it classifies what's inside that box, so CAT

426
00:21:19.359 --> 00:21:22.559
<v Speaker 2>at coordinates one hundred and fifty eighty sixty. This is

427
00:21:22.640 --> 00:21:24.960
<v Speaker 2>absolutely crucial for things like self driving cars.

428
00:21:25.039 --> 00:21:26.880
<v Speaker 1>Yeah, you need to know where the pedestrians and other

429
00:21:26.920 --> 00:21:28.480
<v Speaker 1>cars are, not just that they exist.

430
00:21:28.519 --> 00:21:29.599
<v Speaker 3>Somewhere exactly.

431
00:21:29.880 --> 00:21:33.160
<v Speaker 2>It needs to identify and pinpoint multiple objects in a

432
00:21:33.160 --> 00:21:37.759
<v Speaker 2>busy scene. Interestingly, it treats the bounding box prediction the

433
00:21:37.799 --> 00:21:43.319
<v Speaker 2>localization part as a regression problem, predicting those continuous coordinate values. Ah, okay,

434
00:21:43.720 --> 00:21:47.400
<v Speaker 2>predicting numbers for the box. How do we measure how

435
00:21:47.440 --> 00:21:50.599
<v Speaker 2>well these detectors are doing? Is accuracy still the main thing?

436
00:21:51.279 --> 00:21:54.119
<v Speaker 1>Not really for the localization part. A key metric here

437
00:21:54.240 --> 00:21:56.359
<v Speaker 1>is intersection over union or.

438
00:21:56.319 --> 00:21:58.359
<v Speaker 2>IOU intersection overunion.

439
00:21:58.519 --> 00:22:01.119
<v Speaker 1>Yeah, imagine the box your model addicted and the actual

440
00:22:01.160 --> 00:22:05.559
<v Speaker 1>correct box. The ground truth IOU is the ratio of

441
00:22:05.599 --> 00:22:08.839
<v Speaker 1>the area where they overlap divided by the total area they.

442
00:22:08.680 --> 00:22:10.880
<v Speaker 2>Cover together, So how much they agree on the location.

443
00:22:11.400 --> 00:22:14.839
<v Speaker 1>Pretty much, A higher IOU means a better match. Often

444
00:22:14.880 --> 00:22:18.599
<v Speaker 1>in IOU greater than point five is considered a decent detection.

445
00:22:18.519 --> 00:22:22.079
<v Speaker 2>And for the overall performance, especially with multiple object types.

446
00:22:22.160 --> 00:22:26.079
<v Speaker 1>For that we often use mean average precisionmap. It basically

447
00:22:26.119 --> 00:22:29.000
<v Speaker 1>averages the precisions score across all the different object classes

448
00:22:29.039 --> 00:22:32.240
<v Speaker 1>the detector is trying to find, often calculated at different

449
00:22:32.240 --> 00:22:36.319
<v Speaker 1>IOU thresholds. It gives a single number summarizing the overall quality.

450
00:22:36.400 --> 00:22:39.839
<v Speaker 2>Okaymap and I've heard of things like YOLO or SSD.

451
00:22:40.319 --> 00:22:41.880
<v Speaker 2>What are anchor based detectors?

452
00:22:42.039 --> 00:22:45.160
<v Speaker 1>Right? YOLO you only look once. An SSD single shot

453
00:22:45.240 --> 00:22:48.160
<v Speaker 1>multibox detector are really popular and efficient state of the

454
00:22:48.240 --> 00:22:50.480
<v Speaker 1>art methods. They're anchor based anchor boxes.

455
00:22:50.720 --> 00:22:54.799
<v Speaker 2>Yeah. They effectively overlay a dense grid of pre defined

456
00:22:54.880 --> 00:22:59.160
<v Speaker 2>default boxes, the anchors of various sizes and aspect ratios

457
00:22:59.200 --> 00:23:00.720
<v Speaker 2>all over the input image.

458
00:23:00.400 --> 00:23:02.440
<v Speaker 1>Like potential object locations.

459
00:23:01.920 --> 00:23:03.119
<v Speaker 3>And shapes exactly.

460
00:23:03.480 --> 00:23:06.200
<v Speaker 2>Then in a single forward pass through the network, they

461
00:23:06.240 --> 00:23:09.960
<v Speaker 2>predict adjustments to these anchor boxes to better fit actual

462
00:23:10.039 --> 00:23:13.519
<v Speaker 2>objects and also classify what's in each adjusted box. It's

463
00:23:13.559 --> 00:23:16.319
<v Speaker 2>incredibly efficient for detecting multiple objects fast.

464
00:23:16.559 --> 00:23:19.319
<v Speaker 1>Very cool. Okay, let's talk about something that really feels

465
00:23:19.359 --> 00:23:25.519
<v Speaker 1>cutting edge. Generative adversarial networks or jans. How on earth

466
00:23:25.559 --> 00:23:28.799
<v Speaker 1>do they work? That adversarial training sounds intense.

467
00:23:28.519 --> 00:23:32.839
<v Speaker 2>It is fascinating. Jans involve this adversarial game, a competition

468
00:23:32.920 --> 00:23:36.480
<v Speaker 2>between two neural networks, a generator and a discriminator.

469
00:23:36.559 --> 00:23:37.920
<v Speaker 1>Okay, two players. What do they do?

470
00:23:38.160 --> 00:23:41.000
<v Speaker 2>The generator's job is to learn the underlying patterns and

471
00:23:41.039 --> 00:23:44.039
<v Speaker 2>the real data, and then create new fake data, like

472
00:23:44.079 --> 00:23:47.480
<v Speaker 2>synthetic images that look as realistic as possible. Its goal

473
00:23:47.559 --> 00:23:50.480
<v Speaker 2>is to fool the discriminator, and the discriminator the discriminator

474
00:23:50.519 --> 00:23:53.559
<v Speaker 2>acts like a detective. Its job is to look at

475
00:23:53.559 --> 00:23:56.079
<v Speaker 2>an example, either a real one from the data set

476
00:23:56.200 --> 00:23:59.640
<v Speaker 2>or a fake one from the generator, and decide is

477
00:23:59.680 --> 00:24:03.519
<v Speaker 2>this it's real or fake. It's basically a binary classifier.

478
00:24:03.119 --> 00:24:06.119
<v Speaker 1>So they're locked in this constant battle, the generator trying

479
00:24:06.160 --> 00:24:08.920
<v Speaker 1>to trick the discriminator and the discriminator trying to catch

480
00:24:08.960 --> 00:24:10.000
<v Speaker 1>the fakes exactly.

481
00:24:10.079 --> 00:24:12.839
<v Speaker 2>It's a min max game. The generator gets better by

482
00:24:12.920 --> 00:24:16.559
<v Speaker 2>learning from the discriminator's mistakes. When the discriminator gets fooled,

483
00:24:17.000 --> 00:24:20.000
<v Speaker 2>the generator knows it did something right. The discriminator gets

484
00:24:20.039 --> 00:24:22.880
<v Speaker 2>better by learning to spot the generator's improving.

485
00:24:22.440 --> 00:24:24.519
<v Speaker 1>Fakes, and this continues until it.

486
00:24:24.480 --> 00:24:27.759
<v Speaker 2>Continues until ideally, the generator gets so good at creating

487
00:24:27.799 --> 00:24:30.799
<v Speaker 2>realistic fakes that the discriminator can no longer tell the difference.

488
00:24:31.000 --> 00:24:34.880
<v Speaker 2>It's essentially just guessing randomly like fifty to fifty accuracy, at.

489
00:24:34.759 --> 00:24:37.920
<v Speaker 1>Which point the generator has really learned the essence of the.

490
00:24:37.799 --> 00:24:39.160
<v Speaker 3>Real data precisely.

491
00:24:39.400 --> 00:24:42.079
<v Speaker 2>It has learned to capture the underlying distribution of the

492
00:24:42.119 --> 00:24:47.240
<v Speaker 2>training data. Incredibly well, allowing it to generate novel, convincing samples.

493
00:24:47.519 --> 00:24:51.759
<v Speaker 1>Amazing. What are some surprising applications is just making cool pictures.

494
00:24:52.160 --> 00:24:55.720
<v Speaker 2>Generating realistic images art or music is definitely a big part,

495
00:24:56.079 --> 00:25:00.200
<v Speaker 2>but the underlying idea is powerful gans are great for

496
00:25:00.240 --> 00:25:03.720
<v Speaker 2>things like anomaly detection. If something doesn't fit the learned distribution,

497
00:25:04.039 --> 00:25:07.119
<v Speaker 2>it's likely an anomaly. But the really mind blowing stuff

498
00:25:07.160 --> 00:25:08.640
<v Speaker 2>comes with conditional gams.

499
00:25:08.640 --> 00:25:11.759
<v Speaker 1>Conditional meaning you give them some instructions sort of.

500
00:25:11.839 --> 00:25:15.400
<v Speaker 2>You provide some extra information a condition along with the

501
00:25:15.480 --> 00:25:18.519
<v Speaker 2>random input. This could be a class label, generate a

502
00:25:18.559 --> 00:25:21.680
<v Speaker 2>picture of a cat, or maybe a sketch, or even

503
00:25:21.680 --> 00:25:27.160
<v Speaker 2>a semantic map labeling regions like sky, road building.

504
00:25:26.799 --> 00:25:30.400
<v Speaker 1>And the JAN generates an image matching that condition exactly.

505
00:25:30.799 --> 00:25:33.480
<v Speaker 2>Think about generating a realistic street scene just from those

506
00:25:33.519 --> 00:25:37.319
<v Speaker 2>simple semantic labels, or automatically colorizing a black and white

507
00:25:37.359 --> 00:25:40.759
<v Speaker 2>photo based on learned color patterns, or image super resolution

508
00:25:40.920 --> 00:25:43.240
<v Speaker 2>creating a high res image from a low res one.

509
00:25:43.680 --> 00:25:46.119
<v Speaker 2>These are often specialized types of conditional jans.

510
00:25:46.440 --> 00:25:51.000
<v Speaker 1>That's incredible, Okay. One more advanced topic, semantic segmentation. How's

511
00:25:51.039 --> 00:25:53.720
<v Speaker 1>that different from object detection with its bounding boxes?

512
00:25:53.960 --> 00:25:58.720
<v Speaker 2>Semantic segmentation goes a step further in granularity. Object detection

513
00:25:58.960 --> 00:26:02.640
<v Speaker 2>gives you a box on an object. Semantic segmentation assigns

514
00:26:02.680 --> 00:26:05.640
<v Speaker 2>a class label to every single pixel in the input image.

515
00:26:05.759 --> 00:26:06.799
<v Speaker 1>Every pixel. Wow.

516
00:26:06.960 --> 00:26:09.640
<v Speaker 2>Yeah, so instead of just a box around the car,

517
00:26:09.759 --> 00:26:12.319
<v Speaker 2>it colors all the pixels belonging to the car blue,

518
00:26:12.640 --> 00:26:14.839
<v Speaker 2>all the pixels belonging to the road gray, all the

519
00:26:14.839 --> 00:26:17.000
<v Speaker 2>pixels belonging to the skylight blue.

520
00:26:16.759 --> 00:26:18.720
<v Speaker 3>And so on, so you get the exact shape and

521
00:26:18.720 --> 00:26:19.880
<v Speaker 3>boundaries exactly.

522
00:26:20.519 --> 00:26:24.279
<v Speaker 2>This fine grain understanding is vital for tasks where precise

523
00:26:24.400 --> 00:26:28.759
<v Speaker 2>shape matters. Think about medical imaging precisely outlining a tumor

524
00:26:28.920 --> 00:26:30.240
<v Speaker 2>or blood vessels.

525
00:26:30.000 --> 00:26:32.960
<v Speaker 1>Or for self driving cars, knowing exactly where the drivable

526
00:26:33.079 --> 00:26:35.920
<v Speaker 1>road surface is pixel by pixel precisely.

527
00:26:35.960 --> 00:26:40.119
<v Speaker 2>That it often involves network architectures with upsampling layers sometimes

528
00:26:40.200 --> 00:26:43.759
<v Speaker 2>called deconvolution or transposed convolution to get back to the

529
00:26:43.799 --> 00:26:47.319
<v Speaker 2>original image resolution and make those pixel level predictions.

530
00:26:47.519 --> 00:26:50.359
<v Speaker 1>Okay, so we've covered building and training these amazing models,

531
00:26:50.680 --> 00:26:54.279
<v Speaker 1>from basic classification up to jans and segmentation. How do

532
00:26:54.279 --> 00:26:55.920
<v Speaker 1>we actually get them out of the lab, out of

533
00:26:55.920 --> 00:27:00.640
<v Speaker 1>our Python notebooks and into real world applications. Deployment right.

534
00:27:00.720 --> 00:27:03.559
<v Speaker 2>Deployment is key. You've trained this great model, now what

535
00:27:04.279 --> 00:27:07.400
<v Speaker 2>Tensorflow's standard way to package a trained model is the

536
00:27:07.440 --> 00:27:08.519
<v Speaker 2>saved model format.

537
00:27:08.640 --> 00:27:09.960
<v Speaker 1>Saved model What does that contain?

538
00:27:10.119 --> 00:27:13.319
<v Speaker 2>It's designed to be self contained and portable. It bundles

539
00:27:13.359 --> 00:27:18.319
<v Speaker 2>everything together, the model's architecture, the learned weights, variables, any

540
00:27:18.440 --> 00:27:22.880
<v Speaker 2>necessary assets or auxiliary files, and importantly, a compiled representation

541
00:27:22.960 --> 00:27:26.640
<v Speaker 2>of the computations. It's the universal format for sharing and

542
00:27:26.680 --> 00:27:27.880
<v Speaker 2>deploying TF.

543
00:27:27.640 --> 00:27:31.400
<v Speaker 1>Models in the source mentions. It's language agnostic. What does

544
00:27:31.440 --> 00:27:33.200
<v Speaker 1>that mean in practice? That sounds really useful.

545
00:27:33.240 --> 00:27:35.440
<v Speaker 2>It's a huge advantage. It means you're not tied to

546
00:27:35.480 --> 00:27:37.759
<v Speaker 2>Python for running the model, even if you trained it

547
00:27:37.759 --> 00:27:41.000
<v Speaker 2>in Python. The saved model can be loaded and executed

548
00:27:41.000 --> 00:27:44.880
<v Speaker 2>by TensorFlow libraries in other languages LIKEWI well. For example,

549
00:27:45.000 --> 00:27:48.480
<v Speaker 2>tensorplowdr js lets you load and run save models directly

550
00:27:48.519 --> 00:27:50.480
<v Speaker 2>in JavaScript, right in a web browser or a no

551
00:27:50.680 --> 00:27:54.119
<v Speaker 2>JS back end. Super powerful for web apps AI in

552
00:27:54.160 --> 00:27:57.880
<v Speaker 2>the browser exactly. And there are bindings for other languages too,

553
00:27:57.960 --> 00:28:00.559
<v Speaker 2>like Go, which is popular for back in the services

554
00:28:00.559 --> 00:28:03.119
<v Speaker 2>and data centers and the cloud. There are c plus

555
00:28:03.119 --> 00:28:05.799
<v Speaker 2>plus binds Java swift bindings.

556
00:28:05.400 --> 00:28:07.920
<v Speaker 1>Too, so you can integrate the model into almost any

557
00:28:07.960 --> 00:28:09.160
<v Speaker 1>existing software stack.

558
00:28:09.240 --> 00:28:12.440
<v Speaker 2>Pretty much. This flexibility is crucial for getting these powerful

559
00:28:12.480 --> 00:28:15.480
<v Speaker 2>models out of the research phase and into production systems

560
00:28:15.480 --> 00:28:17.079
<v Speaker 2>where they can actually have an impact.

561
00:28:17.240 --> 00:28:20.119
<v Speaker 1>Wow, what an incredible deep dive that was. We've really

562
00:28:20.200 --> 00:28:24.119
<v Speaker 1>journeyed from the absolute basics the idea of machines learning

563
00:28:24.160 --> 00:28:25.480
<v Speaker 1>from data.

564
00:28:25.119 --> 00:28:29.119
<v Speaker 2>Yeah, through the brain inspired structure of neural networks.

565
00:28:28.839 --> 00:28:34.279
<v Speaker 1>The whole complex dance of training them with gradient descent optimizers, backpropagation.

566
00:28:33.680 --> 00:28:36.319
<v Speaker 2>And then seeing how frameworks like TensorFlow two point zero,

567
00:28:36.440 --> 00:28:40.720
<v Speaker 2>especially with karas, make all that theory practical, building training,

568
00:28:40.880 --> 00:28:42.839
<v Speaker 2>optimizing data pipelines.

569
00:28:42.480 --> 00:28:47.279
<v Speaker 1>Right, enabling us to tackle really sophisticated tasks transfer learning,

570
00:28:47.319 --> 00:28:51.880
<v Speaker 1>object detection, even generating data with jams or doing pixel

571
00:28:51.920 --> 00:28:53.279
<v Speaker 1>level semantic segmentation.

572
00:28:53.640 --> 00:28:57.680
<v Speaker 2>Absolutely, and that saved model format, making deployment across different

573
00:28:57.680 --> 00:29:01.319
<v Speaker 2>platforms possible. Really closes the It's a testament to how

574
00:29:01.319 --> 00:29:02.759
<v Speaker 2>mature these tools have become.

575
00:29:02.920 --> 00:29:06.200
<v Speaker 1>Definitely, hopefully this deep dive has given you, our listener,

576
00:29:06.319 --> 00:29:09.440
<v Speaker 1>a real shortcut to being well informed and maybe spark

577
00:29:09.519 --> 00:29:11.319
<v Speaker 1>some curiosity to dig even deeper.

578
00:29:11.599 --> 00:29:13.839
<v Speaker 2>Yeah, the field moves so fast is always.

579
00:29:13.559 --> 00:29:16.400
<v Speaker 1>More to learn, So the question we leave you with

580
00:29:16.559 --> 00:29:21.319
<v Speaker 1>is armed with this understanding. What surprising patterns or insights

581
00:29:21.400 --> 00:29:23.519
<v Speaker 1>might you uncover next? What could you build
