WEBVTT

1
00:00:00.080 --> 00:00:04.799
<v Speaker 1>Imagine your smartphone unlocking just by glancing at your face,

2
00:00:05.679 --> 00:00:09.759
<v Speaker 1>or maybe a robotic arm and a factory meticulously inspecting

3
00:00:09.800 --> 00:00:12.560
<v Speaker 1>products catching defects you or I might totally miss.

4
00:00:12.679 --> 00:00:15.880
<v Speaker 2>Yeah, it really feels like artificial intelligence has somehow gained

5
00:00:15.919 --> 00:00:16.640
<v Speaker 2>the gift of sight.

6
00:00:16.839 --> 00:00:19.760
<v Speaker 1>It does, but how do computers actually do that? How

7
00:00:19.760 --> 00:00:23.920
<v Speaker 1>do they get this incredible ability to see and interpret

8
00:00:24.079 --> 00:00:25.160
<v Speaker 1>the visual world.

9
00:00:25.559 --> 00:00:29.399
<v Speaker 2>Well, it's quite a journey really, from just raw data

10
00:00:28.760 --> 00:00:32.960
<v Speaker 2>to pretty profound insights, and it's all built on this

11
00:00:33.039 --> 00:00:36.640
<v Speaker 2>fascinating intersection of computer vision and artificial neural network.

12
00:00:36.840 --> 00:00:38.439
<v Speaker 1>It sounds like sci fi, but it's happening now.

13
00:00:38.479 --> 00:00:41.200
<v Speaker 2>It absolutely is a rapidly evolving reality.

14
00:00:41.280 --> 00:00:43.560
<v Speaker 1>And that's exactly what we're diving into today. We're drawing

15
00:00:43.600 --> 00:00:46.320
<v Speaker 1>from a really comprehensive guide on building these kinds of

16
00:00:46.359 --> 00:00:50.520
<v Speaker 1>powerful AI systems. Our mission basically is to pull back

17
00:00:50.520 --> 00:00:55.000
<v Speaker 1>the curtain a bit, demystify how computers perceive process and

18
00:00:55.399 --> 00:00:58.320
<v Speaker 1>we'll ultimately make sense of images and video.

19
00:00:58.159 --> 00:01:01.640
<v Speaker 2>Well, trace that whole path from the simplest element, the pixel,

20
00:01:01.960 --> 00:01:05.439
<v Speaker 2>all the way up to these super complex AI architectures.

21
00:01:05.079 --> 00:01:11.560
<v Speaker 1>Exactly things like object tracking, face recognition. It's a surprising journey.

22
00:01:11.480 --> 00:01:14.200
<v Speaker 2>And you'll hopefully get a clear understanding of the mechanisms,

23
00:01:14.239 --> 00:01:17.760
<v Speaker 2>the innovations behind it all. How these intelligent eyes actually work.

24
00:01:17.959 --> 00:01:19.920
<v Speaker 1>Okay, so let's kick things off right at the beginning.

25
00:01:20.239 --> 00:01:23.280
<v Speaker 1>How does a computer even see an image? We know

26
00:01:23.319 --> 00:01:26.079
<v Speaker 1>they're digital pixels and all that, but how does it

27
00:01:26.159 --> 00:01:26.959
<v Speaker 1>interpret them?

28
00:01:27.159 --> 00:01:30.560
<v Speaker 2>Right? So, at its core, a digital image is just

29
00:01:30.599 --> 00:01:34.000
<v Speaker 2>a grid, a grid of pixels. For something simple like

30
00:01:34.040 --> 00:01:37.480
<v Speaker 2>a grayscale image, each pixel is just one number, usually

31
00:01:37.560 --> 00:01:38.920
<v Speaker 2>between zero and two to fifty five.

32
00:01:39.120 --> 00:01:41.599
<v Speaker 1>Zero for black, two hundred and fifty five for white.

33
00:01:41.760 --> 00:01:44.359
<v Speaker 2>Exactly zero is black, two fifty five is white, and

34
00:01:44.439 --> 00:01:46.359
<v Speaker 2>everything in between is just a shade of gray. It's

35
00:01:46.400 --> 00:01:47.959
<v Speaker 2>literally just a matrix of numbers.

36
00:01:48.079 --> 00:01:50.760
<v Speaker 1>Okay, simple enough for black and white. But what about color?

37
00:01:50.799 --> 00:01:53.000
<v Speaker 1>How do you get all the richness of color from numbers?

38
00:01:53.079 --> 00:01:56.280
<v Speaker 2>Ah, that's where models like RGB come in. Red, green, blue.

39
00:01:56.319 --> 00:01:58.799
<v Speaker 2>Instead of one number per pixel, you get three, a

40
00:01:58.799 --> 00:02:00.159
<v Speaker 2>little bundle, a tupple.

41
00:01:59.840 --> 00:02:02.120
<v Speaker 1>Bit basically, so each pixel has a red value, a

42
00:02:02.120 --> 00:02:02.959
<v Speaker 1>green value, and a.

43
00:02:02.959 --> 00:02:06.000
<v Speaker 2>Blue value, precisely each one, also ranging from zero to

44
00:02:06.000 --> 00:02:09.879
<v Speaker 2>two hundred and fifty five. So like zero zero, no color,

45
00:02:09.919 --> 00:02:12.960
<v Speaker 2>that's black. Okay, two fifty five zero zero zero b

46
00:02:13.280 --> 00:02:15.159
<v Speaker 2>pure red, pure red. So what do you think there are?

47
00:02:15.360 --> 00:02:17.960
<v Speaker 2>Zero two fifty five would be or two fifty five

48
00:02:18.159 --> 00:02:19.080
<v Speaker 2>fifty five to fifty five.

49
00:02:19.080 --> 00:02:21.840
<v Speaker 1>Okay, following that logic, zero zero two fifty five must

50
00:02:21.879 --> 00:02:24.080
<v Speaker 1>be pure blue, and if all three are maxed out

51
00:02:24.080 --> 00:02:27.280
<v Speaker 1>at two fifty five, that's gotta be white, right, combining all.

52
00:02:27.159 --> 00:02:30.080
<v Speaker 2>The light you got to pure white. It's actually quite elegant,

53
00:02:30.120 --> 00:02:32.800
<v Speaker 2>isn't it. How these simple number combinations create this huge

54
00:02:32.919 --> 00:02:33.719
<v Speaker 2>range of colors.

55
00:02:33.759 --> 00:02:34.360
<v Speaker 1>It really is.

56
00:02:34.520 --> 00:02:37.599
<v Speaker 2>And you know, once the computer can represent an image

57
00:02:38.000 --> 00:02:40.479
<v Speaker 2>as these numbers, then it gets the power to manipulate

58
00:02:40.520 --> 00:02:42.159
<v Speaker 2>them in loads of ways.

59
00:02:42.280 --> 00:02:43.759
<v Speaker 1>Right. This is where we get into the sort of

60
00:02:44.000 --> 00:02:48.400
<v Speaker 1>digital darkroom idea. Basic stuff like resizing, moving things.

61
00:02:48.159 --> 00:02:54.759
<v Speaker 2>Around yep, resizing, translation, rotation, flipping, cropping, standard geometric things,

62
00:02:54.879 --> 00:02:57.479
<v Speaker 2>and you mentioned resizing it's shoes. Some methods are better

63
00:02:57.520 --> 00:03:00.439
<v Speaker 2>than others, like by cubic interpolation usually give you a

64
00:03:00.479 --> 00:03:04.199
<v Speaker 2>smoother and nicer looking result compared to simpler ones like bilinear.

65
00:03:04.439 --> 00:03:07.800
<v Speaker 1>Okay, that makes sense. But beyond just moving blocks of pixels,

66
00:03:07.879 --> 00:03:09.639
<v Speaker 1>what about changing the pixels themselves?

67
00:03:09.639 --> 00:03:13.719
<v Speaker 2>You mentioned arithmetic, image arithmetic and bitwise operations. These are

68
00:03:13.719 --> 00:03:18.800
<v Speaker 2>more pixel level manipulations. Think about adding a number to

69
00:03:18.919 --> 00:03:21.319
<v Speaker 2>every pixel value or subtracting.

70
00:03:20.759 --> 00:03:24.000
<v Speaker 1>One, so like brightening or darkening the whole image exactly.

71
00:03:24.039 --> 00:03:26.719
<v Speaker 2>And if a calculation pushes a pixel value above two

72
00:03:26.759 --> 00:03:29.800
<v Speaker 2>fifty five or blow zero, it usually just gets clipped

73
00:03:29.840 --> 00:03:32.639
<v Speaker 2>stuck at the max or min value. Stops things getting.

74
00:03:32.360 --> 00:03:35.319
<v Speaker 1>Weird, prevents crazy colors appearing out of nowhere, right.

75
00:03:35.199 --> 00:03:39.479
<v Speaker 2>And then you have bitwise operations Andy or not, TXO

76
00:03:39.680 --> 00:03:42.639
<v Speaker 2>or not. These are really powerful for things like masking.

77
00:03:42.800 --> 00:03:45.159
<v Speaker 1>Masking like cutting out of shape.

78
00:03:45.000 --> 00:03:47.479
<v Speaker 2>Kind of imagine you have a black and white image

79
00:03:47.520 --> 00:03:50.520
<v Speaker 2>like a stencil. You can use a bitwise Andy operation

80
00:03:50.639 --> 00:03:53.520
<v Speaker 2>between that mask and your main image. It essentially keeps

81
00:03:53.520 --> 00:03:55.560
<v Speaker 2>only the parts of the main image where the mask

82
00:03:55.680 --> 00:03:58.639
<v Speaker 2>is white. It's like a digital cutout, very precise control.

83
00:03:58.800 --> 00:04:01.319
<v Speaker 1>Ah I see, so you can isolate specific parts of

84
00:04:01.319 --> 00:04:02.400
<v Speaker 1>an image very cleanly.

85
00:04:02.520 --> 00:04:05.199
<v Speaker 2>Yep. And we also use other operations for cleaning things

86
00:04:05.240 --> 00:04:07.319
<v Speaker 2>up or highlighting details, like.

87
00:04:07.240 --> 00:04:10.400
<v Speaker 1>Blurring to reduce noise or smooth things out exactly.

88
00:04:10.680 --> 00:04:13.800
<v Speaker 2>Techniques like gossim blur medium blur are common for smoothing.

89
00:04:14.159 --> 00:04:15.680
<v Speaker 2>And then on the flip side, if you want to

90
00:04:15.719 --> 00:04:18.560
<v Speaker 2>find edges the outlines of objects.

91
00:04:18.160 --> 00:04:21.519
<v Speaker 1>You'd use edge detection filters like so.

92
00:04:20.959 --> 00:04:24.680
<v Speaker 2>Soble sure, Yeah, these filters are designed to spot sharp

93
00:04:24.800 --> 00:04:28.480
<v Speaker 2>changes in pixel intensity, which usually happen at edges. It

94
00:04:28.480 --> 00:04:30.920
<v Speaker 2>helps the computer see the skeleton of objects.

95
00:04:31.040 --> 00:04:33.920
<v Speaker 1>And what about just simplifying things down to black and white?

96
00:04:34.040 --> 00:04:38.240
<v Speaker 2>That's binarization. Things like adaptive thresholding or Otsu's method are

97
00:04:38.279 --> 00:04:41.360
<v Speaker 2>clever ways to turn a grayscale image into just black

98
00:04:41.399 --> 00:04:44.560
<v Speaker 2>and white pixels, which can be really useful for certain tasks.

99
00:04:45.040 --> 00:04:48.639
<v Speaker 1>Okay, so we've got the basics breaking images into pixels

100
00:04:48.680 --> 00:04:52.560
<v Speaker 1>manipulating them, but just seeing pixels isn't understanding right. The

101
00:04:52.600 --> 00:04:56.079
<v Speaker 1>computer needs to extract actual meaning. How does it learn

102
00:04:56.120 --> 00:04:58.920
<v Speaker 1>to pick out the important stuff, the meaningful features.

103
00:04:59.319 --> 00:05:02.079
<v Speaker 2>That is absolut the core challenge, and it's addressed by

104
00:05:02.079 --> 00:05:05.480
<v Speaker 2>the computer vision pipeline. It's a sequence. First, you ingest

105
00:05:05.480 --> 00:05:08.399
<v Speaker 2>the image, get the data in, then you process it,

106
00:05:08.480 --> 00:05:11.040
<v Speaker 2>maybe clean it up like we just discussed. Then comes

107
00:05:11.079 --> 00:05:13.160
<v Speaker 2>the crucial step, feature extraction.

108
00:05:13.480 --> 00:05:15.040
<v Speaker 1>Feature extraction, that's the key.

109
00:05:15.040 --> 00:05:17.920
<v Speaker 2>That's where the magic starts. Really, it's how the computer

110
00:05:18.040 --> 00:05:22.360
<v Speaker 2>moves beyond just raw pixel values to identify characteristics that

111
00:05:22.399 --> 00:05:25.600
<v Speaker 2>actually mean something like the curve of an edge, a

112
00:05:25.639 --> 00:05:28.439
<v Speaker 2>specific texture, the corner of an object.

113
00:05:28.639 --> 00:05:31.959
<v Speaker 1>So we're looking for features that are discriminating things that

114
00:05:32.079 --> 00:05:34.759
<v Speaker 1>help tell one object from another exactly.

115
00:05:34.800 --> 00:05:37.920
<v Speaker 2>They need to be discriminating, identifiable across different images of

116
00:05:37.920 --> 00:05:41.160
<v Speaker 2>the same object. And ideally you need lots of examples

117
00:05:41.160 --> 00:05:42.959
<v Speaker 2>to establish those patterns reliably.

118
00:05:43.319 --> 00:05:46.000
<v Speaker 1>And how does the computer store these features once it

119
00:05:46.040 --> 00:05:46.600
<v Speaker 1>finds them.

120
00:05:47.040 --> 00:05:50.439
<v Speaker 2>Typically, these extracted features are represented as a feature vector.

121
00:05:50.959 --> 00:05:53.720
<v Speaker 2>It sounds fancy, but it's basically just a list of numbers,

122
00:05:53.839 --> 00:05:55.160
<v Speaker 2>a one dimensional array.

123
00:05:55.000 --> 00:05:57.600
<v Speaker 1>Okay, a list of numbers representing the important bits of

124
00:05:57.600 --> 00:05:58.079
<v Speaker 1>the image.

125
00:05:58.160 --> 00:06:01.839
<v Speaker 2>Yeah, And here's the sort of a high moment. For

126
00:06:01.959 --> 00:06:05.079
<v Speaker 2>a simple grayscale image, you could just string all the

127
00:06:05.079 --> 00:06:08.839
<v Speaker 2>pixel values together into one massive vector that is a

128
00:06:08.839 --> 00:06:10.040
<v Speaker 2>feature vector technically.

129
00:06:10.079 --> 00:06:12.759
<v Speaker 1>Wow, okay, so you're boiling down the whole image into

130
00:06:12.800 --> 00:06:15.759
<v Speaker 1>this single numerical signature. That makes it easier for a

131
00:06:15.800 --> 00:06:18.680
<v Speaker 1>machine learning algorithm to chew on I guess precisely.

132
00:06:19.240 --> 00:06:22.720
<v Speaker 2>And what's really powerful about modern deep learning, especially convolutional

133
00:06:22.720 --> 00:06:25.439
<v Speaker 2>neural networks or CNNs. Yes, they can actually learn to

134
00:06:25.439 --> 00:06:28.759
<v Speaker 2>extract these features automatically. The network figures out the best

135
00:06:28.759 --> 00:06:30.319
<v Speaker 2>features itself during training.

136
00:06:30.399 --> 00:06:34.639
<v Speaker 1>That's a huge advantage. Less manual work potentially better features.

137
00:06:34.199 --> 00:06:38.319
<v Speaker 2>Definitely, But even before deep learning or alongside it, there

138
00:06:38.360 --> 00:06:42.720
<v Speaker 2>are some really clever advanced feature extraction techniques like.

139
00:06:42.680 --> 00:06:45.920
<v Speaker 1>What you mentioned histograms GLCM. Hog's right.

140
00:06:46.199 --> 00:06:48.560
<v Speaker 2>Histograms are a good starting point just counting how many

141
00:06:48.560 --> 00:06:51.920
<v Speaker 2>pixels have certain intensity values, but you can do more

142
00:06:52.360 --> 00:06:57.639
<v Speaker 2>like histogram equalization, which spreads out the intensities to improve contrast, makes.

143
00:06:57.439 --> 00:07:00.160
<v Speaker 1>Details pop okay, and gl.

144
00:07:00.600 --> 00:07:05.160
<v Speaker 2>GLCM stands for a gray level coocurrence matrix. It's fantastic

145
00:07:05.199 --> 00:07:08.079
<v Speaker 2>for analyzing texture. It looks at how often pairs of

146
00:07:08.079 --> 00:07:10.759
<v Speaker 2>pixel values appear together in certain spatial relationship.

147
00:07:11.000 --> 00:07:12.879
<v Speaker 1>It tells you about the texture, like if it's smooth

148
00:07:13.000 --> 00:07:14.920
<v Speaker 1>or rough or patterned exactly.

149
00:07:14.959 --> 00:07:19.360
<v Speaker 2>It gives you statistics like contrasts, correlation, energy, homogeneity, all

150
00:07:19.399 --> 00:07:20.360
<v Speaker 2>describing the texture.

151
00:07:20.399 --> 00:07:25.480
<v Speaker 1>Cool and Hog's histograms of oriented gradients sounds complex.

152
00:07:25.360 --> 00:07:28.959
<v Speaker 2>The idea is pretty neat actually, AG's focus on object

153
00:07:29.040 --> 00:07:33.240
<v Speaker 2>shape and appearance. They look at how image brightness changes

154
00:07:33.279 --> 00:07:36.000
<v Speaker 2>the gradients and in which directions these changes point.

155
00:07:36.240 --> 00:07:38.279
<v Speaker 1>So it's capturing edge information.

156
00:07:38.079 --> 00:07:41.040
<v Speaker 2>Sort of yeah, edge directions. It breaks the image into

157
00:07:41.040 --> 00:07:45.360
<v Speaker 2>small cells, calculates histograms of these gradient directions within each cell,

158
00:07:45.639 --> 00:07:48.759
<v Speaker 2>and then groups cells into blocks to normalize them. Things

159
00:07:48.839 --> 00:07:52.079
<v Speaker 2>like the number of orientations pixels persol cells per block

160
00:07:52.120 --> 00:07:55.120
<v Speaker 2>are parameters you set. It's good at describing shape even

161
00:07:55.120 --> 00:07:56.040
<v Speaker 2>if lighting changes.

162
00:07:56.199 --> 00:08:00.439
<v Speaker 1>Robust okay, and LBP Local Binary Patterns.

163
00:08:00.439 --> 00:08:03.240
<v Speaker 2>LP is great for finer texture details. It works by

164
00:08:03.279 --> 00:08:06.319
<v Speaker 2>comparing each pixel to its neighbors. If a neighbor is brighter,

165
00:08:06.360 --> 00:08:09.160
<v Speaker 2>you write down a one, if darker, a zero. This

166
00:08:09.199 --> 00:08:11.879
<v Speaker 2>creates a binary number for each pixel's neighborhood.

167
00:08:11.560 --> 00:08:13.680
<v Speaker 1>A unique code for the local texture.

168
00:08:13.839 --> 00:08:16.360
<v Speaker 2>Pretty much. Yeah, and there are enhanced versions that can

169
00:08:16.360 --> 00:08:20.199
<v Speaker 2>look at different sized neighborhoods or are rotation invariant, meaning

170
00:08:20.240 --> 00:08:22.680
<v Speaker 2>the texture feature doesn't change if the image is rotated.

171
00:08:22.920 --> 00:08:25.959
<v Speaker 1>So many ways to describe an image numerically. But having

172
00:08:26.000 --> 00:08:28.720
<v Speaker 1>all these features isn't the end goal. The computer has

173
00:08:28.759 --> 00:08:31.040
<v Speaker 1>to learn from them, right. How do we prep for that?

174
00:08:31.480 --> 00:08:33.840
<v Speaker 2>Right? So, you might have extracted tons of features, maybe

175
00:08:33.840 --> 00:08:37.759
<v Speaker 2>too many. That's where feature selection comes in. You use methods,

176
00:08:38.080 --> 00:08:42.120
<v Speaker 2>filter wrapper, embedded techniques to pick out the most impactful

177
00:08:42.120 --> 00:08:44.600
<v Speaker 2>features for your specific task. Get rid of the.

178
00:08:44.559 --> 00:08:46.799
<v Speaker 1>Noise, focus on what matters exactly.

179
00:08:47.240 --> 00:08:50.360
<v Speaker 2>Then you move to model training. You take your selected

180
00:08:50.360 --> 00:08:53.639
<v Speaker 2>feature set your training data and feed them to a

181
00:08:53.679 --> 00:08:57.559
<v Speaker 2>machine learning algorithm. The algorithm learns the patterns in those

182
00:08:57.559 --> 00:08:59.480
<v Speaker 2>features and creates a model.

183
00:08:59.720 --> 00:09:02.320
<v Speaker 1>And this is where supervised learning comes in again using

184
00:09:02.440 --> 00:09:03.440
<v Speaker 1>labeled data.

185
00:09:03.559 --> 00:09:07.000
<v Speaker 2>Yes, For the kinds of computer vision tasks we're focusing on,

186
00:09:07.080 --> 00:09:11.080
<v Speaker 2>like classification or detection, we typically use supervised learning. We

187
00:09:11.120 --> 00:09:14.279
<v Speaker 2>show the algorithm examples images with features and tell it

188
00:09:14.320 --> 00:09:16.559
<v Speaker 2>the correct answer the label, like this is a cat,

189
00:09:16.639 --> 00:09:20.360
<v Speaker 2>this is a dog. And unsupervised learning that's about finding

190
00:09:20.399 --> 00:09:23.759
<v Speaker 2>patterns in data without labels. Sometimes you might use it first,

191
00:09:23.799 --> 00:09:26.919
<v Speaker 2>maybe to help group images or even automatically generate potential

192
00:09:26.960 --> 00:09:30.799
<v Speaker 2>labels that you then refine for supervised learning. But supervised

193
00:09:30.879 --> 00:09:33.480
<v Speaker 2>is key for building these predictive vision models.

194
00:09:33.759 --> 00:09:37.360
<v Speaker 1>Okay, let's get into the real brains behind this. Deep

195
00:09:37.440 --> 00:09:41.360
<v Speaker 1>learning and artificial neural networks A and NS. We always

196
00:09:41.360 --> 00:09:44.080
<v Speaker 1>hear they're inspired by the human brain. How close is

197
00:09:44.120 --> 00:09:45.639
<v Speaker 1>that analogy? Really, it's a.

198
00:09:45.679 --> 00:09:49.360
<v Speaker 2>Useful starting point. Think of a single artificial neuron as

199
00:09:49.399 --> 00:09:53.840
<v Speaker 2>a highly simplified model of a biological one. It receives inputs,

200
00:09:54.200 --> 00:09:57.960
<v Speaker 2>multiplies them by certain weights which represent the connection strength,

201
00:09:57.960 --> 00:10:00.279
<v Speaker 2>sums them up and then applies a function and to

202
00:10:00.320 --> 00:10:01.120
<v Speaker 2>produce an output.

203
00:10:01.200 --> 00:10:03.000
<v Speaker 1>The simplest version is the perceptron.

204
00:10:03.320 --> 00:10:07.919
<v Speaker 2>Right. A single perceptron can model basic linear relationships like

205
00:10:08.000 --> 00:10:10.840
<v Speaker 2>drawing a straight line to separate two groups of data points.

206
00:10:11.080 --> 00:10:13.200
<v Speaker 1>But the real world isn't usually that simple, is it.

207
00:10:13.279 --> 00:10:15.799
<v Speaker 1>Things are messy nonlinear.

208
00:10:15.399 --> 00:10:18.799
<v Speaker 2>Exactly, and that's why we need deep learning, which typically

209
00:10:18.960 --> 00:10:23.679
<v Speaker 2>uses multilayer perceptrons or MLPs. By stacking layers of these neurons,

210
00:10:23.679 --> 00:10:27.759
<v Speaker 2>the network can learn incredibly complex nonlinear patterns. That's absolutely

211
00:10:27.879 --> 00:10:30.840
<v Speaker 2>essential for tackling real world computer vision problems.

212
00:10:30.919 --> 00:10:33.240
<v Speaker 1>So what does the structure The anatomy of one of

213
00:10:33.240 --> 00:10:34.799
<v Speaker 1>these deep learning models.

214
00:10:34.440 --> 00:10:37.720
<v Speaker 2>Look like, Well, you've got an input layer where the

215
00:10:37.879 --> 00:10:41.120
<v Speaker 2>data like our image feature vector comes in. Then you

216
00:10:41.159 --> 00:10:43.039
<v Speaker 2>have one or more hidden layers. This is where the

217
00:10:43.080 --> 00:10:47.360
<v Speaker 2>real heavy lifting and the learning happens. The network figures

218
00:10:47.399 --> 00:10:50.720
<v Speaker 2>out intermediate representations here, and finally an output layer that

219
00:10:50.720 --> 00:10:53.360
<v Speaker 2>gives you the final result. Maybe it's a probability for

220
00:10:53.399 --> 00:10:56.200
<v Speaker 2>each class like eighty percent chance it's a cat twenty

221
00:10:56.240 --> 00:11:00.240
<v Speaker 2>percent dog. The network learns by adjusting the weights on

222
00:11:00.320 --> 00:11:03.399
<v Speaker 2>all the connections between neurons in these layers. There are

223
00:11:03.399 --> 00:11:06.559
<v Speaker 2>also bias nodes that add another adjustable parameter.

224
00:11:06.919 --> 00:11:10.440
<v Speaker 1>Okay, weights determine connection strength, But how does an individual

225
00:11:10.480 --> 00:11:13.399
<v Speaker 1>neuron decide whether to fire or what value to pass on?

226
00:11:13.759 --> 00:11:15.200
<v Speaker 1>You mentioned activation functions.

227
00:11:15.279 --> 00:11:19.159
<v Speaker 2>Yes, activation functions are critical. They introduce the nonlinearity we need.

228
00:11:19.679 --> 00:11:23.000
<v Speaker 2>After a neuron sums its weighted inputs, the activation function

229
00:11:23.080 --> 00:11:25.960
<v Speaker 2>process is that some to produce the neuron's final output.

230
00:11:26.039 --> 00:11:26.840
<v Speaker 1>What kinds are there?

231
00:11:27.039 --> 00:11:29.600
<v Speaker 2>There's several common ones. Sigma used to be popular, squashing

232
00:11:29.720 --> 00:11:33.639
<v Speaker 2>values between zero and one. RAILU rectified linear unit is

233
00:11:33.799 --> 00:11:37.320
<v Speaker 2>very widely used now It's simple palputationally efficient outputs the

234
00:11:37.360 --> 00:11:39.360
<v Speaker 2>input if positive and zero.

235
00:11:39.159 --> 00:11:41.759
<v Speaker 1>Otherwise real U sounds almost too simple.

236
00:11:41.919 --> 00:11:45.120
<v Speaker 2>It works surprisingly well, and there are variants like leaky

237
00:11:45.200 --> 00:11:49.519
<v Speaker 2>ReLU elu SELU that try to address some minor potential

238
00:11:49.519 --> 00:11:53.519
<v Speaker 2>issues with ReLU and for the output layer. In classification tasks,

239
00:11:53.799 --> 00:11:57.480
<v Speaker 2>softmax is key. Why softmax because it takes the raw

240
00:11:57.559 --> 00:12:00.919
<v Speaker 2>outputs for each class and turns them into probabilities that

241
00:12:01.000 --> 00:12:03.039
<v Speaker 2>all add up to one. So you get that nice

242
00:12:03.120 --> 00:12:06.440
<v Speaker 2>interpretable eighty percent cat twenty percent dog output.

243
00:12:06.600 --> 00:12:09.240
<v Speaker 1>Got it? So the network has its structure, its neurons,

244
00:12:09.240 --> 00:12:12.000
<v Speaker 1>its activation functions, how does it actually learn. How does

245
00:12:12.000 --> 00:12:13.720
<v Speaker 1>it get better? Is it trial and error?

246
00:12:13.759 --> 00:12:15.840
<v Speaker 2>It's a guided trial and error. You could say. The

247
00:12:15.879 --> 00:12:19.440
<v Speaker 2>process starts with feed forward. Your input data flows through

248
00:12:19.440 --> 00:12:22.799
<v Speaker 2>the network layer by layer, activating neurons until it produces

249
00:12:22.799 --> 00:12:23.919
<v Speaker 2>an output a prediction.

250
00:12:24.159 --> 00:12:25.600
<v Speaker 1>Okay, the first guess right.

251
00:12:26.120 --> 00:12:28.080
<v Speaker 2>Then you need to measure how wrong that guess was.

252
00:12:28.159 --> 00:12:30.840
<v Speaker 2>That's where error functions or loss functions come in. They

253
00:12:30.919 --> 00:12:34.039
<v Speaker 2>calculate the difference between the network's prediction and the actual

254
00:12:34.039 --> 00:12:36.679
<v Speaker 2>correct answer, the ground truth. What kinds of loss functions

255
00:12:36.919 --> 00:12:40.559
<v Speaker 2>depends on the task. For regression predicting, a continuous value

256
00:12:40.960 --> 00:12:46.240
<v Speaker 2>means squared error MSE is common for binary classification cat

257
00:12:46.279 --> 00:12:50.919
<v Speaker 2>dog binary cross entropy for classifying among multiple classes digits

258
00:12:51.000 --> 00:12:54.279
<v Speaker 2>zero nine categorical cross entropy is standard.

259
00:12:54.480 --> 00:12:57.440
<v Speaker 1>So you calculate the error, then what how does the

260
00:12:57.440 --> 00:12:59.679
<v Speaker 1>network use that error information?

261
00:13:00.639 --> 00:13:03.679
<v Speaker 2>That's the job of optimization algorithms. Their goal is to

262
00:13:03.720 --> 00:13:06.159
<v Speaker 2>adjust the network's weights in a way that minimizes the

263
00:13:06.200 --> 00:13:09.840
<v Speaker 2>loss function. The most fundamental one is gradient descent, or

264
00:13:09.879 --> 00:13:12.960
<v Speaker 2>more commonly, stochastic gradient descent SGD.

265
00:13:13.120 --> 00:13:15.799
<v Speaker 1>Stochastic gradient descent. How does that work?

266
00:13:15.960 --> 00:13:18.120
<v Speaker 2>Instead of calculating the error over the entire data set

267
00:13:18.159 --> 00:13:21.200
<v Speaker 2>at once, which is slow. SGD uses small or random

268
00:13:21.240 --> 00:13:24.559
<v Speaker 2>subsets called mini batches. It calculates the air for a batch,

269
00:13:24.639 --> 00:13:26.559
<v Speaker 2>figures out which way to adjust the weights to reduce

270
00:13:26.600 --> 00:13:28.919
<v Speaker 2>that error. That's the gradient part, and takes a small

271
00:13:28.919 --> 00:13:29.919
<v Speaker 2>step in that direction, And.

272
00:13:29.840 --> 00:13:32.399
<v Speaker 1>The size of that step is the learning rate exactly.

273
00:13:32.600 --> 00:13:35.600
<v Speaker 2>The learning rate is a crucial hyper parameter. Too big

274
00:13:36.240 --> 00:13:39.559
<v Speaker 2>and you might overshoot the minimum error, too small and

275
00:13:39.639 --> 00:13:43.960
<v Speaker 2>learning takes forever. SGD often includes momentum too, which helps

276
00:13:43.960 --> 00:13:47.279
<v Speaker 2>smooth out the updates and speed up convergence, especially if

277
00:13:47.320 --> 00:13:48.720
<v Speaker 2>the air landscape is uneven.

278
00:13:48.879 --> 00:13:50.799
<v Speaker 1>This is making sense. Let's try to ground it. The

279
00:13:50.840 --> 00:13:55.919
<v Speaker 1>classic example classifying handwritten digits zero through nine. How would

280
00:13:55.960 --> 00:13:57.600
<v Speaker 1>you actually build a model for that?

281
00:13:57.879 --> 00:14:00.600
<v Speaker 2>Yeah, that's the MAST data set, the hull low world

282
00:14:00.679 --> 00:14:04.159
<v Speaker 2>of deep learning. It means it really concrete. Using a

283
00:14:04.200 --> 00:14:06.559
<v Speaker 2>library like Keris, which is often used with TensorFlow, makes

284
00:14:06.559 --> 00:14:09.080
<v Speaker 2>it much simpler. How So, Keras gives you building blocks.

285
00:14:09.320 --> 00:14:11.960
<v Speaker 2>You define your model layer by layer, maybe an input

286
00:14:12.000 --> 00:14:14.720
<v Speaker 2>layer matching the image size, a couple hidden layers with

287
00:14:14.919 --> 00:14:18.879
<v Speaker 2>RAILU activations in an output layer. Then you compile the model,

288
00:14:18.960 --> 00:14:21.960
<v Speaker 2>telling it which optimizer like SGD and loss function like

289
00:14:22.000 --> 00:14:23.559
<v Speaker 2>categorical cross entropy.

290
00:14:23.279 --> 00:14:24.399
<v Speaker 1>To use, and then you train it.

291
00:14:25.000 --> 00:14:28.200
<v Speaker 2>You call model dot fit. Feeding it the training images

292
00:14:28.360 --> 00:14:31.840
<v Speaker 2>and their labels the actual digits it iterates to the data,

293
00:14:31.919 --> 00:14:35.960
<v Speaker 2>adjusting weights. After training. You can use model dot evaluate

294
00:14:36.240 --> 00:14:39.159
<v Speaker 2>on data it hasn't seen before to check performance, and

295
00:14:39.240 --> 00:14:42.639
<v Speaker 2>model dot predict to classify new unseen digits.

296
00:14:42.799 --> 00:14:46.559
<v Speaker 1>And that output layer for digits zero nine, it would

297
00:14:46.600 --> 00:14:48.759
<v Speaker 1>have ten neurons right, one for each digit.

298
00:14:48.559 --> 00:14:51.960
<v Speaker 2>Exactly ten neurons, usually with the softmax activation, so each

299
00:14:52.000 --> 00:14:54.840
<v Speaker 2>one outputs the probability that the input image is that

300
00:14:54.879 --> 00:14:57.240
<v Speaker 2>specific digit. The highest probability wins.

301
00:14:57.279 --> 00:14:59.159
<v Speaker 1>Okay, so you've trained it, but how do you know

302
00:14:59.200 --> 00:15:01.559
<v Speaker 1>if it's actually any good? How do you evaluate it properly?

303
00:15:01.720 --> 00:15:03.879
<v Speaker 2>That's super important. You need to watch out for two

304
00:15:03.919 --> 00:15:06.159
<v Speaker 2>main problems, overfitting and underfitting.

305
00:15:06.360 --> 00:15:09.559
<v Speaker 1>Overfitting is when it memorizes the training data too well.

306
00:15:09.639 --> 00:15:11.840
<v Speaker 2>Yeah, it gets great results on the data it trained on,

307
00:15:11.919 --> 00:15:15.360
<v Speaker 2>but fails badly on new unseen data it hasn't learned

308
00:15:15.399 --> 00:15:18.759
<v Speaker 2>the general patterns. Underfitting is the opposite. The model is

309
00:15:18.799 --> 00:15:21.200
<v Speaker 2>too simple. It hasn't even learned the training data well enough.

310
00:15:21.320 --> 00:15:23.759
<v Speaker 1>So how do you measure performance beyond just looking at

311
00:15:23.759 --> 00:15:24.240
<v Speaker 1>the loss.

312
00:15:24.679 --> 00:15:28.879
<v Speaker 2>We use specific evaluation metrics. Accuracy is the most basic,

313
00:15:29.279 --> 00:15:32.600
<v Speaker 2>what percentage did it get right overall? But often that's

314
00:15:32.639 --> 00:15:35.440
<v Speaker 2>not enough. We look at things like precision and recall.

315
00:15:35.759 --> 00:15:37.600
<v Speaker 1>Precision and recall remind.

316
00:15:37.360 --> 00:15:40.200
<v Speaker 2>Me precision asks of all the times the model predicted,

317
00:15:40.320 --> 00:15:44.720
<v Speaker 2>say digit seven, how many were actually sevens? Recall asks

318
00:15:45.000 --> 00:15:47.600
<v Speaker 2>of all the actual sevens in the data set? How

319
00:15:47.600 --> 00:15:50.519
<v Speaker 2>many did the model correctly identify? Ah? Okay?

320
00:15:50.559 --> 00:15:52.559
<v Speaker 1>Different perspectives on correctness right, and.

321
00:15:52.519 --> 00:15:54.919
<v Speaker 2>The F one score combines precision and recall into the

322
00:15:54.960 --> 00:15:58.320
<v Speaker 2>single number, giving a balanced view. You might also look

323
00:15:58.360 --> 00:16:02.639
<v Speaker 2>at true positive rate negative rate. Depends on the specifics.

324
00:16:02.120 --> 00:16:04.840
<v Speaker 1>And if the metrics aren't great, you tweak things exactly.

325
00:16:04.960 --> 00:16:08.080
<v Speaker 2>That's hyperperimeter tuning. You adjust things like the learning rate,

326
00:16:08.080 --> 00:16:10.159
<v Speaker 2>the number of layers, and the number of neurons per layer.

327
00:16:10.200 --> 00:16:13.519
<v Speaker 2>Maybe try different optimizers or activation functions until you get

328
00:16:13.519 --> 00:16:15.440
<v Speaker 2>the best performance on your validation data.

329
00:16:15.519 --> 00:16:18.000
<v Speaker 1>And once you're happy, you can save the trained model.

330
00:16:18.200 --> 00:16:22.320
<v Speaker 2>Yep. You can save the model's architecture and its learned weights,

331
00:16:22.720 --> 00:16:25.559
<v Speaker 2>often into a single file like in dot AH five

332
00:16:25.639 --> 00:16:28.679
<v Speaker 2>filing caras sensorflow. Then you can load it back later

333
00:16:28.720 --> 00:16:33.000
<v Speaker 2>instantly without retraining to make predictions or even fine tune

334
00:16:33.039 --> 00:16:34.480
<v Speaker 2>it further with more data.

335
00:16:34.879 --> 00:16:38.200
<v Speaker 1>So far, we've mostly talked about classifications, saying this image

336
00:16:38.240 --> 00:16:41.320
<v Speaker 1>contains a cat, But what about finding where the cat is,

337
00:16:42.000 --> 00:16:44.559
<v Speaker 1>or finding multiple objects like a cat and a dog

338
00:16:44.639 --> 00:16:47.960
<v Speaker 1>in the same picture and drawing boxes around them. That's

339
00:16:48.039 --> 00:16:49.360
<v Speaker 1>object detection, isn't it.

340
00:16:49.559 --> 00:16:52.480
<v Speaker 2>That's exactly right. Object detection takes it a step further

341
00:16:52.519 --> 00:16:55.679
<v Speaker 2>than classification. It needs to both identify what objects are

342
00:16:55.679 --> 00:16:59.440
<v Speaker 2>present and localize them, usually by predicting bounding boxes around them.

343
00:16:59.440 --> 00:17:01.159
<v Speaker 1>And how do you measure sure how good those bounding

344
00:17:01.159 --> 00:17:01.759
<v Speaker 1>boxes are.

345
00:17:02.000 --> 00:17:05.480
<v Speaker 2>The standard metric is IOU or intersection over union. You

346
00:17:05.519 --> 00:17:08.559
<v Speaker 2>compare the predicted bounding box with the true ground truth box.

347
00:17:09.039 --> 00:17:12.640
<v Speaker 2>IOU measures the overlap area divided by the total combined area.

348
00:17:13.039 --> 00:17:14.799
<v Speaker 2>Higher IOU means a better prediction.

349
00:17:15.160 --> 00:17:18.480
<v Speaker 1>It feels like object detection has evolved incredibly fast. I

350
00:17:18.480 --> 00:17:20.880
<v Speaker 1>remember early models being quite slow.

351
00:17:21.119 --> 00:17:25.200
<v Speaker 2>Oh definitely. Early approaches like RCNN region based convolutional neural

352
00:17:25.240 --> 00:17:29.680
<v Speaker 2>network were groundbreaking, but slow. They first proposed potential regions

353
00:17:29.720 --> 00:17:32.119
<v Speaker 2>in the image and then ran a classifier on each.

354
00:17:31.960 --> 00:17:35.200
<v Speaker 1>Region, so lots of repeated computation exactly.

355
00:17:34.920 --> 00:17:38.319
<v Speaker 2>Then came improvements like fast our CNN and Faster our CNN,

356
00:17:38.440 --> 00:17:42.640
<v Speaker 2>which cleverly shared computations and introduced a region proposal network

357
00:17:42.680 --> 00:17:44.839
<v Speaker 2>to speed things up dramatically.

358
00:17:44.319 --> 00:17:45.839
<v Speaker 1>And mask r CNN.

359
00:17:46.079 --> 00:17:49.960
<v Speaker 2>Mask RCNN was a really neat extension of Faster our CNN.

360
00:17:50.119 --> 00:17:52.759
<v Speaker 2>Not only did it detect objects and drawboxes, but it

361
00:17:52.799 --> 00:17:56.319
<v Speaker 2>also predicted a pixel level mask for each object, essentially

362
00:17:56.359 --> 00:17:59.920
<v Speaker 2>outlining its exact shape. You could even estimate human poses.

363
00:18:00.119 --> 00:18:02.880
<v Speaker 1>But the real speed revolution came with single shot detectors

364
00:18:02.960 --> 00:18:04.319
<v Speaker 1>right SSD and YOLO.

365
00:18:04.400 --> 00:18:08.119
<v Speaker 2>Absolutely SSD single shot multibox detector and Yolo you Only

366
00:18:08.119 --> 00:18:11.359
<v Speaker 2>Look Once changed the game for real time detection. Instead

367
00:18:11.359 --> 00:18:14.400
<v Speaker 2>of proposing regions first, they try to detect objects directly

368
00:18:14.440 --> 00:18:15.839
<v Speaker 2>in a single pass through the network.

369
00:18:15.880 --> 00:18:17.119
<v Speaker 1>How does SSD work?

370
00:18:17.359 --> 00:18:21.160
<v Speaker 2>Roughly, SSD uses a set of pre defined default boxes

371
00:18:21.599 --> 00:18:24.759
<v Speaker 2>of different sizes and aspect ratios at various locations in

372
00:18:24.799 --> 00:18:27.960
<v Speaker 2>the feature maps extracted by the network. It predicts offsets

373
00:18:28.000 --> 00:18:30.960
<v Speaker 2>to adjust these boxes and confidence scores for each object

374
00:18:31.000 --> 00:18:34.119
<v Speaker 2>class directly from these feature maps. It uses techniques like

375
00:18:34.240 --> 00:18:38.160
<v Speaker 2>data augmentation and non maximum suppression to improve accuracy and efficiency.

376
00:18:38.359 --> 00:18:40.799
<v Speaker 1>And Yolo you Only Look Once.

377
00:18:41.000 --> 00:18:44.279
<v Speaker 2>Great name Yolo is famous for its speed. It divides

378
00:18:44.319 --> 00:18:46.920
<v Speaker 2>the input image into a grid. For each grid cell,

379
00:18:46.960 --> 00:18:49.960
<v Speaker 2>it predicts bounding boxes, confidence scores for those boxes, how

380
00:18:50.039 --> 00:18:53.200
<v Speaker 2>likely they contain an object and class probabilities all in

381
00:18:53.200 --> 00:18:53.599
<v Speaker 2>one go.

382
00:18:53.680 --> 00:18:55.599
<v Speaker 1>And it got faster and better with new versions.

383
00:18:55.640 --> 00:18:59.240
<v Speaker 2>Yeah, yolob two used a network called Darknet nineteen and

384
00:18:59.759 --> 00:19:03.359
<v Speaker 2>three d use the deeper Darknet fifty three, improving accuracy

385
00:19:03.359 --> 00:19:07.039
<v Speaker 2>while maintaining impressive speed. These single shot detectors made real

386
00:19:07.079 --> 00:19:09.039
<v Speaker 2>time object detection on video feasible.

387
00:19:09.160 --> 00:19:12.359
<v Speaker 1>Okay, so detection finds objects in a single frame, but

388
00:19:12.400 --> 00:19:14.640
<v Speaker 1>what about video? How do you follow a specific object

389
00:19:14.680 --> 00:19:17.119
<v Speaker 1>from one frame to the next. That's object tracking, right.

390
00:19:17.559 --> 00:19:21.799
<v Speaker 2>Object tracking builds on detection. You detect objects in each frame,

391
00:19:22.200 --> 00:19:24.200
<v Speaker 2>But then you need a way to link detections of

392
00:19:24.240 --> 00:19:27.680
<v Speaker 2>the same object to cross frames, maintaining its unique identity.

393
00:19:27.880 --> 00:19:29.359
<v Speaker 1>How do you do that linkage? How do you know

394
00:19:29.400 --> 00:19:31.680
<v Speaker 1>the car detected now is the same car detected a

395
00:19:31.720 --> 00:19:32.240
<v Speaker 1>second ago?

396
00:19:32.400 --> 00:19:36.240
<v Speaker 2>There are various methods. One interesting technique involves image hashing

397
00:19:36.680 --> 00:19:38.000
<v Speaker 2>like different hashing, or de.

398
00:19:38.119 --> 00:19:41.519
<v Speaker 1>Haash hashing like creating a fingerprint exactly.

399
00:19:41.640 --> 00:19:45.200
<v Speaker 2>Dehash generates a compact fingerprint or hash value for an

400
00:19:45.200 --> 00:19:48.960
<v Speaker 2>image patch like the detected object based on differences between

401
00:19:48.960 --> 00:19:52.039
<v Speaker 2>adjacent pixels. It's very fast to compute.

402
00:19:52.160 --> 00:19:54.279
<v Speaker 1>So each detected object gets a hash.

403
00:19:54.960 --> 00:19:57.359
<v Speaker 2>Then what then You compare the DASH of a newly

404
00:19:57.400 --> 00:20:00.119
<v Speaker 2>detected object in the current frame with the dehashes of

405
00:20:00.160 --> 00:20:03.279
<v Speaker 2>objects tracked in the previous frame. The comparison is done

406
00:20:03.400 --> 00:20:04.400
<v Speaker 2>using Hamming distance.

407
00:20:04.599 --> 00:20:07.559
<v Speaker 1>Hamming distance that just counts how many bits are different

408
00:20:07.599 --> 00:20:08.559
<v Speaker 1>between two hashes.

409
00:20:08.680 --> 00:20:11.960
<v Speaker 2>Precisely, a low Hamming distance between two de haashes means

410
00:20:12.000 --> 00:20:14.359
<v Speaker 2>the image patches are very similar. So if a new

411
00:20:14.400 --> 00:20:17.839
<v Speaker 2>detections hash is very close to a previously tracked objects hash,

412
00:20:18.119 --> 00:20:20.359
<v Speaker 2>you can confidently say it's the same object and update

413
00:20:20.400 --> 00:20:20.880
<v Speaker 2>its track.

414
00:20:21.119 --> 00:20:23.519
<v Speaker 1>That's clever. A simple comparison tells you if it's the

415
00:20:23.559 --> 00:20:24.000
<v Speaker 1>same thing.

416
00:20:24.079 --> 00:20:27.319
<v Speaker 2>Yeah, it's efficient, and you can integrate this tracking logic

417
00:20:27.440 --> 00:20:30.880
<v Speaker 2>with say a web framework like flask to visualize the

418
00:20:30.920 --> 00:20:33.240
<v Speaker 2>tracks on a live video stream in your browser.

419
00:20:33.799 --> 00:20:37.519
<v Speaker 1>Cool. Now, let's narrow down to a really specific but

420
00:20:37.839 --> 00:20:43.680
<v Speaker 1>huge application. Face recognition. Is that just another object detection problem.

421
00:20:43.720 --> 00:20:46.480
<v Speaker 2>It starts like one. You first need to detect the face,

422
00:20:46.640 --> 00:20:49.920
<v Speaker 2>but then it goes further into identification. The core idea

423
00:20:49.960 --> 00:20:53.480
<v Speaker 2>is to create a unique numerical representation for each face,

424
00:20:53.759 --> 00:20:58.400
<v Speaker 2>often called a facial footprint, or more technically, face embedding.

425
00:20:58.000 --> 00:21:00.559
<v Speaker 1>And embedding like the feature vectors we talked about, are.

426
00:21:00.400 --> 00:21:03.640
<v Speaker 2>Very similar concept Yes, it's a compact vector, typically one

427
00:21:03.720 --> 00:21:07.079
<v Speaker 2>twenty eight dimensional derived from key facial features maybe around

428
00:21:07.119 --> 00:21:09.519
<v Speaker 2>eighty notal points like the corners of the eyes, tip

429
00:21:09.559 --> 00:21:12.160
<v Speaker 2>of the nose, et cetera. This vector captures the unique

430
00:21:12.240 --> 00:21:13.799
<v Speaker 2>characteristics of that specific face.

431
00:21:13.839 --> 00:21:15.519
<v Speaker 1>And how are these embeddings generated.

432
00:21:15.880 --> 00:21:18.759
<v Speaker 2>Deep neural networks are key here, particularly models like face

433
00:21:18.799 --> 00:21:21.480
<v Speaker 2>net developed by a Google. Face net is designed specifically

434
00:21:21.480 --> 00:21:24.359
<v Speaker 2>to take a face image and directly output this highly

435
00:21:24.400 --> 00:21:27.039
<v Speaker 2>discriminating one hundred and twenty eight dimensional embedding.

436
00:21:27.200 --> 00:21:30.680
<v Speaker 1>So face net learns to create good embeddings exactly.

437
00:21:31.279 --> 00:21:34.400
<v Speaker 2>It's trained using a clever method involving a triplet loss function.

438
00:21:35.200 --> 00:21:38.039
<v Speaker 2>The network has shown three images at a time, an

439
00:21:38.079 --> 00:21:42.359
<v Speaker 2>anchor image a person's face, a positive image another picture

440
00:21:42.359 --> 00:21:44.880
<v Speaker 2>of the same person, and a negative image a picture

441
00:21:44.920 --> 00:21:47.400
<v Speaker 2>of a different person and the goal. The goal is

442
00:21:47.440 --> 00:21:50.039
<v Speaker 2>to learn embeddings such that the distance between the anchor

443
00:21:50.079 --> 00:21:53.440
<v Speaker 2>and positive embeddings is small, while the distance between the

444
00:21:53.480 --> 00:21:57.079
<v Speaker 2>anchor and negative embeddings is large. This forces the network

445
00:21:57.119 --> 00:21:59.759
<v Speaker 2>to create embeddings that cluster faces of the same person

446
00:21:59.799 --> 00:22:02.680
<v Speaker 2>to get together and push faces of different people far

447
00:22:02.759 --> 00:22:05.039
<v Speaker 2>apart in that one hundred and twenty eight dimensional space.

448
00:22:05.400 --> 00:22:08.039
<v Speaker 1>Fascinating. So once you have these embeddings, you can compare

449
00:22:08.119 --> 00:22:09.920
<v Speaker 1>them to recognize people.

450
00:22:10.039 --> 00:22:12.640
<v Speaker 2>Yep. For face verification, is this the same person? You

451
00:22:12.720 --> 00:22:16.039
<v Speaker 2>just compare the abettings of two faces. For recognition, who

452
00:22:16.079 --> 00:22:18.440
<v Speaker 2>is this person? You compare the new faces of betting

453
00:22:18.480 --> 00:22:22.480
<v Speaker 2>against a database of known ebttings. Different face neet architectures,

454
00:22:22.640 --> 00:22:25.240
<v Speaker 2>often based on models like conception, are for trade offs

455
00:22:25.279 --> 00:22:29.279
<v Speaker 2>between computational cost measured in f lops floating point operations

456
00:22:29.279 --> 00:22:31.200
<v Speaker 2>per second and the accuracy of the embeddings.

457
00:22:31.279 --> 00:22:34.880
<v Speaker 1>Okay, this tech is clearly powerful, but let's talk about

458
00:22:34.920 --> 00:22:39.200
<v Speaker 1>the real world. Where is computer vision really making a difference?

459
00:22:39.240 --> 00:22:40.279
<v Speaker 1>Moving beyond the lab?

460
00:22:40.400 --> 00:22:45.000
<v Speaker 2>Oh? Absolutely, One huge area is industrial manufacturing. Think about

461
00:22:45.079 --> 00:22:49.599
<v Speaker 2>quality control. Real time defect detection using computer vision is

462
00:22:49.640 --> 00:22:53.880
<v Speaker 2>replacing slow, inconsistent and often expensive manual inspection.

463
00:22:54.119 --> 00:22:55.799
<v Speaker 1>Can you give an example, sure.

464
00:22:55.720 --> 00:22:59.440
<v Speaker 2>Consider steal production. There's a data set called nudet with

465
00:22:59.559 --> 00:23:02.960
<v Speaker 2>images of steel surfaces showing various defects, things like crazing

466
00:23:03.039 --> 00:23:07.119
<v Speaker 2>inclusion patches, pitted surfaces, rolled in scale scratches.

467
00:23:06.759 --> 00:23:10.319
<v Speaker 1>Things a human might miss or classify inconsistently exactly.

468
00:23:10.519 --> 00:23:13.960
<v Speaker 2>A trained computer vision system can scan these surfaces continuously

469
00:23:14.039 --> 00:23:17.480
<v Speaker 2>and reliably identify these defects much faster and often more

470
00:23:17.480 --> 00:23:20.319
<v Speaker 2>accurately than a person could, especially where long shifts. It

471
00:23:20.400 --> 00:23:23.319
<v Speaker 2>leads to better quality, control, less waste, lower costs.

472
00:23:23.559 --> 00:23:26.000
<v Speaker 1>And building such a system requires good data. Right, you

473
00:23:26.039 --> 00:23:28.839
<v Speaker 1>need labeled examples of these defects absolutely crucial.

474
00:23:29.119 --> 00:23:33.079
<v Speaker 2>You need tools for annotation. Microsoft's VOTT the Visual Object

475
00:23:33.119 --> 00:23:35.640
<v Speaker 2>Tagging Tool is a good example. Lets you drop bounding

476
00:23:35.640 --> 00:23:38.920
<v Speaker 2>boxes around defects and images and assigned labels, creating the

477
00:23:38.960 --> 00:23:41.519
<v Speaker 2>ground truth data and needed to train the detection models.

478
00:23:41.720 --> 00:23:45.119
<v Speaker 1>This sounds like it involves massive data sets, complex models.

479
00:23:45.519 --> 00:23:48.960
<v Speaker 1>Training must be a huge undertaking, probably not something you

480
00:23:49.000 --> 00:23:50.400
<v Speaker 1>do on your laptop.

481
00:23:50.079 --> 00:23:52.759
<v Speaker 2>Definitely not for state of the art models. Training these

482
00:23:52.759 --> 00:23:57.759
<v Speaker 2>deep learning vision models requires enormous computational resources. We're talking

483
00:23:57.839 --> 00:24:02.119
<v Speaker 2>large data sets, often multiple high end GPUs working in parallel,

484
00:24:02.359 --> 00:24:05.400
<v Speaker 2>and training times that can stretch from hours to days,

485
00:24:05.519 --> 00:24:06.240
<v Speaker 2>even weeks.

486
00:24:06.480 --> 00:24:09.279
<v Speaker 1>So this is where cloud computing really comes into its own.

487
00:24:09.400 --> 00:24:13.759
<v Speaker 2>Precisely, cloud platforms like Google Cloud Platform GCP or Microsoft

488
00:24:13.759 --> 00:24:17.000
<v Speaker 2>Azure provide the scalable infrastructure you need. You can rent

489
00:24:17.039 --> 00:24:20.799
<v Speaker 2>powerful virtual machines with multiple GPUs, access vast storage for

490
00:24:20.880 --> 00:24:24.000
<v Speaker 2>your data sets, and leverage specialized machine learning services.

491
00:24:24.039 --> 00:24:26.039
<v Speaker 1>And when you have all that power, you need ways

492
00:24:26.079 --> 00:24:29.119
<v Speaker 1>to use it efficiently, right, like training across multiple machines

493
00:24:29.200 --> 00:24:30.119
<v Speaker 1>or GPUs at once.

494
00:24:30.440 --> 00:24:34.240
<v Speaker 2>Yes, that's called distributed training. It's essential for handling these

495
00:24:34.359 --> 00:24:38.599
<v Speaker 2>large models and data sets in a reasonable timeframe. There

496
00:24:38.599 --> 00:24:42.039
<v Speaker 2>are a couple of main strategies. One is data parallelism.

497
00:24:42.640 --> 00:24:45.759
<v Speaker 2>You replicate the model on multiple GPUs or machines, but

498
00:24:45.960 --> 00:24:49.680
<v Speaker 2>you split the data batch among them. Each replica processes

499
00:24:49.720 --> 00:24:53.480
<v Speaker 2>it's part of the batch, calculates updates gradients, and then

500
00:24:53.640 --> 00:24:56.559
<v Speaker 2>these updates are somehow combined to update the main model.

501
00:24:56.599 --> 00:24:57.559
<v Speaker 1>How are they combined?

502
00:24:57.839 --> 00:25:00.799
<v Speaker 2>It can be synchronous where all workers wait and aggregate

503
00:25:00.839 --> 00:25:04.720
<v Speaker 2>gradients together at each step, ensuring consistency. Or it could

504
00:25:04.720 --> 00:25:08.839
<v Speaker 2>be asynchronous where workers updata central model independently, which can

505
00:25:08.880 --> 00:25:11.960
<v Speaker 2>sometimes be faster but potentially less stable.

506
00:25:12.119 --> 00:25:14.759
<v Speaker 1>Okay, that's data parallelism. What's the other strategy?

507
00:25:14.839 --> 00:25:18.079
<v Speaker 2>Model parallelism. This is used when the model itself is

508
00:25:18.119 --> 00:25:21.000
<v Speaker 2>too massive. To fit into the memory of a single GPU.

509
00:25:21.599 --> 00:25:24.720
<v Speaker 2>You actually split the model across different devices, with different

510
00:25:24.799 --> 00:25:29.079
<v Speaker 2>layers residing on different GPUs, data flows between them during computation.

511
00:25:29.519 --> 00:25:33.599
<v Speaker 2>It's generally more complex to implement than data parallelism.

512
00:25:33.119 --> 00:25:36.640
<v Speaker 1>And frameworks like TensorFlow provide tools to manage this distribution.

513
00:25:37.039 --> 00:25:41.240
<v Speaker 2>They do. TensorFlow has built in tf dot distribute DOT

514
00:25:41.279 --> 00:25:45.720
<v Speaker 2>strategy options. Mirrored strategy is common for using multiple GPUs

515
00:25:45.759 --> 00:25:50.799
<v Speaker 2>on one machine. It handles the data parallelism synchronously. Multi

516
00:25:50.799 --> 00:25:54.640
<v Speaker 2>worker mirrored strategy extends this across multiple machines. For really

517
00:25:54.720 --> 00:25:58.240
<v Speaker 2>large scale parameter server strategy uses dedicated servers just to

518
00:25:58.319 --> 00:26:01.279
<v Speaker 2>hold and update the model's parameters, while worker nodes do

519
00:26:01.400 --> 00:26:02.000
<v Speaker 2>the computation.

520
00:26:02.240 --> 00:26:04.799
<v Speaker 1>So the tools are there to help manage this complexity.

521
00:26:04.880 --> 00:26:08.640
<v Speaker 2>Yes, and they're also libraries like horvad developed by Uber,

522
00:26:08.880 --> 00:26:12.000
<v Speaker 2>which are specifically designed to make distributed deep learning training

523
00:26:12.240 --> 00:26:16.440
<v Speaker 2>easier and more efficient, often integrating well with TensorFlow, PyTorch,

524
00:26:16.680 --> 00:26:17.680
<v Speaker 2>and cloud environments.

525
00:26:18.079 --> 00:26:19.400
<v Speaker 1>Hashtag tag tech outro.

526
00:26:19.599 --> 00:26:22.640
<v Speaker 2>Wow. Okay, that was quite the journey. We've really taken

527
00:26:22.680 --> 00:26:26.319
<v Speaker 2>a deep dive today, haven't we from understanding the absolute basics,

528
00:26:26.359 --> 00:26:28.519
<v Speaker 2>like what a pixel even is? To a computer.

529
00:26:28.640 --> 00:26:32.039
<v Speaker 1>Yeah, through manipulating them, extracting meaningful features, getting.

530
00:26:31.720 --> 00:26:34.400
<v Speaker 2>Into the brains with neural networks and deep learning how

531
00:26:34.400 --> 00:26:34.839
<v Speaker 2>they learn.

532
00:26:34.799 --> 00:26:37.799
<v Speaker 1>And then looking at these reallyad advanced applications like detecting objects,

533
00:26:37.839 --> 00:26:42.000
<v Speaker 1>tracking them across video frames, even recognizing individual faces. It's

534
00:26:42.000 --> 00:26:44.000
<v Speaker 1>amazing how it builds up. And as you said, this

535
00:26:44.079 --> 00:26:45.160
<v Speaker 1>isn't just cool tech.

536
00:26:45.000 --> 00:26:47.799
<v Speaker 2>In a lab, not at all. Computer vision powered by

537
00:26:47.799 --> 00:26:53.880
<v Speaker 2>deep learning is genuinely transforming industries manufacturing, security, healthcare, retail.

538
00:26:54.160 --> 00:26:56.519
<v Speaker 2>It's opening up totally new ways for us to interact

539
00:26:56.559 --> 00:26:58.359
<v Speaker 2>with machines and the world.

540
00:26:58.519 --> 00:27:01.119
<v Speaker 1>It really is. So as we wrack up, here's something

541
00:27:01.160 --> 00:27:05.359
<v Speaker 1>to think about. These aiis are getting more and more sophisticated.

542
00:27:05.559 --> 00:27:09.519
<v Speaker 1>They're not just seeing, they're starting to interpret, understand context,

543
00:27:09.640 --> 00:27:10.559
<v Speaker 1>maybe even predict.

544
00:27:11.000 --> 00:27:12.799
<v Speaker 2>The capabilities are growing exponentially.

545
00:27:12.920 --> 00:27:16.559
<v Speaker 1>So what happens next as these intelligent eyes become even

546
00:27:16.599 --> 00:27:19.720
<v Speaker 1>more pervasive? What new frontiers will they unlock? And maybe

547
00:27:19.720 --> 00:27:23.279
<v Speaker 1>more importantly, what new questions, perhaps ethical ones, will we

548
00:27:23.319 --> 00:27:27.039
<v Speaker 1>need to grapple with as AI's ability to see starts

549
00:27:27.039 --> 00:27:28.880
<v Speaker 1>to rival or even exceed our own
