WEBVTT 1 00:00:00.080 --> 00:00:04.799 Imagine your smartphone unlocking just by glancing at your face, 2 00:00:05.679 --> 00:00:09.759 or maybe a robotic arm and a factory meticulously inspecting 3 00:00:09.800 --> 00:00:12.560 products catching defects you or I might totally miss. 4 00:00:12.679 --> 00:00:15.880 Yeah, it really feels like artificial intelligence has somehow gained 5 00:00:15.919 --> 00:00:16.640 the gift of sight. 6 00:00:16.839 --> 00:00:19.760 It does, but how do computers actually do that? How 7 00:00:19.760 --> 00:00:23.920 do they get this incredible ability to see and interpret 8 00:00:24.079 --> 00:00:25.160 the visual world. 9 00:00:25.559 --> 00:00:29.399 Well, it's quite a journey really, from just raw data 10 00:00:28.760 --> 00:00:32.960 to pretty profound insights, and it's all built on this 11 00:00:33.039 --> 00:00:36.640 fascinating intersection of computer vision and artificial neural network. 12 00:00:36.840 --> 00:00:38.439 It sounds like sci fi, but it's happening now. 13 00:00:38.479 --> 00:00:41.200 It absolutely is a rapidly evolving reality. 14 00:00:41.280 --> 00:00:43.560 And that's exactly what we're diving into today. We're drawing 15 00:00:43.600 --> 00:00:46.320 from a really comprehensive guide on building these kinds of 16 00:00:46.359 --> 00:00:50.520 powerful AI systems. Our mission basically is to pull back 17 00:00:50.520 --> 00:00:55.000 the curtain a bit, demystify how computers perceive process and 18 00:00:55.399 --> 00:00:58.320 we'll ultimately make sense of images and video. 19 00:00:58.159 --> 00:01:01.640 Well, trace that whole path from the simplest element, the pixel, 20 00:01:01.960 --> 00:01:05.439 all the way up to these super complex AI architectures. 21 00:01:05.079 --> 00:01:11.560 Exactly things like object tracking, face recognition. It's a surprising journey. 22 00:01:11.480 --> 00:01:14.200 And you'll hopefully get a clear understanding of the mechanisms, 23 00:01:14.239 --> 00:01:17.760 the innovations behind it all. How these intelligent eyes actually work. 24 00:01:17.959 --> 00:01:19.920 Okay, so let's kick things off right at the beginning. 25 00:01:20.239 --> 00:01:23.280 How does a computer even see an image? We know 26 00:01:23.319 --> 00:01:26.079 they're digital pixels and all that, but how does it 27 00:01:26.159 --> 00:01:26.959 interpret them? 28 00:01:27.159 --> 00:01:30.560 Right? So, at its core, a digital image is just 29 00:01:30.599 --> 00:01:34.000 a grid, a grid of pixels. For something simple like 30 00:01:34.040 --> 00:01:37.480 a grayscale image, each pixel is just one number, usually 31 00:01:37.560 --> 00:01:38.920 between zero and two to fifty five. 32 00:01:39.120 --> 00:01:41.599 Zero for black, two hundred and fifty five for white. 33 00:01:41.760 --> 00:01:44.359 Exactly zero is black, two fifty five is white, and 34 00:01:44.439 --> 00:01:46.359 everything in between is just a shade of gray. It's 35 00:01:46.400 --> 00:01:47.959 literally just a matrix of numbers. 36 00:01:48.079 --> 00:01:50.760 Okay, simple enough for black and white. But what about color? 37 00:01:50.799 --> 00:01:53.000 How do you get all the richness of color from numbers? 38 00:01:53.079 --> 00:01:56.280 Ah, that's where models like RGB come in. Red, green, blue. 39 00:01:56.319 --> 00:01:58.799 Instead of one number per pixel, you get three, a 40 00:01:58.799 --> 00:02:00.159 little bundle, a tupple. 41 00:01:59.840 --> 00:02:02.120 Bit basically, so each pixel has a red value, a 42 00:02:02.120 --> 00:02:02.959 green value, and a. 43 00:02:02.959 --> 00:02:06.000 Blue value, precisely each one, also ranging from zero to 44 00:02:06.000 --> 00:02:09.879 two hundred and fifty five. So like zero zero, no color, 45 00:02:09.919 --> 00:02:12.960 that's black. Okay, two fifty five zero zero zero b 46 00:02:13.280 --> 00:02:15.159 pure red, pure red. So what do you think there are? 47 00:02:15.360 --> 00:02:17.960 Zero two fifty five would be or two fifty five 48 00:02:18.159 --> 00:02:19.080 fifty five to fifty five. 49 00:02:19.080 --> 00:02:21.840 Okay, following that logic, zero zero two fifty five must 50 00:02:21.879 --> 00:02:24.080 be pure blue, and if all three are maxed out 51 00:02:24.080 --> 00:02:27.280 at two fifty five, that's gotta be white, right, combining all. 52 00:02:27.159 --> 00:02:30.080 The light you got to pure white. It's actually quite elegant, 53 00:02:30.120 --> 00:02:32.800 isn't it. How these simple number combinations create this huge 54 00:02:32.919 --> 00:02:33.719 range of colors. 55 00:02:33.759 --> 00:02:34.360 It really is. 56 00:02:34.520 --> 00:02:37.599 And you know, once the computer can represent an image 57 00:02:38.000 --> 00:02:40.479 as these numbers, then it gets the power to manipulate 58 00:02:40.520 --> 00:02:42.159 them in loads of ways. 59 00:02:42.280 --> 00:02:43.759 Right. This is where we get into the sort of 60 00:02:44.000 --> 00:02:48.400 digital darkroom idea. Basic stuff like resizing, moving things. 61 00:02:48.159 --> 00:02:54.759 Around yep, resizing, translation, rotation, flipping, cropping, standard geometric things, 62 00:02:54.879 --> 00:02:57.479 and you mentioned resizing it's shoes. Some methods are better 63 00:02:57.520 --> 00:03:00.439 than others, like by cubic interpolation usually give you a 64 00:03:00.479 --> 00:03:04.199 smoother and nicer looking result compared to simpler ones like bilinear. 65 00:03:04.439 --> 00:03:07.800 Okay, that makes sense. But beyond just moving blocks of pixels, 66 00:03:07.879 --> 00:03:09.639 what about changing the pixels themselves? 67 00:03:09.639 --> 00:03:13.719 You mentioned arithmetic, image arithmetic and bitwise operations. These are 68 00:03:13.719 --> 00:03:18.800 more pixel level manipulations. Think about adding a number to 69 00:03:18.919 --> 00:03:21.319 every pixel value or subtracting. 70 00:03:20.759 --> 00:03:24.000 One, so like brightening or darkening the whole image exactly. 71 00:03:24.039 --> 00:03:26.719 And if a calculation pushes a pixel value above two 72 00:03:26.759 --> 00:03:29.800 fifty five or blow zero, it usually just gets clipped 73 00:03:29.840 --> 00:03:32.639 stuck at the max or min value. Stops things getting. 74 00:03:32.360 --> 00:03:35.319 Weird, prevents crazy colors appearing out of nowhere, right. 75 00:03:35.199 --> 00:03:39.479 And then you have bitwise operations Andy or not, TXO 76 00:03:39.680 --> 00:03:42.639 or not. These are really powerful for things like masking. 77 00:03:42.800 --> 00:03:45.159 Masking like cutting out of shape. 78 00:03:45.000 --> 00:03:47.479 Kind of imagine you have a black and white image 79 00:03:47.520 --> 00:03:50.520 like a stencil. You can use a bitwise Andy operation 80 00:03:50.639 --> 00:03:53.520 between that mask and your main image. It essentially keeps 81 00:03:53.520 --> 00:03:55.560 only the parts of the main image where the mask 82 00:03:55.680 --> 00:03:58.639 is white. It's like a digital cutout, very precise control. 83 00:03:58.800 --> 00:04:01.319 Ah I see, so you can isolate specific parts of 84 00:04:01.319 --> 00:04:02.400 an image very cleanly. 85 00:04:02.520 --> 00:04:05.199 Yep. And we also use other operations for cleaning things 86 00:04:05.240 --> 00:04:07.319 up or highlighting details, like. 87 00:04:07.240 --> 00:04:10.400 Blurring to reduce noise or smooth things out exactly. 88 00:04:10.680 --> 00:04:13.800 Techniques like gossim blur medium blur are common for smoothing. 89 00:04:14.159 --> 00:04:15.680 And then on the flip side, if you want to 90 00:04:15.719 --> 00:04:18.560 find edges the outlines of objects. 91 00:04:18.160 --> 00:04:21.519 You'd use edge detection filters like so. 92 00:04:20.959 --> 00:04:24.680 Soble sure, Yeah, these filters are designed to spot sharp 93 00:04:24.800 --> 00:04:28.480 changes in pixel intensity, which usually happen at edges. It 94 00:04:28.480 --> 00:04:30.920 helps the computer see the skeleton of objects. 95 00:04:31.040 --> 00:04:33.920 And what about just simplifying things down to black and white? 96 00:04:34.040 --> 00:04:38.240 That's binarization. Things like adaptive thresholding or Otsu's method are 97 00:04:38.279 --> 00:04:41.360 clever ways to turn a grayscale image into just black 98 00:04:41.399 --> 00:04:44.560 and white pixels, which can be really useful for certain tasks. 99 00:04:45.040 --> 00:04:48.639 Okay, so we've got the basics breaking images into pixels 100 00:04:48.680 --> 00:04:52.560 manipulating them, but just seeing pixels isn't understanding right. The 101 00:04:52.600 --> 00:04:56.079 computer needs to extract actual meaning. How does it learn 102 00:04:56.120 --> 00:04:58.920 to pick out the important stuff, the meaningful features. 103 00:04:59.319 --> 00:05:02.079 That is absolut the core challenge, and it's addressed by 104 00:05:02.079 --> 00:05:05.480 the computer vision pipeline. It's a sequence. First, you ingest 105 00:05:05.480 --> 00:05:08.399 the image, get the data in, then you process it, 106 00:05:08.480 --> 00:05:11.040 maybe clean it up like we just discussed. Then comes 107 00:05:11.079 --> 00:05:13.160 the crucial step, feature extraction. 108 00:05:13.480 --> 00:05:15.040 Feature extraction, that's the key. 109 00:05:15.040 --> 00:05:17.920 That's where the magic starts. Really, it's how the computer 110 00:05:18.040 --> 00:05:22.360 moves beyond just raw pixel values to identify characteristics that 111 00:05:22.399 --> 00:05:25.600 actually mean something like the curve of an edge, a 112 00:05:25.639 --> 00:05:28.439 specific texture, the corner of an object. 113 00:05:28.639 --> 00:05:31.959 So we're looking for features that are discriminating things that 114 00:05:32.079 --> 00:05:34.759 help tell one object from another exactly. 115 00:05:34.800 --> 00:05:37.920 They need to be discriminating, identifiable across different images of 116 00:05:37.920 --> 00:05:41.160 the same object. And ideally you need lots of examples 117 00:05:41.160 --> 00:05:42.959 to establish those patterns reliably. 118 00:05:43.319 --> 00:05:46.000 And how does the computer store these features once it 119 00:05:46.040 --> 00:05:46.600 finds them. 120 00:05:47.040 --> 00:05:50.439 Typically, these extracted features are represented as a feature vector. 121 00:05:50.959 --> 00:05:53.720 It sounds fancy, but it's basically just a list of numbers, 122 00:05:53.839 --> 00:05:55.160 a one dimensional array. 123 00:05:55.000 --> 00:05:57.600 Okay, a list of numbers representing the important bits of 124 00:05:57.600 --> 00:05:58.079 the image. 125 00:05:58.160 --> 00:06:01.839 Yeah, And here's the sort of a high moment. For 126 00:06:01.959 --> 00:06:05.079 a simple grayscale image, you could just string all the 127 00:06:05.079 --> 00:06:08.839 pixel values together into one massive vector that is a 128 00:06:08.839 --> 00:06:10.040 feature vector technically. 129 00:06:10.079 --> 00:06:12.759 Wow, okay, so you're boiling down the whole image into 130 00:06:12.800 --> 00:06:15.759 this single numerical signature. That makes it easier for a 131 00:06:15.800 --> 00:06:18.680 machine learning algorithm to chew on I guess precisely. 132 00:06:19.240 --> 00:06:22.720 And what's really powerful about modern deep learning, especially convolutional 133 00:06:22.720 --> 00:06:25.439 neural networks or CNNs. Yes, they can actually learn to 134 00:06:25.439 --> 00:06:28.759 extract these features automatically. The network figures out the best 135 00:06:28.759 --> 00:06:30.319 features itself during training. 136 00:06:30.399 --> 00:06:34.639 That's a huge advantage. Less manual work potentially better features. 137 00:06:34.199 --> 00:06:38.319 Definitely, But even before deep learning or alongside it, there 138 00:06:38.360 --> 00:06:42.720 are some really clever advanced feature extraction techniques like. 139 00:06:42.680 --> 00:06:45.920 What you mentioned histograms GLCM. Hog's right. 140 00:06:46.199 --> 00:06:48.560 Histograms are a good starting point just counting how many 141 00:06:48.560 --> 00:06:51.920 pixels have certain intensity values, but you can do more 142 00:06:52.360 --> 00:06:57.639 like histogram equalization, which spreads out the intensities to improve contrast, makes. 143 00:06:57.439 --> 00:07:00.160 Details pop okay, and gl. 144 00:07:00.600 --> 00:07:05.160 GLCM stands for a gray level coocurrence matrix. It's fantastic 145 00:07:05.199 --> 00:07:08.079 for analyzing texture. It looks at how often pairs of 146 00:07:08.079 --> 00:07:10.759 pixel values appear together in certain spatial relationship. 147 00:07:11.000 --> 00:07:12.879 It tells you about the texture, like if it's smooth 148 00:07:13.000 --> 00:07:14.920 or rough or patterned exactly. 149 00:07:14.959 --> 00:07:19.360 It gives you statistics like contrasts, correlation, energy, homogeneity, all 150 00:07:19.399 --> 00:07:20.360 describing the texture. 151 00:07:20.399 --> 00:07:25.480 Cool and Hog's histograms of oriented gradients sounds complex. 152 00:07:25.360 --> 00:07:28.959 The idea is pretty neat actually, AG's focus on object 153 00:07:29.040 --> 00:07:33.240 shape and appearance. They look at how image brightness changes 154 00:07:33.279 --> 00:07:36.000 the gradients and in which directions these changes point. 155 00:07:36.240 --> 00:07:38.279 So it's capturing edge information. 156 00:07:38.079 --> 00:07:41.040 Sort of yeah, edge directions. It breaks the image into 157 00:07:41.040 --> 00:07:45.360 small cells, calculates histograms of these gradient directions within each cell, 158 00:07:45.639 --> 00:07:48.759 and then groups cells into blocks to normalize them. Things 159 00:07:48.839 --> 00:07:52.079 like the number of orientations pixels persol cells per block 160 00:07:52.120 --> 00:07:55.120 are parameters you set. It's good at describing shape even 161 00:07:55.120 --> 00:07:56.040 if lighting changes. 162 00:07:56.199 --> 00:08:00.439 Robust okay, and LBP Local Binary Patterns. 163 00:08:00.439 --> 00:08:03.240 LP is great for finer texture details. It works by 164 00:08:03.279 --> 00:08:06.319 comparing each pixel to its neighbors. If a neighbor is brighter, 165 00:08:06.360 --> 00:08:09.160 you write down a one, if darker, a zero. This 166 00:08:09.199 --> 00:08:11.879 creates a binary number for each pixel's neighborhood. 167 00:08:11.560 --> 00:08:13.680 A unique code for the local texture. 168 00:08:13.839 --> 00:08:16.360 Pretty much. Yeah, and there are enhanced versions that can 169 00:08:16.360 --> 00:08:20.199 look at different sized neighborhoods or are rotation invariant, meaning 170 00:08:20.240 --> 00:08:22.680 the texture feature doesn't change if the image is rotated. 171 00:08:22.920 --> 00:08:25.959 So many ways to describe an image numerically. But having 172 00:08:26.000 --> 00:08:28.720 all these features isn't the end goal. The computer has 173 00:08:28.759 --> 00:08:31.040 to learn from them, right. How do we prep for that? 174 00:08:31.480 --> 00:08:33.840 Right? So, you might have extracted tons of features, maybe 175 00:08:33.840 --> 00:08:37.759 too many. That's where feature selection comes in. You use methods, 176 00:08:38.080 --> 00:08:42.120 filter wrapper, embedded techniques to pick out the most impactful 177 00:08:42.120 --> 00:08:44.600 features for your specific task. Get rid of the. 178 00:08:44.559 --> 00:08:46.799 Noise, focus on what matters exactly. 179 00:08:47.240 --> 00:08:50.360 Then you move to model training. You take your selected 180 00:08:50.360 --> 00:08:53.639 feature set your training data and feed them to a 181 00:08:53.679 --> 00:08:57.559 machine learning algorithm. The algorithm learns the patterns in those 182 00:08:57.559 --> 00:08:59.480 features and creates a model. 183 00:08:59.720 --> 00:09:02.320 And this is where supervised learning comes in again using 184 00:09:02.440 --> 00:09:03.440 labeled data. 185 00:09:03.559 --> 00:09:07.000 Yes, For the kinds of computer vision tasks we're focusing on, 186 00:09:07.080 --> 00:09:11.080 like classification or detection, we typically use supervised learning. We 187 00:09:11.120 --> 00:09:14.279 show the algorithm examples images with features and tell it 188 00:09:14.320 --> 00:09:16.559 the correct answer the label, like this is a cat, 189 00:09:16.639 --> 00:09:20.360 this is a dog. And unsupervised learning that's about finding 190 00:09:20.399 --> 00:09:23.759 patterns in data without labels. Sometimes you might use it first, 191 00:09:23.799 --> 00:09:26.919 maybe to help group images or even automatically generate potential 192 00:09:26.960 --> 00:09:30.799 labels that you then refine for supervised learning. But supervised 193 00:09:30.879 --> 00:09:33.480 is key for building these predictive vision models. 194 00:09:33.759 --> 00:09:37.360 Okay, let's get into the real brains behind this. Deep 195 00:09:37.440 --> 00:09:41.360 learning and artificial neural networks A and NS. We always 196 00:09:41.360 --> 00:09:44.080 hear they're inspired by the human brain. How close is 197 00:09:44.120 --> 00:09:45.639 that analogy? Really, it's a. 198 00:09:45.679 --> 00:09:49.360 Useful starting point. Think of a single artificial neuron as 199 00:09:49.399 --> 00:09:53.840 a highly simplified model of a biological one. It receives inputs, 200 00:09:54.200 --> 00:09:57.960 multiplies them by certain weights which represent the connection strength, 201 00:09:57.960 --> 00:10:00.279 sums them up and then applies a function and to 202 00:10:00.320 --> 00:10:01.120 produce an output. 203 00:10:01.200 --> 00:10:03.000 The simplest version is the perceptron. 204 00:10:03.320 --> 00:10:07.919 Right. A single perceptron can model basic linear relationships like 205 00:10:08.000 --> 00:10:10.840 drawing a straight line to separate two groups of data points. 206 00:10:11.080 --> 00:10:13.200 But the real world isn't usually that simple, is it. 207 00:10:13.279 --> 00:10:15.799 Things are messy nonlinear. 208 00:10:15.399 --> 00:10:18.799 Exactly, and that's why we need deep learning, which typically 209 00:10:18.960 --> 00:10:23.679 uses multilayer perceptrons or MLPs. By stacking layers of these neurons, 210 00:10:23.679 --> 00:10:27.759 the network can learn incredibly complex nonlinear patterns. That's absolutely 211 00:10:27.879 --> 00:10:30.840 essential for tackling real world computer vision problems. 212 00:10:30.919 --> 00:10:33.240 So what does the structure The anatomy of one of 213 00:10:33.240 --> 00:10:34.799 these deep learning models. 214 00:10:34.440 --> 00:10:37.720 Look like, Well, you've got an input layer where the 215 00:10:37.879 --> 00:10:41.120 data like our image feature vector comes in. Then you 216 00:10:41.159 --> 00:10:43.039 have one or more hidden layers. This is where the 217 00:10:43.080 --> 00:10:47.360 real heavy lifting and the learning happens. The network figures 218 00:10:47.399 --> 00:10:50.720 out intermediate representations here, and finally an output layer that 219 00:10:50.720 --> 00:10:53.360 gives you the final result. Maybe it's a probability for 220 00:10:53.399 --> 00:10:56.200 each class like eighty percent chance it's a cat twenty 221 00:10:56.240 --> 00:11:00.240 percent dog. The network learns by adjusting the weights on 222 00:11:00.320 --> 00:11:03.399 all the connections between neurons in these layers. There are 223 00:11:03.399 --> 00:11:06.559 also bias nodes that add another adjustable parameter. 224 00:11:06.919 --> 00:11:10.440 Okay, weights determine connection strength, But how does an individual 225 00:11:10.480 --> 00:11:13.399 neuron decide whether to fire or what value to pass on? 226 00:11:13.759 --> 00:11:15.200 You mentioned activation functions. 227 00:11:15.279 --> 00:11:19.159 Yes, activation functions are critical. They introduce the nonlinearity we need. 228 00:11:19.679 --> 00:11:23.000 After a neuron sums its weighted inputs, the activation function 229 00:11:23.080 --> 00:11:25.960 process is that some to produce the neuron's final output. 230 00:11:26.039 --> 00:11:26.840 What kinds are there? 231 00:11:27.039 --> 00:11:29.600 There's several common ones. Sigma used to be popular, squashing 232 00:11:29.720 --> 00:11:33.639 values between zero and one. RAILU rectified linear unit is 233 00:11:33.799 --> 00:11:37.320 very widely used now It's simple palputationally efficient outputs the 234 00:11:37.360 --> 00:11:39.360 input if positive and zero. 235 00:11:39.159 --> 00:11:41.759 Otherwise real U sounds almost too simple. 236 00:11:41.919 --> 00:11:45.120 It works surprisingly well, and there are variants like leaky 237 00:11:45.200 --> 00:11:49.519 ReLU elu SELU that try to address some minor potential 238 00:11:49.519 --> 00:11:53.519 issues with ReLU and for the output layer. In classification tasks, 239 00:11:53.799 --> 00:11:57.480 softmax is key. Why softmax because it takes the raw 240 00:11:57.559 --> 00:12:00.919 outputs for each class and turns them into probabilities that 241 00:12:01.000 --> 00:12:03.039 all add up to one. So you get that nice 242 00:12:03.120 --> 00:12:06.440 interpretable eighty percent cat twenty percent dog output. 243 00:12:06.600 --> 00:12:09.240 Got it? So the network has its structure, its neurons, 244 00:12:09.240 --> 00:12:12.000 its activation functions, how does it actually learn. How does 245 00:12:12.000 --> 00:12:13.720 it get better? Is it trial and error? 246 00:12:13.759 --> 00:12:15.840 It's a guided trial and error. You could say. The 247 00:12:15.879 --> 00:12:19.440 process starts with feed forward. Your input data flows through 248 00:12:19.440 --> 00:12:22.799 the network layer by layer, activating neurons until it produces 249 00:12:22.799 --> 00:12:23.919 an output a prediction. 250 00:12:24.159 --> 00:12:25.600 Okay, the first guess right. 251 00:12:26.120 --> 00:12:28.080 Then you need to measure how wrong that guess was. 252 00:12:28.159 --> 00:12:30.840 That's where error functions or loss functions come in. They 253 00:12:30.919 --> 00:12:34.039 calculate the difference between the network's prediction and the actual 254 00:12:34.039 --> 00:12:36.679 correct answer, the ground truth. What kinds of loss functions 255 00:12:36.919 --> 00:12:40.559 depends on the task. For regression predicting, a continuous value 256 00:12:40.960 --> 00:12:46.240 means squared error MSE is common for binary classification cat 257 00:12:46.279 --> 00:12:50.919 dog binary cross entropy for classifying among multiple classes digits 258 00:12:51.000 --> 00:12:54.279 zero nine categorical cross entropy is standard. 259 00:12:54.480 --> 00:12:57.440 So you calculate the error, then what how does the 260 00:12:57.440 --> 00:12:59.679 network use that error information? 261 00:13:00.639 --> 00:13:03.679 That's the job of optimization algorithms. Their goal is to 262 00:13:03.720 --> 00:13:06.159 adjust the network's weights in a way that minimizes the 263 00:13:06.200 --> 00:13:09.840 loss function. The most fundamental one is gradient descent, or 264 00:13:09.879 --> 00:13:12.960 more commonly, stochastic gradient descent SGD. 265 00:13:13.120 --> 00:13:15.799 Stochastic gradient descent. How does that work? 266 00:13:15.960 --> 00:13:18.120 Instead of calculating the error over the entire data set 267 00:13:18.159 --> 00:13:21.200 at once, which is slow. SGD uses small or random 268 00:13:21.240 --> 00:13:24.559 subsets called mini batches. It calculates the air for a batch, 269 00:13:24.639 --> 00:13:26.559 figures out which way to adjust the weights to reduce 270 00:13:26.600 --> 00:13:28.919 that error. That's the gradient part, and takes a small 271 00:13:28.919 --> 00:13:29.919 step in that direction, And. 272 00:13:29.840 --> 00:13:32.399 The size of that step is the learning rate exactly. 273 00:13:32.600 --> 00:13:35.600 The learning rate is a crucial hyper parameter. Too big 274 00:13:36.240 --> 00:13:39.559 and you might overshoot the minimum error, too small and 275 00:13:39.639 --> 00:13:43.960 learning takes forever. SGD often includes momentum too, which helps 276 00:13:43.960 --> 00:13:47.279 smooth out the updates and speed up convergence, especially if 277 00:13:47.320 --> 00:13:48.720 the air landscape is uneven. 278 00:13:48.879 --> 00:13:50.799 This is making sense. Let's try to ground it. The 279 00:13:50.840 --> 00:13:55.919 classic example classifying handwritten digits zero through nine. How would 280 00:13:55.960 --> 00:13:57.600 you actually build a model for that? 281 00:13:57.879 --> 00:14:00.600 Yeah, that's the MAST data set, the hull low world 282 00:14:00.679 --> 00:14:04.159 of deep learning. It means it really concrete. Using a 283 00:14:04.200 --> 00:14:06.559 library like Keris, which is often used with TensorFlow, makes 284 00:14:06.559 --> 00:14:09.080 it much simpler. How So, Keras gives you building blocks. 285 00:14:09.320 --> 00:14:11.960 You define your model layer by layer, maybe an input 286 00:14:12.000 --> 00:14:14.720 layer matching the image size, a couple hidden layers with 287 00:14:14.919 --> 00:14:18.879 RAILU activations in an output layer. Then you compile the model, 288 00:14:18.960 --> 00:14:21.960 telling it which optimizer like SGD and loss function like 289 00:14:22.000 --> 00:14:23.559 categorical cross entropy. 290 00:14:23.279 --> 00:14:24.399 To use, and then you train it. 291 00:14:25.000 --> 00:14:28.200 You call model dot fit. Feeding it the training images 292 00:14:28.360 --> 00:14:31.840 and their labels the actual digits it iterates to the data, 293 00:14:31.919 --> 00:14:35.960 adjusting weights. After training. You can use model dot evaluate 294 00:14:36.240 --> 00:14:39.159 on data it hasn't seen before to check performance, and 295 00:14:39.240 --> 00:14:42.639 model dot predict to classify new unseen digits. 296 00:14:42.799 --> 00:14:46.559 And that output layer for digits zero nine, it would 297 00:14:46.600 --> 00:14:48.759 have ten neurons right, one for each digit. 298 00:14:48.559 --> 00:14:51.960 Exactly ten neurons, usually with the softmax activation, so each 299 00:14:52.000 --> 00:14:54.840 one outputs the probability that the input image is that 300 00:14:54.879 --> 00:14:57.240 specific digit. The highest probability wins. 301 00:14:57.279 --> 00:14:59.159 Okay, so you've trained it, but how do you know 302 00:14:59.200 --> 00:15:01.559 if it's actually any good? How do you evaluate it properly? 303 00:15:01.720 --> 00:15:03.879 That's super important. You need to watch out for two 304 00:15:03.919 --> 00:15:06.159 main problems, overfitting and underfitting. 305 00:15:06.360 --> 00:15:09.559 Overfitting is when it memorizes the training data too well. 306 00:15:09.639 --> 00:15:11.840 Yeah, it gets great results on the data it trained on, 307 00:15:11.919 --> 00:15:15.360 but fails badly on new unseen data it hasn't learned 308 00:15:15.399 --> 00:15:18.759 the general patterns. Underfitting is the opposite. The model is 309 00:15:18.799 --> 00:15:21.200 too simple. It hasn't even learned the training data well enough. 310 00:15:21.320 --> 00:15:23.759 So how do you measure performance beyond just looking at 311 00:15:23.759 --> 00:15:24.240 the loss. 312 00:15:24.679 --> 00:15:28.879 We use specific evaluation metrics. Accuracy is the most basic, 313 00:15:29.279 --> 00:15:32.600 what percentage did it get right overall? But often that's 314 00:15:32.639 --> 00:15:35.440 not enough. We look at things like precision and recall. 315 00:15:35.759 --> 00:15:37.600 Precision and recall remind. 316 00:15:37.360 --> 00:15:40.200 Me precision asks of all the times the model predicted, 317 00:15:40.320 --> 00:15:44.720 say digit seven, how many were actually sevens? Recall asks 318 00:15:45.000 --> 00:15:47.600 of all the actual sevens in the data set? How 319 00:15:47.600 --> 00:15:50.519 many did the model correctly identify? Ah? Okay? 320 00:15:50.559 --> 00:15:52.559 Different perspectives on correctness right, and. 321 00:15:52.519 --> 00:15:54.919 The F one score combines precision and recall into the 322 00:15:54.960 --> 00:15:58.320 single number, giving a balanced view. You might also look 323 00:15:58.360 --> 00:16:02.639 at true positive rate negative rate. Depends on the specifics. 324 00:16:02.120 --> 00:16:04.840 And if the metrics aren't great, you tweak things exactly. 325 00:16:04.960 --> 00:16:08.080 That's hyperperimeter tuning. You adjust things like the learning rate, 326 00:16:08.080 --> 00:16:10.159 the number of layers, and the number of neurons per layer. 327 00:16:10.200 --> 00:16:13.519 Maybe try different optimizers or activation functions until you get 328 00:16:13.519 --> 00:16:15.440 the best performance on your validation data. 329 00:16:15.519 --> 00:16:18.000 And once you're happy, you can save the trained model. 330 00:16:18.200 --> 00:16:22.320 Yep. You can save the model's architecture and its learned weights, 331 00:16:22.720 --> 00:16:25.559 often into a single file like in dot AH five 332 00:16:25.639 --> 00:16:28.679 filing caras sensorflow. Then you can load it back later 333 00:16:28.720 --> 00:16:33.000 instantly without retraining to make predictions or even fine tune 334 00:16:33.039 --> 00:16:34.480 it further with more data. 335 00:16:34.879 --> 00:16:38.200 So far, we've mostly talked about classifications, saying this image 336 00:16:38.240 --> 00:16:41.320 contains a cat, But what about finding where the cat is, 337 00:16:42.000 --> 00:16:44.559 or finding multiple objects like a cat and a dog 338 00:16:44.639 --> 00:16:47.960 in the same picture and drawing boxes around them. That's 339 00:16:48.039 --> 00:16:49.360 object detection, isn't it. 340 00:16:49.559 --> 00:16:52.480 That's exactly right. Object detection takes it a step further 341 00:16:52.519 --> 00:16:55.679 than classification. It needs to both identify what objects are 342 00:16:55.679 --> 00:16:59.440 present and localize them, usually by predicting bounding boxes around them. 343 00:16:59.440 --> 00:17:01.159 And how do you measure sure how good those bounding 344 00:17:01.159 --> 00:17:01.759 boxes are. 345 00:17:02.000 --> 00:17:05.480 The standard metric is IOU or intersection over union. You 346 00:17:05.519 --> 00:17:08.559 compare the predicted bounding box with the true ground truth box. 347 00:17:09.039 --> 00:17:12.640 IOU measures the overlap area divided by the total combined area. 348 00:17:13.039 --> 00:17:14.799 Higher IOU means a better prediction. 349 00:17:15.160 --> 00:17:18.480 It feels like object detection has evolved incredibly fast. I 350 00:17:18.480 --> 00:17:20.880 remember early models being quite slow. 351 00:17:21.119 --> 00:17:25.200 Oh definitely. Early approaches like RCNN region based convolutional neural 352 00:17:25.240 --> 00:17:29.680 network were groundbreaking, but slow. They first proposed potential regions 353 00:17:29.720 --> 00:17:32.119 in the image and then ran a classifier on each. 354 00:17:31.960 --> 00:17:35.200 Region, so lots of repeated computation exactly. 355 00:17:34.920 --> 00:17:38.319 Then came improvements like fast our CNN and Faster our CNN, 356 00:17:38.440 --> 00:17:42.640 which cleverly shared computations and introduced a region proposal network 357 00:17:42.680 --> 00:17:44.839 to speed things up dramatically. 358 00:17:44.319 --> 00:17:45.839 And mask r CNN. 359 00:17:46.079 --> 00:17:49.960 Mask RCNN was a really neat extension of Faster our CNN. 360 00:17:50.119 --> 00:17:52.759 Not only did it detect objects and drawboxes, but it 361 00:17:52.799 --> 00:17:56.319 also predicted a pixel level mask for each object, essentially 362 00:17:56.359 --> 00:17:59.920 outlining its exact shape. You could even estimate human poses. 363 00:18:00.119 --> 00:18:02.880 But the real speed revolution came with single shot detectors 364 00:18:02.960 --> 00:18:04.319 right SSD and YOLO. 365 00:18:04.400 --> 00:18:08.119 Absolutely SSD single shot multibox detector and Yolo you Only 366 00:18:08.119 --> 00:18:11.359 Look Once changed the game for real time detection. Instead 367 00:18:11.359 --> 00:18:14.400 of proposing regions first, they try to detect objects directly 368 00:18:14.440 --> 00:18:15.839 in a single pass through the network. 369 00:18:15.880 --> 00:18:17.119 How does SSD work? 370 00:18:17.359 --> 00:18:21.160 Roughly, SSD uses a set of pre defined default boxes 371 00:18:21.599 --> 00:18:24.759 of different sizes and aspect ratios at various locations in 372 00:18:24.799 --> 00:18:27.960 the feature maps extracted by the network. It predicts offsets 373 00:18:28.000 --> 00:18:30.960 to adjust these boxes and confidence scores for each object 374 00:18:31.000 --> 00:18:34.119 class directly from these feature maps. It uses techniques like 375 00:18:34.240 --> 00:18:38.160 data augmentation and non maximum suppression to improve accuracy and efficiency. 376 00:18:38.359 --> 00:18:40.799 And Yolo you Only Look Once. 377 00:18:41.000 --> 00:18:44.279 Great name Yolo is famous for its speed. It divides 378 00:18:44.319 --> 00:18:46.920 the input image into a grid. For each grid cell, 379 00:18:46.960 --> 00:18:49.960 it predicts bounding boxes, confidence scores for those boxes, how 380 00:18:50.039 --> 00:18:53.200 likely they contain an object and class probabilities all in 381 00:18:53.200 --> 00:18:53.599 one go. 382 00:18:53.680 --> 00:18:55.599 And it got faster and better with new versions. 383 00:18:55.640 --> 00:18:59.240 Yeah, yolob two used a network called Darknet nineteen and 384 00:18:59.759 --> 00:19:03.359 three d use the deeper Darknet fifty three, improving accuracy 385 00:19:03.359 --> 00:19:07.039 while maintaining impressive speed. These single shot detectors made real 386 00:19:07.079 --> 00:19:09.039 time object detection on video feasible. 387 00:19:09.160 --> 00:19:12.359 Okay, so detection finds objects in a single frame, but 388 00:19:12.400 --> 00:19:14.640 what about video? How do you follow a specific object 389 00:19:14.680 --> 00:19:17.119 from one frame to the next. That's object tracking, right. 390 00:19:17.559 --> 00:19:21.799 Object tracking builds on detection. You detect objects in each frame, 391 00:19:22.200 --> 00:19:24.200 But then you need a way to link detections of 392 00:19:24.240 --> 00:19:27.680 the same object to cross frames, maintaining its unique identity. 393 00:19:27.880 --> 00:19:29.359 How do you do that linkage? How do you know 394 00:19:29.400 --> 00:19:31.680 the car detected now is the same car detected a 395 00:19:31.720 --> 00:19:32.240 second ago? 396 00:19:32.400 --> 00:19:36.240 There are various methods. One interesting technique involves image hashing 397 00:19:36.680 --> 00:19:38.000 like different hashing, or de. 398 00:19:38.119 --> 00:19:41.519 Haash hashing like creating a fingerprint exactly. 399 00:19:41.640 --> 00:19:45.200 Dehash generates a compact fingerprint or hash value for an 400 00:19:45.200 --> 00:19:48.960 image patch like the detected object based on differences between 401 00:19:48.960 --> 00:19:52.039