WEBVTT 1 00:00:00.080 --> 00:00:02.960 The world of AI and machine learning is just exploding, 2 00:00:03.040 --> 00:00:04.879 isn't it. And if you're a coder looking in, you 3 00:00:04.919 --> 00:00:06.879 might be thinking, Okay, how do I actually get started 4 00:00:06.919 --> 00:00:09.240 here without needing a PhD. 5 00:00:09.560 --> 00:00:12.640 That's a really common question and honestly, the tools available 6 00:00:12.679 --> 00:00:16.440 now have made it much more accessible for developers to jump. 7 00:00:16.199 --> 00:00:19.199 In absolutely, and that's really our mission for this deep dive. 8 00:00:19.280 --> 00:00:23.440 We're digging into key parts of Lawrence Moroney's book AI 9 00:00:23.600 --> 00:00:26.440 and Machine Learning for coders. We want to pull out 10 00:00:26.480 --> 00:00:27.760 the most important bits for you. 11 00:00:28.039 --> 00:00:30.600 Think of it as a practical starting point. We're aiming 12 00:00:30.600 --> 00:00:33.479 to give you the essentials on deep learning, computer vision, 13 00:00:33.640 --> 00:00:38.119 and LP, basically focusing on how you can use TensorFlow 14 00:00:38.159 --> 00:00:39.079 to tackle these things. 15 00:00:39.280 --> 00:00:42.079 Yeah, the perspective here is crucial. It's tailored for you, 16 00:00:42.560 --> 00:00:44.799 the coder. It's about the tools you can grab now 17 00:00:44.840 --> 00:00:47.399 and the problems you can start solving. 18 00:00:47.439 --> 00:00:50.079 Right, just like the book intents equipping you to become 19 00:00:50.119 --> 00:00:53.439 an mL developer by focusing on actually doing it, not 20 00:00:53.479 --> 00:00:56.079 getting bogged down in just the theory or the super 21 00:00:56.119 --> 00:00:56.960 complex math. 22 00:00:57.280 --> 00:01:00.759 And people have really responded to that approachmon called it 23 00:01:00.799 --> 00:01:04.120 the much needed practical starting point. Soufu mentioned how it 24 00:01:04.159 --> 00:01:07.120 teaches the key building blocks so you can code AI 25 00:01:07.239 --> 00:01:09.840 for PCs mobile the browser. 26 00:01:10.159 --> 00:01:14.079 Yeah, Laurence Maroney's vision was clearly about empowering developers, and, 27 00:01:15.000 --> 00:01:17.799 like Andrew En says in the foreword, great Adventures await you, 28 00:01:17.879 --> 00:01:18.920 it's an exciting space. 29 00:01:19.200 --> 00:01:22.400 Okay, So let's unpack this. Where does machine learning really 30 00:01:22.439 --> 00:01:27.400 differ from say, traditional programming and what's the main platform. 31 00:01:26.959 --> 00:01:27.879 We'll be talking about. 32 00:01:28.000 --> 00:01:30.760 Well, the fundamental difference is a kind of flip in thinking. 33 00:01:31.159 --> 00:01:35.239 In traditional programming, you write explicit rules, rules, act on data, 34 00:01:35.439 --> 00:01:36.400 and you get answers. 35 00:01:36.799 --> 00:01:40.359 Like coding a game like Breakout, specifically, write the logic 36 00:01:40.920 --> 00:01:45.000 how the ball moves, what happens when it hits a brick, scoring, paddle, misses. 37 00:01:45.200 --> 00:01:47.319 It's all defined rules exactly. 38 00:01:47.599 --> 00:01:50.640 Or think about activity detection from a wearable. You might 39 00:01:50.640 --> 00:01:53.040 write rules like okay, if speed is over x, it's running. 40 00:01:53.120 --> 00:01:56.000 If it's between y and X, it's walking. You define 41 00:01:56.000 --> 00:01:57.359 the logic based on the data. 42 00:01:57.400 --> 00:02:01.000 But that approach hits the ceiling pretty quickly, right. What 43 00:02:01.079 --> 00:02:03.480 if you wanted to detect something way more complex, like 44 00:02:03.719 --> 00:02:05.280 golfing precisely? 45 00:02:05.760 --> 00:02:08.159 How do you write rules for that? The mix of 46 00:02:08.240 --> 00:02:13.719 swinging pausing walking, It gets incredibly hard, maybe even impossible, 47 00:02:14.000 --> 00:02:18.599 to define robust rules that cover every single variation by hand. 48 00:02:18.840 --> 00:02:21.719 So the old rules act on data to give answers. 49 00:02:21.759 --> 00:02:24.599 Method breaks down and the rules are just too fuzzy 50 00:02:24.680 --> 00:02:27.840 or complex to write yourself. And that's where mL comes in. 51 00:02:28.000 --> 00:02:30.280 Right. With machine learning, you kind of flip it. You 52 00:02:30.360 --> 00:02:32.879 provide the data and you provide the answers. We call 53 00:02:32.919 --> 00:02:36.520 those labels. Then the machine learning algorithm figures out the 54 00:02:36.560 --> 00:02:38.960 rules or patterns connecting the data to those answers. 55 00:02:39.080 --> 00:02:39.639 Ah okay. 56 00:02:39.680 --> 00:02:42.120 So for the activity example, you'd give it sensor data 57 00:02:42.120 --> 00:02:45.719 from someone walking, running, biking, and golfing, and you'd label 58 00:02:45.759 --> 00:02:48.319 those chunks of data this part is walking, this part 59 00:02:48.360 --> 00:02:48.800 is golfing. 60 00:02:48.879 --> 00:02:51.759 Yep. And the mL algorithm looks at all that labeled 61 00:02:51.800 --> 00:02:55.520 sensor data maybe acceleration, rotation time, whatever, and it learns 62 00:02:55.520 --> 00:02:58.800 the underlying patterns that distinguish golfing from the others. It 63 00:02:58.840 --> 00:03:01.919 derives the complex rules you couldn't realistically right, that's the 64 00:03:01.960 --> 00:03:03.919 core shift. It's pretty powerful, and. 65 00:03:03.879 --> 00:03:06.280 The platform that's really designed to put this power into 66 00:03:06.319 --> 00:03:07.960 coder's hands is TensorFlow. 67 00:03:08.080 --> 00:03:11.879 TensorFlow, Yeah, it's this huge open source platform for building 68 00:03:11.879 --> 00:03:14.719 and using mL models. Its real value, I think is 69 00:03:14.719 --> 00:03:17.400 that it handles a lot of the underlying complexity. It 70 00:03:17.520 --> 00:03:20.639 implements common algorithms, common patterns, so you. 71 00:03:20.719 --> 00:03:23.560 The coder can focus more on the actual problem you're 72 00:03:23.599 --> 00:03:27.800 trying to solve with mL, unless on say, implementing backpropagation from. 73 00:03:27.639 --> 00:03:31.280 Scratch, exactly focus on the scenario. And TensorFlow is built 74 00:03:31.280 --> 00:03:33.680 to be flexible. You can deploy the models you create 75 00:03:33.719 --> 00:03:37.360 almost anywhere web cloud, mobile apps, on Android or iOS, 76 00:03:37.400 --> 00:03:38.919 even tiny embedded systems. 77 00:03:39.199 --> 00:03:43.080 And how do you typically work with it? Python, install idse. 78 00:03:42.800 --> 00:03:44.800 All of the above. Really, you can pip install it 79 00:03:44.800 --> 00:03:49.639 in Python, use it in IDEs like pischarm, or a 80 00:03:49.879 --> 00:03:52.800 really popular way is using cloud environments like Google Collab 81 00:03:53.000 --> 00:03:56.039 that gives you access to GPUs and TPUs without needing 82 00:03:56.080 --> 00:03:56.960 the hardware yourself. 83 00:03:57.120 --> 00:04:00.919 Okay, let's drill down. What does the simplest possible example 84 00:04:00.960 --> 00:04:03.879 of this learning look like? Like the basic building. 85 00:04:03.599 --> 00:04:07.520 Block, right, So imagine teaching a network a really simple 86 00:04:07.680 --> 00:04:11.400 linear relationship like why equals two x one? You'd give 87 00:04:11.400 --> 00:04:14.360 examples if x is one, why is one of x's two, 88 00:04:14.400 --> 00:04:16.360 Why is three x's three, y is five? And so 89 00:04:16.399 --> 00:04:19.560 on Okay, a tiny neural network, even one with just 90 00:04:19.600 --> 00:04:22.399 a single neuron, can learn this. It does it by 91 00:04:22.439 --> 00:04:27.040 adjusting two internal values, a weight it multiplies the input 92 00:04:27.279 --> 00:04:29.199 x by, and a bias it adds. 93 00:04:29.600 --> 00:04:31.600 So it's basically learning the two in the menx one 94 00:04:32.000 --> 00:04:34.920 from our equation. Yeah, those learned numbers, the weight and bias, 95 00:04:35.000 --> 00:04:36.399 those are the parameters. 96 00:04:35.879 --> 00:04:38.199 Of the network exactly, And that's a really important distinction 97 00:04:38.279 --> 00:04:42.680 for coders getting into mL. Parameters, weights and biases are 98 00:04:42.720 --> 00:04:45.680 what the network learns from the data. They're different from hyper. 99 00:04:45.399 --> 00:04:48.639 Parameters, right. Hyperarameters are the knobs you turn before training starts. 100 00:04:48.639 --> 00:04:51.560 Aren't they things that control the learning process itself. 101 00:04:51.680 --> 00:04:55.000 Yeah, things like the learning rate, how quickly it adjusts weights, 102 00:04:55.079 --> 00:04:56.879 or the number of neurons you decide to put in 103 00:04:56.920 --> 00:04:59.600 a layer, or how many epochs meaning how many times 104 00:04:59.600 --> 00:05:02.040 it sees a whole data set. You experiment with these 105 00:05:02.079 --> 00:05:06.120 to get better results. And neurons also usually have something 106 00:05:06.160 --> 00:05:08.720 called an activation function. It's like a little function that 107 00:05:08.759 --> 00:05:13.000 processes the neurons output. A common one is relute rectified 108 00:05:13.040 --> 00:05:15.680 linear unit. It basically just passes the value through if 109 00:05:15.720 --> 00:05:20.079 it's positive and outputs zero otherwise. This adds nonlinearity, which 110 00:05:20.120 --> 00:05:22.519 is crucial for learning anything beyond simple lines. 111 00:05:22.600 --> 00:05:25.160 Okay, makes sense. We've got the basic shift to learning 112 00:05:25.360 --> 00:05:28.439 the platform, the simplest building block. Let's apply this to 113 00:05:28.480 --> 00:05:31.839 a huge area. Yeah, computer vision making machines. 114 00:05:31.959 --> 00:05:34.959 See so at its core and image is just a 115 00:05:35.000 --> 00:05:38.240 grid of numbers, right pixels. A small gray scale image 116 00:05:38.279 --> 00:05:41.000 like from the Fashion MNIS data set might be twenty 117 00:05:41.000 --> 00:05:43.959 eight by twenty eight pixels. Each pixel has a value, 118 00:05:44.040 --> 00:05:46.399 say zero to two fifty five for how bright it. 119 00:05:46.319 --> 00:05:48.680 Is, and color images just have more numbers per pixel, 120 00:05:48.759 --> 00:05:50.519 usually three for red, green, and blue. 121 00:05:50.680 --> 00:05:51.319 Three channels. 122 00:05:51.439 --> 00:05:51.600 Yeah. 123 00:05:51.639 --> 00:05:54.160 Now if you try to feed those raw pixel values 124 00:05:54.199 --> 00:05:57.240 directly into that simple single neural network we just talked about, 125 00:05:57.319 --> 00:05:59.720 or even a basic multi layer network, well it would 126 00:05:59.800 --> 00:06:03.920 really struggle because a simple network doesn't understand spatial structure. 127 00:06:04.120 --> 00:06:06.680 It just sees a flat list of numbers. It might 128 00:06:06.959 --> 00:06:10.439 learn to recognize, say a sneaker, if it's exactly like 129 00:06:10.480 --> 00:06:13.160 the ones in the training data, same position, same angle. 130 00:06:13.319 --> 00:06:16.240 Ah, but if you show it the sneaker slightly rotated, 131 00:06:16.920 --> 00:06:19.319 or maybe a different type of sneaker, like a high heel, it. 132 00:06:19.319 --> 00:06:22.879 Might completely fail. It hasn't learned the features that make 133 00:06:23.000 --> 00:06:26.720 something a sneaker. It just memorized specific pixel patterns in 134 00:06:26.759 --> 00:06:27.759 specific locations. 135 00:06:27.800 --> 00:06:28.800 Okay, so that's the problem. 136 00:06:28.839 --> 00:06:31.720 Convolutional neural networks or CNNs are designed. 137 00:06:31.360 --> 00:06:35.079 To solve exactly. CNNs are built to automatically find and 138 00:06:35.160 --> 00:06:40.680 learn hierarchical features in images, things like edges, textures, shapes, objects, 139 00:06:40.920 --> 00:06:42.879 regardless of where they appear in the image. 140 00:06:43.160 --> 00:06:44.800 How do they do that? What are the core ideas? 141 00:06:44.879 --> 00:06:49.480 Two main operations, convolutions and pooling. Convolutions use small filters, 142 00:06:49.920 --> 00:06:52.360 like maybe a three by three grade of weights that 143 00:06:52.439 --> 00:06:55.639 slide across the image. Each filter is trained to detect 144 00:06:55.720 --> 00:06:58.519 a specific local pattern, maybe a vertical edge or a 145 00:06:58.560 --> 00:07:00.160 certain curve or text. 146 00:07:00.439 --> 00:07:03.120 So the filter scans the image and produces a sort 147 00:07:03.120 --> 00:07:04.720 of map showing where it found that. 148 00:07:04.639 --> 00:07:09.959 Pattern precisely, and applying a filter reduces the image dimensions slightly. 149 00:07:10.600 --> 00:07:12.720 A three x three filter on a twenty eight x 150 00:07:12.759 --> 00:07:15.079 twenty eight image gives you a twenty six x twenty 151 00:07:15.120 --> 00:07:16.000 six output map. 152 00:07:16.120 --> 00:07:17.120 Okay, and pooling. 153 00:07:17.240 --> 00:07:20.079 Pooling layers then reduce the size of these feature maps, 154 00:07:20.160 --> 00:07:23.720 making the representation smaller and more manageable while keeping the 155 00:07:23.720 --> 00:07:27.759 most important information. A common type is max pooling. You 156 00:07:27.839 --> 00:07:29.800 might take a two x two area and just keep 157 00:07:29.800 --> 00:07:32.759 the maximum value, throwing away the other three. This halves 158 00:07:32.800 --> 00:07:35.480 the dimensions but keeps the strongest signal for that feature 159 00:07:35.519 --> 00:07:36.199 in that region. 160 00:07:36.360 --> 00:07:39.160 And by stacking these convolutions and pooling layers. 161 00:07:39.000 --> 00:07:43.000 The network builds up understanding. Early layers find simple edges, 162 00:07:43.160 --> 00:07:46.759 Middle layers combined edges into corners and textures. Deeper layers 163 00:07:46.800 --> 00:07:49.519 combine those into parts of objects, and then whole objects. 164 00:07:49.959 --> 00:07:51.720 The key thing for you as a coder is that 165 00:07:51.759 --> 00:07:54.800 the CNN automates this feature extraction. You don't have to 166 00:07:54.839 --> 00:07:56.519 hand code edge detectors anymore. 167 00:07:56.639 --> 00:07:57.879 Right, Let's make it concrete. 168 00:07:58.000 --> 00:07:59.920 The book uses a horse or human classifier. 169 00:08:00.000 --> 00:08:00.319 Example. 170 00:08:00.399 --> 00:08:03.519 Yeah, that data set uses bigger color images, maybe three 171 00:08:03.639 --> 00:08:06.319 hundred by three hundred pixels with three color channels. So 172 00:08:06.360 --> 00:08:08.800 the input shape is three hundred by three hundred by three. 173 00:08:08.920 --> 00:08:12.040 And since it's just two classes horse or human, it's 174 00:08:12.160 --> 00:08:15.480 binary classification. You can use just one neuron in the 175 00:08:15.519 --> 00:08:16.480 final output layer. 176 00:08:16.560 --> 00:08:19.600 You can, and you typically attach a Sigmoid activation function 177 00:08:19.680 --> 00:08:23.120 to it. Sigmoid squashes any input value into a range 178 00:08:23.160 --> 00:08:26.600 between zero and one, perfect for probability. You can interpret 179 00:08:26.720 --> 00:08:29.920 the output as say, the probability that the image is 180 00:08:29.920 --> 00:08:30.360 a human. 181 00:08:30.600 --> 00:08:33.919 The source mentioned a specific failure case, though, where a 182 00:08:34.000 --> 00:08:36.840 model trained on this data set saw a picture of 183 00:08:36.919 --> 00:08:39.720 just like the top half of a person and classified 184 00:08:39.759 --> 00:08:42.399 it as a horse. Why would that happen? It seems 185 00:08:42.399 --> 00:08:44.080 like a common beginner frustration. 186 00:08:44.519 --> 00:08:47.279 It often boils down to the training data and overfitting. 187 00:08:47.519 --> 00:08:50.200 If your training set mostly contains full body pictures of 188 00:08:50.279 --> 00:08:53.440 humans standing up and maybe horses in profile, the model 189 00:08:53.519 --> 00:08:57.080 learns those specific views. When it sees something unusual, like 190 00:08:57.200 --> 00:08:59.519 only the upper body of a person perhaps and oppose 191 00:08:59.559 --> 00:09:02.200 it hasn't seen, it might lapt onto some features, maybe texture, 192 00:09:02.240 --> 00:09:05.039 maybe background that it learned we're associated with horses in 193 00:09:05.080 --> 00:09:08.320 the training data. It hasn't generalized the concept of human 194 00:09:08.399 --> 00:09:10.240 well enough outside the examples it saw. 195 00:09:10.480 --> 00:09:14.799 Okay, so this overfitting. Doing great on training data but 196 00:09:14.919 --> 00:09:18.000 failing on new stuff. How do you fight that, especially 197 00:09:18.000 --> 00:09:19.200 if you don't have tons of data? 198 00:09:19.759 --> 00:09:23.320 Several really good techniques. One is image augmentation. It's clever. 199 00:09:23.960 --> 00:09:26.759 During training, you don't just feed the network your original images. 200 00:09:26.879 --> 00:09:29.960 You apply random transformations on the fly, maybe rotate the 201 00:09:30.000 --> 00:09:33.120 image slightly, zoom in or out a bit shift. It 202 00:09:33.159 --> 00:09:35.159 horizontally or vertically flip it. 203 00:09:35.320 --> 00:09:38.720 Ah, so you're essentially creating slightly modified versions of your 204 00:09:38.720 --> 00:09:42.120 existing images, making the data set seem bigger and more varied. 205 00:09:42.279 --> 00:09:44.759 Exactly, the model learns that a horse is still a 206 00:09:44.799 --> 00:09:48.159 horse even if it's tilted or slightly zoomed, it becomes 207 00:09:48.159 --> 00:09:52.159 more robust. Tensorflow's image data generator makes this super easy 208 00:09:52.200 --> 00:09:52.679 to set up. 209 00:09:52.799 --> 00:09:53.200 What else? 210 00:09:53.399 --> 00:09:57.200 Another huge one is transfer learning. This is incredibly powerful. 211 00:09:57.240 --> 00:09:59.960 You might only have a few thousand horse and human images, 212 00:10:00.559 --> 00:10:03.799 but other people have trained enormous models, like mobile net 213 00:10:03.879 --> 00:10:08.039 or inception on millions of images covering say one thousand 214 00:10:08.120 --> 00:10:11.559 different categories on the ImageNet data set. Those massive models 215 00:10:11.600 --> 00:10:15.759 have already learned really really good general purpose feature extractors 216 00:10:15.799 --> 00:10:21.159 in their early convolutional layers. They know how to detect edges, textures, shapes, 217 00:10:21.679 --> 00:10:26.279 basic object parts, things useful for recognizing any image. 218 00:10:26.480 --> 00:10:30.120 So with transfer learning, you take those pre trained layers. 219 00:10:29.840 --> 00:10:32.799 Yep, you basically chop off the original final layers that 220 00:10:32.840 --> 00:10:35.879 were specific to the one thousand image Neet classes. You 221 00:10:36.279 --> 00:10:38.840 freeze the weights of the early layers so they don't change, 222 00:10:39.159 --> 00:10:41.799 and you add your own new classification layers on top, 223 00:10:41.879 --> 00:10:44.440 maybe just a couple of layers ending in that single 224 00:10:44.480 --> 00:10:47.559 sigmoid neuron for your horse human tasks. 225 00:10:47.200 --> 00:10:49.519 And you only train your new small layers using your 226 00:10:49.519 --> 00:10:50.360 smaller data set. 227 00:10:50.519 --> 00:10:54.240 Mostly yes, or sometimes you fine tune by unfreezing the 228 00:10:54.279 --> 00:10:56.679 last few pre trained layers and training them a tiny 229 00:10:56.720 --> 00:10:59.360 bit too. But the point is you're leveraging all the 230 00:10:59.399 --> 00:11:03.000 knowledge learn from the giant data set for your specific problem. 231 00:11:03.279 --> 00:11:06.279 It's a massive shortcut. TensorFlow Hub is a great place 232 00:11:06.279 --> 00:11:07.799 to find these pre trained models. 233 00:11:08.039 --> 00:11:10.480 That makes a lot of sense any other tricks for overfitting. 234 00:11:10.559 --> 00:11:14.720 Dropout regularization is another common one. During training, For each 235 00:11:14.759 --> 00:11:18.759 batch of data, you randomly drop out, meaning temporarily ignore 236 00:11:18.879 --> 00:11:20.840 a certain percentage of the neurons in a layer. 237 00:11:20.879 --> 00:11:21.919 Wait, you just turn them off? 238 00:11:21.960 --> 00:11:25.240 Why It sounds weird, but it forces the network to 239 00:11:25.320 --> 00:11:29.759 learn more robust and redundant representations. It prevents any single 240 00:11:29.840 --> 00:11:33.559 neuron from becoming overly specialized or critical for making predictions 241 00:11:33.559 --> 00:11:36.399 based on the training data. It encourages the network to 242 00:11:36.440 --> 00:11:39.840 distribute the learning. You often see the training accuracy and 243 00:11:39.879 --> 00:11:43.360 the validation accuracy performance on data held back from training 244 00:11:43.720 --> 00:11:46.240 stay much closer together when you use dropout. 245 00:11:46.320 --> 00:11:48.320 Okay, got it. So that's a good overview for images. 246 00:11:48.559 --> 00:11:52.200 What about the other big area text, natural language processing? 247 00:11:52.480 --> 00:11:54.480 How to machine start to understand language? 248 00:11:54.720 --> 00:11:58.480 Well? Like images, text needs to be converted into numbers first. 249 00:11:58.840 --> 00:12:01.639 The initial step is usually tokenization. 250 00:12:01.399 --> 00:12:04.519 Breaking sentences down into words or maybe even parts of words, 251 00:12:04.799 --> 00:12:07.840 and giving each unique piece a number ID like these 252 00:12:07.919 --> 00:12:09.840 is one, cat is two, sat. 253 00:12:09.679 --> 00:12:13.679 As three exactly. You build a vocabulary, a mapping from 254 00:12:13.720 --> 00:12:18.200 words to integer tokens. Good tokenizers handle things like punctuation. 255 00:12:18.320 --> 00:12:22.200 Maybe today just becomes the token for today, and crucially, 256 00:12:22.679 --> 00:12:25.159 you need a plan for words that weren't in your 257 00:12:25.200 --> 00:12:28.559 training vocabulary and out of vocabulary or OOV token. 258 00:12:28.759 --> 00:12:32.240 Once you have tokens, you turn sentences into sequences of 259 00:12:32.279 --> 00:12:35.240 these numbers. The cat sat might become the sequence one 260 00:12:35.320 --> 00:12:36.159 two three, right. 261 00:12:36.480 --> 00:12:39.919 And because neural networks usually need fixed size inputs, you 262 00:12:39.960 --> 00:12:42.200 have to make all your sequences the same length. You 263 00:12:42.279 --> 00:12:45.960 either pad shorter sequences, usually with zeros, or you trunk 264 00:12:45.960 --> 00:12:46.799 paate longer ones. 265 00:12:46.879 --> 00:12:47.759 How do you pick the length? 266 00:12:48.080 --> 00:12:50.679 You typically look at the distribution of sentence links in 267 00:12:50.720 --> 00:12:53.440 your data. Maybe ninety five percent of your sentences are 268 00:12:53.519 --> 00:12:56.120 shorter than say eighty five words, so you might pick 269 00:12:56.159 --> 00:12:58.879 eighty five as your max length to minimize padding while 270 00:12:58.919 --> 00:13:00.799 capturing most sentences fully. 271 00:13:00.559 --> 00:13:03.759 And sometimes you clean the text first, remove HTML maybe 272 00:13:03.799 --> 00:13:04.399 common words. 273 00:13:04.639 --> 00:13:09.320 Yeah, preprocessing is often important, removing HTML tags, maybe converting 274 00:13:09.320 --> 00:13:13.320 to lowercase. Sometimes you remove stop words. Common words like 275 00:13:13.519 --> 00:13:16.080 is is it is? The that might not carry a 276 00:13:16.120 --> 00:13:19.799 much specific meaning for your task. So is it sunny 277 00:13:19.840 --> 00:13:23.200 today might become tokens for just sonny today. 278 00:13:23.480 --> 00:13:26.480 Okay, so we have sequences of numbers, Yeah, but just 279 00:13:26.519 --> 00:13:29.720 assigning arbitrary IDs like one, two, three doesn't tell the 280 00:13:29.759 --> 00:13:33.559 model anything about meaning. Right, Cat two isn't inherently related 281 00:13:33.559 --> 00:13:35.200 to dog maybe fifty exactly. 282 00:13:35.240 --> 00:13:37.600 That's where embttings come in. This is a really key 283 00:13:37.639 --> 00:13:38.519 concept in NLP. 284 00:13:38.759 --> 00:13:38.879 Right. 285 00:13:39.120 --> 00:13:42.240 Embeddings represent words not just as single numbers, but as 286 00:13:42.559 --> 00:13:46.080 vectors lists of numbers in a multi dimensional space. Think 287 00:13:46.120 --> 00:13:49.320 of it like giving each word coordinates on a complex map, and. 288 00:13:49.320 --> 00:13:51.600 The idea is that words with similar meanings end up 289 00:13:51.600 --> 00:13:52.600 closer together. 290 00:13:52.360 --> 00:13:52.879 On this map. 291 00:13:52.919 --> 00:13:56.039 Precisely, king and queen might have similar vectors. Walking and 292 00:13:56.120 --> 00:13:59.679 running might be close. The relationships between words are captured 293 00:13:59.679 --> 00:14:02.159 by their relative positions in this embedding space. 294 00:14:02.240 --> 00:14:04.519 The book uses a cool example with pride and prejudice 295 00:14:04.559 --> 00:14:07.639 characters right, plotting them based on learned dimensions like gender 296 00:14:07.720 --> 00:14:08.399 or nobility. 297 00:14:08.639 --> 00:14:11.440 Yeah, that's a great way to visualize it. Mister Darcy 298 00:14:11.480 --> 00:14:14.360 and Elizabeth Bennett might be positioned based on these learned 299 00:14:14.360 --> 00:14:18.320 semantic features. The key is that the network learns these 300 00:14:18.399 --> 00:14:23.559 vector representations. The dimensions aren't predefined. They emerge from how 301 00:14:23.600 --> 00:14:26.000 words are used together in the training text. 302 00:14:26.360 --> 00:14:29.120 So the network figures out that king is used in 303 00:14:29.200 --> 00:14:33.679 similar contexts to queen, but maybe also related to man, 304 00:14:34.000 --> 00:14:35.679 while queen is related to woman. 305 00:14:35.919 --> 00:14:38.480 Right, and you can even optimize the number of dimensions 306 00:14:38.559 --> 00:14:41.159 in your embedding vectors. A rule of thumb is maybe 307 00:14:41.200 --> 00:14:44.120 the fourth root of your vocabulary size. So for a 308 00:14:44.120 --> 00:14:47.279 few thousand words, maybe seven or eight dimensions is enough 309 00:14:47.440 --> 00:14:50.320 instead of say sixteen or three two. It trains faster 310 00:14:50.440 --> 00:14:51.759 without losing too much meaning. 311 00:14:52.159 --> 00:14:55.000 But if you'd just average the embedding vectors for all 312 00:14:55.039 --> 00:14:57.440 words in a sentence, you lose word order, don't you. 313 00:14:57.480 --> 00:14:58.600 It becomes a bag of words. 314 00:14:58.799 --> 00:15:01.960 That's a major limitation of simple embedding approaches. Word order 315 00:15:02.039 --> 00:15:04.960 is critical in language. Dog bites man versus man bites 316 00:15:05.000 --> 00:15:07.480 dog totally different meanings, same bag of words, So. 317 00:15:07.480 --> 00:15:10.440 To handle sequence in context, we need something more sophisticated, 318 00:15:10.720 --> 00:15:13.320 like we're current nural networks RNNs exactly. 319 00:15:13.720 --> 00:15:16.480 RNNs are designed from the ground up for sequential data. 320 00:15:16.759 --> 00:15:18.919 They have a kind of internal memory or state that 321 00:15:18.960 --> 00:15:21.919 gets updated as they process each word or token in 322 00:15:21.960 --> 00:15:26.000 a sequence. This state carries context from previous words forward. 323 00:15:26.519 --> 00:15:29.799 But you mentioned simple RNNs can struggle with long sentences. 324 00:15:30.039 --> 00:15:33.159 They might forget important context from the beginning. 325 00:15:33.480 --> 00:15:37.039 Yeah, that's the venish ingradient problem. Essentially, the influence of 326 00:15:37.080 --> 00:15:39.960 early words can fade out over long sequences. If you 327 00:15:39.960 --> 00:15:42.440 have a sentence like I grew up in France, so 328 00:15:42.559 --> 00:15:45.840 I speak fluent, the word France early on is key 329 00:15:45.879 --> 00:15:49.000 to predicting French at the end. A simple RNN might 330 00:15:49.080 --> 00:15:49.919 lose that connection. 331 00:15:50.279 --> 00:15:53.440 Okay, and that's why lstm's long short term memory networks 332 00:15:53.440 --> 00:15:54.279 were developed, right. 333 00:15:54.480 --> 00:15:57.559 LSTMs are a special type of RNM. They have internal 334 00:15:57.600 --> 00:16:02.360 mechanisms called gates that expl ilicitly control what information to remember, 335 00:16:02.559 --> 00:16:05.440 what to forget, and what to output. This makes them 336 00:16:05.519 --> 00:16:09.039 much much better at capturing long range dependencies and sequences. 337 00:16:09.320 --> 00:16:12.639 And then there are bidirectional LSTMs. How do they improve things? 338 00:16:12.960 --> 00:16:16.840 So a standard LSTM reads the sequence from start to end. 339 00:16:17.320 --> 00:16:21.840 A bidirectional LSTM has two LSTMs. One reads forwards, the 340 00:16:21.840 --> 00:16:25.399 other reads backwards. Then it combines their outputs at each step. 341 00:16:25.559 --> 00:16:27.840 Ah, so it gets context from both. 342 00:16:27.679 --> 00:16:32.080 Directions exactly for understanding language, this is often really powerful. 343 00:16:32.159 --> 00:16:36.639 Think about sentiment analysis. Sometimes the keyword determining the sentiment 344 00:16:36.720 --> 00:16:40.559 comes late in the sentence, or for predicting a missing word, 345 00:16:40.840 --> 00:16:43.240 knowing the words that come after it is just as 346 00:16:43.240 --> 00:16:45.039 important as knowing the words before it. 347 00:16:45.039 --> 00:16:47.960 Like that I lived in country right Gaelic example. Seeing 348 00:16:48.000 --> 00:16:49.679 Gaelic later helps figure out. 349 00:16:49.519 --> 00:16:53.200 The country precisely. The backward pass provides that future context, 350 00:16:53.360 --> 00:16:56.879 and you can feed pre trained embeddings like love vectors 351 00:16:56.919 --> 00:16:59.080 into these LSTMs to give them a head start on 352 00:16:59.240 --> 00:16:59.840 word meaning. 353 00:17:00.080 --> 00:17:02.360 Okay, so we can use these models to understand text. 354 00:17:02.720 --> 00:17:04.880 What about generating text? How does that work? 355 00:17:05.440 --> 00:17:08.599 The core idea is pretty straightforward. Actually, you train a 356 00:17:08.640 --> 00:17:11.279 model to predict the next word in a sequence given 357 00:17:11.319 --> 00:17:12.200 the preceding words. 358 00:17:12.400 --> 00:17:14.720 So if your training data has the sentence the quick 359 00:17:14.759 --> 00:17:19.640 brown Fox, you'd create training examples like input the label 360 00:17:19.720 --> 00:17:22.599 quick input, the quick label brown input, the quick brown 361 00:17:22.680 --> 00:17:23.880 label Fox exactly. 362 00:17:24.160 --> 00:17:27.400 You slide a window across your text corpus, creating these 363 00:17:27.440 --> 00:17:31.559 input sequences and their corresponding next word labels. The labels 364 00:17:31.559 --> 00:17:34.240 are usually one hot encoded a vector of zeros with 365 00:17:34.279 --> 00:17:37.440 a single one at the index, corresponding to the correct 366 00:17:37.640 --> 00:17:39.640 next word in your vocabulary, and the. 367 00:17:39.599 --> 00:17:42.559 Model architecture would be similar embedding layer than maybe an 368 00:17:42.640 --> 00:17:45.079 LSTM or bidirectional LSTM. 369 00:17:44.720 --> 00:17:48.359 Yep, very common. Then to generate text, you start with 370 00:17:48.440 --> 00:17:51.480 a seed text, maybe a word or a phrase. You 371 00:17:51.559 --> 00:17:55.200 feed that seed sequence into your trained model. It outputs 372 00:17:55.200 --> 00:17:58.359 a probability distribution over all the words in your vocabulary 373 00:17:58.359 --> 00:17:59.960 for what the next word is most likely to be. 374 00:18:00.559 --> 00:18:03.440 You pick a word based on those probabilities, maybe the 375 00:18:03.440 --> 00:18:04.319 most likely one. 376 00:18:04.319 --> 00:18:06.960 Usually yeah, or sometimes you sample from the distribution to 377 00:18:06.960 --> 00:18:10.079 get more variety. Then you append that predicted word to 378 00:18:10.160 --> 00:18:13.400 your seed text. Now you have a slightly longer sequence, and. 379 00:18:13.359 --> 00:18:15.279 You feed that new sequence back into the model to 380 00:18:15.319 --> 00:18:17.240 predict the next word and repeat. 381 00:18:17.480 --> 00:18:20.400 That's the loop, keep feeding the growing sequence back in 382 00:18:20.599 --> 00:18:22.680 predicting the next word, appending it. 383 00:18:22.720 --> 00:18:25.799 The book had an example using song lyrics right starting 384 00:18:25.839 --> 00:18:28.839 with in the town of a Lottie yeah. 385 00:18:28.440 --> 00:18:31.759 And the model correctly predicted one which was the actual 386 00:18:31.799 --> 00:18:33.720 next word in the song it was trained on, and 387 00:18:33.839 --> 00:18:37.440 using different seeds like sweet, Jeremy sad Dublin produced other 388 00:18:37.519 --> 00:18:40.839 plausible next words based on the patterns learned from the lyrics. 389 00:18:41.359 --> 00:18:44.359 Though it's fair to say these simple generation models can 390 00:18:44.400 --> 00:18:47.759 often start repeating themselves or outputting stuff that doesn't make 391 00:18:47.839 --> 00:18:48.799 much sense after a while. 392 00:18:48.799 --> 00:18:52.799 No, definitely, they can descend into gibberish quite quickly. Getting 393 00:18:52.799 --> 00:18:56.440 coherent long form text generation is much harder. It often 394 00:18:56.440 --> 00:19:01.240 involves more complex architectures, maybe stacking multiple LSTM, careful tuning 395 00:19:01.279 --> 00:19:05.960 of hyper parameters, and more sophisticated sampling strategies. The generated 396 00:19:05.960 --> 00:19:08.359 Shakespearean text in the book is an example of getting 397 00:19:08.400 --> 00:19:11.799 something more structured, even if parts are nonsensical, Right. 398 00:19:11.920 --> 00:19:15.640 Okay, images text? What about data where the sequence is 399 00:19:15.720 --> 00:19:19.880 time itself, time series data like weather or stock prices. 400 00:19:20.079 --> 00:19:24.640 Yeah, time series data is everywhere, weather forecasts, stock market trends, 401 00:19:24.680 --> 00:19:28.400