WEBVTT 1 00:00:00.080 --> 00:00:02.480 Have you ever stopped to wonder how your phone can 2 00:00:02.560 --> 00:00:06.839 instantly recognize a face, or you know, how some AI 3 00:00:07.040 --> 00:00:11.080 generates those incredibly realistic images seemingly out of thin air. 4 00:00:11.400 --> 00:00:13.480 Right, it feels like magic sometimes. 5 00:00:13.080 --> 00:00:16.199 It really does. So today we're pulling back the curtain 6 00:00:16.480 --> 00:00:21.079 on that magic behind AI powered products. Our mission in 7 00:00:21.160 --> 00:00:23.399 this deep dive is to really get into the core 8 00:00:23.480 --> 00:00:27.760 concepts of neural networks and TensorFlow two point zero. We've 9 00:00:27.760 --> 00:00:30.480 gone through hands on neural networks, so TensorFlow two point 10 00:00:30.519 --> 00:00:34.240 zero by a Paulo Gellion, great resource, absolutely, and we've 11 00:00:34.320 --> 00:00:38.119 extracted the most important sort of nuggets of knowledge to 12 00:00:38.119 --> 00:00:40.840 give you a shortcut to being genuinely well informed about 13 00:00:40.840 --> 00:00:41.960 this fascinating field. 14 00:00:42.079 --> 00:00:44.960 Yeah, and we'll explore the very foundations of machine learning, 15 00:00:45.000 --> 00:00:48.000 really unveil the inner workings of neural networks, and maybe 16 00:00:48.039 --> 00:00:51.520 demystify how frameworks like TensorFlow two point zero make building 17 00:00:51.520 --> 00:00:55.759 these systems well manageable. You'll hopefully gain an intuitive grasp 18 00:00:55.799 --> 00:00:58.479 of not just what these things are, but more importantly, 19 00:00:58.520 --> 00:01:00.560 why they've become so incredible important. 20 00:01:00.679 --> 00:01:03.000 Okay, let's unpack this right at the start. Then, what 21 00:01:03.119 --> 00:01:06.599 exactly is machine learning fundamental, so at its. 22 00:01:06.480 --> 00:01:11.680 Heart, machine learning is a branch of artificial intelligence. The 23 00:01:11.760 --> 00:01:14.959 core idea is we define algorithms that learn a model 24 00:01:15.000 --> 00:01:16.280 directly from data. 25 00:01:16.359 --> 00:01:18.959 Learning from data, not explicit rules exactly. 26 00:01:19.280 --> 00:01:24.439 The goal is to automatically extract meaningful information insites, patterns. 27 00:01:24.319 --> 00:01:26.879 And the applications are just everywhere, now, aren't they. You 28 00:01:26.920 --> 00:01:29.079 probably use them constantly without even thinking about it. 29 00:01:29.120 --> 00:01:34.480 Oh? Absolutely, they're countless, and yeah, probably daily use Think 30 00:01:34.519 --> 00:01:36.599 about face detection in your smart phone camera. 31 00:01:36.719 --> 00:01:37.760 Yep, use that all the time. 32 00:01:38.000 --> 00:01:42.560 Predictive maintenance and factories, medical image analysis, which. 33 00:01:42.359 --> 00:01:44.599 Is huge helping doctors see. 34 00:01:44.359 --> 00:01:49.400 Things precisely, time series forecasting and finance. Autonomous driving obviously 35 00:01:49.680 --> 00:01:53.120 big one, text comprehension, even those recommendation systems telling you 36 00:01:53.159 --> 00:01:53.959 what to watch next. 37 00:01:54.040 --> 00:01:57.200 Guilty, Okay. The source calls the data set the most 38 00:01:57.239 --> 00:02:01.359 critical part of the mL pipeline. Why is the quality 39 00:02:01.359 --> 00:02:03.719 and structure so so important here? 40 00:02:03.920 --> 00:02:07.280 Because everything hinges on it. The model's success lives or 41 00:02:07.319 --> 00:02:10.560 dies by the data. It's like building a house, right 42 00:02:10.919 --> 00:02:15.120 If your materials, your bricks are bad, the house won't stand, 43 00:02:15.199 --> 00:02:18.719 no matter how good the architect is. Makes sense, So 44 00:02:19.080 --> 00:02:23.719 take face detection. It's trained on thousands, maybe millions of 45 00:02:23.840 --> 00:02:28.680 labeled examples faces marked as faces. The more high quality, 46 00:02:28.719 --> 00:02:32.520 diverse data we have, the better the algorithm performs. And 47 00:02:32.560 --> 00:02:35.840 this leads us straight to this crucial practice of splitting 48 00:02:35.960 --> 00:02:39.919 data put it into three distinct, destoint parts. There is 49 00:02:39.960 --> 00:02:42.280 a training set that's what the model actually learns from, 50 00:02:42.919 --> 00:02:45.759 then a validation set. We use that during training to 51 00:02:45.800 --> 00:02:50.120 measure performance and importantly tune things called hyper parameters. Think 52 00:02:50.120 --> 00:02:51.719 of them as settings for the learning. 53 00:02:51.479 --> 00:02:53.680 Process, like knobs to adjust exactly. 54 00:02:53.960 --> 00:02:56.680 And finally, the test set. This is sacred. It's completely 55 00:02:56.759 --> 00:02:59.960 untouched until the very end for the final evaluations. 56 00:03:00.000 --> 00:03:02.080 That's the real test of how it'll do in the wild. 57 00:03:02.280 --> 00:03:05.520 Precisely, it ensures we get an unbiased look at real 58 00:03:05.560 --> 00:03:08.080 world performance, our ultimate reality check. 59 00:03:08.280 --> 00:03:11.080 We often hear about in dimensional spaces and machine learning. 60 00:03:11.199 --> 00:03:13.719 It sounds pretty abstract. What does that actually mean for 61 00:03:13.800 --> 00:03:14.439 our data? 62 00:03:14.639 --> 00:03:18.120 Yeah, it can sound a bit theoretical, But imagine each 63 00:03:18.240 --> 00:03:21.039 example in your data set, like an image or maybe 64 00:03:21.080 --> 00:03:24.120 sensor readings, as just a single point plotted in some 65 00:03:24.280 --> 00:03:27.240 geometric space. The end just refers to the number of 66 00:03:27.240 --> 00:03:29.639 features or attributes that describe that point. 67 00:03:29.840 --> 00:03:32.639 Ah Okay, so more features more dimensions. Got it? 68 00:03:33.120 --> 00:03:36.520 So that fashion m mist image example? Yeah, twenty eight 69 00:03:36.560 --> 00:03:39.639 by twenty eight pixels. That's seven hundred and eighty four attributes. 70 00:03:39.719 --> 00:03:42.199 So each image is a point in a seven hundred 71 00:03:42.199 --> 00:03:43.759 and eighty four dimensional space. 72 00:03:44.039 --> 00:03:46.599 Wow. Okay, that's impossible to picture. 73 00:03:46.319 --> 00:03:48.879 It totally impossible for us, which is why understanding this 74 00:03:48.960 --> 00:03:52.680 concept is key. It helps us grasp why high dimensions 75 00:03:52.680 --> 00:03:56.319 can be tricky, the curse of dimensionality, and it's why 76 00:03:56.400 --> 00:03:59.560 techniques like dimensionality reduction are so vital not just for 77 00:03:59.639 --> 00:04:01.800 visual but from making models work well. 78 00:04:01.879 --> 00:04:04.680 Okay, so machine learning tasks, they usually fall into three 79 00:04:04.719 --> 00:04:09.159 main buckets, supervised, unsupervised, and semi supervised learning. What's the 80 00:04:09.240 --> 00:04:09.800 key difference? 81 00:04:09.879 --> 00:04:12.879 The absolute key distinction. The main thing is the presence 82 00:04:13.000 --> 00:04:14.599 or absence of labels in. 83 00:04:14.560 --> 00:04:17.920 The data, Labels meaning the answers sort of. 84 00:04:18.000 --> 00:04:22.480 Yeah. Supervised learning uses labeled data. You have inputs and 85 00:04:22.519 --> 00:04:26.319 you have the desired outputs like images labeled cat or dog. 86 00:04:27.120 --> 00:04:28.519 The model learns the mapping. 87 00:04:28.800 --> 00:04:29.879 Okay, that's straightforward. 88 00:04:30.000 --> 00:04:34.000 Unsupervised learning deals with unlabeled data. The goal there is 89 00:04:34.040 --> 00:04:37.160 to find hidden patterns or structures without being told what to. 90 00:04:37.120 --> 00:04:39.720 Look for, like finding groups of similar. 91 00:04:39.319 --> 00:04:44.079 Customers exactly, or detecting weird transactions. For fraud detection where 92 00:04:44.079 --> 00:04:45.519 you don't have fraud labels. 93 00:04:45.240 --> 00:04:48.279 Beforehand, and semi supervised, that's a hybrid. 94 00:04:48.519 --> 00:04:51.639 It cleverly uses a mix of labeled and unlabeled data. 95 00:04:51.879 --> 00:04:55.199 Or sometimes situations where maybe all your examples belong to 96 00:04:55.279 --> 00:04:59.800 the same class, which supervised methods alone can't really handle effectively. 97 00:05:00.079 --> 00:05:02.920 Makes sense. So, once we've built a model using one 98 00:05:02.920 --> 00:05:05.000 of these approaches, how do we know if it's actually 99 00:05:05.000 --> 00:05:06.639 any good? What are the key metrics? 100 00:05:06.920 --> 00:05:11.680 Ah metrics, They're fundamental, absolutely critical for evaluating how good 101 00:05:11.680 --> 00:05:15.000 our model is. Accuracy is the most common one for classification. 102 00:05:15.319 --> 00:05:17.360 Just the percentage he gets right, yep. 103 00:05:17.399 --> 00:05:21.879 Simple proportion of correct predictions. However, and this is a 104 00:05:21.879 --> 00:05:25.879 big however, accuracy can be super misleading, especially on unbalanced 105 00:05:25.959 --> 00:05:26.480 data sets. 106 00:05:26.519 --> 00:05:26.839 Yesso. 107 00:05:27.319 --> 00:05:29.959 Well, imagine eighty percent of your data is class A 108 00:05:30.279 --> 00:05:33.639 and only twenty percent is class B. A lazy model 109 00:05:33.720 --> 00:05:36.040 could just predict class A every single time and. 110 00:05:36.000 --> 00:05:37.879 It would look eighty percent accurate. 111 00:05:37.680 --> 00:05:41.600 Exactly eighty percent accuracy, But it's completely useless because it 112 00:05:41.639 --> 00:05:44.480 never finds Class B not a good classifier at all. 113 00:05:44.639 --> 00:05:48.480 Okay, point taking. So if accuracy can fool us, what 114 00:05:48.560 --> 00:05:50.639 are the better alternatives? What else do we look at? 115 00:05:50.680 --> 00:05:53.399 We rely on a whole suite of other, more nuanced 116 00:05:53.439 --> 00:05:58.079 metrics for classification, things like precision that tells us, out 117 00:05:58.120 --> 00:06:00.519 of all the times the model predicted positive, how many 118 00:06:00.560 --> 00:06:01.519 were actually. 119 00:06:01.160 --> 00:06:04.480 Correct, like how many emails flagged as spamword spans. 120 00:06:04.839 --> 00:06:07.879 Then there's recall that asks, out of all the actual 121 00:06:07.920 --> 00:06:10.800 positive cases that existed, how many did our model find. 122 00:06:10.879 --> 00:06:12.600 Making sure we don't miss important. 123 00:06:12.199 --> 00:06:15.439 Stuff exactly, Like in medical diagnosis, recall is often crucial. 124 00:06:15.519 --> 00:06:17.600 You don't want to miss a positive case, right. The 125 00:06:17.680 --> 00:06:20.240 F one score is great because it's the harmonic mean 126 00:06:20.279 --> 00:06:23.079 of precision and recall, balancing. 127 00:06:22.600 --> 00:06:24.519 Both the combined score yep. 128 00:06:25.040 --> 00:06:28.759 And for binary classification just two classes, the area under 129 00:06:28.800 --> 00:06:31.399 the ROC curve AUC is really useful. 130 00:06:31.519 --> 00:06:32.519 ROC curve Yeah. 131 00:06:32.560 --> 00:06:35.240 It shows the trade off between how well the model 132 00:06:35.360 --> 00:06:39.240 finds true positive sensitivity and how well it avoids false 133 00:06:39.279 --> 00:06:43.600 positive specificity across different thresholds. It gives a really good overall. 134 00:06:43.279 --> 00:06:46.480 Picture, okay. And for regression predicting numbers. 135 00:06:46.639 --> 00:06:49.759 For regression, we look at things like mean absolute error 136 00:06:49.959 --> 00:06:52.639 MAE just the average size of the errors and mean 137 00:06:52.720 --> 00:06:57.319 squared error msee, which penalizes larger errors more heavily. They 138 00:06:57.319 --> 00:07:00.120 tell us how close our predictions are to the actual values. 139 00:07:00.160 --> 00:07:02.480 Okay, so we know how to measure success. Let's talk 140 00:07:02.480 --> 00:07:05.319 about the models themselves, especially the stars of the show 141 00:07:05.519 --> 00:07:07.639 neural Networks. How are they actually defined? 142 00:07:07.839 --> 00:07:11.360 Great neural networks? Doctor Robert heck Nielsen, one of the pioneers, 143 00:07:11.480 --> 00:07:15.079 defined a neural network as basically a computing system made 144 00:07:15.160 --> 00:07:18.399 up of a number of simple, highly interconnected processing elements. 145 00:07:18.439 --> 00:07:20.920 Simple elements, but lots of connections. 146 00:07:20.759 --> 00:07:25.439 Which process information by their dynamic state response to external inputs. 147 00:07:26.279 --> 00:07:28.240 More intuitively, you can just think of them as a 148 00:07:28.279 --> 00:07:31.560 computational model loosely inspired by how our brains work. 149 00:07:31.759 --> 00:07:35.839 Loosely inspired. So they're modeled after wiological neurons. But it's 150 00:07:35.839 --> 00:07:37.120 not an exact copy. 151 00:07:36.959 --> 00:07:39.560 Right, Oh, absolutely not. It's a very coarse inspiration. We 152 00:07:39.639 --> 00:07:44.240 borrow terms like dendrites for inputs, synapses for the connection, weights. 153 00:07:43.920 --> 00:07:46.160 That learn the things that change during training. 154 00:07:45.959 --> 00:07:50.360 Exactly, and a nucleus which is basically this nonlinear activation 155 00:07:50.480 --> 00:07:54.519 function that determines if the neuron fires. But the biological 156 00:07:54.560 --> 00:07:56.800 reality far far more complex. 157 00:07:57.279 --> 00:08:01.199 So why do these artificial neurons need that nonlinear activation function. 158 00:08:01.319 --> 00:08:02.800 What does the nonlinearity do? 159 00:08:03.319 --> 00:08:05.839 That's key. Think about a single neuron without it, It 160 00:08:05.839 --> 00:08:08.199 can basically only draw a straight line or a flat 161 00:08:08.240 --> 00:08:11.360 plane in higher dimensions, a hyperplane to separate data. 162 00:08:11.560 --> 00:08:13.839 Okay, like separating red dots from blue dots with. 163 00:08:13.800 --> 00:08:16.079 One line exactly, But what if the dots are all 164 00:08:16.120 --> 00:08:19.360 mixed up in a complex pattern. That straight line isn't enough. 165 00:08:19.759 --> 00:08:23.680 The nonlinear activation function lets the neuron create a curved boundary, 166 00:08:23.920 --> 00:08:24.879 a hypersurface. 167 00:08:25.360 --> 00:08:27.839 Ah, so we can learn more complex separations. 168 00:08:27.879 --> 00:08:31.399 Precisely, it allows the neuron to capture much more complex 169 00:08:31.480 --> 00:08:34.919 relationships in the data, things that aren't just linearly separable. 170 00:08:35.240 --> 00:08:38.000 And is that why we need multi layered neural networks 171 00:08:38.720 --> 00:08:40.840 to handle even more complex stuff? 172 00:08:41.000 --> 00:08:44.799 Yes, exactly, if one curved boundary isn't enough, adding more 173 00:08:44.879 --> 00:08:48.799 layers allows the network to combine and transform these learned boundaries, 174 00:08:49.120 --> 00:08:51.960 creating incredibly intricate decision regions. 175 00:08:52.000 --> 00:08:54.519 So layers build on layers to create complexity. 176 00:08:54.679 --> 00:08:58.240 Right, It enables the network to learn these remarkably complex 177 00:08:58.399 --> 00:09:03.120 classification boundaries need for real world problems. In fact, these 178 00:09:03.159 --> 00:09:07.840 standard feed forward networks are called universal function approximators. 179 00:09:07.159 --> 00:09:09.279 Meaning they can learn anything pretty much. 180 00:09:09.279 --> 00:09:12.440 In theory, if a relationship exists between inputs and outputs, 181 00:09:12.720 --> 00:09:16.200 a sufficiently large and well trained neural network can approximate 182 00:09:16.240 --> 00:09:18.159 that function, no matter how complex it is. 183 00:09:18.399 --> 00:09:21.879 Wow, that's powerful. What's a major advantage of neural networks 184 00:09:21.919 --> 00:09:25.399 compared to other, maybe more traditional machine learning models. 185 00:09:25.519 --> 00:09:28.879 One huge advantage is their ability to act as feature. 186 00:09:28.559 --> 00:09:30.960 Extractor feature extractors meaning. 187 00:09:30.879 --> 00:09:34.360 So many traditional mL models need you to carefully pre 188 00:09:34.399 --> 00:09:39.320 process the data and manually engineer meaningful features first, like 189 00:09:39.799 --> 00:09:44.200 calculating specific ratios or identifying certain shapes beforehand. 190 00:09:44.320 --> 00:09:45.679 A lot of human effort upfront. 191 00:09:45.960 --> 00:09:49.919 Right, neural networks, especially deep ones with the right architecture, 192 00:09:50.159 --> 00:09:53.480 can often learn these important features directly from the raw 193 00:09:53.519 --> 00:09:57.200 input data themselves. They figure out what's important. 194 00:09:56.799 --> 00:09:58.559 On their own, so they kind of learn how to 195 00:09:58.600 --> 00:10:00.519 see the important patterns exactly. 196 00:10:00.840 --> 00:10:04.679 That automatic future extraction is incredibly powerful and saves a 197 00:10:04.759 --> 00:10:06.120 ton of manual work. 198 00:10:06.240 --> 00:10:09.480 Okay, this future extraction is amazing. So we have these 199 00:10:09.519 --> 00:10:13.159 powerful networks, how do we actually teach them? What does 200 00:10:13.279 --> 00:10:14.919 training really involve? Right? 201 00:10:14.960 --> 00:10:18.679 Training, So, training a model like this means iteratively updating 202 00:10:18.720 --> 00:10:21.399 its internal parameters those connection weights and biases we. 203 00:10:21.399 --> 00:10:22.799 Mentioned adjusting the connections. 204 00:10:22.879 --> 00:10:25.480 Yep, we adjust them to find the configuration that best 205 00:10:25.480 --> 00:10:28.519 solves the problem, the one that minimizes errors. And we 206 00:10:28.600 --> 00:10:30.960 measure error using a loss function. 207 00:10:30.840 --> 00:10:32.720 A score for how wrong the model is. 208 00:10:33.120 --> 00:10:36.279 Pretty much, it measures the difference between the model's predictions 209 00:10:36.360 --> 00:10:39.600 and the actual right answers. The tricky part is that 210 00:10:39.639 --> 00:10:44.080 the landscape defined by this loss function is usually incredibly complex, 211 00:10:44.399 --> 00:10:45.159 lots of hills and. 212 00:10:45.200 --> 00:10:48.799 Valleys, So we can't just instantly find the lowest point 213 00:10:48.840 --> 00:10:50.879 the best solution. We have to kind of search for it. 214 00:10:51.120 --> 00:10:54.080 You've got it, We can't just jump there. We use 215 00:10:54.120 --> 00:10:58.120 an iterative method, and the main technique, the absolute workhorse 216 00:10:58.639 --> 00:10:59.639 is gradient descent. 217 00:11:00.080 --> 00:11:02.000 Radiant descent. Okay, how does that work? 218 00:11:02.200 --> 00:11:05.639 Imagine that lost landscape again, like mountains and valleys. You 219 00:11:05.679 --> 00:11:07.799 want to get to the lowest point in a valley. 220 00:11:08.000 --> 00:11:12.279 Gradient descent calculates the slope at your current position. The 221 00:11:12.320 --> 00:11:15.759 gradient the direction of steepest slope exactly. It tells you 222 00:11:15.799 --> 00:11:18.120 which ways downhill, So you take a small step in 223 00:11:18.159 --> 00:11:22.960 that direction, recalculate the slope, and repeat step by step downhill. 224 00:11:23.000 --> 00:11:25.360 And the learning rate that controls how big those steps. 225 00:11:25.120 --> 00:11:28.600 Are precisely it's a critical hyper parameter. It regulates the 226 00:11:28.639 --> 00:11:31.519 size of each step down the slope. Choosing the right 227 00:11:31.600 --> 00:11:34.799 learning rate is well. It's often called more of an 228 00:11:34.879 --> 00:11:35.679 art than a science. 229 00:11:35.919 --> 00:11:36.759 Tricky to get right. 230 00:11:37.039 --> 00:11:39.440 Yeah, Too large a step and you might overshoot the 231 00:11:39.480 --> 00:11:41.799 valley bottom and bounce around or even climb back up. 232 00:11:42.039 --> 00:11:44.000 Too small and training takes. 233 00:11:43.799 --> 00:11:45.639 Forever grawling towards the solution. 234 00:11:45.919 --> 00:11:49.519 Right, So developers often use strategies where the learning rate 235 00:11:49.639 --> 00:11:53.240 changes during training, maybe starting larger and getting smaller over time. 236 00:11:53.320 --> 00:11:56.240 And there are different flavors of gradient descent right, depending 237 00:11:56.240 --> 00:11:58.159 on how much data you use for each step. 238 00:11:58.320 --> 00:12:02.440 Indeed, there's batch grade radient descent that uses the entire 239 00:12:02.519 --> 00:12:05.759 data set to calculate the gradient for each single step. 240 00:12:06.000 --> 00:12:07.480 Sounds accurate, but slow. 241 00:12:07.559 --> 00:12:11.039 Very accurate direction, but totally impractical for the huge data 242 00:12:11.039 --> 00:12:15.440 sets we use today. Then there's stochastic gradient descent SGD 243 00:12:15.679 --> 00:12:18.120 that uses just one single example. 244 00:12:17.679 --> 00:12:18.399 For each update. 245 00:12:18.600 --> 00:12:20.399 Much faster, but maybe noisy. 246 00:12:20.559 --> 00:12:23.919 Exactly faster updates, but the path can be really erratic. 247 00:12:24.039 --> 00:12:27.120 The industry standard really is mini batch gradient descent. 248 00:12:27.159 --> 00:12:28.480 The best of both worlds. 249 00:12:28.639 --> 00:12:32.440 Pretty much uses small subsets or mini batches of data 250 00:12:32.480 --> 00:12:35.799 for each update. It's a great compromise. Faster than batch, 251 00:12:35.879 --> 00:12:37.600 more stable than pure SGD. 252 00:12:38.000 --> 00:12:43.440 Okay, Now beyond basic gradient descent, there are more advanced 253 00:12:43.639 --> 00:12:45.600 optimization algorithms. What do they add? 254 00:12:45.879 --> 00:12:49.799 They significantly improve training, making it faster and often leading 255 00:12:49.840 --> 00:12:53.679 to better results. A classic is momentum like in physics, 256 00:12:54.039 --> 00:12:57.279 kind of it helps the optimization process gain momentum as 257 00:12:57.279 --> 00:13:00.840 it goes downhill, smoothing out oscillations and helping it power 258 00:13:00.879 --> 00:13:03.000 through small bumps or flat areas. 259 00:13:02.679 --> 00:13:04.399 Faster so it doesn't get stuck easily. 260 00:13:04.559 --> 00:13:08.120 Right. And then there's ATOM adaptive moment estimation. This one 261 00:13:08.159 --> 00:13:11.559 is hugely popular that it's an adaptive learning rate method. 262 00:13:11.879 --> 00:13:15.080 It actually maintains a separate learning rate for each individual 263 00:13:15.120 --> 00:13:16.240 parameter in the network. 264 00:13:16.279 --> 00:13:19.159 Wow, okay, Tailored step sizes, Yeah. 265 00:13:19.279 --> 00:13:22.039 It adapts the step size based on how frequently a 266 00:13:22.080 --> 00:13:25.799 feature associated with that parameter occurs. It often converges much 267 00:13:25.840 --> 00:13:29.039 faster and works well across a wide range of problems. 268 00:13:29.360 --> 00:13:31.080 Many people start with ATOM in. 269 00:13:31.000 --> 00:13:35.679 All these complex gradient calculations, finding the slope for potentially 270 00:13:35.759 --> 00:13:41.759 millions of parameters that's handled by backpropagation and automatic differentiation. 271 00:13:41.679 --> 00:13:44.879 Correct Those are the engines that make training feasible. Back 272 00:13:44.919 --> 00:13:48.799 propagation is the algorithm for efficiently calculating all those gradients, 273 00:13:48.879 --> 00:13:53.840 layer by layer, working backward from the loss. Automatic differentiation 274 00:13:53.960 --> 00:13:58.120 is the underlying mechanism that frameworks used to compute derivatives. 275 00:13:57.679 --> 00:14:01.240 Automatically, so they handle the heavy calculs lifting exactly. 276 00:14:01.440 --> 00:14:04.360 They represent the network's math as a computational graph and 277 00:14:04.399 --> 00:14:07.200 efficiently figure out how changes in each way affect the 278 00:14:07.240 --> 00:14:09.840 final loss thousands or millions of times. 279 00:14:10.159 --> 00:14:14.320 Okay, all this theory is fantastic neural networks training optimizers, 280 00:14:14.840 --> 00:14:17.279 But how do we actually build and train these systems 281 00:14:17.279 --> 00:14:20.279 in practice? That's where frameworks like TensorFlow come in, right, 282 00:14:20.759 --> 00:14:23.480 And the source mentions a big shift from TensorFlow one 283 00:14:23.480 --> 00:14:24.600 point x to two point zero. 284 00:14:24.600 --> 00:14:27.200 Oh, absolutely TensorFlow is key, and yes, the shift from 285 00:14:27.240 --> 00:14:29.360 one point x to two point zero was massive, a 286 00:14:29.399 --> 00:14:30.879 really big deal for usability. 287 00:14:30.919 --> 00:14:32.720 What was the all the way? Like in one point x. 288 00:14:32.720 --> 00:14:35.519 Intensive flow one point x, you had this two stage process. 289 00:14:35.559 --> 00:14:38.840 First you had to define a static computational graph, like 290 00:14:39.000 --> 00:14:41.200 drawing a complete blueprint of all the map. 291 00:14:41.039 --> 00:14:43.320 Operation, laying it all out beforehand. 292 00:14:43.039 --> 00:14:46.480 Exactly, and then you'd execute that graph separately using something 293 00:14:46.480 --> 00:14:50.039 called a TF session. It was powerful, for sure, but 294 00:14:50.159 --> 00:14:53.960 it felt less like Python, more like Python was just 295 00:14:54.120 --> 00:14:58.960 controlling a separate C plus plus engine. Debugging was notoriously. 296 00:14:58.200 --> 00:15:01.039 Painful, right I remember hearing that TensorFlow two point oh 297 00:15:01.240 --> 00:15:03.159 change this dramatically hugely. 298 00:15:03.320 --> 00:15:07.279 TensorFlow two point zero embraced eager execution by default. Eager 299 00:15:07.320 --> 00:15:10.960 execution meaning operations run immediately just like regular Python code. 300 00:15:11.000 --> 00:15:15.320 You define something, it runs, No separate session execution step needed. 301 00:15:15.600 --> 00:15:17.759 H much more interactive, way more interactive. 302 00:15:17.759 --> 00:15:21.120 It made debugging vastly simpler, and the whole development process 303 00:15:21.159 --> 00:15:25.080 feel much more natural, much more pythonic, and crucially, TF 304 00:15:25.120 --> 00:15:28.720 two point zero adopted Paras as its official high level API. 305 00:15:29.000 --> 00:15:30.639 KRIS. I've heard that name a lot. Yeah. 306 00:15:30.720 --> 00:15:34.320 Kris is basically a specification and interface for defining and 307 00:15:34.399 --> 00:15:38.759 training models TF dot Karras is Tensorflow's complete implementation of it. 308 00:15:38.759 --> 00:15:42.159 It makes building complex models much more straightforward. 309 00:15:41.879 --> 00:15:44.159 So Karris kind of hides some of that lower level 310 00:15:44.320 --> 00:15:46.799 graph complexity for you. Let's you focus on the layers 311 00:15:46.799 --> 00:15:48.320 in the architecture. 312 00:15:48.080 --> 00:15:51.480 You've got it. With TF new point oh and Karras, 313 00:15:51.559 --> 00:15:56.120 you're mostly thinking in terms of Python objects. Layers models, 314 00:15:56.120 --> 00:15:59.759 not manually managing graphs and sessions. Karras handles a lot 315 00:15:59.799 --> 00:16:01.879 of that complexity under the hood, but you. 316 00:16:01.879 --> 00:16:03.559 Still get the performance benefits. 317 00:16:03.720 --> 00:16:06.879 Yes, because for performance critical parts you can use the 318 00:16:07.000 --> 00:16:10.080 at TF dot function decorator. This is part of something 319 00:16:10.120 --> 00:16:14.039 called autograph. It automatically converts your Python code back into 320 00:16:14.080 --> 00:16:16.480 a high performance TensorFlow graph behind the scenes. 321 00:16:16.759 --> 00:16:21.159 So bez of both worlds, easy development, fast execution exactly. 322 00:16:21.679 --> 00:16:24.879 Especially helpful for really deep or complex models where graph 323 00:16:24.879 --> 00:16:26.120 performance matters most. 324 00:16:26.480 --> 00:16:29.679 Now, getting data into these models efficiently that can be 325 00:16:29.720 --> 00:16:32.399 a real bottleneck, right, Yeah, especially with huge data sets. 326 00:16:32.399 --> 00:16:33.600 How does TensorFlow help there? 327 00:16:33.720 --> 00:16:37.240 That's where the t data set object is absolutely brilliant. 328 00:16:37.320 --> 00:16:41.080 It's an API design specifically for building highly efficient input pipelines. 329 00:16:41.200 --> 00:16:44.200 Input pipelines like assembly lines for data kind of. Yeah. 330 00:16:44.240 --> 00:16:47.960 It handles everything extracting raw data from wherever it lives, 331 00:16:48.039 --> 00:16:52.679 transforming it, maybe resizing images, applying data augmentation, batching it up, 332 00:16:52.720 --> 00:16:54.799 and then loading it efficiently for the model. It's like 333 00:16:54.840 --> 00:16:58.039 a specialized ETL process for machine learning. 334 00:16:57.879 --> 00:17:00.720 ETL extract, transform load. 335 00:17:00.799 --> 00:17:05.519 Right, and crucially, TTF dot data offers key performance optimizations 336 00:17:05.720 --> 00:17:09.480 things like prefetching prefetching it lets the data preparation on 337 00:17:09.519 --> 00:17:12.079 the CPU happen at the same time as the model 338 00:17:12.119 --> 00:17:15.000 training on the GPU or TPU, so the GPU isn't 339 00:17:15.039 --> 00:17:16.960 sitting idle waiting for the next. 340 00:17:16.839 --> 00:17:19.359 Batch, keeping the expensive hardware. 341 00:17:18.960 --> 00:17:20.960 Busy exactly and caching. 342 00:17:21.160 --> 00:17:23.359 It can store the process data in memory or on 343 00:17:23.480 --> 00:17:25.720 disc after the first pass through the data set the 344 00:17:25.759 --> 00:17:27.480 first epoch. 345 00:17:26.920 --> 00:17:30.119 So subsequent epochs are much faster no slow disc. 346 00:17:30.000 --> 00:17:33.480 Reading Precisely, These optimizations can make a massive difference in 347 00:17:33.519 --> 00:17:35.920 training time, especially with large data sets. 348 00:17:36.160 --> 00:17:38.680 Is there a way to manage the whole mL pipeline 349 00:17:38.720 --> 00:17:41.720 sort of end to end within TensorFlow beyond just the 350 00:17:41.799 --> 00:17:42.720 data part. Yes. 351 00:17:42.960 --> 00:17:46.200 For a more structured approach, TensorFlow offers the tf dot 352 00:17:46.400 --> 00:17:49.759 Estimator API. Think of it as a higher level framework 353 00:17:49.799 --> 00:17:53.920 that encapsulates a lot of the standard, often repetitive parts 354 00:17:53.960 --> 00:17:55.200 of an mL workflow. 355 00:17:55.480 --> 00:17:56.839 Well kind of repetitive. 356 00:17:56.400 --> 00:18:01.480 Parts, things like building the graph, correctly initializing variables, handling 357 00:18:01.519 --> 00:18:06.079 the data loading loop, dealing with exceptions, gracefully creating checkpoints 358 00:18:06.119 --> 00:18:07.119 to save your progress. 359 00:18:07.240 --> 00:18:09.200 Ah, so you don't have to write all that boilerplate 360 00:18:09.200 --> 00:18:10.160 code yourself. 361 00:18:10.039 --> 00:18:14.920 Exactly It also handles saving summaries for visualization tools like tensorboard. 362 00:18:15.319 --> 00:18:19.559 It really simplifies development and enforces good practices, especially useful 363 00:18:19.599 --> 00:18:22.599 when you're scaling up to run on multiple machines or devices. 364 00:18:22.839 --> 00:18:24.480 Okay, let's shift gears a bit and talk about some 365 00:18:24.480 --> 00:18:28.319 advanced applications. Image classification is common, but what if you 366 00:18:28.400 --> 00:18:31.319 don't have a ton of labeled images for your specific task? 367 00:18:31.440 --> 00:18:32.079 Great question. 368 00:18:32.640 --> 00:18:36.000 That's where transfer learning comes in, and it's incredibly powerful. 369 00:18:36.039 --> 00:18:38.039 It's a huge time and resource saver. 370 00:18:38.279 --> 00:18:40.400 Transfer learning transferring. 371 00:18:39.839 --> 00:18:43.759 Knowledge exactly instead of training a big, complex convolutional neural 372 00:18:43.759 --> 00:18:47.440 network from scratch, which needs massive data sets like image. 373 00:18:47.200 --> 00:18:49.240 Net, which has millions of images. 374 00:18:48.920 --> 00:18:52.000 Right over fifteen million, yeah, across thousands of categories. Most 375 00:18:52.079 --> 00:18:55.240 people don't have that kind of data for their specific problem. 376 00:18:55.640 --> 00:18:59.160 So with transfer learning, we reuse parts of a model 377 00:18:59.319 --> 00:19:02.039 that was our trained on image NEET or a similar 378 00:19:02.240 --> 00:19:03.480 large data set. 379 00:19:03.799 --> 00:19:07.440 So we're basically borrowing a pre trained brain that already 380 00:19:07.440 --> 00:19:11.519 knows how to see general features in images like edges, textures, 381 00:19:11.559 --> 00:19:12.279 basic shapes. 382 00:19:12.359 --> 00:19:16.119 It's a perfect analogy. It's already learned those fundamental visual patterns, 383 00:19:16.640 --> 00:19:19.160 so we can take that pre trained part often called 384 00:19:19.160 --> 00:19:22.200 the base model or feature extractor like layers from a 385 00:19:22.240 --> 00:19:25.160 model called inception V three. Okay, freeze its weight so 386 00:19:25.200 --> 00:19:28.279 they don't change, and just add a new small classification 387 00:19:28.440 --> 00:19:30.640 layer on top that we train on our own smaller 388 00:19:30.720 --> 00:19:31.160 data set. 389 00:19:31.279 --> 00:19:33.359 Ah, so you only train the last a little bit right. 390 00:19:33.720 --> 00:19:37.559 This dramatically speeds up training and really helps prevent overfitting, 391 00:19:37.640 --> 00:19:40.359 especially when you have limited data, because the bulk of 392 00:19:40.400 --> 00:19:43.559 the model already understands image as well. And TensorFlow Hub 393 00:19:43.559 --> 00:19:46.119 makes us super easy, as you often just need the 394 00:19:46.359 --> 00:19:48.920 URL of the pre trained model on tf hub and 395 00:19:48.960 --> 00:19:51.559 you can load it directly as a CAUs layer. It's 396 00:19:51.599 --> 00:19:52.680 incredibly convenient. 397 00:19:53.240 --> 00:19:56.200 That's fantastic. But what if our new data set is 398 00:19:56.559 --> 00:20:01.119 maybe similar but not exactly like image, or maybe we 399 00:20:01.160 --> 00:20:03.400 do have a decent amount of new data. Is freezing 400 00:20:03.400 --> 00:20:05.160 the whole base model always best? 401 00:20:05.359 --> 00:20:08.240