WEBVTT 1 00:00:00.120 --> 00:00:04.080 Welcome to the deep dive today. We're taking a shortcut 2 00:00:04.120 --> 00:00:07.360 really to understanding deep learning. It's everywhere, right, really is. 3 00:00:07.519 --> 00:00:10.240 So we've got some excerpts here from deep Learning with Python, 4 00:00:10.800 --> 00:00:14.000 and basically our mission is to pull out the core ideas, 5 00:00:14.119 --> 00:00:17.920 what's it doing, how does it work fundamentally, and maybe 6 00:00:18.000 --> 00:00:18.760 where it's headed? 7 00:00:18.920 --> 00:00:22.600 Yeah, getting the essence without you know, needing a PhD 8 00:00:22.679 --> 00:00:26.160 in maths exactly, avoid the overwhelm, and the book itself 9 00:00:26.239 --> 00:00:29.160 really tries to make it accessible. It pushes back against 10 00:00:29.160 --> 00:00:32.359 this idea that deep learning is some kind of like 11 00:00:32.479 --> 00:00:37.240 dark art. It highlights how Python and TensorFlow two specifically, 12 00:00:37.560 --> 00:00:41.320 plus the Caras community, how all that has made it 13 00:00:41.359 --> 00:00:42.640 practical for way more people. 14 00:00:42.759 --> 00:00:45.560 Right, So we want to give you listening a clear 15 00:00:45.679 --> 00:00:47.679 sense of what it's good for. It's limits to. 16 00:00:47.719 --> 00:00:50.560 Right absolutely, and the sort of standard steps people take 17 00:00:50.600 --> 00:00:52.840 to solve problems with it, you know, from computer vision 18 00:00:52.840 --> 00:00:53.719 to language stuff. 19 00:00:53.840 --> 00:00:55.600 And the author talks about how fast it's all. 20 00:00:55.479 --> 00:00:56.880 Moving, oh, incredibly fast. Yeah. 21 00:00:56.960 --> 00:00:58.840 Yeah, So for you listening, if you want to get 22 00:00:58.840 --> 00:01:01.439 a handle on complex top pretty quickly, maybe for work, 23 00:01:01.479 --> 00:01:04.040 maybe just because you're curious. This is aimed at giving 24 00:01:04.079 --> 00:01:06.000 you that solid foundation. 25 00:01:05.719 --> 00:01:08.400 Core ideas, the impact. 26 00:01:07.920 --> 00:01:11.840 We're looking for those aha moments, keeping it focused. So 27 00:01:11.879 --> 00:01:14.480 you walk away feeling like, okay, I get the big picture. 28 00:01:14.519 --> 00:01:16.480 Now sounds good. Where should we start? 29 00:01:16.599 --> 00:01:19.719 Okay, let's unpack it. What is deep learning and how 30 00:01:19.719 --> 00:01:22.359 does it fit in with you know, AI and machine learning? 31 00:01:22.400 --> 00:01:24.280 Those terms get thrown around a lot, they do. 32 00:01:24.359 --> 00:01:28.519 It's a good starting point. So AI, artificial intelligence in 33 00:01:28.560 --> 00:01:32.319 the really broad sense, is just automating tasks that usually 34 00:01:32.359 --> 00:01:33.599 need human smarts. 35 00:01:33.760 --> 00:01:34.079 Okay. 36 00:01:34.159 --> 00:01:37.079 It's a huge field actually, and older than many people think. 37 00:01:37.519 --> 00:01:41.319 Early AI, sometimes called symbolic AI, was very different, so 38 00:01:41.840 --> 00:01:44.879 it was more about programmers writing down tons and tons 39 00:01:44.879 --> 00:01:48.519 of rules by hand, building these big knowledge databases. The 40 00:01:48.560 --> 00:01:51.680 computer wasn't really learning from experience in the way we 41 00:01:51.719 --> 00:01:52.280 think of now. 42 00:01:52.400 --> 00:01:56.599 Ah, So no actual learning, just following pre written instructions, 43 00:01:56.599 --> 00:01:58.920 basically like those old chess program. 44 00:01:58.680 --> 00:02:02.000 Exactly like that, just rules. Machine learning then emerged as 45 00:02:02.040 --> 00:02:05.040 its own thing, where the focus shifted. The idea became 46 00:02:05.599 --> 00:02:09.159 can we build programs models that learn from data? 47 00:02:09.280 --> 00:02:10.879 Okay, that sounds more familiar. 48 00:02:10.960 --> 00:02:13.960 Yeah. The model finds patterns in the data itself, makes predictions, 49 00:02:13.960 --> 00:02:18.759 makes decisions, all without programmers explicitly telling it every single rule. 50 00:02:18.879 --> 00:02:21.000 Right, and deep learning? Where does that fit? 51 00:02:21.199 --> 00:02:25.599 Deep learning is a subfield within machine learning. It's defining 52 00:02:25.680 --> 00:02:29.719 characteristic is using these multi stage ways of learning representations 53 00:02:29.719 --> 00:02:30.319 of the data. 54 00:02:30.400 --> 00:02:31.159 Multi stage. 55 00:02:31.280 --> 00:02:34.520 Yeah, think of processing the data through many layers. Each 56 00:02:34.639 --> 00:02:38.439 layer learns to represent the data in a slightly more complex, 57 00:02:38.520 --> 00:02:41.159 more useful way based on the layer before it. 58 00:02:41.360 --> 00:02:44.360 Okay, so breaking it down, not trying to learn everything 59 00:02:44.400 --> 00:02:47.360 in one giant leap. The book uses three figures right 60 00:02:47.479 --> 00:02:48.719 to explain how it works. 61 00:02:48.800 --> 00:02:50.840 It does. Yeah, it's a good way to picture. First, 62 00:02:50.960 --> 00:02:54.960 the basic idea deep learning maps inputs to targets. It 63 00:02:55.039 --> 00:02:57.199 learns this mapping by just looking at lots and lots 64 00:02:57.240 --> 00:02:57.719 of examples. 65 00:02:57.719 --> 00:03:00.280 Well, show it cat pictures and dog pictures. 66 00:03:00.240 --> 00:03:02.879 Exactly and tell it which is which. That's the input 67 00:03:03.400 --> 00:03:08.000 in the image and the target the label cat or dog. Second, 68 00:03:08.360 --> 00:03:11.719 this mapping isn't direct. The data flows through a deep 69 00:03:11.800 --> 00:03:15.520 sequence of simple transformations the layers you mentioned precisely. These 70 00:03:15.599 --> 00:03:18.400 layers are like steps in an assembly line. Each one 71 00:03:18.439 --> 00:03:21.199 does something relatively simple to the data it receives. And 72 00:03:21.319 --> 00:03:25.560 the third point, crucially, these transformations, these operations, the layers 73 00:03:25.599 --> 00:03:29.240 perform they aren't hand coded by a programmer. The model 74 00:03:29.319 --> 00:03:32.479 learns what transformations are useful by seeing all those examples 75 00:03:32.520 --> 00:03:33.120 during training. 76 00:03:33.199 --> 00:03:36.479 Okay, learned transformations. That feels like the core of it, 77 00:03:36.520 --> 00:03:39.800 doesn't it. It figures out what features matter on its own. 78 00:03:39.879 --> 00:03:40.479 That's the magic. 79 00:03:40.560 --> 00:03:43.360 Yeah, And this figuring out happens in what the book 80 00:03:43.400 --> 00:03:46.159 calls the training loop. Can you walk us through that? 81 00:03:46.199 --> 00:03:46.919 What's happening there? 82 00:03:46.960 --> 00:03:50.599 Okay? The training loop. So when you first create a network, 83 00:03:51.199 --> 00:03:54.199 it's internal settings that these numbers called weights are just 84 00:03:54.240 --> 00:03:56.680 at randomly small random numbers. 85 00:03:56.400 --> 00:03:59.520 Usually, so it knows nothing. Basically, its first guesses are 86 00:03:59.520 --> 00:04:00.120 a while. 87 00:04:00.319 --> 00:04:02.879 Pretty much guaranteed to be wrong, yeah, which means it 88 00:04:02.879 --> 00:04:06.199 will have a high loss score. The loss is just 89 00:04:06.240 --> 00:04:08.919 a number that measures how far off the network's predictions 90 00:04:08.919 --> 00:04:12.800 are from the actual targets. High loss means very wrong. 91 00:04:12.719 --> 00:04:14.759 Like static on a radio you haven't tuned. 92 00:04:14.520 --> 00:04:17.399 Yet, good analogy, lots of static initially, but then for 93 00:04:17.439 --> 00:04:19.639 each example you show it during training. 94 00:04:19.519 --> 00:04:21.480 Like one cap picture, right. 95 00:04:21.360 --> 00:04:24.160 It makes a prediction. Yeah, it calculates the loss how 96 00:04:24.199 --> 00:04:26.560 wrong it was for that picture, And then comes the 97 00:04:26.560 --> 00:04:30.800 clever part, using calculus, specifically the gradient. 98 00:04:30.480 --> 00:04:33.439 Gradient sounds technical it is a. 99 00:04:33.319 --> 00:04:36.480 Bit, but think of it like this. The gradient tells 100 00:04:36.519 --> 00:04:39.360 you the direction of steepest increase in the loss, like 101 00:04:39.720 --> 00:04:41.000 which way is more wrong? 102 00:04:41.319 --> 00:04:41.720 Okay? 103 00:04:42.000 --> 00:04:45.519 So the optimization algorithm, usually some form of gradient descent, 104 00:04:46.079 --> 00:04:48.879 takes that information and adjusts the weights slightly in the 105 00:04:48.879 --> 00:04:51.839 opposite direction, the direction that would have made the loss 106 00:04:51.879 --> 00:04:53.959 a tiny bit smaller for that one example. 107 00:04:54.160 --> 00:04:57.800 Ah, So it nudges the weights downhill towards less error. 108 00:04:58.000 --> 00:05:00.959 Exactly, it takes a small step downhill on the air landscape. 109 00:05:00.959 --> 00:05:04.480 And it does this over and over for every example, yep. 110 00:05:04.680 --> 00:05:08.319 For every example in your training data, usually in small badges, 111 00:05:08.839 --> 00:05:11.399 and you repeat this process over the entire data set 112 00:05:11.519 --> 00:05:14.519 multiple times. Each full pass through the data set is 113 00:05:14.560 --> 00:05:17.000 called an epoch epoch, got it, And as you go 114 00:05:17.040 --> 00:05:20.240 through more and more epochs, tweaking the waves after each badge, 115 00:05:20.360 --> 00:05:22.920 the overall loss score gradually goes. 116 00:05:22.800 --> 00:05:24.439 Down the static clears up. 117 00:05:24.600 --> 00:05:27.439 Right. A well trained network is one where the loss 118 00:05:27.480 --> 00:05:31.560 is very low, meaning its predictions are consistently close to 119 00:05:31.600 --> 00:05:32.959 the actual target values. 120 00:05:33.079 --> 00:05:35.800 Okay, that makes a lot of sense learning from mistakes 121 00:05:36.000 --> 00:05:39.480 step by tiny step. Now. The book also mentions other 122 00:05:39.560 --> 00:05:43.920 machine learning algorithms kind of for context logistic regression, SVMs, 123 00:05:44.120 --> 00:05:46.040 random forests. Why bring those up? 124 00:05:46.639 --> 00:05:48.959 It helps to see where deep learning fits in the 125 00:05:48.959 --> 00:05:52.079 broader picture. These are really important tools in what you 126 00:05:52.160 --> 00:05:55.959 might call classical or shallow machine learning. Shallow yeah, generally 127 00:05:56.000 --> 00:05:59.000 meaning they don't have that deep, multi layered structure for 128 00:05:59.160 --> 00:06:03.040 learning representation logistic regression, for instance, it's pretty simple, but 129 00:06:03.079 --> 00:06:06.040 still super useful. For classification, it's often the first thing 130 00:06:06.040 --> 00:06:07.959 you try. Like the Hello world. 131 00:06:07.759 --> 00:06:11.399 Of mL okay and SVMs support vector machines. 132 00:06:12.160 --> 00:06:15.360 SVMs try to find the best possible boundary, like a 133 00:06:15.399 --> 00:06:18.439 line or a plane, to separate different classes in your data. 134 00:06:18.800 --> 00:06:21.560 They have this neat mathematical trick called the kernel. 135 00:06:21.240 --> 00:06:23.319 Trick ooh, kernel tricks hounds. 136 00:06:23.360 --> 00:06:27.079 Fancy it kind of is? It lets SVMs handle complex 137 00:06:27.360 --> 00:06:32.240 nonlinear separations without explicitly calculating coordinates in a super high 138 00:06:32.240 --> 00:06:35.040 dimensional space. It's computationally clever. 139 00:06:35.160 --> 00:06:39.160 Hmm. Interesting Maybe for another deep dive. What about random 140 00:06:39.199 --> 00:06:40.720 forests and gradient boosting. 141 00:06:40.959 --> 00:06:44.680 Both are ensemble methods. They combine predictions from many simpler models. 142 00:06:45.120 --> 00:06:48.279 Random forests build lots of decision trees on different subsets 143 00:06:48.319 --> 00:06:51.240 of the data and features, then average their outputs or 144 00:06:51.279 --> 00:06:52.360 take a majority. 145 00:06:52.079 --> 00:06:54.720 Vote like Wisdom of the Crowd, but for trees. 146 00:06:54.560 --> 00:06:58.160 Sort of Yeah, they're often really strong performers, very robust. 147 00:06:58.480 --> 00:07:01.800 Gradient boosting machines are also ensemble methods, but they build 148 00:07:01.839 --> 00:07:05.079 trees sequentially. Each new tree tries to correct the errors 149 00:07:05.079 --> 00:07:06.279 made by the trees that came before. 150 00:07:06.319 --> 00:07:09.040 It. Oh interesting, like building on previous mistakes. 151 00:07:08.680 --> 00:07:14.560 Exactly, and they often slightly outperform random forests, though they 152 00:07:14.560 --> 00:07:17.279 can be a bit more sensitive to tuning. But again, 153 00:07:17.319 --> 00:07:19.759 these are generally considered shallow compared. 154 00:07:19.360 --> 00:07:21.920 To deep learning, right, they don't have that automatic, multi 155 00:07:22.000 --> 00:07:25.639 layered feature learning. So that brings us back, what is 156 00:07:25.680 --> 00:07:29.680 it about deepe learning that's so transformative? What's the key difference? 157 00:07:29.879 --> 00:07:33.199 I think the biggest thing is its ability to learn 158 00:07:33.360 --> 00:07:38.079 all the layers of representation jointly, simultaneously, jointly, as opposed 159 00:07:38.160 --> 00:07:41.160 as opposed to traditional approaches where you might have separate steps. 160 00:07:41.319 --> 00:07:44.720 Like first you'd manually engineer some features from the. 161 00:07:44.759 --> 00:07:47.879 Raw data, like counting specific words and text or finding 162 00:07:48.000 --> 00:07:49.319 edges and an image exactly. 163 00:07:49.399 --> 00:07:52.040 You'd do that feature engineering, and then you'd feed those 164 00:07:52.079 --> 00:07:55.480 engineers features into a classifier like an SVM or a 165 00:07:55.519 --> 00:07:57.959 logistic regression. Deep learning does it all in one go. 166 00:07:58.040 --> 00:08:00.480 The network learns the best features and how to classify 167 00:08:00.560 --> 00:08:01.639 based on them altogether. 168 00:08:01.800 --> 00:08:05.319 Ah okay, and why is learning them jointly so powerful? 169 00:08:05.519 --> 00:08:07.839 Because the features can adapt to each other during learning. 170 00:08:08.399 --> 00:08:11.800 If one layer starts extracting a slightly different, maybe better 171 00:08:12.079 --> 00:08:16.079 type of future, the layers above it can adjust automatically 172 00:08:16.240 --> 00:08:19.199 to make use of that improved representation. It's much more 173 00:08:19.240 --> 00:08:20.399 dynamic and integrated. 174 00:08:20.720 --> 00:08:24.519 So the features themselves evolve during training to be optimal 175 00:08:24.600 --> 00:08:25.160 for the task. 176 00:08:25.279 --> 00:08:27.519 That's a great way to put it. This allows deep 177 00:08:27.600 --> 00:08:32.200 learning to learn really complex abstract concepts by breaking them down. 178 00:08:32.440 --> 00:08:34.919 You start with simple features at the bottom layers, like 179 00:08:35.080 --> 00:08:37.679 edges or textures in an image, and as you go 180 00:08:37.759 --> 00:08:40.799 up through the layers, the network combines these to learn 181 00:08:40.840 --> 00:08:44.480 more complex things like object parts and eventually whole objects. 182 00:08:44.639 --> 00:08:47.720 Like building complex ideas from simpler blocks. That makes sense. 183 00:08:48.120 --> 00:08:50.320 The book also touches on the pace of progress and 184 00:08:50.440 --> 00:08:54.320 mentions a sort of explosive phase. Where are we now? 185 00:08:54.360 --> 00:08:55.279 According to the author? 186 00:08:55.360 --> 00:08:57.840 Yeah, the author reflects on that period maybe around twenty 187 00:08:57.960 --> 00:09:02.600 seventeen twenty eighteen, especially with transformer models revolutionizing language tasks. 188 00:09:03.200 --> 00:09:06.320 It felt like huge breakthroughs were happening constantly, like the 189 00:09:06.360 --> 00:09:07.720 steep early part of an S. 190 00:09:07.720 --> 00:09:11.080 Curve, exponential growth almost almost. 191 00:09:11.360 --> 00:09:13.639 But the feeling, at least when the book was written 192 00:09:13.639 --> 00:09:15.840 around twenty twenty one was that we're probably in the 193 00:09:15.879 --> 00:09:19.159 second half of that S curve now, meaning meaning progress 194 00:09:19.200 --> 00:09:23.159 is still definitely happening and it's significant, but maybe the 195 00:09:23.200 --> 00:09:27.200 era of those absolutely fundamental paradigm shifting discoveries every few 196 00:09:27.200 --> 00:09:29.240 months is slowing down a bit. 197 00:09:29.519 --> 00:09:32.159 So more refinement, building on the existing foundations. 198 00:09:32.240 --> 00:09:36.960 That's a sensia, more incremental, but still powerful progress, finding 199 00:09:37.000 --> 00:09:39.840 new ways to apply these incredibly strong foundations that have 200 00:09:39.840 --> 00:09:40.120 been late. 201 00:09:40.200 --> 00:09:44.000 Okay, interesting perspectives, still moving fast, but maybe maturing. All right, 202 00:09:44.080 --> 00:09:46.480 let's get into the real nuts and bolts, the components. 203 00:09:46.559 --> 00:09:49.559 The book starts with tensors. What are they? Why are 204 00:09:49.639 --> 00:09:50.679 they the starting point? 205 00:09:51.000 --> 00:09:55.080 Tensors are basically the containers for data in neural networks. 206 00:09:55.120 --> 00:09:57.639 You can think of them as generalizations of vectors and 207 00:09:57.679 --> 00:09:59.879 matrices to potentially higher dimensions. 208 00:10:00.159 --> 00:10:02.840 So like a number is a tensor, a list of 209 00:10:02.919 --> 00:10:04.080 numbers a table. 210 00:10:03.919 --> 00:10:06.759 Exactly, a single number is a scaler or ranked zero 211 00:10:06.960 --> 00:10:09.600 tensor a list of numbers, like A vector is a 212 00:10:09.639 --> 00:10:12.720 rank one tensor a table of numbers. A matrix is 213 00:10:12.720 --> 00:10:14.120 a ranked two tensor, and you. 214 00:10:14.080 --> 00:10:16.519 Can have ranked three, ranked four, and so on. 215 00:10:16.639 --> 00:10:18.840 Yep, the rank just tells you how many axes or 216 00:10:18.879 --> 00:10:20.120 dimensions the tensor has. 217 00:10:20.240 --> 00:10:23.000 What defines a tensor then, besides the data. 218 00:10:22.799 --> 00:10:26.320 Itself two key things its shape and its data type 219 00:10:26.600 --> 00:10:28.960 or d type. The shape tells you how many elements 220 00:10:29.000 --> 00:10:31.159 are along each axis, like a matrix might be shape 221 00:10:31.320 --> 00:10:33.559 three five. The d type tells you what kind of 222 00:10:33.639 --> 00:10:36.440 numbers are inside, like thirty two bit floating point numbers 223 00:10:36.519 --> 00:10:37.159 or integers. 224 00:10:37.240 --> 00:10:41.000 Okay, can you give examples of real world data as tensors? Sure? 225 00:10:41.279 --> 00:10:44.960 Simple tabular data like customer infoage, income, whatever, It could 226 00:10:44.960 --> 00:10:48.440 be a ranked two tensor rows or customers columns or features. Right, 227 00:10:48.679 --> 00:10:51.679 time series data like daily stock prices for several stocks 228 00:10:51.759 --> 00:10:55.519 might be ranked three stocks time steps features like open 229 00:10:55.559 --> 00:10:59.559 Hilo clothes. Images are typically ranked four number of images 230 00:10:59.639 --> 00:11:02.480 height with color channels usually three for RGB. 231 00:11:02.720 --> 00:11:04.200 More dimensions for images. 232 00:11:04.840 --> 00:11:07.360 Video adds another dimension for time or frames, making at 233 00:11:07.399 --> 00:11:10.200 rank five number videos, frames, height with channels. 234 00:11:10.360 --> 00:11:13.039 Okay, I see how tensors provide this flexible structure for 235 00:11:13.080 --> 00:11:16.120 all sorts of data. So if tensors hold the data, 236 00:11:16.159 --> 00:11:19.919 what are the tensor operations? The book mentions the gears. 237 00:11:20.159 --> 00:11:22.879 These are the mathematical operations that the layers perform on 238 00:11:22.919 --> 00:11:26.440 the tensors. There are the calculations that transform the data 239 00:11:26.559 --> 00:11:28.120 as it flows through the network. 240 00:11:28.279 --> 00:11:29.600 Like, what kind of operations? 241 00:11:29.679 --> 00:11:33.200 Well, there are simple element wise operations where you do 242 00:11:33.240 --> 00:11:35.879 the same thing like add, multiply, or apply a function 243 00:11:36.399 --> 00:11:40.000 to each individual number in the tensor. There's broadcasting, which 244 00:11:40.039 --> 00:11:43.360 is a set of rules allowing operations between tensors of 245 00:11:43.639 --> 00:11:47.480 different but compatible shapes. It's very useful. The tensor product 246 00:11:47.559 --> 00:11:50.559 or dot product is absolutely fundamental. It's a core operation 247 00:11:50.639 --> 00:11:55.279 in linear algebra and use constantly in dense layers and reshaping, 248 00:11:55.320 --> 00:11:58.399 which changes the tensor shape without changing its contents. 249 00:11:58.519 --> 00:12:03.200 The book also has this geometric interpretation deep learning as 250 00:12:03.360 --> 00:12:07.240 untangling data manifolds. That sounds abstract. 251 00:12:07.600 --> 00:12:09.799 It is a bit abstract, but it's a powerful way 252 00:12:09.840 --> 00:12:14.440 to think about it. Imagine your raw data points, maybe 253 00:12:14.600 --> 00:12:18.360 images of handwritten digits are all jumbled together in a 254 00:12:18.399 --> 00:12:21.600 high dimensional space like a crumpled piece of paper. 255 00:12:21.759 --> 00:12:22.960 Okay, a messy blob. 256 00:12:23.120 --> 00:12:27.279 Right, A data manifold each layer in a deep network 257 00:12:27.440 --> 00:12:31.720 applies a transformation, a tensor operation that essentially tries to 258 00:12:31.799 --> 00:12:35.399 uncrumple that paper a little bit. It stretches, rotates, and 259 00:12:35.440 --> 00:12:38.080 folds the space that data lives in, trying to make 260 00:12:38.080 --> 00:12:41.559 the different categories the different digits in this example more 261 00:12:41.600 --> 00:12:42.440 easily separable. 262 00:12:42.960 --> 00:12:46.000 So layer by layer, it's smoothing out the crumpled paper 263 00:12:46.080 --> 00:12:49.080 until the digits written on different parts are clearly distinct. 264 00:12:48.759 --> 00:12:53.279 Exactly untangling the manifold. After enough layers, ideally that different 265 00:12:53.279 --> 00:12:55.639 classes of data will be nicely separated, maybe even by 266 00:12:55.639 --> 00:12:56.320 simple planes. 267 00:12:56.480 --> 00:12:59.159 That's a great visual. Okay, So tensors are data operations 268 00:12:59.200 --> 00:13:02.919 manipulate THEMMI metrically. The next piece is layers. What are they? 269 00:13:02.919 --> 00:13:05.600 Fundamentally, layers are the building blocks you stack together to 270 00:13:05.639 --> 00:13:07.919 create a deep learning model. You can think of them 271 00:13:07.919 --> 00:13:10.840 as modules that process data. They take one or more 272 00:13:10.840 --> 00:13:13.159 tensors as input and spit out one or more tensors 273 00:13:13.200 --> 00:13:14.240 as output. 274 00:13:14.000 --> 00:13:16.039 And they perform those tensor operations we. 275 00:13:16.039 --> 00:13:19.879 Just talked about precisely. Some layers are stateless, their output 276 00:13:19.919 --> 00:13:22.559 just depends on the current input. Others have internal state. 277 00:13:23.039 --> 00:13:25.440 This state consists of the layer's weights. 278 00:13:25.279 --> 00:13:27.000 The things that get learned during training. 279 00:13:27.320 --> 00:13:30.679 Exactly. The weights are themselves tensors, and they contain the 280 00:13:30.720 --> 00:13:33.600 knowledge the layer has learned. They get updated during training 281 00:13:33.679 --> 00:13:34.559 via gradient descent. 282 00:13:34.960 --> 00:13:37.279 And we use different types of layers for different. 283 00:13:37.080 --> 00:13:41.320 Data, right, yes, absolutely. Dense layers, also called fully connected 284 00:13:41.360 --> 00:13:45.720 layers are common for vector data. Convolutional layers like conv 285 00:13:45.840 --> 00:13:49.320 two D are the stars for image data. Recurrent layers 286 00:13:49.360 --> 00:13:52.840 like LSTMs or grus are designed for sequential data like 287 00:13:52.879 --> 00:13:55.679 text or time series. You choose layers suited to your 288 00:13:55.759 --> 00:13:56.399 data structure. 289 00:13:56.480 --> 00:13:58.440 Let's zoom it on dense layers for a second. What's 290 00:13:58.480 --> 00:14:00.919 the core operation they do and what's the deal with 291 00:14:01.080 --> 00:14:03.320 activation functions like re lu. 292 00:14:03.639 --> 00:14:07.480 Okay, A dense layer performs what's mathematically called an affine transform, 293 00:14:07.879 --> 00:14:11.120 takes the input vector, multiplies it by a weight matrix 294 00:14:11.200 --> 00:14:13.840 that's a tensor product, and then as a bias vector, 295 00:14:14.240 --> 00:14:17.519 it's basically output dot input plus. 296 00:14:17.399 --> 00:14:20.360 B a linear transformation plus an offset. 297 00:14:20.519 --> 00:14:23.600 Correct. Now, here's a really important point. If you just 298 00:14:23.639 --> 00:14:26.720 stack a bunch of these dense layers together doing only 299 00:14:26.759 --> 00:14:30.720 these Effin transforms, the whole stack is mathematically equivalent to 300 00:14:30.840 --> 00:14:34.039 just one single Effen transform. You haven't actually gained any 301 00:14:34.080 --> 00:14:37.200 expressive power beyond a simple linear model, no matter how 302 00:14:37.279 --> 00:14:38.240 many layers you add. 303 00:14:38.360 --> 00:14:41.679 WHOA. Okay, so stacking linear operations just gives you another 304 00:14:41.720 --> 00:14:43.600 linear operation. That seems limiting. 305 00:14:43.679 --> 00:14:46.519 It is. That's why we need activation functions. They introduce 306 00:14:46.679 --> 00:14:50.159 non linearity into the network after the fin transform in 307 00:14:50.200 --> 00:14:50.720 each layer. 308 00:14:50.879 --> 00:14:52.840 Non linearity. Why is that crucial? 309 00:14:53.320 --> 00:14:57.440 Because most real world relationships are non linear. If your 310 00:14:57.480 --> 00:15:00.279 network can only model linear functions, it's going to fail 311 00:15:00.360 --> 00:15:04.759 on most interesting problems. Activation functions break that linearity. 312 00:15:04.279 --> 00:15:07.679 And re LU is a common one, rectified linear unit. 313 00:15:07.759 --> 00:15:12.240 Very common and incredibly simple. It just computes max x zero. 314 00:15:12.559 --> 00:15:15.080 So if the input x is positive, it passes it 315 00:15:15.120 --> 00:15:18.320 through unchanged. If it's negative, it outputs zero. 316 00:15:18.480 --> 00:15:20.960 That's it. That little kink at zero is enough. 317 00:15:21.320 --> 00:15:25.440 It seems simple. But stacking layers with these ReLU activations 318 00:15:25.639 --> 00:15:31.200 allows the network to approximate arbitrarily complex nonlinear functions. It's 319 00:15:31.240 --> 00:15:33.120 what gives deep networks their power. 320 00:15:33.240 --> 00:15:37.840 Okay, ReLU simple function, massive impact because it adds nonlinearity. 321 00:15:38.399 --> 00:15:38.799 Got it. 322 00:15:39.480 --> 00:15:43.480 Now, these layers have weight matrices. You said they're initialized 323 00:15:43.559 --> 00:15:44.639 randomly yep. 324 00:15:44.519 --> 00:15:47.080 Usually with small random values. If you started them all 325 00:15:47.120 --> 00:15:51.279 at zero, they wouldn't learn properly. Randomness breaks the symmetry. 326 00:15:50.879 --> 00:15:53.240 And the whole point of training is to adjust these 327 00:15:53.320 --> 00:15:54.480 random weights. 328 00:15:54.120 --> 00:15:57.000 Exactly, to adjust them based on the feedback signal a loss, 329 00:15:57.360 --> 00:16:01.399 so that the network's overall transformation from to output performs 330 00:16:01.399 --> 00:16:05.039 the task correctly. The learned weights encode the solution, and. 331 00:16:04.919 --> 00:16:08.519 That adjustment mechanism is gradient based optimization. Let's break that 332 00:16:08.559 --> 00:16:08.960 down right. 333 00:16:09.000 --> 00:16:11.759 This is the engine driving the learning. The core idea 334 00:16:12.039 --> 00:16:14.279 is to use the gradient of the loss function. 335 00:16:14.120 --> 00:16:15.960 The direction of steepest descent. 336 00:16:15.759 --> 00:16:18.200 To figure out how to change the weights to decrease 337 00:16:18.240 --> 00:16:20.519 the loss. We want to go downhill on that lost 338 00:16:20.600 --> 00:16:21.200 landscape we. 339 00:16:21.159 --> 00:16:23.840 Talked about, okay, and how does it actually take the steps? 340 00:16:24.159 --> 00:16:29.320 A common algorithm is doochastic gradient descent or SGD. Stochastic 341 00:16:29.399 --> 00:16:31.960 just means it uses small random batches of the training 342 00:16:32.039 --> 00:16:35.159 data to estimate the gradient at each step, rather than 343 00:16:35.159 --> 00:16:36.919 the whole data set, which would be very slow. 344 00:16:37.039 --> 00:16:39.559 So it gets a noisy estimate of the downhill direction 345 00:16:39.600 --> 00:16:40.960 from a small sample. 346 00:16:40.799 --> 00:16:43.919 Exactly, and it takes a small step in that estimated 347 00:16:43.919 --> 00:16:47.039 downhill direction updating the weights. The size of that step 348 00:16:47.120 --> 00:16:49.120 is controlled by the learning rate. 349 00:16:49.240 --> 00:16:50.279 Ah, the learning rate. 350 00:16:50.320 --> 00:16:53.240 That sounds important, it's critical. Too big and you might 351 00:16:53.279 --> 00:16:57.200 overshoot the minimum or bounce around wildly. Too small and 352 00:16:57.240 --> 00:16:59.799 training will take forever, or you might get stuck easily. 353 00:17:00.120 --> 00:17:01.600 Finding a good learning rate is key. 354 00:17:01.960 --> 00:17:05.000 And the loss function itself, that's what defines the landscape 355 00:17:05.000 --> 00:17:07.240 we're descending. It tells us how wrong we are. 356 00:17:07.440 --> 00:17:11.519 Precisely, it quantifies the mismatch between the network's predictions and 357 00:17:11.559 --> 00:17:15.160 the true target values. Different tasks need different loss functions, 358 00:17:15.519 --> 00:17:17.599 but the goal is always to minimize it. 359 00:17:17.880 --> 00:17:22.160 Now, this dissent, can it get stuck? The book mentions 360 00:17:22.279 --> 00:17:24.119 local versus global minima. 361 00:17:24.480 --> 00:17:28.440 Yes, that's a potential issue. The lost landscape for deep 362 00:17:28.440 --> 00:17:32.680 networks can be very complex, with many valors. SGD might 363 00:17:32.720 --> 00:17:36.400 find the bottom of a small nearby valley the local minimum, 364 00:17:36.680 --> 00:17:39.880 but miss a much deeper valley elsewhere the global minimum. 365 00:17:39.920 --> 00:17:42.759 So it finds a solution, but maybe not the best 366 00:17:42.799 --> 00:17:43.559 possible one. 367 00:17:43.680 --> 00:17:48.200 Potentially, yes, although in practice for very high dimensional problems 368 00:17:48.240 --> 00:17:52.000 in deep learning, many local minimum are often quite good anyway. 369 00:17:52.720 --> 00:17:55.559 But techniques like momentum can help momentum. 370 00:17:55.599 --> 00:17:56.279 How does that help? 371 00:17:56.519 --> 00:17:59.079 Momentum adds a sort of inertia to the update step. 372 00:17:59.359 --> 00:18:01.759 It considers the direction of previous steps, not just the 373 00:18:01.759 --> 00:18:04.839 current gredient. This can help the optimizer roll through small 374 00:18:04.880 --> 00:18:08.480 local minima or navigate flat regions more effectively. 375 00:18:08.079 --> 00:18:09.799 Like giving it a push to get over little bumps. 376 00:18:09.839 --> 00:18:13.920 Yeah cool, okay, And you mentioned back propagation earlier as 377 00:18:14.000 --> 00:18:16.039 the way to calculate these gradients efficiently. 378 00:18:16.240 --> 00:18:20.359 Yes. Backpropagation is the algorithm that makes training deep networks feasible. 379 00:18:20.720 --> 00:18:24.079 It's a clever application of the chain roll from calculus. 380 00:18:23.640 --> 00:18:25.200 Chang right for nested functions. 381 00:18:25.440 --> 00:18:29.079 Exactly. A deep network is just a long chain of 382 00:18:29.119 --> 00:18:33.400 nested functions the layers. Backpropagation starts with the final loss 383 00:18:33.640 --> 00:18:36.799 and works backward to the network layer by layer. Why 384 00:18:36.839 --> 00:18:40.400 backward because it efficiently calculates how much each weight in 385 00:18:40.440 --> 00:18:43.960 the network contributed to the final error by reusing calculations 386 00:18:44.000 --> 00:18:46.720 from later layers. It figures out the gradient of the 387 00:18:46.759 --> 00:18:50.200 loss with respect to every single weight in the network. 388 00:18:50.400 --> 00:18:54.079 Wow, without having to recalculate everything from scratch for each 389 00:18:54.119 --> 00:18:55.200 weight precisely. 390 00:18:55.319 --> 00:19:00.240 It's computationally very efficient, and modern frameworks like tensor flo 391 00:19:00.559 --> 00:19:02.880 have automatic differentiation tools built. 392 00:19:02.640 --> 00:19:05.039 In, like gradient tape and TensorFlow Exactly. 393 00:19:05.240 --> 00:19:07.880 You define your networks, forward pass how the data flows through, 394 00:19:07.920 --> 00:19:11.640 and TensorFlow, using tools like gradient tape, automatically figures out 395 00:19:11.640 --> 00:19:14.720 how to compute the gradients needed for backpropagation. It handles 396 00:19:14.720 --> 00:19:15.799 all that calculus for you. 397 00:19:15.920 --> 00:19:19.559 That's amazing, takes away a huge mathematical burden. Okay, so 398 00:19:19.599 --> 00:19:23.720 we have tensors, operations, layers, activation functions, and this gradient 399 00:19:23.759 --> 00:19:27.160 descent engine powered by backpropagation. Let's talk about Paris. The 400 00:19:27.160 --> 00:19:30.319 book focuses on it heavily. What is Keras. 401 00:19:30.240 --> 00:19:34.119