WEBVTT 1 00:00:00.120 --> 00:00:03.319 So back in twenty twelve, Google did something really wild. 2 00:00:03.399 --> 00:00:08.160 They fed a machine something like ten million completely random, 3 00:00:08.240 --> 00:00:10.679 chaotic frames of YouTube videos. Right. 4 00:00:10.679 --> 00:00:12.720 And the crazy part is they didn't write a single 5 00:00:12.759 --> 00:00:15.439 line of code telling this machine what to look for. 6 00:00:15.640 --> 00:00:19.760 Yeah, exactly, no instructions about shapes, no definitions of animals, 7 00:00:20.120 --> 00:00:22.640 and completely on its own, just sorting through the sheer 8 00:00:22.719 --> 00:00:26.519 chaos of the Internet. The machine independently formed the concept 9 00:00:26.600 --> 00:00:27.199 of a cat. 10 00:00:27.359 --> 00:00:31.839 It's fascinating. It physically reorganized its internal mathematical structures to 11 00:00:32.000 --> 00:00:33.600 recognize a feline face. 12 00:00:34.200 --> 00:00:36.600 Just out of nowhere, it is, And today we are 13 00:00:36.640 --> 00:00:39.119 tearing into how that is even remotely possible. 14 00:00:40.079 --> 00:00:42.560 Welcome to the deep dive for you listening. You know, 15 00:00:42.600 --> 00:00:44.439 whether you're building models yourself or you just want to 16 00:00:44.439 --> 00:00:47.079 masterclass in the mechanics of the modern world. Our source 17 00:00:47.079 --> 00:00:50.880 today is Java Deep Learning Essentials by Yusuke Sugamory. 18 00:00:50.920 --> 00:00:53.240 But don't worry, we are entirely bypassing all the heavy 19 00:00:53.320 --> 00:00:57.880 Javison tasks today. Oh absolutely, we're leaving the code behind. 20 00:00:58.520 --> 00:01:01.439 The mission for this deep dive is to extract the pure, 21 00:01:01.679 --> 00:01:04.879 brilliant logic of how we got machines to stop acting 22 00:01:04.920 --> 00:01:10.640 like rigid calculators and start well hallucinating entirely new realities. 23 00:01:11.000 --> 00:01:14.599 Setting the baseline here is so crucial because you know, 24 00:01:15.239 --> 00:01:20.159 the cultural definition of artificial intelligence has been completely diluted. 25 00:01:19.799 --> 00:01:22.799 Right, like your smart toaster might have AI printed on the. 26 00:01:22.719 --> 00:01:26.920 Box now exactly, but running a basic predictive thermis dat 27 00:01:27.000 --> 00:01:30.799 or say, a simple loop of robotic movements that is 28 00:01:30.840 --> 00:01:33.879 fundamentally different from the architecture that learned to see that cat. 29 00:01:34.040 --> 00:01:34.560 Yeah. 30 00:01:34.640 --> 00:01:37.920 To genuinely understand the tectonic shift of modern deep learning, 31 00:01:38.120 --> 00:01:41.239 we really have to look at the graveyard of past methodology. 32 00:01:40.760 --> 00:01:43.680 The booms and the bus So to understand why modern 33 00:01:43.760 --> 00:01:47.000 machine learning it's so revolutionary, let's explore those past failures. 34 00:01:47.400 --> 00:01:50.200 The first major wave hit in the nineteen fifties, right, 35 00:01:50.519 --> 00:01:52.159 driven by search algorithm, right. 36 00:01:52.040 --> 00:01:54.280 Things like depth first search and breadth first search. 37 00:01:54.719 --> 00:01:58.239 The fundamental approach back then was to give a machine 38 00:01:58.400 --> 00:02:01.560 a strict set of rules and then have it rapidly 39 00:02:01.599 --> 00:02:05.319 calculate through a tree of possibilities to find an optimal outcome, 40 00:02:05.879 --> 00:02:09.439 Which is why early computers look like absolute geniuses when 41 00:02:09.479 --> 00:02:10.400 they were playing chess. 42 00:02:10.560 --> 00:02:14.319 Yeah, because a chess board is the ultimate closed ecosystem. 43 00:02:14.479 --> 00:02:17.879 Exactly. It has an eight by eight grid, discrete pieces 44 00:02:17.919 --> 00:02:22.240 immutable rules. The machine just generates millions of branching future 45 00:02:22.280 --> 00:02:25.680 moves and calculates the mathematical path to victory. 46 00:02:25.960 --> 00:02:29.280 And people watched the machine dismantle a chess grand master 47 00:02:29.840 --> 00:02:34.039 and assumed, well, human like artificial intelligence was only a 48 00:02:34.080 --> 00:02:34.800 few years away. 49 00:02:34.879 --> 00:02:36.280 They thought it was right around the corner. 50 00:02:36.400 --> 00:02:38.759 They really did. The assumption was that you could just 51 00:02:38.800 --> 00:02:41.719 scale up that search algorithm to handle real world problems. 52 00:02:42.080 --> 00:02:46.000 But that assumption shattered against a massive theoretical wall known 53 00:02:46.039 --> 00:02:49.280 as the frame problem. Oh, the frame problem. Yeah, a 54 00:02:49.319 --> 00:02:52.240 search algorithm functions perfectly when the frame of reality is 55 00:02:52.319 --> 00:02:55.560 artificially limited, but the moment you drop that machine into 56 00:02:55.560 --> 00:02:58.840 the actual physical world, it paralyzes itself. 57 00:02:58.599 --> 00:03:02.199 Because human beings caught instantly, like unconsciously, filter out an 58 00:03:02.240 --> 00:03:05.400 infinite amount of irrelevant data, and a rule based machine 59 00:03:05.439 --> 00:03:06.479 can't precisely. 60 00:03:06.960 --> 00:03:10.360 It operates on absolute logic, so it has no intuition 61 00:03:10.439 --> 00:03:11.199 for what to ignore. 62 00:03:11.400 --> 00:03:14.719 Wait, so the frame problem is like, it's like asking 63 00:03:14.759 --> 00:03:16.520 a robot to make a cup of tea in a 64 00:03:16.560 --> 00:03:19.120 normal kitchen and it immediately freezes. 65 00:03:19.080 --> 00:03:22.800 Right, because it's actively trying to calculate the current atmospheric 66 00:03:22.840 --> 00:03:23.840 pressure exactly. 67 00:03:23.879 --> 00:03:27.479 It's calculating the exact atomic structure of the ceramic mug. 68 00:03:27.599 --> 00:03:30.599 And I don't know the gravitational pull of Jupiter before 69 00:03:30.639 --> 00:03:33.240 it feels authorized to turn on the kettle because we. 70 00:03:33.240 --> 00:03:36.280 Never explicitly programmed it to ignore Jupiter's gravity. 71 00:03:36.639 --> 00:03:39.680 Yeah, so it factors into the tea making equation. That's wild. 72 00:03:39.840 --> 00:03:43.840 The computational explosion makes action impossible. It becomes trapped in 73 00:03:43.879 --> 00:03:47.039 an infinite loop of processing variables that have zero bearing 74 00:03:47.080 --> 00:03:50.240 on the task. And that failure essentially ended that first 75 00:03:50.280 --> 00:03:51.039 era of AI. 76 00:03:51.280 --> 00:03:54.120 So then came the pivot, arriving in the nineteen eighties. 77 00:03:54.199 --> 00:03:57.280 Researchers tried to bypass the machine's lack of intuition by 78 00:03:57.319 --> 00:04:00.400 basically brute forcing context into its memory. 79 00:04:00.520 --> 00:04:03.360 Right, this is the knowledge representation boom. The second boom. 80 00:04:03.439 --> 00:04:06.520 Yeah, The logic was, if the machine freezes because it 81 00:04:06.520 --> 00:04:09.840 doesn't know enough about the world, let's simply sit down 82 00:04:09.960 --> 00:04:14.000 and manually encode the entirety of human knowledge into a database. 83 00:04:14.319 --> 00:04:18.000 Projects like the Sake database or the semantic Web is 84 00:04:18.040 --> 00:04:22.319 an incredibly tedious effort to build absolute dictionaries of reality. 85 00:04:22.439 --> 00:04:25.360 Typing in rules manually, like a dog is a mammal, 86 00:04:25.480 --> 00:04:28.079 and water is wet, and tokyo is in Japan. 87 00:04:28.480 --> 00:04:31.600 You're trying to build a semantic web of relationships, so 88 00:04:31.639 --> 00:04:34.759 the machine has a reference point for every scenario. But 89 00:04:34.839 --> 00:04:38.920 that leads straight into the second wall, the symbol grounding problem. 90 00:04:38.959 --> 00:04:39.959 Okay, let's unpack that. 91 00:04:40.079 --> 00:04:42.480 Well. You can feed a machine a dictionary and it 92 00:04:42.480 --> 00:04:45.439 can parse the syntax perfectly, can tell you that green 93 00:04:45.800 --> 00:04:48.319 plus apple equals green apple, but it. 94 00:04:48.240 --> 00:04:51.079 Has no actual concept of what an apple tastes or 95 00:04:51.079 --> 00:04:52.079 feels like exactly. 96 00:04:52.120 --> 00:04:53.800 It's a completely devoid of semantics. 97 00:04:53.879 --> 00:04:56.480 So it knows the equation, but it has no concept 98 00:04:56.519 --> 00:04:58.920 of the crisp snap of the skin, or the tartness 99 00:04:58.920 --> 00:05:00.560 of the juice, or the weight of it in your hand. 100 00:05:00.680 --> 00:05:03.879 To the machine, apple is nothing more than a string 101 00:05:03.959 --> 00:05:08.360 of as key characters. It manipulates the symbols flawlessly according 102 00:05:08.360 --> 00:05:10.639 to the grammar we gave it, but those symbols are 103 00:05:10.680 --> 00:05:14.160 never grounded in actual experiential reality. 104 00:05:14.480 --> 00:05:17.920 Humans inherently catch the defining features of an object, but 105 00:05:18.040 --> 00:05:20.959 machines at the stage only saw symbols, and because they 106 00:05:20.959 --> 00:05:25.360 couldn't grasp the underlying concepts, they were incredibly fragile. 107 00:05:25.040 --> 00:05:29.040 Extremely fragile. Confronted with a new situation that deviated even 108 00:05:29.120 --> 00:05:33.240 slightly from their manually programmed dictionary, they just failed completely. 109 00:05:33.360 --> 00:05:36.839 So, since machines couldn't manually learn every rule in the universe, 110 00:05:37.199 --> 00:05:38.560 scientists flipped. 111 00:05:38.279 --> 00:05:40.800 The script right. They abandoned the attempt to teach the 112 00:05:40.839 --> 00:05:42.399 computer the rules of the universe. 113 00:05:42.439 --> 00:05:44.720 Instead of teaching rules, they thought, what if the machine 114 00:05:44.720 --> 00:05:47.600 looked for patterns? You build an architecture that allows the 115 00:05:47.639 --> 00:05:50.759 computer to look at raw data and deduce the dividing 116 00:05:50.759 --> 00:05:52.360 lines itself, And this brings. 117 00:05:52.199 --> 00:05:54.519 Us out of the AI winter and into the third 118 00:05:55.040 --> 00:05:59.600 boon machine learning. The fundamental mechanics shift from deductive rule 119 00:05:59.639 --> 00:06:03.120 following to inductive statistical pattern recognition. 120 00:06:03.439 --> 00:06:05.839 Okay, so you take an algorithm, flood it with data, 121 00:06:06.199 --> 00:06:09.920 and ask it to find the mathematical boundaries between different categories. 122 00:06:10.079 --> 00:06:14.240 Yes, and when we look at unsupervised learning, where the 123 00:06:14.319 --> 00:06:18.759 data is entirely raw and unlabeled, the algorithm's only job 124 00:06:18.879 --> 00:06:20.920 is to find hidden structures. 125 00:06:21.040 --> 00:06:23.439 Like that famous retail case study with the diapers and 126 00:06:23.480 --> 00:06:23.839 the beer. 127 00:06:24.079 --> 00:06:28.319 Exactly, a major supermarket fed millions of raw checkout logs 128 00:06:28.360 --> 00:06:31.839 into a machine learning algorithm The machine didn't know what 129 00:06:31.879 --> 00:06:34.759 the symbols for diapers or beer actually meant. 130 00:06:34.519 --> 00:06:38.000 Because the symbol grounding problem still applies here, right, right, but. 131 00:06:38.000 --> 00:06:42.639 It recognized a profound statistical correlation. It noticed that consumers 132 00:06:42.680 --> 00:06:46.240 purchasing diapers late on a Friday night had a highly 133 00:06:46.279 --> 00:06:49.399 elevated probability of simultaneously purchasing beer. 134 00:06:49.480 --> 00:06:52.160 So the machine maps the frequency, the store moves the 135 00:06:52.160 --> 00:06:55.160 beer aisle next to the diapers, and the profit margins spike. 136 00:06:55.360 --> 00:06:58.040 That's unsupervised learning in a nutshell. But then we have 137 00:06:58.079 --> 00:07:01.560 supervised learning where we do provide examples, and the book 138 00:07:01.639 --> 00:07:05.120 highlights support vector machines or SVMs to handle this. 139 00:07:05.360 --> 00:07:07.519 And this is where the math gets incredibly elegant. 140 00:07:07.720 --> 00:07:10.399 It really does. If you have a massive data set 141 00:07:10.439 --> 00:07:13.600 of say medical diagnostics, and you mack it out on 142 00:07:13.639 --> 00:07:17.519 a two dimensional graph, the data points for healthy and 143 00:07:17.759 --> 00:07:21.240 sick are going to be completely overlapping and tangled together. 144 00:07:21.360 --> 00:07:23.240 You can't just draw a straight two D line to 145 00:07:23.279 --> 00:07:23.959 separate them. 146 00:07:24.199 --> 00:07:26.040 No straight line on a flat plane is just too 147 00:07:26.120 --> 00:07:30.000 simplistic for messy real world data. So SVMs solve this 148 00:07:30.279 --> 00:07:31.279 using the kernel trick. 149 00:07:31.399 --> 00:07:33.040 The kernel trick. I love this concept. 150 00:07:33.079 --> 00:07:37.079 It's basically a method of mathematically shifting perspective. Instead of 151 00:07:37.120 --> 00:07:40.240 trying to force a complex curve boundary through the two 152 00:07:40.319 --> 00:07:44.720 D data, the algorithm applies a mathematical transformation like squaring 153 00:07:44.759 --> 00:07:46.519 the distance of each point from the. 154 00:07:46.480 --> 00:07:50.120 Origin, and by running that calculation, the algorithm effectively takes 155 00:07:50.120 --> 00:07:52.639 the flat two D data and projects it outward into 156 00:07:52.639 --> 00:07:53.759 a three dimensional space. 157 00:07:53.920 --> 00:07:56.519 Right the data points literally lift off the flat page. 158 00:07:56.639 --> 00:07:59.279 It's like the math warps the space so that the 159 00:07:59.319 --> 00:08:01.879 tangled point spread out into a three D shape like 160 00:08:01.879 --> 00:08:05.720 a parabola. And once the data is suspended in three dimensions, 161 00:08:05.800 --> 00:08:09.519 the tangled mess is suddenly separated by altitude exactly. 162 00:08:09.839 --> 00:08:12.600 And at that point the SVM doesn't need to draw 163 00:08:12.720 --> 00:08:16.240 a complex curve anymore. It just slides a perfectly flat, 164 00:08:16.439 --> 00:08:20.040 rigid sheet of glass a hyperplane straight through the three D. 165 00:08:20.040 --> 00:08:24.240 Space, cleanly severing the healthy data points from the sick ones. 166 00:08:24.399 --> 00:08:29.360 It is an extraordinarily powerful classification tool, but traditional machine learning, 167 00:08:29.639 --> 00:08:33.960 despite the brilliance of the kernel trick, harbored a fatal bottleneck. 168 00:08:33.720 --> 00:08:34.879 Right feature engineering. 169 00:08:35.000 --> 00:08:38.320 Yes, the machine is excellent at finding the boundary, but 170 00:08:38.360 --> 00:08:40.879 it remains completely blind to what it is actually looking 171 00:08:40.960 --> 00:08:42.480 at unless a human. 172 00:08:42.240 --> 00:08:44.639 Tells it, so you still have to define the coordinates. 173 00:08:44.799 --> 00:08:47.080 Like if you want the SBM to identify a cat, 174 00:08:47.120 --> 00:08:49.000 you can't just feed it a raw jpeg. 175 00:08:49.279 --> 00:08:51.519 Now, a human data scientist has to sit down and 176 00:08:51.679 --> 00:08:55.440 manually write code that extracts the specific features for the 177 00:08:55.480 --> 00:08:56.440 machine to evaluate. 178 00:08:56.559 --> 00:08:58.759 You have to program it to measure the distance between 179 00:08:58.759 --> 00:09:01.000 the pixels that make up the eye, or calculate the 180 00:09:01.039 --> 00:09:04.600 geometric angle of the ear triangles, or isolate the hex 181 00:09:04.679 --> 00:09:06.120 codes of the fur color. 182 00:09:06.120 --> 00:09:09.320 And the accuracy of the entire model is bound by 183 00:09:09.399 --> 00:09:14.480 human bias. If the human engineer selects poor features, like 184 00:09:14.919 --> 00:09:19.039 trying to predict a neighborhood's housing prices based exclusively on 185 00:09:19.080 --> 00:09:22.399 the number of street lights rather than square footage. 186 00:09:21.879 --> 00:09:26.240 The algorithm will confidently execute the math and deliver absolute garbage. 187 00:09:26.639 --> 00:09:29.279 But wait, if a human is still doing the heavy 188 00:09:29.320 --> 00:09:32.399 lifting of feature engineering, then machine learning isn't really learning 189 00:09:32.440 --> 00:09:33.840 independently at all, is it. 190 00:09:34.600 --> 00:09:35.600 You've hit the nail on the head. 191 00:09:35.639 --> 00:09:38.759 It's just a hyper fast sorterer based on our personal intuition. 192 00:09:39.360 --> 00:09:42.720 That is exactly why machine learning plateaued. It lacked the 193 00:09:42.759 --> 00:09:46.559 metacognitive ability to look at a raw environment and independently 194 00:09:46.600 --> 00:09:48.399 determine which features actually mattered. 195 00:09:48.600 --> 00:09:51.519 So how did we finally break through that feature engineering wall? 196 00:09:51.559 --> 00:09:54.399 Because that leads us to the ultimate game changer, deep learning. 197 00:09:54.799 --> 00:09:59.559 Right. Historically, researchers knew that artificial neural networks theoretically have 198 00:09:59.679 --> 00:10:02.559 this potential, but they couldn't get them to work at scale. 199 00:10:03.080 --> 00:10:05.200 That changes with a two thousand and six paper by 200 00:10:05.240 --> 00:10:08.000 Jeffrey Hinton introducing deep belief nets. 201 00:10:07.879 --> 00:10:11.799 Which was largely ignored until twenty twelve. Right the ImageNet 202 00:10:11.879 --> 00:10:14.399 Large Scale Visual Recognition Challenge. 203 00:10:14.080 --> 00:10:19.320 Yes, the ils VRC. Historically, teams of phdes would spend 204 00:10:19.320 --> 00:10:24.000 an entire year painstakingly tweaking their manual feature engineering, fighting 205 00:10:24.039 --> 00:10:27.519 tooth and nail just to push their image recognition accuracy 206 00:10:27.639 --> 00:10:29.559 up by a fraction of a single percent. 207 00:10:29.799 --> 00:10:33.559 So the field was accustomed to microscopic, agonizing progress. 208 00:10:34.000 --> 00:10:38.679 Then a team called Supervision, utilizing deep learning algorithms, entered 209 00:10:38.679 --> 00:10:43.240 the twenty twelve contest. They abandoned human engineered features entirely. 210 00:10:43.799 --> 00:10:46.879 They fed the raw image pixels directly into a deep 211 00:10:46.919 --> 00:10:50.159 neural network, and they didn't just win, they obliterated the 212 00:10:50.240 --> 00:10:50.960 historical curve. 213 00:10:51.080 --> 00:10:53.720 They beat the second place team by a staggering margin. 214 00:10:53.759 --> 00:10:55.039 Of over ten percent, and. 215 00:10:55.000 --> 00:10:58.120 In the context of computer vision, a ten percent leap 216 00:10:58.159 --> 00:11:00.919 in a single year was viewed as an almost alien. 217 00:11:00.559 --> 00:11:03.639 Intervention, which brings us directly back to that Google experiment 218 00:11:03.639 --> 00:11:06.879 we started with. By feeding those ten million, raw, unlabeled 219 00:11:06.919 --> 00:11:10.600 YouTube frames into a deep neural network, the system independently 220 00:11:10.639 --> 00:11:14.000 deduced the recurring mathematical structures that constituted a cat. 221 00:11:14.440 --> 00:11:17.159 It effectively solved the symbol grounding problem that killed the 222 00:11:17.200 --> 00:11:18.879 AI boom of the nineteen eighties. 223 00:11:19.200 --> 00:11:21.720 So deep learning is doing what we failed to do 224 00:11:21.799 --> 00:11:25.639 in the nineteen eighties. It's solving the symbol grounding problem 225 00:11:25.720 --> 00:11:28.919 by figuring out the signified the actual concept of the 226 00:11:28.919 --> 00:11:30.480 thing completely on its own. 227 00:11:30.600 --> 00:11:34.320 Yes, it didn't just learn a symbol by analyzing millions 228 00:11:34.320 --> 00:11:38.080 of variations of lighting, angles, and shapes. It isolated the 229 00:11:38.120 --> 00:11:43.759 foundational underlying concept of catness entirely independent of human labeling. 230 00:11:43.480 --> 00:11:47.519 That bridges the gap between raw physical data and conceptual understanding. 231 00:11:47.639 --> 00:11:51.399 To prove just how thoroughly these deep networks internalized these concepts, 232 00:11:51.720 --> 00:11:55.320 Google engineers later developed a technique colled inceptionism, widely known 233 00:11:55.360 --> 00:11:56.120 as deep. 234 00:11:55.919 --> 00:11:58.960 Dream Oh deep dream the nightmare. 235 00:11:58.600 --> 00:12:02.840 Art exactly in operation data flows forward through the network, 236 00:12:02.879 --> 00:12:05.320 and the machine outputs a classification of what it sees. 237 00:12:05.960 --> 00:12:09.080 With inceptionism, the engineers reverse the feedback loop. 238 00:12:09.320 --> 00:12:11.639 They fed an image into the network and commanded it 239 00:12:11.679 --> 00:12:16.159 to mathematically amplify whatever patterns it vaguely recognized, a feedback 240 00:12:16.200 --> 00:12:17.960 loop of pure pattern recognition. 241 00:12:18.279 --> 00:12:20.840 So if the network is scanning an image of a blurry, 242 00:12:20.879 --> 00:12:25.519 overcast sky and a cluster of pixels vaguely corresponds to 243 00:12:25.559 --> 00:12:29.679 the internal mathematical weight the network associates with the bird's. 244 00:12:29.360 --> 00:12:31.840 Beak, it alters the image to make this pixels look 245 00:12:31.879 --> 00:12:34.519 slightly more like a beak, and then it feeds that 246 00:12:34.639 --> 00:12:36.639 altered image back into its own input. 247 00:12:36.879 --> 00:12:39.519 Right now, the beak is more pronounced, so the network 248 00:12:39.519 --> 00:12:42.679 confidently hallucinates the eyes and in the feathers. 249 00:12:42.879 --> 00:12:46.679 It runs this recursive loop until a highly detailed, psychedelic, 250 00:12:46.799 --> 00:12:50.039 multi eyed bird physically manifests out of thin air in 251 00:12:50.080 --> 00:12:51.200 the middle of a cloud bank. 252 00:12:51.360 --> 00:12:55.240 It is generating novel imagery based on its deeply internalized 253 00:12:55.320 --> 00:12:58.480 understanding of features. It proves that the network isn't just 254 00:12:58.559 --> 00:13:01.879 matching pixels to a database. It has built a flexible, 255 00:13:02.039 --> 00:13:03.879 generative concept of the object. 256 00:13:03.960 --> 00:13:06.159 Okay, to truly appreciate this for you listening, We have 257 00:13:06.200 --> 00:13:08.879 to unpack the mechanics under the hood. Why did adding 258 00:13:08.879 --> 00:13:11.639 the word deep suddenly unlock this capability? 259 00:13:11.879 --> 00:13:15.279 Well, the basic concept of neural networks existed. A perceptron, 260 00:13:15.480 --> 00:13:18.360 which is a single layer of artificial neurons loosely mimicking 261 00:13:18.440 --> 00:13:22.759 human brain cells, takes inputs, applies mathematical weights, and outputs 262 00:13:22.759 --> 00:13:23.279 a decision. 263 00:13:23.320 --> 00:13:26.799 Good for linear problems, right, But researchers. 264 00:13:26.279 --> 00:13:31.480 Knew that the solve nonlinear complex problems, they needed multilayer perceptrons. 265 00:13:32.399 --> 00:13:35.440 You insert hidden layers of neurons between the input and 266 00:13:35.480 --> 00:13:39.360 the output. Logic dictates that if one hidden layer is good, 267 00:13:39.799 --> 00:13:43.240 stacking twenty layers to make a deep network should allow 268 00:13:43.279 --> 00:13:48.039 it to process incredibly complex realities. The theoretical mathematics supported 269 00:13:48.039 --> 00:13:50.159 that logic. But there was a villain in the story, 270 00:13:50.279 --> 00:13:54.279 wasn't there The vanishing gradient problem. The vanishingradient Neural networks 271 00:13:54.360 --> 00:13:57.639 learn through an algorithm called backpropagation. The network makes a 272 00:13:57.639 --> 00:14:00.879 prediction it looks at a dog and guesses cat, and. 273 00:14:00.919 --> 00:14:03.840 A loss function calculates the mathematical error of. 274 00:14:03.799 --> 00:14:07.600 That guess exactly. The algorithm then takes that error and 275 00:14:07.679 --> 00:14:11.360 propagates it backward through the network, layer by layer, adjusting 276 00:14:11.399 --> 00:14:13.919 the mathematical weights of the connections so the network is 277 00:14:14.000 --> 00:14:15.559 less likely to make that mistake again. 278 00:14:15.639 --> 00:14:18.799 It's a chain of correction, but backpropagation relies on the 279 00:14:18.879 --> 00:14:20.759 chain rule of calculus. 280 00:14:20.200 --> 00:14:22.200 And that's where it all fell apart. Is that error 281 00:14:22.240 --> 00:14:26.080 signal moves backward through the hidden layers. You are multiplying gradients, 282 00:14:26.440 --> 00:14:29.080 and those gradients are often fractional numbers less than one. 283 00:14:29.320 --> 00:14:32.279 So if you multiply a fraction by a fraction by fraction, 284 00:14:32.519 --> 00:14:35.159 the resulting number exponentially shrinks. 285 00:14:35.240 --> 00:14:37.679 By the time that error signal reaches the early layers 286 00:14:37.720 --> 00:14:41.000 of a deep network, the layer's closest to the raw input, 287 00:14:41.480 --> 00:14:43.600 the number has essentially vanished to zero. 288 00:14:43.840 --> 00:14:47.919 The error signal dilutes so severely that the foundational layers 289 00:14:47.919 --> 00:14:51.840 of the network receive absolute no updates. They never adjust 290 00:14:51.919 --> 00:14:53.000 their weights, never learn. 291 00:14:53.440 --> 00:14:57.799 Because the early layers remain untrained, the entire deep architecture 292 00:14:57.840 --> 00:15:02.320 stalls out, rendering deep networks practically useless for decades. 293 00:15:02.519 --> 00:15:06.240 Enter the hero layer wise pre training used in deep 294 00:15:06.320 --> 00:15:09.120 belief nets and stacked denoising auto encoders. 295 00:15:09.279 --> 00:15:12.120 The breakthrough was realizing that Trying to train the entire 296 00:15:12.200 --> 00:15:14.799 massive network at once from the output all the way 297 00:15:14.840 --> 00:15:18.440 back to the input was mathematically impossible, so they isolated 298 00:15:18.480 --> 00:15:18.919 the layers. 299 00:15:19.120 --> 00:15:22.240 You train each hidden layer completely independently. But wait, if 300 00:15:22.279 --> 00:15:24.080 you isolate a layer in the middle of the network, 301 00:15:24.120 --> 00:15:26.639 it has no access to the final answer. It doesn't 302 00:15:26.679 --> 00:15:28.039 know it's supposed to be looking for a. 303 00:15:27.919 --> 00:15:32.320 Cat, So you employ unsupervised learning using auto encoders. You 304 00:15:32.399 --> 00:15:37.080 give that single isolated layer a bizarrely simple task. Take 305 00:15:37.080 --> 00:15:40.759 the raw input data, force it through a mathematical bottleneck 306 00:15:40.799 --> 00:15:43.879 that compresses it, and then try to perfectly reconstruct the 307 00:15:43.919 --> 00:15:45.559 original data on the other side. 308 00:15:45.639 --> 00:15:47.759 The bottleneck is the stroke of genius. 309 00:15:48.039 --> 00:15:51.039 It is because the layer cannot physically pass all the 310 00:15:51.159 --> 00:15:54.879 raw data through the compression, it is mathematically forced to 311 00:15:54.919 --> 00:15:58.279 discard the noise and figure out the most essential defining 312 00:15:58.320 --> 00:16:00.519 features required to rebuild the image. 313 00:16:00.600 --> 00:16:04.679 Once that first layer masters the reconstruction, its output becomes 314 00:16:04.679 --> 00:16:07.519 the input for the second layer. It creates a self 315 00:16:07.559 --> 00:16:10.200 assembling hierarchy of concepts exactly. 316 00:16:10.279 --> 00:16:13.399 The first layer compresses raw pixels and learns to map 317 00:16:13.440 --> 00:16:17.279 basic geometric edges and lines. The second layer isolates itself, 318 00:16:17.519 --> 00:16:20.720 takes those lines compresses them and learns to map specific 319 00:16:20.759 --> 00:16:21.720 shapes and textures. 320 00:16:21.879 --> 00:16:23.879 And then the third layer takes those shapes and learns 321 00:16:23.919 --> 00:16:27.200 to map complex features like eyes and noses. Because each 322 00:16:27.240 --> 00:16:30.559 layer is trained completely independently to find structure, you completely 323 00:16:30.600 --> 00:16:32.360 bypass the chain rule problem. 324 00:16:32.519 --> 00:16:35.600 There is no vanishing gradient because you aren't passing an 325 00:16:35.720 --> 00:16:38.039 error signal backward through twenty layers. 326 00:16:38.240 --> 00:16:41.399 Once every layer has been pre trained to recognize this 327 00:16:41.519 --> 00:16:45.039 hierarchy of features, you assemble the full network, attach a 328 00:16:45.080 --> 00:16:48.080 final output layer, and perform fine tuning. 329 00:16:48.440 --> 00:16:51.639 Now, when you run back propagation with labeled data, the 330 00:16:51.679 --> 00:16:54.679 network already knows how to see. It already has the 331 00:16:54.679 --> 00:16:58.879 mathematical weights for edges, shapes, and textures perfectly established. 332 00:16:59.039 --> 00:17:02.320 It only requires minor adjustments to realize that the combination 333 00:17:02.399 --> 00:17:05.359 of those specific shapes is called a cat. It has 334 00:17:05.440 --> 00:17:07.160 essentially engineered its own features. 335 00:17:07.480 --> 00:17:11.920 But building a massive, deeply layered network introduces another vulnerability. 336 00:17:12.559 --> 00:17:15.759 If a network has millions of perfectly tued connections, it 337 00:17:15.799 --> 00:17:17.240 becomes prone to overfitting. 338 00:17:17.680 --> 00:17:20.799 It memorizes the training data so rigidly that it loses 339 00:17:20.880 --> 00:17:23.599 the flexibility to recognize a cat in the slightly different 340 00:17:23.680 --> 00:17:24.359 lighting condition. 341 00:17:24.680 --> 00:17:29.079 To shatter that rigidity. The architecture employs a remarkably counterintuitive 342 00:17:29.079 --> 00:17:30.519 trick called dropout. 343 00:17:30.720 --> 00:17:33.400 Wait, let's unpack drop out. You're telling me that physically 344 00:17:33.440 --> 00:17:37.160 severing the brain's connections randomly during training actually makes it smarter. 345 00:17:37.519 --> 00:17:41.079 It sounds counterproductive, but yes. During the fine tuning training phase, 346 00:17:41.240 --> 00:17:46.000 the algorithm will literally sever connections between neurons completely at random. 347 00:17:46.400 --> 00:17:49.440 It temporarily drops a random percentage of the network out 348 00:17:49.440 --> 00:17:52.480 of existence for that specific training pass. 349 00:17:52.599 --> 00:17:56.480 You are physically lobotomizing the network during its training. It's 350 00:17:56.519 --> 00:17:59.480 like it's like forcing someone to learn to ride a 351 00:17:59.480 --> 00:18:01.920 bike while randomly taking away one of their senses. 352 00:18:02.000 --> 00:18:04.279 That's a great way to look at it, Like, while. 353 00:18:04.079 --> 00:18:07.279 They are peddling on a tightrope, you randomly blindfold them, 354 00:18:07.599 --> 00:18:10.839 and then you randomly inject a massive dose of novacaine 355 00:18:10.880 --> 00:18:14.599 into their left leg. By randomly stripping away their senses, 356 00:18:14.920 --> 00:18:18.880 you force their central nervous system to develop an incredibly robust, 357 00:18:19.279 --> 00:18:22.960 bulletproof sense of core balance that doesn't rely on any 358 00:18:23.000 --> 00:18:23.799 single crutch. 359 00:18:23.880 --> 00:18:27.359 Precisely, because the neural network knows that any given neuron 360 00:18:27.440 --> 00:18:31.200 might spontaneously drop out during training, it cannot rely on 361 00:18:31.279 --> 00:18:35.119 any single fragile pathway. To recognize a feature. 362 00:18:34.880 --> 00:18:38.519 It is forced to distribute the concept across multiple redundant pathways. 363 00:18:38.599 --> 00:18:42.160 The mathematical representation of the object becomes deeply embedded and 364 00:18:42.200 --> 00:18:43.279 structurally resilient. 365 00:18:43.559 --> 00:18:47.160 For you listening, Grasping this evolution from the flat hyperplans 366 00:18:47.160 --> 00:18:50.240 of the kernel trick to the hierarchical compression of auto 367 00:18:50.319 --> 00:18:54.400 encoders to the deliberate chaos of dropout means you are 368 00:18:54.440 --> 00:18:58.680 really looking past the superficial buzzwords of modern technology. You 369 00:18:58.799 --> 00:19:02.279 now actually grasp the profound mechanics of how human intuition 370 00:19:02.480 --> 00:19:04.680 was mathematically outsourced to the machine. 371 00:19:04.720 --> 00:19:08.400 And understanding those mechanics is vital because the hardware executing 372 00:19:08.400 --> 00:19:12.480 these algorithms is scaling at a terrifying velocity. None of 373 00:19:12.519 --> 00:19:16.920 the architectural breakthroughs of deep learning mattered until physical processors 374 00:19:16.920 --> 00:19:18.200 could handle the math right. 375 00:19:18.519 --> 00:19:21.400 I mean, Google required a cluster of a thousand machines 376 00:19:21.480 --> 00:19:24.720 running for three straight days just to find that original cat. 377 00:19:24.920 --> 00:19:27.079 The theory had to wait for the silicon to catch up. 378 00:19:27.880 --> 00:19:30.359 But Moore's law dictates that the number of transistors on 379 00:19:30.400 --> 00:19:33.279 a microchip doubles roughly every eighteen months. 380 00:19:33.440 --> 00:19:36.319 If you track that exponential curve forward. We are rapidly 381 00:19:36.359 --> 00:19:38.119 approaching the year twenty forty five. 382 00:19:38.200 --> 00:19:41.400 Yes, twenty forty five is the widely projected date for 383 00:19:41.440 --> 00:19:44.640 the technical singularity. At that point on the curve, a 384 00:19:44.680 --> 00:19:48.359 single processor is expected to house more than ten billion transistors. 385 00:19:49.039 --> 00:19:52.279 That transcends the number of biological cells in the human brain. 386 00:19:52.480 --> 00:19:57.000 The computational capacity crosses a threshold where machines achieve self 387 00:19:57.039 --> 00:20:00.720 recursive intelligence. They will possess the hard where and the 388 00:20:00.759 --> 00:20:04.759 deep architecture required to rapidly redesign and optimize their own 389 00:20:04.799 --> 00:20:09.079 software and hardware loops, entirely independent of human engineers, the. 390 00:20:09.200 --> 00:20:13.279 Ultimate abandonment of human future engineering. The book leaves us 391 00:20:13.279 --> 00:20:16.720 with a stark quotation from the late theoretical physicist Stephen 392 00:20:16.759 --> 00:20:17.759 Hawking Right. 393 00:20:17.880 --> 00:20:21.119 He warned that the development of full artificial intelligence could 394 00:20:21.119 --> 00:20:23.000 spell the end of the human race. 395 00:20:23.400 --> 00:20:26.759 Because from the nineteen fifties chessboards to the twenty twelve 396 00:20:26.839 --> 00:20:30.480 ImageNet massacre, human beings were the ones pulling the strings 397 00:20:30.480 --> 00:20:32.119 and defining the loss functions. 398 00:20:32.279 --> 00:20:35.599 We provided the data and established the ultimate goals, even 399 00:20:35.599 --> 00:20:38.240 when the machines learned to map the paths themselves. 400 00:20:38.480 --> 00:20:40.640 So keeping all these mechanics in mind and want you 401 00:20:40.680 --> 00:20:43.680