WEBVTT 1 00:00:00.120 --> 00:00:03.279 So if you were online back in the late nineteen nineties, 2 00:00:03.759 --> 00:00:08.400 you probably remember that that quiet war raging in our inbox. 3 00:00:08.480 --> 00:00:13.320 Oh yeah, the sheer volume of junk mail was just unbelievable. 4 00:00:12.640 --> 00:00:14.400 Right, it was a nightmare, And if you were a 5 00:00:14.400 --> 00:00:16.600 programmer trying to stop it, I mean, you are probably. 6 00:00:16.399 --> 00:00:18.280 Losing your mind, absolutely losing it. 7 00:00:18.280 --> 00:00:21.800 Because you'd write this rigid rule, right, like if an 8 00:00:21.879 --> 00:00:25.640 email contains the number four and the letter, you send 9 00:00:25.679 --> 00:00:26.359 it straight to the. 10 00:00:26.320 --> 00:00:27.920 Trash and that would work perfectly. 11 00:00:28.039 --> 00:00:30.519 Yeah, for about a day, exactly just a day. Then 12 00:00:30.559 --> 00:00:32.799 the spammers would realize what you did and they'd start 13 00:00:32.799 --> 00:00:35.359 spelling it out like for space. 14 00:00:35.200 --> 00:00:37.600 You and suddenly your defense is totally. 15 00:00:37.359 --> 00:00:40.640 Useless, completely useless. You had to go back, write a 16 00:00:40.679 --> 00:00:44.240 new role, deploy it again. It was this endless, exhausting 17 00:00:44.359 --> 00:00:48.399 game of whack a mole, and mathematically humans were just 18 00:00:48.679 --> 00:00:50.079 destined to lose that game. 19 00:00:50.280 --> 00:00:52.399 We were. But then, you know, we stopped trying to 20 00:00:52.399 --> 00:00:54.799 write the rules. We decided to let the machine write them. 21 00:00:54.719 --> 00:00:57.039 Instead, which is just wild to think about. 22 00:00:57.320 --> 00:01:00.799 It was a profound turning point in the history of technology. 23 00:01:01.320 --> 00:01:06.079 We abandon the arrogance of trying to anticipate every possible 24 00:01:06.159 --> 00:01:09.079 variation of a problem and instead built systems that could 25 00:01:09.079 --> 00:01:09.760 actually adapt. 26 00:01:10.120 --> 00:01:13.000 And that adaptation is exactly what we're exploring today. So 27 00:01:13.079 --> 00:01:15.040 welcome to the deep knive. If you're listening to this, 28 00:01:15.319 --> 00:01:17.480 you are what we like to call the learner. That's right, 29 00:01:17.599 --> 00:01:19.959 Whether you're prepping for a high stakes meeting, trying to 30 00:01:20.000 --> 00:01:22.560 catch up on where the tech landscape is heading, or 31 00:01:22.560 --> 00:01:25.640 you're just you know, insanely curious about the mechanics of 32 00:01:25.640 --> 00:01:27.879 the digital world, you are in the right place. 33 00:01:28.159 --> 00:01:28.879 Glad you're here. 34 00:01:29.040 --> 00:01:33.439 Today, we're cracking open Aurelian General's foundational text hands on 35 00:01:33.519 --> 00:01:36.319 machine Learning, and we are skipping all the sci fi, 36 00:01:36.359 --> 00:01:40.040 Hollywood hype, Thank goodness. Yeah, no killer robots, no sky 37 00:01:40.159 --> 00:01:42.959 net today, we're just looking under the hood. Our mission 38 00:01:42.959 --> 00:01:45.599 here is to break down exactly what machine learning actually is, 39 00:01:46.000 --> 00:01:51.560 how these systems physically learn, and why they sometimes fail spectacularly. 40 00:01:51.799 --> 00:01:54.599 And it's vital to start with that spam filter example 41 00:01:54.640 --> 00:01:58.799 you mentioned, because it just perfectly illustrates the mechanical difference 42 00:01:58.799 --> 00:02:03.920 between traditional programing and machine learning. In traditional programming, a 43 00:02:04.040 --> 00:02:08.479 human analyzes a problem, discovers the pattern, writes a hard 44 00:02:08.520 --> 00:02:12.879 coded rule, and evaluates the output it is incredibly brittle. 45 00:02:12.960 --> 00:02:14.479 Brittle is the perfect word for it. 46 00:02:14.520 --> 00:02:17.919 If the environment changes by even one pixel or one keystroke, 47 00:02:18.240 --> 00:02:19.439 the program just breaks. 48 00:02:19.560 --> 00:02:22.520 Okay, so let's unpack this for the listener. Traditional programming 49 00:02:22.599 --> 00:02:26.479 is basically like giving a chef a rigid, unchangeable recipe. 50 00:02:26.599 --> 00:02:27.199 Yeah, exactly. 51 00:02:27.240 --> 00:02:30.120 If they're missing a single ingredient, or if the oven 52 00:02:30.199 --> 00:02:32.080 is just slightly too hot, they just crash and burn. 53 00:02:32.120 --> 00:02:33.080 They don't know how to adapt. 54 00:02:33.120 --> 00:02:33.639 They're stuck. 55 00:02:33.960 --> 00:02:37.240 But machine learning is entirely different. It's like giving a 56 00:02:37.319 --> 00:02:40.759 chef a thousand slightly different cakes and having them guess 57 00:02:40.800 --> 00:02:43.000 the recipe by changing one ingredient at a time. 58 00:02:43.199 --> 00:02:44.680 I love that analogy, right, Like. 59 00:02:44.639 --> 00:02:48.240 Too salty, next time, lower the salt, too dry, add 60 00:02:48.319 --> 00:02:52.159 some water. It repeats this optimization loop thousands of times 61 00:02:52.240 --> 00:02:55.520 until the cake is perfect. It figures out the recipe itself. 62 00:02:55.680 --> 00:02:59.199 What's fascinating here is how we formally define that optimization loop. 63 00:03:00.080 --> 00:03:02.919 In nineteen ninety seven, Tom Mitchell gave us this brilliant 64 00:03:02.960 --> 00:03:06.199 engineering definition that we actually still rely on today. 65 00:03:06.400 --> 00:03:07.680 Oh right, the ETP. 66 00:03:08.159 --> 00:03:11.840 Yes, he said, a computer program learns from experience, which 67 00:03:11.840 --> 00:03:15.120 we call E with respect to some task T and 68 00:03:15.159 --> 00:03:18.960 some performance measure p Okay. Crucially, the system is only 69 00:03:19.000 --> 00:03:22.800 actually learning if its performance on the task improves with 70 00:03:22.879 --> 00:03:24.400 the experience, So. 71 00:03:24.479 --> 00:03:26.639 Mapping that onto our spam filter for a second, the 72 00:03:26.719 --> 00:03:28.800 task T is flagging the junk mail. 73 00:03:29.000 --> 00:03:29.400 Correct. 74 00:03:29.520 --> 00:03:32.520 The experience E is the training data, right, those massive 75 00:03:32.599 --> 00:03:37.199 piles of spam and normal emails, which data scientists playfully. 76 00:03:36.759 --> 00:03:38.400 Call ham right, spam and ham. 77 00:03:38.759 --> 00:03:41.439 And the performance measure P is the accuracy rate, like 78 00:03:41.479 --> 00:03:43.840 what percentage of the emails did it actually put in 79 00:03:43.840 --> 00:03:45.319 the right folder exactly? 80 00:03:45.360 --> 00:03:48.599 And if that percentage goes up as it processes more emails, boom, 81 00:03:48.680 --> 00:03:49.319 It is learning. 82 00:03:49.439 --> 00:03:50.039 It's learning. 83 00:03:50.280 --> 00:03:53.719 And this framework is essential because there are certain problems 84 00:03:54.039 --> 00:03:57.759 where human hard coding just completely totally fails. Think about 85 00:03:57.759 --> 00:04:00.520 speech recognition. Oh, man, if I ask you to write 86 00:04:00.520 --> 00:04:04.360 a traditional program to detect the word two, you know 87 00:04:04.439 --> 00:04:06.199 the number two? How do you do it? 88 00:04:06.520 --> 00:04:07.840 I wouldn't even know where to start. 89 00:04:08.319 --> 00:04:10.439 You might try to hard code a rule looking for 90 00:04:10.479 --> 00:04:13.319 a specific high frequency sound wave for the letter T, 91 00:04:14.599 --> 00:04:17.360 But how do you mathematically account for a child's voice 92 00:04:17.920 --> 00:04:19.360 versus an adults. 93 00:04:19.319 --> 00:04:22.959 Right, or like a British accent versus a Southern drawl exactly? 94 00:04:23.399 --> 00:04:26.680 What if there's wind noise in the background. The sheer 95 00:04:26.759 --> 00:04:30.839 number of variations approaches infinity. You simply cannot write enough 96 00:04:30.920 --> 00:04:34.079 if then statements to cover it all. The system must 97 00:04:34.160 --> 00:04:35.079 learn by example. 98 00:04:35.199 --> 00:04:37.800 So if the system requires examples to learn, that kind 99 00:04:37.800 --> 00:04:41.199 of brings up a massive logistical problem. How exactly do 100 00:04:41.240 --> 00:04:42.800 we feed it those examples? 101 00:04:42.920 --> 00:04:43.759 That's the big question. 102 00:04:43.920 --> 00:04:46.199 Are we just dumping raw data into a hard drive 103 00:04:46.240 --> 00:04:49.000 and hoping for the best. Because the material breaks this 104 00:04:49.120 --> 00:04:53.360 down into the different levels of human supervision required during training. 105 00:04:53.279 --> 00:04:56.879 Right, the data doesn't just magically organize itself. Sadly, the 106 00:04:56.879 --> 00:04:59.920 most common approach is supervised learning. This is where the 107 00:05:00.120 --> 00:05:02.800 machine basically has a teacher. You don't just feed the 108 00:05:02.920 --> 00:05:06.439 algorithm raw data. You feed it data that already includes 109 00:05:06.480 --> 00:05:08.800 the desired solutions, which we call labels. 110 00:05:09.079 --> 00:05:11.680 So the spam filter is supervised because you're handing the 111 00:05:11.720 --> 00:05:14.600 machine a stack of emails that a human has explicitly 112 00:05:14.639 --> 00:05:18.160 stamped as spam or ham. You're giving it the answer 113 00:05:18.240 --> 00:05:19.480 key to study from. 114 00:05:19.720 --> 00:05:22.040 Yes, the answer key is crucial here. 115 00:05:21.879 --> 00:05:24.319 And the text points out this works really well for 116 00:05:24.519 --> 00:05:29.959 predicting categories, which is called classification, and predicting numeric values, 117 00:05:29.959 --> 00:05:33.360 which is called regression, right, like predicting a car's price 118 00:05:33.439 --> 00:05:36.759 based on its mileage. You feed it thousands of examples 119 00:05:36.759 --> 00:05:39.240 of cars where you already know the final sale price. 120 00:05:39.519 --> 00:05:43.040 But the reality is labeled data is a huge luxury. 121 00:05:43.319 --> 00:05:45.639 Most data in the real world just doesn't come with 122 00:05:45.680 --> 00:05:48.639 a neat little answer key, right, It's just raw, exactly, 123 00:05:48.800 --> 00:05:50.600 And that's where unsupervised learning comes in. 124 00:05:51.000 --> 00:05:51.240 Here. 125 00:05:51.360 --> 00:05:54.199 The system is essentially just an observer. You feed it 126 00:05:54.279 --> 00:05:57.120 a mountain of completely unlabeled data, and it has to 127 00:05:57.160 --> 00:05:59.839 figure out the underlying structure all on its. 128 00:05:59.759 --> 00:06:02.959 Own, which honestly sounds like magic. How does an algorithm 129 00:06:03.040 --> 00:06:05.560 learn anything if you literally don't tell it what to 130 00:06:05.560 --> 00:06:05.959 look for? 131 00:06:06.279 --> 00:06:10.399 It does it by measuring distances in multidimensional space pick clustering. 132 00:06:10.399 --> 00:06:13.160 For example, Let's say you have a massive data set 133 00:06:13.199 --> 00:06:15.839 of visitors to your blog. Okay, you have absolutely no 134 00:06:15.959 --> 00:06:19.240 idea who they are, but the algorithm plots every visitor 135 00:06:19.279 --> 00:06:22.240 on a mathematical graph. Maybe one axis is the time 136 00:06:22.279 --> 00:06:24.680 of day they visit. Another axis is the length of 137 00:06:24.720 --> 00:06:26.959 the articles they read, another is the topic. 138 00:06:27.120 --> 00:06:27.680 Oh, I see. 139 00:06:28.079 --> 00:06:31.319 Suddenly it notices that a huge cluster of data points 140 00:06:31.519 --> 00:06:36.279 are physically very close together. In this mathematical space. It realizes, hey, 141 00:06:37.000 --> 00:06:40.480 forty percent of these users always read long form sci 142 00:06:40.480 --> 00:06:44.120 fi posts on Saturday nights. Wow, it didn't know what 143 00:06:44.160 --> 00:06:47.360 sci fi or Saturday meant emotionally. It just calculated that 144 00:06:47.399 --> 00:06:49.240 those behaviors clustered tightly together. 145 00:06:49.399 --> 00:06:50.079 That's wild. 146 00:06:50.319 --> 00:06:52.600 Also, how we do anomaly detection. Yeah, if a credit 147 00:06:52.600 --> 00:06:56.279 card transaction lands way outside the normal behavioral cluster, the 148 00:06:56.319 --> 00:06:57.600 system flags it as fraud. 149 00:06:57.759 --> 00:07:00.439 Okay, so we have the teacher for super and the 150 00:07:00.480 --> 00:07:03.279 observer for unsupervised. But then there is a hybrid, right, 151 00:07:03.399 --> 00:07:06.639 semi supervised learning. Yes, exactly, And the perfect example of 152 00:07:06.680 --> 00:07:09.000 this is something almost everyone listening has in their pocket 153 00:07:09.079 --> 00:07:10.439 right now. Google Photos. 154 00:07:10.639 --> 00:07:11.759 Oh, such a good example. 155 00:07:11.759 --> 00:07:15.360 When you upload a thousand family photos, the unsupervised part 156 00:07:15.399 --> 00:07:19.160 of the algorithm kicks in. First, it mathematically analyzes the 157 00:07:19.160 --> 00:07:22.800 pixels and clusters them, noticing that the exact same face 158 00:07:23.120 --> 00:07:26.120 appears in fifty different pictures. Right. It doesn't know who 159 00:07:26.120 --> 00:07:28.879 that face belongs to, but it knows it's the same object. 160 00:07:29.160 --> 00:07:31.600 Then it turns to you. It asks you to label 161 00:07:31.720 --> 00:07:35.560 just one photo you type in mom, and instantly it 162 00:07:35.680 --> 00:07:40.079 propagates that supervised label across the entire unsupervised cluster. 163 00:07:40.879 --> 00:07:43.160 It is incredibly incredibly efficient. 164 00:07:43.199 --> 00:07:45.279 Well wait, let me push back on this for a second. Sure, 165 00:07:45.439 --> 00:07:48.480 because what does this all mean for us humans? If 166 00:07:48.560 --> 00:07:52.120 the semi supervised systems are doing all the heavy algorithmic 167 00:07:52.199 --> 00:07:56.839 lifting of clustering the data in multi dimensional space, are 168 00:07:56.839 --> 00:07:58.800 we basically just acting as. 169 00:07:58.800 --> 00:08:00.519 Cheap labors inferia? 170 00:08:00.680 --> 00:08:03.480 Like? Are we just the final manual cog in the 171 00:08:03.519 --> 00:08:05.360 machine providing the text tags? 172 00:08:05.680 --> 00:08:07.879 If we connect this to the bigger picture, you'll see 173 00:08:07.920 --> 00:08:11.720 it's actually a profound economic solution. You have to understand 174 00:08:11.720 --> 00:08:15.040 that labeling data is the single biggest bottleneck in all 175 00:08:15.040 --> 00:08:17.959 of machine learning. Paying humans to sit in a room 176 00:08:18.120 --> 00:08:22.720 and manually tag a million individual photos is prohibitively expensive 177 00:08:22.759 --> 00:08:27.519 and agonizingly slow. Semi supervised learning isn't about using humans 178 00:08:27.519 --> 00:08:31.480 as cheap labor. It's an elegant compromise between machine scalability 179 00:08:31.759 --> 00:08:35.799 and human context. Ah, I get it. The algorithm does 180 00:08:35.840 --> 00:08:39.919 what it does best, processing and sorting raw pixels at 181 00:08:39.919 --> 00:08:42.759 a scale a human mind just couldn't fathom, and the 182 00:08:42.840 --> 00:08:45.840 human does what they do best, which is providing the semantic, 183 00:08:46.200 --> 00:08:50.080 emotional or factual context in a single keystroke. 184 00:08:50.320 --> 00:08:52.759 I see, so it's really a partnership. Now, for the 185 00:08:52.799 --> 00:08:54.600 sake of being thorough, we have to mention the final 186 00:08:54.639 --> 00:08:58.720 training category here, reinforcement learning. Yes, this is a totally 187 00:08:58.720 --> 00:09:01.080 different beast. There's no label, answer key, and it's not 188 00:09:01.120 --> 00:09:04.799 just observing clusters here. The learning system is called an agent, 189 00:09:04.840 --> 00:09:06.200 and it's placed into an environment. 190 00:09:06.279 --> 00:09:08.519 Think of it like training a dog. Okay. The agent 191 00:09:08.559 --> 00:09:11.559 performs an action, observes the result, and gets either a 192 00:09:11.600 --> 00:09:15.600 reward or a penalty. Over millions of iterations, it constantly 193 00:09:15.679 --> 00:09:19.679 updates what's called its policy policy. Right. The internal strategy 194 00:09:19.879 --> 00:09:23.120 uses to decide what action will yield the highest reward 195 00:09:23.159 --> 00:09:27.320 over time. This is how deep minds alphag conquered the 196 00:09:27.360 --> 00:09:31.279 world champion at the incredibly complex board game Go Oh Wow. 197 00:09:31.799 --> 00:09:35.080 It didn't just study path games. It played millions of 198 00:09:35.120 --> 00:09:39.440 games against itself, constantly tweaking its policy based on whether 199 00:09:39.480 --> 00:09:41.639 an action led to a win or a loss. 200 00:09:41.799 --> 00:09:43.879 Okay, So, whether you train it with an answer key, 201 00:09:44.159 --> 00:09:47.440 or by clustering unlabeled data, or by letting it play 202 00:09:47.440 --> 00:09:49.879 a million games of Go, we eventually end up with 203 00:09:49.879 --> 00:09:50.639 a train system. 204 00:09:50.759 --> 00:09:51.080 We do. 205 00:09:51.320 --> 00:09:54.840 But here's the multimillion dollar question, how does it actually 206 00:09:54.879 --> 00:09:56.639 make a prediction on a piece of data it has 207 00:09:56.919 --> 00:10:00.679 literally never seen before. How do we move from memorizing 208 00:10:00.720 --> 00:10:03.960 the past to actually generalizing to the unknown future. 209 00:10:04.399 --> 00:10:06.440 To answer that, we first have to look at the plumbing, 210 00:10:06.759 --> 00:10:08.919 like how is the system digesting data on a day 211 00:10:08.919 --> 00:10:11.080 to day basis? Is it a batche learner or an 212 00:10:11.159 --> 00:10:14.240 online learner? Right In Batchel learning, the system trains offline 213 00:10:14.399 --> 00:10:18.440 using all the available data at once. It's computationally heavy. 214 00:10:18.519 --> 00:10:20.360 If you want a batch system to learn about a 215 00:10:20.399 --> 00:10:22.639 new type of spam that appeared this morning, you can't 216 00:10:22.679 --> 00:10:24.360 just teach it the new trick. You have to start 217 00:10:24.360 --> 00:10:26.440 over exactly. You have to shut it down, mix the 218 00:10:26.480 --> 00:10:29.720 new data with the millions of old emails, and retrain 219 00:10:29.799 --> 00:10:31.840 the entire model from scratch. 220 00:10:31.639 --> 00:10:35.919 Which is wildly inefficient if you're dealing with fast changing environments. 221 00:10:36.480 --> 00:10:39.480 And that's why online learning is so crucial. Yes, instead 222 00:10:39.480 --> 00:10:43.919 of massive offline dumps, you feed the data to the 223 00:10:43.960 --> 00:10:47.679 system incrementally, either one by one or in small groups 224 00:10:47.720 --> 00:10:51.159 called mini batches. It learns on the fly, very nimble, 225 00:10:51.679 --> 00:10:54.360 and the text highlights a critical mechanism here called the 226 00:10:54.440 --> 00:10:55.080 learning rate. 227 00:10:55.279 --> 00:10:58.600 The learning rate is just a mathematical parameter that controls 228 00:10:58.639 --> 00:11:03.080 how aggressively the the algorithm updates its internal rules when 229 00:11:03.080 --> 00:11:04.000 it sees new data. 230 00:11:04.039 --> 00:11:06.159 So think of it like two different types of stock traders. 231 00:11:06.279 --> 00:11:08.960 A trader with a high learning rate is highly reactive. 232 00:11:09.039 --> 00:11:12.720 Right they see one bad quarterly report and immediately dump 233 00:11:12.759 --> 00:11:16.039 all their shares completely, forgetting the company's ten year history 234 00:11:16.080 --> 00:11:20.080 of success. They adapt fast, but they're volatile, very volatile. 235 00:11:20.240 --> 00:11:22.919 But a trader with a low learning rate is stubborn. 236 00:11:23.360 --> 00:11:25.879 They rely heavily on the ten year historical average and 237 00:11:26.080 --> 00:11:30.200 barely react to today's news. They are stable, but they 238 00:11:30.279 --> 00:11:33.360 might miss a sudden market crash. The algorithm has to 239 00:11:33.399 --> 00:11:36.679 balance that exact same tension mathematically. 240 00:11:36.000 --> 00:11:39.840 Precisely now, regardless of the plumbing, whether you use batch 241 00:11:40.000 --> 00:11:43.360 or online learning, the algorithm needs a fundamental strategy to 242 00:11:43.480 --> 00:11:47.360 generalize to a new unseen piece of data. Okay, and 243 00:11:47.399 --> 00:11:50.759 there are two primary mechanisms for this, instance based learning 244 00:11:51.159 --> 00:11:52.120 and model base larth. 245 00:11:52.159 --> 00:11:52.960 Let's break those down. 246 00:11:53.000 --> 00:11:57.159 Instance based learning is essentially memorization. The algorithm stores the 247 00:11:57.320 --> 00:12:01.279 entire training data set. When a new email, it calculates 248 00:12:01.320 --> 00:12:04.759 a mathematical distance a similarity measure between the new email 249 00:12:04.879 --> 00:12:06.000 and the ones it's memorized. 250 00:12:06.039 --> 00:12:07.360 So it's comparing yes. 251 00:12:07.759 --> 00:12:10.519 For example, it might literally count the number of matching words. 252 00:12:11.320 --> 00:12:14.039 If the new email shares eighty percent of its vocabulary 253 00:12:14.039 --> 00:12:17.440 with a known spam email, the algorithm says, it's close 254 00:12:17.519 --> 00:12:18.519 enough spam. 255 00:12:18.799 --> 00:12:21.120 Here's where it gets really interesting for you, the learner. 256 00:12:21.879 --> 00:12:25.200 Instance based learning is basically like a student who memorizes 257 00:12:25.360 --> 00:12:28.039 every single practice question before the physics final. 258 00:12:28.279 --> 00:12:28.799 Exactly. 259 00:12:28.879 --> 00:12:31.480 If the exam question is identical to the practice they 260 00:12:31.519 --> 00:12:34.840 totally ace it. If it's slightly rewarded, they might still 261 00:12:34.879 --> 00:12:38.879 guess right by noticing the similarities. But model based learning 262 00:12:38.919 --> 00:12:43.600 is entirely different. It's like actually learning the underlying physics formula. 263 00:12:43.639 --> 00:12:45.639 Once you build the formula the model, you can just 264 00:12:45.720 --> 00:12:47.960 throw the practice tests away. You can solve any new 265 00:12:48.039 --> 00:12:48.960 question they throw at you. 266 00:12:49.320 --> 00:12:53.080 Let's make that concrete. The material uses a fantastic real 267 00:12:53.120 --> 00:12:58.840 world example comparing the OECD Better Life Index with IMFGDP data. 268 00:12:58.919 --> 00:13:00.279 Oh I love this part. 269 00:13:00.480 --> 00:13:04.159 Suppose you plot countries on a graph. The horizontal axis 270 00:13:04.200 --> 00:13:06.679 is GDP per capita, meaning how rich the country is. 271 00:13:07.120 --> 00:13:10.679 The vertical axis is life satisfaction, how happy the citizens are. 272 00:13:10.799 --> 00:13:12.559 When you look at the dots, it's a bit scattered 273 00:13:12.879 --> 00:13:16.440 but you can definitely see a general upward trend. As 274 00:13:16.519 --> 00:13:19.960 money goes up, happiness tends to go up. So the 275 00:13:20.039 --> 00:13:23.440 algorithm decides to build a linear model. It draws a 276 00:13:23.480 --> 00:13:26.320 straight line right through the middle of those scattered dots. 277 00:13:26.240 --> 00:13:29.240 And that straight line is defined by parameters. Right, just 278 00:13:29.320 --> 00:13:32.039 like back in high school algebra, why equals mx plus 279 00:13:32.039 --> 00:13:35.360 b Exactly like that, the algorithm basically has dials. It 280 00:13:35.399 --> 00:13:38.519 can turn, It can change the intercept where the line starts, 281 00:13:38.559 --> 00:13:41.120 and it can change the slope how steep the line. 282 00:13:40.879 --> 00:13:43.919 Is exactly, But this raises an important question. How does 283 00:13:43.919 --> 00:13:46.320 the algorithm actually know if the line it true is 284 00:13:46.360 --> 00:13:46.759 any good? 285 00:13:46.840 --> 00:13:47.960 Right? Who's grading it? 286 00:13:48.600 --> 00:13:52.279 This is the very heart of how machines learn. The 287 00:13:52.360 --> 00:13:56.279 algorithm uses a cost function. The cost function measures the 288 00:13:56.320 --> 00:13:59.799 literal physical distance on the graph between the model straight 289 00:13:59.799 --> 00:14:02.879 line and the actual data dots. Okay, if the line 290 00:14:02.919 --> 00:14:05.200 is drawn too low, the gap between the line and 291 00:14:05.240 --> 00:14:07.519 the dots is large, the cost is high. 292 00:14:07.639 --> 00:14:11.240 So the algorithm's entire purpose in life is to minimize 293 00:14:11.279 --> 00:14:14.200 that cost function. It turns the dial to adjust the 294 00:14:14.200 --> 00:14:17.200 slope of the line. Then it recalculates the distances. Did 295 00:14:17.200 --> 00:14:19.960 the gap get smaller? Yes, turn the dial a bit more, 296 00:14:20.480 --> 00:14:22.720 did the gap get bigger? Whoops? Turn it too far, 297 00:14:22.840 --> 00:14:27.360 turn it back. It is just a relentless mathematical optimization problem. 298 00:14:27.519 --> 00:14:30.120 Find the exact slope and height where the line is 299 00:14:30.120 --> 00:14:32.519 as close to all the dots as physically. 300 00:14:32.120 --> 00:14:35.679 Possible, and once that optimization is done, you have your model. 301 00:14:35.960 --> 00:14:38.159 If a brand new country emerges tomorrow, you don't need 302 00:14:38.200 --> 00:14:41.360 to look at historical instances. You just plug their GDP 303 00:14:41.519 --> 00:14:44.440 into your perfectly sloped line and it spits out of 304 00:14:44.480 --> 00:14:46.120 predicted life satisfaction score. 305 00:14:46.399 --> 00:14:49.080 But hold on a second. If learning is literally just 306 00:14:49.120 --> 00:14:52.559 turning dials to minimize a mathematical cost function, why do 307 00:14:52.639 --> 00:14:56.720 these models still make embarrassing, catastrophic, or even dangerous mistakes 308 00:14:56.759 --> 00:14:57.399 in the real world. 309 00:14:57.559 --> 00:14:59.679 Yeah, it's a huge problem. 310 00:14:59.360 --> 00:15:00.600 Because the math objective. 311 00:15:00.720 --> 00:15:00.919 Right. 312 00:15:01.279 --> 00:15:03.679 This brings us to the absolute core of the issue, 313 00:15:04.200 --> 00:15:06.840 the Achilles heel of everything we've talked about so far, 314 00:15:07.399 --> 00:15:10.120 the garbage in garbage out dilemma. You can have the 315 00:15:10.159 --> 00:15:13.159 most elegant optimization loop on the planet, but if you 316 00:15:13.240 --> 00:15:17.120 suffer from bad data or a bad algorithm, you are doomed. 317 00:15:17.879 --> 00:15:20.960 Let's start with bad data, specifically the raw quantity of it. 318 00:15:21.120 --> 00:15:24.799 The sheer volume of data required is just staggering. There 319 00:15:24.840 --> 00:15:27.720 is a landmark two thousand and one paper by Microsoft 320 00:15:27.759 --> 00:15:31.440 researchers Mickel Banco and Eric Brill that actually proved this right. 321 00:15:31.879 --> 00:15:34.840 They took a highly complex natural language problem and they 322 00:15:34.879 --> 00:15:38.279 tested several very different machine learning algorithms on it. Some 323 00:15:38.399 --> 00:15:42.000 highly sophisticated, some fairly basic. They found that as long 324 00:15:42.039 --> 00:15:44.799 as they fed the algorithms enough data, all of them 325 00:15:44.840 --> 00:15:48.759 performed almost identically well. Peter Norvig later coined a phrase 326 00:15:48.799 --> 00:15:51.840 for this, the unreasonable effectiveness of data. 327 00:15:52.200 --> 00:15:54.559 The unreasonable effectiveness of data. 328 00:15:54.679 --> 00:15:57.000 I love that it was a paradigm shift. It suggested 329 00:15:57.039 --> 00:15:59.799 that complex logic often loses to simple logic backed by 330 00:16:00.000 --> 00:16:00.960 outains of experience. 331 00:16:01.000 --> 00:16:03.120 Okay, wait, though, If that two thousand and one Microsoft 332 00:16:03.159 --> 00:16:06.320 paper prove that giving a mediocre algorithm a billion data 333 00:16:06.360 --> 00:16:10.080 points makes it perform brilliantly, why on earth are Silicon 334 00:16:10.159 --> 00:16:13.440 Valley companies paying millions of dollars to AI researchers. 335 00:16:13.600 --> 00:16:14.240 Good question. 336 00:16:14.440 --> 00:16:18.960 Why not just fire the algorithm development team, save the cash, 337 00:16:19.000 --> 00:16:21.960 and just buy more server space to hoard more data, 338 00:16:22.159 --> 00:16:24.440 just you know, brute force the problem. 339 00:16:24.559 --> 00:16:27.000 It is a totally tempting thought. But you have to 340 00:16:27.039 --> 00:16:31.320 ground this in the realities of the physical world. Yes, 341 00:16:31.679 --> 00:16:35.720 for massive tasks like global image recognition or large language models, 342 00:16:36.000 --> 00:16:39.480 tech giants can brute force it with endless data, But 343 00:16:39.600 --> 00:16:43.679 for ninety nine percent of real world applications, massive data 344 00:16:43.840 --> 00:16:47.480 simply doesn't exist. If you are a hospital trying to 345 00:16:47.480 --> 00:16:51.240 predict a rare genetic disease, you don't have billions of patients. 346 00:16:51.279 --> 00:16:52.360 You might have a few hundred. 347 00:16:52.480 --> 00:16:53.120 That makes sense. 348 00:16:53.159 --> 00:16:55.799 If you're a mid sized retailer optimizing your supply chain, 349 00:16:56.039 --> 00:16:59.279 you have limited noisy data. You can't fire the algorithm 350 00:16:59.320 --> 00:17:02.440 team because getting extra data is either physically impossible or 351 00:17:02.480 --> 00:17:06.440 prohibitively expensive. You need brilliant algorithms that can extract maximum 352 00:17:06.480 --> 00:17:07.640 signal from minimal noise. 353 00:17:07.799 --> 00:17:10.000 And it's not just about the quantity of the data. 354 00:17:10.240 --> 00:17:14.519 The quality is arguably more dangerous. Your data absolutely must be. 355 00:17:14.440 --> 00:17:15.880 Represented, oh without a doubt. 356 00:17:15.960 --> 00:17:18.599 If your training data doesn't perfectly mirror the real world, 357 00:17:19.119 --> 00:17:23.640 your algorithm will learn the wrong lessons with absolute mathematical certainty. 358 00:17:24.319 --> 00:17:27.440 And the text highlights one of the greatest cautionary tales 359 00:17:27.480 --> 00:17:32.000 and statistics for this, the nineteen thirty six Literary Digest. 360 00:17:31.640 --> 00:17:33.680 Poll Such a classic example. 361 00:17:33.400 --> 00:17:36.720 This magazine wanted to predict the US presidential election between 362 00:17:36.720 --> 00:17:40.119 alf Land and Franklin D. Roosevelt, so that they did 363 00:17:40.119 --> 00:17:43.000 what any data enthusiast would do. They went massive. They 364 00:17:43.039 --> 00:17:47.039 sent out ten million surveys and they got two point 365 00:17:47.039 --> 00:17:51.279 four million responses back. It was an astronomically large data set, 366 00:17:51.559 --> 00:17:55.200 and based on that data, they predicted Landon would crush Roosevelt, 367 00:17:55.359 --> 00:17:57.599 taking fifty seven percent of the vote. 368 00:17:57.359 --> 00:18:00.759 And yet Roosevelt won in a landslide with sixty two 369 00:18:00.799 --> 00:18:03.319 percent of the vote. The prediction wasn't just slightly off, 370 00:18:03.400 --> 00:18:05.160 it was completely inverted exactly. 371 00:18:05.400 --> 00:18:08.359 And the reason why is a tech bookcase of sampling bias. 372 00:18:08.720 --> 00:18:11.119 To get the ten million addresses to send the polls 373 00:18:11.160 --> 00:18:15.680 to the magazine used telephone directories, club membershipless and magazine 374 00:18:15.720 --> 00:18:16.599 subscriber lists. 375 00:18:16.960 --> 00:18:18.200 I see where this is going. 376 00:18:18.279 --> 00:18:20.599 Right, because you have to think about the environment of 377 00:18:20.680 --> 00:18:23.640 nineteen thirty six, who actually had a telephone in the 378 00:18:23.680 --> 00:18:27.559 middle of the Great Depression. Wealthier people, wealthier people, and 379 00:18:27.680 --> 00:18:31.680 wealthier people tended to lean Republican, So their massive data 380 00:18:31.720 --> 00:18:35.799 set completely excluded the working class. The algorithm of their 381 00:18:35.839 --> 00:18:39.640 poll wasn't flawed, the data it ingested was poisoned from 382 00:18:39.680 --> 00:18:42.000 the start. Garbage in, garbage. 383 00:18:41.599 --> 00:18:44.920 Out, and that is a failure of data. But we 384 00:18:44.960 --> 00:18:47.599 also have to examine the failure of the algorithm itself. 385 00:18:48.079 --> 00:18:50.920 The most insidious trap in machine learning is a concept 386 00:18:50.920 --> 00:18:52.000 called overfitting. 387 00:18:52.119 --> 00:18:52.880 Overfitting. 388 00:18:53.119 --> 00:18:56.480 This happens when the algorithm performs flawlessly on the training 389 00:18:56.559 --> 00:19:00.000 data but fails entirely when it faces the real world. 390 00:19:00.039 --> 00:19:02.319 I really love the analogy used for this. Imagine you're 391 00:19:02.359 --> 00:19:05.359 a tourist visiting a foreign country for the very first time. Okay, 392 00:19:05.519 --> 00:19:07.920 you get into a taxi and the driver blatantly rips 393 00:19:07.920 --> 00:19:10.720 you off. If you conclude that every single taxi driver 394 00:19:10.839 --> 00:19:13.920 in the entire country is a thief, you are overfitting. Yes, 395 00:19:14.200 --> 00:19:16.839 you took a tiny, noisy anomaly in your personal data 396 00:19:16.880 --> 00:19:19.720 set and drew a massive, sweeping rule from it. 397 00:19:20.000 --> 00:19:24.119 Mathematically, overfitting happens when a model is just too complex. 398 00:19:24.880 --> 00:19:28.279 We talked about turning dials earlier. In data science, we 399 00:19:28.319 --> 00:19:30.200 call those dials degrees of freedom. 400 00:19:30.279 --> 00:19:31.279 Degrees of freedom. 401 00:19:31.359 --> 00:19:33.960