WEBVTT 1 00:00:00.080 --> 00:00:02.960 Okay, So have you ever felt like you're just drowning 2 00:00:02.960 --> 00:00:06.440 in information, you know, for a project, or maybe getting 3 00:00:06.440 --> 00:00:08.320 ready for a meeting, or even just trying to learn 4 00:00:08.359 --> 00:00:10.640 something new, and you just wish someone could kind of 5 00:00:10.640 --> 00:00:11.519 boil it all down. 6 00:00:11.720 --> 00:00:14.080 Yeah, just give you the essentials, right, what really. 7 00:00:13.919 --> 00:00:16.800 Matters exactly, and maybe, you know, throw in a few 8 00:00:16.839 --> 00:00:20.039 surprising bits to keep it interesting. Well, if that sounds 9 00:00:20.039 --> 00:00:23.199 like you, you are definitely in the right place, because 10 00:00:23.239 --> 00:00:24.839 that's what we do here on the Deep Dive. We're 11 00:00:24.879 --> 00:00:28.399 sort of your shortcut to getting properly informed, and today 12 00:00:28.640 --> 00:00:32.039 we're taking a deep dive into your source material. These 13 00:00:32.079 --> 00:00:36.280 are excerpts from Easily Practical Machine Learning Algorithms with Python 14 00:00:36.759 --> 00:00:38.479 by doctor Darren Thomas. 15 00:00:38.640 --> 00:00:42.960 Yeah, and this isn't like your standard dance textbook. The author, 16 00:00:43.000 --> 00:00:46.880 doctor Thomas, he's got a PhD, loads the teaching experience. 17 00:00:46.479 --> 00:00:50.159 And get this a background and saxophone performance, right, but. 18 00:00:50.200 --> 00:00:52.920 His passion for machine learning led him to use these 19 00:00:52.920 --> 00:00:56.840 algorithms in education. He's a lecturer now at Asia Pacific 20 00:00:56.920 --> 00:00:58.560 International University and. 21 00:00:58.479 --> 00:01:01.159 The book's aim, which is really really key for us today, 22 00:01:01.439 --> 00:01:05.239 is to be simple, easy to follow, a kind of 23 00:01:05.400 --> 00:01:09.599 condensed guide for actually using these algorithms with Python exactly. 24 00:01:09.840 --> 00:01:12.719 He even says his goal was always to show what 25 00:01:12.760 --> 00:01:14.920 to do, rather than talk a lot about how to 26 00:01:14.959 --> 00:01:17.680 do it. So less heavy theory, more hands on application. 27 00:01:17.840 --> 00:01:22.400 Okay, so important. Note, then, the book and this deep 28 00:01:22.400 --> 00:01:24.959 dive too, sort of assumes you've already got some background 29 00:01:25.000 --> 00:01:27.079 in Python. Maybe data science stats. 30 00:01:27.200 --> 00:01:30.159 Yeah, it's more for folks looking to build on existing skills, 31 00:01:30.680 --> 00:01:33.439 maybe not for absolute beginners to data science itself. 32 00:01:33.599 --> 00:01:36.480 Right, So our mission today we want to unpack some 33 00:01:36.560 --> 00:01:39.719 of the most common machine learning algorithms. We'll look at 34 00:01:39.760 --> 00:01:41.840 classification that's predicting. 35 00:01:41.640 --> 00:01:43.879 Categories like spam or not spam. 36 00:01:43.680 --> 00:01:49.159 Exactly, and numeric prediction predicting continuous values like say, house prices. 37 00:01:49.719 --> 00:01:54.400 Will get into how they're used, their surprising upsides, their challenges, 38 00:01:54.840 --> 00:01:57.280 and crucially, how you actually figure out if the models 39 00:01:57.280 --> 00:01:58.560 any good and how to make it better. 40 00:01:58.640 --> 00:02:00.640 Yeah, judging and improving them key. 41 00:02:00.560 --> 00:02:02.760 Totally, So ready to jump in. Let's unpack this. 42 00:02:02.920 --> 00:02:04.000 Let's do it, Okay. 43 00:02:04.040 --> 00:02:08.479 First up, decision trees. You can think of these as 44 00:02:08.520 --> 00:02:11.000 like the foundation for a lot of predictive stuff. 45 00:02:11.280 --> 00:02:13.919 Right At its heart, a decision tree is just a 46 00:02:13.960 --> 00:02:17.680 way to classify things or predict numbers by splitting your 47 00:02:17.719 --> 00:02:20.599 data up. Splitting it how It basically keeps dividing the 48 00:02:20.639 --> 00:02:23.479 sample into smaller and smaller groups, trying to make each 49 00:02:23.520 --> 00:02:26.319 little group as similar as possible inside. 50 00:02:26.159 --> 00:02:28.719 So it looks like a tree visually like a flow. 51 00:02:28.520 --> 00:02:30.400 Shart exactly like a float chart. You start at the 52 00:02:30.400 --> 00:02:32.719 top the root node. That's your first big decision point. 53 00:02:33.039 --> 00:02:36.599 Then you follow branches down through more decision nodes, making more. 54 00:02:36.479 --> 00:02:39.120 Splits until you hit the end the leaf nodes. 55 00:02:39.240 --> 00:02:41.840 Yep, the leaf nodes. That's where you get your final prediction. 56 00:02:42.000 --> 00:02:43.800 And how does it decide where to split? 57 00:02:44.120 --> 00:02:48.599 Well? For classification, it often uses something called entropy. High 58 00:02:48.680 --> 00:02:51.439 entropy means things are really mixed up. The tree tries 59 00:02:51.479 --> 00:02:55.360 to make splits that reduce that entropy, creating purer groups. 60 00:02:55.240 --> 00:02:59.800 Lower entropy, more homogeneous, got it. And for predicting numbers. 61 00:03:00.080 --> 00:03:04.919 Prediction that uses metrics like means squared error msee or 62 00:03:04.919 --> 00:03:06.879 maybe R square to guide the splits. 63 00:03:07.199 --> 00:03:10.000 Okay, so what's great about them? Why start here? 64 00:03:10.240 --> 00:03:13.520 Well? A big plus is flexibility. They handle missing data 65 00:03:13.560 --> 00:03:16.639 pretty well. They don't really care if your data isn't, 66 00:03:16.879 --> 00:03:19.879 you know, perfectly normal. A nice bell curve right, and 67 00:03:19.919 --> 00:03:22.159 you don't even have to use all your variables. Plus, 68 00:03:22.240 --> 00:03:25.360 and this is a big one, they're relatively easy to interpret. 69 00:03:25.599 --> 00:03:28.120 Ah, so you don't need a math PhD to figure 70 00:03:28.120 --> 00:03:28.879 out what it's doing. 71 00:03:29.240 --> 00:03:32.759 Pretty much. You can literally trace a path down the 72 00:03:32.800 --> 00:03:34.840 tree and see the reasoning that. 73 00:03:34.879 --> 00:03:38.560 Transparency sounds really useful, especially if you need to explain 74 00:03:38.759 --> 00:03:40.199 why a prediction was made. 75 00:03:40.319 --> 00:03:44.599 Absolutely imagine telling someone why their loan was denied. Showing 76 00:03:44.639 --> 00:03:48.800 them a simple tree is way easier than explaining you know, 77 00:03:48.960 --> 00:03:51.960 complex equations from some other models. Builds trust. 78 00:03:52.400 --> 00:03:55.639 Okay, but there's always a catch, right, what's the downside? 79 00:03:55.680 --> 00:03:57.840 The main one is that they can get really complex 80 00:03:57.840 --> 00:04:00.840 if you let them grow too much, really deep trees, 81 00:04:00.919 --> 00:04:03.319 and that leads to that often leads to overfitting. 82 00:04:03.439 --> 00:04:06.360 Overfitting like it learns the training data too well. 83 00:04:06.360 --> 00:04:09.560 Exactly, it fits the specific sample data perfectly, maybe even 84 00:04:09.599 --> 00:04:12.800 the noise. But then it can generalize well to new 85 00:04:12.840 --> 00:04:14.520 data it hasn't seen before. 86 00:04:14.319 --> 00:04:18.120 Like memorizing answers instead of understanding the concept perfect analogy. 87 00:04:18.879 --> 00:04:22.120 And of course, a super complex tree, even though it's visual, 88 00:04:22.199 --> 00:04:24.000 can still be hard to explain easily. 89 00:04:24.040 --> 00:04:26.720 Okay, let's make it concrete. The source had an example, right, 90 00:04:26.759 --> 00:04:27.839 a cancer data. 91 00:04:27.639 --> 00:04:31.480 Set, Yeah, predicting health status alive or dead. The model 92 00:04:31.560 --> 00:04:34.839 mainly used variables like time and age for its splits. 93 00:04:35.079 --> 00:04:36.199 Then how did it perform? 94 00:04:36.399 --> 00:04:38.720 It got seventy eight percent accuracy on the data it 95 00:04:38.759 --> 00:04:42.079 trained on, okay, but then on the unseen test data 96 00:04:42.120 --> 00:04:44.120 it drops slightly to seventy three percent. 97 00:04:44.439 --> 00:04:47.199 That drop five percent is that bad or expected? 98 00:04:47.360 --> 00:04:50.839 That's actually pretty common and often expected. It shows it's generalizing. Okay, 99 00:04:51.040 --> 00:04:54.079 still performing decently on new stuff. It learned patterns and 100 00:04:54.120 --> 00:04:54.639 applied them. 101 00:04:54.759 --> 00:04:58.240 Right now, What about using it for numeric prediction, same 102 00:04:58.319 --> 00:05:00.639 data set, but predicting aid yep. 103 00:05:01.000 --> 00:05:03.399 So here, instead of looking at purity like with dagony, 104 00:05:03.759 --> 00:05:07.000 the tree uses mse means squared error in the nodes, 105 00:05:07.639 --> 00:05:10.720 and the leaves predict the average age for that group. 106 00:05:10.519 --> 00:05:12.279 And which variables were important there. 107 00:05:12.399 --> 00:05:15.120 The source mentioned pH dot Carno Andmeal dot. 108 00:05:14.959 --> 00:05:16.480 Cow and the results. 109 00:05:16.879 --> 00:05:20.160 Well, the correlation between the actual age and predicted age 110 00:05:20.240 --> 00:05:22.680 was okay on the training data about point five to 111 00:05:22.720 --> 00:05:26.480 four moderate moderate yeah, But on the test data it 112 00:05:26.560 --> 00:05:28.399 dropped way down to point one eight. 113 00:05:28.480 --> 00:05:31.360 Ouch, big drop? What about the error the. 114 00:05:31.399 --> 00:05:34.160 MSc MC was sixty one point eight on the training set, 115 00:05:34.240 --> 00:05:36.839 but jumped up to eighty five point twenty four on 116 00:05:36.920 --> 00:05:37.600 the test set. 117 00:05:37.720 --> 00:05:39.199 So again, that drop tells us. 118 00:05:39.319 --> 00:05:42.040 It tells us while it learned something, it really struggled 119 00:05:42.079 --> 00:05:45.879 to generalize the age prediction to new people. Highlights that 120 00:05:45.959 --> 00:05:48.120 overfitting risk with single trees, which. 121 00:05:48.000 --> 00:05:50.439 Leads us perfectly into the next one. Random forest. This 122 00:05:50.519 --> 00:05:50.959 sounds cool. 123 00:05:51.040 --> 00:05:53.279 Yeah, this is where it gets really interesting. Random Forest 124 00:05:53.360 --> 00:05:56.920 tackles that overfitting problem head on. How instead of building 125 00:05:56.959 --> 00:06:00.120 just one decision tree, it builds hundreds, maybe even thousands 126 00:06:00.160 --> 00:06:00.399 of them. 127 00:06:00.439 --> 00:06:02.480 Wow? Okay, where does the random part come? 128 00:06:02.519 --> 00:06:05.839 In? Two places? First, each tree is built using a 129 00:06:05.920 --> 00:06:09.079 different random sample of your data drawn with replacement called 130 00:06:09.079 --> 00:06:12.519 boots trapping. Second, at each split point in a tree, 131 00:06:12.600 --> 00:06:15.439 it only considers a random subset of your available features. 132 00:06:15.600 --> 00:06:18.040 So not every tree sees all the data, and not 133 00:06:18.160 --> 00:06:20.560 every split considers all the factors exactly. 134 00:06:20.800 --> 00:06:23.959 And the idea is you get lots of slightly different trees, 135 00:06:24.079 --> 00:06:26.720 none of them perfect, but hopefully their errors are kind 136 00:06:26.759 --> 00:06:28.279 of random and cancel each other out. 137 00:06:28.360 --> 00:06:30.560 So how does it make a final prediction with all 138 00:06:30.600 --> 00:06:31.279 those trees? 139 00:06:31.759 --> 00:06:35.000 It's pretty democratic. Actually, for classification, it's just a majority 140 00:06:35.079 --> 00:06:38.480 vote whichever prediction most trees make. 141 00:06:38.680 --> 00:06:42.199 Wins simple enough, and for predicting numbers. 142 00:06:41.920 --> 00:06:44.639 It just averages the predictions from all the individual trees. 143 00:06:44.800 --> 00:06:47.279 And this whole wisdom of the crowd thing really helps 144 00:06:47.279 --> 00:06:48.879 with overfitting massively. 145 00:06:49.519 --> 00:06:53.800 That aggregation step makes the model much more robust and 146 00:06:53.959 --> 00:06:57.959 way less prone to overfitting compared to a single complex 147 00:06:58.000 --> 00:06:59.120 decision tree, so. 148 00:06:59.160 --> 00:07:03.000 The benefits seem clear less. Overfitting works well even if 149 00:07:03.040 --> 00:07:05.800 you don't have tons of data handles missing values. 150 00:07:06.120 --> 00:07:09.439 Sounds great, It often is. It's a very popular, reliable 151 00:07:09.480 --> 00:07:10.720 algorithm for those reasons. 152 00:07:10.800 --> 00:07:13.959 Okay, but the drawback you mentioned transparency with decision trees, 153 00:07:14.079 --> 00:07:14.720 what about here? 154 00:07:15.040 --> 00:07:18.920 Oh yeah, that's the main trade off. With potentially thousands 155 00:07:18.959 --> 00:07:21.839 of trees. You can't just visualize it like a single 156 00:07:21.839 --> 00:07:22.560 flow chart. 157 00:07:22.399 --> 00:07:25.040 Anymore, so it becomes a black box pretty much. 158 00:07:25.480 --> 00:07:27.480 You know what goes in, you know what comes out, 159 00:07:27.560 --> 00:07:31.480 but explaining exactly how it arrived at that specific prediction 160 00:07:31.759 --> 00:07:32.639 is really hard. 161 00:07:32.879 --> 00:07:33.959 That's a problem when. 162 00:07:34.040 --> 00:07:36.680 When you need to explain the why, like we said, 163 00:07:36.879 --> 00:07:41.040 loan applications, medical diagnoses, if you can't explain the reasoning, 164 00:07:41.399 --> 00:07:44.680 it can cause issues with trust or even regulations that 165 00:07:44.720 --> 00:07:46.680 demand transparency. 166 00:07:46.360 --> 00:07:49.160 Right, so that lack of interpretability is a real consideration. 167 00:07:50.040 --> 00:07:52.680 You might get great predictions but lose the explanation. 168 00:07:52.759 --> 00:07:54.240 It's a definite trade off you have to weigh. 169 00:07:54.360 --> 00:07:57.240 Let's look at the example the doctor Aus data set 170 00:07:57.319 --> 00:07:58.199 predicting gender. 171 00:07:58.399 --> 00:08:01.439 Right, the source noted income and age came out as 172 00:08:01.480 --> 00:08:05.360 strong predictors, pointing out the known differences in salaries and 173 00:08:05.439 --> 00:08:08.680 life expectancy. And the performance impressive on the training data 174 00:08:08.959 --> 00:08:12.839 ninety three percent accuracy. Wow, but then quite a big 175 00:08:12.920 --> 00:08:15.560 drop on the test data, down to sixty six percent. 176 00:08:15.600 --> 00:08:18.120 Oo. That's a nearly thirty percent draw. What does that 177 00:08:18.160 --> 00:08:18.600 tell us? 178 00:08:18.920 --> 00:08:21.279 It tells us that while the model learned the training 179 00:08:21.360 --> 00:08:26.319 data extremely well, almost perfectly, it really struggled to generalize 180 00:08:26.360 --> 00:08:26.879 that learning. 181 00:08:27.319 --> 00:08:30.839 So even random forest isn't immune to some overfitting. Or 182 00:08:31.120 --> 00:08:33.639 maybe the training data just wasn't fully representative. 183 00:08:33.879 --> 00:08:36.960 Could be either or both. It's a stark reminder that 184 00:08:37.080 --> 00:08:40.679 high training accuracy is nice, but test accuracy is what 185 00:08:40.879 --> 00:08:42.799 really counts for real world use. 186 00:08:43.039 --> 00:08:46.600 Okay, and the numeric prediction example predicting income from the 187 00:08:46.639 --> 00:08:48.240 same doctor Aus data. 188 00:08:48.039 --> 00:08:50.799 Set here age was the most important variable by far 189 00:08:51.039 --> 00:08:53.320 makes intuitive sense, right. Yeah, people often earn more as 190 00:08:53.320 --> 00:08:53.840 they get older. 191 00:08:53.879 --> 00:08:55.519 Sure, and the numbers. 192 00:08:55.200 --> 00:08:58.360 Strong correlation on the training data point eight three, but 193 00:08:58.399 --> 00:09:00.279 again a drop on the test data down to point 194 00:09:00.320 --> 00:09:00.720 four to eight. 195 00:09:00.919 --> 00:09:03.679 Still a decent drop, and the error ms. 196 00:09:04.159 --> 00:09:06.720 MSc was low on the training set point zero four 197 00:09:06.919 --> 00:09:08.559 and higher on the test set point one to one. 198 00:09:08.639 --> 00:09:11.360 So similar story. Good at learning the training data, but 199 00:09:11.440 --> 00:09:14.279 only moderately good at generalizing the income prediction. 200 00:09:14.399 --> 00:09:17.480 Yeah, pretty much. It captures the relationship it sees, but 201 00:09:17.559 --> 00:09:20.200 applying it to new unseen individuals is where the real 202 00:09:20.240 --> 00:09:20.759 test lies. 203 00:09:20.840 --> 00:09:24.480 Okay, moving on K nearest neighbor or k NN, this 204 00:09:24.519 --> 00:09:26.679 one sounds neighborly, huh. 205 00:09:26.360 --> 00:09:29.720 Yeah, it's actually quite intuitive. The core idea is predicting 206 00:09:29.759 --> 00:09:31.159 by proximity, like. 207 00:09:31.200 --> 00:09:34.000 Your example of walking into a classroom of twelve year olds. 208 00:09:34.159 --> 00:09:36.559 If another kid walks in, you guess they're also twelve. 209 00:09:37.000 --> 00:09:40.960 Exactly like that, Yeah, kNN looks at an unknown data 210 00:09:41.000 --> 00:09:43.840 point and finds the k known data points that are 211 00:09:43.879 --> 00:09:45.799 nearest to it in the feature space. 212 00:09:45.759 --> 00:09:47.559 And nearest is usually measured by. 213 00:09:47.679 --> 00:09:51.679 Typically Euclidian distance, just the straight line distance between points 214 00:09:52.080 --> 00:09:53.519 in that multi dimensional space. 215 00:09:53.679 --> 00:09:55.879 So K is just how many neighbors you look at 216 00:09:55.919 --> 00:09:58.600 like the three nearest or five nearest. 217 00:09:58.720 --> 00:10:01.120 Yep, K is the number of you consider. 218 00:10:01.200 --> 00:10:03.879 And how do those neighbors make the prediction for classification? 219 00:10:03.960 --> 00:10:06.639 They vote. That's why K is usually an odd number 220 00:10:06.679 --> 00:10:10.240 to avoid ties. For numeric prediction, it's just the average 221 00:10:10.240 --> 00:10:11.360 of the neighbor's values. 222 00:10:11.440 --> 00:10:12.200 What's it good for. 223 00:10:12.360 --> 00:10:14.879 It's pretty good with non linear data where the boundary 224 00:10:14.919 --> 00:10:18.480 is in a straight line. And it's non parametric, meaning 225 00:10:18.720 --> 00:10:21.519 it doesn't make strong assumptions about how your data is distributed. 226 00:10:21.919 --> 00:10:22.679 Makes it flexible. 227 00:10:22.960 --> 00:10:25.039 Sounds simple enough. Any hidden traps. 228 00:10:25.120 --> 00:10:29.240 Well, it's sometimes called a lazy learning algorithm. Lazy because 229 00:10:29.240 --> 00:10:32.399 it doesn't really build a model during training. It basically 230 00:10:32.480 --> 00:10:35.799 just stores all the training data. The real work happens 231 00:10:35.799 --> 00:10:37.320 only when you ask for a prediction. 232 00:10:37.799 --> 00:10:40.399 Ah. So it doesn't give you much insight into why 233 00:10:40.519 --> 00:10:41.559 variables are important. 234 00:10:41.759 --> 00:10:45.799 Not really no abstraction. And because it stores everything, it 235 00:10:45.840 --> 00:10:49.240 can struggle with really large data sets. Needs a lot 236 00:10:49.240 --> 00:10:49.799 of memory. 237 00:10:50.120 --> 00:10:53.399 And there was another crucial point something about scale. 238 00:10:54.159 --> 00:10:58.639 Yes, critically important for K and N it is scale sensitive. 239 00:10:58.799 --> 00:11:02.320 Okay, break that down. What does scale sensitive mean? Practically? 240 00:11:02.639 --> 00:11:05.519 Imagine you have age maybe zero to one hundred, and 241 00:11:05.559 --> 00:11:08.519 another variable like own scar, which is just zero or one. 242 00:11:09.039 --> 00:11:12.919 When Cainean calculates distance, the age difference will totally swamp 243 00:11:12.960 --> 00:11:14.879 the own scar difference, just because the numbers are so 244 00:11:14.960 --> 00:11:15.519 much bigger. 245 00:11:15.600 --> 00:11:18.840 So age will have way more influence on who's considered nearest. 246 00:11:18.639 --> 00:11:21.360 Exactly, even if owning a car is actually super important 247 00:11:21.360 --> 00:11:23.399 for the prediction. Yeah, so you have to scale your 248 00:11:23.440 --> 00:11:24.039 data first. 249 00:11:24.120 --> 00:11:24.919 How do you scale it? 250 00:11:25.200 --> 00:11:28.080 Common ways are a minmax scaling where you squish everything 251 00:11:28.120 --> 00:11:30.840 into a zero to one range, or standardization where you 252 00:11:30.879 --> 00:11:33.759 give variables a mean of zero and standard deviation of one. 253 00:11:34.000 --> 00:11:35.679 It puts everything on a level playing. 254 00:11:35.480 --> 00:11:38.559 Field, right, So age isn't shouting louder than the other variables. 255 00:11:38.679 --> 00:11:41.200 Makes sense. The example use the turnout data set to 256 00:11:41.240 --> 00:11:42.919 predict if someone voted yeah. 257 00:11:42.759 --> 00:11:46.360 And they specifically mentioned scaling the data first. The model 258 00:11:46.399 --> 00:11:50.000 got almost eighty percent accuracy on training, and importantly, it 259 00:11:50.080 --> 00:11:51.799 held up really well on the test data too. 260 00:11:51.960 --> 00:11:55.480 That consistency is good, right, suggests it's generalizing. 261 00:11:54.960 --> 00:11:57.200 Exactly what you want to see, not just memorizing. 262 00:11:57.399 --> 00:12:01.480 And for predicting income from that same turnout. 263 00:12:00.600 --> 00:12:03.000 The correlation was zer point six y three on training 264 00:12:03.159 --> 00:12:05.360 dropped a bit too. Point four to eight on test, 265 00:12:05.559 --> 00:12:08.080 but the MSc values were very close point zero two 266 00:12:08.120 --> 00:12:11.200 to one for training, point zero two nine for testing. 267 00:12:11.480 --> 00:12:14.559 So again, even if the correlation isn't super strong, the 268 00:12:14.639 --> 00:12:18.120 similar error rates suggest it's performing consistently on new data. 269 00:12:18.320 --> 00:12:21.399 YEP indicates stable performance, which is often more important than 270 00:12:21.480 --> 00:12:23.279 hitting the absolute highest correlation number. 271 00:12:23.480 --> 00:12:28.679 Okay, Next algorithm, Support vector machines SVM. This sounds a 272 00:12:28.679 --> 00:12:29.480 bit more complex. 273 00:12:29.720 --> 00:12:32.639 It combines ideas from K and N and linear models, 274 00:12:32.919 --> 00:12:36.039 but with a really clever twist for classification, which is 275 00:12:36.240 --> 00:12:38.559 its main goal is to find the best dividing line 276 00:12:38.799 --> 00:12:41.399 or plane or hyperplane in higher dimensions. 277 00:12:41.600 --> 00:12:43.559 Hyperplane fancy word for boundary. 278 00:12:43.879 --> 00:12:47.360 Right. It wants the boundary that creates the biggest possible 279 00:12:47.440 --> 00:12:50.759 gap or margin between the different classes. 280 00:12:50.320 --> 00:12:52.879 A bigger buffer zone exactly. 281 00:12:52.639 --> 00:12:54.720 And the data points that sit right on the edge 282 00:12:54.720 --> 00:12:57.240 of that margin. Those are the support vectors. They're the 283 00:12:57.240 --> 00:12:59.120 critical ones that actually define the boundary. 284 00:12:59.279 --> 00:13:01.679 Interesting, does it always have to be a straight line? 285 00:13:01.720 --> 00:13:03.080 What if the data is all mixed up? 286 00:13:03.399 --> 00:13:07.279 Good question. It prefers straight lines, but it has tricks. 287 00:13:07.919 --> 00:13:11.399 It can allow some misclassifications using a slack variable, a 288 00:13:11.399 --> 00:13:15.320 bit of wiggle room, or for really messy nonlinear data. 289 00:13:15.399 --> 00:13:16.679 It uses the kernel trick. 290 00:13:16.919 --> 00:13:19.679 The kernel trick sounds like magic, It kind of is. 291 00:13:19.840 --> 00:13:22.639 It projects the data into much higher dimensional space where 292 00:13:22.919 --> 00:13:26.639 hopefully a simple linear boundary can separate the classes. 293 00:13:26.600 --> 00:13:29.480 Like unfolding a crumpled paper to separate dots. I like 294 00:13:29.519 --> 00:13:33.679 that analogy. So sbms are flexible, can handle messy data. 295 00:13:33.840 --> 00:13:36.320 Very flexible, Yeah, I think can be incredibly accurate even 296 00:13:36.360 --> 00:13:37.440 on complex problems. 297 00:13:37.440 --> 00:13:41.600 Okay, sounds powerful. Downside, is it another black box? 298 00:13:41.720 --> 00:13:45.360 Often? Yes, especially with those kernel tricks and higher dimensions. 299 00:13:45.600 --> 00:13:49.279 Explaining why SBM made a particular decision gets very abstract, very. 300 00:13:49.159 --> 00:13:51.759 Quickly right, hard to explain to the boss. 301 00:13:51.600 --> 00:13:55.240 Can be Also, choosing the right kernel isn't always obvious, 302 00:13:55.720 --> 00:13:57.919 And like K and N, it's scales sensitive. You need 303 00:13:57.960 --> 00:13:59.000 to rescale your data. 304 00:13:59.120 --> 00:14:03.679 Got it scaling again. The example was predicting mortgage status. 305 00:14:03.759 --> 00:14:05.759 Yes know from the working hours data. 306 00:14:05.600 --> 00:14:08.000 Set correct and data prep was key. They combined some 307 00:14:08.080 --> 00:14:11.480 child related variables and rescaled everything. Then they compared two 308 00:14:11.519 --> 00:14:15.639 common kernels, linear and RBF radio basis function. 309 00:14:15.799 --> 00:14:16.440 Oh do they do? 310 00:14:16.559 --> 00:14:19.679 The linear kernel hit eighty seven percent accuracy on training 311 00:14:20.200 --> 00:14:24.039 and impressively held that exact same accuracy on the test data. 312 00:14:24.080 --> 00:14:26.279 Wow, perfect generalization in that case. 313 00:14:26.360 --> 00:14:29.200 Fantastic result. The RBF kernel was just slightly lower and. 314 00:14:29.159 --> 00:14:33.039 For SVM regression predicting education level from the same data set. 315 00:14:33.320 --> 00:14:36.919 Yeah. Again. Comparing linear and RBF kernels, the linear one 316 00:14:36.960 --> 00:14:40.480 showed a pretty weak correlation, only point three to eight 317 00:14:40.559 --> 00:14:43.519 on training and point four zero on tests. That doesn't 318 00:14:43.559 --> 00:14:47.080 sound great, but the MS values were low and very 319 00:14:47.080 --> 00:14:50.159 stable point zero one five eight nine on training, point 320 00:14:50.240 --> 00:14:51.919 zero one eight three to two on testing. 321 00:14:52.320 --> 00:14:56.879 So weak relationship overall, but the model makes consistent predictions exactly. 322 00:14:57.360 --> 00:15:00.279 Suggests it's generalizing well, even if it's not explaining huge 323 00:15:00.320 --> 00:15:03.360 amount of the variants. The source also mentioned outliers might 324 00:15:03.399 --> 00:15:05.519 be affecting the correlation metric more here. 325 00:15:05.679 --> 00:15:08.440 Interesting. So metrics can sometimes tell slightly different stories. 326 00:15:08.440 --> 00:15:10.840 Definitely, you need to look at them together, all right. 327 00:15:10.960 --> 00:15:15.879 Artificial neural networks ann's the brain inspired ones kind of. 328 00:15:16.480 --> 00:15:20.320 They were initially inspired by biological neurons. Yeah. You have inputs, 329 00:15:20.320 --> 00:15:22.200 some processing happens, and you get outputs. 330 00:15:22.279 --> 00:15:24.000 And deep learning is just when you have lots of 331 00:15:24.080 --> 00:15:25.440 layers of these neurons. 332 00:15:25.200 --> 00:15:28.080 Multiple hidden layers. Yes, that's the essence of deep learning. 333 00:15:28.159 --> 00:15:30.559 So how do they actually work? Simply put, think. 334 00:15:30.399 --> 00:15:33.399 Of inputs like signals arriving. Each input gets a weight 335 00:15:33.519 --> 00:15:35.799 how important it is. They get summed up and then 336 00:15:35.840 --> 00:15:37.440 hit an activation function like. 337 00:15:37.399 --> 00:15:39.159 The neuron deciding whether to fire. 338 00:15:39.799 --> 00:15:42.879 Sort of yeah, that function decides if the signal is 339 00:15:42.879 --> 00:15:45.840 strong enough to pass on to the next layer. Usually 340 00:15:45.879 --> 00:15:49.120 the information flows one way feed forward. It's a whole 341 00:15:49.159 --> 00:15:51.879 cascade of these simple weighted sums and activations. 342 00:15:51.960 --> 00:15:54.600 And the big advantage why all the hype, They. 343 00:15:54.519 --> 00:15:57.960 Really shine with massive amounts of data. Given enough data, 344 00:15:58.000 --> 00:16:02.000 they can learn incredibly complex suttle patterns that other algorithms 345 00:16:02.080 --> 00:16:03.080 might miss entirely. 346 00:16:03.279 --> 00:16:08.039 So flexibility is huge. Image recognition, self driving cars. 347 00:16:07.919 --> 00:16:11.200 Exactly, they tower a lot of cutting edge AI tasks. 348 00:16:11.360 --> 00:16:13.840 But the catch they need tons of data. 349 00:16:13.879 --> 00:16:18.799 Typically, yes, massive data sets for optimal performance, and training 350 00:16:18.840 --> 00:16:20.919 them can take a lot of computing power and time. 351 00:16:21.200 --> 00:16:24.559 Plus sometimes simpler networks can struggle to converge, meaning they 352 00:16:24.559 --> 00:16:25.840 don't actually learn effectively. 353 00:16:25.919 --> 00:16:29.200 Okay, example time predicting union membership from the wages data set. 354 00:16:29.279 --> 00:16:32.559 Right, And a key step here was turning categorical variables 355 00:16:32.840 --> 00:16:35.200 like occupation into dummy. 356 00:16:35.000 --> 00:16:37.200 Variables numerical flags basically. 357 00:16:37.000 --> 00:16:40.000 Yep, zeros and ones. The network can understand. The model 358 00:16:40.039 --> 00:16:43.600 achieved a solid seventy percent accuracy, and importantly, it was 359 00:16:43.639 --> 00:16:46.000 consistent between the training and test data. 360 00:16:46.399 --> 00:16:50.200 Good stability again and the regression example predicting wages. 361 00:16:50.360 --> 00:16:52.879 Also from the wages data set, it showed a moderate 362 00:16:52.919 --> 00:16:56.240 correlation point five to six on training, very close point 363 00:16:56.240 --> 00:16:59.240 five to four odd tests, and the MSS values were 364 00:16:59.240 --> 00:17:00.960 also very so between the two. 365 00:17:00.840 --> 00:17:04.839 Sets, So reasonably good generalization there too. Consistent if not 366 00:17:05.000 --> 00:17:06.160 spectacular prediction. 367 00:17:06.319 --> 00:17:08.599 Looks like it reliable performance on new data. 368 00:17:09.119 --> 00:17:12.000 Now for something completely different, k means you said this 369 00:17:12.039 --> 00:17:13.319 one's unsupervised learning. 370 00:17:13.400 --> 00:17:16.880 That's right. This is fascinating because, unlike everything else we've discussed, 371 00:17:17.279 --> 00:17:20.480 there's no right answer or target variable we're trying. 372 00:17:20.319 --> 00:17:23.960 To predict, no gender, no income, no voted label exactly. 373 00:17:24.359 --> 00:17:27.440 KMES isn't trying to predict anything specific. It's just trying 374 00:17:27.440 --> 00:17:30.759 to find natural groupings or clusters within the data itself, 375 00:17:30.839 --> 00:17:31.920 based on similarity. 376 00:17:32.079 --> 00:17:33.400 How does it find these groups? 377 00:17:33.480 --> 00:17:36.480 It starts by randomly guessing the locations of K cluster 378 00:17:36.599 --> 00:17:38.559 centers or centroids, K. 379 00:17:38.680 --> 00:17:41.079 Being the number of clusters you think are in the data. 380 00:17:41.160 --> 00:17:45.720 Precisely. Then it assigns each data point to its nearest centroid. 381 00:17:46.559 --> 00:17:49.960 After that, it recalculates the position of each centroid to 382 00:17:50.000 --> 00:17:52.559 be the actual center of all the points assigned to it. 383 00:17:52.440 --> 00:17:55.559 And it repeats that. Assigned points move centers yep. 384 00:17:55.640 --> 00:17:59.160 It iterates back and forth, assigned points update centroids until 385 00:17:59.160 --> 00:18:02.440 the centroids start moving much until things stabilize, and. 386 00:18:02.400 --> 00:18:05.200 The researcher has to decide on K the number of clusters. 387 00:18:05.240 --> 00:18:06.599 That sounds tricky, it can be. 388 00:18:06.759 --> 00:18:07.640 It's a key challenge. 389 00:18:07.680 --> 00:18:09.359 So what's the big benefit of doing this. 390 00:18:09.839 --> 00:18:13.920 It's fantastic for exploring your data, for segmentation, finding hidden 391 00:18:13.960 --> 00:18:17.160 patterns you didn't even know existed, understanding what makes different 392 00:18:17.200 --> 00:18:18.640 subgroups within your data. 393 00:18:18.599 --> 00:18:21.319 Distinct, discovering natural segments exactly. 394 00:18:21.640 --> 00:18:25.559 But the drawbacks stem from that unsupervised nature. Since there's 395 00:18:25.559 --> 00:18:28.839 no right answer, evaluating how good the clusters are is 396 00:18:28.880 --> 00:18:31.359 more subjective, relying on the researcher's interpretation. 397 00:18:31.559 --> 00:18:34.400 Hell me guess scale sensitive. 398 00:18:34.000 --> 00:18:37.799 You got it requires data normalization or scaling, just like 399 00:18:37.960 --> 00:18:43.200 kNN and SVM because it relies on distance calculations, and yeah, 400 00:18:43.279 --> 00:18:44.519 choosing the right K is tough. 401 00:18:44.759 --> 00:18:46.200