WEBVTT 1 00:00:00.080 --> 00:00:03.919 Okay, let's unpack this today. We're embarking on a deep 2 00:00:03.960 --> 00:00:09.640 dive into the fascinating, sometimes maybe often hyped, but fundamentally 3 00:00:09.679 --> 00:00:11.080 important world of data science. 4 00:00:11.199 --> 00:00:13.000 Definitely hyped at times, right, But. 5 00:00:13.000 --> 00:00:16.480 Our mission here is really to demystify what data science 6 00:00:16.519 --> 00:00:19.320 truly is. We want to explore its core processes, the 7 00:00:19.399 --> 00:00:22.519 essential tools, and also confront some of the well the 8 00:00:22.559 --> 00:00:26.280 critical real world challenges that come with working with data. 9 00:00:25.960 --> 00:00:27.600 And the ethical ones too. They're huge. 10 00:00:27.640 --> 00:00:31.800 Absolutely. We'll be drawing our insights primarily from Rachel Schutt's 11 00:00:31.800 --> 00:00:35.119 pioneering book Doing Data Science, Straight Talk from the front Line, 12 00:00:35.159 --> 00:00:38.000 which came out of her course at Columbia. 13 00:00:37.560 --> 00:00:40.079 University, a really groundbreaking course back then. 14 00:00:40.240 --> 00:00:42.880 Exactly. So by the end of this deep dive you 15 00:00:42.880 --> 00:00:46.000 should have a much clearer understanding of this field, hopefully 16 00:00:46.000 --> 00:00:48.159 equipped with the knowledge to you know, cut through the 17 00:00:48.200 --> 00:00:51.439 noise and see why it's irrelevant. So let's dive straight 18 00:00:51.479 --> 00:00:55.079 into the heart of it. What is data science? It's 19 00:00:55.119 --> 00:00:58.240 a question that even the pioneers of the field really 20 00:00:58.320 --> 00:00:58.840 grappled with. 21 00:00:58.960 --> 00:01:00.520 They really did until. 22 00:01:00.320 --> 00:01:03.280 Shuts introduction to data science course at Columbia I think 23 00:01:03.280 --> 00:01:06.840 it started fall twenty twelve that really acted as an 24 00:01:06.879 --> 00:01:08.519 incubator for this whole idea. 25 00:01:08.680 --> 00:01:11.040 Yeah, it was a starting point, and Kathy O'Neill form 26 00:01:11.000 --> 00:01:14.959 the mathbave dot org blog. She was instrumental in bringing 27 00:01:14.959 --> 00:01:19.200 these ideas out, specifically pushing back against all the marketing hype. 28 00:01:19.239 --> 00:01:20.640 There was a lot of hype back then. 29 00:01:20.760 --> 00:01:24.400 Oh yeah. A crucial point here is that initial sort 30 00:01:24.400 --> 00:01:28.159 of bewilderment around it. The term was just vague. People 31 00:01:28.200 --> 00:01:30.840 were throwing around phrases like masters of the universe for 32 00:01:30.959 --> 00:01:32.400 data scientists. 33 00:01:31.920 --> 00:01:34.560 Right, which must have annoyed some statisticians. 34 00:01:34.599 --> 00:01:36.640 You can imagine they felt like, hey, that's our feel. 35 00:01:36.719 --> 00:01:40.599 The science of data identity theft almost. But the core 36 00:01:40.760 --> 00:01:42.519 argument in the book, and I think it holds up, 37 00:01:42.640 --> 00:01:45.599 is that data science isn't just rebranding. 38 00:01:45.120 --> 00:01:46.799 Not just a new buzzword exactly. 39 00:01:47.040 --> 00:01:49.959 It's genuinely a new idea, maybe still a bit fragile 40 00:01:50.000 --> 00:01:54.359 or evolving, but it uniquely combines foundations from statistics and 41 00:01:54.400 --> 00:01:57.920 computer science. Plus it has this distinct process tied to it. 42 00:01:58.200 --> 00:02:00.319 And part of that newness I think came from this 43 00:02:00.400 --> 00:02:04.680 idea of datification. Kenneth Kookier and Victor Merhr Schoenberger talked 44 00:02:04.719 --> 00:02:07.120 about this in foreign affairs maybe mid two. 45 00:02:07.040 --> 00:02:09.159 Thirteeneen Riise Big Data. Yeah. 46 00:02:09.240 --> 00:02:13.439 They defined datification as basically taking all aspects of life 47 00:02:13.479 --> 00:02:15.159 and turning them into data. 48 00:02:14.919 --> 00:02:15.879 Which sounds huge. 49 00:02:16.000 --> 00:02:20.240 It is. Think about it, Google glassdentifying your gaze, Twitter, 50 00:02:20.319 --> 00:02:23.879 turning stray thoughts into data points LinkedIn mapping out your 51 00:02:23.879 --> 00:02:28.800 professional life. Everything becomes potentially quantifiable. 52 00:02:28.120 --> 00:02:31.560 Which immediately makes you ask, Okay, who is this we 53 00:02:32.080 --> 00:02:34.840 doing the datafying and what kind of value are they 54 00:02:34.879 --> 00:02:39.319 actually creating? Often it's well modelers, entrepreneurs. 55 00:02:38.800 --> 00:02:41.439 Looking for efficiency, automation pretty much. 56 00:02:41.560 --> 00:02:44.120 Yeah, And what's really striking is that so much of 57 00:02:44.120 --> 00:02:46.800 this wasn't bubbling up in academia initially. It was happening 58 00:02:47.039 --> 00:02:50.400 in industry, in tech companies. That's quite different from how 59 00:02:50.400 --> 00:02:51.800 statistics traditionally develop. 60 00:02:52.039 --> 00:02:55.000 So if it's this broad new thing happening in industry, 61 00:02:55.639 --> 00:02:58.840 what does a data scientist actually look like? Should have 62 00:02:58.919 --> 00:03:00.240 this interesting exercise for. 63 00:03:00.159 --> 00:03:02.159 Her students, the self profiling, Yeah. 64 00:03:02.039 --> 00:03:06.680 Right, rate yourself on computer science, math, stats, machine learning, 65 00:03:06.800 --> 00:03:11.800 domain expertise, communication VIZ data visualization, and. 66 00:03:11.759 --> 00:03:14.199 The results were all over the PLAYFFERNT thing. It showed 67 00:03:14.240 --> 00:03:17.479 pretty clearly that you know, no single person is going 68 00:03:17.520 --> 00:03:19.120 to be brilliant at all of those things. 69 00:03:19.199 --> 00:03:24.120 Yeah, unicorns are rare exactly, which led to this idea. 70 00:03:24.159 --> 00:03:27.639 Maybe it's more useful to define a data science team 71 00:03:28.039 --> 00:03:30.759 than one perfect data scientist. 72 00:03:30.439 --> 00:03:32.840 Makes sense Like that Josh Will's quote, Oh. 73 00:03:32.840 --> 00:03:35.879 Yeah, the classic person who is better at statistics than 74 00:03:35.919 --> 00:03:39.520 any software engineer, and better at software engineering than any statistician. 75 00:03:39.639 --> 00:03:40.879 That captures it pretty well. 76 00:03:41.039 --> 00:03:45.479 It does. Fundamentally, a data scientist extracts, meaning interprets data. 77 00:03:45.840 --> 00:03:48.879 They need tools from stats and machine learning, sure, but 78 00:03:49.199 --> 00:03:53.639 also crucially human intuition, and let's be honest, a huge 79 00:03:53.639 --> 00:03:57.240 part of the job is just collecting, cleaning, and munging data. Yeah, 80 00:03:57.280 --> 00:03:59.759 wrestling with it because real world data is just in 81 00:04:00.000 --> 00:04:02.280 apparently messy always Okay. 82 00:04:02.039 --> 00:04:05.240 So moving from the who to the how? How does 83 00:04:05.280 --> 00:04:07.840 this actually get done? We hear big data all the time, 84 00:04:07.919 --> 00:04:08.719 but it's kind of a. 85 00:04:08.680 --> 00:04:10.159 Vague term it really is. 86 00:04:10.319 --> 00:04:10.560 Yeah. 87 00:04:10.599 --> 00:04:14.560 The book breaks it down nicely though, three parts. One, 88 00:04:14.919 --> 00:04:19.000 it's a set of technologies. Two, it's potentially a revolution 89 00:04:19.160 --> 00:04:21.839 in how we measure things, and a point of view, 90 00:04:21.879 --> 00:04:25.079 really a philosophy about how decisions are going to be 91 00:04:25.120 --> 00:04:27.439 made in the future based on data. 92 00:04:27.120 --> 00:04:30.680 Right and connecting big data back to basic stats like 93 00:04:31.360 --> 00:04:32.879 populations and samples. 94 00:04:33.000 --> 00:04:37.040 That seems important, Oh, absolutely critical. There's this dangerous assumption 95 00:04:37.160 --> 00:04:42.040 sometimes with big data that nal You know, you have 96 00:04:42.079 --> 00:04:42.399 all the. 97 00:04:42.399 --> 00:04:44.000 Data, but you never really do. 98 00:04:44.040 --> 00:04:46.920 You pretty much never do. There's always something missing, some 99 00:04:47.079 --> 00:04:50.279 context you don't have. Kate Crawford's talk on the Hurricane 100 00:04:50.279 --> 00:04:52.439 Sandy tweets is such a powerful example of this. 101 00:04:52.680 --> 00:04:53.519 Well was it gist of that? 102 00:04:53.800 --> 00:04:55.920 Well, looking at the tweets, you might think New Yorkers 103 00:04:55.920 --> 00:04:58.720 were just casually shopping before the storm and partying after. 104 00:04:59.399 --> 00:05:01.560 But that's because they were the ones tweeting heavily. 105 00:05:01.920 --> 00:05:03.879 Ah, so it missed the people really. 106 00:05:03.759 --> 00:05:08.319 Affected exactly, coastal New Jerseyans whose homes were being destroyed. 107 00:05:08.639 --> 00:05:12.000 They weren't tweeting about their grocery runs. It just shows 108 00:05:12.000 --> 00:05:16.240 how subjective the whole process is. You, the data scientist, 109 00:05:16.519 --> 00:05:19.879 are turning the world into data. It's not objective. 110 00:05:20.000 --> 00:05:21.800 Data doesn't just speak for itself. 111 00:05:22.079 --> 00:05:24.720 Never be very skeptical if someone claims it does. 112 00:05:25.120 --> 00:05:28.519 Okay, so data is subjective. We need context. Then we 113 00:05:28.560 --> 00:05:31.839 get to modeling. This sounds like where the magic. 114 00:05:31.519 --> 00:05:34.240 Happens or the hard work maybe both, and when we 115 00:05:34.279 --> 00:05:36.759 say model here, we don't mean like a database scheme. 116 00:05:36.959 --> 00:05:39.199 We mean a statistical model. 117 00:05:38.920 --> 00:05:40.160 Like a mathematical function. 118 00:05:40.360 --> 00:05:43.000 Yeah, one that tries to capture the uncertainty, the randomness, 119 00:05:43.000 --> 00:05:46.279 and how the data was generated and building these it's 120 00:05:46.319 --> 00:05:49.120 definitely part art, part science. Textbooks don't really give you 121 00:05:49.160 --> 00:05:51.319 a step by step guide. You have to make assumptions, 122 00:05:51.399 --> 00:05:53.839 a lot of assumptions about reality. But yeah, we'll get 123 00:05:53.879 --> 00:05:54.639 into how that works. 124 00:05:54.720 --> 00:05:57.399 And you mentioned a big pitfall here overfitting. 125 00:05:57.879 --> 00:06:02.480 Yes, get ready to hear about fitting a lot, possibly 126 00:06:02.560 --> 00:06:03.839 until you have nightmares. 127 00:06:03.439 --> 00:06:05.199 About Okay, okay, So what is it? 128 00:06:05.199 --> 00:06:07.480 It's when your model gets too good at explaining the 129 00:06:07.560 --> 00:06:10.480 specific data you train it on, including all the random 130 00:06:10.519 --> 00:06:12.600 noise and quirks in that sample. 131 00:06:12.399 --> 00:06:14.800 So it learns the noise, not the signal. 132 00:06:14.519 --> 00:06:18.240 Precisely, and then it fails, often badly, when you try 133 00:06:18.240 --> 00:06:20.959 to use it on new unseen data. It hasn't learned 134 00:06:20.959 --> 00:06:23.199 the general pattern, just the specifics of the test it's 135 00:06:23.240 --> 00:06:24.759 studied for, so to speak. 136 00:06:24.519 --> 00:06:26.759 Right, it can't generalize. So before we even get to 137 00:06:26.800 --> 00:06:29.120 complex models, what's the first step. 138 00:06:29.000 --> 00:06:34.240 Exploratory data analysis? Eighty A. Yeah, it's absolutely fundamental. 139 00:06:33.759 --> 00:06:36.560 And that's more than just plotting things, oh much more. 140 00:06:36.720 --> 00:06:40.160 It's a mindset. It's about getting intuition, understanding the shape 141 00:06:40.160 --> 00:06:42.160 of your data, feeling how it connects back to the 142 00:06:42.199 --> 00:06:43.680 real world process that created it. 143 00:06:43.720 --> 00:06:45.839 So what does it help you do practically well? 144 00:06:45.959 --> 00:06:50.199 Gain intuition? Obviously, make comparisons, do basic sanity checks. Is 145 00:06:50.240 --> 00:06:52.839 the data on the right scales, it the right format, 146 00:06:53.079 --> 00:06:58.199 Spot missing values or crazy outliers, summarize things, even debug 147 00:06:58.240 --> 00:06:59.839 how the data was logged in the first place. 148 00:07:00.000 --> 00:07:02.399 Okay, like the example with the New York Times ad 149 00:07:02.480 --> 00:07:07.279 data NYT one dot csv through NYT three one dot csv. 150 00:07:07.399 --> 00:07:11.560 Exactly, the students had to plot distributions of ad impressions 151 00:07:12.040 --> 00:07:15.199 and click through rates the CTR for different age groups, 152 00:07:15.759 --> 00:07:20.720 and segment users by whether they clicked or not using 153 00:07:20.879 --> 00:07:23.879 r In that case, it forces you to really look 154 00:07:23.879 --> 00:07:24.759 at the data first. 155 00:07:24.879 --> 00:07:27.759 And this whole process it kind of mirrors the scientific method, 156 00:07:27.800 --> 00:07:28.279 doesn't it. 157 00:07:28.279 --> 00:07:32.199 It really does. You ask a question, you research, explore 158 00:07:32.279 --> 00:07:35.240 the data, you form a hypothesis, you test it, build 159 00:07:35.279 --> 00:07:37.959 a model, analyze the results, communicate them. 160 00:07:37.959 --> 00:07:38.759 But with a twist. 161 00:07:39.000 --> 00:07:41.639 Yeah, The big difference is the feedback loop. When you 162 00:07:41.680 --> 00:07:43.959 build a data product like a stam filter or a 163 00:07:43.959 --> 00:07:47.399 recommendation engine. It goes out into the world, people use it, 164 00:07:47.800 --> 00:07:49.800 Their interactions generate more data. 165 00:07:49.600 --> 00:07:51.040 Which feeds back into the system. 166 00:07:51.240 --> 00:07:53.800 Right, it's a dynamic cycle. It's not like predicting the weather, 167 00:07:53.839 --> 00:07:57.240 where your forecast doesn't actually change tomorrow's weather. Here the 168 00:07:57.279 --> 00:08:00.680 model influences the world, which generates new data for them, and. 169 00:08:00.639 --> 00:08:02.480 The data scientist is involved all the. 170 00:08:02.399 --> 00:08:05.680 Way through, absolutely, from deciding what data to even collect, 171 00:08:06.079 --> 00:08:09.439 to asking the first questions, planning the attack, and yeah, 172 00:08:09.519 --> 00:08:10.319 writing the code. 173 00:08:10.480 --> 00:08:12.720 The Real Direct case study sounds like a good example 174 00:08:12.759 --> 00:08:14.839 of this using data in real estate. 175 00:08:14.959 --> 00:08:17.680 Yeah, Doug Pearlson's company. Yeah, the traditional real estate broker 176 00:08:17.759 --> 00:08:21.959 system was well broken in terms of data. Brokers guarded 177 00:08:21.959 --> 00:08:25.240 their info fiercely. Public data was months out of date. 178 00:08:25.759 --> 00:08:27.000 So what did Real Direct do? 179 00:08:27.639 --> 00:08:31.000 They heided agents who pooled their knowledge, use data driven tips, 180 00:08:31.040 --> 00:08:34.840 built real time recommendations, tried to get live feeds on searches, offers, 181 00:08:34.919 --> 00:08:36.279 closing times. 182 00:08:36.720 --> 00:08:40.159 All that stuff, and the business model reflected that efficiency. 183 00:08:39.639 --> 00:08:43.000 Right, a subscription model plus lower commission because the data 184 00:08:43.000 --> 00:08:46.440 supposedly made things more efficient. The exercise for the students 185 00:08:46.519 --> 00:08:50.399 was literally, okay, you're advising the CEO, define a data strategy. 186 00:08:50.519 --> 00:08:52.159 What data do you need, where do you get it? 187 00:08:52.279 --> 00:08:54.559 How do you clean it, explore it, summarize it, puts 188 00:08:54.600 --> 00:08:55.159 it all together. 189 00:08:55.360 --> 00:08:59.480 Okay, let's shift gears to the algorithms the engines driving this. 190 00:09:00.120 --> 00:09:03.000 Machine learning versus statistical modeling always confusing. 191 00:09:03.360 --> 00:09:06.919 It is confusing because there's so much overlap. mL algorithms, 192 00:09:07.000 --> 00:09:12.039 mostly from computer science, do prediction classification clustering. Statistical modeling 193 00:09:12.080 --> 00:09:15.759 from SaaS environments does well prediction classification clustering. 194 00:09:15.919 --> 00:09:17.279 So what's the real difference? Then? 195 00:09:17.519 --> 00:09:21.519 Often it's about the goal and the origin. Many mL algorithms, 196 00:09:21.639 --> 00:09:26.679 especially the ones driving AI image recognition, speech recommenders, they 197 00:09:26.679 --> 00:09:30.000 weren't typically part of a core stats curriculum, and crucially, 198 00:09:30.440 --> 00:09:33.679 they're often not designed to help you infer the underlying why. 199 00:09:34.120 --> 00:09:36.159 They just want the best prediction exactly. 200 00:09:36.360 --> 00:09:40.320 Maximum accuracy is usually the goal, whereas statistical modeling often 201 00:09:40.320 --> 00:09:45.080 puts more emphasis on understanding the relationships the uncertainty. But honestly, 202 00:09:45.679 --> 00:09:48.879 good data scientists use both. They know when each approach 203 00:09:48.960 --> 00:09:49.720 is more valuable. 204 00:09:49.919 --> 00:09:52.679 Right, and the warning you mentioned don't be a hammer 205 00:09:52.720 --> 00:09:53.440 looking for a nail. 206 00:09:53.759 --> 00:09:56.720 Precisely, don't just grab the algorithm you know best and 207 00:09:56.759 --> 00:10:00.279 force it onto the problem. First, understand the problem text, 208 00:10:00.360 --> 00:10:03.759 figure out its mathematical structure, then see which algorithms fit 209 00:10:04.000 --> 00:10:04.480 makes sense. 210 00:10:04.720 --> 00:10:06.960 Let's start with a classic linear regression. 211 00:10:07.200 --> 00:10:10.799 Ah, yes, your bread and butter. For predicting a continuous 212 00:10:10.840 --> 00:10:14.639 outcome like price or temperature, using one or more predictors. 213 00:10:14.679 --> 00:10:17.960 We usually start thinking about simple lines like why will 214 00:10:17.960 --> 00:10:19.919 twenty five x deterministic? 215 00:10:20.320 --> 00:10:23.799 Right? But the key mental shift is moving to stochastic functions, 216 00:10:24.399 --> 00:10:28.960 acknowledging that there's randomness uncertainty. The line represents the average trend, 217 00:10:29.320 --> 00:10:31.440 but the points will scatter around it, and. 218 00:10:31.399 --> 00:10:33.519 How do you find the best line? 219 00:10:33.759 --> 00:10:36.320 You minimize the distance between the points and the line, 220 00:10:36.840 --> 00:10:39.960 specifically the sum of the squared vertical distances. That's the 221 00:10:40.039 --> 00:10:41.000 mean squared. 222 00:10:40.759 --> 00:10:43.159 Error, and you evaluate it with things like P values. 223 00:10:43.320 --> 00:10:46.879 Yeah, P values help you test if your predictors actually 224 00:10:46.919 --> 00:10:51.759 have a statistically significant effect. Are their coefficients likely different 225 00:10:51.759 --> 00:10:55.960 from zero? You can add more predictors. That's multiple linear regression, 226 00:10:56.480 --> 00:10:59.360 which then raises the question of feature selection, which predictors 227 00:10:59.360 --> 00:10:59.919 matter most? 228 00:11:00.080 --> 00:11:02.679 And simulating data can help understand. 229 00:11:02.240 --> 00:11:05.799 This oh hugely useful, especially in learning. You create fake 230 00:11:05.879 --> 00:11:08.440 data where you know the true relationship, then you see 231 00:11:08.480 --> 00:11:11.080 if your model can recover it. How sample size effects things? 232 00:11:11.120 --> 00:11:15.039 What happens if you add irrelevant variables? It builds intuition? 233 00:11:15.240 --> 00:11:18.559 Okay, what about classifying things? Finding similar items? 234 00:11:18.639 --> 00:11:21.360 That sounds like CA nearest neighbors or kNN. 235 00:11:21.039 --> 00:11:22.679 Right knnn ad. How does that work? 236 00:11:22.840 --> 00:11:26.720 The idea is simple. To classify a new unlabeled item, 237 00:11:26.960 --> 00:11:30.159 you look at its K closest neighbors data set where 238 00:11:30.200 --> 00:11:32.720 you do have labels. Then you assign the classes most 239 00:11:32.759 --> 00:11:36.919 common among those neighbors. Examples could be anything classifying emails 240 00:11:36.919 --> 00:11:40.600 as spam NOTT spam based on similar emails, assessing credit 241 00:11:40.679 --> 00:11:44.320 risk based on similar applicants, recommending restaurants based on what 242 00:11:44.399 --> 00:11:47.440 similar users like find The neighbors and the key choices 243 00:11:47.480 --> 00:11:50.240 are two main things. First, how do you define closest? 244 00:11:50.480 --> 00:11:53.600 You need a distance metric Euclidian is common for points, 245 00:11:53.799 --> 00:11:57.960 Cosign for text, Hamming for strings, Manhattan for grid like paths. 246 00:11:58.759 --> 00:12:01.559 Depends on the data. Second, choosing K how many neighbors 247 00:12:01.600 --> 00:12:05.600 do you consult? One, five, twenty. That's a tuning parameter. 248 00:12:05.639 --> 00:12:08.440 And this is where it gets interesting. The curse of dimensionality. 249 00:12:08.600 --> 00:12:11.240 I asked. The curse kNN works great in low dimensions 250 00:12:11.559 --> 00:12:14.240 like recognizing handwritten digits, where pixels in a two hundred 251 00:12:14.279 --> 00:12:17.480 and fifty six dimension space have a natural closeness. But 252 00:12:17.759 --> 00:12:20.679 imagine text data with thousands of dimensions. 253 00:12:20.639 --> 00:12:23.120 Words things get spread out exactly. 254 00:12:23.200 --> 00:12:25.799 In high dimensions, everything is kind of far away from 255 00:12:25.799 --> 00:12:29.559 everything else. Your nearest neighbors might not be very similar 256 00:12:29.600 --> 00:12:33.120 at all in a meaningful sense. K and N breaks down. 257 00:12:33.519 --> 00:12:35.679 That's why it's usually bad for spam filtering. 258 00:12:35.799 --> 00:12:38.159 Good point other K and N pitfalls. 259 00:12:38.360 --> 00:12:41.600 Definitely need to scale your variables. If income is in dollars, 260 00:12:41.600 --> 00:12:45.080 in ages in years, income will dominate the distance calculation 261 00:12:45.200 --> 00:12:49.360 unless you scale them, and overfitting is a risk, especially 262 00:12:49.360 --> 00:12:52.279 of K one. Then you're just copying the label of 263 00:12:52.320 --> 00:12:55.559 the single closest point, which might be noise. Correlated features 264 00:12:55.600 --> 00:12:56.960 can also distort distances. 265 00:12:57.120 --> 00:13:00.639 Okay, so kNN needs labels. What if you don't have labels, 266 00:13:00.639 --> 00:13:02.840 but you suspect there are groups in your data. 267 00:13:02.799 --> 00:13:06.080 Then you're talking about unsupervised learning, and K means clustering 268 00:13:06.159 --> 00:13:07.240 is a common technique there. 269 00:13:07.320 --> 00:13:11.240 Unsupervised So the algorithm finds the groups itself precisely. 270 00:13:11.320 --> 00:13:13.559 You tell it how many clusters K you think exist, 271 00:13:14.080 --> 00:13:17.120 and the algorithm iteratively assigns points to the nearest cluster 272 00:13:17.200 --> 00:13:21.960 center centroid, and then recapculates the centroids until things stabilize. 273 00:13:22.000 --> 00:13:22.840 Why would you do that? 274 00:13:23.120 --> 00:13:25.639 Lots of reasons. Maybe you want to segment users for 275 00:13:25.679 --> 00:13:30.320 different marketing or product experiences, or build separate predictive models 276 00:13:30.360 --> 00:13:33.919 for distinct customer groups. K means helps you discover those 277 00:13:33.960 --> 00:13:36.679 groups automatically instead of you trying to define them with 278 00:13:36.799 --> 00:13:38.480 arbitrary rules or thresholds. 279 00:13:38.559 --> 00:13:42.240 So it automates finding clusters in like many dimensions. 280 00:13:42.399 --> 00:13:46.399 Yeah, that's the power, but it has its quirks. Choosing 281 00:13:46.399 --> 00:13:49.240 the right K is often more art than science, and 282 00:13:49.360 --> 00:13:52.200 sometimes the algorithm can get stuck in a suboptimal solution 283 00:13:52.519 --> 00:13:53.919 depending on where it starts. 284 00:13:54.360 --> 00:13:55.600 Is it an old algorithm? 285 00:13:55.639 --> 00:13:58.279 The basic idea goes back to the fifties Steinhaus and 286 00:13:58.360 --> 00:14:02.399 Lloyd m the term K means in sixty seven. There 287 00:14:02.399 --> 00:14:04.879 are newer versions like K means plus plus from two 288 00:14:04.879 --> 00:14:07.720 thousand and seven that try to start the algorithm off better. 289 00:14:07.799 --> 00:14:10.919 Okay, so we said canon isn't great for spam filtering 290 00:14:10.960 --> 00:14:13.440 because of hi dimensions. What does work well that. 291 00:14:13.360 --> 00:14:17.919 Brings us to naive base? A surprisingly effective probabilistic. 292 00:14:17.240 --> 00:14:19.080 Approach based on Bayes Law. 293 00:14:19.080 --> 00:14:21.960 Exactly remember baes Law from stats PA B p G. 294 00:14:22.559 --> 00:14:26.679 Vaguely like the disease testing example, probability you're sick given 295 00:14:26.720 --> 00:14:27.600 a positive test. 296 00:14:27.879 --> 00:14:30.279 That's the one we applied the same logic to spam. 297 00:14:31.240 --> 00:14:34.320 What's the probability in email is spam? Given that contains 298 00:14:34.320 --> 00:14:37.120 the word viagra, p spam, word peace, damp, word. 299 00:14:37.080 --> 00:14:38.759 Needs sense what's the naive part? 300 00:14:39.000 --> 00:14:41.559 The naive assumption is that the words in the email 301 00:14:41.720 --> 00:14:43.240 appear independently of each. 302 00:14:43.120 --> 00:14:46.720 Other, which isn't true. Right, free and viagra probably appeared 303 00:14:46.759 --> 00:14:49.960 together more often than by chance in spam. 304 00:14:50.120 --> 00:14:55.000 Totally untrue. But the simplification makes the math tractable, and 305 00:14:55.120 --> 00:14:59.279 surprisingly it often works really well in practice, especially for text. 306 00:14:59.200 --> 00:15:01.440 Any pitfall just counting words. 307 00:15:01.559 --> 00:15:04.399 Oh yeah, if a word like viagra only appeared in 308 00:15:04.440 --> 00:15:07.159 spam in your training data, the model might assign a 309 00:15:07.200 --> 00:15:09.639 one hundred percent probability of spam if it sees that 310 00:15:09.679 --> 00:15:12.799 word again. It's overfitting. Also, what if you see a 311 00:15:12.840 --> 00:15:15.679 word you've never seen before? The probability would be zero, 312 00:15:16.240 --> 00:15:17.559 which messes up the calculation. 313 00:15:17.679 --> 00:15:18.559 So how do you fix that? 314 00:15:18.840 --> 00:15:22.879 With laplace smoothing sometimes called additive smoothing, you basically add 315 00:15:22.879 --> 00:15:25.600 a small pseudo count to every word count, pretending you've 316 00:15:25.639 --> 00:15:27.840 seen each word at least once or a fraction of 317 00:15:27.879 --> 00:15:31.159 a time. It prevents zero probabilities and generally makes the 318 00:15:31.240 --> 00:15:32.279 estimates more robust. 319 00:15:32.480 --> 00:15:35.919 And this was used in that NYT article classification exercise. 320 00:15:36.000 --> 00:15:39.679 Yes, exactly Jake's exercise. Download two thousand articles from different 321 00:15:39.720 --> 00:15:43.840 sections arts, business, sports, et cetera. Using the API. Train 322 00:15:43.919 --> 00:15:47.960 a naive base model specifically Bernoulli ni bays here to 323 00:15:48.000 --> 00:15:52.440 classify them, tune the smoothing parameters, evaluate with a confusion matrix. 324 00:15:52.480 --> 00:15:55.559 See which words were most indicative of each section. Great 325 00:15:55.639 --> 00:15:56.559 hands on example. 326 00:15:56.720 --> 00:15:59.960 Cool again, Another big one. Logistic regression. How's that different 327 00:16:00.159 --> 00:16:01.440 from linear regression? 328 00:16:01.639 --> 00:16:05.480 Linear regression predicts a continuous value, right like a house price. 329 00:16:05.919 --> 00:16:09.600 Logistic regression predicts the probability of a binary outcome, something 330 00:16:09.600 --> 00:16:11.200 that's either yes or no zero. 331 00:16:11.200 --> 00:16:13.440 One like will a user click and add? Is this 332 00:16:13.519 --> 00:16:15.559 email spam while this customer churn? 333 00:16:15.679 --> 00:16:17.600 Exactly those kinds of things binary outcomes? 334 00:16:17.720 --> 00:16:20.080 And how does it predict a probability? Doesn't a linear 335 00:16:20.120 --> 00:16:21.480 model output any number. 336 00:16:21.639 --> 00:16:24.000 It starts with a linear combination of features, just like 337 00:16:24.080 --> 00:16:27.519 linear regression. Else plus matrox. But then it feeds that 338 00:16:27.600 --> 00:16:31.440 result through a special function called the logistic function or sigmoid. 339 00:16:30.960 --> 00:16:33.360 Function the S shaped curve that's the one. 340 00:16:33.240 --> 00:16:38.000 Pt one plus et. This function squishes any input value 341 00:16:38.080 --> 00:16:41.360 into an output between zero and one, perfect for representing 342 00:16:41.360 --> 00:16:42.159 a probability. 343 00:16:42.440 --> 00:16:45.159 So alpha and beta still means something yep. 344 00:16:45.799 --> 00:16:48.919 Alpha spain is related to the baseline probability. The overall 345 00:16:48.919 --> 00:16:51.720 odds and the betas are the weights for each feature, 346 00:16:52.159 --> 00:16:54.639 telling you how much each feature changes the law odds 347 00:16:54.639 --> 00:16:55.120 of the outcome. 348 00:16:55.279 --> 00:16:57.960 How do you find the best alpha and betas? 349 00:16:58.480 --> 00:17:02.200 Usually with maximum likelihood estimation, you find the parameters that 350 00:17:02.240 --> 00:17:06.119 make the observed data most probable. This often involves optimization 351 00:17:06.160 --> 00:17:10.000 algorithms like Newton's method, or, especially for huge data sets, 352 00:17:10.200 --> 00:17:12.240 stochastic gradient descent SGD. 353 00:17:12.720 --> 00:17:14.319 SGD sounds familiar. 354 00:17:14.039 --> 00:17:16.279 Very common in large scale machine learning. It updates the 355 00:17:16.319 --> 00:17:19.000 parameters using just one data point or a small batche 356 00:17:19.000 --> 00:17:21.759 at a time, making it efficient for massive data sets, 357 00:17:21.880 --> 00:17:25.279 especially sparse ones. Tools like mahood or valpol wabbit use 358 00:17:25.279 --> 00:17:25.799 it heavily. 359 00:17:26.039 --> 00:17:29.480 Now, evaluating these models, you said, accuracy isn't always great. 360 00:17:29.440 --> 00:17:32.079 Right, especially with imbalanced classes. If only one percent of 361 00:17:32.119 --> 00:17:35.480 emails are spam. A model predicting not spam one hundred 362 00:17:35.480 --> 00:17:37.799 percent of the time is ninety nine percent accurate, but useless. 363 00:17:37.880 --> 00:17:39.039 So what should we use instead? 364 00:17:39.359 --> 00:17:43.519 Look at precision of the times you predicted spam, how 365 00:17:43.559 --> 00:17:46.559 often were you right? And recall of all the actual 366 00:17:46.559 --> 00:17:49.720 spam how much did you catch often? There's a trade off. 367 00:17:49.559 --> 00:17:51.359 And F score AUC. 368 00:17:51.720 --> 00:17:55.160 F score tries to combine precision and recall into one number. 369 00:17:55.599 --> 00:17:58.839 AUC area under the ROC curve is really good because 370 00:17:58.839 --> 00:18:02.960 it measures performance across all possible thresholds and isn't thrown 371 00:18:02.960 --> 00:18:05.920 off by imbalanced classes. It's base rate invariant. 372 00:18:06.200 --> 00:18:09.599 But even these metrics might not capture the real goal exactly. 373 00:18:10.200 --> 00:18:13.079 Your model might have great AUC, but does it actually 374 00:18:13.079 --> 00:18:16.880 increase revenue or user engagement. That's where AB testing comes in, 375 00:18:17.000 --> 00:18:18.880 the gold standard for real world impact. 376 00:18:19.000 --> 00:18:21.519 YE, run a controlled experiment, show the old system to 377 00:18:21.559 --> 00:18:23.960 group A, the new model to group B, and measure 378 00:18:24.000 --> 00:18:27.000 the actual business outcome you care about. Google's paper on 379 00:18:27.079 --> 00:18:28.680 experimentation really drives this. 380 00:18:28.759 --> 00:18:32.480 Home and Media six degrees M six D use logistic 381 00:18:32.519 --> 00:18:34.160 regression for predicting AD clicks. 382 00:18:34.240 --> 00:18:37.759 Yeah, a classic application user level conversion prediction, highly scalable, 383 00:18:37.759 --> 00:18:39.119 and effective for binary outcomes. 384 00:18:39.160 --> 00:18:41.680 So okay, let's get into some real world messiness. Time 385 00:18:41.759 --> 00:18:45.079 stamps seems simple, but you said they're tricky. Oh they are. 386 00:18:45.240 --> 00:18:48.279 You get tons of time stamped event data, user clicks, 387 00:18:48.559 --> 00:18:52.200 check ins, sensor readings. That's big data right there. But 388 00:18:52.240 --> 00:18:56.519 they introduce subtle problems like what the biggest is causality. 389 00:18:57.160 --> 00:18:59.680 You cannot use information from the future to predict the 390 00:18:59.680 --> 00:19:03.559 present or past. Sounds obvious, but it's easy to accidentally 391 00:19:03.599 --> 00:19:06.559 leak future information into your training data if you're not 392 00:19:06.640 --> 00:19:07.880 careful with time stamps. 393 00:19:08.079 --> 00:19:10.599 Ah the time travel problem exactly. 394 00:19:10.960 --> 00:19:13.839 You also have to be super careful distinguishing in sample 395 00:19:13.920 --> 00:19:17.640 training data from out of sample testing data based on time, 396 00:19:18.440 --> 00:19:20.880 and often you need running estimates, like a running average, 397 00:19:20.920 --> 00:19:24.200 not a single average calculated over all past data, because 398 00:19:24.240 --> 00:19:27.480 the world changes like in finance. Finance is a great example, 399 00:19:27.720 --> 00:19:30.680 they often use log returns instead of simple percentage returns 400 00:19:31.000 --> 00:19:34.599 because log returns handle compounding better and are more symmetric. 401 00:19:35.160 --> 00:19:38.880