WEBVTT 1 00:00:00.080 --> 00:00:02.960 What if I told you that adding your most highly educated, 2 00:00:04.080 --> 00:00:07.799 highly active users to a data set could mathematically make 3 00:00:07.839 --> 00:00:11.560 your entire user base, look, you know, substantially less sociable. 4 00:00:11.640 --> 00:00:14.439 Yeah, I mean it completely sounds like a broken algorithm, right, 5 00:00:14.759 --> 00:00:18.559 but it's actually this fundamental statistical trap that catches data 6 00:00:18.559 --> 00:00:21.359 teams off guard literally every single. 7 00:00:21.120 --> 00:00:24.920 Day, right, and today we are dismantling those traps. We're 8 00:00:24.920 --> 00:00:27.679 doing a deep dive into the core concepts of data science, 9 00:00:28.160 --> 00:00:32.079 pulling directly from Joel Gruse's Data Science from scratch, and 10 00:00:32.119 --> 00:00:35.799 the mission here is straightforward. We are stepping entirely away 11 00:00:35.840 --> 00:00:38.320 from those prepackaged software libraries. 12 00:00:37.960 --> 00:00:41.759 Exactly because relying purely on high level abstractions, you know, 13 00:00:41.799 --> 00:00:45.119 just typing import pandas and calling a dot mean function, 14 00:00:45.759 --> 00:00:47.880 it creates this really dangerous blind spot. 15 00:00:47.960 --> 00:00:49.240 You're just trusting a black box. 16 00:00:49.399 --> 00:00:52.119 Right. When you treat your analytical tools as black boxes, 17 00:00:52.159 --> 00:00:55.000 you lose the ability to actually interrogate the underlying assumptions. 18 00:00:55.039 --> 00:00:57.119 I mean, you end up optimizing for algorithms you don't 19 00:00:57.119 --> 00:00:58.399 fully understand. 20 00:00:57.960 --> 00:01:03.159 Which inevitably leads to very confident, mathematically sound, and completely 21 00:01:03.200 --> 00:01:04.400 incorrect conclusions. 22 00:01:04.640 --> 00:01:08.040 Yes, the worst kind of incorrect conclusions. 23 00:01:08.439 --> 00:01:11.159 So to ground this deep dive for you, the listener, 24 00:01:11.599 --> 00:01:15.719 we're placing you in a very specific hypothetical scenario today. 25 00:01:16.000 --> 00:01:18.359 You have just been brought in as the founding data 26 00:01:18.400 --> 00:01:22.280 scientist at a startup called Data Sciencestor, which is, you know, 27 00:01:22.359 --> 00:01:25.400 a social network tailored entirely for data professionals. 28 00:01:25.480 --> 00:01:27.200 Sounds like a very niche market. 29 00:01:27.159 --> 00:01:30.840 Very niche, But the point is there is no legacy infrastructure. 30 00:01:31.159 --> 00:01:35.280 You are tasked with building the analytical pipeline completely from scratch. 31 00:01:35.040 --> 00:01:38.000 Which means before you can analyze a single user's behavior, 32 00:01:38.120 --> 00:01:41.000 you have to actually choose your architecture. And the source 33 00:01:41.040 --> 00:01:43.920 material strongly advocates for Python. 34 00:01:43.840 --> 00:01:47.200 But not just because it's popular, right, The rationale goes 35 00:01:47.280 --> 00:01:50.519 way beyond its ecosystem of data tools. 36 00:01:50.599 --> 00:01:54.120 Oh. Absolutely, the argument is rooted entirely in Python's core 37 00:01:54.159 --> 00:01:56.719 design philosophy. It goes back to the zen of Python, 38 00:01:56.799 --> 00:02:00.760 which basically dictates that you know, explicit is better than implicit. 39 00:02:00.920 --> 00:02:03.239 Right, And we see this manifested most clearly in how 40 00:02:03.319 --> 00:02:06.599 Python enforces structural readability through white space. 41 00:02:07.079 --> 00:02:09.159 Yes, the white space rule. I mean, if you look 42 00:02:09.159 --> 00:02:11.680 at C plus plus or Java, the scope of a 43 00:02:11.719 --> 00:02:14.439 function is defined by curly braces. 44 00:02:14.319 --> 00:02:17.599 Which the compiler basically just ignores. Yeah, right, like it 45 00:02:17.639 --> 00:02:19.879 ignores the indentation completely exactly. 46 00:02:20.080 --> 00:02:24.599 You can have this incredibly complex, deeply nested logic crammed 47 00:02:24.639 --> 00:02:27.439 onto a single line of text and the machine will 48 00:02:27.439 --> 00:02:30.319 parse it perfectly, even if it is completely illegible to 49 00:02:30.360 --> 00:02:33.120 the next engineer who has to inherit your code base. 50 00:02:33.240 --> 00:02:33.400 Right. 51 00:02:33.680 --> 00:02:37.560 But Python removes that option entirely. The visual structure of 52 00:02:37.560 --> 00:02:41.400 the code must match the logical structure, or the interpreter 53 00:02:41.439 --> 00:02:43.919 will just throw an indentation error and refuse to run. 54 00:02:44.039 --> 00:02:46.599 It's honestly like a Marie Condo approach to coding. 55 00:02:46.759 --> 00:02:48.039 A Mariecondo approach. 56 00:02:48.120 --> 00:02:50.159 Yeah, like with a curly brace language, you can take 57 00:02:50.199 --> 00:02:53.439 this absolute disaster of a messy room, shove all the 58 00:02:53.520 --> 00:02:56.719 tangled logic into a closet, slam the compiler door shut, 59 00:02:56.800 --> 00:02:57.639 and it just runs. 60 00:02:57.759 --> 00:02:59.280 That's yeah, that's a great way to put it. 61 00:02:59.319 --> 00:03:02.719 But Python. Python forces you to organize the closet so 62 00:03:02.719 --> 00:03:04.800 the structure is visible the second you open the file. 63 00:03:05.199 --> 00:03:08.360 Everyone can see exactly what sparks joy or what causes 64 00:03:08.400 --> 00:03:09.319 fatal crash. 65 00:03:09.520 --> 00:03:13.199 That's exactly it. And this emphasis on explicit structure has 66 00:03:13.240 --> 00:03:16.319 really only become more critical with the shift to Python three, 67 00:03:16.719 --> 00:03:19.479 especially when we start talking about the adoption of type 68 00:03:19.520 --> 00:03:22.240 annotations in these big data pipelines. 69 00:03:22.439 --> 00:03:25.879 Right, because Python natively is dynamically typed yes. 70 00:03:26.120 --> 00:03:29.159 Meaning a variable can hold an integer and then literally 71 00:03:29.199 --> 00:03:31.280 in the next line of code it can be reassigned 72 00:03:31.280 --> 00:03:32.560 to a string, which. 73 00:03:32.319 --> 00:03:33.960 Is great if you're just writing a quick script. 74 00:03:34.000 --> 00:03:37.080 Sure, in a localized script, that flexibility speeds up development, 75 00:03:37.560 --> 00:03:40.639 But in a massive data ingestion pipeline, it is a 76 00:03:40.680 --> 00:03:42.240 massive liability because you. 77 00:03:42.159 --> 00:03:44.479 Don't know what data is actually flowing. 78 00:03:44.080 --> 00:03:47.599 Through the pipe exactly. Let's say your pipeline is pulling 79 00:03:47.719 --> 00:03:52.360 user engagement metrics for data science stor and some localized 80 00:03:52.400 --> 00:03:55.400 anomaly introduces a string, maybe it's like a text based 81 00:03:55.479 --> 00:03:58.800 nan or just a null character into a feature set 82 00:03:58.840 --> 00:04:01.520 that is mathematically acting floating point numbers. 83 00:04:01.639 --> 00:04:02.520 Oh right. 84 00:04:02.960 --> 00:04:05.800 In a purely dynamic setup, the pipeline might not even 85 00:04:05.879 --> 00:04:09.520 crash immediately. It might just perform a silent type coercion 86 00:04:09.680 --> 00:04:12.800 or propagate that null value all the way through your 87 00:04:12.800 --> 00:04:14.919 downstream transformations. 88 00:04:14.599 --> 00:04:17.839 Which ultimately just corrupts the training data for whatever a 89 00:04:17.839 --> 00:04:19.560 machine learning model you're building. 90 00:04:19.319 --> 00:04:22.240 Exactly, and you wouldn't even realize the error until the 91 00:04:22.279 --> 00:04:25.639 model's accuracy mysteriously degraded in production weeks later. 92 00:04:25.920 --> 00:04:29.160 Wow. So by using type annotations, you're basically establishing a 93 00:04:29.199 --> 00:04:31.160 strict contract for your functions. 94 00:04:31.199 --> 00:04:36.480 Precisely you declare explicitly upfront that a specific ingestion module 95 00:04:36.639 --> 00:04:39.959 expects an integer and returns a float period, and. 96 00:04:39.920 --> 00:04:43.079 Then static type checkers can actually analyze the codebase before 97 00:04:43.079 --> 00:04:45.800 it even runs, flagging any potential violations. 98 00:04:45.879 --> 00:04:49.079 It completely shifts the paradigm. You go from basically crossing 99 00:04:49.079 --> 00:04:52.399 your fingers and hoping the data conforms your expectations to 100 00:04:52.639 --> 00:04:56.959 architecting a system that mathematically guarantees the data types before 101 00:04:57.000 --> 00:04:58.399 the processing even begins. 102 00:04:58.600 --> 00:05:02.040 So you're building transparency in resilience right into the foundation 103 00:05:02.160 --> 00:05:03.160 of data sciencestor. 104 00:05:03.319 --> 00:05:06.079 You have to, because once you have a type safe 105 00:05:06.120 --> 00:05:09.879 pipeline aggregating all this clean user data, the immediate next 106 00:05:09.920 --> 00:05:13.639 step in the analytical life cycle is exploratory data analysis. 107 00:05:13.759 --> 00:05:16.399 Right. You want to visualize the distributions to spot the 108 00:05:16.439 --> 00:05:17.720 broader trends. 109 00:05:17.560 --> 00:05:21.920 Which introduces a completely different class of vulnerability into your workflow. 110 00:05:22.079 --> 00:05:25.879 Ah. Yes, because you've moved from the strict, unforgiving logic 111 00:05:25.920 --> 00:05:30.199 of the compiler to the highly subjective translation of data 112 00:05:30.240 --> 00:05:31.040 into pixels. 113 00:05:31.160 --> 00:05:34.839 Yes, human visual processing hardware has all these built in heuristics, 114 00:05:34.879 --> 00:05:37.519 and those heuristics are remarkably easy to exploit. 115 00:05:37.759 --> 00:05:41.560 The source material actually highlights this using that plotlib, specifically 116 00:05:41.560 --> 00:05:44.399 looking at how you can manipulate the axis in bar charts. 117 00:05:44.519 --> 00:05:45.720 It's such a classic trap. 118 00:05:45.920 --> 00:05:48.839 Right. Let's say you were presenting platform growth to the 119 00:05:48.920 --> 00:05:52.040 Data Science Hastor Board of Directors, and in twenty seventeen 120 00:05:52.079 --> 00:05:55.560 the platform was mentioned five hundred times. Then in twenty 121 00:05:55.600 --> 00:05:58.680 eighteen it was mentioned five hundred and five times. 122 00:05:58.519 --> 00:06:01.439 Which is, let's be honest, actional increase. It's barely a 123 00:06:01.439 --> 00:06:03.279 blip in the actual volume. 124 00:06:03.000 --> 00:06:04.920 Right, But if you construct a bar chart for the 125 00:06:04.920 --> 00:06:07.199 board and you set the axis to start at four 126 00:06:07.360 --> 00:06:10.279 ninety nine and end at five oh six, you dramatically 127 00:06:10.360 --> 00:06:12.720 alter the visual narrative. You really do, because the bar 128 00:06:12.800 --> 00:06:14.879 for twenty seventeen sits at a value of one unit 129 00:06:14.879 --> 00:06:18.040 above the baseline, but the bar for twenty eighteen rises 130 00:06:18.079 --> 00:06:19.839 to six units above the baseline. Yep. 131 00:06:20.639 --> 00:06:23.720 Visually, that twenty eighteen bar is taking up six times 132 00:06:23.759 --> 00:06:25.920 the physical space on the screen exactly. 133 00:06:26.000 --> 00:06:29.680 It looks like this towering, exponential six hundred percent increase, 134 00:06:30.319 --> 00:06:32.560 even though the underlying data barely even moved. 135 00:06:32.720 --> 00:06:36.879 And this is exactly where understanding cognitive psychology intersects with 136 00:06:37.000 --> 00:06:42.000 data science. Our visual cortex processes different geometric shapes using 137 00:06:42.079 --> 00:06:44.399 completely different underlying rules. 138 00:06:45.160 --> 00:06:47.240 Wait, let me push back on this rule for a second. 139 00:06:47.319 --> 00:06:51.079 Sure isn't zooming in on the axis? Just I don't 140 00:06:51.079 --> 00:06:54.800 know a helpful way to highlight the relevant detail. If 141 00:06:54.839 --> 00:06:57.480 I'm tracking a metric from five hundred to five h five, 142 00:06:58.000 --> 00:06:59.920 why wouldn't I want to zoom in to show that's 143 00:07:00.040 --> 00:07:01.000 specific variance. 144 00:07:01.319 --> 00:07:03.160 Well, it depends on the chart you're using. If we 145 00:07:03.199 --> 00:07:05.519 look at a line chart, we are evaluating the angle 146 00:07:05.560 --> 00:07:08.519 of the slope. The cognitive focus is on the trajectory 147 00:07:08.519 --> 00:07:11.120 and the rate of change over time. Okay, So zooming 148 00:07:11.160 --> 00:07:14.160 in on a line chart to expose localized volatility, say 149 00:07:14.279 --> 00:07:17.560 tracking minute by minute stock fluctuations between one hundred, one 150 00:07:17.600 --> 00:07:21.040 hundred and five dollars, is completely analytically valid. The slope 151 00:07:21.079 --> 00:07:24.399 still remains a true representation of the localized variance. 152 00:07:24.199 --> 00:07:25.759 Right because I'm just looking at the angle of the 153 00:07:25.800 --> 00:07:26.399 line going. 154 00:07:26.319 --> 00:07:29.399 Up and down exactly. But the visual processing mechanism for 155 00:07:29.439 --> 00:07:32.560 a bar chart is fundamentally different. With a bar chart, 156 00:07:32.639 --> 00:07:35.879 the human brain instinctively equates the value of the data 157 00:07:35.879 --> 00:07:38.839 point with the total two dimensional area. 158 00:07:38.560 --> 00:07:41.319 Of the bar, like the actual amount of ink printed 159 00:07:41.360 --> 00:07:41.879 on the page. 160 00:07:41.959 --> 00:07:45.519 Yes, the literal amount of ink. So by truncating the 161 00:07:45.560 --> 00:07:48.680 axis and starting at four ninety nine, you are divorcing 162 00:07:48.680 --> 00:07:51.040 the area of the bar from its mathematical value. 163 00:07:51.160 --> 00:07:52.600 Oh wow, I see what you mean. 164 00:07:52.720 --> 00:07:55.920 Yeah, you are asking the viewer's brain to process a 165 00:07:55.920 --> 00:07:59.399 physical shape that is six times larger, while expecting them 166 00:07:59.439 --> 00:08:02.720 to override their own visual instincts by reading the tiny 167 00:08:02.800 --> 00:08:04.439 numbers printed on the axis. 168 00:08:04.639 --> 00:08:06.480 It's a total cognitive mismatch. 169 00:08:06.680 --> 00:08:09.519 Exactly, a non zero axis on a bar chart basically 170 00:08:09.560 --> 00:08:12.759 mathematically lies to the viewer's visual cortex, and. 171 00:08:12.720 --> 00:08:14.839 The book points out that the same kind of distortion 172 00:08:14.959 --> 00:08:16.959 applies to variants in scato plots too. 173 00:08:17.040 --> 00:08:17.720 Oh absolutely. 174 00:08:17.800 --> 00:08:20.079 Like if we map out user test scores with test 175 00:08:20.120 --> 00:08:22.399 one on the x axis and test two on the axis, 176 00:08:22.759 --> 00:08:26.079 the scaling of those axes defines the perceived standard deviation. 177 00:08:26.439 --> 00:08:30.920 If you're plotting, library automatically scales the x axis to 178 00:08:30.959 --> 00:08:33.559 cover a twenty point spread, but then it stretches the 179 00:08:33.600 --> 00:08:36.000 axis to cover a forty point spread just to fill 180 00:08:36.080 --> 00:08:36.600 up the screen. 181 00:08:36.759 --> 00:08:39.919 Then the visual density of your clusters is completely compromised. 182 00:08:40.080 --> 00:08:42.960 Right, The data along the axis will appear to have 183 00:08:43.039 --> 00:08:47.200 significantly higher variants simply because the pixels are stretched further apart. 184 00:08:47.480 --> 00:08:50.519 You have to force comparable axes to maintain the integrity 185 00:08:50.519 --> 00:08:51.200 of the distribution. 186 00:08:51.480 --> 00:08:54.320 You do, but you know, visualization is really just a 187 00:08:54.360 --> 00:08:57.840 tool for spotting aggregate trends, and to truly understand the 188 00:08:57.879 --> 00:09:01.360 mechanics of a social platform like data sign ancestor aggregate 189 00:09:01.399 --> 00:09:02.440 trends aren't enough. 190 00:09:02.840 --> 00:09:04.960 No, the executive team wants to know who the key 191 00:09:05.039 --> 00:09:07.480 influencers are. They want to find the nodes with the 192 00:09:07.519 --> 00:09:09.159 highest degree centrality, right. 193 00:09:09.039 --> 00:09:12.480 Which means we have to analyze the topology of the network. 194 00:09:12.200 --> 00:09:16.399 Itself and calculating degree centrality basically just means counting who 195 00:09:16.480 --> 00:09:20.320 has the most friends. But doing that requires analyzing the edges, 196 00:09:20.440 --> 00:09:22.759 the connections between the users, and in a. 197 00:09:22.799 --> 00:09:26.440 Raw data format, this usually exists as an edge list. 198 00:09:26.519 --> 00:09:28.720 You know, user zero is friends with user one, User 199 00:09:28.799 --> 00:09:30.799 zero's friends with user two, user one is friends with 200 00:09:30.879 --> 00:09:32.480 user three, and so on, which. 201 00:09:32.320 --> 00:09:34.919 Is fine for a tiny data set. Iterating through a 202 00:09:34.960 --> 00:09:38.720 short list to count a specific users connections is trivial, sure, 203 00:09:39.039 --> 00:09:39.679 but as. 204 00:09:39.559 --> 00:09:44.200 The platform scales to say, millions of users, the computational 205 00:09:44.240 --> 00:09:48.039 complexity of that search becomes a massive bottleneck. We are 206 00:09:48.080 --> 00:09:50.320 talking about big O notation. 207 00:09:50.000 --> 00:09:51.519 Here, right, the dreaded big oh. 208 00:09:51.840 --> 00:09:55.639 Exactly. Searching an unstructured edge list requires an O of 209 00:09:55.840 --> 00:09:58.039 n operation where n is the number. 210 00:09:57.759 --> 00:10:00.799 Of edges, Meaning to find all connections for you US one, 211 00:10:00.879 --> 00:10:05.120 the algorithm literally has to traverse the entire list, evaluating 212 00:10:05.200 --> 00:10:07.639 every single pair to see if User one is present. 213 00:10:07.720 --> 00:10:10.720 It is computationally expensive, and it scales terribly as the 214 00:10:10.759 --> 00:10:15.000 network grows, which is why data scientists transition into linear algebra. 215 00:10:15.360 --> 00:10:18.399 They translate the network structure by representing the connections as 216 00:10:18.440 --> 00:10:19.639 an adjacency matrix. 217 00:10:19.679 --> 00:10:22.120 Okay, I like to use an analogy for this efficiency jump. 218 00:10:22.159 --> 00:10:24.720 It's here. The edge list is like an old school rolodex. 219 00:10:25.120 --> 00:10:27.080 If you want to know who User one knows, you 220 00:10:27.159 --> 00:10:29.279 have to flip through every single card in the entire 221 00:10:29.320 --> 00:10:32.919 box to check right, very slow. But the matrix is 222 00:10:32.960 --> 00:10:37.200 like a giant wall size pegboard. The rows represent every 223 00:10:37.320 --> 00:10:40.759 user and the columns represent those exact same users. If 224 00:10:40.879 --> 00:10:43.240 User A is friends with the user B. You stick 225 00:10:43.279 --> 00:10:45.720 a PEG in the intersecting cell basically a one. If 226 00:10:45.720 --> 00:10:48.440 there's no connection, the cell is empty a zero. So 227 00:10:48.480 --> 00:10:52.279 you've taken this slow sequential list and transformed it into 228 00:10:52.320 --> 00:10:55.360 a dense structural grid of binary states. 229 00:10:55.519 --> 00:10:59.399 I love that pegboard analogy, and the performance implications of 230 00:10:59.399 --> 00:11:03.279 that transformation are profound. When we restructure the data into 231 00:11:03.279 --> 00:11:06.840 a matrix, we change the algorithmic complexity of finding a 232 00:11:06.960 --> 00:11:09.960 user's connections from an O of n sequential search to 233 00:11:10.039 --> 00:11:11.759 an OH of one constant time look up. 234 00:11:11.799 --> 00:11:13.320 OH of one, So it's instantaneous. 235 00:11:13.399 --> 00:11:15.919 Exactly if you need to know user fives connections, the 236 00:11:15.960 --> 00:11:18.840 system doesn't search at all. It just jumps directly to 237 00:11:18.879 --> 00:11:21.200 the memory address of row five and retrieves the background, 238 00:11:21.200 --> 00:11:24.919 which also aligns perfectly with modern hardware architecture. Yes it does. 239 00:11:25.480 --> 00:11:28.519 Traversing a linked list or an edge list often means 240 00:11:28.600 --> 00:11:32.679 jumping around to different non contiguous blocks of memory. 241 00:11:32.320 --> 00:11:34.679 Which causes cache misses and slows down. 242 00:11:34.519 --> 00:11:39.200 The processing exactly. But a matrix stores these values in 243 00:11:39.360 --> 00:11:43.480 contiguous memory blocks, and that contiguous memory layout allows you 244 00:11:43.559 --> 00:11:48.679 to leverage semity operations single instruction, multiple data. 245 00:11:48.799 --> 00:11:53.159 Because modern CPUs and particularly GPUs are explicitly designed to 246 00:11:53.159 --> 00:11:56.840 perform parallel math operations on contiguous arrays of numbers. 247 00:11:57.039 --> 00:11:59.679 Right, So, by representing the social network as a matrix, 248 00:11:59.720 --> 00:12:02.440 you can and utilize parallel processing to calculate the Egen 249 00:12:02.519 --> 00:12:03.799 values of the matrix, which. 250 00:12:03.600 --> 00:12:05.000 Gives you eigenvector centrality. 251 00:12:05.080 --> 00:12:08.480 Yes, a far more sophisticated metric that doesn't just measure 252 00:12:08.480 --> 00:12:11.360 how many friends a user has, but how influential those 253 00:12:11.399 --> 00:12:15.200 friends actually are. The way you structure your data fundamentally 254 00:12:15.240 --> 00:12:17.679 dictates the analytical power you can bring to bear. 255 00:12:17.960 --> 00:12:20.440 Okay, so let's take stock. You have built a type 256 00:12:20.440 --> 00:12:23.679 safe pipeline, You are forcing comparable axis on your charts 257 00:12:23.720 --> 00:12:26.960 to avoid those visual distortions, and you are utilizing GPU 258 00:12:27.000 --> 00:12:31.120 accelerated matrix operations to map network topology in constant time. 259 00:12:31.360 --> 00:12:33.000 The architecture is rack solid. 260 00:12:33.240 --> 00:12:36.440 It is. But then the VP of Growth knocks on 261 00:12:36.519 --> 00:12:39.679 your door. They want you to build a statistical profile 262 00:12:39.960 --> 00:12:41.559 of the typical user's behavior. 263 00:12:41.679 --> 00:12:42.440 Of course they do. 264 00:12:42.559 --> 00:12:44.679 They want to correlate the number of friends a user 265 00:12:44.759 --> 00:12:47.000 has with the number of daily minutes they spend on 266 00:12:47.000 --> 00:12:51.480 the platform, and this introduces us to the fragility of 267 00:12:51.679 --> 00:12:53.279 standard statistical metrics. 268 00:12:53.320 --> 00:12:56.759 Oh absolutely, when you're summarizing distributions the traditional mean, the 269 00:12:56.840 --> 00:12:58.720 average is notoriously brittle. 270 00:12:59.039 --> 00:13:03.679 Yeah. OK has this great classic example about university graduate. 271 00:13:03.320 --> 00:13:08.000 Salaries the UNC geography major. Yes, in the mid nineteen eighties, 272 00:13:08.039 --> 00:13:10.399 the major at the University of North Carolina with the 273 00:13:10.519 --> 00:13:13.039 highest means starting salary was geography. 274 00:13:13.200 --> 00:13:16.960 And it wasn't because the market suddenly deeply valued cartography. 275 00:13:17.080 --> 00:13:19.879 No, it was solely because a single graduate named Michael 276 00:13:19.919 --> 00:13:21.120 Jordan entered the NBA. 277 00:13:21.399 --> 00:13:24.039 Right. Because the mean is calculated by summing all the 278 00:13:24.120 --> 00:13:26.919 values and dividing by the count, it distributes the weight 279 00:13:26.960 --> 00:13:29.279 of every value equally across the data set. 280 00:13:29.120 --> 00:13:33.000 Which means a massive multi million dollar outlier pulls the 281 00:13:33.159 --> 00:13:36.960 entire mathematical center of gravity toward itself. It completely obscures 282 00:13:37.000 --> 00:13:38.440 the typical distribution of the data. 283 00:13:38.720 --> 00:13:42.559 Whereas the median, by contrast, just relies on positional rank. 284 00:13:42.720 --> 00:13:46.759 It isolates the middle value and renders those extreme taiales irrelevant. 285 00:13:46.879 --> 00:13:50.960 Exactly, and this vulnerability to outliers it extends directly into 286 00:13:50.960 --> 00:13:52.440 how we measure correlation. 287 00:13:52.120 --> 00:13:56.360 Too, Right, like Pearson's correlation coefficient, which evaluates the linear 288 00:13:56.399 --> 00:13:58.200 relationship between two variables. 289 00:13:58.320 --> 00:14:02.720 Yes, but the underlying us for Pearson's relies on calculating covariance, 290 00:14:03.320 --> 00:14:07.080 and covariance involves multiplying the deviations of each data point 291 00:14:07.120 --> 00:14:07.720 from the mean. 292 00:14:08.360 --> 00:14:11.840 Okay, And because you are multiplying those deviations, a massive 293 00:14:11.879 --> 00:14:15.519 outlier doesn't just like slightly skew the result. It mathematically 294 00:14:15.559 --> 00:14:16.919 dominates the entire calculation. 295 00:14:17.039 --> 00:14:20.600 It completely takes over. Let's return to the VP's hypothesis, 296 00:14:21.000 --> 00:14:24.879 more friends equals more time spent on data sciencestor. You 297 00:14:25.000 --> 00:14:28.240 run the correlation and the coefficient comes back incredibly weak. 298 00:14:28.720 --> 00:14:31.799 The data basically suggests there is no relationship, right. 299 00:14:31.679 --> 00:14:34.480 But then you actually plot the data on a scatterplot 300 00:14:35.039 --> 00:14:38.120 and you see this massive dense cluster showing a very 301 00:14:38.159 --> 00:14:41.159 clear positive trend, and then way off in the corner, 302 00:14:41.559 --> 00:14:44.480 one single data point sitting completely isolated on the far 303 00:14:44.559 --> 00:14:45.399 edges of the plot. 304 00:14:45.600 --> 00:14:49.480 And upon investigation, that single point is an internal test 305 00:14:49.519 --> 00:14:50.559 account exactly. 306 00:14:50.679 --> 00:14:52.559 A developer just gave it one hundred friends, but it 307 00:14:52.600 --> 00:14:54.840 only logs one minute of activity a day. 308 00:14:54.919 --> 00:14:58.240 So that single test account has such an extreme deviation 309 00:14:58.360 --> 00:15:01.440 from the mean on both axes that when those deviations 310 00:15:01.480 --> 00:15:05.080 are multiplied together in the covariance formula, it just violently 311 00:15:05.200 --> 00:15:08.279 yanks the line of best fit away from the actual 312 00:15:08.440 --> 00:15:09.120 user cluster. 313 00:15:09.320 --> 00:15:12.399 Yes, and the moment you drop that single test account 314 00:15:12.440 --> 00:15:16.559 from the matrix, the correlation coefficient jumps up. The underlying 315 00:15:16.600 --> 00:15:19.679 truth was there all along. It was just masked by 316 00:15:19.720 --> 00:15:22.480 the mathematical weight of a single anomaly. 317 00:15:22.039 --> 00:15:25.279 Which brings us to the absolute most insidious statistical trap 318 00:15:25.320 --> 00:15:27.840 in data science. Simpson's paradox. 319 00:15:27.919 --> 00:15:28.720 Oh, my favorite. 320 00:15:28.759 --> 00:15:32.080 Outliers are easy to spot if you just visualize the distribution, right, 321 00:15:32.559 --> 00:15:36.000 But Simpsons paradox hies entirely within the aggregate structure of 322 00:15:36.039 --> 00:15:36.840 the data itself. 323 00:15:37.000 --> 00:15:39.919 It does. It occurs when a clear trend appears in 324 00:15:40.000 --> 00:15:43.840 multiple distinct groups of data, but then completely disappears or 325 00:15:43.879 --> 00:15:45.919 even reverses when those groups are combined. 326 00:15:46.159 --> 00:15:49.840 So let's apply this to data science estor, say you 327 00:15:49.879 --> 00:15:53.440 are analyzing regional engagement to see which coast is friendlier. 328 00:15:54.000 --> 00:15:55.960 You calculate the overall average connections. 329 00:15:56.080 --> 00:15:57.240 Okay, let's look at the numbers. 330 00:15:57.399 --> 00:15:59.960 The West Coast user base averages eight point two friends 331 00:16:00.159 --> 00:16:03.320 per user. The East Coast user base averages six point 332 00:16:03.360 --> 00:16:07.279 five friends. So the aggregate data heavily favors the West coast. 333 00:16:07.399 --> 00:16:08.600 Really, but then. 334 00:16:08.639 --> 00:16:12.320 You introduce a confounding variable. You stratify the data based 335 00:16:12.360 --> 00:16:16.600 on educational background users with a PhD and users without 336 00:16:16.600 --> 00:16:17.159 a PhD. 337 00:16:17.279 --> 00:16:20.759 So you isolate the PhD subgroup and suddenly the East 338 00:16:20.799 --> 00:16:24.080 Coast data scientists average significantly more friends than the West 339 00:16:24.120 --> 00:16:24.919 Coast PhDs. 340 00:16:25.000 --> 00:16:27.879 Okay, so the East Coast wins the PhD demographic right. 341 00:16:28.200 --> 00:16:31.320 Then you isolate the non PhD subgroup, and once again, 342 00:16:31.399 --> 00:16:34.039 the East Coast data scientists average more friends than the 343 00:16:34.039 --> 00:16:35.559 West Coast non PhDs. 344 00:16:35.759 --> 00:16:36.639 We stop right there. 345 00:16:36.679 --> 00:16:37.120 What's wrong? 346 00:16:37.240 --> 00:16:39.879 How is it mathematically possible for the East Coast to 347 00:16:39.960 --> 00:16:44.039 win in both individual subcategories but losing the total overall? 348 00:16:44.639 --> 00:16:47.639 I mean, that feels like it violates basic arithmetic. 349 00:16:47.840 --> 00:16:49.759 It really does feel like magic. But it comes down 350 00:16:49.799 --> 00:16:52.600 to unequal weighting in the denominators of those subsets. 351 00:16:52.840 --> 00:16:53.960 Okay, break that down for me. 352 00:16:54.200 --> 00:16:57.799 The paradox is driven by the distribution of the confounding variable, 353 00:16:57.799 --> 00:17:01.799 in this case, the Phdso look at the underlying topology 354 00:17:01.840 --> 00:17:06.759 of the users across the entire platform. Users with PhDs 355 00:17:06.880 --> 00:17:10.839 simply have fewer connections. They average around three friends. 356 00:17:10.440 --> 00:17:12.160 Which are they're busy doing research right? 357 00:17:12.599 --> 00:17:16.440 While users without PhDs are highly active, averaging around ten 358 00:17:16.480 --> 00:17:17.720 to thirteen frames. 359 00:17:18.119 --> 00:17:22.759 So a PhD basically acts as a massive downward weight 360 00:17:22.920 --> 00:17:23.920 on a group's average. 361 00:17:23.960 --> 00:17:27.720 Precisely, now, look at the regional distribution. The East Coast 362 00:17:27.799 --> 00:17:31.079 user base is heavily saturated with PhDs. 363 00:17:30.640 --> 00:17:32.680 Because of all the universities and research hopes. 364 00:17:32.799 --> 00:17:37.039 Exactly, they have a massive concentration of these low connection users, 365 00:17:37.079 --> 00:17:41.480 pulling their overall denominator down. The West Coast user base, however, 366 00:17:41.599 --> 00:17:44.039 is overwhelmingly composed of non. 367 00:17:43.880 --> 00:17:46.119 PhDs, part of culture, right right, So. 368 00:17:46.079 --> 00:17:49.079 When you aggregate the data, the sheer volume of highly 369 00:17:49.119 --> 00:17:52.799 connected non PhDs on the West Coast mathematically drowns out 370 00:17:52.839 --> 00:17:55.680 the East Coast higher performance within the individual tiers. 371 00:17:55.759 --> 00:18:00.000 Wow, so the regional bucketing completely masks the educational weighting completely. 372 00:18:00.480 --> 00:18:02.960 If you hadn't joined the network table with the edguitational 373 00:18:03.000 --> 00:18:05.960 background table, you would have delivered a presentation to the 374 00:18:06.000 --> 00:18:10.400 board concluding that West Coast users are inherently more sociable. 375 00:18:10.200 --> 00:18:12.759 And you would have optimized millions of dollars in marketing 376 00:18:12.799 --> 00:18:17.759 campaigns around that assumption, fully backed by mathematically flawless yet 377 00:18:17.799 --> 00:18:20.559 factually entirely backwards data. 378 00:18:20.680 --> 00:18:23.039 And this, right here is the core lesson of doing 379 00:18:23.119 --> 00:18:26.599 data science from scratch. It forces you to recognize that 380 00:18:26.640 --> 00:18:30.279 statistical tools are not objective arbiters of truth. 381 00:18:30.640 --> 00:18:32.160 No, they are mathematical lenses. 382 00:18:32.400 --> 00:18:36.119 Exactly when we calculate a correlation or an aggregate mean, 383 00:18:36.559 --> 00:18:40.920 the foundational, unspoken assumption is always ceteris parabus, all else 384 00:18:40.960 --> 00:18:45.480 being equal, we assume the underlying distributions or uniform. Simpson's 385 00:18:45.559 --> 00:18:48.160 paradox proves how lethal that assumption can be to a 386 00:18:48.200 --> 00:18:48.880 business model. 387 00:18:49.359 --> 00:18:52.519 The infrastructure of data science really requires rigor at every 388 00:18:52.599 --> 00:18:55.079 single layer of the stack. I mean, it demands type 389 00:18:55.119 --> 00:18:58.680 safe ingestion to prevent silent pipeline corruption. It demands a 390 00:18:58.720 --> 00:19:02.480 physiological understanding of how end users process the geometry of 391 00:19:02.480 --> 00:19:07.640 a visualization. It requires structuring memory into matrices to unlock 392 00:19:07.680 --> 00:19:13.440 computational scale. And it requires a deep, almost paranoid skepticism 393 00:19:13.519 --> 00:19:14.720 of aggregated metrics. 394 00:19:14.759 --> 00:19:16.279 It does, and I want to leave you with a 395 00:19:16.319 --> 00:19:19.680 final thought to apply outside the boundaries of data sciencestor. 396 00:19:19.920 --> 00:19:20.480 Let's hear it. 397 00:19:20.920 --> 00:19:25.839 Every single day you are bombarded with viral statistics, algorithmic recommendations, 398 00:19:26.119 --> 00:19:29.839 and definitive correlations in the news. Every single one of 399 00:19:29.839 --> 00:19:33.680 those metrics was aggregated by someone making an assumption about uniformity. 400 00:19:34.359 --> 00:19:37.240 Knowing what you know now about Simpson's paradox, ask yourself, 401 00:19:37.799 --> 00:19:41.759