WEBVTT

1
00:00:00.080 --> 00:00:02.960
<v Speaker 1>What if I told you that adding your most highly educated,

2
00:00:04.080 --> 00:00:07.799
<v Speaker 1>highly active users to a data set could mathematically make

3
00:00:07.839 --> 00:00:11.560
<v Speaker 1>your entire user base, look, you know, substantially less sociable.

4
00:00:11.640 --> 00:00:14.439
<v Speaker 2>Yeah, I mean it completely sounds like a broken algorithm, right,

5
00:00:14.759 --> 00:00:18.559
<v Speaker 2>but it's actually this fundamental statistical trap that catches data

6
00:00:18.559 --> 00:00:21.359
<v Speaker 2>teams off guard literally every single.

7
00:00:21.120 --> 00:00:24.920
<v Speaker 1>Day, right, and today we are dismantling those traps. We're

8
00:00:24.920 --> 00:00:27.679
<v Speaker 1>doing a deep dive into the core concepts of data science,

9
00:00:28.160 --> 00:00:32.079
<v Speaker 1>pulling directly from Joel Gruse's Data Science from scratch, and

10
00:00:32.119 --> 00:00:35.799
<v Speaker 1>the mission here is straightforward. We are stepping entirely away

11
00:00:35.840 --> 00:00:38.320
<v Speaker 1>from those prepackaged software libraries.

12
00:00:37.960 --> 00:00:41.759
<v Speaker 2>Exactly because relying purely on high level abstractions, you know,

13
00:00:41.799 --> 00:00:45.119
<v Speaker 2>just typing import pandas and calling a dot mean function,

14
00:00:45.759 --> 00:00:47.880
<v Speaker 2>it creates this really dangerous blind spot.

15
00:00:47.960 --> 00:00:49.240
<v Speaker 1>You're just trusting a black box.

16
00:00:49.399 --> 00:00:52.119
<v Speaker 2>Right. When you treat your analytical tools as black boxes,

17
00:00:52.159 --> 00:00:55.000
<v Speaker 2>you lose the ability to actually interrogate the underlying assumptions.

18
00:00:55.039 --> 00:00:57.119
<v Speaker 2>I mean, you end up optimizing for algorithms you don't

19
00:00:57.119 --> 00:00:58.399
<v Speaker 2>fully understand.

20
00:00:57.960 --> 00:01:03.159
<v Speaker 1>Which inevitably leads to very confident, mathematically sound, and completely

21
00:01:03.200 --> 00:01:04.400
<v Speaker 1>incorrect conclusions.

22
00:01:04.640 --> 00:01:08.040
<v Speaker 2>Yes, the worst kind of incorrect conclusions.

23
00:01:08.439 --> 00:01:11.159
<v Speaker 1>So to ground this deep dive for you, the listener,

24
00:01:11.599 --> 00:01:15.719
<v Speaker 1>we're placing you in a very specific hypothetical scenario today.

25
00:01:16.000 --> 00:01:18.359
<v Speaker 1>You have just been brought in as the founding data

26
00:01:18.400 --> 00:01:22.280
<v Speaker 1>scientist at a startup called Data Sciencestor, which is, you know,

27
00:01:22.359 --> 00:01:25.400
<v Speaker 1>a social network tailored entirely for data professionals.

28
00:01:25.480 --> 00:01:27.200
<v Speaker 2>Sounds like a very niche market.

29
00:01:27.159 --> 00:01:30.840
<v Speaker 1>Very niche, But the point is there is no legacy infrastructure.

30
00:01:31.159 --> 00:01:35.280
<v Speaker 1>You are tasked with building the analytical pipeline completely from scratch.

31
00:01:35.040 --> 00:01:38.000
<v Speaker 2>Which means before you can analyze a single user's behavior,

32
00:01:38.120 --> 00:01:41.000
<v Speaker 2>you have to actually choose your architecture. And the source

33
00:01:41.040 --> 00:01:43.920
<v Speaker 2>material strongly advocates for Python.

34
00:01:43.840 --> 00:01:47.200
<v Speaker 1>But not just because it's popular, right, The rationale goes

35
00:01:47.280 --> 00:01:50.519
<v Speaker 1>way beyond its ecosystem of data tools.

36
00:01:50.599 --> 00:01:54.120
<v Speaker 2>Oh. Absolutely, the argument is rooted entirely in Python's core

37
00:01:54.159 --> 00:01:56.719
<v Speaker 2>design philosophy. It goes back to the zen of Python,

38
00:01:56.799 --> 00:02:00.760
<v Speaker 2>which basically dictates that you know, explicit is better than implicit.

39
00:02:00.920 --> 00:02:03.239
<v Speaker 1>Right, And we see this manifested most clearly in how

40
00:02:03.319 --> 00:02:06.599
<v Speaker 1>Python enforces structural readability through white space.

41
00:02:07.079 --> 00:02:09.159
<v Speaker 2>Yes, the white space rule. I mean, if you look

42
00:02:09.159 --> 00:02:11.680
<v Speaker 2>at C plus plus or Java, the scope of a

43
00:02:11.719 --> 00:02:14.439
<v Speaker 2>function is defined by curly braces.

44
00:02:14.319 --> 00:02:17.599
<v Speaker 1>Which the compiler basically just ignores. Yeah, right, like it

45
00:02:17.639 --> 00:02:19.879
<v Speaker 1>ignores the indentation completely exactly.

46
00:02:20.080 --> 00:02:24.599
<v Speaker 2>You can have this incredibly complex, deeply nested logic crammed

47
00:02:24.639 --> 00:02:27.439
<v Speaker 2>onto a single line of text and the machine will

48
00:02:27.439 --> 00:02:30.319
<v Speaker 2>parse it perfectly, even if it is completely illegible to

49
00:02:30.360 --> 00:02:33.120
<v Speaker 2>the next engineer who has to inherit your code base.

50
00:02:33.240 --> 00:02:33.400
<v Speaker 1>Right.

51
00:02:33.680 --> 00:02:37.560
<v Speaker 2>But Python removes that option entirely. The visual structure of

52
00:02:37.560 --> 00:02:41.400
<v Speaker 2>the code must match the logical structure, or the interpreter

53
00:02:41.439 --> 00:02:43.919
<v Speaker 2>will just throw an indentation error and refuse to run.

54
00:02:44.039 --> 00:02:46.599
<v Speaker 1>It's honestly like a Marie Condo approach to coding.

55
00:02:46.759 --> 00:02:48.039
<v Speaker 2>A Mariecondo approach.

56
00:02:48.120 --> 00:02:50.159
<v Speaker 1>Yeah, like with a curly brace language, you can take

57
00:02:50.199 --> 00:02:53.439
<v Speaker 1>this absolute disaster of a messy room, shove all the

58
00:02:53.520 --> 00:02:56.719
<v Speaker 1>tangled logic into a closet, slam the compiler door shut,

59
00:02:56.800 --> 00:02:57.639
<v Speaker 1>and it just runs.

60
00:02:57.759 --> 00:02:59.280
<v Speaker 2>That's yeah, that's a great way to put it.

61
00:02:59.319 --> 00:03:02.719
<v Speaker 1>But Python. Python forces you to organize the closet so

62
00:03:02.719 --> 00:03:04.800
<v Speaker 1>the structure is visible the second you open the file.

63
00:03:05.199 --> 00:03:08.360
<v Speaker 1>Everyone can see exactly what sparks joy or what causes

64
00:03:08.400 --> 00:03:09.319
<v Speaker 1>fatal crash.

65
00:03:09.520 --> 00:03:13.199
<v Speaker 2>That's exactly it. And this emphasis on explicit structure has

66
00:03:13.240 --> 00:03:16.319
<v Speaker 2>really only become more critical with the shift to Python three,

67
00:03:16.719 --> 00:03:19.479
<v Speaker 2>especially when we start talking about the adoption of type

68
00:03:19.520 --> 00:03:22.240
<v Speaker 2>annotations in these big data pipelines.

69
00:03:22.439 --> 00:03:25.879
<v Speaker 1>Right, because Python natively is dynamically typed yes.

70
00:03:26.120 --> 00:03:29.159
<v Speaker 2>Meaning a variable can hold an integer and then literally

71
00:03:29.199 --> 00:03:31.280
<v Speaker 2>in the next line of code it can be reassigned

72
00:03:31.280 --> 00:03:32.560
<v Speaker 2>to a string, which.

73
00:03:32.319 --> 00:03:33.960
<v Speaker 1>Is great if you're just writing a quick script.

74
00:03:34.000 --> 00:03:37.080
<v Speaker 2>Sure, in a localized script, that flexibility speeds up development,

75
00:03:37.560 --> 00:03:40.639
<v Speaker 2>But in a massive data ingestion pipeline, it is a

76
00:03:40.680 --> 00:03:42.240
<v Speaker 2>massive liability because you.

77
00:03:42.159 --> 00:03:44.479
<v Speaker 1>Don't know what data is actually flowing.

78
00:03:44.080 --> 00:03:47.599
<v Speaker 2>Through the pipe exactly. Let's say your pipeline is pulling

79
00:03:47.719 --> 00:03:52.360
<v Speaker 2>user engagement metrics for data science stor and some localized

80
00:03:52.400 --> 00:03:55.400
<v Speaker 2>anomaly introduces a string, maybe it's like a text based

81
00:03:55.479 --> 00:03:58.800
<v Speaker 2>nan or just a null character into a feature set

82
00:03:58.840 --> 00:04:01.520
<v Speaker 2>that is mathematically acting floating point numbers.

83
00:04:01.639 --> 00:04:02.520
<v Speaker 1>Oh right.

84
00:04:02.960 --> 00:04:05.800
<v Speaker 2>In a purely dynamic setup, the pipeline might not even

85
00:04:05.879 --> 00:04:09.520
<v Speaker 2>crash immediately. It might just perform a silent type coercion

86
00:04:09.680 --> 00:04:12.800
<v Speaker 2>or propagate that null value all the way through your

87
00:04:12.800 --> 00:04:14.919
<v Speaker 2>downstream transformations.

88
00:04:14.599 --> 00:04:17.839
<v Speaker 1>Which ultimately just corrupts the training data for whatever a

89
00:04:17.839 --> 00:04:19.560
<v Speaker 1>machine learning model you're building.

90
00:04:19.319 --> 00:04:22.240
<v Speaker 2>Exactly, and you wouldn't even realize the error until the

91
00:04:22.279 --> 00:04:25.639
<v Speaker 2>model's accuracy mysteriously degraded in production weeks later.

92
00:04:25.920 --> 00:04:29.160
<v Speaker 1>Wow. So by using type annotations, you're basically establishing a

93
00:04:29.199 --> 00:04:31.160
<v Speaker 1>strict contract for your functions.

94
00:04:31.199 --> 00:04:36.480
<v Speaker 2>Precisely you declare explicitly upfront that a specific ingestion module

95
00:04:36.639 --> 00:04:39.959
<v Speaker 2>expects an integer and returns a float period, and.

96
00:04:39.920 --> 00:04:43.079
<v Speaker 1>Then static type checkers can actually analyze the codebase before

97
00:04:43.079 --> 00:04:45.800
<v Speaker 1>it even runs, flagging any potential violations.

98
00:04:45.879 --> 00:04:49.079
<v Speaker 2>It completely shifts the paradigm. You go from basically crossing

99
00:04:49.079 --> 00:04:52.399
<v Speaker 2>your fingers and hoping the data conforms your expectations to

100
00:04:52.639 --> 00:04:56.959
<v Speaker 2>architecting a system that mathematically guarantees the data types before

101
00:04:57.000 --> 00:04:58.399
<v Speaker 2>the processing even begins.

102
00:04:58.600 --> 00:05:02.040
<v Speaker 1>So you're building transparency in resilience right into the foundation

103
00:05:02.160 --> 00:05:03.160
<v Speaker 1>of data sciencestor.

104
00:05:03.319 --> 00:05:06.079
<v Speaker 2>You have to, because once you have a type safe

105
00:05:06.120 --> 00:05:09.879
<v Speaker 2>pipeline aggregating all this clean user data, the immediate next

106
00:05:09.920 --> 00:05:13.639
<v Speaker 2>step in the analytical life cycle is exploratory data analysis.

107
00:05:13.759 --> 00:05:16.399
<v Speaker 1>Right. You want to visualize the distributions to spot the

108
00:05:16.439 --> 00:05:17.720
<v Speaker 1>broader trends.

109
00:05:17.560 --> 00:05:21.920
<v Speaker 2>Which introduces a completely different class of vulnerability into your workflow.

110
00:05:22.079 --> 00:05:25.879
<v Speaker 1>Ah. Yes, because you've moved from the strict, unforgiving logic

111
00:05:25.920 --> 00:05:30.199
<v Speaker 1>of the compiler to the highly subjective translation of data

112
00:05:30.240 --> 00:05:31.040
<v Speaker 1>into pixels.

113
00:05:31.160 --> 00:05:34.839
<v Speaker 2>Yes, human visual processing hardware has all these built in heuristics,

114
00:05:34.879 --> 00:05:37.519
<v Speaker 2>and those heuristics are remarkably easy to exploit.

115
00:05:37.759 --> 00:05:41.560
<v Speaker 1>The source material actually highlights this using that plotlib, specifically

116
00:05:41.560 --> 00:05:44.399
<v Speaker 1>looking at how you can manipulate the axis in bar charts.

117
00:05:44.519 --> 00:05:45.720
<v Speaker 2>It's such a classic trap.

118
00:05:45.920 --> 00:05:48.839
<v Speaker 1>Right. Let's say you were presenting platform growth to the

119
00:05:48.920 --> 00:05:52.040
<v Speaker 1>Data Science Hastor Board of Directors, and in twenty seventeen

120
00:05:52.079 --> 00:05:55.560
<v Speaker 1>the platform was mentioned five hundred times. Then in twenty

121
00:05:55.600 --> 00:05:58.680
<v Speaker 1>eighteen it was mentioned five hundred and five times.

122
00:05:58.519 --> 00:06:01.439
<v Speaker 2>Which is, let's be honest, actional increase. It's barely a

123
00:06:01.439 --> 00:06:03.279
<v Speaker 2>blip in the actual volume.

124
00:06:03.000 --> 00:06:04.920
<v Speaker 1>Right, But if you construct a bar chart for the

125
00:06:04.920 --> 00:06:07.199
<v Speaker 1>board and you set the axis to start at four

126
00:06:07.360 --> 00:06:10.279
<v Speaker 1>ninety nine and end at five oh six, you dramatically

127
00:06:10.360 --> 00:06:12.720
<v Speaker 1>alter the visual narrative. You really do, because the bar

128
00:06:12.800 --> 00:06:14.879
<v Speaker 1>for twenty seventeen sits at a value of one unit

129
00:06:14.879 --> 00:06:18.040
<v Speaker 1>above the baseline, but the bar for twenty eighteen rises

130
00:06:18.079 --> 00:06:19.839
<v Speaker 1>to six units above the baseline. Yep.

131
00:06:20.639 --> 00:06:23.720
<v Speaker 2>Visually, that twenty eighteen bar is taking up six times

132
00:06:23.759 --> 00:06:25.920
<v Speaker 2>the physical space on the screen exactly.

133
00:06:26.000 --> 00:06:29.680
<v Speaker 1>It looks like this towering, exponential six hundred percent increase,

134
00:06:30.319 --> 00:06:32.560
<v Speaker 1>even though the underlying data barely even moved.

135
00:06:32.720 --> 00:06:36.879
<v Speaker 2>And this is exactly where understanding cognitive psychology intersects with

136
00:06:37.000 --> 00:06:42.000
<v Speaker 2>data science. Our visual cortex processes different geometric shapes using

137
00:06:42.079 --> 00:06:44.399
<v Speaker 2>completely different underlying rules.

138
00:06:45.160 --> 00:06:47.240
<v Speaker 1>Wait, let me push back on this rule for a second.

139
00:06:47.319 --> 00:06:51.079
<v Speaker 1>Sure isn't zooming in on the axis? Just I don't

140
00:06:51.079 --> 00:06:54.800
<v Speaker 1>know a helpful way to highlight the relevant detail. If

141
00:06:54.839 --> 00:06:57.480
<v Speaker 1>I'm tracking a metric from five hundred to five h five,

142
00:06:58.000 --> 00:06:59.920
<v Speaker 1>why wouldn't I want to zoom in to show that's

143
00:07:00.040 --> 00:07:01.000
<v Speaker 1>specific variance.

144
00:07:01.319 --> 00:07:03.160
<v Speaker 2>Well, it depends on the chart you're using. If we

145
00:07:03.199 --> 00:07:05.519
<v Speaker 2>look at a line chart, we are evaluating the angle

146
00:07:05.560 --> 00:07:08.519
<v Speaker 2>of the slope. The cognitive focus is on the trajectory

147
00:07:08.519 --> 00:07:11.120
<v Speaker 2>and the rate of change over time. Okay, So zooming

148
00:07:11.160 --> 00:07:14.160
<v Speaker 2>in on a line chart to expose localized volatility, say

149
00:07:14.279 --> 00:07:17.560
<v Speaker 2>tracking minute by minute stock fluctuations between one hundred, one

150
00:07:17.600 --> 00:07:21.040
<v Speaker 2>hundred and five dollars, is completely analytically valid. The slope

151
00:07:21.079 --> 00:07:24.399
<v Speaker 2>still remains a true representation of the localized variance.

152
00:07:24.199 --> 00:07:25.759
<v Speaker 1>Right because I'm just looking at the angle of the

153
00:07:25.800 --> 00:07:26.399
<v Speaker 1>line going.

154
00:07:26.319 --> 00:07:29.399
<v Speaker 2>Up and down exactly. But the visual processing mechanism for

155
00:07:29.439 --> 00:07:32.560
<v Speaker 2>a bar chart is fundamentally different. With a bar chart,

156
00:07:32.639 --> 00:07:35.879
<v Speaker 2>the human brain instinctively equates the value of the data

157
00:07:35.879 --> 00:07:38.839
<v Speaker 2>point with the total two dimensional area.

158
00:07:38.560 --> 00:07:41.319
<v Speaker 1>Of the bar, like the actual amount of ink printed

159
00:07:41.360 --> 00:07:41.879
<v Speaker 1>on the page.

160
00:07:41.959 --> 00:07:45.519
<v Speaker 2>Yes, the literal amount of ink. So by truncating the

161
00:07:45.560 --> 00:07:48.680
<v Speaker 2>axis and starting at four ninety nine, you are divorcing

162
00:07:48.680 --> 00:07:51.040
<v Speaker 2>the area of the bar from its mathematical value.

163
00:07:51.160 --> 00:07:52.600
<v Speaker 1>Oh wow, I see what you mean.

164
00:07:52.720 --> 00:07:55.920
<v Speaker 2>Yeah, you are asking the viewer's brain to process a

165
00:07:55.920 --> 00:07:59.399
<v Speaker 2>physical shape that is six times larger, while expecting them

166
00:07:59.439 --> 00:08:02.720
<v Speaker 2>to override their own visual instincts by reading the tiny

167
00:08:02.800 --> 00:08:04.439
<v Speaker 2>numbers printed on the axis.

168
00:08:04.639 --> 00:08:06.480
<v Speaker 1>It's a total cognitive mismatch.

169
00:08:06.680 --> 00:08:09.519
<v Speaker 2>Exactly, a non zero axis on a bar chart basically

170
00:08:09.560 --> 00:08:12.759
<v Speaker 2>mathematically lies to the viewer's visual cortex, and.

171
00:08:12.720 --> 00:08:14.839
<v Speaker 1>The book points out that the same kind of distortion

172
00:08:14.959 --> 00:08:16.959
<v Speaker 1>applies to variants in scato plots too.

173
00:08:17.040 --> 00:08:17.720
<v Speaker 2>Oh absolutely.

174
00:08:17.800 --> 00:08:20.079
<v Speaker 1>Like if we map out user test scores with test

175
00:08:20.120 --> 00:08:22.399
<v Speaker 1>one on the x axis and test two on the axis,

176
00:08:22.759 --> 00:08:26.079
<v Speaker 1>the scaling of those axes defines the perceived standard deviation.

177
00:08:26.439 --> 00:08:30.920
<v Speaker 1>If you're plotting, library automatically scales the x axis to

178
00:08:30.959 --> 00:08:33.559
<v Speaker 1>cover a twenty point spread, but then it stretches the

179
00:08:33.600 --> 00:08:36.000
<v Speaker 1>axis to cover a forty point spread just to fill

180
00:08:36.080 --> 00:08:36.600
<v Speaker 1>up the screen.

181
00:08:36.759 --> 00:08:39.919
<v Speaker 2>Then the visual density of your clusters is completely compromised.

182
00:08:40.080 --> 00:08:42.960
<v Speaker 1>Right, The data along the axis will appear to have

183
00:08:43.039 --> 00:08:47.200
<v Speaker 1>significantly higher variants simply because the pixels are stretched further apart.

184
00:08:47.480 --> 00:08:50.519
<v Speaker 1>You have to force comparable axes to maintain the integrity

185
00:08:50.519 --> 00:08:51.200
<v Speaker 1>of the distribution.

186
00:08:51.480 --> 00:08:54.320
<v Speaker 2>You do, but you know, visualization is really just a

187
00:08:54.360 --> 00:08:57.840
<v Speaker 2>tool for spotting aggregate trends, and to truly understand the

188
00:08:57.879 --> 00:09:01.360
<v Speaker 2>mechanics of a social platform like data sign ancestor aggregate

189
00:09:01.399 --> 00:09:02.440
<v Speaker 2>trends aren't enough.

190
00:09:02.840 --> 00:09:04.960
<v Speaker 1>No, the executive team wants to know who the key

191
00:09:05.039 --> 00:09:07.480
<v Speaker 1>influencers are. They want to find the nodes with the

192
00:09:07.519 --> 00:09:09.159
<v Speaker 1>highest degree centrality, right.

193
00:09:09.039 --> 00:09:12.480
<v Speaker 2>Which means we have to analyze the topology of the network.

194
00:09:12.200 --> 00:09:16.399
<v Speaker 1>Itself and calculating degree centrality basically just means counting who

195
00:09:16.480 --> 00:09:20.320
<v Speaker 1>has the most friends. But doing that requires analyzing the edges,

196
00:09:20.440 --> 00:09:22.759
<v Speaker 1>the connections between the users, and in a.

197
00:09:22.799 --> 00:09:26.440
<v Speaker 2>Raw data format, this usually exists as an edge list.

198
00:09:26.519 --> 00:09:28.720
<v Speaker 2>You know, user zero is friends with user one, User

199
00:09:28.799 --> 00:09:30.799
<v Speaker 2>zero's friends with user two, user one is friends with

200
00:09:30.879 --> 00:09:32.480
<v Speaker 2>user three, and so on, which.

201
00:09:32.320 --> 00:09:34.919
<v Speaker 1>Is fine for a tiny data set. Iterating through a

202
00:09:34.960 --> 00:09:38.720
<v Speaker 1>short list to count a specific users connections is trivial, sure,

203
00:09:39.039 --> 00:09:39.679
<v Speaker 1>but as.

204
00:09:39.559 --> 00:09:44.200
<v Speaker 2>The platform scales to say, millions of users, the computational

205
00:09:44.240 --> 00:09:48.039
<v Speaker 2>complexity of that search becomes a massive bottleneck. We are

206
00:09:48.080 --> 00:09:50.320
<v Speaker 2>talking about big O notation.

207
00:09:50.000 --> 00:09:51.519
<v Speaker 1>Here, right, the dreaded big oh.

208
00:09:51.840 --> 00:09:55.639
<v Speaker 2>Exactly. Searching an unstructured edge list requires an O of

209
00:09:55.840 --> 00:09:58.039
<v Speaker 2>n operation where n is the number.

210
00:09:57.759 --> 00:10:00.799
<v Speaker 1>Of edges, Meaning to find all connections for you US one,

211
00:10:00.879 --> 00:10:05.120
<v Speaker 1>the algorithm literally has to traverse the entire list, evaluating

212
00:10:05.200 --> 00:10:07.639
<v Speaker 1>every single pair to see if User one is present.

213
00:10:07.720 --> 00:10:10.720
<v Speaker 2>It is computationally expensive, and it scales terribly as the

214
00:10:10.759 --> 00:10:15.000
<v Speaker 2>network grows, which is why data scientists transition into linear algebra.

215
00:10:15.360 --> 00:10:18.399
<v Speaker 2>They translate the network structure by representing the connections as

216
00:10:18.440 --> 00:10:19.639
<v Speaker 2>an adjacency matrix.

217
00:10:19.679 --> 00:10:22.120
<v Speaker 1>Okay, I like to use an analogy for this efficiency jump.

218
00:10:22.159 --> 00:10:24.720
<v Speaker 1>It's here. The edge list is like an old school rolodex.

219
00:10:25.120 --> 00:10:27.080
<v Speaker 1>If you want to know who User one knows, you

220
00:10:27.159 --> 00:10:29.279
<v Speaker 1>have to flip through every single card in the entire

221
00:10:29.320 --> 00:10:32.919
<v Speaker 1>box to check right, very slow. But the matrix is

222
00:10:32.960 --> 00:10:37.200
<v Speaker 1>like a giant wall size pegboard. The rows represent every

223
00:10:37.320 --> 00:10:40.759
<v Speaker 1>user and the columns represent those exact same users. If

224
00:10:40.879 --> 00:10:43.240
<v Speaker 1>User A is friends with the user B. You stick

225
00:10:43.279 --> 00:10:45.720
<v Speaker 1>a PEG in the intersecting cell basically a one. If

226
00:10:45.720 --> 00:10:48.440
<v Speaker 1>there's no connection, the cell is empty a zero. So

227
00:10:48.480 --> 00:10:52.279
<v Speaker 1>you've taken this slow sequential list and transformed it into

228
00:10:52.320 --> 00:10:55.360
<v Speaker 1>a dense structural grid of binary states.

229
00:10:55.519 --> 00:10:59.399
<v Speaker 2>I love that pegboard analogy, and the performance implications of

230
00:10:59.399 --> 00:11:03.279
<v Speaker 2>that transformation are profound. When we restructure the data into

231
00:11:03.279 --> 00:11:06.840
<v Speaker 2>a matrix, we change the algorithmic complexity of finding a

232
00:11:06.960 --> 00:11:09.960
<v Speaker 2>user's connections from an O of n sequential search to

233
00:11:10.039 --> 00:11:11.759
<v Speaker 2>an OH of one constant time look up.

234
00:11:11.799 --> 00:11:13.320
<v Speaker 1>OH of one, So it's instantaneous.

235
00:11:13.399 --> 00:11:15.919
<v Speaker 2>Exactly if you need to know user fives connections, the

236
00:11:15.960 --> 00:11:18.840
<v Speaker 2>system doesn't search at all. It just jumps directly to

237
00:11:18.879 --> 00:11:21.200
<v Speaker 2>the memory address of row five and retrieves the background,

238
00:11:21.200 --> 00:11:24.919
<v Speaker 2>which also aligns perfectly with modern hardware architecture. Yes it does.

239
00:11:25.480 --> 00:11:28.519
<v Speaker 2>Traversing a linked list or an edge list often means

240
00:11:28.600 --> 00:11:32.679
<v Speaker 2>jumping around to different non contiguous blocks of memory.

241
00:11:32.320 --> 00:11:34.679
<v Speaker 1>Which causes cache misses and slows down.

242
00:11:34.519 --> 00:11:39.200
<v Speaker 2>The processing exactly. But a matrix stores these values in

243
00:11:39.360 --> 00:11:43.480
<v Speaker 2>contiguous memory blocks, and that contiguous memory layout allows you

244
00:11:43.559 --> 00:11:48.679
<v Speaker 2>to leverage semity operations single instruction, multiple data.

245
00:11:48.799 --> 00:11:53.159
<v Speaker 1>Because modern CPUs and particularly GPUs are explicitly designed to

246
00:11:53.159 --> 00:11:56.840
<v Speaker 1>perform parallel math operations on contiguous arrays of numbers.

247
00:11:57.039 --> 00:11:59.679
<v Speaker 2>Right, So, by representing the social network as a matrix,

248
00:11:59.720 --> 00:12:02.440
<v Speaker 2>you can and utilize parallel processing to calculate the Egen

249
00:12:02.519 --> 00:12:03.799
<v Speaker 2>values of the matrix, which.

250
00:12:03.600 --> 00:12:05.000
<v Speaker 1>Gives you eigenvector centrality.

251
00:12:05.080 --> 00:12:08.480
<v Speaker 2>Yes, a far more sophisticated metric that doesn't just measure

252
00:12:08.480 --> 00:12:11.360
<v Speaker 2>how many friends a user has, but how influential those

253
00:12:11.399 --> 00:12:15.200
<v Speaker 2>friends actually are. The way you structure your data fundamentally

254
00:12:15.240 --> 00:12:17.679
<v Speaker 2>dictates the analytical power you can bring to bear.

255
00:12:17.960 --> 00:12:20.440
<v Speaker 1>Okay, so let's take stock. You have built a type

256
00:12:20.440 --> 00:12:23.679
<v Speaker 1>safe pipeline, You are forcing comparable axis on your charts

257
00:12:23.720 --> 00:12:26.960
<v Speaker 1>to avoid those visual distortions, and you are utilizing GPU

258
00:12:27.000 --> 00:12:31.120
<v Speaker 1>accelerated matrix operations to map network topology in constant time.

259
00:12:31.360 --> 00:12:33.000
<v Speaker 2>The architecture is rack solid.

260
00:12:33.240 --> 00:12:36.440
<v Speaker 1>It is. But then the VP of Growth knocks on

261
00:12:36.519 --> 00:12:39.679
<v Speaker 1>your door. They want you to build a statistical profile

262
00:12:39.960 --> 00:12:41.559
<v Speaker 1>of the typical user's behavior.

263
00:12:41.679 --> 00:12:42.440
<v Speaker 2>Of course they do.

264
00:12:42.559 --> 00:12:44.679
<v Speaker 1>They want to correlate the number of friends a user

265
00:12:44.759 --> 00:12:47.000
<v Speaker 1>has with the number of daily minutes they spend on

266
00:12:47.000 --> 00:12:51.480
<v Speaker 1>the platform, and this introduces us to the fragility of

267
00:12:51.679 --> 00:12:53.279
<v Speaker 1>standard statistical metrics.

268
00:12:53.320 --> 00:12:56.759
<v Speaker 2>Oh absolutely, when you're summarizing distributions the traditional mean, the

269
00:12:56.840 --> 00:12:58.720
<v Speaker 2>average is notoriously brittle.

270
00:12:59.039 --> 00:13:03.679
<v Speaker 1>Yeah. OK has this great classic example about university graduate.

271
00:13:03.320 --> 00:13:08.000
<v Speaker 2>Salaries the UNC geography major. Yes, in the mid nineteen eighties,

272
00:13:08.039 --> 00:13:10.399
<v Speaker 2>the major at the University of North Carolina with the

273
00:13:10.519 --> 00:13:13.039
<v Speaker 2>highest means starting salary was geography.

274
00:13:13.200 --> 00:13:16.960
<v Speaker 1>And it wasn't because the market suddenly deeply valued cartography.

275
00:13:17.080 --> 00:13:19.879
<v Speaker 2>No, it was solely because a single graduate named Michael

276
00:13:19.919 --> 00:13:21.120
<v Speaker 2>Jordan entered the NBA.

277
00:13:21.399 --> 00:13:24.039
<v Speaker 1>Right. Because the mean is calculated by summing all the

278
00:13:24.120 --> 00:13:26.919
<v Speaker 1>values and dividing by the count, it distributes the weight

279
00:13:26.960 --> 00:13:29.279
<v Speaker 1>of every value equally across the data set.

280
00:13:29.120 --> 00:13:33.000
<v Speaker 2>Which means a massive multi million dollar outlier pulls the

281
00:13:33.159 --> 00:13:36.960
<v Speaker 2>entire mathematical center of gravity toward itself. It completely obscures

282
00:13:37.000 --> 00:13:38.440
<v Speaker 2>the typical distribution of the data.

283
00:13:38.720 --> 00:13:42.559
<v Speaker 1>Whereas the median, by contrast, just relies on positional rank.

284
00:13:42.720 --> 00:13:46.759
<v Speaker 1>It isolates the middle value and renders those extreme taiales irrelevant.

285
00:13:46.879 --> 00:13:50.960
<v Speaker 2>Exactly, and this vulnerability to outliers it extends directly into

286
00:13:50.960 --> 00:13:52.440
<v Speaker 2>how we measure correlation.

287
00:13:52.120 --> 00:13:56.360
<v Speaker 1>Too, Right, like Pearson's correlation coefficient, which evaluates the linear

288
00:13:56.399 --> 00:13:58.200
<v Speaker 1>relationship between two variables.

289
00:13:58.320 --> 00:14:02.720
<v Speaker 2>Yes, but the underlying us for Pearson's relies on calculating covariance,

290
00:14:03.320 --> 00:14:07.080
<v Speaker 2>and covariance involves multiplying the deviations of each data point

291
00:14:07.120 --> 00:14:07.720
<v Speaker 2>from the mean.

292
00:14:08.360 --> 00:14:11.840
<v Speaker 1>Okay, And because you are multiplying those deviations, a massive

293
00:14:11.879 --> 00:14:15.519
<v Speaker 1>outlier doesn't just like slightly skew the result. It mathematically

294
00:14:15.559 --> 00:14:16.919
<v Speaker 1>dominates the entire calculation.

295
00:14:17.039 --> 00:14:20.600
<v Speaker 2>It completely takes over. Let's return to the VP's hypothesis,

296
00:14:21.000 --> 00:14:24.879
<v Speaker 2>more friends equals more time spent on data sciencestor. You

297
00:14:25.000 --> 00:14:28.240
<v Speaker 2>run the correlation and the coefficient comes back incredibly weak.

298
00:14:28.720 --> 00:14:31.799
<v Speaker 1>The data basically suggests there is no relationship, right.

299
00:14:31.679 --> 00:14:34.480
<v Speaker 2>But then you actually plot the data on a scatterplot

300
00:14:35.039 --> 00:14:38.120
<v Speaker 2>and you see this massive dense cluster showing a very

301
00:14:38.159 --> 00:14:41.159
<v Speaker 2>clear positive trend, and then way off in the corner,

302
00:14:41.559 --> 00:14:44.480
<v Speaker 2>one single data point sitting completely isolated on the far

303
00:14:44.559 --> 00:14:45.399
<v Speaker 2>edges of the plot.

304
00:14:45.600 --> 00:14:49.480
<v Speaker 1>And upon investigation, that single point is an internal test

305
00:14:49.519 --> 00:14:50.559
<v Speaker 1>account exactly.

306
00:14:50.679 --> 00:14:52.559
<v Speaker 2>A developer just gave it one hundred friends, but it

307
00:14:52.600 --> 00:14:54.840
<v Speaker 2>only logs one minute of activity a day.

308
00:14:54.919 --> 00:14:58.240
<v Speaker 1>So that single test account has such an extreme deviation

309
00:14:58.360 --> 00:15:01.440
<v Speaker 1>from the mean on both axes that when those deviations

310
00:15:01.480 --> 00:15:05.080
<v Speaker 1>are multiplied together in the covariance formula, it just violently

311
00:15:05.200 --> 00:15:08.279
<v Speaker 1>yanks the line of best fit away from the actual

312
00:15:08.440 --> 00:15:09.120
<v Speaker 1>user cluster.

313
00:15:09.320 --> 00:15:12.399
<v Speaker 2>Yes, and the moment you drop that single test account

314
00:15:12.440 --> 00:15:16.559
<v Speaker 2>from the matrix, the correlation coefficient jumps up. The underlying

315
00:15:16.600 --> 00:15:19.679
<v Speaker 2>truth was there all along. It was just masked by

316
00:15:19.720 --> 00:15:22.480
<v Speaker 2>the mathematical weight of a single anomaly.

317
00:15:22.039 --> 00:15:25.279
<v Speaker 1>Which brings us to the absolute most insidious statistical trap

318
00:15:25.320 --> 00:15:27.840
<v Speaker 1>in data science. Simpson's paradox.

319
00:15:27.919 --> 00:15:28.720
<v Speaker 2>Oh, my favorite.

320
00:15:28.759 --> 00:15:32.080
<v Speaker 1>Outliers are easy to spot if you just visualize the distribution, right,

321
00:15:32.559 --> 00:15:36.000
<v Speaker 1>But Simpsons paradox hies entirely within the aggregate structure of

322
00:15:36.039 --> 00:15:36.840
<v Speaker 1>the data itself.

323
00:15:37.000 --> 00:15:39.919
<v Speaker 2>It does. It occurs when a clear trend appears in

324
00:15:40.000 --> 00:15:43.840
<v Speaker 2>multiple distinct groups of data, but then completely disappears or

325
00:15:43.879 --> 00:15:45.919
<v Speaker 2>even reverses when those groups are combined.

326
00:15:46.159 --> 00:15:49.840
<v Speaker 1>So let's apply this to data science estor, say you

327
00:15:49.879 --> 00:15:53.440
<v Speaker 1>are analyzing regional engagement to see which coast is friendlier.

328
00:15:54.000 --> 00:15:55.960
<v Speaker 1>You calculate the overall average connections.

329
00:15:56.080 --> 00:15:57.240
<v Speaker 2>Okay, let's look at the numbers.

330
00:15:57.399 --> 00:15:59.960
<v Speaker 1>The West Coast user base averages eight point two friends

331
00:16:00.159 --> 00:16:03.320
<v Speaker 1>per user. The East Coast user base averages six point

332
00:16:03.360 --> 00:16:07.279
<v Speaker 1>five friends. So the aggregate data heavily favors the West coast.

333
00:16:07.399 --> 00:16:08.600
<v Speaker 2>Really, but then.

334
00:16:08.639 --> 00:16:12.320
<v Speaker 1>You introduce a confounding variable. You stratify the data based

335
00:16:12.360 --> 00:16:16.600
<v Speaker 1>on educational background users with a PhD and users without

336
00:16:16.600 --> 00:16:17.159
<v Speaker 1>a PhD.

337
00:16:17.279 --> 00:16:20.759
<v Speaker 2>So you isolate the PhD subgroup and suddenly the East

338
00:16:20.799 --> 00:16:24.080
<v Speaker 2>Coast data scientists average significantly more friends than the West

339
00:16:24.120 --> 00:16:24.919
<v Speaker 2>Coast PhDs.

340
00:16:25.000 --> 00:16:27.879
<v Speaker 1>Okay, so the East Coast wins the PhD demographic right.

341
00:16:28.200 --> 00:16:31.320
<v Speaker 2>Then you isolate the non PhD subgroup, and once again,

342
00:16:31.399 --> 00:16:34.039
<v Speaker 2>the East Coast data scientists average more friends than the

343
00:16:34.039 --> 00:16:35.559
<v Speaker 2>West Coast non PhDs.

344
00:16:35.759 --> 00:16:36.639
<v Speaker 1>We stop right there.

345
00:16:36.679 --> 00:16:37.120
<v Speaker 2>What's wrong?

346
00:16:37.240 --> 00:16:39.879
<v Speaker 1>How is it mathematically possible for the East Coast to

347
00:16:39.960 --> 00:16:44.039
<v Speaker 1>win in both individual subcategories but losing the total overall?

348
00:16:44.639 --> 00:16:47.639
<v Speaker 1>I mean, that feels like it violates basic arithmetic.

349
00:16:47.840 --> 00:16:49.759
<v Speaker 2>It really does feel like magic. But it comes down

350
00:16:49.799 --> 00:16:52.600
<v Speaker 2>to unequal weighting in the denominators of those subsets.

351
00:16:52.840 --> 00:16:53.960
<v Speaker 1>Okay, break that down for me.

352
00:16:54.200 --> 00:16:57.799
<v Speaker 2>The paradox is driven by the distribution of the confounding variable,

353
00:16:57.799 --> 00:17:01.799
<v Speaker 2>in this case, the Phdso look at the underlying topology

354
00:17:01.840 --> 00:17:06.759
<v Speaker 2>of the users across the entire platform. Users with PhDs

355
00:17:06.880 --> 00:17:10.839
<v Speaker 2>simply have fewer connections. They average around three friends.

356
00:17:10.440 --> 00:17:12.160
<v Speaker 1>Which are they're busy doing research right?

357
00:17:12.599 --> 00:17:16.440
<v Speaker 2>While users without PhDs are highly active, averaging around ten

358
00:17:16.480 --> 00:17:17.720
<v Speaker 2>to thirteen frames.

359
00:17:18.119 --> 00:17:22.759
<v Speaker 1>So a PhD basically acts as a massive downward weight

360
00:17:22.920 --> 00:17:23.920
<v Speaker 1>on a group's average.

361
00:17:23.960 --> 00:17:27.720
<v Speaker 2>Precisely, now, look at the regional distribution. The East Coast

362
00:17:27.799 --> 00:17:31.079
<v Speaker 2>user base is heavily saturated with PhDs.

363
00:17:30.640 --> 00:17:32.680
<v Speaker 1>Because of all the universities and research hopes.

364
00:17:32.799 --> 00:17:37.039
<v Speaker 2>Exactly, they have a massive concentration of these low connection users,

365
00:17:37.079 --> 00:17:41.480
<v Speaker 2>pulling their overall denominator down. The West Coast user base, however,

366
00:17:41.599 --> 00:17:44.039
<v Speaker 2>is overwhelmingly composed of non.

367
00:17:43.880 --> 00:17:46.119
<v Speaker 1>PhDs, part of culture, right right, So.

368
00:17:46.079 --> 00:17:49.079
<v Speaker 2>When you aggregate the data, the sheer volume of highly

369
00:17:49.119 --> 00:17:52.799
<v Speaker 2>connected non PhDs on the West Coast mathematically drowns out

370
00:17:52.839 --> 00:17:55.680
<v Speaker 2>the East Coast higher performance within the individual tiers.

371
00:17:55.759 --> 00:18:00.000
<v Speaker 1>Wow, so the regional bucketing completely masks the educational weighting completely.

372
00:18:00.480 --> 00:18:02.960
<v Speaker 2>If you hadn't joined the network table with the edguitational

373
00:18:03.000 --> 00:18:05.960
<v Speaker 2>background table, you would have delivered a presentation to the

374
00:18:06.000 --> 00:18:10.400
<v Speaker 2>board concluding that West Coast users are inherently more sociable.

375
00:18:10.200 --> 00:18:12.759
<v Speaker 1>And you would have optimized millions of dollars in marketing

376
00:18:12.799 --> 00:18:17.759
<v Speaker 1>campaigns around that assumption, fully backed by mathematically flawless yet

377
00:18:17.799 --> 00:18:20.559
<v Speaker 1>factually entirely backwards data.

378
00:18:20.680 --> 00:18:23.039
<v Speaker 2>And this, right here is the core lesson of doing

379
00:18:23.119 --> 00:18:26.599
<v Speaker 2>data science from scratch. It forces you to recognize that

380
00:18:26.640 --> 00:18:30.279
<v Speaker 2>statistical tools are not objective arbiters of truth.

381
00:18:30.640 --> 00:18:32.160
<v Speaker 1>No, they are mathematical lenses.

382
00:18:32.400 --> 00:18:36.119
<v Speaker 2>Exactly when we calculate a correlation or an aggregate mean,

383
00:18:36.559 --> 00:18:40.920
<v Speaker 2>the foundational, unspoken assumption is always ceteris parabus, all else

384
00:18:40.960 --> 00:18:45.480
<v Speaker 2>being equal, we assume the underlying distributions or uniform. Simpson's

385
00:18:45.559 --> 00:18:48.160
<v Speaker 2>paradox proves how lethal that assumption can be to a

386
00:18:48.200 --> 00:18:48.880
<v Speaker 2>business model.

387
00:18:49.359 --> 00:18:52.519
<v Speaker 1>The infrastructure of data science really requires rigor at every

388
00:18:52.599 --> 00:18:55.079
<v Speaker 1>single layer of the stack. I mean, it demands type

389
00:18:55.119 --> 00:18:58.680
<v Speaker 1>safe ingestion to prevent silent pipeline corruption. It demands a

390
00:18:58.720 --> 00:19:02.480
<v Speaker 1>physiological understanding of how end users process the geometry of

391
00:19:02.480 --> 00:19:07.640
<v Speaker 1>a visualization. It requires structuring memory into matrices to unlock

392
00:19:07.680 --> 00:19:13.440
<v Speaker 1>computational scale. And it requires a deep, almost paranoid skepticism

393
00:19:13.519 --> 00:19:14.720
<v Speaker 1>of aggregated metrics.

394
00:19:14.759 --> 00:19:16.279
<v Speaker 2>It does, and I want to leave you with a

395
00:19:16.319 --> 00:19:19.680
<v Speaker 2>final thought to apply outside the boundaries of data sciencestor.

396
00:19:19.920 --> 00:19:20.480
<v Speaker 1>Let's hear it.

397
00:19:20.920 --> 00:19:25.839
<v Speaker 2>Every single day you are bombarded with viral statistics, algorithmic recommendations,

398
00:19:26.119 --> 00:19:29.839
<v Speaker 2>and definitive correlations in the news. Every single one of

399
00:19:29.839 --> 00:19:33.680
<v Speaker 2>those metrics was aggregated by someone making an assumption about uniformity.

400
00:19:34.359 --> 00:19:37.240
<v Speaker 2>Knowing what you know now about Simpson's paradox, ask yourself,

401
00:19:37.799 --> 00:19:41.759
<v Speaker 2>in the infinitely complex overlapping matrices of human behavior, is

402
00:19:41.839 --> 00:19:45.200
<v Speaker 2>all else ever truly equal? When you see a definitive

403
00:19:45.240 --> 00:19:49.640
<v Speaker 2>trend tomorrow? What confounding variables? What invisible PhDs are lurking

404
00:19:49.759 --> 00:19:52.440
<v Speaker 2>just beneath the surface, waiting to flip the narrative?

405
00:19:52.559 --> 00:19:56.119
<v Speaker 1>Wow? It definitely changes how you interpret the next dashboard

406
00:19:56.119 --> 00:19:58.119
<v Speaker 1>you look at it, the next headline, your read well,

407
00:19:58.119 --> 00:20:00.440
<v Speaker 1>we'll leave it there for today's deep die into the

408
00:20:00.480 --> 00:20:05.119
<v Speaker 1>underlying mechanics of data science. Remember to always verify your access,

409
00:20:05.480 --> 00:20:08.160
<v Speaker 1>check your distributions for outliers, and we will catch you

410
00:20:08.200 --> 00:20:09.119
<v Speaker 1>on the next deep dive.
