WEBVTT

1
00:00:00.200 --> 00:00:05.599
<v Speaker 1>Imagine you're managing a nationwide campaign, right, Okay, you have

2
00:00:05.679 --> 00:00:09.160
<v Speaker 1>this massive database, like a million potential voters, but you

3
00:00:09.240 --> 00:00:11.320
<v Speaker 1>only have the time and the budget to knock on

4
00:00:11.759 --> 00:00:13.640
<v Speaker 1>I don't know, ten thousand.

5
00:00:13.160 --> 00:00:15.839
<v Speaker 2>Doors, right, extremely limited resources.

6
00:00:15.359 --> 00:00:18.879
<v Speaker 1>Exactly, So how do you mathematically guarantee that you're knocking

7
00:00:18.920 --> 00:00:22.399
<v Speaker 1>on the exact right ones? Or or take a massive

8
00:00:22.440 --> 00:00:23.600
<v Speaker 1>retailer like Target.

9
00:00:23.679 --> 00:00:25.480
<v Speaker 2>Oh yeah, the famous pregnancy example.

10
00:00:25.719 --> 00:00:28.719
<v Speaker 1>Yes, how do they know a customer is pregnant and

11
00:00:28.960 --> 00:00:32.240
<v Speaker 1>can start sending them coupons for baby clothes before that

12
00:00:32.280 --> 00:00:34.119
<v Speaker 1>person has even told their own family.

13
00:00:34.280 --> 00:00:38.560
<v Speaker 2>I mean it sounds like corporate espionage or literal mind reading,

14
00:00:38.679 --> 00:00:41.640
<v Speaker 2>it really does, But it's actually just recognizing patterns in

15
00:00:41.920 --> 00:00:46.159
<v Speaker 2>you know, seemingly mundane data like unscented lotion purchases suddenly

16
00:00:46.159 --> 00:00:49.759
<v Speaker 2>correlating with a second trimester, or, in the case of

17
00:00:49.759 --> 00:00:53.119
<v Speaker 2>the twenty twelve Obama campaign, finding the exact combination of

18
00:00:53.159 --> 00:00:57.520
<v Speaker 2>demographic data and pass voting behavior that signals that a

19
00:00:57.520 --> 00:00:59.479
<v Speaker 2>person just needs a very slight nudge to actually show

20
00:00:59.520 --> 00:01:00.640
<v Speaker 2>up at the bulls, Which.

21
00:01:00.479 --> 00:01:02.679
<v Speaker 1>Is wild, right, because we see the end results of

22
00:01:02.679 --> 00:01:05.840
<v Speaker 1>these models every single day. You tap a glossy button

23
00:01:05.840 --> 00:01:09.280
<v Speaker 1>on a smartphone, a little progress bar spins, and an

24
00:01:09.319 --> 00:01:13.040
<v Speaker 1>app confidently tells you who to date, what to buy,

25
00:01:13.200 --> 00:01:14.519
<v Speaker 1>what movie to watch next.

26
00:01:14.640 --> 00:01:16.760
<v Speaker 2>Yeah, it's curated reality exactly.

27
00:01:16.799 --> 00:01:20.719
<v Speaker 1>There's this expectation of seamless magic. We like things to

28
00:01:20.719 --> 00:01:23.200
<v Speaker 1>be hidden behind a sleek interface.

29
00:01:22.760 --> 00:01:25.400
<v Speaker 2>Which is exactly why so many people get a massive

30
00:01:25.439 --> 00:01:28.400
<v Speaker 2>shock when they actually step into the discipline of data

31
00:01:28.400 --> 00:01:32.000
<v Speaker 2>science itself. Oh for sure, because that glossy interface it's

32
00:01:32.159 --> 00:01:35.079
<v Speaker 2>entirely stripped away. You are suddenly looking at a landscape

33
00:01:35.079 --> 00:01:39.840
<v Speaker 2>that is raw, messy and honestly totally unforgiving. Yeah, and

34
00:01:39.920 --> 00:01:44.959
<v Speaker 2>a concerning number of practitioners today are they're overly reliant

35
00:01:44.959 --> 00:01:47.319
<v Speaker 2>on shiny, prepackaged libraries.

36
00:01:46.959 --> 00:01:48.040
<v Speaker 1>The black boxes.

37
00:01:48.239 --> 00:01:50.560
<v Speaker 2>Exactly. They plug data into a black box, Yeah, and

38
00:01:50.599 --> 00:01:53.239
<v Speaker 2>they just trust whatever comes out, without having any idea

39
00:01:53.239 --> 00:01:55.000
<v Speaker 2>how the underlying engine actually functions.

40
00:01:55.120 --> 00:01:57.760
<v Speaker 1>Well, that is the exact trap we are dismantling today.

41
00:01:58.159 --> 00:02:00.640
<v Speaker 1>Welcome to our deep dive into Jowel gris Verus book

42
00:02:00.840 --> 00:02:03.840
<v Speaker 1>Data Science from Scratch First Principles with Python.

43
00:02:03.959 --> 00:02:05.480
<v Speaker 2>It's a fantastic text.

44
00:02:05.400 --> 00:02:08.280
<v Speaker 1>It really is. And our mission for this deep dive

45
00:02:08.479 --> 00:02:13.280
<v Speaker 1>is to basically throw out the automated frameworks. No massive

46
00:02:13.360 --> 00:02:16.879
<v Speaker 1>black box libraries doing the heavy lifting for us. We

47
00:02:16.919 --> 00:02:18.479
<v Speaker 1>are going to look at the raw.

48
00:02:18.360 --> 00:02:20.080
<v Speaker 2>Mechanics right down to the studs.

49
00:02:20.240 --> 00:02:22.800
<v Speaker 1>Exactly. By the end of our conversation, you are going

50
00:02:22.840 --> 00:02:27.159
<v Speaker 1>to understand how fundamental algorithms process information. We're moving from

51
00:02:27.280 --> 00:02:31.120
<v Speaker 1>basic data structures into the mechanics of visualization, all the

52
00:02:31.159 --> 00:02:34.000
<v Speaker 1>way down to the bare metal of linear algebra.

53
00:02:34.159 --> 00:02:38.319
<v Speaker 2>Because data science isn't about memorizing syntax to import a

54
00:02:38.360 --> 00:02:41.800
<v Speaker 2>machine learning library, right, It's about answering questions that no

55
00:02:41.840 --> 00:02:44.479
<v Speaker 2>one has even thought to ask yet. Yeah, and those

56
00:02:44.520 --> 00:02:48.639
<v Speaker 2>answers are buried in this massive glut of everyday information.

57
00:02:49.120 --> 00:02:51.840
<v Speaker 2>You can only extract them if you fundamentally understand the

58
00:02:51.879 --> 00:02:52.719
<v Speaker 2>tools you're holding.

59
00:02:52.840 --> 00:02:55.080
<v Speaker 1>Okay, let's unpack this because before we look at a

60
00:02:55.120 --> 00:02:57.800
<v Speaker 1>single line of actual code, the book sets up this

61
00:02:57.919 --> 00:03:01.800
<v Speaker 1>brilliant hypothetical sandbox for us to play in the startup. Yeah,

62
00:03:01.879 --> 00:03:04.680
<v Speaker 1>you've just been hired a data science sstor, which is

63
00:03:04.719 --> 00:03:08.280
<v Speaker 1>a fictional social network built exclusively for data scientists.

64
00:03:08.479 --> 00:03:12.240
<v Speaker 2>I mean, it's a great framing device. So on your

65
00:03:12.319 --> 00:03:15.759
<v Speaker 2>very first day, the VP of networking drops a massive

66
00:03:15.840 --> 00:03:18.159
<v Speaker 2>data dump on your desk as they do, right, and

67
00:03:18.199 --> 00:03:22.000
<v Speaker 2>gives you your first assignment identify the key connectors among

68
00:03:22.039 --> 00:03:25.919
<v Speaker 2>the users. But this data isn't sitting in some neat,

69
00:03:26.240 --> 00:03:27.599
<v Speaker 2>easily searchable database.

70
00:03:27.680 --> 00:03:28.159
<v Speaker 1>No it's not.

71
00:03:28.439 --> 00:03:31.120
<v Speaker 2>It is raw Python data structures. You get a list

72
00:03:31.120 --> 00:03:34.479
<v Speaker 2>of users where each person is represented by just a

73
00:03:34.479 --> 00:03:35.280
<v Speaker 2>simple dictionary.

74
00:03:35.400 --> 00:03:37.000
<v Speaker 1>Right. So I'm looking at this list in the book,

75
00:03:37.039 --> 00:03:40.400
<v Speaker 1>and it's just basic pairs. User ID zero is named hero,

76
00:03:40.639 --> 00:03:43.680
<v Speaker 1>User ID one is done, just raw text, yep. And

77
00:03:43.759 --> 00:03:46.879
<v Speaker 1>alongside that you get the friendship data. But it's not

78
00:03:46.960 --> 00:03:49.439
<v Speaker 1>a visual web of connections. It's just a list of

79
00:03:49.479 --> 00:03:50.560
<v Speaker 1>tuples exactly.

80
00:03:50.599 --> 00:03:53.319
<v Speaker 2>And a tuple is just an immutable or unchangeable sequence

81
00:03:53.319 --> 00:03:55.879
<v Speaker 2>of elements. In this case, it's just a pair of IDs. Okay,

82
00:03:56.240 --> 00:03:58.599
<v Speaker 2>So if you're as of friends with done, the data

83
00:03:58.719 --> 00:04:02.319
<v Speaker 2>just shows a tuple parenthesis zero coma one parenthesis. That

84
00:04:02.400 --> 00:04:04.840
<v Speaker 2>is your entire social graph. It's entirely abstract.

85
00:04:05.120 --> 00:04:08.680
<v Speaker 1>So if I'm tasked with finding the most important person

86
00:04:08.759 --> 00:04:12.120
<v Speaker 1>in this abstract graph, my immediate instinct is to just

87
00:04:13.319 --> 00:04:15.520
<v Speaker 1>I don't account who has the most tupuls with their

88
00:04:15.560 --> 00:04:17.079
<v Speaker 1>ID in it, just do a raw headcount.

89
00:04:17.160 --> 00:04:19.480
<v Speaker 2>And that is the concept of degree centrality. You're simply

90
00:04:19.519 --> 00:04:23.160
<v Speaker 2>asking who has the most direct connections, you sum them up,

91
00:04:23.600 --> 00:04:26.160
<v Speaker 2>sort the list, and whoever is at the top is

92
00:04:26.639 --> 00:04:28.800
<v Speaker 2>theoretically your most central figure.

93
00:04:28.839 --> 00:04:30.879
<v Speaker 1>But I'm looking at that and thinking that's just high

94
00:04:30.879 --> 00:04:34.000
<v Speaker 1>school popularity pretty much. Yeah, it's measuring who sits with

95
00:04:34.040 --> 00:04:37.360
<v Speaker 1>the most people at lunch. But does having a high

96
00:04:37.480 --> 00:04:40.319
<v Speaker 1>raw headcount actually make you the most critical node in

97
00:04:40.360 --> 00:04:41.360
<v Speaker 1>a flow of information?

98
00:04:41.639 --> 00:04:45.279
<v Speaker 2>It rarely does, and Grease highlights this exact flaw by

99
00:04:45.319 --> 00:04:48.319
<v Speaker 2>introducing an anomaly in the data Sciencestor.

100
00:04:47.800 --> 00:04:50.480
<v Speaker 1>Network right the Dune and Thor situation exactly.

101
00:04:50.639 --> 00:04:53.240
<v Speaker 2>So, if you look at the raw numbers, the user Done,

102
00:04:53.480 --> 00:04:57.399
<v Speaker 2>who is ID one, has three direct friends, but another

103
00:04:57.480 --> 00:05:00.680
<v Speaker 2>user Thor ID four, only has two. So if you

104
00:05:00.720 --> 00:05:04.519
<v Speaker 2>blindly run a degree centrality algorithm, it ranks Done as

105
00:05:04.600 --> 00:05:05.560
<v Speaker 2>more important than Thor.

106
00:05:05.759 --> 00:05:08.240
<v Speaker 1>Wait, but if I actually map out these connections visually,

107
00:05:09.079 --> 00:05:11.560
<v Speaker 1>Thor is sitting dead in the middle of a chasm.

108
00:05:11.800 --> 00:05:16.319
<v Speaker 1>He's the only link bridging two completely separate, isolated clusters

109
00:05:16.319 --> 00:05:16.879
<v Speaker 1>of users.

110
00:05:16.959 --> 00:05:20.639
<v Speaker 2>Precisely, if Done leads the network, his three friends can

111
00:05:20.680 --> 00:05:23.560
<v Speaker 2>still talk to each other through other paths WHOA. But

112
00:05:23.560 --> 00:05:25.879
<v Speaker 2>if Thor leads a network, it literally breaks in half.

113
00:05:26.000 --> 00:05:30.959
<v Speaker 2>The flow of information stops entirely. Thor is a bottleneck. Intuitively,

114
00:05:31.040 --> 00:05:33.439
<v Speaker 2>that makes him far more critical to the network than

115
00:05:33.439 --> 00:05:36.199
<v Speaker 2>Done despite having fewer direct friends.

116
00:05:36.360 --> 00:05:40.000
<v Speaker 1>So if I just used a prepackaged function that calculates popularity,

117
00:05:40.199 --> 00:05:41.800
<v Speaker 1>I would have handed my boss the wrong name on

118
00:05:41.920 --> 00:05:42.319
<v Speaker 1>day one.

119
00:05:42.480 --> 00:05:45.079
<v Speaker 2>You would have And that proves why you can't blindly

120
00:05:45.120 --> 00:05:47.639
<v Speaker 2>trust a default metric. You have to look at the

121
00:05:47.639 --> 00:05:49.959
<v Speaker 2>structure of the data. You have to build functions that

122
00:05:50.040 --> 00:05:53.040
<v Speaker 2>look at say mutual friends or shared interest. And to

123
00:05:53.040 --> 00:05:55.560
<v Speaker 2>build those functions you have to use the language, in

124
00:05:55.560 --> 00:05:59.120
<v Speaker 2>this case Python, But you don't just write Python. The

125
00:05:59.120 --> 00:06:01.000
<v Speaker 2>book Stress is right pythonic.

126
00:06:00.639 --> 00:06:02.879
<v Speaker 1>Code, which is a whole different mindset. And I noticed

127
00:06:02.879 --> 00:06:06.399
<v Speaker 1>the book adhees specifically to Python two point seven, which,

128
00:06:06.439 --> 00:06:09.399
<v Speaker 1>while it introduces some serious quirks into how we handle data.

129
00:06:09.639 --> 00:06:11.000
<v Speaker 2>Oh the division quirk.

130
00:06:11.199 --> 00:06:14.199
<v Speaker 1>Yes, if I open Python two point seven and type

131
00:06:14.199 --> 00:06:16.480
<v Speaker 1>five divided by two, it doesn't give me two point five.

132
00:06:16.480 --> 00:06:17.279
<v Speaker 1>It spits out two.

133
00:06:17.680 --> 00:06:21.120
<v Speaker 2>Because of how it handles integer division, it truncates the

134
00:06:21.160 --> 00:06:25.439
<v Speaker 2>decimal entirely unless you explicitly tell the environment to use

135
00:06:25.519 --> 00:06:26.839
<v Speaker 2>floating point math.

136
00:06:26.959 --> 00:06:28.839
<v Speaker 1>Which seems so counterintuitive.

137
00:06:29.000 --> 00:06:31.639
<v Speaker 2>It is you literally have to type from future import

138
00:06:31.639 --> 00:06:35.000
<v Speaker 2>division to make basic math behave the way human expects

139
00:06:35.000 --> 00:06:38.000
<v Speaker 2>it to. It's I mean, it's a harsh reminder that

140
00:06:38.040 --> 00:06:40.920
<v Speaker 2>the underlying environment shapes the reality of your data.

141
00:06:41.079 --> 00:06:43.800
<v Speaker 1>That feels incredibly rigid, but honestly not as rigid as

142
00:06:43.839 --> 00:06:46.439
<v Speaker 1>white space formatting. Oh yeah, I'm used to seeing code

143
00:06:46.439 --> 00:06:50.160
<v Speaker 1>wrapped in curly braces or having explicit end statements, but

144
00:06:50.240 --> 00:06:52.360
<v Speaker 1>Python relies entirely on indentation.

145
00:06:52.720 --> 00:06:55.399
<v Speaker 2>Right. It forces you to write readable code. There's no

146
00:06:55.519 --> 00:06:58.800
<v Speaker 2>visual clutter. But the trade off is that it's strictly enforced.

147
00:06:59.079 --> 00:07:01.120
<v Speaker 1>Meaning if you mess up a space.

148
00:07:01.160 --> 00:07:04.560
<v Speaker 2>If you're copying and pasting a block of logic to analyze, say,

149
00:07:04.720 --> 00:07:08.160
<v Speaker 2>user salaries, and you have one accental space, the entire

150
00:07:08.160 --> 00:07:12.399
<v Speaker 2>script crashes. It forces discipline, but the reward for that

151
00:07:12.519 --> 00:07:16.680
<v Speaker 2>discipline is access to incredibly elegant tools right out of

152
00:07:16.720 --> 00:07:19.199
<v Speaker 2>the box, things like default, picked encounter.

153
00:07:19.319 --> 00:07:21.560
<v Speaker 1>Okay, let's start there, because I want to understand the

154
00:07:21.600 --> 00:07:24.800
<v Speaker 1>mechanics of why these are so vital. Let's say my

155
00:07:24.920 --> 00:07:28.560
<v Speaker 1>next task at data Sciencestor is to figure out what

156
00:07:28.759 --> 00:07:29.720
<v Speaker 1>our users care.

157
00:07:29.600 --> 00:07:31.279
<v Speaker 2>About, Okay, finding their interests?

158
00:07:31.639 --> 00:07:34.800
<v Speaker 1>Right, I want to count how many times specific words

159
00:07:34.800 --> 00:07:37.959
<v Speaker 1>show up in their profile bios words like hadoop or

160
00:07:38.240 --> 00:07:41.519
<v Speaker 1>psychic learn. If I use a standard Python dictionary to

161
00:07:41.600 --> 00:07:44.399
<v Speaker 1>keep a running tally, how does that actually execute?

162
00:07:44.680 --> 00:07:48.079
<v Speaker 2>It's incredibly clunky. A standard dictionary throws a literal error

163
00:07:48.399 --> 00:07:50.720
<v Speaker 2>if you try to modify a key that doesn't exist yet.

164
00:07:50.759 --> 00:07:53.360
<v Speaker 2>Oh really yeah, So as your program reads the word

165
00:07:53.399 --> 00:07:55.519
<v Speaker 2>hadoop for the very first time, you have to write

166
00:07:55.600 --> 00:07:58.399
<v Speaker 2>logic that says, check if hadoop is in the dictionary.

167
00:07:58.560 --> 00:08:00.959
<v Speaker 2>It isn't. Okay, create the key hadoop and set the

168
00:08:01.040 --> 00:08:01.600
<v Speaker 2>value to one.

169
00:08:01.759 --> 00:08:03.000
<v Speaker 1>That sounds exhausting.

170
00:08:03.199 --> 00:08:05.720
<v Speaker 2>It is. Oh look, the next word is hadoop again.

171
00:08:05.800 --> 00:08:10.160
<v Speaker 2>Check if it exists, yes, okay, increment the value by one.

172
00:08:10.240 --> 00:08:12.959
<v Speaker 1>You are writing multiple lines of repetitive safety checks just

173
00:08:12.959 --> 00:08:15.879
<v Speaker 1>to count words. That seems like a massive waste of

174
00:08:15.959 --> 00:08:18.680
<v Speaker 1>processing time and just human effort.

175
00:08:18.839 --> 00:08:22.160
<v Speaker 2>It totally is enter default dict. It intercepts that missing

176
00:08:22.240 --> 00:08:25.160
<v Speaker 2>key error. If you ask it to modify hodup and

177
00:08:25.199 --> 00:08:29.000
<v Speaker 2>a dupe isn't there, It gracefully creates it on the fly,

178
00:08:29.759 --> 00:08:33.320
<v Speaker 2>assigns it a default value like zero, and then lets

179
00:08:33.320 --> 00:08:36.399
<v Speaker 2>your code increment it. It removes all the boilerplate safety

180
00:08:36.480 --> 00:08:37.279
<v Speaker 2>checks that.

181
00:08:37.320 --> 00:08:40.000
<v Speaker 1>Is so much cleaner, and then counter takes that a

182
00:08:40.039 --> 00:08:40.679
<v Speaker 1>step further.

183
00:08:40.799 --> 00:08:44.320
<v Speaker 2>Right, Yes, counter is a subclass designed specifically for this

184
00:08:44.440 --> 00:08:47.399
<v Speaker 2>exact problem. You just handed a raw list of a

185
00:08:47.440 --> 00:08:50.919
<v Speaker 2>million words, and in one single line of code, it

186
00:08:50.960 --> 00:08:54.279
<v Speaker 2>absorbs the list and spits out a map dictionary of frequency.

187
00:08:54.360 --> 00:08:57.320
<v Speaker 1>It transforms a multi step, error prone loop into a

188
00:08:57.320 --> 00:09:01.399
<v Speaker 1>single elegant command exactly. Oh wait, you just casually mentioned

189
00:09:01.399 --> 00:09:03.600
<v Speaker 1>a list of a million words. When we're dealing with

190
00:09:03.720 --> 00:09:06.600
<v Speaker 1>data at scale, moving a million items around has to

191
00:09:06.600 --> 00:09:09.480
<v Speaker 1>cause problems, huge problems. Like if I try to assign

192
00:09:09.559 --> 00:09:12.039
<v Speaker 1>a list of a million numbers to a variable, Python

193
00:09:12.120 --> 00:09:15.120
<v Speaker 1>is literally allocating memory for all one million of those

194
00:09:15.200 --> 00:09:17.759
<v Speaker 1>numbers instantly, right correct, And if.

195
00:09:17.679 --> 00:09:20.960
<v Speaker 2>Your data set is large enough, your machine will completely choke.

196
00:09:21.440 --> 00:09:24.840
<v Speaker 2>It runs out a RAM and crashes. This is where

197
00:09:24.840 --> 00:09:27.759
<v Speaker 2>the book introduces the concept of lazy.

198
00:09:27.440 --> 00:09:29.679
<v Speaker 1>Evaluation using generators.

199
00:09:29.720 --> 00:09:32.360
<v Speaker 2>Specifically using generators.

200
00:09:31.840 --> 00:09:34.039
<v Speaker 1>I always picture this well, it's like watching a movie.

201
00:09:34.080 --> 00:09:37.399
<v Speaker 1>If I use a standard list, it's like downloading a

202
00:09:37.480 --> 00:09:40.559
<v Speaker 1>massive four K movie file to my hard drive. I

203
00:09:40.600 --> 00:09:42.960
<v Speaker 1>can't watch a single second of it until the entire

204
00:09:42.960 --> 00:09:45.080
<v Speaker 1>one hundred gigabyte file is sitting in my memory.

205
00:09:45.240 --> 00:09:46.679
<v Speaker 2>That's a perfect analogy, but.

206
00:09:46.679 --> 00:09:48.840
<v Speaker 1>A generator is like streaming it. I'm just pulling the

207
00:09:48.879 --> 00:09:51.159
<v Speaker 1>exact frame I need, exactly when I need it, and

208
00:09:51.200 --> 00:09:53.240
<v Speaker 1>then discarding it to make room for the next frame.

209
00:09:53.399 --> 00:09:56.360
<v Speaker 2>That is an excellent way to visualize it. A generator

210
00:09:56.440 --> 00:09:59.720
<v Speaker 2>yields values one at a time. It pauses its state,

211
00:10:00.200 --> 00:10:02.519
<v Speaker 2>waits for you to ask for the next value, and

212
00:10:02.639 --> 00:10:05.480
<v Speaker 2>only computes what is strictly necessary in that moment.

213
00:10:05.320 --> 00:10:09.200
<v Speaker 1>Which is incredibly memory efficient. Extremely Okay, But if streaming

214
00:10:09.320 --> 00:10:11.679
<v Speaker 1>is so much lighter on my system, why am I

215
00:10:11.720 --> 00:10:14.200
<v Speaker 1>ever downloading the file? Like? Why wouldn't I use a

216
00:10:14.240 --> 00:10:17.120
<v Speaker 1>generator for absolutely everything in my data pipeline?

217
00:10:17.200 --> 00:10:20.639
<v Speaker 2>Ah? Because a generator is ephemeral. You can only iterate

218
00:10:20.679 --> 00:10:21.279
<v Speaker 2>through it once.

219
00:10:21.559 --> 00:10:23.919
<v Speaker 1>Wait, really, once I read it, it's just gone.

220
00:10:24.279 --> 00:10:27.399
<v Speaker 2>Exactly. If you have a massive data set of user

221
00:10:27.399 --> 00:10:30.000
<v Speaker 2>interactions and you need to loop through it to find

222
00:10:30.000 --> 00:10:32.279
<v Speaker 2>the average, and then loop through it again to find

223
00:10:32.279 --> 00:10:35.960
<v Speaker 2>the standard deviation, a generator will be completely exhausted after

224
00:10:36.000 --> 00:10:36.720
<v Speaker 2>that first pass.

225
00:10:37.039 --> 00:10:39.320
<v Speaker 1>Oh wow, I didn't realize that.

226
00:10:39.440 --> 00:10:41.799
<v Speaker 2>Yeah, you would have to recompute the entire stream from

227
00:10:41.879 --> 00:10:45.519
<v Speaker 2>scratch for the second pass. So the engineering challenge is

228
00:10:45.559 --> 00:10:49.200
<v Speaker 2>constantly balancing memory efficiency against how many times you actually

229
00:10:49.200 --> 00:10:51.240
<v Speaker 2>need to interact with that specific data set.

230
00:10:51.440 --> 00:10:55.960
<v Speaker 1>Okay, all of this, the dictionaries, the generators, it's beautiful logic.

231
00:10:56.320 --> 00:10:58.840
<v Speaker 1>But all of these elegant Python tools are really just

232
00:10:59.039 --> 00:11:01.799
<v Speaker 1>for us, the develops. Sure, if I take a default

233
00:11:01.799 --> 00:11:04.080
<v Speaker 1>dict full of raw frequencies and drop it on the

234
00:11:04.159 --> 00:11:07.039
<v Speaker 1>VP of networking's desk, their eyes are going to glaze over.

235
00:11:07.240 --> 00:11:10.360
<v Speaker 1>You have to translate those numbers into a visual space, which.

236
00:11:10.159 --> 00:11:13.159
<v Speaker 2>Brings us to the visual translation layer matt plotlib.

237
00:11:13.559 --> 00:11:16.919
<v Speaker 1>The book touches on the standard toolkit, you know, bar

238
00:11:17.120 --> 00:11:21.600
<v Speaker 1>charts for buckets of discrete data, line charts for continuous trends,

239
00:11:21.919 --> 00:11:25.279
<v Speaker 1>scatterplots for pairing two variables together to see if they correlate.

240
00:11:25.320 --> 00:11:27.879
<v Speaker 1>The basics, right, but we don't need to dwell on

241
00:11:27.919 --> 00:11:30.960
<v Speaker 1>what a line chart is. What's interesting is that the

242
00:11:31.000 --> 00:11:33.879
<v Speaker 1>book spends a very specific amount of time warning the

243
00:11:33.919 --> 00:11:36.120
<v Speaker 1>reader about the mechanics of visual deception.

244
00:11:36.360 --> 00:11:39.000
<v Speaker 2>Yes, the trap of the misleading act axis. Yeah, when

245
00:11:39.039 --> 00:11:41.879
<v Speaker 2>you're translating raw numbers into a visual space, you are

246
00:11:41.960 --> 00:11:45.559
<v Speaker 2>essentially creating a narrative, and that narrative is incredibly easy

247
00:11:45.559 --> 00:11:46.200
<v Speaker 2>to manipulate.

248
00:11:46.440 --> 00:11:49.559
<v Speaker 1>Here's where it gets really interesting. The example Groose uses

249
00:11:49.600 --> 00:11:52.720
<v Speaker 1>is brilliant. Let's say we are tracking how many times

250
00:11:52.720 --> 00:11:55.879
<v Speaker 1>the phrase data science is mentioned on user profiles YEP.

251
00:11:55.960 --> 00:11:58.120
<v Speaker 1>The data shows that in twenty thirteen there were five

252
00:11:58.200 --> 00:12:01.360
<v Speaker 1>hundred mentions. In twenty fourteen there were five hundred and

253
00:12:01.440 --> 00:12:02.240
<v Speaker 1>five mentions.

254
00:12:02.320 --> 00:12:04.480
<v Speaker 2>That is an increase of exactly five Right.

255
00:12:04.679 --> 00:12:06.799
<v Speaker 1>So if you plot those two bars on a chart

256
00:12:06.960 --> 00:12:09.600
<v Speaker 1>and you start your YAG access at zero, the difference

257
00:12:09.639 --> 00:12:12.679
<v Speaker 1>between five hundred and five oh five is basically imperceptible.

258
00:12:12.799 --> 00:12:15.480
<v Speaker 2>It looks like a flatline. The narrative there is growth

259
00:12:15.480 --> 00:12:16.919
<v Speaker 2>has completely stagnated.

260
00:12:17.080 --> 00:12:18.919
<v Speaker 1>But what if I want my boss to think I'm

261
00:12:18.919 --> 00:12:22.360
<v Speaker 1>doing an amazing job growing the platform. I go into

262
00:12:22.360 --> 00:12:25.120
<v Speaker 1>my plotting tool and I manually force the axis to

263
00:12:25.120 --> 00:12:27.080
<v Speaker 1>start at four ninety nine and end at five.

264
00:12:26.879 --> 00:12:30.679
<v Speaker 2>Oh six, And suddenly the twenty fourteen bar is towering

265
00:12:30.720 --> 00:12:35.279
<v Speaker 2>over the twenty thirteen bar. It visually implies this massive

266
00:12:35.720 --> 00:12:37.080
<v Speaker 2>explosive increase.

267
00:12:37.480 --> 00:12:40.240
<v Speaker 1>It's the visual equivalent of taking a photo of a puddle,

268
00:12:40.519 --> 00:12:42.879
<v Speaker 1>cropping it incredibly tight so you can't see the edges,

269
00:12:43.039 --> 00:12:45.360
<v Speaker 1>and trying to convince the viewer it's the ocean.

270
00:12:45.559 --> 00:12:46.600
<v Speaker 2>That's exactly what it is.

271
00:12:46.799 --> 00:12:49.159
<v Speaker 1>If it is that easy to lie with data. How

272
00:12:49.159 --> 00:12:52.000
<v Speaker 1>can anyone trust a chart in a corporate presentation.

273
00:12:52.240 --> 00:12:56.039
<v Speaker 2>Well, what's fascinating here is this vulnerability is exactly why

274
00:12:56.080 --> 00:12:58.000
<v Speaker 2>the premise of the book is so vital. How So,

275
00:12:58.440 --> 00:13:02.480
<v Speaker 2>when you use black box visualization libraries, they often auto

276
00:13:02.519 --> 00:13:05.600
<v Speaker 2>scale the axes based on the minimum and maximum values

277
00:13:05.600 --> 00:13:08.039
<v Speaker 2>of the data provided. Oh, I see a tool might

278
00:13:08.080 --> 00:13:10.960
<v Speaker 2>automatically crop the access to four ninety nine without any

279
00:13:11.000 --> 00:13:13.399
<v Speaker 2>malicious intent, just to save space on the screen.

280
00:13:13.840 --> 00:13:15.840
<v Speaker 1>So the software might accidentally lie to.

281
00:13:15.799 --> 00:13:18.440
<v Speaker 2>Me, yes, or a human might do it intentionally to

282
00:13:18.480 --> 00:13:21.120
<v Speaker 2>sell you a false narrative. But when you learn Matt

283
00:13:21.159 --> 00:13:24.679
<v Speaker 2>plotlib from scratch and you physically write the code to

284
00:13:24.759 --> 00:13:28.039
<v Speaker 2>define the axis limits, you see how it works exactly.

285
00:13:28.080 --> 00:13:31.639
<v Speaker 2>You internalize the mechanics of how the visual is constructed.

286
00:13:31.919 --> 00:13:35.039
<v Speaker 2>You train your brain to instantly look at the axis

287
00:13:35.279 --> 00:13:37.679
<v Speaker 2>before you let the shape of the line influence your judgment.

288
00:13:38.399 --> 00:13:41.080
<v Speaker 2>You're basically immunizing yourself against bad data science.

289
00:13:41.120 --> 00:13:43.519
<v Speaker 1>I love that immunizing yourself, But I want to push

290
00:13:43.600 --> 00:13:46.200
<v Speaker 1>deeper into the actual geometry of what we're doing here.

291
00:13:46.240 --> 00:13:46.919
<v Speaker 2>Okay, let's do it.

292
00:13:46.960 --> 00:13:50.279
<v Speaker 1>When we take two variables. Let's say we're plotting a

293
00:13:50.399 --> 00:13:53.360
<v Speaker 1>user's number of friends against the minutes they spend on

294
00:13:53.360 --> 00:13:56.000
<v Speaker 1>the site. Every day we put a dot on a

295
00:13:56.039 --> 00:13:59.440
<v Speaker 1>scatter flot, we are essentially placing a vector in a

296
00:13:59.480 --> 00:14:00.440
<v Speaker 1>mathematic space.

297
00:14:00.720 --> 00:14:04.720
<v Speaker 2>Correct and to truly grasp how algorithms find patterns in

298
00:14:04.759 --> 00:14:07.919
<v Speaker 2>that space, we have to strip away the Python syntax,

299
00:14:08.240 --> 00:14:10.879
<v Speaker 2>strip away the visual charts, and look at the hidden

300
00:14:10.960 --> 00:14:12.720
<v Speaker 2>architecture that runs the entire.

301
00:14:12.480 --> 00:14:15.639
<v Speaker 1>Discipline, which is linear algebra.

302
00:14:15.279 --> 00:14:16.679
<v Speaker 2>The dreaded linear algebra.

303
00:14:16.840 --> 00:14:18.840
<v Speaker 1>That is the phrase that makes half the room breakout

304
00:14:18.840 --> 00:14:19.679
<v Speaker 1>in a cold sweat.

305
00:14:19.720 --> 00:14:20.399
<v Speaker 2>I know, I know.

306
00:14:20.679 --> 00:14:23.799
<v Speaker 1>But abstractly, vectors are just objects that can be added

307
00:14:23.799 --> 00:14:27.279
<v Speaker 1>together or multiplied by scalers to form new vectors.

308
00:14:27.639 --> 00:14:30.919
<v Speaker 2>And concretely, for our purposes anyway, there are simply points

309
00:14:30.960 --> 00:14:34.840
<v Speaker 2>in a finite dimensional space. Representing user data as vectors

310
00:14:35.200 --> 00:14:38.519
<v Speaker 2>is the foundational trick of machine learning. The text gives

311
00:14:38.559 --> 00:14:43.360
<v Speaker 2>a very grounding example a person's physical attributes. You have

312
00:14:43.399 --> 00:14:46.320
<v Speaker 2>a list of three numbers seventy one, seventy and.

313
00:14:46.320 --> 00:14:49.480
<v Speaker 1>Forty meaning seventy inches tall, one hundred and seventy pounds,

314
00:14:49.519 --> 00:14:50.519
<v Speaker 1>and forty years.

315
00:14:50.279 --> 00:14:54.480
<v Speaker 2>Old exactly, And Python is just a list. But mathematically

316
00:14:54.519 --> 00:14:57.200
<v Speaker 2>it is a single coordinate in three dimensional space.

317
00:14:57.320 --> 00:14:59.840
<v Speaker 1>Okay, so I have this coordinate. But the book forces

318
00:14:59.879 --> 00:15:02.720
<v Speaker 1>a to actually do math with these lists from scratch.

319
00:15:02.960 --> 00:15:05.279
<v Speaker 1>If I want to add two user vectors together, I

320
00:15:05.320 --> 00:15:08.120
<v Speaker 1>can't just put a plus sign between two Python lists.

321
00:15:08.279 --> 00:15:10.159
<v Speaker 2>No, it doesn't work like that. By thought would just

322
00:15:10.159 --> 00:15:11.960
<v Speaker 2>stick the two lists together and in right.

323
00:15:12.039 --> 00:15:14.440
<v Speaker 1>So the book introduces the zip function. Walk me through

324
00:15:14.440 --> 00:15:15.399
<v Speaker 1>the mechanics of that.

325
00:15:15.600 --> 00:15:18.679
<v Speaker 2>Vectors must be added component wise. The first element of

326
00:15:18.759 --> 00:15:21.720
<v Speaker 2>vector A adds to the first element of vector B.

327
00:15:22.120 --> 00:15:24.480
<v Speaker 2>So if vector A is one two and vector B

328
00:15:24.679 --> 00:15:28.720
<v Speaker 2>is two one, the zip function acts like a physical zipper.

329
00:15:29.000 --> 00:15:31.559
<v Speaker 2>It takes the first element from both lists, the one

330
00:15:31.559 --> 00:15:34.480
<v Speaker 2>and the two, and binds them into a pair. Then

331
00:15:34.519 --> 00:15:36.279
<v Speaker 2>it takes the second elements the two in the one

332
00:15:36.480 --> 00:15:37.159
<v Speaker 2>and binds them.

333
00:15:37.240 --> 00:15:39.720
<v Speaker 1>And once they're paired up, the book uses a list

334
00:15:39.759 --> 00:15:43.600
<v Speaker 1>comprehension to iterate through those pairs, add them together, and

335
00:15:43.679 --> 00:15:46.039
<v Speaker 1>spit out a new vector three three.

336
00:15:46.559 --> 00:15:49.279
<v Speaker 2>From there you build a function for the dot product,

337
00:15:49.399 --> 00:15:52.279
<v Speaker 2>which is just multiplying those matching pairs and summing up

338
00:15:52.279 --> 00:15:52.720
<v Speaker 2>the total.

339
00:15:52.919 --> 00:15:55.360
<v Speaker 1>Let me stop you right there, because I understand the

340
00:15:55.399 --> 00:15:58.200
<v Speaker 1>mechanics of what you just described, But why are we

341
00:15:58.320 --> 00:16:01.200
<v Speaker 1>doing it? Why does an algorithm care about the sum

342
00:16:01.279 --> 00:16:03.759
<v Speaker 1>of multiplied components.

343
00:16:03.360 --> 00:16:06.279
<v Speaker 2>Because the dot product gives you the magnitude of a vector,

344
00:16:06.679 --> 00:16:09.639
<v Speaker 2>and more importantly, it allows you to calculate the angle

345
00:16:09.679 --> 00:16:10.759
<v Speaker 2>between two vectors.

346
00:16:10.840 --> 00:16:12.120
<v Speaker 1>And why does the angle matter?

347
00:16:12.279 --> 00:16:15.600
<v Speaker 2>This is how algorithms determine similarity. If you plot the

348
00:16:15.639 --> 00:16:18.840
<v Speaker 2>interest of two users as massive vectors, and you calculate

349
00:16:18.879 --> 00:16:21.440
<v Speaker 2>the cosine of the angle between them using the dot product,

350
00:16:21.679 --> 00:16:24.200
<v Speaker 2>the math tells you exactly how similar those two people are.

351
00:16:24.480 --> 00:16:27.519
<v Speaker 2>A small angle means their vectors are pointing in the

352
00:16:27.600 --> 00:16:31.159
<v Speaker 2>exact same direction. They like the exact same things. That

353
00:16:31.519 --> 00:16:34.360
<v Speaker 2>is literally how Acupid knows who you should date.

354
00:16:34.559 --> 00:16:37.519
<v Speaker 1>That is fascinating, But I have to be completely honest here.

355
00:16:38.120 --> 00:16:40.480
<v Speaker 1>I'm looking at the sheer amount of code it takes

356
00:16:40.519 --> 00:16:44.159
<v Speaker 1>to build a vector ad function from scratch using zip

357
00:16:44.279 --> 00:16:49.919
<v Speaker 1>and list comprehensions. Highly optimized libraries like numpi can execute

358
00:16:50.000 --> 00:16:53.240
<v Speaker 1>vector addition across millions of data points in a fraction

359
00:16:53.320 --> 00:16:54.080
<v Speaker 1>of a millisecond.

360
00:16:54.120 --> 00:16:55.399
<v Speaker 2>Well absolutely so, groose.

361
00:16:55.480 --> 00:16:59.000
<v Speaker 1>Forcing us to manually zip lists together feels a bit

362
00:16:59.080 --> 00:17:00.639
<v Speaker 1>like hazing. Are we doing this?

363
00:17:01.120 --> 00:17:03.200
<v Speaker 2>It's a fair question, but think of it this way.

364
00:17:03.639 --> 00:17:07.400
<v Speaker 2>If you don't understand basic arithmetic, a pocket calculator seems

365
00:17:07.400 --> 00:17:09.839
<v Speaker 2>like a magic box. You punch in some numbers and

366
00:17:09.880 --> 00:17:12.160
<v Speaker 2>it spits out the truth, right. But if you accidentally

367
00:17:12.240 --> 00:17:15.039
<v Speaker 2>hit the division key instead of the multiplication key, the

368
00:17:15.119 --> 00:17:17.640
<v Speaker 2>calculator will spit out a completely absurd answer.

369
00:17:17.799 --> 00:17:20.799
<v Speaker 1>And because I don't intuitively understand the math, I wouldn't

370
00:17:20.799 --> 00:17:23.480
<v Speaker 1>even realize the answer is absurd. I'd just trust the

371
00:17:23.519 --> 00:17:25.000
<v Speaker 1>screen exactly.

372
00:17:25.880 --> 00:17:30.279
<v Speaker 2>Relying on numbpi without understanding linear algebra is the exact

373
00:17:30.279 --> 00:17:33.759
<v Speaker 2>same danger, but on a massive scale. When everything works,

374
00:17:34.039 --> 00:17:37.960
<v Speaker 2>you're fine. But what happens when you encounter the curse

375
00:17:38.000 --> 00:17:41.960
<v Speaker 2>of dimensionality? The what the cursive dimensionality? What happens when

376
00:17:42.000 --> 00:17:44.799
<v Speaker 2>you're dealing with data in one hundred dimensions and the

377
00:17:44.880 --> 00:17:50.119
<v Speaker 2>distance between all your points mathematically approaches uniformity, breaking your

378
00:17:50.200 --> 00:17:51.000
<v Speaker 2>predictive model.

379
00:17:51.759 --> 00:17:53.559
<v Speaker 1>The black box isn't going to tell me why it broke.

380
00:17:53.680 --> 00:17:56.519
<v Speaker 2>No, it won't, And because you don't understand the arithmetic

381
00:17:56.559 --> 00:17:59.119
<v Speaker 2>of the vector space, you won't know how to fix it.

382
00:17:59.359 --> 00:18:02.599
<v Speaker 2>You won't realize as your data is behaving absurdly, building

383
00:18:02.599 --> 00:18:06.359
<v Speaker 2>it from scratch builds your mathematical intuition. It ensures you

384
00:18:06.400 --> 00:18:08.440
<v Speaker 2>remain the mechanic, not just a passenger.

385
00:18:08.640 --> 00:18:11.759
<v Speaker 1>That perfectly encapsulates this entire journey. I mean, we started

386
00:18:11.759 --> 00:18:15.680
<v Speaker 1>today looking at a raw, confusing data dump from data Sciencestor.

387
00:18:16.119 --> 00:18:19.759
<v Speaker 1>We realized that default metrics like degree centrality can completely

388
00:18:19.799 --> 00:18:22.480
<v Speaker 1>blind us to the actual structural reality of a network.

389
00:18:22.599 --> 00:18:25.559
<v Speaker 2>We tamed that raw data using the strict elegance of

390
00:18:25.599 --> 00:18:29.599
<v Speaker 2>Python learning, the mechanical advantage of tools like default dick

391
00:18:29.680 --> 00:18:33.000
<v Speaker 2>to skip boilerplate logic, and lazy generators to keep our

392
00:18:33.039 --> 00:18:35.680
<v Speaker 2>systems from crashing under the weight of massive lists.

393
00:18:35.920 --> 00:18:38.640
<v Speaker 1>We took those insights and visualized them, digging into the

394
00:18:38.640 --> 00:18:41.799
<v Speaker 1>mechanics of the axis so we could actively defend ourselves

395
00:18:41.880 --> 00:18:46.480
<v Speaker 1>against manipulated narratives. Yes, and finally, we translated our users

396
00:18:46.519 --> 00:18:51.240
<v Speaker 1>into mathematical vectors, exploring the literal architecture of similarity that

397
00:18:51.319 --> 00:18:53.599
<v Speaker 1>powers every recommendation engine on Earth.

398
00:18:53.960 --> 00:18:57.200
<v Speaker 2>Knowing these first principles is your shield. Whether you were

399
00:18:57.240 --> 00:19:00.480
<v Speaker 2>analyzing your own company's user behavior preparing for a high

400
00:19:00.519 --> 00:19:03.400
<v Speaker 2>level strategy meeting, We're just trying to navigate a world

401
00:19:03.440 --> 00:19:08.079
<v Speaker 2>driven by algorithms. Understanding the how protects you against the hype.

402
00:19:08.240 --> 00:19:09.920
<v Speaker 1>I want to leave you with a final thought to

403
00:19:10.000 --> 00:19:12.880
<v Speaker 1>chew on something that stretches beyond the scope of Python

404
00:19:12.920 --> 00:19:16.279
<v Speaker 1>two point seven. Okay, we've spent this entire deep dive

405
00:19:16.400 --> 00:19:19.920
<v Speaker 1>manually building these models so we can truly comprehend the

406
00:19:19.920 --> 00:19:22.680
<v Speaker 1>how and the why. When we talk about a three

407
00:19:22.720 --> 00:19:26.599
<v Speaker 1>dimensional vector of height, weight, and age, our human brains

408
00:19:26.640 --> 00:19:29.759
<v Speaker 1>can picture it. We can imagine that dot floating in

409
00:19:29.799 --> 00:19:30.319
<v Speaker 1>a room.

410
00:19:30.519 --> 00:19:32.839
<v Speaker 2>But the bleeding edge of this field isn't operating in

411
00:19:32.880 --> 00:19:34.119
<v Speaker 2>three dimensions exactly.

412
00:19:34.799 --> 00:19:38.079
<v Speaker 1>Modern neural networks and depth learning systems are analyzing patterns

413
00:19:38.119 --> 00:19:40.680
<v Speaker 1>across millions of dimensions simultaneously.

414
00:19:40.759 --> 00:19:41.640
<v Speaker 2>Yeah, it's unfammable.

415
00:19:41.799 --> 00:19:45.480
<v Speaker 1>They are finding mathematical correlations in vector spaces that no

416
00:19:45.680 --> 00:19:49.720
<v Speaker 1>human mind can actually visualize or comprehend. So the provocative

417
00:19:49.799 --> 00:19:54.400
<v Speaker 1>question is this, as these self learning algorithms become infinitely

418
00:19:54.440 --> 00:19:58.319
<v Speaker 1>more complex, will we eventually reach a threshold where even

419
00:19:58.359 --> 00:20:02.039
<v Speaker 1>the mechanics, the people who build system from absolute scratch,

420
00:20:02.519 --> 00:20:06.279
<v Speaker 1>can no longer truly understand the why behind the insights

421
00:20:06.319 --> 00:20:07.359
<v Speaker 1>the machine produces.

422
00:20:07.480 --> 00:20:10.240
<v Speaker 2>We're moving from an era of tools we fully comprehend

423
00:20:10.640 --> 00:20:13.920
<v Speaker 2>to an era of intelligences we merely point in a direction.

424
00:20:14.039 --> 00:20:15.680
<v Speaker 1>We might end up right back where we started on

425
00:20:15.759 --> 00:20:20.640
<v Speaker 1>day one, staring at an unimaginably massive humming engine block

426
00:20:21.000 --> 00:20:24.279
<v Speaker 1>and having absolutely no idea why it's moving. Thank you

427
00:20:24.359 --> 00:20:26.680
<v Speaker 1>for joining us on this deep dive into the foundations

428
00:20:26.720 --> 00:20:27.519
<v Speaker 1>of data science.
