WEBVTT

1
00:00:00.080 --> 00:00:03.919
<v Speaker 1>Okay, let's unpack this today. We're embarking on a deep

2
00:00:03.960 --> 00:00:09.640
<v Speaker 1>dive into the fascinating, sometimes maybe often hyped, but fundamentally

3
00:00:09.679 --> 00:00:11.080
<v Speaker 1>important world of data science.

4
00:00:11.199 --> 00:00:13.000
<v Speaker 2>Definitely hyped at times, right, But.

5
00:00:13.000 --> 00:00:16.480
<v Speaker 1>Our mission here is really to demystify what data science

6
00:00:16.519 --> 00:00:19.320
<v Speaker 1>truly is. We want to explore its core processes, the

7
00:00:19.399 --> 00:00:22.519
<v Speaker 1>essential tools, and also confront some of the well the

8
00:00:22.559 --> 00:00:26.280
<v Speaker 1>critical real world challenges that come with working with data.

9
00:00:25.960 --> 00:00:27.600
<v Speaker 2>And the ethical ones too. They're huge.

10
00:00:27.640 --> 00:00:31.800
<v Speaker 1>Absolutely. We'll be drawing our insights primarily from Rachel Schutt's

11
00:00:31.800 --> 00:00:35.119
<v Speaker 1>pioneering book Doing Data Science, Straight Talk from the front Line,

12
00:00:35.159 --> 00:00:38.000
<v Speaker 1>which came out of her course at Columbia.

13
00:00:37.560 --> 00:00:40.079
<v Speaker 2>University, a really groundbreaking course back then.

14
00:00:40.240 --> 00:00:42.880
<v Speaker 1>Exactly. So by the end of this deep dive you

15
00:00:42.880 --> 00:00:46.000
<v Speaker 1>should have a much clearer understanding of this field, hopefully

16
00:00:46.000 --> 00:00:48.159
<v Speaker 1>equipped with the knowledge to you know, cut through the

17
00:00:48.200 --> 00:00:51.439
<v Speaker 1>noise and see why it's irrelevant. So let's dive straight

18
00:00:51.479 --> 00:00:55.079
<v Speaker 1>into the heart of it. What is data science? It's

19
00:00:55.119 --> 00:00:58.240
<v Speaker 1>a question that even the pioneers of the field really

20
00:00:58.320 --> 00:00:58.840
<v Speaker 1>grappled with.

21
00:00:58.960 --> 00:01:00.520
<v Speaker 2>They really did until.

22
00:01:00.320 --> 00:01:03.280
<v Speaker 1>Shuts introduction to data science course at Columbia I think

23
00:01:03.280 --> 00:01:06.840
<v Speaker 1>it started fall twenty twelve that really acted as an

24
00:01:06.879 --> 00:01:08.519
<v Speaker 1>incubator for this whole idea.

25
00:01:08.680 --> 00:01:11.040
<v Speaker 2>Yeah, it was a starting point, and Kathy O'Neill form

26
00:01:11.000 --> 00:01:14.959
<v Speaker 2>the mathbave dot org blog. She was instrumental in bringing

27
00:01:14.959 --> 00:01:19.200
<v Speaker 2>these ideas out, specifically pushing back against all the marketing hype.

28
00:01:19.239 --> 00:01:20.640
<v Speaker 1>There was a lot of hype back then.

29
00:01:20.760 --> 00:01:24.400
<v Speaker 2>Oh yeah. A crucial point here is that initial sort

30
00:01:24.400 --> 00:01:28.159
<v Speaker 2>of bewilderment around it. The term was just vague. People

31
00:01:28.200 --> 00:01:30.840
<v Speaker 2>were throwing around phrases like masters of the universe for

32
00:01:30.959 --> 00:01:32.400
<v Speaker 2>data scientists.

33
00:01:31.920 --> 00:01:34.560
<v Speaker 1>Right, which must have annoyed some statisticians.

34
00:01:34.599 --> 00:01:36.640
<v Speaker 2>You can imagine they felt like, hey, that's our feel.

35
00:01:36.719 --> 00:01:40.599
<v Speaker 2>The science of data identity theft almost. But the core

36
00:01:40.760 --> 00:01:42.519
<v Speaker 2>argument in the book, and I think it holds up,

37
00:01:42.640 --> 00:01:45.599
<v Speaker 2>is that data science isn't just rebranding.

38
00:01:45.120 --> 00:01:46.799
<v Speaker 1>Not just a new buzzword exactly.

39
00:01:47.040 --> 00:01:49.959
<v Speaker 2>It's genuinely a new idea, maybe still a bit fragile

40
00:01:50.000 --> 00:01:54.359
<v Speaker 2>or evolving, but it uniquely combines foundations from statistics and

41
00:01:54.400 --> 00:01:57.920
<v Speaker 2>computer science. Plus it has this distinct process tied to it.

42
00:01:58.200 --> 00:02:00.319
<v Speaker 1>And part of that newness I think came from this

43
00:02:00.400 --> 00:02:04.680
<v Speaker 1>idea of datification. Kenneth Kookier and Victor Merhr Schoenberger talked

44
00:02:04.719 --> 00:02:07.120
<v Speaker 1>about this in foreign affairs maybe mid two.

45
00:02:07.040 --> 00:02:09.159
<v Speaker 2>Thirteeneen Riise Big Data. Yeah.

46
00:02:09.240 --> 00:02:13.439
<v Speaker 1>They defined datification as basically taking all aspects of life

47
00:02:13.479 --> 00:02:15.159
<v Speaker 1>and turning them into data.

48
00:02:14.919 --> 00:02:15.879
<v Speaker 2>Which sounds huge.

49
00:02:16.000 --> 00:02:20.240
<v Speaker 1>It is. Think about it, Google glassdentifying your gaze, Twitter,

50
00:02:20.319 --> 00:02:23.879
<v Speaker 1>turning stray thoughts into data points LinkedIn mapping out your

51
00:02:23.879 --> 00:02:28.800
<v Speaker 1>professional life. Everything becomes potentially quantifiable.

52
00:02:28.120 --> 00:02:31.560
<v Speaker 2>Which immediately makes you ask, Okay, who is this we

53
00:02:32.080 --> 00:02:34.840
<v Speaker 2>doing the datafying and what kind of value are they

54
00:02:34.879 --> 00:02:39.319
<v Speaker 2>actually creating? Often it's well modelers, entrepreneurs.

55
00:02:38.800 --> 00:02:41.439
<v Speaker 1>Looking for efficiency, automation pretty much.

56
00:02:41.560 --> 00:02:44.120
<v Speaker 2>Yeah, And what's really striking is that so much of

57
00:02:44.120 --> 00:02:46.800
<v Speaker 2>this wasn't bubbling up in academia initially. It was happening

58
00:02:47.039 --> 00:02:50.400
<v Speaker 2>in industry, in tech companies. That's quite different from how

59
00:02:50.400 --> 00:02:51.800
<v Speaker 2>statistics traditionally develop.

60
00:02:52.039 --> 00:02:55.000
<v Speaker 1>So if it's this broad new thing happening in industry,

61
00:02:55.639 --> 00:02:58.840
<v Speaker 1>what does a data scientist actually look like? Should have

62
00:02:58.919 --> 00:03:00.240
<v Speaker 1>this interesting exercise for.

63
00:03:00.159 --> 00:03:02.159
<v Speaker 2>Her students, the self profiling, Yeah.

64
00:03:02.039 --> 00:03:06.680
<v Speaker 1>Right, rate yourself on computer science, math, stats, machine learning,

65
00:03:06.800 --> 00:03:11.800
<v Speaker 1>domain expertise, communication VIZ data visualization, and.

66
00:03:11.759 --> 00:03:14.199
<v Speaker 2>The results were all over the PLAYFFERNT thing. It showed

67
00:03:14.240 --> 00:03:17.479
<v Speaker 2>pretty clearly that you know, no single person is going

68
00:03:17.520 --> 00:03:19.120
<v Speaker 2>to be brilliant at all of those things.

69
00:03:19.199 --> 00:03:24.120
<v Speaker 1>Yeah, unicorns are rare exactly, which led to this idea.

70
00:03:24.159 --> 00:03:27.639
<v Speaker 2>Maybe it's more useful to define a data science team

71
00:03:28.039 --> 00:03:30.759
<v Speaker 2>than one perfect data scientist.

72
00:03:30.439 --> 00:03:32.840
<v Speaker 1>Makes sense Like that Josh Will's quote, Oh.

73
00:03:32.840 --> 00:03:35.879
<v Speaker 2>Yeah, the classic person who is better at statistics than

74
00:03:35.919 --> 00:03:39.520
<v Speaker 2>any software engineer, and better at software engineering than any statistician.

75
00:03:39.639 --> 00:03:40.879
<v Speaker 1>That captures it pretty well.

76
00:03:41.039 --> 00:03:45.479
<v Speaker 2>It does. Fundamentally, a data scientist extracts, meaning interprets data.

77
00:03:45.840 --> 00:03:48.879
<v Speaker 2>They need tools from stats and machine learning, sure, but

78
00:03:49.199 --> 00:03:53.639
<v Speaker 2>also crucially human intuition, and let's be honest, a huge

79
00:03:53.639 --> 00:03:57.240
<v Speaker 2>part of the job is just collecting, cleaning, and munging data. Yeah,

80
00:03:57.280 --> 00:03:59.759
<v Speaker 2>wrestling with it because real world data is just in

81
00:04:00.000 --> 00:04:02.280
<v Speaker 2>apparently messy always Okay.

82
00:04:02.039 --> 00:04:05.240
<v Speaker 1>So moving from the who to the how? How does

83
00:04:05.280 --> 00:04:07.840
<v Speaker 1>this actually get done? We hear big data all the time,

84
00:04:07.919 --> 00:04:08.719
<v Speaker 1>but it's kind of a.

85
00:04:08.680 --> 00:04:10.159
<v Speaker 2>Vague term it really is.

86
00:04:10.319 --> 00:04:10.560
<v Speaker 1>Yeah.

87
00:04:10.599 --> 00:04:14.560
<v Speaker 2>The book breaks it down nicely though, three parts. One,

88
00:04:14.919 --> 00:04:19.000
<v Speaker 2>it's a set of technologies. Two, it's potentially a revolution

89
00:04:19.160 --> 00:04:21.839
<v Speaker 2>in how we measure things, and a point of view,

90
00:04:21.879 --> 00:04:25.079
<v Speaker 2>really a philosophy about how decisions are going to be

91
00:04:25.120 --> 00:04:27.439
<v Speaker 2>made in the future based on data.

92
00:04:27.120 --> 00:04:30.680
<v Speaker 1>Right and connecting big data back to basic stats like

93
00:04:31.360 --> 00:04:32.879
<v Speaker 1>populations and samples.

94
00:04:33.000 --> 00:04:37.040
<v Speaker 2>That seems important, Oh, absolutely critical. There's this dangerous assumption

95
00:04:37.160 --> 00:04:42.040
<v Speaker 2>sometimes with big data that nal You know, you have

96
00:04:42.079 --> 00:04:42.399
<v Speaker 2>all the.

97
00:04:42.399 --> 00:04:44.000
<v Speaker 1>Data, but you never really do.

98
00:04:44.040 --> 00:04:46.920
<v Speaker 2>You pretty much never do. There's always something missing, some

99
00:04:47.079 --> 00:04:50.279
<v Speaker 2>context you don't have. Kate Crawford's talk on the Hurricane

100
00:04:50.279 --> 00:04:52.439
<v Speaker 2>Sandy tweets is such a powerful example of this.

101
00:04:52.680 --> 00:04:53.519
<v Speaker 1>Well was it gist of that?

102
00:04:53.800 --> 00:04:55.920
<v Speaker 2>Well, looking at the tweets, you might think New Yorkers

103
00:04:55.920 --> 00:04:58.720
<v Speaker 2>were just casually shopping before the storm and partying after.

104
00:04:59.399 --> 00:05:01.560
<v Speaker 2>But that's because they were the ones tweeting heavily.

105
00:05:01.920 --> 00:05:03.879
<v Speaker 1>Ah, so it missed the people really.

106
00:05:03.759 --> 00:05:08.319
<v Speaker 2>Affected exactly, coastal New Jerseyans whose homes were being destroyed.

107
00:05:08.639 --> 00:05:12.000
<v Speaker 2>They weren't tweeting about their grocery runs. It just shows

108
00:05:12.000 --> 00:05:16.240
<v Speaker 2>how subjective the whole process is. You, the data scientist,

109
00:05:16.519 --> 00:05:19.879
<v Speaker 2>are turning the world into data. It's not objective.

110
00:05:20.000 --> 00:05:21.800
<v Speaker 1>Data doesn't just speak for itself.

111
00:05:22.079 --> 00:05:24.720
<v Speaker 2>Never be very skeptical if someone claims it does.

112
00:05:25.120 --> 00:05:28.519
<v Speaker 1>Okay, so data is subjective. We need context. Then we

113
00:05:28.560 --> 00:05:31.839
<v Speaker 1>get to modeling. This sounds like where the magic.

114
00:05:31.519 --> 00:05:34.240
<v Speaker 2>Happens or the hard work maybe both, and when we

115
00:05:34.279 --> 00:05:36.759
<v Speaker 2>say model here, we don't mean like a database scheme.

116
00:05:36.959 --> 00:05:39.199
<v Speaker 2>We mean a statistical model.

117
00:05:38.920 --> 00:05:40.160
<v Speaker 1>Like a mathematical function.

118
00:05:40.360 --> 00:05:43.000
<v Speaker 2>Yeah, one that tries to capture the uncertainty, the randomness,

119
00:05:43.000 --> 00:05:46.279
<v Speaker 2>and how the data was generated and building these it's

120
00:05:46.319 --> 00:05:49.120
<v Speaker 2>definitely part art, part science. Textbooks don't really give you

121
00:05:49.160 --> 00:05:51.319
<v Speaker 2>a step by step guide. You have to make assumptions,

122
00:05:51.399 --> 00:05:53.839
<v Speaker 2>a lot of assumptions about reality. But yeah, we'll get

123
00:05:53.879 --> 00:05:54.639
<v Speaker 2>into how that works.

124
00:05:54.720 --> 00:05:57.399
<v Speaker 1>And you mentioned a big pitfall here overfitting.

125
00:05:57.879 --> 00:06:02.480
<v Speaker 2>Yes, get ready to hear about fitting a lot, possibly

126
00:06:02.560 --> 00:06:03.839
<v Speaker 2>until you have nightmares.

127
00:06:03.439 --> 00:06:05.199
<v Speaker 1>About Okay, okay, So what is it?

128
00:06:05.199 --> 00:06:07.480
<v Speaker 2>It's when your model gets too good at explaining the

129
00:06:07.560 --> 00:06:10.480
<v Speaker 2>specific data you train it on, including all the random

130
00:06:10.519 --> 00:06:12.600
<v Speaker 2>noise and quirks in that sample.

131
00:06:12.399 --> 00:06:14.800
<v Speaker 1>So it learns the noise, not the signal.

132
00:06:14.519 --> 00:06:18.240
<v Speaker 2>Precisely, and then it fails, often badly, when you try

133
00:06:18.240 --> 00:06:20.959
<v Speaker 2>to use it on new unseen data. It hasn't learned

134
00:06:20.959 --> 00:06:23.199
<v Speaker 2>the general pattern, just the specifics of the test it's

135
00:06:23.240 --> 00:06:24.759
<v Speaker 2>studied for, so to speak.

136
00:06:24.519 --> 00:06:26.759
<v Speaker 1>Right, it can't generalize. So before we even get to

137
00:06:26.800 --> 00:06:29.120
<v Speaker 1>complex models, what's the first step.

138
00:06:29.000 --> 00:06:34.240
<v Speaker 2>Exploratory data analysis? Eighty A. Yeah, it's absolutely fundamental.

139
00:06:33.759 --> 00:06:36.560
<v Speaker 1>And that's more than just plotting things, oh much more.

140
00:06:36.720 --> 00:06:40.160
<v Speaker 2>It's a mindset. It's about getting intuition, understanding the shape

141
00:06:40.160 --> 00:06:42.160
<v Speaker 2>of your data, feeling how it connects back to the

142
00:06:42.199 --> 00:06:43.680
<v Speaker 2>real world process that created it.

143
00:06:43.720 --> 00:06:45.839
<v Speaker 1>So what does it help you do practically well?

144
00:06:45.959 --> 00:06:50.199
<v Speaker 2>Gain intuition? Obviously, make comparisons, do basic sanity checks. Is

145
00:06:50.240 --> 00:06:52.839
<v Speaker 2>the data on the right scales, it the right format,

146
00:06:53.079 --> 00:06:58.199
<v Speaker 2>Spot missing values or crazy outliers, summarize things, even debug

147
00:06:58.240 --> 00:06:59.839
<v Speaker 2>how the data was logged in the first place.

148
00:07:00.000 --> 00:07:02.399
<v Speaker 1>Okay, like the example with the New York Times ad

149
00:07:02.480 --> 00:07:07.279
<v Speaker 1>data NYT one dot csv through NYT three one dot csv.

150
00:07:07.399 --> 00:07:11.560
<v Speaker 2>Exactly, the students had to plot distributions of ad impressions

151
00:07:12.040 --> 00:07:15.199
<v Speaker 2>and click through rates the CTR for different age groups,

152
00:07:15.759 --> 00:07:20.720
<v Speaker 2>and segment users by whether they clicked or not using

153
00:07:20.879 --> 00:07:23.879
<v Speaker 2>r In that case, it forces you to really look

154
00:07:23.879 --> 00:07:24.759
<v Speaker 2>at the data first.

155
00:07:24.879 --> 00:07:27.759
<v Speaker 1>And this whole process it kind of mirrors the scientific method,

156
00:07:27.800 --> 00:07:28.279
<v Speaker 1>doesn't it.

157
00:07:28.279 --> 00:07:32.199
<v Speaker 2>It really does. You ask a question, you research, explore

158
00:07:32.279 --> 00:07:35.240
<v Speaker 2>the data, you form a hypothesis, you test it, build

159
00:07:35.279 --> 00:07:37.959
<v Speaker 2>a model, analyze the results, communicate them.

160
00:07:37.959 --> 00:07:38.759
<v Speaker 1>But with a twist.

161
00:07:39.000 --> 00:07:41.639
<v Speaker 2>Yeah, The big difference is the feedback loop. When you

162
00:07:41.680 --> 00:07:43.959
<v Speaker 2>build a data product like a stam filter or a

163
00:07:43.959 --> 00:07:47.399
<v Speaker 2>recommendation engine. It goes out into the world, people use it,

164
00:07:47.800 --> 00:07:49.800
<v Speaker 2>Their interactions generate more data.

165
00:07:49.600 --> 00:07:51.040
<v Speaker 1>Which feeds back into the system.

166
00:07:51.240 --> 00:07:53.800
<v Speaker 2>Right, it's a dynamic cycle. It's not like predicting the weather,

167
00:07:53.839 --> 00:07:57.240
<v Speaker 2>where your forecast doesn't actually change tomorrow's weather. Here the

168
00:07:57.279 --> 00:08:00.680
<v Speaker 2>model influences the world, which generates new data for them, and.

169
00:08:00.639 --> 00:08:02.480
<v Speaker 1>The data scientist is involved all the.

170
00:08:02.399 --> 00:08:05.680
<v Speaker 2>Way through, absolutely, from deciding what data to even collect,

171
00:08:06.079 --> 00:08:09.439
<v Speaker 2>to asking the first questions, planning the attack, and yeah,

172
00:08:09.519 --> 00:08:10.319
<v Speaker 2>writing the code.

173
00:08:10.480 --> 00:08:12.720
<v Speaker 1>The Real Direct case study sounds like a good example

174
00:08:12.759 --> 00:08:14.839
<v Speaker 1>of this using data in real estate.

175
00:08:14.959 --> 00:08:17.680
<v Speaker 2>Yeah, Doug Pearlson's company. Yeah, the traditional real estate broker

176
00:08:17.759 --> 00:08:21.959
<v Speaker 2>system was well broken in terms of data. Brokers guarded

177
00:08:21.959 --> 00:08:25.240
<v Speaker 2>their info fiercely. Public data was months out of date.

178
00:08:25.759 --> 00:08:27.000
<v Speaker 1>So what did Real Direct do?

179
00:08:27.639 --> 00:08:31.000
<v Speaker 2>They heided agents who pooled their knowledge, use data driven tips,

180
00:08:31.040 --> 00:08:34.840
<v Speaker 2>built real time recommendations, tried to get live feeds on searches, offers,

181
00:08:34.919 --> 00:08:36.279
<v Speaker 2>closing times.

182
00:08:36.720 --> 00:08:40.159
<v Speaker 1>All that stuff, and the business model reflected that efficiency.

183
00:08:39.639 --> 00:08:43.000
<v Speaker 2>Right, a subscription model plus lower commission because the data

184
00:08:43.000 --> 00:08:46.440
<v Speaker 2>supposedly made things more efficient. The exercise for the students

185
00:08:46.519 --> 00:08:50.399
<v Speaker 2>was literally, okay, you're advising the CEO, define a data strategy.

186
00:08:50.519 --> 00:08:52.159
<v Speaker 2>What data do you need, where do you get it?

187
00:08:52.279 --> 00:08:54.559
<v Speaker 2>How do you clean it, explore it, summarize it, puts

188
00:08:54.600 --> 00:08:55.159
<v Speaker 2>it all together.

189
00:08:55.360 --> 00:08:59.480
<v Speaker 1>Okay, let's shift gears to the algorithms the engines driving this.

190
00:09:00.120 --> 00:09:03.000
<v Speaker 1>Machine learning versus statistical modeling always confusing.

191
00:09:03.360 --> 00:09:06.919
<v Speaker 2>It is confusing because there's so much overlap. mL algorithms,

192
00:09:07.000 --> 00:09:12.039
<v Speaker 2>mostly from computer science, do prediction classification clustering. Statistical modeling

193
00:09:12.080 --> 00:09:15.759
<v Speaker 2>from SaaS environments does well prediction classification clustering.

194
00:09:15.919 --> 00:09:17.279
<v Speaker 1>So what's the real difference? Then?

195
00:09:17.519 --> 00:09:21.519
<v Speaker 2>Often it's about the goal and the origin. Many mL algorithms,

196
00:09:21.639 --> 00:09:26.679
<v Speaker 2>especially the ones driving AI image recognition, speech recommenders, they

197
00:09:26.679 --> 00:09:30.000
<v Speaker 2>weren't typically part of a core stats curriculum, and crucially,

198
00:09:30.440 --> 00:09:33.679
<v Speaker 2>they're often not designed to help you infer the underlying why.

199
00:09:34.120 --> 00:09:36.159
<v Speaker 1>They just want the best prediction exactly.

200
00:09:36.360 --> 00:09:40.320
<v Speaker 2>Maximum accuracy is usually the goal, whereas statistical modeling often

201
00:09:40.320 --> 00:09:45.080
<v Speaker 2>puts more emphasis on understanding the relationships the uncertainty. But honestly,

202
00:09:45.679 --> 00:09:48.879
<v Speaker 2>good data scientists use both. They know when each approach

203
00:09:48.960 --> 00:09:49.720
<v Speaker 2>is more valuable.

204
00:09:49.919 --> 00:09:52.679
<v Speaker 1>Right, and the warning you mentioned don't be a hammer

205
00:09:52.720 --> 00:09:53.440
<v Speaker 1>looking for a nail.

206
00:09:53.759 --> 00:09:56.720
<v Speaker 2>Precisely, don't just grab the algorithm you know best and

207
00:09:56.759 --> 00:10:00.279
<v Speaker 2>force it onto the problem. First, understand the problem text,

208
00:10:00.360 --> 00:10:03.759
<v Speaker 2>figure out its mathematical structure, then see which algorithms fit

209
00:10:04.000 --> 00:10:04.480
<v Speaker 2>makes sense.

210
00:10:04.720 --> 00:10:06.960
<v Speaker 1>Let's start with a classic linear regression.

211
00:10:07.200 --> 00:10:10.799
<v Speaker 2>Ah, yes, your bread and butter. For predicting a continuous

212
00:10:10.840 --> 00:10:14.639
<v Speaker 2>outcome like price or temperature, using one or more predictors.

213
00:10:14.679 --> 00:10:17.960
<v Speaker 1>We usually start thinking about simple lines like why will

214
00:10:17.960 --> 00:10:19.919
<v Speaker 1>twenty five x deterministic?

215
00:10:20.320 --> 00:10:23.799
<v Speaker 2>Right? But the key mental shift is moving to stochastic functions,

216
00:10:24.399 --> 00:10:28.960
<v Speaker 2>acknowledging that there's randomness uncertainty. The line represents the average trend,

217
00:10:29.320 --> 00:10:31.440
<v Speaker 2>but the points will scatter around it, and.

218
00:10:31.399 --> 00:10:33.519
<v Speaker 1>How do you find the best line?

219
00:10:33.759 --> 00:10:36.320
<v Speaker 2>You minimize the distance between the points and the line,

220
00:10:36.840 --> 00:10:39.960
<v Speaker 2>specifically the sum of the squared vertical distances. That's the

221
00:10:40.039 --> 00:10:41.000
<v Speaker 2>mean squared.

222
00:10:40.759 --> 00:10:43.159
<v Speaker 1>Error, and you evaluate it with things like P values.

223
00:10:43.320 --> 00:10:46.879
<v Speaker 2>Yeah, P values help you test if your predictors actually

224
00:10:46.919 --> 00:10:51.759
<v Speaker 2>have a statistically significant effect. Are their coefficients likely different

225
00:10:51.759 --> 00:10:55.960
<v Speaker 2>from zero? You can add more predictors. That's multiple linear regression,

226
00:10:56.480 --> 00:10:59.360
<v Speaker 2>which then raises the question of feature selection, which predictors

227
00:10:59.360 --> 00:10:59.919
<v Speaker 2>matter most?

228
00:11:00.080 --> 00:11:02.679
<v Speaker 1>And simulating data can help understand.

229
00:11:02.240 --> 00:11:05.799
<v Speaker 2>This oh hugely useful, especially in learning. You create fake

230
00:11:05.879 --> 00:11:08.440
<v Speaker 2>data where you know the true relationship, then you see

231
00:11:08.480 --> 00:11:11.080
<v Speaker 2>if your model can recover it. How sample size effects things?

232
00:11:11.120 --> 00:11:15.039
<v Speaker 2>What happens if you add irrelevant variables? It builds intuition?

233
00:11:15.240 --> 00:11:18.559
<v Speaker 1>Okay, what about classifying things? Finding similar items?

234
00:11:18.639 --> 00:11:21.360
<v Speaker 2>That sounds like CA nearest neighbors or kNN.

235
00:11:21.039 --> 00:11:22.679
<v Speaker 1>Right knnn ad. How does that work?

236
00:11:22.840 --> 00:11:26.720
<v Speaker 2>The idea is simple. To classify a new unlabeled item,

237
00:11:26.960 --> 00:11:30.159
<v Speaker 2>you look at its K closest neighbors data set where

238
00:11:30.200 --> 00:11:32.720
<v Speaker 2>you do have labels. Then you assign the classes most

239
00:11:32.759 --> 00:11:36.919
<v Speaker 2>common among those neighbors. Examples could be anything classifying emails

240
00:11:36.919 --> 00:11:40.600
<v Speaker 2>as spam NOTT spam based on similar emails, assessing credit

241
00:11:40.679 --> 00:11:44.320
<v Speaker 2>risk based on similar applicants, recommending restaurants based on what

242
00:11:44.399 --> 00:11:47.440
<v Speaker 2>similar users like find The neighbors and the key choices

243
00:11:47.480 --> 00:11:50.240
<v Speaker 2>are two main things. First, how do you define closest?

244
00:11:50.480 --> 00:11:53.600
<v Speaker 2>You need a distance metric Euclidian is common for points,

245
00:11:53.799 --> 00:11:57.960
<v Speaker 2>Cosign for text, Hamming for strings, Manhattan for grid like paths.

246
00:11:58.759 --> 00:12:01.559
<v Speaker 2>Depends on the data. Second, choosing K how many neighbors

247
00:12:01.600 --> 00:12:05.600
<v Speaker 2>do you consult? One, five, twenty. That's a tuning parameter.

248
00:12:05.639 --> 00:12:08.440
<v Speaker 1>And this is where it gets interesting. The curse of dimensionality.

249
00:12:08.600 --> 00:12:11.240
<v Speaker 2>I asked. The curse kNN works great in low dimensions

250
00:12:11.559 --> 00:12:14.240
<v Speaker 2>like recognizing handwritten digits, where pixels in a two hundred

251
00:12:14.279 --> 00:12:17.480
<v Speaker 2>and fifty six dimension space have a natural closeness. But

252
00:12:17.759 --> 00:12:20.679
<v Speaker 2>imagine text data with thousands of dimensions.

253
00:12:20.639 --> 00:12:23.120
<v Speaker 1>Words things get spread out exactly.

254
00:12:23.200 --> 00:12:25.799
<v Speaker 2>In high dimensions, everything is kind of far away from

255
00:12:25.799 --> 00:12:29.559
<v Speaker 2>everything else. Your nearest neighbors might not be very similar

256
00:12:29.600 --> 00:12:33.120
<v Speaker 2>at all in a meaningful sense. K and N breaks down.

257
00:12:33.519 --> 00:12:35.679
<v Speaker 2>That's why it's usually bad for spam filtering.

258
00:12:35.799 --> 00:12:38.159
<v Speaker 1>Good point other K and N pitfalls.

259
00:12:38.360 --> 00:12:41.600
<v Speaker 2>Definitely need to scale your variables. If income is in dollars,

260
00:12:41.600 --> 00:12:45.080
<v Speaker 2>in ages in years, income will dominate the distance calculation

261
00:12:45.200 --> 00:12:49.360
<v Speaker 2>unless you scale them, and overfitting is a risk, especially

262
00:12:49.360 --> 00:12:52.279
<v Speaker 2>of K one. Then you're just copying the label of

263
00:12:52.320 --> 00:12:55.559
<v Speaker 2>the single closest point, which might be noise. Correlated features

264
00:12:55.600 --> 00:12:56.960
<v Speaker 2>can also distort distances.

265
00:12:57.120 --> 00:13:00.639
<v Speaker 1>Okay, so kNN needs labels. What if you don't have labels,

266
00:13:00.639 --> 00:13:02.840
<v Speaker 1>but you suspect there are groups in your data.

267
00:13:02.799 --> 00:13:06.080
<v Speaker 2>Then you're talking about unsupervised learning, and K means clustering

268
00:13:06.159 --> 00:13:07.240
<v Speaker 2>is a common technique there.

269
00:13:07.320 --> 00:13:11.240
<v Speaker 1>Unsupervised So the algorithm finds the groups itself precisely.

270
00:13:11.320 --> 00:13:13.559
<v Speaker 2>You tell it how many clusters K you think exist,

271
00:13:14.080 --> 00:13:17.120
<v Speaker 2>and the algorithm iteratively assigns points to the nearest cluster

272
00:13:17.200 --> 00:13:21.960
<v Speaker 2>center centroid, and then recapculates the centroids until things stabilize.

273
00:13:22.000 --> 00:13:22.840
<v Speaker 1>Why would you do that?

274
00:13:23.120 --> 00:13:25.639
<v Speaker 2>Lots of reasons. Maybe you want to segment users for

275
00:13:25.679 --> 00:13:30.320
<v Speaker 2>different marketing or product experiences, or build separate predictive models

276
00:13:30.360 --> 00:13:33.919
<v Speaker 2>for distinct customer groups. K means helps you discover those

277
00:13:33.960 --> 00:13:36.679
<v Speaker 2>groups automatically instead of you trying to define them with

278
00:13:36.799 --> 00:13:38.480
<v Speaker 2>arbitrary rules or thresholds.

279
00:13:38.559 --> 00:13:42.240
<v Speaker 1>So it automates finding clusters in like many dimensions.

280
00:13:42.399 --> 00:13:46.399
<v Speaker 2>Yeah, that's the power, but it has its quirks. Choosing

281
00:13:46.399 --> 00:13:49.240
<v Speaker 2>the right K is often more art than science, and

282
00:13:49.360 --> 00:13:52.200
<v Speaker 2>sometimes the algorithm can get stuck in a suboptimal solution

283
00:13:52.519 --> 00:13:53.919
<v Speaker 2>depending on where it starts.

284
00:13:54.360 --> 00:13:55.600
<v Speaker 1>Is it an old algorithm?

285
00:13:55.639 --> 00:13:58.279
<v Speaker 2>The basic idea goes back to the fifties Steinhaus and

286
00:13:58.360 --> 00:14:02.399
<v Speaker 2>Lloyd m the term K means in sixty seven. There

287
00:14:02.399 --> 00:14:04.879
<v Speaker 2>are newer versions like K means plus plus from two

288
00:14:04.879 --> 00:14:07.720
<v Speaker 2>thousand and seven that try to start the algorithm off better.

289
00:14:07.799 --> 00:14:10.919
<v Speaker 1>Okay, so we said canon isn't great for spam filtering

290
00:14:10.960 --> 00:14:13.440
<v Speaker 1>because of hi dimensions. What does work well that.

291
00:14:13.360 --> 00:14:17.919
<v Speaker 2>Brings us to naive base? A surprisingly effective probabilistic.

292
00:14:17.240 --> 00:14:19.080
<v Speaker 1>Approach based on Bayes Law.

293
00:14:19.080 --> 00:14:21.960
<v Speaker 2>Exactly remember baes Law from stats PA B p G.

294
00:14:22.559 --> 00:14:26.679
<v Speaker 1>Vaguely like the disease testing example, probability you're sick given

295
00:14:26.720 --> 00:14:27.600
<v Speaker 1>a positive test.

296
00:14:27.879 --> 00:14:30.279
<v Speaker 2>That's the one we applied the same logic to spam.

297
00:14:31.240 --> 00:14:34.320
<v Speaker 2>What's the probability in email is spam? Given that contains

298
00:14:34.320 --> 00:14:37.120
<v Speaker 2>the word viagra, p spam, word peace, damp, word.

299
00:14:37.080 --> 00:14:38.759
<v Speaker 1>Needs sense what's the naive part?

300
00:14:39.000 --> 00:14:41.559
<v Speaker 2>The naive assumption is that the words in the email

301
00:14:41.720 --> 00:14:43.240
<v Speaker 2>appear independently of each.

302
00:14:43.120 --> 00:14:46.720
<v Speaker 1>Other, which isn't true. Right, free and viagra probably appeared

303
00:14:46.759 --> 00:14:49.960
<v Speaker 1>together more often than by chance in spam.

304
00:14:50.120 --> 00:14:55.000
<v Speaker 2>Totally untrue. But the simplification makes the math tractable, and

305
00:14:55.120 --> 00:14:59.279
<v Speaker 2>surprisingly it often works really well in practice, especially for text.

306
00:14:59.200 --> 00:15:01.440
<v Speaker 1>Any pitfall just counting words.

307
00:15:01.559 --> 00:15:04.399
<v Speaker 2>Oh yeah, if a word like viagra only appeared in

308
00:15:04.440 --> 00:15:07.159
<v Speaker 2>spam in your training data, the model might assign a

309
00:15:07.200 --> 00:15:09.639
<v Speaker 2>one hundred percent probability of spam if it sees that

310
00:15:09.679 --> 00:15:12.799
<v Speaker 2>word again. It's overfitting. Also, what if you see a

311
00:15:12.840 --> 00:15:15.679
<v Speaker 2>word you've never seen before? The probability would be zero,

312
00:15:16.240 --> 00:15:17.559
<v Speaker 2>which messes up the calculation.

313
00:15:17.679 --> 00:15:18.559
<v Speaker 1>So how do you fix that?

314
00:15:18.840 --> 00:15:22.879
<v Speaker 2>With laplace smoothing sometimes called additive smoothing, you basically add

315
00:15:22.879 --> 00:15:25.600
<v Speaker 2>a small pseudo count to every word count, pretending you've

316
00:15:25.639 --> 00:15:27.840
<v Speaker 2>seen each word at least once or a fraction of

317
00:15:27.879 --> 00:15:31.159
<v Speaker 2>a time. It prevents zero probabilities and generally makes the

318
00:15:31.240 --> 00:15:32.279
<v Speaker 2>estimates more robust.

319
00:15:32.480 --> 00:15:35.919
<v Speaker 1>And this was used in that NYT article classification exercise.

320
00:15:36.000 --> 00:15:39.679
<v Speaker 2>Yes, exactly Jake's exercise. Download two thousand articles from different

321
00:15:39.720 --> 00:15:43.840
<v Speaker 2>sections arts, business, sports, et cetera. Using the API. Train

322
00:15:43.919 --> 00:15:47.960
<v Speaker 2>a naive base model specifically Bernoulli ni bays here to

323
00:15:48.000 --> 00:15:52.440
<v Speaker 2>classify them, tune the smoothing parameters, evaluate with a confusion matrix.

324
00:15:52.480 --> 00:15:55.559
<v Speaker 2>See which words were most indicative of each section. Great

325
00:15:55.639 --> 00:15:56.559
<v Speaker 2>hands on example.

326
00:15:56.720 --> 00:15:59.960
<v Speaker 1>Cool again, Another big one. Logistic regression. How's that different

327
00:16:00.159 --> 00:16:01.440
<v Speaker 1>from linear regression?

328
00:16:01.639 --> 00:16:05.480
<v Speaker 2>Linear regression predicts a continuous value, right like a house price.

329
00:16:05.919 --> 00:16:09.600
<v Speaker 2>Logistic regression predicts the probability of a binary outcome, something

330
00:16:09.600 --> 00:16:11.200
<v Speaker 2>that's either yes or no zero.

331
00:16:11.200 --> 00:16:13.440
<v Speaker 1>One like will a user click and add? Is this

332
00:16:13.519 --> 00:16:15.559
<v Speaker 1>email spam while this customer churn?

333
00:16:15.679 --> 00:16:17.600
<v Speaker 2>Exactly those kinds of things binary outcomes?

334
00:16:17.720 --> 00:16:20.080
<v Speaker 1>And how does it predict a probability? Doesn't a linear

335
00:16:20.120 --> 00:16:21.480
<v Speaker 1>model output any number.

336
00:16:21.639 --> 00:16:24.000
<v Speaker 2>It starts with a linear combination of features, just like

337
00:16:24.080 --> 00:16:27.519
<v Speaker 2>linear regression. Else plus matrox. But then it feeds that

338
00:16:27.600 --> 00:16:31.440
<v Speaker 2>result through a special function called the logistic function or sigmoid.

339
00:16:30.960 --> 00:16:33.360
<v Speaker 1>Function the S shaped curve that's the one.

340
00:16:33.240 --> 00:16:38.000
<v Speaker 2>Pt one plus et. This function squishes any input value

341
00:16:38.080 --> 00:16:41.360
<v Speaker 2>into an output between zero and one, perfect for representing

342
00:16:41.360 --> 00:16:42.159
<v Speaker 2>a probability.

343
00:16:42.440 --> 00:16:45.159
<v Speaker 1>So alpha and beta still means something yep.

344
00:16:45.799 --> 00:16:48.919
<v Speaker 2>Alpha spain is related to the baseline probability. The overall

345
00:16:48.919 --> 00:16:51.720
<v Speaker 2>odds and the betas are the weights for each feature,

346
00:16:52.159 --> 00:16:54.639
<v Speaker 2>telling you how much each feature changes the law odds

347
00:16:54.639 --> 00:16:55.120
<v Speaker 2>of the outcome.

348
00:16:55.279 --> 00:16:57.960
<v Speaker 1>How do you find the best alpha and betas?

349
00:16:58.480 --> 00:17:02.200
<v Speaker 2>Usually with maximum likelihood estimation, you find the parameters that

350
00:17:02.240 --> 00:17:06.119
<v Speaker 2>make the observed data most probable. This often involves optimization

351
00:17:06.160 --> 00:17:10.000
<v Speaker 2>algorithms like Newton's method, or, especially for huge data sets,

352
00:17:10.200 --> 00:17:12.240
<v Speaker 2>stochastic gradient descent SGD.

353
00:17:12.720 --> 00:17:14.319
<v Speaker 1>SGD sounds familiar.

354
00:17:14.039 --> 00:17:16.279
<v Speaker 2>Very common in large scale machine learning. It updates the

355
00:17:16.319 --> 00:17:19.000
<v Speaker 2>parameters using just one data point or a small batche

356
00:17:19.000 --> 00:17:21.759
<v Speaker 2>at a time, making it efficient for massive data sets,

357
00:17:21.880 --> 00:17:25.279
<v Speaker 2>especially sparse ones. Tools like mahood or valpol wabbit use

358
00:17:25.279 --> 00:17:25.799
<v Speaker 2>it heavily.

359
00:17:26.039 --> 00:17:29.480
<v Speaker 1>Now, evaluating these models, you said, accuracy isn't always great.

360
00:17:29.440 --> 00:17:32.079
<v Speaker 2>Right, especially with imbalanced classes. If only one percent of

361
00:17:32.119 --> 00:17:35.480
<v Speaker 2>emails are spam. A model predicting not spam one hundred

362
00:17:35.480 --> 00:17:37.799
<v Speaker 2>percent of the time is ninety nine percent accurate, but useless.

363
00:17:37.880 --> 00:17:39.039
<v Speaker 1>So what should we use instead?

364
00:17:39.359 --> 00:17:43.519
<v Speaker 2>Look at precision of the times you predicted spam, how

365
00:17:43.559 --> 00:17:46.559
<v Speaker 2>often were you right? And recall of all the actual

366
00:17:46.559 --> 00:17:49.720
<v Speaker 2>spam how much did you catch often? There's a trade off.

367
00:17:49.559 --> 00:17:51.359
<v Speaker 1>And F score AUC.

368
00:17:51.720 --> 00:17:55.160
<v Speaker 2>F score tries to combine precision and recall into one number.

369
00:17:55.599 --> 00:17:58.839
<v Speaker 2>AUC area under the ROC curve is really good because

370
00:17:58.839 --> 00:18:02.960
<v Speaker 2>it measures performance across all possible thresholds and isn't thrown

371
00:18:02.960 --> 00:18:05.920
<v Speaker 2>off by imbalanced classes. It's base rate invariant.

372
00:18:06.200 --> 00:18:09.599
<v Speaker 1>But even these metrics might not capture the real goal exactly.

373
00:18:10.200 --> 00:18:13.079
<v Speaker 2>Your model might have great AUC, but does it actually

374
00:18:13.079 --> 00:18:16.880
<v Speaker 2>increase revenue or user engagement. That's where AB testing comes in,

375
00:18:17.000 --> 00:18:18.880
<v Speaker 2>the gold standard for real world impact.

376
00:18:19.000 --> 00:18:21.519
<v Speaker 1>YE, run a controlled experiment, show the old system to

377
00:18:21.559 --> 00:18:23.960
<v Speaker 1>group A, the new model to group B, and measure

378
00:18:24.000 --> 00:18:27.000
<v Speaker 1>the actual business outcome you care about. Google's paper on

379
00:18:27.079 --> 00:18:28.680
<v Speaker 1>experimentation really drives this.

380
00:18:28.759 --> 00:18:32.480
<v Speaker 2>Home and Media six degrees M six D use logistic

381
00:18:32.519 --> 00:18:34.160
<v Speaker 2>regression for predicting AD clicks.

382
00:18:34.240 --> 00:18:37.759
<v Speaker 1>Yeah, a classic application user level conversion prediction, highly scalable,

383
00:18:37.759 --> 00:18:39.119
<v Speaker 1>and effective for binary outcomes.

384
00:18:39.160 --> 00:18:41.680
<v Speaker 2>So okay, let's get into some real world messiness. Time

385
00:18:41.759 --> 00:18:45.079
<v Speaker 2>stamps seems simple, but you said they're tricky. Oh they are.

386
00:18:45.240 --> 00:18:48.279
<v Speaker 2>You get tons of time stamped event data, user clicks,

387
00:18:48.559 --> 00:18:52.200
<v Speaker 2>check ins, sensor readings. That's big data right there. But

388
00:18:52.240 --> 00:18:56.519
<v Speaker 2>they introduce subtle problems like what the biggest is causality.

389
00:18:57.160 --> 00:18:59.680
<v Speaker 2>You cannot use information from the future to predict the

390
00:18:59.680 --> 00:19:03.559
<v Speaker 2>present or past. Sounds obvious, but it's easy to accidentally

391
00:19:03.599 --> 00:19:06.559
<v Speaker 2>leak future information into your training data if you're not

392
00:19:06.640 --> 00:19:07.880
<v Speaker 2>careful with time stamps.

393
00:19:08.079 --> 00:19:10.599
<v Speaker 1>Ah the time travel problem exactly.

394
00:19:10.960 --> 00:19:13.839
<v Speaker 2>You also have to be super careful distinguishing in sample

395
00:19:13.920 --> 00:19:17.640
<v Speaker 2>training data from out of sample testing data based on time,

396
00:19:18.440 --> 00:19:20.880
<v Speaker 2>and often you need running estimates, like a running average,

397
00:19:20.920 --> 00:19:24.200
<v Speaker 2>not a single average calculated over all past data, because

398
00:19:24.240 --> 00:19:27.480
<v Speaker 2>the world changes like in finance. Finance is a great example,

399
00:19:27.720 --> 00:19:30.680
<v Speaker 2>they often use log returns instead of simple percentage returns

400
00:19:31.000 --> 00:19:34.599
<v Speaker 2>because log returns handle compounding better and are more symmetric.

401
00:19:35.160 --> 00:19:38.880
<v Speaker 2>Volatility is key, and techniques like exponential downweighting let you

402
00:19:38.920 --> 00:19:42.319
<v Speaker 2>calculate a volatility estimate that gives more weight to recent data,

403
00:19:42.599 --> 00:19:44.880
<v Speaker 2>efficiently updating with just the latest info.

404
00:19:44.799 --> 00:19:47.279
<v Speaker 1>And financial markets have weak signals.

405
00:19:47.039 --> 00:19:50.839
<v Speaker 2>Extremely weak these days, so many algorithms are looking for

406
00:19:50.880 --> 00:19:54.400
<v Speaker 2>the same patterns that the signals get arbitraged away quickly.

407
00:19:55.000 --> 00:19:57.880
<v Speaker 2>You might aim for just a tiny positive correlation like

408
00:19:57.920 --> 00:20:01.240
<v Speaker 2>three percent over a day. Yet linear regression is still

409
00:20:01.319 --> 00:20:04.160
<v Speaker 2>used because it's robust even with all that noise.

410
00:20:04.119 --> 00:20:07.480
<v Speaker 1>And they use things like priors and regularization yes, to.

411
00:20:07.480 --> 00:20:12.720
<v Speaker 2>Keep the model stable. Priors incorporate existing beliefs, and regularization

412
00:20:12.880 --> 00:20:16.000
<v Speaker 2>adds penalties to stop coefficients from getting too large or

413
00:20:16.079 --> 00:20:20.400
<v Speaker 2>varying too wildly. It helps prevent overfitting in noisy high

414
00:20:20.440 --> 00:20:24.039
<v Speaker 2>dimensional data. Mathematically, it often simplifies to adding a term

415
00:20:24.079 --> 00:20:25.720
<v Speaker 2>to the covariance matrix.

416
00:20:25.559 --> 00:20:27.799
<v Speaker 1>Which brings us to figuring out what data to even

417
00:20:27.839 --> 00:20:31.440
<v Speaker 1>put into the model. Feature engineering the art of data science.

418
00:20:31.480 --> 00:20:34.880
<v Speaker 2>Some say it's about deciding which variables features to use

419
00:20:35.000 --> 00:20:37.279
<v Speaker 2>and how to transform them to be most effective for

420
00:20:37.359 --> 00:20:42.200
<v Speaker 2>your model. Should's four quadrants idea highlights thinking about relevance, usefulness,

421
00:20:42.279 --> 00:20:45.160
<v Speaker 2>whether something's even logged. Your creativity is a limit.

422
00:20:45.279 --> 00:20:46.799
<v Speaker 1>How do you select the best features?

423
00:20:47.279 --> 00:20:52.400
<v Speaker 2>Several ways filters rank features individually by predictive power like correlation,

424
00:20:52.799 --> 00:20:57.279
<v Speaker 2>but they miss interactions. Step rise regression ads or removes

425
00:20:57.319 --> 00:21:00.400
<v Speaker 2>features one by one from a model watching metas like

426
00:21:00.559 --> 00:21:04.799
<v Speaker 2>R squared or P values, but careful, it can easily overfit.

427
00:21:05.000 --> 00:21:06.480
<v Speaker 1>What about embedded methods.

428
00:21:06.839 --> 00:21:09.720
<v Speaker 2>These methods build feature selection right into the model. Training

429
00:21:09.880 --> 00:21:11.559
<v Speaker 2>decision trees are a prime example.

430
00:21:11.640 --> 00:21:12.240
<v Speaker 1>How do they work?

431
00:21:12.759 --> 00:21:15.400
<v Speaker 2>They recursively split the data based on the feature that

432
00:21:15.440 --> 00:21:19.000
<v Speaker 2>provides the most information game at each step, essentially creating

433
00:21:19.039 --> 00:21:22.000
<v Speaker 2>a flow chart to classify things. They're very interpretable. You

434
00:21:22.000 --> 00:21:23.160
<v Speaker 2>can see the rules.

435
00:21:22.839 --> 00:21:24.440
<v Speaker 1>Like the Titanic survival example.

436
00:21:24.519 --> 00:21:28.519
<v Speaker 2>Classic decision trees can figure out rules like if female survive,

437
00:21:28.960 --> 00:21:32.720
<v Speaker 2>if male and child survive. They handle continuous variables by

438
00:21:32.720 --> 00:21:36.039
<v Speaker 2>finding optimal split points, but you often need to prune

439
00:21:36.039 --> 00:21:39.079
<v Speaker 2>the tree, cutting back branches to stop it from overfitting

440
00:21:39.119 --> 00:21:41.960
<v Speaker 2>the training data. In random Force, they take decision trees

441
00:21:42.000 --> 00:21:44.319
<v Speaker 2>to the next level. They build many decision trees on

442
00:21:44.359 --> 00:21:47.920
<v Speaker 2>different random subsets of the data and features bagging, and

443
00:21:47.960 --> 00:21:51.400
<v Speaker 2>then average their predictions. Much more accurate and robust than

444
00:21:51.400 --> 00:21:55.279
<v Speaker 2>a single tree, But you lose that easy interpretability.

445
00:21:54.480 --> 00:21:59.000
<v Speaker 1>Now when selecting features that causation versus correlation issue seems

446
00:21:59.079 --> 00:22:00.880
<v Speaker 1>critical huge.

447
00:22:00.839 --> 00:22:04.119
<v Speaker 2>Is user played ten times a feature describing the user's behavior,

448
00:22:04.640 --> 00:22:07.920
<v Speaker 2>or is you show ten ads a future describing your action.

449
00:22:08.440 --> 00:22:11.319
<v Speaker 2>If you want actionable insights to change outcomes, you need

450
00:22:11.359 --> 00:22:13.400
<v Speaker 2>to focus on features you can actually control.

451
00:22:13.720 --> 00:22:16.400
<v Speaker 1>David Huffacker at Google talked about mixing message.

452
00:22:16.519 --> 00:22:20.160
<v Speaker 2>Yeah, they're a hybrid approach, combining qualitative insights from small

453
00:22:20.279 --> 00:22:24.240
<v Speaker 2>user interviews with quantitative analysis of large scale log data.

454
00:22:24.599 --> 00:22:27.480
<v Speaker 2>The Google Plus circles feature apparently came from that kind

455
00:22:27.559 --> 00:22:31.519
<v Speaker 2>of mix interviews, sparking ideas tested with big data moving

456
00:22:31.519 --> 00:22:33.160
<v Speaker 2>from description to prediction, which.

457
00:22:32.960 --> 00:22:34.519
<v Speaker 1>Also raises privacy concerns.

458
00:22:34.759 --> 00:22:38.440
<v Speaker 2>Absolutely, anytime you're dealing with human data, what are people

459
00:22:38.480 --> 00:22:43.200
<v Speaker 2>worried about? Identity, theft, financial loss, creepy ads. The book

460
00:22:43.279 --> 00:22:48.960
<v Speaker 2>mentions ideas like clearer data flow diagrams, privacy controls, sensible defaults,

461
00:22:49.599 --> 00:22:50.880
<v Speaker 2>big thorny issues.

462
00:22:51.039 --> 00:22:53.920
<v Speaker 1>Let's talk visualization again. Mark Hanson framed it as more

463
00:22:53.960 --> 00:22:55.440
<v Speaker 1>than just nice plots right.

464
00:22:55.319 --> 00:22:58.480
<v Speaker 2>As an art and information discipline that actually changes how

465
00:22:58.519 --> 00:23:01.799
<v Speaker 2>we see things. Change the instruments, and you change the theory.

466
00:23:02.559 --> 00:23:06.400
<v Speaker 2>His examples Million Dollar Blocks, Project Cascade, the NYT Lobby

467
00:23:06.440 --> 00:23:11.279
<v Speaker 2>display showed data as a powerful communication and exploration tool, sometimes.

468
00:23:11.000 --> 00:23:13.759
<v Speaker 1>Even as art and Ian Wong at Square connected this

469
00:23:13.920 --> 00:23:14.799
<v Speaker 1>directly to fraud.

470
00:23:14.880 --> 00:23:19.119
<v Speaker 2>Detection exactly. Square deals with massive potential fraud. The used

471
00:23:19.200 --> 00:23:23.240
<v Speaker 2>machine learning heavily, but Wong stressed the importance of visualization alongside.

472
00:23:23.279 --> 00:23:26.519
<v Speaker 2>It helps him understand why the model flags something, especially

473
00:23:26.559 --> 00:23:28.359
<v Speaker 2>with high class imbalance.

474
00:23:27.880 --> 00:23:30.680
<v Speaker 1>Where fraud is rare so accuracy is useless.

475
00:23:30.680 --> 00:23:34.680
<v Speaker 2>Decisely, they focus on precision and recall. Wong's tips were

476
00:23:34.720 --> 00:23:38.640
<v Speaker 2>great too. Models aren't black boxes. Iterate quickly like experiments.

477
00:23:38.720 --> 00:23:40.880
<v Speaker 2>Keep your code clean and reusable.

478
00:23:40.359 --> 00:23:42.119
<v Speaker 1>And productionizing the models.

479
00:23:42.039 --> 00:23:45.519
<v Speaker 2>Making them work in real time, minimizing the gap between

480
00:23:45.519 --> 00:23:49.759
<v Speaker 2>offline tests and online reality. Visualization isn't just for building

481
00:23:49.759 --> 00:23:54.920
<v Speaker 2>the model. It helps the human operations team review transactions efficiently.

482
00:23:55.519 --> 00:23:58.680
<v Speaker 2>It augments their intelligence like an exoskeleton.

483
00:23:58.759 --> 00:24:02.599
<v Speaker 1>Okay, this is a really important distinction. Prediction versus causality.

484
00:24:02.920 --> 00:24:07.480
<v Speaker 2>Yes, this is where things get philosophically deep. Maybe recommending

485
00:24:07.519 --> 00:24:10.480
<v Speaker 2>a book because you read a similar one that's prediction.

486
00:24:11.400 --> 00:24:13.920
<v Speaker 2>Understanding what causes someone to buy a book, or get

487
00:24:13.960 --> 00:24:16.319
<v Speaker 2>sick or click an add that's causality.

488
00:24:16.440 --> 00:24:18.759
<v Speaker 1>The methods might look similar, sometimes they might.

489
00:24:18.720 --> 00:24:21.079
<v Speaker 2>Use similar stats, but the intent is different, and that

490
00:24:21.160 --> 00:24:24.039
<v Speaker 2>changes everything about how you design your analysis and interpret

491
00:24:24.039 --> 00:24:24.559
<v Speaker 2>the results.

492
00:24:24.680 --> 00:24:27.400
<v Speaker 1>And observational data is tricky for causality.

493
00:24:27.839 --> 00:24:31.119
<v Speaker 2>Very ory. Steelman mentioned that Okay, Cupid, example, may be

494
00:24:31.119 --> 00:24:33.880
<v Speaker 2>beautiful in emails correlated with responses, but did it cause

495
00:24:33.920 --> 00:24:37.039
<v Speaker 2>them or were people writing beautiful also doing other things

496
00:24:37.039 --> 00:24:38.880
<v Speaker 2>differently confounders everywhere.

497
00:24:38.960 --> 00:24:40.119
<v Speaker 1>So the gold standard is.

498
00:24:40.279 --> 00:24:45.319
<v Speaker 2>Randomize controlled trials. RCTs randomly assign people to treatment or control.

499
00:24:45.599 --> 00:24:49.640
<v Speaker 2>If the group starts statistically identical thanks to randomization, any

500
00:24:49.680 --> 00:24:52.279
<v Speaker 2>difference in outcome can be attributed to the treatment. That's

501
00:24:52.319 --> 00:24:53.480
<v Speaker 2>causal Inference.

502
00:24:53.319 --> 00:24:55.599
<v Speaker 1>And ab tests are like RCTs for tech.

503
00:24:55.559 --> 00:24:59.640
<v Speaker 2>Pretty much much easier logistically than clinical trials, usually less

504
00:24:59.640 --> 00:25:03.559
<v Speaker 2>it's stay more control if you have the infrastructure randomly

505
00:25:03.599 --> 00:25:06.440
<v Speaker 2>show version A or version B of a web page,

506
00:25:06.880 --> 00:25:09.000
<v Speaker 2>measure the difference in clicks or conversions.

507
00:25:09.119 --> 00:25:11.720
<v Speaker 1>But watching out for things like Simpson's paradox, Oh.

508
00:25:11.680 --> 00:25:15.279
<v Speaker 2>Yeah, a huge pitfall and observational data, a trend appears

509
00:25:15.279 --> 00:25:17.279
<v Speaker 2>in the overall data, but reverses when you break it

510
00:25:17.319 --> 00:25:19.480
<v Speaker 2>down to the subgroups, so as you can't just trust

511
00:25:19.519 --> 00:25:20.519
<v Speaker 2>aggregated numbers.

512
00:25:20.759 --> 00:25:24.960
<v Speaker 1>David Madigan's work sounds intense. Using the Reuben causal model.

513
00:25:24.839 --> 00:25:27.519
<v Speaker 2>Right a formal framework to think about what you can

514
00:25:27.599 --> 00:25:32.240
<v Speaker 2>and can't know from observational studies. His almop project analyzed

515
00:25:32.440 --> 00:25:34.440
<v Speaker 2>huge medical databases.

516
00:25:33.880 --> 00:25:35.079
<v Speaker 1>Two hundred million people.

517
00:25:35.240 --> 00:25:37.839
<v Speaker 2>Yeah, and the shocking finding was that for many medical

518
00:25:37.920 --> 00:25:41.359
<v Speaker 2>questions using the same database but different standard analysis methods,

519
00:25:41.599 --> 00:25:44.720
<v Speaker 2>you could get opposite conclusions, like whether a drug caused

520
00:25:44.759 --> 00:25:45.319
<v Speaker 2>cancer or not.

521
00:25:45.559 --> 00:25:49.359
<v Speaker 1>Wow. That really underlines the challenge of inferring cause from

522
00:25:49.359 --> 00:25:50.039
<v Speaker 1>observed data.

523
00:25:50.119 --> 00:25:53.200
<v Speaker 2>It really does choices in analysis matter profoundly.

524
00:25:53.359 --> 00:25:57.640
<v Speaker 1>Another scary problem data leakage. Claudia or Perlic called it

525
00:25:57.680 --> 00:25:58.759
<v Speaker 1>a huge problem.

526
00:25:58.839 --> 00:26:01.920
<v Speaker 2>It is both in petitions and the real world. It's

527
00:26:01.920 --> 00:26:05.799
<v Speaker 2>when your model accidentally learns from information it wouldn't have

528
00:26:06.400 --> 00:26:09.680
<v Speaker 2>in a real prediction scenario. It learns the noise for signal.

529
00:26:09.839 --> 00:26:10.720
<v Speaker 1>How does that happen?

530
00:26:10.839 --> 00:26:14.519
<v Speaker 2>Could be implicit future information, like using diagnosis codes assigned

531
00:26:14.559 --> 00:26:17.440
<v Speaker 2>after a prediction should have been made, or non random sampling.

532
00:26:17.720 --> 00:26:21.759
<v Speaker 2>Perlik gave examples like patient IDs accidentally encoding clinic location,

533
00:26:22.079 --> 00:26:26.359
<v Speaker 2>making cancer prediction trivial, or pneumonia models exploiting diagnosis codes.

534
00:26:26.440 --> 00:26:27.519
<v Speaker 1>So how do you avoid it?

535
00:26:27.720 --> 00:26:33.079
<v Speaker 2>Specific advice, Yes, very specific. Strict temporal cutoff. Remove everything

536
00:26:33.119 --> 00:26:36.559
<v Speaker 2>learned even microseconds before the event you're predicting time, stamp

537
00:26:36.640 --> 00:26:38.880
<v Speaker 2>everything based on when it was known, not just when

538
00:26:38.880 --> 00:26:42.519
<v Speaker 2>it happened. Start clean with raw data, and most importantly,

539
00:26:42.839 --> 00:26:44.400
<v Speaker 2>understand how the data was generated.

540
00:26:44.519 --> 00:26:48.400
<v Speaker 1>Also important is model calibration. Are the probabilities right crucial?

541
00:26:48.920 --> 00:26:51.359
<v Speaker 2>A model might be good at ranking risks, but are

542
00:26:51.400 --> 00:26:55.599
<v Speaker 2>its predicted probabilities accurate? Like when it says seventy percent chance,

543
00:26:55.640 --> 00:26:59.160
<v Speaker 2>does it happen seventy percent of the time? Unprune decision

544
00:26:59.200 --> 00:27:02.640
<v Speaker 2>trees she showed often give probabilities that are too extreme,

545
00:27:03.079 --> 00:27:05.599
<v Speaker 2>while logistic progression tends to be better calibrated.

546
00:27:05.720 --> 00:27:10.079
<v Speaker 1>Okay, Last, big area data engineering, scale and complexity, map

547
00:27:10.079 --> 00:27:12.880
<v Speaker 1>reduce and hodup do data scientists need to know?

548
00:27:12.920 --> 00:27:17.119
<v Speaker 2>This? Often not directly coding map produced jobs anymore with

549
00:27:17.240 --> 00:27:20.680
<v Speaker 2>newer tools abstracting it, but understanding the concepts is still

550
00:27:20.720 --> 00:27:21.440
<v Speaker 2>really valuable.

551
00:27:21.720 --> 00:27:24.440
<v Speaker 1>Why was map reduce developed? What problem does it solve?

552
00:27:24.599 --> 00:27:27.720
<v Speaker 2>Think about counting word frequencies in terabytes of text. You

553
00:27:27.759 --> 00:27:30.480
<v Speaker 2>can't load it all into memory on one machine, and

554
00:27:30.559 --> 00:27:33.279
<v Speaker 2>even if you could split it, combining the accounts back together,

555
00:27:33.279 --> 00:27:36.599
<v Speaker 2>it creates a bottleneck the fan end problem. Map reduce

556
00:27:36.640 --> 00:27:39.720
<v Speaker 2>provides a framework for distributing the work across many machines,

557
00:27:39.960 --> 00:27:42.359
<v Speaker 2>handling failures automatically and managing the.

558
00:27:42.359 --> 00:27:44.279
<v Speaker 1>Complexity, but it can be unintuitive.

559
00:27:44.559 --> 00:27:48.640
<v Speaker 2>Converting some algorithms into map processed chunks and reduce aggurate

560
00:27:48.680 --> 00:27:52.279
<v Speaker 2>results steps isn't always obvious, and how evenly the data

561
00:27:52.319 --> 00:27:55.920
<v Speaker 2>is spread across the machines is critical for performance.

562
00:27:55.839 --> 00:27:59.960
<v Speaker 1>And Joshuall's big data economic law love that one.

563
00:28:00.359 --> 00:28:03.680
<v Speaker 2>No single record is that valuable, but every record, the

564
00:28:03.720 --> 00:28:07.759
<v Speaker 2>whole collection is incredibly valuable. The aggregate tells the story.

565
00:28:07.440 --> 00:28:09.880
<v Speaker 1>And tools like Prigle and mahoot build on this.

566
00:28:10.079 --> 00:28:13.359
<v Speaker 2>Yeah, prigles for large scale graph processing, mahoot for machine

567
00:28:13.440 --> 00:28:17.480
<v Speaker 2>learning algorithms implemented on top of doop map produce handling

568
00:28:17.480 --> 00:28:18.000
<v Speaker 2>the scale.

569
00:28:18.079 --> 00:28:21.200
<v Speaker 1>So wrapping up this deep dive, we've covered a lot.

570
00:28:21.279 --> 00:28:23.240
<v Speaker 1>The goal was really to show what it's like to

571
00:28:23.240 --> 00:28:25.720
<v Speaker 1>be a data scientist and also how some of this

572
00:28:25.839 --> 00:28:26.759
<v Speaker 1>work gets done.

573
00:28:26.839 --> 00:28:30.680
<v Speaker 2>From defining the field to specific algorithms and those tough

574
00:28:30.839 --> 00:28:31.599
<v Speaker 2>real world.

575
00:28:31.480 --> 00:28:35.200
<v Speaker 1>Challenges right and revisiting that core question what is data science?

576
00:28:35.960 --> 00:28:39.559
<v Speaker 1>The book suggests it's maybe a set of best practices

577
00:28:39.640 --> 00:28:43.799
<v Speaker 1>used in tech companies tackling problems with data, sometimes scientifically, but.

578
00:28:43.759 --> 00:28:46.319
<v Speaker 2>Also always be wary of the hype. It's not magic

579
00:28:46.359 --> 00:28:46.960
<v Speaker 2>it's work.

580
00:28:47.200 --> 00:28:50.880
<v Speaker 1>It's a process absolutely, and thinking about the future the

581
00:28:50.920 --> 00:28:53.680
<v Speaker 1>hope for next gen data scientists. It's not just about

582
00:28:53.680 --> 00:28:55.000
<v Speaker 1>technical skill or salary.

583
00:28:55.119 --> 00:28:58.599
<v Speaker 2>No, it's about being good problem solvers, asking the right questions,

584
00:28:59.000 --> 00:29:01.839
<v Speaker 2>thinking deeply about design process ethics.

585
00:29:02.119 --> 00:29:05.000
<v Speaker 1>Like Jeff Hammerbacker's famous quote about the best minds working

586
00:29:05.000 --> 00:29:08.799
<v Speaker 1>on clicking ads that sucks. The aspiration is to use

587
00:29:08.839 --> 00:29:13.000
<v Speaker 1>these powerful tools responsibly to make things better, not worse.

588
00:29:12.920 --> 00:29:16.640
<v Speaker 2>Which comes back to critical thinking always remembering data does

589
00:29:16.680 --> 00:29:21.119
<v Speaker 2>not speak for itself. It needs interpretation, criticism, evaluation. It

590
00:29:21.200 --> 00:29:26.240
<v Speaker 2>requires dealing with messy, incomplete, often inconclusive data. It's a

591
00:29:26.319 --> 00:29:28.599
<v Speaker 2>human process full of judgment calls.

592
00:29:28.799 --> 00:29:31.799
<v Speaker 1>So for you, our listener, as you maybe explore data

593
00:29:31.839 --> 00:29:35.400
<v Speaker 1>science yourself, what does all this mean. It's complex, it's evolving,

594
00:29:35.440 --> 00:29:38.240
<v Speaker 1>it has huge potential and pitfalls. You've heard the practicalities,

595
00:29:38.279 --> 00:29:40.720
<v Speaker 1>the ethical tightropes, the human effort involved.

596
00:29:40.799 --> 00:29:42.839
<v Speaker 2>It's not just about the algorithms.

597
00:29:42.519 --> 00:29:45.559
<v Speaker 1>Definitely not. So here's something to think about as you

598
00:29:45.599 --> 00:29:49.519
<v Speaker 1>continue your journey. What ethical questions will you inevitably face

599
00:29:49.559 --> 00:29:52.440
<v Speaker 1>when you start applying these powerful tools. How will you

600
00:29:52.519 --> 00:29:55.400
<v Speaker 1>handle that tension between what data can show you and

601
00:29:55.440 --> 00:29:57.920
<v Speaker 1>what should be done with that knowledge, especially when different

602
00:29:57.920 --> 00:30:01.200
<v Speaker 1>people have different interests. Ultimately, how will you try to

603
00:30:01.279 --> 00:30:03.599
<v Speaker 1>ensure that your work, your insights are used to make

604
00:30:03.680 --> 00:30:06.039
<v Speaker 1>the world genuinely better and not just, you know, to

605
00:30:06.119 --> 00:30:06.920
<v Speaker 1>make people click
