WEBVTT

1
00:00:00.160 --> 00:00:03.160
<v Speaker 1>Welcome to the deep dive. You know, we hear so

2
00:00:03.319 --> 00:00:06.120
<v Speaker 1>much about the theory of data science, all the algorithms

3
00:00:06.120 --> 00:00:07.960
<v Speaker 1>in math. But today we're going to try and pull

4
00:00:08.000 --> 00:00:10.119
<v Speaker 1>back the curtain a bit look at the real world.

5
00:00:10.679 --> 00:00:13.519
<v Speaker 1>And our guide for this is The Practitioner's Guide to

6
00:00:13.599 --> 00:00:17.039
<v Speaker 1>Data Science by Huielin and Mingly. It feels less like

7
00:00:17.079 --> 00:00:21.600
<v Speaker 1>a textbook honestly, and more like an insider's view of

8
00:00:21.640 --> 00:00:24.359
<v Speaker 1>what it really takes to do data science day to day.

9
00:00:24.679 --> 00:00:26.719
<v Speaker 2>That's exactly right. What really jumped out at me was

10
00:00:26.760 --> 00:00:29.920
<v Speaker 2>how practical it is, you know, how grounded it is.

11
00:00:30.280 --> 00:00:32.320
<v Speaker 2>The authors they don't just give you the what. They

12
00:00:32.359 --> 00:00:35.399
<v Speaker 2>really dig into the how, things like the soft skills

13
00:00:35.399 --> 00:00:38.159
<v Speaker 2>which often get missed, right, and the whole context of

14
00:00:38.200 --> 00:00:40.880
<v Speaker 2>the big data cloud environment. Yeah, that's huge. It's really

15
00:00:40.880 --> 00:00:44.920
<v Speaker 2>about well navigating the messiness of actual data projects. Okay.

16
00:00:44.960 --> 00:00:46.840
<v Speaker 1>Yeah, And it seems like they really push for hands

17
00:00:46.880 --> 00:00:48.799
<v Speaker 1>on learning. I just appreciate that they've got these R

18
00:00:48.840 --> 00:00:51.240
<v Speaker 1>and Python code notebooks all ready to go. You can

19
00:00:51.240 --> 00:00:54.920
<v Speaker 1>grab them on GitHub the links http LA three three

20
00:00:55.079 --> 00:00:57.719
<v Speaker 1>seven CD four's and they basically say, hey, get your

21
00:00:57.759 --> 00:00:59.560
<v Speaker 1>hands dirty, take this code, use your own data to

22
00:00:59.600 --> 00:01:01.200
<v Speaker 1>try it on problems.

23
00:01:01.200 --> 00:01:04.599
<v Speaker 2>Yeah, make it tangible, and that focus on reproducibility using

24
00:01:04.680 --> 00:01:07.239
<v Speaker 2>things like Google co Lab. Yeah, that's so important. Now

25
00:01:07.280 --> 00:01:10.239
<v Speaker 2>it's not just about following steps, it's about giving you

26
00:01:10.280 --> 00:01:13.359
<v Speaker 2>the power to take these techniques and actually build something,

27
00:01:13.760 --> 00:01:16.640
<v Speaker 2>apply them to whatever challenges you're facing. It makes data

28
00:01:16.640 --> 00:01:19.200
<v Speaker 2>science feel like a real tool you can use, not

29
00:01:19.319 --> 00:01:20.280
<v Speaker 2>just concepts.

30
00:01:21.560 --> 00:01:23.120
<v Speaker 1>The book kicks off with a bit of history too,

31
00:01:23.159 --> 00:01:26.879
<v Speaker 1>which I've found pretty useful just for context. It traces

32
00:01:26.879 --> 00:01:32.519
<v Speaker 1>things from the early days like least squares, linear discriminate analysis,

33
00:01:32.959 --> 00:01:35.920
<v Speaker 1>the real foundations, all the way up to how cloud

34
00:01:35.959 --> 00:01:39.319
<v Speaker 1>computing just completely changed the game for data engineering and management.

35
00:01:39.359 --> 00:01:42.159
<v Speaker 1>It really shows how far we've come and fast.

36
00:01:42.200 --> 00:01:45.200
<v Speaker 2>Oh definitely thinking about that evolution the cloud, it's just

37
00:01:45.200 --> 00:01:47.680
<v Speaker 2>been massive, a total game changer. Suddenly you have access

38
00:01:47.719 --> 00:01:52.319
<v Speaker 2>to all this computing power, storage. It kind of democratized

39
00:01:52.400 --> 00:01:55.439
<v Speaker 2>working with huge data sets, you know, and that shifted

40
00:01:55.480 --> 00:01:57.760
<v Speaker 2>data engineering. It's less about physical boxes now and more

41
00:01:57.760 --> 00:02:01.079
<v Speaker 2>about orchestrating data pipelines up in the cloud. It's fundamental shift.

42
00:02:01.280 --> 00:02:04.159
<v Speaker 1>Okay, So the authors then break down data science roles.

43
00:02:04.280 --> 00:02:09.520
<v Speaker 1>They talk about three main skill tracks engineering, analysis and modeling, inference,

44
00:02:10.240 --> 00:02:13.680
<v Speaker 1>So for engineering, it's about building the infrastructure right, the

45
00:02:13.759 --> 00:02:18.240
<v Speaker 1>data pipelines, automated collection, managing the data itself, the plumbing

46
00:02:18.319 --> 00:02:19.159
<v Speaker 1>basically right.

47
00:02:19.599 --> 00:02:22.520
<v Speaker 2>And it's so critical you need that solid engineering foundation.

48
00:02:22.919 --> 00:02:25.240
<v Speaker 2>Everything else is built on top of it. If your

49
00:02:25.319 --> 00:02:29.919
<v Speaker 2>data infrastructure isn't reliable, well, the analysts and modelers, they

50
00:02:29.960 --> 00:02:33.240
<v Speaker 2>just can't do their jobs properly. It's often the unseen work,

51
00:02:33.439 --> 00:02:35.960
<v Speaker 2>but it's absolutely essential for getting good outcomes.

52
00:02:36.159 --> 00:02:38.360
<v Speaker 1>Then there's the analysis track. This sounds like it's really

53
00:02:38.360 --> 00:02:41.479
<v Speaker 1>about understanding the business side. What's the question, what's the

54
00:02:41.560 --> 00:02:44.879
<v Speaker 1>data telling us, and then translating that business need into

55
00:02:44.919 --> 00:02:48.000
<v Speaker 1>a data problem you can actually solve. The book really

56
00:02:48.080 --> 00:02:50.960
<v Speaker 1>hits hard on domain knowledge and communication skills here.

57
00:02:51.000 --> 00:02:55.120
<v Speaker 2>Exactly asking the right questions. Understanding the business context is crucial.

58
00:02:55.520 --> 00:02:58.479
<v Speaker 2>The analyst is like a translator, you know, bridging the

59
00:02:58.520 --> 00:03:00.800
<v Speaker 2>gap between the business folks who have the problems and

60
00:03:00.840 --> 00:03:04.759
<v Speaker 2>the data scientists who might have solutions. It's definitely not

61
00:03:04.840 --> 00:03:08.000
<v Speaker 2>just about crunching numbers. It's about insights that lead to

62
00:03:08.120 --> 00:03:09.080
<v Speaker 2>actual decisions.

63
00:03:09.240 --> 00:03:13.280
<v Speaker 1>Okay, and finally modeling inference. This is where we get

64
00:03:13.280 --> 00:03:17.879
<v Speaker 1>into applying all the different learning methods. Supervised learning like

65
00:03:18.759 --> 00:03:23.520
<v Speaker 1>regression and classification for predictions, but also unsupervised learning for

66
00:03:23.680 --> 00:03:26.960
<v Speaker 1>finding patterns, and even causal inference trying to figure out

67
00:03:27.000 --> 00:03:27.639
<v Speaker 1>cause and effect.

68
00:03:27.759 --> 00:03:30.080
<v Speaker 2>Yeah, and the range of tools here is fascinating. You

69
00:03:30.080 --> 00:03:33.599
<v Speaker 2>can forecast trends, categorize things, try to understand why something

70
00:03:33.639 --> 00:03:36.319
<v Speaker 2>is happening. Each technique gives you a different lens on

71
00:03:36.360 --> 00:03:38.639
<v Speaker 2>the data. A good practitioner knows which tool to pull

72
00:03:38.680 --> 00:03:39.319
<v Speaker 2>out for which.

73
00:03:39.199 --> 00:03:41.800
<v Speaker 1>Job, which brings us to the kinds of questions data

74
00:03:41.840 --> 00:03:48.439
<v Speaker 1>science can actually answer prediction, classification, optimization like forecasting, sales spotting, fraud,

75
00:03:48.560 --> 00:03:51.919
<v Speaker 1>finding efficient routes. But and I think this is really important.

76
00:03:51.960 --> 00:03:55.919
<v Speaker 1>The book also points out the limitations. It's not magic, right.

77
00:03:56.439 --> 00:04:01.759
<v Speaker 2>Being honest about that builds trust, manages xs. Sometimes the

78
00:04:01.840 --> 00:04:04.479
<v Speaker 2>data just isn't there, or maybe the problem isn't really

79
00:04:04.520 --> 00:04:07.560
<v Speaker 2>a data problem at all. Knowing what data science can't

80
00:04:07.560 --> 00:04:10.000
<v Speaker 2>do is just as important as knowing what it can.

81
00:04:10.280 --> 00:04:12.680
<v Speaker 1>They also talk a bit about team structure, like should

82
00:04:12.680 --> 00:04:15.479
<v Speaker 1>you build your own team or outsource, and they stress

83
00:04:15.520 --> 00:04:18.720
<v Speaker 1>how vital collaboration is across different departments. It seems like

84
00:04:18.759 --> 00:04:21.920
<v Speaker 1>a data scientist working alone probably won't get very far.

85
00:04:22.079 --> 00:04:25.480
<v Speaker 2>Oh absolutely not. Data science is inherently collaborative. It has

86
00:04:25.519 --> 00:04:28.399
<v Speaker 2>to be. You need domain experts to frame the problem right,

87
00:04:28.439 --> 00:04:31.800
<v Speaker 2>You need engineering support, you need buy in from leaders

88
00:04:31.800 --> 00:04:34.079
<v Speaker 2>so the insights actually get used. It doesn't matter if

89
00:04:34.079 --> 00:04:37.120
<v Speaker 2>the team is internal or external. Those connections are fundamental.

90
00:04:37.399 --> 00:04:39.879
<v Speaker 1>Now. The book introduces this idea. I found really neat

91
00:04:39.879 --> 00:04:43.279
<v Speaker 1>that the three pillars of knowledge for a data scientist. First,

92
00:04:43.360 --> 00:04:47.639
<v Speaker 1>the core analytics stuff, stats, machine learning techniques, the tools. Second,

93
00:04:47.720 --> 00:04:52.079
<v Speaker 1>domain knowledge plus collaboration, communication, leadership skills, and the third

94
00:04:52.120 --> 00:04:55.199
<v Speaker 1>pillar that's big data management and the IT skills for

95
00:04:55.240 --> 00:04:56.399
<v Speaker 1>the modern cloud world.

96
00:04:57.560 --> 00:05:01.240
<v Speaker 2>Those three pillars really capture how multi fascinating data science

97
00:05:01.319 --> 00:05:03.480
<v Speaker 2>is now. You can't just be good at one or two.

98
00:05:03.839 --> 00:05:05.920
<v Speaker 2>You really need a solid base in all three to

99
00:05:06.000 --> 00:05:11.120
<v Speaker 2>handle complex real world projects. Think about say, predicting customer churn.

100
00:05:11.680 --> 00:05:14.399
<v Speaker 2>You need the analytics chops for the model, but you

101
00:05:14.439 --> 00:05:17.399
<v Speaker 2>also need the domain knowledge to know which customer behaviors matter,

102
00:05:18.959 --> 00:05:21.240
<v Speaker 2>and you need the IT skills to actually get and

103
00:05:21.319 --> 00:05:22.839
<v Speaker 2>process all that data from the cloud.

104
00:05:23.000 --> 00:05:25.519
<v Speaker 1>Okay, then the book gets into the actual project cycle.

105
00:05:25.920 --> 00:05:29.519
<v Speaker 1>It breaks projects down by type like offline training, offline application,

106
00:05:29.680 --> 00:05:33.680
<v Speaker 1>offline training, online application, online training, online application. It's interesting

107
00:05:33.680 --> 00:05:36.399
<v Speaker 1>how the tech needs and the business value change depending

108
00:05:36.439 --> 00:05:38.120
<v Speaker 1>on whether it's real time or batch.

109
00:05:38.319 --> 00:05:41.160
<v Speaker 2>Yeah, that's a really practical way to categorize them. Knowing

110
00:05:41.240 --> 00:05:44.000
<v Speaker 2>upfront if you need a weekly report versus say a

111
00:05:44.079 --> 00:05:47.639
<v Speaker 2>real time recommendation engine on a website, well that changes everything.

112
00:05:47.680 --> 00:05:50.839
<v Speaker 2>How you get data, how you prefit, model, test, deploy.

113
00:05:51.560 --> 00:05:52.399
<v Speaker 2>It all follows from that.

114
00:05:52.680 --> 00:05:55.519
<v Speaker 1>And the book really hammers home the importance of those

115
00:05:55.560 --> 00:06:00.360
<v Speaker 1>early stages problem formulation and project planning. They stress using

116
00:06:00.439 --> 00:06:03.480
<v Speaker 1>data in the planning, really understanding the business value and

117
00:06:03.600 --> 00:06:06.600
<v Speaker 1>why data scientists have to be involved early. It helps

118
00:06:06.639 --> 00:06:11.800
<v Speaker 1>avoid solving completely the wrong problem or setting totally unrealistic timelines.

119
00:06:12.120 --> 00:06:15.000
<v Speaker 2>That's exactly where projects can go off the rails right

120
00:06:15.040 --> 00:06:18.040
<v Speaker 2>at the start, spending that time up front to clearly

121
00:06:18.040 --> 00:06:21.399
<v Speaker 2>define the business problem, figure out the desired outcome, make

122
00:06:21.439 --> 00:06:26.519
<v Speaker 2>a realistic plan. It's foundational. Data scientists bring that unique perspective.

123
00:06:26.560 --> 00:06:29.399
<v Speaker 2>They understand the business and what's actually possible with the data.

124
00:06:29.439 --> 00:06:31.680
<v Speaker 1>And when they talk about project modeling, it's described as

125
00:06:31.800 --> 00:06:34.959
<v Speaker 1>very iterative, not just picking a model. It involves all

126
00:06:35.000 --> 00:06:39.240
<v Speaker 1>that hard work, data cleaning, wrangling, exploratory analysis to really

127
00:06:39.279 --> 00:06:42.560
<v Speaker 1>get the data. Then translating the business problem into stats

128
00:06:42.639 --> 00:06:45.720
<v Speaker 1>or machine learning terms. It's rarely finding the perfect model

129
00:06:45.759 --> 00:06:46.360
<v Speaker 1>first try.

130
00:06:46.600 --> 00:06:49.759
<v Speaker 2>That iterative nature is totally key. It's just how it works.

131
00:06:49.920 --> 00:06:53.079
<v Speaker 2>You break down big problems into smaller analytical questions, apply

132
00:06:53.199 --> 00:06:57.040
<v Speaker 2>different methods. You need feedback, loops, communication, You got to

133
00:06:57.040 --> 00:06:58.839
<v Speaker 2>be willing to learn and adjust as you go.

134
00:06:59.560 --> 00:07:02.399
<v Speaker 1>Finally, in the intersection, they flag a couple of super

135
00:07:02.399 --> 00:07:05.439
<v Speaker 1>common mistakes. We mentioned solving the wrong problem, but the

136
00:07:05.480 --> 00:07:09.560
<v Speaker 1>second one is underestimating timelines. They say that data exploration

137
00:07:09.720 --> 00:07:13.600
<v Speaker 1>and prep, the unglamorous stuff, can eat up like sixty

138
00:07:13.639 --> 00:07:15.759
<v Speaker 1>to eighty percent of the total project time.

139
00:07:15.920 --> 00:07:18.839
<v Speaker 2>WHOA, Yeah, that number really hits home, doesn't it. It

140
00:07:18.920 --> 00:07:21.399
<v Speaker 2>highlights all that hidden effort needed just to get raw,

141
00:07:21.519 --> 00:07:24.279
<v Speaker 2>messy data ready for modeling. If you don't budget time

142
00:07:24.319 --> 00:07:27.000
<v Speaker 2>for that wrangling and exploring properly, your project's almost certainly

143
00:07:27.040 --> 00:07:31.079
<v Speaker 2>going to hit delays or worse, you build on shaky data.

144
00:07:31.160 --> 00:07:33.360
<v Speaker 1>Okay, let's shift gears a bit and dig into some

145
00:07:33.399 --> 00:07:37.839
<v Speaker 1>of the more technical details. Starting with data preprocessing. The

146
00:07:37.879 --> 00:07:40.079
<v Speaker 1>book spends a good amount of time here, and well,

147
00:07:40.240 --> 00:07:42.720
<v Speaker 1>like we just said, raw data is rarely model ready.

148
00:07:43.279 --> 00:07:46.800
<v Speaker 1>Data cleaning is usually step one right, finding and dealing

149
00:07:46.800 --> 00:07:50.800
<v Speaker 1>with weird stuff negative age percentages over one hundred. The

150
00:07:50.839 --> 00:07:53.800
<v Speaker 1>book talks about different strategies like just deleting those rows

151
00:07:54.000 --> 00:07:57.480
<v Speaker 1>or maybe treating them as missing values and imputing them later.

152
00:07:58.040 --> 00:08:00.800
<v Speaker 2>And that's a strategic choice. When when do you delete

153
00:08:00.879 --> 00:08:04.360
<v Speaker 2>versus impute? The book suggests if your data set's big

154
00:08:04.439 --> 00:08:07.720
<v Speaker 2>enough and the bad data seems random, maybe deletion is okay,

155
00:08:08.240 --> 00:08:11.319
<v Speaker 2>but imputation lets you keep more data. They cover simple

156
00:08:11.360 --> 00:08:15.120
<v Speaker 2>methods mean median mode and more complex ones like Kenearest neighbors.

157
00:08:15.360 --> 00:08:18.399
<v Speaker 2>The even point to the impute function in ours impute

158
00:08:18.399 --> 00:08:19.399
<v Speaker 2>Missings package for.

159
00:08:19.360 --> 00:08:22.600
<v Speaker 1>That right, and that leads straight into missing values generally,

160
00:08:22.639 --> 00:08:25.560
<v Speaker 1>which are just everywhere in real data. Again, imputation is key.

161
00:08:25.800 --> 00:08:29.000
<v Speaker 1>They detail the basic methods. kNN even mentioned maybe using

162
00:08:29.079 --> 00:08:31.519
<v Speaker 1>bagging trees for imputation sometimes.

163
00:08:31.199 --> 00:08:33.799
<v Speaker 2>Yeah, And the imputation method you choose, it can actually

164
00:08:33.840 --> 00:08:37.159
<v Speaker 2>affect your model's performance down the line. Like the book says,

165
00:08:37.519 --> 00:08:41.879
<v Speaker 2>simple mean imputation ignores relationships between variables and can kind

166
00:08:41.879 --> 00:08:45.039
<v Speaker 2>of distort things, especially if lots of data is missing.

167
00:08:45.960 --> 00:08:48.440
<v Speaker 2>More advanced methods try to use those relationships to make

168
00:08:48.480 --> 00:08:49.159
<v Speaker 2>better guesses.

169
00:08:49.679 --> 00:08:53.240
<v Speaker 1>Centering and scaling also get covered, basically getting all your

170
00:08:53.279 --> 00:08:57.200
<v Speaker 1>variables onto a similar scale. The book mentions pre process

171
00:08:57.440 --> 00:09:01.480
<v Speaker 1>in r's carrot package using center, and this is super

172
00:09:01.519 --> 00:09:04.320
<v Speaker 1>important for lots of algorithms that are sensitive to how

173
00:09:04.360 --> 00:09:08.360
<v Speaker 1>big the numbers are. Like imagine comparing height in centimeters

174
00:09:08.679 --> 00:09:12.519
<v Speaker 1>and income in thousands of dollars, totally different scales. Right,

175
00:09:12.960 --> 00:09:15.320
<v Speaker 1>some algorithms might just focus on the income because the

176
00:09:15.399 --> 00:09:15.960
<v Speaker 1>numbers are.

177
00:09:15.879 --> 00:09:20.399
<v Speaker 2>Bigger exactly, algorithms like gradient descent, which trains so many models,

178
00:09:20.519 --> 00:09:23.240
<v Speaker 2>they just work better, converge faster when features are on

179
00:09:23.279 --> 00:09:26.679
<v Speaker 2>a similar scale. It stops variables with big ranges from

180
00:09:26.759 --> 00:09:29.240
<v Speaker 2>just dominating the learning process unfairly.

181
00:09:29.519 --> 00:09:33.759
<v Speaker 1>Okay, Next up skewness and outliers. The book talks about

182
00:09:33.840 --> 00:09:38.279
<v Speaker 1>using visualizations, box plots, histograms to spot these, and also

183
00:09:38.720 --> 00:09:42.480
<v Speaker 1>statistical methods like Z scores or the modified Z score

184
00:09:42.559 --> 00:09:45.799
<v Speaker 1>using the mad function in R. Finding these is important

185
00:09:45.840 --> 00:09:48.600
<v Speaker 1>because they can really mess up certain models, like one

186
00:09:48.720 --> 00:09:51.799
<v Speaker 1>huge income could totally skew the average and mislead a

187
00:09:51.879 --> 00:09:52.559
<v Speaker 1>linear model.

188
00:09:52.720 --> 00:09:55.799
<v Speaker 2>Yeah, and knowing how different models react to outliers is key.

189
00:09:56.399 --> 00:10:01.559
<v Speaker 2>Linear regression logistic regression pretty sensitive based models usually more

190
00:10:01.679 --> 00:10:04.840
<v Speaker 2>robust and the book rightly says outliers aren't always errors.

191
00:10:04.840 --> 00:10:07.879
<v Speaker 2>They could be real just unusual, so deciding what to

192
00:10:07.919 --> 00:10:11.440
<v Speaker 2>do remove transform leave alone needs sought. Maybe domain knowledge.

193
00:10:11.519 --> 00:10:14.080
<v Speaker 2>They mentioned transformations like spatial sign in R that can

194
00:10:14.159 --> 00:10:16.639
<v Speaker 2>kind of dampen the influence of outliers without removing them.

195
00:10:16.879 --> 00:10:19.720
<v Speaker 1>Colinearity is another big one when your predictor variables are

196
00:10:19.799 --> 00:10:22.720
<v Speaker 1>highly correlated with each other. The book points to find

197
00:10:22.759 --> 00:10:26.679
<v Speaker 1>correlation in Carrot for finding these. If predictors are too correlated,

198
00:10:26.720 --> 00:10:30.279
<v Speaker 1>it makes model coefficients unstable and hard to interpret, like

199
00:10:30.360 --> 00:10:33.279
<v Speaker 1>trying to separate the effect of Facebook AdSpend from Instagram

200
00:10:33.320 --> 00:10:35.759
<v Speaker 1>AdSpend if they always move together precisely.

201
00:10:36.519 --> 00:10:40.519
<v Speaker 2>High multi collinearity inflates the variance of coefficient estimates in

202
00:10:40.600 --> 00:10:43.600
<v Speaker 2>linear models makes it hard to see the independent effect

203
00:10:43.600 --> 00:10:46.600
<v Speaker 2>of each variable, so you might remove one variable, combine them,

204
00:10:46.919 --> 00:10:49.320
<v Speaker 2>or use dimensionality reduction techniques to handle it.

205
00:10:49.360 --> 00:10:52.720
<v Speaker 1>They also cover sparse variables predictors that barely change across

206
00:10:52.759 --> 00:10:56.679
<v Speaker 1>the data set, very low variance, near zero var in

207
00:10:56.759 --> 00:11:02.120
<v Speaker 1>Carrot helps find these based on unique values and frequency ratios. Basically,

208
00:11:02.159 --> 00:11:04.679
<v Speaker 1>if a variable is almost constant, it's not really helping

209
00:11:04.679 --> 00:11:05.919
<v Speaker 1>your model tell things apart.

210
00:11:06.080 --> 00:11:08.960
<v Speaker 2>Yeah, they're just not adding much information. Removing them can

211
00:11:09.000 --> 00:11:12.279
<v Speaker 2>simplify the model, make it more stable, maybe train faster

212
00:11:12.639 --> 00:11:14.000
<v Speaker 2>without really hurting performance.

213
00:11:14.240 --> 00:11:19.039
<v Speaker 1>And the last preprocessing step mentioned is re encoding dummy variables.

214
00:11:19.360 --> 00:11:22.639
<v Speaker 1>That's just converting categorical things like colors, product types into

215
00:11:22.720 --> 00:11:26.879
<v Speaker 1>numbers usually binaries, zeros and one so algorithms can understand them.

216
00:11:27.039 --> 00:11:31.279
<v Speaker 2>Fundamental step for categorical data creates those binary dummy variables

217
00:11:31.279 --> 00:11:33.120
<v Speaker 2>so the model can tree each category is its own

218
00:11:33.159 --> 00:11:35.360
<v Speaker 2>feature and learn its relationship to the outcome.

219
00:11:35.639 --> 00:11:40.080
<v Speaker 1>Okay, shifting now to data wrangling, The book really highlights

220
00:11:40.120 --> 00:11:44.039
<v Speaker 1>ours deeplayer package for manipulating data. They go through functions

221
00:11:44.080 --> 00:11:50.200
<v Speaker 1>like select for picking columns, filter for rows, arrange for sorting,

222
00:11:50.360 --> 00:11:54.039
<v Speaker 1>dot mutate for making new variables, and summarize with groupie

223
00:11:54.080 --> 00:11:57.519
<v Speaker 1>for calculating stats across groups. They even give a customer

224
00:11:57.559 --> 00:12:00.840
<v Speaker 1>segmentation example showing how you'd use these sociatyrize metrics for

225
00:12:00.840 --> 00:12:04.840
<v Speaker 1>different customer types like average age, spending transaction counts for

226
00:12:05.039 --> 00:12:08.320
<v Speaker 1>say conspicuous versus price conscious customers.

227
00:12:08.559 --> 00:12:11.840
<v Speaker 2>Oh yeah, deeplayer really changed the game for data manipulation

228
00:12:11.919 --> 00:12:15.080
<v Speaker 2>and r The syntax just so intuitive makes common tasks

229
00:12:15.159 --> 00:12:18.399
<v Speaker 2>much clearer and more efficient. That customer segmentation example is

230
00:12:18.440 --> 00:12:20.679
<v Speaker 2>great shows exactly how you use these tools to pull

231
00:12:20.679 --> 00:12:22.960
<v Speaker 2>out meaningful insights about different groups in your data.

232
00:12:23.039 --> 00:12:24.639
<v Speaker 1>They do you give a quick nod to base our

233
00:12:24.679 --> 00:12:29.039
<v Speaker 1>functions too, like apply lapply supply, acknowledging that while deeplayer

234
00:12:29.120 --> 00:12:31.840
<v Speaker 1>is great, sometimes you need the flexibility of the base

235
00:12:31.879 --> 00:12:33.360
<v Speaker 1>functions for trickier stuff.

236
00:12:33.480 --> 00:12:36.000
<v Speaker 2>Right. Deep player streamlines a lot, but basar gives you

237
00:12:36.039 --> 00:12:39.360
<v Speaker 2>that fine grain control for maybe more complex or custom operations.

238
00:12:39.600 --> 00:12:41.080
<v Speaker 2>It's good to know both, really, all.

239
00:12:41.039 --> 00:12:43.159
<v Speaker 1>Right, let's talk model tuning. The book starts with the

240
00:12:43.240 --> 00:12:47.559
<v Speaker 1>classic variance bias tradeoff, the idea that a really complex

241
00:12:47.639 --> 00:12:51.120
<v Speaker 1>model might fit your training data perfectly low bias, but

242
00:12:51.240 --> 00:12:53.360
<v Speaker 1>then it fails on new data because it learned the

243
00:12:53.399 --> 00:12:57.519
<v Speaker 1>noise high variance overfitting, while a too simple model won't

244
00:12:57.519 --> 00:13:01.000
<v Speaker 1>even capture the basic patterns high bias underfit. Tuning is

245
00:13:01.039 --> 00:13:03.000
<v Speaker 1>finding that balance for good generalization.

246
00:13:03.279 --> 00:13:05.720
<v Speaker 2>Yeah, that's like machine learning one oh one, isn't it?

247
00:13:06.159 --> 00:13:08.720
<v Speaker 2>But absolutely crucial. You want the model to learn the

248
00:13:08.759 --> 00:13:13.360
<v Speaker 2>real signal, not the random noise. Overfitting is like memorizing

249
00:13:13.440 --> 00:13:17.360
<v Speaker 2>test answers. Great for that test useless. Otherwise, underfitting is

250
00:13:17.440 --> 00:13:19.000
<v Speaker 2>like not studying at all, and.

251
00:13:19.080 --> 00:13:21.639
<v Speaker 1>Data splitting and resampling are the main tools for managing

252
00:13:21.679 --> 00:13:24.080
<v Speaker 1>this trade off. The book talks about the basic train

253
00:13:24.200 --> 00:13:27.720
<v Speaker 1>to split, build on training data, evaluate on unseen test data.

254
00:13:28.159 --> 00:13:31.840
<v Speaker 1>It also mentions a fancier technique, maximum dissimilarity sampling, using

255
00:13:31.879 --> 00:13:34.440
<v Speaker 1>MAXDESEM and carrot. The goal there is to make the

256
00:13:34.480 --> 00:13:39.320
<v Speaker 1>test set really diverse, covering more possibilities. Simple random splitting

257
00:13:39.360 --> 00:13:42.279
<v Speaker 1>can sometimes give you unrepresentative train or test sets just

258
00:13:42.320 --> 00:13:47.240
<v Speaker 1>by luck, which biases your performance estimate. Maximum dissimilarity sampling

259
00:13:47.320 --> 00:13:49.759
<v Speaker 1>tries to build a test set that really spans the

260
00:13:49.840 --> 00:13:52.759
<v Speaker 1>range of your data, giving a more robust evaluation of

261
00:13:52.840 --> 00:13:55.480
<v Speaker 1>how the model might do in the wild. Then you

262
00:13:55.519 --> 00:13:59.240
<v Speaker 1>have resampling methods for getting more stable performance estimates, especially

263
00:13:59.240 --> 00:14:03.120
<v Speaker 1>with limited data. The book covers cross validation like kfold

264
00:14:03.120 --> 00:14:06.960
<v Speaker 1>and bootstrapping. With kfold, you split data into k parts,

265
00:14:07.279 --> 00:14:09.919
<v Speaker 1>train on K one, test on the last one, repeat

266
00:14:09.960 --> 00:14:13.519
<v Speaker 1>four times and average the results. Gives a more reliable picture.

267
00:14:13.919 --> 00:14:17.240
<v Speaker 2>Yeah, resampling is invaluable for confidence in your performance metrics.

268
00:14:17.600 --> 00:14:20.679
<v Speaker 2>Cross validation avoids the risk of getting a misleading score

269
00:14:20.960 --> 00:14:24.399
<v Speaker 2>just from one lucky or unlucky train to split. Bootstrapping

270
00:14:24.440 --> 00:14:27.759
<v Speaker 2>involves resampling with replacement to create lots of simulated data

271
00:14:27.799 --> 00:14:30.639
<v Speaker 2>sets than training and testing on those. It gives you

272
00:14:30.679 --> 00:14:33.240
<v Speaker 2>a sense of the stability and uncertainty around your metrics.

273
00:14:33.240 --> 00:14:36.200
<v Speaker 1>So how do we actually measure performance? The book says

274
00:14:36.240 --> 00:14:40.679
<v Speaker 1>it depends if it's regression or classification. For regression predicting numbers,

275
00:14:40.759 --> 00:14:44.919
<v Speaker 1>common metrics are URMC, ROOTMANE squared error tells you the

276
00:14:44.960 --> 00:14:48.360
<v Speaker 1>average error size, and ARE squared the proportion of variants explained.

277
00:14:48.759 --> 00:14:51.879
<v Speaker 1>Though they caution that high R squared isn't everything, and

278
00:14:51.919 --> 00:14:57.080
<v Speaker 1>they mention adjusted R squared, which penalizes extra unhelpful predictors.

279
00:14:57.159 --> 00:14:59.679
<v Speaker 2>Right armac is nice because it's in the same units

280
00:14:59.679 --> 00:15:02.720
<v Speaker 2>as your target variable easy to grasp. Our square tells

281
00:15:02.720 --> 00:15:04.639
<v Speaker 2>you how much better your model is than just guessing

282
00:15:04.679 --> 00:15:07.240
<v Speaker 2>the average, but yeah, doesn't guarantee it's a good model

283
00:15:07.399 --> 00:15:11.759
<v Speaker 2>or will generalize. Adjusted R squared pushes towards simpler models,

284
00:15:11.759 --> 00:15:12.639
<v Speaker 2>which is often.

285
00:15:12.399 --> 00:15:16.080
<v Speaker 1>Good for classification predicting categories. The book gets into the

286
00:15:16.120 --> 00:15:19.879
<v Speaker 1>confusion matrix true positives, false positives, etc. N metrics like

287
00:15:20.080 --> 00:15:24.080
<v Speaker 1>accuracy specificity, finding the true negatives and the Kappa statistic

288
00:15:24.480 --> 00:15:28.600
<v Speaker 1>Kappa using Kappa dot test in rs FMSD package measures

289
00:15:28.639 --> 00:15:30.519
<v Speaker 1>agreement beyond what you'd expect by chance.

290
00:15:31.039 --> 00:15:33.679
<v Speaker 2>Useful, Yet the confusion matrix breaks it all down, not

291
00:15:33.759 --> 00:15:35.919
<v Speaker 2>just if the model was right, but how it was wrong.

292
00:15:36.879 --> 00:15:40.320
<v Speaker 2>Simple accuracy can be really misleading with unbalanced classes. If

293
00:15:40.399 --> 00:15:43.399
<v Speaker 2>ninety nine percent or negative, a dumb model predicting negative

294
00:15:43.399 --> 00:15:47.399
<v Speaker 2>all the time gets ninety nine percent accuracy. Specificity Kappa

295
00:15:47.440 --> 00:15:49.799
<v Speaker 2>they give a much better picture, especially Kappa accounting for

296
00:15:49.879 --> 00:15:50.559
<v Speaker 2>chance agreement.

297
00:15:50.799 --> 00:15:54.200
<v Speaker 1>They also cover ROC curves and AUC area under the

298
00:15:54.240 --> 00:15:58.120
<v Speaker 1>curve using proc dot rock in R that helps evaluate

299
00:15:58.159 --> 00:16:02.120
<v Speaker 1>classifiers across different thresholds, and gain and lift charts, which

300
00:16:02.120 --> 00:16:04.799
<v Speaker 1>are more business focused. They show how much better your

301
00:16:04.840 --> 00:16:08.000
<v Speaker 1>model is at finding positive cases compared to just random selection.

302
00:16:08.480 --> 00:16:12.759
<v Speaker 2>For marketing campaigns, RC curves visualize that trade off between

303
00:16:12.799 --> 00:16:15.960
<v Speaker 2>finding true positives and avoiding false positives. As you change

304
00:16:15.960 --> 00:16:19.039
<v Speaker 2>the decision threshold, higher AEC is generally better. GAT and

305
00:16:19.080 --> 00:16:21.399
<v Speaker 2>lift charts translate that into business terms. How much more

306
00:16:21.440 --> 00:16:23.720
<v Speaker 2>efficiently can you reach your target audience using the model?

307
00:16:24.080 --> 00:16:25.279
<v Speaker 2>Very practical? Okay.

308
00:16:25.320 --> 00:16:28.279
<v Speaker 1>Finally, the book walks through a bunch of different regression models.

309
00:16:28.639 --> 00:16:32.440
<v Speaker 1>Start with the basics ordinary lease squares OLS, linear regression

310
00:16:32.919 --> 00:16:37.919
<v Speaker 1>covers its assumptions, linearity, independence, constant error variants, normal residuals,

311
00:16:37.960 --> 00:16:40.919
<v Speaker 1>and diagnostic plots to check them. Then moves to things

312
00:16:40.960 --> 00:16:45.919
<v Speaker 1>like principal component regression PCR and partial lease squares PLS

313
00:16:46.159 --> 00:16:49.159
<v Speaker 1>for handling many possibly correlated predictors.

314
00:16:49.519 --> 00:16:53.279
<v Speaker 2>Understanding those OLS assumptions is so important for trusting the results.

315
00:16:53.559 --> 00:16:56.639
<v Speaker 2>If they're violated, your coefficients and predictions might be off.

316
00:16:57.000 --> 00:17:00.840
<v Speaker 2>Diagnostics help check that PCR and PLS are eight dimensionality

317
00:17:00.840 --> 00:17:06.079
<v Speaker 2>reduction tools, especially when multiicolinearity makes standard linear aggression unstable.

318
00:17:06.200 --> 00:17:10.160
<v Speaker 1>It also covers regularization methods Ridge, LASSO and elastic net.

319
00:17:10.359 --> 00:17:14.519
<v Speaker 1>These shrink coefficients to prevent overfitting handle collinearity, and LASSO

320
00:17:14.920 --> 00:17:17.920
<v Speaker 1>can even do feature selection by zeroing out some coefficients

321
00:17:18.359 --> 00:17:19.799
<v Speaker 1>mentions in neet angle net.

322
00:17:19.839 --> 00:17:22.799
<v Speaker 2>From there, yeah, regularization is super powerful for building more

323
00:17:22.880 --> 00:17:26.000
<v Speaker 2>robust models, especially with lots of features. Ridge shrinks everything

324
00:17:26.000 --> 00:17:29.079
<v Speaker 2>towards zero, last looking for some coefficients to zero doing

325
00:17:29.119 --> 00:17:31.920
<v Speaker 2>automatic feature selection. Elastic net is a mix of both.

326
00:17:32.279 --> 00:17:36.440
<v Speaker 1>Then tree based methods get introduced. Decision trees plus ensembles

327
00:17:36.480 --> 00:17:40.680
<v Speaker 1>like bagging tree, bagging carrot, random forests are FING CARROT

328
00:17:41.039 --> 00:17:44.559
<v Speaker 1>and gradient boosted machines GBM and CARROT. These are good

329
00:17:44.559 --> 00:17:47.839
<v Speaker 1>for nonlinear patterns and handling different data types. Touches on

330
00:17:47.880 --> 00:17:52.599
<v Speaker 1>splitting criteria to many information gain and pruning to avoid overfitting.

331
00:17:52.920 --> 00:17:56.880
<v Speaker 2>Trees are incredibly versatile. Oftent top performers great at finding

332
00:17:56.880 --> 00:18:00.559
<v Speaker 2>complex patterns without needing tons of feature engineering. Ensembles like

333
00:18:00.599 --> 00:18:03.279
<v Speaker 2>random forests and gradium boosting combine many trees to get

334
00:18:03.279 --> 00:18:07.000
<v Speaker 2>even better more stable predictions. Understanding splitting and pruning is

335
00:18:07.079 --> 00:18:08.400
<v Speaker 2>key to making them work well.

336
00:18:08.480 --> 00:18:11.200
<v Speaker 1>And lastly, a quick intro to deep learning. Feed Forward

337
00:18:11.279 --> 00:18:16.880
<v Speaker 1>neural networks FFNNs, convolutional neural networks CNNs for images, Recurrent

338
00:18:16.920 --> 00:18:21.279
<v Speaker 1>neural networks RNNs for sequences like text briefly covers applications,

339
00:18:21.319 --> 00:18:26.039
<v Speaker 1>components like neurons, activation functions, sigmoid or LU layers, optimization,

340
00:18:26.279 --> 00:18:30.599
<v Speaker 1>gradient descent ADAM, regularization dropout points to the CARAS package.

341
00:18:30.640 --> 00:18:31.000
<v Speaker 1>In art.

342
00:18:31.079 --> 00:18:35.279
<v Speaker 2>Deep learning has had amazing success, especially with images, language, speech.

343
00:18:35.880 --> 00:18:39.119
<v Speaker 2>The book just gives a taste, but hits the core concepts,

344
00:18:39.359 --> 00:18:41.839
<v Speaker 2>the building blocks, how they learn how to control them.

345
00:18:42.079 --> 00:18:44.759
<v Speaker 2>It's a huge field, but that's a good starting point.

346
00:18:44.960 --> 00:18:47.720
<v Speaker 1>So wrapping up this deep dive on the Practitioner's Guide

347
00:18:47.720 --> 00:18:52.559
<v Speaker 1>to Data Science, it really feels like a valuable bridge. Yeah, definitely.

348
00:18:52.640 --> 00:18:55.720
<v Speaker 1>And as you, our listener, think about all this, consider

349
00:18:56.000 --> 00:18:59.079
<v Speaker 1>how these ideas might apply to what you're working on

350
00:18:59.160 --> 00:19:01.839
<v Speaker 1>or learning about. Maybe you're prepping for a meeting, trying

351
00:19:01.839 --> 00:19:04.960
<v Speaker 1>to understand a new area, or just curious that ability

352
00:19:05.079 --> 00:19:08.680
<v Speaker 1>to work effectively with data it's just becoming so critical everywhere.

353
00:19:08.799 --> 00:19:12.559
<v Speaker 1>Perhaps you're looking at customer behavior or analyzing trends for research.

354
00:19:12.599 --> 00:19:15.240
<v Speaker 1>The kinds of practical steps and thinking outlined in this

355
00:19:15.279 --> 00:19:17.759
<v Speaker 1>book offer a really solid way to approach those kinds

356
00:19:17.799 --> 00:19:19.440
<v Speaker 1>of challenges. Something to think about.
