WEBVTT

1
00:00:00.080 --> 00:00:02.279
<v Speaker 1>Welcome curious minds to another deep dive.

2
00:00:02.520 --> 00:00:03.040
<v Speaker 2>Hello.

3
00:00:04.000 --> 00:00:09.439
<v Speaker 1>Imagine having just one single reliable place you could quickly

4
00:00:09.560 --> 00:00:12.320
<v Speaker 1>check whenever some complex data science term pops up.

5
00:00:12.439 --> 00:00:14.720
<v Speaker 2>Yeah, instead of drowning and search results.

6
00:00:14.439 --> 00:00:17.920
<v Speaker 1>Exactly, saving you hours maybe of sisting through stuff that

7
00:00:18.000 --> 00:00:21.760
<v Speaker 1>might not even be right. Well, today we're doing just that.

8
00:00:22.120 --> 00:00:26.760
<v Speaker 1>We're cracking open the Data Scientist Pocket Guide by Mohammed Sabri.

9
00:00:27.640 --> 00:00:30.559
<v Speaker 1>It's a resource really designed to cut through all that

10
00:00:30.679 --> 00:00:34.320
<v Speaker 1>noise and hopefully give you clear, reliable answers.

11
00:00:34.399 --> 00:00:35.000
<v Speaker 2>It's useful.

12
00:00:35.240 --> 00:00:38.159
<v Speaker 1>Our mission today, then, is to extract the most important

13
00:00:38.200 --> 00:00:41.520
<v Speaker 1>sort of nuggets of knowledge from this guide. We'll focus

14
00:00:41.600 --> 00:00:45.840
<v Speaker 1>on key concepts, tackle some frequently asked questions in machine learning,

15
00:00:45.880 --> 00:00:48.320
<v Speaker 1>deep learning, the big ones, the big ones. Yeah, think

16
00:00:48.359 --> 00:00:51.320
<v Speaker 1>of this as your personal tour through a really practical glossary,

17
00:00:51.880 --> 00:00:54.560
<v Speaker 1>helping you grasp not just what things are, but.

18
00:00:54.560 --> 00:00:56.320
<v Speaker 2>Why they matter, how they fit.

19
00:00:56.200 --> 00:00:58.359
<v Speaker 1>Together exactly, the bigger picture.

20
00:00:58.479 --> 00:01:01.039
<v Speaker 2>And what's really compelling I think is he wrote it.

21
00:01:01.039 --> 00:01:03.840
<v Speaker 2>It came from his own experience, his own frustrations early on.

22
00:01:03.960 --> 00:01:07.519
<v Speaker 2>Oh interesting, Yeah, he saw the struggle, especially for beginners,

23
00:01:07.560 --> 00:01:13.159
<v Speaker 2>trying to find quick, reliable, clear explanations for fundamental concepts.

24
00:01:13.599 --> 00:01:17.640
<v Speaker 2>He actually says, answers to my questions were not always.

25
00:01:17.400 --> 00:01:19.840
<v Speaker 1>Reliable, right, I can relate to that.

26
00:01:19.840 --> 00:01:22.159
<v Speaker 2>And some concepts are hard to understand. It created a

27
00:01:22.159 --> 00:01:23.400
<v Speaker 2>real barrier. You know.

28
00:01:23.640 --> 00:01:25.799
<v Speaker 1>That's such a common experience, isn't it. It sounds like he

29
00:01:25.879 --> 00:01:28.079
<v Speaker 1>wasn't just like compiling facts. He was trying to solve

30
00:01:28.079 --> 00:01:30.319
<v Speaker 1>a real pain point he knew others had.

31
00:01:30.480 --> 00:01:33.640
<v Speaker 2>Precisely, he wanted to create what he calls a first

32
00:01:33.680 --> 00:01:36.879
<v Speaker 2>of a kind dictionary or glossary that regroups the most

33
00:01:36.879 --> 00:01:39.920
<v Speaker 2>popular terms, really aiming to make the day to day

34
00:01:39.959 --> 00:01:41.439
<v Speaker 2>work easier, more enriching.

35
00:01:41.519 --> 00:01:44.840
<v Speaker 1>Even Okay, so if you've ever felt that sense of

36
00:01:44.879 --> 00:01:48.480
<v Speaker 1>overwhelm just the sheer volume of info, or got lost

37
00:01:48.719 --> 00:01:51.719
<v Speaker 1>trying to figure out which explanation to trust, this deep

38
00:01:51.799 --> 00:01:55.000
<v Speaker 1>dive should be really helpful. Yeah, hopefully. Muhammad describes those

39
00:01:55.040 --> 00:01:58.400
<v Speaker 1>early frustrations quite vividly, you know, having to go on

40
00:01:58.439 --> 00:02:02.159
<v Speaker 1>search engines and use various sources just to understand one concept,

41
00:02:02.640 --> 00:02:04.760
<v Speaker 1>finding it time consuming, and as you said, the answers

42
00:02:04.799 --> 00:02:05.560
<v Speaker 1>weren't always.

43
00:02:05.359 --> 00:02:09.240
<v Speaker 2>Reliable, right, And he points out something key. A lot

44
00:02:09.240 --> 00:02:11.919
<v Speaker 2>of books focus heavily on the coding, which.

45
00:02:11.719 --> 00:02:13.919
<v Speaker 1>Is essential obviously, of course, but.

46
00:02:14.159 --> 00:02:19.120
<v Speaker 2>They often miss understanding the logic and the mechanism behind

47
00:02:19.159 --> 00:02:20.000
<v Speaker 2>each concept.

48
00:02:20.199 --> 00:02:23.120
<v Speaker 1>That raises a really important question. Then why is that

49
00:02:23.159 --> 00:02:27.240
<v Speaker 1>conceptual understanding so critical even if you're a great coder.

50
00:02:27.840 --> 00:02:31.680
<v Speaker 2>Well, the guide really emphasizes this. Without that foundation, it's

51
00:02:32.000 --> 00:02:34.919
<v Speaker 2>hard for him to provide good results and explain its work,

52
00:02:35.439 --> 00:02:39.080
<v Speaker 2>the explanation for it exactly. You can run the code, sure,

53
00:02:39.360 --> 00:02:41.680
<v Speaker 2>but do you know why it works, what the output

54
00:02:41.759 --> 00:02:44.759
<v Speaker 2>really means, how to fix it when it breaks. That's

55
00:02:44.800 --> 00:02:47.479
<v Speaker 2>the conceptual piece, got it. So the book's goal, it's

56
00:02:47.479 --> 00:02:49.800
<v Speaker 2>pretty ambitious, actually, is to be a kind of data

57
00:02:49.840 --> 00:02:53.759
<v Speaker 2>science bible, a quick reference for solid definitions.

58
00:02:53.199 --> 00:02:56.719
<v Speaker 1>A bible. Huh. So, given that focus on quick reference,

59
00:02:56.800 --> 00:02:59.120
<v Speaker 1>quick answers, I'm guessing this isn't a book you read

60
00:02:59.199 --> 00:03:00.840
<v Speaker 1>cover to cover like a novel.

61
00:03:01.000 --> 00:03:04.520
<v Speaker 2>No, absolutely not. He's very clear about that. The objective

62
00:03:04.680 --> 00:03:07.120
<v Speaker 2>is not to be read all at once. Right, It's

63
00:03:07.199 --> 00:03:09.039
<v Speaker 2>meant to be a resource you dip into. You know,

64
00:03:09.080 --> 00:03:10.840
<v Speaker 2>you have a question, you look it up. It's designed

65
00:03:10.879 --> 00:03:13.439
<v Speaker 2>for nonlinear reading.

66
00:03:13.280 --> 00:03:14.199
<v Speaker 1>So you can jump around.

67
00:03:14.280 --> 00:03:16.400
<v Speaker 2>Yeah, start to read wherever you want and jump to

68
00:03:16.439 --> 00:03:18.599
<v Speaker 2>any chapter whatever you need at that moment.

69
00:03:18.800 --> 00:03:22.199
<v Speaker 1>Okay, that makes perfect sense. It's about targeted learning getting

70
00:03:22.240 --> 00:03:25.400
<v Speaker 1>unstuck quickly without wading through dense theory exactly.

71
00:03:25.479 --> 00:03:28.479
<v Speaker 2>It fits that practical engineering mindset, right.

72
00:03:28.680 --> 00:03:30.840
<v Speaker 1>So the book structure reflects that too. It's got this

73
00:03:30.879 --> 00:03:35.719
<v Speaker 1>big alphabetical definition section and then a dedicated FAQ section.

74
00:03:35.840 --> 00:03:37.879
<v Speaker 2>Yeah, the faques are really interesting.

75
00:03:38.000 --> 00:03:41.039
<v Speaker 1>That's where we find some really actionable stuff, those distinctions

76
00:03:41.080 --> 00:03:44.360
<v Speaker 1>that often, you know, trip people up. Let's start with

77
00:03:44.439 --> 00:03:49.439
<v Speaker 1>a big one, deep learning versus traditional machine learning. When

78
00:03:49.439 --> 00:03:51.199
<v Speaker 1>do you actually need deep learning?

79
00:03:51.599 --> 00:03:54.719
<v Speaker 2>Okay, yeah, that's a common question. The guide suggests it

80
00:03:54.759 --> 00:03:58.560
<v Speaker 2>really shines in well two main scenarios where traditional methods

81
00:03:58.639 --> 00:03:59.199
<v Speaker 2>might struggle.

82
00:03:59.360 --> 00:03:59.680
<v Speaker 1>Okay.

83
00:04:00.680 --> 00:04:03.719
<v Speaker 2>In case it is hard to extract features from the data,

84
00:04:03.360 --> 00:04:07.080
<v Speaker 2>meaning deep learning models can often learn the important features

85
00:04:07.159 --> 00:04:11.479
<v Speaker 2>automatically directly from raw data think pixels in an image

86
00:04:11.719 --> 00:04:13.520
<v Speaker 2>or raw audio waveforms.

87
00:04:13.680 --> 00:04:15.719
<v Speaker 1>Us you don't need as much manual feature.

88
00:04:15.400 --> 00:04:19.480
<v Speaker 2>Engineering exactly, which can save a ton of effort, especially

89
00:04:19.519 --> 00:04:21.839
<v Speaker 2>with complex unstructured data.

90
00:04:21.920 --> 00:04:23.399
<v Speaker 1>Okay, that's one. What's the second.

91
00:04:23.439 --> 00:04:25.759
<v Speaker 2>The second, and it often goes hand in hand, is

92
00:04:26.120 --> 00:04:28.800
<v Speaker 2>in case we have a large amount of data scale

93
00:04:29.399 --> 00:04:33.480
<v Speaker 2>TEW massive data sets, deep learning models often keep improving

94
00:04:33.519 --> 00:04:35.839
<v Speaker 2>with more data, they can learn better and show a

95
00:04:35.839 --> 00:04:40.480
<v Speaker 2>better performance, where traditional algorithms might plateau or even struggle

96
00:04:40.519 --> 00:04:41.160
<v Speaker 2>to scale.

97
00:04:41.279 --> 00:04:45.920
<v Speaker 1>So if you're dealing with that raw complex data, images, video, language,

98
00:04:46.000 --> 00:04:48.360
<v Speaker 1>or you just have enormous amounts of data, deep learning

99
00:04:48.399 --> 00:04:49.920
<v Speaker 1>is probably the way to go generally.

100
00:04:50.040 --> 00:04:53.600
<v Speaker 2>Yes, it becomes a much more powerful tool in those situations.

101
00:04:53.680 --> 00:04:56.560
<v Speaker 1>Okay, But even with the right model, you still need

102
00:04:56.600 --> 00:04:59.800
<v Speaker 1>to know if it's actually working well right and understand its.

103
00:04:59.680 --> 00:05:01.920
<v Speaker 2>Mistake absolutely critical, Which brings.

104
00:05:01.720 --> 00:05:04.079
<v Speaker 1>Us to another fundamental concept, one that trips up a

105
00:05:04.079 --> 00:05:06.800
<v Speaker 1>lot of people. Type I and type two errors.

106
00:05:06.839 --> 00:05:10.800
<v Speaker 2>Ah Yes, false positives and false negatives, coarse statistics, but

107
00:05:10.959 --> 00:05:12.560
<v Speaker 2>vital in mL evaluation.

108
00:05:12.759 --> 00:05:13.800
<v Speaker 1>So break it down for us.

109
00:05:13.920 --> 00:05:19.079
<v Speaker 2>Type I Okay, Type I error sometimes called alpha error

110
00:05:19.160 --> 00:05:22.680
<v Speaker 2>or a false positive. This happens when the researcher rejects

111
00:05:22.720 --> 00:05:26.240
<v Speaker 2>the null hypothesis being true in the population, so.

112
00:05:26.279 --> 00:05:28.839
<v Speaker 1>You conclude something is happening when it actually isn't.

113
00:05:29.079 --> 00:05:32.120
<v Speaker 2>Exactly like a medical test saying someone has a disease

114
00:05:32.120 --> 00:05:35.720
<v Speaker 2>when they're healthy, or a spam filter blocking an important

115
00:05:35.720 --> 00:05:40.040
<v Speaker 2>email you rejected the truth healthy not spams.

116
00:05:39.600 --> 00:05:42.079
<v Speaker 1>Got it false alarm, and type two.

117
00:05:42.120 --> 00:05:45.560
<v Speaker 2>Type two error or beta error false negative. This is

118
00:05:45.600 --> 00:05:48.680
<v Speaker 2>the opposite. It's committed when the researcher does not reject

119
00:05:48.680 --> 00:05:51.519
<v Speaker 2>the null hypothesis being false in the population.

120
00:05:52.040 --> 00:05:53.920
<v Speaker 1>So you miss something that is happening.

121
00:05:53.600 --> 00:05:57.120
<v Speaker 2>Precisely, missing an actual effect. Think of a medical test

122
00:05:57.279 --> 00:05:58.600
<v Speaker 2>failing to detect a disease.

123
00:05:58.639 --> 00:06:03.160
<v Speaker 1>Someone actually has raw detection system, letting a fraudulent transactions.

124
00:06:02.639 --> 00:06:06.000
<v Speaker 2>Something so exactly. That's a classic example. You accepted something false,

125
00:06:06.040 --> 00:06:07.680
<v Speaker 2>the transaction is fine as true.

126
00:06:08.120 --> 00:06:11.759
<v Speaker 1>Understanding the difference here seems crucial because the cost of

127
00:06:11.800 --> 00:06:14.319
<v Speaker 1>each error type can be wildly different.

128
00:06:14.040 --> 00:06:17.680
<v Speaker 2>Right, hugely different. Think about that medical test example. A

129
00:06:17.720 --> 00:06:22.519
<v Speaker 2>false positive type one leads to anxiety, maybe unnecessary follow

130
00:06:22.600 --> 00:06:25.680
<v Speaker 2>up tests, annoying, potentially.

131
00:06:25.120 --> 00:06:27.279
<v Speaker 1>Costly, But a false negative.

132
00:06:27.040 --> 00:06:29.759
<v Speaker 2>A false negative type two in that context means a

133
00:06:29.800 --> 00:06:33.079
<v Speaker 2>sick person doesn't get treatment. The consequences could be far,

134
00:06:33.199 --> 00:06:33.759
<v Speaker 2>far worse.

135
00:06:34.079 --> 00:06:35.800
<v Speaker 1>So when you build a model, you have to decide

136
00:06:35.839 --> 00:06:39.439
<v Speaker 1>which type of error is more critical to avoid for

137
00:06:39.560 --> 00:06:40.600
<v Speaker 1>your specific problem.

138
00:06:40.639 --> 00:06:43.720
<v Speaker 2>Absolutely, it's not just about overall accuracy, it's about the

139
00:06:43.759 --> 00:06:47.439
<v Speaker 2>real world impact of the specific mistakes your model makes.

140
00:06:47.800 --> 00:06:50.720
<v Speaker 2>You often have to tune models to minimize one type

141
00:06:50.720 --> 00:06:53.079
<v Speaker 2>of error, even if it slightly increases the other.

142
00:06:53.279 --> 00:06:58.240
<v Speaker 1>Okay, that really clarifies why just looking at accuracy isn't enough. Now,

143
00:06:58.279 --> 00:07:02.519
<v Speaker 1>speaking of practical challenges missing data, every data scientist runs

144
00:07:02.519 --> 00:07:03.199
<v Speaker 1>into this, right.

145
00:07:03.120 --> 00:07:06.399
<v Speaker 2>Oh constantly. It's pretty much unavoidable in real world data sets.

146
00:07:06.480 --> 00:07:07.920
<v Speaker 1>And why is it such a big deal? Why can't

147
00:07:07.959 --> 00:07:08.639
<v Speaker 1>we just ignore it?

148
00:07:09.079 --> 00:07:12.319
<v Speaker 2>Well, because many algorithms are based on statistical methods which

149
00:07:12.319 --> 00:07:14.959
<v Speaker 2>are supposed to receive a complete data set as input.

150
00:07:15.040 --> 00:07:15.839
<v Speaker 2>They just aren't.

151
00:07:15.600 --> 00:07:17.040
<v Speaker 1>Designed for gaps, so they break.

152
00:07:17.240 --> 00:07:21.199
<v Speaker 2>They might break completely, just refuse to run, or maybe worse,

153
00:07:21.240 --> 00:07:26.040
<v Speaker 2>they run, but give you a core predictive model. Garbage in,

154
00:07:26.079 --> 00:07:27.639
<v Speaker 2>garbage out essentially.

155
00:07:27.399 --> 00:07:29.480
<v Speaker 1>Okay, so we have to handle it? What are the

156
00:07:29.519 --> 00:07:32.199
<v Speaker 1>main ways? According to the guide, it.

157
00:07:32.120 --> 00:07:36.040
<v Speaker 2>Outlines two main strategies. First, you can simply remove the

158
00:07:36.079 --> 00:07:41.519
<v Speaker 2>missing data, usually by deleting the observations the lines which

159
00:07:41.560 --> 00:07:43.560
<v Speaker 2>contain at least one missing feature.

160
00:07:43.680 --> 00:07:44.519
<v Speaker 1>Just drop a whole row.

161
00:07:44.639 --> 00:07:47.839
<v Speaker 2>Yeah, it's simple, it's quick, But the downside is you

162
00:07:47.920 --> 00:07:51.439
<v Speaker 2>might lose a lot of valuable information, especially if missingness

163
00:07:51.480 --> 00:07:54.879
<v Speaker 2>isn't totally random, or if many rows have gaps.

164
00:07:55.040 --> 00:07:57.240
<v Speaker 1>Right, you could be throwing away perfectly good data In

165
00:07:57.319 --> 00:07:59.040
<v Speaker 1>other columns, what's the alternative?

166
00:07:59.120 --> 00:08:04.199
<v Speaker 2>The alternative is imputation, replacing the missing values with artificial values,

167
00:08:04.680 --> 00:08:05.600
<v Speaker 2>filling in the gaps.

168
00:08:05.680 --> 00:08:06.360
<v Speaker 1>How do you do that?

169
00:08:06.639 --> 00:08:09.519
<v Speaker 2>Just guess, well, not quite guess. You can use simple

170
00:08:09.519 --> 00:08:13.040
<v Speaker 2>statistical methods like replacing missing numerical values with the mean

171
00:08:13.160 --> 00:08:15.759
<v Speaker 2>or mode of that column. Or you can use more

172
00:08:15.759 --> 00:08:20.600
<v Speaker 2>sophisticated techniques like using regression building a small model to

173
00:08:20.680 --> 00:08:23.639
<v Speaker 2>predict what the missing value likely would have been based

174
00:08:23.680 --> 00:08:24.959
<v Speaker 2>on the other features in that row.

175
00:08:25.240 --> 00:08:28.480
<v Speaker 1>Ah interesting, using the other data to inform the.

176
00:08:28.480 --> 00:08:32.919
<v Speaker 2>Replacement exactly, But there's a really important caveat here. Whatever

177
00:08:33.000 --> 00:08:36.080
<v Speaker 2>method you use, the replacements should not lead to a

178
00:08:36.120 --> 00:08:39.879
<v Speaker 2>significant change in the distribution and composition of the data set.

179
00:08:40.200 --> 00:08:43.519
<v Speaker 2>Meaning you want to fill the gaps without fundamentally changing

180
00:08:43.519 --> 00:08:46.440
<v Speaker 2>the story the data tells. You don't want to introduce

181
00:08:46.519 --> 00:08:51.639
<v Speaker 2>unintended biases or distort relationships between variables. It requires careful thought.

182
00:08:51.840 --> 00:08:54.600
<v Speaker 1>So it's about repairing the data set carefully, making it

183
00:08:54.679 --> 00:08:57.919
<v Speaker 1>usable for algorithms without messing up the underlying patterns.

184
00:08:58.039 --> 00:09:00.840
<v Speaker 2>That's the goal. Make it robustin integrity.

185
00:09:01.039 --> 00:09:05.519
<v Speaker 1>Okay, so data is clean, models built. Now the evaluation

186
00:09:05.600 --> 00:09:08.200
<v Speaker 1>part again, how do we actually measure performance?

187
00:09:08.480 --> 00:09:12.720
<v Speaker 2>Right? Evaluation, it's iterative. Often you cycle back. The guide

188
00:09:12.759 --> 00:09:14.600
<v Speaker 2>says you need to use what it's called a metric.

189
00:09:14.639 --> 00:09:17.759
<v Speaker 2>This could be visual like a plot, or mathematical a number.

190
00:09:17.559 --> 00:09:18.399
<v Speaker 1>And you just pick one.

191
00:09:18.639 --> 00:09:21.799
<v Speaker 2>No. Crucially, the choice of metric is entirely based on

192
00:09:21.879 --> 00:09:23.960
<v Speaker 2>the type of problem that we are trying to.

193
00:09:23.840 --> 00:09:25.639
<v Speaker 1>Solve, Like we discussed with type three.

194
00:09:25.559 --> 00:09:29.159
<v Speaker 2>Errors exactly, the metric needs to align with the actual goal.

195
00:09:29.399 --> 00:09:32.159
<v Speaker 2>For classification problems, put things into categories.

196
00:09:32.200 --> 00:09:34.159
<v Speaker 1>You have options like, okay.

197
00:09:33.919 --> 00:09:37.559
<v Speaker 2>Area under the curve, auc which looks at how well

198
00:09:37.600 --> 00:09:41.960
<v Speaker 2>the model distinguishes classes, the confusion matrix, which breaks down

199
00:09:41.960 --> 00:09:45.399
<v Speaker 2>the types of correct and incorrect predictions.

200
00:09:44.879 --> 00:09:46.960
<v Speaker 1>True positives, false negatives.

201
00:09:47.600 --> 00:09:52.240
<v Speaker 2>Then there's basic accuracy recall how many actual positives did

202
00:09:52.240 --> 00:09:55.480
<v Speaker 2>we find, precision of the ones we predicted positive, how

203
00:09:55.519 --> 00:09:58.639
<v Speaker 2>many were right? And the F one score, which balances

204
00:09:58.679 --> 00:09:59.840
<v Speaker 2>precision and recall.

205
00:10:00.080 --> 00:10:03.679
<v Speaker 1>Okay, lots of options for classification. What a regression predicting

206
00:10:03.679 --> 00:10:04.120
<v Speaker 1>a number.

207
00:10:04.320 --> 00:10:07.120
<v Speaker 2>For regression, you're looking at how close your predictions are

208
00:10:07.159 --> 00:10:10.279
<v Speaker 2>to the actual values. So metrics include mean square error

209
00:10:10.399 --> 00:10:14.559
<v Speaker 2>msee root mean square error RMS, mean absolute error MAE,

210
00:10:15.039 --> 00:10:18.720
<v Speaker 2>and the coefficient of determination or R squared and its

211
00:10:18.759 --> 00:10:20.639
<v Speaker 2>cousin adjusted r square.

212
00:10:20.480 --> 00:10:22.679
<v Speaker 1>Sounds like you need to know what each metric tells you.

213
00:10:22.759 --> 00:10:27.279
<v Speaker 2>Definitely, and the guide strongly advises using multiple evaluation metrics

214
00:10:27.279 --> 00:10:31.240
<v Speaker 2>for the same project. Why because each evaluation metric is

215
00:10:31.360 --> 00:10:33.519
<v Speaker 2>unique and has its own strength.

216
00:10:33.240 --> 00:10:36.720
<v Speaker 1>So one metric might look good, but another might reveal

217
00:10:36.840 --> 00:10:37.399
<v Speaker 1>a weakness.

218
00:10:37.639 --> 00:10:41.799
<v Speaker 2>Precisely, relying on just one number can be misleading. Looking

219
00:10:41.879 --> 00:10:45.600
<v Speaker 2>at several gives you a much more rounded, robust understanding

220
00:10:45.919 --> 00:10:48.080
<v Speaker 2>of how your model is really performing.

221
00:10:48.519 --> 00:10:52.200
<v Speaker 1>That's a key takeaway. Don't just chase one score, look

222
00:10:52.240 --> 00:10:54.200
<v Speaker 1>at the whole picture. All right, let's zoom out again.

223
00:10:54.639 --> 00:10:57.799
<v Speaker 1>Metal questions. Here's a big one. When can you actually

224
00:10:57.879 --> 00:10:59.759
<v Speaker 1>say you did a good job on a project? Is

225
00:10:59.759 --> 00:11:01.120
<v Speaker 1>it just about the metrics?

226
00:11:01.480 --> 00:11:03.679
<v Speaker 2>Ah, that's a great question, and the answer, according to

227
00:11:03.720 --> 00:11:06.600
<v Speaker 2>the guide is definitely not just about the metrics. It

228
00:11:06.639 --> 00:11:10.120
<v Speaker 2>suggests that data scientists should not be a perfectionist. Instead

229
00:11:10.200 --> 00:11:13.279
<v Speaker 2>think like an engineer solving a practical problem. Focus on

230
00:11:13.320 --> 00:11:15.519
<v Speaker 2>the best outcome in the shortest amount of time.

231
00:11:15.639 --> 00:11:16.799
<v Speaker 1>So efficiency matters.

232
00:11:17.240 --> 00:11:21.720
<v Speaker 2>Speed yeh iteration, Yes exactly. It mentions an agile style

233
00:11:21.720 --> 00:11:24.600
<v Speaker 2>where the idea delivers a result fast and iterates to

234
00:11:24.600 --> 00:11:29.000
<v Speaker 2>improve the work. Get something working then make it better. Critically,

235
00:11:29.279 --> 00:11:32.360
<v Speaker 2>a good result in accuracy doesn't necessarily mean that your

236
00:11:32.440 --> 00:11:32.840
<v Speaker 2>job is.

237
00:11:32.840 --> 00:11:35.200
<v Speaker 1>Good, especially for hard problems.

238
00:11:35.000 --> 00:11:38.639
<v Speaker 2>Especially for hard problems where maybe due to the data itself,

239
00:11:38.679 --> 00:11:42.000
<v Speaker 2>it is almost impossible to get good accuracy.

240
00:11:42.320 --> 00:11:46.399
<v Speaker 1>So what should the focus be then, If not just accuracy.

241
00:11:46.039 --> 00:11:49.399
<v Speaker 2>The focus should shift on the logic and reasoning behind

242
00:11:49.440 --> 00:11:52.720
<v Speaker 2>the work instead of focusing on the accuracy. Did you

243
00:11:52.759 --> 00:11:56.919
<v Speaker 2>follow a sound process? Can you justify your choices? Did

244
00:11:56.960 --> 00:12:00.480
<v Speaker 2>you address the business problem effectively even if the model

245
00:12:00.679 --> 00:12:01.399
<v Speaker 2>isn't perfect.

246
00:12:01.600 --> 00:12:04.519
<v Speaker 1>That's a really important perspective. It's about the methodology, the

247
00:12:04.519 --> 00:12:08.159
<v Speaker 1>critical thinking, the practical impact, not just chasing a percentage

248
00:12:08.159 --> 00:12:09.039
<v Speaker 1>point right.

249
00:12:09.159 --> 00:12:13.600
<v Speaker 2>Sound work, continuous improvement, clear communication of limitations. That's often

250
00:12:13.639 --> 00:12:16.399
<v Speaker 2>more valuable than hitting an arbitrary accuracy target.

251
00:12:16.480 --> 00:12:20.519
<v Speaker 1>Okay, that leads nicely to another practical question, data transformation.

252
00:12:20.639 --> 00:12:23.360
<v Speaker 1>We know it's important, but how much time should we

253
00:12:23.399 --> 00:12:24.279
<v Speaker 1>really be spending on it?

254
00:12:24.320 --> 00:12:26.480
<v Speaker 2>This is another fantastic point from the guide, and the

255
00:12:26.519 --> 00:12:29.480
<v Speaker 2>emphasis is quite strong. It says data transformation is the

256
00:12:29.480 --> 00:12:31.679
<v Speaker 2>most important step in a data science.

257
00:12:31.399 --> 00:12:33.559
<v Speaker 1>Project, the most important more than modeling.

258
00:12:33.879 --> 00:12:37.039
<v Speaker 2>That's the claim. It even states, the more time that

259
00:12:37.159 --> 00:12:40.759
<v Speaker 2>is spent on data transformation, the higher is the model performance.

260
00:12:41.159 --> 00:12:43.080
<v Speaker 1>Wow. Why is it that critical?

261
00:12:43.279 --> 00:12:46.320
<v Speaker 2>Because, as the guide puts it, a machine learning model

262
00:12:46.399 --> 00:12:49.000
<v Speaker 2>is very sensitive to the format of the input data

263
00:12:49.320 --> 00:12:52.000
<v Speaker 2>and the nature of the input data. Garbage in garbage

264
00:12:52.000 --> 00:12:55.720
<v Speaker 2>out applies here too, but also slightly messy data in

265
00:12:56.200 --> 00:13:01.240
<v Speaker 2>slightly messy results out. Good data transformation will value the

266
00:13:01.279 --> 00:13:04.840
<v Speaker 2>input data more, essentially making it easier for the model

267
00:13:04.960 --> 00:13:08.159
<v Speaker 2>to find the key variables to use for training. It

268
00:13:08.200 --> 00:13:10.559
<v Speaker 2>prepares the data optimally for the algorithm.

269
00:13:10.720 --> 00:13:12.440
<v Speaker 1>Can you give some examples of transformation?

270
00:13:12.799 --> 00:13:16.639
<v Speaker 2>Sure? Things like applying a natural logarithm for continuous target

271
00:13:16.720 --> 00:13:20.240
<v Speaker 2>variable if it's heavily skewed, using one hot encoding for

272
00:13:20.320 --> 00:13:25.320
<v Speaker 2>categorical variables, turning categories like red, blue, green into separate binary.

273
00:13:24.879 --> 00:13:27.039
<v Speaker 1>Columns right so the model can understand.

274
00:13:26.639 --> 00:13:31.320
<v Speaker 2>Them exactly, or bidding transformation grouping continuous numbers into ranges.

275
00:13:31.480 --> 00:13:34.080
<v Speaker 2>These aren't just busy work. They directly help the model

276
00:13:34.159 --> 00:13:34.759
<v Speaker 2>learn better.

277
00:13:35.039 --> 00:13:38.000
<v Speaker 1>So the real secret sauce isn't just the fancy algorithm.

278
00:13:38.240 --> 00:13:41.960
<v Speaker 2>Often no, the secret resides in data transformation and how

279
00:13:42.000 --> 00:13:45.080
<v Speaker 2>well it is performed. It's the foundation. Get that right

280
00:13:45.200 --> 00:13:47.480
<v Speaker 2>and your model has a much better chance of success.

281
00:13:47.759 --> 00:13:53.519
<v Speaker 1>That's incredibly insightful, the unsung hero of model performance. Okay,

282
00:13:53.559 --> 00:13:55.799
<v Speaker 1>one last fascinating nugget. I wanted to pull out this

283
00:13:55.840 --> 00:13:58.960
<v Speaker 1>one from the definition section automation bias.

284
00:13:59.440 --> 00:14:03.480
<v Speaker 2>What's that, ah, automation bias? This occurs when a human

285
00:14:03.519 --> 00:14:07.240
<v Speaker 2>decision maker favors recommendations made by an automated system over

286
00:14:07.320 --> 00:14:10.600
<v Speaker 2>a non automated system, even if the automated system is wrong,

287
00:14:10.720 --> 00:14:14.000
<v Speaker 2>even if the automated system provides an error. Yes, it

288
00:14:14.039 --> 00:14:18.279
<v Speaker 2>stems from overtrusting the machine learning model, perhaps just because

289
00:14:18.279 --> 00:14:19.879
<v Speaker 2>it seems complex or objective.

290
00:14:20.039 --> 00:14:22.519
<v Speaker 1>That's actually a bit worrying, isn't it. As AI gets

291
00:14:22.559 --> 00:14:25.759
<v Speaker 1>more embedded in decision making, we just blindly trust the machine.

292
00:14:25.799 --> 00:14:27.960
<v Speaker 2>It's a real risk. We see a recommendation from a

293
00:14:28.000 --> 00:14:32.399
<v Speaker 2>sophisticated algorithm and our critical thinking might just switch off.

294
00:14:32.639 --> 00:14:34.240
<v Speaker 2>We assume the machine knows best.

295
00:14:34.440 --> 00:14:36.919
<v Speaker 1>How do we guard against that? As people building these

296
00:14:36.960 --> 00:14:38.320
<v Speaker 1>systems or even just using.

297
00:14:38.080 --> 00:14:41.919
<v Speaker 2>Them, that's the challenge. The guide doesn't explicitly state solutions,

298
00:14:42.240 --> 00:14:46.360
<v Speaker 2>but it implies the need for awareness. Maybe designing systems

299
00:14:46.360 --> 00:14:50.759
<v Speaker 2>with checks and balances, requiring human oversight for critical decisions,

300
00:14:51.240 --> 00:14:54.440
<v Speaker 2>Ensuring transparency so people can question the output.

301
00:14:54.879 --> 00:14:57.360
<v Speaker 1>So the human element isn't just about feeding data in,

302
00:14:57.879 --> 00:15:02.240
<v Speaker 1>It's about maintaining that critical overst throughout the process. Don't

303
00:15:02.279 --> 00:15:03.519
<v Speaker 1>just accept the output.

304
00:15:03.360 --> 00:15:08.519
<v Speaker 2>Exactly active critical engagement, don't blindly follow the automated advice,

305
00:15:08.759 --> 00:15:10.559
<v Speaker 2>especially when the stakes are high.

306
00:15:10.600 --> 00:15:13.000
<v Speaker 1>Wow. Okay, we've covered a lot, from when to use

307
00:15:13.039 --> 00:15:16.159
<v Speaker 1>deep learning, to the nuances of TYPEI in two errors,

308
00:15:16.559 --> 00:15:20.720
<v Speaker 1>handling missing data, the crucial role of evaluation metrics, the

309
00:15:20.759 --> 00:15:23.480
<v Speaker 1>surprising importance of data transformation.

310
00:15:23.120 --> 00:15:25.720
<v Speaker 2>And even that subtle trap of automation bias.

311
00:15:25.879 --> 00:15:28.759
<v Speaker 1>It really drives home that understanding the why and the

312
00:15:28.799 --> 00:15:32.120
<v Speaker 1>how the logic and mechanism is just as vital as

313
00:15:32.320 --> 00:15:33.120
<v Speaker 1>writing the code.

314
00:15:33.240 --> 00:15:36.559
<v Speaker 2>Absolutely, this whole deep dive into the data scientist pocket

315
00:15:36.559 --> 00:15:39.799
<v Speaker 2>guide really reinforces the idea that knowlage is most valuable

316
00:15:39.840 --> 00:15:43.840
<v Speaker 2>when understood and applied. It's about building that solid conceptual foundation.

317
00:15:44.240 --> 00:15:49.759
<v Speaker 1>So for you, the listener navigating this complex field, what's

318
00:15:49.799 --> 00:15:50.759
<v Speaker 1>the key message here?

319
00:15:51.279 --> 00:15:53.679
<v Speaker 2>I think it's that becoming a good data scientist is

320
00:15:53.720 --> 00:15:57.440
<v Speaker 2>a journey. It really takes continuously learning new techniques and

321
00:15:57.519 --> 00:15:58.600
<v Speaker 2>updating your knowledge.

322
00:15:58.679 --> 00:16:00.000
<v Speaker 1>It's not a one and done thing.

323
00:16:00.200 --> 00:16:05.159
<v Speaker 2>Definitely not. It demands discipline and autonomy, and maybe most importantly,

324
00:16:05.240 --> 00:16:09.799
<v Speaker 2>the ability to question assumptions, to seek out reliable understanding

325
00:16:09.799 --> 00:16:12.679
<v Speaker 2>like this guide aims to provide and always push for

326
00:16:12.720 --> 00:16:15.159
<v Speaker 2>that deeper insight, don't just scratch the surface.

327
00:16:15.360 --> 00:16:18.639
<v Speaker 1>So the final thought perhaps is while the models get smarter,

328
00:16:18.919 --> 00:16:21.919
<v Speaker 1>our own critical thinking and deep understanding remain the most

329
00:16:22.000 --> 00:16:25.120
<v Speaker 1>valuable assets we bring to the table. Don't automate your

330
00:16:25.120 --> 00:16:25.720
<v Speaker 1>own judgment

331
00:16:25.879 --> 00:16:28.120
<v Speaker 2>Well put, keep questioning, keep learning,
