WEBVTT

1
00:00:00.080 --> 00:00:02.960
<v Speaker 1>Okay, So have you ever felt like you're just drowning

2
00:00:02.960 --> 00:00:06.440
<v Speaker 1>in information, you know, for a project, or maybe getting

3
00:00:06.440 --> 00:00:08.320
<v Speaker 1>ready for a meeting, or even just trying to learn

4
00:00:08.359 --> 00:00:10.640
<v Speaker 1>something new, and you just wish someone could kind of

5
00:00:10.640 --> 00:00:11.519
<v Speaker 1>boil it all down.

6
00:00:11.720 --> 00:00:14.080
<v Speaker 2>Yeah, just give you the essentials, right, what really.

7
00:00:13.919 --> 00:00:16.800
<v Speaker 1>Matters exactly, and maybe, you know, throw in a few

8
00:00:16.839 --> 00:00:20.039
<v Speaker 1>surprising bits to keep it interesting. Well, if that sounds

9
00:00:20.039 --> 00:00:23.199
<v Speaker 1>like you, you are definitely in the right place, because

10
00:00:23.239 --> 00:00:24.839
<v Speaker 1>that's what we do here on the Deep Dive. We're

11
00:00:24.879 --> 00:00:28.399
<v Speaker 1>sort of your shortcut to getting properly informed, and today

12
00:00:28.640 --> 00:00:32.039
<v Speaker 1>we're taking a deep dive into your source material. These

13
00:00:32.079 --> 00:00:36.280
<v Speaker 1>are excerpts from Easily Practical Machine Learning Algorithms with Python

14
00:00:36.759 --> 00:00:38.479
<v Speaker 1>by doctor Darren Thomas.

15
00:00:38.640 --> 00:00:42.960
<v Speaker 2>Yeah, and this isn't like your standard dance textbook. The author,

16
00:00:43.000 --> 00:00:46.880
<v Speaker 2>doctor Thomas, he's got a PhD, loads the teaching experience.

17
00:00:46.479 --> 00:00:50.159
<v Speaker 1>And get this a background and saxophone performance, right, but.

18
00:00:50.200 --> 00:00:52.920
<v Speaker 2>His passion for machine learning led him to use these

19
00:00:52.920 --> 00:00:56.840
<v Speaker 2>algorithms in education. He's a lecturer now at Asia Pacific

20
00:00:56.920 --> 00:00:58.560
<v Speaker 2>International University and.

21
00:00:58.479 --> 00:01:01.159
<v Speaker 1>The book's aim, which is really really key for us today,

22
00:01:01.439 --> 00:01:05.239
<v Speaker 1>is to be simple, easy to follow, a kind of

23
00:01:05.400 --> 00:01:09.599
<v Speaker 1>condensed guide for actually using these algorithms with Python exactly.

24
00:01:09.840 --> 00:01:12.719
<v Speaker 2>He even says his goal was always to show what

25
00:01:12.760 --> 00:01:14.920
<v Speaker 2>to do, rather than talk a lot about how to

26
00:01:14.959 --> 00:01:17.680
<v Speaker 2>do it. So less heavy theory, more hands on application.

27
00:01:17.840 --> 00:01:22.400
<v Speaker 1>Okay, so important. Note, then, the book and this deep

28
00:01:22.400 --> 00:01:24.959
<v Speaker 1>dive too, sort of assumes you've already got some background

29
00:01:25.000 --> 00:01:27.079
<v Speaker 1>in Python. Maybe data science stats.

30
00:01:27.200 --> 00:01:30.159
<v Speaker 2>Yeah, it's more for folks looking to build on existing skills,

31
00:01:30.680 --> 00:01:33.439
<v Speaker 2>maybe not for absolute beginners to data science itself.

32
00:01:33.599 --> 00:01:36.480
<v Speaker 1>Right, So our mission today we want to unpack some

33
00:01:36.560 --> 00:01:39.719
<v Speaker 1>of the most common machine learning algorithms. We'll look at

34
00:01:39.760 --> 00:01:41.840
<v Speaker 1>classification that's predicting.

35
00:01:41.640 --> 00:01:43.879
<v Speaker 2>Categories like spam or not spam.

36
00:01:43.680 --> 00:01:49.159
<v Speaker 1>Exactly, and numeric prediction predicting continuous values like say, house prices.

37
00:01:49.719 --> 00:01:54.400
<v Speaker 1>Will get into how they're used, their surprising upsides, their challenges,

38
00:01:54.840 --> 00:01:57.280
<v Speaker 1>and crucially, how you actually figure out if the models

39
00:01:57.280 --> 00:01:58.560
<v Speaker 1>any good and how to make it better.

40
00:01:58.640 --> 00:02:00.640
<v Speaker 2>Yeah, judging and improving them key.

41
00:02:00.560 --> 00:02:02.760
<v Speaker 1>Totally, So ready to jump in. Let's unpack this.

42
00:02:02.920 --> 00:02:04.000
<v Speaker 2>Let's do it, Okay.

43
00:02:04.040 --> 00:02:08.479
<v Speaker 1>First up, decision trees. You can think of these as

44
00:02:08.520 --> 00:02:11.000
<v Speaker 1>like the foundation for a lot of predictive stuff.

45
00:02:11.280 --> 00:02:13.919
<v Speaker 2>Right At its heart, a decision tree is just a

46
00:02:13.960 --> 00:02:17.680
<v Speaker 2>way to classify things or predict numbers by splitting your

47
00:02:17.719 --> 00:02:20.599
<v Speaker 2>data up. Splitting it how It basically keeps dividing the

48
00:02:20.639 --> 00:02:23.479
<v Speaker 2>sample into smaller and smaller groups, trying to make each

49
00:02:23.520 --> 00:02:26.319
<v Speaker 2>little group as similar as possible inside.

50
00:02:26.159 --> 00:02:28.719
<v Speaker 1>So it looks like a tree visually like a flow.

51
00:02:28.520 --> 00:02:30.400
<v Speaker 2>Shart exactly like a float chart. You start at the

52
00:02:30.400 --> 00:02:32.719
<v Speaker 2>top the root node. That's your first big decision point.

53
00:02:33.039 --> 00:02:36.599
<v Speaker 2>Then you follow branches down through more decision nodes, making more.

54
00:02:36.479 --> 00:02:39.120
<v Speaker 1>Splits until you hit the end the leaf nodes.

55
00:02:39.240 --> 00:02:41.840
<v Speaker 2>Yep, the leaf nodes. That's where you get your final prediction.

56
00:02:42.000 --> 00:02:43.800
<v Speaker 1>And how does it decide where to split?

57
00:02:44.120 --> 00:02:48.599
<v Speaker 2>Well? For classification, it often uses something called entropy. High

58
00:02:48.680 --> 00:02:51.439
<v Speaker 2>entropy means things are really mixed up. The tree tries

59
00:02:51.479 --> 00:02:55.360
<v Speaker 2>to make splits that reduce that entropy, creating purer groups.

60
00:02:55.240 --> 00:02:59.800
<v Speaker 1>Lower entropy, more homogeneous, got it. And for predicting numbers.

61
00:03:00.080 --> 00:03:04.919
<v Speaker 2>Prediction that uses metrics like means squared error msee or

62
00:03:04.919 --> 00:03:06.879
<v Speaker 2>maybe R square to guide the splits.

63
00:03:07.199 --> 00:03:10.000
<v Speaker 1>Okay, so what's great about them? Why start here?

64
00:03:10.240 --> 00:03:13.520
<v Speaker 2>Well? A big plus is flexibility. They handle missing data

65
00:03:13.560 --> 00:03:16.639
<v Speaker 2>pretty well. They don't really care if your data isn't,

66
00:03:16.879 --> 00:03:19.879
<v Speaker 2>you know, perfectly normal. A nice bell curve right, and

67
00:03:19.919 --> 00:03:22.159
<v Speaker 2>you don't even have to use all your variables. Plus,

68
00:03:22.240 --> 00:03:25.360
<v Speaker 2>and this is a big one, they're relatively easy to interpret.

69
00:03:25.599 --> 00:03:28.120
<v Speaker 1>Ah, so you don't need a math PhD to figure

70
00:03:28.120 --> 00:03:28.879
<v Speaker 1>out what it's doing.

71
00:03:29.240 --> 00:03:32.759
<v Speaker 2>Pretty much. You can literally trace a path down the

72
00:03:32.800 --> 00:03:34.840
<v Speaker 2>tree and see the reasoning that.

73
00:03:34.879 --> 00:03:38.560
<v Speaker 1>Transparency sounds really useful, especially if you need to explain

74
00:03:38.759 --> 00:03:40.199
<v Speaker 1>why a prediction was made.

75
00:03:40.319 --> 00:03:44.599
<v Speaker 2>Absolutely imagine telling someone why their loan was denied. Showing

76
00:03:44.639 --> 00:03:48.800
<v Speaker 2>them a simple tree is way easier than explaining you know,

77
00:03:48.960 --> 00:03:51.960
<v Speaker 2>complex equations from some other models. Builds trust.

78
00:03:52.400 --> 00:03:55.639
<v Speaker 1>Okay, but there's always a catch, right, what's the downside?

79
00:03:55.680 --> 00:03:57.840
<v Speaker 2>The main one is that they can get really complex

80
00:03:57.840 --> 00:04:00.840
<v Speaker 2>if you let them grow too much, really deep trees,

81
00:04:00.919 --> 00:04:03.319
<v Speaker 2>and that leads to that often leads to overfitting.

82
00:04:03.439 --> 00:04:06.360
<v Speaker 1>Overfitting like it learns the training data too well.

83
00:04:06.360 --> 00:04:09.560
<v Speaker 2>Exactly, it fits the specific sample data perfectly, maybe even

84
00:04:09.599 --> 00:04:12.800
<v Speaker 2>the noise. But then it can generalize well to new

85
00:04:12.840 --> 00:04:14.520
<v Speaker 2>data it hasn't seen before.

86
00:04:14.319 --> 00:04:18.120
<v Speaker 1>Like memorizing answers instead of understanding the concept perfect analogy.

87
00:04:18.879 --> 00:04:22.120
<v Speaker 2>And of course, a super complex tree, even though it's visual,

88
00:04:22.199 --> 00:04:24.000
<v Speaker 2>can still be hard to explain easily.

89
00:04:24.040 --> 00:04:26.720
<v Speaker 1>Okay, let's make it concrete. The source had an example, right,

90
00:04:26.759 --> 00:04:27.839
<v Speaker 1>a cancer data.

91
00:04:27.639 --> 00:04:31.480
<v Speaker 2>Set, Yeah, predicting health status alive or dead. The model

92
00:04:31.560 --> 00:04:34.839
<v Speaker 2>mainly used variables like time and age for its splits.

93
00:04:35.079 --> 00:04:36.199
<v Speaker 1>Then how did it perform?

94
00:04:36.399 --> 00:04:38.720
<v Speaker 2>It got seventy eight percent accuracy on the data it

95
00:04:38.759 --> 00:04:42.079
<v Speaker 2>trained on, okay, but then on the unseen test data

96
00:04:42.120 --> 00:04:44.120
<v Speaker 2>it drops slightly to seventy three percent.

97
00:04:44.439 --> 00:04:47.199
<v Speaker 1>That drop five percent is that bad or expected?

98
00:04:47.360 --> 00:04:50.839
<v Speaker 2>That's actually pretty common and often expected. It shows it's generalizing. Okay,

99
00:04:51.040 --> 00:04:54.079
<v Speaker 2>still performing decently on new stuff. It learned patterns and

100
00:04:54.120 --> 00:04:54.639
<v Speaker 2>applied them.

101
00:04:54.759 --> 00:04:58.240
<v Speaker 1>Right now, What about using it for numeric prediction, same

102
00:04:58.319 --> 00:05:00.639
<v Speaker 1>data set, but predicting aid yep.

103
00:05:01.000 --> 00:05:03.399
<v Speaker 2>So here, instead of looking at purity like with dagony,

104
00:05:03.759 --> 00:05:07.000
<v Speaker 2>the tree uses mse means squared error in the nodes,

105
00:05:07.639 --> 00:05:10.720
<v Speaker 2>and the leaves predict the average age for that group.

106
00:05:10.519 --> 00:05:12.279
<v Speaker 1>And which variables were important there.

107
00:05:12.399 --> 00:05:15.120
<v Speaker 2>The source mentioned pH dot Carno Andmeal dot.

108
00:05:14.959 --> 00:05:16.480
<v Speaker 1>Cow and the results.

109
00:05:16.879 --> 00:05:20.160
<v Speaker 2>Well, the correlation between the actual age and predicted age

110
00:05:20.240 --> 00:05:22.680
<v Speaker 2>was okay on the training data about point five to

111
00:05:22.720 --> 00:05:26.480
<v Speaker 2>four moderate moderate yeah, But on the test data it

112
00:05:26.560 --> 00:05:28.399
<v Speaker 2>dropped way down to point one eight.

113
00:05:28.480 --> 00:05:31.360
<v Speaker 1>Ouch, big drop? What about the error the.

114
00:05:31.399 --> 00:05:34.160
<v Speaker 2>MSc MC was sixty one point eight on the training set,

115
00:05:34.240 --> 00:05:36.839
<v Speaker 2>but jumped up to eighty five point twenty four on

116
00:05:36.920 --> 00:05:37.600
<v Speaker 2>the test set.

117
00:05:37.720 --> 00:05:39.199
<v Speaker 1>So again, that drop tells us.

118
00:05:39.319 --> 00:05:42.040
<v Speaker 2>It tells us while it learned something, it really struggled

119
00:05:42.079 --> 00:05:45.879
<v Speaker 2>to generalize the age prediction to new people. Highlights that

120
00:05:45.959 --> 00:05:48.120
<v Speaker 2>overfitting risk with single trees, which.

121
00:05:48.000 --> 00:05:50.439
<v Speaker 1>Leads us perfectly into the next one. Random forest. This

122
00:05:50.519 --> 00:05:50.959
<v Speaker 1>sounds cool.

123
00:05:51.040 --> 00:05:53.279
<v Speaker 2>Yeah, this is where it gets really interesting. Random Forest

124
00:05:53.360 --> 00:05:56.920
<v Speaker 2>tackles that overfitting problem head on. How instead of building

125
00:05:56.959 --> 00:06:00.120
<v Speaker 2>just one decision tree, it builds hundreds, maybe even thousands

126
00:06:00.160 --> 00:06:00.399
<v Speaker 2>of them.

127
00:06:00.439 --> 00:06:02.480
<v Speaker 1>Wow? Okay, where does the random part come?

128
00:06:02.519 --> 00:06:05.839
<v Speaker 2>In? Two places? First, each tree is built using a

129
00:06:05.920 --> 00:06:09.079
<v Speaker 2>different random sample of your data drawn with replacement called

130
00:06:09.079 --> 00:06:12.519
<v Speaker 2>boots trapping. Second, at each split point in a tree,

131
00:06:12.600 --> 00:06:15.439
<v Speaker 2>it only considers a random subset of your available features.

132
00:06:15.600 --> 00:06:18.040
<v Speaker 1>So not every tree sees all the data, and not

133
00:06:18.160 --> 00:06:20.560
<v Speaker 1>every split considers all the factors exactly.

134
00:06:20.800 --> 00:06:23.959
<v Speaker 2>And the idea is you get lots of slightly different trees,

135
00:06:24.079 --> 00:06:26.720
<v Speaker 2>none of them perfect, but hopefully their errors are kind

136
00:06:26.759 --> 00:06:28.279
<v Speaker 2>of random and cancel each other out.

137
00:06:28.360 --> 00:06:30.560
<v Speaker 1>So how does it make a final prediction with all

138
00:06:30.600 --> 00:06:31.279
<v Speaker 1>those trees?

139
00:06:31.759 --> 00:06:35.000
<v Speaker 2>It's pretty democratic. Actually, for classification, it's just a majority

140
00:06:35.079 --> 00:06:38.480
<v Speaker 2>vote whichever prediction most trees make.

141
00:06:38.680 --> 00:06:42.199
<v Speaker 1>Wins simple enough, and for predicting numbers.

142
00:06:41.920 --> 00:06:44.639
<v Speaker 2>It just averages the predictions from all the individual trees.

143
00:06:44.800 --> 00:06:47.279
<v Speaker 1>And this whole wisdom of the crowd thing really helps

144
00:06:47.279 --> 00:06:48.879
<v Speaker 1>with overfitting massively.

145
00:06:49.519 --> 00:06:53.800
<v Speaker 2>That aggregation step makes the model much more robust and

146
00:06:53.959 --> 00:06:57.959
<v Speaker 2>way less prone to overfitting compared to a single complex

147
00:06:58.000 --> 00:06:59.120
<v Speaker 2>decision tree, so.

148
00:06:59.160 --> 00:07:03.000
<v Speaker 1>The benefits seem clear less. Overfitting works well even if

149
00:07:03.040 --> 00:07:05.800
<v Speaker 1>you don't have tons of data handles missing values.

150
00:07:06.120 --> 00:07:09.439
<v Speaker 2>Sounds great, It often is. It's a very popular, reliable

151
00:07:09.480 --> 00:07:10.720
<v Speaker 2>algorithm for those reasons.

152
00:07:10.800 --> 00:07:13.959
<v Speaker 1>Okay, but the drawback you mentioned transparency with decision trees,

153
00:07:14.079 --> 00:07:14.720
<v Speaker 1>what about here?

154
00:07:15.040 --> 00:07:18.920
<v Speaker 2>Oh yeah, that's the main trade off. With potentially thousands

155
00:07:18.959 --> 00:07:21.839
<v Speaker 2>of trees. You can't just visualize it like a single

156
00:07:21.839 --> 00:07:22.560
<v Speaker 2>flow chart.

157
00:07:22.399 --> 00:07:25.040
<v Speaker 1>Anymore, so it becomes a black box pretty much.

158
00:07:25.480 --> 00:07:27.480
<v Speaker 2>You know what goes in, you know what comes out,

159
00:07:27.560 --> 00:07:31.480
<v Speaker 2>but explaining exactly how it arrived at that specific prediction

160
00:07:31.759 --> 00:07:32.639
<v Speaker 2>is really hard.

161
00:07:32.879 --> 00:07:33.959
<v Speaker 1>That's a problem when.

162
00:07:34.040 --> 00:07:36.680
<v Speaker 2>When you need to explain the why, like we said,

163
00:07:36.879 --> 00:07:41.040
<v Speaker 2>loan applications, medical diagnoses, if you can't explain the reasoning,

164
00:07:41.399 --> 00:07:44.680
<v Speaker 2>it can cause issues with trust or even regulations that

165
00:07:44.720 --> 00:07:46.680
<v Speaker 2>demand transparency.

166
00:07:46.360 --> 00:07:49.160
<v Speaker 1>Right, so that lack of interpretability is a real consideration.

167
00:07:50.040 --> 00:07:52.680
<v Speaker 1>You might get great predictions but lose the explanation.

168
00:07:52.759 --> 00:07:54.240
<v Speaker 2>It's a definite trade off you have to weigh.

169
00:07:54.360 --> 00:07:57.240
<v Speaker 1>Let's look at the example the doctor Aus data set

170
00:07:57.319 --> 00:07:58.199
<v Speaker 1>predicting gender.

171
00:07:58.399 --> 00:08:01.439
<v Speaker 2>Right, the source noted income and age came out as

172
00:08:01.480 --> 00:08:05.360
<v Speaker 2>strong predictors, pointing out the known differences in salaries and

173
00:08:05.439 --> 00:08:08.680
<v Speaker 2>life expectancy. And the performance impressive on the training data

174
00:08:08.959 --> 00:08:12.839
<v Speaker 2>ninety three percent accuracy. Wow, but then quite a big

175
00:08:12.920 --> 00:08:15.560
<v Speaker 2>drop on the test data, down to sixty six percent.

176
00:08:15.600 --> 00:08:18.120
<v Speaker 1>Oo. That's a nearly thirty percent draw. What does that

177
00:08:18.160 --> 00:08:18.600
<v Speaker 1>tell us?

178
00:08:18.920 --> 00:08:21.279
<v Speaker 2>It tells us that while the model learned the training

179
00:08:21.360 --> 00:08:26.319
<v Speaker 2>data extremely well, almost perfectly, it really struggled to generalize

180
00:08:26.360 --> 00:08:26.879
<v Speaker 2>that learning.

181
00:08:27.319 --> 00:08:30.839
<v Speaker 1>So even random forest isn't immune to some overfitting. Or

182
00:08:31.120 --> 00:08:33.639
<v Speaker 1>maybe the training data just wasn't fully representative.

183
00:08:33.879 --> 00:08:36.960
<v Speaker 2>Could be either or both. It's a stark reminder that

184
00:08:37.080 --> 00:08:40.679
<v Speaker 2>high training accuracy is nice, but test accuracy is what

185
00:08:40.879 --> 00:08:42.799
<v Speaker 2>really counts for real world use.

186
00:08:43.039 --> 00:08:46.600
<v Speaker 1>Okay, and the numeric prediction example predicting income from the

187
00:08:46.639 --> 00:08:48.240
<v Speaker 1>same doctor Aus data.

188
00:08:48.039 --> 00:08:50.799
<v Speaker 2>Set here age was the most important variable by far

189
00:08:51.039 --> 00:08:53.320
<v Speaker 2>makes intuitive sense, right. Yeah, people often earn more as

190
00:08:53.320 --> 00:08:53.840
<v Speaker 2>they get older.

191
00:08:53.879 --> 00:08:55.519
<v Speaker 1>Sure, and the numbers.

192
00:08:55.200 --> 00:08:58.360
<v Speaker 2>Strong correlation on the training data point eight three, but

193
00:08:58.399 --> 00:09:00.279
<v Speaker 2>again a drop on the test data down to point

194
00:09:00.320 --> 00:09:00.720
<v Speaker 2>four to eight.

195
00:09:00.919 --> 00:09:03.679
<v Speaker 1>Still a decent drop, and the error ms.

196
00:09:04.159 --> 00:09:06.720
<v Speaker 2>MSc was low on the training set point zero four

197
00:09:06.919 --> 00:09:08.559
<v Speaker 2>and higher on the test set point one to one.

198
00:09:08.639 --> 00:09:11.360
<v Speaker 1>So similar story. Good at learning the training data, but

199
00:09:11.440 --> 00:09:14.279
<v Speaker 1>only moderately good at generalizing the income prediction.

200
00:09:14.399 --> 00:09:17.480
<v Speaker 2>Yeah, pretty much. It captures the relationship it sees, but

201
00:09:17.559 --> 00:09:20.200
<v Speaker 2>applying it to new unseen individuals is where the real

202
00:09:20.240 --> 00:09:20.759
<v Speaker 2>test lies.

203
00:09:20.840 --> 00:09:24.480
<v Speaker 1>Okay, moving on K nearest neighbor or k NN, this

204
00:09:24.519 --> 00:09:26.679
<v Speaker 1>one sounds neighborly, huh.

205
00:09:26.360 --> 00:09:29.720
<v Speaker 2>Yeah, it's actually quite intuitive. The core idea is predicting

206
00:09:29.759 --> 00:09:31.159
<v Speaker 2>by proximity, like.

207
00:09:31.200 --> 00:09:34.000
<v Speaker 1>Your example of walking into a classroom of twelve year olds.

208
00:09:34.159 --> 00:09:36.559
<v Speaker 1>If another kid walks in, you guess they're also twelve.

209
00:09:37.000 --> 00:09:40.960
<v Speaker 2>Exactly like that, Yeah, kNN looks at an unknown data

210
00:09:41.000 --> 00:09:43.840
<v Speaker 2>point and finds the k known data points that are

211
00:09:43.879 --> 00:09:45.799
<v Speaker 2>nearest to it in the feature space.

212
00:09:45.759 --> 00:09:47.559
<v Speaker 1>And nearest is usually measured by.

213
00:09:47.679 --> 00:09:51.679
<v Speaker 2>Typically Euclidian distance, just the straight line distance between points

214
00:09:52.080 --> 00:09:53.519
<v Speaker 2>in that multi dimensional space.

215
00:09:53.679 --> 00:09:55.879
<v Speaker 1>So K is just how many neighbors you look at

216
00:09:55.919 --> 00:09:58.600
<v Speaker 1>like the three nearest or five nearest.

217
00:09:58.720 --> 00:10:01.120
<v Speaker 2>Yep, K is the number of you consider.

218
00:10:01.200 --> 00:10:03.879
<v Speaker 1>And how do those neighbors make the prediction for classification?

219
00:10:03.960 --> 00:10:06.639
<v Speaker 2>They vote. That's why K is usually an odd number

220
00:10:06.679 --> 00:10:10.240
<v Speaker 2>to avoid ties. For numeric prediction, it's just the average

221
00:10:10.240 --> 00:10:11.360
<v Speaker 2>of the neighbor's values.

222
00:10:11.440 --> 00:10:12.200
<v Speaker 1>What's it good for.

223
00:10:12.360 --> 00:10:14.879
<v Speaker 2>It's pretty good with non linear data where the boundary

224
00:10:14.919 --> 00:10:18.480
<v Speaker 2>is in a straight line. And it's non parametric, meaning

225
00:10:18.720 --> 00:10:21.519
<v Speaker 2>it doesn't make strong assumptions about how your data is distributed.

226
00:10:21.919 --> 00:10:22.679
<v Speaker 2>Makes it flexible.

227
00:10:22.960 --> 00:10:25.039
<v Speaker 1>Sounds simple enough. Any hidden traps.

228
00:10:25.120 --> 00:10:29.240
<v Speaker 2>Well, it's sometimes called a lazy learning algorithm. Lazy because

229
00:10:29.240 --> 00:10:32.399
<v Speaker 2>it doesn't really build a model during training. It basically

230
00:10:32.480 --> 00:10:35.799
<v Speaker 2>just stores all the training data. The real work happens

231
00:10:35.799 --> 00:10:37.320
<v Speaker 2>only when you ask for a prediction.

232
00:10:37.799 --> 00:10:40.399
<v Speaker 1>Ah. So it doesn't give you much insight into why

233
00:10:40.519 --> 00:10:41.559
<v Speaker 1>variables are important.

234
00:10:41.759 --> 00:10:45.799
<v Speaker 2>Not really no abstraction. And because it stores everything, it

235
00:10:45.840 --> 00:10:49.240
<v Speaker 2>can struggle with really large data sets. Needs a lot

236
00:10:49.240 --> 00:10:49.799
<v Speaker 2>of memory.

237
00:10:50.120 --> 00:10:53.399
<v Speaker 1>And there was another crucial point something about scale.

238
00:10:54.159 --> 00:10:58.639
<v Speaker 2>Yes, critically important for K and N it is scale sensitive.

239
00:10:58.799 --> 00:11:02.320
<v Speaker 1>Okay, break that down. What does scale sensitive mean? Practically?

240
00:11:02.639 --> 00:11:05.519
<v Speaker 2>Imagine you have age maybe zero to one hundred, and

241
00:11:05.559 --> 00:11:08.519
<v Speaker 2>another variable like own scar, which is just zero or one.

242
00:11:09.039 --> 00:11:12.919
<v Speaker 2>When Cainean calculates distance, the age difference will totally swamp

243
00:11:12.960 --> 00:11:14.879
<v Speaker 2>the own scar difference, just because the numbers are so

244
00:11:14.960 --> 00:11:15.519
<v Speaker 2>much bigger.

245
00:11:15.600 --> 00:11:18.840
<v Speaker 1>So age will have way more influence on who's considered nearest.

246
00:11:18.639 --> 00:11:21.360
<v Speaker 2>Exactly, even if owning a car is actually super important

247
00:11:21.360 --> 00:11:23.399
<v Speaker 2>for the prediction. Yeah, so you have to scale your

248
00:11:23.440 --> 00:11:24.039
<v Speaker 2>data first.

249
00:11:24.120 --> 00:11:24.919
<v Speaker 1>How do you scale it?

250
00:11:25.200 --> 00:11:28.080
<v Speaker 2>Common ways are a minmax scaling where you squish everything

251
00:11:28.120 --> 00:11:30.840
<v Speaker 2>into a zero to one range, or standardization where you

252
00:11:30.879 --> 00:11:33.759
<v Speaker 2>give variables a mean of zero and standard deviation of one.

253
00:11:34.000 --> 00:11:35.679
<v Speaker 2>It puts everything on a level playing.

254
00:11:35.480 --> 00:11:38.559
<v Speaker 1>Field, right, So age isn't shouting louder than the other variables.

255
00:11:38.679 --> 00:11:41.200
<v Speaker 1>Makes sense. The example use the turnout data set to

256
00:11:41.240 --> 00:11:42.919
<v Speaker 1>predict if someone voted yeah.

257
00:11:42.759 --> 00:11:46.360
<v Speaker 2>And they specifically mentioned scaling the data first. The model

258
00:11:46.399 --> 00:11:50.000
<v Speaker 2>got almost eighty percent accuracy on training, and importantly, it

259
00:11:50.080 --> 00:11:51.799
<v Speaker 2>held up really well on the test data too.

260
00:11:51.960 --> 00:11:55.480
<v Speaker 1>That consistency is good, right, suggests it's generalizing.

261
00:11:54.960 --> 00:11:57.200
<v Speaker 2>Exactly what you want to see, not just memorizing.

262
00:11:57.399 --> 00:12:01.480
<v Speaker 1>And for predicting income from that same turnout.

263
00:12:00.600 --> 00:12:03.000
<v Speaker 2>The correlation was zer point six y three on training

264
00:12:03.159 --> 00:12:05.360
<v Speaker 2>dropped a bit too. Point four to eight on test,

265
00:12:05.559 --> 00:12:08.080
<v Speaker 2>but the MSc values were very close point zero two

266
00:12:08.120 --> 00:12:11.200
<v Speaker 2>to one for training, point zero two nine for testing.

267
00:12:11.480 --> 00:12:14.559
<v Speaker 1>So again, even if the correlation isn't super strong, the

268
00:12:14.639 --> 00:12:18.120
<v Speaker 1>similar error rates suggest it's performing consistently on new data.

269
00:12:18.320 --> 00:12:21.399
<v Speaker 2>YEP indicates stable performance, which is often more important than

270
00:12:21.480 --> 00:12:23.279
<v Speaker 2>hitting the absolute highest correlation number.

271
00:12:23.480 --> 00:12:28.679
<v Speaker 1>Okay, Next algorithm, Support vector machines SVM. This sounds a

272
00:12:28.679 --> 00:12:29.480
<v Speaker 1>bit more complex.

273
00:12:29.720 --> 00:12:32.639
<v Speaker 2>It combines ideas from K and N and linear models,

274
00:12:32.919 --> 00:12:36.039
<v Speaker 2>but with a really clever twist for classification, which is

275
00:12:36.240 --> 00:12:38.559
<v Speaker 2>its main goal is to find the best dividing line

276
00:12:38.799 --> 00:12:41.399
<v Speaker 2>or plane or hyperplane in higher dimensions.

277
00:12:41.600 --> 00:12:43.559
<v Speaker 1>Hyperplane fancy word for boundary.

278
00:12:43.879 --> 00:12:47.360
<v Speaker 2>Right. It wants the boundary that creates the biggest possible

279
00:12:47.440 --> 00:12:50.759
<v Speaker 2>gap or margin between the different classes.

280
00:12:50.320 --> 00:12:52.879
<v Speaker 1>A bigger buffer zone exactly.

281
00:12:52.639 --> 00:12:54.720
<v Speaker 2>And the data points that sit right on the edge

282
00:12:54.720 --> 00:12:57.240
<v Speaker 2>of that margin. Those are the support vectors. They're the

283
00:12:57.240 --> 00:12:59.120
<v Speaker 2>critical ones that actually define the boundary.

284
00:12:59.279 --> 00:13:01.679
<v Speaker 1>Interesting, does it always have to be a straight line?

285
00:13:01.720 --> 00:13:03.080
<v Speaker 1>What if the data is all mixed up?

286
00:13:03.399 --> 00:13:07.279
<v Speaker 2>Good question. It prefers straight lines, but it has tricks.

287
00:13:07.919 --> 00:13:11.399
<v Speaker 2>It can allow some misclassifications using a slack variable, a

288
00:13:11.399 --> 00:13:15.320
<v Speaker 2>bit of wiggle room, or for really messy nonlinear data.

289
00:13:15.399 --> 00:13:16.679
<v Speaker 2>It uses the kernel trick.

290
00:13:16.919 --> 00:13:19.679
<v Speaker 1>The kernel trick sounds like magic, It kind of is.

291
00:13:19.840 --> 00:13:22.639
<v Speaker 2>It projects the data into much higher dimensional space where

292
00:13:22.919 --> 00:13:26.639
<v Speaker 2>hopefully a simple linear boundary can separate the classes.

293
00:13:26.600 --> 00:13:29.480
<v Speaker 1>Like unfolding a crumpled paper to separate dots. I like

294
00:13:29.519 --> 00:13:33.679
<v Speaker 1>that analogy. So sbms are flexible, can handle messy data.

295
00:13:33.840 --> 00:13:36.320
<v Speaker 2>Very flexible, Yeah, I think can be incredibly accurate even

296
00:13:36.360 --> 00:13:37.440
<v Speaker 2>on complex problems.

297
00:13:37.440 --> 00:13:41.600
<v Speaker 1>Okay, sounds powerful. Downside, is it another black box?

298
00:13:41.720 --> 00:13:45.360
<v Speaker 2>Often? Yes, especially with those kernel tricks and higher dimensions.

299
00:13:45.600 --> 00:13:49.279
<v Speaker 2>Explaining why SBM made a particular decision gets very abstract, very.

300
00:13:49.159 --> 00:13:51.759
<v Speaker 1>Quickly right, hard to explain to the boss.

301
00:13:51.600 --> 00:13:55.240
<v Speaker 2>Can be Also, choosing the right kernel isn't always obvious,

302
00:13:55.720 --> 00:13:57.919
<v Speaker 2>And like K and N, it's scales sensitive. You need

303
00:13:57.960 --> 00:13:59.000
<v Speaker 2>to rescale your data.

304
00:13:59.120 --> 00:14:03.679
<v Speaker 1>Got it scaling again. The example was predicting mortgage status.

305
00:14:03.759 --> 00:14:05.759
<v Speaker 1>Yes know from the working hours data.

306
00:14:05.600 --> 00:14:08.000
<v Speaker 2>Set correct and data prep was key. They combined some

307
00:14:08.080 --> 00:14:11.480
<v Speaker 2>child related variables and rescaled everything. Then they compared two

308
00:14:11.519 --> 00:14:15.639
<v Speaker 2>common kernels, linear and RBF radio basis function.

309
00:14:15.799 --> 00:14:16.440
<v Speaker 1>Oh do they do?

310
00:14:16.559 --> 00:14:19.679
<v Speaker 2>The linear kernel hit eighty seven percent accuracy on training

311
00:14:20.200 --> 00:14:24.039
<v Speaker 2>and impressively held that exact same accuracy on the test data.

312
00:14:24.080 --> 00:14:26.279
<v Speaker 1>Wow, perfect generalization in that case.

313
00:14:26.360 --> 00:14:29.200
<v Speaker 2>Fantastic result. The RBF kernel was just slightly lower and.

314
00:14:29.159 --> 00:14:33.039
<v Speaker 1>For SVM regression predicting education level from the same data set.

315
00:14:33.320 --> 00:14:36.919
<v Speaker 2>Yeah. Again. Comparing linear and RBF kernels, the linear one

316
00:14:36.960 --> 00:14:40.480
<v Speaker 2>showed a pretty weak correlation, only point three to eight

317
00:14:40.559 --> 00:14:43.519
<v Speaker 2>on training and point four zero on tests. That doesn't

318
00:14:43.559 --> 00:14:47.080
<v Speaker 2>sound great, but the MS values were low and very

319
00:14:47.080 --> 00:14:50.159
<v Speaker 2>stable point zero one five eight nine on training, point

320
00:14:50.240 --> 00:14:51.919
<v Speaker 2>zero one eight three to two on testing.

321
00:14:52.320 --> 00:14:56.879
<v Speaker 1>So weak relationship overall, but the model makes consistent predictions exactly.

322
00:14:57.360 --> 00:15:00.279
<v Speaker 2>Suggests it's generalizing well, even if it's not explaining huge

323
00:15:00.320 --> 00:15:03.360
<v Speaker 2>amount of the variants. The source also mentioned outliers might

324
00:15:03.399 --> 00:15:05.519
<v Speaker 2>be affecting the correlation metric more here.

325
00:15:05.679 --> 00:15:08.440
<v Speaker 1>Interesting. So metrics can sometimes tell slightly different stories.

326
00:15:08.440 --> 00:15:10.840
<v Speaker 2>Definitely, you need to look at them together, all right.

327
00:15:10.960 --> 00:15:15.879
<v Speaker 1>Artificial neural networks ann's the brain inspired ones kind of.

328
00:15:16.480 --> 00:15:20.320
<v Speaker 2>They were initially inspired by biological neurons. Yeah. You have inputs,

329
00:15:20.320 --> 00:15:22.200
<v Speaker 2>some processing happens, and you get outputs.

330
00:15:22.279 --> 00:15:24.000
<v Speaker 1>And deep learning is just when you have lots of

331
00:15:24.080 --> 00:15:25.440
<v Speaker 1>layers of these neurons.

332
00:15:25.200 --> 00:15:28.080
<v Speaker 2>Multiple hidden layers. Yes, that's the essence of deep learning.

333
00:15:28.159 --> 00:15:30.559
<v Speaker 1>So how do they actually work? Simply put, think.

334
00:15:30.399 --> 00:15:33.399
<v Speaker 2>Of inputs like signals arriving. Each input gets a weight

335
00:15:33.519 --> 00:15:35.799
<v Speaker 2>how important it is. They get summed up and then

336
00:15:35.840 --> 00:15:37.440
<v Speaker 2>hit an activation function like.

337
00:15:37.399 --> 00:15:39.159
<v Speaker 1>The neuron deciding whether to fire.

338
00:15:39.799 --> 00:15:42.879
<v Speaker 2>Sort of yeah, that function decides if the signal is

339
00:15:42.879 --> 00:15:45.840
<v Speaker 2>strong enough to pass on to the next layer. Usually

340
00:15:45.879 --> 00:15:49.120
<v Speaker 2>the information flows one way feed forward. It's a whole

341
00:15:49.159 --> 00:15:51.879
<v Speaker 2>cascade of these simple weighted sums and activations.

342
00:15:51.960 --> 00:15:54.600
<v Speaker 1>And the big advantage why all the hype, They.

343
00:15:54.519 --> 00:15:57.960
<v Speaker 2>Really shine with massive amounts of data. Given enough data,

344
00:15:58.000 --> 00:16:02.000
<v Speaker 2>they can learn incredibly complex suttle patterns that other algorithms

345
00:16:02.080 --> 00:16:03.080
<v Speaker 2>might miss entirely.

346
00:16:03.279 --> 00:16:08.039
<v Speaker 1>So flexibility is huge. Image recognition, self driving cars.

347
00:16:07.919 --> 00:16:11.200
<v Speaker 2>Exactly, they tower a lot of cutting edge AI tasks.

348
00:16:11.360 --> 00:16:13.840
<v Speaker 1>But the catch they need tons of data.

349
00:16:13.879 --> 00:16:18.799
<v Speaker 2>Typically, yes, massive data sets for optimal performance, and training

350
00:16:18.840 --> 00:16:20.919
<v Speaker 2>them can take a lot of computing power and time.

351
00:16:21.200 --> 00:16:24.559
<v Speaker 2>Plus sometimes simpler networks can struggle to converge, meaning they

352
00:16:24.559 --> 00:16:25.840
<v Speaker 2>don't actually learn effectively.

353
00:16:25.919 --> 00:16:29.200
<v Speaker 1>Okay, example time predicting union membership from the wages data set.

354
00:16:29.279 --> 00:16:32.559
<v Speaker 2>Right, And a key step here was turning categorical variables

355
00:16:32.840 --> 00:16:35.200
<v Speaker 2>like occupation into dummy.

356
00:16:35.000 --> 00:16:37.200
<v Speaker 1>Variables numerical flags basically.

357
00:16:37.000 --> 00:16:40.000
<v Speaker 2>Yep, zeros and ones. The network can understand. The model

358
00:16:40.039 --> 00:16:43.600
<v Speaker 2>achieved a solid seventy percent accuracy, and importantly, it was

359
00:16:43.639 --> 00:16:46.000
<v Speaker 2>consistent between the training and test data.

360
00:16:46.399 --> 00:16:50.200
<v Speaker 1>Good stability again and the regression example predicting wages.

361
00:16:50.360 --> 00:16:52.879
<v Speaker 2>Also from the wages data set, it showed a moderate

362
00:16:52.919 --> 00:16:56.240
<v Speaker 2>correlation point five to six on training, very close point

363
00:16:56.240 --> 00:16:59.240
<v Speaker 2>five to four odd tests, and the MSS values were

364
00:16:59.240 --> 00:17:00.960
<v Speaker 2>also very so between the two.

365
00:17:00.840 --> 00:17:04.839
<v Speaker 1>Sets, So reasonably good generalization there too. Consistent if not

366
00:17:05.000 --> 00:17:06.160
<v Speaker 1>spectacular prediction.

367
00:17:06.319 --> 00:17:08.599
<v Speaker 2>Looks like it reliable performance on new data.

368
00:17:09.119 --> 00:17:12.000
<v Speaker 1>Now for something completely different, k means you said this

369
00:17:12.039 --> 00:17:13.319
<v Speaker 1>one's unsupervised learning.

370
00:17:13.400 --> 00:17:16.880
<v Speaker 2>That's right. This is fascinating because, unlike everything else we've discussed,

371
00:17:17.279 --> 00:17:20.480
<v Speaker 2>there's no right answer or target variable we're trying.

372
00:17:20.319 --> 00:17:23.960
<v Speaker 1>To predict, no gender, no income, no voted label exactly.

373
00:17:24.359 --> 00:17:27.440
<v Speaker 2>KMES isn't trying to predict anything specific. It's just trying

374
00:17:27.440 --> 00:17:30.759
<v Speaker 2>to find natural groupings or clusters within the data itself,

375
00:17:30.839 --> 00:17:31.920
<v Speaker 2>based on similarity.

376
00:17:32.079 --> 00:17:33.400
<v Speaker 1>How does it find these groups?

377
00:17:33.480 --> 00:17:36.480
<v Speaker 2>It starts by randomly guessing the locations of K cluster

378
00:17:36.599 --> 00:17:38.559
<v Speaker 2>centers or centroids, K.

379
00:17:38.680 --> 00:17:41.079
<v Speaker 1>Being the number of clusters you think are in the data.

380
00:17:41.160 --> 00:17:45.720
<v Speaker 2>Precisely. Then it assigns each data point to its nearest centroid.

381
00:17:46.559 --> 00:17:49.960
<v Speaker 2>After that, it recalculates the position of each centroid to

382
00:17:50.000 --> 00:17:52.559
<v Speaker 2>be the actual center of all the points assigned to it.

383
00:17:52.440 --> 00:17:55.559
<v Speaker 1>And it repeats that. Assigned points move centers yep.

384
00:17:55.640 --> 00:17:59.160
<v Speaker 2>It iterates back and forth, assigned points update centroids until

385
00:17:59.160 --> 00:18:02.440
<v Speaker 2>the centroids start moving much until things stabilize, and.

386
00:18:02.400 --> 00:18:05.200
<v Speaker 1>The researcher has to decide on K the number of clusters.

387
00:18:05.240 --> 00:18:06.599
<v Speaker 1>That sounds tricky, it can be.

388
00:18:06.759 --> 00:18:07.640
<v Speaker 2>It's a key challenge.

389
00:18:07.680 --> 00:18:09.359
<v Speaker 1>So what's the big benefit of doing this.

390
00:18:09.839 --> 00:18:13.920
<v Speaker 2>It's fantastic for exploring your data, for segmentation, finding hidden

391
00:18:13.960 --> 00:18:17.160
<v Speaker 2>patterns you didn't even know existed, understanding what makes different

392
00:18:17.200 --> 00:18:18.640
<v Speaker 2>subgroups within your data.

393
00:18:18.599 --> 00:18:21.319
<v Speaker 1>Distinct, discovering natural segments exactly.

394
00:18:21.640 --> 00:18:25.559
<v Speaker 2>But the drawbacks stem from that unsupervised nature. Since there's

395
00:18:25.559 --> 00:18:28.839
<v Speaker 2>no right answer, evaluating how good the clusters are is

396
00:18:28.880 --> 00:18:31.359
<v Speaker 2>more subjective, relying on the researcher's interpretation.

397
00:18:31.559 --> 00:18:34.400
<v Speaker 1>Hell me guess scale sensitive.

398
00:18:34.000 --> 00:18:37.799
<v Speaker 2>You got it requires data normalization or scaling, just like

399
00:18:37.960 --> 00:18:43.200
<v Speaker 2>kNN and SVM because it relies on distance calculations, and yeah,

400
00:18:43.279 --> 00:18:44.519
<v Speaker 2>choosing the right K is tough.

401
00:18:44.759 --> 00:18:46.200
<v Speaker 1>Are there ways to help choose K.

402
00:18:46.440 --> 00:18:48.880
<v Speaker 2>There are methods, yeah, like the elbow method, where you

403
00:18:48.920 --> 00:18:51.480
<v Speaker 2>plot a measure of cluster cohesion against different values of

404
00:18:51.559 --> 00:18:54.480
<v Speaker 2>K and look for an elbow point where adding more

405
00:18:54.519 --> 00:18:56.680
<v Speaker 2>clusters doesn't improve things much.

406
00:18:56.759 --> 00:19:03.079
<v Speaker 1>Okay, the example used the act SAT scores age education.

407
00:19:03.359 --> 00:19:07.519
<v Speaker 2>Yes, and they stressed normalizing the data first, because scores,

408
00:19:07.559 --> 00:19:10.559
<v Speaker 2>age and education level are all on different scales.

409
00:19:10.720 --> 00:19:12.119
<v Speaker 1>Did the elbow method work?

410
00:19:12.240 --> 00:19:15.400
<v Speaker 2>It suggested K two two clusters, but the author actually

411
00:19:15.400 --> 00:19:17.480
<v Speaker 2>decided to go with K three, looking for maybe a

412
00:19:17.480 --> 00:19:18.920
<v Speaker 2>bit more detail in the groupings.

413
00:19:18.960 --> 00:19:21.000
<v Speaker 1>And what did those three clusters reveal.

414
00:19:21.079 --> 00:19:25.079
<v Speaker 2>Pluster zero had the oldest students, most education, highest test

415
00:19:25.119 --> 00:19:28.400
<v Speaker 2>scores generally, Cluster one was kind of medium across the board.

416
00:19:28.400 --> 00:19:31.480
<v Speaker 2>It's second oldest, second highest education, and Cluster two was

417
00:19:31.559 --> 00:19:33.640
<v Speaker 2>much younger with weaker test performance.

418
00:19:33.720 --> 00:19:36.319
<v Speaker 1>Seems logical age and achievement grouping together.

419
00:19:36.759 --> 00:19:39.920
<v Speaker 2>Yeah, But here's the kicker, the really surprising part. When

420
00:19:39.920 --> 00:19:42.519
<v Speaker 2>they looked closer, what did they find? A really clear

421
00:19:42.559 --> 00:19:48.599
<v Speaker 2>separation by gender emerged Cluster one all males, Cluster zero

422
00:19:48.599 --> 00:19:49.880
<v Speaker 2>and two all females.

423
00:19:49.960 --> 00:19:53.039
<v Speaker 1>WHOA, The algorithm wasn't told about gender, right, It just

424
00:19:53.079 --> 00:19:54.839
<v Speaker 1>found that pattern exactly.

425
00:19:54.839 --> 00:19:58.400
<v Speaker 2>It revealed this underlying structure, and it helped explain things

426
00:19:59.079 --> 00:20:02.359
<v Speaker 2>like why cluster one one all males had particularly high

427
00:20:02.440 --> 00:20:05.799
<v Speaker 2>quantitative set scores tying into statistical trends.

428
00:20:05.880 --> 00:20:08.880
<v Speaker 1>That's amazing. So kymines didn't just group by scores. It

429
00:20:09.000 --> 00:20:13.160
<v Speaker 1>uncovered this fundamental demographic split. What does that tell us? Generally?

430
00:20:13.839 --> 00:20:18.079
<v Speaker 2>It shows how these unsupervised methods can reveal really deep insights,

431
00:20:18.440 --> 00:20:21.839
<v Speaker 2>sometimes biases or strong correlations we weren't even looking for.

432
00:20:22.119 --> 00:20:24.400
<v Speaker 2>It forces you to ask why the data is structured

433
00:20:24.440 --> 00:20:24.759
<v Speaker 2>that way.

434
00:20:24.839 --> 00:20:27.160
<v Speaker 1>So the final interpretation was cluster.

435
00:20:26.920 --> 00:20:30.759
<v Speaker 2>Zero older educated women, cluster one young males, Cluster two

436
00:20:31.039 --> 00:20:33.839
<v Speaker 2>very young women, a much richer picture than just high medium,

437
00:20:33.920 --> 00:20:34.519
<v Speaker 2>low scores.

438
00:20:34.599 --> 00:20:38.039
<v Speaker 1>Incredible. Okay, so we've seen all these algorithms, but just

439
00:20:38.079 --> 00:20:40.000
<v Speaker 1>getting an answer isn't enough. We need to know if

440
00:20:40.000 --> 00:20:40.759
<v Speaker 1>the answer is good.

441
00:20:40.960 --> 00:20:44.759
<v Speaker 2>Assessing models absolutely critical, otherwise you're just generating numbers without

442
00:20:44.799 --> 00:20:45.720
<v Speaker 2>knowing if they're meaningful.

443
00:20:45.920 --> 00:20:49.480
<v Speaker 1>And for classification, the go to tool is the confusion matrix.

444
00:20:49.759 --> 00:20:52.400
<v Speaker 2>Definitely a fundamental starting point. It just lays out your

445
00:20:52.400 --> 00:20:55.039
<v Speaker 2>predictions against the actual truth in a simple table.

446
00:20:55.160 --> 00:20:57.279
<v Speaker 1>Remind us of the four boxes you've.

447
00:20:57.079 --> 00:21:00.519
<v Speaker 2>Got true negatives TN correctly saying some thing isn't there.

448
00:21:00.680 --> 00:21:04.759
<v Speaker 2>False negatives FN missing something that is there. Big problem

449
00:21:04.799 --> 00:21:05.920
<v Speaker 2>in medical tests.

450
00:21:05.599 --> 00:21:08.359
<v Speaker 1>For instance, right false positives.

451
00:21:07.880 --> 00:21:10.240
<v Speaker 2>FP seeing something is there when it isn't, like a

452
00:21:10.279 --> 00:21:12.880
<v Speaker 2>spam filter blocking a real email.

453
00:21:12.880 --> 00:21:14.559
<v Speaker 1>And true positives TP.

454
00:21:14.519 --> 00:21:17.640
<v Speaker 2>Correctly identifying something that is there the spam filter catching

455
00:21:17.720 --> 00:21:18.440
<v Speaker 2>actual spam.

456
00:21:18.599 --> 00:21:22.519
<v Speaker 1>So the cancer decision tree example had four ten zero FN,

457
00:21:22.640 --> 00:21:25.720
<v Speaker 1>twenty eight FP, and eighty four TP. Lots of numbers.

458
00:21:26.240 --> 00:21:29.359
<v Speaker 1>If I'm building, say a system to detect faulty products

459
00:21:29.359 --> 00:21:32.440
<v Speaker 1>on an assembly line, which metrics should I care most about?

460
00:21:32.480 --> 00:21:35.400
<v Speaker 2>Ooh, good question. For faulty products, you probably care a

461
00:21:35.440 --> 00:21:39.319
<v Speaker 2>lot about. Recall also called sensitivity. That's true positives divided

462
00:21:39.319 --> 00:21:42.400
<v Speaker 2>by all the actual positives TP plus FN. You want

463
00:21:42.440 --> 00:21:44.559
<v Speaker 2>to catch as many faulty products as possible, even if

464
00:21:44.599 --> 00:21:47.480
<v Speaker 2>you accidentally flag a few good ones. Missing a faulty

465
00:21:47.519 --> 00:21:50.000
<v Speaker 2>one could be costly or dangerous.

466
00:21:49.640 --> 00:21:52.920
<v Speaker 1>So recall is about minimizing misses. For the cancer example,

467
00:21:52.960 --> 00:21:55.519
<v Speaker 1>recall was point seventy six yes.

468
00:21:55.680 --> 00:21:59.200
<v Speaker 2>And precision, which is true positives divided by all the

469
00:21:59.240 --> 00:22:02.880
<v Speaker 2>ones the model predicted is positive TP plus FP was

470
00:22:02.920 --> 00:22:07.079
<v Speaker 2>also point seven to six. Precision is about how trustworthy

471
00:22:07.079 --> 00:22:10.160
<v Speaker 2>A positive prediction is when the model says it's cancer.

472
00:22:10.279 --> 00:22:12.359
<v Speaker 2>How often is it right? And the F measure that

473
00:22:12.519 --> 00:22:15.640
<v Speaker 2>just combines precision and recall into one score also point

474
00:22:15.680 --> 00:22:18.559
<v Speaker 2>seventy six in this case because there were zero false negatives,

475
00:22:18.559 --> 00:22:19.400
<v Speaker 2>which is unusual.

476
00:22:19.440 --> 00:22:21.039
<v Speaker 1>What about the negative specificity?

477
00:22:21.160 --> 00:22:24.480
<v Speaker 2>Specificity is true negatives divided by all the actual negatives

478
00:22:24.599 --> 00:22:27.920
<v Speaker 2>tmplus FP. It measures how well the model identifies the

479
00:22:27.920 --> 00:22:30.599
<v Speaker 2>true negatives. For the cancer example, it was only point

480
00:22:30.640 --> 00:22:33.480
<v Speaker 2>one three, quite poor, meaning it wasn't good at correctly

481
00:22:33.519 --> 00:22:35.200
<v Speaker 2>identifying healthy people as healthy.

482
00:22:35.359 --> 00:22:38.559
<v Speaker 1>Okay, so different metrics matter depending on the goal. What

483
00:22:38.720 --> 00:22:40.200
<v Speaker 1>about plain old accuracy?

484
00:22:40.319 --> 00:22:44.039
<v Speaker 2>Accuracy is just tmplus TP divided by the total all

485
00:22:44.079 --> 00:22:46.920
<v Speaker 2>the correct predictions. It was seventy six percent here. But

486
00:22:47.000 --> 00:22:50.079
<v Speaker 2>accuracy can be misleading, especially if one class is way

487
00:22:50.119 --> 00:22:50.880
<v Speaker 2>more common.

488
00:22:50.640 --> 00:22:53.000
<v Speaker 1>Than the other, like if ninety nine percent of emails

489
00:22:53.000 --> 00:22:56.559
<v Speaker 1>aren't spam. A model predicting not spam all the time

490
00:22:56.640 --> 00:22:59.920
<v Speaker 1>gets ninety nine percent accuracy, but is useless Exactly.

491
00:23:00.039 --> 00:23:02.279
<v Speaker 2>That's why you need these other metrics. An error is

492
00:23:02.400 --> 00:23:05.000
<v Speaker 2>just one minus accuracy, so twenty four percent here.

493
00:23:05.200 --> 00:23:07.240
<v Speaker 1>And finally, kappa.

494
00:23:06.799 --> 00:23:09.880
<v Speaker 2>Kappa measures accuracy but accounts for how much agreement you'd

495
00:23:09.880 --> 00:23:12.880
<v Speaker 2>expect just by chance, closer to one is better. Here

496
00:23:13.039 --> 00:23:15.880
<v Speaker 2>was point one seven to one, which is pretty low.

497
00:23:16.279 --> 00:23:19.200
<v Speaker 2>Suggests the model's performance wasn't much better than random guessing

498
00:23:19.400 --> 00:23:20.920
<v Speaker 2>once you factor chance in. Wow.

499
00:23:21.079 --> 00:23:23.559
<v Speaker 1>Okay, so looking at all of them gives a much

500
00:23:23.599 --> 00:23:24.359
<v Speaker 1>fuller picture.

501
00:23:24.440 --> 00:23:24.920
<v Speaker 2>Definitely.

502
00:23:25.359 --> 00:23:29.599
<v Speaker 1>Now the ROC curve this is a cool backstory. Wwii.

503
00:23:29.799 --> 00:23:32.160
<v Speaker 2>Yeah. Radar engineers used it to figure out if a

504
00:23:32.160 --> 00:23:35.279
<v Speaker 2>blip was an enemy plane a true positive or just

505
00:23:35.519 --> 00:23:37.680
<v Speaker 2>noise a false positive. High stake stuff.

506
00:23:37.799 --> 00:23:39.000
<v Speaker 1>Now we use it for models.

507
00:23:39.279 --> 00:23:41.960
<v Speaker 2>Yep. It plots the true positive rate against the false

508
00:23:41.960 --> 00:23:45.519
<v Speaker 2>positive rate at various thresholds. Ideally, you want the curve

509
00:23:45.599 --> 00:23:48.799
<v Speaker 2>to shoot up towards the top left corner high true positives,

510
00:23:48.880 --> 00:23:49.960
<v Speaker 2>low false positives.

511
00:23:50.000 --> 00:23:52.279
<v Speaker 1>A diagonal line is bad like random guessing.

512
00:23:52.519 --> 00:23:56.400
<v Speaker 2>Exactly the area under this curve AUC gives a single number,

513
00:23:56.519 --> 00:23:59.160
<v Speaker 2>zero to one. The cancer model had an AEC of

514
00:23:59.240 --> 00:24:02.839
<v Speaker 2>zero point seven nine, which is generally considered acceptable. Maybe

515
00:24:02.839 --> 00:24:03.720
<v Speaker 2>fairy good okay.

516
00:24:03.759 --> 00:24:08.039
<v Speaker 1>Another key technique cross validation. This is for checking generalizability.

517
00:24:08.240 --> 00:24:10.640
<v Speaker 2>Right. Instead of just one train test split, you divide

518
00:24:10.680 --> 00:24:14.319
<v Speaker 2>your training data into say five or ten folds like slices. Yeah.

519
00:24:14.759 --> 00:24:17.240
<v Speaker 2>Then you train the model five times. Each time, you

520
00:24:17.279 --> 00:24:19.599
<v Speaker 2>train on four folds and test on the one fold

521
00:24:19.720 --> 00:24:20.400
<v Speaker 2>left out.

522
00:24:20.359 --> 00:24:23.079
<v Speaker 1>Using a different slice for testing each time exactly.

523
00:24:23.559 --> 00:24:26.079
<v Speaker 2>Then you average the results from those five tests. It

524
00:24:26.119 --> 00:24:28.119
<v Speaker 2>gives you a much better idea of how stable the

525
00:24:28.160 --> 00:24:30.640
<v Speaker 2>performance is and how well it's likely to perform on

526
00:24:30.759 --> 00:24:31.960
<v Speaker 2>genuinely new data.

527
00:24:32.000 --> 00:24:34.200
<v Speaker 1>And the real test set, the one you held back

528
00:24:34.240 --> 00:24:34.920
<v Speaker 1>at the start.

529
00:24:35.160 --> 00:24:38.400
<v Speaker 2>Crucially, you only touch that once, right at the very end,

530
00:24:38.720 --> 00:24:41.000
<v Speaker 2>after you've done all your model selection and tuning, using

531
00:24:41.000 --> 00:24:43.759
<v Speaker 2>cross validation on the training data, don't peak.

532
00:24:43.880 --> 00:24:47.000
<v Speaker 1>Got it? For the cancer decision tree, cross validation showed

533
00:24:47.000 --> 00:24:50.880
<v Speaker 1>about seventy three percent accuracy, but with a standard deviation

534
00:24:51.160 --> 00:24:51.960
<v Speaker 1>of eleven percent.

535
00:24:52.119 --> 00:24:54.160
<v Speaker 2>Yeah, that standard deviation tells you there was quite a

536
00:24:54.200 --> 00:24:57.640
<v Speaker 2>bit of variability and performance across the different folds. Maybe

537
00:24:57.839 --> 00:25:00.759
<v Speaker 2>do the small sample size making the full quite different

538
00:25:00.759 --> 00:25:01.240
<v Speaker 2>from each other.

539
00:25:01.319 --> 00:25:06.240
<v Speaker 1>Okay, shifting to assessing regression models predicting numbers. No confusion

540
00:25:06.240 --> 00:25:07.240
<v Speaker 1>matrix here, right.

541
00:25:07.200 --> 00:25:10.119
<v Speaker 2>Right, we use different metrics. You'd start by comparing basic dats,

542
00:25:10.200 --> 00:25:14.000
<v Speaker 2>mean standard deviation, core tiles of the actual values versus

543
00:25:14.000 --> 00:25:17.759
<v Speaker 2>your predicted values. Ideally they should look pretty similar.

544
00:25:17.640 --> 00:25:20.160
<v Speaker 1>Like for the age prediction the means were close, but

545
00:25:20.200 --> 00:25:21.640
<v Speaker 1>standard deviations.

546
00:25:21.160 --> 00:25:23.599
<v Speaker 2>Differed a bit. Hey exactly. Then you look at the

547
00:25:23.599 --> 00:25:27.680
<v Speaker 2>correlation between actual and predicted for age that was point

548
00:25:27.680 --> 00:25:29.000
<v Speaker 2>five to four on the training set.

549
00:25:29.119 --> 00:25:31.400
<v Speaker 1>We also saw means squared error MS.

550
00:25:31.839 --> 00:25:35.079
<v Speaker 2>Lower is better, lower is better, Yes, for age, it

551
00:25:35.160 --> 00:25:37.720
<v Speaker 2>was sixty one point eight on train eighty five point

552
00:25:37.799 --> 00:25:40.400
<v Speaker 2>twenty four on test. The different shows that drop in

553
00:25:40.440 --> 00:25:43.519
<v Speaker 2>performance on unseen data and are squared. Our squared tells

554
00:25:43.559 --> 00:25:45.920
<v Speaker 2>you how much of the variation in the outcome variable

555
00:25:45.960 --> 00:25:49.880
<v Speaker 2>your model explains. Ranges from zero to one. For age prediction,

556
00:25:49.920 --> 00:25:52.559
<v Speaker 2>that was point twenty nine, which the source called not exciting.

557
00:25:52.839 --> 00:25:55.319
<v Speaker 2>It means the model only explained about twenty nine percent

558
00:25:55.359 --> 00:25:56.480
<v Speaker 2>of the variation in age.

559
00:25:56.519 --> 00:25:58.160
<v Speaker 1>And cross validation applies here too.

560
00:25:58.359 --> 00:26:01.519
<v Speaker 2>Absolutely, you'd cross validate metrics like MSE and R two

561
00:26:01.799 --> 00:26:05.559
<v Speaker 2>for that SVM regression predicting education. The cross validated MS

562
00:26:05.880 --> 00:26:08.519
<v Speaker 2>was similar to the original test MAC, which is good,

563
00:26:08.599 --> 00:26:11.119
<v Speaker 2>But the R two, the cross validated R two had

564
00:26:11.119 --> 00:26:13.960
<v Speaker 2>a high standard deviation point one nine. So I guess

565
00:26:14.000 --> 00:26:16.480
<v Speaker 2>the model's ability to explain the variance wasn't very stable

566
00:26:16.519 --> 00:26:19.359
<v Speaker 2>across different subsets of the data. Again, maybe sample size

567
00:26:19.359 --> 00:26:21.160
<v Speaker 2>issues or just a complex relationship.

568
00:26:21.240 --> 00:26:25.119
<v Speaker 1>Okay, we've built models, we've assessed them now making them better.

569
00:26:25.640 --> 00:26:29.000
<v Speaker 1>We know we can get more data, change variables, switch algorithms,

570
00:26:29.440 --> 00:26:31.160
<v Speaker 1>But what about tuning.

571
00:26:31.119 --> 00:26:34.960
<v Speaker 2>Ah hyper parameter tuning. This is really interesting. It's about

572
00:26:35.160 --> 00:26:38.799
<v Speaker 2>tweaking the settings of the algorithm itself before it starts learning.

573
00:26:39.440 --> 00:26:42.440
<v Speaker 1>So not changing the data, but changing how the algorithm

574
00:26:42.519 --> 00:26:44.119
<v Speaker 1>learns from the data exactly.

575
00:26:44.200 --> 00:26:47.000
<v Speaker 2>Things like the K and kN N, or the C

576
00:26:47.200 --> 00:26:50.160
<v Speaker 2>penalty termin SVM, or how deep you let a decision

577
00:26:50.200 --> 00:26:52.640
<v Speaker 2>tree grow. These are hyper.

578
00:26:52.359 --> 00:26:54.000
<v Speaker 1>Parameters, and how do you tune them?

579
00:26:54.400 --> 00:26:58.240
<v Speaker 2>Guessing more systematic than that, Usually you define a grid

580
00:26:58.440 --> 00:27:01.279
<v Speaker 2>of possible values for the hyper parameters you want to tune,

581
00:27:01.480 --> 00:27:04.599
<v Speaker 2>like Trika values from one to twenty, try different distance.

582
00:27:04.279 --> 00:27:06.920
<v Speaker 1>Metrics, and the computer tries all the combinations.

583
00:27:06.440 --> 00:27:09.559
<v Speaker 2>YEP, often using cross validation within the grid search. It

584
00:27:09.599 --> 00:27:12.319
<v Speaker 2>runs models for all combinations and tells you which set

585
00:27:12.319 --> 00:27:15.440
<v Speaker 2>of hyper parameters gave the best average performance on the

586
00:27:15.480 --> 00:27:18.240
<v Speaker 2>cross validation folds. It's a bit a trial and error

587
00:27:18.359 --> 00:27:19.200
<v Speaker 2>guided by data.

588
00:27:19.519 --> 00:27:22.759
<v Speaker 1>Let's look at the kNN tuning example. Original model predicting

589
00:27:22.839 --> 00:27:25.160
<v Speaker 1>voted was around seventy nine percent accurate.

590
00:27:25.480 --> 00:27:29.440
<v Speaker 2>Right, the tuned K one thirteen the weights uniform versus

591
00:27:29.480 --> 00:27:33.240
<v Speaker 2>distance based and the distance metric Manhattan versus Kowski clidion.

592
00:27:33.799 --> 00:27:36.759
<v Speaker 2>That created ninety six different kNN models to test.

593
00:27:36.920 --> 00:27:39.039
<v Speaker 1>Wow, and the best one.

594
00:27:38.839 --> 00:27:42.160
<v Speaker 2>The best combination found through cross validation, achieved seventy four

595
00:27:42.200 --> 00:27:43.240
<v Speaker 2>percent accuracy.

596
00:27:43.240 --> 00:27:45.759
<v Speaker 1>That's lower than the original seventy nine percent.

597
00:27:45.519 --> 00:27:49.200
<v Speaker 2>On cross validation. Yes, but here's the crucial part. When

598
00:27:49.240 --> 00:27:52.480
<v Speaker 2>they tested that tuned model back on its own training data,

599
00:27:52.559 --> 00:27:55.640
<v Speaker 2>its accuracy jumped to ninety nine point seven percent.

600
00:27:55.799 --> 00:27:59.359
<v Speaker 1>WHOA nearly perfect. But that sounds suspicious.

601
00:27:59.359 --> 00:28:02.119
<v Speaker 2>I really suspicion. That huge gap between the near perfect

602
00:28:02.160 --> 00:28:05.920
<v Speaker 2>training accuracy and the seventy four percent cross validated accuracy

603
00:28:05.960 --> 00:28:10.759
<v Speaker 2>screams overfitting. The tuning process found settings that basically memorize

604
00:28:10.759 --> 00:28:11.480
<v Speaker 2>the training data.

605
00:28:11.559 --> 00:28:13.400
<v Speaker 1>So this is a massive warning sign about just looking

606
00:28:13.440 --> 00:28:15.519
<v Speaker 1>at training accuracy, especially after tuning.

607
00:28:15.599 --> 00:28:19.200
<v Speaker 2>Absolutely reinforces why cross validation and that final untouched test

608
00:28:19.240 --> 00:28:20.519
<v Speaker 2>set are non negotiable.

609
00:28:20.559 --> 00:28:24.000
<v Speaker 1>Okay, and the SVM regression tuning predicting education original mESC

610
00:28:24.160 --> 00:28:25.960
<v Speaker 1>was about point zero one five eight nine.

611
00:28:26.039 --> 00:28:29.039
<v Speaker 2>They tune the c's cost parameter, the kernel type linear

612
00:28:29.119 --> 00:28:31.960
<v Speaker 2>versus RBF and degree for polynomial kernels and the result

613
00:28:32.160 --> 00:28:36.119
<v Speaker 2>the best combo nudged the MSE down slightly to point

614
00:28:36.240 --> 00:28:37.680
<v Speaker 2>zero one, five, three, eight.

615
00:28:37.720 --> 00:28:39.559
<v Speaker 1>A tiny improvement worth it.

616
00:28:39.400 --> 00:28:42.759
<v Speaker 2>Depends entirely on the context. Sometimes shaving even a tiny

617
00:28:42.759 --> 00:28:46.079
<v Speaker 2>bit off the air can be hugely valuable. It shows

618
00:28:46.119 --> 00:28:48.519
<v Speaker 2>the potential, even if it's not always dramatic.

619
00:28:48.680 --> 00:28:52.920
<v Speaker 1>Finally, taking it one step further, combining algorithms the ensemble

620
00:28:52.920 --> 00:28:54.480
<v Speaker 1>approach or stacking.

621
00:28:54.680 --> 00:28:57.920
<v Speaker 2>Yes, stacking is a pretty sophisticated way to combine models.

622
00:28:58.000 --> 00:29:01.079
<v Speaker 2>The basic idea is you train several different models, like

623
00:29:01.119 --> 00:29:05.119
<v Speaker 2>a random forest, a SVM, a kNN okay. Then instead

624
00:29:05.119 --> 00:29:07.880
<v Speaker 2>of just averaging their outputs, you use their predictions as

625
00:29:07.920 --> 00:29:10.279
<v Speaker 2>inputs for a final metal model.

626
00:29:10.319 --> 00:29:12.799
<v Speaker 1>So a model that learns from the predictions of other.

627
00:29:12.680 --> 00:29:15.400
<v Speaker 2>Models exactly, it learns how to best combine the strengths

628
00:29:15.400 --> 00:29:17.960
<v Speaker 2>and potentially correct the weaknesses of the base models to

629
00:29:18.000 --> 00:29:20.599
<v Speaker 2>make a final, hopefully better prediction.

630
00:29:20.359 --> 00:29:21.720
<v Speaker 1>Trying to create a supermodel.

631
00:29:21.759 --> 00:29:24.799
<v Speaker 2>That's the goal maximize strengths, minimize weaknesses.

632
00:29:24.960 --> 00:29:28.559
<v Speaker 1>The example used the doctor Aus data set again predicting gender,

633
00:29:28.640 --> 00:29:32.759
<v Speaker 1>combining random forest, SVM, and kNN YEP individually.

634
00:29:33.039 --> 00:29:36.039
<v Speaker 2>Cross validation showed random forest was the best performer on

635
00:29:36.119 --> 00:29:38.039
<v Speaker 2>its own well, SVM struggled a.

636
00:29:37.960 --> 00:29:40.839
<v Speaker 1>Bit, So did stacking them beat the random forest?

637
00:29:41.200 --> 00:29:46.319
<v Speaker 2>Interestingly? No, No, the initial stacked model actually performed worse

638
00:29:46.400 --> 00:29:48.359
<v Speaker 2>than just using the random forest by itself.

639
00:29:48.559 --> 00:29:50.960
<v Speaker 1>Huh So more complex isn't always better?

640
00:29:51.160 --> 00:29:55.240
<v Speaker 2>Definitely not automatically. Just throwing models together can sometimes add

641
00:29:55.279 --> 00:29:56.680
<v Speaker 2>noise or compound errors.

642
00:29:56.880 --> 00:29:59.359
<v Speaker 1>Did they try tuning the models within the stack?

643
00:30:00.119 --> 00:30:04.680
<v Speaker 2>Did they tuned hyper parameters for the random forest, the SVM,

644
00:30:04.759 --> 00:30:07.319
<v Speaker 2>and the kN N within the ensemble.

645
00:30:06.920 --> 00:30:09.400
<v Speaker 1>Structure and the final result after all that work, The.

646
00:30:09.319 --> 00:30:13.480
<v Speaker 2>Final tuned ensemble model landed at about seventy percent accuracy

647
00:30:13.519 --> 00:30:15.279
<v Speaker 2>on both training and test data.

648
00:30:15.359 --> 00:30:19.440
<v Speaker 1>So after all that stacking and tuning, it didn't really

649
00:30:19.519 --> 00:30:22.839
<v Speaker 1>improve much, if at all, over the simpler single random

650
00:30:22.880 --> 00:30:23.519
<v Speaker 1>forest model.

651
00:30:23.559 --> 00:30:26.319
<v Speaker 2>Pretty much in this specific case, the added complexity didn't

652
00:30:26.359 --> 00:30:28.160
<v Speaker 2>yield significantly better performance.

653
00:30:28.240 --> 00:30:31.119
<v Speaker 1>That's a really important takeaway, A huge one. Wow. Okay,

654
00:30:31.119 --> 00:30:33.599
<v Speaker 1>we've covered a lot of ground today, from decision trees

655
00:30:33.680 --> 00:30:34.880
<v Speaker 1>kind of the building block.

656
00:30:34.720 --> 00:30:36.839
<v Speaker 2>To random forests using the wisdom of the.

657
00:30:36.799 --> 00:30:42.319
<v Speaker 1>Crowd, kNN using proximity, SVM finding that optimal boundary an

658
00:30:42.440 --> 00:30:44.440
<v Speaker 1>n's sort of mimicking the brain, then.

659
00:30:44.359 --> 00:30:47.640
<v Speaker 2>K meanes finding hidden groups without even being told what

660
00:30:47.759 --> 00:30:48.279
<v Speaker 2>to look.

661
00:30:48.119 --> 00:30:51.599
<v Speaker 1>For and crucially, how to actually measure if these models

662
00:30:51.640 --> 00:30:55.799
<v Speaker 1>are any good using things like the confusion matrix, RC curves,

663
00:30:55.880 --> 00:30:57.240
<v Speaker 1>cross validation.

664
00:30:57.039 --> 00:31:00.519
<v Speaker 2>And how to potentially improve them through careful hyper parameter

665
00:31:00.599 --> 00:31:02.680
<v Speaker 2>tuning and even stacking them together.

666
00:31:02.839 --> 00:31:05.960
<v Speaker 1>And remember, listener, these aren't just you know, abstract ideas.

667
00:31:06.079 --> 00:31:08.240
<v Speaker 1>This deep dive should give you a real foundation for

668
00:31:08.359 --> 00:31:10.599
<v Speaker 1>understanding how these tools work, yeah.

669
00:31:10.400 --> 00:31:14.480
<v Speaker 2>Their quirks, their strengths, their weaknesses, and importantly, how to

670
00:31:14.519 --> 00:31:17.920
<v Speaker 2>think critically about their results and choose the right approach

671
00:31:18.000 --> 00:31:19.400
<v Speaker 2>for your specific problem.

672
00:31:19.680 --> 00:31:22.519
<v Speaker 1>And that final point from the ensemble example really sticks

673
00:31:22.559 --> 00:31:26.160
<v Speaker 1>with me. As the source material concludes, and this is powerful.

674
00:31:26.720 --> 00:31:29.599
<v Speaker 1>Complexity is not a cure all for better performance, and

675
00:31:30.119 --> 00:31:31.960
<v Speaker 1>simple is almost always better.

676
00:31:32.079 --> 00:31:33.880
<v Speaker 2>It's a fantastic lesson, it really is.

677
00:31:33.920 --> 00:31:35.640
<v Speaker 1>And it leaves us with a question for you to

678
00:31:35.720 --> 00:31:38.680
<v Speaker 1>maybe mull over in whatever you're working on, whether it's

679
00:31:38.759 --> 00:31:41.720
<v Speaker 1>data or just a life, where might you be over

680
00:31:41.799 --> 00:31:45.880
<v Speaker 1>complicating things? Where could a simpler approach actually lead to clearer,

681
00:31:46.000 --> 00:31:47.640
<v Speaker 1>more powerful insights.

682
00:31:47.680 --> 00:31:48.519
<v Speaker 2>Something to think about.
