WEBVTT

1
00:00:00.280 --> 00:00:02.839
<v Speaker 1>Imagine a world running on data that looks well, kind

2
00:00:02.839 --> 00:00:06.679
<v Speaker 1>of like a spreadsheet. We're talking banks, insurance, retail, government.

3
00:00:06.879 --> 00:00:09.119
<v Speaker 1>This isn't the really flashy stuff like AI making pictures

4
00:00:09.240 --> 00:00:13.839
<v Speaker 1>or music. This is the absolute bedrock of our digital economy.

5
00:00:14.240 --> 00:00:16.800
<v Speaker 1>So today we're doing a deep dive into this fascinating

6
00:00:16.879 --> 00:00:21.399
<v Speaker 1>and honestly often overlooked world of machine learning for tabular data.

7
00:00:21.920 --> 00:00:24.079
<v Speaker 1>Our goal here is to cut through some of the noise,

8
00:00:24.280 --> 00:00:27.079
<v Speaker 1>give you a shortcut to understanding what tabular data actually is,

9
00:00:27.480 --> 00:00:30.440
<v Speaker 1>why it matters so much, and how we apply the

10
00:00:30.480 --> 00:00:33.240
<v Speaker 1>well the most powerful mL techniques to it. We'll touch

11
00:00:33.280 --> 00:00:36.960
<v Speaker 1>on everything from cleaning up messy data, pitting classic mL

12
00:00:37.039 --> 00:00:40.399
<v Speaker 1>against deep learning, and even deploying these models out into

13
00:00:40.399 --> 00:00:42.880
<v Speaker 1>the real world. You might find some surprising facts, maybe

14
00:00:42.880 --> 00:00:45.479
<v Speaker 1>a few aha moments. Okay, let's get into it. So

15
00:00:45.520 --> 00:00:48.759
<v Speaker 1>let's start right at the beginning. What exactly is tabular data?

16
00:00:48.759 --> 00:00:51.280
<v Speaker 1>I think of just a simple table, maybe one, listing

17
00:00:51.320 --> 00:00:54.359
<v Speaker 1>currencies for different countries. Each row that's an observation, like

18
00:00:54.399 --> 00:00:57.159
<v Speaker 1>all the details for Australia's currency, and then each column

19
00:00:57.200 --> 00:00:59.840
<v Speaker 1>is of feature things like currency name or units per u.

20
00:01:00.600 --> 00:01:03.479
<v Speaker 2>Pretty straightforward, right, Yeah, exactly it's a format everyone gets,

21
00:01:03.479 --> 00:01:05.920
<v Speaker 2>which is why it's so fundamental. And while it might

22
00:01:05.959 --> 00:01:08.719
<v Speaker 2>only be say, ten percent or so of all the

23
00:01:08.760 --> 00:01:12.200
<v Speaker 2>digital data out there, by some estimates that ten percent

24
00:01:12.280 --> 00:01:16.000
<v Speaker 2>is absolutely critical. It's structure that row and columns set

25
00:01:16.079 --> 00:01:19.920
<v Speaker 2>up makes it super easy to input, retrieve, manage, analyze.

26
00:01:20.040 --> 00:01:24.920
<v Speaker 2>It really is the lifeblood for countless businesses, spreadsheets, huge databases,

27
00:01:24.959 --> 00:01:25.439
<v Speaker 2>you name it.

28
00:01:25.599 --> 00:01:27.599
<v Speaker 1>Okay, So here's something that's always kind of puzzled me.

29
00:01:28.079 --> 00:01:31.840
<v Speaker 1>If it's so fundamental, why haven't deep learning models completely

30
00:01:31.879 --> 00:01:33.799
<v Speaker 1>taken over this space, I mean, the way they have

31
00:01:33.920 --> 00:01:37.760
<v Speaker 1>for images or audio or text. What's the key difference there?

32
00:01:37.840 --> 00:01:39.879
<v Speaker 2>That is a really great question. The main thing is

33
00:01:39.879 --> 00:01:43.359
<v Speaker 2>that tabular data has that typical matrix shape rows, columns

34
00:01:43.560 --> 00:01:47.400
<v Speaker 2>pretty distinct. It's not like unstructured data audio, waves, pixels, texts,

35
00:01:47.400 --> 00:01:50.319
<v Speaker 2>which is much more well unordered and varied. And because

36
00:01:50.359 --> 00:01:52.719
<v Speaker 2>of that unique structure, tabula data comes with its own

37
00:01:52.760 --> 00:01:56.439
<v Speaker 2>set of let's call them pathologies, common problems you absolutely

38
00:01:56.439 --> 00:01:59.519
<v Speaker 2>have to fix before you can do any serious analysis pathologies.

39
00:01:59.560 --> 00:02:01.359
<v Speaker 1>Okay, what kind of problems are we talking about?

40
00:02:01.519 --> 00:02:05.120
<v Speaker 2>Well, first off, you often find constant or quasi constant

41
00:02:05.120 --> 00:02:08.759
<v Speaker 2>columns features that just don't change much or at all.

42
00:02:09.360 --> 00:02:12.240
<v Speaker 2>They offer almost no information to a model. Then there

43
00:02:12.280 --> 00:02:15.560
<v Speaker 2>are duplicated and highly call near features, so information that's

44
00:02:15.599 --> 00:02:18.680
<v Speaker 2>either just copied or it's so similar it's basically saying

45
00:02:18.719 --> 00:02:21.879
<v Speaker 2>the same thing twice. With linear models especially, this can

46
00:02:21.919 --> 00:02:25.120
<v Speaker 2>cause real conceptual misunderstandings. Makes it hard to figure out

47
00:02:25.159 --> 00:02:26.960
<v Speaker 2>what's actually driving a prediction.

48
00:02:26.919 --> 00:02:29.759
<v Speaker 1>Right, like having two columns that both measure basically the

49
00:02:29.759 --> 00:02:30.719
<v Speaker 1>same temperature scale.

50
00:02:30.759 --> 00:02:34.000
<v Speaker 2>Yeah, redundant exactly. Then you have irrelevant features, stuff that

51
00:02:34.120 --> 00:02:37.800
<v Speaker 2>just doesn't help predict what you want to predict, and

52
00:02:38.159 --> 00:02:41.319
<v Speaker 2>the big one missing data. This is crucial because some

53
00:02:41.599 --> 00:02:44.719
<v Speaker 2>mL algorithms just flat out won't run if there are gaps,

54
00:02:45.000 --> 00:02:47.919
<v Speaker 2>and these gaps aren't always random. Sometimes data is missing

55
00:02:47.960 --> 00:02:51.360
<v Speaker 2>completely at random, sometimes just at random, or sometimes it's

56
00:02:51.360 --> 00:02:53.960
<v Speaker 2>missing not at random. Think about it. A missing review

57
00:02:54.039 --> 00:02:57.319
<v Speaker 2>score might actually mean there are no reviews, which is

58
00:02:57.439 --> 00:02:58.719
<v Speaker 2>itself information.

59
00:02:58.400 --> 00:03:00.599
<v Speaker 1>All right, That's a subtle but important distinction.

60
00:03:00.759 --> 00:03:05.199
<v Speaker 2>Definitely. We also deal with rare categories, features with tons

61
00:03:05.280 --> 00:03:08.159
<v Speaker 2>of unique values, or values that show up super infrequently,

62
00:03:08.639 --> 00:03:12.400
<v Speaker 2>hard for models to learn from those, and my personal favorite,

63
00:03:12.479 --> 00:03:15.599
<v Speaker 2>just plain errors in the data. You know, Misspelling's like

64
00:03:15.639 --> 00:03:19.719
<v Speaker 2>Toyota instead of Toyota. Hmm, this isn't just a typo Cosmetically,

65
00:03:19.800 --> 00:03:22.879
<v Speaker 2>it splits what should be one category into multiple noisy ones.

66
00:03:22.960 --> 00:03:24.159
<v Speaker 2>Really confuses the model.

67
00:03:24.240 --> 00:03:25.240
<v Speaker 1>It sounds like a minefield.

68
00:03:25.520 --> 00:03:27.879
<v Speaker 2>It can be. The key insight here is like a

69
00:03:27.919 --> 00:03:30.560
<v Speaker 2>slightly blurry pixel in an image might just make it

70
00:03:30.639 --> 00:03:35.120
<v Speaker 2>less clear, but a single misspelled category or an unhandled

71
00:03:35.120 --> 00:03:38.280
<v Speaker 2>missing value in a table that can fundamentally mislead your

72
00:03:38.280 --> 00:03:42.319
<v Speaker 2>model force it to make decisions based on completely wrong information.

73
00:03:42.719 --> 00:03:44.520
<v Speaker 2>It's like having a great map, but with a few

74
00:03:44.599 --> 00:03:47.080
<v Speaker 2>key cities just randomly renamed, you can't navigate.

75
00:03:47.319 --> 00:03:52.000
<v Speaker 1>Huh, the bane of every data scientist's existence. It's like

76
00:03:52.039 --> 00:03:54.039
<v Speaker 1>trying to find Waldo, but half the time he skilled

77
00:03:54.080 --> 00:03:57.599
<v Speaker 1>Waldo completely messing up your search algorithm. So it really

78
00:03:57.639 --> 00:04:00.240
<v Speaker 1>sounds like forget the fancy algorithms for a second, real

79
00:04:00.280 --> 00:04:03.319
<v Speaker 1>hard work. Maybe the biggest challenge is just understanding and

80
00:04:03.319 --> 00:04:06.680
<v Speaker 1>prepping your data. Is that where something like exploratory data

81
00:04:06.719 --> 00:04:10.080
<v Speaker 1>analysis EDA comes in. Is it indispensable?

82
00:04:10.360 --> 00:04:15.159
<v Speaker 2>Absolutely one percent. Getting reliable insights starts with good EDA.

83
00:04:15.560 --> 00:04:18.120
<v Speaker 2>And it's not just about making pretty charts though that helps.

84
00:04:18.439 --> 00:04:22.600
<v Speaker 2>It's really about systematically spotting and fixing these pathologies before

85
00:04:22.600 --> 00:04:26.360
<v Speaker 2>they wreck your model downstream. We use tools like histograms,

86
00:04:26.439 --> 00:04:29.199
<v Speaker 2>box plots, things like that to actually see how the

87
00:04:29.279 --> 00:04:33.040
<v Speaker 2>data is distributed. You know, spot things like heavy tails,

88
00:04:33.199 --> 00:04:37.000
<v Speaker 2>extreme values and prices maybe which can seriously skew your results.

89
00:04:37.079 --> 00:04:39.199
<v Speaker 1>Okay, when you find those extremes, like maybe a house

90
00:04:39.240 --> 00:04:41.720
<v Speaker 1>listed for ten billion dollars by mistake, you know, just

91
00:04:41.759 --> 00:04:45.480
<v Speaker 1>deleted to you. You mentioned windsorizing. How does that work

92
00:04:45.519 --> 00:04:48.439
<v Speaker 1>and why is it often better than just tossing the

93
00:04:48.519 --> 00:04:50.279
<v Speaker 1>data point or letting it mess everything up?

94
00:04:50.480 --> 00:04:54.480
<v Speaker 2>Right? Good question. Windsorizing basically means if a value is

95
00:04:54.600 --> 00:04:57.040
<v Speaker 2>way out there, maybe be on the top one percent

96
00:04:57.120 --> 00:04:59.279
<v Speaker 2>or bottom one percent of your data range, you just

97
00:04:59.360 --> 00:05:01.800
<v Speaker 2>capped it, place it with the value at that one

98
00:05:01.839 --> 00:05:03.720
<v Speaker 2>percent or ninety nine percent mark. So you keep the

99
00:05:03.759 --> 00:05:07.839
<v Speaker 2>data point, but you prevent that extreme outlier, maybe a

100
00:05:08.040 --> 00:05:11.560
<v Speaker 2>data entry error, from having this huge disproportionate influence on

101
00:05:11.600 --> 00:05:15.600
<v Speaker 2>your model. Similarly, for those categorical features with tons of

102
00:05:15.680 --> 00:05:19.160
<v Speaker 2>unique labels we call high cardinality features, we can aggregate

103
00:05:19.199 --> 00:05:22.160
<v Speaker 2>the really rare categories, group all those one off values

104
00:05:22.160 --> 00:05:24.879
<v Speaker 2>into a single other category. Do this simplifies things for the

105
00:05:24.920 --> 00:05:27.959
<v Speaker 2>model and Honestly, the unsung hero that makes a lot

106
00:05:27.959 --> 00:05:30.680
<v Speaker 2>of this data wrangling possible is the panda's data frame

107
00:05:30.920 --> 00:05:34.879
<v Speaker 2>in Python. It's just incredibly flexible and efficient for managing

108
00:05:34.920 --> 00:05:36.680
<v Speaker 2>and manipulating tabular data.

109
00:05:36.759 --> 00:05:39.600
<v Speaker 1>Okay, Pandas got it, So let's shift. Here's a bit.

110
00:05:39.639 --> 00:05:43.160
<v Speaker 1>There's this ongoing debate in the data science world, right

111
00:05:43.199 --> 00:05:46.399
<v Speaker 1>when you're tackling these tabular data problems, what's better classical

112
00:05:46.439 --> 00:05:48.920
<v Speaker 1>machine learning techniques or deep learning. Maybe we can unpack

113
00:05:48.920 --> 00:05:51.519
<v Speaker 1>this using that airbb example you mentioned, predicting listing prices

114
00:05:51.560 --> 00:05:52.199
<v Speaker 1>in New York City.

115
00:05:52.399 --> 00:05:54.439
<v Speaker 2>Yeah, that's a great way to look at it. We

116
00:05:54.480 --> 00:05:57.959
<v Speaker 2>can compare these two approaches. Let's say classical mL represented

117
00:05:58.000 --> 00:06:01.560
<v Speaker 2>by xg boost a popular choice, and deep learning may

118
00:06:01.560 --> 00:06:05.639
<v Speaker 2>be using keras across a few key things. First, simplicity

119
00:06:06.160 --> 00:06:09.399
<v Speaker 2>in that Airbnb case study, using xg boost often meant

120
00:06:09.480 --> 00:06:12.600
<v Speaker 2>much simpler code to define and train the model, sometimes

121
00:06:12.639 --> 00:06:15.800
<v Speaker 2>literally just one line following the standard psychic learned pattern.

122
00:06:15.879 --> 00:06:19.360
<v Speaker 2>Lots of people know. Kearras for deep learning usually needed

123
00:06:19.439 --> 00:06:21.519
<v Speaker 2>quite a few more lines, especially for defining all the

124
00:06:21.519 --> 00:06:24.519
<v Speaker 2>network layers and setting up things like efficient training callbacks.

125
00:06:24.600 --> 00:06:26.959
<v Speaker 1>Right, that definitely matters for day to day work. But

126
00:06:27.160 --> 00:06:30.360
<v Speaker 1>simplicity aside, what about understanding why the model makes a prediction?

127
00:06:30.879 --> 00:06:33.879
<v Speaker 1>You know, transparency and explainability. How do they stack up there?

128
00:06:34.079 --> 00:06:37.600
<v Speaker 2>That's a huge point. Classical models like decision trees, which

129
00:06:37.600 --> 00:06:40.480
<v Speaker 2>are kind of the building blocks for xg boost, can

130
00:06:40.519 --> 00:06:44.560
<v Speaker 2>often be visualized or explained. You could, for example, show

131
00:06:44.600 --> 00:06:47.560
<v Speaker 2>a non specialist how a simple decision tree predicts how

132
00:06:47.600 --> 00:06:50.240
<v Speaker 2>long a property might stay on the market step by step.

133
00:06:50.720 --> 00:06:54.759
<v Speaker 2>Deep neural networks, well, they often rely on these analogies

134
00:06:54.800 --> 00:06:57.680
<v Speaker 2>to biological neurons, which were frankly a bit controversial and

135
00:06:57.759 --> 00:07:01.199
<v Speaker 2>don't really clarify how the model arrives at its decision. Internally,

136
00:07:01.279 --> 00:07:01.839
<v Speaker 2>it's more of a.

137
00:07:01.759 --> 00:07:04.480
<v Speaker 1>Black box the black box problem.

138
00:07:04.160 --> 00:07:08.480
<v Speaker 2>Exactly, and related to that is feature importance, what actually

139
00:07:08.560 --> 00:07:11.839
<v Speaker 2>drove the prediction. Xg boost has built in methods that

140
00:07:11.920 --> 00:07:14.959
<v Speaker 2>easily tell you which features had the biggest impact, like

141
00:07:15.000 --> 00:07:17.839
<v Speaker 2>for the Airbnb prices, room type might pop up as

142
00:07:17.839 --> 00:07:21.279
<v Speaker 2>the most important factor. Deep learning frameworks like Keras they

143
00:07:21.279 --> 00:07:24.160
<v Speaker 2>don't usually have that built right in. You need external tools,

144
00:07:24.279 --> 00:07:26.600
<v Speaker 2>often more complex ones, to try and get that same

145
00:07:26.680 --> 00:07:27.360
<v Speaker 2>kind of insight.

146
00:07:27.480 --> 00:07:30.399
<v Speaker 1>So it sounds like for understanding and explaining classical methods

147
00:07:30.399 --> 00:07:31.279
<v Speaker 1>often have an edge.

148
00:07:31.439 --> 00:07:33.879
<v Speaker 2>Often yes, And if we look at the bigger picture,

149
00:07:33.959 --> 00:07:37.439
<v Speaker 2>like research trends, the amount of research specifically on deep

150
00:07:37.519 --> 00:07:40.360
<v Speaker 2>learning for tabular data is actually just a tiny fraction

151
00:07:40.439 --> 00:07:42.759
<v Speaker 2>of all deep learning research being published. There's just no

152
00:07:43.920 --> 00:07:47.279
<v Speaker 2>unambiguous winner yet in terms of raw predictive power on

153
00:07:47.360 --> 00:07:50.360
<v Speaker 2>tables for tabular data, the jury is definitely still out.

154
00:07:50.800 --> 00:07:53.480
<v Speaker 1>Okay, that's really interesting. Why do you think that is?

155
00:07:53.480 --> 00:07:56.959
<v Speaker 1>Why hasn't deep learning just dominated here like it has elsewhere?

156
00:07:57.319 --> 00:08:00.519
<v Speaker 2>Well? One theory is that tabular nata often already presents

157
00:08:00.560 --> 00:08:05.399
<v Speaker 2>features in a highly structured, kind of interpretable way, you know, price, location,

158
00:08:05.639 --> 00:08:09.680
<v Speaker 2>number of bedrooms. Deep learning's real superpower is often extracting

159
00:08:09.759 --> 00:08:13.959
<v Speaker 2>hierarchical features from raw unstructured stuff, like finding edges and

160
00:08:14.120 --> 00:08:17.480
<v Speaker 2>shapes than objects in an image or grammar patterns in text.

161
00:08:18.120 --> 00:08:21.959
<v Speaker 2>But with tabular data, that powerful automatic feature extraction might

162
00:08:22.000 --> 00:08:24.879
<v Speaker 2>not be the huge advantage it is elsewhere. The features

163
00:08:24.879 --> 00:08:27.759
<v Speaker 2>are often pretty meaningful already. In fact, sometimes deep learning

164
00:08:27.839 --> 00:08:31.040
<v Speaker 2>might even pick up on spurious correlations and tables, essentially

165
00:08:31.040 --> 00:08:34.279
<v Speaker 2>finding patterns and noise because it's so powerful at pattern finding.

166
00:08:34.639 --> 00:08:37.240
<v Speaker 1>Gotcha, So it might be too powerful in a way

167
00:08:37.519 --> 00:08:41.320
<v Speaker 1>for this kind of data sometimes. So if deep learning

168
00:08:41.360 --> 00:08:44.480
<v Speaker 1>isn't the clear raining champ here, what is generally considered,

169
00:08:44.519 --> 00:08:47.000
<v Speaker 1>you know, state of the art for most tabular data

170
00:08:47.039 --> 00:08:48.120
<v Speaker 1>problems right now.

171
00:08:48.279 --> 00:08:51.639
<v Speaker 2>Right now, that title really belongs to gradient boosting decision

172
00:08:51.720 --> 00:08:56.080
<v Speaker 2>trees or gbdts. These models have really become the workhourses

173
00:08:56.159 --> 00:08:59.039
<v Speaker 2>for tabular data tasks gbdts.

174
00:08:59.360 --> 00:09:01.240
<v Speaker 1>Okay, how do they actually work? How do they get

175
00:09:01.279 --> 00:09:02.480
<v Speaker 1>such good predictions?

176
00:09:02.759 --> 00:09:05.639
<v Speaker 2>They're a really cool example of what's called an ensemble method,

177
00:09:05.720 --> 00:09:09.080
<v Speaker 2>basically getting multiple models to work together. But unlike some

178
00:09:09.080 --> 00:09:12.559
<v Speaker 2>other ensemble methods like random forests, where models are built independently,

179
00:09:13.159 --> 00:09:16.679
<v Speaker 2>gbdts build models sequentially. Think of it like building the

180
00:09:16.720 --> 00:09:19.879
<v Speaker 2>prediction piece by piece, almost like a chain. Each new

181
00:09:19.879 --> 00:09:22.159
<v Speaker 2>tree model tries to correct the errors made by the

182
00:09:22.159 --> 00:09:25.320
<v Speaker 2>previous ones. So it's this iterative process of improvement. It

183
00:09:25.399 --> 00:09:27.600
<v Speaker 2>learns from the mistakes of the models that came before

184
00:09:27.600 --> 00:09:28.440
<v Speaker 2>it in the sequence.

185
00:09:28.559 --> 00:09:31.159
<v Speaker 1>Ah. Okay, So it's like a team where each member

186
00:09:31.399 --> 00:09:32.679
<v Speaker 1>learns from the last one's.

187
00:09:32.519 --> 00:09:37.000
<v Speaker 2>Attempt precisely, not just averaging independent guesses, but actively refining

188
00:09:37.000 --> 00:09:37.519
<v Speaker 2>the prediction.

189
00:09:37.960 --> 00:09:41.200
<v Speaker 1>And you mentioned two big names leading the GBDT charge

190
00:09:42.000 --> 00:09:46.039
<v Speaker 1>x you boost and light GBM. They got famous through competitions.

191
00:09:45.559 --> 00:09:48.799
<v Speaker 2>Right, that's right. They really gained prominence by winning or

192
00:09:48.840 --> 00:09:53.000
<v Speaker 2>performing incredibly well in data science competitions like Caggles Higgs

193
00:09:53.039 --> 00:09:55.919
<v Speaker 2>Boson machine Learning Challenge years ago that really put them.

194
00:09:55.840 --> 00:09:57.879
<v Speaker 1>On the map. So what makes them so good? Is

195
00:09:57.919 --> 00:09:59.879
<v Speaker 1>it just the boosting idea or is it more to it?

196
00:10:00.039 --> 00:10:02.519
<v Speaker 2>There's definitely more to it. They achieve their speed and

197
00:10:02.559 --> 00:10:06.960
<v Speaker 2>accuracy through some really clever technical innovations. Xg boost, for instance,

198
00:10:07.039 --> 00:10:09.159
<v Speaker 2>uses smart ways to find the best splits in the

199
00:10:09.240 --> 00:10:13.440
<v Speaker 2>data very quickly, like histogram splitting and a unique weighted

200
00:10:13.519 --> 00:10:18.159
<v Speaker 2>quantile sketch. Light GBM uses techniques like leafwise tree growth.

201
00:10:18.559 --> 00:10:21.480
<v Speaker 2>Instead of building the tree level by level symmetrically, it

202
00:10:21.600 --> 00:10:24.320
<v Speaker 2>focuses its effort on the nodes the leaves where it

203
00:10:24.320 --> 00:10:26.519
<v Speaker 2>can reduce the air the most. This can lead to

204
00:10:26.600 --> 00:10:30.399
<v Speaker 2>faster training and smaller trees. Light GBM also uses smart

205
00:10:30.440 --> 00:10:34.200
<v Speaker 2>sampling like gradient based one side sampling or GOSS to

206
00:10:34.240 --> 00:10:36.480
<v Speaker 2>focus on the data points that are harder to predict,

207
00:10:36.799 --> 00:10:40.440
<v Speaker 2>and exclusive feature bundling EFB to kind of group sparse

208
00:10:40.440 --> 00:10:41.639
<v Speaker 2>features together efficiently.

209
00:10:41.879 --> 00:10:44.759
<v Speaker 1>Wow, Okay, that sounds pretty sophisticated Under the hood, it is.

210
00:10:44.879 --> 00:10:47.960
<v Speaker 2>Think of light GBM like a really efficient data assistant.

211
00:10:48.320 --> 00:10:51.679
<v Speaker 2>It knows exactly where the most important information is likely

212
00:10:51.759 --> 00:10:55.480
<v Speaker 2>to be and how to summarize things without losing crucial details.

213
00:10:55.519 --> 00:10:56.840
<v Speaker 2>That makes it fast.

214
00:10:57.080 --> 00:10:59.360
<v Speaker 1>And you mentioned something earlier that really caught my attention.

215
00:11:00.080 --> 00:11:04.120
<v Speaker 1>They handle missing data automatically. That sounds almost too good

216
00:11:04.120 --> 00:11:06.320
<v Speaker 1>to be true. How does that work? Does it mean

217
00:11:06.360 --> 00:11:08.679
<v Speaker 1>we could just be a bit lazier with cleaning our

218
00:11:08.720 --> 00:11:09.639
<v Speaker 1>data if we use.

219
00:11:09.480 --> 00:11:12.000
<v Speaker 2>These Huh, well, it is a huge advantage. It's not

220
00:11:12.039 --> 00:11:15.120
<v Speaker 2>really about being lazy though. Both xg boosts and light

221
00:11:15.159 --> 00:11:18.600
<v Speaker 2>GBM have this built in capability where at each split

222
00:11:18.639 --> 00:11:21.320
<v Speaker 2>point in a tree, they learn which direction left or

223
00:11:21.360 --> 00:11:24.960
<v Speaker 2>right branch missing values should go to minimize the overall

224
00:11:25.080 --> 00:11:28.480
<v Speaker 2>error the loss function, so the model itself learns the

225
00:11:28.519 --> 00:11:31.279
<v Speaker 2>best way to handle those gaps based on the data patterns.

226
00:11:31.559 --> 00:11:34.639
<v Speaker 2>It's quite robust and another key practical thing they both

227
00:11:34.720 --> 00:11:38.240
<v Speaker 2>do is early stopping. They watch performance on a separate

228
00:11:38.279 --> 00:11:42.240
<v Speaker 2>validation data set during training, and if the performance stops

229
00:11:42.240 --> 00:11:45.080
<v Speaker 2>improving for a certain number of rounds, they just stop training.

230
00:11:45.720 --> 00:11:48.559
<v Speaker 2>This is crucial to prevent overfitting, making sure the model

231
00:11:48.600 --> 00:11:51.159
<v Speaker 2>works well on new data, not just the data it

232
00:11:51.200 --> 00:11:51.720
<v Speaker 2>was trained on.

233
00:11:51.840 --> 00:11:54.000
<v Speaker 1>Okay, that makes sense, prevents it from just memorizing the

234
00:11:54.039 --> 00:11:57.000
<v Speaker 1>training set. So we have these two powerhouses, xg boost

235
00:11:57.000 --> 00:11:59.200
<v Speaker 1>and light GBM. How do you actually choose between them

236
00:11:59.240 --> 00:12:00.000
<v Speaker 1>for a specific priser.

237
00:12:00.679 --> 00:12:03.519
<v Speaker 2>Yeah, that's a common question. Based on the sources we

238
00:12:03.559 --> 00:12:05.720
<v Speaker 2>looked at and general experience in the field, there are

239
00:12:05.720 --> 00:12:09.320
<v Speaker 2>some general guidelines. Light GBM often tends to perform better

240
00:12:09.440 --> 00:12:11.720
<v Speaker 2>or at least train faster, when you have really large

241
00:12:11.720 --> 00:12:15.039
<v Speaker 2>amounts of data. Its leafwise growth is very efficient then,

242
00:12:15.519 --> 00:12:18.200
<v Speaker 2>but that same leafwise growth can sometimes cause it to

243
00:12:18.240 --> 00:12:21.960
<v Speaker 2>overfit a bit more easily on smaller data sets. Xg boost,

244
00:12:21.960 --> 00:12:24.639
<v Speaker 2>on the other hand, is often considered slightly more robust

245
00:12:24.879 --> 00:12:28.039
<v Speaker 2>maybe builds more stable models, especially on smaller BEATA samples.

246
00:12:28.639 --> 00:12:32.360
<v Speaker 2>Speed wise, light GBM is typically faster on CPUs, but

247
00:12:32.559 --> 00:12:35.519
<v Speaker 2>xg boost is often seen as more scalable for distributed

248
00:12:35.519 --> 00:12:40.000
<v Speaker 2>computing and has had perhaps slightly more mature GPU support historically,

249
00:12:40.360 --> 00:12:42.559
<v Speaker 2>though light GBM is catching up fast there too.

250
00:12:42.919 --> 00:12:44.600
<v Speaker 1>Okay, So it depends on the scale of your data,

251
00:12:44.639 --> 00:12:47.639
<v Speaker 1>maybe your hardware. Interesting trade offs. So it seems like

252
00:12:47.679 --> 00:12:51.480
<v Speaker 1>gradient boosting is incredibly powerful for tables. But deep learning

253
00:12:51.480 --> 00:12:54.360
<v Speaker 1>isn't completely out of the picture, right and stepping back,

254
00:12:54.440 --> 00:12:57.360
<v Speaker 1>getting any model boosting or deep learning actually working in

255
00:12:57.399 --> 00:13:00.159
<v Speaker 1>the real world that involves a lot more than just

256
00:13:00.240 --> 00:13:01.200
<v Speaker 1>hitting train, doesn't it?

257
00:13:01.480 --> 00:13:05.200
<v Speaker 2>Oh, absolutely far more, And yes, deep learning still has

258
00:13:05.200 --> 00:13:10.080
<v Speaker 2>a role. While classical mL specially gbtt's often performs very

259
00:13:10.120 --> 00:13:14.039
<v Speaker 2>competitively or even better on many tabular tasks, frameworks like

260
00:13:14.159 --> 00:13:17.960
<v Speaker 2>Keras built on TensorFlow, and fasti, which is built on PyTorch,

261
00:13:18.039 --> 00:13:22.840
<v Speaker 2>are definitely making inroads. They often incorporate sophisticated preprocessing layers

262
00:13:22.919 --> 00:13:26.039
<v Speaker 2>right into the deep learning model itself, handling data transformations

263
00:13:26.039 --> 00:13:28.000
<v Speaker 2>efficiently within the network architecture.

264
00:13:28.120 --> 00:13:31.200
<v Speaker 1>Right. And once you've picked your approach, say XG boost

265
00:13:31.279 --> 00:13:33.720
<v Speaker 1>or a Keris model, you need to tune it right

266
00:13:33.840 --> 00:13:37.039
<v Speaker 1>make it perform its best. That's where hyper parameter optimization

267
00:13:37.120 --> 00:13:39.919
<v Speaker 1>comes in, finding those perfect setting exactly.

268
00:13:39.919 --> 00:13:42.440
<v Speaker 2>You need to find the ideal settings the hyper parameters

269
00:13:42.440 --> 00:13:45.200
<v Speaker 2>for your specific model and data, and there are several

270
00:13:45.240 --> 00:13:47.759
<v Speaker 2>ways to do that. There's the classic grid search, which

271
00:13:47.799 --> 00:13:52.279
<v Speaker 2>is exhaustive. It literally tries every single combination of parameter

272
00:13:52.399 --> 00:13:54.799
<v Speaker 2>values you give it, like trying every key on a

273
00:13:54.840 --> 00:13:59.639
<v Speaker 2>giant keychain. Then there's random search. You just randomly sample combinations. Surprisingly,

274
00:13:59.759 --> 00:14:01.919
<v Speaker 2>this often works just as well or even better than

275
00:14:01.960 --> 00:14:05.080
<v Speaker 2>grid search, especially if only a few hyper parameters really matter.

276
00:14:05.120 --> 00:14:06.399
<v Speaker 2>It's often much more efficient.

277
00:14:06.799 --> 00:14:10.799
<v Speaker 1>Randomly trying things works better. That seems counterintuitive it.

278
00:14:10.720 --> 00:14:13.159
<v Speaker 2>Does, but imagine you have ten settings, but only two

279
00:14:13.320 --> 00:14:16.879
<v Speaker 2>really impact performance. Grid search spends most of its time

280
00:14:16.919 --> 00:14:20.080
<v Speaker 2>trying useless combinations of the other eight. Random search has

281
00:14:20.120 --> 00:14:22.279
<v Speaker 2>a better chance of hitting good values for the important

282
00:14:22.320 --> 00:14:26.159
<v Speaker 2>too much faster. Then you have smarter methods. Success of

283
00:14:26.240 --> 00:14:29.080
<v Speaker 2>having is like running a tournament. You start many models

284
00:14:29.120 --> 00:14:32.360
<v Speaker 2>with few resources, quickly discard the bad ones and give

285
00:14:32.440 --> 00:14:36.080
<v Speaker 2>more resources to the promising candidates. And then there's Beaesian

286
00:14:36.080 --> 00:14:40.159
<v Speaker 2>optimization using tools like optuna. This is really clever. It

287
00:14:40.200 --> 00:14:43.000
<v Speaker 2>builds a statistical model of how the hyper parameters seem

288
00:14:43.039 --> 00:14:46.679
<v Speaker 2>to affect performance, and uses that model to intelligently decide

289
00:14:46.679 --> 00:14:50.120
<v Speaker 2>which combinations to try next. It's an informed search, much

290
00:14:50.159 --> 00:14:52.919
<v Speaker 2>more efficient than just randomly guessing or trying everything.

291
00:14:53.039 --> 00:14:57.000
<v Speaker 1>Okay, beaesian optimization sounds powerful. So you've trained your model,

292
00:14:57.039 --> 00:14:59.720
<v Speaker 1>you've tuned it. Now the really hard part getting it

293
00:14:59.720 --> 00:15:02.559
<v Speaker 1>out of lab and actually used. This is where mL

294
00:15:02.600 --> 00:15:05.080
<v Speaker 1>ops machine learning operations becomes essential.

295
00:15:04.799 --> 00:15:08.679
<v Speaker 2>Right, absolutely critical. MLOPS is huge and why is it

296
00:15:08.720 --> 00:15:11.440
<v Speaker 2>so crucial? Well, first, just running your train model on

297
00:15:11.519 --> 00:15:15.519
<v Speaker 2>some new unseen data points before deploying is vital. This

298
00:15:15.559 --> 00:15:19.440
<v Speaker 2>helps detect things like data leakage. That's when somehow information

299
00:15:19.480 --> 00:15:21.879
<v Speaker 2>from the future or even from the target variable you're

300
00:15:21.879 --> 00:15:24.799
<v Speaker 2>trying to predict, accidentally sneaks into your training data. This

301
00:15:24.879 --> 00:15:27.240
<v Speaker 2>makes your model look amazing during development, but then it

302
00:15:27.279 --> 00:15:30.440
<v Speaker 2>completely fails in the real world because that leaked information

303
00:15:30.519 --> 00:15:34.879
<v Speaker 2>isn't available. Then MLUPS practices help catch this ah.

304
00:15:34.559 --> 00:15:38.360
<v Speaker 1>The dreaded data leakage, like predicting stock prices using tomorrow's

305
00:15:38.399 --> 00:15:40.320
<v Speaker 1>closing price somehow precisely.

306
00:15:40.759 --> 00:15:43.879
<v Speaker 2>MLOPS also helps you validate the model's actual performance in

307
00:15:43.919 --> 00:15:47.399
<v Speaker 2>a scenario that mimix production. We saw examples of maybe

308
00:15:47.440 --> 00:15:50.159
<v Speaker 2>doing a basic web deployment with something simple like flask,

309
00:15:50.240 --> 00:15:53.480
<v Speaker 2>which is great for demos, but for real world, reliable

310
00:15:53.480 --> 00:15:56.960
<v Speaker 2>applications you almost always need the robustness and scalability of

311
00:15:56.960 --> 00:16:01.720
<v Speaker 2>public cloud platforms like Google Cloud awsure. These clouds offer

312
00:16:01.759 --> 00:16:06.279
<v Speaker 2>comprehensive MLOPS environments. They handle things like model monitoring, tracking

313
00:16:06.320 --> 00:16:09.159
<v Speaker 2>accuracy over time to see if it degrades. They ensure

314
00:16:09.200 --> 00:16:12.440
<v Speaker 2>resiliency and uptime so your service stays available, and they

315
00:16:12.440 --> 00:16:14.360
<v Speaker 2>support sophisticated mL pipelines.

316
00:16:14.639 --> 00:16:16.879
<v Speaker 1>Okay, tell me more about the mL pipeline. That sounds

317
00:16:16.879 --> 00:16:18.960
<v Speaker 1>like the real engine behind m elopes. What does it

318
00:16:19.000 --> 00:16:19.759
<v Speaker 1>actually automate?

319
00:16:20.120 --> 00:16:23.039
<v Speaker 2>It really is a game changer. An mL pipeline is

320
00:16:23.120 --> 00:16:26.559
<v Speaker 2>essentially a coded, automated workflow. It takes you all the

321
00:16:26.600 --> 00:16:29.200
<v Speaker 2>way from the raw input data right through to a deployed,

322
00:16:29.320 --> 00:16:32.879
<v Speaker 2>monitored model. It automates the data cleanup, the feature engineering,

323
00:16:32.919 --> 00:16:36.039
<v Speaker 2>the model training, the evaluation, the tuning, the deployment, the

324
00:16:36.080 --> 00:16:39.799
<v Speaker 2>whole nine yards. This ensures consistency. Every time you run

325
00:16:39.840 --> 00:16:42.120
<v Speaker 2>the pipeline, you get the same steps applied in the

326
00:16:42.120 --> 00:16:46.240
<v Speaker 2>same way. It ensures repeatability and this is absolutely essential

327
00:16:46.240 --> 00:16:49.519
<v Speaker 2>in dynamic environments like real estate pricing, where the market

328
00:16:49.639 --> 00:16:52.600
<v Speaker 2>data changes constantly and you need to retrain and update

329
00:16:52.639 --> 00:16:54.519
<v Speaker 2>your models frequently and reliably.

330
00:16:54.720 --> 00:16:58.279
<v Speaker 1>That makes total sense. Automation and consistency are key for

331
00:16:58.399 --> 00:17:02.480
<v Speaker 1>anything real world. So, thinking about everything we've discussed, if

332
00:17:02.480 --> 00:17:06.240
<v Speaker 1>classical mL like GBDTS is strong and deep learning has

333
00:17:06.240 --> 00:17:09.039
<v Speaker 1>its place, it makes you wonder can you actually combine

334
00:17:09.079 --> 00:17:10.480
<v Speaker 1>them get the best of both worlds.

335
00:17:10.720 --> 00:17:13.880
<v Speaker 2>That's exactly what some of the most interesting recent work explores,

336
00:17:14.240 --> 00:17:17.000
<v Speaker 2>and the answer seems to be a definite yes. Going

337
00:17:17.039 --> 00:17:19.880
<v Speaker 2>back to that Tokyo Airbnb pricing problem mentioned in the

338
00:17:19.880 --> 00:17:23.920
<v Speaker 2>source material, they actually tried blending the predictions. They took

339
00:17:23.920 --> 00:17:27.359
<v Speaker 2>an optimized XG boost model and a fine tuned deep

340
00:17:27.440 --> 00:17:31.640
<v Speaker 2>learning model using FASTAI, and they found the best results.

341
00:17:31.759 --> 00:17:34.480
<v Speaker 2>The lowest prediction error came from a fifty to fifty

342
00:17:34.599 --> 00:17:37.240
<v Speaker 2>ensemble just averaging the predictions of the two models.

343
00:17:37.240 --> 00:17:39.920
<v Speaker 1>How a fifty to fifty split was optimal, not leaning

344
00:17:39.920 --> 00:17:41.920
<v Speaker 1>more heavily on one or the other in.

345
00:17:41.839 --> 00:17:45.079
<v Speaker 2>That specific case. Yes, it really challenges that narrative. You

346
00:17:45.160 --> 00:17:47.960
<v Speaker 2>sometimes hear that deep learning is all you need for

347
00:17:48.039 --> 00:17:51.400
<v Speaker 2>tabular data. It seems that's often not true. Combining the

348
00:17:51.440 --> 00:17:55.400
<v Speaker 2>strengths of gbdt's maybe their robustness with structured features and explainability,

349
00:17:55.799 --> 00:17:59.079
<v Speaker 2>with the potential pattern finding power of deep learning, that

350
00:17:59.160 --> 00:18:02.480
<v Speaker 2>collaborative approach which yielded the best results. It really reinforces

351
00:18:02.519 --> 00:18:05.440
<v Speaker 2>that core idea, doesn't it That knowledge is most valuable

352
00:18:05.440 --> 00:18:08.200
<v Speaker 2>when you understand it and can apply it creatively, and

353
00:18:08.240 --> 00:18:12.759
<v Speaker 2>that considering multiple perspectives multiple approaches usually leads to a richer,

354
00:18:12.839 --> 00:18:13.640
<v Speaker 2>better outcome.

355
00:18:13.839 --> 00:18:16.839
<v Speaker 1>Absolutely, a blend often works best. So what a journey

356
00:18:17.240 --> 00:18:19.359
<v Speaker 1>You've just taken a deep dive with us into this well,

357
00:18:19.400 --> 00:18:23.000
<v Speaker 1>surprisingly complex, but absolutely vital world machine learning for tabular data.

358
00:18:23.640 --> 00:18:26.480
<v Speaker 1>We've gone from dealing with messy spreadsheets and weird data

359
00:18:26.559 --> 00:18:30.720
<v Speaker 1>quirks to pitting these powerful algorithms like XG boost and

360
00:18:30.759 --> 00:18:33.440
<v Speaker 1>deep learning against each other and seeing how we actually

361
00:18:33.440 --> 00:18:36.880
<v Speaker 1>bring them to life with mlops. It really makes you think, though,

362
00:18:37.440 --> 00:18:40.759
<v Speaker 1>if even these incredibly sophisticated machine learning models can get

363
00:18:40.799 --> 00:18:44.160
<v Speaker 1>tripped up by something as simple as a misspelling like toyota,

364
00:18:44.720 --> 00:18:48.319
<v Speaker 1>or by subtle dependencies between rows or that missing value

365
00:18:48.359 --> 00:18:51.279
<v Speaker 1>that actually means something, what does that really tell us

366
00:18:51.319 --> 00:18:55.000
<v Speaker 1>about the fundamental importance of truly understanding your data, getting

367
00:18:55.000 --> 00:18:58.200
<v Speaker 1>your hands dirty with it, exploring it, cleaning it before

368
00:18:58.240 --> 00:19:00.839
<v Speaker 1>you even think about pressing trains. Maybe a thought worth

369
00:19:00.920 --> 00:19:03.880
<v Speaker 1>mulling over. We really hope this deep dive helps you

370
00:19:03.920 --> 00:19:06.799
<v Speaker 1>be even more informed and maybe more curious about the

371
00:19:06.839 --> 00:19:08.519
<v Speaker 1>data that powers so much of our world.
