WEBVTT 1 00:00:00.280 --> 00:00:02.839 Imagine a world running on data that looks well, kind 2 00:00:02.839 --> 00:00:06.679 of like a spreadsheet. We're talking banks, insurance, retail, government. 3 00:00:06.879 --> 00:00:09.119 This isn't the really flashy stuff like AI making pictures 4 00:00:09.240 --> 00:00:13.839 or music. This is the absolute bedrock of our digital economy. 5 00:00:14.240 --> 00:00:16.800 So today we're doing a deep dive into this fascinating 6 00:00:16.879 --> 00:00:21.399 and honestly often overlooked world of machine learning for tabular data. 7 00:00:21.920 --> 00:00:24.079 Our goal here is to cut through some of the noise, 8 00:00:24.280 --> 00:00:27.079 give you a shortcut to understanding what tabular data actually is, 9 00:00:27.480 --> 00:00:30.440 why it matters so much, and how we apply the 10 00:00:30.480 --> 00:00:33.240 well the most powerful mL techniques to it. We'll touch 11 00:00:33.280 --> 00:00:36.960 on everything from cleaning up messy data, pitting classic mL 12 00:00:37.039 --> 00:00:40.399 against deep learning, and even deploying these models out into 13 00:00:40.399 --> 00:00:42.880 the real world. You might find some surprising facts, maybe 14 00:00:42.880 --> 00:00:45.479 a few aha moments. Okay, let's get into it. So 15 00:00:45.520 --> 00:00:48.759 let's start right at the beginning. What exactly is tabular data? 16 00:00:48.759 --> 00:00:51.280 I think of just a simple table, maybe one, listing 17 00:00:51.320 --> 00:00:54.359 currencies for different countries. Each row that's an observation, like 18 00:00:54.399 --> 00:00:57.159 all the details for Australia's currency, and then each column 19 00:00:57.200 --> 00:00:59.840 is of feature things like currency name or units per u. 20 00:01:00.600 --> 00:01:03.479 Pretty straightforward, right, Yeah, exactly it's a format everyone gets, 21 00:01:03.479 --> 00:01:05.920 which is why it's so fundamental. And while it might 22 00:01:05.959 --> 00:01:08.719 only be say, ten percent or so of all the 23 00:01:08.760 --> 00:01:12.200 digital data out there, by some estimates that ten percent 24 00:01:12.280 --> 00:01:16.000 is absolutely critical. It's structure that row and columns set 25 00:01:16.079 --> 00:01:19.920 up makes it super easy to input, retrieve, manage, analyze. 26 00:01:20.040 --> 00:01:24.920 It really is the lifeblood for countless businesses, spreadsheets, huge databases, 27 00:01:24.959 --> 00:01:25.439 you name it. 28 00:01:25.599 --> 00:01:27.599 Okay, So here's something that's always kind of puzzled me. 29 00:01:28.079 --> 00:01:31.840 If it's so fundamental, why haven't deep learning models completely 30 00:01:31.879 --> 00:01:33.799 taken over this space, I mean, the way they have 31 00:01:33.920 --> 00:01:37.760 for images or audio or text. What's the key difference there? 32 00:01:37.840 --> 00:01:39.879 That is a really great question. The main thing is 33 00:01:39.879 --> 00:01:43.359 that tabular data has that typical matrix shape rows, columns 34 00:01:43.560 --> 00:01:47.400 pretty distinct. It's not like unstructured data audio, waves, pixels, texts, 35 00:01:47.400 --> 00:01:50.319 which is much more well unordered and varied. And because 36 00:01:50.359 --> 00:01:52.719 of that unique structure, tabula data comes with its own 37 00:01:52.760 --> 00:01:56.439 set of let's call them pathologies, common problems you absolutely 38 00:01:56.439 --> 00:01:59.519 have to fix before you can do any serious analysis pathologies. 39 00:01:59.560 --> 00:02:01.359 Okay, what kind of problems are we talking about? 40 00:02:01.519 --> 00:02:05.120 Well, first off, you often find constant or quasi constant 41 00:02:05.120 --> 00:02:08.759 columns features that just don't change much or at all. 42 00:02:09.360 --> 00:02:12.240 They offer almost no information to a model. Then there 43 00:02:12.280 --> 00:02:15.560 are duplicated and highly call near features, so information that's 44 00:02:15.599 --> 00:02:18.680 either just copied or it's so similar it's basically saying 45 00:02:18.719 --> 00:02:21.879 the same thing twice. With linear models especially, this can 46 00:02:21.919 --> 00:02:25.120 cause real conceptual misunderstandings. Makes it hard to figure out 47 00:02:25.159 --> 00:02:26.960 what's actually driving a prediction. 48 00:02:26.919 --> 00:02:29.759 Right, like having two columns that both measure basically the 49 00:02:29.759 --> 00:02:30.719 same temperature scale. 50 00:02:30.759 --> 00:02:34.000 Yeah, redundant exactly. Then you have irrelevant features, stuff that 51 00:02:34.120 --> 00:02:37.800 just doesn't help predict what you want to predict, and 52 00:02:38.159 --> 00:02:41.319 the big one missing data. This is crucial because some 53 00:02:41.599 --> 00:02:44.719 mL algorithms just flat out won't run if there are gaps, 54 00:02:45.000 --> 00:02:47.919 and these gaps aren't always random. Sometimes data is missing 55 00:02:47.960 --> 00:02:51.360 completely at random, sometimes just at random, or sometimes it's 56 00:02:51.360 --> 00:02:53.960 missing not at random. Think about it. A missing review 57 00:02:54.039 --> 00:02:57.319 score might actually mean there are no reviews, which is 58 00:02:57.439 --> 00:02:58.719 itself information. 59 00:02:58.400 --> 00:03:00.599 All right, That's a subtle but important distinction. 60 00:03:00.759 --> 00:03:05.199 Definitely. We also deal with rare categories, features with tons 61 00:03:05.280 --> 00:03:08.159 of unique values, or values that show up super infrequently, 62 00:03:08.639 --> 00:03:12.400 hard for models to learn from those, and my personal favorite, 63 00:03:12.479 --> 00:03:15.599 just plain errors in the data. You know, Misspelling's like 64 00:03:15.639 --> 00:03:19.719 Toyota instead of Toyota. Hmm, this isn't just a typo Cosmetically, 65 00:03:19.800 --> 00:03:22.879 it splits what should be one category into multiple noisy ones. 66 00:03:22.960 --> 00:03:24.159 Really confuses the model. 67 00:03:24.240 --> 00:03:25.240 It sounds like a minefield. 68 00:03:25.520 --> 00:03:27.879 It can be. The key insight here is like a 69 00:03:27.919 --> 00:03:30.560 slightly blurry pixel in an image might just make it 70 00:03:30.639 --> 00:03:35.120 less clear, but a single misspelled category or an unhandled 71 00:03:35.120 --> 00:03:38.280 missing value in a table that can fundamentally mislead your 72 00:03:38.280 --> 00:03:42.319 model force it to make decisions based on completely wrong information. 73 00:03:42.719 --> 00:03:44.520 It's like having a great map, but with a few 74 00:03:44.599 --> 00:03:47.080 key cities just randomly renamed, you can't navigate. 75 00:03:47.319 --> 00:03:52.000 Huh, the bane of every data scientist's existence. It's like 76 00:03:52.039 --> 00:03:54.039 trying to find Waldo, but half the time he skilled 77 00:03:54.080 --> 00:03:57.599 Waldo completely messing up your search algorithm. So it really 78 00:03:57.639 --> 00:04:00.240 sounds like forget the fancy algorithms for a second, real 79 00:04:00.280 --> 00:04:03.319 hard work. Maybe the biggest challenge is just understanding and 80 00:04:03.319 --> 00:04:06.680 prepping your data. Is that where something like exploratory data 81 00:04:06.719 --> 00:04:10.080 analysis EDA comes in. Is it indispensable? 82 00:04:10.360 --> 00:04:15.159 Absolutely one percent. Getting reliable insights starts with good EDA. 83 00:04:15.560 --> 00:04:18.120 And it's not just about making pretty charts though that helps. 84 00:04:18.439 --> 00:04:22.600 It's really about systematically spotting and fixing these pathologies before 85 00:04:22.600 --> 00:04:26.360 they wreck your model downstream. We use tools like histograms, 86 00:04:26.439 --> 00:04:29.199 box plots, things like that to actually see how the 87 00:04:29.279 --> 00:04:33.040 data is distributed. You know, spot things like heavy tails, 88 00:04:33.199 --> 00:04:37.000 extreme values and prices maybe which can seriously skew your results. 89 00:04:37.079 --> 00:04:39.199 Okay, when you find those extremes, like maybe a house 90 00:04:39.240 --> 00:04:41.720 listed for ten billion dollars by mistake, you know, just 91 00:04:41.759 --> 00:04:45.480 deleted to you. You mentioned windsorizing. How does that work 92 00:04:45.519 --> 00:04:48.439 and why is it often better than just tossing the 93 00:04:48.519 --> 00:04:50.279 data point or letting it mess everything up? 94 00:04:50.480 --> 00:04:54.480 Right? Good question. Windsorizing basically means if a value is 95 00:04:54.600 --> 00:04:57.040 way out there, maybe be on the top one percent 96 00:04:57.120 --> 00:04:59.279 or bottom one percent of your data range, you just 97 00:04:59.360 --> 00:05:01.800 capped it, place it with the value at that one 98 00:05:01.839 --> 00:05:03.720 percent or ninety nine percent mark. So you keep the 99 00:05:03.759 --> 00:05:07.839 data point, but you prevent that extreme outlier, maybe a 100 00:05:08.040 --> 00:05:11.560 data entry error, from having this huge disproportionate influence on 101 00:05:11.600 --> 00:05:15.600 your model. Similarly, for those categorical features with tons of 102 00:05:15.680 --> 00:05:19.160 unique labels we call high cardinality features, we can aggregate 103 00:05:19.199 --> 00:05:22.160 the really rare categories, group all those one off values 104 00:05:22.160 --> 00:05:24.879 into a single other category. Do this simplifies things for the 105 00:05:24.920 --> 00:05:27.959 model and Honestly, the unsung hero that makes a lot 106 00:05:27.959 --> 00:05:30.680 of this data wrangling possible is the panda's data frame 107 00:05:30.920 --> 00:05:34.879 in Python. It's just incredibly flexible and efficient for managing 108 00:05:34.920 --> 00:05:36.680 and manipulating tabular data. 109 00:05:36.759 --> 00:05:39.600 Okay, Pandas got it, So let's shift. Here's a bit. 110 00:05:39.639 --> 00:05:43.160 There's this ongoing debate in the data science world, right 111 00:05:43.199 --> 00:05:46.399 when you're tackling these tabular data problems, what's better classical 112 00:05:46.439 --> 00:05:48.920 machine learning techniques or deep learning. Maybe we can unpack 113 00:05:48.920 --> 00:05:51.519 this using that airbb example you mentioned, predicting listing prices 114 00:05:51.560 --> 00:05:52.199 in New York City. 115 00:05:52.399 --> 00:05:54.439 Yeah, that's a great way to look at it. We 116 00:05:54.480 --> 00:05:57.959 can compare these two approaches. Let's say classical mL represented 117 00:05:58.000 --> 00:06:01.560 by xg boost a popular choice, and deep learning may 118 00:06:01.560 --> 00:06:05.639 be using keras across a few key things. First, simplicity 119 00:06:06.160 --> 00:06:09.399 in that Airbnb case study, using xg boost often meant 120 00:06:09.480 --> 00:06:12.600 much simpler code to define and train the model, sometimes 121 00:06:12.639 --> 00:06:15.800 literally just one line following the standard psychic learned pattern. 122 00:06:15.879 --> 00:06:19.360 Lots of people know. Kearras for deep learning usually needed 123 00:06:19.439 --> 00:06:21.519 quite a few more lines, especially for defining all the 124 00:06:21.519 --> 00:06:24.519 network layers and setting up things like efficient training callbacks. 125 00:06:24.600 --> 00:06:26.959 Right, that definitely matters for day to day work. But 126 00:06:27.160 --> 00:06:30.360 simplicity aside, what about understanding why the model makes a prediction? 127 00:06:30.879 --> 00:06:33.879 You know, transparency and explainability. How do they stack up there? 128 00:06:34.079 --> 00:06:37.600 That's a huge point. Classical models like decision trees, which 129 00:06:37.600 --> 00:06:40.480 are kind of the building blocks for xg boost, can 130 00:06:40.519 --> 00:06:44.560 often be visualized or explained. You could, for example, show 131 00:06:44.600 --> 00:06:47.560 a non specialist how a simple decision tree predicts how 132 00:06:47.600 --> 00:06:50.240 long a property might stay on the market step by step. 133 00:06:50.720 --> 00:06:54.759 Deep neural networks, well, they often rely on these analogies 134 00:06:54.800 --> 00:06:57.680 to biological neurons, which were frankly a bit controversial and 135 00:06:57.759 --> 00:07:01.199 don't really clarify how the model arrives at its decision. Internally, 136 00:07:01.279 --> 00:07:01.839 it's more of a. 137 00:07:01.759 --> 00:07:04.480 Black box the black box problem. 138 00:07:04.160 --> 00:07:08.480 Exactly, and related to that is feature importance, what actually 139 00:07:08.560 --> 00:07:11.839 drove the prediction. Xg boost has built in methods that 140 00:07:11.920 --> 00:07:14.959 easily tell you which features had the biggest impact, like 141 00:07:15.000 --> 00:07:17.839 for the Airbnb prices, room type might pop up as 142 00:07:17.839 --> 00:07:21.279 the most important factor. Deep learning frameworks like Keras they 143 00:07:21.279 --> 00:07:24.160 don't usually have that built right in. You need external tools, 144 00:07:24.279 --> 00:07:26.600 often more complex ones, to try and get that same 145 00:07:26.680 --> 00:07:27.360 kind of insight. 146 00:07:27.480 --> 00:07:30.399 So it sounds like for understanding and explaining classical methods 147 00:07:30.399 --> 00:07:31.279 often have an edge. 148 00:07:31.439 --> 00:07:33.879 Often yes, And if we look at the bigger picture, 149 00:07:33.959 --> 00:07:37.439 like research trends, the amount of research specifically on deep 150 00:07:37.519 --> 00:07:40.360 learning for tabular data is actually just a tiny fraction 151 00:07:40.439 --> 00:07:42.759 of all deep learning research being published. There's just no 152 00:07:43.920 --> 00:07:47.279 unambiguous winner yet in terms of raw predictive power on 153 00:07:47.360 --> 00:07:50.360 tables for tabular data, the jury is definitely still out. 154 00:07:50.800 --> 00:07:53.480 Okay, that's really interesting. Why do you think that is? 155 00:07:53.480 --> 00:07:56.959 Why hasn't deep learning just dominated here like it has elsewhere? 156 00:07:57.319 --> 00:08:00.519 Well? One theory is that tabular nata often already presents 157 00:08:00.560 --> 00:08:05.399 features in a highly structured, kind of interpretable way, you know, price, location, 158 00:08:05.639 --> 00:08:09.680 number of bedrooms. Deep learning's real superpower is often extracting 159 00:08:09.759 --> 00:08:13.959 hierarchical features from raw unstructured stuff, like finding edges and 160 00:08:14.120 --> 00:08:17.480 shapes than objects in an image or grammar patterns in text. 161 00:08:18.120 --> 00:08:21.959 But with tabular data, that powerful automatic feature extraction might 162 00:08:22.000 --> 00:08:24.879 not be the huge advantage it is elsewhere. The features 163 00:08:24.879 --> 00:08:27.759 are often pretty meaningful already. In fact, sometimes deep learning 164 00:08:27.839 --> 00:08:31.040 might even pick up on spurious correlations and tables, essentially 165 00:08:31.040 --> 00:08:34.279 finding patterns and noise because it's so powerful at pattern finding. 166 00:08:34.639 --> 00:08:37.240 Gotcha, So it might be too powerful in a way 167 00:08:37.519 --> 00:08:41.320 for this kind of data sometimes. So if deep learning 168 00:08:41.360 --> 00:08:44.480 isn't the clear raining champ here, what is generally considered, 169 00:08:44.519 --> 00:08:47.000 you know, state of the art for most tabular data 170 00:08:47.039 --> 00:08:48.120 problems right now. 171 00:08:48.279 --> 00:08:51.639 Right now, that title really belongs to gradient boosting decision 172 00:08:51.720 --> 00:08:56.080 trees or gbdts. These models have really become the workhourses 173 00:08:56.159 --> 00:08:59.039 for tabular data tasks gbdts. 174 00:08:59.360 --> 00:09:01.240 Okay, how do they actually work? How do they get 175 00:09:01.279 --> 00:09:02.480 such good predictions? 176 00:09:02.759 --> 00:09:05.639 They're a really cool example of what's called an ensemble method, 177 00:09:05.720 --> 00:09:09.080 basically getting multiple models to work together. But unlike some 178 00:09:09.080 --> 00:09:12.559 other ensemble methods like random forests, where models are built independently, 179 00:09:13.159 --> 00:09:16.679 gbdts build models sequentially. Think of it like building the 180 00:09:16.720 --> 00:09:19.879 prediction piece by piece, almost like a chain. Each new 181 00:09:19.879 --> 00:09:22.159 tree model tries to correct the errors made by the 182 00:09:22.159 --> 00:09:25.320 previous ones. So it's this iterative process of improvement. It 183 00:09:25.399 --> 00:09:27.600 learns from the mistakes of the models that came before 184 00:09:27.600 --> 00:09:28.440 it in the sequence. 185 00:09:28.559 --> 00:09:31.159 Ah. Okay, So it's like a team where each member 186 00:09:31.399 --> 00:09:32.679 learns from the last one's. 187 00:09:32.519 --> 00:09:37.000 Attempt precisely, not just averaging independent guesses, but actively refining 188 00:09:37.000 --> 00:09:37.519 the prediction. 189 00:09:37.960 --> 00:09:41.200 And you mentioned two big names leading the GBDT charge 190 00:09:42.000 --> 00:09:46.039 x you boost and light GBM. They got famous through competitions. 191 00:09:45.559 --> 00:09:48.799 Right, that's right. They really gained prominence by winning or 192 00:09:48.840 --> 00:09:53.000 performing incredibly well in data science competitions like Caggles Higgs 193 00:09:53.039 --> 00:09:55.919 Boson machine Learning Challenge years ago that really put them. 194 00:09:55.840 --> 00:09:57.879 On the map. So what makes them so good? Is 195 00:09:57.919 --> 00:09:59.879 it just the boosting idea or is it more to it? 196 00:10:00.039 --> 00:10:02.519 There's definitely more to it. They achieve their speed and 197 00:10:02.559 --> 00:10:06.960 accuracy through some really clever technical innovations. Xg boost, for instance, 198 00:10:07.039 --> 00:10:09.159 uses smart ways to find the best splits in the 199 00:10:09.240 --> 00:10:13.440 data very quickly, like histogram splitting and a unique weighted 200 00:10:13.519 --> 00:10:18.159 quantile sketch. Light GBM uses techniques like leafwise tree growth. 201 00:10:18.559 --> 00:10:21.480 Instead of building the tree level by level symmetrically, it 202 00:10:21.600 --> 00:10:24.320 focuses its effort on the nodes the leaves where it 203 00:10:24.320 --> 00:10:26.519 can reduce the air the most. This can lead to 204 00:10:26.600 --> 00:10:30.399 faster training and smaller trees. Light GBM also uses smart 205 00:10:30.440 --> 00:10:34.200 sampling like gradient based one side sampling or GOSS to 206 00:10:34.240 --> 00:10:36.480 focus on the data points that are harder to predict, 207 00:10:36.799 --> 00:10:40.440 and exclusive feature bundling EFB to kind of group sparse 208 00:10:40.440 --> 00:10:41.639 features together efficiently. 209 00:10:41.879 --> 00:10:44.759 Wow, Okay, that sounds pretty sophisticated Under the hood, it is. 210 00:10:44.879 --> 00:10:47.960 Think of light GBM like a really efficient data assistant. 211 00:10:48.320 --> 00:10:51.679 It knows exactly where the most important information is likely 212 00:10:51.759 --> 00:10:55.480 to be and how to summarize things without losing crucial details. 213 00:10:55.519 --> 00:10:56.840 That makes it fast. 214 00:10:57.080 --> 00:10:59.360 And you mentioned something earlier that really caught my attention. 215 00:11:00.080 --> 00:11:04.120 They handle missing data automatically. That sounds almost too good 216 00:11:04.120 --> 00:11:06.320 to be true. How does that work? Does it mean 217 00:11:06.360 --> 00:11:08.679 we could just be a bit lazier with cleaning our 218 00:11:08.720 --> 00:11:09.639 data if we use. 219 00:11:09.480 --> 00:11:12.000 These Huh, well, it is a huge advantage. It's not 220 00:11:12.039 --> 00:11:15.120 really about being lazy though. Both xg boosts and light 221 00:11:15.159 --> 00:11:18.600 GBM have this built in capability where at each split 222 00:11:18.639 --> 00:11:21.320 point in a tree, they learn which direction left or 223 00:11:21.360 --> 00:11:24.960 right branch missing values should go to minimize the overall 224 00:11:25.080 --> 00:11:28.480 error the loss function, so the model itself learns the 225 00:11:28.519 --> 00:11:31.279 best way to handle those gaps based on the data patterns. 226 00:11:31.559 --> 00:11:34.639 It's quite robust and another key practical thing they both 227 00:11:34.720 --> 00:11:38.240 do is early stopping. They watch performance on a separate 228 00:11:38.279 --> 00:11:42.240 validation data set during training, and if the performance stops 229 00:11:42.240 --> 00:11:45.080 improving for a certain number of rounds, they just stop training. 230 00:11:45.720 --> 00:11:48.559 This is crucial to prevent overfitting, making sure the model 231 00:11:48.600 --> 00:11:51.159 works well on new data, not just the data it 232 00:11:51.200 --> 00:11:51.720 was trained on. 233 00:11:51.840 --> 00:11:54.000 Okay, that makes sense, prevents it from just memorizing the 234 00:11:54.039 --> 00:11:57.000 training set. So we have these two powerhouses, xg boost 235 00:11:57.000 --> 00:11:59.200 and light GBM. How do you actually choose between them 236 00:11:59.240 --> 00:12:00.000 for a specific priser. 237 00:12:00.679 --> 00:12:03.519 Yeah, that's a common question. Based on the sources we 238 00:12:03.559 --> 00:12:05.720 looked at and general experience in the field, there are 239 00:12:05.720 --> 00:12:09.320 some general guidelines. Light GBM often tends to perform better 240 00:12:09.440 --> 00:12:11.720 or at least train faster, when you have really large 241 00:12:11.720 --> 00:12:15.039 amounts of data. Its leafwise growth is very efficient then, 242 00:12:15.519 --> 00:12:18.200 but that same leafwise growth can sometimes cause it to 243 00:12:18.240 --> 00:12:21.960 overfit a bit more easily on smaller data sets. Xg boost, 244 00:12:21.960 --> 00:12:24.639 on the other hand, is often considered slightly more robust 245 00:12:24.879 --> 00:12:28.039 maybe builds more stable models, especially on smaller BEATA samples. 246 00:12:28.639 --> 00:12:32.360 Speed wise, light GBM is typically faster on CPUs, but 247 00:12:32.559 --> 00:12:35.519 xg boost is often seen as more scalable for distributed 248 00:12:35.519 --> 00:12:40.000 computing and has had perhaps slightly more mature GPU support historically, 249 00:12:40.360 --> 00:12:42.559 though light GBM is catching up fast there too. 250 00:12:42.919 --> 00:12:44.600 Okay, So it depends on the scale of your data, 251 00:12:44.639 --> 00:12:47.639 maybe your hardware. Interesting trade offs. So it seems like 252 00:12:47.679 --> 00:12:51.480 gradient boosting is incredibly powerful for tables. But deep learning 253 00:12:51.480 --> 00:12:54.360 isn't completely out of the picture, right and stepping back, 254 00:12:54.440 --> 00:12:57.360 getting any model boosting or deep learning actually working in 255 00:12:57.399 --> 00:13:00.159 the real world that involves a lot more than just 256 00:13:00.240 --> 00:13:01.200 hitting train, doesn't it? 257 00:13:01.480 --> 00:13:05.200 Oh, absolutely far more, And yes, deep learning still has 258 00:13:05.200 --> 00:13:10.080 a role. While classical mL specially gbtt's often performs very 259 00:13:10.120 --> 00:13:14.039 competitively or even better on many tabular tasks, frameworks like 260 00:13:14.159 --> 00:13:17.960 Keras built on TensorFlow, and fasti, which is built on PyTorch, 261 00:13:18.039 --> 00:13:22.840 are definitely making inroads. They often incorporate sophisticated preprocessing layers 262 00:13:22.919 --> 00:13:26.039 right into the deep learning model itself, handling data transformations 263 00:13:26.039 --> 00:13:28.000 efficiently within the network architecture. 264 00:13:28.120 --> 00:13:31.200 Right. And once you've picked your approach, say XG boost 265 00:13:31.279 --> 00:13:33.720 or a Keris model, you need to tune it right 266 00:13:33.840 --> 00:13:37.039 make it perform its best. That's where hyper parameter optimization 267 00:13:37.120 --> 00:13:39.919 comes in, finding those perfect setting exactly. 268 00:13:39.919 --> 00:13:42.440 You need to find the ideal settings the hyper parameters 269 00:13:42.440 --> 00:13:45.200 for your specific model and data, and there are several 270 00:13:45.240 --> 00:13:47.759 ways to do that. There's the classic grid search, which 271 00:13:47.799 --> 00:13:52.279 is exhaustive. It literally tries every single combination of parameter 272 00:13:52.399 --> 00:13:54.799 values you give it, like trying every key on a 273 00:13:54.840 --> 00:13:59.639 giant keychain. Then there's random search. You just randomly sample combinations. Surprisingly, 274 00:13:59.759 --> 00:14:01.919 this often works just as well or even better than 275 00:14:01.960 --> 00:14:05.080 grid search, especially if only a few hyper parameters really matter. 276 00:14:05.120 --> 00:14:06.399 It's often much more efficient. 277 00:14:06.799 --> 00:14:10.799 Randomly trying things works better. That seems counterintuitive it. 278 00:14:10.720 --> 00:14:13.159 Does, but imagine you have ten settings, but only two 279 00:14:13.320 --> 00:14:16.879 really impact performance. Grid search spends most of its time 280 00:14:16.919 --> 00:14:20.080 trying useless combinations of the other eight. Random search has 281 00:14:20.120 --> 00:14:22.279 a better chance of hitting good values for the important 282 00:14:22.320 --> 00:14:26.159 too much faster. Then you have smarter methods. Success of 283 00:14:26.240 --> 00:14:29.080 having is like running a tournament. You start many models 284 00:14:29.120 --> 00:14:32.360 with few resources, quickly discard the bad ones and give 285 00:14:32.440 --> 00:14:36.080 more resources to the promising candidates. And then there's Beaesian 286 00:14:36.080 --> 00:14:40.159 optimization using tools like optuna. This is really clever. It 287 00:14:40.200 --> 00:14:43.000 builds a statistical model of how the hyper parameters seem 288 00:14:43.039 --> 00:14:46.679 to affect performance, and uses that model to intelligently decide 289 00:14:46.679 --> 00:14:50.120 which combinations to try next. It's an informed search, much 290 00:14:50.159 --> 00:14:52.919 more efficient than just randomly guessing or trying everything. 291 00:14:53.039 --> 00:14:57.000 Okay, beaesian optimization sounds powerful. So you've trained your model, 292 00:14:57.039 --> 00:14:59.720 you've tuned it. Now the really hard part getting it 293 00:14:59.720 --> 00:15:02.559 out of lab and actually used. This is where mL 294 00:15:02.600 --> 00:15:05.080 ops machine learning operations becomes essential. 295 00:15:04.799 --> 00:15:08.679 Right, absolutely critical. MLOPS is huge and why is it 296 00:15:08.720 --> 00:15:11.440 so crucial? Well, first, just running your train model on 297 00:15:11.519 --> 00:15:15.519 some new unseen data points before deploying is vital. This 298 00:15:15.559 --> 00:15:19.440 helps detect things like data leakage. That's when somehow information 299 00:15:19.480 --> 00:15:21.879 from the future or even from the target variable you're 300 00:15:21.879 --> 00:15:24.799 trying to predict, accidentally sneaks into your training data. This 301 00:15:24.879 --> 00:15:27.240 makes your model look amazing during development, but then it 302 00:15:27.279 --> 00:15:30.440 completely fails in the real world because that leaked information 303 00:15:30.519 --> 00:15:34.879 isn't available. Then MLUPS practices help catch this ah. 304 00:15:34.559 --> 00:15:38.360 The dreaded data leakage, like predicting stock prices using tomorrow's 305 00:15:38.399 --> 00:15:40.320 closing price somehow precisely. 306 00:15:40.759 --> 00:15:43.879 MLOPS also helps you validate the model's actual performance in 307 00:15:43.919 --> 00:15:47.399 a scenario that mimix production. We saw examples of maybe 308 00:15:47.440 --> 00:15:50.159 doing a basic web deployment with something simple like flask, 309 00:15:50.240 --> 00:15:53.480 which is great for demos, but for real world, reliable 310 00:15:53.480 --> 00:15:56.960 applications you almost always need the robustness and scalability of 311 00:15:56.960 --> 00:16:01.720 public cloud platforms like Google Cloud awsure. These clouds offer 312 00:16:01.759 --> 00:16:06.279 comprehensive MLOPS environments. They handle things like model monitoring, tracking 313 00:16:06.320 --> 00:16:09.159 accuracy over time to see if it degrades. They ensure 314 00:16:09.200 --> 00:16:12.440 resiliency and uptime so your service stays available, and they 315 00:16:12.440 --> 00:16:14.360 support sophisticated mL pipelines. 316 00:16:14.639 --> 00:16:16.879 Okay, tell me more about the mL pipeline. That sounds 317 00:16:16.879 --> 00:16:18.960 like the real engine behind m elopes. What does it 318 00:16:19.000 --> 00:16:19.759 actually automate? 319 00:16:20.120 --> 00:16:23.039 It really is a game changer. An mL pipeline is 320 00:16:23.120 --> 00:16:26.559 essentially a coded, automated workflow. It takes you all the 321 00:16:26.600 --> 00:16:29.200 way from the raw input data right through to a deployed, 322 00:16:29.320 --> 00:16:32.879 monitored model. It automates the data cleanup, the feature engineering, 323 00:16:32.919 --> 00:16:36.039 the model training, the evaluation, the tuning, the deployment, the 324 00:16:36.080 --> 00:16:39.799 whole nine yards. This ensures consistency. Every time you run 325 00:16:39.840 --> 00:16:42.120 the pipeline, you get the same steps applied in the 326 00:16:42.120 --> 00:16:46.240 same way. It ensures repeatability and this is absolutely essential 327 00:16:46.240 --> 00:16:49.519 in dynamic environments like real estate pricing, where the market 328 00:16:49.639 --> 00:16:52.600 data changes constantly and you need to retrain and update 329 00:16:52.639 --> 00:16:54.519 your models frequently and reliably. 330 00:16:54.720 --> 00:16:58.279 That makes total sense. Automation and consistency are key for 331 00:16:58.399 --> 00:17:02.480 anything real world. So, thinking about everything we've discussed, if 332 00:17:02.480 --> 00:17:06.240 classical mL like GBDTS is strong and deep learning has 333 00:17:06.240 --> 00:17:09.039 its place, it makes you wonder can you actually combine 334 00:17:09.079 --> 00:17:10.480 them get the best of both worlds. 335 00:17:10.720 --> 00:17:13.880 That's exactly what some of the most interesting recent work explores, 336 00:17:14.240 --> 00:17:17.000 and the answer seems to be a definite yes. Going 337 00:17:17.039 --> 00:17:19.880 back to that Tokyo Airbnb pricing problem mentioned in the 338 00:17:19.880 --> 00:17:23.920 source material, they actually tried blending the predictions. They took 339 00:17:23.920 --> 00:17:27.359 an optimized XG boost model and a fine tuned deep 340 00:17:27.440 --> 00:17:31.640 learning model using FASTAI, and they found the best results. 341 00:17:31.759 --> 00:17:34.480 The lowest prediction error came from a fifty to fifty 342 00:17:34.599 --> 00:17:37.240 ensemble just averaging the predictions of the two models. 343 00:17:37.240 --> 00:17:39.920 How a fifty to fifty split was optimal, not leaning 344 00:17:39.920 --> 00:17:41.920 more heavily on one or the other in. 345 00:17:41.839 --> 00:17:45.079 That specific case. Yes, it really challenges that narrative. You 346 00:17:45.160 --> 00:17:47.960 sometimes hear that deep learning is all you need for 347 00:17:48.039 --> 00:17:51.400 tabular data. It seems that's often not true. Combining the 348 00:17:51.440 --> 00:17:55.400 strengths of gbdt's maybe their robustness with structured features and explainability, 349 00:17:55.799 --> 00:17:59.079 with the potential pattern finding power of deep learning, that 350 00:17:59.160 --> 00:18:02.480 collaborative approach which yielded the best results. It really reinforces 351 00:18:02.519 --> 00:18:05.440 that core idea, doesn't it That knowledge is most valuable 352 00:18:05.440 --> 00:18:08.200 when you understand it and can apply it creatively, and 353 00:18:08.240 --> 00:18:12.759 that considering multiple perspectives multiple approaches usually leads to a richer, 354 00:18:12.839 --> 00:18:13.640 better outcome. 355 00:18:13.839 --> 00:18:16.839 Absolutely, a blend often works best. So what a journey 356 00:18:17.240 --> 00:18:19.359 You've just taken a deep dive with us into this well, 357 00:18:19.400 --> 00:18:23.000 surprisingly complex, but absolutely vital world machine learning for tabular data. 358 00:18:23.640 --> 00:18:26.480 We've gone from dealing with messy spreadsheets and weird data 359 00:18:26.559 --> 00:18:30.720 quirks to pitting these powerful algorithms like XG boost and 360 00:18:30.759 --> 00:18:33.440 deep learning against each other and seeing how we actually 361 00:18:33.440 --> 00:18:36.880 bring them to life with mlops. It really makes you think, though, 362 00:18:37.440 --> 00:18:40.759 if even these incredibly sophisticated machine learning models can get 363 00:18:40.799 --> 00:18:44.160 tripped up by something as simple as a misspelling like toyota, 364 00:18:44.720 --> 00:18:48.319 or by subtle dependencies between rows or that missing value 365 00:18:48.359 --> 00:18:51.279 that actually means something, what does that really tell us 366 00:18:51.319 --> 00:18:55.000 about the fundamental importance of truly understanding your data, getting 367 00:18:55.000 --> 00:18:58.200 your hands dirty with it, exploring it, cleaning it before 368 00:18:58.240 --> 00:19:00.839 you even think about pressing trains. Maybe a thought worth 369 00:19:00.920 --> 00:19:03.880 mulling over. We really hope this deep dive helps you 370 00:19:03.920 --> 00:19:06.799 be even more informed and maybe more curious about the 371 00:19:06.839 --> 00:19:08.519 data that powers so much of our world.