WEBVTT 1 00:00:00.160 --> 00:00:03.160 Welcome to the deep dive. You know, we hear so 2 00:00:03.319 --> 00:00:06.120 much about the theory of data science, all the algorithms 3 00:00:06.120 --> 00:00:07.960 in math. But today we're going to try and pull 4 00:00:08.000 --> 00:00:10.119 back the curtain a bit look at the real world. 5 00:00:10.679 --> 00:00:13.519 And our guide for this is The Practitioner's Guide to 6 00:00:13.599 --> 00:00:17.039 Data Science by Huielin and Mingly. It feels less like 7 00:00:17.079 --> 00:00:21.600 a textbook honestly, and more like an insider's view of 8 00:00:21.640 --> 00:00:24.359 what it really takes to do data science day to day. 9 00:00:24.679 --> 00:00:26.719 That's exactly right. What really jumped out at me was 10 00:00:26.760 --> 00:00:29.920 how practical it is, you know, how grounded it is. 11 00:00:30.280 --> 00:00:32.320 The authors they don't just give you the what. They 12 00:00:32.359 --> 00:00:35.399 really dig into the how, things like the soft skills 13 00:00:35.399 --> 00:00:38.159 which often get missed, right, and the whole context of 14 00:00:38.200 --> 00:00:40.880 the big data cloud environment. Yeah, that's huge. It's really 15 00:00:40.880 --> 00:00:44.920 about well navigating the messiness of actual data projects. Okay. 16 00:00:44.960 --> 00:00:46.840 Yeah, And it seems like they really push for hands 17 00:00:46.880 --> 00:00:48.799 on learning. I just appreciate that they've got these R 18 00:00:48.840 --> 00:00:51.240 and Python code notebooks all ready to go. You can 19 00:00:51.240 --> 00:00:54.920 grab them on GitHub the links http LA three three 20 00:00:55.079 --> 00:00:57.719 seven CD four's and they basically say, hey, get your 21 00:00:57.759 --> 00:00:59.560 hands dirty, take this code, use your own data to 22 00:00:59.600 --> 00:01:01.200 try it on problems. 23 00:01:01.200 --> 00:01:04.599 Yeah, make it tangible, and that focus on reproducibility using 24 00:01:04.680 --> 00:01:07.239 things like Google co Lab. Yeah, that's so important. Now 25 00:01:07.280 --> 00:01:10.239 it's not just about following steps, it's about giving you 26 00:01:10.280 --> 00:01:13.359 the power to take these techniques and actually build something, 27 00:01:13.760 --> 00:01:16.640 apply them to whatever challenges you're facing. It makes data 28 00:01:16.640 --> 00:01:19.200 science feel like a real tool you can use, not 29 00:01:19.319 --> 00:01:20.280 just concepts. 30 00:01:21.560 --> 00:01:23.120 The book kicks off with a bit of history too, 31 00:01:23.159 --> 00:01:26.879 which I've found pretty useful just for context. It traces 32 00:01:26.879 --> 00:01:32.519 things from the early days like least squares, linear discriminate analysis, 33 00:01:32.959 --> 00:01:35.920 the real foundations, all the way up to how cloud 34 00:01:35.959 --> 00:01:39.319 computing just completely changed the game for data engineering and management. 35 00:01:39.359 --> 00:01:42.159 It really shows how far we've come and fast. 36 00:01:42.200 --> 00:01:45.200 Oh definitely thinking about that evolution the cloud, it's just 37 00:01:45.200 --> 00:01:47.680 been massive, a total game changer. Suddenly you have access 38 00:01:47.719 --> 00:01:52.319 to all this computing power, storage. It kind of democratized 39 00:01:52.400 --> 00:01:55.439 working with huge data sets, you know, and that shifted 40 00:01:55.480 --> 00:01:57.760 data engineering. It's less about physical boxes now and more 41 00:01:57.760 --> 00:02:01.079 about orchestrating data pipelines up in the cloud. It's fundamental shift. 42 00:02:01.280 --> 00:02:04.159 Okay, So the authors then break down data science roles. 43 00:02:04.280 --> 00:02:09.520 They talk about three main skill tracks engineering, analysis and modeling, inference, 44 00:02:10.240 --> 00:02:13.680 So for engineering, it's about building the infrastructure right, the 45 00:02:13.759 --> 00:02:18.240 data pipelines, automated collection, managing the data itself, the plumbing 46 00:02:18.319 --> 00:02:19.159 basically right. 47 00:02:19.599 --> 00:02:22.520 And it's so critical you need that solid engineering foundation. 48 00:02:22.919 --> 00:02:25.240 Everything else is built on top of it. If your 49 00:02:25.319 --> 00:02:29.919 data infrastructure isn't reliable, well, the analysts and modelers, they 50 00:02:29.960 --> 00:02:33.240 just can't do their jobs properly. It's often the unseen work, 51 00:02:33.439 --> 00:02:35.960 but it's absolutely essential for getting good outcomes. 52 00:02:36.159 --> 00:02:38.360 Then there's the analysis track. This sounds like it's really 53 00:02:38.360 --> 00:02:41.479 about understanding the business side. What's the question, what's the 54 00:02:41.560 --> 00:02:44.879 data telling us, and then translating that business need into 55 00:02:44.919 --> 00:02:48.000 a data problem you can actually solve. The book really 56 00:02:48.080 --> 00:02:50.960 hits hard on domain knowledge and communication skills here. 57 00:02:51.000 --> 00:02:55.120 Exactly asking the right questions. Understanding the business context is crucial. 58 00:02:55.520 --> 00:02:58.479 The analyst is like a translator, you know, bridging the 59 00:02:58.520 --> 00:03:00.800 gap between the business folks who have the problems and 60 00:03:00.840 --> 00:03:04.759 the data scientists who might have solutions. It's definitely not 61 00:03:04.840 --> 00:03:08.000 just about crunching numbers. It's about insights that lead to 62 00:03:08.120 --> 00:03:09.080 actual decisions. 63 00:03:09.240 --> 00:03:13.280 Okay, and finally modeling inference. This is where we get 64 00:03:13.280 --> 00:03:17.879 into applying all the different learning methods. Supervised learning like 65 00:03:18.759 --> 00:03:23.520 regression and classification for predictions, but also unsupervised learning for 66 00:03:23.680 --> 00:03:26.960 finding patterns, and even causal inference trying to figure out 67 00:03:27.000 --> 00:03:27.639 cause and effect. 68 00:03:27.759 --> 00:03:30.080 Yeah, and the range of tools here is fascinating. You 69 00:03:30.080 --> 00:03:33.599 can forecast trends, categorize things, try to understand why something 70 00:03:33.639 --> 00:03:36.319 is happening. Each technique gives you a different lens on 71 00:03:36.360 --> 00:03:38.639 the data. A good practitioner knows which tool to pull 72 00:03:38.680 --> 00:03:39.319 out for which. 73 00:03:39.199 --> 00:03:41.800 Job, which brings us to the kinds of questions data 74 00:03:41.840 --> 00:03:48.439 science can actually answer prediction, classification, optimization like forecasting, sales spotting, fraud, 75 00:03:48.560 --> 00:03:51.919 finding efficient routes. But and I think this is really important. 76 00:03:51.960 --> 00:03:55.919 The book also points out the limitations. It's not magic, right. 77 00:03:56.439 --> 00:04:01.759 Being honest about that builds trust, manages xs. Sometimes the 78 00:04:01.840 --> 00:04:04.479 data just isn't there, or maybe the problem isn't really 79 00:04:04.520 --> 00:04:07.560 a data problem at all. Knowing what data science can't 80 00:04:07.560 --> 00:04:10.000 do is just as important as knowing what it can. 81 00:04:10.280 --> 00:04:12.680 They also talk a bit about team structure, like should 82 00:04:12.680 --> 00:04:15.479 you build your own team or outsource, and they stress 83 00:04:15.520 --> 00:04:18.720 how vital collaboration is across different departments. It seems like 84 00:04:18.759 --> 00:04:21.920 a data scientist working alone probably won't get very far. 85 00:04:22.079 --> 00:04:25.480 Oh absolutely not. Data science is inherently collaborative. It has 86 00:04:25.519 --> 00:04:28.399 to be. You need domain experts to frame the problem right, 87 00:04:28.439 --> 00:04:31.800 You need engineering support, you need buy in from leaders 88 00:04:31.800 --> 00:04:34.079 so the insights actually get used. It doesn't matter if 89 00:04:34.079 --> 00:04:37.120 the team is internal or external. Those connections are fundamental. 90 00:04:37.399 --> 00:04:39.879 Now. The book introduces this idea. I found really neat 91 00:04:39.879 --> 00:04:43.279 that the three pillars of knowledge for a data scientist. First, 92 00:04:43.360 --> 00:04:47.639 the core analytics stuff, stats, machine learning techniques, the tools. Second, 93 00:04:47.720 --> 00:04:52.079 domain knowledge plus collaboration, communication, leadership skills, and the third 94 00:04:52.120 --> 00:04:55.199 pillar that's big data management and the IT skills for 95 00:04:55.240 --> 00:04:56.399 the modern cloud world. 96 00:04:57.560 --> 00:05:01.240 Those three pillars really capture how multi fascinating data science 97 00:05:01.319 --> 00:05:03.480 is now. You can't just be good at one or two. 98 00:05:03.839 --> 00:05:05.920 You really need a solid base in all three to 99 00:05:06.000 --> 00:05:11.120 handle complex real world projects. Think about say, predicting customer churn. 100 00:05:11.680 --> 00:05:14.399 You need the analytics chops for the model, but you 101 00:05:14.439 --> 00:05:17.399 also need the domain knowledge to know which customer behaviors matter, 102 00:05:18.959 --> 00:05:21.240 and you need the IT skills to actually get and 103 00:05:21.319 --> 00:05:22.839 process all that data from the cloud. 104 00:05:23.000 --> 00:05:25.519 Okay, then the book gets into the actual project cycle. 105 00:05:25.920 --> 00:05:29.519 It breaks projects down by type like offline training, offline application, 106 00:05:29.680 --> 00:05:33.680 offline training, online application, online training, online application. It's interesting 107 00:05:33.680 --> 00:05:36.399 how the tech needs and the business value change depending 108 00:05:36.439 --> 00:05:38.120 on whether it's real time or batch. 109 00:05:38.319 --> 00:05:41.160 Yeah, that's a really practical way to categorize them. Knowing 110 00:05:41.240 --> 00:05:44.000 upfront if you need a weekly report versus say a 111 00:05:44.079 --> 00:05:47.639 real time recommendation engine on a website, well that changes everything. 112 00:05:47.680 --> 00:05:50.839 How you get data, how you prefit, model, test, deploy. 113 00:05:51.560 --> 00:05:52.399 It all follows from that. 114 00:05:52.680 --> 00:05:55.519 And the book really hammers home the importance of those 115 00:05:55.560 --> 00:06:00.360 early stages problem formulation and project planning. They stress using 116 00:06:00.439 --> 00:06:03.480 data in the planning, really understanding the business value and 117 00:06:03.600 --> 00:06:06.600 why data scientists have to be involved early. It helps 118 00:06:06.639 --> 00:06:11.800 avoid solving completely the wrong problem or setting totally unrealistic timelines. 119 00:06:12.120 --> 00:06:15.000 That's exactly where projects can go off the rails right 120 00:06:15.040 --> 00:06:18.040 at the start, spending that time up front to clearly 121 00:06:18.040 --> 00:06:21.399 define the business problem, figure out the desired outcome, make 122 00:06:21.439 --> 00:06:26.519 a realistic plan. It's foundational. Data scientists bring that unique perspective. 123 00:06:26.560 --> 00:06:29.399 They understand the business and what's actually possible with the data. 124 00:06:29.439 --> 00:06:31.680 And when they talk about project modeling, it's described as 125 00:06:31.800 --> 00:06:34.959 very iterative, not just picking a model. It involves all 126 00:06:35.000 --> 00:06:39.240 that hard work, data cleaning, wrangling, exploratory analysis to really 127 00:06:39.279 --> 00:06:42.560 get the data. Then translating the business problem into stats 128 00:06:42.639 --> 00:06:45.720 or machine learning terms. It's rarely finding the perfect model 129 00:06:45.759 --> 00:06:46.360 first try. 130 00:06:46.600 --> 00:06:49.759 That iterative nature is totally key. It's just how it works. 131 00:06:49.920 --> 00:06:53.079 You break down big problems into smaller analytical questions, apply 132 00:06:53.199 --> 00:06:57.040 different methods. You need feedback, loops, communication, You got to 133 00:06:57.040 --> 00:06:58.839 be willing to learn and adjust as you go. 134 00:06:59.560 --> 00:07:02.399 Finally, in the intersection, they flag a couple of super 135 00:07:02.399 --> 00:07:05.439 common mistakes. We mentioned solving the wrong problem, but the 136 00:07:05.480 --> 00:07:09.560 second one is underestimating timelines. They say that data exploration 137 00:07:09.720 --> 00:07:13.600 and prep, the unglamorous stuff, can eat up like sixty 138 00:07:13.639 --> 00:07:15.759 to eighty percent of the total project time. 139 00:07:15.920 --> 00:07:18.839 WHOA, Yeah, that number really hits home, doesn't it. It 140 00:07:18.920 --> 00:07:21.399 highlights all that hidden effort needed just to get raw, 141 00:07:21.519 --> 00:07:24.279 messy data ready for modeling. If you don't budget time 142 00:07:24.319 --> 00:07:27.000 for that wrangling and exploring properly, your project's almost certainly 143 00:07:27.040 --> 00:07:31.079 going to hit delays or worse, you build on shaky data. 144 00:07:31.160 --> 00:07:33.360 Okay, let's shift gears a bit and dig into some 145 00:07:33.399 --> 00:07:37.839 of the more technical details. Starting with data preprocessing. The 146 00:07:37.879 --> 00:07:40.079 book spends a good amount of time here, and well, 147 00:07:40.240 --> 00:07:42.720 like we just said, raw data is rarely model ready. 148 00:07:43.279 --> 00:07:46.800 Data cleaning is usually step one right, finding and dealing 149 00:07:46.800 --> 00:07:50.800 with weird stuff negative age percentages over one hundred. The 150 00:07:50.839 --> 00:07:53.800 book talks about different strategies like just deleting those rows 151 00:07:54.000 --> 00:07:57.480 or maybe treating them as missing values and imputing them later. 152 00:07:58.040 --> 00:08:00.800 And that's a strategic choice. When when do you delete 153 00:08:00.879 --> 00:08:04.360 versus impute? The book suggests if your data set's big 154 00:08:04.439 --> 00:08:07.720 enough and the bad data seems random, maybe deletion is okay, 155 00:08:08.240 --> 00:08:11.319 but imputation lets you keep more data. They cover simple 156 00:08:11.360 --> 00:08:15.120 methods mean median mode and more complex ones like Kenearest neighbors. 157 00:08:15.360 --> 00:08:18.399 The even point to the impute function in ours impute 158 00:08:18.399 --> 00:08:19.399 Missings package for. 159 00:08:19.360 --> 00:08:22.600 That right, and that leads straight into missing values generally, 160 00:08:22.639 --> 00:08:25.560 which are just everywhere in real data. Again, imputation is key. 161 00:08:25.800 --> 00:08:29.000 They detail the basic methods. kNN even mentioned maybe using 162 00:08:29.079 --> 00:08:31.519 bagging trees for imputation sometimes. 163 00:08:31.199 --> 00:08:33.799 Yeah, And the imputation method you choose, it can actually 164 00:08:33.840 --> 00:08:37.159 affect your model's performance down the line. Like the book says, 165 00:08:37.519 --> 00:08:41.879 simple mean imputation ignores relationships between variables and can kind 166 00:08:41.879 --> 00:08:45.039 of distort things, especially if lots of data is missing. 167 00:08:45.960 --> 00:08:48.440 More advanced methods try to use those relationships to make 168 00:08:48.480 --> 00:08:49.159 better guesses. 169 00:08:49.679 --> 00:08:53.240 Centering and scaling also get covered, basically getting all your 170 00:08:53.279 --> 00:08:57.200 variables onto a similar scale. The book mentions pre process 171 00:08:57.440 --> 00:09:01.480 in r's carrot package using center, and this is super 172 00:09:01.519 --> 00:09:04.320 important for lots of algorithms that are sensitive to how 173 00:09:04.360 --> 00:09:08.360 big the numbers are. Like imagine comparing height in centimeters 174 00:09:08.679 --> 00:09:12.519 and income in thousands of dollars, totally different scales. Right, 175 00:09:12.960 --> 00:09:15.320 some algorithms might just focus on the income because the 176 00:09:15.399 --> 00:09:15.960 numbers are. 177 00:09:15.879 --> 00:09:20.399 Bigger exactly, algorithms like gradient descent, which trains so many models, 178 00:09:20.519 --> 00:09:23.240 they just work better, converge faster when features are on 179 00:09:23.279 --> 00:09:26.679 a similar scale. It stops variables with big ranges from 180 00:09:26.759 --> 00:09:29.240 just dominating the learning process unfairly. 181 00:09:29.519 --> 00:09:33.759 Okay, Next up skewness and outliers. The book talks about 182 00:09:33.840 --> 00:09:38.279 using visualizations, box plots, histograms to spot these, and also 183 00:09:38.720 --> 00:09:42.480 statistical methods like Z scores or the modified Z score 184 00:09:42.559 --> 00:09:45.799 using the mad function in R. Finding these is important 185 00:09:45.840 --> 00:09:48.600 because they can really mess up certain models, like one 186 00:09:48.720 --> 00:09:51.799 huge income could totally skew the average and mislead a 187 00:09:51.879 --> 00:09:52.559 linear model. 188 00:09:52.720 --> 00:09:55.799 Yeah, and knowing how different models react to outliers is key. 189 00:09:56.399 --> 00:10:01.559 Linear regression logistic regression pretty sensitive based models usually more 190 00:10:01.679 --> 00:10:04.840 robust and the book rightly says outliers aren't always errors. 191 00:10:04.840 --> 00:10:07.879 They could be real just unusual, so deciding what to 192 00:10:07.919 --> 00:10:11.440 do remove transform leave alone needs sought. Maybe domain knowledge. 193 00:10:11.519 --> 00:10:14.080 They mentioned transformations like spatial sign in R that can 194 00:10:14.159 --> 00:10:16.639 kind of dampen the influence of outliers without removing them. 195 00:10:16.879 --> 00:10:19.720 Colinearity is another big one when your predictor variables are 196 00:10:19.799 --> 00:10:22.720 highly correlated with each other. The book points to find 197 00:10:22.759 --> 00:10:26.679 correlation in Carrot for finding these. If predictors are too correlated, 198 00:10:26.720 --> 00:10:30.279 it makes model coefficients unstable and hard to interpret, like 199 00:10:30.360 --> 00:10:33.279 trying to separate the effect of Facebook AdSpend from Instagram 200 00:10:33.320 --> 00:10:35.759 AdSpend if they always move together precisely. 201 00:10:36.519 --> 00:10:40.519 High multi collinearity inflates the variance of coefficient estimates in 202 00:10:40.600 --> 00:10:43.600 linear models makes it hard to see the independent effect 203 00:10:43.600 --> 00:10:46.600 of each variable, so you might remove one variable, combine them, 204 00:10:46.919 --> 00:10:49.320 or use dimensionality reduction techniques to handle it. 205 00:10:49.360 --> 00:10:52.720 They also cover sparse variables predictors that barely change across 206 00:10:52.759 --> 00:10:56.679 the data set, very low variance, near zero var in 207 00:10:56.759 --> 00:11:02.120 Carrot helps find these based on unique values and frequency ratios. Basically, 208 00:11:02.159 --> 00:11:04.679 if a variable is almost constant, it's not really helping 209 00:11:04.679 --> 00:11:05.919 your model tell things apart. 210 00:11:06.080 --> 00:11:08.960 Yeah, they're just not adding much information. Removing them can 211 00:11:09.000 --> 00:11:12.279 simplify the model, make it more stable, maybe train faster 212 00:11:12.639 --> 00:11:14.000 without really hurting performance. 213 00:11:14.240 --> 00:11:19.039 And the last preprocessing step mentioned is re encoding dummy variables. 214 00:11:19.360 --> 00:11:22.639 That's just converting categorical things like colors, product types into 215 00:11:22.720 --> 00:11:26.879 numbers usually binaries, zeros and one so algorithms can understand them. 216 00:11:27.039 --> 00:11:31.279 Fundamental step for categorical data creates those binary dummy variables 217 00:11:31.279 --> 00:11:33.120 so the model can tree each category is its own 218 00:11:33.159 --> 00:11:35.360 feature and learn its relationship to the outcome. 219 00:11:35.639 --> 00:11:40.080 Okay, shifting now to data wrangling, The book really highlights 220 00:11:40.120 --> 00:11:44.039 ours deeplayer package for manipulating data. They go through functions 221 00:11:44.080 --> 00:11:50.200 like select for picking columns, filter for rows, arrange for sorting, 222 00:11:50.360 --> 00:11:54.039 dot mutate for making new variables, and summarize with groupie 223 00:11:54.080 --> 00:11:57.519 for calculating stats across groups. They even give a customer 224 00:11:57.559 --> 00:12:00.840 segmentation example showing how you'd use these sociatyrize metrics for 225 00:12:00.840 --> 00:12:04.840 different customer types like average age, spending transaction counts for 226 00:12:05.039 --> 00:12:08.320 say conspicuous versus price conscious customers. 227 00:12:08.559 --> 00:12:11.840 Oh yeah, deeplayer really changed the game for data manipulation 228 00:12:11.919 --> 00:12:15.080 and r The syntax just so intuitive makes common tasks 229 00:12:15.159 --> 00:12:18.399 much clearer and more efficient. That customer segmentation example is 230 00:12:18.440 --> 00:12:20.679 great shows exactly how you use these tools to pull 231 00:12:20.679 --> 00:12:22.960 out meaningful insights about different groups in your data. 232 00:12:23.039 --> 00:12:24.639 They do you give a quick nod to base our 233 00:12:24.679 --> 00:12:29.039 functions too, like apply lapply supply, acknowledging that while deeplayer 234 00:12:29.120 --> 00:12:31.840 is great, sometimes you need the flexibility of the base 235 00:12:31.879 --> 00:12:33.360 functions for trickier stuff. 236 00:12:33.480 --> 00:12:36.000 Right. Deep player streamlines a lot, but basar gives you 237 00:12:36.039 --> 00:12:39.360 that fine grain control for maybe more complex or custom operations. 238 00:12:39.600 --> 00:12:41.080 It's good to know both, really, all. 239 00:12:41.039 --> 00:12:43.159 Right, let's talk model tuning. The book starts with the 240 00:12:43.240 --> 00:12:47.559 classic variance bias tradeoff, the idea that a really complex 241 00:12:47.639 --> 00:12:51.120 model might fit your training data perfectly low bias, but 242 00:12:51.240 --> 00:12:53.360 then it fails on new data because it learned the 243 00:12:53.399 --> 00:12:57.519 noise high variance overfitting, while a too simple model won't 244 00:12:57.519 --> 00:13:01.000 even capture the basic patterns high bias underfit. Tuning is 245 00:13:01.039 --> 00:13:03.000 finding that balance for good generalization. 246 00:13:03.279 --> 00:13:05.720 Yeah, that's like machine learning one oh one, isn't it? 247 00:13:06.159 --> 00:13:08.720 But absolutely crucial. You want the model to learn the 248 00:13:08.759 --> 00:13:13.360 real signal, not the random noise. Overfitting is like memorizing 249 00:13:13.440 --> 00:13:17.360 test answers. Great for that test useless. Otherwise, underfitting is 250 00:13:17.440 --> 00:13:19.000 like not studying at all, and. 251 00:13:19.080 --> 00:13:21.639 Data splitting and resampling are the main tools for managing 252 00:13:21.679 --> 00:13:24.080 this trade off. The book talks about the basic train 253 00:13:24.200 --> 00:13:27.720 to split, build on training data, evaluate on unseen test data. 254 00:13:28.159 --> 00:13:31.840 It also mentions a fancier technique, maximum dissimilarity sampling, using 255 00:13:31.879 --> 00:13:34.440 MAXDESEM and carrot. The goal there is to make the 256 00:13:34.480 --> 00:13:39.320 test set really diverse, covering more possibilities. Simple random splitting 257 00:13:39.360 --> 00:13:42.279 can sometimes give you unrepresentative train or test sets just 258 00:13:42.320 --> 00:13:47.240 by luck, which biases your performance estimate. Maximum dissimilarity sampling 259 00:13:47.320 --> 00:13:49.759 tries to build a test set that really spans the 260 00:13:49.840 --> 00:13:52.759 range of your data, giving a more robust evaluation of 261 00:13:52.840 --> 00:13:55.480 how the model might do in the wild. Then you 262 00:13:55.519 --> 00:13:59.240 have resampling methods for getting more stable performance estimates, especially 263 00:13:59.240 --> 00:14:03.120 with limited data. The book covers cross validation like kfold 264 00:14:03.120 --> 00:14:06.960 and bootstrapping. With kfold, you split data into k parts, 265 00:14:07.279 --> 00:14:09.919 train on K one, test on the last one, repeat 266 00:14:09.960 --> 00:14:13.519 four times and average the results. Gives a more reliable picture. 267 00:14:13.919 --> 00:14:17.240 Yeah, resampling is invaluable for confidence in your performance metrics. 268 00:14:17.600 --> 00:14:20.679 Cross validation avoids the risk of getting a misleading score 269 00:14:20.960 --> 00:14:24.399 just from one lucky or unlucky train to split. Bootstrapping 270 00:14:24.440 --> 00:14:27.759 involves resampling with replacement to create lots of simulated data 271 00:14:27.799 --> 00:14:30.639 sets than training and testing on those. It gives you 272 00:14:30.679 --> 00:14:33.240 a sense of the stability and uncertainty around your metrics. 273 00:14:33.240 --> 00:14:36.200 So how do we actually measure performance? The book says 274 00:14:36.240 --> 00:14:40.679 it depends if it's regression or classification. For regression predicting numbers, 275 00:14:40.759 --> 00:14:44.919 common metrics are URMC, ROOTMANE squared error tells you the 276 00:14:44.960 --> 00:14:48.360 average error size, and ARE squared the proportion of variants explained. 277 00:14:48.759 --> 00:14:51.879 Though they caution that high R squared isn't everything, and 278 00:14:51.919 --> 00:14:57.080 they mention adjusted R squared, which penalizes extra unhelpful predictors. 279 00:14:57.159 --> 00:14:59.679 Right armac is nice because it's in the same units 280 00:14:59.679 --> 00:15:02.720 as your target variable easy to grasp. Our square tells 281 00:15:02.720 --> 00:15:04.639 you how much better your model is than just guessing 282 00:15:04.679 --> 00:15:07.240 the average, but yeah, doesn't guarantee it's a good model 283 00:15:07.399 --> 00:15:11.759 or will generalize. Adjusted R squared pushes towards simpler models, 284 00:15:11.759 --> 00:15:12.639 which is often. 285 00:15:12.399 --> 00:15:16.080 Good for classification predicting categories. The book gets into the 286 00:15:16.120 --> 00:15:19.879 confusion matrix true positives, false positives, etc. N metrics like 287 00:15:20.080 --> 00:15:24.080 accuracy specificity, finding the true negatives and the Kappa statistic 288 00:15:24.480 --> 00:15:28.600 Kappa using Kappa dot test in rs FMSD package measures 289 00:15:28.639 --> 00:15:30.519 agreement beyond what you'd expect by chance. 290 00:15:31.039 --> 00:15:33.679 Useful, Yet the confusion matrix breaks it all down, not 291 00:15:33.759 --> 00:15:35.919 just if the model was right, but how it was wrong. 292 00:15:36.879 --> 00:15:40.320 Simple accuracy can be really misleading with unbalanced classes. If 293 00:15:40.399 --> 00:15:43.399 ninety nine percent or negative, a dumb model predicting negative 294 00:15:43.399 --> 00:15:47.399 all the time gets ninety nine percent accuracy. Specificity Kappa 295 00:15:47.440 --> 00:15:49.799 they give a much better picture, especially Kappa accounting for 296 00:15:49.879 --> 00:15:50.559 chance agreement. 297 00:15:50.799 --> 00:15:54.200 They also cover ROC curves and AUC area under the 298 00:15:54.240 --> 00:15:58.120 curve using proc dot rock in R that helps evaluate 299 00:15:58.159 --> 00:16:02.120 classifiers across different thresholds, and gain and lift charts, which 300 00:16:02.120 --> 00:16:04.799 are more business focused. They show how much better your 301 00:16:04.840 --> 00:16:08.000 model is at finding positive cases compared to just random selection. 302 00:16:08.480 --> 00:16:12.759 For marketing campaigns, RC curves visualize that trade off between 303 00:16:12.799 --> 00:16:15.960 finding true positives and avoiding false positives. As you change 304 00:16:15.960 --> 00:16:19.039 the decision threshold, higher AEC is generally better. GAT and 305 00:16:19.080 --> 00:16:21.399 lift charts translate that into business terms. How much more 306 00:16:21.440 --> 00:16:23.720 efficiently can you reach your target audience using the model? 307 00:16:24.080 --> 00:16:25.279 Very practical? Okay. 308 00:16:25.320 --> 00:16:28.279 Finally, the book walks through a bunch of different regression models. 309 00:16:28.639 --> 00:16:32.440 Start with the basics ordinary lease squares OLS, linear regression 310 00:16:32.919 --> 00:16:37.919 covers its assumptions, linearity, independence, constant error variants, normal residuals, 311 00:16:37.960 --> 00:16:40.919 and diagnostic plots to check them. Then moves to things 312 00:16:40.960 --> 00:16:45.919 like principal component regression PCR and partial lease squares PLS 313 00:16:46.159 --> 00:16:49.159 for handling many possibly correlated predictors. 314 00:16:49.519 --> 00:16:53.279 Understanding those OLS assumptions is so important for trusting the results. 315 00:16:53.559 --> 00:16:56.639 If they're violated, your coefficients and predictions might be off. 316 00:16:57.000 --> 00:17:00.840 Diagnostics help check that PCR and PLS are eight dimensionality 317 00:17:00.840 --> 00:17:06.079 reduction tools, especially when multiicolinearity makes standard linear aggression unstable. 318 00:17:06.200 --> 00:17:10.160 It also covers regularization methods Ridge, LASSO and elastic net. 319 00:17:10.359 --> 00:17:14.519 These shrink coefficients to prevent overfitting handle collinearity, and LASSO 320 00:17:14.920 --> 00:17:17.920 can even do feature selection by zeroing out some coefficients 321 00:17:18.359 --> 00:17:19.799 mentions in neet angle net. 322 00:17:19.839 --> 00:17:22.799 From there, yeah, regularization is super powerful for building more 323 00:17:22.880 --> 00:17:26.000 robust models, especially with lots of features. Ridge shrinks everything 324 00:17:26.000 --> 00:17:29.079 towards zero, last looking for some coefficients to zero doing 325 00:17:29.119 --> 00:17:31.920 automatic feature selection. Elastic net is a mix of both. 326 00:17:32.279 --> 00:17:36.440 Then tree based methods get introduced. Decision trees plus ensembles 327 00:17:36.480 --> 00:17:40.680 like bagging tree, bagging carrot, random forests are FING CARROT 328 00:17:41.039 --> 00:17:44.559 and gradient boosted machines GBM and CARROT. These are good 329 00:17:44.559 --> 00:17:47.839 for nonlinear patterns and handling different data types. Touches on 330 00:17:47.880 --> 00:17:52.599 splitting criteria to many information gain and pruning to avoid overfitting. 331 00:17:52.920 --> 00:17:56.880 Trees are incredibly versatile. Oftent top performers great at finding 332 00:17:56.880 --> 00:18:00.559 complex patterns without needing tons of feature engineering. Ensembles like 333 00:18:00.599 --> 00:18:03.279 random forests and gradium boosting combine many trees to get 334 00:18:03.279 --> 00:18:07.000 even better more stable predictions. Understanding splitting and pruning is 335 00:18:07.079 --> 00:18:08.400 key to making them work well. 336 00:18:08.480 --> 00:18:11.200 And lastly, a quick intro to deep learning. Feed Forward 337 00:18:11.279 --> 00:18:16.880 neural networks FFNNs, convolutional neural networks CNNs for images, Recurrent 338 00:18:16.920 --> 00:18:21.279 neural networks RNNs for sequences like text briefly covers applications, 339 00:18:21.319 --> 00:18:26.039 components like neurons, activation functions, sigmoid or LU layers, optimization, 340 00:18:26.279 --> 00:18:30.599 gradient descent ADAM, regularization dropout points to the CARAS package. 341 00:18:30.640 --> 00:18:31.000 In art. 342 00:18:31.079 --> 00:18:35.279 Deep learning has had amazing success, especially with images, language, speech. 343 00:18:35.880 --> 00:18:39.119 The book just gives a taste, but hits the core concepts, 344 00:18:39.359 --> 00:18:41.839 the building blocks, how they learn how to control them. 345 00:18:42.079 --> 00:18:44.759 It's a huge field, but that's a good starting point. 346 00:18:44.960 --> 00:18:47.720 So wrapping up this deep dive on the Practitioner's Guide 347 00:18:47.720 --> 00:18:52.559 to Data Science, it really feels like a valuable bridge. Yeah, definitely. 348 00:18:52.640 --> 00:18:55.720 And as you, our listener, think about all this, consider 349 00:18:56.000 --> 00:18:59.079 how these ideas might apply to what you're working on 350 00:18:59.160 --> 00:19:01.839 or learning about. Maybe you're prepping for a meeting, trying 351 00:19:01.839 --> 00:19:04.960 to understand a new area, or just curious that ability 352 00:19:05.079 --> 00:19:08.680 to work effectively with data it's just becoming so critical everywhere. 353 00:19:08.799 --> 00:19:12.559 Perhaps you're looking at customer behavior or analyzing trends for research. 354 00:19:12.599 --> 00:19:15.240 The kinds of practical steps and thinking outlined in this 355 00:19:15.279 --> 00:19:17.759 book offer a really solid way to approach those kinds 356 00:19:17.799 --> 00:19:19.440 of challenges. Something to think about.