WEBVTT 1 00:00:00.080 --> 00:00:02.279 Welcome curious minds to another deep dive. 2 00:00:02.520 --> 00:00:03.040 Hello. 3 00:00:04.000 --> 00:00:09.439 Imagine having just one single reliable place you could quickly 4 00:00:09.560 --> 00:00:12.320 check whenever some complex data science term pops up. 5 00:00:12.439 --> 00:00:14.720 Yeah, instead of drowning and search results. 6 00:00:14.439 --> 00:00:17.920 Exactly, saving you hours maybe of sisting through stuff that 7 00:00:18.000 --> 00:00:21.760 might not even be right. Well, today we're doing just that. 8 00:00:22.120 --> 00:00:26.760 We're cracking open the Data Scientist Pocket Guide by Mohammed Sabri. 9 00:00:27.640 --> 00:00:30.559 It's a resource really designed to cut through all that 10 00:00:30.679 --> 00:00:34.320 noise and hopefully give you clear, reliable answers. 11 00:00:34.399 --> 00:00:35.000 It's useful. 12 00:00:35.240 --> 00:00:38.159 Our mission today, then, is to extract the most important 13 00:00:38.200 --> 00:00:41.520 sort of nuggets of knowledge from this guide. We'll focus 14 00:00:41.600 --> 00:00:45.840 on key concepts, tackle some frequently asked questions in machine learning, 15 00:00:45.880 --> 00:00:48.320 deep learning, the big ones, the big ones. Yeah, think 16 00:00:48.359 --> 00:00:51.320 of this as your personal tour through a really practical glossary, 17 00:00:51.880 --> 00:00:54.560 helping you grasp not just what things are, but. 18 00:00:54.560 --> 00:00:56.320 Why they matter, how they fit. 19 00:00:56.200 --> 00:00:58.359 Together exactly, the bigger picture. 20 00:00:58.479 --> 00:01:01.039 And what's really compelling I think is he wrote it. 21 00:01:01.039 --> 00:01:03.840 It came from his own experience, his own frustrations early on. 22 00:01:03.960 --> 00:01:07.519 Oh interesting, Yeah, he saw the struggle, especially for beginners, 23 00:01:07.560 --> 00:01:13.159 trying to find quick, reliable, clear explanations for fundamental concepts. 24 00:01:13.599 --> 00:01:17.640 He actually says, answers to my questions were not always. 25 00:01:17.400 --> 00:01:19.840 Reliable, right, I can relate to that. 26 00:01:19.840 --> 00:01:22.159 And some concepts are hard to understand. It created a 27 00:01:22.159 --> 00:01:23.400 real barrier. You know. 28 00:01:23.640 --> 00:01:25.799 That's such a common experience, isn't it. It sounds like he 29 00:01:25.879 --> 00:01:28.079 wasn't just like compiling facts. He was trying to solve 30 00:01:28.079 --> 00:01:30.319 a real pain point he knew others had. 31 00:01:30.480 --> 00:01:33.640 Precisely, he wanted to create what he calls a first 32 00:01:33.680 --> 00:01:36.879 of a kind dictionary or glossary that regroups the most 33 00:01:36.879 --> 00:01:39.920 popular terms, really aiming to make the day to day 34 00:01:39.959 --> 00:01:41.439 work easier, more enriching. 35 00:01:41.519 --> 00:01:44.840 Even Okay, so if you've ever felt that sense of 36 00:01:44.879 --> 00:01:48.480 overwhelm just the sheer volume of info, or got lost 37 00:01:48.719 --> 00:01:51.719 trying to figure out which explanation to trust, this deep 38 00:01:51.799 --> 00:01:55.000 dive should be really helpful. Yeah, hopefully. Muhammad describes those 39 00:01:55.040 --> 00:01:58.400 early frustrations quite vividly, you know, having to go on 40 00:01:58.439 --> 00:02:02.159 search engines and use various sources just to understand one concept, 41 00:02:02.640 --> 00:02:04.760 finding it time consuming, and as you said, the answers 42 00:02:04.799 --> 00:02:05.560 weren't always. 43 00:02:05.359 --> 00:02:09.240 Reliable, right, And he points out something key. A lot 44 00:02:09.240 --> 00:02:11.919 of books focus heavily on the coding, which. 45 00:02:11.719 --> 00:02:13.919 Is essential obviously, of course, but. 46 00:02:14.159 --> 00:02:19.120 They often miss understanding the logic and the mechanism behind 47 00:02:19.159 --> 00:02:20.000 each concept. 48 00:02:20.199 --> 00:02:23.120 That raises a really important question. Then why is that 49 00:02:23.159 --> 00:02:27.240 conceptual understanding so critical even if you're a great coder. 50 00:02:27.840 --> 00:02:31.680 Well, the guide really emphasizes this. Without that foundation, it's 51 00:02:32.000 --> 00:02:34.919 hard for him to provide good results and explain its work, 52 00:02:35.439 --> 00:02:39.080 the explanation for it exactly. You can run the code, sure, 53 00:02:39.360 --> 00:02:41.680 but do you know why it works, what the output 54 00:02:41.759 --> 00:02:44.759 really means, how to fix it when it breaks. That's 55 00:02:44.800 --> 00:02:47.479 the conceptual piece, got it. So the book's goal, it's 56 00:02:47.479 --> 00:02:49.800 pretty ambitious, actually, is to be a kind of data 57 00:02:49.840 --> 00:02:53.759 science bible, a quick reference for solid definitions. 58 00:02:53.199 --> 00:02:56.719 A bible. Huh. So, given that focus on quick reference, 59 00:02:56.800 --> 00:02:59.120 quick answers, I'm guessing this isn't a book you read 60 00:02:59.199 --> 00:03:00.840 cover to cover like a novel. 61 00:03:01.000 --> 00:03:04.520 No, absolutely not. He's very clear about that. The objective 62 00:03:04.680 --> 00:03:07.120 is not to be read all at once. Right, It's 63 00:03:07.199 --> 00:03:09.039 meant to be a resource you dip into. You know, 64 00:03:09.080 --> 00:03:10.840 you have a question, you look it up. It's designed 65 00:03:10.879 --> 00:03:13.439 for nonlinear reading. 66 00:03:13.280 --> 00:03:14.199 So you can jump around. 67 00:03:14.280 --> 00:03:16.400 Yeah, start to read wherever you want and jump to 68 00:03:16.439 --> 00:03:18.599 any chapter whatever you need at that moment. 69 00:03:18.800 --> 00:03:22.199 Okay, that makes perfect sense. It's about targeted learning getting 70 00:03:22.240 --> 00:03:25.400 unstuck quickly without wading through dense theory exactly. 71 00:03:25.479 --> 00:03:28.479 It fits that practical engineering mindset, right. 72 00:03:28.680 --> 00:03:30.840 So the book structure reflects that too. It's got this 73 00:03:30.879 --> 00:03:35.719 big alphabetical definition section and then a dedicated FAQ section. 74 00:03:35.840 --> 00:03:37.879 Yeah, the faques are really interesting. 75 00:03:38.000 --> 00:03:41.039 That's where we find some really actionable stuff, those distinctions 76 00:03:41.080 --> 00:03:44.360 that often, you know, trip people up. Let's start with 77 00:03:44.439 --> 00:03:49.439 a big one, deep learning versus traditional machine learning. When 78 00:03:49.439 --> 00:03:51.199 do you actually need deep learning? 79 00:03:51.599 --> 00:03:54.719 Okay, yeah, that's a common question. The guide suggests it 80 00:03:54.759 --> 00:03:58.560 really shines in well two main scenarios where traditional methods 81 00:03:58.639 --> 00:03:59.199 might struggle. 82 00:03:59.360 --> 00:03:59.680 Okay. 83 00:04:00.680 --> 00:04:03.719 In case it is hard to extract features from the data, 84 00:04:03.360 --> 00:04:07.080 meaning deep learning models can often learn the important features 85 00:04:07.159 --> 00:04:11.479 automatically directly from raw data think pixels in an image 86 00:04:11.719 --> 00:04:13.520 or raw audio waveforms. 87 00:04:13.680 --> 00:04:15.719 Us you don't need as much manual feature. 88 00:04:15.400 --> 00:04:19.480 Engineering exactly, which can save a ton of effort, especially 89 00:04:19.519 --> 00:04:21.839 with complex unstructured data. 90 00:04:21.920 --> 00:04:23.399 Okay, that's one. What's the second. 91 00:04:23.439 --> 00:04:25.759 The second, and it often goes hand in hand, is 92 00:04:26.120 --> 00:04:28.800 in case we have a large amount of data scale 93 00:04:29.399 --> 00:04:33.480 TEW massive data sets, deep learning models often keep improving 94 00:04:33.519 --> 00:04:35.839 with more data, they can learn better and show a 95 00:04:35.839 --> 00:04:40.480 better performance, where traditional algorithms might plateau or even struggle 96 00:04:40.519 --> 00:04:41.160 to scale. 97 00:04:41.279 --> 00:04:45.920 So if you're dealing with that raw complex data, images, video, language, 98 00:04:46.000 --> 00:04:48.360 or you just have enormous amounts of data, deep learning 99 00:04:48.399 --> 00:04:49.920 is probably the way to go generally. 100 00:04:50.040 --> 00:04:53.600 Yes, it becomes a much more powerful tool in those situations. 101 00:04:53.680 --> 00:04:56.560 Okay, But even with the right model, you still need 102 00:04:56.600 --> 00:04:59.800 to know if it's actually working well right and understand its. 103 00:04:59.680 --> 00:05:01.920 Mistake absolutely critical, Which brings. 104 00:05:01.720 --> 00:05:04.079 Us to another fundamental concept, one that trips up a 105 00:05:04.079 --> 00:05:06.800 lot of people. Type I and type two errors. 106 00:05:06.839 --> 00:05:10.800 Ah Yes, false positives and false negatives, coarse statistics, but 107 00:05:10.959 --> 00:05:12.560 vital in mL evaluation. 108 00:05:12.759 --> 00:05:13.800 So break it down for us. 109 00:05:13.920 --> 00:05:19.079 Type I Okay, Type I error sometimes called alpha error 110 00:05:19.160 --> 00:05:22.680 or a false positive. This happens when the researcher rejects 111 00:05:22.720 --> 00:05:26.240 the null hypothesis being true in the population, so. 112 00:05:26.279 --> 00:05:28.839 You conclude something is happening when it actually isn't. 113 00:05:29.079 --> 00:05:32.120 Exactly like a medical test saying someone has a disease 114 00:05:32.120 --> 00:05:35.720 when they're healthy, or a spam filter blocking an important 115 00:05:35.720 --> 00:05:40.040 email you rejected the truth healthy not spams. 116 00:05:39.600 --> 00:05:42.079 Got it false alarm, and type two. 117 00:05:42.120 --> 00:05:45.560 Type two error or beta error false negative. This is 118 00:05:45.600 --> 00:05:48.680 the opposite. It's committed when the researcher does not reject 119 00:05:48.680 --> 00:05:51.519 the null hypothesis being false in the population. 120 00:05:52.040 --> 00:05:53.920 So you miss something that is happening. 121 00:05:53.600 --> 00:05:57.120 Precisely, missing an actual effect. Think of a medical test 122 00:05:57.279 --> 00:05:58.600 failing to detect a disease. 123 00:05:58.639 --> 00:06:03.160 Someone actually has raw detection system, letting a fraudulent transactions. 124 00:06:02.639 --> 00:06:06.000 Something so exactly. That's a classic example. You accepted something false, 125 00:06:06.040 --> 00:06:07.680 the transaction is fine as true. 126 00:06:08.120 --> 00:06:11.759 Understanding the difference here seems crucial because the cost of 127 00:06:11.800 --> 00:06:14.319 each error type can be wildly different. 128 00:06:14.040 --> 00:06:17.680 Right, hugely different. Think about that medical test example. A 129 00:06:17.720 --> 00:06:22.519 false positive type one leads to anxiety, maybe unnecessary follow 130 00:06:22.600 --> 00:06:25.680 up tests, annoying, potentially. 131 00:06:25.120 --> 00:06:27.279 Costly, But a false negative. 132 00:06:27.040 --> 00:06:29.759 A false negative type two in that context means a 133 00:06:29.800 --> 00:06:33.079 sick person doesn't get treatment. The consequences could be far, 134 00:06:33.199 --> 00:06:33.759 far worse. 135 00:06:34.079 --> 00:06:35.800 So when you build a model, you have to decide 136 00:06:35.839 --> 00:06:39.439 which type of error is more critical to avoid for 137 00:06:39.560 --> 00:06:40.600 your specific problem. 138 00:06:40.639 --> 00:06:43.720 Absolutely, it's not just about overall accuracy, it's about the 139 00:06:43.759 --> 00:06:47.439 real world impact of the specific mistakes your model makes. 140 00:06:47.800 --> 00:06:50.720 You often have to tune models to minimize one type 141 00:06:50.720 --> 00:06:53.079 of error, even if it slightly increases the other. 142 00:06:53.279 --> 00:06:58.240 Okay, that really clarifies why just looking at accuracy isn't enough. Now, 143 00:06:58.279 --> 00:07:02.519 speaking of practical challenges missing data, every data scientist runs 144 00:07:02.519 --> 00:07:03.199 into this, right. 145 00:07:03.120 --> 00:07:06.399 Oh constantly. It's pretty much unavoidable in real world data sets. 146 00:07:06.480 --> 00:07:07.920 And why is it such a big deal? Why can't 147 00:07:07.959 --> 00:07:08.639 we just ignore it? 148 00:07:09.079 --> 00:07:12.319 Well, because many algorithms are based on statistical methods which 149 00:07:12.319 --> 00:07:14.959 are supposed to receive a complete data set as input. 150 00:07:15.040 --> 00:07:15.839 They just aren't. 151 00:07:15.600 --> 00:07:17.040 Designed for gaps, so they break. 152 00:07:17.240 --> 00:07:21.199 They might break completely, just refuse to run, or maybe worse, 153 00:07:21.240 --> 00:07:26.040 they run, but give you a core predictive model. Garbage in, 154 00:07:26.079 --> 00:07:27.639 garbage out essentially. 155 00:07:27.399 --> 00:07:29.480 Okay, so we have to handle it? What are the 156 00:07:29.519 --> 00:07:32.199 main ways? According to the guide, it. 157 00:07:32.120 --> 00:07:36.040 Outlines two main strategies. First, you can simply remove the 158 00:07:36.079 --> 00:07:41.519 missing data, usually by deleting the observations the lines which 159 00:07:41.560 --> 00:07:43.560 contain at least one missing feature. 160 00:07:43.680 --> 00:07:44.519 Just drop a whole row. 161 00:07:44.639 --> 00:07:47.839 Yeah, it's simple, it's quick, But the downside is you 162 00:07:47.920 --> 00:07:51.439 might lose a lot of valuable information, especially if missingness 163 00:07:51.480 --> 00:07:54.879 isn't totally random, or if many rows have gaps. 164 00:07:55.040 --> 00:07:57.240 Right, you could be throwing away perfectly good data In 165 00:07:57.319 --> 00:07:59.040 other columns, what's the alternative? 166 00:07:59.120 --> 00:08:04.199 The alternative is imputation, replacing the missing values with artificial values, 167 00:08:04.680 --> 00:08:05.600 filling in the gaps. 168 00:08:05.680 --> 00:08:06.360 How do you do that? 169 00:08:06.639 --> 00:08:09.519 Just guess, well, not quite guess. You can use simple 170 00:08:09.519 --> 00:08:13.040 statistical methods like replacing missing numerical values with the mean 171 00:08:13.160 --> 00:08:15.759 or mode of that column. Or you can use more 172 00:08:15.759 --> 00:08:20.600 sophisticated techniques like using regression building a small model to 173 00:08:20.680 --> 00:08:23.639 predict what the missing value likely would have been based 174 00:08:23.680 --> 00:08:24.959 on the other features in that row. 175 00:08:25.240 --> 00:08:28.480 Ah interesting, using the other data to inform the. 176 00:08:28.480 --> 00:08:32.919 Replacement exactly, But there's a really important caveat here. Whatever 177 00:08:33.000 --> 00:08:36.080 method you use, the replacements should not lead to a 178 00:08:36.120 --> 00:08:39.879 significant change in the distribution and composition of the data set. 179 00:08:40.200 --> 00:08:43.519 Meaning you want to fill the gaps without fundamentally changing 180 00:08:43.519 --> 00:08:46.440 the story the data tells. You don't want to introduce 181 00:08:46.519 --> 00:08:51.639 unintended biases or distort relationships between variables. It requires careful thought. 182 00:08:51.840 --> 00:08:54.600 So it's about repairing the data set carefully, making it 183 00:08:54.679 --> 00:08:57.919 usable for algorithms without messing up the underlying patterns. 184 00:08:58.039 --> 00:09:00.840 That's the goal. Make it robustin integrity. 185 00:09:01.039 --> 00:09:05.519 Okay, so data is clean, models built. Now the evaluation 186 00:09:05.600 --> 00:09:08.200 part again, how do we actually measure performance? 187 00:09:08.480 --> 00:09:12.720 Right? Evaluation, it's iterative. Often you cycle back. The guide 188 00:09:12.759 --> 00:09:14.600 says you need to use what it's called a metric. 189 00:09:14.639 --> 00:09:17.759 This could be visual like a plot, or mathematical a number. 190 00:09:17.559 --> 00:09:18.399 And you just pick one. 191 00:09:18.639 --> 00:09:21.799 No. Crucially, the choice of metric is entirely based on 192 00:09:21.879 --> 00:09:23.960 the type of problem that we are trying to. 193 00:09:23.840 --> 00:09:25.639 Solve, Like we discussed with type three. 194 00:09:25.559 --> 00:09:29.159 Errors exactly, the metric needs to align with the actual goal. 195 00:09:29.399 --> 00:09:32.159 For classification problems, put things into categories. 196 00:09:32.200 --> 00:09:34.159 You have options like, okay. 197 00:09:33.919 --> 00:09:37.559 Area under the curve, auc which looks at how well 198 00:09:37.600 --> 00:09:41.960 the model distinguishes classes, the confusion matrix, which breaks down 199 00:09:41.960 --> 00:09:45.399 the types of correct and incorrect predictions. 200 00:09:44.879 --> 00:09:46.960 True positives, false negatives. 201 00:09:47.600 --> 00:09:52.240 Then there's basic accuracy recall how many actual positives did 202 00:09:52.240 --> 00:09:55.480 we find, precision of the ones we predicted positive, how 203 00:09:55.519 --> 00:09:58.639 many were right? And the F one score, which balances 204 00:09:58.679 --> 00:09:59.840 precision and recall. 205 00:10:00.080 --> 00:10:03.679 Okay, lots of options for classification. What a regression predicting 206 00:10:03.679 --> 00:10:04.120 a number. 207 00:10:04.320 --> 00:10:07.120 For regression, you're looking at how close your predictions are 208 00:10:07.159 --> 00:10:10.279 to the actual values. So metrics include mean square error 209 00:10:10.399 --> 00:10:14.559 msee root mean square error RMS, mean absolute error MAE, 210 00:10:15.039 --> 00:10:18.720 and the coefficient of determination or R squared and its 211 00:10:18.759 --> 00:10:20.639 cousin adjusted r square. 212 00:10:20.480 --> 00:10:22.679 Sounds like you need to know what each metric tells you. 213 00:10:22.759 --> 00:10:27.279 Definitely, and the guide strongly advises using multiple evaluation metrics 214 00:10:27.279 --> 00:10:31.240 for the same project. Why because each evaluation metric is 215 00:10:31.360 --> 00:10:33.519 unique and has its own strength. 216 00:10:33.240 --> 00:10:36.720 So one metric might look good, but another might reveal 217 00:10:36.840 --> 00:10:37.399 a weakness. 218 00:10:37.639 --> 00:10:41.799 Precisely, relying on just one number can be misleading. Looking 219 00:10:41.879 --> 00:10:45.600 at several gives you a much more rounded, robust understanding 220 00:10:45.919 --> 00:10:48.080 of how your model is really performing. 221 00:10:48.519 --> 00:10:52.200 That's a key takeaway. Don't just chase one score, look 222 00:10:52.240 --> 00:10:54.200 at the whole picture. All right, let's zoom out again. 223 00:10:54.639 --> 00:10:57.799 Metal questions. Here's a big one. When can you actually 224 00:10:57.879 --> 00:10:59.759 say you did a good job on a project? Is 225 00:10:59.759 --> 00:11:01.120 it just about the metrics? 226 00:11:01.480 --> 00:11:03.679 Ah, that's a great question, and the answer, according to 227 00:11:03.720 --> 00:11:06.600 the guide is definitely not just about the metrics. It 228 00:11:06.639 --> 00:11:10.120 suggests that data scientists should not be a perfectionist. Instead 229 00:11:10.200 --> 00:11:13.279 think like an engineer solving a practical problem. Focus on 230 00:11:13.320 --> 00:11:15.519 the best outcome in the shortest amount of time. 231 00:11:15.639 --> 00:11:16.799 So efficiency matters. 232 00:11:17.240 --> 00:11:21.720 Speed yeh iteration, Yes exactly. It mentions an agile style 233 00:11:21.720 --> 00:11:24.600 where the idea delivers a result fast and iterates to 234 00:11:24.600 --> 00:11:29.000 improve the work. Get something working then make it better. Critically, 235 00:11:29.279 --> 00:11:32.360 a good result in accuracy doesn't necessarily mean that your 236 00:11:32.440 --> 00:11:32.840 job is. 237 00:11:32.840 --> 00:11:35.200 Good, especially for hard problems. 238 00:11:35.000 --> 00:11:38.639 Especially for hard problems where maybe due to the data itself, 239 00:11:38.679 --> 00:11:42.000 it is almost impossible to get good accuracy. 240 00:11:42.320 --> 00:11:46.399 So what should the focus be then, If not just accuracy. 241 00:11:46.039 --> 00:11:49.399 The focus should shift on the logic and reasoning behind 242 00:11:49.440 --> 00:11:52.720 the work instead of focusing on the accuracy. Did you 243 00:11:52.759 --> 00:11:56.919 follow a sound process? Can you justify your choices? Did 244 00:11:56.960 --> 00:12:00.480 you address the business problem effectively even if the model 245 00:12:00.679 --> 00:12:01.399 isn't perfect. 246 00:12:01.600 --> 00:12:04.519 That's a really important perspective. It's about the methodology, the 247 00:12:04.519 --> 00:12:08.159 critical thinking, the practical impact, not just chasing a percentage 248 00:12:08.159 --> 00:12:09.039 point right. 249 00:12:09.159 --> 00:12:13.600 Sound work, continuous improvement, clear communication of limitations. That's often 250 00:12:13.639 --> 00:12:16.399 more valuable than hitting an arbitrary accuracy target. 251 00:12:16.480 --> 00:12:20.519 Okay, that leads nicely to another practical question, data transformation. 252 00:12:20.639 --> 00:12:23.360 We know it's important, but how much time should we 253 00:12:23.399 --> 00:12:24.279 really be spending on it? 254 00:12:24.320 --> 00:12:26.480 This is another fantastic point from the guide, and the 255 00:12:26.519 --> 00:12:29.480 emphasis is quite strong. It says data transformation is the 256 00:12:29.480 --> 00:12:31.679 most important step in a data science. 257 00:12:31.399 --> 00:12:33.559 Project, the most important more than modeling. 258 00:12:33.879 --> 00:12:37.039 That's the claim. It even states, the more time that 259 00:12:37.159 --> 00:12:40.759 is spent on data transformation, the higher is the model performance. 260 00:12:41.159 --> 00:12:43.080 Wow. Why is it that critical? 261 00:12:43.279 --> 00:12:46.320 Because, as the guide puts it, a machine learning model 262 00:12:46.399 --> 00:12:49.000 is very sensitive to the format of the input data 263 00:12:49.320 --> 00:12:52.000 and the nature of the input data. Garbage in garbage 264 00:12:52.000 --> 00:12:55.720 out applies here too, but also slightly messy data in 265 00:12:56.200 --> 00:13:01.240 slightly messy results out. Good data transformation will value the 266 00:13:01.279 --> 00:13:04.840 input data more, essentially making it easier for the model 267 00:13:04.960 --> 00:13:08.159 to find the key variables to use for training. It 268 00:13:08.200 --> 00:13:10.559 prepares the data optimally for the algorithm. 269 00:13:10.720 --> 00:13:12.440 Can you give some examples of transformation? 270 00:13:12.799 --> 00:13:16.639 Sure? Things like applying a natural logarithm for continuous target 271 00:13:16.720 --> 00:13:20.240 variable if it's heavily skewed, using one hot encoding for 272 00:13:20.320 --> 00:13:25.320 categorical variables, turning categories like red, blue, green into separate binary. 273 00:13:24.879 --> 00:13:27.039 Columns right so the model can understand. 274 00:13:26.639 --> 00:13:31.320 Them exactly, or bidding transformation grouping continuous numbers into ranges. 275 00:13:31.480 --> 00:13:34.080 These aren't just busy work. They directly help the model 276 00:13:34.159 --> 00:13:34.759 learn better. 277 00:13:35.039 --> 00:13:38.000 So the real secret sauce isn't just the fancy algorithm. 278 00:13:38.240 --> 00:13:41.960 Often no, the secret resides in data transformation and how 279 00:13:42.000 --> 00:13:45.080 well it is performed. It's the foundation. Get that right 280 00:13:45.200 --> 00:13:47.480 and your model has a much better chance of success. 281 00:13:47.759 --> 00:13:53.519 That's incredibly insightful, the unsung hero of model performance. Okay, 282 00:13:53.559 --> 00:13:55.799 one last fascinating nugget. I wanted to pull out this 283 00:13:55.840 --> 00:13:58.960 one from the definition section automation bias. 284 00:13:59.440 --> 00:14:03.480 What's that, ah, automation bias? This occurs when a human 285 00:14:03.519 --> 00:14:07.240 decision maker favors recommendations made by an automated system over 286 00:14:07.320 --> 00:14:10.600 a non automated system, even if the automated system is wrong, 287 00:14:10.720 --> 00:14:14.000 even if the automated system provides an error. Yes, it 288 00:14:14.039 --> 00:14:18.279 stems from overtrusting the machine learning model, perhaps just because 289 00:14:18.279 --> 00:14:19.879 it seems complex or objective. 290 00:14:20.039 --> 00:14:22.519 That's actually a bit worrying, isn't it. As AI gets 291 00:14:22.559 --> 00:14:25.759 more embedded in decision making, we just blindly trust the machine. 292 00:14:25.799 --> 00:14:27.960 It's a real risk. We see a recommendation from a 293 00:14:28.000 --> 00:14:32.399 sophisticated algorithm and our critical thinking might just switch off. 294 00:14:32.639 --> 00:14:34.240 We assume the machine knows best. 295 00:14:34.440 --> 00:14:36.919 How do we guard against that? As people building these 296 00:14:36.960 --> 00:14:38.320 systems or even just using. 297 00:14:38.080 --> 00:14:41.919 Them, that's the challenge. The guide doesn't explicitly state solutions, 298 00:14:42.240 --> 00:14:46.360 but it implies the need for awareness. Maybe designing systems 299 00:14:46.360 --> 00:14:50.759 with checks and balances, requiring human oversight for critical decisions, 300 00:14:51.240 --> 00:14:54.440 Ensuring transparency so people can question the output. 301 00:14:54.879 --> 00:14:57.360 So the human element isn't just about feeding data in, 302 00:14:57.879 --> 00:15:02.240 It's about maintaining that critical overst throughout the process. Don't 303 00:15:02.279 --> 00:15:03.519 just accept the output. 304 00:15:03.360 --> 00:15:08.519 Exactly active critical engagement, don't blindly follow the automated advice, 305 00:15:08.759 --> 00:15:10.559 especially when the stakes are high. 306 00:15:10.600 --> 00:15:13.000 Wow. Okay, we've covered a lot, from when to use 307 00:15:13.039 --> 00:15:16.159 deep learning, to the nuances of TYPEI in two errors, 308 00:15:16.559 --> 00:15:20.720 handling missing data, the crucial role of evaluation metrics, the 309 00:15:20.759 --> 00:15:23.480 surprising importance of data transformation. 310 00:15:23.120 --> 00:15:25.720 And even that subtle trap of automation bias. 311 00:15:25.879 --> 00:15:28.759 It really drives home that understanding the why and the 312 00:15:28.799 --> 00:15:32.120 how the logic and mechanism is just as vital as 313 00:15:32.320 --> 00:15:33.120 writing the code. 314 00:15:33.240 --> 00:15:36.559 Absolutely, this whole deep dive into the data scientist pocket 315 00:15:36.559 --> 00:15:39.799 guide really reinforces the idea that knowlage is most valuable 316 00:15:39.840 --> 00:15:43.840 when understood and applied. It's about building that solid conceptual foundation. 317 00:15:44.240 --> 00:15:49.759 So for you, the listener navigating this complex field, what's 318 00:15:49.799 --> 00:15:50.759 the key message here? 319 00:15:51.279 --> 00:15:53.679 I think it's that becoming a good data scientist is 320 00:15:53.720 --> 00:15:57.440 a journey. It really takes continuously learning new techniques and 321 00:15:57.519 --> 00:15:58.600 updating your knowledge. 322 00:15:58.679 --> 00:16:00.000 It's not a one and done thing. 323 00:16:00.200 --> 00:16:05.159 Definitely not. It demands discipline and autonomy, and maybe most importantly, 324 00:16:05.240 --> 00:16:09.799 the ability to question assumptions, to seek out reliable understanding 325 00:16:09.799 --> 00:16:12.679 like this guide aims to provide and always push for 326 00:16:12.720 --> 00:16:15.159 that deeper insight, don't just scratch the surface. 327 00:16:15.360 --> 00:16:18.639 So the final thought perhaps is while the models get smarter, 328 00:16:18.919 --> 00:16:21.919 our own critical thinking and deep understanding remain the most 329 00:16:22.000 --> 00:16:25.120 valuable assets we bring to the table. Don't automate your 330 00:16:25.120 --> 00:16:25.720 own judgment 331 00:16:25.879 --> 00:16:28.120 Well put, keep questioning, keep learning,