WEBVTT 1 00:00:00.120 --> 00:00:02.640 Welcome back to the deep dive, where we unpack complex 2 00:00:02.759 --> 00:00:07.280 topics and bring you the essential insights. Today we're navigating 3 00:00:07.320 --> 00:00:10.359 the exciting world of machine learning on Amazon Web Services. 4 00:00:10.880 --> 00:00:13.039 Our mission for this deep dive is really to distill 5 00:00:13.080 --> 00:00:15.759 the core concepts of machine learning, walk you through its 6 00:00:15.839 --> 00:00:19.280 end to end life cycle, the whole process, and also 7 00:00:19.519 --> 00:00:24.000 highlight how AWS services provide a well powerful toolkit for 8 00:00:24.079 --> 00:00:27.760 every single stage. We're drawing our insights from the AWS 9 00:00:27.800 --> 00:00:32.359 Certified Machine Learning Specialty MLSC zero one certification guide, trying 10 00:00:32.359 --> 00:00:35.640 to pull out those aha moments and strategic takeaways. The 11 00:00:35.679 --> 00:00:38.079 goal is to make you feel truly well informed, whether 12 00:00:38.159 --> 00:00:41.359 you're a strategizing for a meeting or just really curious 13 00:00:41.359 --> 00:00:42.039 about this field. 14 00:00:42.399 --> 00:00:44.280 Yeah, and what's truly valuable in this guide and what 15 00:00:44.399 --> 00:00:46.920 we'll focus on today is its ability to break down 16 00:00:46.960 --> 00:00:50.439 these complex mL ideas into actionable knowledge. We're going to 17 00:00:50.479 --> 00:00:53.399 try and give you a clear roadmap basically from foundational 18 00:00:53.439 --> 00:00:57.280 definitions right through to practical AWOS applications, showing you how 19 00:00:57.280 --> 00:00:59.079 you might build truly intelligent solutions. 20 00:00:59.240 --> 00:01:00.280 Okay, let's dig in to this. 21 00:01:00.320 --> 00:01:00.520 Then. 22 00:01:00.920 --> 00:01:04.599 The guide starts by clarifying the relationship between artificial intelligence 23 00:01:04.640 --> 00:01:08.040 machine learning and deep learning. I like the analogy they use. 24 00:01:08.079 --> 00:01:11.400 Think of it like a set of nested Russian dolls exactly. 25 00:01:11.840 --> 00:01:15.719 So at the outermost layer you've got artificial intelligence AI. 26 00:01:16.000 --> 00:01:19.079 That's the really broad field, aiming to create machines that 27 00:01:19.159 --> 00:01:24.079 can do tasks mimicking human intelligence. Then moving inward, machine 28 00:01:24.159 --> 00:01:27.519 learning mL is a key subset of AI. This is 29 00:01:27.560 --> 00:01:31.359 where systems learn from data. They identify patterns and make 30 00:01:31.400 --> 00:01:35.719 predictions without being explicitly programmed. It's about learning from experience, 31 00:01:35.760 --> 00:01:37.400 observing adapting. 32 00:01:36.959 --> 00:01:39.319 Okay, learning from data, not rules precisely. 33 00:01:39.519 --> 00:01:42.319 And then at the very core you have deep learning DL. 34 00:01:42.680 --> 00:01:45.799 That's an even more specialized subset of mL. Deep learning 35 00:01:45.920 --> 00:01:48.439 uses these multi layered structures you've probably heard of them, 36 00:01:48.560 --> 00:01:52.439 deep neural networks. They solve highly complex problems. They're powering 37 00:01:52.480 --> 00:01:53.840 a lot of the state of the art stuff we 38 00:01:53.879 --> 00:01:57.319 see today, like language translation or facial recognition. 39 00:01:57.680 --> 00:02:00.920 So what this hierarchy really means for us, for you listening, 40 00:02:00.959 --> 00:02:04.519 is that we're witnessing this incredible evolution. It's fueled by 41 00:02:04.760 --> 00:02:07.920 well more computing power and just vast amounts of data 42 00:02:07.920 --> 00:02:11.319 being available now, and AI applications are becoming more powerful, 43 00:02:11.360 --> 00:02:14.680 more accessible and really applicable across almost every industry. 44 00:02:15.840 --> 00:02:18.560 And when these systems learn, they generally fall into three 45 00:02:18.599 --> 00:02:22.159 main approaches, three ways of learning. The first is supervised learning. 46 00:02:22.960 --> 00:02:26.000 This relies on labeled data. So imagine you have a 47 00:02:26.080 --> 00:02:29.159 data set where every example has an answer already attached. 48 00:02:29.439 --> 00:02:30.479 That's your labeled. 49 00:02:30.240 --> 00:02:33.479 Data, right, like inputs and the correct outputs exactly. 50 00:02:33.840 --> 00:02:37.680 So. One common use for supervised learning is classification. Here 51 00:02:37.759 --> 00:02:40.879 the model predicts a category or class. For instance, the 52 00:02:40.919 --> 00:02:45.840 guy talks about classifying financial transactions. Is this fraudulent or legitimate? 53 00:02:46.039 --> 00:02:49.039 Based on features like amount, time of day, that sort. 54 00:02:48.840 --> 00:02:50.439 Of thing, okay, putting things into buckets? 55 00:02:50.520 --> 00:02:53.639 Yeah? And the other key type is regression. The goal 56 00:02:53.719 --> 00:02:56.840 here is to predict a continuous numerical value. This could 57 00:02:56.840 --> 00:02:59.840 be forecasting sales figures for the next quarter maybe, or 58 00:03:00.039 --> 00:03:01.280 predicting the obstable price for. 59 00:03:01.280 --> 00:03:03.520 A product, got it, predicting a number. 60 00:03:03.960 --> 00:03:08.280 Then there's unsupervised learning. This works with unlabeled data, so 61 00:03:08.360 --> 00:03:12.120 no answer is provided beforehand. Here, this system tries to 62 00:03:12.120 --> 00:03:14.280 find hidden patterns or structures on its own. 63 00:03:14.520 --> 00:03:16.520 Ah, okay, so finding patterns we didn't know. 64 00:03:16.479 --> 00:03:19.639 We're there exactly a great example of this is clustering 65 00:03:19.919 --> 00:03:22.879 you group similar data points together. Think about segmenting your 66 00:03:22.879 --> 00:03:26.560 customer base based on their purchasing behavior. You know, to 67 00:03:26.599 --> 00:03:27.680 understand different. 68 00:03:27.360 --> 00:03:29.560 Market segments, right, finding natural groupings. 69 00:03:30.599 --> 00:03:34.319 And finally, we have reinforcement learning. This is where a 70 00:03:34.360 --> 00:03:38.199 system learns by interacting with an environment. It gets rewards 71 00:03:38.199 --> 00:03:42.479 for good decisions and well penalties for poor ones. It's 72 00:03:42.520 --> 00:03:44.479 a bit like how we learn through trial and error. 73 00:03:44.560 --> 00:03:47.599 The guide mentions an example like an automated call center 74 00:03:47.639 --> 00:03:51.120 agent learning the best path to resolve customer queries by 75 00:03:51.120 --> 00:03:52.879 getting rewarded for good recommendations. 76 00:03:53.240 --> 00:03:57.319 Interesting learning by doing essentially So, this next point seems 77 00:03:57.360 --> 00:04:00.400 crucial because it's about how we actually use these The 78 00:04:00.479 --> 00:04:05.479 approach you choose, supervised, unsupervised, or reinforcement, it totally depends 79 00:04:05.520 --> 00:04:07.080 on your data and the problem you're trying to. 80 00:04:07.000 --> 00:04:11.719 Solve, right, absolutely, it's fundamental. Do you have clearly labeled examples? 81 00:04:12.400 --> 00:04:15.719 Supervised is likely your path? Are you looking for hidden 82 00:04:15.759 --> 00:04:19.639 groups in just raw data? Unsupervised? Is it about learning 83 00:04:19.639 --> 00:04:23.800 through interaction and feedback? Reinforcement The data and the goal 84 00:04:23.920 --> 00:04:24.800 dictate the method. 85 00:04:25.079 --> 00:04:28.360 Makes sense. Now, building effective mL models isn't just about 86 00:04:28.360 --> 00:04:31.680 picking one of those algorithms. It's a structured process. The 87 00:04:31.720 --> 00:04:35.480 guide highlights something called crisp DM, the cross industry's standard 88 00:04:35.480 --> 00:04:38.199 process for data mining, as a blueprint for this. 89 00:04:38.439 --> 00:04:41.560 Yeah, chris DM is really widely used. It provides a clear, 90 00:04:41.680 --> 00:04:46.040 iterative framework with six key phases. It starts with business understanding. 91 00:04:46.399 --> 00:04:49.120 This is all about clearly defining your project objectives, your 92 00:04:49.120 --> 00:04:52.600 success criteria, potential risks. It sounds obvious, but honestly, this 93 00:04:52.639 --> 00:04:54.720 is where many projects can go wrong if the problem 94 00:04:54.720 --> 00:04:55.639 isn't nailed down. 95 00:04:55.480 --> 00:04:58.079 Precisely right, knowing what you're actually trying to achieve. 96 00:04:58.439 --> 00:05:04.399 Then data understanding. This involves collecting, describing, exploring, checking the 97 00:05:04.480 --> 00:05:07.680 quality of your raw data. Data scientists need to be 98 00:05:07.839 --> 00:05:11.279 well super skeptical here, look for every nuance. Then comes 99 00:05:11.360 --> 00:05:15.480 data preparation and this is often the most time consuming phase. Really. 100 00:05:15.519 --> 00:05:20.519 It involves selecting, cleaning, transforming, formatting the data for your chosen. 101 00:05:20.279 --> 00:05:21.959 Algorithm, Okay, getting the data ready. 102 00:05:22.160 --> 00:05:25.639 Following that is modeling. Here you select the appropriate algorithm, 103 00:05:25.959 --> 00:05:28.839 design your tests, approach and train the model. You need 104 00:05:28.839 --> 00:05:32.120 to distinguish between parameters, which are learned from the data itself, 105 00:05:32.360 --> 00:05:34.920 and hyper parameters, which are like knobs you turn to 106 00:05:34.959 --> 00:05:38.839 control the learning process. The fifth phase is evaluation. You 107 00:05:38.920 --> 00:05:42.920 review the model's performance against those initial business success criteria you. 108 00:05:42.959 --> 00:05:44.720 Defined and if it's not good enough. 109 00:05:45.040 --> 00:05:48.399 That's the key. mL is iterative. It's a scientific process. 110 00:05:48.600 --> 00:05:51.000 If your model isn't cutting it, you loop back, maybe 111 00:05:51.000 --> 00:05:53.439 you tune those hyper parameters, maybe you need more data, 112 00:05:53.560 --> 00:05:58.399 maybe you even need to rethink the business problem itself. Finally, deployment, 113 00:05:59.000 --> 00:06:02.519 getting your model into reduction. This involves creating pipelines for 114 00:06:02.560 --> 00:06:06.000 continuous training and inference and setting up monitoring to catch 115 00:06:06.560 --> 00:06:09.920 model drift. All drift, Yeah, that's what a model's performance 116 00:06:09.959 --> 00:06:13.040 degrades over time because the real world data or patterns change, 117 00:06:13.319 --> 00:06:16.920 So you need to monitor and potentially retrain. And you know, 118 00:06:16.959 --> 00:06:20.000 if we connect this back to the AWS certification, the 119 00:06:20.079 --> 00:06:24.560 four domains covered in the exam data engineering, exploratory, data analysis, modeling, 120 00:06:24.680 --> 00:06:27.879 and mL OPS, right, they really map quite directly to 121 00:06:27.920 --> 00:06:30.279 these CRISP DM stages. It's complete life cycle. 122 00:06:30.360 --> 00:06:32.279 Okay, that framework makes a lot of sense. Now, you 123 00:06:32.319 --> 00:06:35.000 mentioned data preparation is often the most time consuming part. 124 00:06:35.079 --> 00:06:37.839 The guide really stresses this too. It's the absolute foundation 125 00:06:38.000 --> 00:06:41.279 for any good model. Get the data wrong, and well 126 00:06:41.319 --> 00:06:41.959 nothing else. 127 00:06:41.800 --> 00:06:45.000 Matters much absolutely garbage in, garbage out. As they say, 128 00:06:45.560 --> 00:06:48.600 A critical first step is understanding your feature types, the 129 00:06:48.680 --> 00:06:51.040 kind of beta you have. So you've got numerical data. 130 00:06:51.360 --> 00:06:54.800 This could be discrete like countable items, number of clicks maybe, 131 00:06:55.120 --> 00:06:59.199 or continuous measurements with potentially infinite values like temperature or. 132 00:06:59.160 --> 00:07:01.079 Price, numbers, screen or continuous. 133 00:07:01.160 --> 00:07:04.319 Got it. Then you have categorical data. This describes qualities 134 00:07:04.399 --> 00:07:07.920 or labels. It can be nominal labels without any inherent order, 135 00:07:08.480 --> 00:07:11.720 like colors or types of products, or ordinal labels that 136 00:07:11.839 --> 00:07:15.079 do have a meaningful order like low, medium, high, or 137 00:07:15.240 --> 00:07:16.040 education levels. 138 00:07:16.079 --> 00:07:18.439 Okay, categories with or without an order. 139 00:07:18.360 --> 00:07:22.600 Right, and categorical data, especially nominal, usually can't be fed 140 00:07:22.600 --> 00:07:26.680 directly into most algorithms. It needs transforming into numbers. For example, 141 00:07:26.759 --> 00:07:29.560 for that nominal data without order, like countries, we often 142 00:07:29.680 --> 00:07:32.360 use one hot encoding. This creates a new binary column 143 00:07:32.399 --> 00:07:35.720 a zero or one for each category. It avoids accidentally 144 00:07:35.759 --> 00:07:39.040 implying that, say, country three is somehow greater than country two. 145 00:07:39.279 --> 00:07:42.680 Ah avoids creating a false order exactly, whereas for ordinal 146 00:07:42.759 --> 00:07:47.439 data like those education levels, ordinal encoding preserves that inherent sequence. 147 00:07:48.839 --> 00:07:51.639 Now the crucial rule here in this trips people up 148 00:07:51.720 --> 00:07:55.000 sometimes is that any encoder you create must be fitted 149 00:07:55.120 --> 00:07:58.279 only on your training data. Then you use that same 150 00:07:58.360 --> 00:08:01.279 fitted encoder to transform your teches data and any new 151 00:08:01.279 --> 00:08:04.959 production data. You never refit on test data that introduces bias. 152 00:08:05.040 --> 00:08:06.959 Okay, fit on train, transform on tests. 153 00:08:07.000 --> 00:08:09.439 Got it now. For numerical features, you often need to 154 00:08:09.439 --> 00:08:13.199 adjust their scale. Data normalization, for instance, might scale data 155 00:08:13.199 --> 00:08:16.279 to arrange between zero and one. This is really vital 156 00:08:16.319 --> 00:08:19.120 for algorithms that are sensitive to the magnitude of numbers, 157 00:08:19.160 --> 00:08:21.480 like neural networks or caneurous. 158 00:08:21.040 --> 00:08:23.879 Neighbors, so they don't overweight big numbers precisely. 159 00:08:24.199 --> 00:08:28.399 Alternatively, data standardization transforms data to have a mean of 160 00:08:28.480 --> 00:08:31.720 zero and a standard deviation of one. This is fantastic 161 00:08:31.759 --> 00:08:35.039 for identifying outliers, for example, and for features that are 162 00:08:35.039 --> 00:08:38.480 skewed think income distributions often bunched up at one end. 163 00:08:38.960 --> 00:08:42.559 Logarithmic and power transformations like the box Cox method can 164 00:08:42.559 --> 00:08:45.559 make them more symmetrical, more like a Bell curve, and 165 00:08:45.600 --> 00:08:49.360 that often significantly improves the performance of many algorithms like 166 00:08:49.480 --> 00:08:50.440 linear regression. 167 00:08:50.639 --> 00:08:53.200 Wow, lots of ways to wrangle the data. What about 168 00:08:53.240 --> 00:08:55.440 problems like missing values? 169 00:08:55.519 --> 00:08:57.440 Yeah, that's a common one. First, you have to try 170 00:08:57.480 --> 00:09:00.600 and understand why they're missing. Is it ran them or 171 00:09:00.679 --> 00:09:03.960 is there a pattern. Options range from just listwise deletion 172 00:09:04.080 --> 00:09:07.240 discarding rows or columns with missing data, but be careful 173 00:09:07.279 --> 00:09:10.720 you might lose valuable information, or imputation where you replace 174 00:09:10.799 --> 00:09:14.200 missing values. Simple imputation might use the mean or the median, 175 00:09:14.200 --> 00:09:16.840 which is less sensitive to outliers, or the mode for 176 00:09:16.919 --> 00:09:20.519 categorical data, but you can get more sophisticated even using 177 00:09:20.519 --> 00:09:22.960 other mL models to predict what the missing values should be. 178 00:09:23.120 --> 00:09:26.600 Okay, and outliers those weird data points. 179 00:09:26.399 --> 00:09:30.759 So another common hurdle. Outliers are data points significantly different 180 00:09:30.799 --> 00:09:34.639 from the rest. They can dramatically skew your model's understanding, 181 00:09:35.080 --> 00:09:38.480 like pulling a regression line way off course. Tools like 182 00:09:38.600 --> 00:09:42.000 z scores or visualizing with box plots help detect them. 183 00:09:42.519 --> 00:09:45.120 Once found, you might remove them or maybe just flag 184 00:09:45.159 --> 00:09:46.759 them so your model knows they're unusual. 185 00:09:46.919 --> 00:09:50.039 Makes sense, And what if the data is like really unbalanced. 186 00:09:50.039 --> 00:09:51.639 You mentioned fraud detection earlier. 187 00:09:51.440 --> 00:09:54.840 Right, Unbalanced data sets very common. Say only one percent 188 00:09:54.879 --> 00:09:57.960 of your transactions are actually fraudulent. Your model might just 189 00:09:58.039 --> 00:10:01.159 learn to always predict not fraud, because that's accurate ninety 190 00:10:01.200 --> 00:10:04.039 nine percent of the time, but it misses the important cases. 191 00:10:04.559 --> 00:10:07.080 So to address this, you can tune your algorithm, maybe 192 00:10:07.120 --> 00:10:09.399 tell to pay more attention to the rare class using 193 00:10:09.480 --> 00:10:12.720 something like a class weight hyperparameter. Or you can resample 194 00:10:12.759 --> 00:10:16.279 your data. Either undersample the majority class just use fewer 195 00:10:16.320 --> 00:10:20.120 examples of not fraud, or oversample the minority class. A 196 00:10:20.159 --> 00:10:23.960 popular technique for oversampling is SMO and a synthetic minority 197 00:10:24.000 --> 00:10:28.159 over sampling technique. It intelligently creates new synthetic examples of 198 00:10:28.200 --> 00:10:29.759 the rare class to help balance things. 199 00:10:29.639 --> 00:10:34.759 Out smot okay, creating fake but plausible examples kind. 200 00:10:34.519 --> 00:10:38.080 Of yeah, based on the characteristics of the existing minority examples, 201 00:10:38.879 --> 00:10:42.799 and finally preparing text data for mL or natural language 202 00:10:42.799 --> 00:10:47.120 processing NLP. This has evolved a lot. Older methods like 203 00:10:47.200 --> 00:10:51.679 bag of Words BOW just count how often words appear simple, 204 00:10:51.720 --> 00:10:56.159 but loses context. More advanced techniques like word embedding, used 205 00:10:56.159 --> 00:10:59.480 in models like word two, VEK or glove represent words 206 00:10:59.519 --> 00:11:03.320 as dense numerical vectors. What's fascinating here is these vectors 207 00:11:03.360 --> 00:11:06.960 capture semantic meaning. Words with similar meanings end up closer 208 00:11:07.000 --> 00:11:09.240 together in this multi dimensional space, so. 209 00:11:09.159 --> 00:11:12.000 The model understands relationships between words in. 210 00:11:11.919 --> 00:11:14.759 A mathematical sense. Yes, it captures context and meaning much 211 00:11:14.799 --> 00:11:15.799 better than just counting. 212 00:11:16.080 --> 00:11:18.559 That's a really thorough look at data prep. It's clear 213 00:11:18.600 --> 00:11:22.600 its critical and well often complex. But all this meticulously 214 00:11:22.679 --> 00:11:25.559 prepared information needs a robust place to live. You need 215 00:11:25.600 --> 00:11:28.480 to store it somewhere, and on AWS. That journey often 216 00:11:28.480 --> 00:11:31.399 begins with S three, Right, our digital warehouse, where do 217 00:11:31.440 --> 00:11:32.879 we store all this data? For mL? 218 00:11:33.000 --> 00:11:36.480 You're absolutely right. The storage choice is fundamental. Amazon S 219 00:11:36.559 --> 00:11:40.519 three Simple Storage Service is very often the starting point 220 00:11:40.799 --> 00:11:45.200 and the core its object storage, known for its incredible durability, 221 00:11:45.720 --> 00:11:49.200 designed for eleven nine's durability, which is just astronomical protection 222 00:11:49.279 --> 00:11:53.200 against data loss. It's highly scalable. You store objects your 223 00:11:53.200 --> 00:11:56.200 files basically within these things called buckets, which are specific 224 00:11:56.200 --> 00:12:00.159 to an AWS region, and S three offers different storage classes. 225 00:12:00.480 --> 00:12:03.159 This lets you optimize costs based on how frequently you 226 00:12:03.200 --> 00:12:06.240 need to access the data. Data you access rarely can 227 00:12:06.279 --> 00:12:09.919 go into cheaper, colder storage. Plus, it has robust access 228 00:12:09.960 --> 00:12:12.480 control and encryption options to keep everything secure. OK. 229 00:12:12.720 --> 00:12:15.639 S three for scalable, durable object storage, what about more 230 00:12:15.679 --> 00:12:17.840 structured data like traditional databases? 231 00:12:18.000 --> 00:12:21.799 For that, Amazon Relational Database Service RDS is the managed service. 232 00:12:21.840 --> 00:12:25.120 It supports popular engines like Mycycle, Postgress, Goal, Oracle, etc. 233 00:12:25.879 --> 00:12:29.480 A key feature for reliability is multi easy deployments. This 234 00:12:29.559 --> 00:12:32.360 automatically creates a synchronous standby copy of your database in 235 00:12:32.399 --> 00:12:35.440 a different availability zone, so if one AZ has an issue, 236 00:12:35.480 --> 00:12:36.759 it fails over automatically. 237 00:12:36.919 --> 00:12:39.440 Great for high availability, so it keeps running even if 238 00:12:39.440 --> 00:12:41.519 there's an outage in one place exactly. 239 00:12:41.559 --> 00:12:44.279 And for scaling read performance, especially for applications that do 240 00:12:44.320 --> 00:12:47.720 a lot of reading, you can use read replicas. These 241 00:12:47.720 --> 00:12:50.960 are asynchronously replicated copies of your main database. You can 242 00:12:50.960 --> 00:12:53.240 point your read heavy traffic to them. You can even 243 00:12:53.240 --> 00:12:56.480 place them in different regions for global reach. This directly 244 00:12:56.519 --> 00:12:59.759 impacts your RPO recovery point objective how much data you 245 00:12:59.799 --> 00:13:03.559 might lose an RTO recovery time objective how fast you recover. 246 00:13:04.440 --> 00:13:07.679 Multi asy and read replicas help you achieve low RPO 247 00:13:07.759 --> 00:13:08.679 and RTO. 248 00:13:08.440 --> 00:13:10.840 Makes sense availability and read scaling. 249 00:13:10.879 --> 00:13:14.519 HM and beyond S three and rds AWS has specialized 250 00:13:14.519 --> 00:13:18.159 stores too. Amazon Redshift is a data warehouse optimized for 251 00:13:18.200 --> 00:13:22.240 analyzing massive data sets using SQL and Amazon DynamoDB is 252 00:13:22.279 --> 00:13:25.519 a fully managed no SQL database the key value in 253 00:13:25.600 --> 00:13:29.039 document data where you need super fast, flexible access at 254 00:13:29.039 --> 00:13:29.919 really any scale. 255 00:13:30.120 --> 00:13:32.200 Okay, so a whole range of options. The key takeaway 256 00:13:32.240 --> 00:13:34.279 here seems to be it's not just about storing data, 257 00:13:34.279 --> 00:13:36.440 it's about choosing the right storage for the right kind 258 00:13:36.480 --> 00:13:40.559 of data, getting that optimal balance of availability, performance, security, 259 00:13:40.919 --> 00:13:44.240 and cost for your specific mL use. 260 00:13:44.080 --> 00:13:46.840 Case, precisely matching the tool to the job. 261 00:13:47.279 --> 00:13:50.120 So once our data is carefully stored and prepped, we 262 00:13:50.200 --> 00:13:52.759 often need to process it further, maybe transform it in 263 00:13:52.799 --> 00:13:55.519 bulk or analyze streams of it. The guide walks us 264 00:13:55.559 --> 00:13:58.840 through a WUS services for both batch processing and real 265 00:13:58.879 --> 00:13:59.440 time stuff. 266 00:13:59.559 --> 00:14:04.399 Yeah, large scale data transformation and movement like etlxtract transform 267 00:14:04.519 --> 00:14:09.159 load AWS. Glue is a really powerful, fully managed service. 268 00:14:09.200 --> 00:14:11.759 It's a secret Sauce is the data catalog. You can 269 00:14:11.799 --> 00:14:15.759 automatically crawl your data sources, figure out the schema, detect changes, 270 00:14:15.799 --> 00:14:19.240 and make it all queriable. Then glues ETL jobs, which 271 00:14:19.320 --> 00:14:22.000 usually run on a patchy spark, do the heavy lifting 272 00:14:22.039 --> 00:14:25.480 of the actual data transformation, maybe copying and cleaning data 273 00:14:25.480 --> 00:14:27.360 from S three into redshift for example. 274 00:14:27.480 --> 00:14:30.039 So Glue handles the whole ETL pipeline. 275 00:14:29.600 --> 00:14:32.240 Pretty much in a serverlest way. Now, if you just 276 00:14:32.279 --> 00:14:34.279 want a query data that's already sitting in S three 277 00:14:34.600 --> 00:14:37.879 without moving or transforming it first, Amazon Athena is amazing 278 00:14:37.919 --> 00:14:41.240 for this. It's serverless, interactive use standard SQL to query 279 00:14:41.279 --> 00:14:46.080 data directly in S three across various formats CSV, json, parquet, ORC, 280 00:14:46.600 --> 00:14:49.200 no infrastructure to manage. Is incredibly fast for ad hoc 281 00:14:49.240 --> 00:14:50.440 analysis or quick. 282 00:14:50.240 --> 00:14:53.720 Exploration schema onread right, you define the structure as. 283 00:14:53.600 --> 00:14:57.639 You query it exactly. Now, for processing real time streaming data, 284 00:14:57.960 --> 00:15:01.639 we turn to Amazon Kinesis Visais data streams can capture 285 00:15:01.679 --> 00:15:04.639 and store huge amounts of data per second from loads 286 00:15:04.639 --> 00:15:09.200 of sources website clicks, IoT sensors, financial transactions. You can 287 00:15:09.200 --> 00:15:11.919 then build applications to process this stream in real time. 288 00:15:12.320 --> 00:15:14.960 Then there's Kinesis Data fire Hose. This is a fully 289 00:15:14.960 --> 00:15:17.759 managed service that takes that streaming data and automatically loads 290 00:15:17.799 --> 00:15:21.279 it into destinations like S three, redshift or analytics services. 291 00:15:21.519 --> 00:15:23.679 It can even transform the data on the fly using 292 00:15:23.720 --> 00:15:25.799 AWS Lambda before delivering it. 293 00:15:25.879 --> 00:15:28.399 So fire Hose is more about getting the stream into 294 00:15:28.399 --> 00:15:30.159 storage or other services easily. 295 00:15:30.480 --> 00:15:33.639 Yeah, simplifies the delivery part. And what about getting data 296 00:15:33.639 --> 00:15:38.039 from your own data centers into AWS. AWS Storage Gateway 297 00:15:38.039 --> 00:15:41.360 connects your on premises software appliances to cloud storage using 298 00:15:41.399 --> 00:15:45.360 standard file or block protocols. For really massive data transfers 299 00:15:45.440 --> 00:15:48.240 where the Internet is too slow, you have the AWS 300 00:15:48.279 --> 00:15:51.639 snow family. These are physical devices like Snowball Edge which 301 00:15:51.679 --> 00:15:54.679 is like a ruggedized suitcase computer, or even Snowmobile, a 302 00:15:54.720 --> 00:15:57.799 whole shipping container. You load data onto them locally, ship 303 00:15:57.840 --> 00:16:00.679 them to AWS and they upload it securely, much faster 304 00:16:00.759 --> 00:16:02.039 for petabytes a truck. 305 00:16:01.840 --> 00:16:03.879 Full of data literally pretty much. 306 00:16:04.159 --> 00:16:07.519 And AWS Data Sinc. Is great for ongoing online data 307 00:16:07.519 --> 00:16:11.360 transfer between your on premises storage and AWS services like 308 00:16:11.440 --> 00:16:15.360 S three or EFS. Finally, for those really big computation 309 00:16:15.480 --> 00:16:17.919 heavy batch jobs, things that might take hours or days 310 00:16:18.240 --> 00:16:21.799 or need massive resources beyond what Lander offers, Aws Bachil 311 00:16:21.840 --> 00:16:24.240 lets you schedule and run these efficiently. It manages the 312 00:16:24.320 --> 00:16:28.480 job queues, provisions the right compute resources like EC two instances, 313 00:16:28.720 --> 00:16:29.919 and scales automatically. 314 00:16:30.120 --> 00:16:33.480 Okay. This really covers the spectrum, from analyzing static data 315 00:16:33.519 --> 00:16:37.080 with Athena and glue to handling real time streams with kinesis, 316 00:16:37.399 --> 00:16:41.159 and even moving massive data sets physically. AWS seems to 317 00:16:41.159 --> 00:16:43.559 have a tool for almost every data processing need. 318 00:16:43.720 --> 00:16:45.639 It's a very comprehensive set of services. 319 00:16:45.759 --> 00:16:49.120 Now, before we dive headfirst into coding raw algorithms, the 320 00:16:49.200 --> 00:16:51.879 guide makes a point of highlighting aws's out of the 321 00:16:51.919 --> 00:16:55.879 box AI services. These seem designed to make advanced mL 322 00:16:55.960 --> 00:16:58.679 accessible even if you're not a deep learning expert. Right, 323 00:16:58.919 --> 00:17:00.960 no model building recques exactly. 324 00:17:01.080 --> 00:17:04.599 These are pre trained managed services. You use them via 325 00:17:04.680 --> 00:17:08.799 simple API calls. They bring sophisticated AI capabilities directly into 326 00:17:08.799 --> 00:17:12.799 your applications with minimal fuss. For example, Amazon Recognition provides 327 00:17:12.839 --> 00:17:16.759 powerful visual analysis. It can detect objects, people, faces, texts 328 00:17:16.759 --> 00:17:20.240 and images, and videos, even sentiment analysis on faces. Amazon 329 00:17:20.240 --> 00:17:24.759 Polly converts text into remarkably lifelike speech, loads of voices languages, 330 00:17:24.880 --> 00:17:27.240 great for accessibility or creating voice interfaces. 331 00:17:27.319 --> 00:17:29.799 Polly Speaks and Recognition Ce's right. 332 00:17:29.880 --> 00:17:32.319 And Amazon transcribed as the opposite of poly. It converts 333 00:17:32.319 --> 00:17:36.839 speech into text, excellent for transcribing audio, video calls, generating captions. 334 00:17:37.039 --> 00:17:41.160 It supports custom vocabularies too, for better accuracy and specific domains. 335 00:17:41.480 --> 00:17:46.000 Amazon comprehend digs into unstructured text I think customer reviews, emails, 336 00:17:46.079 --> 00:17:50.319 social media feeds. It pulls out insights like sentiment positive, negative, neutral, 337 00:17:50.519 --> 00:17:53.200 key phrases, entities, even topics. 338 00:17:52.880 --> 00:17:54.799 So comprehend understands text. 339 00:17:55.480 --> 00:18:00.839 Amazon Translate provides high quality, real time language translation between languages. 340 00:18:01.119 --> 00:18:04.519 Amazon TExtract is really interesting. It goes beyond basic ocr 341 00:18:04.920 --> 00:18:08.480 optical character recognition. It understands the structure of documents, so 342 00:18:08.519 --> 00:18:11.000 it can extract data not just as raw text, but 343 00:18:11.039 --> 00:18:15.319 specifically from forms and tables, preserving their layout and relationships. 344 00:18:15.359 --> 00:18:17.039 Super useful for document. 345 00:18:16.680 --> 00:18:19.880 Processing while understanding forms and tables not just text. 346 00:18:20.160 --> 00:18:23.720 Yeah, and finally, Amazon Lex this is the engine that 347 00:18:23.759 --> 00:18:28.880 powers Amazon Alexa. It lets you build sophisticated conversational interfaces chatbots, 348 00:18:29.160 --> 00:18:34.759 voice spots using natural language understanding NLU and automatic speech recognition. ASR. 349 00:18:35.640 --> 00:18:40.039 You define the user's goals, intense the information needed, slots 350 00:18:40.079 --> 00:18:44.440 and sample phrases, utterances, and LEX handles the complex conversation flow. 351 00:18:45.039 --> 00:18:47.599 Okay, that's an incredible menu of ready to use AI 352 00:18:47.839 --> 00:18:50.839 really lowers the barrier to entry, But it begs the 353 00:18:50.920 --> 00:18:56.039 question for you listening, how do you decide when should 354 00:18:56.079 --> 00:18:58.960 you use these powerful pre built tools versus actually diving 355 00:18:59.000 --> 00:19:01.400 in and building a custom mL model from scratch. 356 00:19:01.759 --> 00:19:04.839 That's a really important strategic decision and the answer often 357 00:19:04.880 --> 00:19:08.400 comes down to specificity and control. For common, well defined 358 00:19:08.400 --> 00:19:13.279 tasks like general translation, sentiment analysis, standard object recognition, and images, 359 00:19:13.599 --> 00:19:16.440 these managed services are often the fastest, easiest, and most 360 00:19:16.440 --> 00:19:19.880 cost effective path. They're pre trained by AWS on massive 361 00:19:19.960 --> 00:19:22.880 data sets, so you benefit from that expertise with minimal 362 00:19:22.880 --> 00:19:26.039 development effort. You don't need deep mL knowledge to integrate 363 00:19:26.079 --> 00:19:27.519 them via APIs. 364 00:19:27.160 --> 00:19:28.680 So use them for the standard stuff. 365 00:19:28.960 --> 00:19:32.839 Generally, yes, However, if your problem is highly specialized, maybe 366 00:19:32.839 --> 00:19:36.240 involves unique data types not covered by the services, or 367 00:19:36.279 --> 00:19:38.759 if you need fine grained control over the model architecture 368 00:19:38.799 --> 00:19:41.759 or the training process or the specific performance trade offs. 369 00:19:42.240 --> 00:19:45.240 That's when building a custom model, probably using a platform 370 00:19:45.279 --> 00:19:48.720 like Amazon sage Maker becomes the better choice. It gives 371 00:19:48.759 --> 00:19:52.319 you full flexibility, but requires more mL expertise and effort. 372 00:19:52.480 --> 00:19:55.640 Got it. Use managed services for speed and common tasks, 373 00:19:55.720 --> 00:19:59.359 build custom for unique needs and control. Okay, now let's 374 00:19:59.359 --> 00:20:01.519 go deeper into the custom model, building into the heart 375 00:20:01.559 --> 00:20:05.359 of mL, the algorithms themselves. The guide outlines aws's built 376 00:20:05.359 --> 00:20:08.359 in algorithms available in sage Maker, which are often optimized 377 00:20:08.359 --> 00:20:10.880 for the AWS environment. But first, maybe a quick word 378 00:20:10.920 --> 00:20:13.559 on ensemble models. The guide mentions these are pretty powerful. 379 00:20:13.759 --> 00:20:17.000 Yeah. Ensemble methods are a really important concept. The idea 380 00:20:17.039 --> 00:20:20.319 is to combine multiple individual mL models to get better 381 00:20:20.359 --> 00:20:23.599 predictive performance than any single model could achieve on its own. 382 00:20:24.119 --> 00:20:28.640 Two main types are bagging, think bootstrap aggregating. Like in 383 00:20:28.680 --> 00:20:32.640 a random forest algorithm, You train many models, usually decision trees, 384 00:20:32.920 --> 00:20:35.960 independently on different random samples of your data, and then 385 00:20:36.000 --> 00:20:39.680 you average their predictions for regression or take a majority 386 00:20:39.759 --> 00:20:42.799 vote for classification. It helps reduce variants. 387 00:20:43.039 --> 00:20:45.400 So wisdom of the crowd applied to models. 388 00:20:45.680 --> 00:20:49.440 Kinda yeah. The other main type is boosting. Here models 389 00:20:49.440 --> 00:20:52.240 are trained sequentially. Each new model focuses on correcting the 390 00:20:52.319 --> 00:20:54.960 errors made by the previous ones. It builds a strong 391 00:20:55.039 --> 00:20:58.880 predictor Iteratively, algorithms like ATTA boost or the very popular 392 00:20:59.000 --> 00:21:02.920 XG boost uses approach. Boosting often leads to very high accuracy, 393 00:21:03.160 --> 00:21:04.880 but you need to be careful about overfitting. 394 00:21:05.079 --> 00:21:08.319 Okay, bagging is parallel boosting a sequential makes sense? So 395 00:21:08.400 --> 00:21:10.920 what are some of the key built in algorithms sage 396 00:21:10.960 --> 00:21:13.720 Maker offers, for say, supervised learning. 397 00:21:13.799 --> 00:21:17.119 Right for supervised tasks with labeled data, sage Maker has 398 00:21:17.160 --> 00:21:20.680 several optimized algorithms. The linear learner algorithm is a good 399 00:21:20.680 --> 00:21:25.400 starting point. It's versatile handling with regression, predicting numbers and 400 00:21:25.480 --> 00:21:30.599 classification predicted categories. It's great for understanding linear relationships and 401 00:21:30.720 --> 00:21:33.680