WEBVTT 1 00:00:00.080 --> 00:00:02.600 You've seen the headlines, You've definitely heard the buzz about AI, 2 00:00:03.160 --> 00:00:07.440 and maybe you've wondered how those really cool prototypes actually 3 00:00:07.440 --> 00:00:10.480 become real world solutions. 4 00:00:09.919 --> 00:00:12.759 Right, the ones that deliver actual tangible value. 5 00:00:12.919 --> 00:00:16.359 Exactly what does it really take to move AI from 6 00:00:16.399 --> 00:00:19.320 just a concept, like a neat idea to something that 7 00:00:19.359 --> 00:00:22.519 actually drives business results. That's what we're getting into today. 8 00:00:22.640 --> 00:00:24.600 Yeah, our mission here is to take you on a 9 00:00:24.640 --> 00:00:29.440 deep dive into well productionizing AI. We're using insights from 10 00:00:29.519 --> 00:00:34.240 Barry Walsh's guide on delivering AI B to B solutions, 11 00:00:34.240 --> 00:00:36.240 specifically with Cloud and Python. 12 00:00:35.960 --> 00:00:37.920 And we want to cut through the jargon right. 13 00:00:37.840 --> 00:00:42.000 Absolutely, focus on the essential nuggets, you know, the core ecosystem, 14 00:00:42.039 --> 00:00:45.640 the practical steps, the best practices for building AI apps 15 00:00:45.840 --> 00:00:47.560 that actually work and succeed out there. 16 00:00:47.600 --> 00:00:50.039 Because this isn't just about the technology itself, is it. 17 00:00:50.039 --> 00:00:53.640 It's more about understanding that whole journey, how AI moves 18 00:00:53.640 --> 00:00:57.200 from you know, pure hype to actual concrete return on 19 00:00:57.280 --> 00:00:58.479 investment ROI. 20 00:00:58.640 --> 00:00:59.280 That's the key. 21 00:00:59.359 --> 00:01:01.960 We want to show you why employers are so focused 22 00:01:02.000 --> 00:01:07.120 on these high value AI solutions and how understanding this 23 00:01:07.200 --> 00:01:11.000 process can give you some real aha moments about the 24 00:01:11.000 --> 00:01:14.200 future of tech and business and maybe you're rolling it too. 25 00:01:14.519 --> 00:01:17.840 So okay, let's start maybe by demystifying the AI ecosystem 26 00:01:17.879 --> 00:01:21.719 a bit and what we even mean by productionizing. 27 00:01:20.920 --> 00:01:22.040 AI good idea. 28 00:01:22.280 --> 00:01:25.040 For years, AI felt like it was always just around 29 00:01:25.040 --> 00:01:28.560 the corner. But now, well, now it's actually delivering, and 30 00:01:28.760 --> 00:01:31.159 things like the pandemic definitely accelerated that. 31 00:01:31.280 --> 00:01:33.680 Shaw that with chatbots, right, suddenly they were everywhere for 32 00:01:33.680 --> 00:01:34.719 customer service. 33 00:01:34.519 --> 00:01:38.920 Exactly, or deep learning aiding, healthcare diagnostics, even computer vision 34 00:01:39.000 --> 00:01:42.000 for remember the social distancing stuff. Oh yeah, these weren't 35 00:01:42.040 --> 00:01:44.640 just lab experiments. They became pretty critical tools fast. 36 00:01:45.200 --> 00:01:48.040 And the really big shift, Walsh points this out, is 37 00:01:48.079 --> 00:01:52.680 moving beyond those cool but maybe isolated, standalone AI projects. 38 00:01:53.159 --> 00:01:56.920 Companies are now chasing this bigger enterprise AI vision and 39 00:01:56.959 --> 00:01:59.519 the market for that it's projected at what three hundred 40 00:01:59.519 --> 00:02:01.719 and forty one billion dollars huge. 41 00:02:01.640 --> 00:02:04.799 It's massive. Yeah, So it's not just one off solutions anymore. 42 00:02:04.799 --> 00:02:07.640 It's got to be a broader, integrated strategy. 43 00:02:07.200 --> 00:02:10.159 Which demands something different. Right, this industrialization of. 44 00:02:10.120 --> 00:02:15.439 AI exactly, that's the term. It means focusing on reusability, scalability, safety, 45 00:02:16.360 --> 00:02:19.520 and this is key building these things in right from. 46 00:02:19.360 --> 00:02:21.879 The design face, not just an afterthought. 47 00:02:21.599 --> 00:02:25.280 Definitely not an afterthought. That way, AI becomes a reliable 48 00:02:25.360 --> 00:02:28.080 business asset, not just you know, a science. 49 00:02:27.759 --> 00:02:30.919 Project that makes complete sense. But okay, if it's such 50 00:02:30.960 --> 00:02:34.680 a clear benefit, what holds companies back? What are the 51 00:02:34.759 --> 00:02:38.680 hurdles they face trying to get this enterprise AI thing going? 52 00:02:38.800 --> 00:02:39.000 Well? 53 00:02:39.039 --> 00:02:41.240 You know, even with all the potential, a lot of 54 00:02:41.240 --> 00:02:43.919 companies do struggle. Sometimes it's just a lack of awareness 55 00:02:43.919 --> 00:02:45.240 of what AI can really. 56 00:02:45.039 --> 00:02:47.000 Do, or they're stuck with old tools. 57 00:02:46.680 --> 00:02:51.080 That too, legacy tools, maybe a general resistance to innovation. 58 00:02:51.199 --> 00:02:53.319 And of course you always have ethical concerns or worries 59 00:02:53.360 --> 00:02:57.120 about jobs understandable, but ultimately, the best way to look 60 00:02:57.159 --> 00:03:02.199 at it, the forward thinking view is augmented intelligence AI 61 00:03:02.360 --> 00:03:06.400 designed to solve specific business problems, and often there's still 62 00:03:06.439 --> 00:03:07.080 a human in the. 63 00:03:07.080 --> 00:03:09.680 Loop right guiding it, checking it exactly. 64 00:03:09.719 --> 00:03:13.560 It's about augmenting what humans can do, not necessarily replacing 65 00:03:13.599 --> 00:03:14.240 them entirely. 66 00:03:14.719 --> 00:03:18.719 That's a really important distinction. Now, Okay, core concepts sometimes 67 00:03:18.800 --> 00:03:22.240 terms get jumbled. What we're mostly using in business today, 68 00:03:22.280 --> 00:03:23.400 that's narrow AI. 69 00:03:23.639 --> 00:03:28.879 That's right, machines designed for one single specific task like 70 00:03:28.919 --> 00:03:32.360 Google Translate. It's amazing at translation, but you can't ask 71 00:03:32.400 --> 00:03:33.319 it about the weather, right. 72 00:03:33.360 --> 00:03:36.120 It doesn't have general smarts, the sci fi version the 73 00:03:36.120 --> 00:03:42.159 thinking machine that's artificial general intelligence AGI still mostly theory, still. 74 00:03:41.919 --> 00:03:45.319 Very much theory. And within narrow AI, the key techniques 75 00:03:45.360 --> 00:03:49.400 we use are machine learning mL and deep learning DL. 76 00:03:49.560 --> 00:03:51.840 And deep learning is a type of machine learning. 77 00:03:51.639 --> 00:03:54.520 Yeah, basically a more advanced subset. It's specifically built for 78 00:03:54.599 --> 00:03:58.879 tackling those really complex big data problems, often using structures 79 00:03:58.919 --> 00:03:59.960 we call neural networks. 80 00:04:00.199 --> 00:04:02.879 Okay, so if mL and DL are the techniques, how 81 00:04:02.919 --> 00:04:04.960 does data science fit in? Where does that sit? 82 00:04:05.319 --> 00:04:08.599 Good question? Data science is kind of the big umbrella. 83 00:04:09.120 --> 00:04:13.840 Think of it, covering everything the modeling, statistics, programming, plus 84 00:04:14.039 --> 00:04:18.800 crucially knowing the business domain, all aimed at extracting real 85 00:04:18.839 --> 00:04:23.800 insights and value from data. It's the whole toolkit you need, really. 86 00:04:23.519 --> 00:04:26.199 And honestly, none of this would scale what it without. 87 00:04:26.199 --> 00:04:28.959 The cloud. Cloud computing is just fundamental for. 88 00:04:29.040 --> 00:04:32.120 AI today, absolutely essential. It's the primary enabler. 89 00:04:32.160 --> 00:04:37.120 And when we talk cloud, everyone knows the big three Aws, Microsoft, Azure, 90 00:04:37.199 --> 00:04:39.279 Google Cloud Platform GCP. 91 00:04:39.600 --> 00:04:41.720 Those are the main players for sure, though you also 92 00:04:41.759 --> 00:04:45.360 have others like IBM Cloud Heroku doing significant work too. 93 00:04:45.560 --> 00:04:48.639 The key things the cloud provides for AI are basically 94 00:04:48.839 --> 00:04:50.199 storage and compute power. 95 00:04:50.319 --> 00:04:53.279 Right, that's it, storage and compute, And it's worth noting 96 00:04:53.399 --> 00:04:57.480 deep learning projects they need a lot of both, huge overhead. 97 00:04:57.120 --> 00:05:00.160 Sometimes, but machine learning is maybe less demanding. 98 00:04:59.759 --> 00:05:03.279 Off and yeah, mL projects can frequently run with fewer resources. 99 00:05:03.439 --> 00:05:06.480 What's interesting though, a trend we're seeing is companies worried 100 00:05:06.480 --> 00:05:07.360 about vendor. 101 00:05:07.160 --> 00:05:09.319 Lock, being stuck with one provider. 102 00:05:09.160 --> 00:05:12.199 Exactly, so they're moving towards multi cloud or hybrid setups. 103 00:05:12.560 --> 00:05:15.040 Gives them more flexibility strategically, and. 104 00:05:15.000 --> 00:05:17.920 How do they package these AI applications up to run 105 00:05:18.040 --> 00:05:22.079 reliably everywhere? That's a containerization, isn't it like Docker spot 106 00:05:22.120 --> 00:05:23.600 on containerization. 107 00:05:23.759 --> 00:05:26.279 Docker's the big name there. Has pretty much become the 108 00:05:26.360 --> 00:05:30.839 standard way to productionize AI apps. Why is that, Well, 109 00:05:31.279 --> 00:05:34.959 think of containers like those standard shipping containers. They bundle 110 00:05:35.040 --> 00:05:39.680 everything the app needs, code, libraries, settings. It makes them lightweight, portable, 111 00:05:39.959 --> 00:05:41.680 and they run in isolated environments. 112 00:05:41.759 --> 00:05:44.920 So consistent deployment no matter the underlying. 113 00:05:44.480 --> 00:05:48.839 System precisely really important for getting things working reliably in production. 114 00:05:49.120 --> 00:05:52.079 Okay, so we have the concepts, the cloud foundation. Now 115 00:05:52.160 --> 00:05:56.120 let's dig into the operational side, the blueprint. Barry Walsh 116 00:05:56.160 --> 00:05:59.560 really hammers this home data strategy is paramount. He cites 117 00:05:59.560 --> 00:06:02.600 the numbers something like eighty nine percent of businesses struggle 118 00:06:02.639 --> 00:06:03.639 with data management. 119 00:06:03.759 --> 00:06:06.079 Yeah, it's a striking figure, isn't it. And his point 120 00:06:06.160 --> 00:06:09.800 is without a solid data strategy, your analytics, your AI initiatives, 121 00:06:10.279 --> 00:06:12.000 they're likely doomed from the start. 122 00:06:12.240 --> 00:06:14.120 That's a sobering thought, it is. 123 00:06:14.360 --> 00:06:17.959 And to tackle that exact challenge, we have this methodology 124 00:06:18.000 --> 00:06:18.839 called data ops. 125 00:06:19.160 --> 00:06:21.040 Okay, data ops, what's that about? 126 00:06:21.240 --> 00:06:26.600 Basically merges ideas from DevOps, agile methods and lean manufacturing. 127 00:06:25.959 --> 00:06:28.519 Principles and data obviously, right. 128 00:06:28.480 --> 00:06:31.879 The core goal is to streamline your data pipelines, really 129 00:06:31.879 --> 00:06:35.800 boost data quality and reliability, shorten that innovation cycle time, 130 00:06:36.120 --> 00:06:36.759 cut down. 131 00:06:36.560 --> 00:06:38.399 Production error, and improve collaboration. 132 00:06:38.600 --> 00:06:41.879 Yeah, often through things like self service tools. It fosters 133 00:06:41.920 --> 00:06:46.319 this culture of continuous improvement, encouraging experimentation, sort of lab 134 00:06:46.360 --> 00:06:48.480 based innovation with data. 135 00:06:48.600 --> 00:06:51.920 Okay, So data ops handles the data flow and quality. 136 00:06:52.079 --> 00:06:54.240 Then you mentioneds mlops. 137 00:06:54.319 --> 00:06:57.680 Yes. Now, a really critical insight here is that AI 138 00:06:57.720 --> 00:07:01.800 and machine learning aren't just about code code plus data. 139 00:07:01.399 --> 00:07:04.480 And the data part is tricky because it changes exactly 140 00:07:04.519 --> 00:07:05.199 code development. 141 00:07:05.279 --> 00:07:07.160 We kind of know how to control that, but data 142 00:07:07.560 --> 00:07:09.959 is dynamic, it evolves on its own, and that's a 143 00:07:10.000 --> 00:07:13.600 massive challenge. There's this frustrating statistic that less than half 144 00:07:13.800 --> 00:07:15.959 of mL models actually make it into production. 145 00:07:16.079 --> 00:07:17.079 Wow, less than half? 146 00:07:17.360 --> 00:07:20.959 Yeah, not great, it's not. And that's precisely where MLUPS 147 00:07:20.959 --> 00:07:23.839 comes in. It applies all those proven best practices from 148 00:07:23.879 --> 00:07:27.560 DevOps and data ops, but specifically to the entire machine 149 00:07:27.639 --> 00:07:28.480 learning life cycle. 150 00:07:28.600 --> 00:07:30.240 So from the very beginning. 151 00:07:29.959 --> 00:07:33.800 Yeah, data prep, model training, all the way through deployment, monitoring, 152 00:07:34.279 --> 00:07:39.720 and that vital continuous improvement. Looplops is really about bridging 153 00:07:39.759 --> 00:07:43.720 that gap, that painful gap between developing a model and 154 00:07:43.759 --> 00:07:46.000 actually using it effectively in the real world. 155 00:07:46.120 --> 00:07:48.519 And you mentioned agile methods are key in data ops, 156 00:07:48.560 --> 00:07:49.439 same for mlops. 157 00:07:49.480 --> 00:07:53.480 Absolutely central Agile fosters that collaboration and adaptability you need. 158 00:07:54.240 --> 00:07:56.920 AI project teams ideally are really diverse. 159 00:07:57.000 --> 00:07:57.959 Who's usually on them. 160 00:07:58.040 --> 00:08:01.759 You'll have business users, data arket, tech solution architects, data engineers, 161 00:08:01.839 --> 00:08:05.879 data scientists, mL engineers, IT operations folks, everyone. 162 00:08:05.560 --> 00:08:07.439 Involved, and they work in sprints typically. 163 00:08:07.519 --> 00:08:10.600 Yeah, development sprints or product sprints, usually two to four 164 00:08:10.600 --> 00:08:14.040 weeks long. This lets them prioritize and deliver features fixed 165 00:08:14.040 --> 00:08:15.639 bugs incrementally. 166 00:08:15.759 --> 00:08:16.800 The benefits seem clear. 167 00:08:16.879 --> 00:08:21.519 Then more flexibility, definitely faster delivery times too, and crucially, 168 00:08:21.680 --> 00:08:24.560 you reduce the risk of building something nobody actually wants 169 00:08:24.639 --> 00:08:28.519 or needs because you're constantly adapting to changing requirements as 170 00:08:28.519 --> 00:08:29.120 you learn more. 171 00:08:29.480 --> 00:08:32.879 Collaboration sounds key. You mentioned tools like get and GitHub. 172 00:08:33.080 --> 00:08:33.919 How do they fit in? 173 00:08:34.559 --> 00:08:39.360 They're huge version control, especially distributed version control systems or 174 00:08:39.440 --> 00:08:43.960 dvcs like Geit is just invaluable. It tracks and manages 175 00:08:44.080 --> 00:08:46.279 changes not just a source code, but also to your 176 00:08:46.360 --> 00:08:47.120 data sets as. 177 00:08:47.000 --> 00:08:50.480 They evolve, tracking data changes too. That's interesting for AI. 178 00:08:50.759 --> 00:08:54.879 It's a massive benefit. Get helps reduce development time, leads 179 00:08:54.879 --> 00:08:58.720 to much higher success rates for deployments, gives you transparent traceability. 180 00:08:58.960 --> 00:09:02.279 You know exactly who change what when, and critically the 181 00:09:02.320 --> 00:09:03.519 ability to roll back. 182 00:09:03.559 --> 00:09:05.919 To a previous version if something goes wrong. 183 00:09:05.759 --> 00:09:09.279 Exactly previous states of both code and data. The getub 184 00:09:09.279 --> 00:09:12.120 ecosystem ties it all together. You've got the command line 185 00:09:12.120 --> 00:09:15.679 tool get itself, the cloud hosting on GitHub, and a 186 00:09:15.720 --> 00:09:18.919 desktop app too. Makes team collaboration much. 187 00:09:18.720 --> 00:09:22.320 Smoother, and then automating the whole release pipeline. That's CICD right. 188 00:09:22.399 --> 00:09:24.600 Continuous integration continues delivery. 189 00:09:24.440 --> 00:09:28.919 That's the one. CICD automates the build, test, and deployment stages. 190 00:09:29.399 --> 00:09:33.039 It lets you push out software updates constantly, reliably with 191 00:09:33.240 --> 00:09:34.639 minimal manual fuss. 192 00:09:34.759 --> 00:09:37.440 How does that apply specifically in the AI data ops world. 193 00:09:37.559 --> 00:09:40.720 Well, the automation extends even further there. It includes orchestrating 194 00:09:40.759 --> 00:09:44.080 your data pipelines automatically. It can even integrate things like 195 00:09:44.159 --> 00:09:45.080 data drift detection. 196 00:09:45.240 --> 00:09:46.080 What's data drift? 197 00:09:46.200 --> 00:09:49.000 That's when the live data your model season production starts 198 00:09:49.039 --> 00:09:51.679 looking different from the data it was trained on, which 199 00:09:51.759 --> 00:09:53.159 can really mess up performance. 200 00:09:53.399 --> 00:09:55.720 Ah, So CICD can catch. 201 00:09:55.480 --> 00:09:58.919 That, it can trigger alerts or even automated MITL retraining 202 00:09:59.159 --> 00:10:03.519 tools like Jenkins makes setting up these CICD environments relatively straightforward. 203 00:10:03.600 --> 00:10:06.519 Using pipeline scripts. It just makes the whole D end 204 00:10:06.519 --> 00:10:08.279 process much more efficient and robust. 205 00:10:08.480 --> 00:10:11.200 Okay, let's switch gears slightly and talk about the data itself, 206 00:10:11.240 --> 00:10:14.000 because we hear about the data deluge all the time. 207 00:10:14.039 --> 00:10:17.240 What was that prediction? One hundred and seventy five zetabytes 208 00:10:17.759 --> 00:10:19.039 by twenty twenty five. 209 00:10:18.919 --> 00:10:22.879 Something staggering like that. Yeah, a zetabyte is a trilly gigabytes. 210 00:10:23.000 --> 00:10:24.639 It's almost impossible to comprehend. 211 00:10:24.879 --> 00:10:28.159 But the real challenge, the insight for businesses isn't just 212 00:10:28.240 --> 00:10:30.039 the volume, is it. It's the messiness. 213 00:10:30.120 --> 00:10:34.799 Absolutely, raw data almost never comes ready to use. Companies 214 00:10:34.840 --> 00:10:37.879 often seriously underestimate the sheer effort needed to turn that 215 00:10:38.000 --> 00:10:40.759 raw stuff into clean, valuable data. 216 00:10:40.480 --> 00:10:41.759 And that's a critical step. 217 00:10:42.000 --> 00:10:44.600 It can absolutely make or break an AI project right 218 00:10:44.639 --> 00:10:47.639 at the start. Many initiatives stumble right there because they 219 00:10:47.720 --> 00:10:49.360 underestimate the cleaning and prep work. 220 00:10:49.440 --> 00:10:51.840 So how do you manage that influx? IBM has this 221 00:10:51.960 --> 00:10:53.240 concept the AI Ladder. 222 00:10:53.399 --> 00:10:58.240 Yeah, it outlines a strategic sequence collect, organize, analyze, and 223 00:10:58.279 --> 00:11:02.159 then infuse AI. A key part of it is unifying 224 00:11:02.240 --> 00:11:05.879 your data, often across multiple clouds, maybe using a data lake. 225 00:11:06.159 --> 00:11:09.159 Unification is key for getting that complete picture totally. 226 00:11:09.200 --> 00:11:11.840 And this leads us to data pipelines. These are the 227 00:11:11.919 --> 00:11:13.519 automated workflows that move. 228 00:11:13.399 --> 00:11:16.039 Data around like plumbing for data kind. 229 00:11:15.919 --> 00:11:19.559 Of Yeah, automated series of actions extract or ingest data 230 00:11:19.600 --> 00:11:22.720 from sources, transform it so it's usable, and then load 231 00:11:22.759 --> 00:11:25.679 it into a data store for analysis. It's that classic 232 00:11:26.000 --> 00:11:29.000 extract transform load loop ETL. 233 00:11:29.200 --> 00:11:30.559 Let's break that ETL down. 234 00:11:30.759 --> 00:11:34.080 Extraction that's just pulling data from all sorts of places 235 00:11:34.080 --> 00:11:39.480 text files, databases, websites, APIs, and increasingly using efficient formats 236 00:11:39.480 --> 00:11:42.879 like Parquet or AVRO. Then transformation that's the prep work 237 00:11:43.279 --> 00:11:47.039 getting the data ready for whatever system comes next. Involves formatting, 238 00:11:47.240 --> 00:11:52.759 filtering out bad data, encoding things, numerically, scaling values, normalizing, 239 00:11:53.080 --> 00:11:56.039 maybe splitting data sets lots of steps. 240 00:11:55.679 --> 00:11:58.200 Potentially, and finally loading, just. 241 00:11:58.200 --> 00:12:01.759 Putting that cleaned transformed data into its final destination, the 242 00:12:01.840 --> 00:12:02.399 data store. 243 00:12:02.440 --> 00:12:06.240 Okay eto. But then there's another term, data wrangling. How 244 00:12:06.279 --> 00:12:07.559 is that different from transformation? 245 00:12:07.759 --> 00:12:11.120 Good question. Data wrangling is maybe more active, more iterative. 246 00:12:11.720 --> 00:12:14.519 It happens after you've acquired the data, but before you 247 00:12:14.559 --> 00:12:18.360 start building models. Unlike say, exploratory data analysis, where you're 248 00:12:18.360 --> 00:12:21.600 just looking, wrangling actively changes the data to make it 249 00:12:21.639 --> 00:12:22.799 suitable for mL or. 250 00:12:22.759 --> 00:12:25.799 DL, So it's really shaping the data for the model exactly. 251 00:12:25.919 --> 00:12:29.519 It includes things like deciding how to handle missing values, 252 00:12:29.559 --> 00:12:33.320 do you drop the rows, fill them with the average, interpolate. 253 00:12:34.159 --> 00:12:38.440 It also means dealing with outliers, encoding, categorical data, scaling, 254 00:12:38.519 --> 00:12:43.080 numerical features. It's all about optimizing the data for the algorithm. 255 00:12:43.399 --> 00:12:45.759 Right, makes sense. So once the data is wrangled, you 256 00:12:45.799 --> 00:12:47.639 need somewhere to put it. You mentioned data lakes. 257 00:12:47.720 --> 00:12:50.200 Yep, data lakes are popular. They're basically a single large 258 00:12:50.240 --> 00:12:54.720 repository for all kinds of data, raw, structured, unstructured, semi structured. 259 00:12:55.039 --> 00:12:58.399 Great for handling that variety of velocity, volume, veracity the 260 00:12:58.480 --> 00:12:59.440 vis of big data. 261 00:12:59.600 --> 00:13:00.440 But there's a catch. 262 00:13:00.519 --> 00:13:03.720 There is Without really good cataloging and governance, a data 263 00:13:03.799 --> 00:13:06.559 lake can easily turn into a data swamp, just a 264 00:13:06.799 --> 00:13:08.200 mess of unusable data. 265 00:13:08.279 --> 00:13:11.240 Okay, so governance is crucial. What about data warehouses? 266 00:13:11.679 --> 00:13:17.159 Data warehouses are traditionally more structured, clean organized data, mostly structured, 267 00:13:17.639 --> 00:13:20.399 often serving as the single source of truth for reporting 268 00:13:20.440 --> 00:13:23.639 and analytics, though modern ones are getting better at handling 269 00:13:23.720 --> 00:13:24.919 unstructured data too. 270 00:13:24.919 --> 00:13:25.799 And data marts. 271 00:13:25.840 --> 00:13:28.840 Those are usually smaller, focused subsets of a data warehouse, 272 00:13:29.120 --> 00:13:31.639 tailored for a specific department or analytical need. 273 00:13:31.840 --> 00:13:34.200 And there's a newer concept, lakehouse. 274 00:13:34.480 --> 00:13:36.600 Yeah. The lakehouse idea tries to blend the best of 275 00:13:36.639 --> 00:13:39.799 both the flexibility and cheap storage of a data lake, 276 00:13:40.240 --> 00:13:42.799 combined with the data management and structure features of a 277 00:13:42.879 --> 00:13:45.159 data warehouse still evolving but. 278 00:13:45.200 --> 00:13:48.720 Promising, and choosing between ETL and ELT. That's a strategic 279 00:13:48.759 --> 00:13:50.159 decision too, right Definitely. 280 00:13:50.399 --> 00:13:55.120 ETL extract transform load is the classic way transform the 281 00:13:55.200 --> 00:13:58.480 data before loading it. Often good for structured data or 282 00:13:58.600 --> 00:14:03.559 migrating to the cloud. ELT ELT xtract load transform flips it. 283 00:14:03.960 --> 00:14:07.480 You'd load the raw data into storage first, then transform it. 284 00:14:07.600 --> 00:14:10.600 This is really popular for data lakes and exploratory analysis. 285 00:14:10.639 --> 00:14:12.559 Why more flexibility exactly? 286 00:14:13.159 --> 00:14:15.679 Data scientists can access the raw data and decide on 287 00:14:15.720 --> 00:14:18.559 transformations later as they figure out what they need. Much 288 00:14:18.559 --> 00:14:20.639 more agile for exploration. 289 00:14:20.200 --> 00:14:24.600 Than the databases themselves. SQL versus no SQL quick rundown sure. 290 00:14:24.519 --> 00:14:28.480 SQL databases think Microsoft SQL server, my School postgress are relational. 291 00:14:28.639 --> 00:14:31.679 They use pre defined schemas prey structured. Great for complex 292 00:14:31.720 --> 00:14:34.799 analytical queries. What we call ol app using powerful joints 293 00:14:34.919 --> 00:14:38.720 and no SQL, no SEQL like Mungo, DBE, Cassandra AWS, 294 00:14:38.799 --> 00:14:43.720 DynamoDB are non relational, often schemeless or flexible schema designed 295 00:14:43.720 --> 00:14:47.440 for massive scale, high speed and handling frequent changes like 296 00:14:47.440 --> 00:14:51.200 in web apps. Often used for OLTP transactional. 297 00:14:50.639 --> 00:14:53.759 Data and what if you need insights right now from 298 00:14:53.840 --> 00:14:55.279 data that's constantly flowing. 299 00:14:55.440 --> 00:14:58.559 Ah, then you need stream processing and analytics. This is 300 00:14:58.559 --> 00:15:02.039 about querying data streams as they arrive in your real time. 301 00:15:02.320 --> 00:15:04.159 Crucial when the data's value drops. 302 00:15:03.919 --> 00:15:07.200 Quickly, like IoT sensor data or stock prices. 303 00:15:07.240 --> 00:15:12.960 Perfect examples. Tools like Apache storms, Spark, streaming, flink, COFKA, streams, aws, kinesis. 304 00:15:13.200 --> 00:15:15.679 They're all built for this kind of high speed, continuous 305 00:15:15.759 --> 00:15:19.399 querying and analysis, getting insights in milliseconds or seconds. 306 00:15:19.399 --> 00:15:23.039 Okay, we've got the data flowing stored prepped. Let's finally 307 00:15:23.080 --> 00:15:25.840 dive into the engine machine learning and deep learning. Starting 308 00:15:25.840 --> 00:15:27.799 with mL. Supervised learning, Yeah. 309 00:15:27.600 --> 00:15:29.679 Probably the most common type. You train the model on 310 00:15:29.759 --> 00:15:31.679 data where you already know the right answer the label. 311 00:15:31.759 --> 00:15:34.320 Like predicting customer churn you have pass data on who 312 00:15:34.399 --> 00:15:35.480 left exactly. 313 00:15:35.559 --> 00:15:40.120 That's a classification problem. Or forecasting customer revenue based on passbending. 314 00:15:40.159 --> 00:15:43.000 That's a regression known inputs, known outputs. 315 00:15:43.159 --> 00:15:46.759 Then unsupervised learning, this is where it gets interesting. 316 00:15:47.080 --> 00:15:49.759 You don't have labels. The goal is to find hidden 317 00:15:49.799 --> 00:15:52.720 patterns or structures in the data itself. 318 00:15:52.440 --> 00:15:54.799 Like grouping similar customers together. 319 00:15:54.679 --> 00:15:58.600 Right, that's clustering. Another key technique is dimensionality reduction, like 320 00:15:58.639 --> 00:16:03.759 PCA principle component analysis. It helps simplify massive data sets 321 00:16:03.759 --> 00:16:07.200 by finding the most important underlying features, making them easier 322 00:16:07.200 --> 00:16:07.679 to work with. 323 00:16:07.840 --> 00:16:10.399 And then there's reinforcement learning. That sounds different. 324 00:16:10.159 --> 00:16:13.360 Again it is. It's about real time learning. An agent 325 00:16:13.679 --> 00:16:16.960 learns by trial and error in an environment, getting rewards 326 00:16:17.000 --> 00:16:18.039 or penalties for its. 327 00:16:17.879 --> 00:16:20.519 Actions, like training a robot to walk, yeah. 328 00:16:20.320 --> 00:16:24.879 Or gameplaying AI. Google Search Engine uses IT. Autonomous vehicles 329 00:16:24.919 --> 00:16:30.440 rely heavily on IT. Robotics. It's powering some really advanced applications. 330 00:16:29.919 --> 00:16:31.960 Okay, And building on mL, we get to deep learning. 331 00:16:32.039 --> 00:16:34.399 This is where the really complex stuff happens. Right. 332 00:16:34.679 --> 00:16:39.519 Pretty much, DL extends mL using these things called artificial 333 00:16:39.519 --> 00:16:43.000 neural networks A and NS, often with many, many hidden layers. 334 00:16:43.240 --> 00:16:45.240 They're loosely inspired by the brain structure. 335 00:16:45.320 --> 00:16:48.360 Why is DL booming now? These ideas aren't brand. 336 00:16:48.120 --> 00:16:51.200 New, true, the concepts have been around, but it's the 337 00:16:51.320 --> 00:16:55.200 massive leap in computing power, especially from GPUs, and huge 338 00:16:55.240 --> 00:16:58.879 amounts of data that have made training these deep networks feasible. 339 00:16:59.120 --> 00:17:02.639 Think milestones like IBM's Deep Blue beating Caspar. 340 00:17:02.279 --> 00:17:04.160 Ofv Watson on Jeopardy. 341 00:17:03.799 --> 00:17:08.039 Image net breakthroughs, Google Deep minds alphag These were all 342 00:17:08.079 --> 00:17:11.119 powered by or pushed the limits of deep learning and 343 00:17:11.160 --> 00:17:12.319 the hardware behind it. 344 00:17:12.240 --> 00:17:14.480 And inside deal. There are different kinds of neural networks. 345 00:17:14.480 --> 00:17:15.400 Oh yeah, whole zoo of them. 346 00:17:15.480 --> 00:17:15.720 Yeah. 347 00:17:15.720 --> 00:17:19.440 For image recognition, the standard is convolutional neural networks CNNs. 348 00:17:19.680 --> 00:17:22.839 They're brilliant at picking out sparal hierarchies of features in images, 349 00:17:23.119 --> 00:17:25.680 treating them as multidimensional grids or tensors. 350 00:17:25.720 --> 00:17:28.799 Tensors right, like complex spreadsheets. 351 00:17:28.119 --> 00:17:31.960 Sort of, yeah, multidimensional arrays. Then for sequential data time 352 00:17:32.079 --> 00:17:35.680 series language, you have for current neural networks RNNs and 353 00:17:35.759 --> 00:17:39.839 a more powerful variant called LSTMs long short term memory models. 354 00:17:40.119 --> 00:17:42.599 They have memory of past inputs essentially. 355 00:17:42.720 --> 00:17:45.920 Yes, they maintain a state that captures information from previous 356 00:17:45.920 --> 00:17:48.559 steps in the sequence, and then you get into networks. 357 00:17:48.599 --> 00:17:51.839 They can even generate new data like auto encoders or 358 00:17:51.920 --> 00:17:52.920 variational auto. 359 00:17:52.759 --> 00:17:56.359 Encoders and the famous JANS generative adversarial networks. 360 00:17:56.480 --> 00:17:59.519 That's them, the tech behind deep fakes. They learn patterns 361 00:17:59.559 --> 00:18:03.720 from input data and can create incredibly realistic synthetic images, text, 362 00:18:04.039 --> 00:18:04.920 even music. 363 00:18:04.920 --> 00:18:07.440 Wild stuff. What tools do people use to build these? 364 00:18:07.559 --> 00:18:11.000 Python is king here. The dominant frameworks are TensorFlow, which 365 00:18:11.039 --> 00:18:14.359 is Google's open source library, and Keras, which is a 366 00:18:14.440 --> 00:18:17.119 high level API that makes TensorFlow much easier to use. 367 00:18:17.319 --> 00:18:18.480 And the other big one. 368 00:18:18.440 --> 00:18:22.720 PyTorch from Facebook. It's another very popular open source framework 369 00:18:22.920 --> 00:18:26.759 known for its flexibility and dynamic approach often preferred in research. 370 00:18:26.960 --> 00:18:29.319 Building the model is one thing, but getting it to 371 00:18:29.359 --> 00:18:33.640 perform well, that's tuning right, sounds complex. 372 00:18:34.000 --> 00:18:37.279 It's definitely an iterative process, almost an art form. Sometimes 373 00:18:37.839 --> 00:18:41.119 you need to choose the right activation functions, deciding how 374 00:18:41.160 --> 00:18:45.359 neurons fire, select appropriate loss functions to measure the model's air, 375 00:18:46.039 --> 00:18:50.079 and pick good optimization algorithms to adjust the model's internal 376 00:18:50.079 --> 00:18:51.839 weights to minimize that error. 377 00:18:52.160 --> 00:18:54.759 And then there are hyper parameters like dials you can 378 00:18:54.759 --> 00:18:55.799 turn exactly. 379 00:18:56.240 --> 00:18:58.799 Think of them as settings outside the model that control 380 00:18:58.839 --> 00:19:02.119 the learning process itself. Things like the learning rate, how 381 00:19:02.119 --> 00:19:04.920 many examples you process at once, the batch size, and 382 00:19:05.000 --> 00:19:08.160 really important techniques like regularization dropout. 383 00:19:08.240 --> 00:19:09.640 What does regularization do? 384 00:19:10.079 --> 00:19:12.880 It helps prevent the model from overfitting. That's where it 385 00:19:12.920 --> 00:19:15.920 learns the training data too well, including noise, and then 386 00:19:16.000 --> 00:19:19.839 fails to generalize to new unseen data. Dropout is a 387 00:19:19.839 --> 00:19:22.640 common way to fight that. Fine tuning these hyper parameters 388 00:19:22.759 --> 00:19:24.400 is critical for getting good results. 389 00:19:24.680 --> 00:19:27.480 You know, thinking back, we've seen so many cool AI 390 00:19:27.559 --> 00:19:30.440 prototypes over the years, but historically it didn't a lot 391 00:19:30.440 --> 00:19:32.839 of them just fail to make it into actual use. 392 00:19:33.119 --> 00:19:36.200 Sadly, yes, that was a common story, often due to 393 00:19:36.240 --> 00:19:38.799 those operational silos we talked about, or maybe relying too 394 00:19:38.839 --> 00:19:42.000 much on niche experts, models being too complex and code heavy, 395 00:19:42.079 --> 00:19:44.480 core integration, lots of reasons. 396 00:19:44.519 --> 00:19:46.880 But it feels like that's changing now, like there's an 397 00:19:46.920 --> 00:19:48.519 automation revolution happening. 398 00:19:49.000 --> 00:19:51.079 I think that's fair to say, and that's where auto 399 00:19:51.160 --> 00:19:54.920 mL comes in alongside noload no code low code platforms. 400 00:19:54.519 --> 00:19:57.599 Okay, AUTOMML automating machine learning pretty much. 401 00:19:57.720 --> 00:20:00.680