WEBVTT 1 00:00:00.040 --> 00:00:03.200 Welcome to another deep dive for you, the learner listening 2 00:00:03.200 --> 00:00:06.599 in today. I want you to just imagine standing on 3 00:00:06.639 --> 00:00:10.119 the edge of a massive, wildly turbulent. 4 00:00:09.640 --> 00:00:11.800 Ocean, like a really chaotic one. 5 00:00:11.880 --> 00:00:15.439 Yeah, exactly. But we're looking at a global landscape generating 6 00:00:15.480 --> 00:00:17.399 over forty zetabytes. 7 00:00:16.920 --> 00:00:20.320 Of data, which is just an unfathomable number, it really is. 8 00:00:20.719 --> 00:00:25.280 And the modern business challenge it isn't acquiring information anymore. 9 00:00:25.719 --> 00:00:28.800 The actual challenge is preventing your enterprise from drowning in 10 00:00:28.879 --> 00:00:32.359 these raw data swamps. You know, It's about figuring out 11 00:00:32.600 --> 00:00:36.399 how to build the industrial plumbing necessary to refine that 12 00:00:36.520 --> 00:00:40.640 total chaos into pure actionable business assets. 13 00:00:40.759 --> 00:00:44.200 Because the physics of a forty zetabyte landscape they completely 14 00:00:44.280 --> 00:00:48.079 break traditional data models. Oh for sure, human cognition and frankly, 15 00:00:48.240 --> 00:00:51.880 legacy server architectures they just aren't built to natively comprehend 16 00:00:51.920 --> 00:00:53.159 or route that much throughput. 17 00:00:53.439 --> 00:00:55.359 No, they would just melt pretty much. 18 00:00:55.520 --> 00:00:57.560 You can have the most valuable data on the planet 19 00:00:57.600 --> 00:01:00.560 sitting in your servers, but if you're processing staff can't 20 00:01:00.840 --> 00:01:03.439 ingest it, structure it, and analyze it at scale, it 21 00:01:03.479 --> 00:01:06.359 actually becomes a massive liability. 22 00:01:05.879 --> 00:01:07.400 Instead of a competitive advantage. 23 00:01:07.920 --> 00:01:08.480 Exactly. 24 00:01:08.560 --> 00:01:12.480 Okay, let's unpack this today. We are analyzing a foundational 25 00:01:12.519 --> 00:01:16.879 text to solve this exact problem, which is practical Data 26 00:01:16.920 --> 00:01:19.840 Science by Andreas Francois Vermulin. 27 00:01:20.120 --> 00:01:22.439 And this isn't just a theoretical text point. 28 00:01:22.439 --> 00:01:26.239 No, not at all. It's an aggressive, really comprehensive guide 29 00:01:26.640 --> 00:01:28.959 to the entire enterprise technology stack. 30 00:01:29.359 --> 00:01:32.920 Yeah, the layered frameworks, the rigid business rules, all the 31 00:01:32.959 --> 00:01:36.959 stuff required to actually tame massive data sets out in 32 00:01:37.000 --> 00:01:37.439 the wild. 33 00:01:37.640 --> 00:01:42.640 Right, Because what Vermulin offers is essentially an architectural blueprint. 34 00:01:42.879 --> 00:01:45.159 We are moving way past the novelty of you know, 35 00:01:45.319 --> 00:01:47.200 simple data science experiments. 36 00:01:46.760 --> 00:01:49.239 On a laptop, like just running a quick Python script. 37 00:01:49.439 --> 00:01:52.719 Right. This text breaks down the mechanical reality of how 38 00:01:52.799 --> 00:01:57.959 data is stored, processed across distributed clusters, legally protected, and 39 00:01:58.079 --> 00:02:00.159 ultimately served up to an executive board. 40 00:02:00.359 --> 00:02:02.280 To drive millions of dollars in decisions. 41 00:02:02.400 --> 00:02:03.840 Exactly. That's the end goal. 42 00:02:04.000 --> 00:02:05.760 So our mission for this deep dive is to give 43 00:02:05.799 --> 00:02:08.439 you a cohesive mental model of that entire. 44 00:02:08.280 --> 00:02:09.960 Journey from start to finish. 45 00:02:10.039 --> 00:02:14.039 Yeah, we'll track the data flowing from a wild unstructured 46 00:02:14.120 --> 00:02:17.759 lake through all that complex processing machinery, all the way 47 00:02:17.840 --> 00:02:18.879 up to business deployment. 48 00:02:19.000 --> 00:02:20.000 It's quite a journey. 49 00:02:20.159 --> 00:02:23.599 So let's start at the source, right, taming the wild reservoir. 50 00:02:24.560 --> 00:02:28.919 Fermulen defines the data lake as a massive repository storing 51 00:02:29.000 --> 00:02:31.879 data in its native raw format. 52 00:02:31.719 --> 00:02:34.240 Which is crucial to understand. 53 00:02:33.759 --> 00:02:36.240 Right because for anyone who has worked with legacy systems, 54 00:02:36.840 --> 00:02:40.479 we know the absolute friction of the old schema naw ride. 55 00:02:40.319 --> 00:02:43.960 Approach ugh schema on right. It basically forces you into 56 00:02:44.000 --> 00:02:46.639 a rigid box before you even begin doing anything. 57 00:02:46.680 --> 00:02:47.919 You have to map everything out. 58 00:02:48.039 --> 00:02:50.680 Yeah, you have to spend months modeling the exact ship 59 00:02:50.759 --> 00:02:53.960 of your database tables, the data types, the relationships, all 60 00:02:54.000 --> 00:02:55.840 before a single bite is even loaded. 61 00:02:55.879 --> 00:02:58.240 And that rugenity causes massive bottlenecks. 62 00:02:58.360 --> 00:03:02.159 No, absolutely, because the moment a new unexpected data format 63 00:03:02.240 --> 00:03:04.560 arrives from an external vendor, what happens. 64 00:03:04.759 --> 00:03:08.000 The whole ingestion pupline just breaks down, shatters. That's where 65 00:03:08.000 --> 00:03:11.439 the modern schemon read philosophy comes in. You bypass that 66 00:03:11.560 --> 00:03:15.360 initial bottleneck completely by loading the data into the lake 67 00:03:15.520 --> 00:03:16.400 exactly as it. 68 00:03:16.360 --> 00:03:18.800 Is, just raw and completely unstructured. 69 00:03:18.879 --> 00:03:22.840 Yeah, you only apply the organizational rules the schema at 70 00:03:22.840 --> 00:03:27.400 the exact computational moment you query the data. Yes, so 71 00:03:28.120 --> 00:03:32.360 is a data lake essentially a giant unfiltered natural reservoir, 72 00:03:33.120 --> 00:03:36.520 And schema on reed is like deciding whether you want 73 00:03:36.520 --> 00:03:40.039 to filter that water for drinking, farming, or swimming only 74 00:03:40.080 --> 00:03:41.960 at the exact moment you dip your bucket in. 75 00:03:42.120 --> 00:03:45.319 That is a perfect analogy. What's fascinating here is how 76 00:03:45.360 --> 00:03:50.000 that flexibility directly accelerates knowledge generation. How so well by 77 00:03:50.080 --> 00:03:54.360 keeping the leaf level atomic data perfectly intact, you preserve 78 00:03:54.599 --> 00:03:57.159 all the anomalies and the really subtle signals. 79 00:03:57.319 --> 00:03:59.520 Uh, because he didn't scrub them out of the start exactly. 80 00:04:00.000 --> 00:04:02.800 It's in an exploratory data science. The actual insights are 81 00:04:02.840 --> 00:04:04.599 hidden in the unstructured noise. 82 00:04:04.800 --> 00:04:05.000 Right. 83 00:04:05.159 --> 00:04:08.280 If you force data through a rigid schema on right 84 00:04:08.400 --> 00:04:11.599 filter right out ingestion, you strip out those anomalies because 85 00:04:11.599 --> 00:04:13.439 they just don't fit your predefined assumption. 86 00:04:13.639 --> 00:04:15.240 You lose what you didn't know you were looking for. 87 00:04:15.479 --> 00:04:20.720 Precisely, Schema on reed preserves those unknown variables for future models. 88 00:04:20.759 --> 00:04:24.360 But Vermilan makes it clear you can't just leave everything 89 00:04:24.399 --> 00:04:26.360 floating in a chaotic lake forever. 90 00:04:26.680 --> 00:04:27.959 No, that would be a disaster. 91 00:04:28.240 --> 00:04:31.480 Right enter the data vault, which is a hybrid modeling 92 00:04:31.519 --> 00:04:34.040 methodology created by Dan linst. 93 00:04:33.759 --> 00:04:36.720 It because we do need structure for business reporting, but 94 00:04:36.800 --> 00:04:39.079 we want it without losing that agility. 95 00:04:39.560 --> 00:04:45.560 So the Data Vault achieves this using three core architectural components, right, hubs, links, 96 00:04:45.600 --> 00:04:46.399 and satellites. 97 00:04:46.600 --> 00:04:49.800 Yeah, the mechanical genius of the Data Vault is its modularity. 98 00:04:50.319 --> 00:04:53.199 Hubbs act as the immutable business keys. 99 00:04:53.079 --> 00:04:55.279 Like the absolute core identifiers right. 100 00:04:55.160 --> 00:04:58.399 Like a persistent customer ID it never changes, okay, and 101 00:04:58.439 --> 00:05:01.839 the links links handle the trends actional associations. They map 102 00:05:01.879 --> 00:05:05.199 how hubs interact without holding any descriptive data themselves. 103 00:05:05.279 --> 00:05:07.120 Got it, So where does the actual information go? 104 00:05:07.439 --> 00:05:10.680 All the volatile descriptive context is pushed into the satellites. 105 00:05:11.000 --> 00:05:13.720 So if the hub is the unchangeable concept of a 106 00:05:13.759 --> 00:05:16.639 specific customer and the link represents the fact that they 107 00:05:16.680 --> 00:05:18.720 interacted with a specific product. 108 00:05:18.439 --> 00:05:21.959 The satellite holds their current address, their income bracket, and 109 00:05:22.000 --> 00:05:23.279 the timestamp of the event. 110 00:05:23.600 --> 00:05:26.279 Wow, why split it up so aggressively like that? 111 00:05:26.800 --> 00:05:30.959 Because it isolates structural changes. Let's say your marketing department 112 00:05:31.040 --> 00:05:34.800 suddenly starts collecting a dozen new demographic metrics on customers. 113 00:05:35.160 --> 00:05:37.079 With a normal setup, you'd have to rebuild your core 114 00:05:37.120 --> 00:05:41.519 tables or alter existing schemas. But here you simply attach 115 00:05:41.600 --> 00:05:44.839 a brand new satellite to the existing hub. Oh wow, Yeah, 116 00:05:44.879 --> 00:05:49.279 it allows you to model incredibly complex, evolving enterprise environments 117 00:05:49.639 --> 00:05:54.879 while maintaining a completely auditible historical record of every single change. 118 00:05:54.920 --> 00:05:58.160 That's brilliant. Okay, so now we have a highly structured, 119 00:05:58.319 --> 00:06:02.160 scalable reservoir. But a reservoir is useless if you don't 120 00:06:02.160 --> 00:06:04.839 have the industrial machinery to pump and process the water. 121 00:06:05.199 --> 00:06:05.800 Very true. 122 00:06:06.079 --> 00:06:09.600 Let's move into the processing stack for Mulen outlines. At 123 00:06:09.639 --> 00:06:13.240 the absolute center of this arsenal is Apache Spark. 124 00:06:13.439 --> 00:06:17.879 Spark completely changed the paradigm for distributed cluster computing because 125 00:06:17.879 --> 00:06:21.240 it's so fast, because it's resilient. When you are analyzing 126 00:06:21.360 --> 00:06:26.439 terabytes of telemetry data, a single machine's memory will inevitably crash. 127 00:06:26.519 --> 00:06:27.959 It just can't hold the weight, right. 128 00:06:28.439 --> 00:06:33.240 Spark solves this by utilizing resilient distributed data sets or RDBs. 129 00:06:33.399 --> 00:06:34.279 Okay, what do those do? 130 00:06:34.800 --> 00:06:39.120 It basically shatters the massive data set into partitions, distributes 131 00:06:39.160 --> 00:06:42.839 them across thousands of worker nodes in a cluster, processes 132 00:06:42.879 --> 00:06:45.079 the math and memory all at the same time, and 133 00:06:45.160 --> 00:06:48.160 then aggregates the results back together seamlessly. 134 00:06:48.600 --> 00:06:53.000 That is, Wild and working alongside Spark is apatche Kofka. 135 00:06:53.720 --> 00:06:56.839 If Spark is doing the heavy computational lifting, Kafka is 136 00:06:56.879 --> 00:06:59.560 handling the sheer velocity of the ingestion exactly. 137 00:06:59.680 --> 00:07:04.160 Coff operates as a distributed published, subscribe messaging system like 138 00:07:04.199 --> 00:07:07.920 a massive router. Yeah, imagine you have a global retail operation. 139 00:07:08.439 --> 00:07:12.839 You've got thousands of edge devices, website clicks, supply. 140 00:07:12.600 --> 00:07:15.240 Chan updates, generating millions of events per second. 141 00:07:15.360 --> 00:07:19.639 Right, Kafka ingests that entire stream. It guarantees fault tolerant 142 00:07:19.680 --> 00:07:21.839 real time delivery to the processing. 143 00:07:21.360 --> 00:07:23.439 Core, so nothing gets lost exactly. 144 00:07:23.480 --> 00:07:26.240 It ensures no packets are dropped even if a downstream 145 00:07:26.319 --> 00:07:27.920 server briefly goes offline. 146 00:07:27.959 --> 00:07:30.240 Here's where it gets really interesting. If we look at 147 00:07:30.279 --> 00:07:32.839 the programming languages. Okay, we all know Python and OUR 148 00:07:33.079 --> 00:07:36.759 the standard languages for data science. Sure, but if Python 149 00:07:36.839 --> 00:07:40.839 and OUR are the cognitive centers, like the brains running 150 00:07:40.839 --> 00:07:44.879 the logical models are Kafka and Spark basically the central 151 00:07:44.879 --> 00:07:47.879 nervous system ensuring the signals actually travel through the giant 152 00:07:47.879 --> 00:07:49.600 corporate body without collapsing. 153 00:07:49.959 --> 00:07:53.360 That analogy perfectly maps to the technical architecture. 154 00:07:53.399 --> 00:07:53.920 All awesome. 155 00:07:54.079 --> 00:07:58.240 Yeah, Python is exceptional for logical wrangling, right, but Native 156 00:07:58.319 --> 00:08:02.240 Panda's data frames are heavily constrained by single machine memory 157 00:08:02.240 --> 00:08:06.560 limits they max out exactly. And similarly, R is unmatched 158 00:08:06.560 --> 00:08:11.600 for statistical rigor. It creates complex visualizations. With libraries like gg. 159 00:08:11.439 --> 00:08:13.480 Plot two, you can't easily scale it. 160 00:08:13.680 --> 00:08:17.040 Right. To apply that statistical rigor to a forty za 161 00:08:17.079 --> 00:08:18.240 by ocean, you need. 162 00:08:18.079 --> 00:08:20.000 A bridge, which is where the tools come in. 163 00:08:20.160 --> 00:08:23.600 Yeah, that's why Vermulin highlights packages like spark Layer. It 164 00:08:23.639 --> 00:08:26.839 allows data scientists to write standard R code that executes 165 00:08:26.920 --> 00:08:30.199 natively across a massive spark cluster. Oh, I see the 166 00:08:30.279 --> 00:08:34.080 distributed tools free the analytical brains from their single server skulls. 167 00:08:34.399 --> 00:08:36.639 That's a great way to put it. And we can't 168 00:08:36.679 --> 00:08:40.120 ignore the edge devices feeding the system either. The text 169 00:08:40.159 --> 00:08:44.960 specifically highlights mqtt MQ telemetry Transport. 170 00:08:44.600 --> 00:08:46.000 Of really vital protocol. 171 00:08:46.120 --> 00:08:48.759 Yeah, because if you have an incredibly dense array of 172 00:08:48.799 --> 00:08:53.840 IoT sensors, say monitoring temperature fluctuations across a massive agricultural grid, 173 00:08:54.320 --> 00:08:58.000 standard HTTP protocols carry way too much header overhead. They're 174 00:08:58.039 --> 00:09:03.440 just too bulky, RIGHTMQ uses a microscopic footprint. It's the 175 00:09:03.440 --> 00:09:07.679 perfect protocol to shoot continuous low bandwidth telemetry data directly 176 00:09:07.720 --> 00:09:09.279 into your Kofka streams. 177 00:09:09.039 --> 00:09:13.200 And mastering that integration. Knowing how to capture lightweight MQTT 178 00:09:13.440 --> 00:09:17.600 signals at the edge, stream them flawlessly through kofka, crunch 179 00:09:17.679 --> 00:09:20.480 the distributed math with Spark, and orchestrate it all with 180 00:09:20.519 --> 00:09:21.200 Python MUD. 181 00:09:21.279 --> 00:09:22.080 That's the real trick. 182 00:09:22.240 --> 00:09:24.840 Yeah, that is the exact threshold that separates a local 183 00:09:24.879 --> 00:09:27.279 data analyst from an enterprise grade data scientist. 184 00:09:27.440 --> 00:09:29.720 Okay, but having a garage full of state of the 185 00:09:29.840 --> 00:09:32.120 art tools doesn't mean you actually know how to build 186 00:09:32.120 --> 00:09:33.000 a functional car. 187 00:09:33.320 --> 00:09:34.440 No, it definitely doesn't. 188 00:09:34.600 --> 00:09:38.519 We have the stack, but we need a blueprint which 189 00:09:38.519 --> 00:09:41.559 brings us to the processing frameworks required to manage these 190 00:09:41.559 --> 00:09:45.919 deployments without, you know, causing catastrophic failures. 191 00:09:46.080 --> 00:09:50.200 Because the industry graveyard is completely full of brilliant algorithms 192 00:09:50.200 --> 00:09:51.200 that died in production. 193 00:09:51.360 --> 00:09:52.600 Why do they die. 194 00:09:52.519 --> 00:09:56.039 Because there was no standardized engineering process for Meal and 195 00:09:56.159 --> 00:09:59.600 champions CRISPDIUM, which stands for the cross Industry Standard Process 196 00:09:59.600 --> 00:10:02.360 for Data mining. Right, it breaks the workflow into a 197 00:10:02.399 --> 00:10:08.960 really strict sequence business understanding data, Understanding data preparation, modeling, evaluation, 198 00:10:09.399 --> 00:10:10.159 and deployment. 199 00:10:10.559 --> 00:10:13.399 It seems like jumping straight into modeling without the business 200 00:10:13.480 --> 00:10:16.960 understanding layer is exactly why so many data pilots fail 201 00:10:16.960 --> 00:10:18.440 when they hit the production floor. 202 00:10:18.240 --> 00:10:21.799 Oh one hundred percent. And the text emphasizes that CRISPDM 203 00:10:21.919 --> 00:10:23.759 is inherently cyclical, not linear. 204 00:10:23.919 --> 00:10:25.600 Right, You don't just march from step one to six 205 00:10:25.600 --> 00:10:26.480 and clock out for the day. 206 00:10:26.559 --> 00:10:29.879 Far from it. The cyclical nature is a defensive mechanism 207 00:10:29.919 --> 00:10:33.399 against bad assumptions. Well, you might spend weeks in the 208 00:10:33.399 --> 00:10:37.320 modeling phase only to hit the evaluation phase and realize 209 00:10:37.320 --> 00:10:39.679 your predictive accuracy is hovering around. 210 00:10:39.480 --> 00:10:41.440 Fifty percent, basically a coin toss. 211 00:10:41.559 --> 00:10:44.919 Right, That failure forces you back to data preparation to 212 00:10:45.000 --> 00:10:47.840 engineer new features, or sometimes all the way back to 213 00:10:47.919 --> 00:10:51.799 business understanding because the original problem was framed incorrectly. 214 00:10:51.919 --> 00:10:56.639 Wow, And to operationalize this cycle at scale, Vermulen outlines 215 00:10:56.679 --> 00:10:59.919 a five layer data science framework. Yes, he grounds the 216 00:11:00.360 --> 00:11:05.200 using a fictional corporate sandbox called VKHCG, the vermil and Quent, Vulner, 217 00:11:05.240 --> 00:11:08.279 Hillman Clark Group. It's quite a mouthful, it is, but 218 00:11:08.399 --> 00:11:12.519 it's a massive conglomerate with distinct subsidiaries handling it, networks, 219 00:11:12.759 --> 00:11:16.519 global billboard, advertising, logistics, and four X trading. 220 00:11:16.919 --> 00:11:20.000 It serves as the perfect stress test environment for the framework. 221 00:11:20.159 --> 00:11:22.559 So how did the five layers stack up to manage 222 00:11:22.559 --> 00:11:23.279 this complexity? 223 00:11:23.519 --> 00:11:26.200 At the apex is the business layer, which dictates the 224 00:11:26.240 --> 00:11:27.440 actual enterprise needs. 225 00:11:27.519 --> 00:11:27.799 Okay. 226 00:11:27.919 --> 00:11:30.480 Below that, it's a utility layer, which is a centralized 227 00:11:30.519 --> 00:11:31.960 vault for repeatable algorithms. 228 00:11:32.039 --> 00:11:32.279 Got it. 229 00:11:32.399 --> 00:11:36.399 Then the operational management layer handles scheduling and automated triggers. 230 00:11:36.320 --> 00:11:37.639 Like running the jobs right. 231 00:11:38.120 --> 00:11:42.000 The audit balance and control layer strictly monitors data lineage 232 00:11:42.039 --> 00:11:45.879 in compliance super important. And finally, the functional layer at 233 00:11:45.879 --> 00:11:49.240 the bottom is where the actual algorithmic heavy lifting and 234 00:11:49.360 --> 00:11:51.399 data transformations execute. 235 00:11:51.639 --> 00:11:55.480 Looking at this architecture, it becomes painfully obvious why so 236 00:11:55.559 --> 00:11:58.960 many data pilots fail? Oh yeah, A data scientist will 237 00:11:58.960 --> 00:12:02.320 build a brilliant predictive model in a Jupiter notebook on 238 00:12:02.360 --> 00:12:03.919 their local machine. 239 00:12:03.480 --> 00:12:06.840 Which is effectively operating purely in the functional layer. 240 00:12:06.679 --> 00:12:09.279 Exactly, But when they try to deploy it across an 241 00:12:09.399 --> 00:12:14.080 enterprise like VKHCG without the operational management layer to schedule 242 00:12:14.080 --> 00:12:17.919 the pipelines or the audit layer to monitor data drift. 243 00:12:17.840 --> 00:12:21.840 The model immediately fractures under real world condition, it just shatters. Yeah, 244 00:12:22.440 --> 00:12:25.159 if we connect this to the bigger picture, the primary 245 00:12:25.279 --> 00:12:29.120 value of the five layer framework isn't merely bureaucratic organization. 246 00:12:29.519 --> 00:12:30.240 What is it? Then? 247 00:12:30.399 --> 00:12:34.279 It provides the architectural scaffolding required to transition a localized, 248 00:12:34.320 --> 00:12:38.919 fragile experiment into an automated, fault tolerant production environment, making 249 00:12:38.919 --> 00:12:43.480 it real exactly. A model without operational integration and continuous 250 00:12:43.519 --> 00:12:46.879 auditing is effectively useless to the broader enterprise. 251 00:12:47.080 --> 00:12:50.080 Speaking of the brighter enterprise, let's look at the sheer 252 00:12:50.159 --> 00:12:53.440 logistical nightmare of a conglomerate like VKHCG. 253 00:12:53.519 --> 00:12:54.200 It's massive. 254 00:12:54.440 --> 00:12:58.159 Yeah, you have Crenwolner ag generating video files and high 255 00:12:58.240 --> 00:13:02.399 rise images from billboards. Clark Ltd is generating thousands of 256 00:13:02.480 --> 00:13:06.480 csvs of four X trading data. Hillman Ltd Is producing 257 00:13:06.600 --> 00:13:10.600 XML routing data. So much variety, right, So how do 258 00:13:10.679 --> 00:13:14.519 these distinct layers and subsidiaries communicate without drowning in an 259 00:13:14.600 --> 00:13:16.240 endless se of custom translation? 260 00:13:16.320 --> 00:13:20.320 APIs that integration bottleneck is solved by the utility layer, 261 00:13:20.480 --> 00:13:25.480 specifically through an architectural standard Vermulin introduces called Horus. 262 00:13:25.360 --> 00:13:28.879 Which stands for the homogeneous ontology for recursive uniform schema. 263 00:13:29.000 --> 00:13:29.480 That's a one. 264 00:13:29.519 --> 00:13:32.399 It's essentially a universal internal adapter. Let's break down the 265 00:13:32.440 --> 00:13:35.120 actual mathematics of why this is necessary, because the technical 266 00:13:35.159 --> 00:13:38.279 debt of point to point integration is just staggering. 267 00:13:38.360 --> 00:13:39.720 It really is. Let's hear the math. 268 00:13:39.840 --> 00:13:42.600 Okay, if an enterprise has one hundred different data formats 269 00:13:43.080 --> 00:13:45.519 and you want any system to talk to any other system, 270 00:13:45.960 --> 00:13:49.360 you have to write direct converters for every single combination. 271 00:13:49.840 --> 00:13:52.720 That's one hundred times ninety nine. You're looking at nearly 272 00:13:52.799 --> 00:13:58.480 ten thousand custom brittle integration scripts just to maintain baseline communication. 273 00:13:58.759 --> 00:14:01.639 And every time and ex journal vendor updates and API 274 00:14:02.279 --> 00:14:06.279 dozens of those point to point scripts break simultaneously. 275 00:14:05.480 --> 00:14:07.639 Which is a nightmare for the engineers. 276 00:14:07.240 --> 00:14:10.919 Absolute nightmare. But by instituting Horace as the central hub, 277 00:14:11.320 --> 00:14:14.399 you mandate that every incoming format is translated into the 278 00:14:14.399 --> 00:14:15.200 HORROR standard. 279 00:14:15.240 --> 00:14:15.559 First. 280 00:14:15.720 --> 00:14:19.759 Okay, if a downstream system needs that data, it translates 281 00:14:19.759 --> 00:14:22.039 it from HORUS into its target format. 282 00:14:22.200 --> 00:14:24.320 Wait, wait, I want to push back on that architecture 283 00:14:24.360 --> 00:14:27.320 for a second. Sure isn't translating Format A into HORUS 284 00:14:27.320 --> 00:14:30.000 and then Horruce into format b Aren't we just injecting 285 00:14:30.039 --> 00:14:33.919 a middleman into every single data pipeline. Doesn't that intermediate 286 00:14:33.919 --> 00:14:38.360 step add massive computational overhead and latency? Why is this 287 00:14:38.399 --> 00:14:39.879 actually faster in the long run. 288 00:14:40.039 --> 00:14:43.000 It's a really critical trade off. Yes, you introduce a 289 00:14:43.039 --> 00:14:46.759 fractional computational cost by serializing and de serializing through an 290 00:14:46.759 --> 00:14:50.320 intermediate cema. There is a cost, but consider the alternative 291 00:14:50.679 --> 00:14:53.639 by using a hub and spoke model. Integrating one hundred 292 00:14:53.639 --> 00:14:58.279 formats only requires two hundred scripts, one to convert HORUS 293 00:14:58.399 --> 00:15:00.200 and one to convert out. 294 00:15:00.000 --> 00:15:01.519 That is a huge difference. 295 00:15:01.559 --> 00:15:05.159 It's a ninety eight percent savings in development time. When 296 00:15:05.200 --> 00:15:07.639 Format one oh one is introduced, you don't write one 297 00:15:07.720 --> 00:15:10.159 hundred new integrations, you write exactly too. 298 00:15:10.360 --> 00:15:12.279 Wow, Okay, that makes perfect sense. 299 00:15:12.320 --> 00:15:15.720 The microscopic increase in compute latency is heavily outweighed by 300 00:15:15.720 --> 00:15:18.840 the elimination of thousands of hours of developer maintenance and 301 00:15:18.919 --> 00:15:19.960 pipeline fragility. 302 00:15:20.200 --> 00:15:23.600 And HORUS isn't just for tabular data either. The text 303 00:15:23.600 --> 00:15:26.679 provides some wild examples of how the utility layer forces 304 00:15:26.759 --> 00:15:30.480 complex unstructured data into this homogeneous format. 305 00:15:30.559 --> 00:15:32.919 Yeah, the image extraction is crazy. 306 00:15:32.559 --> 00:15:35.080 It really is yeah, for meal and details. An algorithm 307 00:15:35.080 --> 00:15:38.279 that takes a JPEG image of a dog named Angus, 308 00:15:38.399 --> 00:15:41.720 great name, and it extracts the exact red, green, blue, 309 00:15:41.720 --> 00:15:44.320 and alpha transparency values for every single. 310 00:15:44.039 --> 00:15:45.759 Pixel, just tearing the image apart. 311 00:15:46.000 --> 00:15:49.240 Yeah, and it flattens the entire visual into a massive 312 00:15:49.320 --> 00:15:53.039 data frame of raw numerical arrays. And he applies the 313 00:15:53.080 --> 00:15:57.519 exact same logic to MP four video files, extracting frame 314 00:15:57.559 --> 00:15:58.519 by frame matrices. 315 00:15:58.960 --> 00:16:03.720 Right, because by mathematically flattening complex visual or audio data 316 00:16:03.759 --> 00:16:07.600 into a standardized horror structure, you allow standard machine learning 317 00:16:07.639 --> 00:16:09.600 libraries to process it because. 318 00:16:09.320 --> 00:16:11.960 They usually need tabular numerical inputs. 319 00:16:12.000 --> 00:16:14.879 Right, exactly, Now they can process a video file using 320 00:16:14.879 --> 00:16:18.360 the exact same underlying logic they would use to analyze 321 00:16:18.360 --> 00:16:19.600 a financial spreadsheet. 322 00:16:19.799 --> 00:16:21.159 That is mind blowing it. 323 00:16:21.080 --> 00:16:23.879 Is, And because it's stored in the utility layer, any 324 00:16:23.919 --> 00:16:27.480 engineer across the enterprise can call that verified image extraction 325 00:16:27.559 --> 00:16:30.720 algorithm without having to reinvent the mathematical wheel. 326 00:16:30.799 --> 00:16:33.919 Which brings us to the final and unequivocally most critical 327 00:16:34.039 --> 00:16:34.519 piece of. 328 00:16:34.440 --> 00:16:35.960 The framework, the top of the pyramid. 329 00:16:36.039 --> 00:16:38.759 Right, we have the data lakes, the spark clusters, the 330 00:16:38.799 --> 00:16:42.440 CRISPA DM blueprints and the horrors universal translators. But all 331 00:16:42.480 --> 00:16:46.000 of this flawless engineering is absolutely worthless if it solves 332 00:16:46.039 --> 00:16:47.120 the wrong human problems. 333 00:16:47.200 --> 00:16:47.960 Totally worthless. 334 00:16:48.000 --> 00:16:50.759 We have to ascend to the top the business layer. 335 00:16:51.159 --> 00:16:55.120 This is where non technical functional requirements actually dictate the 336 00:16:55.200 --> 00:17:01.120 engineering parameters. Right Vermulin leans heavily on the Moscow prioritization method. 337 00:17:01.159 --> 00:17:04.319 Here Moscow that must have, should have, could have, won't 338 00:17:04.359 --> 00:17:05.440 have exactly. 339 00:17:05.839 --> 00:17:11.359 It forces stakeholders to brutally separate mission critical analytical needs 340 00:17:11.680 --> 00:17:14.319 from purely aspirational vanity metrics. 341 00:17:14.359 --> 00:17:16.000 And you have to do that before single line and 342 00:17:16.000 --> 00:17:19.000 code is written Precisely. Once those strict requirements are set, 343 00:17:19.359 --> 00:17:22.359 the business logic has to be modeled. The text introduces 344 00:17:22.400 --> 00:17:25.799 sun models, developed by Mark Whitehorn to handle this mapping. 345 00:17:26.200 --> 00:17:29.799 Sun models provide a phenomenal way to separate business facts 346 00:17:29.839 --> 00:17:30.680 from context. 347 00:17:30.880 --> 00:17:31.559 How do they work. 348 00:17:32.119 --> 00:17:35.319 The center of the model represents the fact that's a specific, 349 00:17:35.559 --> 00:17:37.799 undeniable event, like a financial transaction. 350 00:17:37.960 --> 00:17:38.839 Okay, that's the core. 351 00:17:39.039 --> 00:17:43.559 Right Radiating outward are the dimensions. These are the contextual 352 00:17:43.599 --> 00:17:46.920 realities of that event, such as the customer's geographic location 353 00:17:47.160 --> 00:17:49.759 or the stores operating hours at the exact time of 354 00:17:49.799 --> 00:17:51.119 the transaction. 355 00:17:50.799 --> 00:17:55.039 And managing those dimensions over time is surprisingly complex, isn't it? Well? Incredibly, 356 00:17:55.160 --> 00:18:00.440 the book highlights slowly changing dimensions, specifically sed TIS type two, 357 00:18:01.039 --> 00:18:05.279 which uses an effective date column. There's a brilliant historical 358 00:18:05.319 --> 00:18:08.319 example used to explain why this matters the Dutch explorer. 359 00:18:08.400 --> 00:18:10.480 Really yes, tracking doctor Jacob Rogavin. 360 00:18:10.680 --> 00:18:14.079 Right, if you look at standard relational databases, they often 361 00:18:14.119 --> 00:18:16.640 default to what we call SCD type one, which is 362 00:18:16.839 --> 00:18:18.440 simple overwriting. 363 00:18:17.960 --> 00:18:20.279 Meaning they just replace the old data. Yeah. 364 00:18:20.440 --> 00:18:23.960 So, if doctor Rogavin moves from his home in Middleburg 365 00:18:24.079 --> 00:18:28.400 to Easter Island in seventeen twenty two, an SCD type 366 00:18:28.400 --> 00:18:30.720 one system just overwrites his address. 367 00:18:30.359 --> 00:18:32.000 Field, which seems fine at first glance. 368 00:18:32.079 --> 00:18:35.319 But the problem is you've permanently destroyed your historical context. 369 00:18:35.480 --> 00:18:38.200 Right. But with SCD type two, you don't overwrite. No, 370 00:18:38.519 --> 00:18:40.240 you add a new row and you manage it with 371 00:18:40.240 --> 00:18:43.119 an effective date. You log that he resided in Middleburg 372 00:18:43.160 --> 00:18:45.400 with an n date of April fourth, seventeen twenty two, 373 00:18:45.839 --> 00:18:48.440 and a new row shows him residing on Easter Island 374 00:18:48.480 --> 00:18:52.359 effective April five, seventeen twenty two. Exactly why is maintaining 375 00:18:52.359 --> 00:18:55.519 that temporal timeline so critical for advanced data science. 376 00:18:55.240 --> 00:18:58.480 Because predictive machine learning models absolutely rely on point in 377 00:18:58.519 --> 00:19:03.519 time accuracy. Say your algorithm is analyzing why certain customer 378 00:19:03.559 --> 00:19:07.559 segments canceled their subscriptions five years ago, right, it needs 379 00:19:07.599 --> 00:19:12.240 to evaluate the geographic and demographic dimensions of those customers 380 00:19:12.279 --> 00:19:14.680 as they existed five years ago, not who they are 381 00:19:14.720 --> 00:19:20.359 today exactly. If your database has overwritten their historical addresses 382 00:19:20.400 --> 00:19:23.799 with their current ones, your training data is contaminated with 383 00:19:23.839 --> 00:19:25.000 future knowledge. 384 00:19:24.640 --> 00:19:27.960 Which completely invalidates the model's predictive power. It ruins the 385 00:19:28.000 --> 00:19:31.039 whole thing, and the strictness required in the data models 386 00:19:31.119 --> 00:19:34.119 must also be applied to the human language driving them. 387 00:19:34.559 --> 00:19:37.680 The text offers a brutal warning about the danger of 388 00:19:37.799 --> 00:19:40.640 weak words in the business layers requirements. 389 00:19:40.720 --> 00:19:45.240 Oh yes, business analysts frequently write non functional requirements stating 390 00:19:45.279 --> 00:19:48.839 a dashboard must be user friendly or a streaming pipeline 391 00:19:48.880 --> 00:19:50.559 must operate seamlessly. 392 00:19:50.200 --> 00:19:51.759 Which sounds good in the meeting. 393 00:19:51.599 --> 00:19:55.920 Sure, but from an engineering perspective, those words are poisoned because. 394 00:19:55.599 --> 00:19:59.279 They are fundamentally untestable. You can't write a unit test 395 00:19:59.319 --> 00:20:04.000 for seamless. You have to define strict binary thresholds like 396 00:20:04.440 --> 00:20:07.599 the kofa stream will process fifty thousand events per second 397 00:20:07.920 --> 00:20:10.039 with latency under one hundred milliseconds. 398 00:20:10.160 --> 00:20:14.160 Yes, if you don't translate qualitative business desires into highly 399 00:20:14.240 --> 00:20:20.240 specific quantitative engineering parameters, expectations misalign, and enterprise scale projects 400 00:20:20.279 --> 00:20:21.759 fail right before deployment. 401 00:20:22.119 --> 00:20:24.960