WEBVTT

1
00:00:00.040 --> 00:00:03.200
<v Speaker 1>Welcome to another deep dive for you, the learner listening

2
00:00:03.200 --> 00:00:06.599
<v Speaker 1>in today. I want you to just imagine standing on

3
00:00:06.639 --> 00:00:10.119
<v Speaker 1>the edge of a massive, wildly turbulent.

4
00:00:09.640 --> 00:00:11.800
<v Speaker 2>Ocean, like a really chaotic one.

5
00:00:11.880 --> 00:00:15.439
<v Speaker 1>Yeah, exactly. But we're looking at a global landscape generating

6
00:00:15.480 --> 00:00:17.399
<v Speaker 1>over forty zetabytes.

7
00:00:16.920 --> 00:00:20.320
<v Speaker 2>Of data, which is just an unfathomable number, it really is.

8
00:00:20.719 --> 00:00:25.280
<v Speaker 1>And the modern business challenge it isn't acquiring information anymore.

9
00:00:25.719 --> 00:00:28.800
<v Speaker 1>The actual challenge is preventing your enterprise from drowning in

10
00:00:28.879 --> 00:00:32.359
<v Speaker 1>these raw data swamps. You know, It's about figuring out

11
00:00:32.600 --> 00:00:36.399
<v Speaker 1>how to build the industrial plumbing necessary to refine that

12
00:00:36.520 --> 00:00:40.640
<v Speaker 1>total chaos into pure actionable business assets.

13
00:00:40.759 --> 00:00:44.200
<v Speaker 2>Because the physics of a forty zetabyte landscape they completely

14
00:00:44.280 --> 00:00:48.079
<v Speaker 2>break traditional data models. Oh for sure, human cognition and frankly,

15
00:00:48.240 --> 00:00:51.880
<v Speaker 2>legacy server architectures they just aren't built to natively comprehend

16
00:00:51.920 --> 00:00:53.159
<v Speaker 2>or route that much throughput.

17
00:00:53.439 --> 00:00:55.359
<v Speaker 1>No, they would just melt pretty much.

18
00:00:55.520 --> 00:00:57.560
<v Speaker 2>You can have the most valuable data on the planet

19
00:00:57.600 --> 00:01:00.560
<v Speaker 2>sitting in your servers, but if you're processing staff can't

20
00:01:00.840 --> 00:01:03.439
<v Speaker 2>ingest it, structure it, and analyze it at scale, it

21
00:01:03.479 --> 00:01:06.359
<v Speaker 2>actually becomes a massive liability.

22
00:01:05.879 --> 00:01:07.400
<v Speaker 1>Instead of a competitive advantage.

23
00:01:07.920 --> 00:01:08.480
<v Speaker 2>Exactly.

24
00:01:08.560 --> 00:01:12.480
<v Speaker 1>Okay, let's unpack this today. We are analyzing a foundational

25
00:01:12.519 --> 00:01:16.879
<v Speaker 1>text to solve this exact problem, which is practical Data

26
00:01:16.920 --> 00:01:19.840
<v Speaker 1>Science by Andreas Francois Vermulin.

27
00:01:20.120 --> 00:01:22.439
<v Speaker 2>And this isn't just a theoretical text point.

28
00:01:22.439 --> 00:01:26.239
<v Speaker 1>No, not at all. It's an aggressive, really comprehensive guide

29
00:01:26.640 --> 00:01:28.959
<v Speaker 1>to the entire enterprise technology stack.

30
00:01:29.359 --> 00:01:32.920
<v Speaker 2>Yeah, the layered frameworks, the rigid business rules, all the

31
00:01:32.959 --> 00:01:36.959
<v Speaker 2>stuff required to actually tame massive data sets out in

32
00:01:37.000 --> 00:01:37.439
<v Speaker 2>the wild.

33
00:01:37.640 --> 00:01:42.640
<v Speaker 1>Right, Because what Vermulin offers is essentially an architectural blueprint.

34
00:01:42.879 --> 00:01:45.159
<v Speaker 2>We are moving way past the novelty of you know,

35
00:01:45.319 --> 00:01:47.200
<v Speaker 2>simple data science experiments.

36
00:01:46.760 --> 00:01:49.239
<v Speaker 1>On a laptop, like just running a quick Python script.

37
00:01:49.439 --> 00:01:52.719
<v Speaker 2>Right. This text breaks down the mechanical reality of how

38
00:01:52.799 --> 00:01:57.959
<v Speaker 2>data is stored, processed across distributed clusters, legally protected, and

39
00:01:58.079 --> 00:02:00.159
<v Speaker 2>ultimately served up to an executive board.

40
00:02:00.359 --> 00:02:02.280
<v Speaker 1>To drive millions of dollars in decisions.

41
00:02:02.400 --> 00:02:03.840
<v Speaker 2>Exactly. That's the end goal.

42
00:02:04.000 --> 00:02:05.760
<v Speaker 1>So our mission for this deep dive is to give

43
00:02:05.799 --> 00:02:08.439
<v Speaker 1>you a cohesive mental model of that entire.

44
00:02:08.280 --> 00:02:09.960
<v Speaker 2>Journey from start to finish.

45
00:02:10.039 --> 00:02:14.039
<v Speaker 1>Yeah, we'll track the data flowing from a wild unstructured

46
00:02:14.120 --> 00:02:17.759
<v Speaker 1>lake through all that complex processing machinery, all the way

47
00:02:17.840 --> 00:02:18.879
<v Speaker 1>up to business deployment.

48
00:02:19.000 --> 00:02:20.000
<v Speaker 2>It's quite a journey.

49
00:02:20.159 --> 00:02:23.599
<v Speaker 1>So let's start at the source, right, taming the wild reservoir.

50
00:02:24.560 --> 00:02:28.919
<v Speaker 1>Fermulen defines the data lake as a massive repository storing

51
00:02:29.000 --> 00:02:31.879
<v Speaker 1>data in its native raw format.

52
00:02:31.719 --> 00:02:34.240
<v Speaker 2>Which is crucial to understand.

53
00:02:33.759 --> 00:02:36.240
<v Speaker 1>Right because for anyone who has worked with legacy systems,

54
00:02:36.840 --> 00:02:40.479
<v Speaker 1>we know the absolute friction of the old schema naw ride.

55
00:02:40.319 --> 00:02:43.960
<v Speaker 2>Approach ugh schema on right. It basically forces you into

56
00:02:44.000 --> 00:02:46.639
<v Speaker 2>a rigid box before you even begin doing anything.

57
00:02:46.680 --> 00:02:47.919
<v Speaker 1>You have to map everything out.

58
00:02:48.039 --> 00:02:50.680
<v Speaker 2>Yeah, you have to spend months modeling the exact ship

59
00:02:50.759 --> 00:02:53.960
<v Speaker 2>of your database tables, the data types, the relationships, all

60
00:02:54.000 --> 00:02:55.840
<v Speaker 2>before a single bite is even loaded.

61
00:02:55.879 --> 00:02:58.240
<v Speaker 1>And that rugenity causes massive bottlenecks.

62
00:02:58.360 --> 00:03:02.159
<v Speaker 2>No, absolutely, because the moment a new unexpected data format

63
00:03:02.240 --> 00:03:04.560
<v Speaker 2>arrives from an external vendor, what happens.

64
00:03:04.759 --> 00:03:08.000
<v Speaker 1>The whole ingestion pupline just breaks down, shatters. That's where

65
00:03:08.000 --> 00:03:11.439
<v Speaker 1>the modern schemon read philosophy comes in. You bypass that

66
00:03:11.560 --> 00:03:15.360
<v Speaker 1>initial bottleneck completely by loading the data into the lake

67
00:03:15.520 --> 00:03:16.400
<v Speaker 1>exactly as it.

68
00:03:16.360 --> 00:03:18.800
<v Speaker 2>Is, just raw and completely unstructured.

69
00:03:18.879 --> 00:03:22.840
<v Speaker 1>Yeah, you only apply the organizational rules the schema at

70
00:03:22.840 --> 00:03:27.400
<v Speaker 1>the exact computational moment you query the data. Yes, so

71
00:03:28.120 --> 00:03:32.360
<v Speaker 1>is a data lake essentially a giant unfiltered natural reservoir,

72
00:03:33.120 --> 00:03:36.520
<v Speaker 1>And schema on reed is like deciding whether you want

73
00:03:36.520 --> 00:03:40.039
<v Speaker 1>to filter that water for drinking, farming, or swimming only

74
00:03:40.080 --> 00:03:41.960
<v Speaker 1>at the exact moment you dip your bucket in.

75
00:03:42.120 --> 00:03:45.319
<v Speaker 2>That is a perfect analogy. What's fascinating here is how

76
00:03:45.360 --> 00:03:50.000
<v Speaker 2>that flexibility directly accelerates knowledge generation. How so well by

77
00:03:50.080 --> 00:03:54.360
<v Speaker 2>keeping the leaf level atomic data perfectly intact, you preserve

78
00:03:54.599 --> 00:03:57.159
<v Speaker 2>all the anomalies and the really subtle signals.

79
00:03:57.319 --> 00:03:59.520
<v Speaker 1>Uh, because he didn't scrub them out of the start exactly.

80
00:04:00.000 --> 00:04:02.800
<v Speaker 2>It's in an exploratory data science. The actual insights are

81
00:04:02.840 --> 00:04:04.599
<v Speaker 2>hidden in the unstructured noise.

82
00:04:04.800 --> 00:04:05.000
<v Speaker 1>Right.

83
00:04:05.159 --> 00:04:08.280
<v Speaker 2>If you force data through a rigid schema on right

84
00:04:08.400 --> 00:04:11.599
<v Speaker 2>filter right out ingestion, you strip out those anomalies because

85
00:04:11.599 --> 00:04:13.439
<v Speaker 2>they just don't fit your predefined assumption.

86
00:04:13.639 --> 00:04:15.240
<v Speaker 1>You lose what you didn't know you were looking for.

87
00:04:15.479 --> 00:04:20.720
<v Speaker 2>Precisely, Schema on reed preserves those unknown variables for future models.

88
00:04:20.759 --> 00:04:24.360
<v Speaker 1>But Vermilan makes it clear you can't just leave everything

89
00:04:24.399 --> 00:04:26.360
<v Speaker 1>floating in a chaotic lake forever.

90
00:04:26.680 --> 00:04:27.959
<v Speaker 2>No, that would be a disaster.

91
00:04:28.240 --> 00:04:31.480
<v Speaker 1>Right enter the data vault, which is a hybrid modeling

92
00:04:31.519 --> 00:04:34.040
<v Speaker 1>methodology created by Dan linst.

93
00:04:33.759 --> 00:04:36.720
<v Speaker 2>It because we do need structure for business reporting, but

94
00:04:36.800 --> 00:04:39.079
<v Speaker 2>we want it without losing that agility.

95
00:04:39.560 --> 00:04:45.560
<v Speaker 1>So the Data Vault achieves this using three core architectural components, right, hubs, links,

96
00:04:45.600 --> 00:04:46.399
<v Speaker 1>and satellites.

97
00:04:46.600 --> 00:04:49.800
<v Speaker 2>Yeah, the mechanical genius of the Data Vault is its modularity.

98
00:04:50.319 --> 00:04:53.199
<v Speaker 2>Hubbs act as the immutable business keys.

99
00:04:53.079 --> 00:04:55.279
<v Speaker 1>Like the absolute core identifiers right.

100
00:04:55.160 --> 00:04:58.399
<v Speaker 2>Like a persistent customer ID it never changes, okay, and

101
00:04:58.439 --> 00:05:01.839
<v Speaker 2>the links links handle the trends actional associations. They map

102
00:05:01.879 --> 00:05:05.199
<v Speaker 2>how hubs interact without holding any descriptive data themselves.

103
00:05:05.279 --> 00:05:07.120
<v Speaker 1>Got it, So where does the actual information go?

104
00:05:07.439 --> 00:05:10.680
<v Speaker 2>All the volatile descriptive context is pushed into the satellites.

105
00:05:11.000 --> 00:05:13.720
<v Speaker 1>So if the hub is the unchangeable concept of a

106
00:05:13.759 --> 00:05:16.639
<v Speaker 1>specific customer and the link represents the fact that they

107
00:05:16.680 --> 00:05:18.720
<v Speaker 1>interacted with a specific product.

108
00:05:18.439 --> 00:05:21.959
<v Speaker 2>The satellite holds their current address, their income bracket, and

109
00:05:22.000 --> 00:05:23.279
<v Speaker 2>the timestamp of the event.

110
00:05:23.600 --> 00:05:26.279
<v Speaker 1>Wow, why split it up so aggressively like that?

111
00:05:26.800 --> 00:05:30.959
<v Speaker 2>Because it isolates structural changes. Let's say your marketing department

112
00:05:31.040 --> 00:05:34.800
<v Speaker 2>suddenly starts collecting a dozen new demographic metrics on customers.

113
00:05:35.160 --> 00:05:37.079
<v Speaker 2>With a normal setup, you'd have to rebuild your core

114
00:05:37.120 --> 00:05:41.519
<v Speaker 2>tables or alter existing schemas. But here you simply attach

115
00:05:41.600 --> 00:05:44.839
<v Speaker 2>a brand new satellite to the existing hub. Oh wow, Yeah,

116
00:05:44.879 --> 00:05:49.279
<v Speaker 2>it allows you to model incredibly complex, evolving enterprise environments

117
00:05:49.639 --> 00:05:54.879
<v Speaker 2>while maintaining a completely auditible historical record of every single change.

118
00:05:54.920 --> 00:05:58.160
<v Speaker 1>That's brilliant. Okay, so now we have a highly structured,

119
00:05:58.319 --> 00:06:02.160
<v Speaker 1>scalable reservoir. But a reservoir is useless if you don't

120
00:06:02.160 --> 00:06:04.839
<v Speaker 1>have the industrial machinery to pump and process the water.

121
00:06:05.199 --> 00:06:05.800
<v Speaker 2>Very true.

122
00:06:06.079 --> 00:06:09.600
<v Speaker 1>Let's move into the processing stack for Mulen outlines. At

123
00:06:09.639 --> 00:06:13.240
<v Speaker 1>the absolute center of this arsenal is Apache Spark.

124
00:06:13.439 --> 00:06:17.879
<v Speaker 2>Spark completely changed the paradigm for distributed cluster computing because

125
00:06:17.879 --> 00:06:21.240
<v Speaker 2>it's so fast, because it's resilient. When you are analyzing

126
00:06:21.360 --> 00:06:26.439
<v Speaker 2>terabytes of telemetry data, a single machine's memory will inevitably crash.

127
00:06:26.519 --> 00:06:27.959
<v Speaker 1>It just can't hold the weight, right.

128
00:06:28.439 --> 00:06:33.240
<v Speaker 2>Spark solves this by utilizing resilient distributed data sets or RDBs.

129
00:06:33.399 --> 00:06:34.279
<v Speaker 1>Okay, what do those do?

130
00:06:34.800 --> 00:06:39.120
<v Speaker 2>It basically shatters the massive data set into partitions, distributes

131
00:06:39.160 --> 00:06:42.839
<v Speaker 2>them across thousands of worker nodes in a cluster, processes

132
00:06:42.879 --> 00:06:45.079
<v Speaker 2>the math and memory all at the same time, and

133
00:06:45.160 --> 00:06:48.160
<v Speaker 2>then aggregates the results back together seamlessly.

134
00:06:48.600 --> 00:06:53.000
<v Speaker 1>That is, Wild and working alongside Spark is apatche Kofka.

135
00:06:53.720 --> 00:06:56.839
<v Speaker 1>If Spark is doing the heavy computational lifting, Kafka is

136
00:06:56.879 --> 00:06:59.560
<v Speaker 1>handling the sheer velocity of the ingestion exactly.

137
00:06:59.680 --> 00:07:04.160
<v Speaker 2>Coff operates as a distributed published, subscribe messaging system like

138
00:07:04.199 --> 00:07:07.920
<v Speaker 2>a massive router. Yeah, imagine you have a global retail operation.

139
00:07:08.439 --> 00:07:12.839
<v Speaker 2>You've got thousands of edge devices, website clicks, supply.

140
00:07:12.600 --> 00:07:15.240
<v Speaker 1>Chan updates, generating millions of events per second.

141
00:07:15.360 --> 00:07:19.639
<v Speaker 2>Right, Kafka ingests that entire stream. It guarantees fault tolerant

142
00:07:19.680 --> 00:07:21.839
<v Speaker 2>real time delivery to the processing.

143
00:07:21.360 --> 00:07:23.439
<v Speaker 1>Core, so nothing gets lost exactly.

144
00:07:23.480 --> 00:07:26.240
<v Speaker 2>It ensures no packets are dropped even if a downstream

145
00:07:26.319 --> 00:07:27.920
<v Speaker 2>server briefly goes offline.

146
00:07:27.959 --> 00:07:30.240
<v Speaker 1>Here's where it gets really interesting. If we look at

147
00:07:30.279 --> 00:07:32.839
<v Speaker 1>the programming languages. Okay, we all know Python and OUR

148
00:07:33.079 --> 00:07:36.759
<v Speaker 1>the standard languages for data science. Sure, but if Python

149
00:07:36.839 --> 00:07:40.839
<v Speaker 1>and OUR are the cognitive centers, like the brains running

150
00:07:40.839 --> 00:07:44.879
<v Speaker 1>the logical models are Kafka and Spark basically the central

151
00:07:44.879 --> 00:07:47.879
<v Speaker 1>nervous system ensuring the signals actually travel through the giant

152
00:07:47.879 --> 00:07:49.600
<v Speaker 1>corporate body without collapsing.

153
00:07:49.959 --> 00:07:53.360
<v Speaker 2>That analogy perfectly maps to the technical architecture.

154
00:07:53.399 --> 00:07:53.920
<v Speaker 1>All awesome.

155
00:07:54.079 --> 00:07:58.240
<v Speaker 2>Yeah, Python is exceptional for logical wrangling, right, but Native

156
00:07:58.319 --> 00:08:02.240
<v Speaker 2>Panda's data frames are heavily constrained by single machine memory

157
00:08:02.240 --> 00:08:06.560
<v Speaker 2>limits they max out exactly. And similarly, R is unmatched

158
00:08:06.560 --> 00:08:11.600
<v Speaker 2>for statistical rigor. It creates complex visualizations. With libraries like gg.

159
00:08:11.439 --> 00:08:13.480
<v Speaker 1>Plot two, you can't easily scale it.

160
00:08:13.680 --> 00:08:17.040
<v Speaker 2>Right. To apply that statistical rigor to a forty za

161
00:08:17.079 --> 00:08:18.240
<v Speaker 2>by ocean, you need.

162
00:08:18.079 --> 00:08:20.000
<v Speaker 1>A bridge, which is where the tools come in.

163
00:08:20.160 --> 00:08:23.600
<v Speaker 2>Yeah, that's why Vermulin highlights packages like spark Layer. It

164
00:08:23.639 --> 00:08:26.839
<v Speaker 2>allows data scientists to write standard R code that executes

165
00:08:26.920 --> 00:08:30.199
<v Speaker 2>natively across a massive spark cluster. Oh, I see the

166
00:08:30.279 --> 00:08:34.080
<v Speaker 2>distributed tools free the analytical brains from their single server skulls.

167
00:08:34.399 --> 00:08:36.639
<v Speaker 1>That's a great way to put it. And we can't

168
00:08:36.679 --> 00:08:40.120
<v Speaker 1>ignore the edge devices feeding the system either. The text

169
00:08:40.159 --> 00:08:44.960
<v Speaker 1>specifically highlights mqtt MQ telemetry Transport.

170
00:08:44.600 --> 00:08:46.000
<v Speaker 2>Of really vital protocol.

171
00:08:46.120 --> 00:08:48.759
<v Speaker 1>Yeah, because if you have an incredibly dense array of

172
00:08:48.799 --> 00:08:53.840
<v Speaker 1>IoT sensors, say monitoring temperature fluctuations across a massive agricultural grid,

173
00:08:54.320 --> 00:08:58.000
<v Speaker 1>standard HTTP protocols carry way too much header overhead. They're

174
00:08:58.039 --> 00:09:03.440
<v Speaker 1>just too bulky, RIGHTMQ uses a microscopic footprint. It's the

175
00:09:03.440 --> 00:09:07.679
<v Speaker 1>perfect protocol to shoot continuous low bandwidth telemetry data directly

176
00:09:07.720 --> 00:09:09.279
<v Speaker 1>into your Kofka streams.

177
00:09:09.039 --> 00:09:13.200
<v Speaker 2>And mastering that integration. Knowing how to capture lightweight MQTT

178
00:09:13.440 --> 00:09:17.600
<v Speaker 2>signals at the edge, stream them flawlessly through kofka, crunch

179
00:09:17.679 --> 00:09:20.480
<v Speaker 2>the distributed math with Spark, and orchestrate it all with

180
00:09:20.519 --> 00:09:21.200
<v Speaker 2>Python MUD.

181
00:09:21.279 --> 00:09:22.080
<v Speaker 1>That's the real trick.

182
00:09:22.240 --> 00:09:24.840
<v Speaker 2>Yeah, that is the exact threshold that separates a local

183
00:09:24.879 --> 00:09:27.279
<v Speaker 2>data analyst from an enterprise grade data scientist.

184
00:09:27.440 --> 00:09:29.720
<v Speaker 1>Okay, but having a garage full of state of the

185
00:09:29.840 --> 00:09:32.120
<v Speaker 1>art tools doesn't mean you actually know how to build

186
00:09:32.120 --> 00:09:33.000
<v Speaker 1>a functional car.

187
00:09:33.320 --> 00:09:34.440
<v Speaker 2>No, it definitely doesn't.

188
00:09:34.600 --> 00:09:38.519
<v Speaker 1>We have the stack, but we need a blueprint which

189
00:09:38.519 --> 00:09:41.559
<v Speaker 1>brings us to the processing frameworks required to manage these

190
00:09:41.559 --> 00:09:45.919
<v Speaker 1>deployments without, you know, causing catastrophic failures.

191
00:09:46.080 --> 00:09:50.200
<v Speaker 2>Because the industry graveyard is completely full of brilliant algorithms

192
00:09:50.200 --> 00:09:51.200
<v Speaker 2>that died in production.

193
00:09:51.360 --> 00:09:52.600
<v Speaker 1>Why do they die.

194
00:09:52.519 --> 00:09:56.039
<v Speaker 2>Because there was no standardized engineering process for Meal and

195
00:09:56.159 --> 00:09:59.600
<v Speaker 2>champions CRISPDIUM, which stands for the cross Industry Standard Process

196
00:09:59.600 --> 00:10:02.360
<v Speaker 2>for Data mining. Right, it breaks the workflow into a

197
00:10:02.399 --> 00:10:08.960
<v Speaker 2>really strict sequence business understanding data, Understanding data preparation, modeling, evaluation,

198
00:10:09.399 --> 00:10:10.159
<v Speaker 2>and deployment.

199
00:10:10.559 --> 00:10:13.399
<v Speaker 1>It seems like jumping straight into modeling without the business

200
00:10:13.480 --> 00:10:16.960
<v Speaker 1>understanding layer is exactly why so many data pilots fail

201
00:10:16.960 --> 00:10:18.440
<v Speaker 1>when they hit the production floor.

202
00:10:18.240 --> 00:10:21.799
<v Speaker 2>Oh one hundred percent. And the text emphasizes that CRISPDM

203
00:10:21.919 --> 00:10:23.759
<v Speaker 2>is inherently cyclical, not linear.

204
00:10:23.919 --> 00:10:25.600
<v Speaker 1>Right, You don't just march from step one to six

205
00:10:25.600 --> 00:10:26.480
<v Speaker 1>and clock out for the day.

206
00:10:26.559 --> 00:10:29.879
<v Speaker 2>Far from it. The cyclical nature is a defensive mechanism

207
00:10:29.919 --> 00:10:33.399
<v Speaker 2>against bad assumptions. Well, you might spend weeks in the

208
00:10:33.399 --> 00:10:37.320
<v Speaker 2>modeling phase only to hit the evaluation phase and realize

209
00:10:37.320 --> 00:10:39.679
<v Speaker 2>your predictive accuracy is hovering around.

210
00:10:39.480 --> 00:10:41.440
<v Speaker 1>Fifty percent, basically a coin toss.

211
00:10:41.559 --> 00:10:44.919
<v Speaker 2>Right, That failure forces you back to data preparation to

212
00:10:45.000 --> 00:10:47.840
<v Speaker 2>engineer new features, or sometimes all the way back to

213
00:10:47.919 --> 00:10:51.799
<v Speaker 2>business understanding because the original problem was framed incorrectly.

214
00:10:51.919 --> 00:10:56.639
<v Speaker 1>Wow, And to operationalize this cycle at scale, Vermulen outlines

215
00:10:56.679 --> 00:10:59.919
<v Speaker 1>a five layer data science framework. Yes, he grounds the

216
00:11:00.360 --> 00:11:05.200
<v Speaker 1>using a fictional corporate sandbox called VKHCG, the vermil and Quent, Vulner,

217
00:11:05.240 --> 00:11:08.279
<v Speaker 1>Hillman Clark Group. It's quite a mouthful, it is, but

218
00:11:08.399 --> 00:11:12.519
<v Speaker 1>it's a massive conglomerate with distinct subsidiaries handling it, networks,

219
00:11:12.759 --> 00:11:16.519
<v Speaker 1>global billboard, advertising, logistics, and four X trading.

220
00:11:16.919 --> 00:11:20.000
<v Speaker 2>It serves as the perfect stress test environment for the framework.

221
00:11:20.159 --> 00:11:22.559
<v Speaker 1>So how did the five layers stack up to manage

222
00:11:22.559 --> 00:11:23.279
<v Speaker 1>this complexity?

223
00:11:23.519 --> 00:11:26.200
<v Speaker 2>At the apex is the business layer, which dictates the

224
00:11:26.240 --> 00:11:27.440
<v Speaker 2>actual enterprise needs.

225
00:11:27.519 --> 00:11:27.799
<v Speaker 1>Okay.

226
00:11:27.919 --> 00:11:30.480
<v Speaker 2>Below that, it's a utility layer, which is a centralized

227
00:11:30.519 --> 00:11:31.960
<v Speaker 2>vault for repeatable algorithms.

228
00:11:32.039 --> 00:11:32.279
<v Speaker 1>Got it.

229
00:11:32.399 --> 00:11:36.399
<v Speaker 2>Then the operational management layer handles scheduling and automated triggers.

230
00:11:36.320 --> 00:11:37.639
<v Speaker 1>Like running the jobs right.

231
00:11:38.120 --> 00:11:42.000
<v Speaker 2>The audit balance and control layer strictly monitors data lineage

232
00:11:42.039 --> 00:11:45.879
<v Speaker 2>in compliance super important. And finally, the functional layer at

233
00:11:45.879 --> 00:11:49.240
<v Speaker 2>the bottom is where the actual algorithmic heavy lifting and

234
00:11:49.360 --> 00:11:51.399
<v Speaker 2>data transformations execute.

235
00:11:51.639 --> 00:11:55.480
<v Speaker 1>Looking at this architecture, it becomes painfully obvious why so

236
00:11:55.559 --> 00:11:58.960
<v Speaker 1>many data pilots fail? Oh yeah, A data scientist will

237
00:11:58.960 --> 00:12:02.320
<v Speaker 1>build a brilliant predictive model in a Jupiter notebook on

238
00:12:02.360 --> 00:12:03.919
<v Speaker 1>their local machine.

239
00:12:03.480 --> 00:12:06.840
<v Speaker 2>Which is effectively operating purely in the functional layer.

240
00:12:06.679 --> 00:12:09.279
<v Speaker 1>Exactly, But when they try to deploy it across an

241
00:12:09.399 --> 00:12:14.080
<v Speaker 1>enterprise like VKHCG without the operational management layer to schedule

242
00:12:14.080 --> 00:12:17.919
<v Speaker 1>the pipelines or the audit layer to monitor data drift.

243
00:12:17.840 --> 00:12:21.840
<v Speaker 2>The model immediately fractures under real world condition, it just shatters. Yeah,

244
00:12:22.440 --> 00:12:25.159
<v Speaker 2>if we connect this to the bigger picture, the primary

245
00:12:25.279 --> 00:12:29.120
<v Speaker 2>value of the five layer framework isn't merely bureaucratic organization.

246
00:12:29.519 --> 00:12:30.240
<v Speaker 1>What is it? Then?

247
00:12:30.399 --> 00:12:34.279
<v Speaker 2>It provides the architectural scaffolding required to transition a localized,

248
00:12:34.320 --> 00:12:38.919
<v Speaker 2>fragile experiment into an automated, fault tolerant production environment, making

249
00:12:38.919 --> 00:12:43.480
<v Speaker 2>it real exactly. A model without operational integration and continuous

250
00:12:43.519 --> 00:12:46.879
<v Speaker 2>auditing is effectively useless to the broader enterprise.

251
00:12:47.080 --> 00:12:50.080
<v Speaker 1>Speaking of the brighter enterprise, let's look at the sheer

252
00:12:50.159 --> 00:12:53.440
<v Speaker 1>logistical nightmare of a conglomerate like VKHCG.

253
00:12:53.519 --> 00:12:54.200
<v Speaker 2>It's massive.

254
00:12:54.440 --> 00:12:58.159
<v Speaker 1>Yeah, you have Crenwolner ag generating video files and high

255
00:12:58.240 --> 00:13:02.399
<v Speaker 1>rise images from billboards. Clark Ltd is generating thousands of

256
00:13:02.480 --> 00:13:06.480
<v Speaker 1>csvs of four X trading data. Hillman Ltd Is producing

257
00:13:06.600 --> 00:13:10.600
<v Speaker 1>XML routing data. So much variety, right, So how do

258
00:13:10.679 --> 00:13:14.519
<v Speaker 1>these distinct layers and subsidiaries communicate without drowning in an

259
00:13:14.600 --> 00:13:16.240
<v Speaker 1>endless se of custom translation?

260
00:13:16.320 --> 00:13:20.320
<v Speaker 2>APIs that integration bottleneck is solved by the utility layer,

261
00:13:20.480 --> 00:13:25.480
<v Speaker 2>specifically through an architectural standard Vermulin introduces called Horus.

262
00:13:25.360 --> 00:13:28.879
<v Speaker 1>Which stands for the homogeneous ontology for recursive uniform schema.

263
00:13:29.000 --> 00:13:29.480
<v Speaker 2>That's a one.

264
00:13:29.519 --> 00:13:32.399
<v Speaker 1>It's essentially a universal internal adapter. Let's break down the

265
00:13:32.440 --> 00:13:35.120
<v Speaker 1>actual mathematics of why this is necessary, because the technical

266
00:13:35.159 --> 00:13:38.279
<v Speaker 1>debt of point to point integration is just staggering.

267
00:13:38.360 --> 00:13:39.720
<v Speaker 2>It really is. Let's hear the math.

268
00:13:39.840 --> 00:13:42.600
<v Speaker 1>Okay, if an enterprise has one hundred different data formats

269
00:13:43.080 --> 00:13:45.519
<v Speaker 1>and you want any system to talk to any other system,

270
00:13:45.960 --> 00:13:49.360
<v Speaker 1>you have to write direct converters for every single combination.

271
00:13:49.840 --> 00:13:52.720
<v Speaker 1>That's one hundred times ninety nine. You're looking at nearly

272
00:13:52.799 --> 00:13:58.480
<v Speaker 1>ten thousand custom brittle integration scripts just to maintain baseline communication.

273
00:13:58.759 --> 00:14:01.639
<v Speaker 2>And every time and ex journal vendor updates and API

274
00:14:02.279 --> 00:14:06.279
<v Speaker 2>dozens of those point to point scripts break simultaneously.

275
00:14:05.480 --> 00:14:07.639
<v Speaker 1>Which is a nightmare for the engineers.

276
00:14:07.240 --> 00:14:10.919
<v Speaker 2>Absolute nightmare. But by instituting Horace as the central hub,

277
00:14:11.320 --> 00:14:14.399
<v Speaker 2>you mandate that every incoming format is translated into the

278
00:14:14.399 --> 00:14:15.200
<v Speaker 2>HORROR standard.

279
00:14:15.240 --> 00:14:15.559
<v Speaker 1>First.

280
00:14:15.720 --> 00:14:19.759
<v Speaker 2>Okay, if a downstream system needs that data, it translates

281
00:14:19.759 --> 00:14:22.039
<v Speaker 2>it from HORUS into its target format.

282
00:14:22.200 --> 00:14:24.320
<v Speaker 1>Wait, wait, I want to push back on that architecture

283
00:14:24.360 --> 00:14:27.320
<v Speaker 1>for a second. Sure isn't translating Format A into HORUS

284
00:14:27.320 --> 00:14:30.000
<v Speaker 1>and then Horruce into format b Aren't we just injecting

285
00:14:30.039 --> 00:14:33.919
<v Speaker 1>a middleman into every single data pipeline. Doesn't that intermediate

286
00:14:33.919 --> 00:14:38.360
<v Speaker 1>step add massive computational overhead and latency? Why is this

287
00:14:38.399 --> 00:14:39.879
<v Speaker 1>actually faster in the long run.

288
00:14:40.039 --> 00:14:43.000
<v Speaker 2>It's a really critical trade off. Yes, you introduce a

289
00:14:43.039 --> 00:14:46.759
<v Speaker 2>fractional computational cost by serializing and de serializing through an

290
00:14:46.759 --> 00:14:50.320
<v Speaker 2>intermediate cema. There is a cost, but consider the alternative

291
00:14:50.679 --> 00:14:53.639
<v Speaker 2>by using a hub and spoke model. Integrating one hundred

292
00:14:53.639 --> 00:14:58.279
<v Speaker 2>formats only requires two hundred scripts, one to convert HORUS

293
00:14:58.399 --> 00:15:00.200
<v Speaker 2>and one to convert out.

294
00:15:00.000 --> 00:15:01.519
<v Speaker 1>That is a huge difference.

295
00:15:01.559 --> 00:15:05.159
<v Speaker 2>It's a ninety eight percent savings in development time. When

296
00:15:05.200 --> 00:15:07.639
<v Speaker 2>Format one oh one is introduced, you don't write one

297
00:15:07.720 --> 00:15:10.159
<v Speaker 2>hundred new integrations, you write exactly too.

298
00:15:10.360 --> 00:15:12.279
<v Speaker 1>Wow, Okay, that makes perfect sense.

299
00:15:12.320 --> 00:15:15.720
<v Speaker 2>The microscopic increase in compute latency is heavily outweighed by

300
00:15:15.720 --> 00:15:18.840
<v Speaker 2>the elimination of thousands of hours of developer maintenance and

301
00:15:18.919 --> 00:15:19.960
<v Speaker 2>pipeline fragility.

302
00:15:20.200 --> 00:15:23.600
<v Speaker 1>And HORUS isn't just for tabular data either. The text

303
00:15:23.600 --> 00:15:26.679
<v Speaker 1>provides some wild examples of how the utility layer forces

304
00:15:26.759 --> 00:15:30.480
<v Speaker 1>complex unstructured data into this homogeneous format.

305
00:15:30.559 --> 00:15:32.919
<v Speaker 2>Yeah, the image extraction is crazy.

306
00:15:32.559 --> 00:15:35.080
<v Speaker 1>It really is yeah, for meal and details. An algorithm

307
00:15:35.080 --> 00:15:38.279
<v Speaker 1>that takes a JPEG image of a dog named Angus,

308
00:15:38.399 --> 00:15:41.720
<v Speaker 1>great name, and it extracts the exact red, green, blue,

309
00:15:41.720 --> 00:15:44.320
<v Speaker 1>and alpha transparency values for every single.

310
00:15:44.039 --> 00:15:45.759
<v Speaker 2>Pixel, just tearing the image apart.

311
00:15:46.000 --> 00:15:49.240
<v Speaker 1>Yeah, and it flattens the entire visual into a massive

312
00:15:49.320 --> 00:15:53.039
<v Speaker 1>data frame of raw numerical arrays. And he applies the

313
00:15:53.080 --> 00:15:57.519
<v Speaker 1>exact same logic to MP four video files, extracting frame

314
00:15:57.559 --> 00:15:58.519
<v Speaker 1>by frame matrices.

315
00:15:58.960 --> 00:16:03.720
<v Speaker 2>Right, because by mathematically flattening complex visual or audio data

316
00:16:03.759 --> 00:16:07.600
<v Speaker 2>into a standardized horror structure, you allow standard machine learning

317
00:16:07.639 --> 00:16:09.600
<v Speaker 2>libraries to process it because.

318
00:16:09.320 --> 00:16:11.960
<v Speaker 1>They usually need tabular numerical inputs.

319
00:16:12.000 --> 00:16:14.879
<v Speaker 2>Right, exactly, Now they can process a video file using

320
00:16:14.879 --> 00:16:18.360
<v Speaker 2>the exact same underlying logic they would use to analyze

321
00:16:18.360 --> 00:16:19.600
<v Speaker 2>a financial spreadsheet.

322
00:16:19.799 --> 00:16:21.159
<v Speaker 1>That is mind blowing it.

323
00:16:21.080 --> 00:16:23.879
<v Speaker 2>Is, And because it's stored in the utility layer, any

324
00:16:23.919 --> 00:16:27.480
<v Speaker 2>engineer across the enterprise can call that verified image extraction

325
00:16:27.559 --> 00:16:30.720
<v Speaker 2>algorithm without having to reinvent the mathematical wheel.

326
00:16:30.799 --> 00:16:33.919
<v Speaker 1>Which brings us to the final and unequivocally most critical

327
00:16:34.039 --> 00:16:34.519
<v Speaker 1>piece of.

328
00:16:34.440 --> 00:16:35.960
<v Speaker 2>The framework, the top of the pyramid.

329
00:16:36.039 --> 00:16:38.759
<v Speaker 1>Right, we have the data lakes, the spark clusters, the

330
00:16:38.799 --> 00:16:42.440
<v Speaker 1>CRISPA DM blueprints and the horrors universal translators. But all

331
00:16:42.480 --> 00:16:46.000
<v Speaker 1>of this flawless engineering is absolutely worthless if it solves

332
00:16:46.039 --> 00:16:47.120
<v Speaker 1>the wrong human problems.

333
00:16:47.200 --> 00:16:47.960
<v Speaker 2>Totally worthless.

334
00:16:48.000 --> 00:16:50.759
<v Speaker 1>We have to ascend to the top the business layer.

335
00:16:51.159 --> 00:16:55.120
<v Speaker 2>This is where non technical functional requirements actually dictate the

336
00:16:55.200 --> 00:17:01.120
<v Speaker 2>engineering parameters. Right Vermulin leans heavily on the Moscow prioritization method.

337
00:17:01.159 --> 00:17:04.319
<v Speaker 1>Here Moscow that must have, should have, could have, won't

338
00:17:04.359 --> 00:17:05.440
<v Speaker 1>have exactly.

339
00:17:05.839 --> 00:17:11.359
<v Speaker 2>It forces stakeholders to brutally separate mission critical analytical needs

340
00:17:11.680 --> 00:17:14.319
<v Speaker 2>from purely aspirational vanity metrics.

341
00:17:14.359 --> 00:17:16.000
<v Speaker 1>And you have to do that before single line and

342
00:17:16.000 --> 00:17:19.000
<v Speaker 1>code is written Precisely. Once those strict requirements are set,

343
00:17:19.359 --> 00:17:22.359
<v Speaker 1>the business logic has to be modeled. The text introduces

344
00:17:22.400 --> 00:17:25.799
<v Speaker 1>sun models, developed by Mark Whitehorn to handle this mapping.

345
00:17:26.200 --> 00:17:29.799
<v Speaker 2>Sun models provide a phenomenal way to separate business facts

346
00:17:29.839 --> 00:17:30.680
<v Speaker 2>from context.

347
00:17:30.880 --> 00:17:31.559
<v Speaker 1>How do they work.

348
00:17:32.119 --> 00:17:35.319
<v Speaker 2>The center of the model represents the fact that's a specific,

349
00:17:35.559 --> 00:17:37.799
<v Speaker 2>undeniable event, like a financial transaction.

350
00:17:37.960 --> 00:17:38.839
<v Speaker 1>Okay, that's the core.

351
00:17:39.039 --> 00:17:43.559
<v Speaker 2>Right Radiating outward are the dimensions. These are the contextual

352
00:17:43.599 --> 00:17:46.920
<v Speaker 2>realities of that event, such as the customer's geographic location

353
00:17:47.160 --> 00:17:49.759
<v Speaker 2>or the stores operating hours at the exact time of

354
00:17:49.799 --> 00:17:51.119
<v Speaker 2>the transaction.

355
00:17:50.799 --> 00:17:55.039
<v Speaker 1>And managing those dimensions over time is surprisingly complex, isn't it? Well? Incredibly,

356
00:17:55.160 --> 00:18:00.440
<v Speaker 1>the book highlights slowly changing dimensions, specifically sed TIS type two,

357
00:18:01.039 --> 00:18:05.279
<v Speaker 1>which uses an effective date column. There's a brilliant historical

358
00:18:05.319 --> 00:18:08.319
<v Speaker 1>example used to explain why this matters the Dutch explorer.

359
00:18:08.400 --> 00:18:10.480
<v Speaker 1>Really yes, tracking doctor Jacob Rogavin.

360
00:18:10.680 --> 00:18:14.079
<v Speaker 2>Right, if you look at standard relational databases, they often

361
00:18:14.119 --> 00:18:16.640
<v Speaker 2>default to what we call SCD type one, which is

362
00:18:16.839 --> 00:18:18.440
<v Speaker 2>simple overwriting.

363
00:18:17.960 --> 00:18:20.279
<v Speaker 1>Meaning they just replace the old data. Yeah.

364
00:18:20.440 --> 00:18:23.960
<v Speaker 2>So, if doctor Rogavin moves from his home in Middleburg

365
00:18:24.079 --> 00:18:28.400
<v Speaker 2>to Easter Island in seventeen twenty two, an SCD type

366
00:18:28.400 --> 00:18:30.720
<v Speaker 2>one system just overwrites his address.

367
00:18:30.359 --> 00:18:32.000
<v Speaker 1>Field, which seems fine at first glance.

368
00:18:32.079 --> 00:18:35.319
<v Speaker 2>But the problem is you've permanently destroyed your historical context.

369
00:18:35.480 --> 00:18:38.200
<v Speaker 1>Right. But with SCD type two, you don't overwrite. No,

370
00:18:38.519 --> 00:18:40.240
<v Speaker 1>you add a new row and you manage it with

371
00:18:40.240 --> 00:18:43.119
<v Speaker 1>an effective date. You log that he resided in Middleburg

372
00:18:43.160 --> 00:18:45.400
<v Speaker 1>with an n date of April fourth, seventeen twenty two,

373
00:18:45.839 --> 00:18:48.440
<v Speaker 1>and a new row shows him residing on Easter Island

374
00:18:48.480 --> 00:18:52.359
<v Speaker 1>effective April five, seventeen twenty two. Exactly why is maintaining

375
00:18:52.359 --> 00:18:55.519
<v Speaker 1>that temporal timeline so critical for advanced data science.

376
00:18:55.240 --> 00:18:58.480
<v Speaker 2>Because predictive machine learning models absolutely rely on point in

377
00:18:58.519 --> 00:19:03.519
<v Speaker 2>time accuracy. Say your algorithm is analyzing why certain customer

378
00:19:03.559 --> 00:19:07.559
<v Speaker 2>segments canceled their subscriptions five years ago, right, it needs

379
00:19:07.599 --> 00:19:12.240
<v Speaker 2>to evaluate the geographic and demographic dimensions of those customers

380
00:19:12.279 --> 00:19:14.680
<v Speaker 2>as they existed five years ago, not who they are

381
00:19:14.720 --> 00:19:20.359
<v Speaker 2>today exactly. If your database has overwritten their historical addresses

382
00:19:20.400 --> 00:19:23.799
<v Speaker 2>with their current ones, your training data is contaminated with

383
00:19:23.839 --> 00:19:25.000
<v Speaker 2>future knowledge.

384
00:19:24.640 --> 00:19:27.960
<v Speaker 1>Which completely invalidates the model's predictive power. It ruins the

385
00:19:28.000 --> 00:19:31.039
<v Speaker 1>whole thing, and the strictness required in the data models

386
00:19:31.119 --> 00:19:34.119
<v Speaker 1>must also be applied to the human language driving them.

387
00:19:34.559 --> 00:19:37.680
<v Speaker 1>The text offers a brutal warning about the danger of

388
00:19:37.799 --> 00:19:40.640
<v Speaker 1>weak words in the business layers requirements.

389
00:19:40.720 --> 00:19:45.240
<v Speaker 2>Oh yes, business analysts frequently write non functional requirements stating

390
00:19:45.279 --> 00:19:48.839
<v Speaker 2>a dashboard must be user friendly or a streaming pipeline

391
00:19:48.880 --> 00:19:50.559
<v Speaker 2>must operate seamlessly.

392
00:19:50.200 --> 00:19:51.759
<v Speaker 1>Which sounds good in the meeting.

393
00:19:51.599 --> 00:19:55.920
<v Speaker 2>Sure, but from an engineering perspective, those words are poisoned because.

394
00:19:55.599 --> 00:19:59.279
<v Speaker 1>They are fundamentally untestable. You can't write a unit test

395
00:19:59.319 --> 00:20:04.000
<v Speaker 1>for seamless. You have to define strict binary thresholds like

396
00:20:04.440 --> 00:20:07.599
<v Speaker 1>the kofa stream will process fifty thousand events per second

397
00:20:07.920 --> 00:20:10.039
<v Speaker 1>with latency under one hundred milliseconds.

398
00:20:10.160 --> 00:20:14.160
<v Speaker 2>Yes, if you don't translate qualitative business desires into highly

399
00:20:14.240 --> 00:20:20.240
<v Speaker 2>specific quantitative engineering parameters, expectations misalign, and enterprise scale projects

400
00:20:20.279 --> 00:20:21.759
<v Speaker 2>fail right before deployment.

401
00:20:22.119 --> 00:20:24.960
<v Speaker 1>So what does this all mean? A data scientist could

402
00:20:24.960 --> 00:20:28.759
<v Speaker 1>write perfect Skyle code, build a flawless COFFA stream, translate

403
00:20:28.839 --> 00:20:32.279
<v Speaker 1>every format perfectly through whoors. But if a business analyst

404
00:20:32.279 --> 00:20:35.319
<v Speaker 1>writes the word seamlessly in the requirements or forgets to

405
00:20:35.359 --> 00:20:39.519
<v Speaker 1>properly design SED type two dimensions, the whole multimillion dollar

406
00:20:39.599 --> 00:20:42.480
<v Speaker 1>architecture collapses purely due to human ambiguity.

407
00:20:42.680 --> 00:20:45.440
<v Speaker 2>That is the uncompromising reality of data science.

408
00:20:45.480 --> 00:20:45.759
<v Speaker 1>Wow.

409
00:20:45.960 --> 00:20:48.480
<v Speaker 2>And that reality becomes legally perilous when we factor in

410
00:20:48.559 --> 00:20:52.160
<v Speaker 2>modern regulatory frameworks like GDPR in Europe or HYPA in

411
00:20:52.200 --> 00:20:54.359
<v Speaker 2>the US, which Vermilan addresses thoroughly.

412
00:20:54.480 --> 00:20:57.559
<v Speaker 1>Yeah, the right to be forgotten is a terrifying technical challenge.

413
00:20:57.720 --> 00:21:01.480
<v Speaker 1>Under GDPR, a consumer can legally that an enterprise eradicate

414
00:21:01.640 --> 00:21:03.279
<v Speaker 1>every trace of their personal.

415
00:21:03.039 --> 00:21:05.160
<v Speaker 2>Data, every single trace, which is huge.

416
00:21:05.480 --> 00:21:06.319
<v Speaker 1>Yeah.

417
00:21:06.680 --> 00:21:10.799
<v Speaker 2>This raises an important question regarding architectural accountability. Let's say

418
00:21:10.839 --> 00:21:14.119
<v Speaker 2>you have ingested massive amounts of unstructured data into a

419
00:21:14.240 --> 00:21:17.880
<v Speaker 2>schema on read data lake. Okay, but you neglected to

420
00:21:18.000 --> 00:21:22.079
<v Speaker 2>implement the audit balance and control layer to track exactly

421
00:21:22.119 --> 00:21:26.240
<v Speaker 2>how that specific user's data propagated through your Horus translations

422
00:21:26.319 --> 00:21:29.400
<v Speaker 2>and into your downstream machine learning model. Oh man, you

423
00:21:29.480 --> 00:21:32.759
<v Speaker 2>simply cannot delete them because you can't find them, and.

424
00:21:32.880 --> 00:21:36.559
<v Speaker 1>The legal penalty for failing to comply can reach four

425
00:21:36.599 --> 00:21:38.400
<v Speaker 1>percent of your global corporate.

426
00:21:38.160 --> 00:21:42.160
<v Speaker 2>Turnover, which transforms data architecture from a back office it

427
00:21:42.480 --> 00:21:45.359
<v Speaker 2>function into a literal existential corporate threat.

428
00:21:45.440 --> 00:21:48.759
<v Speaker 1>It really does. Yeah, let's recap the intense journey we've

429
00:21:48.759 --> 00:21:51.759
<v Speaker 1>mapped out today for the listener. We bypass the bottlenecks

430
00:21:51.759 --> 00:21:54.960
<v Speaker 1>of schema on right by utilizing a wild data ake.

431
00:21:55.440 --> 00:21:57.880
<v Speaker 1>Then we introduce the modularity of data vault hubs and

432
00:21:57.920 --> 00:22:01.440
<v Speaker 1>satellites to add structure without losing historical agility.

433
00:22:01.480 --> 00:22:04.960
<v Speaker 2>And we powered that storage with an industrial processing stack.

434
00:22:04.839 --> 00:22:07.319
<v Speaker 1>Leveraging Kafka for fault tolerant ingestion.

435
00:22:07.400 --> 00:22:11.279
<v Speaker 2>And sparks distributed memory clusters to handle the immense scale

436
00:22:11.559 --> 00:22:14.920
<v Speaker 2>while bridging the analytical power of R and Python into

437
00:22:14.960 --> 00:22:16.160
<v Speaker 2>that environment exactly.

438
00:22:16.799 --> 00:22:19.880
<v Speaker 1>Then we contain that horse power using the CRISPA DM

439
00:22:19.920 --> 00:22:23.039
<v Speaker 1>blueprint and Vermulen's five layer Enterprise framework.

440
00:22:23.079 --> 00:22:25.880
<v Speaker 2>We routed around the nightmare of point to point integration

441
00:22:26.039 --> 00:22:29.680
<v Speaker 2>by funneling everything through the Horus universal schema, and.

442
00:22:29.640 --> 00:22:34.240
<v Speaker 1>Ultimately we tethered every piece of this complex machinery to strict,

443
00:22:34.559 --> 00:22:40.279
<v Speaker 1>auditible and legally compliant business layer requirements using Moscow prioritization

444
00:22:40.400 --> 00:22:41.880
<v Speaker 1>and point in time sun models.

445
00:22:42.000 --> 00:22:46.000
<v Speaker 2>It is an incredibly dense, tightly integrated ecosystem.

446
00:22:45.400 --> 00:22:46.000
<v Speaker 1>It really is.

447
00:22:46.279 --> 00:22:50.000
<v Speaker 2>But understanding how the flow of data mandates the existence

448
00:22:50.079 --> 00:22:53.400
<v Speaker 2>of each of these specific tools and layers is exactly

449
00:22:53.440 --> 00:22:57.200
<v Speaker 2>what separates a narrow programmer from a true system's architect.

450
00:22:57.400 --> 00:23:01.000
<v Speaker 1>For you listening, whether you're architecting these systems yourself or

451
00:23:01.079 --> 00:23:03.839
<v Speaker 1>simply preparing to lead a high level strategy meeting tomorrow,

452
00:23:04.160 --> 00:23:07.400
<v Speaker 1>understanding the mechanics of this stack gives you the vocabulary

453
00:23:07.400 --> 00:23:10.480
<v Speaker 1>to lead the data conversation. You now understand why the

454
00:23:10.519 --> 00:23:15.960
<v Speaker 1>structural plumbing, the audits, the hubs, the utility translations is equally,

455
00:23:15.960 --> 00:23:19.839
<v Speaker 1>if not more critical than the predictive algorithms.

456
00:23:19.319 --> 00:23:24.359
<v Speaker 2>Themselves, because without that robust infrastructure, the most sophisticated predictive

457
00:23:24.359 --> 00:23:26.599
<v Speaker 2>algorithm is just an isolated math.

458
00:23:26.440 --> 00:23:28.799
<v Speaker 1>Equation, It doesn't actually do anything right.

459
00:23:29.079 --> 00:23:32.799
<v Speaker 2>The stack is what bridges the gap between theoretical potential

460
00:23:33.160 --> 00:23:36.319
<v Speaker 2>and executable automated enterprise value.

461
00:23:36.440 --> 00:23:39.000
<v Speaker 1>I want to leave you with one final provocative thought

462
00:23:39.000 --> 00:23:43.880
<v Speaker 1>to ponder. As distributed processing frameworks like Spark continue to

463
00:23:43.880 --> 00:23:48.720
<v Speaker 1>integrate exponentially more powerful machine learning capabilities natively, how far

464
00:23:48.799 --> 00:23:50.000
<v Speaker 1>are we from a tipping point?

465
00:23:50.079 --> 00:23:51.359
<v Speaker 2>Oh that's a big question, right.

466
00:23:51.559 --> 00:23:54.160
<v Speaker 1>What happens when an AI doesn't just process the data lake,

467
00:23:54.319 --> 00:23:58.319
<v Speaker 1>but actively begins writing its own Moscow business requirements, dynamically

468
00:23:58.319 --> 00:24:01.440
<v Speaker 1>restructuring its own sun models, and effectively managing the human

469
00:24:01.440 --> 00:24:04.279
<v Speaker 1>business layer itself to optimize corporate outcomes.

470
00:24:04.519 --> 00:24:08.279
<v Speaker 2>It fundamentally upends the hierarchy. When the analytical tools become

471
00:24:08.359 --> 00:24:12.000
<v Speaker 2>capable of dictating the enterprise strategy, the frameworks we use

472
00:24:12.039 --> 00:24:14.039
<v Speaker 2>to govern them will have to evolve dramatically.

473
00:24:14.279 --> 00:24:16.599
<v Speaker 1>We're standing on the edge of an ever expanding forty

474
00:24:16.720 --> 00:24:20.559
<v Speaker 1>zetabyte ocean. It isn't just getting deeper, it's beginning to

475
00:24:20.599 --> 00:24:23.880
<v Speaker 1>analyze the tides. Thank you for joining us on this

476
00:24:24.000 --> 00:24:26.960
<v Speaker 1>deep dive. Keep exploring the depths of your own data
