WEBVTT

1
00:00:00.080 --> 00:00:01.720
<v Speaker 1>Okay, so you're trying to get a real grip on

2
00:00:01.760 --> 00:00:05.960
<v Speaker 1>something complex, right, and you want it fast, but without

3
00:00:05.960 --> 00:00:08.240
<v Speaker 1>getting totally buried in jargon and detail.

4
00:00:08.359 --> 00:00:10.279
<v Speaker 2>Yeah, information overload is real.

5
00:00:10.240 --> 00:00:13.759
<v Speaker 1>Exactly, So think of this as your shortcut. We're diving

6
00:00:13.800 --> 00:00:17.079
<v Speaker 1>deep into real time analytics today, trying to give you

7
00:00:17.120 --> 00:00:20.640
<v Speaker 1>that core understanding, you know, without all the noise.

8
00:00:20.920 --> 00:00:23.160
<v Speaker 2>And we're basing this on the book Building real Time

9
00:00:23.199 --> 00:00:27.920
<v Speaker 2>Analytics Systems. It just came out September twenty twenty three, first.

10
00:00:27.920 --> 00:00:31.079
<v Speaker 1>Edition, right, and the book's goal seems pretty practical, just

11
00:00:31.120 --> 00:00:33.759
<v Speaker 1>helping you, the listener, get your job done if you're

12
00:00:33.799 --> 00:00:34.719
<v Speaker 1>working in this space.

13
00:00:34.960 --> 00:00:37.240
<v Speaker 2>Pretty much, it cuts through the theory to the how

14
00:00:37.280 --> 00:00:38.039
<v Speaker 2>to now.

15
00:00:38.479 --> 00:00:43.960
<v Speaker 1>Think back maybe early two thousands data analytics. It's often

16
00:00:43.960 --> 00:00:46.479
<v Speaker 1>felt like something you did after everything else happened, you know,

17
00:00:46.679 --> 00:00:47.320
<v Speaker 1>batch process.

18
00:00:47.439 --> 00:00:49.799
<v Speaker 2>Oh definitely, reports would tell you what happened yesterday or

19
00:00:49.920 --> 00:00:51.719
<v Speaker 2>last week hindsight basically.

20
00:00:51.759 --> 00:00:53.799
<v Speaker 1>But things have really shifted, haven't they. There's this like

21
00:00:54.280 --> 00:00:57.920
<v Speaker 1>massive appetite now for knowing things the moment they happen

22
00:00:58.320 --> 00:00:58.920
<v Speaker 1>real time.

23
00:00:59.039 --> 00:01:02.399
<v Speaker 2>Absolutely. The book uses fraud detection as a great example. Yeah,

24
00:01:02.399 --> 00:01:03.880
<v Speaker 2>finding out about fraud hours later?

25
00:01:04.200 --> 00:01:07.079
<v Speaker 1>Uh uh, too late? Right, the money's probably done exactly.

26
00:01:07.200 --> 00:01:10.480
<v Speaker 2>The real wind is spotting it now, flagging it, maybe

27
00:01:10.519 --> 00:01:14.200
<v Speaker 2>blocking it instantly. That immediacy is key. It's not just

28
00:01:14.319 --> 00:01:18.200
<v Speaker 2>nice to have anymore often, it's well essential.

29
00:01:18.879 --> 00:01:21.280
<v Speaker 1>And that brings us to this idea of streaming. It's

30
00:01:21.319 --> 00:01:24.400
<v Speaker 1>not about waiting for a whole file to finish downloading

31
00:01:24.480 --> 00:01:25.120
<v Speaker 1>or collecting.

32
00:01:25.280 --> 00:01:27.280
<v Speaker 2>No, not at all. Think of it more like a

33
00:01:28.599 --> 00:01:32.400
<v Speaker 2>continuous flow, a river of data that just keeps coming.

34
00:01:32.439 --> 00:01:33.439
<v Speaker 2>It never really ends.

35
00:01:33.680 --> 00:01:35.760
<v Speaker 1>And the crucial part is you can dip into that

36
00:01:35.840 --> 00:01:38.560
<v Speaker 1>river and act on what you see right then and there.

37
00:01:38.599 --> 00:01:41.920
<v Speaker 2>Precisely. A data stream fundamentally is just a series of

38
00:01:42.000 --> 00:01:45.120
<v Speaker 2>data points ordered by time. Each one represents some kind

39
00:01:45.120 --> 00:01:46.760
<v Speaker 2>of event or a change.

40
00:01:46.480 --> 00:01:48.760
<v Speaker 1>Like what give us an example, well, like.

41
00:01:48.840 --> 00:01:52.599
<v Speaker 2>Every single purchase on an e commerce site, or every

42
00:01:52.599 --> 00:01:55.359
<v Speaker 2>reading from an IoT sensor, maybe temperature of pressure. It's

43
00:01:55.400 --> 00:01:56.680
<v Speaker 2>like a constant pulsive information.

44
00:01:56.760 --> 00:01:58.760
<v Speaker 1>Okay, okay, And here's a point the book really stresses,

45
00:01:58.799 --> 00:02:02.120
<v Speaker 1>which I found fascinating. Events have a shelf.

46
00:02:01.840 --> 00:02:04.040
<v Speaker 2>Life, a very short one.

47
00:02:04.079 --> 00:02:07.359
<v Speaker 1>Sometimes their value can just like plummet super fast. Think

48
00:02:07.359 --> 00:02:10.479
<v Speaker 1>about an online shopping cart someone just abandon right if.

49
00:02:10.360 --> 00:02:12.919
<v Speaker 2>You can ping them with an SMS or an email.

50
00:02:13.000 --> 00:02:15.120
<v Speaker 2>Maybe with a little discount voucher, Like.

51
00:02:15.319 --> 00:02:17.000
<v Speaker 1>Immediately you might get that sale back.

52
00:02:17.080 --> 00:02:19.319
<v Speaker 2>You've got a decent shot. Yeah, but wait, even just

53
00:02:19.360 --> 00:02:20.840
<v Speaker 2>a couple of hours, they've moved.

54
00:02:20.639 --> 00:02:24.039
<v Speaker 1>On, bought somewhere else, or just changed their mind exactly.

55
00:02:24.240 --> 00:02:28.080
<v Speaker 2>The timing that immediate reaction makes all the difference, and

56
00:02:28.120 --> 00:02:31.560
<v Speaker 2>that is the heart of real time analytics or RTA.

57
00:02:32.159 --> 00:02:35.479
<v Speaker 2>It's all about squeezing value from those events basically as

58
00:02:35.479 --> 00:02:36.199
<v Speaker 2>soon as they happen.

59
00:02:36.400 --> 00:02:39.719
<v Speaker 1>The book mentions soft real time. What's that about. Does

60
00:02:39.719 --> 00:02:41.240
<v Speaker 1>that mean it's not quite real time?

61
00:02:41.520 --> 00:02:43.840
<v Speaker 2>Well, yeah, kind of. It just acknowledges that, you know,

62
00:02:43.919 --> 00:02:48.120
<v Speaker 2>perfection is hard. There might be tiny delays milliseconds maybe

63
00:02:48.159 --> 00:02:52.680
<v Speaker 2>seconds because of network latency or system hiccups. It's not instantaneous,

64
00:02:52.680 --> 00:02:53.840
<v Speaker 2>but it's very very close.

65
00:02:53.960 --> 00:02:57.080
<v Speaker 1>Okay, so practical real time. The big difference is compared

66
00:02:57.120 --> 00:03:00.080
<v Speaker 1>to batch processing right batches.

67
00:02:59.759 --> 00:03:02.639
<v Speaker 2>Where you collect data over time maybe an hour, maybe

68
00:03:02.639 --> 00:03:04.879
<v Speaker 2>a day, put it in a big chunk, and then

69
00:03:04.919 --> 00:03:05.479
<v Speaker 2>analyze it.

70
00:03:05.560 --> 00:03:08.080
<v Speaker 1>We used to set up these artificial deadlines. Didn't we

71
00:03:08.319 --> 00:03:10.840
<v Speaker 1>run the report at midnight for yesterday's data.

72
00:03:11.080 --> 00:03:14.120
<v Speaker 2>Yeah, those time boundaries. The problem is your analysis is

73
00:03:14.120 --> 00:03:17.879
<v Speaker 2>always looking backwards, you're getting insights about what was happening.

74
00:03:17.520 --> 00:03:19.599
<v Speaker 1>Which might be stale news by the time you get

75
00:03:19.639 --> 00:03:20.319
<v Speaker 1>it totally.

76
00:03:21.039 --> 00:03:23.039
<v Speaker 2>RTA aims to give you a view of the present,

77
00:03:23.199 --> 00:03:25.479
<v Speaker 2>so your decisions are actually relevant to now.

78
00:03:25.719 --> 00:03:28.240
<v Speaker 1>So our mission for this deep dive drawing from the

79
00:03:28.240 --> 00:03:31.360
<v Speaker 1>book is to really get into those core concepts and

80
00:03:31.719 --> 00:03:35.479
<v Speaker 1>importantly the benefits. What do you actually gain from this?

81
00:03:35.879 --> 00:03:39.360
<v Speaker 2>Let's talk benefits. Then. Speed is obviously a big one.

82
00:03:39.759 --> 00:03:43.439
<v Speaker 2>The book argues it's often a decisive factor. Market leaders

83
00:03:43.439 --> 00:03:44.719
<v Speaker 2>tend to be faster.

84
00:03:44.680 --> 00:03:47.719
<v Speaker 1>Faster at understanding, faster at reacting.

85
00:03:47.360 --> 00:03:50.360
<v Speaker 2>Exactly, and RTA helps achieve that. For one thing, it

86
00:03:50.400 --> 00:03:53.960
<v Speaker 2>can actually open up totally new revenue streams now. So well,

87
00:03:54.080 --> 00:03:56.719
<v Speaker 2>think about turning your real time data itself into a

88
00:03:56.759 --> 00:04:00.520
<v Speaker 2>product offering your end users. Maybe customers the ability to

89
00:04:00.599 --> 00:04:05.240
<v Speaker 2>query data with analytical capabilities almost live. They'd likely pay

90
00:04:05.319 --> 00:04:06.280
<v Speaker 2>for that kind of access.

91
00:04:06.400 --> 00:04:09.479
<v Speaker 1>Ah, interesting, So the insight itself becomes a premium service.

92
00:04:09.719 --> 00:04:12.479
<v Speaker 1>That makes sense. It's not just about making more money, though,

93
00:04:12.599 --> 00:04:15.080
<v Speaker 1>is it. The book talks infrastructure costs too.

94
00:04:15.439 --> 00:04:18.920
<v Speaker 2>Yes, that's a really important one. Traditional BATGE systems often

95
00:04:18.959 --> 00:04:21.079
<v Speaker 2>tie storage and compute together very tightly.

96
00:04:21.240 --> 00:04:23.120
<v Speaker 1>Meaning if your data grows.

97
00:04:23.240 --> 00:04:26.399
<v Speaker 2>Your costs for both storage and the processing power needed

98
00:04:26.439 --> 00:04:30.759
<v Speaker 2>can just explode, often exponentially ouch. But with RTA, you're

99
00:04:30.879 --> 00:04:34.319
<v Speaker 2>processing data more incrementally as it arrives. It sort of

100
00:04:34.319 --> 00:04:37.000
<v Speaker 2>breaks that tight coupling. You don't necessarily need to store

101
00:04:37.079 --> 00:04:40.399
<v Speaker 2>everything forever just to process it later in huge.

102
00:04:40.120 --> 00:04:44.120
<v Speaker 1>Batches, so you avoid building those massive, expensive legacy systems

103
00:04:44.319 --> 00:04:45.279
<v Speaker 1>just for batch jobs.

104
00:04:45.399 --> 00:04:49.600
<v Speaker 2>Potentially, yes, significant cost savings are possible there. You're handling

105
00:04:49.680 --> 00:04:52.279
<v Speaker 2>smaller streams continuously.

106
00:04:51.800 --> 00:04:54.879
<v Speaker 1>Like managing a steady creek instead of building dams for

107
00:04:55.079 --> 00:04:59.160
<v Speaker 1>unpredictable floods. And what about us, the customers? How does

108
00:04:59.199 --> 00:05:00.680
<v Speaker 1>this improve customer experience?

109
00:05:00.839 --> 00:05:04.639
<v Speaker 2>Well, think about customer support. Traditionally it's reactive, Right, you

110
00:05:04.639 --> 00:05:08.240
<v Speaker 2>have a problem, you call their email, They investigate.

111
00:05:07.720 --> 00:05:08.920
<v Speaker 1>And maybe fix it eventually.

112
00:05:09.199 --> 00:05:13.279
<v Speaker 2>Maybe. With RTA, companies can constantly monitor streams of data

113
00:05:13.360 --> 00:05:17.560
<v Speaker 2>usage patterns. ERA logus sensor data looking for anomalies or

114
00:05:17.560 --> 00:05:18.560
<v Speaker 2>signs of trouble.

115
00:05:18.319 --> 00:05:21.000
<v Speaker 1>Ah, so they can spot problems before I even notice them.

116
00:05:21.279 --> 00:05:24.920
<v Speaker 2>That's the goal. They can potentially identify and even resolve

117
00:05:24.959 --> 00:05:29.759
<v Speaker 2>issues proactively automatically, maybe reroute traffic, restart a service, or

118
00:05:29.800 --> 00:05:32.439
<v Speaker 2>even reach out to you before it becomes a major headache.

119
00:05:32.560 --> 00:05:37.519
<v Speaker 1>That sounds much better, moving from reactive firefighting to proactive.

120
00:05:36.920 --> 00:05:41.000
<v Speaker 2>Care exactly, it leads to much higher customer satisfaction. It

121
00:05:41.040 --> 00:05:43.160
<v Speaker 2>feels like the company is actually looking out for you.

122
00:05:43.319 --> 00:05:49.600
<v Speaker 1>Okay, So RTA sounds powerful but also complex. The book

123
00:05:49.680 --> 00:05:55.279
<v Speaker 1>introduces this term the real time analytics ecosystem or stack.

124
00:05:55.399 --> 00:05:56.680
<v Speaker 1>What is that? In simple terms?

125
00:05:56.759 --> 00:06:00.319
<v Speaker 2>Yeah, you'll hear ecosystem, stack, streaming stack. The basic mean

126
00:06:00.319 --> 00:06:03.800
<v Speaker 2>the same thing. It's the whole collection of tools, technologies,

127
00:06:04.040 --> 00:06:06.439
<v Speaker 2>and the processes you use to get from those raw,

128
00:06:06.879 --> 00:06:08.160
<v Speaker 2>unending streams.

129
00:06:07.759 --> 00:06:09.639
<v Speaker 1>Of data to actual insights you can use.

130
00:06:09.720 --> 00:06:12.959
<v Speaker 2>Precisely, it's the entire pipeline, all the components working together.

131
00:06:13.160 --> 00:06:15.759
<v Speaker 1>And why is understanding that whole picture important?

132
00:06:15.959 --> 00:06:19.199
<v Speaker 2>Well, if you're an architect designing these systems, or developer

133
00:06:19.240 --> 00:06:22.079
<v Speaker 2>building the apps, or even an operator keeping it all running,

134
00:06:22.600 --> 00:06:24.800
<v Speaker 2>you need to understand how the pieces fit together.

135
00:06:24.639 --> 00:06:26.920
<v Speaker 1>To make the right choices about tools and how they connect.

136
00:06:27.279 --> 00:06:30.560
<v Speaker 2>Absolutely, it helps you build systems that are robust, scalable,

137
00:06:30.800 --> 00:06:33.360
<v Speaker 2>and actually deliver those real time insights effectively.

138
00:06:33.480 --> 00:06:37.120
<v Speaker 1>Okay. Now, before diving into the modern stack, the book

139
00:06:37.199 --> 00:06:41.560
<v Speaker 1>briefly mentions something called the Lambda architecture. Sounds a bit

140
00:06:41.839 --> 00:06:43.120
<v Speaker 1>I don't know dated.

141
00:06:43.000 --> 00:06:44.000
<v Speaker 2>It is a bit older. Yeah.

142
00:06:44.120 --> 00:06:44.319
<v Speaker 1>Yeah.

143
00:06:44.360 --> 00:06:46.720
<v Speaker 2>It was kind of an early attempt to deal with

144
00:06:47.240 --> 00:06:52.439
<v Speaker 2>having both real time needs and needing accurate historical analysis

145
00:06:52.480 --> 00:06:53.959
<v Speaker 2>on huge data.

146
00:06:53.720 --> 00:06:56.040
<v Speaker 1>Sets, trying to do both at once sort of.

147
00:06:56.199 --> 00:06:59.879
<v Speaker 2>It had three layers, a big, slow batch layer for

148
00:07:00.120 --> 00:07:04.079
<v Speaker 2>processing all the historical data accurately, a fast speed layer

149
00:07:04.319 --> 00:07:08.079
<v Speaker 2>for handling the incoming real time streams providing quick, maybe

150
00:07:08.120 --> 00:07:10.839
<v Speaker 2>slightly less perfect answers. And the third layer a serving

151
00:07:10.920 --> 00:07:12.959
<v Speaker 2>layer that would try to merge the results from both

152
00:07:12.959 --> 00:07:15.680
<v Speaker 2>the batch and speed layers when you actually queried the system.

153
00:07:15.759 --> 00:07:17.839
<v Speaker 1>Okay, so it tried to give you fast answers and

154
00:07:17.879 --> 00:07:20.720
<v Speaker 1>eventually correct complete answers. What was the upside?

155
00:07:20.839 --> 00:07:23.720
<v Speaker 2>The main benefit was that your original raw data was

156
00:07:23.839 --> 00:07:26.480
<v Speaker 2>kept safe and sound in the batch layer, so if

157
00:07:26.480 --> 00:07:28.600
<v Speaker 2>you messed up your processing logic or wanted to try

158
00:07:28.600 --> 00:07:29.879
<v Speaker 2>a new analysis.

159
00:07:29.360 --> 00:07:30.839
<v Speaker 1>You could always go back and rerun it on the

160
00:07:30.879 --> 00:07:35.319
<v Speaker 1>original data exactly. Data I mutability was a plus, But

161
00:07:36.680 --> 00:07:39.360
<v Speaker 1>I sense a butt coming. The book implies it wasn't

162
00:07:39.360 --> 00:07:41.680
<v Speaker 1>the perfect solution. What were the drawbacks?

163
00:07:42.279 --> 00:07:45.079
<v Speaker 2>There were quite a few. Actually, First, it was complex.

164
00:07:45.480 --> 00:07:49.120
<v Speaker 2>You essentially had to build and maintain two separate data pipelines,

165
00:07:49.480 --> 00:07:52.480
<v Speaker 2>Batch and speed. That's a lot of engineering.

166
00:07:52.120 --> 00:07:54.920
<v Speaker 1>Effort, double the work, potentially double the problem pretty much.

167
00:07:55.439 --> 00:07:58.839
<v Speaker 2>Also, many early stream processors relied heavily on the JVM,

168
00:07:59.040 --> 00:08:01.519
<v Speaker 2>the Java Virtual Mass, which was fine if you were

169
00:08:01.519 --> 00:08:04.199
<v Speaker 2>a Java shop, but maybe less ideal otherwise.

170
00:08:04.360 --> 00:08:07.600
<v Speaker 1>Fender lock in or skill set mismatch.

171
00:08:07.759 --> 00:08:10.519
<v Speaker 2>Yeah, and maybe the biggest headache was often having to

172
00:08:10.519 --> 00:08:14.079
<v Speaker 2>write and maintain the same or very similar processing logic

173
00:08:14.160 --> 00:08:15.839
<v Speaker 2>in both the batch and the speed layers.

174
00:08:16.480 --> 00:08:19.560
<v Speaker 1>Duplication. That sounds like a nightmare for consistency and updates,

175
00:08:19.720 --> 00:08:20.319
<v Speaker 1>it really was.

176
00:08:20.639 --> 00:08:23.920
<v Speaker 2>Keeping them perfectly in sync was hard, leading to potential

177
00:08:23.920 --> 00:08:26.959
<v Speaker 2>inconsistencies in the final results. So yeah, lots of overhead

178
00:08:27.000 --> 00:08:27.600
<v Speaker 2>and complexity.

179
00:08:27.720 --> 00:08:30.000
<v Speaker 1>Okay, so LAMB deserved a purpose, But we've move on.

180
00:08:30.399 --> 00:08:34.320
<v Speaker 1>What does a more modern real time analytics stack look like?

181
00:08:34.360 --> 00:08:35.679
<v Speaker 1>What are the essential pieces?

182
00:08:36.080 --> 00:08:39.279
<v Speaker 2>Right? The contemporary approach is generally more streamlined. It typically

183
00:08:39.320 --> 00:08:41.080
<v Speaker 2>starts with event producers.

184
00:08:40.639 --> 00:08:42.679
<v Speaker 1>The things generating the data in the first place.

185
00:08:42.759 --> 00:08:46.639
<v Speaker 2>Exactly, systems that detect something happen to state change and

186
00:08:46.759 --> 00:08:50.159
<v Speaker 2>fire off an event. Like an order management system sees

187
00:08:50.200 --> 00:08:52.919
<v Speaker 2>a new order and generates an order received event, and that.

188
00:08:52.919 --> 00:08:57.559
<v Speaker 1>Event contains the details like order ID, customer info items.

189
00:08:57.679 --> 00:09:00.879
<v Speaker 2>All the relevant data and A key thing here mentioned

190
00:09:00.919 --> 00:09:03.480
<v Speaker 2>in the book is you really need to benchmark your

191
00:09:03.480 --> 00:09:07.440
<v Speaker 2>producers make sure they can actually handle the volume and

192
00:09:07.519 --> 00:09:12.159
<v Speaker 2>speed of events you expect without becoming a bottleneck. Scalability

193
00:09:12.200 --> 00:09:14.120
<v Speaker 2>and latency are critical right from the start.

194
00:09:14.200 --> 00:09:16.679
<v Speaker 1>Okay, makes sense, The source needs to keep up. Where

195
00:09:16.679 --> 00:09:17.879
<v Speaker 1>do those events go next?

196
00:09:18.120 --> 00:09:20.639
<v Speaker 2>They flow into the event streaming platform. This is like

197
00:09:20.679 --> 00:09:23.519
<v Speaker 2>the central highway or message bus for all your events.

198
00:09:23.519 --> 00:09:27.159
<v Speaker 2>The backbone, yeah, exactly. Its job is to ingest potentially

199
00:09:27.279 --> 00:09:31.000
<v Speaker 2>huge volumes of events, store them reliably, usually for some

200
00:09:31.039 --> 00:09:34.320
<v Speaker 2>configurable period, and deliver them to whatever needs to consume them.

201
00:09:34.399 --> 00:09:36.799
<v Speaker 2>A patch Kafka is probably the most well known example here.

202
00:09:36.960 --> 00:09:39.519
<v Speaker 1>Right, Kofka comes up a lot. What makes a good

203
00:09:39.720 --> 00:09:40.720
<v Speaker 1>streaming platform?

204
00:09:41.000 --> 00:09:45.440
<v Speaker 2>Key things are scalability, Can it handle growth? Fault tolerance?

205
00:09:45.440 --> 00:09:49.000
<v Speaker 2>Does it lose data if a server fails? High throughput?

206
00:09:49.039 --> 00:09:53.159
<v Speaker 2>Can it handle a massive continuous flow and low latency?

207
00:09:53.519 --> 00:09:56.000
<v Speaker 2>How quickly does data get through? Got it?

208
00:09:56.679 --> 00:10:01.120
<v Speaker 1>So? Data producers feed events onto this Sofka like highway?

209
00:10:02.200 --> 00:10:05.200
<v Speaker 2>Then what then? You typically have a stream processing platform

210
00:10:05.279 --> 00:10:07.240
<v Speaker 2>This is where the real time analysis starts happening.

211
00:10:07.279 --> 00:10:08.759
<v Speaker 1>This is where the magic happens, well.

212
00:10:08.720 --> 00:10:10.399
<v Speaker 2>Some of it. This is where you take those raw

213
00:10:10.399 --> 00:10:13.639
<v Speaker 2>event streams and transform them, maybe enrich them by joining

214
00:10:13.639 --> 00:10:16.960
<v Speaker 2>them with other data streams or static data filter them,

215
00:10:17.159 --> 00:10:22.000
<v Speaker 2>aggregate them, run calculations. Basically turn raw data into intermediate insights.

216
00:10:22.159 --> 00:10:23.879
<v Speaker 1>Can you give examples of tools here?

217
00:10:24.000 --> 00:10:26.799
<v Speaker 2>Sure. Popular ones include a Patche flink, which is a

218
00:10:26.799 --> 00:10:30.799
<v Speaker 2>powerful stream processing framework. There's also a Patche Spark streaming,

219
00:10:30.879 --> 00:10:34.559
<v Speaker 2>which extends the Spark batch engine for streaming, and Kaffka streams,

220
00:10:34.559 --> 00:10:36.559
<v Speaker 2>which is a library that lets you build stream processing

221
00:10:36.600 --> 00:10:38.440
<v Speaker 2>apps directly on top of Kafka.

222
00:10:38.600 --> 00:10:41.320
<v Speaker 1>What are the important features for these stream processors?

223
00:10:41.559 --> 00:10:44.639
<v Speaker 2>You need things like good state management because your analysis

224
00:10:44.679 --> 00:10:48.720
<v Speaker 2>often depends on past events when doing capabilities for doing

225
00:10:48.720 --> 00:10:51.679
<v Speaker 2>calculations over specific time periods like the last five minutes.

226
00:10:52.159 --> 00:10:55.840
<v Speaker 2>Fault tolerance obviously so processing doesn't stop if something breaks,

227
00:10:56.840 --> 00:10:59.679
<v Speaker 2>and support for different data formats okay.

228
00:11:00.039 --> 00:11:03.360
<v Speaker 1>Using happens insights are generated, how do we actually use

229
00:11:03.399 --> 00:11:04.399
<v Speaker 1>them or see them?

230
00:11:04.519 --> 00:11:08.120
<v Speaker 2>That's the final piece, usually the serving layer. This is

231
00:11:08.159 --> 00:11:10.399
<v Speaker 2>the system that stores the results of your real time

232
00:11:10.440 --> 00:11:13.559
<v Speaker 2>processing and makes them available for querying fast.

233
00:11:13.919 --> 00:11:17.519
<v Speaker 1>This is what applications or dashboards actually talk to exactly.

234
00:11:17.559 --> 00:11:20.559
<v Speaker 2>It's the primary access point. Now, this serving layer could

235
00:11:20.600 --> 00:11:23.000
<v Speaker 2>be a few different types of systems like what It

236
00:11:23.039 --> 00:11:26.679
<v Speaker 2>could be a fast key value store think Mango dB,

237
00:11:26.879 --> 00:11:29.919
<v Speaker 2>maybe elastic search or rettis. These are great if you

238
00:11:29.960 --> 00:11:32.639
<v Speaker 2>primarily need to look up results based on a specific key,

239
00:11:32.840 --> 00:11:35.600
<v Speaker 2>like getting the current status for a particular user.

240
00:11:35.440 --> 00:11:37.559
<v Speaker 1>ID quick lookups. What's the alternative?

241
00:11:37.679 --> 00:11:40.960
<v Speaker 2>The alternative, especially for more complex analytics, is a real

242
00:11:41.039 --> 00:11:45.879
<v Speaker 2>time ol APP database. Ol app stands for online analytical processing.

243
00:11:45.480 --> 00:11:48.399
<v Speaker 1>Ah okay designed for analysis right.

244
00:11:48.799 --> 00:11:52.600
<v Speaker 2>Tools like a Pacupine, Apache, Druid, rock Set, or ClickHouse

245
00:11:52.879 --> 00:11:55.879
<v Speaker 2>fall into this category. They are built for slicing and

246
00:11:55.960 --> 00:12:00.240
<v Speaker 2>dicing data, running aggregations, filtering across lots of dimensions, much

247
00:12:00.279 --> 00:12:02.240
<v Speaker 2>more complex queries than just a key lookup.

248
00:12:02.519 --> 00:12:04.759
<v Speaker 1>So if I want to see, say, sales trends by

249
00:12:04.799 --> 00:12:07.559
<v Speaker 1>region and product category for the last hour, I'd want

250
00:12:07.600 --> 00:12:08.799
<v Speaker 1>an ol APP database.

251
00:12:09.120 --> 00:12:12.600
<v Speaker 2>Generally, Yes, that's where they shine. The crucial thing for

252
00:12:12.679 --> 00:12:16.000
<v Speaker 2>any serving layer in this context is speed. You need

253
00:12:16.120 --> 00:12:19.799
<v Speaker 2>really fast data ingestion. The results from the stream processor

254
00:12:19.840 --> 00:12:23.000
<v Speaker 2>need to show up almost instantly, and query latency needs

255
00:12:23.039 --> 00:12:24.440
<v Speaker 2>to be low often in the.

256
00:12:24.360 --> 00:12:25.759
<v Speaker 1>Millisecond well, well seconds again.

257
00:12:25.799 --> 00:12:28.200
<v Speaker 2>Wow yeah, and it also needs to handle high concurrency,

258
00:12:28.200 --> 00:12:30.919
<v Speaker 2>potentially thousands or even hundreds of thousands of queries per

259
00:12:30.919 --> 00:12:32.200
<v Speaker 2>second depending on the application.

260
00:12:32.399 --> 00:12:34.919
<v Speaker 1>That's incredible scale. How do you choose between key value

261
00:12:34.919 --> 00:12:38.600
<v Speaker 1>and real time ol app beyond just the query type?

262
00:12:38.639 --> 00:12:41.240
<v Speaker 2>Well, query type is the main driver, but you also

263
00:12:41.279 --> 00:12:44.840
<v Speaker 2>look at how data gets in. Does it support direct

264
00:12:44.919 --> 00:12:48.679
<v Speaker 2>streaming ingestion from COFKA or flink or do you need

265
00:12:48.720 --> 00:12:51.639
<v Speaker 2>an extra step? How fast is that ingestion? Really? Can

266
00:12:51.679 --> 00:12:54.279
<v Speaker 2>it handle your expected data volume and rate? Does it

267
00:12:54.360 --> 00:12:57.440
<v Speaker 2>need complex indexing or pre aggregation to meet your query

268
00:12:57.480 --> 00:13:00.559
<v Speaker 2>speed goals? Lots to consider, definitely, and the book makes

269
00:13:00.600 --> 00:13:03.679
<v Speaker 2>a very sensible point. Don't just trust the marketing hype.

270
00:13:04.120 --> 00:13:07.320
<v Speaker 2>Do your own benchmarking with your own data and query patterns.

271
00:13:07.559 --> 00:13:10.120
<v Speaker 2>See what actually works best for your specific needs.

272
00:13:10.279 --> 00:13:13.720
<v Speaker 1>Test it yourself. Always good advice. Okay, so we have producers,

273
00:13:13.720 --> 00:13:17.000
<v Speaker 1>the streaming platform, the process, or the serving layer. How

274
00:13:17.000 --> 00:13:19.840
<v Speaker 1>do people like actual users see this stuff?

275
00:13:19.879 --> 00:13:22.519
<v Speaker 2>Ah, the front end? Good point. If your users are

276
00:13:22.679 --> 00:13:25.720
<v Speaker 2>internal like data analysts or engineers, maybe they querry the

277
00:13:25.759 --> 00:13:27.840
<v Speaker 2>serving layer directly using SQL or an.

278
00:13:27.759 --> 00:13:32.320
<v Speaker 1>API okay, But for less technical users or external customer.

279
00:13:31.960 --> 00:13:34.399
<v Speaker 2>Then you'll likely need a user interface a front end,

280
00:13:34.919 --> 00:13:36.120
<v Speaker 2>and you've got a few options here.

281
00:13:36.120 --> 00:13:36.600
<v Speaker 1>What are they?

282
00:13:36.679 --> 00:13:39.399
<v Speaker 2>You could go fully custom build your own web application

283
00:13:39.759 --> 00:13:44.120
<v Speaker 2>using standard tools like react as Angular viewjas gives you

284
00:13:44.159 --> 00:13:47.720
<v Speaker 2>total control over the look, feel, and functionality.

285
00:13:47.080 --> 00:13:49.080
<v Speaker 1>The Highffert high control option. What else?

286
00:13:49.399 --> 00:13:52.480
<v Speaker 2>Then there are low code frameworks, things like Streamlet or plotly,

287
00:13:52.559 --> 00:13:55.720
<v Speaker 2>dash or popular especially in the Python world. They let

288
00:13:55.720 --> 00:13:59.000
<v Speaker 2>you build interactive dashboards and web apps with much less

289
00:13:59.000 --> 00:14:00.200
<v Speaker 2>front end coding effort.

290
00:14:00.360 --> 00:14:03.559
<v Speaker 1>Faster development, maybe, less customization generally yes.

291
00:14:04.039 --> 00:14:10.000
<v Speaker 2>And the third category is data visualization tools they ca Apache, Superset, redash, Grfauna.

292
00:14:10.440 --> 00:14:13.559
<v Speaker 2>These often provide drag and drop interfaces to build dashboards

293
00:14:13.559 --> 00:14:16.320
<v Speaker 2>directly on top of your data sources, often with no

294
00:14:16.399 --> 00:14:17.360
<v Speaker 2>coding required at all.

295
00:14:17.480 --> 00:14:19.240
<v Speaker 1>The quickest way to get a dashboard up.

296
00:14:19.159 --> 00:14:22.440
<v Speaker 2>Often yes, So how you choose depends on a few things.

297
00:14:22.480 --> 00:14:24.759
<v Speaker 2>What's the front end coding skill level of your team,

298
00:14:25.440 --> 00:14:28.399
<v Speaker 2>how much time do you realistically have, and who are

299
00:14:28.399 --> 00:14:32.639
<v Speaker 2>the user's internal experts or external customers needing a polished experience.

300
00:14:32.840 --> 00:14:36.559
<v Speaker 1>A spectrum of choices matching needs and resources makes sense

301
00:14:36.879 --> 00:14:39.080
<v Speaker 1>now the book also notes that sometimes the lines between

302
00:14:39.120 --> 00:14:40.639
<v Speaker 1>these components get fuzzy.

303
00:14:40.960 --> 00:14:44.960
<v Speaker 2>Yeah, technology evolves and tools sometimes wear multiple hats. Apache

304
00:14:44.960 --> 00:14:48.559
<v Speaker 2>Pulsar is a good examples, mainly an event streaming platform

305
00:14:48.679 --> 00:14:52.360
<v Speaker 2>like Kafka, but it also has built in capabilities called

306
00:14:52.399 --> 00:14:56.039
<v Speaker 2>Pulsar functions that let you do some lightweight stream processing

307
00:14:56.080 --> 00:14:57.720
<v Speaker 2>directly within Pulsar itself.

308
00:14:57.840 --> 00:15:01.799
<v Speaker 1>Ah, so the streaming platform is doing some processing tasks exactly.

309
00:15:01.840 --> 00:15:04.279
<v Speaker 2>It blurs the line a bit between the streaming platform

310
00:15:04.519 --> 00:15:07.240
<v Speaker 2>and the stream processing platform. It just shows that these

311
00:15:07.279 --> 00:15:10.960
<v Speaker 2>categories aren't always rigid silos. The landscape is pretty dynamic.

312
00:15:11.080 --> 00:15:14.039
<v Speaker 1>Okay, that's a great tour through the stack. So wrapping

313
00:15:14.120 --> 00:15:17.080
<v Speaker 1>up this main section, what's the big takeaway from the

314
00:15:17.120 --> 00:15:18.519
<v Speaker 1>book about building these systems?

315
00:15:18.639 --> 00:15:21.919
<v Speaker 2>I think the fundamental message is that embracing real time

316
00:15:22.200 --> 00:15:26.559
<v Speaker 2>analytics isn't just a technical upgrade. It's a strategic move

317
00:15:26.919 --> 00:15:29.240
<v Speaker 2>that can give you a serious competitive.

318
00:15:28.799 --> 00:15:31.679
<v Speaker 1>Edge by making you faster, more informed.

319
00:15:31.639 --> 00:15:35.840
<v Speaker 2>And ultimately making more accurate, relevant decisions because you're acting

320
00:15:35.840 --> 00:15:38.480
<v Speaker 2>on what's happening now, not what happened yesterday.

321
00:15:38.679 --> 00:15:41.960
<v Speaker 1>Fantastic, and this deep dive has been really an introduction.

322
00:15:42.480 --> 00:15:45.200
<v Speaker 1>We've touched on the core ideas, the benefits those key

323
00:15:45.240 --> 00:15:48.159
<v Speaker 1>building blocks of the RTA stack all pulled from the

324
00:15:48.200 --> 00:15:51.000
<v Speaker 1>insights in building real time analytics systems.

325
00:15:51.039 --> 00:15:53.919
<v Speaker 2>Absolutely, it just scratches the surface, but hopefully gives you

326
00:15:53.960 --> 00:15:54.840
<v Speaker 2>a solid foundation.

327
00:15:55.039 --> 00:15:57.799
<v Speaker 1>So a final thought for you, the listener to chew on,

328
00:15:58.480 --> 00:16:02.039
<v Speaker 1>think about your own work, your own organization. Where could

329
00:16:02.080 --> 00:16:06.240
<v Speaker 1>real time analytics unlock something new, maybe a new data

330
00:16:06.279 --> 00:16:09.679
<v Speaker 1>product or a way to significantly improve a process you

331
00:16:09.720 --> 00:16:10.279
<v Speaker 1>already have.

332
00:16:10.600 --> 00:16:14.200
<v Speaker 2>Yeah, ask yourself, what's the current shelf life of your data?

333
00:16:14.679 --> 00:16:17.440
<v Speaker 2>Is its value decaying rapidly? What can you gain by

334
00:16:17.519 --> 00:16:20.200
<v Speaker 2>acting on it immediately? And maybe think about that stack.

335
00:16:20.200 --> 00:16:25.240
<v Speaker 2>We discussed producers, streaming, processing, serving front end. Which piece

336
00:16:25.320 --> 00:16:28.919
<v Speaker 2>might offer the biggest immediate win for your specific situation?

337
00:16:29.240 --> 00:16:32.440
<v Speaker 1>Something to definitely consider. Where could that immediate incite make

338
00:16:32.440 --> 00:16:33.279
<v Speaker 1>the biggest difference
