WEBVTT 1 00:00:00.160 --> 00:00:03.640 Welcome to the deep dive, your shortcut to truly understanding 2 00:00:03.759 --> 00:00:09.519 complex topics. Today, we're plunging into Apache Koffka. It's really 3 00:00:09.560 --> 00:00:13.880 a foundational technology underpinning so much of the modern data world. 4 00:00:14.080 --> 00:00:16.519 Absolutely, it's everywhere, even if you don't see it directly. 5 00:00:16.640 --> 00:00:19.399 We've pulled together a stack of detailed sources for you, 6 00:00:19.480 --> 00:00:23.679 particularly some really great insights from a patchi Kofka in action, 7 00:00:24.440 --> 00:00:27.239 and our mission really is to unpack its full potential. 8 00:00:27.359 --> 00:00:29.640 Yeah, get beyond just the buzzwords exactly. 9 00:00:29.679 --> 00:00:33.560 We'll explore everything from its basic building blocks to how 10 00:00:33.600 --> 00:00:37.840 it ensures rock solid reliability, which is critical goose performance 11 00:00:37.880 --> 00:00:41.719 to incredible levels, and fits into the most advanced enterprise systems. 12 00:00:42.039 --> 00:00:44.240 Think of it this way. If you've ever wondered how 13 00:00:44.240 --> 00:00:47.880 a massive online retailer processes millions of real time orders, 14 00:00:48.320 --> 00:00:54.079 updates inventory instantly, or personalizes your shopping experience on the fly, Kuffa. 15 00:00:53.840 --> 00:00:55.479 Is probably in the mix somewhere. It's off in that 16 00:00:55.560 --> 00:00:56.520 silent powerhouse. 17 00:00:56.840 --> 00:01:00.719 Get ready for some serious aha moments, because this deep 18 00:01:00.759 --> 00:01:03.560 dive is your essential guide. We want you to not 19 00:01:03.719 --> 00:01:07.239 just know what Kolica is, but deeply understand how it 20 00:01:07.319 --> 00:01:10.319 works and why it's so critical for today's real time 21 00:01:10.439 --> 00:01:11.400 data needs. 22 00:01:11.120 --> 00:01:13.680 And hopefully, without feeling overwhelmed by the jargon. 23 00:01:13.439 --> 00:01:14.200 Let's untack this. 24 00:01:14.560 --> 00:01:18.560 It's truly a system that transforms how organizations handle data. 25 00:01:19.159 --> 00:01:22.239 It enables that shift, you know, from waiting. 26 00:01:21.879 --> 00:01:24.120 For daily reports still a batch world. 27 00:01:24.000 --> 00:01:26.959 Right to getting instant insights and acting on them immediately. 28 00:01:27.079 --> 00:01:31.200 That real time capability is absolutely where the magic happens. Okay, 29 00:01:31.280 --> 00:01:34.359 So for anyone looking to understand Kafka, where do we 30 00:01:34.480 --> 00:01:38.519 even begin? What are its absolute foundational elements? 31 00:01:38.560 --> 00:01:41.599 Okay? So at its core, cof Go works with messages 32 00:01:42.159 --> 00:01:45.799 sometimes called records. Okay, These are essentially just by rays. 33 00:01:45.840 --> 00:01:49.439 Think of them like small data envelopes, and for efficiency, 34 00:01:49.560 --> 00:01:53.159 they're often grouped into batches before being sent saves overhead. 35 00:01:52.799 --> 00:01:55.840 Got it? Batches of messages, and these messages are organized 36 00:01:55.840 --> 00:01:58.519 into topics like categories exactly. 37 00:01:59.040 --> 00:02:02.040 Think of a topic as a dedicated channel or category 38 00:02:02.079 --> 00:02:05.920 for bundling messages of a specific business type, much like 39 00:02:06.120 --> 00:02:09.159 tables in a database maybe, but really designed for a 40 00:02:09.199 --> 00:02:12.879 continuous stream of events. So, for that online retailer we mentioned, 41 00:02:12.919 --> 00:02:15.879 you might have a customer orders topic or maybe a 42 00:02:15.919 --> 00:02:17.599 product inventory updates topic. 43 00:02:17.800 --> 00:02:20.120 Right, So if I place an order that becomes a 44 00:02:20.120 --> 00:02:23.879 message in the customer orders topic. Simple enough, But how 45 00:02:23.879 --> 00:02:27.560 does KOFKA handle the sheer volume millions, maybe billions of 46 00:02:27.599 --> 00:02:30.039 messages and ensure it can scale. 47 00:02:30.240 --> 00:02:33.000 Ah? That's where partitions come in, and they are truly 48 00:02:33.479 --> 00:02:36.800 like the backbone of kofka's performance and scalability. 49 00:02:36.879 --> 00:02:38.199 Partitions Okay, hash. 50 00:02:38.080 --> 00:02:41.240 Topic is divided into one or more partitions. This division 51 00:02:41.319 --> 00:02:44.560 is what enables parallel processing lots of things happening at once. 52 00:02:44.680 --> 00:02:46.439 Makes sense, divide and conquer and to. 53 00:02:46.439 --> 00:02:49.879 Ensure high availability and fault tolerance, which is crucial. These 54 00:02:49.919 --> 00:02:53.080 partitions are replicated across different COFCA servers. The servers which 55 00:02:53.120 --> 00:02:55.120 you call we call them brokers. So if one broker 56 00:02:55.159 --> 00:02:57.800 goes down to debta, is safe and accessible in another one? 57 00:02:58.039 --> 00:02:58.599 No panic? 58 00:02:59.080 --> 00:03:03.639 Got it? Mess topics partitions on brokers. Okay, So you 59 00:03:03.719 --> 00:03:07.919 have producers sending messages and consumers receiving them. How did 60 00:03:07.960 --> 00:03:10.639 they interact with these partitions and brokers? 61 00:03:10.800 --> 00:03:14.400 Good question. Producers are the application sending the messages your 62 00:03:14.560 --> 00:03:18.000 order service. Maybe they send them to the designated leader 63 00:03:18.120 --> 00:03:18.919 of a partition. 64 00:03:19.240 --> 00:03:22.599 Leader one broker is in charge for that partition, right, and. 65 00:03:22.560 --> 00:03:26.120 The producer selects that partition using something called a partitioner. 66 00:03:26.560 --> 00:03:28.439 Often it's based on a message key, which we'll get to. 67 00:03:28.560 --> 00:03:28.840 Okay. 68 00:03:29.080 --> 00:03:32.840 On the other side, consumers receive and process messages. They're 69 00:03:32.879 --> 00:03:35.800 quite flexible actually, they can read from multiple partitions, even 70 00:03:36.159 --> 00:03:37.319 multiple topics at once. 71 00:03:37.400 --> 00:03:39.360 And the brokers themselves, what's their main job. 72 00:03:39.680 --> 00:03:43.560 The brokers are the Kafka servers. They manage the storage, distribution, 73 00:03:43.759 --> 00:03:47.439 retrieval of messages, all that, and they share replicas and 74 00:03:47.520 --> 00:03:51.560 processing tasks pretty evenly among themselves. It's a distributed system. 75 00:03:51.680 --> 00:03:53.599 This sounds like a lot of moving parts all needing 76 00:03:53.639 --> 00:03:56.039 to coordinate. Who is the leader? Is this broker alive? 77 00:03:56.159 --> 00:03:58.960 How does Kafka manage that internal choreography? 78 00:03:59.240 --> 00:04:02.719 Right? That's the role coordination pluster. Historically this was a 79 00:04:02.759 --> 00:04:05.719 page zookeeper, a whole separate system you had to manage. 80 00:04:05.840 --> 00:04:08.800 I remember a zoo keeper could be complex. 81 00:04:08.879 --> 00:04:12.039 It could. But now Koka is increasingly moving to craft 82 00:04:12.759 --> 00:04:15.919 kr aft ok and this isn't just a name change. 83 00:04:15.960 --> 00:04:21.360 It's a pretty significant evolution. Craft simplifies the entire Kofka 84 00:04:21.480 --> 00:04:25.920 architecture because it removes that external dependency on zookeeper, so 85 00:04:26.000 --> 00:04:30.040 Kofka manages itself more exactly, it becomes self managing for 86 00:04:30.079 --> 00:04:35.079 these critical coordination tasks. Overseeing partition assignments, handling leader elections, 87 00:04:35.519 --> 00:04:39.160 continuously monitoring broker health means fewer moving parts for you 88 00:04:39.240 --> 00:04:42.439 to manage, which is a huge operational win, especially for 89 00:04:42.600 --> 00:04:43.879 large dynamic clusters. 90 00:04:43.920 --> 00:04:47.360 Okay, that gives us the basic anatomy messages and topics 91 00:04:47.360 --> 00:04:50.959 split into partitions managed by brokers, with producers and consumers 92 00:04:51.040 --> 00:04:54.000 all coordinated by craft or zookeeper. But here's where it 93 00:04:54.040 --> 00:04:58.000 gets really interesting and a bit well mind bending for me. Initially, 94 00:04:58.319 --> 00:05:02.120 the sources describe Kofka's core nature as a distributed log. 95 00:05:02.360 --> 00:05:04.000 Yes, this is fundamental. 96 00:05:04.079 --> 00:05:05.920 Can you elaborate on that? Why is thinking of it 97 00:05:05.920 --> 00:05:07.199 as a log so important? 98 00:05:07.600 --> 00:05:11.639 Absolutely? What's truly fascinating here is that Kafka is fundamentally 99 00:05:11.759 --> 00:05:14.959 a distributed log. You need to sort of forget about 100 00:05:15.000 --> 00:05:17.199 it being just a message queue for a second. Okay, 101 00:05:17.240 --> 00:05:19.800 think of it more like an immutable personal diary, or 102 00:05:20.120 --> 00:05:23.040 maybe better the commit log of a database. It answers 103 00:05:23.079 --> 00:05:26.720 the question what happened? It focuses on the history of 104 00:05:26.759 --> 00:05:30.319 events rather than just what is which is the current state? 105 00:05:30.480 --> 00:05:32.079 Right history versus snapshot? 106 00:05:32.160 --> 00:05:35.800 Precisely So for our online retailer, it's not just current 107 00:05:35.839 --> 00:05:38.920 inventory is fifty shirts. It's more like a shirt was 108 00:05:38.959 --> 00:05:42.079 sold at ten point zero one am than another at 109 00:05:42.120 --> 00:05:44.959 ten point zero two am. Then we receive stock at 110 00:05:45.000 --> 00:05:47.160 ten point zero five am, the whole sequence. 111 00:05:47.480 --> 00:05:50.040 So it's about the sequence of actions, the journey, not 112 00:05:50.160 --> 00:05:52.959 just the final destination. What are the key properties of 113 00:05:53.000 --> 00:05:54.879 such a log that make it so powerful? 114 00:05:55.120 --> 00:05:59.279 Logs have distinct crucial properties. First order and sorting. Messages 115 00:05:59.319 --> 00:06:01.920 are always sorted time within a partition, oldest entry. 116 00:06:01.839 --> 00:06:03.160 The beginning partition, got it. 117 00:06:03.399 --> 00:06:06.199 Second, writing and reading direction. You always append new entries 118 00:06:06.199 --> 00:06:08.079 to the end of the log, like adding to a diary, 119 00:06:08.439 --> 00:06:10.800 and you typically read from old to new using what 120 00:06:10.920 --> 00:06:13.199 Kafka calls offsets to track your position. 121 00:06:13.360 --> 00:06:15.879 Offsets like bookmarks kinda yeah. 122 00:06:15.920 --> 00:06:19.800 And crucially, immutability. Once an entry is written, you can't 123 00:06:19.800 --> 00:06:22.399 easily change or remove it. It's like writing in permanent ink. 124 00:06:22.800 --> 00:06:27.360 That immutability has profound implications for data integrity. I imagine, 125 00:06:27.680 --> 00:06:30.519 and I've heard this concept of time travel mentioned with logs. 126 00:06:30.600 --> 00:06:32.439 How does that actually work and why is it such 127 00:06:32.439 --> 00:06:33.160 a game changer? 128 00:06:33.480 --> 00:06:37.839 That's right, time travel because the log is immutable and ordered. 129 00:06:38.240 --> 00:06:40.519 You can literally reconstruct the state of the world at 130 00:06:40.519 --> 00:06:43.959 any point in time by simply replaying the entries from 131 00:06:43.959 --> 00:06:46.800 the beginning of the log or from a specific offset. 132 00:06:47.360 --> 00:06:50.279 For our online retailer, this means you could replay all 133 00:06:50.399 --> 00:06:53.800 order placed events from last year to reconstruct exactly how 134 00:06:53.800 --> 00:06:57.040 many items were sold during a specific promotion wow, or 135 00:06:57.079 --> 00:06:59.759 even rebuild an entire system state if a database with 136 00:06:59.759 --> 00:07:03.959 some lost just from the Kofka log. This capability is 137 00:07:04.000 --> 00:07:08.120 really how Kafka helps businesses transition from traditional batch oriented 138 00:07:08.160 --> 00:07:11.360 processing you know, waiting for those overnight reports. 139 00:07:11.079 --> 00:07:12.959 Right at the end of day summary is to real. 140 00:07:12.800 --> 00:07:15.680 Time data handling, getting instant, up to the minute insights. 141 00:07:15.920 --> 00:07:19.600 That's incredibly powerful replaying history. But if a log is 142 00:07:19.639 --> 00:07:22.519 conceptually simple, why does it need to be distributed? Why 143 00:07:22.560 --> 00:07:26.480 not just one gigantic, super fast log on one machine. Yeah, 144 00:07:26.560 --> 00:07:30.160 good question, And this feels like where Kofka truly transforms 145 00:07:30.199 --> 00:07:33.959 from a concept into an industrial strength powerhouse. Must have 146 00:07:34.040 --> 00:07:37.319 big implications for like data resilience exactly. 147 00:07:37.600 --> 00:07:41.560 The challenge with the single log is clear speed, scalability 148 00:07:41.600 --> 00:07:45.920 and resilience a single system, a single server. It's often 149 00:07:46.000 --> 00:07:50.759 just not reliable enough hardware fails networks glitch, it's common. 150 00:07:51.160 --> 00:07:55.000 Kafka addresses this through horizontal scaling. Me Instead of buying bigger, 151 00:07:55.079 --> 00:07:58.600 more powerful servers, vertical scaling you use more servers. When 152 00:07:58.639 --> 00:08:00.959 you're existing brokers are getting busy, you just add another 153 00:08:00.959 --> 00:08:01.639 one to the cluster. 154 00:08:02.480 --> 00:08:04.360 Scale out, not up precisely. 155 00:08:04.680 --> 00:08:07.079 This is cheaper, much more flexible, and fits the reality 156 00:08:07.079 --> 00:08:09.480 that individual machines aren't perfectly reliable. 157 00:08:09.519 --> 00:08:13.199 So that's a fundamental architectural philosophy behind COFKA. Expect failures, 158 00:08:13.240 --> 00:08:14.040 build around them. 159 00:08:14.120 --> 00:08:17.399 It absolutely is. This approach is crucial because individual IT 160 00:08:17.639 --> 00:08:23.639 systems are seen as inherently unreliable. Horizontal scaling also enables parallelization. 161 00:08:23.279 --> 00:08:24.839 More work done at the same time. 162 00:08:24.759 --> 00:08:29.399 Right, allowing Kofka to process far more messages per unit 163 00:08:29.439 --> 00:08:32.679 of time than a single server ever could. And this 164 00:08:32.720 --> 00:08:36.960 works sufficiently because in KOFKA, data for one logical entities, 165 00:08:37.000 --> 00:08:41.320 say all events related to a specific product or maybe 166 00:08:41.360 --> 00:08:44.120 a single customer's order history can be kept in its 167 00:08:44.159 --> 00:08:45.720 own log partition. 168 00:08:45.759 --> 00:08:47.399 Using that message key you mentioned earlier. 169 00:08:47.440 --> 00:08:51.080 Exactly using the key, this ensures correct ordering for that 170 00:08:51.159 --> 00:08:54.639 specific entity, even if the overall order across all products 171 00:08:54.679 --> 00:08:56.279 isn't strictly sequential globally. 172 00:08:56.360 --> 00:08:58.720 Okay, that makes sense order within a context. 173 00:08:58.840 --> 00:09:01.399 So the implication for you, as someone listening and maybe 174 00:09:01.440 --> 00:09:04.519 trying to build robust systems, is that Kofka is designed 175 00:09:04.559 --> 00:09:07.159 from the ground up to be highly available and resilient 176 00:09:07.559 --> 00:09:10.399 even if parts of it fail. It distributes data and 177 00:09:10.480 --> 00:09:13.960 work across many machines. It's built for reliability in an 178 00:09:14.039 --> 00:09:14.960 unreliable world. 179 00:09:15.120 --> 00:09:18.120 That's a fantastic overview of the core architecture, very clear. 180 00:09:18.399 --> 00:09:20.840 Let's maybe talk about the messages themselves. Now. The source 181 00:09:20.879 --> 00:09:24.440 mentions Kofka's data agnosticism. What does that actually mean for 182 00:09:24.519 --> 00:09:25.799 what you can send through it? 183 00:09:25.799 --> 00:09:28.279 It means Kafka doesn't really care about the content of 184 00:09:28.320 --> 00:09:32.320 your messages. It treats all messages as raw BTE arrays. 185 00:09:32.360 --> 00:09:34.039 Just sequences of bytes yep. 186 00:09:34.399 --> 00:09:37.960 This flexibility is a key design choice. It allows Kafka 187 00:09:38.039 --> 00:09:41.039 to handle any kind of data, whether it's JSON, AVRO 188 00:09:41.240 --> 00:09:45.679 proto buff, plaintext, whatever, regardless of its format or structure. 189 00:09:46.200 --> 00:09:48.559 It doesn't try to interpret the data, which is actually 190 00:09:48.600 --> 00:09:50.000 a big part of its high performance. 191 00:09:50.159 --> 00:09:51.120 Like a postal service. 192 00:09:51.240 --> 00:09:54.480 Exactly like a postal service that delivers any package, big 193 00:09:54.559 --> 00:09:56.600 or small without needing to know what's inside. 194 00:09:56.799 --> 00:09:59.960 That sounds incredibly flexible, But doesn't that mean it's complete 195 00:10:00.279 --> 00:10:03.720 agnostic to the meaning or structure of the data. What 196 00:10:03.759 --> 00:10:07.440 are the implications of that for say, data governance or 197 00:10:07.919 --> 00:10:09.720 ensuring consistency down the line. 198 00:10:09.840 --> 00:10:12.960 Ah, you've hit on a crucial point there. While Kofka 199 00:10:13.000 --> 00:10:16.639 itself is agnostic, in practice, it is optimized for many 200 00:10:16.720 --> 00:10:18.600 small structured messages. 201 00:10:18.159 --> 00:10:19.000 Small unstructure. 202 00:10:19.080 --> 00:10:22.120 Okay, the default maximum message size is only one megabyte. 203 00:10:22.200 --> 00:10:24.360 Oh, that's smaller than I might have thought. 204 00:10:24.320 --> 00:10:27.480 It is, and why you can technically adjust this. It's 205 00:10:27.559 --> 00:10:33.440 generally advised against larger messages can severely impact performance disk space. 206 00:10:34.120 --> 00:10:36.960 It's just not what it's designed for. Kofka is built 207 00:10:36.960 --> 00:10:40.200 for high throughput of many small events, not for transferring 208 00:10:40.279 --> 00:10:43.399 large files like PDFs or big video files. I mean, 209 00:10:43.440 --> 00:10:46.440 look at LinkedIn. They famously use Kafka to process something 210 00:10:46.480 --> 00:10:49.879 like seven trillion messages a day across roughly one hundred 211 00:10:49.919 --> 00:10:51.440 clusters back in twenty nineteen. 212 00:10:51.480 --> 00:10:53.080 Seven trillion a day. 213 00:10:53.360 --> 00:10:56.720 That's an astonishing number of small messages. Really drives home 214 00:10:56.759 --> 00:10:57.720 the point definitely. 215 00:10:57.919 --> 00:11:00.279 So if most messages are small, what are the common 216 00:11:00.320 --> 00:11:03.039 types of messages you typically see in a real world 217 00:11:03.159 --> 00:11:06.159 KOFCA system, what patterns emerge in practice? 218 00:11:06.159 --> 00:11:08.799 Most systems use a mix of message types. You often 219 00:11:08.840 --> 00:11:12.559 see states, states, yeah, messages that describe the complete current 220 00:11:12.600 --> 00:11:15.679 state of an object, like all the details for a product, 221 00:11:15.720 --> 00:11:18.720 its current price, stock level, description, everything, and if you 222 00:11:18.720 --> 00:11:20.960 only care about the latest state. KOFKA has a feature 223 00:11:20.960 --> 00:11:24.000 called log compaction, which uses message keys to save space 224 00:11:24.000 --> 00:11:26.840 by keeping only the most recent version of a particular record. 225 00:11:27.080 --> 00:11:30.279 Okay, so state is the full picture right now? What else? 226 00:11:30.519 --> 00:11:33.440 Then? There are deltas. These contain only the changes in state, 227 00:11:33.759 --> 00:11:36.799 like just a stock quantity adjustment of negative five because 228 00:11:36.799 --> 00:11:40.200 an item sold or plus ten because stock arrived. 229 00:11:40.279 --> 00:11:42.799 Ah, just the change. That sounds way more efficient for 230 00:11:42.879 --> 00:11:43.519 data volume. 231 00:11:43.720 --> 00:11:47.600 It is much smaller messages, but they're less useful on 232 00:11:47.639 --> 00:11:47.960 their own. 233 00:11:48.279 --> 00:11:50.759 How so, what are the challenges if you only have 234 00:11:50.879 --> 00:11:55.000 deltas and need to know, say, the product's total stock 235 00:11:55.120 --> 00:11:55.519 right now? 236 00:11:55.559 --> 00:11:57.919 That's a great question. If you only have deltas, you'd 237 00:11:57.919 --> 00:12:01.200 have to process all previous deltas for that product just 238 00:12:01.240 --> 00:12:04.679 to reconstruct the current state. That could be computationally intensive 239 00:12:04.679 --> 00:12:05.240 for the consumer. 240 00:12:05.360 --> 00:12:07.600 Right you have to sum them all up exactly. 241 00:12:07.559 --> 00:12:10.720 Which is why events are often preferred. They describe what happened, 242 00:12:10.879 --> 00:12:14.399 but add context. Like instead of just MIGHTO five stock, 243 00:12:14.480 --> 00:12:17.240 the event might be order fulfilled event, which contains the 244 00:12:17.279 --> 00:12:21.200 stock change but also the order ID, customer ID timestamp 245 00:12:21.960 --> 00:12:27.159 more more meaning VAT, adjustment or promotion started, or other examples. Logs, 246 00:12:27.240 --> 00:12:29.600 in fact, are really just a special kind of event stream. 247 00:12:29.639 --> 00:12:32.320 Okay, events give context and you mentioned one more type. 248 00:12:32.200 --> 00:12:35.600 Yes, commands. These are used to instruct other systems to 249 00:12:35.600 --> 00:12:40.120 perform actions like ship this order command or process payment command. 250 00:12:40.759 --> 00:12:43.639 Unlike events, where the cener often doesn't care who listens, 251 00:12:43.879 --> 00:12:47.320 commands usually require a response or a specific action from 252 00:12:47.320 --> 00:12:48.360 the recipient system. 253 00:12:48.399 --> 00:12:53.360 That distinction between events and commands feels important. Commands expect 254 00:12:53.360 --> 00:12:56.960 a reaction. Now you mentioned, messages aren't just a singular 255 00:12:57.039 --> 00:13:00.639 blob of data. They have structure. Break that down for 256 00:13:00.720 --> 00:13:03.639 us again. What are the parts of a Kafka message? 257 00:13:03.759 --> 00:13:07.559 Yes, a Kafka message or record technically is composed of 258 00:13:07.600 --> 00:13:12.039 a few key elements. First, the value that's the primary payload, 259 00:13:12.080 --> 00:13:14.799 the actual information you want to convey, like the details 260 00:13:14.799 --> 00:13:15.679 of a customer order. 261 00:13:15.840 --> 00:13:17.600 Usually the biggest part the core data. 262 00:13:17.720 --> 00:13:21.039 And there's an optional key which is incredibly important even 263 00:13:21.039 --> 00:13:24.200 though it's optional, so important it's used to categorize messages, 264 00:13:24.279 --> 00:13:27.240 and critically, messages with the same key are guaranteed by 265 00:13:27.279 --> 00:13:28.799 Kafka to go to the same partition. 266 00:13:29.039 --> 00:13:32.000 Ah. So that's how you ensure order for a specific entity, 267 00:13:32.200 --> 00:13:34.519 like all updates for product one two over. 268 00:13:34.720 --> 00:13:37.639 Exactly, if you send all updates for product one twenty 269 00:13:37.679 --> 00:13:40.159 three with a key one twenty three, they land in 270 00:13:40.200 --> 00:13:43.919 the same partition in order, it guarantees their order relative 271 00:13:43.960 --> 00:13:46.759 to each other, at least from a single producer. The 272 00:13:46.879 --> 00:13:49.759 key is also essential for that log compaction feature we 273 00:13:49.879 --> 00:13:53.360 mentioned where Kafka retains only the latest message for a 274 00:13:53.360 --> 00:13:56.360 given key, very useful for topics representing current state. 275 00:13:56.799 --> 00:13:59.480 So the key is crucial for ordering and compaction, not 276 00:13:59.639 --> 00:14:01.919 just as a ID. What else is in a message? 277 00:14:02.039 --> 00:14:04.799 There are also optional custom headers. These are meant for 278 00:14:05.080 --> 00:14:09.039 technical metadata, things like tracing IDs for distributed systems, maybe 279 00:14:09.080 --> 00:14:11.960 security token stuff like that, not really for business data. 280 00:14:12.080 --> 00:14:14.440 Keep business data in the value generally yes. 281 00:14:14.679 --> 00:14:17.440 And finally, there's a timestamp. This records the time the 282 00:14:17.440 --> 00:14:20.080 message was created by the producer or potentially when it 283 00:14:20.120 --> 00:14:23.840 was appended to the broker log, depending on configuration. This 284 00:14:23.960 --> 00:14:27.799 timestamp is vital for many real time analytics scenarios, especially 285 00:14:27.799 --> 00:14:30.480 when you start dealing with time windows in stream processing. 286 00:14:30.759 --> 00:14:33.519 Fascinating how much detail goes into what seems like a 287 00:14:33.559 --> 00:14:38.000 simple message. Loads of potential there. Now, let's pivot to 288 00:14:38.039 --> 00:14:43.080 something absolutely critical for any data system, reliability. How does 289 00:14:43.120 --> 00:14:45.840 kofka build trust with your data? How does it ensure 290 00:14:45.879 --> 00:14:48.799 nothing gets lost or hopelessly out of order? 291 00:14:49.159 --> 00:14:52.799 Right? Reliability in Kofka is built on a few core pillars. First, 292 00:14:53.080 --> 00:14:55.080 replication and leaders followers. 293 00:14:55.200 --> 00:14:58.799 We touched on this leaders and followers for partitions exactly. 294 00:14:58.879 --> 00:15:02.360 For each partition, one broker acts as the leader. It 295 00:15:02.399 --> 00:15:06.279 handles all the incoming produce requests and outgoing consumer requests 296 00:15:06.279 --> 00:15:09.080 for that partition. The other brokers holding replicas for that 297 00:15:09.159 --> 00:15:12.720 partition are called followers, and they just continuously replicate or 298 00:15:12.759 --> 00:15:16.480 copy new messages from that leader. This creates redundant copies 299 00:15:16.480 --> 00:15:18.480 of your data across different machines. 300 00:15:18.559 --> 00:15:22.200 That sounds incredibly robust, multiple copies, But what actually happens 301 00:15:22.200 --> 00:15:24.360 behind the scenes when a leader fails let's say the 302 00:15:24.399 --> 00:15:28.159 machine crashes. Is the switch to a follower instantaneous? Are 303 00:15:28.200 --> 00:15:31.159 there any potential downsides or edge cases? A listener should be. 304 00:15:31.159 --> 00:15:35.600 Aware of good question. When a leader fails, Kafka automatically 305 00:15:35.639 --> 00:15:38.360 detects this and elects a new leader from its set 306 00:15:38.399 --> 00:15:42.279 of in sync replicas or ISRs. These are followers that 307 00:15:42.279 --> 00:15:43.639 are caught up with a leader's. 308 00:15:43.320 --> 00:15:46.039 Log ISRs, in sync replicas. 309 00:15:45.679 --> 00:15:50.120 YEAH, or sometimes eligible leader replicas elrs, depending on the setup. 310 00:15:50.600 --> 00:15:54.159 The goal is to ensure the topic remains accessible. Producers 311 00:15:54.159 --> 00:15:57.480 and consumers are designed to automatically detect this change and 312 00:15:57.559 --> 00:16:00.639 switch to the new leader, usually with minimal interruption. We're 313 00:16:00.639 --> 00:16:02.720 talking milliseconds typically. 314 00:16:02.320 --> 00:16:04.080 Okay, so it's fast. Any downsides? 315 00:16:04.240 --> 00:16:07.159 The main downside is that during that brief election period, 316 00:16:07.480 --> 00:16:12.200 that specific partition might be temporarily unavailable for writing. Reading 317 00:16:12.279 --> 00:16:15.279 might still be possible from followers depending on config, but 318 00:16:15.399 --> 00:16:19.080 writs need the leader. Also, once the original preferred leader 319 00:16:19.120 --> 00:16:22.000 comes back online and catches up, KAKA often aims to 320 00:16:22.039 --> 00:16:25.360 reinstate it as leader. This helps rebalance the leadership load 321 00:16:25.399 --> 00:16:26.840 across the cluster over time. 322 00:16:27.360 --> 00:16:31.120 That's excellent to know. Automatic failover is key. So how 323 00:16:31.159 --> 00:16:34.960 do acknowledgements or ACKs play into this? How do producers 324 00:16:35.080 --> 00:16:38.519 know their messages are safely persisted across these replicas before 325 00:16:38.559 --> 00:16:39.000 they move on? 326 00:16:39.320 --> 00:16:43.519 ACKs? Are precisely how producers control the durability guarantee and 327 00:16:43.679 --> 00:16:47.799 ensure messages are safely persisted. There are three main strategies 328 00:16:47.919 --> 00:16:51.720 controlled by the act's producer. Can fig with x zero 329 00:16:51.799 --> 00:16:55.159 it's basically fire and forget send in hope pretty much. 330 00:16:55.559 --> 00:16:58.279 It gives the highest performance because the producer doesn't wait 331 00:16:58.320 --> 00:17:01.399 for any confirmation at all, but it offers the lowest 332 00:17:01.440 --> 00:17:05.799 reliability comforable. Maybe to UDP networking. You could lose messages 333 00:17:05.839 --> 00:17:07.279 if the broker fails immediately. 334 00:17:07.400 --> 00:17:08.519 When would you ever use that? 335 00:17:08.880 --> 00:17:12.160 It's acceptable if some data loss is tolerable, maybe high 336 00:17:12.240 --> 00:17:15.240 volume sensor data where only the latest reading matters and 337 00:17:15.279 --> 00:17:18.599 losing an occasional reading isn't catastrophic, like a temperature sensor 338 00:17:18.799 --> 00:17:19.960 in a non critical system. 339 00:17:20.079 --> 00:17:22.759 Okay, what about AX one. That sounds like a middle ground. 340 00:17:22.799 --> 00:17:25.839 It is AX one means the producer gets a response 341 00:17:26.000 --> 00:17:29.200 and acknowledgment as soon as the leader broker successfully receives 342 00:17:29.200 --> 00:17:32.079 and writes the message to its local log. This offers 343 00:17:32.160 --> 00:17:34.920 much better latency than waiting for all replicas, but data 344 00:17:34.960 --> 00:17:37.960 loss is still possible if the leader receives the message 345 00:17:38.319 --> 00:17:40.920 sends the akaz act back to the producer but then 346 00:17:41.200 --> 00:17:44.720 crashes before that message gets replicated to its followers, that 347 00:17:44.759 --> 00:17:45.599 message is lost. 348 00:17:45.960 --> 00:17:48.920 Ah okay, so it's confirmed by the leader, but not 349 00:17:49.039 --> 00:17:50.599 guaranteed replicated. 350 00:17:50.200 --> 00:17:53.160 Yet exactly, which brings us to AXOL or you can 351 00:17:53.160 --> 00:17:55.079 write as a medico one. This has actually been the 352 00:17:55.079 --> 00:17:56.680 default setting since Kafka three. 353 00:17:56.519 --> 00:17:57.799 Point zero, the safest option. 354 00:17:58.200 --> 00:18:02.480 Yes, AXOL offers the highest reliability with this setting. The 355 00:18:02.559 --> 00:18:05.279 leader waits until all of the current in sync replica's 356 00:18:05.640 --> 00:18:09.359 ISRs have successfully persisted the data to their logs before 357 00:18:09.400 --> 00:18:11.799 sending that final ACK back to the producer. 358 00:18:11.920 --> 00:18:14.240 So you know it's on multiple machines, right. 359 00:18:14.559 --> 00:18:17.119 This is what you definitely want for critical data like 360 00:18:17.160 --> 00:18:21.279 those customer orders, financial transactions, anything you absolutely cannot lose. 361 00:18:21.359 --> 00:18:24.839 So for guaranteed delivery, AXOL is the gold standard. Is 362 00:18:24.880 --> 00:18:27.240 there a way to fine tune exactly how many InSync 363 00:18:27.319 --> 00:18:30.200 replicas need to acknowledge before the leader confirms? Maybe you 364 00:18:30.200 --> 00:18:31.880 don't need all of them, just a majority. 365 00:18:32.079 --> 00:18:35.519 Yes. Absolutely. That's where men dot nsync dot replicas comes in. 366 00:18:35.839 --> 00:18:38.519 It's a topic level configuration setting that works hand in 367 00:18:38.559 --> 00:18:39.359 hand with AXOL. 368 00:18:39.559 --> 00:18:40.240 How does it work? 369 00:18:40.319 --> 00:18:43.519 It specifies the minimum number of ISRs, including the leader 370 00:18:43.559 --> 00:18:47.480 itself that must acknowledge the right before the leader confirms 371 00:18:47.519 --> 00:18:49.680 receipt back to the producer. So if you have a 372 00:18:49.720 --> 00:18:53.000 replication factor of three and you set men dot nsync 373 00:18:53.079 --> 00:18:55.799 dot replicas two, then the right succeeds as long as 374 00:18:55.799 --> 00:18:58.880 the leader and at least one follower confirm it. If 375 00:18:58.880 --> 00:19:01.240 only the leader is available, well, the producer will get 376 00:19:01.240 --> 00:19:04.960 an error and can retry, preventing potential data loss if 377 00:19:05.000 --> 00:19:07.559 too many replicas are temporarily down or slow. 378 00:19:07.720 --> 00:19:10.640 That gives you really fine grain control over the durability 379 00:19:10.720 --> 00:19:14.920 versus availability trade off. Very useful, but what about guaranteeing 380 00:19:14.920 --> 00:19:17.960 messages are written exactly once and in the correct order, 381 00:19:18.079 --> 00:19:21.160 especially if a producer has to retry sending due to 382 00:19:21.200 --> 00:19:23.799 a temporary network issue or something that sounds like a 383 00:19:23.799 --> 00:19:25.720 classic distributed systems headache. 384 00:19:25.759 --> 00:19:28.920 It is a tough problem, but Kaofka has solutions for that. 385 00:19:29.000 --> 00:19:31.200 We turn to idempatance and transactions. 386 00:19:31.319 --> 00:19:34.920 Idempatance, meaning doing something multiple times, has the same effect 387 00:19:34.960 --> 00:19:35.480 as doing. 388 00:19:35.319 --> 00:19:39.039 It once, precisely by setting enabled dot idempatance true on 389 00:19:39.079 --> 00:19:42.319 the producer, which is actually the default now too. Alongside 390 00:19:42.359 --> 00:19:45.079 acts all, Kafka ensures that messages are written in the 391 00:19:45.079 --> 00:19:48.519 correct order. Within a partition and are present exactly once, 392 00:19:48.880 --> 00:19:50.440 even if the producer retries sending. 393 00:19:50.839 --> 00:19:53.000 How does it do that without much overhead? 394 00:19:53.240 --> 00:19:55.759 It uses sequence numbers assigned by the producer and tracked 395 00:19:55.759 --> 00:19:59.079 by the broker. The performance loss is negligible, maybe one 396 00:19:59.119 --> 00:20:02.640 percent or less, but the gain in data integrity is huge. 397 00:20:03.039 --> 00:20:05.960 Imagine if an order place message got duplicated because of 398 00:20:05.960 --> 00:20:08.039 a retry, I defidence prevents that. 399 00:20:08.519 --> 00:20:11.759 Okay, so I dumpetance handles duplicates from producer retries. What 400 00:20:11.799 --> 00:20:13.440 about transactions? When do they come in? 401 00:20:13.599 --> 00:20:18.559