WEBVTT 1 00:00:00.080 --> 00:00:02.240 Welcome to the deep dive, where we plunge into a 2 00:00:02.279 --> 00:00:06.759 stack of information, research notes, you name it, to really 3 00:00:06.839 --> 00:00:09.759 pull out those key nuggets of knowledge. We give you 4 00:00:09.800 --> 00:00:12.640 a serious shortcut to being well informed. Today, we're doing 5 00:00:12.720 --> 00:00:15.240 a deep dive into something I think is incredibly practical, 6 00:00:15.359 --> 00:00:17.800 really powerful for anyone you know, navigating the world a 7 00:00:17.800 --> 00:00:21.120 big data It's excerpts from the Azure Data Brooks cookbook 8 00:00:21.199 --> 00:00:25.039 Accelerate and Scale real Time Analytics. Think of this as 9 00:00:25.039 --> 00:00:27.920 your well, your essential guide to building solid, cutting edge 10 00:00:27.960 --> 00:00:30.719 data solutions on Azure. Our mission for you today is 11 00:00:30.800 --> 00:00:33.560 to really distill the core components of the key strategies 12 00:00:33.600 --> 00:00:36.560 from this cookbook so you grasp not just what adud 13 00:00:36.600 --> 00:00:38.920 data Brioks can do, but really how it's applied, you know, 14 00:00:38.960 --> 00:00:41.679 in the real world to tackle today's data challenges. 15 00:00:41.920 --> 00:00:44.520 Yeah, and what makes the source material so so compelling 16 00:00:44.640 --> 00:00:47.000 is really the background of the people involved, the authors, 17 00:00:47.359 --> 00:00:50.479 Fanie Raj and Vino Jazzwall. I mean, these are deeply 18 00:00:50.640 --> 00:00:54.960 experienced data architects engineers at Microsoft. We're talking over decade 19 00:00:54.960 --> 00:00:58.840 each and it specifically complex data warehouses, big data, real 20 00:00:58.880 --> 00:01:01.039 time solutions on as or it's their bread and butter 21 00:01:01.439 --> 00:01:03.439 and then you've got the reviewers on Kurna Or and 22 00:01:03.439 --> 00:01:05.959 Elan Bernardo Palasco. They add this whole other layer, you know, 23 00:01:06.040 --> 00:01:09.920 Advanced Data Architecture mL, scalable data pipeline. So this collective experience, 24 00:01:09.920 --> 00:01:11.959 it means what we're exploring isn't just theory, right, it's 25 00:01:12.000 --> 00:01:16.480 really grounded in hard one practical Know how you feel 26 00:01:16.519 --> 00:01:17.159 that reading it? 27 00:01:17.239 --> 00:01:19.799 Right? Okay, great point. So let's kick things off with 28 00:01:19.840 --> 00:01:23.239 the absolute basics, the fundamentals. If you're looking to get 29 00:01:23.239 --> 00:01:26.280 your hands dirty with Azure data bricks, the cookbook jumps 30 00:01:26.400 --> 00:01:29.719 right into creating the service, doesn't mess around. It walks 31 00:01:29.719 --> 00:01:32.760 you through setting up a workspace like directly in the 32 00:01:32.799 --> 00:01:36.159 Azure portal, and highlights these key decisions you make right 33 00:01:36.200 --> 00:01:39.239 at the start, like, for instance, the choice of vnet deployment. 34 00:01:39.319 --> 00:01:42.879 It shows selecting no initially maybe as a simpler start 35 00:01:43.200 --> 00:01:46.480 before you know review and create the service in your 36 00:01:46.519 --> 00:01:50.040 resource group like cookbook RG they use as an example. 37 00:01:49.719 --> 00:01:53.400 And that choice, the vnet deployment one. It's actually pretty significant, 38 00:01:53.439 --> 00:01:53.719 isn't it. 39 00:01:53.799 --> 00:01:54.000 Yeah. 40 00:01:54.040 --> 00:01:57.280 The book smartly brings up alternatives early, like using the 41 00:01:57.319 --> 00:02:01.200 Adurecli for deployment, and that's just crucial for automation, right, 42 00:02:01.319 --> 00:02:05.359 especially if you're thinking infrastructure as code, maybe scripting repeatable setups. 43 00:02:05.439 --> 00:02:08.719 We're using DevOps pipelines. Imagine like standing up a whole 44 00:02:08.800 --> 00:02:11.319 data bricks environment with just one command. That's the kind 45 00:02:11.319 --> 00:02:12.400 of power it hints at. 46 00:02:12.639 --> 00:02:14.919 Exactly. Okay, so you've got the workspace up and running, 47 00:02:15.000 --> 00:02:19.759 what's next? Access control? Obviously, the cookbook explains adding users 48 00:02:19.840 --> 00:02:23.439 groups straight from the Data Bricks admin console. Simple enough, 49 00:02:23.599 --> 00:02:26.240 but they need to be an as you're active directory first, 50 00:02:26.319 --> 00:02:28.879 that's the prerequisite. Then you get to the core, really 51 00:02:29.199 --> 00:02:32.159 creating and managing clusters. This is where the processing happens. 52 00:02:32.319 --> 00:02:34.039 You can spin them up from the UI, give it 53 00:02:34.080 --> 00:02:36.400 a name, pick a cluster mode like standard, that's the 54 00:02:36.599 --> 00:02:40.120 recommended one for single users, and it wisely defaults to 55 00:02:40.199 --> 00:02:42.919 terminating after what one hundred and twenty minutes of inactivity. 56 00:02:43.000 --> 00:02:45.599 Yeah, saves on costs, smart default, right. 57 00:02:45.599 --> 00:02:48.280 And you pick your Spark version data Bricks run time 58 00:02:48.479 --> 00:02:51.159 like Spark three point zero point one run time seven 59 00:02:51.199 --> 00:02:52.439 point four in their examples. 60 00:02:52.879 --> 00:02:55.719 And this discussion around cluster modes, that's where you really 61 00:02:55.759 --> 00:02:59.879 start tailoring things, optimizing your setup. So beyond that interact 62 00:03:00.120 --> 00:03:03.879 standard mode, the book introduces job clusters, and these aren't 63 00:03:03.879 --> 00:03:06.919 just like a minor variation, it's totally different approach for 64 00:03:07.000 --> 00:03:09.879 scheduled stuff. For automation, they spin up run your notebook. 65 00:03:09.919 --> 00:03:12.199 Job may be triggered by data factory. And then it's 66 00:03:12.199 --> 00:03:14.520 a cool part. They automatically delete them sales when done. 67 00:03:14.599 --> 00:03:17.800 So for you, that means well, potentially huge cost savings 68 00:03:17.800 --> 00:03:20.919 and super efficient resource use. You're only paying for compute 69 00:03:20.919 --> 00:03:22.439 when it's actually crunching numbers. 70 00:03:22.919 --> 00:03:26.000 Yeah, that auto delete is brilliant. And speaking of jobs, 71 00:03:26.159 --> 00:03:29.280 the cookbook makes that transition smooth from playing around a 72 00:03:29.319 --> 00:03:32.960 notebooks interactively to automating them. It shows uploading a notebook 73 00:03:32.960 --> 00:03:35.759 maybe a dot DBC file, running the cells, then scheduling 74 00:03:35.759 --> 00:03:37.800 it as a job in data bricks and here's that 75 00:03:37.879 --> 00:03:40.400 powerful bit again. You can configure the job to create 76 00:03:40.439 --> 00:03:43.919 a new on demand job cluster just for that one task, 77 00:03:44.000 --> 00:03:47.439 really flexible and then for anyone building integrations, you know, 78 00:03:47.439 --> 00:03:51.520 programmatically talking to data brooks. It covers authentication using Patsy's 79 00:03:51.560 --> 00:03:54.360 Personal Access tokens or Azure ad tokens to hit the 80 00:03:54.400 --> 00:03:58.120 rest APIs. It even shows connecting powerbi desktop using a 81 00:03:58.159 --> 00:04:01.439 pat to visualize data and spark tables like the MDA 82 00:04:01.439 --> 00:04:04.520 example they use. Okay, so environment's ready clusters are figured out. 83 00:04:05.000 --> 00:04:06.960 Now the big question how do you get your data 84 00:04:07.000 --> 00:04:10.439 in and out? The cookbook is really practical, step by 85 00:04:10.439 --> 00:04:14.520 step instructions for mounting storage, specifically ADLs GENT two as 86 00:04:14.520 --> 00:04:17.000 your data link storage gent too, and also as your 87 00:04:17.040 --> 00:04:20.560 blob storage, mounting them right into DDFS, the Data Bricks filesystem. 88 00:04:20.560 --> 00:04:23.079 It does involve registering an APP and AAD as your 89 00:04:23.079 --> 00:04:26.199 active directory to get those credentials application ID, tenant ID 90 00:04:26.680 --> 00:04:31.199 and the client's secret and obviously those secrets need super 91 00:04:31.240 --> 00:04:33.279 careful handling, store them securely. 92 00:04:33.519 --> 00:04:37.240 Absolutely, and this mounting process it's just such a game 93 00:04:37.319 --> 00:04:40.439 changer for making data accessible because it lets you treat 94 00:04:40.480 --> 00:04:43.199 your cloud storage almost like it's a local drive inside 95 00:04:43.240 --> 00:04:46.279 Data Bricks. It smoves out data access for your notebooks, 96 00:04:46.560 --> 00:04:51.279 makes interacting with potentially massive data sets feel well, seamless, 97 00:04:51.519 --> 00:04:52.839 much less clunky. 98 00:04:52.519 --> 00:04:56.079 Definitely, so storage is mounted now. The cookbook dives into 99 00:04:56.240 --> 00:04:59.240 reading and writing data different formats, different services that cover 100 00:04:59.360 --> 00:05:03.680 CSV and files in detail. You learned about Sparks schema inference. 101 00:05:03.920 --> 00:05:06.000 You know where it tries to guess the data. 102 00:05:05.680 --> 00:05:08.439 Types, which can be okay, but sometimes right. 103 00:05:08.480 --> 00:05:11.120 Sometimes it just sees everything as a string initially, so 104 00:05:11.480 --> 00:05:15.160 more importantly, it guides you through explicitly defining that schema 105 00:05:15.560 --> 00:05:18.879 using struct type, specifying things like integer type for a 106 00:05:18.920 --> 00:05:20.839 cus key column, making sure it's correct. 107 00:05:21.000 --> 00:05:24.279 And this is exactly where you unlock series performance gains 108 00:05:24.639 --> 00:05:30.240 that explicit schema definition plus the format choice itself. Park 109 00:05:30.680 --> 00:05:33.600 being columnar isn't just about compression, though that's nice. Its 110 00:05:33.639 --> 00:05:38.680 real power comes from optimizations like column pruning and predicate pushdown. 111 00:05:39.000 --> 00:05:41.519 Think about it, your query only needs two columns out 112 00:05:41.519 --> 00:05:44.240 of fifty, park let Spark read only those two columns, 113 00:05:44.639 --> 00:05:46.639 or if you have a filter or ware clause, it 114 00:05:46.639 --> 00:05:49.199 can push that filter down to the storage level. Avoids 115 00:05:49.240 --> 00:05:52.319 reading tons of irrelevant data compared to reading whole ROS 116 00:05:52.360 --> 00:05:55.120 and CSV or JSON. It's well, it's night and day 117 00:05:55.120 --> 00:05:56.160 for big data queries. 118 00:05:56.360 --> 00:05:58.720 Huge difference, and the cookbook doesn't stop there. It covers 119 00:05:58.759 --> 00:06:02.360 professing JSON too, even complexness. Did Jason shows you the 120 00:06:02.399 --> 00:06:06.160 Spark functions like toe json from Maryson and then beyond files, 121 00:06:06.160 --> 00:06:08.839 it talks about reading and writing to Azure sql database 122 00:06:08.879 --> 00:06:12.439 and also as your synaps analytics, specifically the dedicated seql 123 00:06:12.439 --> 00:06:13.920 pool using the native connectors. 124 00:06:14.120 --> 00:06:16.959 Yeah, and if you zoom out a bit, this ability 125 00:06:17.000 --> 00:06:21.000 to seamlessly integrate with all these services Azure, Sequel, synapps, 126 00:06:21.360 --> 00:06:24.680 even Cosmos dB which also have a Spark connector for 127 00:06:24.720 --> 00:06:27.600 batch and streaming. That's what really cements data bricks is 128 00:06:27.600 --> 00:06:30.560 this central hub, this sort of nervous system for a 129 00:06:30.600 --> 00:06:33.759 modern data platform. It's all about bringing together your diverse 130 00:06:33.839 --> 00:06:36.000 data sources into one place for analysis. 131 00:06:36.160 --> 00:06:38.040 Okay, let's peek under the hood a bit. I ever, 132 00:06:38.160 --> 00:06:40.839 wonder what Spark is actually doing when you run a query, 133 00:06:40.920 --> 00:06:44.120 it can feel like a black box. Sometimes this cookbook 134 00:06:44.199 --> 00:06:48.759 pulls back that curtain, introduces the concepts, jobs, stages, tasks, 135 00:06:49.079 --> 00:06:50.160 how Spark breaks. 136 00:06:49.920 --> 00:06:51.800 Down the work and the key visual here the really 137 00:06:51.839 --> 00:06:55.639 insightful bit is the directed acyclic graph the de gay. 138 00:06:56.000 --> 00:06:58.600 Think of it like Spark's internal blueprint for your query. 139 00:06:58.839 --> 00:07:01.399 It shows exactly how it plays to execute it and 140 00:07:01.439 --> 00:07:04.439 you can see this DAG in the SPARKUI. It breaks 141 00:07:04.480 --> 00:07:07.560 down your whole application into these jobs, stages and tasks. 142 00:07:07.920 --> 00:07:11.439 So for you the user, this is invaluable for debugging performance, 143 00:07:11.519 --> 00:07:14.040 Like if you see one task taking way longer than 144 00:07:14.040 --> 00:07:16.279 all the others, that's often your first big clue. You 145 00:07:16.360 --> 00:07:18.959 might have data skew where one partition has way more 146 00:07:19.040 --> 00:07:21.199 data than the others. The day helps you spot. 147 00:07:20.920 --> 00:07:24.000 That, and the book cleverly links this back to scheme 148 00:07:24.040 --> 00:07:27.480 definition shows how using that inferred schema we talked about 149 00:07:27.560 --> 00:07:30.560 it might lead to a more complicated dadgie more tasks, 150 00:07:30.879 --> 00:07:34.800 whereas providing an explicit schema upfront can simplify things, potentially 151 00:07:34.839 --> 00:07:37.920 cutting down execution time quite a bit. Open joins we 152 00:07:37.959 --> 00:07:41.279 all do joins. The cookbook explains how spucks optimizer is 153 00:07:41.319 --> 00:07:44.920 smart choosing between different algorithms like short merge or broadcast 154 00:07:44.959 --> 00:07:47.120 hash joins, but it also shows how you can influence 155 00:07:47.160 --> 00:07:49.639 that choice using hints in your seqlor data frame code 156 00:07:49.639 --> 00:07:52.680 to sugjust a specific joint strategy if you know something about. 157 00:07:52.439 --> 00:07:55.120 Your data, which leads to the million dollar question, how 158 00:07:55.120 --> 00:07:57.480 do you make your sparkax faster? The cookbook gets into 159 00:07:57.519 --> 00:08:02.120 the nitty gritty input partitions, shuffle partitions, output partitions. So 160 00:08:02.199 --> 00:08:06.800 Spark reads data from say EHDFS or ADLSM blocks. By default, 161 00:08:07.000 --> 00:08:09.800 each block might become one partition, but you can tweet 162 00:08:09.800 --> 00:08:12.439 settings like spark dot sqo, dot files dot max partition 163 00:08:12.519 --> 00:08:16.759 bites to control that initial partition size, which directly impacts parallelism. 164 00:08:16.920 --> 00:08:20.120 More smaller partitions can mean more tasks running in parallel. 165 00:08:20.240 --> 00:08:22.879 And then there's sparke dot sqol dot, shuffle dot partitions. 166 00:08:23.000 --> 00:08:26.680 Shuffling data between stages is expensive, involves network traffic. This 167 00:08:26.759 --> 00:08:30.000 setting controls how many partitions are created after a shuffle. Now, 168 00:08:30.000 --> 00:08:32.279 the book is honest, there's no single magic number for 169 00:08:32.320 --> 00:08:34.440 shuffle partitions. It really depends on your cluster size, your 170 00:08:34.519 --> 00:08:38.279 data volume. Begetting this reasonably right tuning it is absolutely 171 00:08:38.279 --> 00:08:40.279 critical for good performance. You have to experiment A. 172 00:08:40.200 --> 00:08:44.000 Bit makes sense, okay, shifting gears a bit real time data, 173 00:08:44.120 --> 00:08:46.679 it's everywhere now as your data bricks handles this With 174 00:08:46.720 --> 00:08:51.279 structured streaming, the cookbook gives good examples like reading streaming 175 00:08:51.320 --> 00:08:55.279 data from Kofka, or specifically kofka enabled event hubs in Azure, 176 00:08:55.919 --> 00:08:59.159 and even this clever trick treating a simple folder full 177 00:08:59.200 --> 00:09:01.360 of JSON log files as if it were a live 178 00:09:01.399 --> 00:09:03.480 streaming source, which is pretty neat. 179 00:09:03.559 --> 00:09:06.000 Yeah, that folder trick is handy, but one of the 180 00:09:06.039 --> 00:09:09.840 inherent challenges with any streaming system is late data right 181 00:09:10.080 --> 00:09:13.039 data arriving out of order. The cookbook points out how 182 00:09:13.120 --> 00:09:17.039 data brick structured streaming handles this pretty gracefully, automatically placing 183 00:09:17.120 --> 00:09:19.720 data into the correct time window. But this is where 184 00:09:19.759 --> 00:09:22.679 water marking comes in. It's a crucial concept. You essentially 185 00:09:22.720 --> 00:09:25.159 tell Spark, hey, data can be late, but only up 186 00:09:25.200 --> 00:09:28.360 to this much late. Anything older than the water markets ignored. 187 00:09:28.759 --> 00:09:31.919 This stops Spark from having to constantly update old aggregated 188 00:09:31.960 --> 00:09:34.320 results from ages ago, keeps things. 189 00:09:34.120 --> 00:09:38.399 Manageable right prevents infinite state, and the book details windowing 190 00:09:38.440 --> 00:09:43.080 for aggregations on streams explains both types. Tumbling windows those 191 00:09:43.120 --> 00:09:46.200 are fixed, non overlapping blocks of time like every five minutes, 192 00:09:46.519 --> 00:09:49.360 and then sliding windows. These overlap like a ten minute 193 00:09:49.399 --> 00:09:52.039 window that slides forward every five minutes. A single event 194 00:09:52.080 --> 00:09:56.600 can fall into multiple windows. It also clarifies offsets and checkpoints, 195 00:09:56.919 --> 00:10:00.799 especially for stateful streaming, where you're doing counts some averages 196 00:10:00.840 --> 00:10:05.279 over time. Spark processes the stream in microbatches. Checkpoints are 197 00:10:05.279 --> 00:10:07.120 how it remembers where it got up to the last 198 00:10:07.120 --> 00:10:09.200 offset processed in the source exactly. 199 00:10:09.200 --> 00:10:11.840 So if a job fails and restarts, the checkpoint lets 200 00:10:11.840 --> 00:10:14.000 it pick up right where it left off, ensuring no 201 00:10:14.120 --> 00:10:17.320 data is missed or processed twice it's key for fault 202 00:10:17.360 --> 00:10:18.600 tolerance and consistency. 203 00:10:19.039 --> 00:10:21.480 Okay, now this next part I think many people would 204 00:10:21.480 --> 00:10:24.120 agree this is where things get really interesting. Delta Lake. 205 00:10:24.440 --> 00:10:27.200 The cookbook presents this open source storage layer which sits 206 00:10:27.279 --> 00:10:29.759 right on top of your cloud storage like Adylus Gen two, 207 00:10:30.320 --> 00:10:33.120 and it positions Delta Lake as the solution, the answer 208 00:10:33.200 --> 00:10:36.840 to those classic data lake problems. No schema enforcement, no 209 00:10:36.919 --> 00:10:41.039 consistency guarantees, no acidy transactions, the data swamp problem. 210 00:10:41.200 --> 00:10:44.679 Oh. Absolutely, Delta Lake is a genuine game changer, bringing 211 00:10:45.159 --> 00:10:51.480 acid etymicity, consistency, isolation, durability, those database level guarantees, bringing 212 00:10:51.519 --> 00:10:54.960 them to the data lake. That's huge. Data lakes traditionally 213 00:10:55.039 --> 00:10:59.600 lack that, but Delta gives you reliable transactions plus scheme enforcement. 214 00:10:59.639 --> 00:11:02.399 Like you said, it rejects data that doesn't fit the 215 00:11:02.399 --> 00:11:05.879 table's structure, but it also allows scheme evolution, so you 216 00:11:05.919 --> 00:11:08.720 can change the schema over time as your data needs change. 217 00:11:08.919 --> 00:11:13.399 That's practical and crucially enabling it date and delete operations 218 00:11:13.440 --> 00:11:16.360 directly on your data lig files. That was a massive 219 00:11:16.399 --> 00:11:18.799 pain point before Delta Now it's straightforward. 220 00:11:19.120 --> 00:11:21.720 So the cookbook shows the basics naturally how to create 221 00:11:21.759 --> 00:11:24.159 Delta tables, read from them, write to them, saving data 222 00:11:24.159 --> 00:11:27.360 frames and Delta format, and it tackles concurrency, always a 223 00:11:27.360 --> 00:11:30.399 big issue in distributed systems. It explains how Delta uses 224 00:11:30.440 --> 00:11:34.200 optimistic concurrency control. Multiple jobs can try to write at 225 00:11:34.200 --> 00:11:37.879 the same time. Delta handles this by creating new table versions. 226 00:11:38.360 --> 00:11:41.000 If two jobs try to commit based on the same 227 00:11:41.399 --> 00:11:45.480 older version, only one succeeds, the other gets rejected. It 228 00:11:45.519 --> 00:11:49.080 even points out specific exceptions you might hit, like concurrent 229 00:11:49.159 --> 00:11:53.879 transaction exception or concurrent append exception, especially with multiple streaming 230 00:11:53.919 --> 00:11:55.919 queries hitting the same table right and the way. 231 00:11:55.759 --> 00:11:58.399 It handles this is pretty neat. It doesn't use traditional 232 00:11:58.480 --> 00:12:02.039 database locks, which can cause bottle. Instead, it makes sure 233 00:12:02.039 --> 00:12:06.240 a transactions trying to commit are processed mutually exclusively, one 234 00:12:06.279 --> 00:12:09.759 after the other. The first one wins, updates the transaction log, 235 00:12:09.840 --> 00:12:12.559 creates a new table version. The second one, seeing the 236 00:12:12.559 --> 00:12:16.879 table has changed underneath, it fails gracefully insurer's integrity and 237 00:12:16.919 --> 00:12:19.639 the book also notes that partitioning your Delta table smartly 238 00:12:19.679 --> 00:12:22.120 can really help reduce the chances of these conflicts in 239 00:12:22.120 --> 00:12:22.720 the first place. 240 00:12:22.879 --> 00:12:26.600 Good tip performance is always key to The cookbook introduces 241 00:12:26.639 --> 00:12:31.000 optimize and zorder optimize is about fixing the small file problem. 242 00:12:31.320 --> 00:12:34.799 It compacts lots of small data files into fewer, larger ones, 243 00:12:34.879 --> 00:12:38.399 much better for read performance, and zorder is even more advanced. 244 00:12:38.639 --> 00:12:41.679 It physically co locates related data within the files based 245 00:12:41.720 --> 00:12:43.320 on callers you specify exactly. 246 00:12:43.360 --> 00:12:46.080 It's like multi dimensional clustering. So when you queer with 247 00:12:46.120 --> 00:12:49.320 filters on those Z ordered columns, Spark can skip reading 248 00:12:49.440 --> 00:12:52.120 huge chunks of irrelevant data big speed up. 249 00:12:52.639 --> 00:12:56.440 Delta tables also support constraints like in databases. The cookbook 250 00:12:56.480 --> 00:12:59.679 mentions chie chick constraints evaluating a boolean expression for each 251 00:12:59.799 --> 00:13:03.600 row and standard not NLL constraints, and if you try 252 00:13:03.600 --> 00:13:06.519 to insert data that violates these you get an invariant 253 00:13:06.559 --> 00:13:08.399 violation exception helps maintain. 254 00:13:08.200 --> 00:13:11.360 Data quality, but honestly for you, the user, maybe one 255 00:13:11.360 --> 00:13:14.080 of the absolute coolest, most powerful features of Delta is 256 00:13:14.120 --> 00:13:18.039 the versioning and time travel. Every single change, every transaction 257 00:13:18.399 --> 00:13:21.399 is recorded in the Delta Transaction log. These Jason files 258 00:13:21.440 --> 00:13:23.679 in the Delta log folder. This means you have a 259 00:13:23.679 --> 00:13:26.759 complete history of your table. You could literally query the 260 00:13:26.799 --> 00:13:28.879 table as it was at a specific point in time 261 00:13:28.960 --> 00:13:32.799 or specific version number. Made a mistake, accidental delete, bad update, 262 00:13:33.120 --> 00:13:35.639 you can just query the previous version or even restore 263 00:13:35.679 --> 00:13:37.759 the table to that point. It's like a built in 264 00:13:38.120 --> 00:13:40.960 undue button for your entire data. Lake invaluable. 265 00:13:41.120 --> 00:13:44.200 That time travel is amazing. Okay, So the cookbook takes 266 00:13:44.240 --> 00:13:47.679 all these individual pieces, the setup, storage, spark, streaming, Delta 267 00:13:47.720 --> 00:13:49.799 and ties them together. It presents an end to end 268 00:13:49.840 --> 00:13:53.639 solution building near real time analytics and a modern data warehouse. 269 00:13:53.960 --> 00:13:56.919 It shows ingesting data from all sorts of places. Add 270 00:13:56.919 --> 00:14:00.559 your event hubs for the streaming stuff, Adlist two for 271 00:14:00.639 --> 00:14:04.399 batch files, maybe Azure sql database for lookup tables or metadata. 272 00:14:04.480 --> 00:14:07.320 Yeah, and the core architecture they showcase is very much 273 00:14:07.320 --> 00:14:11.000 that lake house pattern we hear about. It's powerful. The 274 00:14:11.080 --> 00:14:14.200 idea is you process all this diverse data, maybe land 275 00:14:14.200 --> 00:14:17.679 structured stuff in synapse analytics using traditional fact and dimension 276 00:14:17.799 --> 00:14:20.440 tables for BI, but you also keep the raw and 277 00:14:20.519 --> 00:14:23.200 processed data in delta lake Maybe it's some results in 278 00:14:23.200 --> 00:14:26.519 Cosmos dB two, specifically to power those near real time 279 00:14:26.600 --> 00:14:29.879 dashboards and applications. It blends the best of both worlds, 280 00:14:29.919 --> 00:14:32.759 the flexibility of a lake, the structure of a warehouse. 281 00:14:32.799 --> 00:14:35.480 Altho. The book walks through a scenario simulating vehicle sensor 282 00:14:35.519 --> 00:14:39.320 data Jason format streaming into event hubs, then Azure Data 283 00:14:39.320 --> 00:14:42.720 Bricks using Spark structured streaming picks it up, processes. It 284 00:14:43.039 --> 00:14:46.879 stores aggregated results in delta tables. Maybe the raw non 285 00:14:46.919 --> 00:14:50.120 aggregated data goes off to synaps and COSMOSDB as well. 286 00:14:50.399 --> 00:14:53.279 It shows processing both streaming and batch data together, even 287 00:14:53.360 --> 00:14:55.840 joining the live stream with static lookup tables pulled from 288 00:14:55.840 --> 00:14:58.919 Azure sql, and it explains the transformation stages using that 289 00:14:58.960 --> 00:15:03.120 medallion architecture bronze for raw silver, for cleaned enriched gold 290 00:15:03.159 --> 00:15:06.519 for aggregated business ready data, all typically stored as delta 291 00:15:06.559 --> 00:15:07.600 tables exactly. 292 00:15:07.639 --> 00:15:10.360 That bronze silk gold pattern is super common, provides. 293 00:15:10.080 --> 00:15:13.399 Great structure, and crucially, the cookbook shows you can build 294 00:15:13.519 --> 00:15:16.679 visualizations directly in a data bricks notebook for that near 295 00:15:16.759 --> 00:15:20.960 real time view, define queries, whip up bar charts, pie charts, whatever, 296 00:15:21.320 --> 00:15:24.320 and pin them to a notebook dashboard, and that dashboard 297 00:15:24.360 --> 00:15:28.000 can automatically refresh as new data streams in. Pretty cool 298 00:15:28.039 --> 00:15:31.919 for quick operational views, but for more robust enterprise bi 299 00:15:32.279 --> 00:15:36.200 it walks through connecting Powerbi using the native Azure Beta 300 00:15:36.200 --> 00:15:39.679 bricks connector in Powerbi desktop, you just need the server 301 00:15:39.759 --> 00:15:42.960 host name, HTTP path details from your data Bricks cluster, 302 00:15:43.440 --> 00:15:46.000 then you can directly query those Delta lake tables. 303 00:15:46.039 --> 00:15:49.480 So this direct connection is key because Data Bricks optimized 304 00:15:49.519 --> 00:15:53.279 engine working with Delta, combined with powerbi's native connector using 305 00:15:53.360 --> 00:15:56.559 efficient ODBC drivers, it means you can get really close 306 00:15:56.600 --> 00:15:59.879 to real time insights in your powerbi reports without constantly 307 00:16:00.039 --> 00:16:02.919 hidden refresh manually. It is designed for that low latency 308 00:16:02.960 --> 00:16:05.759 experience getting actionable intelligence fast. 309 00:16:06.080 --> 00:16:10.200 And finally, how do you automate this whole complex flow orchestration? 310 00:16:11.000 --> 00:16:14.639 The cookbook clearly shows using Azure Data Factory ADF adf 311 00:16:14.679 --> 00:16:18.080 acts as that serverlus et l e LT orchestrator, it 312 00:16:18.120 --> 00:16:21.000 can trigger your data Bricks notebooks, run other Azure tasks, 313 00:16:21.080 --> 00:16:24.480 manage dependencies, handle failures, basically run the entire end to 314 00:16:24.559 --> 00:16:27.679 end pipeline reliably. Okay, we're covering a lot, but no 315 00:16:27.840 --> 00:16:31.000 modern data solution discussion is complete without talking DevOps and 316 00:16:31.000 --> 00:16:35.720 security absolutely critical. The cookbook dedicates good sections to CICD 317 00:16:35.799 --> 00:16:39.679 continuous integration continuous deployment, specifically for your data Bricks notebooks 318 00:16:39.720 --> 00:16:40.799 using Azured DevOps. 319 00:16:40.840 --> 00:16:42.799 Yeah, and this is so important. It's not just about 320 00:16:42.799 --> 00:16:46.000 pushing code faster. It means proper source control for your notebooks. 321 00:16:46.039 --> 00:16:49.519 Maybe you can getthub or Azure repos versioning everything and 322 00:16:49.559 --> 00:16:53.000 then automating the deployment to different environments DEV test, UAT, 323 00:16:53.240 --> 00:16:56.759 PROD through release pipelines. It reduces manual effort, reduces errors, 324 00:16:57.000 --> 00:17:00.240 ensure you have consistent, reliable deployments every single time. It's 325 00:17:00.240 --> 00:17:01.960 professionalizing your data bricks. 326 00:17:01.679 --> 00:17:07.200 Development absolutely, and then security paramount the book details understanding 327 00:17:07.240 --> 00:17:10.599 and setting up RBAC role based access control and also 328 00:17:10.920 --> 00:17:15.440 ACL's access control lists within Azure. Specifically for your Adlsgen 329 00:17:15.519 --> 00:17:19.319 two storage. RBC lets you grant broader permissions like maybe 330 00:17:19.519 --> 00:17:22.359 storage blob data reader for a whole container or storage 331 00:17:22.359 --> 00:17:23.119 account right. 332 00:17:23.200 --> 00:17:25.880 RBAC is good for those broader strokes, but ACLS give 333 00:17:25.920 --> 00:17:29.000 you that really fine grain control. You can set read, 334 00:17:29.400 --> 00:17:32.960 write excute permissions on individual files and directories within the lake. 335 00:17:33.039 --> 00:17:35.000 This is essential if you have multiple teams sharing the 336 00:17:35.079 --> 00:17:37.480 lake or really sensitive data where you need to lock 337 00:17:37.519 --> 00:17:40.559 down access very tightly. You can grant access to specific 338 00:17:40.599 --> 00:17:43.920 addus or groups on specific folders, very granular. 339 00:17:44.119 --> 00:17:47.480 Another big security measure covered deploying data bricks itself into 340 00:17:47.480 --> 00:17:50.680 your own Azure virtual network of vnet. It explains provisioning 341 00:17:50.759 --> 00:17:54.480 data bricks workspaces within private and public subnets you control. 342 00:17:54.599 --> 00:17:57.160 This isolates your data bricks environment and lets you securely 343 00:17:57.200 --> 00:18:01.039 access things like Adlsgen two using private endpoints. Keeping traffic 344 00:18:01.079 --> 00:18:02.440 off the public Internet. 345 00:18:02.240 --> 00:18:06.200 And managing secrets always a headache. The integration with Azure 346 00:18:06.279 --> 00:18:09.559 key Vault is highlighted. Keyvolt becomes your central, super secure 347 00:18:09.599 --> 00:18:12.720 place to store things like storage account keys, database passwords, 348 00:18:12.839 --> 00:18:16.319 API keys. Your notebooks then fetch these secrets from keyvolt 349 00:18:16.319 --> 00:18:18.720 at runtime, rather than having them hard coded in the 350 00:18:18.759 --> 00:18:23.200 notebook itself, much much more secure. Similarly, azur app configuration 351 00:18:23.319 --> 00:18:27.440 is mentioned for managing application setting centrally keeping configurations separate 352 00:18:27.440 --> 00:18:30.559 from code. It can even reference secrets stored in key vault. 353 00:18:30.799 --> 00:18:35.160 And what about monitoring troubleshooting? The cookbook covers setting up 354 00:18:35.200 --> 00:18:38.880 a log analytics workspace and Azure Monitor and integrating data 355 00:18:38.880 --> 00:18:42.599 bricks to send its logs there sparklogs, cluster logs, audit logs. 356 00:18:42.839 --> 00:18:45.279 Then you can use KQL, the Custo query language to 357 00:18:45.359 --> 00:18:48.960 query all that telemetry data, find errors, track performance. You 358 00:18:48.960 --> 00:18:51.559 can even build dashboards in Azure Monitor to get a 359 00:18:51.640 --> 00:18:54.759 high level view of the health across all your Azure services, 360 00:18:54.799 --> 00:18:58.440 including Data Bricks, And lastly, within Data Bricks itself, there's 361 00:18:58.480 --> 00:19:02.039 cluster access control. Admins can define who is allowed to 362 00:19:02.039 --> 00:19:06.359 create clusters manage them. Plus cluster visibility control, especially in 363 00:19:06.400 --> 00:19:10.240 premium workspaces, restricts who can even see certain clusters, adds 364 00:19:10.240 --> 00:19:13.279 another layer of security and governance. Wow. 365 00:19:13.319 --> 00:19:16.160 Okay, that was a lot to unpack from just these excerpts, 366 00:19:16.240 --> 00:19:18.519 wasn't it. But hopefully you listening now have a really 367 00:19:18.839 --> 00:19:21.680 solid feel for the immense capabilities packed into Azure Data Bricks, 368 00:19:21.799 --> 00:19:24.839 especially for accelerating and scaling real time analytics. From just 369 00:19:24.839 --> 00:19:27.400 setting up the core services, handling all sorts of data collmats, 370 00:19:27.400 --> 00:19:30.559 optimizing Spark, dealing with streaming data, and then leveraging the 371 00:19:30.799 --> 00:19:32.640 frankly amazing power of Delta Lake. 372 00:19:32.720 --> 00:19:36.480 This cookbook really does lay out a comprehensive roadmap for 373 00:19:36.519 --> 00:19:38.680 anyone working with data on Azure today. 374 00:19:39.319 --> 00:19:42.000 Absolutely, and when you connect all those dots like we've 375 00:19:42.039 --> 00:19:45.000 tried to do, it's just clear that Data Bricks gives 376 00:19:45.039 --> 00:19:48.440 you this complete toolkit you can build genuinely robust modern 377 00:19:48.480 --> 00:19:51.960 data warehouses, near real time analytical solutions. You've got the 378 00:19:52.039 --> 00:19:54.960 visualization built in or connected via power BI. You've got 379 00:19:55.000 --> 00:19:58.519 the automation through ADF, and those critical security and DEVOFS 380 00:19:58.519 --> 00:20:02.079 integrations are cover. It really empowers you to build enterprise 381 00:20:02.079 --> 00:20:04.319 grade data platforms that can handle pretty much anything you 382 00:20:04.359 --> 00:20:04.920 throw at them. 383 00:20:05.039 --> 00:20:08.119 So what does this all mean for you? Practically well, 384 00:20:08.119 --> 00:20:10.839 with these kinds of tools at your fingertips managing complex, 385 00:20:10.960 --> 00:20:14.400 large scale data systems on Azure, It's not just possible, 386 00:20:14.400 --> 00:20:17.119 it's highly optimized. It lets you move beyond just old 387 00:20:17.160 --> 00:20:20.640 school batch processing and really embrace real time insights, get 388 00:20:20.680 --> 00:20:24.160 answers faster, all while making sure your data is reliable, consistent, 389 00:20:24.240 --> 00:20:26.279 and secure thanks to things like Delta Lake and the 390 00:20:26.319 --> 00:20:30.200 security features. So as you think about the sheer volume 391 00:20:30.240 --> 00:20:33.279 and speed of data being generated today, here's maybe a 392 00:20:33.279 --> 00:20:36.039 final thought for you to moll over. If Delta Lake 393 00:20:36.079 --> 00:20:39.839 can bring those database like guarantees, ase transactions, scheme enforcement 394 00:20:39.880 --> 00:20:42.480 to the inherent flexibility and scale of a data lake, 395 00:20:42.720 --> 00:20:45.400 does this fundamentally change how we should think about designing 396 00:20:45.400 --> 00:20:48.799 all our future data architectures. Does it push us firmly 397 00:20:48.839 --> 00:20:51.240 into that lakehouse paradigm as the default for almost every 398 00:20:51.279 --> 00:20:53.759 kind of data? And what new possibilities does that unlock? 399 00:20:53.799 --> 00:20:55.559 For your next big data challenge.