WEBVTT

1
00:00:00.080 --> 00:00:02.240
<v Speaker 1>Welcome to the deep dive, where we plunge into a

2
00:00:02.279 --> 00:00:06.759
<v Speaker 1>stack of information, research notes, you name it, to really

3
00:00:06.839 --> 00:00:09.759
<v Speaker 1>pull out those key nuggets of knowledge. We give you

4
00:00:09.800 --> 00:00:12.640
<v Speaker 1>a serious shortcut to being well informed. Today, we're doing

5
00:00:12.720 --> 00:00:15.240
<v Speaker 1>a deep dive into something I think is incredibly practical,

6
00:00:15.359 --> 00:00:17.800
<v Speaker 1>really powerful for anyone you know, navigating the world a

7
00:00:17.800 --> 00:00:21.120
<v Speaker 1>big data It's excerpts from the Azure Data Brooks cookbook

8
00:00:21.199 --> 00:00:25.039
<v Speaker 1>Accelerate and Scale real Time Analytics. Think of this as

9
00:00:25.039 --> 00:00:27.920
<v Speaker 1>your well, your essential guide to building solid, cutting edge

10
00:00:27.960 --> 00:00:30.719
<v Speaker 1>data solutions on Azure. Our mission for you today is

11
00:00:30.800 --> 00:00:33.560
<v Speaker 1>to really distill the core components of the key strategies

12
00:00:33.600 --> 00:00:36.560
<v Speaker 1>from this cookbook so you grasp not just what adud

13
00:00:36.600 --> 00:00:38.920
<v Speaker 1>data Brioks can do, but really how it's applied, you know,

14
00:00:38.960 --> 00:00:41.679
<v Speaker 1>in the real world to tackle today's data challenges.

15
00:00:41.920 --> 00:00:44.520
<v Speaker 2>Yeah, and what makes the source material so so compelling

16
00:00:44.640 --> 00:00:47.000
<v Speaker 2>is really the background of the people involved, the authors,

17
00:00:47.359 --> 00:00:50.479
<v Speaker 2>Fanie Raj and Vino Jazzwall. I mean, these are deeply

18
00:00:50.640 --> 00:00:54.960
<v Speaker 2>experienced data architects engineers at Microsoft. We're talking over decade

19
00:00:54.960 --> 00:00:58.840
<v Speaker 2>each and it specifically complex data warehouses, big data, real

20
00:00:58.880 --> 00:01:01.039
<v Speaker 2>time solutions on as or it's their bread and butter

21
00:01:01.439 --> 00:01:03.439
<v Speaker 2>and then you've got the reviewers on Kurna Or and

22
00:01:03.439 --> 00:01:05.959
<v Speaker 2>Elan Bernardo Palasco. They add this whole other layer, you know,

23
00:01:06.040 --> 00:01:09.920
<v Speaker 2>Advanced Data Architecture mL, scalable data pipeline. So this collective experience,

24
00:01:09.920 --> 00:01:11.959
<v Speaker 2>it means what we're exploring isn't just theory, right, it's

25
00:01:12.000 --> 00:01:16.480
<v Speaker 2>really grounded in hard one practical Know how you feel

26
00:01:16.519 --> 00:01:17.159
<v Speaker 2>that reading it?

27
00:01:17.239 --> 00:01:19.799
<v Speaker 1>Right? Okay, great point. So let's kick things off with

28
00:01:19.840 --> 00:01:23.239
<v Speaker 1>the absolute basics, the fundamentals. If you're looking to get

29
00:01:23.239 --> 00:01:26.280
<v Speaker 1>your hands dirty with Azure data bricks, the cookbook jumps

30
00:01:26.400 --> 00:01:29.719
<v Speaker 1>right into creating the service, doesn't mess around. It walks

31
00:01:29.719 --> 00:01:32.760
<v Speaker 1>you through setting up a workspace like directly in the

32
00:01:32.799 --> 00:01:36.159
<v Speaker 1>Azure portal, and highlights these key decisions you make right

33
00:01:36.200 --> 00:01:39.239
<v Speaker 1>at the start, like, for instance, the choice of vnet deployment.

34
00:01:39.319 --> 00:01:42.879
<v Speaker 1>It shows selecting no initially maybe as a simpler start

35
00:01:43.200 --> 00:01:46.480
<v Speaker 1>before you know review and create the service in your

36
00:01:46.519 --> 00:01:50.040
<v Speaker 1>resource group like cookbook RG they use as an example.

37
00:01:49.719 --> 00:01:53.400
<v Speaker 2>And that choice, the vnet deployment one. It's actually pretty significant,

38
00:01:53.439 --> 00:01:53.719
<v Speaker 2>isn't it.

39
00:01:53.799 --> 00:01:54.000
<v Speaker 1>Yeah.

40
00:01:54.040 --> 00:01:57.280
<v Speaker 2>The book smartly brings up alternatives early, like using the

41
00:01:57.319 --> 00:02:01.200
<v Speaker 2>Adurecli for deployment, and that's just crucial for automation, right,

42
00:02:01.319 --> 00:02:05.359
<v Speaker 2>especially if you're thinking infrastructure as code, maybe scripting repeatable setups.

43
00:02:05.439 --> 00:02:08.719
<v Speaker 2>We're using DevOps pipelines. Imagine like standing up a whole

44
00:02:08.800 --> 00:02:11.319
<v Speaker 2>data bricks environment with just one command. That's the kind

45
00:02:11.319 --> 00:02:12.400
<v Speaker 2>of power it hints at.

46
00:02:12.639 --> 00:02:14.919
<v Speaker 1>Exactly. Okay, so you've got the workspace up and running,

47
00:02:15.000 --> 00:02:19.759
<v Speaker 1>what's next? Access control? Obviously, the cookbook explains adding users

48
00:02:19.840 --> 00:02:23.439
<v Speaker 1>groups straight from the Data Bricks admin console. Simple enough,

49
00:02:23.599 --> 00:02:26.240
<v Speaker 1>but they need to be an as you're active directory first,

50
00:02:26.319 --> 00:02:28.879
<v Speaker 1>that's the prerequisite. Then you get to the core, really

51
00:02:29.199 --> 00:02:32.159
<v Speaker 1>creating and managing clusters. This is where the processing happens.

52
00:02:32.319 --> 00:02:34.039
<v Speaker 1>You can spin them up from the UI, give it

53
00:02:34.080 --> 00:02:36.400
<v Speaker 1>a name, pick a cluster mode like standard, that's the

54
00:02:36.599 --> 00:02:40.120
<v Speaker 1>recommended one for single users, and it wisely defaults to

55
00:02:40.199 --> 00:02:42.919
<v Speaker 1>terminating after what one hundred and twenty minutes of inactivity.

56
00:02:43.000 --> 00:02:45.599
<v Speaker 2>Yeah, saves on costs, smart default, right.

57
00:02:45.599 --> 00:02:48.280
<v Speaker 1>And you pick your Spark version data Bricks run time

58
00:02:48.479 --> 00:02:51.159
<v Speaker 1>like Spark three point zero point one run time seven

59
00:02:51.199 --> 00:02:52.439
<v Speaker 1>point four in their examples.

60
00:02:52.879 --> 00:02:55.719
<v Speaker 3>And this discussion around cluster modes, that's where you really

61
00:02:55.759 --> 00:02:59.879
<v Speaker 3>start tailoring things, optimizing your setup. So beyond that interact

62
00:03:00.120 --> 00:03:03.879
<v Speaker 3>standard mode, the book introduces job clusters, and these aren't

63
00:03:03.879 --> 00:03:06.919
<v Speaker 3>just like a minor variation, it's totally different approach for

64
00:03:07.000 --> 00:03:09.879
<v Speaker 3>scheduled stuff. For automation, they spin up run your notebook.

65
00:03:09.919 --> 00:03:12.199
<v Speaker 3>Job may be triggered by data factory. And then it's

66
00:03:12.199 --> 00:03:14.520
<v Speaker 3>a cool part. They automatically delete them sales when done.

67
00:03:14.599 --> 00:03:17.800
<v Speaker 3>So for you, that means well, potentially huge cost savings

68
00:03:17.800 --> 00:03:20.919
<v Speaker 3>and super efficient resource use. You're only paying for compute

69
00:03:20.919 --> 00:03:22.439
<v Speaker 3>when it's actually crunching numbers.

70
00:03:22.919 --> 00:03:26.000
<v Speaker 1>Yeah, that auto delete is brilliant. And speaking of jobs,

71
00:03:26.159 --> 00:03:29.280
<v Speaker 1>the cookbook makes that transition smooth from playing around a

72
00:03:29.319 --> 00:03:32.960
<v Speaker 1>notebooks interactively to automating them. It shows uploading a notebook

73
00:03:32.960 --> 00:03:35.759
<v Speaker 1>maybe a dot DBC file, running the cells, then scheduling

74
00:03:35.759 --> 00:03:37.800
<v Speaker 1>it as a job in data bricks and here's that

75
00:03:37.879 --> 00:03:40.400
<v Speaker 1>powerful bit again. You can configure the job to create

76
00:03:40.439 --> 00:03:43.919
<v Speaker 1>a new on demand job cluster just for that one task,

77
00:03:44.000 --> 00:03:47.439
<v Speaker 1>really flexible and then for anyone building integrations, you know,

78
00:03:47.439 --> 00:03:51.520
<v Speaker 1>programmatically talking to data brooks. It covers authentication using Patsy's

79
00:03:51.560 --> 00:03:54.360
<v Speaker 1>Personal Access tokens or Azure ad tokens to hit the

80
00:03:54.400 --> 00:03:58.120
<v Speaker 1>rest APIs. It even shows connecting powerbi desktop using a

81
00:03:58.159 --> 00:04:01.439
<v Speaker 1>pat to visualize data and spark tables like the MDA

82
00:04:01.439 --> 00:04:04.520
<v Speaker 1>example they use. Okay, so environment's ready clusters are figured out.

83
00:04:05.000 --> 00:04:06.960
<v Speaker 1>Now the big question how do you get your data

84
00:04:07.000 --> 00:04:10.439
<v Speaker 1>in and out? The cookbook is really practical, step by

85
00:04:10.439 --> 00:04:14.520
<v Speaker 1>step instructions for mounting storage, specifically ADLs GENT two as

86
00:04:14.520 --> 00:04:17.000
<v Speaker 1>your data link storage gent too, and also as your

87
00:04:17.040 --> 00:04:20.560
<v Speaker 1>blob storage, mounting them right into DDFS, the Data Bricks filesystem.

88
00:04:20.560 --> 00:04:23.079
<v Speaker 1>It does involve registering an APP and AAD as your

89
00:04:23.079 --> 00:04:26.199
<v Speaker 1>active directory to get those credentials application ID, tenant ID

90
00:04:26.680 --> 00:04:31.199
<v Speaker 1>and the client's secret and obviously those secrets need super

91
00:04:31.240 --> 00:04:33.279
<v Speaker 1>careful handling, store them securely.

92
00:04:33.519 --> 00:04:37.240
<v Speaker 2>Absolutely, and this mounting process it's just such a game

93
00:04:37.319 --> 00:04:40.439
<v Speaker 2>changer for making data accessible because it lets you treat

94
00:04:40.480 --> 00:04:43.199
<v Speaker 2>your cloud storage almost like it's a local drive inside

95
00:04:43.240 --> 00:04:46.279
<v Speaker 2>Data Bricks. It smoves out data access for your notebooks,

96
00:04:46.560 --> 00:04:51.279
<v Speaker 2>makes interacting with potentially massive data sets feel well, seamless,

97
00:04:51.519 --> 00:04:52.839
<v Speaker 2>much less clunky.

98
00:04:52.519 --> 00:04:56.079
<v Speaker 1>Definitely, so storage is mounted now. The cookbook dives into

99
00:04:56.240 --> 00:04:59.240
<v Speaker 1>reading and writing data different formats, different services that cover

100
00:04:59.360 --> 00:05:03.680
<v Speaker 1>CSV and files in detail. You learned about Sparks schema inference.

101
00:05:03.920 --> 00:05:06.000
<v Speaker 1>You know where it tries to guess the data.

102
00:05:05.680 --> 00:05:08.439
<v Speaker 2>Types, which can be okay, but sometimes right.

103
00:05:08.480 --> 00:05:11.120
<v Speaker 1>Sometimes it just sees everything as a string initially, so

104
00:05:11.480 --> 00:05:15.160
<v Speaker 1>more importantly, it guides you through explicitly defining that schema

105
00:05:15.560 --> 00:05:18.879
<v Speaker 1>using struct type, specifying things like integer type for a

106
00:05:18.920 --> 00:05:20.839
<v Speaker 1>cus key column, making sure it's correct.

107
00:05:21.000 --> 00:05:24.279
<v Speaker 2>And this is exactly where you unlock series performance gains

108
00:05:24.639 --> 00:05:30.240
<v Speaker 2>that explicit schema definition plus the format choice itself. Park

109
00:05:30.680 --> 00:05:33.600
<v Speaker 2>being columnar isn't just about compression, though that's nice. Its

110
00:05:33.639 --> 00:05:38.680
<v Speaker 2>real power comes from optimizations like column pruning and predicate pushdown.

111
00:05:39.000 --> 00:05:41.519
<v Speaker 2>Think about it, your query only needs two columns out

112
00:05:41.519 --> 00:05:44.240
<v Speaker 2>of fifty, park let Spark read only those two columns,

113
00:05:44.639 --> 00:05:46.639
<v Speaker 2>or if you have a filter or ware clause, it

114
00:05:46.639 --> 00:05:49.199
<v Speaker 2>can push that filter down to the storage level. Avoids

115
00:05:49.240 --> 00:05:52.319
<v Speaker 2>reading tons of irrelevant data compared to reading whole ROS

116
00:05:52.360 --> 00:05:55.120
<v Speaker 2>and CSV or JSON. It's well, it's night and day

117
00:05:55.120 --> 00:05:56.160
<v Speaker 2>for big data queries.

118
00:05:56.360 --> 00:05:58.720
<v Speaker 1>Huge difference, and the cookbook doesn't stop there. It covers

119
00:05:58.759 --> 00:06:02.360
<v Speaker 1>professing JSON too, even complexness. Did Jason shows you the

120
00:06:02.399 --> 00:06:06.160
<v Speaker 1>Spark functions like toe json from Maryson and then beyond files,

121
00:06:06.160 --> 00:06:08.839
<v Speaker 1>it talks about reading and writing to Azure sql database

122
00:06:08.879 --> 00:06:12.439
<v Speaker 1>and also as your synaps analytics, specifically the dedicated seql

123
00:06:12.439 --> 00:06:13.920
<v Speaker 1>pool using the native connectors.

124
00:06:14.120 --> 00:06:16.959
<v Speaker 2>Yeah, and if you zoom out a bit, this ability

125
00:06:17.000 --> 00:06:21.000
<v Speaker 2>to seamlessly integrate with all these services Azure, Sequel, synapps,

126
00:06:21.360 --> 00:06:24.680
<v Speaker 2>even Cosmos dB which also have a Spark connector for

127
00:06:24.720 --> 00:06:27.600
<v Speaker 2>batch and streaming. That's what really cements data bricks is

128
00:06:27.600 --> 00:06:30.560
<v Speaker 2>this central hub, this sort of nervous system for a

129
00:06:30.600 --> 00:06:33.759
<v Speaker 2>modern data platform. It's all about bringing together your diverse

130
00:06:33.839 --> 00:06:36.000
<v Speaker 2>data sources into one place for analysis.

131
00:06:36.160 --> 00:06:38.040
<v Speaker 1>Okay, let's peek under the hood a bit. I ever,

132
00:06:38.160 --> 00:06:40.839
<v Speaker 1>wonder what Spark is actually doing when you run a query,

133
00:06:40.920 --> 00:06:44.120
<v Speaker 1>it can feel like a black box. Sometimes this cookbook

134
00:06:44.199 --> 00:06:48.759
<v Speaker 1>pulls back that curtain, introduces the concepts, jobs, stages, tasks,

135
00:06:49.079 --> 00:06:50.160
<v Speaker 1>how Spark breaks.

136
00:06:49.920 --> 00:06:51.800
<v Speaker 2>Down the work and the key visual here the really

137
00:06:51.839 --> 00:06:55.639
<v Speaker 2>insightful bit is the directed acyclic graph the de gay.

138
00:06:56.000 --> 00:06:58.600
<v Speaker 2>Think of it like Spark's internal blueprint for your query.

139
00:06:58.839 --> 00:07:01.399
<v Speaker 2>It shows exactly how it plays to execute it and

140
00:07:01.439 --> 00:07:04.439
<v Speaker 2>you can see this DAG in the SPARKUI. It breaks

141
00:07:04.480 --> 00:07:07.560
<v Speaker 2>down your whole application into these jobs, stages and tasks.

142
00:07:07.920 --> 00:07:11.439
<v Speaker 2>So for you the user, this is invaluable for debugging performance,

143
00:07:11.519 --> 00:07:14.040
<v Speaker 2>Like if you see one task taking way longer than

144
00:07:14.040 --> 00:07:16.279
<v Speaker 2>all the others, that's often your first big clue. You

145
00:07:16.360 --> 00:07:18.959
<v Speaker 2>might have data skew where one partition has way more

146
00:07:19.040 --> 00:07:21.199
<v Speaker 2>data than the others. The day helps you spot.

147
00:07:20.920 --> 00:07:24.000
<v Speaker 1>That, and the book cleverly links this back to scheme

148
00:07:24.040 --> 00:07:27.480
<v Speaker 1>definition shows how using that inferred schema we talked about

149
00:07:27.560 --> 00:07:30.560
<v Speaker 1>it might lead to a more complicated dadgie more tasks,

150
00:07:30.879 --> 00:07:34.800
<v Speaker 1>whereas providing an explicit schema upfront can simplify things, potentially

151
00:07:34.839 --> 00:07:37.920
<v Speaker 1>cutting down execution time quite a bit. Open joins we

152
00:07:37.959 --> 00:07:41.279
<v Speaker 1>all do joins. The cookbook explains how spucks optimizer is

153
00:07:41.319 --> 00:07:44.920
<v Speaker 1>smart choosing between different algorithms like short merge or broadcast

154
00:07:44.959 --> 00:07:47.120
<v Speaker 1>hash joins, but it also shows how you can influence

155
00:07:47.160 --> 00:07:49.639
<v Speaker 1>that choice using hints in your seqlor data frame code

156
00:07:49.639 --> 00:07:52.680
<v Speaker 1>to sugjust a specific joint strategy if you know something about.

157
00:07:52.439 --> 00:07:55.120
<v Speaker 2>Your data, which leads to the million dollar question, how

158
00:07:55.120 --> 00:07:57.480
<v Speaker 2>do you make your sparkax faster? The cookbook gets into

159
00:07:57.519 --> 00:08:02.120
<v Speaker 2>the nitty gritty input partitions, shuffle partitions, output partitions. So

160
00:08:02.199 --> 00:08:06.800
<v Speaker 2>Spark reads data from say EHDFS or ADLSM blocks. By default,

161
00:08:07.000 --> 00:08:09.800
<v Speaker 2>each block might become one partition, but you can tweet

162
00:08:09.800 --> 00:08:12.439
<v Speaker 2>settings like spark dot sqo, dot files dot max partition

163
00:08:12.519 --> 00:08:16.759
<v Speaker 2>bites to control that initial partition size, which directly impacts parallelism.

164
00:08:16.920 --> 00:08:20.120
<v Speaker 2>More smaller partitions can mean more tasks running in parallel.

165
00:08:20.240 --> 00:08:22.879
<v Speaker 2>And then there's sparke dot sqol dot, shuffle dot partitions.

166
00:08:23.000 --> 00:08:26.680
<v Speaker 2>Shuffling data between stages is expensive, involves network traffic. This

167
00:08:26.759 --> 00:08:30.000
<v Speaker 2>setting controls how many partitions are created after a shuffle. Now,

168
00:08:30.000 --> 00:08:32.279
<v Speaker 2>the book is honest, there's no single magic number for

169
00:08:32.320 --> 00:08:34.440
<v Speaker 2>shuffle partitions. It really depends on your cluster size, your

170
00:08:34.519 --> 00:08:38.279
<v Speaker 2>data volume. Begetting this reasonably right tuning it is absolutely

171
00:08:38.279 --> 00:08:40.279
<v Speaker 2>critical for good performance. You have to experiment A.

172
00:08:40.200 --> 00:08:44.000
<v Speaker 1>Bit makes sense, okay, shifting gears a bit real time data,

173
00:08:44.120 --> 00:08:46.679
<v Speaker 1>it's everywhere now as your data bricks handles this With

174
00:08:46.720 --> 00:08:51.279
<v Speaker 1>structured streaming, the cookbook gives good examples like reading streaming

175
00:08:51.320 --> 00:08:55.279
<v Speaker 1>data from Kofka, or specifically kofka enabled event hubs in Azure,

176
00:08:55.919 --> 00:08:59.159
<v Speaker 1>and even this clever trick treating a simple folder full

177
00:08:59.200 --> 00:09:01.360
<v Speaker 1>of JSON log files as if it were a live

178
00:09:01.399 --> 00:09:03.480
<v Speaker 1>streaming source, which is pretty neat.

179
00:09:03.559 --> 00:09:06.000
<v Speaker 2>Yeah, that folder trick is handy, but one of the

180
00:09:06.039 --> 00:09:09.840
<v Speaker 2>inherent challenges with any streaming system is late data right

181
00:09:10.080 --> 00:09:13.039
<v Speaker 2>data arriving out of order. The cookbook points out how

182
00:09:13.120 --> 00:09:17.039
<v Speaker 2>data brick structured streaming handles this pretty gracefully, automatically placing

183
00:09:17.120 --> 00:09:19.720
<v Speaker 2>data into the correct time window. But this is where

184
00:09:19.759 --> 00:09:22.679
<v Speaker 2>water marking comes in. It's a crucial concept. You essentially

185
00:09:22.720 --> 00:09:25.159
<v Speaker 2>tell Spark, hey, data can be late, but only up

186
00:09:25.200 --> 00:09:28.360
<v Speaker 2>to this much late. Anything older than the water markets ignored.

187
00:09:28.759 --> 00:09:31.919
<v Speaker 2>This stops Spark from having to constantly update old aggregated

188
00:09:31.960 --> 00:09:34.320
<v Speaker 2>results from ages ago, keeps things.

189
00:09:34.120 --> 00:09:38.399
<v Speaker 1>Manageable right prevents infinite state, and the book details windowing

190
00:09:38.440 --> 00:09:43.080
<v Speaker 1>for aggregations on streams explains both types. Tumbling windows those

191
00:09:43.120 --> 00:09:46.200
<v Speaker 1>are fixed, non overlapping blocks of time like every five minutes,

192
00:09:46.519 --> 00:09:49.360
<v Speaker 1>and then sliding windows. These overlap like a ten minute

193
00:09:49.399 --> 00:09:52.039
<v Speaker 1>window that slides forward every five minutes. A single event

194
00:09:52.080 --> 00:09:56.600
<v Speaker 1>can fall into multiple windows. It also clarifies offsets and checkpoints,

195
00:09:56.919 --> 00:10:00.799
<v Speaker 1>especially for stateful streaming, where you're doing counts some averages

196
00:10:00.840 --> 00:10:05.279
<v Speaker 1>over time. Spark processes the stream in microbatches. Checkpoints are

197
00:10:05.279 --> 00:10:07.120
<v Speaker 1>how it remembers where it got up to the last

198
00:10:07.120 --> 00:10:09.200
<v Speaker 1>offset processed in the source exactly.

199
00:10:09.200 --> 00:10:11.840
<v Speaker 2>So if a job fails and restarts, the checkpoint lets

200
00:10:11.840 --> 00:10:14.000
<v Speaker 2>it pick up right where it left off, ensuring no

201
00:10:14.120 --> 00:10:17.320
<v Speaker 2>data is missed or processed twice it's key for fault

202
00:10:17.360 --> 00:10:18.600
<v Speaker 2>tolerance and consistency.

203
00:10:19.039 --> 00:10:21.480
<v Speaker 1>Okay, now this next part I think many people would

204
00:10:21.480 --> 00:10:24.120
<v Speaker 1>agree this is where things get really interesting. Delta Lake.

205
00:10:24.440 --> 00:10:27.200
<v Speaker 1>The cookbook presents this open source storage layer which sits

206
00:10:27.279 --> 00:10:29.759
<v Speaker 1>right on top of your cloud storage like Adylus Gen two,

207
00:10:30.320 --> 00:10:33.120
<v Speaker 1>and it positions Delta Lake as the solution, the answer

208
00:10:33.200 --> 00:10:36.840
<v Speaker 1>to those classic data lake problems. No schema enforcement, no

209
00:10:36.919 --> 00:10:41.039
<v Speaker 1>consistency guarantees, no acidy transactions, the data swamp problem.

210
00:10:41.200 --> 00:10:44.679
<v Speaker 2>Oh. Absolutely, Delta Lake is a genuine game changer, bringing

211
00:10:45.159 --> 00:10:51.480
<v Speaker 2>acid etymicity, consistency, isolation, durability, those database level guarantees, bringing

212
00:10:51.519 --> 00:10:54.960
<v Speaker 2>them to the data lake. That's huge. Data lakes traditionally

213
00:10:55.039 --> 00:10:59.600
<v Speaker 2>lack that, but Delta gives you reliable transactions plus scheme enforcement.

214
00:10:59.639 --> 00:11:02.399
<v Speaker 2>Like you said, it rejects data that doesn't fit the

215
00:11:02.399 --> 00:11:05.879
<v Speaker 2>table's structure, but it also allows scheme evolution, so you

216
00:11:05.919 --> 00:11:08.720
<v Speaker 2>can change the schema over time as your data needs change.

217
00:11:08.919 --> 00:11:13.399
<v Speaker 2>That's practical and crucially enabling it date and delete operations

218
00:11:13.440 --> 00:11:16.360
<v Speaker 2>directly on your data lig files. That was a massive

219
00:11:16.399 --> 00:11:18.799
<v Speaker 2>pain point before Delta Now it's straightforward.

220
00:11:19.120 --> 00:11:21.720
<v Speaker 1>So the cookbook shows the basics naturally how to create

221
00:11:21.759 --> 00:11:24.159
<v Speaker 1>Delta tables, read from them, write to them, saving data

222
00:11:24.159 --> 00:11:27.360
<v Speaker 1>frames and Delta format, and it tackles concurrency, always a

223
00:11:27.360 --> 00:11:30.399
<v Speaker 1>big issue in distributed systems. It explains how Delta uses

224
00:11:30.440 --> 00:11:34.200
<v Speaker 1>optimistic concurrency control. Multiple jobs can try to write at

225
00:11:34.200 --> 00:11:37.879
<v Speaker 1>the same time. Delta handles this by creating new table versions.

226
00:11:38.360 --> 00:11:41.000
<v Speaker 1>If two jobs try to commit based on the same

227
00:11:41.399 --> 00:11:45.480
<v Speaker 1>older version, only one succeeds, the other gets rejected. It

228
00:11:45.519 --> 00:11:49.080
<v Speaker 1>even points out specific exceptions you might hit, like concurrent

229
00:11:49.159 --> 00:11:53.879
<v Speaker 1>transaction exception or concurrent append exception, especially with multiple streaming

230
00:11:53.919 --> 00:11:55.919
<v Speaker 1>queries hitting the same table right and the way.

231
00:11:55.759 --> 00:11:58.399
<v Speaker 2>It handles this is pretty neat. It doesn't use traditional

232
00:11:58.480 --> 00:12:02.039
<v Speaker 2>database locks, which can cause bottle. Instead, it makes sure

233
00:12:02.039 --> 00:12:06.240
<v Speaker 2>a transactions trying to commit are processed mutually exclusively, one

234
00:12:06.279 --> 00:12:09.759
<v Speaker 2>after the other. The first one wins, updates the transaction log,

235
00:12:09.840 --> 00:12:12.559
<v Speaker 2>creates a new table version. The second one, seeing the

236
00:12:12.559 --> 00:12:16.879
<v Speaker 2>table has changed underneath, it fails gracefully insurer's integrity and

237
00:12:16.919 --> 00:12:19.639
<v Speaker 2>the book also notes that partitioning your Delta table smartly

238
00:12:19.679 --> 00:12:22.120
<v Speaker 2>can really help reduce the chances of these conflicts in

239
00:12:22.120 --> 00:12:22.720
<v Speaker 2>the first place.

240
00:12:22.879 --> 00:12:26.600
<v Speaker 1>Good tip performance is always key to The cookbook introduces

241
00:12:26.639 --> 00:12:31.000
<v Speaker 1>optimize and zorder optimize is about fixing the small file problem.

242
00:12:31.320 --> 00:12:34.799
<v Speaker 1>It compacts lots of small data files into fewer, larger ones,

243
00:12:34.879 --> 00:12:38.399
<v Speaker 1>much better for read performance, and zorder is even more advanced.

244
00:12:38.639 --> 00:12:41.679
<v Speaker 1>It physically co locates related data within the files based

245
00:12:41.720 --> 00:12:43.320
<v Speaker 1>on callers you specify exactly.

246
00:12:43.360 --> 00:12:46.080
<v Speaker 2>It's like multi dimensional clustering. So when you queer with

247
00:12:46.120 --> 00:12:49.320
<v Speaker 2>filters on those Z ordered columns, Spark can skip reading

248
00:12:49.440 --> 00:12:52.120
<v Speaker 2>huge chunks of irrelevant data big speed up.

249
00:12:52.639 --> 00:12:56.440
<v Speaker 1>Delta tables also support constraints like in databases. The cookbook

250
00:12:56.480 --> 00:12:59.679
<v Speaker 1>mentions chie chick constraints evaluating a boolean expression for each

251
00:12:59.799 --> 00:13:03.600
<v Speaker 1>row and standard not NLL constraints, and if you try

252
00:13:03.600 --> 00:13:06.519
<v Speaker 1>to insert data that violates these you get an invariant

253
00:13:06.559 --> 00:13:08.399
<v Speaker 1>violation exception helps maintain.

254
00:13:08.200 --> 00:13:11.360
<v Speaker 2>Data quality, but honestly for you, the user, maybe one

255
00:13:11.360 --> 00:13:14.080
<v Speaker 2>of the absolute coolest, most powerful features of Delta is

256
00:13:14.120 --> 00:13:18.039
<v Speaker 2>the versioning and time travel. Every single change, every transaction

257
00:13:18.399 --> 00:13:21.399
<v Speaker 2>is recorded in the Delta Transaction log. These Jason files

258
00:13:21.440 --> 00:13:23.679
<v Speaker 2>in the Delta log folder. This means you have a

259
00:13:23.679 --> 00:13:26.759
<v Speaker 2>complete history of your table. You could literally query the

260
00:13:26.799 --> 00:13:28.879
<v Speaker 2>table as it was at a specific point in time

261
00:13:28.960 --> 00:13:32.799
<v Speaker 2>or specific version number. Made a mistake, accidental delete, bad update,

262
00:13:33.120 --> 00:13:35.639
<v Speaker 2>you can just query the previous version or even restore

263
00:13:35.679 --> 00:13:37.759
<v Speaker 2>the table to that point. It's like a built in

264
00:13:38.120 --> 00:13:40.960
<v Speaker 2>undue button for your entire data. Lake invaluable.

265
00:13:41.120 --> 00:13:44.200
<v Speaker 1>That time travel is amazing. Okay, So the cookbook takes

266
00:13:44.240 --> 00:13:47.679
<v Speaker 1>all these individual pieces, the setup, storage, spark, streaming, Delta

267
00:13:47.720 --> 00:13:49.799
<v Speaker 1>and ties them together. It presents an end to end

268
00:13:49.840 --> 00:13:53.639
<v Speaker 1>solution building near real time analytics and a modern data warehouse.

269
00:13:53.960 --> 00:13:56.919
<v Speaker 1>It shows ingesting data from all sorts of places. Add

270
00:13:56.919 --> 00:14:00.559
<v Speaker 1>your event hubs for the streaming stuff, Adlist two for

271
00:14:00.639 --> 00:14:04.399
<v Speaker 1>batch files, maybe Azure sql database for lookup tables or metadata.

272
00:14:04.480 --> 00:14:07.320
<v Speaker 2>Yeah, and the core architecture they showcase is very much

273
00:14:07.320 --> 00:14:11.000
<v Speaker 2>that lake house pattern we hear about. It's powerful. The

274
00:14:11.080 --> 00:14:14.200
<v Speaker 2>idea is you process all this diverse data, maybe land

275
00:14:14.200 --> 00:14:17.679
<v Speaker 2>structured stuff in synapse analytics using traditional fact and dimension

276
00:14:17.799 --> 00:14:20.440
<v Speaker 2>tables for BI, but you also keep the raw and

277
00:14:20.519 --> 00:14:23.200
<v Speaker 2>processed data in delta lake Maybe it's some results in

278
00:14:23.200 --> 00:14:26.519
<v Speaker 2>Cosmos dB two, specifically to power those near real time

279
00:14:26.600 --> 00:14:29.879
<v Speaker 2>dashboards and applications. It blends the best of both worlds,

280
00:14:29.919 --> 00:14:32.759
<v Speaker 2>the flexibility of a lake, the structure of a warehouse.

281
00:14:32.799 --> 00:14:35.480
<v Speaker 1>Altho. The book walks through a scenario simulating vehicle sensor

282
00:14:35.519 --> 00:14:39.320
<v Speaker 1>data Jason format streaming into event hubs, then Azure Data

283
00:14:39.320 --> 00:14:42.720
<v Speaker 1>Bricks using Spark structured streaming picks it up, processes. It

284
00:14:43.039 --> 00:14:46.879
<v Speaker 1>stores aggregated results in delta tables. Maybe the raw non

285
00:14:46.919 --> 00:14:50.120
<v Speaker 1>aggregated data goes off to synaps and COSMOSDB as well.

286
00:14:50.399 --> 00:14:53.279
<v Speaker 1>It shows processing both streaming and batch data together, even

287
00:14:53.360 --> 00:14:55.840
<v Speaker 1>joining the live stream with static lookup tables pulled from

288
00:14:55.840 --> 00:14:58.919
<v Speaker 1>Azure sql, and it explains the transformation stages using that

289
00:14:58.960 --> 00:15:03.120
<v Speaker 1>medallion architecture bronze for raw silver, for cleaned enriched gold

290
00:15:03.159 --> 00:15:06.519
<v Speaker 1>for aggregated business ready data, all typically stored as delta

291
00:15:06.559 --> 00:15:07.600
<v Speaker 1>tables exactly.

292
00:15:07.639 --> 00:15:10.360
<v Speaker 2>That bronze silk gold pattern is super common, provides.

293
00:15:10.080 --> 00:15:13.399
<v Speaker 1>Great structure, and crucially, the cookbook shows you can build

294
00:15:13.519 --> 00:15:16.679
<v Speaker 1>visualizations directly in a data bricks notebook for that near

295
00:15:16.759 --> 00:15:20.960
<v Speaker 1>real time view, define queries, whip up bar charts, pie charts, whatever,

296
00:15:21.320 --> 00:15:24.320
<v Speaker 1>and pin them to a notebook dashboard, and that dashboard

297
00:15:24.360 --> 00:15:28.000
<v Speaker 1>can automatically refresh as new data streams in. Pretty cool

298
00:15:28.039 --> 00:15:31.919
<v Speaker 1>for quick operational views, but for more robust enterprise bi

299
00:15:32.279 --> 00:15:36.200
<v Speaker 1>it walks through connecting Powerbi using the native Azure Beta

300
00:15:36.200 --> 00:15:39.679
<v Speaker 1>bricks connector in Powerbi desktop, you just need the server

301
00:15:39.759 --> 00:15:42.960
<v Speaker 1>host name, HTTP path details from your data Bricks cluster,

302
00:15:43.440 --> 00:15:46.000
<v Speaker 1>then you can directly query those Delta lake tables.

303
00:15:46.039 --> 00:15:49.480
<v Speaker 2>So this direct connection is key because Data Bricks optimized

304
00:15:49.519 --> 00:15:53.279
<v Speaker 2>engine working with Delta, combined with powerbi's native connector using

305
00:15:53.360 --> 00:15:56.559
<v Speaker 2>efficient ODBC drivers, it means you can get really close

306
00:15:56.600 --> 00:15:59.879
<v Speaker 2>to real time insights in your powerbi reports without constantly

307
00:16:00.039 --> 00:16:02.919
<v Speaker 2>hidden refresh manually. It is designed for that low latency

308
00:16:02.960 --> 00:16:05.759
<v Speaker 2>experience getting actionable intelligence fast.

309
00:16:06.080 --> 00:16:10.200
<v Speaker 1>And finally, how do you automate this whole complex flow orchestration?

310
00:16:11.000 --> 00:16:14.639
<v Speaker 1>The cookbook clearly shows using Azure Data Factory ADF adf

311
00:16:14.679 --> 00:16:18.080
<v Speaker 1>acts as that serverlus et l e LT orchestrator, it

312
00:16:18.120 --> 00:16:21.000
<v Speaker 1>can trigger your data Bricks notebooks, run other Azure tasks,

313
00:16:21.080 --> 00:16:24.480
<v Speaker 1>manage dependencies, handle failures, basically run the entire end to

314
00:16:24.559 --> 00:16:27.679
<v Speaker 1>end pipeline reliably. Okay, we're covering a lot, but no

315
00:16:27.840 --> 00:16:31.000
<v Speaker 1>modern data solution discussion is complete without talking DevOps and

316
00:16:31.000 --> 00:16:35.720
<v Speaker 1>security absolutely critical. The cookbook dedicates good sections to CICD

317
00:16:35.799 --> 00:16:39.679
<v Speaker 1>continuous integration continuous deployment, specifically for your data Bricks notebooks

318
00:16:39.720 --> 00:16:40.799
<v Speaker 1>using Azured DevOps.

319
00:16:40.840 --> 00:16:42.799
<v Speaker 2>Yeah, and this is so important. It's not just about

320
00:16:42.799 --> 00:16:46.000
<v Speaker 2>pushing code faster. It means proper source control for your notebooks.

321
00:16:46.039 --> 00:16:49.519
<v Speaker 2>Maybe you can getthub or Azure repos versioning everything and

322
00:16:49.559 --> 00:16:53.000
<v Speaker 2>then automating the deployment to different environments DEV test, UAT,

323
00:16:53.240 --> 00:16:56.759
<v Speaker 2>PROD through release pipelines. It reduces manual effort, reduces errors,

324
00:16:57.000 --> 00:17:00.240
<v Speaker 2>ensure you have consistent, reliable deployments every single time. It's

325
00:17:00.240 --> 00:17:01.960
<v Speaker 2>professionalizing your data bricks.

326
00:17:01.679 --> 00:17:07.200
<v Speaker 1>Development absolutely, and then security paramount the book details understanding

327
00:17:07.240 --> 00:17:10.599
<v Speaker 1>and setting up RBAC role based access control and also

328
00:17:10.920 --> 00:17:15.440
<v Speaker 1>ACL's access control lists within Azure. Specifically for your Adlsgen

329
00:17:15.519 --> 00:17:19.319
<v Speaker 1>two storage. RBC lets you grant broader permissions like maybe

330
00:17:19.519 --> 00:17:22.359
<v Speaker 1>storage blob data reader for a whole container or storage

331
00:17:22.359 --> 00:17:23.119
<v Speaker 1>account right.

332
00:17:23.200 --> 00:17:25.880
<v Speaker 2>RBAC is good for those broader strokes, but ACLS give

333
00:17:25.920 --> 00:17:29.000
<v Speaker 2>you that really fine grain control. You can set read,

334
00:17:29.400 --> 00:17:32.960
<v Speaker 2>write excute permissions on individual files and directories within the lake.

335
00:17:33.039 --> 00:17:35.000
<v Speaker 2>This is essential if you have multiple teams sharing the

336
00:17:35.079 --> 00:17:37.480
<v Speaker 2>lake or really sensitive data where you need to lock

337
00:17:37.519 --> 00:17:40.559
<v Speaker 2>down access very tightly. You can grant access to specific

338
00:17:40.599 --> 00:17:43.920
<v Speaker 2>addus or groups on specific folders, very granular.

339
00:17:44.119 --> 00:17:47.480
<v Speaker 1>Another big security measure covered deploying data bricks itself into

340
00:17:47.480 --> 00:17:50.680
<v Speaker 1>your own Azure virtual network of vnet. It explains provisioning

341
00:17:50.759 --> 00:17:54.480
<v Speaker 1>data bricks workspaces within private and public subnets you control.

342
00:17:54.599 --> 00:17:57.160
<v Speaker 1>This isolates your data bricks environment and lets you securely

343
00:17:57.200 --> 00:18:01.039
<v Speaker 1>access things like Adlsgen two using private endpoints. Keeping traffic

344
00:18:01.079 --> 00:18:02.440
<v Speaker 1>off the public Internet.

345
00:18:02.240 --> 00:18:06.200
<v Speaker 2>And managing secrets always a headache. The integration with Azure

346
00:18:06.279 --> 00:18:09.559
<v Speaker 2>key Vault is highlighted. Keyvolt becomes your central, super secure

347
00:18:09.599 --> 00:18:12.720
<v Speaker 2>place to store things like storage account keys, database passwords,

348
00:18:12.839 --> 00:18:16.319
<v Speaker 2>API keys. Your notebooks then fetch these secrets from keyvolt

349
00:18:16.319 --> 00:18:18.720
<v Speaker 2>at runtime, rather than having them hard coded in the

350
00:18:18.759 --> 00:18:23.200
<v Speaker 2>notebook itself, much much more secure. Similarly, azur app configuration

351
00:18:23.319 --> 00:18:27.440
<v Speaker 2>is mentioned for managing application setting centrally keeping configurations separate

352
00:18:27.440 --> 00:18:30.559
<v Speaker 2>from code. It can even reference secrets stored in key vault.

353
00:18:30.799 --> 00:18:35.160
<v Speaker 1>And what about monitoring troubleshooting? The cookbook covers setting up

354
00:18:35.200 --> 00:18:38.880
<v Speaker 1>a log analytics workspace and Azure Monitor and integrating data

355
00:18:38.880 --> 00:18:42.599
<v Speaker 1>bricks to send its logs there sparklogs, cluster logs, audit logs.

356
00:18:42.839 --> 00:18:45.279
<v Speaker 1>Then you can use KQL, the Custo query language to

357
00:18:45.359 --> 00:18:48.960
<v Speaker 1>query all that telemetry data, find errors, track performance. You

358
00:18:48.960 --> 00:18:51.559
<v Speaker 1>can even build dashboards in Azure Monitor to get a

359
00:18:51.640 --> 00:18:54.759
<v Speaker 1>high level view of the health across all your Azure services,

360
00:18:54.799 --> 00:18:58.440
<v Speaker 1>including Data Bricks, And lastly, within Data Bricks itself, there's

361
00:18:58.480 --> 00:19:02.039
<v Speaker 1>cluster access control. Admins can define who is allowed to

362
00:19:02.039 --> 00:19:06.359
<v Speaker 1>create clusters manage them. Plus cluster visibility control, especially in

363
00:19:06.400 --> 00:19:10.240
<v Speaker 1>premium workspaces, restricts who can even see certain clusters, adds

364
00:19:10.240 --> 00:19:13.279
<v Speaker 1>another layer of security and governance. Wow.

365
00:19:13.319 --> 00:19:16.160
<v Speaker 4>Okay, that was a lot to unpack from just these excerpts,

366
00:19:16.240 --> 00:19:18.519
<v Speaker 4>wasn't it. But hopefully you listening now have a really

367
00:19:18.839 --> 00:19:21.680
<v Speaker 4>solid feel for the immense capabilities packed into Azure Data Bricks,

368
00:19:21.799 --> 00:19:24.839
<v Speaker 4>especially for accelerating and scaling real time analytics. From just

369
00:19:24.839 --> 00:19:27.400
<v Speaker 4>setting up the core services, handling all sorts of data collmats,

370
00:19:27.400 --> 00:19:30.559
<v Speaker 4>optimizing Spark, dealing with streaming data, and then leveraging the

371
00:19:30.799 --> 00:19:32.640
<v Speaker 4>frankly amazing power of Delta Lake.

372
00:19:32.720 --> 00:19:36.480
<v Speaker 1>This cookbook really does lay out a comprehensive roadmap for

373
00:19:36.519 --> 00:19:38.680
<v Speaker 1>anyone working with data on Azure today.

374
00:19:39.319 --> 00:19:42.000
<v Speaker 5>Absolutely, and when you connect all those dots like we've

375
00:19:42.039 --> 00:19:45.000
<v Speaker 5>tried to do, it's just clear that Data Bricks gives

376
00:19:45.039 --> 00:19:48.440
<v Speaker 5>you this complete toolkit you can build genuinely robust modern

377
00:19:48.480 --> 00:19:51.960
<v Speaker 5>data warehouses, near real time analytical solutions. You've got the

378
00:19:52.039 --> 00:19:54.960
<v Speaker 5>visualization built in or connected via power BI. You've got

379
00:19:55.000 --> 00:19:58.519
<v Speaker 5>the automation through ADF, and those critical security and DEVOFS

380
00:19:58.519 --> 00:20:02.079
<v Speaker 5>integrations are cover. It really empowers you to build enterprise

381
00:20:02.079 --> 00:20:04.319
<v Speaker 5>grade data platforms that can handle pretty much anything you

382
00:20:04.359 --> 00:20:04.920
<v Speaker 5>throw at them.

383
00:20:05.039 --> 00:20:08.119
<v Speaker 1>So what does this all mean for you? Practically well,

384
00:20:08.119 --> 00:20:10.839
<v Speaker 1>with these kinds of tools at your fingertips managing complex,

385
00:20:10.960 --> 00:20:14.400
<v Speaker 1>large scale data systems on Azure, It's not just possible,

386
00:20:14.400 --> 00:20:17.119
<v Speaker 1>it's highly optimized. It lets you move beyond just old

387
00:20:17.160 --> 00:20:20.640
<v Speaker 1>school batch processing and really embrace real time insights, get

388
00:20:20.680 --> 00:20:24.160
<v Speaker 1>answers faster, all while making sure your data is reliable, consistent,

389
00:20:24.240 --> 00:20:26.279
<v Speaker 1>and secure thanks to things like Delta Lake and the

390
00:20:26.319 --> 00:20:30.200
<v Speaker 1>security features. So as you think about the sheer volume

391
00:20:30.240 --> 00:20:33.279
<v Speaker 1>and speed of data being generated today, here's maybe a

392
00:20:33.279 --> 00:20:36.039
<v Speaker 1>final thought for you to moll over. If Delta Lake

393
00:20:36.079 --> 00:20:39.839
<v Speaker 1>can bring those database like guarantees, ase transactions, scheme enforcement

394
00:20:39.880 --> 00:20:42.480
<v Speaker 1>to the inherent flexibility and scale of a data lake,

395
00:20:42.720 --> 00:20:45.400
<v Speaker 1>does this fundamentally change how we should think about designing

396
00:20:45.400 --> 00:20:48.799
<v Speaker 1>all our future data architectures. Does it push us firmly

397
00:20:48.839 --> 00:20:51.240
<v Speaker 1>into that lakehouse paradigm as the default for almost every

398
00:20:51.279 --> 00:20:53.759
<v Speaker 1>kind of data? And what new possibilities does that unlock?

399
00:20:53.799 --> 00:20:55.559
<v Speaker 1>For your next big data challenge.
