WEBVTT

1
00:00:00.080 --> 00:00:03.439
<v Speaker 1>Okay, let's unpack this. You've given us well, quite a

2
00:00:03.439 --> 00:00:07.360
<v Speaker 1>stack of sources here, all focused on data virtualization, specifically

3
00:00:07.599 --> 00:00:09.640
<v Speaker 1>digging into Microsoft's poll based tool.

4
00:00:09.839 --> 00:00:13.080
<v Speaker 2>That's right, And our mission today really is to cut

5
00:00:13.080 --> 00:00:15.560
<v Speaker 2>through all that complexity for you. We want to show

6
00:00:15.560 --> 00:00:19.160
<v Speaker 2>you how this kind of technology tackles what's probably the

7
00:00:19.199 --> 00:00:24.320
<v Speaker 2>biggest challenge for modern businesses yea analyzing these huge, scattered

8
00:00:24.399 --> 00:00:24.960
<v Speaker 2>data sets.

9
00:00:25.039 --> 00:00:28.920
<v Speaker 1>We're talking big data, IoT streams, data mining stuff. Yeah,

10
00:00:29.079 --> 00:00:30.039
<v Speaker 1>massive amount.

11
00:00:29.920 --> 00:00:33.079
<v Speaker 2>Exactly petabytes of it, and doing it quickly, affordably, and

12
00:00:33.119 --> 00:00:37.320
<v Speaker 2>crucially without making your team learn every single complex system

13
00:00:37.399 --> 00:00:38.759
<v Speaker 2>your company happens to use.

14
00:00:38.840 --> 00:00:41.039
<v Speaker 1>Yeah, it starts with just grappling with the sheer size,

15
00:00:41.039 --> 00:00:44.039
<v Speaker 1>doesn't it. That gravity of big data. You imagine your

16
00:00:44.119 --> 00:00:47.119
<v Speaker 1>data is like a small notebook manageable, right, but then

17
00:00:47.159 --> 00:00:49.399
<v Speaker 1>it explodes. Suddenly it's not a notebook. It's like a

18
00:00:49.399 --> 00:00:52.719
<v Speaker 1>million books scattered everywhere. When you hit petabyte scale, your

19
00:00:52.719 --> 00:00:56.240
<v Speaker 1>normal ways of working just they can't keep up. They

20
00:00:56.240 --> 00:00:56.880
<v Speaker 1>get swamped.

21
00:00:57.119 --> 00:01:01.439
<v Speaker 2>Absolutely, the systems slow down, costs spiral, it's a mess.

22
00:01:01.880 --> 00:01:04.319
<v Speaker 2>The old problem was if you wanted to analyze that

23
00:01:04.439 --> 00:01:08.079
<v Speaker 2>giant library, you literally had to move every single book

24
00:01:08.239 --> 00:01:11.280
<v Speaker 2>onto one enormous table your data warehouse before you could

25
00:01:11.319 --> 00:01:12.400
<v Speaker 2>even start reading, a.

26
00:01:12.400 --> 00:01:14.840
<v Speaker 1>Huge costly effort upfront, totally.

27
00:01:14.920 --> 00:01:17.120
<v Speaker 2>So the whole point of virtualization is to find a

28
00:01:17.120 --> 00:01:21.400
<v Speaker 2>way to analyze these massive, sprawling data sets fast using

29
00:01:21.799 --> 00:01:26.200
<v Speaker 2>sensible resources, without that gigantic data moving phase first.

30
00:01:26.439 --> 00:01:28.560
<v Speaker 1>Okay, so that gets us right to the heart of

31
00:01:28.680 --> 00:01:32.359
<v Speaker 1>data virtualization. It's about solving that movement problem. But maybe

32
00:01:32.359 --> 00:01:34.280
<v Speaker 1>we should quickly touch on why the data is so

33
00:01:34.359 --> 00:01:36.920
<v Speaker 1>scattered to begin with. Yeah, why not just one big

34
00:01:36.959 --> 00:01:38.000
<v Speaker 1>system for everything.

35
00:01:38.359 --> 00:01:40.799
<v Speaker 2>Well, it really boils down to different tools for different jobs.

36
00:01:40.799 --> 00:01:45.959
<v Speaker 2>It's this conflict between say, speed and data integrity. Relational databases,

37
00:01:46.000 --> 00:01:49.560
<v Speaker 2>things like SQL server. They're fantastic for transactional stuff, data

38
00:01:49.560 --> 00:01:50.879
<v Speaker 2>to day business ops.

39
00:01:50.640 --> 00:01:54.239
<v Speaker 1>Updates, deletes, making sure the customer record is accurate precisely.

40
00:01:54.280 --> 00:01:57.920
<v Speaker 2>They guarantee that integrity, but that comes with overhead, especially

41
00:01:57.959 --> 00:02:01.799
<v Speaker 2>when you're just dumping massive amounts of sequential data like logs.

42
00:02:02.480 --> 00:02:06.120
<v Speaker 1>Right for just raw logging or archiving old stuff, you'd

43
00:02:06.159 --> 00:02:09.759
<v Speaker 1>look at file systems, maybe hdfs or cloud storage like

44
00:02:09.840 --> 00:02:10.560
<v Speaker 1>Azure blobs.

45
00:02:10.680 --> 00:02:12.840
<v Speaker 2>Yeah, they give you super fast reads and writes because

46
00:02:12.840 --> 00:02:16.560
<v Speaker 2>they kind of sacrifice that strict integrity and transaction control.

47
00:02:16.879 --> 00:02:19.759
<v Speaker 1>Great for archiving, yeah, terrible for updates. Right.

48
00:02:19.800 --> 00:02:22.479
<v Speaker 2>I think one source mentioned updating a single customer record

49
00:02:22.520 --> 00:02:26.879
<v Speaker 2>and a file system could mean touching hundreds of different pieces, slow,

50
00:02:27.759 --> 00:02:29.240
<v Speaker 2>costly exactly.

51
00:02:29.759 --> 00:02:33.520
<v Speaker 1>So now you've got this split. Your really valuable transactional

52
00:02:33.560 --> 00:02:37.199
<v Speaker 1>data lives in the relational system and this huge bulk

53
00:02:37.400 --> 00:02:41.680
<v Speaker 1>of sequential, maybe less frequently updated data is out in

54
00:02:41.719 --> 00:02:43.759
<v Speaker 1>the file system or the cloud, and you.

55
00:02:43.719 --> 00:02:46.159
<v Speaker 2>Need to join them together for analysis. That's where the

56
00:02:46.159 --> 00:02:46.800
<v Speaker 2>pain starts.

57
00:02:46.960 --> 00:02:50.240
<v Speaker 1>That's where the movement problem really bites. Think about that analogy.

58
00:02:50.280 --> 00:02:52.319
<v Speaker 1>Your source is used computer A and computer B.

59
00:02:52.599 --> 00:02:56.560
<v Speaker 2>Ah. Yes, computer A has the million entries the big archive, and.

60
00:02:56.520 --> 00:02:59.479
<v Speaker 1>Computer B has just one thousand entries, maybe the current

61
00:02:59.479 --> 00:03:03.840
<v Speaker 1>customers from the relational database. Okay, so historically the default

62
00:03:03.879 --> 00:03:06.360
<v Speaker 1>way to join these you'd try to move all one

63
00:03:06.400 --> 00:03:07.919
<v Speaker 1>million entries from A over to B.

64
00:03:08.319 --> 00:03:11.680
<v Speaker 2>Yeah, instantly your network gets hammered. You're trying to ship

65
00:03:11.919 --> 00:03:14.280
<v Speaker 2>potentially petabytes across the wire.

66
00:03:14.120 --> 00:03:16.439
<v Speaker 1>And then poor computer B is struggling to even store

67
00:03:16.479 --> 00:03:17.960
<v Speaker 1>it all, let alone process it.

68
00:03:18.159 --> 00:03:23.599
<v Speaker 2>Exactly, it's CPU memory disc Everything is strained trying to

69
00:03:23.639 --> 00:03:26.000
<v Speaker 2>sift through all this data. You mostly don't even need

70
00:03:26.080 --> 00:03:31.080
<v Speaker 2>just to find maybe ten relevant records. It's just incredibly inefficient.

71
00:03:30.680 --> 00:03:32.479
<v Speaker 1>Slow, expensive duplicates data.

72
00:03:32.599 --> 00:03:36.759
<v Speaker 2>Yeah, great so data visualization and specifically the thinking behind

73
00:03:36.759 --> 00:03:40.319
<v Speaker 2>poly base. It just flips that whole idea on its head. Also,

74
00:03:40.719 --> 00:03:43.240
<v Speaker 2>the smart way, the efficient way, is to move the

75
00:03:43.280 --> 00:03:46.879
<v Speaker 2>small data, those thousand entries from B over to the

76
00:03:46.960 --> 00:03:49.360
<v Speaker 2>large environment on A.

77
00:03:48.759 --> 00:03:51.319
<v Speaker 1>Ah okay, send the query to where the data lives.

78
00:03:51.439 --> 00:03:54.439
<v Speaker 2>Precisely, you push the query logic, the filtering, the joining

79
00:03:54.680 --> 00:03:56.439
<v Speaker 2>down to the system that has the bulk of the

80
00:03:56.479 --> 00:03:59.159
<v Speaker 2>data and the resources to handle it, do the work there,

81
00:03:59.400 --> 00:04:02.919
<v Speaker 2>and then you only bring back the final small relevant

82
00:04:02.919 --> 00:04:03.719
<v Speaker 2>results set to B.

83
00:04:03.919 --> 00:04:07.240
<v Speaker 1>Got it. So no network saguration, no data duplication on

84
00:04:07.360 --> 00:04:09.360
<v Speaker 1>B and you let the heavy duty system do the

85
00:04:09.360 --> 00:04:09.879
<v Speaker 1>heavy lifting.

86
00:04:09.960 --> 00:04:12.039
<v Speaker 2>You got it. It's a fundamental shift in where the

87
00:04:12.039 --> 00:04:13.120
<v Speaker 2>computation happens.

88
00:04:13.439 --> 00:04:19.000
<v Speaker 1>That efficiency game is huge, and that leads us neatly

89
00:04:19.079 --> 00:04:21.759
<v Speaker 1>into polybased itself, because this is where that technical integration

90
00:04:21.800 --> 00:04:25.800
<v Speaker 1>gets really clever. What's the core promise of polybase for

91
00:04:25.879 --> 00:04:28.040
<v Speaker 1>someone using say SQL server.

92
00:04:28.079 --> 00:04:31.720
<v Speaker 2>The promise is really about seamless power through familiarity. That's

93
00:04:31.759 --> 00:04:35.040
<v Speaker 2>the key. Polybase lets you query pretty much any external

94
00:04:35.120 --> 00:04:39.000
<v Speaker 2>data source hdfs, azure blobs, even other databases using the

95
00:04:39.040 --> 00:04:42.040
<v Speaker 2>tool you already know, SQL server and the language you

96
00:04:42.040 --> 00:04:43.199
<v Speaker 2>already know, t sql.

97
00:04:43.319 --> 00:04:45.920
<v Speaker 1>Okay, so I write my standard t SQL query, yep.

98
00:04:46.120 --> 00:04:47.240
<v Speaker 2>But here's the clever bit.

99
00:04:47.360 --> 00:04:47.600
<v Speaker 1>Yep.

100
00:04:47.839 --> 00:04:51.120
<v Speaker 2>While you're writing familiar t sql, Polybase is working behind

101
00:04:51.120 --> 00:04:54.680
<v Speaker 2>the scenes translating that query and leveraging the native capabilities

102
00:04:54.720 --> 00:04:58.480
<v Speaker 2>of that external system, especially things like its parallel processing

103
00:04:58.519 --> 00:05:00.680
<v Speaker 2>power or its optimized story access.

104
00:05:00.879 --> 00:05:03.720
<v Speaker 1>So it's like a universal translator for data queries. You

105
00:05:03.759 --> 00:05:06.360
<v Speaker 1>speak t sql and Polybase figures out how to ask

106
00:05:06.399 --> 00:05:08.079
<v Speaker 1>the question and Hadoop speak or whatever.

107
00:05:08.199 --> 00:05:10.079
<v Speaker 2>That's a great way to put it. Yeah, it handles

108
00:05:10.079 --> 00:05:11.319
<v Speaker 2>that translation and execution.

109
00:05:11.600 --> 00:05:13.639
<v Speaker 1>Now, this wasn't an overnight thing you mentioned. It has

110
00:05:13.759 --> 00:05:17.319
<v Speaker 1>roots in Microsoft's earlier efforts, particularly with Parallel Data Warehouse.

111
00:05:17.680 --> 00:05:20.920
<v Speaker 2>Absolutely essential context. Polybase was officially announced I think it

112
00:05:20.959 --> 00:05:24.279
<v Speaker 2>was November twenty twelve at the Sequel Pass summit, But

113
00:05:24.480 --> 00:05:28.279
<v Speaker 2>the underlying tech, the architecture, it really relied on the

114
00:05:28.319 --> 00:05:31.600
<v Speaker 2>groundwork laid by Parallel Data Warehouse or PDW, which came

115
00:05:31.600 --> 00:05:32.720
<v Speaker 2>out back in twenty ten.

116
00:05:32.879 --> 00:05:35.879
<v Speaker 1>And the sources really emphasize how quickly that PDW tech

117
00:05:35.920 --> 00:05:40.959
<v Speaker 1>evolved PDW version two in twenty thirteen. The performance jump

118
00:05:41.079 --> 00:05:42.639
<v Speaker 1>was apparently staggered.

119
00:05:42.800 --> 00:05:46.079
<v Speaker 2>It was revolutionary. We're talking like one hundred times faster

120
00:05:46.279 --> 00:05:49.439
<v Speaker 2>query performance compared to view one. That's not incremental, that's

121
00:05:49.439 --> 00:05:50.839
<v Speaker 2>a different class of machine.

122
00:05:50.920 --> 00:05:51.480
<v Speaker 1>Wow.

123
00:05:51.560 --> 00:05:54.680
<v Speaker 2>And at the same time they slashed the price per petabite.

124
00:05:54.759 --> 00:05:59.000
<v Speaker 2>It proved that this massively parallel processing or MPP architecture,

125
00:05:59.000 --> 00:06:01.439
<v Speaker 2>which is the foundation for PAUL two, was really the

126
00:06:01.480 --> 00:06:05.600
<v Speaker 2>way forward for handling big data within a relational database context.

127
00:06:05.160 --> 00:06:08.199
<v Speaker 1>And that investment paid off by sql server twenty sixteen.

128
00:06:08.240 --> 00:06:11.560
<v Speaker 1>Polybase wasn't just a PDW thing, It was generally available

129
00:06:11.600 --> 00:06:13.120
<v Speaker 1>in standard sql server editions.

130
00:06:13.199 --> 00:06:16.040
<v Speaker 2>Yeah, that move really cemented it. Microsoft was clearly using

131
00:06:16.040 --> 00:06:18.920
<v Speaker 2>the same core codebase, bringing that big data query power

132
00:06:18.920 --> 00:06:20.879
<v Speaker 2>to its mainstream database product.

133
00:06:20.680 --> 00:06:23.639
<v Speaker 1>Which brings us nicely to the technical secret sauce this

134
00:06:23.759 --> 00:06:25.319
<v Speaker 1>idea of push down computation.

135
00:06:25.600 --> 00:06:28.839
<v Speaker 2>Yes, this is critical to understanding why polybase is so

136
00:06:28.920 --> 00:06:30.519
<v Speaker 2>much better than the older ways.

137
00:06:30.800 --> 00:06:34.360
<v Speaker 1>We touch on the older ways failing, specifically SQL server's

138
00:06:34.439 --> 00:06:37.639
<v Speaker 1>link servers. Can you elaborate on how they fell short

139
00:06:37.639 --> 00:06:38.600
<v Speaker 1>with big data?

140
00:06:38.680 --> 00:06:43.240
<v Speaker 2>Sure, they were let's just say not very optimized for

141
00:06:43.279 --> 00:06:46.680
<v Speaker 2>remote filtering. If you wrote a standard query like select

142
00:06:46.680 --> 00:06:48.920
<v Speaker 2>from my link server ducts give me not table where

143
00:06:49.000 --> 00:06:52.480
<v Speaker 2>filter column ten, the link server wouldn't push that wear

144
00:06:52.519 --> 00:06:54.759
<v Speaker 2>filter column ten part down to the remote system. It

145
00:06:54.759 --> 00:06:57.399
<v Speaker 2>would actually read the entire table from the remote serce

146
00:06:57.439 --> 00:06:57.879
<v Speaker 2>the whole.

147
00:06:57.720 --> 00:07:00.040
<v Speaker 1>Thing, even if it was billions of rows, the whole.

148
00:06:59.839 --> 00:07:02.560
<v Speaker 2>Thing, pulled it all across the network, then applied the

149
00:07:02.560 --> 00:07:04.879
<v Speaker 2>filter locally on your SQL server instace.

150
00:07:04.959 --> 00:07:07.639
<v Speaker 1>Oh my goodness. So if I just wanted ten records,

151
00:07:07.680 --> 00:07:11.480
<v Speaker 1>I might still be pulling gigabytes or terabytes across the network.

152
00:07:11.160 --> 00:07:16.160
<v Speaker 2>First, precisely a guaranteed performance killer for large data sets.

153
00:07:16.720 --> 00:07:19.879
<v Speaker 2>To actually force the filter to run remotely, you had

154
00:07:19.920 --> 00:07:23.120
<v Speaker 2>to jump through hoops using things like open query embedding

155
00:07:23.160 --> 00:07:26.600
<v Speaker 2>your remote query as a string. It was awkward, error prone,

156
00:07:26.639 --> 00:07:29.040
<v Speaker 2>and didn't scale well for complex logic. Right.

157
00:07:29.120 --> 00:07:32.720
<v Speaker 1>That sounds painful. So the magic, as the sources call it,

158
00:07:33.120 --> 00:07:37.560
<v Speaker 1>of predicate push down in polybase. It fixes that.

159
00:07:37.879 --> 00:07:41.920
<v Speaker 2>It fixes exactly that. Polybase enables intelligent pushdown. The query

160
00:07:41.920 --> 00:07:44.360
<v Speaker 2>optimizer looks at your t SQL query and figures out

161
00:07:44.399 --> 00:07:47.120
<v Speaker 2>which parts the filtering predicate is. Maybe some joins, maybe

162
00:07:47.120 --> 00:07:50.759
<v Speaker 2>aggregations can actually be executed on the external data source.

163
00:07:50.519 --> 00:07:52.399
<v Speaker 1>Itself, so it does the work remotely, and then.

164
00:07:52.360 --> 00:07:55.560
<v Speaker 2>It only brings back the much smaller pre filtered, maybe

165
00:07:55.600 --> 00:07:57.879
<v Speaker 2>even pre aggregated results set. You only get the ten

166
00:07:57.920 --> 00:08:00.920
<v Speaker 2>customers you ask for the billion stay put.

167
00:08:01.120 --> 00:08:03.680
<v Speaker 1>That must completely change the game for data warehousing.

168
00:08:03.759 --> 00:08:08.839
<v Speaker 2>Oh massively think about traditional ETL extract, transform load, huge

169
00:08:08.920 --> 00:08:12.720
<v Speaker 2>complex processes, often running for hours overnight just to move

170
00:08:12.800 --> 00:08:15.040
<v Speaker 2>and reshape data before you can even query it.

171
00:08:15.079 --> 00:08:16.800
<v Speaker 1>Right the daily or nightly load window.

172
00:08:17.120 --> 00:08:20.279
<v Speaker 2>Polybase lets you potentially bypass a lot of that heavy lifting.

173
00:08:20.639 --> 00:08:23.439
<v Speaker 2>You don't necessarily need to physically load all the external

174
00:08:23.519 --> 00:08:26.600
<v Speaker 2>data into the warehouse first, you can query it in place.

175
00:08:26.959 --> 00:08:29.879
<v Speaker 2>You focus on the analysis, the queries, the calculations, not

176
00:08:29.920 --> 00:08:33.759
<v Speaker 2>the complex data plumbing all through one connection point in

177
00:08:33.879 --> 00:08:37.720
<v Speaker 2>SQL server and I guess this pushdown benefit is amplified

178
00:08:37.720 --> 00:08:42.200
<v Speaker 2>by parallelism. Definitely polydase, especially when you set up scale

179
00:08:42.240 --> 00:08:45.639
<v Speaker 2>out groups with multiple SQL server nodes, is designed for

180
00:08:45.759 --> 00:08:48.879
<v Speaker 2>parallel data transfer. It can read data from multiple nodes

181
00:08:48.879 --> 00:08:53.600
<v Speaker 2>in a Hadoop cluster or multiple partitions in cloud storage simultaneously.

182
00:08:52.960 --> 00:08:56.000
<v Speaker 1>So it's pulling data in parallel, professing parts of the

183
00:08:56.080 --> 00:08:58.559
<v Speaker 1>query in parallel on the remote system exactly.

184
00:08:58.960 --> 00:09:01.960
<v Speaker 2>That parallel operation capability is a hallmark of those high

185
00:09:02.039 --> 00:09:05.639
<v Speaker 2>end MPP systems, and polybaseed brings that capability to SQL

186
00:09:05.639 --> 00:09:07.360
<v Speaker 2>server interacting with external data.

187
00:09:07.399 --> 00:09:10.120
<v Speaker 1>Fantastic. Okay, let's broaden the view of bit section four

188
00:09:10.679 --> 00:09:15.879
<v Speaker 1>Polybase within the wider modern data ecosystem. Interoperability seems key here, it.

189
00:09:15.840 --> 00:09:18.200
<v Speaker 2>Really is its main strength. It has those native, highly

190
00:09:18.200 --> 00:09:21.919
<v Speaker 2>optimized connectors for the big ones Hadoop, hdfs and Azure

191
00:09:21.919 --> 00:09:25.279
<v Speaker 2>blob storage using the WASB protocol WSB.

192
00:09:25.519 --> 00:09:27.799
<v Speaker 1>That's Windows Azure Storage Blob yep.

193
00:09:27.720 --> 00:09:29.879
<v Speaker 2>The standard way to talk to Azure blobs for a while.

194
00:09:30.039 --> 00:09:32.960
<v Speaker 1>But what about everything else the world isn't just Microsoft

195
00:09:32.960 --> 00:09:36.639
<v Speaker 1>and hadoob. What if you need data from say, Cassandra

196
00:09:37.159 --> 00:09:41.159
<v Speaker 1>or Mango dB, or even other relational systems like mysuquel

197
00:09:41.240 --> 00:09:42.240
<v Speaker 1>or postgres School.

198
00:09:42.399 --> 00:09:46.399
<v Speaker 2>Good question. For many of those other systems, Polybase relies

199
00:09:46.480 --> 00:09:50.799
<v Speaker 2>on ODDC drivers open database connectivity. It's like a standard adapter.

200
00:09:51.039 --> 00:09:53.360
<v Speaker 1>Okay, ODBC. So you can still use.

201
00:09:53.240 --> 00:09:55.639
<v Speaker 2>T sql, Yes, and that's the big win. You still

202
00:09:55.639 --> 00:09:58.759
<v Speaker 2>get to query those diverse sources using familiar t SQL

203
00:09:58.799 --> 00:10:02.080
<v Speaker 2>from within sql server. Huge for developer productivity. And ease

204
00:10:02.120 --> 00:10:02.600
<v Speaker 2>of adoption.

205
00:10:02.720 --> 00:10:04.559
<v Speaker 1>But there's always a butt, isn't there. What's the trade

206
00:10:04.559 --> 00:10:05.440
<v Speaker 1>off with ODBC.

207
00:10:05.639 --> 00:10:09.080
<v Speaker 2>Well, using a generic bridge like ODBC can sometimes introduce

208
00:10:09.080 --> 00:10:12.000
<v Speaker 2>a bit of overhead, and more importantly, it can sometimes

209
00:10:12.120 --> 00:10:15.200
<v Speaker 2>limit those powerful pushdown capabilities we just talked about.

210
00:10:15.320 --> 00:10:18.879
<v Speaker 1>Uh, so the intelligence might not always translate perfectly through

211
00:10:18.919 --> 00:10:19.919
<v Speaker 1>the ODBC layer.

212
00:10:20.240 --> 00:10:23.519
<v Speaker 2>Exactly. One of your sources had a perfect, if slightly

213
00:10:23.559 --> 00:10:28.240
<v Speaker 2>worrying example with mycequel. A simple count aggregation failed to

214
00:10:28.240 --> 00:10:30.960
<v Speaker 2>push down because of some subtle difference in how white

215
00:10:30.960 --> 00:10:32.840
<v Speaker 2>space was handled by the ODBC driver.

216
00:10:33.000 --> 00:10:35.600
<v Speaker 1>Suriously, a count query failed to push down.

217
00:10:35.799 --> 00:10:39.879
<v Speaker 2>Yeah, and the workaround they had to explicitly disable push

218
00:10:39.960 --> 00:10:42.720
<v Speaker 2>down for that query, meaning it fell back to pulling

219
00:10:42.759 --> 00:10:44.000
<v Speaker 2>more data than necessary.

220
00:10:44.440 --> 00:10:47.720
<v Speaker 1>So while ODBC gives you broad connectivity, you might occasionally

221
00:10:47.759 --> 00:10:50.440
<v Speaker 1>lose some of that peak performance or intelligent push down

222
00:10:50.440 --> 00:10:52.039
<v Speaker 1>you get with the native connectors.

223
00:10:52.399 --> 00:10:55.559
<v Speaker 2>That's the trade off essentially. It highlights why systems with

224
00:10:55.679 --> 00:10:59.279
<v Speaker 2>truly native, deeply integrated connections often performed best.

225
00:10:59.440 --> 00:11:03.480
<v Speaker 1>Speaking best performers, the sources bring up Terra data quite

226
00:11:03.519 --> 00:11:05.600
<v Speaker 1>a bit as a kind of gold standard in this

227
00:11:05.759 --> 00:11:06.600
<v Speaker 1>MPP world.

228
00:11:06.759 --> 00:11:10.000
<v Speaker 2>Yeah, Terra data is often seen as the benchmark, especially

229
00:11:10.000 --> 00:11:13.879
<v Speaker 2>for petabyte scale warehousing. Their architecture goes way back nineteen

230
00:11:13.919 --> 00:11:17.600
<v Speaker 2>seventy nine. They really pioneered many of these MPP concepts

231
00:11:17.679 --> 00:11:20.120
<v Speaker 2>like shared nothing architecture, so they've.

232
00:11:19.919 --> 00:11:22.679
<v Speaker 1>Been doing native pushed down in parallel data movement for

233
00:11:22.759 --> 00:11:24.120
<v Speaker 1>decades pretty much.

234
00:11:24.320 --> 00:11:28.240
<v Speaker 2>Their maturity and optimization are their big strengths polybases in

235
00:11:28.279 --> 00:11:32.799
<v Speaker 2>many ways, Bringing those proven MPP concepts refined over years

236
00:11:32.840 --> 00:11:36.200
<v Speaker 2>by systems like Terra Data into the more mainstream SQL

237
00:11:36.240 --> 00:11:37.759
<v Speaker 2>server ecosystem.

238
00:11:37.279 --> 00:11:40.480
<v Speaker 1>Makes sense, and shifting to the cloud polybases vital there too. Right,

239
00:11:40.480 --> 00:11:41.639
<v Speaker 1>Connecting SQL server.

240
00:11:41.519 --> 00:11:46.080
<v Speaker 2>To cloud storage absolutely essential. We mentioned Azure blobs via WASB.

241
00:11:46.480 --> 00:11:49.200
<v Speaker 2>It also supports reading from Azure Data lakes store both

242
00:11:49.240 --> 00:11:50.399
<v Speaker 2>Gen one and Gen two.

243
00:11:50.559 --> 00:11:52.960
<v Speaker 1>Does it use the newer native protocols for ady ls

244
00:11:53.000 --> 00:11:53.679
<v Speaker 1>Gen two Like.

245
00:11:53.639 --> 00:11:58.519
<v Speaker 2>ABFs, often it still relies on the WISP protocol compatibility

246
00:11:58.600 --> 00:12:01.360
<v Speaker 2>layer even for Gen two. Depending on the specific sequel

247
00:12:01.399 --> 00:12:05.559
<v Speaker 2>server or synapse version. The native ABFs support is getting better,

248
00:12:05.559 --> 00:12:07.639
<v Speaker 2>but WASB is often the fallback.

249
00:12:07.799 --> 00:12:10.159
<v Speaker 1>And another key point mentioned was read versus right right now.

250
00:12:10.240 --> 00:12:12.720
<v Speaker 2>Yes, that's an important current limitation to be aware of,

251
00:12:12.879 --> 00:12:17.200
<v Speaker 2>especially in cloud scenarios like Azure, synaps analytics. Polybase is

252
00:12:17.200 --> 00:12:22.080
<v Speaker 2>primarily fantastic for reading data from these external sources hood ADLs, blobs,

253
00:12:22.480 --> 00:12:24.960
<v Speaker 2>but writing data back out to them via Polybase is

254
00:12:24.960 --> 00:12:28.559
<v Speaker 2>often not supported or more limited. It's mainly a consumption

255
00:12:28.720 --> 00:12:32.759
<v Speaker 2>a virtualization tool, not necessarily a two way synchronization engine yet.

256
00:12:32.960 --> 00:12:36.080
<v Speaker 1>Okay, good clarification. So let's tie this all together. Why

257
00:12:36.080 --> 00:12:38.840
<v Speaker 1>should you, our listener, really care about this? What are

258
00:12:38.879 --> 00:12:41.559
<v Speaker 1>the killer real world use cases?

259
00:12:41.759 --> 00:12:43.759
<v Speaker 2>Well, there are two immediate ones that jump out, offering

260
00:12:43.840 --> 00:12:47.879
<v Speaker 2>huge savings and capabilities. First, aging and archiving.

261
00:12:47.759 --> 00:12:50.600
<v Speaker 1>Moving old data out of expensive databases exactly.

262
00:12:50.639 --> 00:12:54.519
<v Speaker 2>Think about old log files, transaction history older than say

263
00:12:54.799 --> 00:12:57.919
<v Speaker 2>five years, data you need to keep for compliance but

264
00:12:57.960 --> 00:13:00.519
<v Speaker 2>don't query often. You can set up part titioning in

265
00:13:00.600 --> 00:13:03.879
<v Speaker 2>sql server to automatically move those old partitions to cheaper

266
00:13:03.919 --> 00:13:07.279
<v Speaker 2>storage hdfs as your data lak. And the beauty is

267
00:13:07.399 --> 00:13:10.159
<v Speaker 2>Polybase makes that archive data still look like it's part

268
00:13:10.200 --> 00:13:13.360
<v Speaker 2>of the original table. Your legacy applications can query it

269
00:13:13.480 --> 00:13:16.960
<v Speaker 2>using the same tseql, no code changes needed, instant cost

270
00:13:17.039 --> 00:13:18.600
<v Speaker 2>savings on primary storage.

271
00:13:18.759 --> 00:13:21.240
<v Speaker 1>That's incredibly practical. Okay, what's the second big one?

272
00:13:21.279 --> 00:13:24.200
<v Speaker 2>The second one is maybe more transformational creating those three

273
00:13:24.240 --> 00:13:26.559
<v Speaker 2>hundred and sixty degree customer views, especially for things like

274
00:13:26.600 --> 00:13:27.799
<v Speaker 2>AI and machine.

275
00:13:27.600 --> 00:13:29.440
<v Speaker 1>Learning, combining different data types.

276
00:13:29.679 --> 00:13:33.759
<v Speaker 2>Right, imagine joining your core customer data from your relational

277
00:13:33.840 --> 00:13:39.240
<v Speaker 2>database names addresses purchase history with massive unstructured or semi

278
00:13:39.240 --> 00:13:43.039
<v Speaker 2>structured data streams. Right, what like web clickstream data, social

279
00:13:43.080 --> 00:13:48.120
<v Speaker 2>media interactions, maybe sensor data from devices, even anonymized location data,

280
00:13:48.440 --> 00:13:52.240
<v Speaker 2>stuff that lives outside your traditional database. Polybase lets you

281
00:13:52.279 --> 00:13:55.120
<v Speaker 2>bring all that disparate data together virtually. You can then

282
00:13:55.200 --> 00:13:58.279
<v Speaker 2>run mL models across that unified view to do really

283
00:13:58.360 --> 00:14:04.320
<v Speaker 2>powerful things customers with incredible accuracy, predict churn, detect fraud,

284
00:14:04.639 --> 00:14:08.080
<v Speaker 2>personalized offers, things you just couldn't do easily when the

285
00:14:08.159 --> 00:14:09.039
<v Speaker 2>data was siloed.

286
00:14:09.240 --> 00:14:11.480
<v Speaker 1>That opens up a lot of possibilities. Yeah, so wrapping

287
00:14:11.519 --> 00:14:13.200
<v Speaker 1>things up, then, what's the big takeaway here?

288
00:14:13.320 --> 00:14:16.240
<v Speaker 2>The big takeaway is that data virtualization, with tools like

289
00:14:16.279 --> 00:14:19.879
<v Speaker 2>Polybase leading the charge in the Microsoft world, fundamentally changes

290
00:14:19.919 --> 00:14:21.639
<v Speaker 2>the role of the relational database.

291
00:14:21.799 --> 00:14:23.960
<v Speaker 1>It's not just a container anymore, exactly.

292
00:14:24.039 --> 00:14:27.000
<v Speaker 2>It becomes more of a central hub and analytical control plane.

293
00:14:27.039 --> 00:14:30.679
<v Speaker 2>By giving you familiar t SQL access to these vast

294
00:14:30.919 --> 00:14:34.759
<v Speaker 2>varied external data sets and using clever tech like predicate

295
00:14:34.840 --> 00:14:38.279
<v Speaker 2>pushdown to do it efficiently. It saves potentially huge amounts

296
00:14:38.320 --> 00:14:39.559
<v Speaker 2>of time and money.

297
00:14:39.440 --> 00:14:43.039
<v Speaker 1>Less complax etl less need for specialized skills for every

298
00:14:43.039 --> 00:14:43.919
<v Speaker 1>single data source.

299
00:14:44.000 --> 00:14:47.080
<v Speaker 2>Precisely, it makes leveraging diverse data much more accessible.

300
00:14:47.159 --> 00:14:50.919
<v Speaker 1>Okay, a powerful shift. So we've seen how polybase makes

301
00:14:50.960 --> 00:14:54.159
<v Speaker 1>reading and linking data from dozens of sources much easier.

302
00:14:54.440 --> 00:14:58.200
<v Speaker 1>It makes dealing with massive static files almost trivial compared

303
00:14:58.240 --> 00:14:58.840
<v Speaker 1>to the old ways.

304
00:14:58.919 --> 00:15:01.039
<v Speaker 2>Yeah, the read side is pre well tackled, but you.

305
00:15:01.000 --> 00:15:03.399
<v Speaker 1>Alluded to the difficulty of updating data in those distributed

306
00:15:03.440 --> 00:15:05.960
<v Speaker 1>file systems earlier, which leaves us with a final thought

307
00:15:06.000 --> 00:15:08.799
<v Speaker 1>few to chew on. If polybase has made reading and

308
00:15:08.840 --> 00:15:12.639
<v Speaker 1>analyzing virtualized data so seamless, how long will it be

309
00:15:12.759 --> 00:15:16.559
<v Speaker 1>until that other major big data headache, the complexity and

310
00:15:16.639 --> 00:15:19.600
<v Speaker 1>cost of ensuring real time consistency and updates across all

311
00:15:19.639 --> 00:15:23.720
<v Speaker 1>these different virtualized sources, is also virtualized away just as elegantly.

312
00:15:24.720 --> 00:15:27.240
<v Speaker 1>When can we update that archive record as easily as

313
00:15:27.279 --> 00:15:28.039
<v Speaker 1>we can query it?

314
00:15:28.279 --> 00:15:31.000
<v Speaker 2>That's the multi billion dollar question, isn't it. How do

315
00:15:31.039 --> 00:15:34.279
<v Speaker 2>you handle distributed transactions and consistency at scale in a

316
00:15:34.320 --> 00:15:36.600
<v Speaker 2>virtualized world. That's the next frontier
