WEBVTT 1 00:00:00.080 --> 00:00:03.439 Okay, let's unpack this. You've given us well, quite a 2 00:00:03.439 --> 00:00:07.360 stack of sources here, all focused on data virtualization, specifically 3 00:00:07.599 --> 00:00:09.640 digging into Microsoft's poll based tool. 4 00:00:09.839 --> 00:00:13.080 That's right, And our mission today really is to cut 5 00:00:13.080 --> 00:00:15.560 through all that complexity for you. We want to show 6 00:00:15.560 --> 00:00:19.160 you how this kind of technology tackles what's probably the 7 00:00:19.199 --> 00:00:24.320 biggest challenge for modern businesses yea analyzing these huge, scattered 8 00:00:24.399 --> 00:00:24.960 data sets. 9 00:00:25.039 --> 00:00:28.920 We're talking big data, IoT streams, data mining stuff. Yeah, 10 00:00:29.079 --> 00:00:30.039 massive amount. 11 00:00:29.920 --> 00:00:33.079 Exactly petabytes of it, and doing it quickly, affordably, and 12 00:00:33.119 --> 00:00:37.320 crucially without making your team learn every single complex system 13 00:00:37.399 --> 00:00:38.759 your company happens to use. 14 00:00:38.840 --> 00:00:41.039 Yeah, it starts with just grappling with the sheer size, 15 00:00:41.039 --> 00:00:44.039 doesn't it. That gravity of big data. You imagine your 16 00:00:44.119 --> 00:00:47.119 data is like a small notebook manageable, right, but then 17 00:00:47.159 --> 00:00:49.399 it explodes. Suddenly it's not a notebook. It's like a 18 00:00:49.399 --> 00:00:52.719 million books scattered everywhere. When you hit petabyte scale, your 19 00:00:52.719 --> 00:00:56.240 normal ways of working just they can't keep up. They 20 00:00:56.240 --> 00:00:56.880 get swamped. 21 00:00:57.119 --> 00:01:01.439 Absolutely, the systems slow down, costs spiral, it's a mess. 22 00:01:01.880 --> 00:01:04.319 The old problem was if you wanted to analyze that 23 00:01:04.439 --> 00:01:08.079 giant library, you literally had to move every single book 24 00:01:08.239 --> 00:01:11.280 onto one enormous table your data warehouse before you could 25 00:01:11.319 --> 00:01:12.400 even start reading, a. 26 00:01:12.400 --> 00:01:14.840 Huge costly effort upfront, totally. 27 00:01:14.920 --> 00:01:17.120 So the whole point of virtualization is to find a 28 00:01:17.120 --> 00:01:21.400 way to analyze these massive, sprawling data sets fast using 29 00:01:21.799 --> 00:01:26.200 sensible resources, without that gigantic data moving phase first. 30 00:01:26.439 --> 00:01:28.560 Okay, so that gets us right to the heart of 31 00:01:28.680 --> 00:01:32.359 data virtualization. It's about solving that movement problem. But maybe 32 00:01:32.359 --> 00:01:34.280 we should quickly touch on why the data is so 33 00:01:34.359 --> 00:01:36.920 scattered to begin with. Yeah, why not just one big 34 00:01:36.959 --> 00:01:38.000 system for everything. 35 00:01:38.359 --> 00:01:40.799 Well, it really boils down to different tools for different jobs. 36 00:01:40.799 --> 00:01:45.959 It's this conflict between say, speed and data integrity. Relational databases, 37 00:01:46.000 --> 00:01:49.560 things like SQL server. They're fantastic for transactional stuff, data 38 00:01:49.560 --> 00:01:50.879 to day business ops. 39 00:01:50.640 --> 00:01:54.239 Updates, deletes, making sure the customer record is accurate precisely. 40 00:01:54.280 --> 00:01:57.920 They guarantee that integrity, but that comes with overhead, especially 41 00:01:57.959 --> 00:02:01.799 when you're just dumping massive amounts of sequential data like logs. 42 00:02:02.480 --> 00:02:06.120 Right for just raw logging or archiving old stuff, you'd 43 00:02:06.159 --> 00:02:09.759 look at file systems, maybe hdfs or cloud storage like 44 00:02:09.840 --> 00:02:10.560 Azure blobs. 45 00:02:10.680 --> 00:02:12.840 Yeah, they give you super fast reads and writes because 46 00:02:12.840 --> 00:02:16.560 they kind of sacrifice that strict integrity and transaction control. 47 00:02:16.879 --> 00:02:19.759 Great for archiving, yeah, terrible for updates. Right. 48 00:02:19.800 --> 00:02:22.479 I think one source mentioned updating a single customer record 49 00:02:22.520 --> 00:02:26.879 and a file system could mean touching hundreds of different pieces, slow, 50 00:02:27.759 --> 00:02:29.240 costly exactly. 51 00:02:29.759 --> 00:02:33.520 So now you've got this split. Your really valuable transactional 52 00:02:33.560 --> 00:02:37.199 data lives in the relational system and this huge bulk 53 00:02:37.400 --> 00:02:41.680 of sequential, maybe less frequently updated data is out in 54 00:02:41.719 --> 00:02:43.759 the file system or the cloud, and you. 55 00:02:43.719 --> 00:02:46.159 Need to join them together for analysis. That's where the 56 00:02:46.159 --> 00:02:46.800 pain starts. 57 00:02:46.960 --> 00:02:50.240 That's where the movement problem really bites. Think about that analogy. 58 00:02:50.280 --> 00:02:52.319 Your source is used computer A and computer B. 59 00:02:52.599 --> 00:02:56.560 Ah. Yes, computer A has the million entries the big archive, and. 60 00:02:56.520 --> 00:02:59.479 Computer B has just one thousand entries, maybe the current 61 00:02:59.479 --> 00:03:03.840 customers from the relational database. Okay, so historically the default 62 00:03:03.879 --> 00:03:06.360 way to join these you'd try to move all one 63 00:03:06.400 --> 00:03:07.919 million entries from A over to B. 64 00:03:08.319 --> 00:03:11.680 Yeah, instantly your network gets hammered. You're trying to ship 65 00:03:11.919 --> 00:03:14.280 potentially petabytes across the wire. 66 00:03:14.120 --> 00:03:16.439 And then poor computer B is struggling to even store 67 00:03:16.479 --> 00:03:17.960 it all, let alone process it. 68 00:03:18.159 --> 00:03:23.599 Exactly, it's CPU memory disc Everything is strained trying to 69 00:03:23.639 --> 00:03:26.000 sift through all this data. You mostly don't even need 70 00:03:26.080 --> 00:03:31.080 just to find maybe ten relevant records. It's just incredibly inefficient. 71 00:03:30.680 --> 00:03:32.479 Slow, expensive duplicates data. 72 00:03:32.599 --> 00:03:36.759 Yeah, great so data visualization and specifically the thinking behind 73 00:03:36.759 --> 00:03:40.319 poly base. It just flips that whole idea on its head. Also, 74 00:03:40.719 --> 00:03:43.240 the smart way, the efficient way, is to move the 75 00:03:43.280 --> 00:03:46.879 small data, those thousand entries from B over to the 76 00:03:46.960 --> 00:03:49.360 large environment on A. 77 00:03:48.759 --> 00:03:51.319 Ah okay, send the query to where the data lives. 78 00:03:51.439 --> 00:03:54.439 Precisely, you push the query logic, the filtering, the joining 79 00:03:54.680 --> 00:03:56.439 down to the system that has the bulk of the 80 00:03:56.479 --> 00:03:59.159 data and the resources to handle it, do the work there, 81 00:03:59.400 --> 00:04:02.919 and then you only bring back the final small relevant 82 00:04:02.919 --> 00:04:03.719 results set to B. 83 00:04:03.919 --> 00:04:07.240 Got it. So no network saguration, no data duplication on 84 00:04:07.360 --> 00:04:09.360 B and you let the heavy duty system do the 85 00:04:09.360 --> 00:04:09.879 heavy lifting. 86 00:04:09.960 --> 00:04:12.039 You got it. It's a fundamental shift in where the 87 00:04:12.039 --> 00:04:13.120 computation happens. 88 00:04:13.439 --> 00:04:19.000 That efficiency game is huge, and that leads us neatly 89 00:04:19.079 --> 00:04:21.759 into polybased itself, because this is where that technical integration 90 00:04:21.800 --> 00:04:25.800 gets really clever. What's the core promise of polybase for 91 00:04:25.879 --> 00:04:28.040 someone using say SQL server. 92 00:04:28.079 --> 00:04:31.720 The promise is really about seamless power through familiarity. That's 93 00:04:31.759 --> 00:04:35.040 the key. Polybase lets you query pretty much any external 94 00:04:35.120 --> 00:04:39.000 data source hdfs, azure blobs, even other databases using the 95 00:04:39.040 --> 00:04:42.040 tool you already know, SQL server and the language you 96 00:04:42.040 --> 00:04:43.199 already know, t sql. 97 00:04:43.319 --> 00:04:45.920 Okay, so I write my standard t SQL query, yep. 98 00:04:46.120 --> 00:04:47.240 But here's the clever bit. 99 00:04:47.360 --> 00:04:47.600 Yep. 100 00:04:47.839 --> 00:04:51.120 While you're writing familiar t sql, Polybase is working behind 101 00:04:51.120 --> 00:04:54.680 the scenes translating that query and leveraging the native capabilities 102 00:04:54.720 --> 00:04:58.480 of that external system, especially things like its parallel processing 103 00:04:58.519 --> 00:05:00.680 power or its optimized story access. 104 00:05:00.879 --> 00:05:03.720 So it's like a universal translator for data queries. You 105 00:05:03.759 --> 00:05:06.360 speak t sql and Polybase figures out how to ask 106 00:05:06.399 --> 00:05:08.079 the question and Hadoop speak or whatever. 107 00:05:08.199 --> 00:05:10.079 That's a great way to put it. Yeah, it handles 108 00:05:10.079 --> 00:05:11.319 that translation and execution. 109 00:05:11.600 --> 00:05:13.639 Now, this wasn't an overnight thing you mentioned. It has 110 00:05:13.759 --> 00:05:17.319 roots in Microsoft's earlier efforts, particularly with Parallel Data Warehouse. 111 00:05:17.680 --> 00:05:20.920 Absolutely essential context. Polybase was officially announced I think it 112 00:05:20.959 --> 00:05:24.279 was November twenty twelve at the Sequel Pass summit, But 113 00:05:24.480 --> 00:05:28.279 the underlying tech, the architecture, it really relied on the 114 00:05:28.319 --> 00:05:31.600 groundwork laid by Parallel Data Warehouse or PDW, which came 115 00:05:31.600 --> 00:05:32.720 out back in twenty ten. 116 00:05:32.879 --> 00:05:35.879 And the sources really emphasize how quickly that PDW tech 117 00:05:35.920 --> 00:05:40.959 evolved PDW version two in twenty thirteen. The performance jump 118 00:05:41.079 --> 00:05:42.639 was apparently staggered. 119 00:05:42.800 --> 00:05:46.079 It was revolutionary. We're talking like one hundred times faster 120 00:05:46.279 --> 00:05:49.439 query performance compared to view one. That's not incremental, that's 121 00:05:49.439 --> 00:05:50.839 a different class of machine. 122 00:05:50.920 --> 00:05:51.480 Wow. 123 00:05:51.560 --> 00:05:54.680 And at the same time they slashed the price per petabite. 124 00:05:54.759 --> 00:05:59.000 It proved that this massively parallel processing or MPP architecture, 125 00:05:59.000 --> 00:06:01.439 which is the foundation for PAUL two, was really the 126 00:06:01.480 --> 00:06:05.600 way forward for handling big data within a relational database context. 127 00:06:05.160 --> 00:06:08.199 And that investment paid off by sql server twenty sixteen. 128 00:06:08.240 --> 00:06:11.560 Polybase wasn't just a PDW thing, It was generally available 129 00:06:11.600 --> 00:06:13.120 in standard sql server editions. 130 00:06:13.199 --> 00:06:16.040 Yeah, that move really cemented it. Microsoft was clearly using 131 00:06:16.040 --> 00:06:18.920 the same core codebase, bringing that big data query power 132 00:06:18.920 --> 00:06:20.879 to its mainstream database product. 133 00:06:20.680 --> 00:06:23.639 Which brings us nicely to the technical secret sauce this 134 00:06:23.759 --> 00:06:25.319 idea of push down computation. 135 00:06:25.600 --> 00:06:28.839 Yes, this is critical to understanding why polybase is so 136 00:06:28.920 --> 00:06:30.519 much better than the older ways. 137 00:06:30.800 --> 00:06:34.360 We touch on the older ways failing, specifically SQL server's 138 00:06:34.439 --> 00:06:37.639 link servers. Can you elaborate on how they fell short 139 00:06:37.639 --> 00:06:38.600 with big data? 140 00:06:38.680 --> 00:06:43.240 Sure, they were let's just say not very optimized for 141 00:06:43.279 --> 00:06:46.680 remote filtering. If you wrote a standard query like select 142 00:06:46.680 --> 00:06:48.920 from my link server ducts give me not table where 143 00:06:49.000 --> 00:06:52.480 filter column ten, the link server wouldn't push that wear 144 00:06:52.519 --> 00:06:54.759 filter column ten part down to the remote system. It 145 00:06:54.759 --> 00:06:57.399 would actually read the entire table from the remote serce 146 00:06:57.439 --> 00:06:57.879 the whole. 147 00:06:57.720 --> 00:07:00.040 Thing, even if it was billions of rows, the whole. 148 00:06:59.839 --> 00:07:02.560 Thing, pulled it all across the network, then applied the 149 00:07:02.560 --> 00:07:04.879 filter locally on your SQL server instace. 150 00:07:04.959 --> 00:07:07.639 Oh my goodness. So if I just wanted ten records, 151 00:07:07.680 --> 00:07:11.480 I might still be pulling gigabytes or terabytes across the network. 152 00:07:11.160 --> 00:07:16.160 First, precisely a guaranteed performance killer for large data sets. 153 00:07:16.720 --> 00:07:19.879 To actually force the filter to run remotely, you had 154 00:07:19.920 --> 00:07:23.120 to jump through hoops using things like open query embedding 155 00:07:23.160 --> 00:07:26.600 your remote query as a string. It was awkward, error prone, 156 00:07:26.639 --> 00:07:29.040 and didn't scale well for complex logic. Right. 157 00:07:29.120 --> 00:07:32.720 That sounds painful. So the magic, as the sources call it, 158 00:07:33.120 --> 00:07:37.560 of predicate push down in polybase. It fixes that. 159 00:07:37.879 --> 00:07:41.920 It fixes exactly that. Polybase enables intelligent pushdown. The query 160 00:07:41.920 --> 00:07:44.360 optimizer looks at your t SQL query and figures out 161 00:07:44.399 --> 00:07:47.120 which parts the filtering predicate is. Maybe some joins, maybe 162 00:07:47.120 --> 00:07:50.759 aggregations can actually be executed on the external data source. 163 00:07:50.519 --> 00:07:52.399 Itself, so it does the work remotely, and then. 164 00:07:52.360 --> 00:07:55.560 It only brings back the much smaller pre filtered, maybe 165 00:07:55.600 --> 00:07:57.879 even pre aggregated results set. You only get the ten 166 00:07:57.920 --> 00:08:00.920 customers you ask for the billion stay put. 167 00:08:01.120 --> 00:08:03.680 That must completely change the game for data warehousing. 168 00:08:03.759 --> 00:08:08.839 Oh massively think about traditional ETL extract, transform load, huge 169 00:08:08.920 --> 00:08:12.720 complex processes, often running for hours overnight just to move 170 00:08:12.800 --> 00:08:15.040 and reshape data before you can even query it. 171 00:08:15.079 --> 00:08:16.800 Right the daily or nightly load window. 172 00:08:17.120 --> 00:08:20.279 Polybase lets you potentially bypass a lot of that heavy lifting. 173 00:08:20.639 --> 00:08:23.439 You don't necessarily need to physically load all the external 174 00:08:23.519 --> 00:08:26.600 data into the warehouse first, you can query it in place. 175 00:08:26.959 --> 00:08:29.879 You focus on the analysis, the queries, the calculations, not 176 00:08:29.920 --> 00:08:33.759 the complex data plumbing all through one connection point in 177 00:08:33.879 --> 00:08:37.720 SQL server and I guess this pushdown benefit is amplified 178 00:08:37.720 --> 00:08:42.200 by parallelism. Definitely polydase, especially when you set up scale 179 00:08:42.240 --> 00:08:45.639 out groups with multiple SQL server nodes, is designed for 180 00:08:45.759 --> 00:08:48.879 parallel data transfer. It can read data from multiple nodes 181 00:08:48.879 --> 00:08:53.600 in a Hadoop cluster or multiple partitions in cloud storage simultaneously. 182 00:08:52.960 --> 00:08:56.000 So it's pulling data in parallel, professing parts of the 183 00:08:56.080 --> 00:08:58.559 query in parallel on the remote system exactly. 184 00:08:58.960 --> 00:09:01.960 That parallel operation capability is a hallmark of those high 185 00:09:02.039 --> 00:09:05.639 end MPP systems, and polybaseed brings that capability to SQL 186 00:09:05.639 --> 00:09:07.360 server interacting with external data. 187 00:09:07.399 --> 00:09:10.120 Fantastic. Okay, let's broaden the view of bit section four 188 00:09:10.679 --> 00:09:15.879 Polybase within the wider modern data ecosystem. Interoperability seems key here, it. 189 00:09:15.840 --> 00:09:18.200 Really is its main strength. It has those native, highly 190 00:09:18.200 --> 00:09:21.919 optimized connectors for the big ones Hadoop, hdfs and Azure 191 00:09:21.919 --> 00:09:25.279 blob storage using the WASB protocol WSB. 192 00:09:25.519 --> 00:09:27.799 That's Windows Azure Storage Blob yep. 193 00:09:27.720 --> 00:09:29.879 The standard way to talk to Azure blobs for a while. 194 00:09:30.039 --> 00:09:32.960 But what about everything else the world isn't just Microsoft 195 00:09:32.960 --> 00:09:36.639 and hadoob. What if you need data from say, Cassandra 196 00:09:37.159 --> 00:09:41.159 or Mango dB, or even other relational systems like mysuquel 197 00:09:41.240 --> 00:09:42.240 or postgres School. 198 00:09:42.399 --> 00:09:46.399 Good question. For many of those other systems, Polybase relies 199 00:09:46.480 --> 00:09:50.799 on ODDC drivers open database connectivity. It's like a standard adapter. 200 00:09:51.039 --> 00:09:53.360 Okay, ODBC. So you can still use. 201 00:09:53.240 --> 00:09:55.639 T sql, Yes, and that's the big win. You still 202 00:09:55.639 --> 00:09:58.759 get to query those diverse sources using familiar t SQL 203 00:09:58.799 --> 00:10:02.080 from within sql server. Huge for developer productivity. And ease 204 00:10:02.120 --> 00:10:02.600 of adoption. 205 00:10:02.720 --> 00:10:04.559 But there's always a butt, isn't there. What's the trade 206 00:10:04.559 --> 00:10:05.440 off with ODBC. 207 00:10:05.639 --> 00:10:09.080 Well, using a generic bridge like ODBC can sometimes introduce 208 00:10:09.080 --> 00:10:12.000 a bit of overhead, and more importantly, it can sometimes 209 00:10:12.120 --> 00:10:15.200 limit those powerful pushdown capabilities we just talked about. 210 00:10:15.320 --> 00:10:18.879 Uh, so the intelligence might not always translate perfectly through 211 00:10:18.919 --> 00:10:19.919 the ODBC layer. 212 00:10:20.240 --> 00:10:23.519 Exactly. One of your sources had a perfect, if slightly 213 00:10:23.559 --> 00:10:28.240 worrying example with mycequel. A simple count aggregation failed to 214 00:10:28.240 --> 00:10:30.960 push down because of some subtle difference in how white 215 00:10:30.960 --> 00:10:32.840 space was handled by the ODBC driver. 216 00:10:33.000 --> 00:10:35.600 Suriously, a count query failed to push down. 217 00:10:35.799 --> 00:10:39.879 Yeah, and the workaround they had to explicitly disable push 218 00:10:39.960 --> 00:10:42.720 down for that query, meaning it fell back to pulling 219 00:10:42.759 --> 00:10:44.000 more data than necessary. 220 00:10:44.440 --> 00:10:47.720 So while ODBC gives you broad connectivity, you might occasionally 221 00:10:47.759 --> 00:10:50.440 lose some of that peak performance or intelligent push down 222 00:10:50.440 --> 00:10:52.039 you get with the native connectors. 223 00:10:52.399 --> 00:10:55.559 That's the trade off essentially. It highlights why systems with 224 00:10:55.679 --> 00:10:59.279 truly native, deeply integrated connections often performed best. 225 00:10:59.440 --> 00:11:03.480 Speaking best performers, the sources bring up Terra data quite 226 00:11:03.519 --> 00:11:05.600 a bit as a kind of gold standard in this 227 00:11:05.759 --> 00:11:06.600 MPP world. 228 00:11:06.759 --> 00:11:10.000 Yeah, Terra data is often seen as the benchmark, especially 229 00:11:10.000 --> 00:11:13.879 for petabyte scale warehousing. Their architecture goes way back nineteen 230 00:11:13.919 --> 00:11:17.600 seventy nine. They really pioneered many of these MPP concepts 231 00:11:17.679 --> 00:11:20.120 like shared nothing architecture, so they've. 232 00:11:19.919 --> 00:11:22.679 Been doing native pushed down in parallel data movement for 233 00:11:22.759 --> 00:11:24.120 decades pretty much. 234 00:11:24.320 --> 00:11:28.240 Their maturity and optimization are their big strengths polybases in 235 00:11:28.279 --> 00:11:32.799 many ways, Bringing those proven MPP concepts refined over years 236 00:11:32.840 --> 00:11:36.200 by systems like Terra Data into the more mainstream SQL 237 00:11:36.240 --> 00:11:37.759 server ecosystem. 238 00:11:37.279 --> 00:11:40.480 Makes sense, and shifting to the cloud polybases vital there too. Right, 239 00:11:40.480 --> 00:11:41.639 Connecting SQL server. 240 00:11:41.519 --> 00:11:46.080 To cloud storage absolutely essential. We mentioned Azure blobs via WASB. 241 00:11:46.480 --> 00:11:49.200 It also supports reading from Azure Data lakes store both 242 00:11:49.240 --> 00:11:50.399 Gen one and Gen two. 243 00:11:50.559 --> 00:11:52.960 Does it use the newer native protocols for ady ls 244 00:11:53.000 --> 00:11:53.679 Gen two Like. 245 00:11:53.639 --> 00:11:58.519 ABFs, often it still relies on the WISP protocol compatibility 246 00:11:58.600 --> 00:12:01.360 layer even for Gen two. Depending on the specific sequel 247 00:12:01.399 --> 00:12:05.559 server or synapse version. The native ABFs support is getting better, 248 00:12:05.559 --> 00:12:07.639 but WASB is often the fallback. 249 00:12:07.799 --> 00:12:10.159 And another key point mentioned was read versus right right now. 250 00:12:10.240 --> 00:12:12.720 Yes, that's an important current limitation to be aware of, 251 00:12:12.879 --> 00:12:17.200 especially in cloud scenarios like Azure, synaps analytics. Polybase is 252 00:12:17.200 --> 00:12:22.080 primarily fantastic for reading data from these external sources hood ADLs, blobs, 253 00:12:22.480 --> 00:12:24.960 but writing data back out to them via Polybase is 254 00:12:24.960 --> 00:12:28.559 often not supported or more limited. It's mainly a consumption 255 00:12:28.720 --> 00:12:32.759 a virtualization tool, not necessarily a two way synchronization engine yet. 256 00:12:32.960 --> 00:12:36.080 Okay, good clarification. So let's tie this all together. Why 257 00:12:36.080 --> 00:12:38.840 should you, our listener, really care about this? What are 258 00:12:38.879 --> 00:12:41.559 the killer real world use cases? 259 00:12:41.759 --> 00:12:43.759 Well, there are two immediate ones that jump out, offering 260 00:12:43.840 --> 00:12:47.879 huge savings and capabilities. First, aging and archiving. 261 00:12:47.759 --> 00:12:50.600 Moving old data out of expensive databases exactly. 262 00:12:50.639 --> 00:12:54.519 Think about old log files, transaction history older than say 263 00:12:54.799 --> 00:12:57.919 five years, data you need to keep for compliance but 264 00:12:57.960 --> 00:13:00.519 don't query often. You can set up part titioning in 265 00:13:00.600 --> 00:13:03.879 sql server to automatically move those old partitions to cheaper 266 00:13:03.919 --> 00:13:07.279 storage hdfs as your data lak. And the beauty is 267 00:13:07.399 --> 00:13:10.159 Polybase makes that archive data still look like it's part 268 00:13:10.200 --> 00:13:13.360 of the original table. Your legacy applications can query it 269 00:13:13.480 --> 00:13:16.960 using the same tseql, no code changes needed, instant cost 270 00:13:17.039 --> 00:13:18.600 savings on primary storage. 271 00:13:18.759 --> 00:13:21.240 That's incredibly practical. Okay, what's the second big one? 272 00:13:21.279 --> 00:13:24.200 The second one is maybe more transformational creating those three 273 00:13:24.240 --> 00:13:26.559 hundred and sixty degree customer views, especially for things like 274 00:13:26.600 --> 00:13:27.799 AI and machine. 275 00:13:27.600 --> 00:13:29.440 Learning, combining different data types. 276 00:13:29.679 --> 00:13:33.759 Right, imagine joining your core customer data from your relational 277 00:13:33.840 --> 00:13:39.240 database names addresses purchase history with massive unstructured or semi 278 00:13:39.240 --> 00:13:43.039 structured data streams. Right, what like web clickstream data, social 279 00:13:43.080 --> 00:13:48.120 media interactions, maybe sensor data from devices, even anonymized location data, 280 00:13:48.440 --> 00:13:52.240 stuff that lives outside your traditional database. Polybase lets you 281 00:13:52.279 --> 00:13:55.120 bring all that disparate data together virtually. You can then 282 00:13:55.200 --> 00:13:58.279 run mL models across that unified view to do really 283 00:13:58.360 --> 00:14:04.320 powerful things customers with incredible accuracy, predict churn, detect fraud, 284 00:14:04.639 --> 00:14:08.080 personalized offers, things you just couldn't do easily when the 285 00:14:08.159 --> 00:14:09.039 data was siloed. 286 00:14:09.240 --> 00:14:11.480 That opens up a lot of possibilities. Yeah, so wrapping 287 00:14:11.519 --> 00:14:13.200 things up, then, what's the big takeaway here? 288 00:14:13.320 --> 00:14:16.240 The big takeaway is that data virtualization, with tools like 289 00:14:16.279 --> 00:14:19.879 Polybase leading the charge in the Microsoft world, fundamentally changes 290 00:14:19.919 --> 00:14:21.639 the role of the relational database. 291 00:14:21.799 --> 00:14:23.960 It's not just a container anymore, exactly. 292 00:14:24.039 --> 00:14:27.000 It becomes more of a central hub and analytical control plane. 293 00:14:27.039 --> 00:14:30.679 By giving you familiar t SQL access to these vast 294 00:14:30.919 --> 00:14:34.759 varied external data sets and using clever tech like predicate 295 00:14:34.840 --> 00:14:38.279 pushdown to do it efficiently. It saves potentially huge amounts 296 00:14:38.320 --> 00:14:39.559 of time and money. 297 00:14:39.440 --> 00:14:43.039 Less complax etl less need for specialized skills for every 298 00:14:43.039 --> 00:14:43.919 single data source. 299 00:14:44.000 --> 00:14:47.080 Precisely, it makes leveraging diverse data much more accessible. 300 00:14:47.159 --> 00:14:50.919 Okay, a powerful shift. So we've seen how polybase makes 301 00:14:50.960 --> 00:14:54.159 reading and linking data from dozens of sources much easier. 302 00:14:54.440 --> 00:14:58.200 It makes dealing with massive static files almost trivial compared 303 00:14:58.240 --> 00:14:58.840 to the old ways. 304 00:14:58.919 --> 00:15:01.039 Yeah, the read side is pre well tackled, but you. 305 00:15:01.000 --> 00:15:03.399 Alluded to the difficulty of updating data in those distributed 306 00:15:03.440 --> 00:15:05.960 file systems earlier, which leaves us with a final thought 307 00:15:06.000 --> 00:15:08.799 few to chew on. If polybase has made reading and 308 00:15:08.840 --> 00:15:12.639 analyzing virtualized data so seamless, how long will it be 309 00:15:12.759 --> 00:15:16.559 until that other major big data headache, the complexity and 310 00:15:16.639 --> 00:15:19.600 cost of ensuring real time consistency and updates across all 311 00:15:19.639 --> 00:15:23.720 these different virtualized sources, is also virtualized away just as elegantly. 312 00:15:24.720 --> 00:15:27.240 When can we update that archive record as easily as 313 00:15:27.279 --> 00:15:28.039 we can query it? 314 00:15:28.279 --> 00:15:31.000 That's the multi billion dollar question, isn't it. How do 315 00:15:31.039 --> 00:15:34.279 you handle distributed transactions and consistency at scale in a 316 00:15:34.320 --> 00:15:36.600 virtualized world. That's the next frontier