WEBVTT 1 00:00:00.120 --> 00:00:03.680 Imagine, right, imagine a master chef, like a world class chef, 2 00:00:03.759 --> 00:00:08.400 starring in this massive primetime thirty minute cooking show. 3 00:00:08.480 --> 00:00:09.839 Okay, I'm picturing it right. 4 00:00:09.880 --> 00:00:12.679 So the camera rolls, the steak is perfectly seared, the 5 00:00:12.839 --> 00:00:16.399 sauce is just flawless, and the audience goes completely wild. 6 00:00:17.079 --> 00:00:20.239 But what you don't see, though, what's hidden off camera, 7 00:00:20.359 --> 00:00:23.399 is that this chef spent the previous eight hours standing 8 00:00:23.440 --> 00:00:27.160 in a back alley literally crying while peeling potatoes and 9 00:00:27.199 --> 00:00:29.679 shopping like ten thousand onions. 10 00:00:29.760 --> 00:00:31.800 Yeah, that sounds about right for the industry. 11 00:00:31.480 --> 00:00:34.320 It's crazy. And in the technology world, that chef is 12 00:00:34.359 --> 00:00:37.600 a data scientist. So today we are looking at how 13 00:00:37.719 --> 00:00:41.399 enterprise organizations are finally you know, firing the potato peelers. 14 00:00:41.479 --> 00:00:43.719 It is a critical shift. Honestly, we are looking at 15 00:00:43.759 --> 00:00:47.359 a fundamental rewrite of the architecture, Like the actual plumbing 16 00:00:47.399 --> 00:00:50.240 of predictive analytics is being completely re routed at the 17 00:00:50.359 --> 00:00:54.840 enterprise level just to eliminate these massive systemic inefficiencies. 18 00:00:55.159 --> 00:00:58.039 Exactly. So, if you want a shortcut to understanding how 19 00:00:58.119 --> 00:01:02.200 big businesses are actually transfer forming their standard everyday databases 20 00:01:02.719 --> 00:01:07.799 into these automated predictive engines. Well, this is it. Today's 21 00:01:07.840 --> 00:01:10.480 deep dive is based on excerpts from the book Data 22 00:01:10.519 --> 00:01:14.439 Science using Oracle Data Minor and Oracle r Enterprise. 23 00:01:14.120 --> 00:01:15.879 Which is a fantastic resource by the. 24 00:01:15.879 --> 00:01:19.200 Way, Oh totally. We're exploring how bringing the math directly 25 00:01:19.200 --> 00:01:22.120 to the data solves honestly one of the single biggest 26 00:01:22.159 --> 00:01:26.159 bottlenecks in modern tech. So okay, let's unpack this because 27 00:01:26.200 --> 00:01:29.159 the way a lot of organizations currently execute data science 28 00:01:29.280 --> 00:01:31.760 is just fundamentally broken, it really is. 29 00:01:32.040 --> 00:01:34.879 What's fascinating here is that the real secret to effective 30 00:01:34.920 --> 00:01:39.000 data science at scale isn't necessarily about inventing a more 31 00:01:39.040 --> 00:01:42.879 complex neural network or a bitter algorithm. It's really about 32 00:01:42.920 --> 00:01:45.239 where those algorithms are physically executed, right. 33 00:01:45.280 --> 00:01:46.680 The location matters exactly. 34 00:01:47.000 --> 00:01:50.200 Moving massive amounts of data across networks just to analyze 35 00:01:50.200 --> 00:01:53.760 it is a critical, costly mistake, and the infrastructure we 36 00:01:53.760 --> 00:01:57.280 were exploring today was built specifically to stop that movement entirely. 37 00:01:57.480 --> 00:02:00.400 So to understand the fix, I feel like we first 38 00:02:00.400 --> 00:02:04.239 have to understand why data scientists are stuck chopping those 39 00:02:04.280 --> 00:02:07.040 onions in the first place. Because if you look at 40 00:02:07.040 --> 00:02:13.680 standard industry frameworks like the crispym methodology. You'd probably assume 41 00:02:13.759 --> 00:02:17.400 the actual modeling, like the glamorous machine learning part, is 42 00:02:17.439 --> 00:02:18.400 where all the time goes. 43 00:02:18.520 --> 00:02:21.479 Yeah, that is the assumption, but the reality is heavily, 44 00:02:21.520 --> 00:02:26.120 heavily skewed. In almost any enterprise deployment, data preparation takes 45 00:02:26.199 --> 00:02:28.800 up a staggering sixty to eighty percent of the total 46 00:02:28.840 --> 00:02:29.479 project effort. 47 00:02:29.639 --> 00:02:31.680 Sixty to eighty percent. That's insane. 48 00:02:32.000 --> 00:02:35.520 It is because real world data is inherently dirty, it's skewed, 49 00:02:35.560 --> 00:02:38.439 it's real with missing values. It's just a mess. Right. 50 00:02:38.680 --> 00:02:40.280 So if the vast majority of the job is just 51 00:02:40.360 --> 00:02:43.599 cleaning up missing variables and formatting timestamps, why is the 52 00:02:43.599 --> 00:02:46.080 titled data scientist and not data janitor. 53 00:02:46.599 --> 00:02:49.719 Well, because that janitoring actually dictates the success or failure 54 00:02:49.719 --> 00:02:52.520 of the entire model. In predictive analytics, there is basically 55 00:02:52.560 --> 00:02:55.879 an ironcloud rule. A simple regression model built on perfectly 56 00:02:55.919 --> 00:02:59.719 clean data will consistently outperform a highly sophisticated deep learning 57 00:02:59.719 --> 00:03:01.599 model that has been fed dirty data. 58 00:03:01.639 --> 00:03:05.719 Wow. Really, so the clean data beats the complex math every. 59 00:03:05.520 --> 00:03:08.719 Single time, because if you don't handle the anomaies, your 60 00:03:08.759 --> 00:03:13.280 model simply learns the noise. It just memorizes the mistakes. 61 00:03:13.039 --> 00:03:15.319 But spending all that time cleaning. I mean, that's the 62 00:03:15.360 --> 00:03:19.039 antithesis of business agility, right Like, if a telecom company 63 00:03:19.080 --> 00:03:22.159 wants to predict customer churn this month, they can't afford 64 00:03:22.199 --> 00:03:25.680 to spend three weeks manually cleaning billing data first. 65 00:03:25.599 --> 00:03:28.520 Exactly, And that naturally leads us to the data science 66 00:03:28.560 --> 00:03:32.560 automation pyramid. From the source material. To move fast, you 67 00:03:32.560 --> 00:03:34.319 have to automate the data pipeline. 68 00:03:34.439 --> 00:03:36.879 Okay, So walk us through this pyramid. What's at the bottom. 69 00:03:36.960 --> 00:03:40.039 At the base, you have problem specific automation. This is 70 00:03:40.080 --> 00:03:45.599 automating a single, rigid workflow, like maybe a monthly sales 71 00:03:45.639 --> 00:03:47.919 forecast that runs exactly the same way every time. 72 00:03:48.039 --> 00:03:50.000 Got it, Just basic scripting right. 73 00:03:50.319 --> 00:03:53.240 Then above that you have repetitive task automation, which is 74 00:03:53.280 --> 00:03:56.919 where you build generalize scripts to automatically handle missing values 75 00:03:57.039 --> 00:03:59.919 or transform columns across various different data sets. 76 00:04:00.039 --> 00:04:02.080 So that's taking away a lot of the manual janitor 77 00:04:02.159 --> 00:04:02.960 work exactly. 78 00:04:03.000 --> 00:04:05.159 It frees up the human to do actual science. 79 00:04:05.360 --> 00:04:07.319 Okay, So what's at the very top of the pyramid. 80 00:04:07.400 --> 00:04:11.400 Then the automated statistician Ooh, that sounds intense. It is 81 00:04:11.879 --> 00:04:14.840 This is an environment where the system evaluates the underlying 82 00:04:14.919 --> 00:04:19.279 data structures, learns the patterns, and automatically selects the most 83 00:04:19.360 --> 00:04:23.240 optimal algorithm without requiring a human to manually tune the 84 00:04:23.319 --> 00:04:24.199 hyper parameters. 85 00:04:24.680 --> 00:04:26.959 Wait, getting to that top tier sounds incredible, but I 86 00:04:27.000 --> 00:04:30.680 mean the glaring issue here is the friction of traditional architecture. 87 00:04:30.800 --> 00:04:30.959 Right. 88 00:04:31.879 --> 00:04:35.120 Historically, when a data scientist wanted to run those repetitive 89 00:04:35.240 --> 00:04:38.920 data cleaning scripts, they were using client side tools like 90 00:04:39.040 --> 00:04:42.639 Python or open source R or SaaS on their laptops. 91 00:04:42.759 --> 00:04:45.560 Yeah, which means they had to extract the data. Traditional 92 00:04:45.560 --> 00:04:49.160 analytical environments basically sit on a separate application server or 93 00:04:49.199 --> 00:04:52.720 on the data scientist's local machine. So to run your model, 94 00:04:52.800 --> 00:04:56.480 you have to query your central enterprise database, extract gigabytes 95 00:04:56.560 --> 00:04:59.399 or sometimes terabytes of data, push it over the network, 96 00:05:00.079 --> 00:05:02.839 it into the memory of your analytical tool, process it, 97 00:05:03.079 --> 00:05:04.680 and then attempt to write the results back. 98 00:05:04.920 --> 00:05:08.199 It sounds exhausting just describing it. So, bringing back to 99 00:05:08.240 --> 00:05:12.319 our chef analogy, traditional data science is like storing all 100 00:05:12.360 --> 00:05:15.680 your raw ingredients in this massive warehouse all the way 101 00:05:15.680 --> 00:05:18.800 across town. Yes, exactly, and every single time you want 102 00:05:18.800 --> 00:05:20.920 to test a new recipe, You literally have to drive 103 00:05:20.959 --> 00:05:23.720 a semi truck across the city, load up the ingredients, 104 00:05:23.920 --> 00:05:26.560 drive back to your kitchen, cook the meal, and then 105 00:05:26.639 --> 00:05:28.800 drive the leftovers back to the warehouse. 106 00:05:29.000 --> 00:05:33.319 It's catastrophic for efficiency. You hit network io bottlenecks immediately, 107 00:05:33.839 --> 00:05:38.600 you hit integration failures, and most importantly, client based tools 108 00:05:38.759 --> 00:05:41.199 simply choke because they require the data set to be 109 00:05:41.240 --> 00:05:42.560 loaded into active RAM. 110 00:05:42.720 --> 00:05:44.240 Right. You can't just cram everything in there. 111 00:05:44.360 --> 00:05:47.639 No, you absolutely cannot load a two terabyte customer table 112 00:05:47.680 --> 00:05:50.759 into the RAM of a standard application server. It will crash. 113 00:05:51.120 --> 00:05:53.720 Here's where it gets really interesting, though, because the solution 114 00:05:53.879 --> 00:05:58.839 presented in this architecture, specifically Oracle Advanced Analytics, basically just 115 00:05:59.120 --> 00:06:01.759 builds the kitchen and inside the warehouse precisely. 116 00:06:02.040 --> 00:06:05.920 Oracle Advanced Analytics or OAA operates directly on top of 117 00:06:05.959 --> 00:06:09.160 the Oracle database kernel. The data never actually moves. 118 00:06:09.079 --> 00:06:11.399 So you're just cooking where the food is exactly. 119 00:06:11.759 --> 00:06:17.120 By eliminating data extraction, you effectively achieve zero latency in 120 00:06:17.160 --> 00:06:18.279 your data pipeline. 121 00:06:18.519 --> 00:06:20.680 And I imagine you bypass the memory limits of a 122 00:06:20.680 --> 00:06:25.360 local machine entirely right, because the database is already optimized 123 00:06:25.399 --> 00:06:28.240 to query and process data directly from storage. 124 00:06:28.360 --> 00:06:31.560 Oh absolutely, Plus you don't have to worry about security 125 00:06:31.560 --> 00:06:34.920 protocols breaking down over some open network connection, right. 126 00:06:34.800 --> 00:06:37.680 Because once the data leaves the database, you've kind of 127 00:06:37.720 --> 00:06:39.199 lost control over who sees it. 128 00:06:39.279 --> 00:06:42.399 Security is a massive factor here. The data remains governed 129 00:06:42.399 --> 00:06:45.959 by the strict native security policies of the database itself. 130 00:06:46.639 --> 00:06:50.800 But from a purely performance standpoint, executing inside the kernel 131 00:06:50.959 --> 00:06:55.839 allows the algorithms to leverage oracles parallel processing capabilities. 132 00:06:55.319 --> 00:06:58.759 Meaning instead of one computer churning through the data row 133 00:06:58.800 --> 00:07:01.480 by row by row, the database can split the job 134 00:07:01.560 --> 00:07:05.319 across dozens of internal processors simultaneously. 135 00:07:04.680 --> 00:07:07.879 Exactly, and when predictive models run directly inside the kernel, 136 00:07:08.240 --> 00:07:11.800 the whole business posture shifts. You aren't extracting data to 137 00:07:11.879 --> 00:07:14.639 run some post mortem analysis of what happened last quarter. 138 00:07:14.720 --> 00:07:16.920 Yeah, it's not looking backward anymore. 139 00:07:16.639 --> 00:07:20.360 Right, You are queering the database in real time to ask, 140 00:07:20.879 --> 00:07:24.720 what is the probability this specific transaction happening right now 141 00:07:25.279 --> 00:07:26.040 is fraudulent? 142 00:07:26.439 --> 00:07:29.399 Okay, So the kitchen is inside the warehouse, which is great, 143 00:07:29.639 --> 00:07:31.800 but we still have to do the sixty to eighty 144 00:07:31.800 --> 00:07:35.240 percent of the workload that involves data preparation, right, I mean, 145 00:07:35.279 --> 00:07:37.720 bringing the math to the data doesn't magically clean it. 146 00:07:38.319 --> 00:07:40.959 We still need the tools to handle anomalies. 147 00:07:41.199 --> 00:07:44.160 Yes we do, and that is handled through an in 148 00:07:44.279 --> 00:07:47.800 database plseql package called DBMS data meaning transformation. 149 00:07:47.920 --> 00:07:50.720 Okay, quite a mouseful, it is, yeah. 150 00:07:50.399 --> 00:07:53.600 But basically this is the toolkit for managing that massive 151 00:07:53.680 --> 00:07:56.480 data preparation phase without ever leaving the database. 152 00:07:56.759 --> 00:08:00.439 So let's talk about how this toolkit actually works specifically withouts, 153 00:08:00.879 --> 00:08:03.439 Because if you have an e commerce platform in your analyze, 154 00:08:03.439 --> 00:08:06.439 and say average order value one user buying a fifty 155 00:08:06.480 --> 00:08:10.040 thousand dollars watch is going to completely skew your standard distribution. 156 00:08:10.199 --> 00:08:11.279 Oh absolutely, So. 157 00:08:11.199 --> 00:08:14.240 The toolkit handles this using what they call winsorizing or trimming. 158 00:08:14.560 --> 00:08:18.560 Right, unhandled extreme values will drag the mean of your 159 00:08:18.639 --> 00:08:22.439 data so far from the median that any distance based 160 00:08:22.519 --> 00:08:26.639 algorithm you use will generate completely erroneous clusters. It'll just 161 00:08:26.759 --> 00:08:27.519 ruin the model. 162 00:08:28.120 --> 00:08:31.079 So if you're using this plcql package, how do these 163 00:08:31.120 --> 00:08:35.759 two methods winsorizing and trimming mechanically solve that watch problem. 164 00:08:35.919 --> 00:08:39.320 Well, trimming is the brute force approach. It literally clips 165 00:08:39.840 --> 00:08:43.320 the extreme tail ends of your distribution, say the top 166 00:08:43.360 --> 00:08:46.679 one percent of values and just sets them to NUL. 167 00:08:46.480 --> 00:08:48.679 Just delete them basically effectively. 168 00:08:48.720 --> 00:08:51.840 Yes. Windsorizing, on the other hand, is much more elegant. 169 00:08:52.000 --> 00:08:55.440 Instead of removing the data point entirely, it cacks it. 170 00:08:55.440 --> 00:08:59.320 It replaces those extreme tail values with a specified maximum parameter, 171 00:08:59.600 --> 00:09:02.679 pulling the outlier back into the acceptable edge of the distribution. 172 00:09:03.039 --> 00:09:05.799 Oh I see, So windsorizing is like taking a person 173 00:09:05.840 --> 00:09:08.360 who's screaming through a megaphone in a crowded room and 174 00:09:08.399 --> 00:09:11.120 forcing them to just whisper. While trimming is you're just 175 00:09:11.159 --> 00:09:13.240 throwing them out of the building entirely so they don't 176 00:09:13.279 --> 00:09:14.039 ruin the party. 177 00:09:14.279 --> 00:09:16.840 That's a great way to visualize it. Yes, but dealing 178 00:09:16.879 --> 00:09:19.000 without liars is just the first step. You also have 179 00:09:19.039 --> 00:09:22.159 to normalize the data before you apply the algorithm. 180 00:09:21.759 --> 00:09:25.559 Right normalization, which is bringing variables to a uniform scale 181 00:09:25.720 --> 00:09:29.840 using min max or z score calculations. Because and correct 182 00:09:29.879 --> 00:09:32.919 me if I'm wrong. If you feed an algorithm a 183 00:09:32.960 --> 00:09:36.759 customer's age, which is a two digit number, alongside their 184 00:09:36.919 --> 00:09:40.720 annual income, which is a six digit number, the geometry 185 00:09:40.720 --> 00:09:44.480 of the algorithm will mathematically assume the income is exponentially 186 00:09:44.480 --> 00:09:46.759 more important, just simply because the integer is larger. 187 00:09:46.799 --> 00:09:50.919 You nailed it. The algorithm operates on mathematical distance. If 188 00:09:50.960 --> 00:09:53.840 you don't scale the inputs to a uniform magnitude, your 189 00:09:53.879 --> 00:09:55.440 model is practically useless. 190 00:09:55.519 --> 00:09:57.840 It just gets confused by the big numbers exactly. 191 00:09:58.200 --> 00:10:02.080 But the toolkit goes beyond scaling. It also performs complex binning, 192 00:10:02.559 --> 00:10:05.720 which is transforming continuous data into discrete categories. 193 00:10:05.840 --> 00:10:08.200 Yeah. The supervised binning feature is what really caught my 194 00:10:08.240 --> 00:10:10.480 eye in the source text because instead of a human 195 00:10:10.639 --> 00:10:15.039 arbitrarily deciding that high income starts at exactly one hundred 196 00:10:15.039 --> 00:10:18.559 thousand dollars, supervised binning automates the logic it does. 197 00:10:18.639 --> 00:10:20.960 It uses a decision tree algorithm under the hood, so 198 00:10:21.000 --> 00:10:25.159 the system analyzes the data's relationship to your target outcome, 199 00:10:25.559 --> 00:10:28.080 like whether a customer churned or not. If the decision 200 00:10:28.080 --> 00:10:32.120 tree determines that a massive spike inchurn happens specifically when 201 00:10:32.159 --> 00:10:36.480 income drops below let's say sixty four three hundred dollars, 202 00:10:36.799 --> 00:10:38.919 it sets the bin boundary exactly there. 203 00:10:39.240 --> 00:10:39.759 Oh wow. 204 00:10:39.840 --> 00:10:42.720 Yeah, it lets the predictive power of the data dictate 205 00:10:42.799 --> 00:10:45.519 the categorization completely, removing human bias. 206 00:10:46.039 --> 00:10:48.519 Well, wait, if I'm just a standard database administrator or 207 00:10:48.559 --> 00:10:50.759 a business analyst running this, How do I know if 208 00:10:50.759 --> 00:10:53.879 the algorithm I want to use requires minmac scaling or 209 00:10:53.919 --> 00:10:58.200 a Z score or supervised binning like I wouldn't. 210 00:10:57.840 --> 00:11:01.000 Know that, and you frequently don't need to know. Oracle 211 00:11:01.080 --> 00:11:05.519 utilizes a feature called Automatic data Preparation or ADP nice. 212 00:11:05.639 --> 00:11:10.120 When enabled, ADP intercepts your request evaluates the specific algorithm 213 00:11:10.159 --> 00:11:13.519 you've chosen, say a support vector machine, which strictly requires 214 00:11:13.559 --> 00:11:17.759 normalized inputs, and it automatically executes the correct mathematical transformations 215 00:11:17.759 --> 00:11:19.960 inside the kernel before running the model. 216 00:11:20.120 --> 00:11:22.799 That is so cool. It handles the prerequisites dynamically, so 217 00:11:22.840 --> 00:11:25.279 the data is prepped, the environment is secure, and we 218 00:11:25.360 --> 00:11:27.679 are finally ready for the top tier of that pyramid. 219 00:11:27.759 --> 00:11:29.840 We talked about the automated statistician. 220 00:11:30.200 --> 00:11:33.200 Yes, this is where we look at the DBM's predictive 221 00:11:33.240 --> 00:11:38.799 analytics package. It contained three highly automated APIs, Predict, Explain 222 00:11:39.080 --> 00:11:39.759 and Profile. 223 00:11:40.000 --> 00:11:43.960 Right Predict automatically generates an outcome variable. Explain ranks the 224 00:11:43.960 --> 00:11:48.120 importance of the independent variables, and Profile extracts the core 225 00:11:48.200 --> 00:11:51.080 business rules the model found. You literally just pass it 226 00:11:51.159 --> 00:11:53.279 the table name and the target column and it does 227 00:11:53.320 --> 00:11:53.720 the rest. 228 00:11:54.000 --> 00:11:55.399 It really is that straightforward. 229 00:11:55.440 --> 00:11:57.440 So what does this all mean? It sounds like we 230 00:11:57.480 --> 00:12:02.360 are completely democratizing machine learning, allowing average SQL users to 231 00:12:02.399 --> 00:12:04.360 perform data science without knowing the math. 232 00:12:04.559 --> 00:12:08.600 It absolutely democratizes access, But in this raises an important question. 233 00:12:09.000 --> 00:12:11.799 Is it safe to lower the barrier to entry that far? 234 00:12:11.960 --> 00:12:13.159 That's a very fair point. 235 00:12:13.279 --> 00:12:16.200 When an analyst just presses a predict button, they are 236 00:12:16.279 --> 00:12:20.159 essentially trusting a black box. The system is making vast 237 00:12:20.159 --> 00:12:22.000 mathematical assumptions on their behalf. 238 00:12:22.120 --> 00:12:24.440 I agree, and frankly, I'm a bit skeptical. If you 239 00:12:24.559 --> 00:12:27.240 let an average user bypass the math and just blindly 240 00:12:27.240 --> 00:12:30.320 apply predictive models to their company's revenue data, aren't we 241 00:12:30.399 --> 00:12:34.519 just accelerating how fast they can make a catastrophic business decision. 242 00:12:34.759 --> 00:12:38.320 That is the inherent risk of democratization, without a doubt, 243 00:12:38.799 --> 00:12:41.759 If the user doesn't understand the underlying assumptions of the models, 244 00:12:42.399 --> 00:12:47.000 the results can be dangerous. Most powerful parametric machine learning 245 00:12:47.000 --> 00:12:51.200 models assume that your underlying data follows a normal Bell 246 00:12:51.240 --> 00:12:54.759 curve distribution. Right If you feed them highly skewed, non 247 00:12:54.799 --> 00:12:59.279 normal data, the predictions will be mathematically invalid, period. 248 00:12:59.039 --> 00:13:02.120 Which is why having statistical tests built directly into the 249 00:13:02.200 --> 00:13:04.480 databas is so critical, I guess you don't have to 250 00:13:04.519 --> 00:13:07.960 blindly trust the black box. You can use native sequel 251 00:13:08.000 --> 00:13:11.559 functions like the Shapiro Wilks test to evaluate the normality 252 00:13:11.559 --> 00:13:13.919 of your distribution right there in the query exactly. 253 00:13:13.960 --> 00:13:18.120 Shapiro Wokes evaluates the null hypothesis that your sample came 254 00:13:18.120 --> 00:13:19.919 from a normally distributed population. 255 00:13:20.320 --> 00:13:22.159 Okay, so if I run that SQL query and it 256 00:13:22.159 --> 00:13:23.480 returns a P value. 257 00:13:23.200 --> 00:13:26.360 Of zero, you instantly know your data is non normal. 258 00:13:26.840 --> 00:13:29.120 You can test your assumptions without having to extract the 259 00:13:29.200 --> 00:13:32.759 data to a specialized statistical software package. It's all right there. 260 00:13:33.000 --> 00:13:37.000 And the analytical capabilities of modern sequel don't stop there. 261 00:13:37.519 --> 00:13:40.600 The source material dives into functions like lag lead and 262 00:13:40.639 --> 00:13:44.759 these really complex windowing functions. And these aren't just convenient syntax, 263 00:13:44.840 --> 00:13:47.159 they are massive performance life savers. 264 00:13:47.240 --> 00:13:50.200 They really are lag and let allow you to access 265 00:13:50.320 --> 00:13:54.159 data from previous or subsequent rows in the exact same 266 00:13:54.240 --> 00:13:56.840 result set without having to use a clunky self joint. 267 00:13:57.200 --> 00:14:00.320 So like if a retailer is calculating year over year's 268 00:14:00.320 --> 00:14:03.200 sales growth across ten thousand stores, they don't have to 269 00:14:03.279 --> 00:14:06.919 pull millions of rows into a Python data frame just 270 00:14:06.960 --> 00:14:10.200 to calculate a rolling average. They can use a SQL 271 00:14:10.240 --> 00:14:13.519 windowing function to calculate that moving average directly on the 272 00:14:13.559 --> 00:14:14.840 storage disc and. 273 00:14:14.759 --> 00:14:18.240 By processing that rolling average inside the database. Using SQL, 274 00:14:18.559 --> 00:14:22.440 you are leveraging the internal optimizer. It completes the calculation 275 00:14:22.519 --> 00:14:24.919 in a fraction of the time and only returns the 276 00:14:24.960 --> 00:14:27.480 final aggregated insight to the application layer. 277 00:14:27.600 --> 00:14:30.559 Okay, so SQL is incredibly powerful. But let's play Devil's 278 00:14:30.600 --> 00:14:31.279 advocate for a second. 279 00:14:31.399 --> 00:14:31.840 Let's do it. 280 00:14:32.000 --> 00:14:35.240 What if your company's lead data scientist is like a 281 00:14:35.320 --> 00:14:38.759 hardcore statistical researcher, you know the type. They spend eight 282 00:14:38.840 --> 00:14:42.039 years getting a PhD. They live and breathe the open 283 00:14:42.039 --> 00:14:45.759 source R programming language, and they rely on these massive, 284 00:14:45.919 --> 00:14:50.320 crowdsourced libraries of cutting edge algorithms that standard seql just 285 00:14:50.360 --> 00:14:51.720 doesn't natively support. 286 00:14:52.559 --> 00:14:55.320 Are they just forced to abandon R and write Oracle 287 00:14:55.399 --> 00:14:56.799 plseql not at all. 288 00:14:57.240 --> 00:15:00.679 That exact friction is what Oracle Are Enterprise or ORE 289 00:15:01.240 --> 00:15:02.399 was engineered to eliminate. 290 00:15:02.440 --> 00:15:04.720 Because open source R has the same architectural flaw we 291 00:15:04.720 --> 00:15:07.080 talked about earlier. Right, it's entirely client based. Has to 292 00:15:07.120 --> 00:15:09.720 load everything into the local laptops around exactly. 293 00:15:09.840 --> 00:15:12.759 Open source R is brilliant for innovation, but it is 294 00:15:12.799 --> 00:15:16.720 fundamentally incapable of handling true enterprise big data. If you 295 00:15:16.759 --> 00:15:19.559 try to run an advanced clustering algorithm on a billion 296 00:15:19.639 --> 00:15:22.720 rows of transaction data using open source R, the memory 297 00:15:22.720 --> 00:15:25.000 limit will immediately crash the session. 298 00:15:24.879 --> 00:15:26.240 Just a blue screen of death. 299 00:15:26.279 --> 00:15:27.279 Basically pretty much. 300 00:15:27.360 --> 00:15:29.799 So how does already solve this without forcing that PHGD 301 00:15:29.840 --> 00:15:32.399 data scientist to learn a whole new language. The source 302 00:15:32.440 --> 00:15:36.159 outlines a three layer architecture to make our database compatible. Right, 303 00:15:36.240 --> 00:15:38.720 so layer one is simply the client R engine. The 304 00:15:38.799 --> 00:15:41.799 data scientist sits at their laptop and write standard R 305 00:15:41.879 --> 00:15:45.080 code in their normal ide. They don't change their workflow 306 00:15:45.120 --> 00:15:45.360 at all. 307 00:15:45.519 --> 00:15:47.879 Okay, and layer two is where the magic happens. 308 00:15:47.919 --> 00:15:48.039 Right. 309 00:15:48.080 --> 00:15:50.120 The database has this transparency layer. 310 00:15:50.200 --> 00:15:53.440 Yes, this is the crucial translation mechanism. When the data 311 00:15:53.440 --> 00:15:55.960 scientist writes an OUR command to filter a data set 312 00:15:56.039 --> 00:16:00.519 or apply a transformation, the transparency layer intersects that command. Oh, okay, 313 00:16:00.679 --> 00:16:02.799 does not pull the data to the laptop. Instead, it 314 00:16:02.919 --> 00:16:07.639 dynamically translates the R syntax into a highly optimized SEQL query. 315 00:16:08.200 --> 00:16:11.080 It maps the R data frames directly to Oracle tables 316 00:16:11.159 --> 00:16:11.639 or views. 317 00:16:11.799 --> 00:16:14.480 That is wild. So the data scientist thinks they are 318 00:16:14.519 --> 00:16:17.759 manipulating a local R data frame, but behind the scenes, 319 00:16:17.840 --> 00:16:21.399 Oracle is essentially spoofing the environment and executing a native 320 00:16:21.399 --> 00:16:23.120 SQL query on the server exactly. 321 00:16:23.120 --> 00:16:26.120 It's totally seamless. And then layer three consists of spawned 322 00:16:26.200 --> 00:16:29.840 R engines running directly on the database server itself. If 323 00:16:29.840 --> 00:16:32.600 the data scientist uses an ORE package like ort M, 324 00:16:32.720 --> 00:16:36.200 which maps directly to Oracle data mining algorithms, the execution 325 00:16:36.240 --> 00:16:39.080 happens entirely inside the kernel using parallel processing. 326 00:16:39.360 --> 00:16:42.039 But wait, what if they are using a custom third 327 00:16:42.039 --> 00:16:45.919 party R package that Oracle doesn't natively map to. How 328 00:16:45.960 --> 00:16:47.440 do you keep the memory from crashing. 329 00:16:47.480 --> 00:16:51.080 Then that's where functions like or dot row apply come in. 330 00:16:51.559 --> 00:16:54.559 It allows the database server to partition the massive data 331 00:16:54.559 --> 00:16:58.480 set into manageable chunks, spawn multiple R engines directly on 332 00:16:58.519 --> 00:17:01.759 the server in parallel, feed the data chunks to those engines, 333 00:17:02.159 --> 00:17:03.879 and then reassemble the results at the end. 334 00:17:03.960 --> 00:17:05.079 Oh that's incredibly smart. 335 00:17:05.160 --> 00:17:08.400 Yeah, you get the full analytical power of custom our 336 00:17:08.519 --> 00:17:11.720 packages without ever moving the data across the network or 337 00:17:11.759 --> 00:17:13.400 overwhelming a single machines RAM. 338 00:17:13.680 --> 00:17:16.960 If we connect this to the bigger picture. This integration 339 00:17:17.240 --> 00:17:20.319 is the ultimate bridge. You are taking the rapid innovation 340 00:17:20.559 --> 00:17:23.799 and the massive crowd source brilliance of the open source 341 00:17:23.920 --> 00:17:27.640 our community, and you're seamlessly plugging it directly into the 342 00:17:27.680 --> 00:17:31.759 heavy duty, industrial scale processing power of an enterprise database. 343 00:17:31.839 --> 00:17:34.400 You get the best of both worlds. The agility of 344 00:17:34.440 --> 00:17:38.480 open source statistical libraries combined with the scalability, parallel execution, 345 00:17:38.599 --> 00:17:41.119 and strict security of a Tier one database. You just 346 00:17:41.160 --> 00:17:42.599 don't have to compromise anymore. 347 00:17:42.759 --> 00:17:46.240 This has been a really fascinating exploration. We've completely broken 348 00:17:46.319 --> 00:17:50.799 down why the traditional paradigm of data science, you know, 349 00:17:50.880 --> 00:17:53.039 extracting the data from the source and moving it to 350 00:17:53.039 --> 00:17:57.519 the math, is just a fragile, bottlenecked system. And we've 351 00:17:57.559 --> 00:18:00.799 seen how inverting that paradigm, bringing them directly to the 352 00:18:00.880 --> 00:18:04.559 data through Oracle Data Minor advanced SQL analytics and that 353 00:18:04.680 --> 00:18:08.680 awesome transparency layer of Oracle our Enterprise allows organizations to 354 00:18:08.759 --> 00:18:13.200 execute real time, highly scalable predictions without moving a single 355 00:18:13.240 --> 00:18:14.119 byte of data. 356 00:18:14.240 --> 00:18:16.960 It permanently alters the velocity in which an enterprise can 357 00:18:17.000 --> 00:18:18.319 generate actionable foresight. 358 00:18:18.599 --> 00:18:22.519 It changes everything, and for you listening understanding this architectural 359 00:18:22.559 --> 00:18:25.160 shift puts you at a massive advantage, whether you are 360 00:18:25.240 --> 00:18:27.839 architecting a back end, leading a business unit, or just 361 00:18:27.960 --> 00:18:30.680 tracking the evolution of AI. Knowing that data movement and 362 00:18:30.759 --> 00:18:33.839 data preparation are the true hidden constraints of machine learning 363 00:18:34.200 --> 00:18:37.640 really changes how you should evaluate every new tech solution 364 00:18:37.759 --> 00:18:38.440 on the market. 365 00:18:38.720 --> 00:18:42.039 It absolutely should dictate your strategy and want to leave 366 00:18:42.079 --> 00:18:44.519 you with a final thought. Tom All over, we spend 367 00:18:44.519 --> 00:18:47.400 a lot of time discussing the top tier of that 368 00:18:47.480 --> 00:18:53.079 automation pyramid, the automated statistician with tools actively translating code, 369 00:18:53.200 --> 00:18:57.720 automatically normalizing variables, and letting decision trees handle data prep. 370 00:18:58.400 --> 00:19:01.359 The mechanical friction of data science is vanishing. 371 00:19:01.400 --> 00:19:02.720 Yeah, it's getting so automated. 372 00:19:03.000 --> 00:19:05.960 So if this trend accelerates over the next decade, what 373 00:19:06.119 --> 00:19:09.720 happens to the human data scientist? Will the prestigious role 374 00:19:09.720 --> 00:19:12.960 of data scientists eventually pivot away from writing code entirely, 375 00:19:13.599 --> 00:19:17.480 transforming them into business strategists who simply understand how to 376 00:19:17.559 --> 00:19:19.880 ask a database the right strategic question. 377 00:19:20.319 --> 00:19:23.880 It's an incredible thought. From spending eight hours peeling potatoes 378 00:19:23.920 --> 00:19:26.400 to finally just sitting at the chef's table and designing 379 00:19:26.440 --> 00:19:28.599 the menu thanks for taking the deep dive with us.