WEBVTT

1
00:00:00.120 --> 00:00:03.680
<v Speaker 1>Imagine, right, imagine a master chef, like a world class chef,

2
00:00:03.759 --> 00:00:08.400
<v Speaker 1>starring in this massive primetime thirty minute cooking show.

3
00:00:08.480 --> 00:00:09.839
<v Speaker 2>Okay, I'm picturing it right.

4
00:00:09.880 --> 00:00:12.679
<v Speaker 1>So the camera rolls, the steak is perfectly seared, the

5
00:00:12.839 --> 00:00:16.399
<v Speaker 1>sauce is just flawless, and the audience goes completely wild.

6
00:00:17.079 --> 00:00:20.239
<v Speaker 1>But what you don't see, though, what's hidden off camera,

7
00:00:20.359 --> 00:00:23.399
<v Speaker 1>is that this chef spent the previous eight hours standing

8
00:00:23.440 --> 00:00:27.160
<v Speaker 1>in a back alley literally crying while peeling potatoes and

9
00:00:27.199 --> 00:00:29.679
<v Speaker 1>shopping like ten thousand onions.

10
00:00:29.760 --> 00:00:31.800
<v Speaker 2>Yeah, that sounds about right for the industry.

11
00:00:31.480 --> 00:00:34.320
<v Speaker 1>It's crazy. And in the technology world, that chef is

12
00:00:34.359 --> 00:00:37.600
<v Speaker 1>a data scientist. So today we are looking at how

13
00:00:37.719 --> 00:00:41.399
<v Speaker 1>enterprise organizations are finally you know, firing the potato peelers.

14
00:00:41.479 --> 00:00:43.719
<v Speaker 2>It is a critical shift. Honestly, we are looking at

15
00:00:43.759 --> 00:00:47.359
<v Speaker 2>a fundamental rewrite of the architecture, Like the actual plumbing

16
00:00:47.399 --> 00:00:50.240
<v Speaker 2>of predictive analytics is being completely re routed at the

17
00:00:50.359 --> 00:00:54.840
<v Speaker 2>enterprise level just to eliminate these massive systemic inefficiencies.

18
00:00:55.159 --> 00:00:58.039
<v Speaker 1>Exactly. So, if you want a shortcut to understanding how

19
00:00:58.119 --> 00:01:02.200
<v Speaker 1>big businesses are actually transfer forming their standard everyday databases

20
00:01:02.719 --> 00:01:07.799
<v Speaker 1>into these automated predictive engines. Well, this is it. Today's

21
00:01:07.840 --> 00:01:10.480
<v Speaker 1>deep dive is based on excerpts from the book Data

22
00:01:10.519 --> 00:01:14.439
<v Speaker 1>Science using Oracle Data Minor and Oracle r Enterprise.

23
00:01:14.120 --> 00:01:15.879
<v Speaker 2>Which is a fantastic resource by the.

24
00:01:15.879 --> 00:01:19.200
<v Speaker 1>Way, Oh totally. We're exploring how bringing the math directly

25
00:01:19.200 --> 00:01:22.120
<v Speaker 1>to the data solves honestly one of the single biggest

26
00:01:22.159 --> 00:01:26.159
<v Speaker 1>bottlenecks in modern tech. So okay, let's unpack this because

27
00:01:26.200 --> 00:01:29.159
<v Speaker 1>the way a lot of organizations currently execute data science

28
00:01:29.280 --> 00:01:31.760
<v Speaker 1>is just fundamentally broken, it really is.

29
00:01:32.040 --> 00:01:34.879
<v Speaker 2>What's fascinating here is that the real secret to effective

30
00:01:34.920 --> 00:01:39.000
<v Speaker 2>data science at scale isn't necessarily about inventing a more

31
00:01:39.040 --> 00:01:42.879
<v Speaker 2>complex neural network or a bitter algorithm. It's really about

32
00:01:42.920 --> 00:01:45.239
<v Speaker 2>where those algorithms are physically executed, right.

33
00:01:45.280 --> 00:01:46.680
<v Speaker 1>The location matters exactly.

34
00:01:47.000 --> 00:01:50.200
<v Speaker 2>Moving massive amounts of data across networks just to analyze

35
00:01:50.200 --> 00:01:53.760
<v Speaker 2>it is a critical, costly mistake, and the infrastructure we

36
00:01:53.760 --> 00:01:57.280
<v Speaker 2>were exploring today was built specifically to stop that movement entirely.

37
00:01:57.480 --> 00:02:00.400
<v Speaker 1>So to understand the fix, I feel like we first

38
00:02:00.400 --> 00:02:04.239
<v Speaker 1>have to understand why data scientists are stuck chopping those

39
00:02:04.280 --> 00:02:07.040
<v Speaker 1>onions in the first place. Because if you look at

40
00:02:07.040 --> 00:02:13.680
<v Speaker 1>standard industry frameworks like the crispym methodology. You'd probably assume

41
00:02:13.759 --> 00:02:17.400
<v Speaker 1>the actual modeling, like the glamorous machine learning part, is

42
00:02:17.439 --> 00:02:18.400
<v Speaker 1>where all the time goes.

43
00:02:18.520 --> 00:02:21.479
<v Speaker 2>Yeah, that is the assumption, but the reality is heavily,

44
00:02:21.520 --> 00:02:26.120
<v Speaker 2>heavily skewed. In almost any enterprise deployment, data preparation takes

45
00:02:26.199 --> 00:02:28.800
<v Speaker 2>up a staggering sixty to eighty percent of the total

46
00:02:28.840 --> 00:02:29.479
<v Speaker 2>project effort.

47
00:02:29.639 --> 00:02:31.680
<v Speaker 1>Sixty to eighty percent. That's insane.

48
00:02:32.000 --> 00:02:35.520
<v Speaker 2>It is because real world data is inherently dirty, it's skewed,

49
00:02:35.560 --> 00:02:38.439
<v Speaker 2>it's real with missing values. It's just a mess. Right.

50
00:02:38.680 --> 00:02:40.280
<v Speaker 1>So if the vast majority of the job is just

51
00:02:40.360 --> 00:02:43.599
<v Speaker 1>cleaning up missing variables and formatting timestamps, why is the

52
00:02:43.599 --> 00:02:46.080
<v Speaker 1>titled data scientist and not data janitor.

53
00:02:46.599 --> 00:02:49.719
<v Speaker 2>Well, because that janitoring actually dictates the success or failure

54
00:02:49.719 --> 00:02:52.520
<v Speaker 2>of the entire model. In predictive analytics, there is basically

55
00:02:52.560 --> 00:02:55.879
<v Speaker 2>an ironcloud rule. A simple regression model built on perfectly

56
00:02:55.919 --> 00:02:59.719
<v Speaker 2>clean data will consistently outperform a highly sophisticated deep learning

57
00:02:59.719 --> 00:03:01.599
<v Speaker 2>model that has been fed dirty data.

58
00:03:01.639 --> 00:03:05.719
<v Speaker 1>Wow. Really, so the clean data beats the complex math every.

59
00:03:05.520 --> 00:03:08.719
<v Speaker 2>Single time, because if you don't handle the anomaies, your

60
00:03:08.759 --> 00:03:13.280
<v Speaker 2>model simply learns the noise. It just memorizes the mistakes.

61
00:03:13.039 --> 00:03:15.319
<v Speaker 1>But spending all that time cleaning. I mean, that's the

62
00:03:15.360 --> 00:03:19.039
<v Speaker 1>antithesis of business agility, right Like, if a telecom company

63
00:03:19.080 --> 00:03:22.159
<v Speaker 1>wants to predict customer churn this month, they can't afford

64
00:03:22.199 --> 00:03:25.680
<v Speaker 1>to spend three weeks manually cleaning billing data first.

65
00:03:25.599 --> 00:03:28.520
<v Speaker 2>Exactly, And that naturally leads us to the data science

66
00:03:28.560 --> 00:03:32.560
<v Speaker 2>automation pyramid. From the source material. To move fast, you

67
00:03:32.560 --> 00:03:34.319
<v Speaker 2>have to automate the data pipeline.

68
00:03:34.439 --> 00:03:36.879
<v Speaker 1>Okay, So walk us through this pyramid. What's at the bottom.

69
00:03:36.960 --> 00:03:40.039
<v Speaker 2>At the base, you have problem specific automation. This is

70
00:03:40.080 --> 00:03:45.599
<v Speaker 2>automating a single, rigid workflow, like maybe a monthly sales

71
00:03:45.639 --> 00:03:47.919
<v Speaker 2>forecast that runs exactly the same way every time.

72
00:03:48.039 --> 00:03:50.000
<v Speaker 1>Got it, Just basic scripting right.

73
00:03:50.319 --> 00:03:53.240
<v Speaker 2>Then above that you have repetitive task automation, which is

74
00:03:53.280 --> 00:03:56.919
<v Speaker 2>where you build generalize scripts to automatically handle missing values

75
00:03:57.039 --> 00:03:59.919
<v Speaker 2>or transform columns across various different data sets.

76
00:04:00.039 --> 00:04:02.080
<v Speaker 1>So that's taking away a lot of the manual janitor

77
00:04:02.159 --> 00:04:02.960
<v Speaker 1>work exactly.

78
00:04:03.000 --> 00:04:05.159
<v Speaker 2>It frees up the human to do actual science.

79
00:04:05.360 --> 00:04:07.319
<v Speaker 1>Okay, So what's at the very top of the pyramid.

80
00:04:07.400 --> 00:04:11.400
<v Speaker 2>Then the automated statistician Ooh, that sounds intense. It is

81
00:04:11.879 --> 00:04:14.840
<v Speaker 2>This is an environment where the system evaluates the underlying

82
00:04:14.919 --> 00:04:19.279
<v Speaker 2>data structures, learns the patterns, and automatically selects the most

83
00:04:19.360 --> 00:04:23.240
<v Speaker 2>optimal algorithm without requiring a human to manually tune the

84
00:04:23.319 --> 00:04:24.199
<v Speaker 2>hyper parameters.

85
00:04:24.680 --> 00:04:26.959
<v Speaker 1>Wait, getting to that top tier sounds incredible, but I

86
00:04:27.000 --> 00:04:30.680
<v Speaker 1>mean the glaring issue here is the friction of traditional architecture.

87
00:04:30.800 --> 00:04:30.959
<v Speaker 2>Right.

88
00:04:31.879 --> 00:04:35.120
<v Speaker 1>Historically, when a data scientist wanted to run those repetitive

89
00:04:35.240 --> 00:04:38.920
<v Speaker 1>data cleaning scripts, they were using client side tools like

90
00:04:39.040 --> 00:04:42.639
<v Speaker 1>Python or open source R or SaaS on their laptops.

91
00:04:42.759 --> 00:04:45.560
<v Speaker 2>Yeah, which means they had to extract the data. Traditional

92
00:04:45.560 --> 00:04:49.160
<v Speaker 2>analytical environments basically sit on a separate application server or

93
00:04:49.199 --> 00:04:52.720
<v Speaker 2>on the data scientist's local machine. So to run your model,

94
00:04:52.800 --> 00:04:56.480
<v Speaker 2>you have to query your central enterprise database, extract gigabytes

95
00:04:56.560 --> 00:04:59.399
<v Speaker 2>or sometimes terabytes of data, push it over the network,

96
00:05:00.079 --> 00:05:02.839
<v Speaker 2>it into the memory of your analytical tool, process it,

97
00:05:03.079 --> 00:05:04.680
<v Speaker 2>and then attempt to write the results back.

98
00:05:04.920 --> 00:05:08.199
<v Speaker 1>It sounds exhausting just describing it. So, bringing back to

99
00:05:08.240 --> 00:05:12.319
<v Speaker 1>our chef analogy, traditional data science is like storing all

100
00:05:12.360 --> 00:05:15.680
<v Speaker 1>your raw ingredients in this massive warehouse all the way

101
00:05:15.680 --> 00:05:18.800
<v Speaker 1>across town. Yes, exactly, and every single time you want

102
00:05:18.800 --> 00:05:20.920
<v Speaker 1>to test a new recipe, You literally have to drive

103
00:05:20.959 --> 00:05:23.720
<v Speaker 1>a semi truck across the city, load up the ingredients,

104
00:05:23.920 --> 00:05:26.560
<v Speaker 1>drive back to your kitchen, cook the meal, and then

105
00:05:26.639 --> 00:05:28.800
<v Speaker 1>drive the leftovers back to the warehouse.

106
00:05:29.000 --> 00:05:33.319
<v Speaker 2>It's catastrophic for efficiency. You hit network io bottlenecks immediately,

107
00:05:33.839 --> 00:05:38.600
<v Speaker 2>you hit integration failures, and most importantly, client based tools

108
00:05:38.759 --> 00:05:41.199
<v Speaker 2>simply choke because they require the data set to be

109
00:05:41.240 --> 00:05:42.560
<v Speaker 2>loaded into active RAM.

110
00:05:42.720 --> 00:05:44.240
<v Speaker 1>Right. You can't just cram everything in there.

111
00:05:44.360 --> 00:05:47.639
<v Speaker 2>No, you absolutely cannot load a two terabyte customer table

112
00:05:47.680 --> 00:05:50.759
<v Speaker 2>into the RAM of a standard application server. It will crash.

113
00:05:51.120 --> 00:05:53.720
<v Speaker 1>Here's where it gets really interesting, though, because the solution

114
00:05:53.879 --> 00:05:58.839
<v Speaker 1>presented in this architecture, specifically Oracle Advanced Analytics, basically just

115
00:05:59.120 --> 00:06:01.759
<v Speaker 1>builds the kitchen and inside the warehouse precisely.

116
00:06:02.040 --> 00:06:05.920
<v Speaker 2>Oracle Advanced Analytics or OAA operates directly on top of

117
00:06:05.959 --> 00:06:09.160
<v Speaker 2>the Oracle database kernel. The data never actually moves.

118
00:06:09.079 --> 00:06:11.399
<v Speaker 1>So you're just cooking where the food is exactly.

119
00:06:11.759 --> 00:06:17.120
<v Speaker 2>By eliminating data extraction, you effectively achieve zero latency in

120
00:06:17.160 --> 00:06:18.279
<v Speaker 2>your data pipeline.

121
00:06:18.519 --> 00:06:20.680
<v Speaker 1>And I imagine you bypass the memory limits of a

122
00:06:20.680 --> 00:06:25.360
<v Speaker 1>local machine entirely right, because the database is already optimized

123
00:06:25.399 --> 00:06:28.240
<v Speaker 1>to query and process data directly from storage.

124
00:06:28.360 --> 00:06:31.560
<v Speaker 2>Oh absolutely, Plus you don't have to worry about security

125
00:06:31.560 --> 00:06:34.920
<v Speaker 2>protocols breaking down over some open network connection, right.

126
00:06:34.800 --> 00:06:37.680
<v Speaker 1>Because once the data leaves the database, you've kind of

127
00:06:37.720 --> 00:06:39.199
<v Speaker 1>lost control over who sees it.

128
00:06:39.279 --> 00:06:42.399
<v Speaker 2>Security is a massive factor here. The data remains governed

129
00:06:42.399 --> 00:06:45.959
<v Speaker 2>by the strict native security policies of the database itself.

130
00:06:46.639 --> 00:06:50.800
<v Speaker 2>But from a purely performance standpoint, executing inside the kernel

131
00:06:50.959 --> 00:06:55.839
<v Speaker 2>allows the algorithms to leverage oracles parallel processing capabilities.

132
00:06:55.319 --> 00:06:58.759
<v Speaker 1>Meaning instead of one computer churning through the data row

133
00:06:58.800 --> 00:07:01.480
<v Speaker 1>by row by row, the database can split the job

134
00:07:01.560 --> 00:07:05.319
<v Speaker 1>across dozens of internal processors simultaneously.

135
00:07:04.680 --> 00:07:07.879
<v Speaker 2>Exactly, and when predictive models run directly inside the kernel,

136
00:07:08.240 --> 00:07:11.800
<v Speaker 2>the whole business posture shifts. You aren't extracting data to

137
00:07:11.879 --> 00:07:14.639
<v Speaker 2>run some post mortem analysis of what happened last quarter.

138
00:07:14.720 --> 00:07:16.920
<v Speaker 1>Yeah, it's not looking backward anymore.

139
00:07:16.639 --> 00:07:20.360
<v Speaker 2>Right, You are queering the database in real time to ask,

140
00:07:20.879 --> 00:07:24.720
<v Speaker 2>what is the probability this specific transaction happening right now

141
00:07:25.279 --> 00:07:26.040
<v Speaker 2>is fraudulent?

142
00:07:26.439 --> 00:07:29.399
<v Speaker 1>Okay, So the kitchen is inside the warehouse, which is great,

143
00:07:29.639 --> 00:07:31.800
<v Speaker 1>but we still have to do the sixty to eighty

144
00:07:31.800 --> 00:07:35.240
<v Speaker 1>percent of the workload that involves data preparation, right, I mean,

145
00:07:35.279 --> 00:07:37.720
<v Speaker 1>bringing the math to the data doesn't magically clean it.

146
00:07:38.319 --> 00:07:40.959
<v Speaker 1>We still need the tools to handle anomalies.

147
00:07:41.199 --> 00:07:44.160
<v Speaker 2>Yes we do, and that is handled through an in

148
00:07:44.279 --> 00:07:47.800
<v Speaker 2>database plseql package called DBMS data meaning transformation.

149
00:07:47.920 --> 00:07:50.720
<v Speaker 1>Okay, quite a mouseful, it is, yeah.

150
00:07:50.399 --> 00:07:53.600
<v Speaker 2>But basically this is the toolkit for managing that massive

151
00:07:53.680 --> 00:07:56.480
<v Speaker 2>data preparation phase without ever leaving the database.

152
00:07:56.759 --> 00:08:00.439
<v Speaker 1>So let's talk about how this toolkit actually works specifically withouts,

153
00:08:00.879 --> 00:08:03.439
<v Speaker 1>Because if you have an e commerce platform in your analyze,

154
00:08:03.439 --> 00:08:06.439
<v Speaker 1>and say average order value one user buying a fifty

155
00:08:06.480 --> 00:08:10.040
<v Speaker 1>thousand dollars watch is going to completely skew your standard distribution.

156
00:08:10.199 --> 00:08:11.279
<v Speaker 2>Oh absolutely, So.

157
00:08:11.199 --> 00:08:14.240
<v Speaker 1>The toolkit handles this using what they call winsorizing or trimming.

158
00:08:14.560 --> 00:08:18.560
<v Speaker 2>Right, unhandled extreme values will drag the mean of your

159
00:08:18.639 --> 00:08:22.439
<v Speaker 2>data so far from the median that any distance based

160
00:08:22.519 --> 00:08:26.639
<v Speaker 2>algorithm you use will generate completely erroneous clusters. It'll just

161
00:08:26.759 --> 00:08:27.519
<v Speaker 2>ruin the model.

162
00:08:28.120 --> 00:08:31.079
<v Speaker 1>So if you're using this plcql package, how do these

163
00:08:31.120 --> 00:08:35.759
<v Speaker 1>two methods winsorizing and trimming mechanically solve that watch problem.

164
00:08:35.919 --> 00:08:39.320
<v Speaker 2>Well, trimming is the brute force approach. It literally clips

165
00:08:39.840 --> 00:08:43.320
<v Speaker 2>the extreme tail ends of your distribution, say the top

166
00:08:43.360 --> 00:08:46.679
<v Speaker 2>one percent of values and just sets them to NUL.

167
00:08:46.480 --> 00:08:48.679
<v Speaker 1>Just delete them basically effectively.

168
00:08:48.720 --> 00:08:51.840
<v Speaker 2>Yes. Windsorizing, on the other hand, is much more elegant.

169
00:08:52.000 --> 00:08:55.440
<v Speaker 2>Instead of removing the data point entirely, it cacks it.

170
00:08:55.440 --> 00:08:59.320
<v Speaker 2>It replaces those extreme tail values with a specified maximum parameter,

171
00:08:59.600 --> 00:09:02.679
<v Speaker 2>pulling the outlier back into the acceptable edge of the distribution.

172
00:09:03.039 --> 00:09:05.799
<v Speaker 1>Oh I see, So windsorizing is like taking a person

173
00:09:05.840 --> 00:09:08.360
<v Speaker 1>who's screaming through a megaphone in a crowded room and

174
00:09:08.399 --> 00:09:11.120
<v Speaker 1>forcing them to just whisper. While trimming is you're just

175
00:09:11.159 --> 00:09:13.240
<v Speaker 1>throwing them out of the building entirely so they don't

176
00:09:13.279 --> 00:09:14.039
<v Speaker 1>ruin the party.

177
00:09:14.279 --> 00:09:16.840
<v Speaker 2>That's a great way to visualize it. Yes, but dealing

178
00:09:16.879 --> 00:09:19.000
<v Speaker 2>without liars is just the first step. You also have

179
00:09:19.039 --> 00:09:22.159
<v Speaker 2>to normalize the data before you apply the algorithm.

180
00:09:21.759 --> 00:09:25.559
<v Speaker 1>Right normalization, which is bringing variables to a uniform scale

181
00:09:25.720 --> 00:09:29.840
<v Speaker 1>using min max or z score calculations. Because and correct

182
00:09:29.879 --> 00:09:32.919
<v Speaker 1>me if I'm wrong. If you feed an algorithm a

183
00:09:32.960 --> 00:09:36.759
<v Speaker 1>customer's age, which is a two digit number, alongside their

184
00:09:36.919 --> 00:09:40.720
<v Speaker 1>annual income, which is a six digit number, the geometry

185
00:09:40.720 --> 00:09:44.480
<v Speaker 1>of the algorithm will mathematically assume the income is exponentially

186
00:09:44.480 --> 00:09:46.759
<v Speaker 1>more important, just simply because the integer is larger.

187
00:09:46.799 --> 00:09:50.919
<v Speaker 2>You nailed it. The algorithm operates on mathematical distance. If

188
00:09:50.960 --> 00:09:53.840
<v Speaker 2>you don't scale the inputs to a uniform magnitude, your

189
00:09:53.879 --> 00:09:55.440
<v Speaker 2>model is practically useless.

190
00:09:55.519 --> 00:09:57.840
<v Speaker 1>It just gets confused by the big numbers exactly.

191
00:09:58.200 --> 00:10:02.080
<v Speaker 2>But the toolkit goes beyond scaling. It also performs complex binning,

192
00:10:02.559 --> 00:10:05.720
<v Speaker 2>which is transforming continuous data into discrete categories.

193
00:10:05.840 --> 00:10:08.200
<v Speaker 1>Yeah. The supervised binning feature is what really caught my

194
00:10:08.240 --> 00:10:10.480
<v Speaker 1>eye in the source text because instead of a human

195
00:10:10.639 --> 00:10:15.039
<v Speaker 1>arbitrarily deciding that high income starts at exactly one hundred

196
00:10:15.039 --> 00:10:18.559
<v Speaker 1>thousand dollars, supervised binning automates the logic it does.

197
00:10:18.639 --> 00:10:20.960
<v Speaker 2>It uses a decision tree algorithm under the hood, so

198
00:10:21.000 --> 00:10:25.159
<v Speaker 2>the system analyzes the data's relationship to your target outcome,

199
00:10:25.559 --> 00:10:28.080
<v Speaker 2>like whether a customer churned or not. If the decision

200
00:10:28.080 --> 00:10:32.120
<v Speaker 2>tree determines that a massive spike inchurn happens specifically when

201
00:10:32.159 --> 00:10:36.480
<v Speaker 2>income drops below let's say sixty four three hundred dollars,

202
00:10:36.799 --> 00:10:38.919
<v Speaker 2>it sets the bin boundary exactly there.

203
00:10:39.240 --> 00:10:39.759
<v Speaker 1>Oh wow.

204
00:10:39.840 --> 00:10:42.720
<v Speaker 2>Yeah, it lets the predictive power of the data dictate

205
00:10:42.799 --> 00:10:45.519
<v Speaker 2>the categorization completely, removing human bias.

206
00:10:46.039 --> 00:10:48.519
<v Speaker 1>Well, wait, if I'm just a standard database administrator or

207
00:10:48.559 --> 00:10:50.759
<v Speaker 1>a business analyst running this, How do I know if

208
00:10:50.759 --> 00:10:53.879
<v Speaker 1>the algorithm I want to use requires minmac scaling or

209
00:10:53.919 --> 00:10:58.200
<v Speaker 1>a Z score or supervised binning like I wouldn't.

210
00:10:57.840 --> 00:11:01.000
<v Speaker 2>Know that, and you frequently don't need to know. Oracle

211
00:11:01.080 --> 00:11:05.519
<v Speaker 2>utilizes a feature called Automatic data Preparation or ADP nice.

212
00:11:05.639 --> 00:11:10.120
<v Speaker 2>When enabled, ADP intercepts your request evaluates the specific algorithm

213
00:11:10.159 --> 00:11:13.519
<v Speaker 2>you've chosen, say a support vector machine, which strictly requires

214
00:11:13.559 --> 00:11:17.759
<v Speaker 2>normalized inputs, and it automatically executes the correct mathematical transformations

215
00:11:17.759 --> 00:11:19.960
<v Speaker 2>inside the kernel before running the model.

216
00:11:20.120 --> 00:11:22.799
<v Speaker 1>That is so cool. It handles the prerequisites dynamically, so

217
00:11:22.840 --> 00:11:25.279
<v Speaker 1>the data is prepped, the environment is secure, and we

218
00:11:25.360 --> 00:11:27.679
<v Speaker 1>are finally ready for the top tier of that pyramid.

219
00:11:27.759 --> 00:11:29.840
<v Speaker 1>We talked about the automated statistician.

220
00:11:30.200 --> 00:11:33.200
<v Speaker 2>Yes, this is where we look at the DBM's predictive

221
00:11:33.240 --> 00:11:38.799
<v Speaker 2>analytics package. It contained three highly automated APIs, Predict, Explain

222
00:11:39.080 --> 00:11:39.759
<v Speaker 2>and Profile.

223
00:11:40.000 --> 00:11:43.960
<v Speaker 1>Right Predict automatically generates an outcome variable. Explain ranks the

224
00:11:43.960 --> 00:11:48.120
<v Speaker 1>importance of the independent variables, and Profile extracts the core

225
00:11:48.200 --> 00:11:51.080
<v Speaker 1>business rules the model found. You literally just pass it

226
00:11:51.159 --> 00:11:53.279
<v Speaker 1>the table name and the target column and it does

227
00:11:53.320 --> 00:11:53.720
<v Speaker 1>the rest.

228
00:11:54.000 --> 00:11:55.399
<v Speaker 2>It really is that straightforward.

229
00:11:55.440 --> 00:11:57.440
<v Speaker 1>So what does this all mean? It sounds like we

230
00:11:57.480 --> 00:12:02.360
<v Speaker 1>are completely democratizing machine learning, allowing average SQL users to

231
00:12:02.399 --> 00:12:04.360
<v Speaker 1>perform data science without knowing the math.

232
00:12:04.559 --> 00:12:08.600
<v Speaker 2>It absolutely democratizes access, But in this raises an important question.

233
00:12:09.000 --> 00:12:11.799
<v Speaker 2>Is it safe to lower the barrier to entry that far?

234
00:12:11.960 --> 00:12:13.159
<v Speaker 1>That's a very fair point.

235
00:12:13.279 --> 00:12:16.200
<v Speaker 2>When an analyst just presses a predict button, they are

236
00:12:16.279 --> 00:12:20.159
<v Speaker 2>essentially trusting a black box. The system is making vast

237
00:12:20.159 --> 00:12:22.000
<v Speaker 2>mathematical assumptions on their behalf.

238
00:12:22.120 --> 00:12:24.440
<v Speaker 1>I agree, and frankly, I'm a bit skeptical. If you

239
00:12:24.559 --> 00:12:27.240
<v Speaker 1>let an average user bypass the math and just blindly

240
00:12:27.240 --> 00:12:30.320
<v Speaker 1>apply predictive models to their company's revenue data, aren't we

241
00:12:30.399 --> 00:12:34.519
<v Speaker 1>just accelerating how fast they can make a catastrophic business decision.

242
00:12:34.759 --> 00:12:38.320
<v Speaker 2>That is the inherent risk of democratization, without a doubt,

243
00:12:38.799 --> 00:12:41.759
<v Speaker 2>If the user doesn't understand the underlying assumptions of the models,

244
00:12:42.399 --> 00:12:47.000
<v Speaker 2>the results can be dangerous. Most powerful parametric machine learning

245
00:12:47.000 --> 00:12:51.200
<v Speaker 2>models assume that your underlying data follows a normal Bell

246
00:12:51.240 --> 00:12:54.759
<v Speaker 2>curve distribution. Right If you feed them highly skewed, non

247
00:12:54.799 --> 00:12:59.279
<v Speaker 2>normal data, the predictions will be mathematically invalid, period.

248
00:12:59.039 --> 00:13:02.120
<v Speaker 1>Which is why having statistical tests built directly into the

249
00:13:02.200 --> 00:13:04.480
<v Speaker 1>databas is so critical, I guess you don't have to

250
00:13:04.519 --> 00:13:07.960
<v Speaker 1>blindly trust the black box. You can use native sequel

251
00:13:08.000 --> 00:13:11.559
<v Speaker 1>functions like the Shapiro Wilks test to evaluate the normality

252
00:13:11.559 --> 00:13:13.919
<v Speaker 1>of your distribution right there in the query exactly.

253
00:13:13.960 --> 00:13:18.120
<v Speaker 2>Shapiro Wokes evaluates the null hypothesis that your sample came

254
00:13:18.120 --> 00:13:19.919
<v Speaker 2>from a normally distributed population.

255
00:13:20.320 --> 00:13:22.159
<v Speaker 1>Okay, so if I run that SQL query and it

256
00:13:22.159 --> 00:13:23.480
<v Speaker 1>returns a P value.

257
00:13:23.200 --> 00:13:26.360
<v Speaker 2>Of zero, you instantly know your data is non normal.

258
00:13:26.840 --> 00:13:29.120
<v Speaker 2>You can test your assumptions without having to extract the

259
00:13:29.200 --> 00:13:32.759
<v Speaker 2>data to a specialized statistical software package. It's all right there.

260
00:13:33.000 --> 00:13:37.000
<v Speaker 1>And the analytical capabilities of modern sequel don't stop there.

261
00:13:37.519 --> 00:13:40.600
<v Speaker 1>The source material dives into functions like lag lead and

262
00:13:40.639 --> 00:13:44.759
<v Speaker 1>these really complex windowing functions. And these aren't just convenient syntax,

263
00:13:44.840 --> 00:13:47.159
<v Speaker 1>they are massive performance life savers.

264
00:13:47.240 --> 00:13:50.200
<v Speaker 2>They really are lag and let allow you to access

265
00:13:50.320 --> 00:13:54.159
<v Speaker 2>data from previous or subsequent rows in the exact same

266
00:13:54.240 --> 00:13:56.840
<v Speaker 2>result set without having to use a clunky self joint.

267
00:13:57.200 --> 00:14:00.320
<v Speaker 1>So like if a retailer is calculating year over year's

268
00:14:00.320 --> 00:14:03.200
<v Speaker 1>sales growth across ten thousand stores, they don't have to

269
00:14:03.279 --> 00:14:06.919
<v Speaker 1>pull millions of rows into a Python data frame just

270
00:14:06.960 --> 00:14:10.200
<v Speaker 1>to calculate a rolling average. They can use a SQL

271
00:14:10.240 --> 00:14:13.519
<v Speaker 1>windowing function to calculate that moving average directly on the

272
00:14:13.559 --> 00:14:14.840
<v Speaker 1>storage disc and.

273
00:14:14.759 --> 00:14:18.240
<v Speaker 2>By processing that rolling average inside the database. Using SQL,

274
00:14:18.559 --> 00:14:22.440
<v Speaker 2>you are leveraging the internal optimizer. It completes the calculation

275
00:14:22.519 --> 00:14:24.919
<v Speaker 2>in a fraction of the time and only returns the

276
00:14:24.960 --> 00:14:27.480
<v Speaker 2>final aggregated insight to the application layer.

277
00:14:27.600 --> 00:14:30.559
<v Speaker 1>Okay, so SQL is incredibly powerful. But let's play Devil's

278
00:14:30.600 --> 00:14:31.279
<v Speaker 1>advocate for a second.

279
00:14:31.399 --> 00:14:31.840
<v Speaker 2>Let's do it.

280
00:14:32.000 --> 00:14:35.240
<v Speaker 1>What if your company's lead data scientist is like a

281
00:14:35.320 --> 00:14:38.759
<v Speaker 1>hardcore statistical researcher, you know the type. They spend eight

282
00:14:38.840 --> 00:14:42.039
<v Speaker 1>years getting a PhD. They live and breathe the open

283
00:14:42.039 --> 00:14:45.759
<v Speaker 1>source R programming language, and they rely on these massive,

284
00:14:45.919 --> 00:14:50.320
<v Speaker 1>crowdsourced libraries of cutting edge algorithms that standard seql just

285
00:14:50.360 --> 00:14:51.720
<v Speaker 1>doesn't natively support.

286
00:14:52.559 --> 00:14:55.320
<v Speaker 2>Are they just forced to abandon R and write Oracle

287
00:14:55.399 --> 00:14:56.799
<v Speaker 2>plseql not at all.

288
00:14:57.240 --> 00:15:00.679
<v Speaker 1>That exact friction is what Oracle Are Enterprise or ORE

289
00:15:01.240 --> 00:15:02.399
<v Speaker 1>was engineered to eliminate.

290
00:15:02.440 --> 00:15:04.720
<v Speaker 2>Because open source R has the same architectural flaw we

291
00:15:04.720 --> 00:15:07.080
<v Speaker 2>talked about earlier. Right, it's entirely client based. Has to

292
00:15:07.120 --> 00:15:09.720
<v Speaker 2>load everything into the local laptops around exactly.

293
00:15:09.840 --> 00:15:12.759
<v Speaker 1>Open source R is brilliant for innovation, but it is

294
00:15:12.799 --> 00:15:16.720
<v Speaker 1>fundamentally incapable of handling true enterprise big data. If you

295
00:15:16.759 --> 00:15:19.559
<v Speaker 1>try to run an advanced clustering algorithm on a billion

296
00:15:19.639 --> 00:15:22.720
<v Speaker 1>rows of transaction data using open source R, the memory

297
00:15:22.720 --> 00:15:25.000
<v Speaker 1>limit will immediately crash the session.

298
00:15:24.879 --> 00:15:26.240
<v Speaker 2>Just a blue screen of death.

299
00:15:26.279 --> 00:15:27.279
<v Speaker 1>Basically pretty much.

300
00:15:27.360 --> 00:15:29.799
<v Speaker 2>So how does already solve this without forcing that PHGD

301
00:15:29.840 --> 00:15:32.399
<v Speaker 2>data scientist to learn a whole new language. The source

302
00:15:32.440 --> 00:15:36.159
<v Speaker 2>outlines a three layer architecture to make our database compatible. Right,

303
00:15:36.240 --> 00:15:38.720
<v Speaker 2>so layer one is simply the client R engine. The

304
00:15:38.799 --> 00:15:41.799
<v Speaker 2>data scientist sits at their laptop and write standard R

305
00:15:41.879 --> 00:15:45.080
<v Speaker 2>code in their normal ide. They don't change their workflow

306
00:15:45.120 --> 00:15:45.360
<v Speaker 2>at all.

307
00:15:45.519 --> 00:15:47.879
<v Speaker 1>Okay, and layer two is where the magic happens.

308
00:15:47.919 --> 00:15:48.039
<v Speaker 2>Right.

309
00:15:48.080 --> 00:15:50.120
<v Speaker 1>The database has this transparency layer.

310
00:15:50.200 --> 00:15:53.440
<v Speaker 2>Yes, this is the crucial translation mechanism. When the data

311
00:15:53.440 --> 00:15:55.960
<v Speaker 2>scientist writes an OUR command to filter a data set

312
00:15:56.039 --> 00:16:00.519
<v Speaker 2>or apply a transformation, the transparency layer intersects that command. Oh, okay,

313
00:16:00.679 --> 00:16:02.799
<v Speaker 2>does not pull the data to the laptop. Instead, it

314
00:16:02.919 --> 00:16:07.639
<v Speaker 2>dynamically translates the R syntax into a highly optimized SEQL query.

315
00:16:08.200 --> 00:16:11.080
<v Speaker 2>It maps the R data frames directly to Oracle tables

316
00:16:11.159 --> 00:16:11.639
<v Speaker 2>or views.

317
00:16:11.799 --> 00:16:14.480
<v Speaker 1>That is wild. So the data scientist thinks they are

318
00:16:14.519 --> 00:16:17.759
<v Speaker 1>manipulating a local R data frame, but behind the scenes,

319
00:16:17.840 --> 00:16:21.399
<v Speaker 1>Oracle is essentially spoofing the environment and executing a native

320
00:16:21.399 --> 00:16:23.120
<v Speaker 1>SQL query on the server exactly.

321
00:16:23.120 --> 00:16:26.120
<v Speaker 2>It's totally seamless. And then layer three consists of spawned

322
00:16:26.200 --> 00:16:29.840
<v Speaker 2>R engines running directly on the database server itself. If

323
00:16:29.840 --> 00:16:32.600
<v Speaker 2>the data scientist uses an ORE package like ort M,

324
00:16:32.720 --> 00:16:36.200
<v Speaker 2>which maps directly to Oracle data mining algorithms, the execution

325
00:16:36.240 --> 00:16:39.080
<v Speaker 2>happens entirely inside the kernel using parallel processing.

326
00:16:39.360 --> 00:16:42.039
<v Speaker 1>But wait, what if they are using a custom third

327
00:16:42.039 --> 00:16:45.919
<v Speaker 1>party R package that Oracle doesn't natively map to. How

328
00:16:45.960 --> 00:16:47.440
<v Speaker 1>do you keep the memory from crashing.

329
00:16:47.480 --> 00:16:51.080
<v Speaker 2>Then that's where functions like or dot row apply come in.

330
00:16:51.559 --> 00:16:54.559
<v Speaker 2>It allows the database server to partition the massive data

331
00:16:54.559 --> 00:16:58.480
<v Speaker 2>set into manageable chunks, spawn multiple R engines directly on

332
00:16:58.519 --> 00:17:01.759
<v Speaker 2>the server in parallel, feed the data chunks to those engines,

333
00:17:02.159 --> 00:17:03.879
<v Speaker 2>and then reassemble the results at the end.

334
00:17:03.960 --> 00:17:05.079
<v Speaker 1>Oh that's incredibly smart.

335
00:17:05.160 --> 00:17:08.400
<v Speaker 2>Yeah, you get the full analytical power of custom our

336
00:17:08.519 --> 00:17:11.720
<v Speaker 2>packages without ever moving the data across the network or

337
00:17:11.759 --> 00:17:13.400
<v Speaker 2>overwhelming a single machines RAM.

338
00:17:13.680 --> 00:17:16.960
<v Speaker 1>If we connect this to the bigger picture. This integration

339
00:17:17.240 --> 00:17:20.319
<v Speaker 1>is the ultimate bridge. You are taking the rapid innovation

340
00:17:20.559 --> 00:17:23.799
<v Speaker 1>and the massive crowd source brilliance of the open source

341
00:17:23.920 --> 00:17:27.640
<v Speaker 1>our community, and you're seamlessly plugging it directly into the

342
00:17:27.680 --> 00:17:31.759
<v Speaker 1>heavy duty, industrial scale processing power of an enterprise database.

343
00:17:31.839 --> 00:17:34.400
<v Speaker 2>You get the best of both worlds. The agility of

344
00:17:34.440 --> 00:17:38.480
<v Speaker 2>open source statistical libraries combined with the scalability, parallel execution,

345
00:17:38.599 --> 00:17:41.119
<v Speaker 2>and strict security of a Tier one database. You just

346
00:17:41.160 --> 00:17:42.599
<v Speaker 2>don't have to compromise anymore.

347
00:17:42.759 --> 00:17:46.240
<v Speaker 1>This has been a really fascinating exploration. We've completely broken

348
00:17:46.319 --> 00:17:50.799
<v Speaker 1>down why the traditional paradigm of data science, you know,

349
00:17:50.880 --> 00:17:53.039
<v Speaker 1>extracting the data from the source and moving it to

350
00:17:53.039 --> 00:17:57.519
<v Speaker 1>the math, is just a fragile, bottlenecked system. And we've

351
00:17:57.559 --> 00:18:00.799
<v Speaker 1>seen how inverting that paradigm, bringing them directly to the

352
00:18:00.880 --> 00:18:04.559
<v Speaker 1>data through Oracle Data Minor advanced SQL analytics and that

353
00:18:04.680 --> 00:18:08.680
<v Speaker 1>awesome transparency layer of Oracle our Enterprise allows organizations to

354
00:18:08.759 --> 00:18:13.200
<v Speaker 1>execute real time, highly scalable predictions without moving a single

355
00:18:13.240 --> 00:18:14.119
<v Speaker 1>byte of data.

356
00:18:14.240 --> 00:18:16.960
<v Speaker 2>It permanently alters the velocity in which an enterprise can

357
00:18:17.000 --> 00:18:18.319
<v Speaker 2>generate actionable foresight.

358
00:18:18.599 --> 00:18:22.519
<v Speaker 1>It changes everything, and for you listening understanding this architectural

359
00:18:22.559 --> 00:18:25.160
<v Speaker 1>shift puts you at a massive advantage, whether you are

360
00:18:25.240 --> 00:18:27.839
<v Speaker 1>architecting a back end, leading a business unit, or just

361
00:18:27.960 --> 00:18:30.680
<v Speaker 1>tracking the evolution of AI. Knowing that data movement and

362
00:18:30.759 --> 00:18:33.839
<v Speaker 1>data preparation are the true hidden constraints of machine learning

363
00:18:34.200 --> 00:18:37.640
<v Speaker 1>really changes how you should evaluate every new tech solution

364
00:18:37.759 --> 00:18:38.440
<v Speaker 1>on the market.

365
00:18:38.720 --> 00:18:42.039
<v Speaker 2>It absolutely should dictate your strategy and want to leave

366
00:18:42.079 --> 00:18:44.519
<v Speaker 2>you with a final thought. Tom All over, we spend

367
00:18:44.519 --> 00:18:47.400
<v Speaker 2>a lot of time discussing the top tier of that

368
00:18:47.480 --> 00:18:53.079
<v Speaker 2>automation pyramid, the automated statistician with tools actively translating code,

369
00:18:53.200 --> 00:18:57.720
<v Speaker 2>automatically normalizing variables, and letting decision trees handle data prep.

370
00:18:58.400 --> 00:19:01.359
<v Speaker 2>The mechanical friction of data science is vanishing.

371
00:19:01.400 --> 00:19:02.720
<v Speaker 1>Yeah, it's getting so automated.

372
00:19:03.000 --> 00:19:05.960
<v Speaker 2>So if this trend accelerates over the next decade, what

373
00:19:06.119 --> 00:19:09.720
<v Speaker 2>happens to the human data scientist? Will the prestigious role

374
00:19:09.720 --> 00:19:12.960
<v Speaker 2>of data scientists eventually pivot away from writing code entirely,

375
00:19:13.599 --> 00:19:17.480
<v Speaker 2>transforming them into business strategists who simply understand how to

376
00:19:17.559 --> 00:19:19.880
<v Speaker 2>ask a database the right strategic question.

377
00:19:20.319 --> 00:19:23.880
<v Speaker 1>It's an incredible thought. From spending eight hours peeling potatoes

378
00:19:23.920 --> 00:19:26.400
<v Speaker 1>to finally just sitting at the chef's table and designing

379
00:19:26.440 --> 00:19:28.599
<v Speaker 1>the menu thanks for taking the deep dive with us.
