WEBVTT

1
00:00:00.120 --> 00:00:03.279
<v Speaker 1>So if you were online back in the late nineteen nineties,

2
00:00:03.759 --> 00:00:08.400
<v Speaker 1>you probably remember that that quiet war raging in our inbox.

3
00:00:08.480 --> 00:00:13.320
<v Speaker 2>Oh yeah, the sheer volume of junk mail was just unbelievable.

4
00:00:12.640 --> 00:00:14.400
<v Speaker 1>Right, it was a nightmare, And if you were a

5
00:00:14.400 --> 00:00:16.600
<v Speaker 1>programmer trying to stop it, I mean, you are probably.

6
00:00:16.399 --> 00:00:18.280
<v Speaker 2>Losing your mind, absolutely losing it.

7
00:00:18.280 --> 00:00:21.800
<v Speaker 1>Because you'd write this rigid rule, right, like if an

8
00:00:21.879 --> 00:00:25.640
<v Speaker 1>email contains the number four and the letter, you send

9
00:00:25.679 --> 00:00:26.359
<v Speaker 1>it straight to the.

10
00:00:26.320 --> 00:00:27.920
<v Speaker 2>Trash and that would work perfectly.

11
00:00:28.039 --> 00:00:30.519
<v Speaker 1>Yeah, for about a day, exactly just a day. Then

12
00:00:30.559 --> 00:00:32.799
<v Speaker 1>the spammers would realize what you did and they'd start

13
00:00:32.799 --> 00:00:35.359
<v Speaker 1>spelling it out like for space.

14
00:00:35.200 --> 00:00:37.600
<v Speaker 2>You and suddenly your defense is totally.

15
00:00:37.359 --> 00:00:40.640
<v Speaker 1>Useless, completely useless. You had to go back, write a

16
00:00:40.679 --> 00:00:44.240
<v Speaker 1>new role, deploy it again. It was this endless, exhausting

17
00:00:44.359 --> 00:00:48.399
<v Speaker 1>game of whack a mole, and mathematically humans were just

18
00:00:48.679 --> 00:00:50.079
<v Speaker 1>destined to lose that game.

19
00:00:50.280 --> 00:00:52.399
<v Speaker 2>We were. But then, you know, we stopped trying to

20
00:00:52.399 --> 00:00:54.799
<v Speaker 2>write the rules. We decided to let the machine write them.

21
00:00:54.719 --> 00:00:57.039
<v Speaker 1>Instead, which is just wild to think about.

22
00:00:57.320 --> 00:01:00.799
<v Speaker 2>It was a profound turning point in the history of technology.

23
00:01:01.320 --> 00:01:06.079
<v Speaker 2>We abandon the arrogance of trying to anticipate every possible

24
00:01:06.159 --> 00:01:09.079
<v Speaker 2>variation of a problem and instead built systems that could

25
00:01:09.079 --> 00:01:09.760
<v Speaker 2>actually adapt.

26
00:01:10.120 --> 00:01:13.000
<v Speaker 1>And that adaptation is exactly what we're exploring today. So

27
00:01:13.079 --> 00:01:15.040
<v Speaker 1>welcome to the deep knive. If you're listening to this,

28
00:01:15.319 --> 00:01:17.480
<v Speaker 1>you are what we like to call the learner. That's right,

29
00:01:17.599 --> 00:01:19.959
<v Speaker 1>Whether you're prepping for a high stakes meeting, trying to

30
00:01:20.000 --> 00:01:22.560
<v Speaker 1>catch up on where the tech landscape is heading, or

31
00:01:22.560 --> 00:01:25.640
<v Speaker 1>you're just you know, insanely curious about the mechanics of

32
00:01:25.640 --> 00:01:27.879
<v Speaker 1>the digital world, you are in the right place.

33
00:01:28.159 --> 00:01:28.879
<v Speaker 2>Glad you're here.

34
00:01:29.040 --> 00:01:33.439
<v Speaker 1>Today, we're cracking open Aurelian General's foundational text hands on

35
00:01:33.519 --> 00:01:36.319
<v Speaker 1>machine Learning, and we are skipping all the sci fi,

36
00:01:36.359 --> 00:01:40.040
<v Speaker 1>Hollywood hype, Thank goodness. Yeah, no killer robots, no sky

37
00:01:40.159 --> 00:01:42.959
<v Speaker 1>net today, we're just looking under the hood. Our mission

38
00:01:42.959 --> 00:01:45.599
<v Speaker 1>here is to break down exactly what machine learning actually is,

39
00:01:46.000 --> 00:01:51.560
<v Speaker 1>how these systems physically learn, and why they sometimes fail spectacularly.

40
00:01:51.799 --> 00:01:54.599
<v Speaker 2>And it's vital to start with that spam filter example

41
00:01:54.640 --> 00:01:58.799
<v Speaker 2>you mentioned, because it just perfectly illustrates the mechanical difference

42
00:01:58.799 --> 00:02:03.920
<v Speaker 2>between traditional programing and machine learning. In traditional programming, a

43
00:02:04.040 --> 00:02:08.479
<v Speaker 2>human analyzes a problem, discovers the pattern, writes a hard

44
00:02:08.520 --> 00:02:12.879
<v Speaker 2>coded rule, and evaluates the output it is incredibly brittle.

45
00:02:12.960 --> 00:02:14.479
<v Speaker 1>Brittle is the perfect word for it.

46
00:02:14.520 --> 00:02:17.919
<v Speaker 2>If the environment changes by even one pixel or one keystroke,

47
00:02:18.240 --> 00:02:19.439
<v Speaker 2>the program just breaks.

48
00:02:19.560 --> 00:02:22.520
<v Speaker 1>Okay, so let's unpack this for the listener. Traditional programming

49
00:02:22.599 --> 00:02:26.479
<v Speaker 1>is basically like giving a chef a rigid, unchangeable recipe.

50
00:02:26.599 --> 00:02:27.199
<v Speaker 2>Yeah, exactly.

51
00:02:27.240 --> 00:02:30.120
<v Speaker 1>If they're missing a single ingredient, or if the oven

52
00:02:30.199 --> 00:02:32.080
<v Speaker 1>is just slightly too hot, they just crash and burn.

53
00:02:32.120 --> 00:02:33.080
<v Speaker 1>They don't know how to adapt.

54
00:02:33.120 --> 00:02:33.639
<v Speaker 2>They're stuck.

55
00:02:33.960 --> 00:02:37.240
<v Speaker 1>But machine learning is entirely different. It's like giving a

56
00:02:37.319 --> 00:02:40.759
<v Speaker 1>chef a thousand slightly different cakes and having them guess

57
00:02:40.800 --> 00:02:43.000
<v Speaker 1>the recipe by changing one ingredient at a time.

58
00:02:43.199 --> 00:02:44.680
<v Speaker 2>I love that analogy, right, Like.

59
00:02:44.639 --> 00:02:48.240
<v Speaker 1>Too salty, next time, lower the salt, too dry, add

60
00:02:48.319 --> 00:02:52.159
<v Speaker 1>some water. It repeats this optimization loop thousands of times

61
00:02:52.240 --> 00:02:55.520
<v Speaker 1>until the cake is perfect. It figures out the recipe itself.

62
00:02:55.680 --> 00:02:59.199
<v Speaker 2>What's fascinating here is how we formally define that optimization loop.

63
00:03:00.080 --> 00:03:02.919
<v Speaker 2>In nineteen ninety seven, Tom Mitchell gave us this brilliant

64
00:03:02.960 --> 00:03:06.199
<v Speaker 2>engineering definition that we actually still rely on today.

65
00:03:06.400 --> 00:03:07.680
<v Speaker 1>Oh right, the ETP.

66
00:03:08.159 --> 00:03:11.840
<v Speaker 2>Yes, he said, a computer program learns from experience, which

67
00:03:11.840 --> 00:03:15.120
<v Speaker 2>we call E with respect to some task T and

68
00:03:15.159 --> 00:03:18.960
<v Speaker 2>some performance measure p Okay. Crucially, the system is only

69
00:03:19.000 --> 00:03:22.800
<v Speaker 2>actually learning if its performance on the task improves with

70
00:03:22.879 --> 00:03:24.400
<v Speaker 2>the experience, So.

71
00:03:24.479 --> 00:03:26.639
<v Speaker 1>Mapping that onto our spam filter for a second, the

72
00:03:26.719 --> 00:03:28.800
<v Speaker 1>task T is flagging the junk mail.

73
00:03:29.000 --> 00:03:29.400
<v Speaker 2>Correct.

74
00:03:29.520 --> 00:03:32.520
<v Speaker 1>The experience E is the training data, right, those massive

75
00:03:32.599 --> 00:03:37.199
<v Speaker 1>piles of spam and normal emails, which data scientists playfully.

76
00:03:36.759 --> 00:03:38.400
<v Speaker 2>Call ham right, spam and ham.

77
00:03:38.759 --> 00:03:41.439
<v Speaker 1>And the performance measure P is the accuracy rate, like

78
00:03:41.479 --> 00:03:43.840
<v Speaker 1>what percentage of the emails did it actually put in

79
00:03:43.840 --> 00:03:45.319
<v Speaker 1>the right folder exactly?

80
00:03:45.360 --> 00:03:48.599
<v Speaker 2>And if that percentage goes up as it processes more emails, boom,

81
00:03:48.680 --> 00:03:49.319
<v Speaker 2>It is learning.

82
00:03:49.439 --> 00:03:50.039
<v Speaker 1>It's learning.

83
00:03:50.280 --> 00:03:53.719
<v Speaker 2>And this framework is essential because there are certain problems

84
00:03:54.039 --> 00:03:57.759
<v Speaker 2>where human hard coding just completely totally fails. Think about

85
00:03:57.759 --> 00:04:00.520
<v Speaker 2>speech recognition. Oh, man, if I ask you to write

86
00:04:00.520 --> 00:04:04.360
<v Speaker 2>a traditional program to detect the word two, you know

87
00:04:04.439 --> 00:04:06.199
<v Speaker 2>the number two? How do you do it?

88
00:04:06.520 --> 00:04:07.840
<v Speaker 1>I wouldn't even know where to start.

89
00:04:08.319 --> 00:04:10.439
<v Speaker 2>You might try to hard code a rule looking for

90
00:04:10.479 --> 00:04:13.319
<v Speaker 2>a specific high frequency sound wave for the letter T,

91
00:04:14.599 --> 00:04:17.360
<v Speaker 2>But how do you mathematically account for a child's voice

92
00:04:17.920 --> 00:04:19.360
<v Speaker 2>versus an adults.

93
00:04:19.319 --> 00:04:22.959
<v Speaker 1>Right, or like a British accent versus a Southern drawl exactly?

94
00:04:23.399 --> 00:04:26.680
<v Speaker 2>What if there's wind noise in the background. The sheer

95
00:04:26.759 --> 00:04:30.839
<v Speaker 2>number of variations approaches infinity. You simply cannot write enough

96
00:04:30.920 --> 00:04:34.079
<v Speaker 2>if then statements to cover it all. The system must

97
00:04:34.160 --> 00:04:35.079
<v Speaker 2>learn by example.

98
00:04:35.199 --> 00:04:37.800
<v Speaker 1>So if the system requires examples to learn, that kind

99
00:04:37.800 --> 00:04:41.199
<v Speaker 1>of brings up a massive logistical problem. How exactly do

100
00:04:41.240 --> 00:04:42.800
<v Speaker 1>we feed it those examples?

101
00:04:42.920 --> 00:04:43.759
<v Speaker 2>That's the big question.

102
00:04:43.920 --> 00:04:46.199
<v Speaker 1>Are we just dumping raw data into a hard drive

103
00:04:46.240 --> 00:04:49.000
<v Speaker 1>and hoping for the best. Because the material breaks this

104
00:04:49.120 --> 00:04:53.360
<v Speaker 1>down into the different levels of human supervision required during training.

105
00:04:53.279 --> 00:04:56.879
<v Speaker 2>Right, the data doesn't just magically organize itself. Sadly, the

106
00:04:56.879 --> 00:04:59.920
<v Speaker 2>most common approach is supervised learning. This is where the

107
00:05:00.120 --> 00:05:02.800
<v Speaker 2>machine basically has a teacher. You don't just feed the

108
00:05:02.920 --> 00:05:06.439
<v Speaker 2>algorithm raw data. You feed it data that already includes

109
00:05:06.480 --> 00:05:08.800
<v Speaker 2>the desired solutions, which we call labels.

110
00:05:09.079 --> 00:05:11.680
<v Speaker 1>So the spam filter is supervised because you're handing the

111
00:05:11.720 --> 00:05:14.600
<v Speaker 1>machine a stack of emails that a human has explicitly

112
00:05:14.639 --> 00:05:18.160
<v Speaker 1>stamped as spam or ham. You're giving it the answer

113
00:05:18.240 --> 00:05:19.480
<v Speaker 1>key to study from.

114
00:05:19.720 --> 00:05:22.040
<v Speaker 2>Yes, the answer key is crucial here.

115
00:05:21.879 --> 00:05:24.319
<v Speaker 1>And the text points out this works really well for

116
00:05:24.519 --> 00:05:29.959
<v Speaker 1>predicting categories, which is called classification, and predicting numeric values,

117
00:05:29.959 --> 00:05:33.360
<v Speaker 1>which is called regression, right, like predicting a car's price

118
00:05:33.439 --> 00:05:36.759
<v Speaker 1>based on its mileage. You feed it thousands of examples

119
00:05:36.759 --> 00:05:39.240
<v Speaker 1>of cars where you already know the final sale price.

120
00:05:39.519 --> 00:05:43.040
<v Speaker 2>But the reality is labeled data is a huge luxury.

121
00:05:43.319 --> 00:05:45.639
<v Speaker 2>Most data in the real world just doesn't come with

122
00:05:45.680 --> 00:05:48.639
<v Speaker 2>a neat little answer key, right, It's just raw, exactly,

123
00:05:48.800 --> 00:05:50.600
<v Speaker 2>And that's where unsupervised learning comes in.

124
00:05:51.000 --> 00:05:51.240
<v Speaker 1>Here.

125
00:05:51.360 --> 00:05:54.199
<v Speaker 2>The system is essentially just an observer. You feed it

126
00:05:54.279 --> 00:05:57.120
<v Speaker 2>a mountain of completely unlabeled data, and it has to

127
00:05:57.160 --> 00:05:59.839
<v Speaker 2>figure out the underlying structure all on its.

128
00:05:59.759 --> 00:06:02.959
<v Speaker 1>Own, which honestly sounds like magic. How does an algorithm

129
00:06:03.040 --> 00:06:05.560
<v Speaker 1>learn anything if you literally don't tell it what to

130
00:06:05.560 --> 00:06:05.959
<v Speaker 1>look for?

131
00:06:06.279 --> 00:06:10.399
<v Speaker 2>It does it by measuring distances in multidimensional space pick clustering.

132
00:06:10.399 --> 00:06:13.160
<v Speaker 2>For example, Let's say you have a massive data set

133
00:06:13.199 --> 00:06:15.839
<v Speaker 2>of visitors to your blog. Okay, you have absolutely no

134
00:06:15.959 --> 00:06:19.240
<v Speaker 2>idea who they are, but the algorithm plots every visitor

135
00:06:19.279 --> 00:06:22.240
<v Speaker 2>on a mathematical graph. Maybe one axis is the time

136
00:06:22.279 --> 00:06:24.680
<v Speaker 2>of day they visit. Another axis is the length of

137
00:06:24.720 --> 00:06:26.959
<v Speaker 2>the articles they read, another is the topic.

138
00:06:27.120 --> 00:06:27.680
<v Speaker 1>Oh, I see.

139
00:06:28.079 --> 00:06:31.319
<v Speaker 2>Suddenly it notices that a huge cluster of data points

140
00:06:31.519 --> 00:06:36.279
<v Speaker 2>are physically very close together. In this mathematical space. It realizes, hey,

141
00:06:37.000 --> 00:06:40.480
<v Speaker 2>forty percent of these users always read long form sci

142
00:06:40.480 --> 00:06:44.120
<v Speaker 2>fi posts on Saturday nights. Wow, it didn't know what

143
00:06:44.160 --> 00:06:47.360
<v Speaker 2>sci fi or Saturday meant emotionally. It just calculated that

144
00:06:47.399 --> 00:06:49.240
<v Speaker 2>those behaviors clustered tightly together.

145
00:06:49.399 --> 00:06:50.079
<v Speaker 1>That's wild.

146
00:06:50.319 --> 00:06:52.600
<v Speaker 2>Also, how we do anomaly detection. Yeah, if a credit

147
00:06:52.600 --> 00:06:56.279
<v Speaker 2>card transaction lands way outside the normal behavioral cluster, the

148
00:06:56.319 --> 00:06:57.600
<v Speaker 2>system flags it as fraud.

149
00:06:57.759 --> 00:07:00.439
<v Speaker 1>Okay, so we have the teacher for super and the

150
00:07:00.480 --> 00:07:03.279
<v Speaker 1>observer for unsupervised. But then there is a hybrid, right,

151
00:07:03.399 --> 00:07:06.639
<v Speaker 1>semi supervised learning. Yes, exactly, And the perfect example of

152
00:07:06.680 --> 00:07:09.000
<v Speaker 1>this is something almost everyone listening has in their pocket

153
00:07:09.079 --> 00:07:10.439
<v Speaker 1>right now. Google Photos.

154
00:07:10.639 --> 00:07:11.759
<v Speaker 2>Oh, such a good example.

155
00:07:11.759 --> 00:07:15.360
<v Speaker 1>When you upload a thousand family photos, the unsupervised part

156
00:07:15.399 --> 00:07:19.160
<v Speaker 1>of the algorithm kicks in. First, it mathematically analyzes the

157
00:07:19.160 --> 00:07:22.800
<v Speaker 1>pixels and clusters them, noticing that the exact same face

158
00:07:23.120 --> 00:07:26.120
<v Speaker 1>appears in fifty different pictures. Right. It doesn't know who

159
00:07:26.120 --> 00:07:28.879
<v Speaker 1>that face belongs to, but it knows it's the same object.

160
00:07:29.160 --> 00:07:31.600
<v Speaker 1>Then it turns to you. It asks you to label

161
00:07:31.720 --> 00:07:35.560
<v Speaker 1>just one photo you type in mom, and instantly it

162
00:07:35.680 --> 00:07:40.079
<v Speaker 1>propagates that supervised label across the entire unsupervised cluster.

163
00:07:40.879 --> 00:07:43.160
<v Speaker 2>It is incredibly incredibly efficient.

164
00:07:43.199 --> 00:07:45.279
<v Speaker 1>Well wait, let me push back on this for a second. Sure,

165
00:07:45.439 --> 00:07:48.480
<v Speaker 1>because what does this all mean for us humans? If

166
00:07:48.560 --> 00:07:52.120
<v Speaker 1>the semi supervised systems are doing all the heavy algorithmic

167
00:07:52.199 --> 00:07:56.839
<v Speaker 1>lifting of clustering the data in multi dimensional space, are

168
00:07:56.839 --> 00:07:58.800
<v Speaker 1>we basically just acting as.

169
00:07:58.800 --> 00:08:00.519
<v Speaker 2>Cheap labors inferia?

170
00:08:00.680 --> 00:08:03.480
<v Speaker 1>Like? Are we just the final manual cog in the

171
00:08:03.519 --> 00:08:05.360
<v Speaker 1>machine providing the text tags?

172
00:08:05.680 --> 00:08:07.879
<v Speaker 2>If we connect this to the bigger picture, you'll see

173
00:08:07.920 --> 00:08:11.720
<v Speaker 2>it's actually a profound economic solution. You have to understand

174
00:08:11.720 --> 00:08:15.040
<v Speaker 2>that labeling data is the single biggest bottleneck in all

175
00:08:15.040 --> 00:08:17.959
<v Speaker 2>of machine learning. Paying humans to sit in a room

176
00:08:18.120 --> 00:08:22.720
<v Speaker 2>and manually tag a million individual photos is prohibitively expensive

177
00:08:22.759 --> 00:08:27.519
<v Speaker 2>and agonizingly slow. Semi supervised learning isn't about using humans

178
00:08:27.519 --> 00:08:31.480
<v Speaker 2>as cheap labor. It's an elegant compromise between machine scalability

179
00:08:31.759 --> 00:08:35.799
<v Speaker 2>and human context. Ah, I get it. The algorithm does

180
00:08:35.840 --> 00:08:39.919
<v Speaker 2>what it does best, processing and sorting raw pixels at

181
00:08:39.919 --> 00:08:42.759
<v Speaker 2>a scale a human mind just couldn't fathom, and the

182
00:08:42.840 --> 00:08:45.840
<v Speaker 2>human does what they do best, which is providing the semantic,

183
00:08:46.200 --> 00:08:50.080
<v Speaker 2>emotional or factual context in a single keystroke.

184
00:08:50.320 --> 00:08:52.759
<v Speaker 1>I see, so it's really a partnership. Now, for the

185
00:08:52.799 --> 00:08:54.600
<v Speaker 1>sake of being thorough, we have to mention the final

186
00:08:54.639 --> 00:08:58.720
<v Speaker 1>training category here, reinforcement learning. Yes, this is a totally

187
00:08:58.720 --> 00:09:01.080
<v Speaker 1>different beast. There's no label, answer key, and it's not

188
00:09:01.120 --> 00:09:04.799
<v Speaker 1>just observing clusters here. The learning system is called an agent,

189
00:09:04.840 --> 00:09:06.200
<v Speaker 1>and it's placed into an environment.

190
00:09:06.279 --> 00:09:08.519
<v Speaker 2>Think of it like training a dog. Okay. The agent

191
00:09:08.559 --> 00:09:11.559
<v Speaker 2>performs an action, observes the result, and gets either a

192
00:09:11.600 --> 00:09:15.600
<v Speaker 2>reward or a penalty. Over millions of iterations, it constantly

193
00:09:15.679 --> 00:09:19.679
<v Speaker 2>updates what's called its policy policy. Right. The internal strategy

194
00:09:19.879 --> 00:09:23.120
<v Speaker 2>uses to decide what action will yield the highest reward

195
00:09:23.159 --> 00:09:27.320
<v Speaker 2>over time. This is how deep minds alphag conquered the

196
00:09:27.360 --> 00:09:31.279
<v Speaker 2>world champion at the incredibly complex board game Go Oh Wow.

197
00:09:31.799 --> 00:09:35.080
<v Speaker 2>It didn't just study path games. It played millions of

198
00:09:35.120 --> 00:09:39.440
<v Speaker 2>games against itself, constantly tweaking its policy based on whether

199
00:09:39.480 --> 00:09:41.639
<v Speaker 2>an action led to a win or a loss.

200
00:09:41.799 --> 00:09:43.879
<v Speaker 1>Okay, So, whether you train it with an answer key,

201
00:09:44.159 --> 00:09:47.440
<v Speaker 1>or by clustering unlabeled data, or by letting it play

202
00:09:47.440 --> 00:09:49.879
<v Speaker 1>a million games of Go, we eventually end up with

203
00:09:49.879 --> 00:09:50.639
<v Speaker 1>a train system.

204
00:09:50.759 --> 00:09:51.080
<v Speaker 2>We do.

205
00:09:51.320 --> 00:09:54.840
<v Speaker 1>But here's the multimillion dollar question, how does it actually

206
00:09:54.879 --> 00:09:56.639
<v Speaker 1>make a prediction on a piece of data it has

207
00:09:56.919 --> 00:10:00.679
<v Speaker 1>literally never seen before. How do we move from memorizing

208
00:10:00.720 --> 00:10:03.960
<v Speaker 1>the past to actually generalizing to the unknown future.

209
00:10:04.399 --> 00:10:06.440
<v Speaker 2>To answer that, we first have to look at the plumbing,

210
00:10:06.759 --> 00:10:08.919
<v Speaker 2>like how is the system digesting data on a day

211
00:10:08.919 --> 00:10:11.080
<v Speaker 2>to day basis? Is it a batche learner or an

212
00:10:11.159 --> 00:10:14.240
<v Speaker 2>online learner? Right In Batchel learning, the system trains offline

213
00:10:14.399 --> 00:10:18.440
<v Speaker 2>using all the available data at once. It's computationally heavy.

214
00:10:18.519 --> 00:10:20.360
<v Speaker 2>If you want a batch system to learn about a

215
00:10:20.399 --> 00:10:22.639
<v Speaker 2>new type of spam that appeared this morning, you can't

216
00:10:22.679 --> 00:10:24.360
<v Speaker 2>just teach it the new trick. You have to start

217
00:10:24.360 --> 00:10:26.440
<v Speaker 2>over exactly. You have to shut it down, mix the

218
00:10:26.480 --> 00:10:29.720
<v Speaker 2>new data with the millions of old emails, and retrain

219
00:10:29.799 --> 00:10:31.840
<v Speaker 2>the entire model from scratch.

220
00:10:31.639 --> 00:10:35.919
<v Speaker 1>Which is wildly inefficient if you're dealing with fast changing environments.

221
00:10:36.480 --> 00:10:39.480
<v Speaker 1>And that's why online learning is so crucial. Yes, instead

222
00:10:39.480 --> 00:10:43.919
<v Speaker 1>of massive offline dumps, you feed the data to the

223
00:10:43.960 --> 00:10:47.679
<v Speaker 1>system incrementally, either one by one or in small groups

224
00:10:47.720 --> 00:10:51.159
<v Speaker 1>called mini batches. It learns on the fly, very nimble,

225
00:10:51.679 --> 00:10:54.360
<v Speaker 1>and the text highlights a critical mechanism here called the

226
00:10:54.440 --> 00:10:55.080
<v Speaker 1>learning rate.

227
00:10:55.279 --> 00:10:58.600
<v Speaker 2>The learning rate is just a mathematical parameter that controls

228
00:10:58.639 --> 00:11:03.080
<v Speaker 2>how aggressively the the algorithm updates its internal rules when

229
00:11:03.080 --> 00:11:04.000
<v Speaker 2>it sees new data.

230
00:11:04.039 --> 00:11:06.159
<v Speaker 1>So think of it like two different types of stock traders.

231
00:11:06.279 --> 00:11:08.960
<v Speaker 1>A trader with a high learning rate is highly reactive.

232
00:11:09.039 --> 00:11:12.720
<v Speaker 1>Right they see one bad quarterly report and immediately dump

233
00:11:12.759 --> 00:11:16.039
<v Speaker 1>all their shares completely, forgetting the company's ten year history

234
00:11:16.080 --> 00:11:20.080
<v Speaker 1>of success. They adapt fast, but they're volatile, very volatile.

235
00:11:20.240 --> 00:11:22.919
<v Speaker 1>But a trader with a low learning rate is stubborn.

236
00:11:23.360 --> 00:11:25.879
<v Speaker 1>They rely heavily on the ten year historical average and

237
00:11:26.080 --> 00:11:30.200
<v Speaker 1>barely react to today's news. They are stable, but they

238
00:11:30.279 --> 00:11:33.360
<v Speaker 1>might miss a sudden market crash. The algorithm has to

239
00:11:33.399 --> 00:11:36.679
<v Speaker 1>balance that exact same tension mathematically.

240
00:11:36.000 --> 00:11:39.840
<v Speaker 2>Precisely now, regardless of the plumbing, whether you use batch

241
00:11:40.000 --> 00:11:43.360
<v Speaker 2>or online learning, the algorithm needs a fundamental strategy to

242
00:11:43.480 --> 00:11:47.360
<v Speaker 2>generalize to a new unseen piece of data. Okay, and

243
00:11:47.399 --> 00:11:50.759
<v Speaker 2>there are two primary mechanisms for this, instance based learning

244
00:11:51.159 --> 00:11:52.120
<v Speaker 2>and model base larth.

245
00:11:52.159 --> 00:11:52.960
<v Speaker 1>Let's break those down.

246
00:11:53.000 --> 00:11:57.159
<v Speaker 2>Instance based learning is essentially memorization. The algorithm stores the

247
00:11:57.320 --> 00:12:01.279
<v Speaker 2>entire training data set. When a new email, it calculates

248
00:12:01.320 --> 00:12:04.759
<v Speaker 2>a mathematical distance a similarity measure between the new email

249
00:12:04.879 --> 00:12:06.000
<v Speaker 2>and the ones it's memorized.

250
00:12:06.039 --> 00:12:07.360
<v Speaker 1>So it's comparing yes.

251
00:12:07.759 --> 00:12:10.519
<v Speaker 2>For example, it might literally count the number of matching words.

252
00:12:11.320 --> 00:12:14.039
<v Speaker 2>If the new email shares eighty percent of its vocabulary

253
00:12:14.039 --> 00:12:17.440
<v Speaker 2>with a known spam email, the algorithm says, it's close

254
00:12:17.519 --> 00:12:18.519
<v Speaker 2>enough spam.

255
00:12:18.799 --> 00:12:21.120
<v Speaker 1>Here's where it gets really interesting for you, the learner.

256
00:12:21.879 --> 00:12:25.200
<v Speaker 1>Instance based learning is basically like a student who memorizes

257
00:12:25.360 --> 00:12:28.039
<v Speaker 1>every single practice question before the physics final.

258
00:12:28.279 --> 00:12:28.799
<v Speaker 2>Exactly.

259
00:12:28.879 --> 00:12:31.480
<v Speaker 1>If the exam question is identical to the practice they

260
00:12:31.519 --> 00:12:34.840
<v Speaker 1>totally ace it. If it's slightly rewarded, they might still

261
00:12:34.879 --> 00:12:38.879
<v Speaker 1>guess right by noticing the similarities. But model based learning

262
00:12:38.919 --> 00:12:43.600
<v Speaker 1>is entirely different. It's like actually learning the underlying physics formula.

263
00:12:43.639 --> 00:12:45.639
<v Speaker 1>Once you build the formula the model, you can just

264
00:12:45.720 --> 00:12:47.960
<v Speaker 1>throw the practice tests away. You can solve any new

265
00:12:48.039 --> 00:12:48.960
<v Speaker 1>question they throw at you.

266
00:12:49.320 --> 00:12:53.080
<v Speaker 2>Let's make that concrete. The material uses a fantastic real

267
00:12:53.120 --> 00:12:58.840
<v Speaker 2>world example comparing the OECD Better Life Index with IMFGDP data.

268
00:12:58.919 --> 00:13:00.279
<v Speaker 1>Oh I love this part.

269
00:13:00.480 --> 00:13:04.159
<v Speaker 2>Suppose you plot countries on a graph. The horizontal axis

270
00:13:04.200 --> 00:13:06.679
<v Speaker 2>is GDP per capita, meaning how rich the country is.

271
00:13:07.120 --> 00:13:10.679
<v Speaker 2>The vertical axis is life satisfaction, how happy the citizens are.

272
00:13:10.799 --> 00:13:12.559
<v Speaker 2>When you look at the dots, it's a bit scattered

273
00:13:12.879 --> 00:13:16.440
<v Speaker 2>but you can definitely see a general upward trend. As

274
00:13:16.519 --> 00:13:19.960
<v Speaker 2>money goes up, happiness tends to go up. So the

275
00:13:20.039 --> 00:13:23.440
<v Speaker 2>algorithm decides to build a linear model. It draws a

276
00:13:23.480 --> 00:13:26.320
<v Speaker 2>straight line right through the middle of those scattered dots.

277
00:13:26.240 --> 00:13:29.240
<v Speaker 1>And that straight line is defined by parameters. Right, just

278
00:13:29.320 --> 00:13:32.039
<v Speaker 1>like back in high school algebra, why equals mx plus

279
00:13:32.039 --> 00:13:35.360
<v Speaker 1>b Exactly like that, the algorithm basically has dials. It

280
00:13:35.399 --> 00:13:38.519
<v Speaker 1>can turn, It can change the intercept where the line starts,

281
00:13:38.559 --> 00:13:41.120
<v Speaker 1>and it can change the slope how steep the line.

282
00:13:40.879 --> 00:13:43.919
<v Speaker 2>Is exactly, But this raises an important question. How does

283
00:13:43.919 --> 00:13:46.320
<v Speaker 2>the algorithm actually know if the line it true is

284
00:13:46.360 --> 00:13:46.759
<v Speaker 2>any good?

285
00:13:46.840 --> 00:13:47.960
<v Speaker 1>Right? Who's grading it?

286
00:13:48.600 --> 00:13:52.279
<v Speaker 2>This is the very heart of how machines learn. The

287
00:13:52.360 --> 00:13:56.279
<v Speaker 2>algorithm uses a cost function. The cost function measures the

288
00:13:56.320 --> 00:13:59.799
<v Speaker 2>literal physical distance on the graph between the model straight

289
00:13:59.799 --> 00:14:02.879
<v Speaker 2>line and the actual data dots. Okay, if the line

290
00:14:02.919 --> 00:14:05.200
<v Speaker 2>is drawn too low, the gap between the line and

291
00:14:05.240 --> 00:14:07.519
<v Speaker 2>the dots is large, the cost is high.

292
00:14:07.639 --> 00:14:11.240
<v Speaker 1>So the algorithm's entire purpose in life is to minimize

293
00:14:11.279 --> 00:14:14.200
<v Speaker 1>that cost function. It turns the dial to adjust the

294
00:14:14.200 --> 00:14:17.200
<v Speaker 1>slope of the line. Then it recalculates the distances. Did

295
00:14:17.200 --> 00:14:19.960
<v Speaker 1>the gap get smaller? Yes, turn the dial a bit more,

296
00:14:20.480 --> 00:14:22.720
<v Speaker 1>did the gap get bigger? Whoops? Turn it too far,

297
00:14:22.840 --> 00:14:27.360
<v Speaker 1>turn it back. It is just a relentless mathematical optimization problem.

298
00:14:27.519 --> 00:14:30.120
<v Speaker 1>Find the exact slope and height where the line is

299
00:14:30.120 --> 00:14:32.519
<v Speaker 1>as close to all the dots as physically.

300
00:14:32.120 --> 00:14:35.679
<v Speaker 2>Possible, and once that optimization is done, you have your model.

301
00:14:35.960 --> 00:14:38.159
<v Speaker 2>If a brand new country emerges tomorrow, you don't need

302
00:14:38.200 --> 00:14:41.360
<v Speaker 2>to look at historical instances. You just plug their GDP

303
00:14:41.519 --> 00:14:44.440
<v Speaker 2>into your perfectly sloped line and it spits out of

304
00:14:44.480 --> 00:14:46.120
<v Speaker 2>predicted life satisfaction score.

305
00:14:46.399 --> 00:14:49.080
<v Speaker 1>But hold on a second. If learning is literally just

306
00:14:49.120 --> 00:14:52.559
<v Speaker 1>turning dials to minimize a mathematical cost function, why do

307
00:14:52.639 --> 00:14:56.720
<v Speaker 1>these models still make embarrassing, catastrophic, or even dangerous mistakes

308
00:14:56.759 --> 00:14:57.399
<v Speaker 1>in the real world.

309
00:14:57.559 --> 00:14:59.679
<v Speaker 2>Yeah, it's a huge problem.

310
00:14:59.360 --> 00:15:00.600
<v Speaker 1>Because the math objective.

311
00:15:00.720 --> 00:15:00.919
<v Speaker 2>Right.

312
00:15:01.279 --> 00:15:03.679
<v Speaker 1>This brings us to the absolute core of the issue,

313
00:15:04.200 --> 00:15:06.840
<v Speaker 1>the Achilles heel of everything we've talked about so far,

314
00:15:07.399 --> 00:15:10.120
<v Speaker 1>the garbage in garbage out dilemma. You can have the

315
00:15:10.159 --> 00:15:13.159
<v Speaker 1>most elegant optimization loop on the planet, but if you

316
00:15:13.240 --> 00:15:17.120
<v Speaker 1>suffer from bad data or a bad algorithm, you are doomed.

317
00:15:17.879 --> 00:15:20.960
<v Speaker 1>Let's start with bad data, specifically the raw quantity of it.

318
00:15:21.120 --> 00:15:24.799
<v Speaker 2>The sheer volume of data required is just staggering. There

319
00:15:24.840 --> 00:15:27.720
<v Speaker 2>is a landmark two thousand and one paper by Microsoft

320
00:15:27.759 --> 00:15:31.440
<v Speaker 2>researchers Mickel Banco and Eric Brill that actually proved this right.

321
00:15:31.879 --> 00:15:34.840
<v Speaker 2>They took a highly complex natural language problem and they

322
00:15:34.879 --> 00:15:38.279
<v Speaker 2>tested several very different machine learning algorithms on it. Some

323
00:15:38.399 --> 00:15:42.000
<v Speaker 2>highly sophisticated, some fairly basic. They found that as long

324
00:15:42.039 --> 00:15:44.799
<v Speaker 2>as they fed the algorithms enough data, all of them

325
00:15:44.840 --> 00:15:48.759
<v Speaker 2>performed almost identically well. Peter Norvig later coined a phrase

326
00:15:48.799 --> 00:15:51.840
<v Speaker 2>for this, the unreasonable effectiveness of data.

327
00:15:52.200 --> 00:15:54.559
<v Speaker 1>The unreasonable effectiveness of data.

328
00:15:54.679 --> 00:15:57.000
<v Speaker 2>I love that it was a paradigm shift. It suggested

329
00:15:57.039 --> 00:15:59.799
<v Speaker 2>that complex logic often loses to simple logic backed by

330
00:16:00.000 --> 00:16:00.960
<v Speaker 2>outains of experience.

331
00:16:01.000 --> 00:16:03.120
<v Speaker 1>Okay, wait, though, If that two thousand and one Microsoft

332
00:16:03.159 --> 00:16:06.320
<v Speaker 1>paper prove that giving a mediocre algorithm a billion data

333
00:16:06.360 --> 00:16:10.080
<v Speaker 1>points makes it perform brilliantly, why on earth are Silicon

334
00:16:10.159 --> 00:16:13.440
<v Speaker 1>Valley companies paying millions of dollars to AI researchers.

335
00:16:13.600 --> 00:16:14.240
<v Speaker 2>Good question.

336
00:16:14.440 --> 00:16:18.960
<v Speaker 1>Why not just fire the algorithm development team, save the cash,

337
00:16:19.000 --> 00:16:21.960
<v Speaker 1>and just buy more server space to hoard more data,

338
00:16:22.159 --> 00:16:24.440
<v Speaker 1>just you know, brute force the problem.

339
00:16:24.559 --> 00:16:27.000
<v Speaker 2>It is a totally tempting thought. But you have to

340
00:16:27.039 --> 00:16:31.320
<v Speaker 2>ground this in the realities of the physical world. Yes,

341
00:16:31.679 --> 00:16:35.720
<v Speaker 2>for massive tasks like global image recognition or large language models,

342
00:16:36.000 --> 00:16:39.480
<v Speaker 2>tech giants can brute force it with endless data, But

343
00:16:39.600 --> 00:16:43.679
<v Speaker 2>for ninety nine percent of real world applications, massive data

344
00:16:43.840 --> 00:16:47.480
<v Speaker 2>simply doesn't exist. If you are a hospital trying to

345
00:16:47.480 --> 00:16:51.240
<v Speaker 2>predict a rare genetic disease, you don't have billions of patients.

346
00:16:51.279 --> 00:16:52.360
<v Speaker 2>You might have a few hundred.

347
00:16:52.480 --> 00:16:53.120
<v Speaker 1>That makes sense.

348
00:16:53.159 --> 00:16:55.799
<v Speaker 2>If you're a mid sized retailer optimizing your supply chain,

349
00:16:56.039 --> 00:16:59.279
<v Speaker 2>you have limited noisy data. You can't fire the algorithm

350
00:16:59.320 --> 00:17:02.440
<v Speaker 2>team because getting extra data is either physically impossible or

351
00:17:02.480 --> 00:17:06.440
<v Speaker 2>prohibitively expensive. You need brilliant algorithms that can extract maximum

352
00:17:06.480 --> 00:17:07.640
<v Speaker 2>signal from minimal noise.

353
00:17:07.799 --> 00:17:10.000
<v Speaker 1>And it's not just about the quantity of the data.

354
00:17:10.240 --> 00:17:14.519
<v Speaker 1>The quality is arguably more dangerous. Your data absolutely must be.

355
00:17:14.440 --> 00:17:15.880
<v Speaker 2>Represented, oh without a doubt.

356
00:17:15.960 --> 00:17:18.599
<v Speaker 1>If your training data doesn't perfectly mirror the real world,

357
00:17:19.119 --> 00:17:23.640
<v Speaker 1>your algorithm will learn the wrong lessons with absolute mathematical certainty.

358
00:17:24.319 --> 00:17:27.440
<v Speaker 1>And the text highlights one of the greatest cautionary tales

359
00:17:27.480 --> 00:17:32.000
<v Speaker 1>and statistics for this, the nineteen thirty six Literary Digest.

360
00:17:31.640 --> 00:17:33.680
<v Speaker 2>Poll Such a classic example.

361
00:17:33.400 --> 00:17:36.720
<v Speaker 1>This magazine wanted to predict the US presidential election between

362
00:17:36.720 --> 00:17:40.119
<v Speaker 1>alf Land and Franklin D. Roosevelt, so that they did

363
00:17:40.119 --> 00:17:43.000
<v Speaker 1>what any data enthusiast would do. They went massive. They

364
00:17:43.039 --> 00:17:47.039
<v Speaker 1>sent out ten million surveys and they got two point

365
00:17:47.039 --> 00:17:51.279
<v Speaker 1>four million responses back. It was an astronomically large data set,

366
00:17:51.559 --> 00:17:55.200
<v Speaker 1>and based on that data, they predicted Landon would crush Roosevelt,

367
00:17:55.359 --> 00:17:57.599
<v Speaker 1>taking fifty seven percent of the vote.

368
00:17:57.359 --> 00:18:00.759
<v Speaker 2>And yet Roosevelt won in a landslide with sixty two

369
00:18:00.799 --> 00:18:03.319
<v Speaker 2>percent of the vote. The prediction wasn't just slightly off,

370
00:18:03.400 --> 00:18:05.160
<v Speaker 2>it was completely inverted exactly.

371
00:18:05.400 --> 00:18:08.359
<v Speaker 1>And the reason why is a tech bookcase of sampling bias.

372
00:18:08.720 --> 00:18:11.119
<v Speaker 1>To get the ten million addresses to send the polls

373
00:18:11.160 --> 00:18:15.680
<v Speaker 1>to the magazine used telephone directories, club membershipless and magazine

374
00:18:15.720 --> 00:18:16.599
<v Speaker 1>subscriber lists.

375
00:18:16.960 --> 00:18:18.200
<v Speaker 2>I see where this is going.

376
00:18:18.279 --> 00:18:20.599
<v Speaker 1>Right, because you have to think about the environment of

377
00:18:20.680 --> 00:18:23.640
<v Speaker 1>nineteen thirty six, who actually had a telephone in the

378
00:18:23.680 --> 00:18:27.559
<v Speaker 1>middle of the Great Depression. Wealthier people, wealthier people, and

379
00:18:27.680 --> 00:18:31.680
<v Speaker 1>wealthier people tended to lean Republican, So their massive data

380
00:18:31.720 --> 00:18:35.799
<v Speaker 1>set completely excluded the working class. The algorithm of their

381
00:18:35.839 --> 00:18:39.640
<v Speaker 1>poll wasn't flawed, the data it ingested was poisoned from

382
00:18:39.680 --> 00:18:42.000
<v Speaker 1>the start. Garbage in, garbage.

383
00:18:41.599 --> 00:18:44.920
<v Speaker 2>Out, and that is a failure of data. But we

384
00:18:44.960 --> 00:18:47.599
<v Speaker 2>also have to examine the failure of the algorithm itself.

385
00:18:48.079 --> 00:18:50.920
<v Speaker 2>The most insidious trap in machine learning is a concept

386
00:18:50.920 --> 00:18:52.000
<v Speaker 2>called overfitting.

387
00:18:52.119 --> 00:18:52.880
<v Speaker 1>Overfitting.

388
00:18:53.119 --> 00:18:56.480
<v Speaker 2>This happens when the algorithm performs flawlessly on the training

389
00:18:56.559 --> 00:19:00.000
<v Speaker 2>data but fails entirely when it faces the real world.

390
00:19:00.039 --> 00:19:02.319
<v Speaker 1>I really love the analogy used for this. Imagine you're

391
00:19:02.359 --> 00:19:05.359
<v Speaker 1>a tourist visiting a foreign country for the very first time. Okay,

392
00:19:05.519 --> 00:19:07.920
<v Speaker 1>you get into a taxi and the driver blatantly rips

393
00:19:07.920 --> 00:19:10.720
<v Speaker 1>you off. If you conclude that every single taxi driver

394
00:19:10.839 --> 00:19:13.920
<v Speaker 1>in the entire country is a thief, you are overfitting. Yes,

395
00:19:14.200 --> 00:19:16.839
<v Speaker 1>you took a tiny, noisy anomaly in your personal data

396
00:19:16.880 --> 00:19:19.720
<v Speaker 1>set and drew a massive, sweeping rule from it.

397
00:19:20.000 --> 00:19:24.119
<v Speaker 2>Mathematically, overfitting happens when a model is just too complex.

398
00:19:24.880 --> 00:19:28.279
<v Speaker 2>We talked about turning dials earlier. In data science, we

399
00:19:28.319 --> 00:19:30.200
<v Speaker 2>call those dials degrees of freedom.

400
00:19:30.279 --> 00:19:31.279
<v Speaker 1>Degrees of freedom.

401
00:19:31.359 --> 00:19:33.960
<v Speaker 2>Got it. If you give an algorithm one hundred different

402
00:19:34.000 --> 00:19:36.799
<v Speaker 2>dials to fit a small amount of data, it will

403
00:19:36.880 --> 00:19:40.519
<v Speaker 2>contort itself to connect every single dot perfectly, even the

404
00:19:40.559 --> 00:19:43.920
<v Speaker 2>outliers and the noise right. For instance, if you feeded

405
00:19:44.000 --> 00:19:47.480
<v Speaker 2>a data set of countries to predict life satisfaction, and

406
00:19:47.519 --> 00:19:50.279
<v Speaker 2>your model has too many degrees of freedom, it might

407
00:19:50.319 --> 00:19:54.079
<v Speaker 2>notice a bizarre coincidence countries with a W in their name,

408
00:19:54.160 --> 00:19:57.480
<v Speaker 2>like New Zealand, Norway, Sweden, and Switzerland happen to have

409
00:19:57.599 --> 00:19:58.640
<v Speaker 2>high life satisfaction.

410
00:19:58.839 --> 00:19:59.319
<v Speaker 1>Oh wow.

411
00:19:59.400 --> 00:20:01.839
<v Speaker 2>The algorithm will mathematically lock that in. As a rule,

412
00:20:02.079 --> 00:20:05.480
<v Speaker 2>it will truly believe the letter W generates human happiness.

413
00:20:05.079 --> 00:20:08.119
<v Speaker 1>Which is obviously absurd. The W is just pure noise.

414
00:20:08.799 --> 00:20:11.920
<v Speaker 1>So how do we actually stop the machine from memorizing

415
00:20:11.920 --> 00:20:12.359
<v Speaker 1>the noise?

416
00:20:12.440 --> 00:20:16.200
<v Speaker 2>By using a technique called regularization. Regularization is essentially a

417
00:20:16.200 --> 00:20:20.000
<v Speaker 2>mathematical penalty for complexity. It forces the model to be simpler.

418
00:20:20.079 --> 00:20:20.839
<v Speaker 1>How does it do that?

419
00:20:21.160 --> 00:20:23.839
<v Speaker 2>If the model has one hundred dials? It could turn

420
00:20:24.480 --> 00:20:28.440
<v Speaker 2>Regularization applies a mathematical friction that says, I'm going to

421
00:20:28.480 --> 00:20:31.079
<v Speaker 2>penalize your cost function for every dial you use.

422
00:20:31.240 --> 00:20:32.119
<v Speaker 1>Ah clutter.

423
00:20:32.319 --> 00:20:34.920
<v Speaker 2>The algorithm realizes it can't use all the dials without

424
00:20:35.000 --> 00:20:38.079
<v Speaker 2>racking up huge penalties, so it snaps off ninety of

425
00:20:38.079 --> 00:20:41.680
<v Speaker 2>those dials. Yeah, and only uses the most important ten. Okay,

426
00:20:41.799 --> 00:20:45.279
<v Speaker 2>I see by restricting its degrees of freedom. You force

427
00:20:45.319 --> 00:20:47.559
<v Speaker 2>it to ignore the noisy data like the letter W

428
00:20:48.039 --> 00:20:52.319
<v Speaker 2>and focus only on the massive, undeniable underlying trends. You

429
00:20:52.440 --> 00:20:56.640
<v Speaker 2>intentionally make the model slightly worse on the training data

430
00:20:57.000 --> 00:20:59.240
<v Speaker 2>so that it can be infinitely better at handling the

431
00:20:59.319 --> 00:21:00.200
<v Speaker 2>unknown future.

432
00:21:00.319 --> 00:21:02.839
<v Speaker 1>Wow. So what does this all mean? Let's bring this

433
00:21:02.920 --> 00:21:08.480
<v Speaker 1>all together. Machine learning isn't magic. It is fundamentally about experience, task,

434
00:21:08.559 --> 00:21:12.119
<v Speaker 1>and performance exactly. We've seen how algorithms learn, whether they're

435
00:21:12.160 --> 00:21:16.039
<v Speaker 1>relying on a teacher for labeled answers, exploring multidimensional clusters

436
00:21:16.039 --> 00:21:19.000
<v Speaker 1>as an observer, or playing a million games as an agent.

437
00:21:19.400 --> 00:21:22.400
<v Speaker 1>We've seen how they generalize, either by calculating the distance

438
00:21:22.400 --> 00:21:26.000
<v Speaker 1>to past instances or by turning mathematical dials to minimize

439
00:21:26.000 --> 00:21:28.880
<v Speaker 1>a cost function and build a model. And most importantly,

440
00:21:28.920 --> 00:21:33.359
<v Speaker 1>we've seen how these incredibly powerful systems are entirely at

441
00:21:33.359 --> 00:21:36.240
<v Speaker 1>the mercy of the data we feed them. The biggest

442
00:21:36.240 --> 00:21:38.640
<v Speaker 1>takeaway here for you, the learner, is that the next

443
00:21:38.680 --> 00:21:41.480
<v Speaker 1>time you interact with a smart algorithm in your daily life,

444
00:21:41.480 --> 00:21:44.960
<v Speaker 1>whether it's a loan approval, resume screener, or a social

445
00:21:45.000 --> 00:21:48.359
<v Speaker 1>media feed, you really shouldn't ask how smart is the

446
00:21:48.440 --> 00:21:51.799
<v Speaker 1>math you should be asking what data was this trained on?

447
00:21:52.200 --> 00:21:55.640
<v Speaker 2>It is the defining question of our era, and if

448
00:21:55.640 --> 00:21:58.480
<v Speaker 2>we connect this to the bigger picture, it leaves us

449
00:21:58.519 --> 00:22:01.319
<v Speaker 2>to something quite profound to consider. Yeah, if a machine

450
00:22:01.400 --> 00:22:04.559
<v Speaker 2>learning model requires millions of examples of historical human behavior

451
00:22:04.599 --> 00:22:07.359
<v Speaker 2>to optimize its rules, and we know that our historical

452
00:22:07.440 --> 00:22:11.160
<v Speaker 2>data is riddled with sampling biases, blind spots, and flawed decisions,

453
00:22:11.559 --> 00:22:16.279
<v Speaker 2>does an AI eventually transcend our limitations or by mathematically

454
00:22:16.279 --> 00:22:19.599
<v Speaker 2>optimizing itself against our past to simply become a highly

455
00:22:19.640 --> 00:22:23.319
<v Speaker 2>efficient automated mirror of our own historical prejudices.

456
00:22:23.680 --> 00:22:26.599
<v Speaker 1>Man, that is a fascinating thought to keep you up

457
00:22:26.640 --> 00:22:28.960
<v Speaker 1>at night. Thank you for joining us on this deep

458
00:22:29.000 --> 00:22:31.680
<v Speaker 1>dive into the true mechanics of the algorithms that are

459
00:22:31.799 --> 00:22:35.160
<v Speaker 1>quietly running our world. Keep questioning the data, keep learning,

460
00:22:35.480 --> 00:22:36.960
<v Speaker 1>and we will catch you next time.
