WEBVTT

1
00:00:00.120 --> 00:00:03.319
<v Speaker 1>So back in twenty twelve, Google did something really wild.

2
00:00:03.399 --> 00:00:08.160
<v Speaker 1>They fed a machine something like ten million completely random,

3
00:00:08.240 --> 00:00:10.679
<v Speaker 1>chaotic frames of YouTube videos. Right.

4
00:00:10.679 --> 00:00:12.720
<v Speaker 2>And the crazy part is they didn't write a single

5
00:00:12.759 --> 00:00:15.439
<v Speaker 2>line of code telling this machine what to look for.

6
00:00:15.640 --> 00:00:19.760
<v Speaker 1>Yeah, exactly, no instructions about shapes, no definitions of animals,

7
00:00:20.120 --> 00:00:22.640
<v Speaker 1>and completely on its own, just sorting through the sheer

8
00:00:22.719 --> 00:00:26.519
<v Speaker 1>chaos of the Internet. The machine independently formed the concept

9
00:00:26.600 --> 00:00:27.199
<v Speaker 1>of a cat.

10
00:00:27.359 --> 00:00:31.839
<v Speaker 2>It's fascinating. It physically reorganized its internal mathematical structures to

11
00:00:32.000 --> 00:00:33.600
<v Speaker 2>recognize a feline face.

12
00:00:34.200 --> 00:00:36.600
<v Speaker 1>Just out of nowhere, it is, And today we are

13
00:00:36.640 --> 00:00:39.119
<v Speaker 1>tearing into how that is even remotely possible.

14
00:00:40.079 --> 00:00:42.560
<v Speaker 2>Welcome to the deep dive for you listening. You know,

15
00:00:42.600 --> 00:00:44.439
<v Speaker 2>whether you're building models yourself or you just want to

16
00:00:44.439 --> 00:00:47.079
<v Speaker 2>masterclass in the mechanics of the modern world. Our source

17
00:00:47.079 --> 00:00:50.880
<v Speaker 2>today is Java Deep Learning Essentials by Yusuke Sugamory.

18
00:00:50.920 --> 00:00:53.240
<v Speaker 1>But don't worry, we are entirely bypassing all the heavy

19
00:00:53.320 --> 00:00:57.880
<v Speaker 1>Javison tasks today. Oh absolutely, we're leaving the code behind.

20
00:00:58.520 --> 00:01:01.439
<v Speaker 1>The mission for this deep dive is to extract the pure,

21
00:01:01.679 --> 00:01:04.879
<v Speaker 1>brilliant logic of how we got machines to stop acting

22
00:01:04.920 --> 00:01:10.640
<v Speaker 1>like rigid calculators and start well hallucinating entirely new realities.

23
00:01:11.000 --> 00:01:14.599
<v Speaker 2>Setting the baseline here is so crucial because you know,

24
00:01:15.239 --> 00:01:20.159
<v Speaker 2>the cultural definition of artificial intelligence has been completely diluted.

25
00:01:19.799 --> 00:01:22.799
<v Speaker 1>Right, like your smart toaster might have AI printed on the.

26
00:01:22.719 --> 00:01:26.920
<v Speaker 2>Box now exactly, but running a basic predictive thermis dat

27
00:01:27.000 --> 00:01:30.799
<v Speaker 2>or say, a simple loop of robotic movements that is

28
00:01:30.840 --> 00:01:33.879
<v Speaker 2>fundamentally different from the architecture that learned to see that cat.

29
00:01:34.040 --> 00:01:34.560
<v Speaker 1>Yeah.

30
00:01:34.640 --> 00:01:37.920
<v Speaker 2>To genuinely understand the tectonic shift of modern deep learning,

31
00:01:38.120 --> 00:01:41.239
<v Speaker 2>we really have to look at the graveyard of past methodology.

32
00:01:40.760 --> 00:01:43.680
<v Speaker 1>The booms and the bus So to understand why modern

33
00:01:43.760 --> 00:01:47.000
<v Speaker 1>machine learning it's so revolutionary, let's explore those past failures.

34
00:01:47.400 --> 00:01:50.200
<v Speaker 1>The first major wave hit in the nineteen fifties, right,

35
00:01:50.519 --> 00:01:52.159
<v Speaker 1>driven by search algorithm, right.

36
00:01:52.040 --> 00:01:54.280
<v Speaker 2>Things like depth first search and breadth first search.

37
00:01:54.719 --> 00:01:58.239
<v Speaker 1>The fundamental approach back then was to give a machine

38
00:01:58.400 --> 00:02:01.560
<v Speaker 1>a strict set of rules and then have it rapidly

39
00:02:01.599 --> 00:02:05.319
<v Speaker 1>calculate through a tree of possibilities to find an optimal outcome,

40
00:02:05.879 --> 00:02:09.439
<v Speaker 1>Which is why early computers look like absolute geniuses when

41
00:02:09.479 --> 00:02:10.400
<v Speaker 1>they were playing chess.

42
00:02:10.560 --> 00:02:14.319
<v Speaker 2>Yeah, because a chess board is the ultimate closed ecosystem.

43
00:02:14.479 --> 00:02:17.879
<v Speaker 1>Exactly. It has an eight by eight grid, discrete pieces

44
00:02:17.919 --> 00:02:22.240
<v Speaker 1>immutable rules. The machine just generates millions of branching future

45
00:02:22.280 --> 00:02:25.680
<v Speaker 1>moves and calculates the mathematical path to victory.

46
00:02:25.960 --> 00:02:29.280
<v Speaker 2>And people watched the machine dismantle a chess grand master

47
00:02:29.840 --> 00:02:34.039
<v Speaker 2>and assumed, well, human like artificial intelligence was only a

48
00:02:34.080 --> 00:02:34.800
<v Speaker 2>few years away.

49
00:02:34.879 --> 00:02:36.280
<v Speaker 1>They thought it was right around the corner.

50
00:02:36.400 --> 00:02:38.759
<v Speaker 2>They really did. The assumption was that you could just

51
00:02:38.800 --> 00:02:41.719
<v Speaker 2>scale up that search algorithm to handle real world problems.

52
00:02:42.080 --> 00:02:46.000
<v Speaker 2>But that assumption shattered against a massive theoretical wall known

53
00:02:46.039 --> 00:02:49.280
<v Speaker 2>as the frame problem. Oh, the frame problem. Yeah, a

54
00:02:49.319 --> 00:02:52.240
<v Speaker 2>search algorithm functions perfectly when the frame of reality is

55
00:02:52.319 --> 00:02:55.560
<v Speaker 2>artificially limited, but the moment you drop that machine into

56
00:02:55.560 --> 00:02:58.840
<v Speaker 2>the actual physical world, it paralyzes itself.

57
00:02:58.599 --> 00:03:02.199
<v Speaker 1>Because human beings caught instantly, like unconsciously, filter out an

58
00:03:02.240 --> 00:03:05.400
<v Speaker 1>infinite amount of irrelevant data, and a rule based machine

59
00:03:05.439 --> 00:03:06.479
<v Speaker 1>can't precisely.

60
00:03:06.960 --> 00:03:10.360
<v Speaker 2>It operates on absolute logic, so it has no intuition

61
00:03:10.439 --> 00:03:11.199
<v Speaker 2>for what to ignore.

62
00:03:11.400 --> 00:03:14.719
<v Speaker 1>Wait, so the frame problem is like, it's like asking

63
00:03:14.759 --> 00:03:16.520
<v Speaker 1>a robot to make a cup of tea in a

64
00:03:16.560 --> 00:03:19.120
<v Speaker 1>normal kitchen and it immediately freezes.

65
00:03:19.080 --> 00:03:22.800
<v Speaker 2>Right, because it's actively trying to calculate the current atmospheric

66
00:03:22.840 --> 00:03:23.840
<v Speaker 2>pressure exactly.

67
00:03:23.879 --> 00:03:27.479
<v Speaker 1>It's calculating the exact atomic structure of the ceramic mug.

68
00:03:27.599 --> 00:03:30.599
<v Speaker 1>And I don't know the gravitational pull of Jupiter before

69
00:03:30.639 --> 00:03:33.240
<v Speaker 1>it feels authorized to turn on the kettle because we.

70
00:03:33.240 --> 00:03:36.280
<v Speaker 2>Never explicitly programmed it to ignore Jupiter's gravity.

71
00:03:36.639 --> 00:03:39.680
<v Speaker 1>Yeah, so it factors into the tea making equation. That's wild.

72
00:03:39.840 --> 00:03:43.840
<v Speaker 2>The computational explosion makes action impossible. It becomes trapped in

73
00:03:43.879 --> 00:03:47.039
<v Speaker 2>an infinite loop of processing variables that have zero bearing

74
00:03:47.080 --> 00:03:50.240
<v Speaker 2>on the task. And that failure essentially ended that first

75
00:03:50.280 --> 00:03:51.039
<v Speaker 2>era of AI.

76
00:03:51.280 --> 00:03:54.120
<v Speaker 1>So then came the pivot, arriving in the nineteen eighties.

77
00:03:54.199 --> 00:03:57.280
<v Speaker 1>Researchers tried to bypass the machine's lack of intuition by

78
00:03:57.319 --> 00:04:00.400
<v Speaker 1>basically brute forcing context into its memory.

79
00:04:00.520 --> 00:04:03.360
<v Speaker 2>Right, this is the knowledge representation boom. The second boom.

80
00:04:03.439 --> 00:04:06.520
<v Speaker 1>Yeah, The logic was, if the machine freezes because it

81
00:04:06.520 --> 00:04:09.840
<v Speaker 1>doesn't know enough about the world, let's simply sit down

82
00:04:09.960 --> 00:04:14.000
<v Speaker 1>and manually encode the entirety of human knowledge into a database.

83
00:04:14.319 --> 00:04:18.000
<v Speaker 2>Projects like the Sake database or the semantic Web is

84
00:04:18.040 --> 00:04:22.319
<v Speaker 2>an incredibly tedious effort to build absolute dictionaries of reality.

85
00:04:22.439 --> 00:04:25.360
<v Speaker 1>Typing in rules manually, like a dog is a mammal,

86
00:04:25.480 --> 00:04:28.079
<v Speaker 1>and water is wet, and tokyo is in Japan.

87
00:04:28.480 --> 00:04:31.600
<v Speaker 2>You're trying to build a semantic web of relationships, so

88
00:04:31.639 --> 00:04:34.759
<v Speaker 2>the machine has a reference point for every scenario. But

89
00:04:34.839 --> 00:04:38.920
<v Speaker 2>that leads straight into the second wall, the symbol grounding problem.

90
00:04:38.959 --> 00:04:39.959
<v Speaker 1>Okay, let's unpack that.

91
00:04:40.079 --> 00:04:42.480
<v Speaker 2>Well. You can feed a machine a dictionary and it

92
00:04:42.480 --> 00:04:45.439
<v Speaker 2>can parse the syntax perfectly, can tell you that green

93
00:04:45.800 --> 00:04:48.319
<v Speaker 2>plus apple equals green apple, but it.

94
00:04:48.240 --> 00:04:51.079
<v Speaker 1>Has no actual concept of what an apple tastes or

95
00:04:51.079 --> 00:04:52.079
<v Speaker 1>feels like exactly.

96
00:04:52.120 --> 00:04:53.800
<v Speaker 2>It's a completely devoid of semantics.

97
00:04:53.879 --> 00:04:56.480
<v Speaker 1>So it knows the equation, but it has no concept

98
00:04:56.519 --> 00:04:58.920
<v Speaker 1>of the crisp snap of the skin, or the tartness

99
00:04:58.920 --> 00:05:00.560
<v Speaker 1>of the juice, or the weight of it in your hand.

100
00:05:00.680 --> 00:05:03.879
<v Speaker 2>To the machine, apple is nothing more than a string

101
00:05:03.959 --> 00:05:08.360
<v Speaker 2>of as key characters. It manipulates the symbols flawlessly according

102
00:05:08.360 --> 00:05:10.639
<v Speaker 2>to the grammar we gave it, but those symbols are

103
00:05:10.680 --> 00:05:14.160
<v Speaker 2>never grounded in actual experiential reality.

104
00:05:14.480 --> 00:05:17.920
<v Speaker 1>Humans inherently catch the defining features of an object, but

105
00:05:18.040 --> 00:05:20.959
<v Speaker 1>machines at the stage only saw symbols, and because they

106
00:05:20.959 --> 00:05:25.360
<v Speaker 1>couldn't grasp the underlying concepts, they were incredibly fragile.

107
00:05:25.040 --> 00:05:29.040
<v Speaker 2>Extremely fragile. Confronted with a new situation that deviated even

108
00:05:29.120 --> 00:05:33.240
<v Speaker 2>slightly from their manually programmed dictionary, they just failed completely.

109
00:05:33.360 --> 00:05:36.839
<v Speaker 1>So, since machines couldn't manually learn every rule in the universe,

110
00:05:37.199 --> 00:05:38.560
<v Speaker 1>scientists flipped.

111
00:05:38.279 --> 00:05:40.800
<v Speaker 2>The script right. They abandoned the attempt to teach the

112
00:05:40.839 --> 00:05:42.399
<v Speaker 2>computer the rules of the universe.

113
00:05:42.439 --> 00:05:44.720
<v Speaker 1>Instead of teaching rules, they thought, what if the machine

114
00:05:44.720 --> 00:05:47.600
<v Speaker 1>looked for patterns? You build an architecture that allows the

115
00:05:47.639 --> 00:05:50.759
<v Speaker 1>computer to look at raw data and deduce the dividing

116
00:05:50.759 --> 00:05:52.360
<v Speaker 1>lines itself, And this brings.

117
00:05:52.199 --> 00:05:54.519
<v Speaker 2>Us out of the AI winter and into the third

118
00:05:55.040 --> 00:05:59.600
<v Speaker 2>boon machine learning. The fundamental mechanics shift from deductive rule

119
00:05:59.639 --> 00:06:03.120
<v Speaker 2>following to inductive statistical pattern recognition.

120
00:06:03.439 --> 00:06:05.839
<v Speaker 1>Okay, so you take an algorithm, flood it with data,

121
00:06:06.199 --> 00:06:09.920
<v Speaker 1>and ask it to find the mathematical boundaries between different categories.

122
00:06:10.079 --> 00:06:14.240
<v Speaker 2>Yes, and when we look at unsupervised learning, where the

123
00:06:14.319 --> 00:06:18.759
<v Speaker 2>data is entirely raw and unlabeled, the algorithm's only job

124
00:06:18.879 --> 00:06:20.920
<v Speaker 2>is to find hidden structures.

125
00:06:21.040 --> 00:06:23.439
<v Speaker 1>Like that famous retail case study with the diapers and

126
00:06:23.480 --> 00:06:23.839
<v Speaker 1>the beer.

127
00:06:24.079 --> 00:06:28.319
<v Speaker 2>Exactly, a major supermarket fed millions of raw checkout logs

128
00:06:28.360 --> 00:06:31.839
<v Speaker 2>into a machine learning algorithm The machine didn't know what

129
00:06:31.879 --> 00:06:34.759
<v Speaker 2>the symbols for diapers or beer actually meant.

130
00:06:34.519 --> 00:06:38.000
<v Speaker 1>Because the symbol grounding problem still applies here, right, right, but.

131
00:06:38.000 --> 00:06:42.639
<v Speaker 2>It recognized a profound statistical correlation. It noticed that consumers

132
00:06:42.680 --> 00:06:46.240
<v Speaker 2>purchasing diapers late on a Friday night had a highly

133
00:06:46.279 --> 00:06:49.399
<v Speaker 2>elevated probability of simultaneously purchasing beer.

134
00:06:49.480 --> 00:06:52.160
<v Speaker 1>So the machine maps the frequency, the store moves the

135
00:06:52.160 --> 00:06:55.160
<v Speaker 1>beer aisle next to the diapers, and the profit margins spike.

136
00:06:55.360 --> 00:06:58.040
<v Speaker 2>That's unsupervised learning in a nutshell. But then we have

137
00:06:58.079 --> 00:07:01.560
<v Speaker 2>supervised learning where we do provide examples, and the book

138
00:07:01.639 --> 00:07:05.120
<v Speaker 2>highlights support vector machines or SVMs to handle this.

139
00:07:05.360 --> 00:07:07.519
<v Speaker 1>And this is where the math gets incredibly elegant.

140
00:07:07.720 --> 00:07:10.399
<v Speaker 2>It really does. If you have a massive data set

141
00:07:10.439 --> 00:07:13.600
<v Speaker 2>of say medical diagnostics, and you mack it out on

142
00:07:13.639 --> 00:07:17.519
<v Speaker 2>a two dimensional graph, the data points for healthy and

143
00:07:17.759 --> 00:07:21.240
<v Speaker 2>sick are going to be completely overlapping and tangled together.

144
00:07:21.360 --> 00:07:23.240
<v Speaker 1>You can't just draw a straight two D line to

145
00:07:23.279 --> 00:07:23.959
<v Speaker 1>separate them.

146
00:07:24.199 --> 00:07:26.040
<v Speaker 2>No straight line on a flat plane is just too

147
00:07:26.120 --> 00:07:30.000
<v Speaker 2>simplistic for messy real world data. So SVMs solve this

148
00:07:30.279 --> 00:07:31.279
<v Speaker 2>using the kernel trick.

149
00:07:31.399 --> 00:07:33.040
<v Speaker 1>The kernel trick. I love this concept.

150
00:07:33.079 --> 00:07:37.079
<v Speaker 2>It's basically a method of mathematically shifting perspective. Instead of

151
00:07:37.120 --> 00:07:40.240
<v Speaker 2>trying to force a complex curve boundary through the two

152
00:07:40.319 --> 00:07:44.720
<v Speaker 2>D data, the algorithm applies a mathematical transformation like squaring

153
00:07:44.759 --> 00:07:46.519
<v Speaker 2>the distance of each point from the.

154
00:07:46.480 --> 00:07:50.120
<v Speaker 1>Origin, and by running that calculation, the algorithm effectively takes

155
00:07:50.120 --> 00:07:52.639
<v Speaker 1>the flat two D data and projects it outward into

156
00:07:52.639 --> 00:07:53.759
<v Speaker 1>a three dimensional space.

157
00:07:53.920 --> 00:07:56.519
<v Speaker 2>Right the data points literally lift off the flat page.

158
00:07:56.639 --> 00:07:59.279
<v Speaker 1>It's like the math warps the space so that the

159
00:07:59.319 --> 00:08:01.879
<v Speaker 1>tangled point spread out into a three D shape like

160
00:08:01.879 --> 00:08:05.720
<v Speaker 1>a parabola. And once the data is suspended in three dimensions,

161
00:08:05.800 --> 00:08:09.519
<v Speaker 1>the tangled mess is suddenly separated by altitude exactly.

162
00:08:09.839 --> 00:08:12.600
<v Speaker 2>And at that point the SVM doesn't need to draw

163
00:08:12.720 --> 00:08:16.240
<v Speaker 2>a complex curve anymore. It just slides a perfectly flat,

164
00:08:16.439 --> 00:08:20.040
<v Speaker 2>rigid sheet of glass a hyperplane straight through the three D.

165
00:08:20.040 --> 00:08:24.240
<v Speaker 1>Space, cleanly severing the healthy data points from the sick ones.

166
00:08:24.399 --> 00:08:29.360
<v Speaker 2>It is an extraordinarily powerful classification tool, but traditional machine learning,

167
00:08:29.639 --> 00:08:33.960
<v Speaker 2>despite the brilliance of the kernel trick, harbored a fatal bottleneck.

168
00:08:33.720 --> 00:08:34.879
<v Speaker 1>Right feature engineering.

169
00:08:35.000 --> 00:08:38.320
<v Speaker 2>Yes, the machine is excellent at finding the boundary, but

170
00:08:38.360 --> 00:08:40.879
<v Speaker 2>it remains completely blind to what it is actually looking

171
00:08:40.960 --> 00:08:42.480
<v Speaker 2>at unless a human.

172
00:08:42.240 --> 00:08:44.639
<v Speaker 1>Tells it, so you still have to define the coordinates.

173
00:08:44.799 --> 00:08:47.080
<v Speaker 1>Like if you want the SBM to identify a cat,

174
00:08:47.120 --> 00:08:49.000
<v Speaker 1>you can't just feed it a raw jpeg.

175
00:08:49.279 --> 00:08:51.519
<v Speaker 2>Now, a human data scientist has to sit down and

176
00:08:51.679 --> 00:08:55.440
<v Speaker 2>manually write code that extracts the specific features for the

177
00:08:55.480 --> 00:08:56.440
<v Speaker 2>machine to evaluate.

178
00:08:56.559 --> 00:08:58.759
<v Speaker 1>You have to program it to measure the distance between

179
00:08:58.759 --> 00:09:01.000
<v Speaker 1>the pixels that make up the eye, or calculate the

180
00:09:01.039 --> 00:09:04.600
<v Speaker 1>geometric angle of the ear triangles, or isolate the hex

181
00:09:04.679 --> 00:09:06.120
<v Speaker 1>codes of the fur color.

182
00:09:06.120 --> 00:09:09.320
<v Speaker 2>And the accuracy of the entire model is bound by

183
00:09:09.399 --> 00:09:14.480
<v Speaker 2>human bias. If the human engineer selects poor features, like

184
00:09:14.919 --> 00:09:19.039
<v Speaker 2>trying to predict a neighborhood's housing prices based exclusively on

185
00:09:19.080 --> 00:09:22.399
<v Speaker 2>the number of street lights rather than square footage.

186
00:09:21.879 --> 00:09:26.240
<v Speaker 1>The algorithm will confidently execute the math and deliver absolute garbage.

187
00:09:26.639 --> 00:09:29.279
<v Speaker 1>But wait, if a human is still doing the heavy

188
00:09:29.320 --> 00:09:32.399
<v Speaker 1>lifting of feature engineering, then machine learning isn't really learning

189
00:09:32.440 --> 00:09:33.840
<v Speaker 1>independently at all, is it.

190
00:09:34.600 --> 00:09:35.600
<v Speaker 2>You've hit the nail on the head.

191
00:09:35.639 --> 00:09:38.759
<v Speaker 1>It's just a hyper fast sorterer based on our personal intuition.

192
00:09:39.360 --> 00:09:42.720
<v Speaker 2>That is exactly why machine learning plateaued. It lacked the

193
00:09:42.759 --> 00:09:46.559
<v Speaker 2>metacognitive ability to look at a raw environment and independently

194
00:09:46.600 --> 00:09:48.399
<v Speaker 2>determine which features actually mattered.

195
00:09:48.600 --> 00:09:51.519
<v Speaker 1>So how did we finally break through that feature engineering wall?

196
00:09:51.559 --> 00:09:54.399
<v Speaker 1>Because that leads us to the ultimate game changer, deep learning.

197
00:09:54.799 --> 00:09:59.559
<v Speaker 2>Right. Historically, researchers knew that artificial neural networks theoretically have

198
00:09:59.679 --> 00:10:02.559
<v Speaker 2>this potential, but they couldn't get them to work at scale.

199
00:10:03.080 --> 00:10:05.200
<v Speaker 2>That changes with a two thousand and six paper by

200
00:10:05.240 --> 00:10:08.000
<v Speaker 2>Jeffrey Hinton introducing deep belief nets.

201
00:10:07.879 --> 00:10:11.799
<v Speaker 1>Which was largely ignored until twenty twelve. Right the ImageNet

202
00:10:11.879 --> 00:10:14.399
<v Speaker 1>Large Scale Visual Recognition Challenge.

203
00:10:14.080 --> 00:10:19.320
<v Speaker 2>Yes, the ils VRC. Historically, teams of phdes would spend

204
00:10:19.320 --> 00:10:24.000
<v Speaker 2>an entire year painstakingly tweaking their manual feature engineering, fighting

205
00:10:24.039 --> 00:10:27.519
<v Speaker 2>tooth and nail just to push their image recognition accuracy

206
00:10:27.639 --> 00:10:29.559
<v Speaker 2>up by a fraction of a single percent.

207
00:10:29.799 --> 00:10:33.559
<v Speaker 1>So the field was accustomed to microscopic, agonizing progress.

208
00:10:34.000 --> 00:10:38.679
<v Speaker 2>Then a team called Supervision, utilizing deep learning algorithms, entered

209
00:10:38.679 --> 00:10:43.240
<v Speaker 2>the twenty twelve contest. They abandoned human engineered features entirely.

210
00:10:43.799 --> 00:10:46.879
<v Speaker 2>They fed the raw image pixels directly into a deep

211
00:10:46.919 --> 00:10:50.159
<v Speaker 2>neural network, and they didn't just win, they obliterated the

212
00:10:50.240 --> 00:10:50.960
<v Speaker 2>historical curve.

213
00:10:51.080 --> 00:10:53.720
<v Speaker 1>They beat the second place team by a staggering margin.

214
00:10:53.759 --> 00:10:55.039
<v Speaker 1>Of over ten percent, and.

215
00:10:55.000 --> 00:10:58.120
<v Speaker 2>In the context of computer vision, a ten percent leap

216
00:10:58.159 --> 00:11:00.919
<v Speaker 2>in a single year was viewed as an almost alien.

217
00:11:00.559 --> 00:11:03.639
<v Speaker 1>Intervention, which brings us directly back to that Google experiment

218
00:11:03.639 --> 00:11:06.879
<v Speaker 1>we started with. By feeding those ten million, raw, unlabeled

219
00:11:06.919 --> 00:11:10.600
<v Speaker 1>YouTube frames into a deep neural network, the system independently

220
00:11:10.639 --> 00:11:14.000
<v Speaker 1>deduced the recurring mathematical structures that constituted a cat.

221
00:11:14.440 --> 00:11:17.159
<v Speaker 2>It effectively solved the symbol grounding problem that killed the

222
00:11:17.200 --> 00:11:18.879
<v Speaker 2>AI boom of the nineteen eighties.

223
00:11:19.200 --> 00:11:21.720
<v Speaker 1>So deep learning is doing what we failed to do

224
00:11:21.799 --> 00:11:25.639
<v Speaker 1>in the nineteen eighties. It's solving the symbol grounding problem

225
00:11:25.720 --> 00:11:28.919
<v Speaker 1>by figuring out the signified the actual concept of the

226
00:11:28.919 --> 00:11:30.480
<v Speaker 1>thing completely on its own.

227
00:11:30.600 --> 00:11:34.320
<v Speaker 2>Yes, it didn't just learn a symbol by analyzing millions

228
00:11:34.320 --> 00:11:38.080
<v Speaker 2>of variations of lighting, angles, and shapes. It isolated the

229
00:11:38.120 --> 00:11:43.759
<v Speaker 2>foundational underlying concept of catness entirely independent of human labeling.

230
00:11:43.480 --> 00:11:47.519
<v Speaker 1>That bridges the gap between raw physical data and conceptual understanding.

231
00:11:47.639 --> 00:11:51.399
<v Speaker 2>To prove just how thoroughly these deep networks internalized these concepts,

232
00:11:51.720 --> 00:11:55.320
<v Speaker 2>Google engineers later developed a technique colled inceptionism, widely known

233
00:11:55.360 --> 00:11:56.120
<v Speaker 2>as deep.

234
00:11:55.919 --> 00:11:58.960
<v Speaker 1>Dream Oh deep dream the nightmare.

235
00:11:58.600 --> 00:12:02.840
<v Speaker 2>Art exactly in operation data flows forward through the network,

236
00:12:02.879 --> 00:12:05.320
<v Speaker 2>and the machine outputs a classification of what it sees.

237
00:12:05.960 --> 00:12:09.080
<v Speaker 2>With inceptionism, the engineers reverse the feedback loop.

238
00:12:09.320 --> 00:12:11.639
<v Speaker 1>They fed an image into the network and commanded it

239
00:12:11.679 --> 00:12:16.159
<v Speaker 1>to mathematically amplify whatever patterns it vaguely recognized, a feedback

240
00:12:16.200 --> 00:12:17.960
<v Speaker 1>loop of pure pattern recognition.

241
00:12:18.279 --> 00:12:20.840
<v Speaker 2>So if the network is scanning an image of a blurry,

242
00:12:20.879 --> 00:12:25.519
<v Speaker 2>overcast sky and a cluster of pixels vaguely corresponds to

243
00:12:25.559 --> 00:12:29.679
<v Speaker 2>the internal mathematical weight the network associates with the bird's.

244
00:12:29.360 --> 00:12:31.840
<v Speaker 1>Beak, it alters the image to make this pixels look

245
00:12:31.879 --> 00:12:34.519
<v Speaker 1>slightly more like a beak, and then it feeds that

246
00:12:34.639 --> 00:12:36.639
<v Speaker 1>altered image back into its own input.

247
00:12:36.879 --> 00:12:39.519
<v Speaker 2>Right now, the beak is more pronounced, so the network

248
00:12:39.519 --> 00:12:42.679
<v Speaker 2>confidently hallucinates the eyes and in the feathers.

249
00:12:42.879 --> 00:12:46.679
<v Speaker 1>It runs this recursive loop until a highly detailed, psychedelic,

250
00:12:46.799 --> 00:12:50.039
<v Speaker 1>multi eyed bird physically manifests out of thin air in

251
00:12:50.080 --> 00:12:51.200
<v Speaker 1>the middle of a cloud bank.

252
00:12:51.360 --> 00:12:55.240
<v Speaker 2>It is generating novel imagery based on its deeply internalized

253
00:12:55.320 --> 00:12:58.480
<v Speaker 2>understanding of features. It proves that the network isn't just

254
00:12:58.559 --> 00:13:01.879
<v Speaker 2>matching pixels to a database. It has built a flexible,

255
00:13:02.039 --> 00:13:03.879
<v Speaker 2>generative concept of the object.

256
00:13:03.960 --> 00:13:06.159
<v Speaker 1>Okay, to truly appreciate this for you listening, We have

257
00:13:06.200 --> 00:13:08.879
<v Speaker 1>to unpack the mechanics under the hood. Why did adding

258
00:13:08.879 --> 00:13:11.639
<v Speaker 1>the word deep suddenly unlock this capability?

259
00:13:11.879 --> 00:13:15.279
<v Speaker 2>Well, the basic concept of neural networks existed. A perceptron,

260
00:13:15.480 --> 00:13:18.360
<v Speaker 2>which is a single layer of artificial neurons loosely mimicking

261
00:13:18.440 --> 00:13:22.759
<v Speaker 2>human brain cells, takes inputs, applies mathematical weights, and outputs

262
00:13:22.759 --> 00:13:23.279
<v Speaker 2>a decision.

263
00:13:23.320 --> 00:13:26.799
<v Speaker 1>Good for linear problems, right, But researchers.

264
00:13:26.279 --> 00:13:31.480
<v Speaker 2>Knew that the solve nonlinear complex problems, they needed multilayer perceptrons.

265
00:13:32.399 --> 00:13:35.440
<v Speaker 2>You insert hidden layers of neurons between the input and

266
00:13:35.480 --> 00:13:39.360
<v Speaker 2>the output. Logic dictates that if one hidden layer is good,

267
00:13:39.799 --> 00:13:43.240
<v Speaker 2>stacking twenty layers to make a deep network should allow

268
00:13:43.279 --> 00:13:48.039
<v Speaker 2>it to process incredibly complex realities. The theoretical mathematics supported

269
00:13:48.039 --> 00:13:50.159
<v Speaker 2>that logic. But there was a villain in the story,

270
00:13:50.279 --> 00:13:54.279
<v Speaker 2>wasn't there The vanishing gradient problem. The vanishingradient Neural networks

271
00:13:54.360 --> 00:13:57.639
<v Speaker 2>learn through an algorithm called backpropagation. The network makes a

272
00:13:57.639 --> 00:14:00.879
<v Speaker 2>prediction it looks at a dog and guesses cat, and.

273
00:14:00.919 --> 00:14:03.840
<v Speaker 1>A loss function calculates the mathematical error of.

274
00:14:03.799 --> 00:14:07.600
<v Speaker 2>That guess exactly. The algorithm then takes that error and

275
00:14:07.679 --> 00:14:11.360
<v Speaker 2>propagates it backward through the network, layer by layer, adjusting

276
00:14:11.399 --> 00:14:13.919
<v Speaker 2>the mathematical weights of the connections so the network is

277
00:14:14.000 --> 00:14:15.559
<v Speaker 2>less likely to make that mistake again.

278
00:14:15.639 --> 00:14:18.799
<v Speaker 1>It's a chain of correction, but backpropagation relies on the

279
00:14:18.879 --> 00:14:20.759
<v Speaker 1>chain rule of calculus.

280
00:14:20.200 --> 00:14:22.200
<v Speaker 2>And that's where it all fell apart. Is that error

281
00:14:22.240 --> 00:14:26.080
<v Speaker 2>signal moves backward through the hidden layers. You are multiplying gradients,

282
00:14:26.440 --> 00:14:29.080
<v Speaker 2>and those gradients are often fractional numbers less than one.

283
00:14:29.320 --> 00:14:32.279
<v Speaker 1>So if you multiply a fraction by a fraction by fraction,

284
00:14:32.519 --> 00:14:35.159
<v Speaker 1>the resulting number exponentially shrinks.

285
00:14:35.240 --> 00:14:37.679
<v Speaker 2>By the time that error signal reaches the early layers

286
00:14:37.720 --> 00:14:41.000
<v Speaker 2>of a deep network, the layer's closest to the raw input,

287
00:14:41.480 --> 00:14:43.600
<v Speaker 2>the number has essentially vanished to zero.

288
00:14:43.840 --> 00:14:47.919
<v Speaker 1>The error signal dilutes so severely that the foundational layers

289
00:14:47.919 --> 00:14:51.840
<v Speaker 1>of the network receive absolute no updates. They never adjust

290
00:14:51.919 --> 00:14:53.000
<v Speaker 1>their weights, never learn.

291
00:14:53.440 --> 00:14:57.799
<v Speaker 2>Because the early layers remain untrained, the entire deep architecture

292
00:14:57.840 --> 00:15:02.320
<v Speaker 2>stalls out, rendering deep networks practically useless for decades.

293
00:15:02.519 --> 00:15:06.240
<v Speaker 1>Enter the hero layer wise pre training used in deep

294
00:15:06.320 --> 00:15:09.120
<v Speaker 1>belief nets and stacked denoising auto encoders.

295
00:15:09.279 --> 00:15:12.120
<v Speaker 2>The breakthrough was realizing that Trying to train the entire

296
00:15:12.200 --> 00:15:14.799
<v Speaker 2>massive network at once from the output all the way

297
00:15:14.840 --> 00:15:18.440
<v Speaker 2>back to the input was mathematically impossible, so they isolated

298
00:15:18.480 --> 00:15:18.919
<v Speaker 2>the layers.

299
00:15:19.120 --> 00:15:22.240
<v Speaker 1>You train each hidden layer completely independently. But wait, if

300
00:15:22.279 --> 00:15:24.080
<v Speaker 1>you isolate a layer in the middle of the network,

301
00:15:24.120 --> 00:15:26.639
<v Speaker 1>it has no access to the final answer. It doesn't

302
00:15:26.679 --> 00:15:28.039
<v Speaker 1>know it's supposed to be looking for a.

303
00:15:27.919 --> 00:15:32.320
<v Speaker 2>Cat, So you employ unsupervised learning using auto encoders. You

304
00:15:32.399 --> 00:15:37.080
<v Speaker 2>give that single isolated layer a bizarrely simple task. Take

305
00:15:37.080 --> 00:15:40.759
<v Speaker 2>the raw input data, force it through a mathematical bottleneck

306
00:15:40.799 --> 00:15:43.879
<v Speaker 2>that compresses it, and then try to perfectly reconstruct the

307
00:15:43.919 --> 00:15:45.559
<v Speaker 2>original data on the other side.

308
00:15:45.639 --> 00:15:47.759
<v Speaker 1>The bottleneck is the stroke of genius.

309
00:15:48.039 --> 00:15:51.039
<v Speaker 2>It is because the layer cannot physically pass all the

310
00:15:51.159 --> 00:15:54.879
<v Speaker 2>raw data through the compression, it is mathematically forced to

311
00:15:54.919 --> 00:15:58.279
<v Speaker 2>discard the noise and figure out the most essential defining

312
00:15:58.320 --> 00:16:00.519
<v Speaker 2>features required to rebuild the image.

313
00:16:00.600 --> 00:16:04.679
<v Speaker 1>Once that first layer masters the reconstruction, its output becomes

314
00:16:04.679 --> 00:16:07.519
<v Speaker 1>the input for the second layer. It creates a self

315
00:16:07.559 --> 00:16:10.200
<v Speaker 1>assembling hierarchy of concepts exactly.

316
00:16:10.279 --> 00:16:13.399
<v Speaker 2>The first layer compresses raw pixels and learns to map

317
00:16:13.440 --> 00:16:17.279
<v Speaker 2>basic geometric edges and lines. The second layer isolates itself,

318
00:16:17.519 --> 00:16:20.720
<v Speaker 2>takes those lines compresses them and learns to map specific

319
00:16:20.759 --> 00:16:21.720
<v Speaker 2>shapes and textures.

320
00:16:21.879 --> 00:16:23.879
<v Speaker 1>And then the third layer takes those shapes and learns

321
00:16:23.919 --> 00:16:27.200
<v Speaker 1>to map complex features like eyes and noses. Because each

322
00:16:27.240 --> 00:16:30.559
<v Speaker 1>layer is trained completely independently to find structure, you completely

323
00:16:30.600 --> 00:16:32.360
<v Speaker 1>bypass the chain rule problem.

324
00:16:32.519 --> 00:16:35.600
<v Speaker 2>There is no vanishing gradient because you aren't passing an

325
00:16:35.720 --> 00:16:38.039
<v Speaker 2>error signal backward through twenty layers.

326
00:16:38.240 --> 00:16:41.399
<v Speaker 1>Once every layer has been pre trained to recognize this

327
00:16:41.519 --> 00:16:45.039
<v Speaker 1>hierarchy of features, you assemble the full network, attach a

328
00:16:45.080 --> 00:16:48.080
<v Speaker 1>final output layer, and perform fine tuning.

329
00:16:48.440 --> 00:16:51.639
<v Speaker 2>Now, when you run back propagation with labeled data, the

330
00:16:51.679 --> 00:16:54.679
<v Speaker 2>network already knows how to see. It already has the

331
00:16:54.679 --> 00:16:58.879
<v Speaker 2>mathematical weights for edges, shapes, and textures perfectly established.

332
00:16:59.039 --> 00:17:02.320
<v Speaker 1>It only requires minor adjustments to realize that the combination

333
00:17:02.399 --> 00:17:05.359
<v Speaker 1>of those specific shapes is called a cat. It has

334
00:17:05.440 --> 00:17:07.160
<v Speaker 1>essentially engineered its own features.

335
00:17:07.480 --> 00:17:11.920
<v Speaker 2>But building a massive, deeply layered network introduces another vulnerability.

336
00:17:12.559 --> 00:17:15.759
<v Speaker 2>If a network has millions of perfectly tued connections, it

337
00:17:15.799 --> 00:17:17.240
<v Speaker 2>becomes prone to overfitting.

338
00:17:17.680 --> 00:17:20.799
<v Speaker 1>It memorizes the training data so rigidly that it loses

339
00:17:20.880 --> 00:17:23.599
<v Speaker 1>the flexibility to recognize a cat in the slightly different

340
00:17:23.680 --> 00:17:24.359
<v Speaker 1>lighting condition.

341
00:17:24.680 --> 00:17:29.079
<v Speaker 2>To shatter that rigidity. The architecture employs a remarkably counterintuitive

342
00:17:29.079 --> 00:17:30.519
<v Speaker 2>trick called dropout.

343
00:17:30.720 --> 00:17:33.400
<v Speaker 1>Wait, let's unpack drop out. You're telling me that physically

344
00:17:33.440 --> 00:17:37.160
<v Speaker 1>severing the brain's connections randomly during training actually makes it smarter.

345
00:17:37.519 --> 00:17:41.079
<v Speaker 2>It sounds counterproductive, but yes. During the fine tuning training phase,

346
00:17:41.240 --> 00:17:46.000
<v Speaker 2>the algorithm will literally sever connections between neurons completely at random.

347
00:17:46.400 --> 00:17:49.440
<v Speaker 2>It temporarily drops a random percentage of the network out

348
00:17:49.440 --> 00:17:52.480
<v Speaker 2>of existence for that specific training pass.

349
00:17:52.599 --> 00:17:56.480
<v Speaker 1>You are physically lobotomizing the network during its training. It's

350
00:17:56.519 --> 00:17:59.480
<v Speaker 1>like it's like forcing someone to learn to ride a

351
00:17:59.480 --> 00:18:01.920
<v Speaker 1>bike while randomly taking away one of their senses.

352
00:18:02.000 --> 00:18:04.279
<v Speaker 2>That's a great way to look at it, Like, while.

353
00:18:04.079 --> 00:18:07.279
<v Speaker 1>They are peddling on a tightrope, you randomly blindfold them,

354
00:18:07.599 --> 00:18:10.839
<v Speaker 1>and then you randomly inject a massive dose of novacaine

355
00:18:10.880 --> 00:18:14.599
<v Speaker 1>into their left leg. By randomly stripping away their senses,

356
00:18:14.920 --> 00:18:18.880
<v Speaker 1>you force their central nervous system to develop an incredibly robust,

357
00:18:19.279 --> 00:18:22.960
<v Speaker 1>bulletproof sense of core balance that doesn't rely on any

358
00:18:23.000 --> 00:18:23.799
<v Speaker 1>single crutch.

359
00:18:23.880 --> 00:18:27.359
<v Speaker 2>Precisely, because the neural network knows that any given neuron

360
00:18:27.440 --> 00:18:31.200
<v Speaker 2>might spontaneously drop out during training, it cannot rely on

361
00:18:31.279 --> 00:18:35.119
<v Speaker 2>any single fragile pathway. To recognize a feature.

362
00:18:34.880 --> 00:18:38.519
<v Speaker 1>It is forced to distribute the concept across multiple redundant pathways.

363
00:18:38.599 --> 00:18:42.160
<v Speaker 2>The mathematical representation of the object becomes deeply embedded and

364
00:18:42.200 --> 00:18:43.279
<v Speaker 2>structurally resilient.

365
00:18:43.559 --> 00:18:47.160
<v Speaker 1>For you listening, Grasping this evolution from the flat hyperplans

366
00:18:47.160 --> 00:18:50.240
<v Speaker 1>of the kernel trick to the hierarchical compression of auto

367
00:18:50.319 --> 00:18:54.400
<v Speaker 1>encoders to the deliberate chaos of dropout means you are

368
00:18:54.440 --> 00:18:58.680
<v Speaker 1>really looking past the superficial buzzwords of modern technology. You

369
00:18:58.799 --> 00:19:02.279
<v Speaker 1>now actually grasp the profound mechanics of how human intuition

370
00:19:02.480 --> 00:19:04.680
<v Speaker 1>was mathematically outsourced to the machine.

371
00:19:04.720 --> 00:19:08.400
<v Speaker 2>And understanding those mechanics is vital because the hardware executing

372
00:19:08.400 --> 00:19:12.480
<v Speaker 2>these algorithms is scaling at a terrifying velocity. None of

373
00:19:12.519 --> 00:19:16.920
<v Speaker 2>the architectural breakthroughs of deep learning mattered until physical processors

374
00:19:16.920 --> 00:19:18.200
<v Speaker 2>could handle the math right.

375
00:19:18.519 --> 00:19:21.400
<v Speaker 1>I mean, Google required a cluster of a thousand machines

376
00:19:21.480 --> 00:19:24.720
<v Speaker 1>running for three straight days just to find that original cat.

377
00:19:24.920 --> 00:19:27.079
<v Speaker 2>The theory had to wait for the silicon to catch up.

378
00:19:27.880 --> 00:19:30.359
<v Speaker 2>But Moore's law dictates that the number of transistors on

379
00:19:30.400 --> 00:19:33.279
<v Speaker 2>a microchip doubles roughly every eighteen months.

380
00:19:33.440 --> 00:19:36.319
<v Speaker 1>If you track that exponential curve forward. We are rapidly

381
00:19:36.359 --> 00:19:38.119
<v Speaker 1>approaching the year twenty forty five.

382
00:19:38.200 --> 00:19:41.400
<v Speaker 2>Yes, twenty forty five is the widely projected date for

383
00:19:41.440 --> 00:19:44.640
<v Speaker 2>the technical singularity. At that point on the curve, a

384
00:19:44.680 --> 00:19:48.359
<v Speaker 2>single processor is expected to house more than ten billion transistors.

385
00:19:49.039 --> 00:19:52.279
<v Speaker 1>That transcends the number of biological cells in the human brain.

386
00:19:52.480 --> 00:19:57.000
<v Speaker 2>The computational capacity crosses a threshold where machines achieve self

387
00:19:57.039 --> 00:20:00.720
<v Speaker 2>recursive intelligence. They will possess the hard where and the

388
00:20:00.759 --> 00:20:04.759
<v Speaker 2>deep architecture required to rapidly redesign and optimize their own

389
00:20:04.799 --> 00:20:09.079
<v Speaker 2>software and hardware loops, entirely independent of human engineers, the.

390
00:20:09.200 --> 00:20:13.279
<v Speaker 1>Ultimate abandonment of human future engineering. The book leaves us

391
00:20:13.279 --> 00:20:16.720
<v Speaker 1>with a stark quotation from the late theoretical physicist Stephen

392
00:20:16.759 --> 00:20:17.759
<v Speaker 1>Hawking Right.

393
00:20:17.880 --> 00:20:21.119
<v Speaker 2>He warned that the development of full artificial intelligence could

394
00:20:21.119 --> 00:20:23.000
<v Speaker 2>spell the end of the human race.

395
00:20:23.400 --> 00:20:26.759
<v Speaker 1>Because from the nineteen fifties chessboards to the twenty twelve

396
00:20:26.839 --> 00:20:30.480
<v Speaker 1>ImageNet massacre, human beings were the ones pulling the strings

397
00:20:30.480 --> 00:20:32.119
<v Speaker 1>and defining the loss functions.

398
00:20:32.279 --> 00:20:35.599
<v Speaker 2>We provided the data and established the ultimate goals, even

399
00:20:35.599 --> 00:20:38.240
<v Speaker 2>when the machines learned to map the paths themselves.

400
00:20:38.480 --> 00:20:40.640
<v Speaker 1>So keeping all these mechanics in mind and want you

401
00:20:40.680 --> 00:20:43.680
<v Speaker 1>to muld this over. The machines have already conquered the

402
00:20:43.720 --> 00:20:46.519
<v Speaker 1>frame problem by learning to filter out the noise of reality.

403
00:20:46.920 --> 00:20:50.359
<v Speaker 1>They have conquered the symbol grounding problem by internalizing the

404
00:20:50.359 --> 00:20:52.400
<v Speaker 1>structural concepts of physical objects.

405
00:20:52.680 --> 00:20:57.200
<v Speaker 2>They defeated the vanish ingradient to build deep hierarchical cognition.

406
00:20:57.480 --> 00:20:59.480
<v Speaker 1>If a machine can look at a cloudy sky and

407
00:20:59.519 --> 00:21:03.559
<v Speaker 1>recursive dream up a mathematically perfect multi eyed nightmare bird

408
00:21:03.720 --> 00:21:06.559
<v Speaker 1>entirely on its own, what happens in twenty forty five?

409
00:21:07.160 --> 00:21:10.839
<v Speaker 1>What happens when an intelligence backed by ten billion transistors

410
00:21:10.920 --> 00:21:14.000
<v Speaker 1>starts defining its own loss functions, selecting its own features,

411
00:21:14.039 --> 00:21:16.640
<v Speaker 1>and optimizing for its own goals without ever needing to

412
00:21:16.640 --> 00:21:18.559
<v Speaker 1>tell us what they are. A question What's keeping in

413
00:21:18.559 --> 00:21:20.440
<v Speaker 1>mind the next time you see a machine generate a

414
00:21:20.480 --> 00:21:23.279
<v Speaker 1>masterpiece from thin air. Thanks for taking this deep dive.
