WEBVTT

1
00:00:00.080 --> 00:00:02.520
<v Speaker 1>You know, when we normally think about artificial intelligence learning

2
00:00:02.560 --> 00:00:08.199
<v Speaker 1>to see the world, there's this underlying expectation of neat,

3
00:00:08.560 --> 00:00:09.960
<v Speaker 1>orderly geometry. Right.

4
00:00:10.039 --> 00:00:12.400
<v Speaker 2>Absolutely, everything has its specific place.

5
00:00:12.839 --> 00:00:14.480
<v Speaker 1>Yeah, And whether you're trying to catch up on the

6
00:00:14.560 --> 00:00:17.960
<v Speaker 1>latest tech trends or you're just insanely curious about how

7
00:00:18.039 --> 00:00:22.160
<v Speaker 1>machines actually perceive reality, you've probably heard of neural networks,

8
00:00:22.719 --> 00:00:26.519
<v Speaker 1>and traditional neural networks thrive on perfect grids. I mean,

9
00:00:26.559 --> 00:00:30.120
<v Speaker 1>you feed a computer a photograph and it basically just

10
00:00:30.160 --> 00:00:31.760
<v Speaker 1>sees a strict two D grid of.

11
00:00:31.839 --> 00:00:35.039
<v Speaker 2>Pixels, or you feed it a paragraph of text and

12
00:00:35.079 --> 00:00:37.520
<v Speaker 2>it sees a straight one D line of words. It's

13
00:00:37.560 --> 00:00:40.119
<v Speaker 2>what computer scientists call the Euclidean.

14
00:00:39.560 --> 00:00:41.119
<v Speaker 1>Domain Euclidian domain. Yeah.

15
00:00:41.159 --> 00:00:44.920
<v Speaker 2>Yeah, it's basically the math of flat surfaces, straight lines,

16
00:00:45.039 --> 00:00:50.119
<v Speaker 2>and predictable localized structures. It's a world where every single

17
00:00:50.159 --> 00:00:53.439
<v Speaker 2>piece of data has a very specific orderly neighborhood.

18
00:00:53.560 --> 00:00:55.359
<v Speaker 1>But then you step out of the computer and into

19
00:00:55.359 --> 00:00:57.840
<v Speaker 1>your actual life and the real world, like your social

20
00:00:57.880 --> 00:01:00.640
<v Speaker 1>network or the molecular structure of the coffee you drank

21
00:01:00.679 --> 00:01:03.200
<v Speaker 1>this morning, or even the chaotic flow of traffic you

22
00:01:03.240 --> 00:01:05.159
<v Speaker 1>said in it. It just doesn't fit into those neat

23
00:01:05.200 --> 00:01:05.840
<v Speaker 1>little boxes.

24
00:01:06.319 --> 00:01:08.760
<v Speaker 2>No, not at all. It's completely chaotic exactly.

25
00:01:08.959 --> 00:01:11.879
<v Speaker 1>Suddenly that pristine grid is gone and you are looking

26
00:01:11.959 --> 00:01:15.079
<v Speaker 1>at a landscape that is mathematically messy. It's a non

27
00:01:15.120 --> 00:01:18.519
<v Speaker 1>Euclidean web of relationships. So today we are taking a

28
00:01:18.560 --> 00:01:21.480
<v Speaker 1>deep dive into a stack of highly technical notes from

29
00:01:21.480 --> 00:01:25.400
<v Speaker 1>the textbook Introduction to Graft Neural Networks by Zeon Lu

30
00:01:25.480 --> 00:01:26.040
<v Speaker 1>and Jizo.

31
00:01:26.359 --> 00:01:29.799
<v Speaker 2>It is a phenomenal text, but yeah, it's incredibly.

32
00:01:29.239 --> 00:01:31.840
<v Speaker 1>Dense, super dense. So our mission today is to take

33
00:01:31.879 --> 00:01:35.519
<v Speaker 1>this really math heavy computer science text and translate it

34
00:01:35.560 --> 00:01:38.920
<v Speaker 1>into something intuitive. We want to figure out exactly how

35
00:01:39.000 --> 00:01:43.200
<v Speaker 1>AI is finally learning to map the messy interconnected web

36
00:01:43.239 --> 00:01:46.560
<v Speaker 1>of reality. Because to map that reality, the computer scientists

37
00:01:46.599 --> 00:01:50.359
<v Speaker 1>had to invent an entirely new architecture, the graph neural network.

38
00:01:50.519 --> 00:01:53.200
<v Speaker 2>And to really appreciate the scale of this paradigm shift,

39
00:01:53.239 --> 00:01:55.640
<v Speaker 2>we first have to look at what broke the old.

40
00:01:55.439 --> 00:01:57.239
<v Speaker 1>Models right, what went wrong?

41
00:01:57.599 --> 00:02:01.120
<v Speaker 2>Exactly? Traditional deep learning hit an absolute wall when it

42
00:02:01.120 --> 00:02:04.799
<v Speaker 2>tried to process anything that wasn't on a grid. Convolutional

43
00:02:04.879 --> 00:02:08.360
<v Speaker 2>neural networks or CNNs, which is the architecture that basically

44
00:02:08.439 --> 00:02:12.639
<v Speaker 2>drove the entire modern image recognition boom. They rely on

45
00:02:12.759 --> 00:02:17.560
<v Speaker 2>sliding a mathematical filter evenly across a predictable.

46
00:02:16.919 --> 00:02:20.199
<v Speaker 1>Grid, kind of like a little square magnifying glass sliding

47
00:02:20.199 --> 00:02:20.879
<v Speaker 1>over pixels.

48
00:02:21.120 --> 00:02:24.159
<v Speaker 2>Yes, exactly like that. It slides over the image looking

49
00:02:24.199 --> 00:02:26.520
<v Speaker 2>at a neat three x three square of pixels at

50
00:02:26.560 --> 00:02:26.919
<v Speaker 2>a time.

51
00:02:27.120 --> 00:02:29.759
<v Speaker 1>Okay, let's unpack this for a second. If traditional AI

52
00:02:29.919 --> 00:02:33.759
<v Speaker 1>is like reading a perfectly formatted Excel spreadsheet or analyzing

53
00:02:33.759 --> 00:02:36.159
<v Speaker 1>a chessboard, a graph is more like looking at a

54
00:02:36.199 --> 00:02:37.560
<v Speaker 1>detective's messi corkboard.

55
00:02:37.639 --> 00:02:39.159
<v Speaker 2>Oh I love that analogy, right.

56
00:02:39.159 --> 00:02:41.879
<v Speaker 1>You know one's from the Thrillers. Just chaotic pushpins with

57
00:02:41.919 --> 00:02:46.439
<v Speaker 1>red string tying dozens of unpredictable suspects, locations and clues altogether.

58
00:02:47.039 --> 00:02:50.599
<v Speaker 1>A CNN takes its neat little square magnifying glass, stares

59
00:02:50.639 --> 00:02:53.479
<v Speaker 1>at that tangled web of red string and just completely

60
00:02:53.560 --> 00:02:54.000
<v Speaker 1>gives up.

61
00:02:54.159 --> 00:02:57.680
<v Speaker 2>It completely breaks down, because on your detectives corkboard, one

62
00:02:57.719 --> 00:03:00.439
<v Speaker 2>clue might have two strings attached to it, and then

63
00:03:00.479 --> 00:03:03.360
<v Speaker 2>another clue right next to it might have five hundred

64
00:03:03.360 --> 00:03:05.560
<v Speaker 2>strings connecting it to everything else on the board.

65
00:03:05.639 --> 00:03:06.759
<v Speaker 1>Wow. Yeah, So you.

66
00:03:06.759 --> 00:03:10.319
<v Speaker 2>Can't slide a standard fixed size three x three filter

67
00:03:10.439 --> 00:03:13.520
<v Speaker 2>over a spider web. The distance between the nodes isn't

68
00:03:13.560 --> 00:03:17.280
<v Speaker 2>a straight line anymore. The concept of up, down, left, right,

69
00:03:17.840 --> 00:03:21.400
<v Speaker 2>it just doesn't exist. It's purely about relationships and connections.

70
00:03:21.719 --> 00:03:24.439
<v Speaker 1>But computer scientists didn't just throw their hands up when

71
00:03:24.439 --> 00:03:27.039
<v Speaker 1>they saw the corkboard, right, Yeah, I was looking at

72
00:03:27.039 --> 00:03:31.039
<v Speaker 1>the early workarounds. The textbook mentions these things called network

73
00:03:31.080 --> 00:03:34.120
<v Speaker 1>embedding methods like deep walk and node to vec.

74
00:03:34.479 --> 00:03:36.560
<v Speaker 2>Yeah, the early attempts to solve the problem.

75
00:03:36.599 --> 00:03:39.039
<v Speaker 1>From what I gather, they tried to send virtual agents

76
00:03:39.080 --> 00:03:42.319
<v Speaker 1>walking randomly along the strings of the corkboard to map

77
00:03:42.360 --> 00:03:44.879
<v Speaker 1>it out, essentially trying to flatten the whole three D

78
00:03:45.000 --> 00:03:47.680
<v Speaker 1>web into a simple flat list of numbers.

79
00:03:47.840 --> 00:03:49.479
<v Speaker 2>That's a great way to put it. They tried to

80
00:03:49.479 --> 00:03:52.599
<v Speaker 2>map the nodes into low dimensional vectors using those random walks.

81
00:03:52.840 --> 00:03:55.879
<v Speaker 2>They were essentially trying to force a non Euclidean graph

82
00:03:56.280 --> 00:03:58.520
<v Speaker 2>to behave like a Euclidean spreadsheet.

83
00:03:58.599 --> 00:03:59.360
<v Speaker 1>But it didn't work.

84
00:03:59.639 --> 00:04:02.879
<v Speaker 2>No, it failed on a massive scale, and for two

85
00:04:03.199 --> 00:04:07.840
<v Speaker 2>really critical reasons. First, they didn't share computational.

86
00:04:07.080 --> 00:04:09.240
<v Speaker 1>Parameter, which means what exactly it means.

87
00:04:09.280 --> 00:04:12.159
<v Speaker 2>Every single node you added to the graph required the

88
00:04:12.240 --> 00:04:15.039
<v Speaker 2>model to learn a brand new set of weights. So

89
00:04:15.159 --> 00:04:18.360
<v Speaker 2>if you were analyzing a social network with billions of users.

90
00:04:18.720 --> 00:04:23.360
<v Speaker 2>The computational cost just grew linearly until the machine choked.

91
00:04:23.439 --> 00:04:25.120
<v Speaker 2>It was a nightmare, oh man.

92
00:04:25.680 --> 00:04:28.079
<v Speaker 1>And the second failure was about adapting to the unknown,

93
00:04:28.120 --> 00:04:29.040
<v Speaker 1>wasn't it exactly?

94
00:04:29.079 --> 00:04:31.879
<v Speaker 2>If you trained one of these early models on a

95
00:04:31.920 --> 00:04:35.360
<v Speaker 2>specific corkboard, and then I walked into the room and

96
00:04:35.519 --> 00:04:38.560
<v Speaker 2>pinned a brand new suspect to the board with new strings,

97
00:04:38.959 --> 00:04:40.759
<v Speaker 2>the model was totally blind to it.

98
00:04:40.879 --> 00:04:42.839
<v Speaker 1>Wait, really, it just couldn't see the new pin.

99
00:04:42.959 --> 00:04:45.240
<v Speaker 2>It couldn't process it at all. It just memorized the

100
00:04:45.279 --> 00:04:47.680
<v Speaker 2>specific board it was looking at, rather than learning how

101
00:04:47.720 --> 00:04:48.920
<v Speaker 2>to actually be a detective.

102
00:04:49.120 --> 00:04:53.040
<v Speaker 1>So to build an AI that actually learns, researchers realized

103
00:04:53.120 --> 00:04:56.680
<v Speaker 1>they had to understand the underlying structure of the graph mathematically,

104
00:04:57.040 --> 00:04:59.560
<v Speaker 1>rather than just trying to flatten it into a list right.

105
00:05:00.079 --> 00:05:03.199
<v Speaker 2>Art of understanding these non Euclidean spaces relies on something

106
00:05:03.279 --> 00:05:04.920
<v Speaker 2>called the laplation matrix.

107
00:05:05.360 --> 00:05:08.560
<v Speaker 1>The text calls it the mathematical heartbeat of the graph.

108
00:05:08.920 --> 00:05:11.800
<v Speaker 1>I really love that phrasing, but visualizing a matrix is

109
00:05:11.800 --> 00:05:14.759
<v Speaker 1>always kind of tricky. If we think about the quarkboard,

110
00:05:15.279 --> 00:05:18.319
<v Speaker 1>how does the laplation capture that chaotic shape.

111
00:05:18.680 --> 00:05:21.800
<v Speaker 2>Think about the tension in the strings. In graph theory,

112
00:05:21.959 --> 00:05:24.720
<v Speaker 2>you start with an adjacency matrix, which is basically just

113
00:05:24.800 --> 00:05:27.279
<v Speaker 2>a ledger showing which pushpins are connected to which.

114
00:05:27.360 --> 00:05:28.720
<v Speaker 1>Okay, simple ledger, right.

115
00:05:29.360 --> 00:05:32.360
<v Speaker 2>Then you have a degree matrix, which just simply counts

116
00:05:32.399 --> 00:05:35.639
<v Speaker 2>the total number of strings attached to each pushpin. The

117
00:05:35.759 --> 00:05:39.240
<v Speaker 2>Laplation matrix is the mathematical difference between the two the

118
00:05:39.279 --> 00:05:41.759
<v Speaker 2>degree matrix minus the adjacency matrix.

119
00:05:41.800 --> 00:05:44.639
<v Speaker 1>So it's subtracting the connections from the total strings.

120
00:05:44.319 --> 00:05:46.959
<v Speaker 2>Exactly, and by doing that it captures not just where

121
00:05:46.959 --> 00:05:49.560
<v Speaker 2>the pins are, but the potential energy and the flow

122
00:05:49.600 --> 00:05:53.879
<v Speaker 2>of information between them. It mathematically describes the overall shape

123
00:05:53.920 --> 00:05:55.639
<v Speaker 2>and structure of the entire web.

124
00:05:55.720 --> 00:05:59.480
<v Speaker 1>Wow. But so even with the Laplacian matrix acting is

125
00:05:59.519 --> 00:06:02.839
<v Speaker 1>this perfect map of the energy, early researchers still had

126
00:06:02.839 --> 00:06:05.839
<v Speaker 1>to figure out how to actually do convolution right. I

127
00:06:05.959 --> 00:06:08.519
<v Speaker 1>still had to figure out how to slide that magnifying

128
00:06:08.560 --> 00:06:12.120
<v Speaker 1>glass over the strings to extract meaning they did.

129
00:06:11.920 --> 00:06:14.680
<v Speaker 2>And this is where the textbook gets fascinating, because the

130
00:06:14.839 --> 00:06:18.680
<v Speaker 2>entire field of computer science literally split into two opposing

131
00:06:18.720 --> 00:06:20.199
<v Speaker 2>philosophical camps trying.

132
00:06:20.040 --> 00:06:22.199
<v Speaker 1>To solve this spectral and spatial.

133
00:06:21.920 --> 00:06:26.759
<v Speaker 2>Exactly the great divide in craft learning. The spectral approach

134
00:06:26.959 --> 00:06:31.000
<v Speaker 2>is heavily rooted in complex physics and signal processing. It

135
00:06:31.040 --> 00:06:34.240
<v Speaker 2>relies on the Fourier domain. So instead of looking at

136
00:06:34.319 --> 00:06:38.759
<v Speaker 2>individual pushpins spectral models like the spectral network and chubnet,

137
00:06:39.079 --> 00:06:41.639
<v Speaker 2>they look at the graph as a whole system of

138
00:06:41.720 --> 00:06:42.800
<v Speaker 2>vibrating signals.

139
00:06:43.199 --> 00:06:47.240
<v Speaker 1>So if spatial is looking at the individual pins, spectral

140
00:06:47.319 --> 00:06:49.759
<v Speaker 1>is like plucking the strings to see how the whole

141
00:06:49.759 --> 00:06:50.759
<v Speaker 1>board vibrates.

142
00:06:51.000 --> 00:06:52.399
<v Speaker 2>That is a perfect analogy.

143
00:06:52.519 --> 00:06:55.439
<v Speaker 1>Yes, they're looking at the overall structural frequencies of the

144
00:06:55.439 --> 00:06:58.199
<v Speaker 1>graph based on that Laplacian matrix we just talked about.

145
00:06:58.240 --> 00:07:01.560
<v Speaker 2>They look at the global resonance. But spectral methods run

146
00:07:01.600 --> 00:07:03.560
<v Speaker 2>into a massive real.

147
00:07:03.319 --> 00:07:05.680
<v Speaker 1>World roadblock because they're too rigid.

148
00:07:05.480 --> 00:07:09.319
<v Speaker 2>Exactly because the filters they build are mathematically tied to

149
00:07:09.360 --> 00:07:12.560
<v Speaker 2>the specific laplac matrix of that exact graph, they are

150
00:07:12.680 --> 00:07:17.079
<v Speaker 2>hyper specialized. Imagine tuning a grand piano to sound perfect

151
00:07:17.160 --> 00:07:20.120
<v Speaker 2>in one specific concert hall. If you pick up that

152
00:07:20.199 --> 00:07:23.079
<v Speaker 2>piano and move it to a different room with different acoustics,

153
00:07:23.240 --> 00:07:25.639
<v Speaker 2>or in this case, a graph with a different structure,

154
00:07:26.199 --> 00:07:29.600
<v Speaker 2>your tuning just doesn't work. Anymore. The model completely fails

155
00:07:29.639 --> 00:07:32.279
<v Speaker 2>to generalize to new environments.

156
00:07:31.879 --> 00:07:35.000
<v Speaker 1>Which means we have to abandon the whole system approach.

157
00:07:35.120 --> 00:07:37.720
<v Speaker 1>If we want flexible AI, we have to pivot to

158
00:07:37.759 --> 00:07:41.120
<v Speaker 1>the other camp, the spatial approach you do, and spatial

159
00:07:41.120 --> 00:07:44.680
<v Speaker 1>methods basically say, you know, forget the global frequencies of

160
00:07:44.720 --> 00:07:47.319
<v Speaker 1>the whole room, let's just zoom in and look at

161
00:07:47.319 --> 00:07:49.480
<v Speaker 1>our immediate neighbors on the corkboard. Right.

162
00:07:49.720 --> 00:07:53.360
<v Speaker 2>Spatial methods operate directly on the spatially close neighbors, but

163
00:07:53.439 --> 00:07:56.519
<v Speaker 2>then we run right back into the core problem. How

164
00:07:56.560 --> 00:08:00.240
<v Speaker 2>do you run a standard uniform filter over nodes that

165
00:08:00.319 --> 00:08:02.560
<v Speaker 2>all have a wildly different number of neighbors?

166
00:08:02.680 --> 00:08:05.879
<v Speaker 1>Right And reading through the textbooks breakdown of early spatial models,

167
00:08:06.120 --> 00:08:09.519
<v Speaker 1>I hit one called Patchee San and I have to

168
00:08:09.519 --> 00:08:11.439
<v Speaker 1>be honest, it felt like the researchers were just straight

169
00:08:11.519 --> 00:08:12.000
<v Speaker 1>up cheating.

170
00:08:12.120 --> 00:08:13.680
<v Speaker 2>A lot of people felt that way at the time.

171
00:08:13.879 --> 00:08:17.399
<v Speaker 1>Right from what I understand, patchway sand forces chaos into

172
00:08:17.560 --> 00:08:20.879
<v Speaker 1>order by setting a totally arbitrary rule. It basically says,

173
00:08:21.240 --> 00:08:23.639
<v Speaker 1>I'm only going to look at exactly nade neighbors for

174
00:08:23.720 --> 00:08:27.560
<v Speaker 1>every single node, no matter what it extracts exactly nakee neighbors,

175
00:08:27.759 --> 00:08:30.839
<v Speaker 1>normalizes them, and then just runs a standard one DCNN

176
00:08:30.879 --> 00:08:31.319
<v Speaker 1>over them.

177
00:08:31.439 --> 00:08:32.759
<v Speaker 2>That's exactly what it does.

178
00:08:32.879 --> 00:08:36.120
<v Speaker 1>Wait a second, though, If patchway sand forces a chaotic

179
00:08:36.159 --> 00:08:39.039
<v Speaker 1>web into a neat little sequence of exactly naked neighbors,

180
00:08:39.240 --> 00:08:41.840
<v Speaker 1>aren't we just slicing off vital parts of the graph

181
00:08:41.960 --> 00:08:44.279
<v Speaker 1>just to make the math easier for the machine. Then

182
00:08:44.320 --> 00:08:46.639
<v Speaker 1>we are literally ignoring data.

183
00:08:46.679 --> 00:08:50.679
<v Speaker 2>You're not wrong. The researchers were prioritizing computational feasibility over

184
00:08:50.720 --> 00:08:54.799
<v Speaker 2>complete accuracy. They needed something that could actually run. But

185
00:08:54.879 --> 00:08:57.720
<v Speaker 2>that instinct you have that slicing off data is a

186
00:08:57.720 --> 00:09:00.840
<v Speaker 2>fundamental flaw. That is exactly what why the field moved

187
00:09:00.840 --> 00:09:03.480
<v Speaker 2>away from rigid structures and developed graphsage.

188
00:09:03.639 --> 00:09:06.000
<v Speaker 1>Graph sage. That's a huge one in the text.

189
00:09:06.120 --> 00:09:09.159
<v Speaker 2>It was a monumental leap forward because the creators of

190
00:09:09.200 --> 00:09:11.720
<v Speaker 2>graphsage realize you don't need to force the graph into

191
00:09:11.720 --> 00:09:14.720
<v Speaker 2>a rigid shape. Instead of memorizing a fixed neighborhood of

192
00:09:14.759 --> 00:09:18.360
<v Speaker 2>exactly naked nodes, graph sage learns an inductive framework.

193
00:09:18.799 --> 00:09:23.240
<v Speaker 1>Inductive meaning it learns the underlying rule of the puzzle,

194
00:09:23.639 --> 00:09:26.039
<v Speaker 1>not just the specific solution to one puzzle.

195
00:09:26.399 --> 00:09:30.720
<v Speaker 2>It learns the strategy. So graph sage uniformly samples a

196
00:09:30.759 --> 00:09:33.919
<v Speaker 2>fixed size set of neighbors. But the brilliance is in

197
00:09:33.960 --> 00:09:37.000
<v Speaker 2>what it does next. It applies an aggregator.

198
00:09:36.399 --> 00:09:38.679
<v Speaker 1>Function like finding an average.

199
00:09:38.440 --> 00:09:42.360
<v Speaker 2>Exactly, like a mean aggregator that finds the mathematical average

200
00:09:42.360 --> 00:09:46.080
<v Speaker 2>of the features, or a pooling aggregator. It's not trying

201
00:09:46.120 --> 00:09:48.960
<v Speaker 2>to learn the specific nodes themselves. It's learning the function

202
00:09:49.080 --> 00:09:52.639
<v Speaker 2>of how to pull in feature information from whatever local

203
00:09:52.639 --> 00:09:54.240
<v Speaker 2>neighborhood happens to be around it.

204
00:09:54.279 --> 00:09:56.720
<v Speaker 1>Oh wow, So because it learns the how you can

205
00:09:56.759 --> 00:10:00.000
<v Speaker 1>take an entirely unseen node, drop it into the network tomorrow,

206
00:10:00.360 --> 00:10:03.039
<v Speaker 1>and the model intuitively knows how to process it based

207
00:10:03.080 --> 00:10:04.679
<v Speaker 1>on whatever new neighbors surround us.

208
00:10:04.720 --> 00:10:07.159
<v Speaker 2>Exactly, it finally learned how to be the detective. It

209
00:10:07.200 --> 00:10:09.039
<v Speaker 2>knows how to read the strings no matter what crazy

210
00:10:09.039 --> 00:10:10.039
<v Speaker 2>board you put in front of it.

211
00:10:10.120 --> 00:10:14.919
<v Speaker 1>That's incredible, But aggregating neighbors equally brings up another glaring

212
00:10:14.960 --> 00:10:19.080
<v Speaker 1>real world problem. In reality, not all relationships are created equal.

213
00:10:19.200 --> 00:10:19.879
<v Speaker 2>No, definitely not.

214
00:10:20.279 --> 00:10:23.120
<v Speaker 1>Think about your own life. If I ask my friends

215
00:10:23.159 --> 00:10:26.159
<v Speaker 1>for advice on buying a car, my friend who has

216
00:10:26.159 --> 00:10:28.879
<v Speaker 1>been a mechanic for twenty years matters a lot more

217
00:10:28.879 --> 00:10:30.960
<v Speaker 1>than my friend who rides a unicycle. I would hope

218
00:10:31.000 --> 00:10:35.759
<v Speaker 1>so right. But standard spatial aggregation just averaging everyone together,

219
00:10:36.279 --> 00:10:40.120
<v Speaker 1>treats the mechanic and the unicycle writer as mathematically equal.

220
00:10:40.720 --> 00:10:43.639
<v Speaker 2>And this is where the architecture evolves to mirror human

221
00:10:43.720 --> 00:10:47.879
<v Speaker 2>cognition much more closely. We transition into adding memory and

222
00:10:47.919 --> 00:10:51.799
<v Speaker 2>attention to the graph. The textbook details graph recurrent networks

223
00:10:51.960 --> 00:10:56.480
<v Speaker 2>or GRNs and graph attention networks known as gats gats.

224
00:10:57.039 --> 00:11:00.480
<v Speaker 1>Here's where it gets really interesting to me. Under graph

225
00:11:00.559 --> 00:11:03.039
<v Speaker 1>convolutional networks. The ones that just average all their neighbors

226
00:11:03.399 --> 00:11:05.480
<v Speaker 1>are sort of like being in a loud cocktail party

227
00:11:05.480 --> 00:11:07.559
<v Speaker 1>where you try to listen to everyone in the room equally.

228
00:11:07.679 --> 00:11:10.120
<v Speaker 2>That sounds exhausting it is you pull in so.

229
00:11:10.200 --> 00:11:13.879
<v Speaker 1>Much overlapping chatter that it just creates a dull, useless hum.

230
00:11:14.360 --> 00:11:17.399
<v Speaker 1>But graph attention networks, the gats, they put on noise

231
00:11:17.440 --> 00:11:20.240
<v Speaker 1>canceling headphones and focus entirely on the one person with

232
00:11:20.320 --> 00:11:21.240
<v Speaker 1>the juicy gossip.

233
00:11:21.519 --> 00:11:24.360
<v Speaker 2>That's a great way to visualize the self attention mechanism.

234
00:11:24.639 --> 00:11:28.440
<v Speaker 2>It is a brilliant piece of engineering. Basically, for every

235
00:11:28.440 --> 00:11:33.399
<v Speaker 2>single neighbor or note has the model calculates an attention coefficient.

236
00:11:32.919 --> 00:11:35.360
<v Speaker 1>Using leaky railue and softmax equations. Right.

237
00:11:35.480 --> 00:11:38.159
<v Speaker 2>Yes, the math gets heavy there, But to avoid the

238
00:11:38.159 --> 00:11:40.799
<v Speaker 2>heavy jargon, just think of it as a mathematical filter

239
00:11:40.919 --> 00:11:44.159
<v Speaker 2>that actively mutes the background noise and cranks up the

240
00:11:44.240 --> 00:11:47.600
<v Speaker 2>volume on the important signal. It runs the data through

241
00:11:47.639 --> 00:11:52.039
<v Speaker 2>a function that penalizes irrelevant information and then balances all

242
00:11:52.080 --> 00:11:54.799
<v Speaker 2>those individual attention scores out so they add up to

243
00:11:54.840 --> 00:11:56.600
<v Speaker 2>a clean one percent.

244
00:11:56.759 --> 00:12:00.000
<v Speaker 1>Oh. I see, so this is sign's specific weighted import

245
00:12:00.279 --> 00:12:03.240
<v Speaker 1>to different neighbors. The mechanic gets an eighty five percent

246
00:12:03.240 --> 00:12:06.559
<v Speaker 1>attention score and the unicycle rider gets a two percent score.

247
00:12:06.679 --> 00:12:08.759
<v Speaker 2>Exactly. It learns who to trust.

248
00:12:08.600 --> 00:12:11.600
<v Speaker 1>And the text also highlights multi head attention. If we

249
00:12:11.639 --> 00:12:14.519
<v Speaker 1>stick to the cocktail party analogy, I assume that's like

250
00:12:14.559 --> 00:12:17.440
<v Speaker 1>sending five different friends into the party, each instructed to

251
00:12:17.440 --> 00:12:20.200
<v Speaker 1>listen for different kinds of gossip, Like one listens for

252
00:12:20.279 --> 00:12:23.720
<v Speaker 1>financial news, one for relationship drama, and then they all

253
00:12:23.720 --> 00:12:25.080
<v Speaker 1>compare notes at the end of the night.

254
00:12:25.519 --> 00:12:29.679
<v Speaker 2>Yeah, that's spot on. Multihead attention stabilizes the learning process

255
00:12:29.720 --> 00:12:35.159
<v Speaker 2>by running several independent attention mechanisms simultaneously and concatenating the results.

256
00:12:35.480 --> 00:12:38.080
<v Speaker 2>It ensures the model doesn't fixate on just one type

257
00:12:38.080 --> 00:12:39.480
<v Speaker 2>of relationship.

258
00:12:38.960 --> 00:12:40.440
<v Speaker 1>So it gets a well rounded view. Right.

259
00:12:40.759 --> 00:12:43.080
<v Speaker 2>But beyond just focusing on the right neighbors in the

260
00:12:43.080 --> 00:12:46.240
<v Speaker 2>present moment, sometimes the network needs memory to understand the

261
00:12:46.240 --> 00:12:49.639
<v Speaker 2>broader context. This is where graph recurrent networks come in.

262
00:12:50.039 --> 00:12:53.720
<v Speaker 2>They heavily borrow memory gates like GRU and LSTM gates

263
00:12:53.960 --> 00:12:58.039
<v Speaker 2>from traditional sequence models to remember long term dependencies and

264
00:12:58.120 --> 00:12:59.519
<v Speaker 2>forget irrelevant data.

265
00:13:00.039 --> 00:13:02.879
<v Speaker 1>The source highlighted a specific model for analyzing text called

266
00:13:02.919 --> 00:13:07.759
<v Speaker 1>the sentence LSTM or SLLSTM. This honestly blew my mind.

267
00:13:08.200 --> 00:13:11.039
<v Speaker 1>Normally text is just a straight line, but here they

268
00:13:11.039 --> 00:13:13.600
<v Speaker 1>take a sentence turn the words into nodes on a graph,

269
00:13:13.679 --> 00:13:16.559
<v Speaker 1>so each word can look at its immediate neighbors. But then,

270
00:13:17.080 --> 00:13:19.639
<v Speaker 1>this is the crazy part. They add this genius thing

271
00:13:19.679 --> 00:13:20.840
<v Speaker 1>called a supernode.

272
00:13:21.120 --> 00:13:26.639
<v Speaker 2>Yes, the supernode solves a massive architectural bottleneck. If you

273
00:13:26.720 --> 00:13:30.279
<v Speaker 2>are analyzing a really long paragraph, a word needs to

274
00:13:30.360 --> 00:13:33.159
<v Speaker 2>understand the grammar of the words immediately next to it,

275
00:13:33.399 --> 00:13:35.759
<v Speaker 2>but it also needs to understand the overarching theme of

276
00:13:35.799 --> 00:13:37.240
<v Speaker 2>the whole text, Right.

277
00:13:37.080 --> 00:13:39.759
<v Speaker 1>Like if the text is a massive legal document, the

278
00:13:39.799 --> 00:13:41.600
<v Speaker 1>first word of the page and the last word of

279
00:13:41.639 --> 00:13:44.399
<v Speaker 1>the page might be hundreds of hops away from each other.

280
00:13:44.399 --> 00:13:47.679
<v Speaker 1>On a normal graph, the signal would totally degrade before

281
00:13:47.720 --> 00:13:49.320
<v Speaker 1>they ever communicated exactly.

282
00:13:49.600 --> 00:13:53.799
<v Speaker 2>The SLSTM elegantly solves this by connecting every single word

283
00:13:53.840 --> 00:13:57.279
<v Speaker 2>node to its immediate neighbors, but also connecting every single

284
00:13:57.279 --> 00:13:59.600
<v Speaker 2>word to one overarching supernode.

285
00:13:59.639 --> 00:13:59.960
<v Speaker 1>Wow.

286
00:14:00.240 --> 00:14:03.320
<v Speaker 2>So the word nodes handle the local context, the immediate

287
00:14:03.320 --> 00:14:07.080
<v Speaker 2>grammar and phrasing. Meanwhile, the supernode acts as a central hub,

288
00:14:07.399 --> 00:14:11.159
<v Speaker 2>aggregating information from all the words simultaneously and feeding that

289
00:14:11.279 --> 00:14:13.679
<v Speaker 2>global context back down to the individual words.

290
00:14:13.960 --> 00:14:16.759
<v Speaker 1>It's like having a project manager who sees the entire

291
00:14:16.840 --> 00:14:20.480
<v Speaker 1>timeline of the construction project, while the individual workers only

292
00:14:20.480 --> 00:14:23.960
<v Speaker 1>see their daily tasks. Uh. The project manager constantly yells

293
00:14:24.000 --> 00:14:26.200
<v Speaker 1>down from this gaffolding to make sure everyone is actually

294
00:14:26.240 --> 00:14:27.320
<v Speaker 1>building the same house.

295
00:14:27.600 --> 00:14:30.360
<v Speaker 2>That is exactly what it does, and because it allows

296
00:14:30.360 --> 00:14:33.960
<v Speaker 2>information to flow so efficiently across the whole structure without

297
00:14:33.960 --> 00:14:39.159
<v Speaker 2>degrading over long distances, the SLSTM has actually outperformed incredibly

298
00:14:39.240 --> 00:14:42.759
<v Speaker 2>powerful state of the art sequence models like the Transformer

299
00:14:42.759 --> 00:14:44.679
<v Speaker 2>on certain text classification tasks.

300
00:14:44.720 --> 00:14:48.639
<v Speaker 1>That is wild. Okay, So if giving a graph, neural

301
00:14:48.679 --> 00:14:53.159
<v Speaker 1>network memory, dynamic attention and a project manager supernode makes

302
00:14:53.159 --> 00:14:56.600
<v Speaker 1>it this incredibly smart. The logical next step in computer

303
00:14:56.639 --> 00:15:00.360
<v Speaker 1>science is always the same, go deeper. Oh ahwa, right,

304
00:15:00.399 --> 00:15:02.759
<v Speaker 1>if a two layer graph neural network is good, a

305
00:15:02.799 --> 00:15:06.159
<v Speaker 1>fifty layer network must be a superintelligence. Let's just stack

306
00:15:06.200 --> 00:15:07.759
<v Speaker 1>these aggregation layers to the moon.

307
00:15:07.879 --> 00:15:10.480
<v Speaker 2>And that is exactly what happened with convolutional neural networks.

308
00:15:10.480 --> 00:15:13.440
<v Speaker 2>For images, researchers went from models with just a few

309
00:15:13.519 --> 00:15:16.399
<v Speaker 2>layers to resonant architectures with over one hundred layers, and

310
00:15:16.440 --> 00:15:18.360
<v Speaker 2>the performance just skyrocketed.

311
00:15:18.440 --> 00:15:21.279
<v Speaker 1>But with graphs, it's not that simple, is it?

312
00:15:21.440 --> 00:15:23.759
<v Speaker 2>Not at all? Doing that with graphs plunges you straight

313
00:15:23.799 --> 00:15:26.240
<v Speaker 2>into the biggest, most frustrating trap in graph.

314
00:15:26.120 --> 00:15:27.799
<v Speaker 1>Learning, the oversmoothing trap.

315
00:15:28.000 --> 00:15:32.200
<v Speaker 2>Yes, to understand why stacking layers destroys a graph, we

316
00:15:32.279 --> 00:15:35.480
<v Speaker 2>have to look back at the original Vanilla GNN proposed

317
00:15:35.519 --> 00:15:39.000
<v Speaker 2>back in two thousand and nine. It was painfully inefficient

318
00:15:39.039 --> 00:15:42.320
<v Speaker 2>because it updated node states ineratively until it hit what

319
00:15:42.360 --> 00:15:45.320
<v Speaker 2>they called a fixed point. By the time the math

320
00:15:45.399 --> 00:15:48.080
<v Speaker 2>reached that fixed point, the representations of the nodes were

321
00:15:48.080 --> 00:15:49.440
<v Speaker 2>completely uninformative.

322
00:15:49.559 --> 00:15:52.200
<v Speaker 1>So what does this all mean for you? Listening? Think

323
00:15:52.240 --> 00:15:55.960
<v Speaker 1>about a beautiful, diverse mosaic made of thousands of uniquely

324
00:15:56.000 --> 00:15:59.559
<v Speaker 1>colored tiles. If you constantly average the colors of all

325
00:15:59.600 --> 00:16:02.279
<v Speaker 1>your name and then in the next layer you average

326
00:16:02.279 --> 00:16:05.799
<v Speaker 1>the new colors of your neighbor's neighbors, it blends right. Eventually,

327
00:16:05.879 --> 00:16:09.440
<v Speaker 1>that gorgeous mosaic just turns into a giant, muddy gray

328
00:16:09.519 --> 00:16:11.320
<v Speaker 1>blob that is oversmoothing.

329
00:16:11.519 --> 00:16:15.320
<v Speaker 2>It's the mathematical homogenization of the data. By layer ten,

330
00:16:15.480 --> 00:16:17.600
<v Speaker 2>a node isn't just looking at its immediate friends. It's

331
00:16:17.639 --> 00:16:20.679
<v Speaker 2>looking at its friends of friends of friends exponentially outward.

332
00:16:20.919 --> 00:16:23.440
<v Speaker 2>It's pulling in massive amounts of irrelevant noise from the

333
00:16:23.440 --> 00:16:26.559
<v Speaker 2>far edges of the graph. Until every single node shares

334
00:16:26.600 --> 00:16:28.360
<v Speaker 2>the exact same average representation.

335
00:16:28.559 --> 00:16:31.159
<v Speaker 1>The network loses all its sharp edges exactly.

336
00:16:31.279 --> 00:16:33.440
<v Speaker 2>You lose the unique features that define the node in

337
00:16:33.480 --> 00:16:34.440
<v Speaker 2>the first place.

338
00:16:34.440 --> 00:16:37.639
<v Speaker 1>Which completely ruins the point of the graph. I mean,

339
00:16:37.679 --> 00:16:40.519
<v Speaker 1>if every node mathematically looks like a muddy gray blob,

340
00:16:40.879 --> 00:16:43.799
<v Speaker 1>the AI can't classify a cancer cell from a healthy cell,

341
00:16:44.039 --> 00:16:45.799
<v Speaker 1>or a bot account from a real user.

342
00:16:45.879 --> 00:16:46.679
<v Speaker 2>It becomes uses.

343
00:16:46.879 --> 00:16:48.799
<v Speaker 1>So if the problem is that we are averaging too

344
00:16:48.799 --> 00:16:51.679
<v Speaker 1>many neighbors over too many layers, until it becomes a blob.

345
00:16:52.120 --> 00:16:55.080
<v Speaker 1>The logical solution has to be finding a way to

346
00:16:55.159 --> 00:16:58.159
<v Speaker 1>hit the brakes right, giving the network a way to

347
00:16:58.200 --> 00:17:00.120
<v Speaker 1>stop before it loses its eye.

348
00:17:00.519 --> 00:17:03.639
<v Speaker 2>And that realization led to the development of graph residual

349
00:17:03.639 --> 00:17:07.960
<v Speaker 2>networks or GRNs. One of the most brilliant solutions the

350
00:17:08.000 --> 00:17:12.079
<v Speaker 2>textbook covers is the Jump Knowledge network or JKN.

351
00:17:12.279 --> 00:17:13.599
<v Speaker 1>Oh this is fascinating.

352
00:17:13.640 --> 00:17:17.279
<v Speaker 2>The researchers behind JKN recognize that different nodes need different

353
00:17:17.319 --> 00:17:21.119
<v Speaker 2>receptive fields. A node sitting right in the dense, crowded

354
00:17:21.160 --> 00:17:23.680
<v Speaker 2>core of the social network might turn into a gray

355
00:17:23.720 --> 00:17:26.440
<v Speaker 2>blob after just two layers simply because it has so

356
00:17:26.480 --> 00:17:28.759
<v Speaker 2>many connections flooding it with data.

357
00:17:28.440 --> 00:17:30.400
<v Speaker 1>Right too much gossip at the party exactly.

358
00:17:30.680 --> 00:17:33.839
<v Speaker 2>But a node out on the isolated quiet fringes might

359
00:17:33.920 --> 00:17:36.880
<v Speaker 2>actually need five or six layers of aggregation just to

360
00:17:36.920 --> 00:17:39.000
<v Speaker 2>gather enough context from the rest of the board to

361
00:17:39.000 --> 00:17:39.599
<v Speaker 2>be useful.

362
00:17:39.799 --> 00:17:42.160
<v Speaker 1>So it literally lets the node jump back through time

363
00:17:42.279 --> 00:17:43.279
<v Speaker 1>to a previous layer.

364
00:17:43.400 --> 00:17:46.039
<v Speaker 2>Yes, in the final layer of the network, the JKN

365
00:17:46.119 --> 00:17:51.039
<v Speaker 2>lets every single node adaptively select which intermediate layer's representation

366
00:17:51.240 --> 00:17:55.039
<v Speaker 2>was most useful for its specific situation. The dense core

367
00:17:55.119 --> 00:17:57.839
<v Speaker 2>node can choose to use its representation from layer two.

368
00:17:58.200 --> 00:18:01.359
<v Speaker 2>While the fringe node pulls from layer five. It preserves

369
00:18:01.400 --> 00:18:04.079
<v Speaker 2>the structural awareness of each node before it gets smoothed

370
00:18:04.119 --> 00:18:04.880
<v Speaker 2>out by the math.

371
00:18:05.440 --> 00:18:08.400
<v Speaker 1>That is incredibly clever. It's basically like giving each node

372
00:18:08.680 --> 00:18:11.799
<v Speaker 1>its own personalized stop button, like Okay, I've learned enough

373
00:18:11.799 --> 00:18:15.000
<v Speaker 1>about my surroundings, stop averaging before I lose who I am.

374
00:18:15.160 --> 00:18:17.039
<v Speaker 2>It's a very elegant solution, and.

375
00:18:16.960 --> 00:18:20.440
<v Speaker 1>The text also details how researchers borrow tricks from those

376
00:18:20.480 --> 00:18:23.359
<v Speaker 1>massive image networks to build deep gcns. Right.

377
00:18:23.480 --> 00:18:27.799
<v Speaker 2>Yes, Deep gcns tackle both the vanish ingradient problem, which

378
00:18:27.799 --> 00:18:30.559
<v Speaker 2>is a mathematical decay that happens in all deep networks,

379
00:18:30.960 --> 00:18:35.680
<v Speaker 2>and over smoothing. They use ResNet style skip connections, which

380
00:18:35.839 --> 00:18:38.519
<v Speaker 2>literally take the raw matrix of data from a previous

381
00:18:38.599 --> 00:18:40.759
<v Speaker 2>layer and add it directly to the current one, keeping

382
00:18:40.759 --> 00:18:44.559
<v Speaker 2>the original signal alive just bypassing the blur exactly. But

383
00:18:44.640 --> 00:18:47.759
<v Speaker 2>the real breakthrough for preventing the gray blob in deep

384
00:18:47.839 --> 00:18:50.480
<v Speaker 2>gcns is a technique called dilated k.

385
00:18:50.519 --> 00:18:54.279
<v Speaker 1>N dilated k nearest neighbors. Now, if the problem is

386
00:18:54.319 --> 00:18:57.440
<v Speaker 1>pulling in too much dense noise from immediate neighbors, I'm

387
00:18:57.480 --> 00:19:01.640
<v Speaker 1>guessing dilation forces the network to like, ignore the people

388
00:19:01.720 --> 00:19:03.559
<v Speaker 1>right next to it so it can look further away.

389
00:19:03.680 --> 00:19:06.880
<v Speaker 2>That's the core idea. It expands the receptive field without

390
00:19:06.880 --> 00:19:10.799
<v Speaker 2>adding pure noise. Instead of looking at every single immediate

391
00:19:10.799 --> 00:19:14.319
<v Speaker 2>neighbor in a dense cluster, the network calculates a wider

392
00:19:14.440 --> 00:19:18.640
<v Speaker 2>radius of nearest neighbors and then intentionally skips nodes at

393
00:19:18.640 --> 00:19:19.440
<v Speaker 2>a set interval.

394
00:19:19.480 --> 00:19:20.240
<v Speaker 1>Oh, I get it.

395
00:19:20.240 --> 00:19:22.680
<v Speaker 2>It dilates its view, grabbing a sample from further out

396
00:19:22.720 --> 00:19:25.119
<v Speaker 2>while ignoring the overwhelming density in between.

397
00:19:25.319 --> 00:19:29.319
<v Speaker 1>It's exactly like standing way back from a massive Impressionist painting.

398
00:19:29.960 --> 00:19:32.079
<v Speaker 1>If you press your nose to the canvas, you are

399
00:19:32.119 --> 00:19:35.039
<v Speaker 1>totally overwhelmed by the density of the brushstrokes. You can't

400
00:19:35.079 --> 00:19:37.079
<v Speaker 1>see anything. You have to zoom out to see the

401
00:19:37.119 --> 00:19:38.519
<v Speaker 1>broad context of the landscape.

402
00:19:38.559 --> 00:19:39.920
<v Speaker 2>That is exactly how it functions.

403
00:19:40.039 --> 00:19:44.160
<v Speaker 1>And because dilated kNN is intentionally skipping data points in between,

404
00:19:44.880 --> 00:19:47.880
<v Speaker 1>you aren't just averaging everything together into a blur. You

405
00:19:47.920 --> 00:19:50.400
<v Speaker 1>get the big picture without the overwhelming noise.

406
00:19:50.640 --> 00:19:54.640
<v Speaker 2>It elegantly preserves the high frequency information, the sharp defining

407
00:19:54.680 --> 00:19:58.759
<v Speaker 2>details of the graph, while still gathering long range global context.

408
00:19:59.319 --> 00:20:03.160
<v Speaker 2>By combining all these skip connections with dilated convolutions, researchers

409
00:20:03.160 --> 00:20:07.119
<v Speaker 2>were finally able to successfully build a massive fifty six

410
00:20:07.240 --> 00:20:10.759
<v Speaker 2>layer graph convolutional network that didn't succumb to.

411
00:20:10.799 --> 00:20:12.799
<v Speaker 1>Oversmoothing fifty six layers.

412
00:20:13.000 --> 00:20:16.039
<v Speaker 2>Yeah, it stayed incredibly sharp and perceptive at depths that

413
00:20:16.079 --> 00:20:18.359
<v Speaker 2>would have previously completely destroyed the data.

414
00:20:18.400 --> 00:20:20.680
<v Speaker 1>Well, we have covered a massive amount of ground today,

415
00:20:20.720 --> 00:20:24.440
<v Speaker 1>pulling some incredibly dense computer science down to Earth. Let's

416
00:20:24.480 --> 00:20:27.519
<v Speaker 1>recap this journey. We started with the realization that the

417
00:20:27.559 --> 00:20:29.839
<v Speaker 1>real world isn't a neat Excel spreadsheet.

418
00:20:30.000 --> 00:20:31.640
<v Speaker 2>It is definitely a corkboard.

419
00:20:31.240 --> 00:20:35.000
<v Speaker 1>A Messi corkboard. Traditional AI failed on non Euclidean graphs

420
00:20:35.039 --> 00:20:38.480
<v Speaker 1>because it relies on grids, and early network embeddings failed

421
00:20:38.480 --> 00:20:41.799
<v Speaker 1>because there were computationally impossible for huge networks and just

422
00:20:41.839 --> 00:20:43.920
<v Speaker 1>couldn't generalize to new data. Right.

423
00:20:43.960 --> 00:20:46.480
<v Speaker 2>And then we explored how the underlying energy of the graph,

424
00:20:46.519 --> 00:20:49.440
<v Speaker 2>which is captured by the Laplacian matrix, opened the door

425
00:20:49.480 --> 00:20:51.160
<v Speaker 2>to actual graph neural networks.

426
00:20:51.279 --> 00:20:53.960
<v Speaker 1>Yeah, and we saw the field split into spectral methods

427
00:20:54.240 --> 00:20:57.119
<v Speaker 1>which look at overall structural frequencies but failed to adapt

428
00:20:57.160 --> 00:21:00.160
<v Speaker 1>to new environments, and spatial methods, which zoom in to

429
00:21:00.200 --> 00:21:01.799
<v Speaker 1>operate directly on local neighbors.

430
00:21:01.839 --> 00:21:04.440
<v Speaker 2>Then we saw graph stage crack the inductive problem by

431
00:21:04.559 --> 00:21:07.319
<v Speaker 2>learning the strategy of how to aggregate rather than just

432
00:21:07.480 --> 00:21:09.160
<v Speaker 2>memorizing a specific layout.

433
00:21:09.400 --> 00:21:13.559
<v Speaker 1>We gave the network memory and intense focus, using SLSTM

434
00:21:13.640 --> 00:21:17.039
<v Speaker 1>supernodes to manage the big picture, and graph attention networks

435
00:21:17.039 --> 00:21:19.400
<v Speaker 1>to tune out the noise and focus on the important

436
00:21:19.440 --> 00:21:20.759
<v Speaker 1>gossip at the cocktail party.

437
00:21:20.839 --> 00:21:23.559
<v Speaker 2>And finally we confunded the limits of depth. We saw

438
00:21:23.599 --> 00:21:28.000
<v Speaker 2>how stacking too many layers creates an oversmooth, muddy gray blob,

439
00:21:28.440 --> 00:21:32.480
<v Speaker 2>and how jump knowledge algorithms and dilated kNN allowed networks

440
00:21:32.519 --> 00:21:35.400
<v Speaker 2>to go dozens of layers deep while retaining the unique,

441
00:21:35.440 --> 00:21:37.319
<v Speaker 2>sharp identities of every node.

442
00:21:37.480 --> 00:21:40.400
<v Speaker 1>It's been an incredible deep dive, and for you listening,

443
00:21:40.480 --> 00:21:42.920
<v Speaker 1>whether you are actually building these models or just living

444
00:21:42.920 --> 00:21:45.359
<v Speaker 1>in the world governed by them, it is crucial to

445
00:21:45.400 --> 00:21:49.000
<v Speaker 1>remember that this isn't just abstract textbook math.

446
00:21:49.119 --> 00:21:51.759
<v Speaker 2>No, it has massive real world applications.

447
00:21:51.799 --> 00:21:55.720
<v Speaker 1>Absolutely, this architecture is the engine of the next decade

448
00:21:55.720 --> 00:22:00.440
<v Speaker 1>of discovery. Graph neural networks are the exact mathematical works

449
00:22:00.480 --> 00:22:04.000
<v Speaker 1>that map your social circles to recommend friends or content.

450
00:22:04.519 --> 00:22:07.799
<v Speaker 1>They are modeling complex physical systems like city traffic or

451
00:22:07.839 --> 00:22:08.559
<v Speaker 1>weather patterns.

452
00:22:08.599 --> 00:22:12.480
<v Speaker 2>They are even analyzing the non euclidian molecular fingerprints of

453
00:22:12.680 --> 00:22:15.319
<v Speaker 2>compounds to discover new life saving medicines.

454
00:22:15.640 --> 00:22:18.839
<v Speaker 1>It's truly everywhere. They are the fundamental lens through which

455
00:22:18.920 --> 00:22:23.000
<v Speaker 1>artificial intelligence is finally learning to understand the methy interconnected

456
00:22:23.079 --> 00:22:24.359
<v Speaker 1>web of our actual lives.

457
00:22:24.400 --> 00:22:27.599
<v Speaker 2>They represent a profound shift in computer science. We are

458
00:22:27.640 --> 00:22:30.920
<v Speaker 2>moving from analyzing isolated data points in a vacuum to

459
00:22:31.119 --> 00:22:33.839
<v Speaker 2>analyzing the relationships between them. Because in the real world,

460
00:22:33.920 --> 00:22:38.319
<v Speaker 2>whether it is physics, biology, or society, relationships are everything, which.

461
00:22:38.079 --> 00:22:39.599
<v Speaker 1>Brings me to a final thought I want to leave

462
00:22:39.599 --> 00:22:42.599
<v Speaker 1>you with today. The mechanics of the graph neural network

463
00:22:42.720 --> 00:22:47.519
<v Speaker 1>teach us a fascinating, almost philosophical lesson about reality. These

464
00:22:47.559 --> 00:22:53.119
<v Speaker 1>models prove mathematically that relationships fundamentally define identity. A node

465
00:22:53.240 --> 00:22:56.599
<v Speaker 1>only has meaning and only gains intelligence based on the

466
00:22:56.599 --> 00:23:00.359
<v Speaker 1>neighbors it connects to. But remember the oversmoothing trap. If

467
00:23:00.400 --> 00:23:02.279
<v Speaker 1>a note is forced to average the input of too

468
00:23:02.319 --> 00:23:06.359
<v Speaker 1>many neighbors layer after layer, it completely loses its unique features.

469
00:23:06.559 --> 00:23:09.440
<v Speaker 1>It turns into a muddy gray blob in the computer.

470
00:23:09.680 --> 00:23:13.319
<v Speaker 1>It literally requires complex algorithms like jump knowledge and skip

471
00:23:13.359 --> 00:23:15.920
<v Speaker 1>connections just to force the node to remember its original

472
00:23:15.960 --> 00:23:19.920
<v Speaker 1>features and protect its identity from the overwhelming crowd. So

473
00:23:20.480 --> 00:23:23.599
<v Speaker 1>what does that mathematical reality say about our own human

474
00:23:23.640 --> 00:23:24.480
<v Speaker 1>social networks?

475
00:23:25.000 --> 00:23:25.960
<v Speaker 2>That's a scary thought.

476
00:23:26.279 --> 00:23:29.039
<v Speaker 1>Think about your own digital life in an age of

477
00:23:29.160 --> 00:23:32.240
<v Speaker 1>endless connectivity, where we are constantly exposed to the opinion's

478
00:23:32.279 --> 00:23:35.599
<v Speaker 1>tastes and outrage of millions of people online. How many

479
00:23:35.640 --> 00:23:38.359
<v Speaker 1>hops away are you before your own thoughts and your

480
00:23:38.400 --> 00:23:41.559
<v Speaker 1>own individual opinions are just a mathematically smoothed out average

481
00:23:41.559 --> 00:23:45.240
<v Speaker 1>of your Internet feed? Have we oversmoothed ourselves? Are we

482
00:23:45.319 --> 00:23:48.160
<v Speaker 1>losing our sharp edges to the gray blob of the crowd?

483
00:23:48.279 --> 00:23:50.400
<v Speaker 1>Something to chew on until the next deep dive
