WEBVTT 1 00:00:00.080 --> 00:00:02.520 You know, when we normally think about artificial intelligence learning 2 00:00:02.560 --> 00:00:08.199 to see the world, there's this underlying expectation of neat, 3 00:00:08.560 --> 00:00:09.960 orderly geometry. Right. 4 00:00:10.039 --> 00:00:12.400 Absolutely, everything has its specific place. 5 00:00:12.839 --> 00:00:14.480 Yeah, And whether you're trying to catch up on the 6 00:00:14.560 --> 00:00:17.960 latest tech trends or you're just insanely curious about how 7 00:00:18.039 --> 00:00:22.160 machines actually perceive reality, you've probably heard of neural networks, 8 00:00:22.719 --> 00:00:26.519 and traditional neural networks thrive on perfect grids. I mean, 9 00:00:26.559 --> 00:00:30.120 you feed a computer a photograph and it basically just 10 00:00:30.160 --> 00:00:31.760 sees a strict two D grid of. 11 00:00:31.839 --> 00:00:35.039 Pixels, or you feed it a paragraph of text and 12 00:00:35.079 --> 00:00:37.520 it sees a straight one D line of words. It's 13 00:00:37.560 --> 00:00:40.119 what computer scientists call the Euclidean. 14 00:00:39.560 --> 00:00:41.119 Domain Euclidian domain. Yeah. 15 00:00:41.159 --> 00:00:44.920 Yeah, it's basically the math of flat surfaces, straight lines, 16 00:00:45.039 --> 00:00:50.119 and predictable localized structures. It's a world where every single 17 00:00:50.159 --> 00:00:53.439 piece of data has a very specific orderly neighborhood. 18 00:00:53.560 --> 00:00:55.359 But then you step out of the computer and into 19 00:00:55.359 --> 00:00:57.840 your actual life and the real world, like your social 20 00:00:57.880 --> 00:01:00.640 network or the molecular structure of the coffee you drank 21 00:01:00.679 --> 00:01:03.200 this morning, or even the chaotic flow of traffic you 22 00:01:03.240 --> 00:01:05.159 said in it. It just doesn't fit into those neat 23 00:01:05.200 --> 00:01:05.840 little boxes. 24 00:01:06.319 --> 00:01:08.760 No, not at all. It's completely chaotic exactly. 25 00:01:08.959 --> 00:01:11.879 Suddenly that pristine grid is gone and you are looking 26 00:01:11.959 --> 00:01:15.079 at a landscape that is mathematically messy. It's a non 27 00:01:15.120 --> 00:01:18.519 Euclidean web of relationships. So today we are taking a 28 00:01:18.560 --> 00:01:21.480 deep dive into a stack of highly technical notes from 29 00:01:21.480 --> 00:01:25.400 the textbook Introduction to Graft Neural Networks by Zeon Lu 30 00:01:25.480 --> 00:01:26.040 and Jizo. 31 00:01:26.359 --> 00:01:29.799 It is a phenomenal text, but yeah, it's incredibly. 32 00:01:29.239 --> 00:01:31.840 Dense, super dense. So our mission today is to take 33 00:01:31.879 --> 00:01:35.519 this really math heavy computer science text and translate it 34 00:01:35.560 --> 00:01:38.920 into something intuitive. We want to figure out exactly how 35 00:01:39.000 --> 00:01:43.200 AI is finally learning to map the messy interconnected web 36 00:01:43.239 --> 00:01:46.560 of reality. Because to map that reality, the computer scientists 37 00:01:46.599 --> 00:01:50.359 had to invent an entirely new architecture, the graph neural network. 38 00:01:50.519 --> 00:01:53.200 And to really appreciate the scale of this paradigm shift, 39 00:01:53.239 --> 00:01:55.640 we first have to look at what broke the old. 40 00:01:55.439 --> 00:01:57.239 Models right, what went wrong? 41 00:01:57.599 --> 00:02:01.120 Exactly? Traditional deep learning hit an absolute wall when it 42 00:02:01.120 --> 00:02:04.799 tried to process anything that wasn't on a grid. Convolutional 43 00:02:04.879 --> 00:02:08.360 neural networks or CNNs, which is the architecture that basically 44 00:02:08.439 --> 00:02:12.639 drove the entire modern image recognition boom. They rely on 45 00:02:12.759 --> 00:02:17.560 sliding a mathematical filter evenly across a predictable. 46 00:02:16.919 --> 00:02:20.199 Grid, kind of like a little square magnifying glass sliding 47 00:02:20.199 --> 00:02:20.879 over pixels. 48 00:02:21.120 --> 00:02:24.159 Yes, exactly like that. It slides over the image looking 49 00:02:24.199 --> 00:02:26.520 at a neat three x three square of pixels at 50 00:02:26.560 --> 00:02:26.919 a time. 51 00:02:27.120 --> 00:02:29.759 Okay, let's unpack this for a second. If traditional AI 52 00:02:29.919 --> 00:02:33.759 is like reading a perfectly formatted Excel spreadsheet or analyzing 53 00:02:33.759 --> 00:02:36.159 a chessboard, a graph is more like looking at a 54 00:02:36.199 --> 00:02:37.560 detective's messi corkboard. 55 00:02:37.639 --> 00:02:39.159 Oh I love that analogy, right. 56 00:02:39.159 --> 00:02:41.879 You know one's from the Thrillers. Just chaotic pushpins with 57 00:02:41.919 --> 00:02:46.439 red string tying dozens of unpredictable suspects, locations and clues altogether. 58 00:02:47.039 --> 00:02:50.599 A CNN takes its neat little square magnifying glass, stares 59 00:02:50.639 --> 00:02:53.479 at that tangled web of red string and just completely 60 00:02:53.560 --> 00:02:54.000 gives up. 61 00:02:54.159 --> 00:02:57.680 It completely breaks down, because on your detectives corkboard, one 62 00:02:57.719 --> 00:03:00.439 clue might have two strings attached to it, and then 63 00:03:00.479 --> 00:03:03.360 another clue right next to it might have five hundred 64 00:03:03.360 --> 00:03:05.560 strings connecting it to everything else on the board. 65 00:03:05.639 --> 00:03:06.759 Wow. Yeah, So you. 66 00:03:06.759 --> 00:03:10.319 Can't slide a standard fixed size three x three filter 67 00:03:10.439 --> 00:03:13.520 over a spider web. The distance between the nodes isn't 68 00:03:13.560 --> 00:03:17.280 a straight line anymore. The concept of up, down, left, right, 69 00:03:17.840 --> 00:03:21.400 it just doesn't exist. It's purely about relationships and connections. 70 00:03:21.719 --> 00:03:24.439 But computer scientists didn't just throw their hands up when 71 00:03:24.439 --> 00:03:27.039 they saw the corkboard, right, Yeah, I was looking at 72 00:03:27.039 --> 00:03:31.039 the early workarounds. The textbook mentions these things called network 73 00:03:31.080 --> 00:03:34.120 embedding methods like deep walk and node to vec. 74 00:03:34.479 --> 00:03:36.560 Yeah, the early attempts to solve the problem. 75 00:03:36.599 --> 00:03:39.039 From what I gather, they tried to send virtual agents 76 00:03:39.080 --> 00:03:42.319 walking randomly along the strings of the corkboard to map 77 00:03:42.360 --> 00:03:44.879 it out, essentially trying to flatten the whole three D 78 00:03:45.000 --> 00:03:47.680 web into a simple flat list of numbers. 79 00:03:47.840 --> 00:03:49.479 That's a great way to put it. They tried to 80 00:03:49.479 --> 00:03:52.599 map the nodes into low dimensional vectors using those random walks. 81 00:03:52.840 --> 00:03:55.879 They were essentially trying to force a non Euclidean graph 82 00:03:56.280 --> 00:03:58.520 to behave like a Euclidean spreadsheet. 83 00:03:58.599 --> 00:03:59.360 But it didn't work. 84 00:03:59.639 --> 00:04:02.879 No, it failed on a massive scale, and for two 85 00:04:03.199 --> 00:04:07.840 really critical reasons. First, they didn't share computational. 86 00:04:07.080 --> 00:04:09.240 Parameter, which means what exactly it means. 87 00:04:09.280 --> 00:04:12.159 Every single node you added to the graph required the 88 00:04:12.240 --> 00:04:15.039 model to learn a brand new set of weights. So 89 00:04:15.159 --> 00:04:18.360 if you were analyzing a social network with billions of users. 90 00:04:18.720 --> 00:04:23.360 The computational cost just grew linearly until the machine choked. 91 00:04:23.439 --> 00:04:25.120 It was a nightmare, oh man. 92 00:04:25.680 --> 00:04:28.079 And the second failure was about adapting to the unknown, 93 00:04:28.120 --> 00:04:29.040 wasn't it exactly? 94 00:04:29.079 --> 00:04:31.879 If you trained one of these early models on a 95 00:04:31.920 --> 00:04:35.360 specific corkboard, and then I walked into the room and 96 00:04:35.519 --> 00:04:38.560 pinned a brand new suspect to the board with new strings, 97 00:04:38.959 --> 00:04:40.759 the model was totally blind to it. 98 00:04:40.879 --> 00:04:42.839 Wait, really, it just couldn't see the new pin. 99 00:04:42.959 --> 00:04:45.240 It couldn't process it at all. It just memorized the 100 00:04:45.279 --> 00:04:47.680 specific board it was looking at, rather than learning how 101 00:04:47.720 --> 00:04:48.920 to actually be a detective. 102 00:04:49.120 --> 00:04:53.040 So to build an AI that actually learns, researchers realized 103 00:04:53.120 --> 00:04:56.680 they had to understand the underlying structure of the graph mathematically, 104 00:04:57.040 --> 00:04:59.560 rather than just trying to flatten it into a list right. 105 00:05:00.079 --> 00:05:03.199 Art of understanding these non Euclidean spaces relies on something 106 00:05:03.279 --> 00:05:04.920 called the laplation matrix. 107 00:05:05.360 --> 00:05:08.560 The text calls it the mathematical heartbeat of the graph. 108 00:05:08.920 --> 00:05:11.800 I really love that phrasing, but visualizing a matrix is 109 00:05:11.800 --> 00:05:14.759 always kind of tricky. If we think about the quarkboard, 110 00:05:15.279 --> 00:05:18.319 how does the laplation capture that chaotic shape. 111 00:05:18.680 --> 00:05:21.800 Think about the tension in the strings. In graph theory, 112 00:05:21.959 --> 00:05:24.720 you start with an adjacency matrix, which is basically just 113 00:05:24.800 --> 00:05:27.279 a ledger showing which pushpins are connected to which. 114 00:05:27.360 --> 00:05:28.720 Okay, simple ledger, right. 115 00:05:29.360 --> 00:05:32.360 Then you have a degree matrix, which just simply counts 116 00:05:32.399 --> 00:05:35.639 the total number of strings attached to each pushpin. The 117 00:05:35.759 --> 00:05:39.240 Laplation matrix is the mathematical difference between the two the 118 00:05:39.279 --> 00:05:41.759 degree matrix minus the adjacency matrix. 119 00:05:41.800 --> 00:05:44.639 So it's subtracting the connections from the total strings. 120 00:05:44.319 --> 00:05:46.959 Exactly, and by doing that it captures not just where 121 00:05:46.959 --> 00:05:49.560 the pins are, but the potential energy and the flow 122 00:05:49.600 --> 00:05:53.879 of information between them. It mathematically describes the overall shape 123 00:05:53.920 --> 00:05:55.639 and structure of the entire web. 124 00:05:55.720 --> 00:05:59.480 Wow. But so even with the Laplacian matrix acting is 125 00:05:59.519 --> 00:06:02.839 this perfect map of the energy, early researchers still had 126 00:06:02.839 --> 00:06:05.839 to figure out how to actually do convolution right. I 127 00:06:05.959 --> 00:06:08.519 still had to figure out how to slide that magnifying 128 00:06:08.560 --> 00:06:12.120 glass over the strings to extract meaning they did. 129 00:06:11.920 --> 00:06:14.680 And this is where the textbook gets fascinating, because the 130 00:06:14.839 --> 00:06:18.680 entire field of computer science literally split into two opposing 131 00:06:18.720 --> 00:06:20.199 philosophical camps trying. 132 00:06:20.040 --> 00:06:22.199 To solve this spectral and spatial. 133 00:06:21.920 --> 00:06:26.759 Exactly the great divide in craft learning. The spectral approach 134 00:06:26.959 --> 00:06:31.000 is heavily rooted in complex physics and signal processing. It 135 00:06:31.040 --> 00:06:34.240 relies on the Fourier domain. So instead of looking at 136 00:06:34.319 --> 00:06:38.759 individual pushpins spectral models like the spectral network and chubnet, 137 00:06:39.079 --> 00:06:41.639 they look at the graph as a whole system of 138 00:06:41.720 --> 00:06:42.800 vibrating signals. 139 00:06:43.199 --> 00:06:47.240 So if spatial is looking at the individual pins, spectral 140 00:06:47.319 --> 00:06:49.759 is like plucking the strings to see how the whole 141 00:06:49.759 --> 00:06:50.759 board vibrates. 142 00:06:51.000 --> 00:06:52.399 That is a perfect analogy. 143 00:06:52.519 --> 00:06:55.439 Yes, they're looking at the overall structural frequencies of the 144 00:06:55.439 --> 00:06:58.199 graph based on that Laplacian matrix we just talked about. 145 00:06:58.240 --> 00:07:01.560 They look at the global resonance. But spectral methods run 146 00:07:01.600 --> 00:07:03.560 into a massive real. 147 00:07:03.319 --> 00:07:05.680 World roadblock because they're too rigid. 148 00:07:05.480 --> 00:07:09.319 Exactly because the filters they build are mathematically tied to 149 00:07:09.360 --> 00:07:12.560 the specific laplac matrix of that exact graph, they are 150 00:07:12.680 --> 00:07:17.079 hyper specialized. Imagine tuning a grand piano to sound perfect 151 00:07:17.160 --> 00:07:20.120 in one specific concert hall. If you pick up that 152 00:07:20.199 --> 00:07:23.079 piano and move it to a different room with different acoustics, 153 00:07:23.240 --> 00:07:25.639 or in this case, a graph with a different structure, 154 00:07:26.199 --> 00:07:29.600 your tuning just doesn't work. Anymore. The model completely fails 155 00:07:29.639 --> 00:07:32.279 to generalize to new environments. 156 00:07:31.879 --> 00:07:35.000 Which means we have to abandon the whole system approach. 157 00:07:35.120 --> 00:07:37.720 If we want flexible AI, we have to pivot to 158 00:07:37.759 --> 00:07:41.120 the other camp, the spatial approach you do, and spatial 159 00:07:41.120 --> 00:07:44.680 methods basically say, you know, forget the global frequencies of 160 00:07:44.720 --> 00:07:47.319 the whole room, let's just zoom in and look at 161 00:07:47.319 --> 00:07:49.480 our immediate neighbors on the corkboard. Right. 162 00:07:49.720 --> 00:07:53.360 Spatial methods operate directly on the spatially close neighbors, but 163 00:07:53.439 --> 00:07:56.519 then we run right back into the core problem. How 164 00:07:56.560 --> 00:08:00.240 do you run a standard uniform filter over nodes that 165 00:08:00.319 --> 00:08:02.560 all have a wildly different number of neighbors? 166 00:08:02.680 --> 00:08:05.879 Right And reading through the textbooks breakdown of early spatial models, 167 00:08:06.120 --> 00:08:09.519 I hit one called Patchee San and I have to 168 00:08:09.519 --> 00:08:11.439 be honest, it felt like the researchers were just straight 169 00:08:11.519 --> 00:08:12.000 up cheating. 170 00:08:12.120 --> 00:08:13.680 A lot of people felt that way at the time. 171 00:08:13.879 --> 00:08:17.399 Right from what I understand, patchway sand forces chaos into 172 00:08:17.560 --> 00:08:20.879 order by setting a totally arbitrary rule. It basically says, 173 00:08:21.240 --> 00:08:23.639 I'm only going to look at exactly nade neighbors for 174 00:08:23.720 --> 00:08:27.560 every single node, no matter what it extracts exactly nakee neighbors, 175 00:08:27.759 --> 00:08:30.839 normalizes them, and then just runs a standard one DCNN 176 00:08:30.879 --> 00:08:31.319 over them. 177 00:08:31.439 --> 00:08:32.759 That's exactly what it does. 178 00:08:32.879 --> 00:08:36.120 Wait a second, though, If patchway sand forces a chaotic 179 00:08:36.159 --> 00:08:39.039 web into a neat little sequence of exactly naked neighbors, 180 00:08:39.240 --> 00:08:41.840 aren't we just slicing off vital parts of the graph 181 00:08:41.960 --> 00:08:44.279 just to make the math easier for the machine. Then 182 00:08:44.320 --> 00:08:46.639 we are literally ignoring data. 183 00:08:46.679 --> 00:08:50.679 You're not wrong. The researchers were prioritizing computational feasibility over 184 00:08:50.720 --> 00:08:54.799 complete accuracy. They needed something that could actually run. But 185 00:08:54.879 --> 00:08:57.720 that instinct you have that slicing off data is a 186 00:08:57.720 --> 00:09:00.840 fundamental flaw. That is exactly what why the field moved 187 00:09:00.840 --> 00:09:03.480 away from rigid structures and developed graphsage. 188 00:09:03.639 --> 00:09:06.000 Graph sage. That's a huge one in the text. 189 00:09:06.120 --> 00:09:09.159 It was a monumental leap forward because the creators of 190 00:09:09.200 --> 00:09:11.720 graphsage realize you don't need to force the graph into 191 00:09:11.720 --> 00:09:14.720 a rigid shape. Instead of memorizing a fixed neighborhood of 192 00:09:14.759 --> 00:09:18.360 exactly naked nodes, graph sage learns an inductive framework. 193 00:09:18.799 --> 00:09:23.240 Inductive meaning it learns the underlying rule of the puzzle, 194 00:09:23.639 --> 00:09:26.039 not just the specific solution to one puzzle. 195 00:09:26.399 --> 00:09:30.720 It learns the strategy. So graph sage uniformly samples a 196 00:09:30.759 --> 00:09:33.919 fixed size set of neighbors. But the brilliance is in 197 00:09:33.960 --> 00:09:37.000 what it does next. It applies an aggregator. 198 00:09:36.399 --> 00:09:38.679 Function like finding an average. 199 00:09:38.440 --> 00:09:42.360 Exactly, like a mean aggregator that finds the mathematical average 200 00:09:42.360 --> 00:09:46.080 of the features, or a pooling aggregator. It's not trying 201 00:09:46.120 --> 00:09:48.960 to learn the specific nodes themselves. It's learning the function 202 00:09:49.080 --> 00:09:52.639 of how to pull in feature information from whatever local 203 00:09:52.639 --> 00:09:54.240 neighborhood happens to be around it. 204 00:09:54.279 --> 00:09:56.720 Oh wow, So because it learns the how you can 205 00:09:56.759 --> 00:10:00.000 take an entirely unseen node, drop it into the network tomorrow, 206 00:10:00.360 --> 00:10:03.039 and the model intuitively knows how to process it based 207 00:10:03.080 --> 00:10:04.679 on whatever new neighbors surround us. 208 00:10:04.720 --> 00:10:07.159 Exactly, it finally learned how to be the detective. It 209 00:10:07.200 --> 00:10:09.039 knows how to read the strings no matter what crazy 210 00:10:09.039 --> 00:10:10.039 board you put in front of it. 211 00:10:10.120 --> 00:10:14.919 That's incredible, But aggregating neighbors equally brings up another glaring 212 00:10:14.960 --> 00:10:19.080 real world problem. In reality, not all relationships are created equal. 213 00:10:19.200 --> 00:10:19.879 No, definitely not. 214 00:10:20.279 --> 00:10:23.120 Think about your own life. If I ask my friends 215 00:10:23.159 --> 00:10:26.159 for advice on buying a car, my friend who has 216 00:10:26.159 --> 00:10:28.879 been a mechanic for twenty years matters a lot more 217 00:10:28.879 --> 00:10:30.960 than my friend who rides a unicycle. I would hope 218 00:10:31.000 --> 00:10:35.759 so right. But standard spatial aggregation just averaging everyone together, 219 00:10:36.279 --> 00:10:40.120 treats the mechanic and the unicycle writer as mathematically equal. 220 00:10:40.720 --> 00:10:43.639 And this is where the architecture evolves to mirror human 221 00:10:43.720 --> 00:10:47.879 cognition much more closely. We transition into adding memory and 222 00:10:47.919 --> 00:10:51.799 attention to the graph. The textbook details graph recurrent networks 223 00:10:51.960 --> 00:10:56.480 or GRNs and graph attention networks known as gats gats. 224 00:10:57.039 --> 00:11:00.480 Here's where it gets really interesting to me. Under graph 225 00:11:00.559 --> 00:11:03.039 convolutional networks. The ones that just average all their neighbors 226 00:11:03.399 --> 00:11:05.480 are sort of like being in a loud cocktail party 227 00:11:05.480 --> 00:11:07.559 where you try to listen to everyone in the room equally. 228 00:11:07.679 --> 00:11:10.120 That sounds exhausting it is you pull in so. 229 00:11:10.200 --> 00:11:13.879 Much overlapping chatter that it just creates a dull, useless hum. 230 00:11:14.360 --> 00:11:17.399 But graph attention networks, the gats, they put on noise 231 00:11:17.440 --> 00:11:20.240 canceling headphones and focus entirely on the one person with 232 00:11:20.320 --> 00:11:21.240 the juicy gossip. 233 00:11:21.519 --> 00:11:24.360 That's a great way to visualize the self attention mechanism. 234 00:11:24.639 --> 00:11:28.440 It is a brilliant piece of engineering. Basically, for every 235 00:11:28.440 --> 00:11:33.399 single neighbor or note has the model calculates an attention coefficient. 236 00:11:32.919 --> 00:11:35.360 Using leaky railue and softmax equations. Right. 237 00:11:35.480 --> 00:11:38.159 Yes, the math gets heavy there, But to avoid the 238 00:11:38.159 --> 00:11:40.799 heavy jargon, just think of it as a mathematical filter 239 00:11:40.919 --> 00:11:44.159 that actively mutes the background noise and cranks up the 240 00:11:44.240 --> 00:11:47.600 volume on the important signal. It runs the data through 241 00:11:47.639 --> 00:11:52.039 a function that penalizes irrelevant information and then balances all 242 00:11:52.080 --> 00:11:54.799 those individual attention scores out so they add up to 243 00:11:54.840 --> 00:11:56.600 a clean one percent. 244 00:11:56.759 --> 00:12:00.000 Oh. I see, so this is sign's specific weighted import 245 00:12:00.279 --> 00:12:03.240 to different neighbors. The mechanic gets an eighty five percent 246 00:12:03.240 --> 00:12:06.559 attention score and the unicycle rider gets a two percent score. 247 00:12:06.679 --> 00:12:08.759 Exactly. It learns who to trust. 248 00:12:08.600 --> 00:12:11.600 And the text also highlights multi head attention. If we 249 00:12:11.639 --> 00:12:14.519 stick to the cocktail party analogy, I assume that's like 250 00:12:14.559 --> 00:12:17.440 sending five different friends into the party, each instructed to 251 00:12:17.440 --> 00:12:20.200 listen for different kinds of gossip, Like one listens for 252 00:12:20.279 --> 00:12:23.720 financial news, one for relationship drama, and then they all 253 00:12:23.720 --> 00:12:25.080 compare notes at the end of the night. 254 00:12:25.519 --> 00:12:29.679 Yeah, that's spot on. Multihead attention stabilizes the learning process 255 00:12:29.720 --> 00:12:35.159 by running several independent attention mechanisms simultaneously and concatenating the results. 256 00:12:35.480 --> 00:12:38.080 It ensures the model doesn't fixate on just one type 257 00:12:38.080 --> 00:12:39.480 of relationship. 258 00:12:38.960 --> 00:12:40.440 So it gets a well rounded view. Right. 259 00:12:40.759 --> 00:12:43.080 But beyond just focusing on the right neighbors in the 260 00:12:43.080 --> 00:12:46.240 present moment, sometimes the network needs memory to understand the 261 00:12:46.240 --> 00:12:49.639 broader context. This is where graph recurrent networks come in. 262 00:12:50.039 --> 00:12:53.720 They heavily borrow memory gates like GRU and LSTM gates 263 00:12:53.960 --> 00:12:58.039 from traditional sequence models to remember long term dependencies and 264 00:12:58.120 --> 00:12:59.519 forget irrelevant data. 265 00:13:00.039 --> 00:13:02.879 The source highlighted a specific model for analyzing text called 266 00:13:02.919 --> 00:13:07.759 the sentence LSTM or SLLSTM. This honestly blew my mind. 267 00:13:08.200 --> 00:13:11.039 Normally text is just a straight line, but here they 268 00:13:11.039 --> 00:13:13.600 take a sentence turn the words into nodes on a graph, 269 00:13:13.679 --> 00:13:16.559 so each word can look at its immediate neighbors. But then, 270 00:13:17.080 --> 00:13:19.639 this is the crazy part. They add this genius thing 271 00:13:19.679 --> 00:13:20.840 called a supernode. 272 00:13:21.120 --> 00:13:26.639 Yes, the supernode solves a massive architectural bottleneck. If you 273 00:13:26.720 --> 00:13:30.279 are analyzing a really long paragraph, a word needs to 274 00:13:30.360 --> 00:13:33.159 understand the grammar of the words immediately next to it, 275 00:13:33.399 --> 00:13:35.759 but it also needs to understand the overarching theme of 276 00:13:35.799 --> 00:13:37.240 the whole text, Right. 277 00:13:37.080 --> 00:13:39.759 Like if the text is a massive legal document, the 278 00:13:39.799 --> 00:13:41.600 first word of the page and the last word of 279 00:13:41.639 --> 00:13:44.399 the page might be hundreds of hops away from each other. 280 00:13:44.399 --> 00:13:47.679 On a normal graph, the signal would totally degrade before 281 00:13:47.720 --> 00:13:49.320 they ever communicated exactly. 282 00:13:49.600 --> 00:13:53.799 The SLSTM elegantly solves this by connecting every single word 283 00:13:53.840 --> 00:13:57.279 node to its immediate neighbors, but also connecting every single 284 00:13:57.279 --> 00:13:59.600 word to one overarching supernode. 285 00:13:59.639 --> 00:13:59.960 Wow. 286 00:14:00.240 --> 00:14:03.320 So the word nodes handle the local context, the immediate 287 00:14:03.320 --> 00:14:07.080 grammar and phrasing. Meanwhile, the supernode acts as a central hub, 288 00:14:07.399 --> 00:14:11.159 aggregating information from all the words simultaneously and feeding that 289 00:14:11.279 --> 00:14:13.679 global context back down to the individual words. 290 00:14:13.960 --> 00:14:16.759 It's like having a project manager who sees the entire 291 00:14:16.840 --> 00:14:20.480 timeline of the construction project, while the individual workers only 292 00:14:20.480 --> 00:14:23.960 see their daily tasks. Uh. The project manager constantly yells 293 00:14:24.000 --> 00:14:26.200 down from this gaffolding to make sure everyone is actually 294 00:14:26.240 --> 00:14:27.320 building the same house. 295 00:14:27.600 --> 00:14:30.360 That is exactly what it does, and because it allows 296 00:14:30.360 --> 00:14:33.960 information to flow so efficiently across the whole structure without 297 00:14:33.960 --> 00:14:39.159 degrading over long distances, the SLSTM has actually outperformed incredibly 298 00:14:39.240 --> 00:14:42.759 powerful state of the art sequence models like the Transformer 299 00:14:42.759 --> 00:14:44.679 on certain text classification tasks. 300 00:14:44.720 --> 00:14:48.639 That is wild. Okay, So if giving a graph, neural 301 00:14:48.679 --> 00:14:53.159 network memory, dynamic attention and a project manager supernode makes 302 00:14:53.159 --> 00:14:56.600 it this incredibly smart. The logical next step in computer 303 00:14:56.639 --> 00:15:00.360 science is always the same, go deeper. Oh ahwa, right, 304 00:15:00.399 --> 00:15:02.759 if a two layer graph neural network is good, a 305 00:15:02.799 --> 00:15:06.159 fifty layer network must be a superintelligence. Let's just stack 306 00:15:06.200 --> 00:15:07.759 these aggregation layers to the moon. 307 00:15:07.879 --> 00:15:10.480 And that is exactly what happened with convolutional neural networks. 308 00:15:10.480 --> 00:15:13.440 For images, researchers went from models with just a few 309 00:15:13.519 --> 00:15:16.399 layers to resonant architectures with over one hundred layers, and 310 00:15:16.440 --> 00:15:18.360 the performance just skyrocketed. 311 00:15:18.440 --> 00:15:21.279 But with graphs, it's not that simple, is it? 312 00:15:21.440 --> 00:15:23.759 Not at all? Doing that with graphs plunges you straight 313 00:15:23.799 --> 00:15:26.240 into the biggest, most frustrating trap in graph. 314 00:15:26.120 --> 00:15:27.799 Learning, the oversmoothing trap. 315 00:15:28.000 --> 00:15:32.200 Yes, to understand why stacking layers destroys a graph, we 316 00:15:32.279 --> 00:15:35.480 have to look back at the original Vanilla GNN proposed 317 00:15:35.519 --> 00:15:39.000 back in two thousand and nine. It was painfully inefficient 318 00:15:39.039 --> 00:15:42.320 because it updated node states ineratively until it hit what 319 00:15:42.360 --> 00:15:45.320 they called a fixed point. By the time the math 320 00:15:45.399 --> 00:15:48.080 reached that fixed point, the representations of the nodes were 321 00:15:48.080 --> 00:15:49.440 completely uninformative. 322 00:15:49.559 --> 00:15:52.200 So what does this all mean for you? Listening? Think 323 00:15:52.240 --> 00:15:55.960 about a beautiful, diverse mosaic made of thousands of uniquely 324 00:15:56.000 --> 00:15:59.559 colored tiles. If you constantly average the colors of all 325 00:15:59.600 --> 00:16:02.279 your name and then in the next layer you average 326 00:16:02.279 --> 00:16:05.799 the new colors of your neighbor's neighbors, it blends right. Eventually, 327 00:16:05.879 --> 00:16:09.440 that gorgeous mosaic just turns into a giant, muddy gray 328 00:16:09.519 --> 00:16:11.320 blob that is oversmoothing. 329 00:16:11.519 --> 00:16:15.320 It's the mathematical homogenization of the data. By layer ten, 330 00:16:15.480 --> 00:16:17.600 a node isn't just looking at its immediate friends. It's 331 00:16:17.639 --> 00:16:20.679 looking at its friends of friends of friends exponentially outward. 332 00:16:20.919 --> 00:16:23.440 It's pulling in massive amounts of irrelevant noise from the 333 00:16:23.440 --> 00:16:26.559 far edges of the graph. Until every single node shares 334 00:16:26.600 --> 00:16:28.360 the exact same average representation. 335 00:16:28.559 --> 00:16:31.159 The network loses all its sharp edges exactly. 336 00:16:31.279 --> 00:16:33.440 You lose the unique features that define the node in 337 00:16:33.480 --> 00:16:34.440 the first place. 338 00:16:34.440 --> 00:16:37.639 Which completely ruins the point of the graph. I mean, 339 00:16:37.679 --> 00:16:40.519 if every node mathematically looks like a muddy gray blob, 340 00:16:40.879 --> 00:16:43.799 the AI can't classify a cancer cell from a healthy cell, 341 00:16:44.039 --> 00:16:45.799 or a bot account from a real user. 342 00:16:45.879 --> 00:16:46.679 It becomes uses. 343 00:16:46.879 --> 00:16:48.799 So if the problem is that we are averaging too 344 00:16:48.799 --> 00:16:51.679 many neighbors over too many layers, until it becomes a blob. 345 00:16:52.120 --> 00:16:55.080 The logical solution has to be finding a way to 346 00:16:55.159 --> 00:16:58.159 hit the brakes right, giving the network a way to 347 00:16:58.200 --> 00:17:00.120 stop before it loses its eye. 348 00:17:00.519 --> 00:17:03.639 And that realization led to the development of graph residual 349 00:17:03.639 --> 00:17:07.960 networks or GRNs. One of the most brilliant solutions the 350 00:17:08.000 --> 00:17:12.079 textbook covers is the Jump Knowledge network or JKN. 351 00:17:12.279 --> 00:17:13.599 Oh this is fascinating. 352 00:17:13.640 --> 00:17:17.279 The researchers behind JKN recognize that different nodes need different 353 00:17:17.319 --> 00:17:21.119 receptive fields. A node sitting right in the dense, crowded 354 00:17:21.160 --> 00:17:23.680 core of the social network might turn into a gray 355 00:17:23.720 --> 00:17:26.440 blob after just two layers simply because it has so 356 00:17:26.480 --> 00:17:28.759 many connections flooding it with data. 357 00:17:28.440 --> 00:17:30.400 Right too much gossip at the party exactly. 358 00:17:30.680 --> 00:17:33.839 But a node out on the isolated quiet fringes might 359 00:17:33.920 --> 00:17:36.880 actually need five or six layers of aggregation just to 360 00:17:36.920 --> 00:17:39.000 gather enough context from the rest of the board to 361 00:17:39.000 --> 00:17:39.599 be useful. 362 00:17:39.799 --> 00:17:42.160 So it literally lets the node jump back through time 363 00:17:42.279 --> 00:17:43.279 to a previous layer. 364 00:17:43.400 --> 00:17:46.039 Yes, in the final layer of the network, the JKN 365 00:17:46.119 --> 00:17:51.039 lets every single node adaptively select which intermediate layer's representation 366 00:17:51.240 --> 00:17:55.039 was most useful for its specific situation. The dense core 367 00:17:55.119 --> 00:17:57.839 node can choose to use its representation from layer two. 368 00:17:58.200 --> 00:18:01.359 While the fringe node pulls from layer five. It preserves 369 00:18:01.400 --> 00:18:04.079 the structural awareness of each node before it gets smoothed 370 00:18:04.119 --> 00:18:04.880 out by the math. 371 00:18:05.440 --> 00:18:08.400 That is incredibly clever. It's basically like giving each node 372 00:18:08.680 --> 00:18:11.799 its own personalized stop button, like Okay, I've learned enough 373 00:18:11.799 --> 00:18:15.000 about my surroundings, stop averaging before I lose who I am. 374 00:18:15.160 --> 00:18:17.039 It's a very elegant solution, and. 375 00:18:16.960 --> 00:18:20.440 The text also details how researchers borrow tricks from those 376 00:18:20.480 --> 00:18:23.359 massive image networks to build deep gcns. Right. 377 00:18:23.480 --> 00:18:27.799 Yes, Deep gcns tackle both the vanish ingradient problem, which 378 00:18:27.799 --> 00:18:30.559 is a mathematical decay that happens in all deep networks, 379 00:18:30.960 --> 00:18:35.680 and over smoothing. They use ResNet style skip connections, which 380 00:18:35.839 --> 00:18:38.519 literally take the raw matrix of data from a previous 381 00:18:38.599 --> 00:18:40.759 layer and add it directly to the current one, keeping 382 00:18:40.759 --> 00:18:44.559 the original signal alive just bypassing the blur exactly. But 383 00:18:44.640 --> 00:18:47.759 the real breakthrough for preventing the gray blob in deep 384 00:18:47.839 --> 00:18:50.480 gcns is a technique called dilated k. 385 00:18:50.519 --> 00:18:54.279 N dilated k nearest neighbors. Now, if the problem is 386 00:18:54.319 --> 00:18:57.440 pulling in too much dense noise from immediate neighbors, I'm 387 00:18:57.480 --> 00:19:01.640 guessing dilation forces the network to like, ignore the people 388 00:19:01.720 --> 00:19:03.559 right next to it so it can look further away. 389 00:19:03.680 --> 00:19:06.880 That's the core idea. It expands the receptive field without 390 00:19:06.880 --> 00:19:10.799 adding pure noise. Instead of looking at every single immediate 391 00:19:10.799 --> 00:19:14.319 neighbor in a dense cluster, the network calculates a wider 392 00:19:14.440 --> 00:19:18.640 radius of nearest neighbors and then intentionally skips nodes at 393 00:19:18.640 --> 00:19:19.440 a set interval. 394 00:19:19.480 --> 00:19:20.240 Oh, I get it. 395 00:19:20.240 --> 00:19:22.680 It dilates its view, grabbing a sample from further out 396 00:19:22.720 --> 00:19:25.119 while ignoring the overwhelming density in between. 397 00:19:25.319 --> 00:19:29.319 It's exactly like standing way back from a massive Impressionist painting. 398 00:19:29.960 --> 00:19:32.079 If you press your nose to the canvas, you are 399 00:19:32.119 --> 00:19:35.039 totally overwhelmed by the density of the brushstrokes. You can't 400 00:19:35.079 --> 00:19:37.079 see anything. You have to zoom out to see the 401 00:19:37.119 --> 00:19:38.519