WEBVTT 1 00:00:01.199 --> 00:00:06.200 Welcome to the Sentient Code, where intelligence is engineered, autonomy 2 00:00:06.280 --> 00:00:10.439 is emerging, and a line between human and machine grows thinner. 3 00:00:10.800 --> 00:00:15.359 Each episode, we decode the algorithms, explore the robotics, and 4 00:00:15.439 --> 00:00:19.000 examine the ideas shaping the future of artificial minds. 5 00:00:24.920 --> 00:00:27.199 You know that voice in your head, the one you're 6 00:00:27.280 --> 00:00:30.440 using right now to process what I'm saying. Maybe it 7 00:00:30.480 --> 00:00:32.240 is saying, Okay, where is he going with this? Or 8 00:00:32.880 --> 00:00:34.880 maybe it's just reminding you that you forgot to switch 9 00:00:34.920 --> 00:00:35.799 the laundry right. 10 00:00:35.719 --> 00:00:38.840 The internal monologue, the narrator of the documentary that is 11 00:00:38.840 --> 00:00:39.840 your life exactly. 12 00:00:40.039 --> 00:00:43.079 We have always thought of that inner voice as just 13 00:00:45.520 --> 00:00:49.840 a human quirk, maybe even a byproduct of consciousness, something 14 00:00:49.840 --> 00:00:53.520 that just happens because we have language. But what if 15 00:00:53.520 --> 00:00:57.000 it is not just noise? What if that little voice 16 00:00:57.200 --> 00:00:59.600 is actually the engine of intelligence? 17 00:01:00.200 --> 00:01:02.679 That is the billion dollar question. And if you ask 18 00:01:02.759 --> 00:01:05.920 the researchers at the Okinawa Institute of Science and Technology 19 00:01:06.079 --> 00:01:09.319 or OIST, they will tell you that the reason AI 20 00:01:09.359 --> 00:01:11.560 has been hitting a wall lately is precisely because it 21 00:01:11.599 --> 00:01:13.959 doesn't have that voice. It doesn't mumble to itself. 22 00:01:14.120 --> 00:01:16.079 Mumbling that was the technical term they use. 23 00:01:16.159 --> 00:01:18.280 Well, they call it self directed internal speech. But yeah, 24 00:01:18.319 --> 00:01:20.280 expectively it is mumbling, and we are looking at a 25 00:01:20.319 --> 00:01:23.840 really fascinating study today. This was published just this January 26 00:01:23.879 --> 00:01:27.400 twenty eighth, twenty twenty six, in the journal Neural Computation. 27 00:01:27.879 --> 00:01:30.840 It is led by first author doctor Jeffrey Kaiser of 28 00:01:30.879 --> 00:01:35.200 the Cognitive Neurorobotics Research Unit, and it proposes something that 29 00:01:35.319 --> 00:01:37.680 frankly sounds a little sci fi. 30 00:01:38.000 --> 00:01:40.319 Yeah. I read through the material for this steep dive 31 00:01:40.359 --> 00:01:43.040 and my immediate thought was, great, now the robots are 32 00:01:43.079 --> 00:01:44.560 going to be talking to themselves on. 33 00:01:44.519 --> 00:01:46.959 The bus exactly, just muttering in the corner. 34 00:01:47.680 --> 00:01:50.959 But the implications here are massive, right. We aren't just 35 00:01:51.000 --> 00:01:53.319 talking about a chatbot that is a little bit wittier 36 00:01:53.760 --> 00:01:55.760 or more conversational, No, not at all. 37 00:01:56.040 --> 00:01:59.560 We are talking about a fundamental restructuring of how machines learn. 38 00:02:00.120 --> 00:02:03.439 We are moving away from the whole big data approach. 39 00:02:03.040 --> 00:02:05.599 Where you just feed a computer the entire Internet. 40 00:02:05.359 --> 00:02:08.400 Right, just scraping everything. We are moving away from that 41 00:02:08.520 --> 00:02:11.240 and towards something much more biological, something that learns a 42 00:02:11.240 --> 00:02:13.000 lot more like a human child does. 43 00:02:13.319 --> 00:02:15.639 So the mission for our deep dive today is to 44 00:02:15.719 --> 00:02:19.520 really figure out why giving an artificial intelligence, A mumble 45 00:02:19.560 --> 00:02:23.319 and a scratch pad might actually be the key to 46 00:02:23.360 --> 00:02:26.000 the next generation of robotics. Because usually when we hear 47 00:02:26.039 --> 00:02:29.759 about AI upgrades, it's always, oh, we need more chips, 48 00:02:29.840 --> 00:02:31.960 or we need massive new data centers. 49 00:02:31.840 --> 00:02:34.360 More compute, more power, alway. 50 00:02:34.240 --> 00:02:37.000 Right, But this is different. This is about the architecture itself. 51 00:02:37.199 --> 00:02:40.199 It is about architecture, but honestly, it's also about psychology 52 00:02:40.360 --> 00:02:43.319 because to understand this machine architecture, we actually have to 53 00:02:43.360 --> 00:02:46.520 start with a human brain. We have to ask why 54 00:02:46.560 --> 00:02:47.919 do you talk to yourself? 55 00:02:50.039 --> 00:02:50.919 Usually to keep. 56 00:02:50.800 --> 00:02:53.840 From panicking, to be honest, or if I'm cooking, if 57 00:02:53.879 --> 00:02:57.639 I am making a really complex recipe, I am definitely muttering. Okay, 58 00:02:57.680 --> 00:03:00.120 onions are done. Now I need the garlic. Where to 59 00:03:00.159 --> 00:03:00.680 put the garlic? 60 00:03:00.840 --> 00:03:04.120 Exactly, you are using self talk as an executive function. 61 00:03:04.520 --> 00:03:07.639 You aren't just making noise into the void. You are 62 00:03:07.800 --> 00:03:13.400 actively organizing disparate ideas. You are weighing conflicting choices. You 63 00:03:13.479 --> 00:03:16.240 are processing sensory data in real time. So it have 64 00:03:16.319 --> 00:03:19.319 a purpose, a very specific purpose. In psychology. We call 65 00:03:19.319 --> 00:03:20.240 this metacognition. 66 00:03:20.520 --> 00:03:21.759 Thinking about thinking, right. 67 00:03:21.800 --> 00:03:24.719 It allows you to objectify your own thought process. It 68 00:03:24.759 --> 00:03:27.560 creates a feedback loop where the output of one thought, 69 00:03:27.719 --> 00:03:30.879 like the onions are done, becomes the direct input for 70 00:03:30.919 --> 00:03:32.879 the next thought, which is get the garlic. 71 00:03:32.960 --> 00:03:34.360 So it's essentially a chain of logic. 72 00:03:34.680 --> 00:03:37.960 It is a chain. And doctor Kwiser's team is saying, look, 73 00:03:38.039 --> 00:03:40.800 this biological habit isn't a glitch in the human system. 74 00:03:41.120 --> 00:03:44.240 It is a highly functional mechanism. It is literally how 75 00:03:44.280 --> 00:03:47.240 we organize our minds. And if we want AI to 76 00:03:47.319 --> 00:03:50.319 navigate ambiguity the way humans do, we need to import 77 00:03:50.400 --> 00:03:52.960 this biology directly into the code. 78 00:03:53.240 --> 00:03:56.000 But it is not just the voice, right. There is 79 00:03:56.039 --> 00:03:58.680 this other piece of the puzzle that the paper emphasizes 80 00:03:58.960 --> 00:04:01.520 called working memory. And I really want to pause on 81 00:04:01.560 --> 00:04:07.560 this because in the study they talk a lot about slots, Yes, slots, slots, right, 82 00:04:07.639 --> 00:04:09.560 and they make a really big deal about how this 83 00:04:09.599 --> 00:04:13.080 is entirely different from how a normal neural network remembers things. 84 00:04:13.520 --> 00:04:14.960 So help me out here, because I think a lot 85 00:04:14.960 --> 00:04:17.639 of people would assume doesn't the standard chatbot already have 86 00:04:17.680 --> 00:04:20.519 a memory. It remembers what I typed three prompts ago. 87 00:04:20.720 --> 00:04:23.279 It does, but it is a completely different kind of memory. 88 00:04:23.519 --> 00:04:25.759 Think of a standard neural network like the ones running 89 00:04:25.759 --> 00:04:29.000 most current large language models as a giant piece of 90 00:04:29.120 --> 00:04:30.079 tied fabric. 91 00:04:30.120 --> 00:04:33.000 Hi die Okay, I am picturing a vintage T shirt 92 00:04:33.160 --> 00:04:34.000 from the sixties. 93 00:04:34.199 --> 00:04:37.839 Perfect. When a standard neural network learns something new, the 94 00:04:37.959 --> 00:04:41.879 die spreads out everywhere. The information is distributed across all 95 00:04:41.959 --> 00:04:45.759 the connections, all the mathematical weights simultaneously. It is a 96 00:04:45.839 --> 00:04:47.680 holographic kind of storage. 97 00:04:47.759 --> 00:04:48.199 I see. 98 00:04:48.319 --> 00:04:50.920 So if you want to change one specific fact, or 99 00:04:50.959 --> 00:04:53.240 if you just need to hold one specific number in 100 00:04:53.279 --> 00:04:55.439 your head for a second, it is really hard to 101 00:04:55.480 --> 00:04:58.120 do that without messing up the pattern of the whole. 102 00:04:57.879 --> 00:05:01.399 Shirt because it's messy. You can't just one specific thread 103 00:05:01.399 --> 00:05:04.360 out without unraveling the entire image or changing the surrounding 104 00:05:04.399 --> 00:05:05.519 colors exactly. 105 00:05:06.040 --> 00:05:09.839 That architecture is fantastic for recognizing broad patterns, but it 106 00:05:09.879 --> 00:05:14.040 is actually really bad for holding specifics. Now, what doctor 107 00:05:14.120 --> 00:05:17.800 Kwaiser and his team did was introduce these explicit slots. 108 00:05:18.279 --> 00:05:20.639 Imagine that on top of that TIDI shirt, you sew 109 00:05:20.720 --> 00:05:23.160 on a few clear plastic pockets. 110 00:05:22.759 --> 00:05:25.480 Okay, like a plastic badge holder or a bocket protector. 111 00:05:25.639 --> 00:05:28.839 Right, These are distinct protected containers. You can write a 112 00:05:28.920 --> 00:05:30.600 number on a piece of paper, put it in slot, 113 00:05:30.600 --> 00:05:33.279 A and it stays perfectly safe. It doesn't bleed into 114 00:05:33.279 --> 00:05:35.480 the TIDI fabric at all. It functions as a true 115 00:05:35.560 --> 00:05:36.680 variable I see. 116 00:05:36.720 --> 00:05:39.399 So the AI can say, okay, I am currently holding 117 00:05:39.439 --> 00:05:42.079 the number seven in my left hand, and it completely 118 00:05:42.079 --> 00:05:44.240 doesn't matter what the rest of the network is doing. 119 00:05:44.480 --> 00:05:46.920 That seven is safe and isolated precisely. 120 00:05:47.160 --> 00:05:50.000 And this is absolutely crucial for formal logic. If I 121 00:05:50.040 --> 00:05:52.519 tell you to reverse the sequence seven, two, nine, you 122 00:05:52.560 --> 00:05:54.560 need to hold those three numbers in your head, in 123 00:05:54.560 --> 00:05:57.920 your slots and shuffle them around a standard AI struggles 124 00:05:57.959 --> 00:06:01.040 with this because it tries to memorize the concept of seven, two, 125 00:06:01.160 --> 00:06:04.560 nine based on how often it has seen those specific 126 00:06:04.680 --> 00:06:06.279 numbers grouped together in the past. 127 00:06:06.600 --> 00:06:09.199 So it's essentially trying to vibe its way vibe. 128 00:06:09.000 --> 00:06:12.240 Its way through a math problem. Yes, that is hilarious, 129 00:06:12.480 --> 00:06:15.319 but it's true. It is vibing based purely on statistics. 130 00:06:15.360 --> 00:06:17.519 It looks at the data and says, well, usually seven 131 00:06:17.600 --> 00:06:20.040 is followed by eight, but here it's two, and it 132 00:06:20.160 --> 00:06:22.959 just gets confused. But the OIC model is different. It 133 00:06:23.000 --> 00:06:24.959 puts seven and slot one two and slot two and 134 00:06:25.079 --> 00:06:27.839 nine and slot three, and then that is when the 135 00:06:27.879 --> 00:06:30.040 inner voice, the mumbling kicks in, and. 136 00:06:30.000 --> 00:06:32.759 The mumble says, swap slot one and slot three. 137 00:06:32.920 --> 00:06:37.079 Bingo. It generates a symbolic command directed at its own 138 00:06:37.120 --> 00:06:41.319 memory system, swap one in three. It absolutely does not 139 00:06:41.519 --> 00:06:43.519 care that the numbers are seven and nine. They could 140 00:06:43.519 --> 00:06:46.120 be an apple and an orange. They could be completely 141 00:06:46.160 --> 00:06:49.800 made up words. The logic holds perfectly because the slots 142 00:06:49.800 --> 00:06:52.319 are entirely separate from the content inside them. 143 00:06:52.439 --> 00:06:55.920 And this sounds exactly like what computer scientists called generalization. 144 00:06:56.480 --> 00:06:59.759 That is the magic word here, generalization. 145 00:06:59.160 --> 00:07:02.040 Because in the paper they use this incredibly dense phrase 146 00:07:02.079 --> 00:07:06.399 they call it content agnostic information processing. It is a mouthful, 147 00:07:06.560 --> 00:07:08.360 it really is, but it seems to be the core 148 00:07:08.399 --> 00:07:09.240 of why this works. 149 00:07:09.439 --> 00:07:11.519 It is a mouthful, but it is the holy grail 150 00:07:11.600 --> 00:07:16.000 of artificial intelligence research. Content agnostic means the AI understands 151 00:07:16.000 --> 00:07:19.000 the underlying rule, regardless of the specific data it is 152 00:07:19.000 --> 00:07:22.120 looking at. Think about basic algebra. If you know the 153 00:07:22.160 --> 00:07:24.959 A plus B equal C, you can solve that equation 154 00:07:25.000 --> 00:07:27.959 whether A is five or a is five million, or 155 00:07:28.000 --> 00:07:29.560 a is a banana. 156 00:07:29.120 --> 00:07:31.639 Right because I know the relationship between the parts, not 157 00:07:31.720 --> 00:07:33.600 just the parts themselves exactly. 158 00:07:33.759 --> 00:07:38.000 Traditional AI is often just memorizing millions examples, if it 159 00:07:38.040 --> 00:07:41.560 has seen the sequence one two three reverses three to 160 00:07:42.120 --> 00:07:44.399 one a million times in its training data, it can 161 00:07:44.480 --> 00:07:47.600 do it easily. But if you give it xyz, it 162 00:07:47.720 --> 00:07:50.800 might fail simply because it hasn't seen those specific letters 163 00:07:50.800 --> 00:07:51.920 in that specific. 164 00:07:51.680 --> 00:07:53.399 Order before, which seems so brittle. 165 00:07:53.600 --> 00:07:57.319 It is extremely britle. But the OIST researchers found that 166 00:07:57.360 --> 00:08:00.319 their model, the one equipped with the memory slaw and 167 00:08:00.360 --> 00:08:02.959 the internal mumbling, could look at a sequence it had 168 00:08:03.000 --> 00:08:06.160 literally never seen before in its life and apply the 169 00:08:06.199 --> 00:08:08.560 reverse rule perfectly on the first try. 170 00:08:08.519 --> 00:08:11.079 Because it wasn't looking at the letters themselves, it was 171 00:08:11.120 --> 00:08:14.120 looking at the containers. Take what is in slot one 172 00:08:14.240 --> 00:08:16.040 and move it to slot three exactly. 173 00:08:16.079 --> 00:08:18.879 It completely separates the algorithm from the data, and that 174 00:08:18.959 --> 00:08:21.519 is something humans do naturally all day long, but neural 175 00:08:21.560 --> 00:08:23.879 networks have historically been terrible at it. 176 00:08:23.959 --> 00:08:27.079 Okay, so let's dig into the actual mumbling mechanism itself, 177 00:08:27.079 --> 00:08:29.399 because I am trying to visualize this. Yeah, how does 178 00:08:29.399 --> 00:08:32.200 a computer actually mumble? I mean, is it generating a 179 00:08:32.200 --> 00:08:34.039 tiny sound file? Is there microphone involved? 180 00:08:34.159 --> 00:08:37.559 No, no audio is being generated. It is generating tokens. 181 00:08:38.039 --> 00:08:41.519 In AI terminology, a token is just a fundamental unit 182 00:08:41.559 --> 00:08:44.080 of information, like a word or a piece of a word. 183 00:08:44.559 --> 00:08:47.000 In a normal chatbot that you might use online, the 184 00:08:47.039 --> 00:08:50.120 tokens it generates come out immediately as text on your screen. 185 00:08:50.919 --> 00:08:55.480 But in this OHES system, the researchers created a recurrent loop. 186 00:08:55.360 --> 00:08:57.240 A loop, so it feeds back on itself. 187 00:08:57.320 --> 00:08:59.720 Right, the system generates a token. Let's say it generates 188 00:08:59.759 --> 00:09:01.639 the token for the word swat, but instead of showing 189 00:09:01.639 --> 00:09:04.679 that word to the user, it feeds that token directly 190 00:09:04.759 --> 00:09:07.600 back into its own input layer for the very next 191 00:09:07.759 --> 00:09:09.879 millisecond of processing. 192 00:09:09.440 --> 00:09:11.399 So it is whispering back into its own ear. 193 00:09:11.600 --> 00:09:14.279 It is a quiet mumble. The paper describes it as 194 00:09:14.320 --> 00:09:17.279 the low level generation of tokens. It acts as an 195 00:09:17.320 --> 00:09:21.840 intermediate computational step. And the researchers did something very specific 196 00:09:21.840 --> 00:09:25.240 here to make this happen. They actively encouraged the system 197 00:09:25.320 --> 00:09:26.519 to do this during training. 198 00:09:26.919 --> 00:09:29.480 Encouraged like they gave it a digital cookie. 199 00:09:29.240 --> 00:09:32.080 Sort of yeah. In machine learning we use things called 200 00:09:32.200 --> 00:09:36.039 loss functions and targets to guide behavior. They essentially set 201 00:09:36.080 --> 00:09:39.000 a strict target where the system was required to produce 202 00:09:39.039 --> 00:09:42.240 a certain amount of internal speech while it was attempting 203 00:09:42.240 --> 00:09:45.320 to solve the problem. They basically said to the AI, 204 00:09:45.799 --> 00:09:48.600 you cannot just guess the final answer. You have to 205 00:09:48.639 --> 00:09:51.159 show your work. You have to talk it through step 206 00:09:51.159 --> 00:09:51.679 by step. 207 00:09:52.080 --> 00:09:55.159 Man That instantly reminds me of my high school algebra teacher. 208 00:09:55.480 --> 00:09:57.399 I don't care if you got the right answer, show 209 00:09:57.440 --> 00:09:58.960 me the steps, and. 210 00:09:58.879 --> 00:10:01.360 Your teacher was exactly right, because if you show the steps, 211 00:10:01.399 --> 00:10:05.039 you actually prove that you understand the logic behind the solution. 212 00:10:05.480 --> 00:10:07.919 If you just write down the final number, you might 213 00:10:07.919 --> 00:10:09.919 have just memorized it from the textbook or taken a 214 00:10:09.960 --> 00:10:14.559 lucky guess. By forcing the AI to mumble the intermediate steps, 215 00:10:14.840 --> 00:10:17.600 they forced it to break the complex problem down into 216 00:10:17.759 --> 00:10:19.919 manageable logical. 217 00:10:19.480 --> 00:10:23.000 Chunks, which brings us directly to the specific tasks they 218 00:10:23.080 --> 00:10:26.960 use to test this theory. We mentioned reversing sequences earlier. 219 00:10:27.279 --> 00:10:31.320 The paper also talks about pattern creation. But why are 220 00:10:31.440 --> 00:10:35.200 these specific tasks so important for the researchers to use. 221 00:10:36.039 --> 00:10:39.360 They seem, I don't know, almost too simple, like reversing 222 00:10:39.360 --> 00:10:42.720 a list of items. A cheap pocket calculator can do that. 223 00:10:43.559 --> 00:10:46.200 A calculator can do that because a calculator is hard 224 00:10:46.240 --> 00:10:49.120 coded by a human software engineer to do exactly that. 225 00:10:49.559 --> 00:10:51.559 A neural network, on the other hand, has to learn 226 00:10:51.639 --> 00:10:53.840 how to do it completely from scratch just by looking 227 00:10:53.879 --> 00:10:56.440 at examples. And for a neural network, these types of 228 00:10:56.440 --> 00:11:01.000 tasks are actually brutal. They are highly computationally because they 229 00:11:01.039 --> 00:11:03.200 require what we call sequential processing. 230 00:11:03.559 --> 00:11:05.480 Sequential meaning involves time. 231 00:11:05.480 --> 00:11:08.559 Yes, time and order. You have to remember the beginning 232 00:11:08.559 --> 00:11:11.240 of the sentence while you are simultaneously reading the end 233 00:11:11.279 --> 00:11:13.879 of the sentence, and then you have to purposefully manipulate 234 00:11:13.960 --> 00:11:16.960 the order of those elements. This requires holding multiple distinct 235 00:11:17.039 --> 00:11:20.240 data points in your head simultaneously without them overwriting each other. 236 00:11:20.399 --> 00:11:22.799 Ah, we are back to the tid eye problem. 237 00:11:23.159 --> 00:11:25.799 Exactly, if I add blue dye for the end of 238 00:11:25.840 --> 00:11:28.879 the sentence, it might bleed over and turn the red 239 00:11:28.960 --> 00:11:31.240 dye at the beginning of the sentence into purple. 240 00:11:31.440 --> 00:11:34.399 So the information corrupts itself just by existing in the 241 00:11:34.440 --> 00:11:35.080 same space. 242 00:11:35.320 --> 00:11:38.120 Right, And the study showed that the models equipped with 243 00:11:38.200 --> 00:11:41.840 the explicit slots and the internal mumble just blue the 244 00:11:41.879 --> 00:11:45.039 standard models out of the water. They could handle significantly 245 00:11:45.080 --> 00:11:49.840 longer sequences, much more complex patterns. And here's the really 246 00:11:49.840 --> 00:11:53.879 mind blowing part. They could switch between different tasks without crashing. 247 00:11:54.080 --> 00:11:58.360 Multitasking. Multitasking Now, humans are pretty famous for thinking we 248 00:11:58.440 --> 00:12:01.639 are amazing at multitask while actually being terrible at it. 249 00:12:01.919 --> 00:12:05.320 But for AI, it's usually a complete disaster, isn't it. 250 00:12:05.320 --> 00:12:08.360 It is usually catastrophic. In fact, there is an official 251 00:12:08.440 --> 00:12:11.000 term for it in the field. It's called catastrophic forgetting. 252 00:12:11.320 --> 00:12:13.559 That sounds incredibly dramatic, it really is. 253 00:12:13.960 --> 00:12:17.279 If you train a standard artificial intelligence to play chess 254 00:12:17.399 --> 00:12:19.559 and it gets really good, and then you try to 255 00:12:19.600 --> 00:12:22.720 teach that exact same model to play checkers, it will 256 00:12:22.759 --> 00:12:25.919 almost always completely forget how to play chess. That's totally why, 257 00:12:26.080 --> 00:12:30.080 completely overwritten, because it overwrites the mathematical weights to learn 258 00:12:30.159 --> 00:12:33.759 the new game. The tiedeie pattern essentially gets entirely redied 259 00:12:33.840 --> 00:12:34.679 with new colors. 260 00:12:34.759 --> 00:12:36.600 But the OST model didn't do that. 261 00:12:36.799 --> 00:12:40.240 It remember both it did, and doctor Kwoiser observed that 262 00:12:40.240 --> 00:12:44.039 the mumbling was the absolute key to this capability. The 263 00:12:44.120 --> 00:12:48.000 internal speech acted as a dynamic context manager. 264 00:12:48.120 --> 00:12:50.840 Break that down for me. A context manager. 265 00:12:50.519 --> 00:12:53.279 Thing of a professional chef working in a really busy kitchen. 266 00:12:53.679 --> 00:12:55.639 They are chopping onions on the cutting board, but they 267 00:12:55.679 --> 00:12:58.399 also have a delicate sauce simmering over on the stove. 268 00:12:58.639 --> 00:13:00.759 They chop chop, chop, and they fit physically stop and 269 00:13:00.759 --> 00:13:03.639 say out loud to themselves, Okay, check the sauce. They 270 00:13:03.639 --> 00:13:06.000 walk over, they stir the sauce. Then they say, sauce 271 00:13:06.039 --> 00:13:07.159 is good. Back to onions. 272 00:13:07.240 --> 00:13:11.240 That little phrase back to onions. It resets their mental state, It. 273 00:13:11.159 --> 00:13:15.039 Resets the context. The OST system uses its mumble to 274 00:13:15.120 --> 00:13:18.960 explicitly label which task it is currently performing. It says internally, 275 00:13:19.200 --> 00:13:22.039 I am now doing task A, and it uses the 276 00:13:22.080 --> 00:13:25.519 memory slots specifically assigned for task A. Then it mumbles 277 00:13:25.639 --> 00:13:29.080 switching to task B, and that command clears the slots 278 00:13:29.159 --> 00:13:32.080 or moves its attention to new slots. It actively prevents 279 00:13:32.159 --> 00:13:34.480 the parameters of the tasks from bleeding into each other. 280 00:13:34.720 --> 00:13:37.759 That is wild. That essentially bridges the huge gap between 281 00:13:37.759 --> 00:13:41.080 the rigid, single task focus of traditional AI, where you 282 00:13:41.120 --> 00:13:44.000 have one specific bot for playing chess and a totally 283 00:13:44.000 --> 00:13:48.080 different bot for chatting, and the flexible, fluid adaptability that 284 00:13:48.200 --> 00:13:48.879 human beings have. 285 00:13:49.200 --> 00:13:52.000 It is a massive step towards general purpose intelligence, and 286 00:13:52.039 --> 00:13:55.159 it leads us directly to another incredible benefit outline in 287 00:13:55.200 --> 00:13:57.240 the research, which is data efficiency. 288 00:13:57.399 --> 00:13:59.320 Yes, this is a huge topic right now in the 289 00:13:59.360 --> 00:14:01.919 tech world. I like constantly keep reading articles saying that 290 00:14:01.960 --> 00:14:04.159 we are basically running out of Internet, that the big 291 00:14:04.200 --> 00:14:07.320 tech companies have scraped every single book, every news article, 292 00:14:07.360 --> 00:14:10.200 every Reddit post, and there is literally nothing left of 293 00:14:10.279 --> 00:14:12.679 high quality to train the next generation of models on. 294 00:14:13.039 --> 00:14:16.279 That is a very real, very pressing problem. The current 295 00:14:16.360 --> 00:14:20.240 dominant paradigm in AI development is essentially scale is all 296 00:14:20.279 --> 00:14:23.120 you need. Just make the model bigger, throw more processing 297 00:14:23.200 --> 00:14:25.519 power at it, and give it more data. But we 298 00:14:25.559 --> 00:14:28.039 are rapidly hitting the hard seiling of what is actually 299 00:14:28.080 --> 00:14:32.759 available out there. The OST research suggests a completely viable way. 300 00:14:32.720 --> 00:14:35.399 Out of that trap, sparse data utilization. 301 00:14:35.639 --> 00:14:39.039 Right. Because the OST system is learning how to think, 302 00:14:39.360 --> 00:14:43.440 meaning the general underlying rules, rather than just what to think, 303 00:14:43.480 --> 00:14:47.440 which is just memorizing specific answers, it needs significantly less 304 00:14:47.519 --> 00:14:49.639 data to achieve the same or better performance. 305 00:14:49.919 --> 00:14:52.720 Going back to your algebra analogy earlier, if I teach 306 00:14:52.759 --> 00:14:55.240 a student the actual rules of algebra, I really only 307 00:14:55.240 --> 00:14:57.399 need to show them maybe ten practice problems, and they 308 00:14:57.399 --> 00:14:59.600 get the concept they can apply it anywhere. But if 309 00:14:59.639 --> 00:15:01.480 I try to teach them malogy but purely by showing 310 00:15:01.519 --> 00:15:03.840 them every single possible math problem in existence so they 311 00:15:03.840 --> 00:15:07.440 can memorize the answers. I would literally need infinite data. 312 00:15:07.679 --> 00:15:10.840 That is a perfect analogy. The OIS model is learning 313 00:15:10.879 --> 00:15:14.320 the rules of the game. Doctor Kawiser explicitly calls it 314 00:15:14.480 --> 00:15:19.159 a complementary, lightweight alternative to these massive, heavy data models. 315 00:15:19.519 --> 00:15:21.080 And you have to imagine what that means for the 316 00:15:21.120 --> 00:15:24.200 real world. Think about the environment, think about the energy costs. 317 00:15:24.600 --> 00:15:27.759 We wouldn't need to build these massive city size data 318 00:15:27.759 --> 00:15:31.080 centers that consume as much electricity as a small country 319 00:15:31.279 --> 00:15:33.519 just to train a smart AI, and we. 320 00:15:33.519 --> 00:15:36.039 Would need a supercomputer to actually run the AI once 321 00:15:36.080 --> 00:15:38.279 it's trained. Yeah, which brings us to the part of 322 00:15:38.279 --> 00:15:40.600 the paper that got me really truly excited, and that 323 00:15:40.759 --> 00:15:41.960 is robotics. 324 00:15:42.159 --> 00:15:45.440 Yes, the real world application of all this theory. 325 00:15:45.159 --> 00:15:48.559 Because right now, let's be honest, robots are kind of dumb. 326 00:15:48.679 --> 00:15:50.840 They work perfectly in a car factory where everything is 327 00:15:50.879 --> 00:15:53.519 literally bolted down to the floor, the lighting never changes 328 00:15:53.799 --> 00:15:55.960 and the exact same part comes down the assembly line 329 00:15:55.960 --> 00:15:58.320 every three seconds. But you put a state of the 330 00:15:58.399 --> 00:16:01.360 art robot in my messy living it is total chaos. 331 00:16:01.399 --> 00:16:03.159 It gets stuck on a rug exactly. 332 00:16:03.519 --> 00:16:06.960 The paper explicitly talks about the challenge of transitioning AI 333 00:16:07.080 --> 00:16:10.480 from controlled environments to dynamic environments. 334 00:16:10.879 --> 00:16:13.679 Let's really take this out of the laboratory, because the 335 00:16:13.679 --> 00:16:18.240 paper specifically mentions agricultural robots as a use case. Let's 336 00:16:18.320 --> 00:16:22.320 visualize that you have got a robotic tractor, or let's 337 00:16:22.320 --> 00:16:24.919 go at a weed bot out in a massive cornfield. 338 00:16:25.080 --> 00:16:28.000 Okay, so you have this robot. Its sole job is 339 00:16:28.039 --> 00:16:30.759 to drive down the row, visually identify a weed and 340 00:16:30.799 --> 00:16:34.000 pull it out, but obviously leave the valuable corn alone. Now, 341 00:16:34.039 --> 00:16:36.759 in a sterile lab setting, that is incredibly easy. The 342 00:16:36.840 --> 00:16:40.279 lighting is perfectly calibrated, the corn is bright green, the 343 00:16:40.279 --> 00:16:43.720 weed has a distinct leaf shape. The cameras process it instantly. 344 00:16:43.840 --> 00:16:45.799 But out in the actual real war, in the. 345 00:16:45.799 --> 00:16:48.960 Real world, a dark cloud passes over the sun. The 346 00:16:49.120 --> 00:16:52.600 ambient light drops by fifty percent in two seconds. A 347 00:16:52.639 --> 00:16:54.879 gust of wind blows the corn stock, so it is 348 00:16:54.879 --> 00:16:57.720 suddenly leaning over at a forty five degree angle. Maybe 349 00:16:57.759 --> 00:16:59.840 there's a splash of mud that gets splattered right on 350 00:17:00.120 --> 00:17:01.279 the robot's camera lens. 351 00:17:01.559 --> 00:17:05.119 To a standard vision based AI, that visual input just 352 00:17:05.240 --> 00:17:08.640 changed completely. It completely freaks out. It thinks the leaning 353 00:17:08.680 --> 00:17:12.279 corns an entirely new, unrecognized object. It thinks the shadow 354 00:17:12.319 --> 00:17:14.039 from the cloud is a deep hole in the. 355 00:17:14.000 --> 00:17:17.960 Ground, exactly, it throws an error and crashes, or worse, 356 00:17:18.000 --> 00:17:20.599 it just happily pulls up all the expensive corn But 357 00:17:20.720 --> 00:17:23.799 a robot equipped with this inner voice architecture and working 358 00:17:23.839 --> 00:17:27.079 memory can actually self correct in real time. It can 359 00:17:27.119 --> 00:17:30.160 literally talk itself through the sensory confusion, so it. 360 00:17:30.160 --> 00:17:33.319 Is internally mumbling, Okay, the light just got a lot darker, 361 00:17:33.720 --> 00:17:36.160 but my sensors say I didn't actually move forward, so 362 00:17:36.200 --> 00:17:39.119 the object directly in front of me is highly likely 363 00:17:39.359 --> 00:17:41.279 to still be the cornstock I was just looking at. 364 00:17:41.519 --> 00:17:45.160 Yes, exactly. It maintains a continuous state. It explicitly says 365 00:17:45.160 --> 00:17:48.559 to itself. Current state is weeding row four. Event is 366 00:17:48.599 --> 00:17:52.599 sudden light reduction. Action is continue current task. It bridges 367 00:17:52.640 --> 00:17:54.960 the sudden gap in its sensory data by relying on 368 00:17:55.039 --> 00:17:58.680 a logical internal narrative. It creates a cognitive buffer against 369 00:17:58.680 --> 00:18:01.039 the unpredictability and chaos of the physical world. 370 00:18:01.319 --> 00:18:04.440 That is just remarkably human. I mean, that is exactly 371 00:18:04.440 --> 00:18:06.400 what I do. When I am driving on the highway 372 00:18:06.400 --> 00:18:09.559 in a sudden rainstorm. I am talking to myself saying, okay, 373 00:18:09.559 --> 00:18:12.559 I can't see much, just slow down, keep the wheel straight, 374 00:18:12.920 --> 00:18:16.039 look for the tail lights ahead. I am not completely 375 00:18:16.079 --> 00:18:18.839 relearning how to drive a car every single second. I 376 00:18:18.880 --> 00:18:21.400 am actively talking myself through the noise and the fear. 377 00:18:21.559 --> 00:18:24.000 And that is exactly why this research is so huge 378 00:18:24.000 --> 00:18:27.039 for the field of robotics. You simply cannot upload the 379 00:18:27.200 --> 00:18:29.920 entire Internet into the memory banks of a farm tractor. 380 00:18:30.440 --> 00:18:33.839 It is impossible. You need a centralized brain that is small, 381 00:18:34.119 --> 00:18:37.559 highly efficient, and capable of actively reasoning its way out 382 00:18:37.599 --> 00:18:40.680 of a novel problem, rather than just cross referencing a 383 00:18:40.720 --> 00:18:44.519 massive database to remember a pre programmed solution. This is 384 00:18:44.559 --> 00:18:48.559 the fundamental difference between simple automation and true autonomy. 385 00:18:48.759 --> 00:18:50.960 Break that distinction down from me a bit more automation 386 00:18:51.119 --> 00:18:54.319 versus autonomy. People use those words interchangeably a lot. 387 00:18:54.480 --> 00:18:57.799 They do, but they are very different. Automation is like 388 00:18:57.839 --> 00:19:00.720 a train on a track. It is incredibly powerful, it 389 00:19:00.759 --> 00:19:04.079 is fast, it is efficient, but if a cow suddenly 390 00:19:04.160 --> 00:19:06.480 wanders on to the track, the train doesn't know how 391 00:19:06.480 --> 00:19:09.759 to evaluate the situation. It just hits the brakes and stops, 392 00:19:10.000 --> 00:19:13.680 or it crashes. Current robots are still largely just automated. 393 00:19:13.839 --> 00:19:17.640 They rigidly follow a script. Go forward ten feet, turn left, 394 00:19:17.720 --> 00:19:19.400 ninety degrees stop, but. 395 00:19:19.319 --> 00:19:21.880 The real world does not have tracks exactly. 396 00:19:22.319 --> 00:19:24.759 Autonomy, on the other hand, is like driving a car. 397 00:19:25.039 --> 00:19:27.400 You can see the cow, evaluate the shoulder of the road, 398 00:19:27.440 --> 00:19:29.799 and steer around it. You can decide to go off 399 00:19:29.880 --> 00:19:32.559 road if you have to. Autonomy means you are writing 400 00:19:32.559 --> 00:19:35.960 the behavioral script in real time as the situation unfolds. 401 00:19:36.400 --> 00:19:39.160