WEBVTT

1
00:00:01.199 --> 00:00:06.200
<v Speaker 1>Welcome to the Sentient Code, where intelligence is engineered, autonomy

2
00:00:06.280 --> 00:00:10.439
<v Speaker 1>is emerging, and a line between human and machine grows thinner.

3
00:00:10.800 --> 00:00:15.359
<v Speaker 1>Each episode, we decode the algorithms, explore the robotics, and

4
00:00:15.439 --> 00:00:19.000
<v Speaker 1>examine the ideas shaping the future of artificial minds.

5
00:00:24.920 --> 00:00:27.199
<v Speaker 2>You know that voice in your head, the one you're

6
00:00:27.280 --> 00:00:30.440
<v Speaker 2>using right now to process what I'm saying. Maybe it

7
00:00:30.480 --> 00:00:32.240
<v Speaker 2>is saying, Okay, where is he going with this? Or

8
00:00:32.880 --> 00:00:34.880
<v Speaker 2>maybe it's just reminding you that you forgot to switch

9
00:00:34.920 --> 00:00:35.799
<v Speaker 2>the laundry right.

10
00:00:35.719 --> 00:00:38.840
<v Speaker 3>The internal monologue, the narrator of the documentary that is

11
00:00:38.840 --> 00:00:39.840
<v Speaker 3>your life exactly.

12
00:00:40.039 --> 00:00:43.079
<v Speaker 2>We have always thought of that inner voice as just

13
00:00:45.520 --> 00:00:49.840
<v Speaker 2>a human quirk, maybe even a byproduct of consciousness, something

14
00:00:49.840 --> 00:00:53.520
<v Speaker 2>that just happens because we have language. But what if

15
00:00:53.520 --> 00:00:57.000
<v Speaker 2>it is not just noise? What if that little voice

16
00:00:57.200 --> 00:00:59.600
<v Speaker 2>is actually the engine of intelligence?

17
00:01:00.200 --> 00:01:02.679
<v Speaker 3>That is the billion dollar question. And if you ask

18
00:01:02.759 --> 00:01:05.920
<v Speaker 3>the researchers at the Okinawa Institute of Science and Technology

19
00:01:06.079 --> 00:01:09.319
<v Speaker 3>or OIST, they will tell you that the reason AI

20
00:01:09.359 --> 00:01:11.560
<v Speaker 3>has been hitting a wall lately is precisely because it

21
00:01:11.599 --> 00:01:13.959
<v Speaker 3>doesn't have that voice. It doesn't mumble to itself.

22
00:01:14.120 --> 00:01:16.079
<v Speaker 2>Mumbling that was the technical term they use.

23
00:01:16.159 --> 00:01:18.280
<v Speaker 3>Well, they call it self directed internal speech. But yeah,

24
00:01:18.319 --> 00:01:20.280
<v Speaker 3>expectively it is mumbling, and we are looking at a

25
00:01:20.319 --> 00:01:23.840
<v Speaker 3>really fascinating study today. This was published just this January

26
00:01:23.879 --> 00:01:27.400
<v Speaker 3>twenty eighth, twenty twenty six, in the journal Neural Computation.

27
00:01:27.879 --> 00:01:30.840
<v Speaker 3>It is led by first author doctor Jeffrey Kaiser of

28
00:01:30.879 --> 00:01:35.200
<v Speaker 3>the Cognitive Neurorobotics Research Unit, and it proposes something that

29
00:01:35.319 --> 00:01:37.680
<v Speaker 3>frankly sounds a little sci fi.

30
00:01:38.000 --> 00:01:40.319
<v Speaker 2>Yeah. I read through the material for this steep dive

31
00:01:40.359 --> 00:01:43.040
<v Speaker 2>and my immediate thought was, great, now the robots are

32
00:01:43.079 --> 00:01:44.560
<v Speaker 2>going to be talking to themselves on.

33
00:01:44.519 --> 00:01:46.959
<v Speaker 3>The bus exactly, just muttering in the corner.

34
00:01:47.680 --> 00:01:50.959
<v Speaker 2>But the implications here are massive, right. We aren't just

35
00:01:51.000 --> 00:01:53.319
<v Speaker 2>talking about a chatbot that is a little bit wittier

36
00:01:53.760 --> 00:01:55.760
<v Speaker 2>or more conversational, No, not at all.

37
00:01:56.040 --> 00:01:59.560
<v Speaker 3>We are talking about a fundamental restructuring of how machines learn.

38
00:02:00.120 --> 00:02:03.439
<v Speaker 3>We are moving away from the whole big data approach.

39
00:02:03.040 --> 00:02:05.599
<v Speaker 2>Where you just feed a computer the entire Internet.

40
00:02:05.359 --> 00:02:08.400
<v Speaker 3>Right, just scraping everything. We are moving away from that

41
00:02:08.520 --> 00:02:11.240
<v Speaker 3>and towards something much more biological, something that learns a

42
00:02:11.240 --> 00:02:13.000
<v Speaker 3>lot more like a human child does.

43
00:02:13.319 --> 00:02:15.639
<v Speaker 2>So the mission for our deep dive today is to

44
00:02:15.719 --> 00:02:19.520
<v Speaker 2>really figure out why giving an artificial intelligence, A mumble

45
00:02:19.560 --> 00:02:23.319
<v Speaker 2>and a scratch pad might actually be the key to

46
00:02:23.360 --> 00:02:26.000
<v Speaker 2>the next generation of robotics. Because usually when we hear

47
00:02:26.039 --> 00:02:29.759
<v Speaker 2>about AI upgrades, it's always, oh, we need more chips,

48
00:02:29.840 --> 00:02:31.960
<v Speaker 2>or we need massive new data centers.

49
00:02:31.840 --> 00:02:34.360
<v Speaker 3>More compute, more power, alway.

50
00:02:34.240 --> 00:02:37.000
<v Speaker 2>Right, But this is different. This is about the architecture itself.

51
00:02:37.199 --> 00:02:40.199
<v Speaker 3>It is about architecture, but honestly, it's also about psychology

52
00:02:40.360 --> 00:02:43.319
<v Speaker 3>because to understand this machine architecture, we actually have to

53
00:02:43.360 --> 00:02:46.520
<v Speaker 3>start with a human brain. We have to ask why

54
00:02:46.560 --> 00:02:47.919
<v Speaker 3>do you talk to yourself?

55
00:02:50.039 --> 00:02:50.919
<v Speaker 1>Usually to keep.

56
00:02:50.800 --> 00:02:53.840
<v Speaker 2>From panicking, to be honest, or if I'm cooking, if

57
00:02:53.879 --> 00:02:57.639
<v Speaker 2>I am making a really complex recipe, I am definitely muttering. Okay,

58
00:02:57.680 --> 00:03:00.120
<v Speaker 2>onions are done. Now I need the garlic. Where to

59
00:03:00.159 --> 00:03:00.680
<v Speaker 2>put the garlic?

60
00:03:00.840 --> 00:03:04.120
<v Speaker 3>Exactly, you are using self talk as an executive function.

61
00:03:04.520 --> 00:03:07.639
<v Speaker 3>You aren't just making noise into the void. You are

62
00:03:07.800 --> 00:03:13.400
<v Speaker 3>actively organizing disparate ideas. You are weighing conflicting choices. You

63
00:03:13.479 --> 00:03:16.240
<v Speaker 3>are processing sensory data in real time. So it have

64
00:03:16.319 --> 00:03:19.319
<v Speaker 3>a purpose, a very specific purpose. In psychology. We call

65
00:03:19.319 --> 00:03:20.240
<v Speaker 3>this metacognition.

66
00:03:20.520 --> 00:03:21.759
<v Speaker 2>Thinking about thinking, right.

67
00:03:21.800 --> 00:03:24.719
<v Speaker 3>It allows you to objectify your own thought process. It

68
00:03:24.759 --> 00:03:27.560
<v Speaker 3>creates a feedback loop where the output of one thought,

69
00:03:27.719 --> 00:03:30.879
<v Speaker 3>like the onions are done, becomes the direct input for

70
00:03:30.919 --> 00:03:32.879
<v Speaker 3>the next thought, which is get the garlic.

71
00:03:32.960 --> 00:03:34.360
<v Speaker 2>So it's essentially a chain of logic.

72
00:03:34.680 --> 00:03:37.960
<v Speaker 3>It is a chain. And doctor Kwiser's team is saying, look,

73
00:03:38.039 --> 00:03:40.800
<v Speaker 3>this biological habit isn't a glitch in the human system.

74
00:03:41.120 --> 00:03:44.240
<v Speaker 3>It is a highly functional mechanism. It is literally how

75
00:03:44.280 --> 00:03:47.240
<v Speaker 3>we organize our minds. And if we want AI to

76
00:03:47.319 --> 00:03:50.319
<v Speaker 3>navigate ambiguity the way humans do, we need to import

77
00:03:50.400 --> 00:03:52.960
<v Speaker 3>this biology directly into the code.

78
00:03:53.240 --> 00:03:56.000
<v Speaker 2>But it is not just the voice, right. There is

79
00:03:56.039 --> 00:03:58.680
<v Speaker 2>this other piece of the puzzle that the paper emphasizes

80
00:03:58.960 --> 00:04:01.520
<v Speaker 2>called working memory. And I really want to pause on

81
00:04:01.560 --> 00:04:07.560
<v Speaker 2>this because in the study they talk a lot about slots, Yes, slots, slots, right,

82
00:04:07.639 --> 00:04:09.560
<v Speaker 2>and they make a really big deal about how this

83
00:04:09.599 --> 00:04:13.080
<v Speaker 2>is entirely different from how a normal neural network remembers things.

84
00:04:13.520 --> 00:04:14.960
<v Speaker 2>So help me out here, because I think a lot

85
00:04:14.960 --> 00:04:17.639
<v Speaker 2>of people would assume doesn't the standard chatbot already have

86
00:04:17.680 --> 00:04:20.519
<v Speaker 2>a memory. It remembers what I typed three prompts ago.

87
00:04:20.720 --> 00:04:23.279
<v Speaker 3>It does, but it is a completely different kind of memory.

88
00:04:23.519 --> 00:04:25.759
<v Speaker 3>Think of a standard neural network like the ones running

89
00:04:25.759 --> 00:04:29.000
<v Speaker 3>most current large language models as a giant piece of

90
00:04:29.120 --> 00:04:30.079
<v Speaker 3>tied fabric.

91
00:04:30.120 --> 00:04:33.000
<v Speaker 2>Hi die Okay, I am picturing a vintage T shirt

92
00:04:33.160 --> 00:04:34.000
<v Speaker 2>from the sixties.

93
00:04:34.199 --> 00:04:37.839
<v Speaker 3>Perfect. When a standard neural network learns something new, the

94
00:04:37.959 --> 00:04:41.879
<v Speaker 3>die spreads out everywhere. The information is distributed across all

95
00:04:41.959 --> 00:04:45.759
<v Speaker 3>the connections, all the mathematical weights simultaneously. It is a

96
00:04:45.839 --> 00:04:47.680
<v Speaker 3>holographic kind of storage.

97
00:04:47.759 --> 00:04:48.199
<v Speaker 2>I see.

98
00:04:48.319 --> 00:04:50.920
<v Speaker 3>So if you want to change one specific fact, or

99
00:04:50.959 --> 00:04:53.240
<v Speaker 3>if you just need to hold one specific number in

100
00:04:53.279 --> 00:04:55.439
<v Speaker 3>your head for a second, it is really hard to

101
00:04:55.480 --> 00:04:58.120
<v Speaker 3>do that without messing up the pattern of the whole.

102
00:04:57.879 --> 00:05:01.399
<v Speaker 2>Shirt because it's messy. You can't just one specific thread

103
00:05:01.399 --> 00:05:04.360
<v Speaker 2>out without unraveling the entire image or changing the surrounding

104
00:05:04.399 --> 00:05:05.519
<v Speaker 2>colors exactly.

105
00:05:06.040 --> 00:05:09.839
<v Speaker 3>That architecture is fantastic for recognizing broad patterns, but it

106
00:05:09.879 --> 00:05:14.040
<v Speaker 3>is actually really bad for holding specifics. Now, what doctor

107
00:05:14.120 --> 00:05:17.800
<v Speaker 3>Kwaiser and his team did was introduce these explicit slots.

108
00:05:18.279 --> 00:05:20.639
<v Speaker 3>Imagine that on top of that TIDI shirt, you sew

109
00:05:20.720 --> 00:05:23.160
<v Speaker 3>on a few clear plastic pockets.

110
00:05:22.759 --> 00:05:25.480
<v Speaker 2>Okay, like a plastic badge holder or a bocket protector.

111
00:05:25.639 --> 00:05:28.839
<v Speaker 3>Right, These are distinct protected containers. You can write a

112
00:05:28.920 --> 00:05:30.600
<v Speaker 3>number on a piece of paper, put it in slot,

113
00:05:30.600 --> 00:05:33.279
<v Speaker 3>A and it stays perfectly safe. It doesn't bleed into

114
00:05:33.279 --> 00:05:35.480
<v Speaker 3>the TIDI fabric at all. It functions as a true

115
00:05:35.560 --> 00:05:36.680
<v Speaker 3>variable I see.

116
00:05:36.720 --> 00:05:39.399
<v Speaker 2>So the AI can say, okay, I am currently holding

117
00:05:39.439 --> 00:05:42.079
<v Speaker 2>the number seven in my left hand, and it completely

118
00:05:42.079 --> 00:05:44.240
<v Speaker 2>doesn't matter what the rest of the network is doing.

119
00:05:44.480 --> 00:05:46.920
<v Speaker 2>That seven is safe and isolated precisely.

120
00:05:47.160 --> 00:05:50.000
<v Speaker 3>And this is absolutely crucial for formal logic. If I

121
00:05:50.040 --> 00:05:52.519
<v Speaker 3>tell you to reverse the sequence seven, two, nine, you

122
00:05:52.560 --> 00:05:54.560
<v Speaker 3>need to hold those three numbers in your head, in

123
00:05:54.560 --> 00:05:57.920
<v Speaker 3>your slots and shuffle them around a standard AI struggles

124
00:05:57.959 --> 00:06:01.040
<v Speaker 3>with this because it tries to memorize the concept of seven, two,

125
00:06:01.160 --> 00:06:04.560
<v Speaker 3>nine based on how often it has seen those specific

126
00:06:04.680 --> 00:06:06.279
<v Speaker 3>numbers grouped together in the past.

127
00:06:06.600 --> 00:06:09.199
<v Speaker 2>So it's essentially trying to vibe its way vibe.

128
00:06:09.000 --> 00:06:12.240
<v Speaker 3>Its way through a math problem. Yes, that is hilarious,

129
00:06:12.480 --> 00:06:15.319
<v Speaker 3>but it's true. It is vibing based purely on statistics.

130
00:06:15.360 --> 00:06:17.519
<v Speaker 3>It looks at the data and says, well, usually seven

131
00:06:17.600 --> 00:06:20.040
<v Speaker 3>is followed by eight, but here it's two, and it

132
00:06:20.160 --> 00:06:22.959
<v Speaker 3>just gets confused. But the OIC model is different. It

133
00:06:23.000 --> 00:06:24.959
<v Speaker 3>puts seven and slot one two and slot two and

134
00:06:25.079 --> 00:06:27.839
<v Speaker 3>nine and slot three, and then that is when the

135
00:06:27.879 --> 00:06:30.040
<v Speaker 3>inner voice, the mumbling kicks in, and.

136
00:06:30.000 --> 00:06:32.759
<v Speaker 2>The mumble says, swap slot one and slot three.

137
00:06:32.920 --> 00:06:37.079
<v Speaker 3>Bingo. It generates a symbolic command directed at its own

138
00:06:37.120 --> 00:06:41.319
<v Speaker 3>memory system, swap one in three. It absolutely does not

139
00:06:41.519 --> 00:06:43.519
<v Speaker 3>care that the numbers are seven and nine. They could

140
00:06:43.519 --> 00:06:46.120
<v Speaker 3>be an apple and an orange. They could be completely

141
00:06:46.160 --> 00:06:49.800
<v Speaker 3>made up words. The logic holds perfectly because the slots

142
00:06:49.800 --> 00:06:52.319
<v Speaker 3>are entirely separate from the content inside them.

143
00:06:52.439 --> 00:06:55.920
<v Speaker 2>And this sounds exactly like what computer scientists called generalization.

144
00:06:56.480 --> 00:06:59.759
<v Speaker 3>That is the magic word here, generalization.

145
00:06:59.160 --> 00:07:02.040
<v Speaker 2>Because in the paper they use this incredibly dense phrase

146
00:07:02.079 --> 00:07:06.399
<v Speaker 2>they call it content agnostic information processing. It is a mouthful,

147
00:07:06.560 --> 00:07:08.360
<v Speaker 2>it really is, but it seems to be the core

148
00:07:08.399 --> 00:07:09.240
<v Speaker 2>of why this works.

149
00:07:09.439 --> 00:07:11.519
<v Speaker 3>It is a mouthful, but it is the holy grail

150
00:07:11.600 --> 00:07:16.000
<v Speaker 3>of artificial intelligence research. Content agnostic means the AI understands

151
00:07:16.000 --> 00:07:19.000
<v Speaker 3>the underlying rule, regardless of the specific data it is

152
00:07:19.000 --> 00:07:22.120
<v Speaker 3>looking at. Think about basic algebra. If you know the

153
00:07:22.160 --> 00:07:24.959
<v Speaker 3>A plus B equal C, you can solve that equation

154
00:07:25.000 --> 00:07:27.959
<v Speaker 3>whether A is five or a is five million, or

155
00:07:28.000 --> 00:07:29.560
<v Speaker 3>a is a banana.

156
00:07:29.120 --> 00:07:31.639
<v Speaker 2>Right because I know the relationship between the parts, not

157
00:07:31.720 --> 00:07:33.600
<v Speaker 2>just the parts themselves exactly.

158
00:07:33.759 --> 00:07:38.000
<v Speaker 3>Traditional AI is often just memorizing millions examples, if it

159
00:07:38.040 --> 00:07:41.560
<v Speaker 3>has seen the sequence one two three reverses three to

160
00:07:42.120 --> 00:07:44.399
<v Speaker 3>one a million times in its training data, it can

161
00:07:44.480 --> 00:07:47.600
<v Speaker 3>do it easily. But if you give it xyz, it

162
00:07:47.720 --> 00:07:50.800
<v Speaker 3>might fail simply because it hasn't seen those specific letters

163
00:07:50.800 --> 00:07:51.920
<v Speaker 3>in that specific.

164
00:07:51.680 --> 00:07:53.399
<v Speaker 2>Order before, which seems so brittle.

165
00:07:53.600 --> 00:07:57.319
<v Speaker 3>It is extremely britle. But the OIST researchers found that

166
00:07:57.360 --> 00:08:00.319
<v Speaker 3>their model, the one equipped with the memory slaw and

167
00:08:00.360 --> 00:08:02.959
<v Speaker 3>the internal mumbling, could look at a sequence it had

168
00:08:03.000 --> 00:08:06.160
<v Speaker 3>literally never seen before in its life and apply the

169
00:08:06.199 --> 00:08:08.560
<v Speaker 3>reverse rule perfectly on the first try.

170
00:08:08.519 --> 00:08:11.079
<v Speaker 2>Because it wasn't looking at the letters themselves, it was

171
00:08:11.120 --> 00:08:14.120
<v Speaker 2>looking at the containers. Take what is in slot one

172
00:08:14.240 --> 00:08:16.040
<v Speaker 2>and move it to slot three exactly.

173
00:08:16.079 --> 00:08:18.879
<v Speaker 3>It completely separates the algorithm from the data, and that

174
00:08:18.959 --> 00:08:21.519
<v Speaker 3>is something humans do naturally all day long, but neural

175
00:08:21.560 --> 00:08:23.879
<v Speaker 3>networks have historically been terrible at it.

176
00:08:23.959 --> 00:08:27.079
<v Speaker 2>Okay, so let's dig into the actual mumbling mechanism itself,

177
00:08:27.079 --> 00:08:29.399
<v Speaker 2>because I am trying to visualize this. Yeah, how does

178
00:08:29.399 --> 00:08:32.200
<v Speaker 2>a computer actually mumble? I mean, is it generating a

179
00:08:32.200 --> 00:08:34.039
<v Speaker 2>tiny sound file? Is there microphone involved?

180
00:08:34.159 --> 00:08:37.559
<v Speaker 3>No, no audio is being generated. It is generating tokens.

181
00:08:38.039 --> 00:08:41.519
<v Speaker 3>In AI terminology, a token is just a fundamental unit

182
00:08:41.559 --> 00:08:44.080
<v Speaker 3>of information, like a word or a piece of a word.

183
00:08:44.559 --> 00:08:47.000
<v Speaker 3>In a normal chatbot that you might use online, the

184
00:08:47.039 --> 00:08:50.120
<v Speaker 3>tokens it generates come out immediately as text on your screen.

185
00:08:50.919 --> 00:08:55.480
<v Speaker 3>But in this OHES system, the researchers created a recurrent loop.

186
00:08:55.360 --> 00:08:57.240
<v Speaker 2>A loop, so it feeds back on itself.

187
00:08:57.320 --> 00:08:59.720
<v Speaker 3>Right, the system generates a token. Let's say it generates

188
00:08:59.759 --> 00:09:01.639
<v Speaker 3>the token for the word swat, but instead of showing

189
00:09:01.639 --> 00:09:04.679
<v Speaker 3>that word to the user, it feeds that token directly

190
00:09:04.759 --> 00:09:07.600
<v Speaker 3>back into its own input layer for the very next

191
00:09:07.759 --> 00:09:09.879
<v Speaker 3>millisecond of processing.

192
00:09:09.440 --> 00:09:11.399
<v Speaker 2>So it is whispering back into its own ear.

193
00:09:11.600 --> 00:09:14.279
<v Speaker 3>It is a quiet mumble. The paper describes it as

194
00:09:14.320 --> 00:09:17.279
<v Speaker 3>the low level generation of tokens. It acts as an

195
00:09:17.320 --> 00:09:21.840
<v Speaker 3>intermediate computational step. And the researchers did something very specific

196
00:09:21.840 --> 00:09:25.240
<v Speaker 3>here to make this happen. They actively encouraged the system

197
00:09:25.320 --> 00:09:26.519
<v Speaker 3>to do this during training.

198
00:09:26.919 --> 00:09:29.480
<v Speaker 2>Encouraged like they gave it a digital cookie.

199
00:09:29.240 --> 00:09:32.080
<v Speaker 3>Sort of yeah. In machine learning we use things called

200
00:09:32.200 --> 00:09:36.039
<v Speaker 3>loss functions and targets to guide behavior. They essentially set

201
00:09:36.080 --> 00:09:39.000
<v Speaker 3>a strict target where the system was required to produce

202
00:09:39.039 --> 00:09:42.240
<v Speaker 3>a certain amount of internal speech while it was attempting

203
00:09:42.240 --> 00:09:45.320
<v Speaker 3>to solve the problem. They basically said to the AI,

204
00:09:45.799 --> 00:09:48.600
<v Speaker 3>you cannot just guess the final answer. You have to

205
00:09:48.639 --> 00:09:51.159
<v Speaker 3>show your work. You have to talk it through step

206
00:09:51.159 --> 00:09:51.679
<v Speaker 3>by step.

207
00:09:52.080 --> 00:09:55.159
<v Speaker 2>Man That instantly reminds me of my high school algebra teacher.

208
00:09:55.480 --> 00:09:57.399
<v Speaker 2>I don't care if you got the right answer, show

209
00:09:57.440 --> 00:09:58.960
<v Speaker 2>me the steps, and.

210
00:09:58.879 --> 00:10:01.360
<v Speaker 3>Your teacher was exactly right, because if you show the steps,

211
00:10:01.399 --> 00:10:05.039
<v Speaker 3>you actually prove that you understand the logic behind the solution.

212
00:10:05.480 --> 00:10:07.919
<v Speaker 3>If you just write down the final number, you might

213
00:10:07.919 --> 00:10:09.919
<v Speaker 3>have just memorized it from the textbook or taken a

214
00:10:09.960 --> 00:10:14.559
<v Speaker 3>lucky guess. By forcing the AI to mumble the intermediate steps,

215
00:10:14.840 --> 00:10:17.600
<v Speaker 3>they forced it to break the complex problem down into

216
00:10:17.759 --> 00:10:19.919
<v Speaker 3>manageable logical.

217
00:10:19.480 --> 00:10:23.000
<v Speaker 2>Chunks, which brings us directly to the specific tasks they

218
00:10:23.080 --> 00:10:26.960
<v Speaker 2>use to test this theory. We mentioned reversing sequences earlier.

219
00:10:27.279 --> 00:10:31.320
<v Speaker 2>The paper also talks about pattern creation. But why are

220
00:10:31.440 --> 00:10:35.200
<v Speaker 2>these specific tasks so important for the researchers to use.

221
00:10:36.039 --> 00:10:39.360
<v Speaker 2>They seem, I don't know, almost too simple, like reversing

222
00:10:39.360 --> 00:10:42.720
<v Speaker 2>a list of items. A cheap pocket calculator can do that.

223
00:10:43.559 --> 00:10:46.200
<v Speaker 3>A calculator can do that because a calculator is hard

224
00:10:46.240 --> 00:10:49.120
<v Speaker 3>coded by a human software engineer to do exactly that.

225
00:10:49.559 --> 00:10:51.559
<v Speaker 3>A neural network, on the other hand, has to learn

226
00:10:51.639 --> 00:10:53.840
<v Speaker 3>how to do it completely from scratch just by looking

227
00:10:53.879 --> 00:10:56.440
<v Speaker 3>at examples. And for a neural network, these types of

228
00:10:56.440 --> 00:11:01.000
<v Speaker 3>tasks are actually brutal. They are highly computationally because they

229
00:11:01.039 --> 00:11:03.200
<v Speaker 3>require what we call sequential processing.

230
00:11:03.559 --> 00:11:05.480
<v Speaker 2>Sequential meaning involves time.

231
00:11:05.480 --> 00:11:08.559
<v Speaker 3>Yes, time and order. You have to remember the beginning

232
00:11:08.559 --> 00:11:11.240
<v Speaker 3>of the sentence while you are simultaneously reading the end

233
00:11:11.279 --> 00:11:13.879
<v Speaker 3>of the sentence, and then you have to purposefully manipulate

234
00:11:13.960 --> 00:11:16.960
<v Speaker 3>the order of those elements. This requires holding multiple distinct

235
00:11:17.039 --> 00:11:20.240
<v Speaker 3>data points in your head simultaneously without them overwriting each other.

236
00:11:20.399 --> 00:11:22.799
<v Speaker 2>Ah, we are back to the tid eye problem.

237
00:11:23.159 --> 00:11:25.799
<v Speaker 3>Exactly, if I add blue dye for the end of

238
00:11:25.840 --> 00:11:28.879
<v Speaker 3>the sentence, it might bleed over and turn the red

239
00:11:28.960 --> 00:11:31.240
<v Speaker 3>dye at the beginning of the sentence into purple.

240
00:11:31.440 --> 00:11:34.399
<v Speaker 2>So the information corrupts itself just by existing in the

241
00:11:34.440 --> 00:11:35.080
<v Speaker 2>same space.

242
00:11:35.320 --> 00:11:38.120
<v Speaker 3>Right, And the study showed that the models equipped with

243
00:11:38.200 --> 00:11:41.840
<v Speaker 3>the explicit slots and the internal mumble just blue the

244
00:11:41.879 --> 00:11:45.039
<v Speaker 3>standard models out of the water. They could handle significantly

245
00:11:45.080 --> 00:11:49.840
<v Speaker 3>longer sequences, much more complex patterns. And here's the really

246
00:11:49.840 --> 00:11:53.879
<v Speaker 3>mind blowing part. They could switch between different tasks without crashing.

247
00:11:54.080 --> 00:11:58.360
<v Speaker 2>Multitasking. Multitasking Now, humans are pretty famous for thinking we

248
00:11:58.440 --> 00:12:01.639
<v Speaker 2>are amazing at multitask while actually being terrible at it.

249
00:12:01.919 --> 00:12:05.320
<v Speaker 2>But for AI, it's usually a complete disaster, isn't it.

250
00:12:05.320 --> 00:12:08.360
<v Speaker 3>It is usually catastrophic. In fact, there is an official

251
00:12:08.440 --> 00:12:11.000
<v Speaker 3>term for it in the field. It's called catastrophic forgetting.

252
00:12:11.320 --> 00:12:13.559
<v Speaker 2>That sounds incredibly dramatic, it really is.

253
00:12:13.960 --> 00:12:17.279
<v Speaker 3>If you train a standard artificial intelligence to play chess

254
00:12:17.399 --> 00:12:19.559
<v Speaker 3>and it gets really good, and then you try to

255
00:12:19.600 --> 00:12:22.720
<v Speaker 3>teach that exact same model to play checkers, it will

256
00:12:22.759 --> 00:12:25.919
<v Speaker 3>almost always completely forget how to play chess. That's totally why,

257
00:12:26.080 --> 00:12:30.080
<v Speaker 3>completely overwritten, because it overwrites the mathematical weights to learn

258
00:12:30.159 --> 00:12:33.759
<v Speaker 3>the new game. The tiedeie pattern essentially gets entirely redied

259
00:12:33.840 --> 00:12:34.679
<v Speaker 3>with new colors.

260
00:12:34.759 --> 00:12:36.600
<v Speaker 2>But the OST model didn't do that.

261
00:12:36.799 --> 00:12:40.240
<v Speaker 3>It remember both it did, and doctor Kwoiser observed that

262
00:12:40.240 --> 00:12:44.039
<v Speaker 3>the mumbling was the absolute key to this capability. The

263
00:12:44.120 --> 00:12:48.000
<v Speaker 3>internal speech acted as a dynamic context manager.

264
00:12:48.120 --> 00:12:50.840
<v Speaker 2>Break that down for me. A context manager.

265
00:12:50.519 --> 00:12:53.279
<v Speaker 3>Thing of a professional chef working in a really busy kitchen.

266
00:12:53.679 --> 00:12:55.639
<v Speaker 3>They are chopping onions on the cutting board, but they

267
00:12:55.679 --> 00:12:58.399
<v Speaker 3>also have a delicate sauce simmering over on the stove.

268
00:12:58.639 --> 00:13:00.759
<v Speaker 3>They chop chop, chop, and they fit physically stop and

269
00:13:00.759 --> 00:13:03.639
<v Speaker 3>say out loud to themselves, Okay, check the sauce. They

270
00:13:03.639 --> 00:13:06.000
<v Speaker 3>walk over, they stir the sauce. Then they say, sauce

271
00:13:06.039 --> 00:13:07.159
<v Speaker 3>is good. Back to onions.

272
00:13:07.240 --> 00:13:11.240
<v Speaker 2>That little phrase back to onions. It resets their mental state, It.

273
00:13:11.159 --> 00:13:15.039
<v Speaker 3>Resets the context. The OST system uses its mumble to

274
00:13:15.120 --> 00:13:18.960
<v Speaker 3>explicitly label which task it is currently performing. It says internally,

275
00:13:19.200 --> 00:13:22.039
<v Speaker 3>I am now doing task A, and it uses the

276
00:13:22.080 --> 00:13:25.519
<v Speaker 3>memory slots specifically assigned for task A. Then it mumbles

277
00:13:25.639 --> 00:13:29.080
<v Speaker 3>switching to task B, and that command clears the slots

278
00:13:29.159 --> 00:13:32.080
<v Speaker 3>or moves its attention to new slots. It actively prevents

279
00:13:32.159 --> 00:13:34.480
<v Speaker 3>the parameters of the tasks from bleeding into each other.

280
00:13:34.720 --> 00:13:37.759
<v Speaker 2>That is wild. That essentially bridges the huge gap between

281
00:13:37.759 --> 00:13:41.080
<v Speaker 2>the rigid, single task focus of traditional AI, where you

282
00:13:41.120 --> 00:13:44.000
<v Speaker 2>have one specific bot for playing chess and a totally

283
00:13:44.000 --> 00:13:48.080
<v Speaker 2>different bot for chatting, and the flexible, fluid adaptability that

284
00:13:48.200 --> 00:13:48.879
<v Speaker 2>human beings have.

285
00:13:49.200 --> 00:13:52.000
<v Speaker 3>It is a massive step towards general purpose intelligence, and

286
00:13:52.039 --> 00:13:55.159
<v Speaker 3>it leads us directly to another incredible benefit outline in

287
00:13:55.200 --> 00:13:57.240
<v Speaker 3>the research, which is data efficiency.

288
00:13:57.399 --> 00:13:59.320
<v Speaker 2>Yes, this is a huge topic right now in the

289
00:13:59.360 --> 00:14:01.919
<v Speaker 2>tech world. I like constantly keep reading articles saying that

290
00:14:01.960 --> 00:14:04.159
<v Speaker 2>we are basically running out of Internet, that the big

291
00:14:04.200 --> 00:14:07.320
<v Speaker 2>tech companies have scraped every single book, every news article,

292
00:14:07.360 --> 00:14:10.200
<v Speaker 2>every Reddit post, and there is literally nothing left of

293
00:14:10.279 --> 00:14:12.679
<v Speaker 2>high quality to train the next generation of models on.

294
00:14:13.039 --> 00:14:16.279
<v Speaker 3>That is a very real, very pressing problem. The current

295
00:14:16.360 --> 00:14:20.240
<v Speaker 3>dominant paradigm in AI development is essentially scale is all

296
00:14:20.279 --> 00:14:23.120
<v Speaker 3>you need. Just make the model bigger, throw more processing

297
00:14:23.200 --> 00:14:25.519
<v Speaker 3>power at it, and give it more data. But we

298
00:14:25.559 --> 00:14:28.039
<v Speaker 3>are rapidly hitting the hard seiling of what is actually

299
00:14:28.080 --> 00:14:32.759
<v Speaker 3>available out there. The OST research suggests a completely viable way.

300
00:14:32.720 --> 00:14:35.399
<v Speaker 2>Out of that trap, sparse data utilization.

301
00:14:35.639 --> 00:14:39.039
<v Speaker 3>Right. Because the OST system is learning how to think,

302
00:14:39.360 --> 00:14:43.440
<v Speaker 3>meaning the general underlying rules, rather than just what to think,

303
00:14:43.480 --> 00:14:47.440
<v Speaker 3>which is just memorizing specific answers, it needs significantly less

304
00:14:47.519 --> 00:14:49.639
<v Speaker 3>data to achieve the same or better performance.

305
00:14:49.919 --> 00:14:52.720
<v Speaker 2>Going back to your algebra analogy earlier, if I teach

306
00:14:52.759 --> 00:14:55.240
<v Speaker 2>a student the actual rules of algebra, I really only

307
00:14:55.240 --> 00:14:57.399
<v Speaker 2>need to show them maybe ten practice problems, and they

308
00:14:57.399 --> 00:14:59.600
<v Speaker 2>get the concept they can apply it anywhere. But if

309
00:14:59.639 --> 00:15:01.480
<v Speaker 2>I try to teach them malogy but purely by showing

310
00:15:01.519 --> 00:15:03.840
<v Speaker 2>them every single possible math problem in existence so they

311
00:15:03.840 --> 00:15:07.440
<v Speaker 2>can memorize the answers. I would literally need infinite data.

312
00:15:07.679 --> 00:15:10.840
<v Speaker 3>That is a perfect analogy. The OIS model is learning

313
00:15:10.879 --> 00:15:14.320
<v Speaker 3>the rules of the game. Doctor Kawiser explicitly calls it

314
00:15:14.480 --> 00:15:19.159
<v Speaker 3>a complementary, lightweight alternative to these massive, heavy data models.

315
00:15:19.519 --> 00:15:21.080
<v Speaker 3>And you have to imagine what that means for the

316
00:15:21.120 --> 00:15:24.200
<v Speaker 3>real world. Think about the environment, think about the energy costs.

317
00:15:24.600 --> 00:15:27.759
<v Speaker 3>We wouldn't need to build these massive city size data

318
00:15:27.759 --> 00:15:31.080
<v Speaker 3>centers that consume as much electricity as a small country

319
00:15:31.279 --> 00:15:33.519
<v Speaker 3>just to train a smart AI, and we.

320
00:15:33.519 --> 00:15:36.039
<v Speaker 2>Would need a supercomputer to actually run the AI once

321
00:15:36.080 --> 00:15:38.279
<v Speaker 2>it's trained. Yeah, which brings us to the part of

322
00:15:38.279 --> 00:15:40.600
<v Speaker 2>the paper that got me really truly excited, and that

323
00:15:40.759 --> 00:15:41.960
<v Speaker 2>is robotics.

324
00:15:42.159 --> 00:15:45.440
<v Speaker 3>Yes, the real world application of all this theory.

325
00:15:45.159 --> 00:15:48.559
<v Speaker 2>Because right now, let's be honest, robots are kind of dumb.

326
00:15:48.679 --> 00:15:50.840
<v Speaker 2>They work perfectly in a car factory where everything is

327
00:15:50.879 --> 00:15:53.519
<v Speaker 2>literally bolted down to the floor, the lighting never changes

328
00:15:53.799 --> 00:15:55.960
<v Speaker 2>and the exact same part comes down the assembly line

329
00:15:55.960 --> 00:15:58.320
<v Speaker 2>every three seconds. But you put a state of the

330
00:15:58.399 --> 00:16:01.360
<v Speaker 2>art robot in my messy living it is total chaos.

331
00:16:01.399 --> 00:16:03.159
<v Speaker 2>It gets stuck on a rug exactly.

332
00:16:03.519 --> 00:16:06.960
<v Speaker 3>The paper explicitly talks about the challenge of transitioning AI

333
00:16:07.080 --> 00:16:10.480
<v Speaker 3>from controlled environments to dynamic environments.

334
00:16:10.879 --> 00:16:13.679
<v Speaker 2>Let's really take this out of the laboratory, because the

335
00:16:13.679 --> 00:16:18.240
<v Speaker 2>paper specifically mentions agricultural robots as a use case. Let's

336
00:16:18.320 --> 00:16:22.320
<v Speaker 2>visualize that you have got a robotic tractor, or let's

337
00:16:22.320 --> 00:16:24.919
<v Speaker 2>go at a weed bot out in a massive cornfield.

338
00:16:25.080 --> 00:16:28.000
<v Speaker 3>Okay, so you have this robot. Its sole job is

339
00:16:28.039 --> 00:16:30.759
<v Speaker 3>to drive down the row, visually identify a weed and

340
00:16:30.799 --> 00:16:34.000
<v Speaker 3>pull it out, but obviously leave the valuable corn alone. Now,

341
00:16:34.039 --> 00:16:36.759
<v Speaker 3>in a sterile lab setting, that is incredibly easy. The

342
00:16:36.840 --> 00:16:40.279
<v Speaker 3>lighting is perfectly calibrated, the corn is bright green, the

343
00:16:40.279 --> 00:16:43.720
<v Speaker 3>weed has a distinct leaf shape. The cameras process it instantly.

344
00:16:43.840 --> 00:16:45.799
<v Speaker 2>But out in the actual real war, in the.

345
00:16:45.799 --> 00:16:48.960
<v Speaker 3>Real world, a dark cloud passes over the sun. The

346
00:16:49.120 --> 00:16:52.600
<v Speaker 3>ambient light drops by fifty percent in two seconds. A

347
00:16:52.639 --> 00:16:54.879
<v Speaker 3>gust of wind blows the corn stock, so it is

348
00:16:54.879 --> 00:16:57.720
<v Speaker 3>suddenly leaning over at a forty five degree angle. Maybe

349
00:16:57.759 --> 00:16:59.840
<v Speaker 3>there's a splash of mud that gets splattered right on

350
00:17:00.120 --> 00:17:01.279
<v Speaker 3>the robot's camera lens.

351
00:17:01.559 --> 00:17:05.119
<v Speaker 2>To a standard vision based AI, that visual input just

352
00:17:05.240 --> 00:17:08.640
<v Speaker 2>changed completely. It completely freaks out. It thinks the leaning

353
00:17:08.680 --> 00:17:12.279
<v Speaker 2>corns an entirely new, unrecognized object. It thinks the shadow

354
00:17:12.319 --> 00:17:14.039
<v Speaker 2>from the cloud is a deep hole in the.

355
00:17:14.000 --> 00:17:17.960
<v Speaker 3>Ground, exactly, it throws an error and crashes, or worse,

356
00:17:18.000 --> 00:17:20.599
<v Speaker 3>it just happily pulls up all the expensive corn But

357
00:17:20.720 --> 00:17:23.799
<v Speaker 3>a robot equipped with this inner voice architecture and working

358
00:17:23.839 --> 00:17:27.079
<v Speaker 3>memory can actually self correct in real time. It can

359
00:17:27.119 --> 00:17:30.160
<v Speaker 3>literally talk itself through the sensory confusion, so it.

360
00:17:30.160 --> 00:17:33.319
<v Speaker 2>Is internally mumbling, Okay, the light just got a lot darker,

361
00:17:33.720 --> 00:17:36.160
<v Speaker 2>but my sensors say I didn't actually move forward, so

362
00:17:36.200 --> 00:17:39.119
<v Speaker 2>the object directly in front of me is highly likely

363
00:17:39.359 --> 00:17:41.279
<v Speaker 2>to still be the cornstock I was just looking at.

364
00:17:41.519 --> 00:17:45.160
<v Speaker 3>Yes, exactly. It maintains a continuous state. It explicitly says

365
00:17:45.160 --> 00:17:48.559
<v Speaker 3>to itself. Current state is weeding row four. Event is

366
00:17:48.599 --> 00:17:52.599
<v Speaker 3>sudden light reduction. Action is continue current task. It bridges

367
00:17:52.640 --> 00:17:54.960
<v Speaker 3>the sudden gap in its sensory data by relying on

368
00:17:55.039 --> 00:17:58.680
<v Speaker 3>a logical internal narrative. It creates a cognitive buffer against

369
00:17:58.680 --> 00:18:01.039
<v Speaker 3>the unpredictability and chaos of the physical world.

370
00:18:01.319 --> 00:18:04.440
<v Speaker 2>That is just remarkably human. I mean, that is exactly

371
00:18:04.440 --> 00:18:06.400
<v Speaker 2>what I do. When I am driving on the highway

372
00:18:06.400 --> 00:18:09.559
<v Speaker 2>in a sudden rainstorm. I am talking to myself saying, okay,

373
00:18:09.559 --> 00:18:12.559
<v Speaker 2>I can't see much, just slow down, keep the wheel straight,

374
00:18:12.920 --> 00:18:16.039
<v Speaker 2>look for the tail lights ahead. I am not completely

375
00:18:16.079 --> 00:18:18.839
<v Speaker 2>relearning how to drive a car every single second. I

376
00:18:18.880 --> 00:18:21.400
<v Speaker 2>am actively talking myself through the noise and the fear.

377
00:18:21.559 --> 00:18:24.000
<v Speaker 3>And that is exactly why this research is so huge

378
00:18:24.000 --> 00:18:27.039
<v Speaker 3>for the field of robotics. You simply cannot upload the

379
00:18:27.200 --> 00:18:29.920
<v Speaker 3>entire Internet into the memory banks of a farm tractor.

380
00:18:30.440 --> 00:18:33.839
<v Speaker 3>It is impossible. You need a centralized brain that is small,

381
00:18:34.119 --> 00:18:37.559
<v Speaker 3>highly efficient, and capable of actively reasoning its way out

382
00:18:37.599 --> 00:18:40.680
<v Speaker 3>of a novel problem, rather than just cross referencing a

383
00:18:40.720 --> 00:18:44.519
<v Speaker 3>massive database to remember a pre programmed solution. This is

384
00:18:44.559 --> 00:18:48.559
<v Speaker 3>the fundamental difference between simple automation and true autonomy.

385
00:18:48.759 --> 00:18:50.960
<v Speaker 2>Break that distinction down from me a bit more automation

386
00:18:51.119 --> 00:18:54.319
<v Speaker 2>versus autonomy. People use those words interchangeably a lot.

387
00:18:54.480 --> 00:18:57.799
<v Speaker 3>They do, but they are very different. Automation is like

388
00:18:57.839 --> 00:19:00.720
<v Speaker 3>a train on a track. It is incredibly powerful, it

389
00:19:00.759 --> 00:19:04.079
<v Speaker 3>is fast, it is efficient, but if a cow suddenly

390
00:19:04.160 --> 00:19:06.480
<v Speaker 3>wanders on to the track, the train doesn't know how

391
00:19:06.480 --> 00:19:09.759
<v Speaker 3>to evaluate the situation. It just hits the brakes and stops,

392
00:19:10.000 --> 00:19:13.680
<v Speaker 3>or it crashes. Current robots are still largely just automated.

393
00:19:13.839 --> 00:19:17.640
<v Speaker 3>They rigidly follow a script. Go forward ten feet, turn left,

394
00:19:17.720 --> 00:19:19.400
<v Speaker 3>ninety degrees stop, but.

395
00:19:19.319 --> 00:19:21.880
<v Speaker 2>The real world does not have tracks exactly.

396
00:19:22.319 --> 00:19:24.759
<v Speaker 3>Autonomy, on the other hand, is like driving a car.

397
00:19:25.039 --> 00:19:27.400
<v Speaker 3>You can see the cow, evaluate the shoulder of the road,

398
00:19:27.440 --> 00:19:29.799
<v Speaker 3>and steer around it. You can decide to go off

399
00:19:29.880 --> 00:19:32.559
<v Speaker 3>road if you have to. Autonomy means you are writing

400
00:19:32.559 --> 00:19:35.960
<v Speaker 3>the behavioral script in real time as the situation unfolds.

401
00:19:36.400 --> 00:19:39.160
<v Speaker 3>The mumbling we are talking about is effectively that real

402
00:19:39.240 --> 00:19:40.640
<v Speaker 3>time script writing process.

403
00:19:40.720 --> 00:19:43.240
<v Speaker 2>And because this whole architecture is so lightweight, as doctor

404
00:19:43.279 --> 00:19:46.160
<v Speaker 2>Quaser puts it, you can actually put this brain physically

405
00:19:46.200 --> 00:19:47.799
<v Speaker 2>inside the robot itself.

406
00:19:48.119 --> 00:19:51.039
<v Speaker 3>Right, it becomes an embedded system. The robot doesn't need

407
00:19:51.079 --> 00:19:53.920
<v Speaker 3>to constantly talk to a massive cloud server for every

408
00:19:53.920 --> 00:19:58.359
<v Speaker 3>single micro decision, and that drastically reduces latency. If that

409
00:19:58.519 --> 00:20:02.119
<v Speaker 3>automated tractor sees a itch suddenly appear, it needs to

410
00:20:02.160 --> 00:20:05.200
<v Speaker 3>stop right now, not in the two seconds it takes

411
00:20:05.240 --> 00:20:07.359
<v Speaker 3>to send a video frame to a server farm in

412
00:20:07.440 --> 00:20:09.640
<v Speaker 3>Virginia and wait to get a stop command back.

413
00:20:09.799 --> 00:20:12.160
<v Speaker 2>Two seconds is an eternity when you are driving a

414
00:20:12.240 --> 00:20:13.000
<v Speaker 2>tractor into.

415
00:20:12.799 --> 00:20:13.839
<v Speaker 3>A ditch exactly.

416
00:20:14.200 --> 00:20:17.200
<v Speaker 2>So, if I'm tracking this right, we have a proposed

417
00:20:17.240 --> 00:20:21.400
<v Speaker 2>system that learns significantly faster uses a fraction of the

418
00:20:21.519 --> 00:20:25.720
<v Speaker 2>data can multitask without forgetting its primary directive and functions

419
00:20:25.839 --> 00:20:30.920
<v Speaker 2>exponentially better in the messy, unstructured real world, and honestly

420
00:20:31.000 --> 00:20:33.359
<v Speaker 2>sounds almost too good to be true. There has to

421
00:20:33.400 --> 00:20:36.039
<v Speaker 2>be a catch, or rather, what does this actually mean

422
00:20:36.039 --> 00:20:38.480
<v Speaker 2>for the future of how we as humans are going

423
00:20:38.480 --> 00:20:39.640
<v Speaker 2>to interact with these things?

424
00:20:39.880 --> 00:20:42.359
<v Speaker 3>Well, there is actually one more really interesting side effect

425
00:20:42.440 --> 00:20:45.000
<v Speaker 3>of this mumbling architecture that we haven't touched on yet,

426
00:20:45.079 --> 00:20:48.680
<v Speaker 3>and it solves a major headache in the field interpretability.

427
00:20:48.759 --> 00:20:51.559
<v Speaker 2>Interpretability like being able to translate.

428
00:20:51.039 --> 00:20:54.559
<v Speaker 3>It more like being able to understand its motives. One

429
00:20:54.559 --> 00:20:57.799
<v Speaker 3>of the absolute biggest fears about modern AI, especially the

430
00:20:57.920 --> 00:21:00.920
<v Speaker 3>massive deep learning models, is that they are essentially a

431
00:21:00.920 --> 00:21:03.799
<v Speaker 3>black box. We feed data in, we get an answer out,

432
00:21:03.880 --> 00:21:06.799
<v Speaker 3>but we really don't know why made that specific decision.

433
00:21:06.960 --> 00:21:09.720
<v Speaker 3>It just spits out the final output based on billions

434
00:21:09.799 --> 00:21:14.079
<v Speaker 3>obscure mathematical weights. But with this ost system, the actual

435
00:21:14.119 --> 00:21:17.160
<v Speaker 3>thought process is explicit and trackable.

436
00:21:16.759 --> 00:21:20.000
<v Speaker 2>Because it is actively generating and mumbling those tokens.

437
00:21:20.119 --> 00:21:24.319
<v Speaker 3>Yes, we can theoretically open up the system and read

438
00:21:24.440 --> 00:21:27.319
<v Speaker 3>the literal transcript of its internal deliberation. So if that

439
00:21:27.400 --> 00:21:31.079
<v Speaker 3>agricultural robot does something crazy and drives straight through a

440
00:21:31.079 --> 00:21:33.880
<v Speaker 3>wooden fence. The engineers don't have to just throw their

441
00:21:33.880 --> 00:21:35.880
<v Speaker 3>hands up and guess why the weights failed. They can

442
00:21:35.920 --> 00:21:38.200
<v Speaker 3>look at the internal log and literally read the mumble.

443
00:21:38.240 --> 00:21:40.799
<v Speaker 3>They can see it said identified wooden fence as tall

444
00:21:40.920 --> 00:21:42.960
<v Speaker 3>dry grass proceeding forward.

445
00:21:43.079 --> 00:21:45.400
<v Speaker 2>Oh wow, So it gives us an actual readable audit

446
00:21:45.400 --> 00:21:46.720
<v Speaker 2>trail of its thought process.

447
00:21:47.160 --> 00:21:51.200
<v Speaker 3>Exactly. It makes safety engineering and debugging so much easier

448
00:21:51.240 --> 00:21:54.480
<v Speaker 3>and more transparent. We can debug the actual logic of

449
00:21:54.519 --> 00:21:57.160
<v Speaker 3>the machine, not just try to tweak the underlying math

450
00:21:57.240 --> 00:21:59.519
<v Speaker 3>and hope for the best. We can see exactly where

451
00:21:59.519 --> 00:22:00.960
<v Speaker 3>the reason is went off the rails.

452
00:22:01.039 --> 00:22:04.079
<v Speaker 2>That is fascinating. It's exactly like being able to read

453
00:22:04.119 --> 00:22:07.279
<v Speaker 2>a student's rough draft of an essay to see exactly

454
00:22:07.279 --> 00:22:11.000
<v Speaker 2>where they misunderstood the core assignment, rather than just giving

455
00:22:11.000 --> 00:22:13.119
<v Speaker 2>the final paper an f and moving on.

456
00:22:13.400 --> 00:22:15.599
<v Speaker 3>It really is and all of this leads us to

457
00:22:15.640 --> 00:22:18.400
<v Speaker 3>a final sort of philosophical point that the paper touches on.

458
00:22:18.799 --> 00:22:20.960
<v Speaker 3>Doctor Kwiser and his team make a point to mention

459
00:22:21.039 --> 00:22:24.400
<v Speaker 3>that this research isn't just about building better, more efficient

460
00:22:24.480 --> 00:22:28.359
<v Speaker 3>robots for industry. It is fundamentally about understanding ourselves.

461
00:22:28.160 --> 00:22:30.599
<v Speaker 2>Getting back to the biological blueprint we talked about the

462
00:22:30.680 --> 00:22:31.640
<v Speaker 2>start right.

463
00:22:32.039 --> 00:22:35.279
<v Speaker 3>By successfully modeling inner speech and working memory as a

464
00:22:35.279 --> 00:22:39.920
<v Speaker 3>distinct computational advantage in a machine, the study strongly validates

465
00:22:39.960 --> 00:22:44.039
<v Speaker 3>the hypothesis that our own internal monologue is a critical

466
00:22:44.200 --> 00:22:48.000
<v Speaker 3>functional component of human intelligence. It's not just some weird

467
00:22:48.079 --> 00:22:50.880
<v Speaker 3>evolutionary quirk or a side effect of learning to speak

468
00:22:50.880 --> 00:22:54.160
<v Speaker 3>out loud. We evolve the ability to talk to ourselves

469
00:22:54.200 --> 00:22:56.960
<v Speaker 3>because it is quite literally the most efficient way to

470
00:22:57.000 --> 00:22:58.720
<v Speaker 3>run our own biological software.

471
00:22:59.000 --> 00:23:03.440
<v Speaker 2>So we are basically biological machines running a continuous mumble

472
00:23:03.480 --> 00:23:06.720
<v Speaker 2>algorithm on a wetwear neural network.

473
00:23:06.480 --> 00:23:08.799
<v Speaker 3>In a very real way. Yes, it suggests that the

474
00:23:08.839 --> 00:23:11.119
<v Speaker 3>act of thinking is really just internal self.

475
00:23:10.880 --> 00:23:14.000
<v Speaker 2>Communication, and that is that is heavy.

476
00:23:14.119 --> 00:23:15.960
<v Speaker 3>It is heavy. It really changes how we look at

477
00:23:16.039 --> 00:23:19.880
<v Speaker 3>artificial intelligence going forward. Instead of seeing AI as this alien,

478
00:23:20.039 --> 00:23:23.720
<v Speaker 3>hyperfast super calculator that just knows things instantly, we might

479
00:23:23.759 --> 00:23:25.759
<v Speaker 3>start to see it as something that occasionally needs to

480
00:23:25.799 --> 00:23:28.559
<v Speaker 3>just take a minute pause and think things through before

481
00:23:28.559 --> 00:23:29.000
<v Speaker 3>it acts.

482
00:23:29.119 --> 00:23:31.640
<v Speaker 2>We are moving towards systems that do not just blindly

483
00:23:31.720 --> 00:23:35.359
<v Speaker 2>process incoming data, but actively interact with their own internal

484
00:23:35.400 --> 00:23:37.000
<v Speaker 2>states to figure out the world.

485
00:23:37.279 --> 00:23:42.680
<v Speaker 3>Correct the future of robust, reliable artificial intelligence lies in

486
00:23:42.799 --> 00:23:46.079
<v Speaker 3>machines that quite literally talk to themselves.

487
00:23:46.279 --> 00:23:49.440
<v Speaker 2>I really love that perspective. It makes the AI feel

488
00:23:49.519 --> 00:23:53.240
<v Speaker 2>a lot less like a terrifying magic box and a

489
00:23:53.279 --> 00:23:55.519
<v Speaker 2>lot more like a well like a thinker, a.

490
00:23:55.559 --> 00:23:57.880
<v Speaker 3>Thinker with a very organized scratch pad.

491
00:23:57.759 --> 00:23:59.720
<v Speaker 2>A thinker with a scratch pad and a habit of

492
00:24:00.039 --> 00:24:00.920
<v Speaker 2>uttering under its breath.

493
00:24:01.119 --> 00:24:01.720
<v Speaker 3>Exactly.

494
00:24:01.960 --> 00:24:04.599
<v Speaker 2>So let's recap the really big takeaways here for everyone.

495
00:24:04.880 --> 00:24:08.039
<v Speaker 2>We started with the observation that traditional AI has been

496
00:24:08.960 --> 00:24:12.119
<v Speaker 2>somewhat stuck in this big data trap, right.

497
00:24:12.079 --> 00:24:17.200
<v Speaker 3>Relying purely on massive, unsustainable data inputs to essentially memorize

498
00:24:17.200 --> 00:24:18.000
<v Speaker 3>the entire world.

499
00:24:18.119 --> 00:24:20.640
<v Speaker 2>And then this team, it always he comes along and says, no,

500
00:24:20.920 --> 00:24:23.640
<v Speaker 2>stop building bigger data centers. Look at the architecture of

501
00:24:23.680 --> 00:24:26.759
<v Speaker 2>the human brain. Instead, we have working memory, which gives

502
00:24:26.799 --> 00:24:30.079
<v Speaker 2>us those isolated slots, and we have internal speech, which

503
00:24:30.119 --> 00:24:31.000
<v Speaker 2>gives us the mumble.

504
00:24:31.200 --> 00:24:34.559
<v Speaker 3>And when you successfully combine those two biological concepts and code,

505
00:24:34.559 --> 00:24:38.359
<v Speaker 3>you achieve content agnostic information processing. You get a system

506
00:24:38.400 --> 00:24:40.839
<v Speaker 3>with the ability to learn the underlying rules of a problem,

507
00:24:40.920 --> 00:24:43.160
<v Speaker 3>not just memorize the facts of the training.

508
00:24:42.920 --> 00:24:47.440
<v Speaker 2>Data, which directly leads to much better generalization, the ability

509
00:24:47.480 --> 00:24:50.440
<v Speaker 2>to deal with the alien alphabet or the completely novel

510
00:24:50.519 --> 00:24:52.400
<v Speaker 2>sequence without crashing.

511
00:24:52.119 --> 00:24:56.559
<v Speaker 3>And significantly better multitasking, the chef managing the kitchen without

512
00:24:56.559 --> 00:24:57.720
<v Speaker 3>catastrophic forgetting.

513
00:24:57.799 --> 00:25:03.039
<v Speaker 2>And finally, it unlocks the potential for true real world robotics.

514
00:25:03.519 --> 00:25:07.039
<v Speaker 2>The autonomous tractor navigating a sudden storm without needing to

515
00:25:07.119 --> 00:25:09.200
<v Speaker 2>ask a cloud server what a shadow.

516
00:25:08.920 --> 00:25:12.160
<v Speaker 3>Is, achieving true autonomy through internal self regulation.

517
00:25:12.720 --> 00:25:15.440
<v Speaker 2>It really seems like we are witnessing a true maturing

518
00:25:15.519 --> 00:25:18.599
<v Speaker 2>of the entire field of AI, moving away from just

519
00:25:18.640 --> 00:25:24.200
<v Speaker 2>brute force computing and moving toward truly elegant, biologically inspired design.

520
00:25:24.440 --> 00:25:25.759
<v Speaker 3>I would go so far as to call it a

521
00:25:25.759 --> 00:25:30.160
<v Speaker 3>fundamental shift from artificial intelligence to artificial cognition. We aren't

522
00:25:30.200 --> 00:25:33.039
<v Speaker 3>just trying to simulate the final results of human thinking anymore.

523
00:25:33.039 --> 00:25:36.160
<v Speaker 3>We are actually simulating the step by step process of thinking.

524
00:25:36.279 --> 00:25:38.359
<v Speaker 2>That is a very crucial distinction to make.

525
00:25:38.559 --> 00:25:41.759
<v Speaker 3>It is, and doctor Kaiser's work really challenges all of

526
00:25:41.839 --> 00:25:45.319
<v Speaker 3>us to reconsider our basic definition of what learning actually is.

527
00:25:45.759 --> 00:25:49.000
<v Speaker 3>Learning is not just the massive accumulation of disjointed facts.

528
00:25:49.240 --> 00:25:52.480
<v Speaker 3>It is the act of development of the internal logical

529
00:25:52.519 --> 00:25:55.799
<v Speaker 3>processes required to manipulate and understand those facts.

530
00:25:55.920 --> 00:25:58.799
<v Speaker 2>And the memory slots provide the stable canvas upon which

531
00:25:58.799 --> 00:26:01.519
<v Speaker 2>that complex process is drawn Together.

532
00:26:01.759 --> 00:26:04.839
<v Speaker 3>They enabled the machine to finally step out of the rigid,

533
00:26:05.039 --> 00:26:09.640
<v Speaker 3>fragile constraints of its historical training data and dynamically engage

534
00:26:09.640 --> 00:26:12.920
<v Speaker 3>with a sheer, novelty and chaos of the real world.

535
00:26:13.680 --> 00:26:15.359
<v Speaker 2>As we wrap up this deep dive, I want to

536
00:26:15.440 --> 00:26:17.720
<v Speaker 2>leave you the listener, with a final thought to mole

537
00:26:17.759 --> 00:26:20.599
<v Speaker 2>over the next time you catch yourself talking to yourself,

538
00:26:20.839 --> 00:26:23.400
<v Speaker 2>maybe you are rehearsing an argument in the shower, or

539
00:26:23.480 --> 00:26:25.519
<v Speaker 2>muttering under your breath while you tear apart the house

540
00:26:25.559 --> 00:26:28.880
<v Speaker 2>looking for your car keys. Do not feel crazy, definitely not.

541
00:26:29.000 --> 00:26:30.200
<v Speaker 3>You are highly functional.

542
00:26:30.319 --> 00:26:33.200
<v Speaker 2>You're just actively optimizing your working memory. You are running

543
00:26:33.200 --> 00:26:37.240
<v Speaker 2>some very high level executive code, and pretty soon your

544
00:26:37.279 --> 00:26:39.880
<v Speaker 2>smart toaster might be doing the exact same thing while

545
00:26:39.880 --> 00:26:41.880
<v Speaker 2>it figures out how to perfectly brown your bagel.

546
00:26:42.200 --> 00:26:44.839
<v Speaker 3>Let's just hope the toaster doesn't start arguing back about

547
00:26:44.880 --> 00:26:46.039
<v Speaker 3>what settings you chose.

548
00:26:46.319 --> 00:26:49.599
<v Speaker 2>That is definitely a problem for another deep dive. This

549
00:26:49.680 --> 00:26:52.400
<v Speaker 2>has been a truly fascinating look into the study. AI

550
00:26:52.519 --> 00:26:56.240
<v Speaker 2>that talks to itself learns faster and smarter coming out

551
00:26:56.279 --> 00:26:59.519
<v Speaker 2>of the Okanau Institute of Science and Technology.

552
00:26:59.119 --> 00:27:01.599
<v Speaker 3>A very signific can step forward in the field of

553
00:27:01.720 --> 00:27:03.160
<v Speaker 3>cognitive neurorobotics.

554
00:27:03.880 --> 00:27:05.759
<v Speaker 2>Thank you so much for taking the time to break

555
00:27:05.799 --> 00:27:08.440
<v Speaker 2>all of this complex architecture down with us today. It

556
00:27:08.519 --> 00:27:11.160
<v Speaker 2>was my absolute pleasure, and thank you for listening. Keep

557
00:27:11.160 --> 00:27:15.079
<v Speaker 2>talking to yourselves everyone, It is genuinely good for your brain.
