WEBVTT

1
00:00:01.199 --> 00:00:06.200
<v Speaker 1>Welcome to the Sentient Code, where intelligence is engineered, autonomy

2
00:00:06.280 --> 00:00:10.439
<v Speaker 1>is emerging, and a line between human and machine grows thinner.

3
00:00:10.800 --> 00:00:15.359
<v Speaker 1>Each episode, we decode the algorithms, explore the robotics, and

4
00:00:15.439 --> 00:00:19.000
<v Speaker 1>examine the ideas shaping the future of artificial minds.

5
00:00:23.800 --> 00:00:25.440
<v Speaker 2>I want to start today by asking you to do

6
00:00:25.519 --> 00:00:30.160
<v Speaker 2>something that feels incredibly simple, almost you know, trivial, but

7
00:00:30.239 --> 00:00:33.799
<v Speaker 2>it's actually a miracle of biology. Right now, just pause

8
00:00:33.840 --> 00:00:35.719
<v Speaker 2>for a second and notice exactly what you're doing. You're

9
00:00:35.719 --> 00:00:38.840
<v Speaker 2>listening to my voice, obviously, but maybe you're also driving,

10
00:00:38.880 --> 00:00:42.039
<v Speaker 2>so your eyes are scanning the road, watching for break lights.

11
00:00:42.359 --> 00:00:44.960
<v Speaker 2>You feel the texture of the steering wheel under your hands.

12
00:00:45.159 --> 00:00:47.479
<v Speaker 2>Maybe you're drinking coffee and you can smell the roast.

13
00:00:47.600 --> 00:00:49.600
<v Speaker 3>It's the sensory soup. We're swimming in it.

14
00:00:49.759 --> 00:00:52.759
<v Speaker 2>Exactly, it's a soup. But here's the thing, and I

15
00:00:52.840 --> 00:00:56.799
<v Speaker 2>really want you to catch this. You aren't toggling between

16
00:00:56.840 --> 00:00:59.600
<v Speaker 2>these senses like you're switching apps on a phone. No,

17
00:01:00.399 --> 00:01:03.039
<v Speaker 2>you don't stop hearing to start seeing. You don't pause

18
00:01:03.079 --> 00:01:05.560
<v Speaker 2>your sense of smell to process the texture of the wheel.

19
00:01:06.079 --> 00:01:09.879
<v Speaker 2>Your brain is this incredible fluid mixing board. It takes

20
00:01:10.040 --> 00:01:15.439
<v Speaker 2>audio visual tactile and textual inputs and weaves them into

21
00:01:15.480 --> 00:01:18.280
<v Speaker 2>this single, seamless narrative we call reality.

22
00:01:18.400 --> 00:01:21.079
<v Speaker 3>And it's completely effortless for us. I mean, it is

23
00:01:21.120 --> 00:01:25.200
<v Speaker 3>the defining feature of biological consciousness. So we don't really

24
00:01:25.200 --> 00:01:28.799
<v Speaker 3>think about modalities, do we. We just think about the world.

25
00:01:29.040 --> 00:01:31.959
<v Speaker 2>But and this is the big concept we're unpacking today.

26
00:01:32.359 --> 00:01:36.200
<v Speaker 2>Until very very recently, artificial intelligence was not like that

27
00:01:36.280 --> 00:01:38.000
<v Speaker 2>at all. In fact, it was the exact opposite.

28
00:01:38.040 --> 00:01:40.879
<v Speaker 3>Oh, it was completely fragmented. You look at the history

29
00:01:40.920 --> 00:01:44.319
<v Speaker 3>of AI really from the nineteen fifties up until well

30
00:01:44.519 --> 00:01:47.519
<v Speaker 3>the early twenty twenties, we were building a fractured mind.

31
00:01:47.599 --> 00:01:49.640
<v Speaker 3>We had what we call the island problem.

32
00:01:49.680 --> 00:01:52.120
<v Speaker 2>The island problem. I like that image paid the picture

33
00:01:52.159 --> 00:01:52.400
<v Speaker 2>for us.

34
00:01:52.439 --> 00:01:55.920
<v Speaker 3>Okay, so picture an archipelago on one island. You have

35
00:01:56.079 --> 00:02:00.159
<v Speaker 3>these brilliant computer vision systems. They were specialists. They could

36
00:02:00.159 --> 00:02:01.879
<v Speaker 3>look at a photo of a cat and tell you

37
00:02:02.000 --> 00:02:06.159
<v Speaker 3>that's a tabby with like ninety nine percent accuracy. Superhuman

38
00:02:06.239 --> 00:02:08.680
<v Speaker 3>vision in some respects. Right. But if you showed that

39
00:02:08.800 --> 00:02:11.919
<v Speaker 3>same system a handwritten note that said this is a cat,

40
00:02:12.360 --> 00:02:15.360
<v Speaker 3>it was blind. It couldn't read. It had no concept

41
00:02:15.439 --> 00:02:16.360
<v Speaker 3>of what letters were.

42
00:02:16.240 --> 00:02:18.879
<v Speaker 2>Okay, so that's island one the eye they cannot read exactly.

43
00:02:19.319 --> 00:02:21.919
<v Speaker 3>Then on the next island over you have the text spots.

44
00:02:22.199 --> 00:02:25.240
<v Speaker 3>The ancestors of you know, chatchypt and the like. They

45
00:02:25.280 --> 00:02:27.159
<v Speaker 3>could write you a sonnet about a cat. They could

46
00:02:27.159 --> 00:02:29.879
<v Speaker 3>define the biology of a feline. They could translate cat

47
00:02:29.919 --> 00:02:32.680
<v Speaker 3>into fifty languages. But if you showed them a picture

48
00:02:32.680 --> 00:02:39.080
<v Speaker 3>of a kitten, nothing, just static. They were effectively brains

49
00:02:39.199 --> 00:02:42.759
<v Speaker 3>in a jar that only knew the world through symbols.

50
00:02:43.080 --> 00:02:45.319
<v Speaker 2>So you have the eye that cannot read and the

51
00:02:45.319 --> 00:02:47.319
<v Speaker 2>brain that cannot see precisely.

52
00:02:47.960 --> 00:02:51.039
<v Speaker 3>And the worst part, they were built by different people.

53
00:02:51.240 --> 00:02:53.919
<v Speaker 3>The computer vision engineers didn't hang out with the natural

54
00:02:54.000 --> 00:02:55.360
<v Speaker 3>language processing engineers.

55
00:02:55.360 --> 00:02:57.080
<v Speaker 2>They were in different departments.

56
00:02:56.639 --> 00:02:59.400
<v Speaker 3>They used different math, they used different architectures. They were

57
00:02:59.400 --> 00:03:01.840
<v Speaker 3>effectively different species of intelligence.

58
00:03:02.000 --> 00:03:05.120
<v Speaker 2>So for fifty sixty years we were building these savants.

59
00:03:05.800 --> 00:03:09.039
<v Speaker 2>One savann could see perfect pixels, one savant could pars

60
00:03:09.080 --> 00:03:12.120
<v Speaker 2>perfect grammar. But they couldn't have a conversation.

61
00:03:12.240 --> 00:03:14.240
<v Speaker 3>They couldn't even acknowledge each other's existence.

62
00:03:14.400 --> 00:03:17.960
<v Speaker 2>And today, because the reason we're doing this show is

63
00:03:17.960 --> 00:03:19.439
<v Speaker 2>that something fundamental changed.

64
00:03:19.719 --> 00:03:23.000
<v Speaker 3>Today. The bridges have been built the water between the

65
00:03:23.039 --> 00:03:27.280
<v Speaker 3>Islands is gone. We are witnessing the rise of multimodal

66
00:03:27.400 --> 00:03:30.199
<v Speaker 3>AI and I want to be really clear to everyone listening,

67
00:03:30.240 --> 00:03:32.800
<v Speaker 3>this isn't just a feature update. This isn't just now

68
00:03:32.800 --> 00:03:34.400
<v Speaker 3>your chatbot has a camera icon.

69
00:03:34.520 --> 00:03:36.280
<v Speaker 2>It feels much much bigger than that.

70
00:03:36.719 --> 00:03:40.280
<v Speaker 3>It is fundamental. We are moving from the era of

71
00:03:40.319 --> 00:03:43.360
<v Speaker 3>the specialist to the era of the generalist. We are

72
00:03:43.400 --> 00:03:46.719
<v Speaker 3>giving machines the ability to integrate senses in a way

73
00:03:46.759 --> 00:03:50.319
<v Speaker 3>that well, it mimics that human sensory soup we started with.

74
00:03:50.680 --> 00:03:53.759
<v Speaker 2>That's our mission for this discussion. We've pulled together a

75
00:03:53.759 --> 00:03:57.759
<v Speaker 2>stack of research, technical papers, and industry analysis to figure

76
00:03:57.800 --> 00:04:00.639
<v Speaker 2>out how this happened, because it seemed like for decades

77
00:04:00.639 --> 00:04:02.840
<v Speaker 2>we were stuck and then in the last few years

78
00:04:02.919 --> 00:04:04.520
<v Speaker 2>everything just collided right.

79
00:04:04.960 --> 00:04:07.599
<v Speaker 3>We're going to look at the architecture, the actual aha

80
00:04:07.840 --> 00:04:11.199
<v Speaker 3>moment that let machines see and read. At the same time,

81
00:04:11.719 --> 00:04:14.520
<v Speaker 3>we'll look at the superpowers this unlocks, like reading X

82
00:04:14.599 --> 00:04:16.120
<v Speaker 3>rays while reading patient notes.

83
00:04:16.160 --> 00:04:19.120
<v Speaker 2>And we absolutely have to talk about the limitations because

84
00:04:19.160 --> 00:04:22.040
<v Speaker 2>the research shows that all these machines can see, they

85
00:04:22.120 --> 00:04:23.639
<v Speaker 2>hallucinate in brand new.

86
00:04:23.480 --> 00:04:25.920
<v Speaker 3>Ways they do, and we need to ask the big

87
00:04:25.959 --> 00:04:29.040
<v Speaker 3>philosophical question if a machine looks at a photo of

88
00:04:29.040 --> 00:04:31.240
<v Speaker 3>a funeral and writes a palm that makes you cry,

89
00:04:31.720 --> 00:04:35.000
<v Speaker 3>does it actually understand grief or is it just really

90
00:04:35.040 --> 00:04:35.920
<v Speaker 3>really good at math.

91
00:04:36.199 --> 00:04:38.759
<v Speaker 2>Let's get into the mechanics then, Section one. How did

92
00:04:38.800 --> 00:04:42.399
<v Speaker 2>we get here? Because I remember reading about AI in

93
00:04:42.480 --> 00:04:46.680
<v Speaker 2>say twenty fifteen, and it was all about these specialized tools.

94
00:04:46.720 --> 00:04:49.879
<v Speaker 2>You had one tool for chess, one tool for translating French.

95
00:04:50.759 --> 00:04:53.040
<v Speaker 2>When did the walls come down? Was there like a

96
00:04:53.079 --> 00:04:53.800
<v Speaker 2>single invention.

97
00:04:54.160 --> 00:04:56.600
<v Speaker 3>To understand the solution, you really have to understand why

98
00:04:56.600 --> 00:04:58.839
<v Speaker 3>the walls were there in the first place. And it

99
00:04:58.879 --> 00:05:02.439
<v Speaker 3>all boils down to the architecture, the literal shape of

100
00:05:02.480 --> 00:05:03.439
<v Speaker 3>the neural networks.

101
00:05:03.600 --> 00:05:06.120
<v Speaker 2>Okay, break that down for us, you know, don't go

102
00:05:06.199 --> 00:05:08.680
<v Speaker 2>too heavy on the jargon, but give us the reality.

103
00:05:08.720 --> 00:05:11.319
<v Speaker 2>Why couldn't vision bot talk to the text butt?

104
00:05:11.439 --> 00:05:13.759
<v Speaker 3>Okay? So, for a long time, the king of computer

105
00:05:13.839 --> 00:05:17.439
<v Speaker 3>vision was something called a CNN, a convolutional neural network.

106
00:05:17.480 --> 00:05:19.199
<v Speaker 2>We've touched on these before. These are the ones that

107
00:05:19.319 --> 00:05:22.199
<v Speaker 2>scan an image like a grid, right, looking for edges

108
00:05:22.199 --> 00:05:23.079
<v Speaker 2>and shapes. Right.

109
00:05:23.240 --> 00:05:26.079
<v Speaker 3>Imagine a sliding window moving over a picture. It looks

110
00:05:26.120 --> 00:05:28.040
<v Speaker 3>at a tiny patch of pixel, say a three x

111
00:05:28.079 --> 00:05:30.240
<v Speaker 3>three square and asks is there an edge?

112
00:05:30.279 --> 00:05:30.439
<v Speaker 2>Here?

113
00:05:30.519 --> 00:05:32.319
<v Speaker 3>Is there a curve? Is there a color gradient? It

114
00:05:32.360 --> 00:05:35.680
<v Speaker 3>builds up from lines to shapes, to ears to eventually

115
00:05:35.720 --> 00:05:39.519
<v Speaker 3>a cat. It is designed mathematically to process grids of

116
00:05:39.519 --> 00:05:41.759
<v Speaker 3>spatial data. It understands space.

117
00:05:42.079 --> 00:05:44.040
<v Speaker 2>Okay, so that's the eye. It deals in grids. It

118
00:05:44.079 --> 00:05:45.399
<v Speaker 2>thinks and grids exactly.

119
00:05:45.480 --> 00:05:48.199
<v Speaker 3>But for text, text is in a grid. Text is

120
00:05:48.240 --> 00:05:51.920
<v Speaker 3>a stream, It's a sequence the quickbat brown dot fox.

121
00:05:52.439 --> 00:05:55.399
<v Speaker 3>The order matters immensely. Of course, you can't just look

122
00:05:55.439 --> 00:05:58.639
<v Speaker 3>at brown without knowing quick came before it. So for

123
00:05:58.759 --> 00:06:02.759
<v Speaker 3>that we used Ours were current neural networks. These were

124
00:06:02.759 --> 00:06:06.480
<v Speaker 3>designed to remember the past. They process the word fox

125
00:06:06.720 --> 00:06:09.199
<v Speaker 3>while trying to hold onto the memory of the.

126
00:06:09.600 --> 00:06:12.360
<v Speaker 2>So you have one kind of math designed for grids,

127
00:06:12.439 --> 00:06:15.120
<v Speaker 2>which is space, and a completely different kind of math

128
00:06:15.160 --> 00:06:16.839
<v Speaker 2>designed for streams, which is time.

129
00:06:17.120 --> 00:06:19.199
<v Speaker 3>You've got it. And you couldn't just plug one into

130
00:06:19.240 --> 00:06:21.439
<v Speaker 3>the other. They spoke different languages. It was like trying

131
00:06:21.439 --> 00:06:24.120
<v Speaker 3>to put a VHS tape into a toaster. The inputs

132
00:06:24.160 --> 00:06:25.439
<v Speaker 3>just didn't match the machinery.

133
00:06:25.519 --> 00:06:28.000
<v Speaker 2>So what changed? I know the answer involves transformers, because

134
00:06:28.000 --> 00:06:30.000
<v Speaker 2>that seems to be the answer to everything in AI. Lately.

135
00:06:30.079 --> 00:06:33.480
<v Speaker 2>But why what did the transformer do that the others couldn't.

136
00:06:33.720 --> 00:06:37.040
<v Speaker 3>Well, the date was twenty seventeen. The paper was attention

137
00:06:37.319 --> 00:06:39.319
<v Speaker 3>is all you need. We talk about it all the

138
00:06:39.360 --> 00:06:42.040
<v Speaker 3>time on this show. But the hidden revolution in that

139
00:06:42.079 --> 00:06:44.920
<v Speaker 3>paper wasn't just that it was better at language. It

140
00:06:44.959 --> 00:06:47.600
<v Speaker 3>was that the transformer was a universal substrate.

141
00:06:48.000 --> 00:06:51.639
<v Speaker 2>Universal substrate, Yeah, that sounds impressive, but what does it

142
00:06:51.680 --> 00:06:53.040
<v Speaker 2>actually mean? In practice?

143
00:06:53.120 --> 00:06:55.959
<v Speaker 3>It means it's a structure that can process any kind

144
00:06:55.959 --> 00:06:58.199
<v Speaker 3>of information as long as you can turn that information

145
00:06:58.319 --> 00:06:59.240
<v Speaker 3>into a sequence.

146
00:06:59.519 --> 00:07:02.120
<v Speaker 2>So text is obviously a sequence, word after word after

147
00:07:02.199 --> 00:07:03.319
<v Speaker 2>word that fits right.

148
00:07:03.399 --> 00:07:06.600
<v Speaker 3>In AI terms, we call those tokens. But then the

149
00:07:06.639 --> 00:07:10.839
<v Speaker 3>researchers have this, this real aha moment. They realized, wait

150
00:07:10.839 --> 00:07:13.360
<v Speaker 3>a minute, we can treat an image as a sequence too.

151
00:07:13.759 --> 00:07:16.680
<v Speaker 2>Hold on, how do you turn a picture into a sequence.

152
00:07:17.079 --> 00:07:20.519
<v Speaker 2>A picture is a flat two D object. It doesn't

153
00:07:20.519 --> 00:07:22.720
<v Speaker 2>have a start and end like a sentence does.

154
00:07:22.800 --> 00:07:25.120
<v Speaker 3>That was the stroke of genius. The researchers asked, what

155
00:07:25.199 --> 00:07:26.759
<v Speaker 3>if we forced it to be a sequence?

156
00:07:26.879 --> 00:07:27.879
<v Speaker 2>Forced it? How?

157
00:07:28.319 --> 00:07:31.480
<v Speaker 3>Imagine taking a photo of a dog. Now imagine taking

158
00:07:31.519 --> 00:07:33.560
<v Speaker 3>a pair of scissors and cutting it up into a

159
00:07:33.600 --> 00:07:37.160
<v Speaker 3>grid of little squares. Let's say sixteen by sixteen pixel squares.

160
00:07:37.560 --> 00:07:39.319
<v Speaker 3>You have a pile of these tiny patches.

161
00:07:39.360 --> 00:07:40.040
<v Speaker 2>Okay, I'm with you.

162
00:07:40.319 --> 00:07:43.480
<v Speaker 3>Now, you just line them up square one, score two,

163
00:07:43.480 --> 00:07:45.199
<v Speaker 3>scure three from top left to bottom right.

164
00:07:45.319 --> 00:07:47.360
<v Speaker 2>You flatten the grid into a line exactly.

165
00:07:47.399 --> 00:07:50.439
<v Speaker 3>They turn the image into a sentence of visual words.

166
00:07:50.480 --> 00:07:53.199
<v Speaker 3>They call them patches. And once you did that, once

167
00:07:53.240 --> 00:07:55.720
<v Speaker 3>you turn the image into a sequence of patches, the

168
00:07:55.759 --> 00:07:57.759
<v Speaker 3>transformer looked at it and said, I know what to

169
00:07:57.800 --> 00:07:58.120
<v Speaker 3>do with this.

170
00:07:58.360 --> 00:08:00.920
<v Speaker 2>Because of the transformer, a patch of pixels is just

171
00:08:00.959 --> 00:08:03.160
<v Speaker 2>another token, the same way a word is a token.

172
00:08:03.279 --> 00:08:06.959
<v Speaker 3>Precisely, that is, the everything is a token realization. And

173
00:08:07.000 --> 00:08:10.480
<v Speaker 3>it didn't stop at images. Audio that's just a sequence

174
00:08:10.519 --> 00:08:14.759
<v Speaker 3>of spectrogram slices. Video that's just a sequence of frames

175
00:08:14.800 --> 00:08:17.360
<v Speaker 3>in temporal order, even code or molecules.

176
00:08:17.439 --> 00:08:20.680
<v Speaker 2>So the machine stops seeing image versus text versus audio,

177
00:08:21.000 --> 00:08:24.000
<v Speaker 2>and just start seeing data stream versus data strue.

178
00:08:24.079 --> 00:08:27.879
<v Speaker 3>Correct. It was like discovering that French, Mandarin and mathematics

179
00:08:27.920 --> 00:08:31.319
<v Speaker 3>are all actually dialects of the same underlying language. Once

180
00:08:31.360 --> 00:08:34.120
<v Speaker 3>they realized that the transformer could handle all of these

181
00:08:34.159 --> 00:08:39.360
<v Speaker 3>as sequences, the barrier between the senses just it evaporated.

182
00:08:39.440 --> 00:08:42.080
<v Speaker 2>That is wild. So the architecture was the lock, and

183
00:08:42.080 --> 00:08:45.519
<v Speaker 2>this idea of tokenization was the key that fit everything.

184
00:08:45.639 --> 00:08:47.799
<v Speaker 3>That's a beautiful way to put it. And once that

185
00:08:47.919 --> 00:08:52.120
<v Speaker 3>architectural problem was solved, the floodgates opened. We moved into

186
00:08:52.159 --> 00:08:55.279
<v Speaker 3>this phase of connecting the dots, of teaching these different

187
00:08:55.279 --> 00:08:56.679
<v Speaker 3>senses to talk to each other.

188
00:08:56.799 --> 00:08:59.759
<v Speaker 2>Okay, I get the architecture. That makes sense. We can

189
00:08:59.759 --> 00:09:02.759
<v Speaker 2>now feed everything into the same kind of machine. But

190
00:09:02.840 --> 00:09:06.720
<v Speaker 2>I'm still stuck on the understanding part. Just because I

191
00:09:06.759 --> 00:09:08.679
<v Speaker 2>feed a picture of a dog and the word dog

192
00:09:08.720 --> 00:09:11.080
<v Speaker 2>into the same machine, how does the machine know they

193
00:09:11.080 --> 00:09:13.279
<v Speaker 2>refer to the same thing. Surely it's not just looking

194
00:09:13.320 --> 00:09:14.159
<v Speaker 2>it up in a dictionary.

195
00:09:14.279 --> 00:09:17.120
<v Speaker 3>No, no, it's not a lookup table at all. It's geometry.

196
00:09:17.200 --> 00:09:19.480
<v Speaker 2>Geometry. You're gonna have to explain that one. How does

197
00:09:19.480 --> 00:09:21.600
<v Speaker 2>a picture of a dog become geometry?

198
00:09:21.720 --> 00:09:24.200
<v Speaker 3>This is where we have to talk about vectors and

199
00:09:24.360 --> 00:09:26.600
<v Speaker 3>high dimensional space, and to do that we have to

200
00:09:26.600 --> 00:09:29.240
<v Speaker 3>talk about how these things are actually trained. The most

201
00:09:29.279 --> 00:09:33.919
<v Speaker 3>famous example is a model called clap from open ai clip.

202
00:09:34.120 --> 00:09:37.120
<v Speaker 2>I've seen that mentioned contrast of language image pre training.

203
00:09:37.399 --> 00:09:41.480
<v Speaker 3>It's a mouthful, but the concept is really elegant. Imagine

204
00:09:41.519 --> 00:09:44.639
<v Speaker 3>you have a massive bucket of data, and I'm talking

205
00:09:44.840 --> 00:09:48.840
<v Speaker 3>four hundred million images scraped from the Internet and the

206
00:09:48.840 --> 00:09:50.240
<v Speaker 3>text captions that came with them.

207
00:09:50.480 --> 00:09:54.519
<v Speaker 2>So like IMG zero zero one dot jpeg and the

208
00:09:54.559 --> 00:09:57.960
<v Speaker 2>alt text that says a golden retriever catching a frisbee

209
00:09:58.000 --> 00:09:58.440
<v Speaker 2>on the beach.

210
00:09:58.799 --> 00:10:01.720
<v Speaker 3>Right now, you start with a blank brain. It knows nothing.

211
00:10:01.799 --> 00:10:04.720
<v Speaker 3>You show at the image and you show at the text. Initially,

212
00:10:04.759 --> 00:10:08.039
<v Speaker 3>the machine thinks these are totally unrelated things. It turns

213
00:10:08.039 --> 00:10:10.120
<v Speaker 3>the image into a set of numbers. We call that

214
00:10:10.159 --> 00:10:12.480
<v Speaker 3>a vector, and it turns the text into another set

215
00:10:12.519 --> 00:10:15.480
<v Speaker 3>of numbers. And those numbers are in this mathematical space,

216
00:10:15.720 --> 00:10:18.639
<v Speaker 3>nowhere near each other. The strangers in the map total strangers.

217
00:10:19.120 --> 00:10:22.440
<v Speaker 3>But then you apply something called contrastive loss. This is

218
00:10:22.480 --> 00:10:26.960
<v Speaker 3>the training mechanism. You essentially punish the machine. You say, hey,

219
00:10:27.639 --> 00:10:31.279
<v Speaker 3>these two sets of numbers, they belong together. Pull them closer.

220
00:10:31.360 --> 00:10:33.399
<v Speaker 2>You're forcing them to be neighbors exactly.

221
00:10:33.440 --> 00:10:36.320
<v Speaker 3>And simultaneously you show at the text a golden retriever

222
00:10:36.440 --> 00:10:39.039
<v Speaker 3>catching a frisbee and a picture of a toaster, and

223
00:10:39.120 --> 00:10:41.399
<v Speaker 3>you say, push these apart. These are not the same.

224
00:10:41.720 --> 00:10:43.639
<v Speaker 3>These live on opposite sides of the universe.

225
00:10:43.720 --> 00:10:47.120
<v Speaker 2>So it's this constant game of hot and cold, pushing

226
00:10:47.159 --> 00:10:48.000
<v Speaker 2>and pulling.

227
00:10:47.960 --> 00:10:51.240
<v Speaker 3>Done billions and billions of times, over and over, and

228
00:10:51.279 --> 00:10:53.799
<v Speaker 3>eventually the machine builds a map. We call it a

229
00:10:53.840 --> 00:10:57.799
<v Speaker 3>high dimensional vector space. Imagine a graph, but instead of

230
00:10:57.799 --> 00:11:02.039
<v Speaker 3>two or three axes, it has thousands. In this map,

231
00:11:02.120 --> 00:11:05.840
<v Speaker 3>the coordinates for the visual pattern of fur, floppy, ears

232
00:11:05.879 --> 00:11:09.200
<v Speaker 3>and tail end up located at the exact same coordinates

233
00:11:09.240 --> 00:11:11.440
<v Speaker 3>as the linguistic pattern for the word dog.

234
00:11:11.879 --> 00:11:14.840
<v Speaker 2>Wow. So it's not using a dictionary. It's not looking

235
00:11:14.960 --> 00:11:18.639
<v Speaker 2>up dog equals animal. It's mapping the concept of dogness

236
00:11:18.919 --> 00:11:22.080
<v Speaker 2>to a specific location in this massive, invisible space.

237
00:11:22.200 --> 00:11:24.360
<v Speaker 3>Yes, and this is why it feels like it understands

238
00:11:24.399 --> 00:11:27.240
<v Speaker 3>because that space has geometry. It has a kind of logic.

239
00:11:27.360 --> 00:11:29.759
<v Speaker 2>Okay, give me an example of that logic, because logic

240
00:11:29.799 --> 00:11:32.240
<v Speaker 2>implies it can do reasoning, not just matching.

241
00:11:32.440 --> 00:11:35.480
<v Speaker 3>Okay, think about the classic relationship between king and queen

242
00:11:35.720 --> 00:11:38.240
<v Speaker 3>in text. If you take the math vector for the

243
00:11:38.240 --> 00:11:40.919
<v Speaker 3>word king, subtract the vector for man, and then add

244
00:11:40.960 --> 00:11:43.879
<v Speaker 3>the vector for woman, you land almost perfectly on the

245
00:11:43.960 --> 00:11:44.679
<v Speaker 3>vector for queen.

246
00:11:44.919 --> 00:11:47.960
<v Speaker 2>Right. That's the famous example King minus man plus woman

247
00:11:48.000 --> 00:11:50.120
<v Speaker 2>equals queen. It's like vector arithmetic.

248
00:11:50.440 --> 00:11:53.399
<v Speaker 3>Now do it with images. If you take the visual

249
00:11:53.519 --> 00:11:55.799
<v Speaker 3>vector of a king a photo of a guy in

250
00:11:55.840 --> 00:11:59.879
<v Speaker 3>a crown, subtract the visual features that represent man, and

251
00:12:00.200 --> 00:12:05.200
<v Speaker 3>add the visual features that represent woman, the machine generates

252
00:12:05.240 --> 00:12:06.240
<v Speaker 3>an image of a.

253
00:12:06.240 --> 00:12:11.799
<v Speaker 2>Queen that is mind blowing. The logic, the geometric relationship

254
00:12:12.679 --> 00:12:15.120
<v Speaker 2>it holds up across the senses it does.

255
00:12:15.279 --> 00:12:17.600
<v Speaker 3>It means the machine has found a concept layer that

256
00:12:17.679 --> 00:12:21.039
<v Speaker 3>sits deeper than language and deeper than pixels. It has

257
00:12:21.080 --> 00:12:22.679
<v Speaker 3>found the meaning that connects them.

258
00:12:22.759 --> 00:12:24.440
<v Speaker 2>It's performing analogical reasoning.

259
00:12:24.519 --> 00:12:26.720
<v Speaker 3>It is. That's how the system can look at a

260
00:12:26.720 --> 00:12:29.159
<v Speaker 3>photo of a funeral and connect it to the text

261
00:12:29.360 --> 00:12:32.159
<v Speaker 3>A moment of grief. It's not because it memorized that

262
00:12:32.240 --> 00:12:35.720
<v Speaker 3>specific photo and caption pair. It's because the visual information

263
00:12:35.759 --> 00:12:38.120
<v Speaker 3>in the funeral photo and the concept of grief from

264
00:12:38.159 --> 00:12:40.960
<v Speaker 3>the text live in the same emotional region of this

265
00:12:41.080 --> 00:12:42.039
<v Speaker 3>mathematical space.

266
00:12:42.120 --> 00:12:44.279
<v Speaker 2>It's math the geometry of sadness.

267
00:12:43.799 --> 00:12:46.840
<v Speaker 3>In a mathematical sense. Yes, it has aligned the visual

268
00:12:46.840 --> 00:12:49.519
<v Speaker 3>features of sadness with the linguistic features of sadness.

269
00:12:49.600 --> 00:12:52.320
<v Speaker 2>That explains so much about why these systems feel like

270
00:12:52.360 --> 00:12:55.039
<v Speaker 2>they get it. They aren't just matching keywords. They are

271
00:12:55.120 --> 00:12:56.840
<v Speaker 2>navigating a map of meaning.

272
00:12:56.879 --> 00:12:59.759
<v Speaker 3>And usually the architecture that runs this map, the sort

273
00:12:59.799 --> 00:13:02.879
<v Speaker 3>of central brain, is a large language model. You have

274
00:13:02.960 --> 00:13:05.519
<v Speaker 3>these specialized encoders. You can think of them as the

275
00:13:05.559 --> 00:13:08.919
<v Speaker 3>eyes and ears that project all this information into the brain.

276
00:13:09.799 --> 00:13:13.600
<v Speaker 3>The LM does the reasoning in that shared space, and

277
00:13:13.639 --> 00:13:15.320
<v Speaker 3>then it can send information back out.

278
00:13:15.480 --> 00:13:17.720
<v Speaker 2>So the LLEN is the conductor of the orchestras, making

279
00:13:17.720 --> 00:13:19.679
<v Speaker 2>sure the strings and the woodwinds are all playing from

280
00:13:19.679 --> 00:13:22.080
<v Speaker 2>the same sheet music. Ceaseely, all right, So we have

281
00:13:22.159 --> 00:13:24.720
<v Speaker 2>the history. The silos are gone with the science. It's

282
00:13:24.720 --> 00:13:27.080
<v Speaker 2>a geometry of concepts. Now I want to talk about

283
00:13:27.080 --> 00:13:30.720
<v Speaker 2>the utility, because cool math is great, but what can

284
00:13:30.799 --> 00:13:31.600
<v Speaker 2>this actually do?

285
00:13:32.000 --> 00:13:35.000
<v Speaker 3>The capabilities are substantial, and I think we should start

286
00:13:35.039 --> 00:13:37.960
<v Speaker 3>with what we can call vision language nuance, because we're

287
00:13:37.960 --> 00:13:40.159
<v Speaker 3>not just talking about identifying objects anymore.

288
00:13:40.279 --> 00:13:42.360
<v Speaker 2>Right. This isn't just drawing a box around a cat

289
00:13:42.440 --> 00:13:45.480
<v Speaker 2>and saying cat ninety nine percent confidence. That was like

290
00:13:45.639 --> 00:13:46.960
<v Speaker 2>twenty fifteen era. AI.

291
00:13:47.159 --> 00:13:51.000
<v Speaker 3>No, No, Now, it's about identifying relationships emotional tenor. It

292
00:13:51.039 --> 00:13:52.720
<v Speaker 3>can look at a scene and say this is a

293
00:13:52.759 --> 00:13:55.960
<v Speaker 3>tense negotiation happening in a corporate boardroom based on the

294
00:13:55.960 --> 00:13:58.919
<v Speaker 3>body language, the lighting, the arrangement of people. But one

295
00:13:58.919 --> 00:14:03.360
<v Speaker 3>of the most practical superpowers is something called OCR integration

296
00:14:03.639 --> 00:14:05.279
<v Speaker 3>optical character recognition.

297
00:14:05.559 --> 00:14:08.639
<v Speaker 2>But OCR has been round since the nineties. My scanner

298
00:14:08.720 --> 00:14:11.279
<v Speaker 2>came with it. Why is this a big deal? Now?

299
00:14:11.519 --> 00:14:15.840
<v Speaker 3>Old OCR was dumb. It just scraped text off a page.

300
00:14:15.919 --> 00:14:17.600
<v Speaker 3>It didn't know where the text was or what it

301
00:14:17.639 --> 00:14:21.559
<v Speaker 3>meant in context. Multimodal AI reads the text in context.

302
00:14:21.600 --> 00:14:23.159
<v Speaker 3>It can look at a street sign in a photo,

303
00:14:23.279 --> 00:14:24.919
<v Speaker 3>read the sign, look at the cars, look at the

304
00:14:24.919 --> 00:14:27.720
<v Speaker 3>time of day, and tell you if parking is legal right.

305
00:14:27.600 --> 00:14:30.720
<v Speaker 2>Now, or to go back to our earlier point. It

306
00:14:30.799 --> 00:14:33.519
<v Speaker 2>could read a handwritten note on a medical scan and

307
00:14:33.679 --> 00:14:36.080
<v Speaker 2>understand how that note relates to the X ray.

308
00:14:35.879 --> 00:14:40.200
<v Speaker 3>Itself exactly, which segues perfectly into the second big capability

309
00:14:40.480 --> 00:14:43.120
<v Speaker 3>document understanding. This is what some people are calling the

310
00:14:43.519 --> 00:14:44.759
<v Speaker 3>ultimate office assistant.

311
00:14:45.080 --> 00:14:46.240
<v Speaker 2>This is the one that I think is going to

312
00:14:46.320 --> 00:14:48.399
<v Speaker 2>change a lot of white collar work. I want you

313
00:14:48.440 --> 00:14:51.799
<v Speaker 2>to walk me through a scenario here, because I deal

314
00:14:51.840 --> 00:14:55.240
<v Speaker 2>with PDFs all day and they are where data goes

315
00:14:55.279 --> 00:14:55.600
<v Speaker 2>to die.

316
00:14:55.759 --> 00:14:59.320
<v Speaker 3>Okay, picture this. You have a fifty page annual report.

317
00:14:59.519 --> 00:15:03.639
<v Speaker 3>It's got three columns of text, complex bar charts, photos

318
00:15:03.639 --> 00:15:08.759
<v Speaker 3>with captions, footnotes. For old AI, that was a complete nightmare.

319
00:15:08.840 --> 00:15:12.039
<v Speaker 3>The text would get jumbled, the chart was invisible.

320
00:15:11.600 --> 00:15:13.600
<v Speaker 2>I was soup. You'd copy paste it into a text

321
00:15:13.639 --> 00:15:15.279
<v Speaker 2>file and just get absolute garbage. Right.

322
00:15:15.639 --> 00:15:19.120
<v Speaker 3>But a multimodal system sees the document like a human does.

323
00:15:19.159 --> 00:15:21.440
<v Speaker 3>It understands the layout. It can look at the bar chart,

324
00:15:21.720 --> 00:15:24.519
<v Speaker 3>extract the data from the visual bars, I mean literally

325
00:15:24.559 --> 00:15:27.360
<v Speaker 3>measuring the pixels of the bars, read the surrounding text,

326
00:15:27.519 --> 00:15:30.600
<v Speaker 3>understand what that data means, and answer a question like

327
00:15:30.799 --> 00:15:32.960
<v Speaker 3>based on the chart on page three, which quarter had

328
00:15:32.960 --> 00:15:34.519
<v Speaker 3>the highest revenue.

329
00:15:34.080 --> 00:15:36.919
<v Speaker 2>Without a human having to manually turn that chart into

330
00:15:36.960 --> 00:15:39.519
<v Speaker 2>an Excel sheet first zero preprocessing.

331
00:15:39.559 --> 00:15:40.960
<v Speaker 3>It just looks and understands.

332
00:15:41.000 --> 00:15:44.919
<v Speaker 2>That's incredible. It basically unlocks all the information that is

333
00:15:44.960 --> 00:15:48.720
<v Speaker 2>trapped inside images, within documents. What about audio and video

334
00:15:48.840 --> 00:15:51.240
<v Speaker 2>You mentioned earlier that video is just a sequence of frames.

335
00:15:51.519 --> 00:15:56.000
<v Speaker 3>Audio and video are huge frontiers now. In audio, we

336
00:15:56.039 --> 00:15:59.799
<v Speaker 3>aren't just transcribing speech to text anymore. We are analyzing

337
00:15:59.840 --> 00:16:03.919
<v Speaker 3>the vocal characteristics. The system can detect emotion. Is the

338
00:16:03.919 --> 00:16:08.600
<v Speaker 3>speaker angry, nervous, sarcastically happy?

339
00:16:08.720 --> 00:16:10.480
<v Speaker 2>You can hear the scare quotes in your voice.

340
00:16:10.559 --> 00:16:13.600
<v Speaker 3>It can absolutely and it can analyze music, not just

341
00:16:13.639 --> 00:16:16.879
<v Speaker 3>the genre, but the rhythm, the mood, the instrumentation. When

342
00:16:16.919 --> 00:16:19.840
<v Speaker 3>you combine that with video, you get narrative understanding. It

343
00:16:19.840 --> 00:16:22.360
<v Speaker 3>can track events over time and start to build a story.

344
00:16:22.559 --> 00:16:24.759
<v Speaker 2>But the real magic, and the research we looked at

345
00:16:24.840 --> 00:16:28.440
<v Speaker 2>was really emphatic about this is the killer app of

346
00:16:28.559 --> 00:16:31.200
<v Speaker 2>true integration. Is it not just being good at video

347
00:16:31.320 --> 00:16:33.200
<v Speaker 2>or good at text. It's the combo.

348
00:16:33.399 --> 00:16:35.399
<v Speaker 3>It is the synthesis. That's where the real power is.

349
00:16:35.519 --> 00:16:37.960
<v Speaker 3>Let's look at a coding scenario. Imagine you're a developer.

350
00:16:38.039 --> 00:16:40.679
<v Speaker 3>You're stuck. You get some cryptic error message. You take

351
00:16:40.720 --> 00:16:42.960
<v Speaker 3>a screenshot of your error message. You just paste it

352
00:16:43.000 --> 00:16:45.679
<v Speaker 3>into the AI. The AI reads the screenshot, looks at

353
00:16:45.679 --> 00:16:49.320
<v Speaker 3>your actual code file, consults the official software documentation online,

354
00:16:49.440 --> 00:16:51.039
<v Speaker 3>and synthesizes an answer.

355
00:16:51.240 --> 00:16:53.759
<v Speaker 2>So it's using its eyes and it's reading comprehension at

356
00:16:53.759 --> 00:16:56.039
<v Speaker 2>the exact same time to solve one problem.

357
00:16:56.440 --> 00:17:00.159
<v Speaker 3>Or take medicine, the radiologist assistant idea. It looks that

358
00:17:00.200 --> 00:17:04.000
<v Speaker 3>the CT scan, that's vision. It reads the patient's history notes,

359
00:17:04.039 --> 00:17:07.480
<v Speaker 3>that's text. It checks the latest research papers for medical journals.

360
00:17:07.480 --> 00:17:11.079
<v Speaker 3>More text, and it synthesizes a potential diagnosis based on

361
00:17:11.200 --> 00:17:12.359
<v Speaker 3>all three modalities.

362
00:17:12.440 --> 00:17:15.359
<v Speaker 2>It becomes the ultimate second opinion engine.

363
00:17:15.480 --> 00:17:18.880
<v Speaker 3>Right or a final example in design, you sketch a

364
00:17:18.960 --> 00:17:20.880
<v Speaker 3>rough idea for an app on a napkin, You take

365
00:17:20.920 --> 00:17:23.119
<v Speaker 3>a photo, you upload it, and you say, make this

366
00:17:23.160 --> 00:17:26.440
<v Speaker 3>look like a sleek modern app interface, but use our

367
00:17:26.480 --> 00:17:30.319
<v Speaker 3>official brand colors from this attached pdf. It sees the sketch,

368
00:17:30.400 --> 00:17:32.720
<v Speaker 3>it reads your brief, it consults the PDF for the

369
00:17:32.759 --> 00:17:35.400
<v Speaker 3>color codes, and it generates the final image.

370
00:17:35.440 --> 00:17:39.039
<v Speaker 2>It's closing the loop between idea, instruction, and creation. It

371
00:17:39.039 --> 00:17:42.039
<v Speaker 2>feels like we're getting closer to that Jarvis from Ironman Fantasy,

372
00:17:42.079 --> 00:17:44.000
<v Speaker 2>the as system that just handles them.

373
00:17:44.119 --> 00:17:46.160
<v Speaker 3>We are getting closer. But and this is a very

374
00:17:46.240 --> 00:17:48.039
<v Speaker 3>very big up. We have to talk about where it

375
00:17:48.079 --> 00:17:50.599
<v Speaker 3>breaks because it is not Jarvis yet, and you and

376
00:17:50.640 --> 00:17:53.079
<v Speaker 3>I need to be clear that this isn't magic. It

377
00:17:53.119 --> 00:17:55.079
<v Speaker 3>breaks in some surprisingly dumb ways.

378
00:17:55.279 --> 00:17:57.000
<v Speaker 2>You don't want to play the skeptic here for a minute,

379
00:17:57.039 --> 00:18:01.240
<v Speaker 2>because it sounds perfect, but I know it's not. Where

380
00:18:01.240 --> 00:18:03.559
<v Speaker 2>does the machine stumble? What trips it up?

381
00:18:04.160 --> 00:18:08.079
<v Speaker 3>It stumbles in some surprisingly fundamental areas. The first one,

382
00:18:08.240 --> 00:18:11.599
<v Speaker 3>and this is almost ironic is spatial reasoning, which.

383
00:18:11.400 --> 00:18:13.839
<v Speaker 2>Is funny, right because you'd think a computer vision system

384
00:18:13.880 --> 00:18:14.920
<v Speaker 2>would be great at space.

385
00:18:15.319 --> 00:18:19.160
<v Speaker 3>It sees pixels, you would think, But remember these systems

386
00:18:19.160 --> 00:18:22.440
<v Speaker 3>are trained on flat two D images from the Internet.

387
00:18:22.880 --> 00:18:25.759
<v Speaker 3>They struggle to build an intuitive three D model of

388
00:18:25.799 --> 00:18:28.559
<v Speaker 3>the world. If you show it a picture of a

389
00:18:28.559 --> 00:18:30.839
<v Speaker 3>table with a messy pile of objects, and you ask

390
00:18:31.200 --> 00:18:33.440
<v Speaker 3>is the apple behind the book or in front of it,

391
00:18:33.440 --> 00:18:34.799
<v Speaker 3>it often gets really confused.

392
00:18:34.839 --> 00:18:36.839
<v Speaker 2>It sees the pixels of the apple and the pixels

393
00:18:36.839 --> 00:18:40.000
<v Speaker 2>of the book, but it doesn't get the depth the

394
00:18:40.039 --> 00:18:42.160
<v Speaker 2>physics of one object including another.

395
00:18:42.240 --> 00:18:44.839
<v Speaker 3>It lacks a physics engine in its head. It doesn't

396
00:18:44.880 --> 00:18:48.359
<v Speaker 3>intuitively understand that solid objects occupy space and can't pass

397
00:18:48.400 --> 00:18:51.480
<v Speaker 3>through each other. And this is a massive problem for robotics.

398
00:18:52.119 --> 00:18:54.119
<v Speaker 3>If you want a robot to clean your kitchen, it

399
00:18:54.200 --> 00:18:56.519
<v Speaker 3>needs to know exactly where the cup is relative to

400
00:18:56.559 --> 00:18:59.480
<v Speaker 3>the edge of the table. Close enough isn't good enough.

401
00:18:59.480 --> 00:19:01.200
<v Speaker 3>When you're hands fine china.

402
00:19:01.279 --> 00:19:03.119
<v Speaker 2>That makes a ton of sense. It sees the picture,

403
00:19:03.440 --> 00:19:06.680
<v Speaker 2>but it doesn't understand the physical reality behind the picture exactly.

404
00:19:07.160 --> 00:19:10.279
<v Speaker 3>Then there is temporal understanding. We can process video as

405
00:19:10.279 --> 00:19:13.880
<v Speaker 3>a stream of frames, but Understanding causality over time is

406
00:19:14.039 --> 00:19:15.519
<v Speaker 3>really hard for these models.

407
00:19:15.519 --> 00:19:18.519
<v Speaker 2>Causealit you mean, like the glass broke because the ball.

408
00:19:18.400 --> 00:19:22.839
<v Speaker 3>Hit it exactly that. Or even following a complex argument

409
00:19:22.839 --> 00:19:26.000
<v Speaker 3>in a lecture, if I say A happened which led

410
00:19:26.039 --> 00:19:28.519
<v Speaker 3>to B, but then C came along and prevented D

411
00:19:28.599 --> 00:19:32.119
<v Speaker 3>from occurring, the AI might track all the nouns but

412
00:19:32.240 --> 00:19:35.319
<v Speaker 3>lose the thread of the logic that question what happened

413
00:19:35.319 --> 00:19:39.680
<v Speaker 3>because of what is a surprisingly difficult cognitive task for it.

414
00:19:39.680 --> 00:19:42.680
<v Speaker 2>It's the difference between seeing a series of snapshots and

415
00:19:42.759 --> 00:19:45.039
<v Speaker 2>understanding a story with a plot precisely.

416
00:19:45.440 --> 00:19:48.119
<v Speaker 3>And then we have to talk about the big one hallucination.

417
00:19:48.400 --> 00:19:51.039
<v Speaker 2>We know text models lie, they make up facts, they

418
00:19:51.039 --> 00:19:54.440
<v Speaker 2>make up sources. Do multimodal models lie in the same.

419
00:19:54.279 --> 00:19:58.279
<v Speaker 3>Way, they lie in new and excitingly creative ways. We

420
00:19:58.359 --> 00:20:00.000
<v Speaker 3>call it multimodal hallucinating.

421
00:20:00.440 --> 00:20:02.599
<v Speaker 2>That sounds terrifying. Give me an example of what that

422
00:20:02.640 --> 00:20:03.039
<v Speaker 2>looks like.

423
00:20:03.240 --> 00:20:06.000
<v Speaker 3>It can be. A system might read the text in

424
00:20:06.039 --> 00:20:09.240
<v Speaker 3>a financial report perfectly, but then look at a graph

425
00:20:09.319 --> 00:20:12.400
<v Speaker 3>on the same page and completely hallucinated trend that isn't there.

426
00:20:12.839 --> 00:20:16.119
<v Speaker 3>It might say sales are trending upwards when the line

427
00:20:16.200 --> 00:20:19.759
<v Speaker 3>is clearly going down whoa. Or it might describe a

428
00:20:19.759 --> 00:20:23.240
<v Speaker 3>photo and just invent details about objects that are partially hidden.

429
00:20:23.599 --> 00:20:27.000
<v Speaker 2>So it sees half a car behind a building and says,

430
00:20:27.680 --> 00:20:30.000
<v Speaker 2>there is a red convertible with a dog in the

431
00:20:30.000 --> 00:20:33.319
<v Speaker 2>back seat, even though it can't possibly see the back seat.

432
00:20:33.440 --> 00:20:37.559
<v Speaker 3>Right, it's confabulating based on probability things well, usually cars

433
00:20:37.599 --> 00:20:39.839
<v Speaker 3>have things in back seats, and it just fills in

434
00:20:39.880 --> 00:20:42.720
<v Speaker 3>the blanks with a plausible story. In a creative context,

435
00:20:42.720 --> 00:20:45.920
<v Speaker 3>you might call that imagination. But in a medical or

436
00:20:45.960 --> 00:20:48.160
<v Speaker 3>illegal context that's malpractice.

437
00:20:48.279 --> 00:20:51.480
<v Speaker 2>That's a crucial distinction. It's just guessing, and sometimes a

438
00:20:51.559 --> 00:20:54.559
<v Speaker 2>guess is wrong with total unwavering confidence.

439
00:20:54.640 --> 00:20:57.279
<v Speaker 3>And there's also the problem of compositionality. This is what

440
00:20:57.319 --> 00:20:59.440
<v Speaker 3>I like to call the catenoid at the dog problem.

441
00:20:59.599 --> 00:21:02.279
<v Speaker 3>Explain that one identifying cat and dog in a picture

442
00:21:02.319 --> 00:21:06.799
<v Speaker 3>is easy, that's basic object recognition, but understanding their relationship

443
00:21:07.119 --> 00:21:10.000
<v Speaker 3>the cat is annoyed at the dog requires understanding subtle

444
00:21:10.119 --> 00:21:13.720
<v Speaker 3>cues and the interaction between the two. Current systems often

445
00:21:13.759 --> 00:21:17.000
<v Speaker 3>struggle to bind those attributes correctly. They might see a

446
00:21:17.000 --> 00:21:19.519
<v Speaker 3>happy dog in an angry cat and output the sentence

447
00:21:19.559 --> 00:21:21.960
<v Speaker 3>an angry dog and a happy cat. They mix up

448
00:21:21.960 --> 00:21:23.799
<v Speaker 3>who owns which emotion, so.

449
00:21:23.759 --> 00:21:25.440
<v Speaker 2>They get all the ingredients right, but they get the

450
00:21:25.480 --> 00:21:26.599
<v Speaker 2>recipe completely wrong.

451
00:21:26.880 --> 00:21:31.440
<v Speaker 3>A perfect analogy, and finally, the deepest and most philosophical

452
00:21:31.480 --> 00:21:35.920
<v Speaker 3>limitation grounding, or what we can call the fire problem.

453
00:21:35.960 --> 00:21:37.799
<v Speaker 2>This was the part of the research that really stuck

454
00:21:37.799 --> 00:21:40.599
<v Speaker 2>with me, the idea that the AI knows fire, but

455
00:21:40.680 --> 00:21:41.759
<v Speaker 2>it doesn't know fire.

456
00:21:41.960 --> 00:21:45.680
<v Speaker 3>It's the gap between data and experience. The AI has

457
00:21:45.680 --> 00:21:48.400
<v Speaker 3>seen a billion pixels of fire, it has read a

458
00:21:48.480 --> 00:21:52.000
<v Speaker 3>trillion words about heat, burning, smoke, and danger, but it

459
00:21:52.039 --> 00:21:55.240
<v Speaker 3>has never felt heat. It has never reflexively pulled its

460
00:21:55.279 --> 00:21:58.359
<v Speaker 3>hand away from a hot stove. It lacks sensory consequence.

461
00:21:58.519 --> 00:22:00.920
<v Speaker 2>But does that matter. I've ever been to Mars, but

462
00:22:00.960 --> 00:22:02.319
<v Speaker 2>I feel like I know a lot about it. I

463
00:22:02.400 --> 00:22:04.920
<v Speaker 2>learned it all from books and pictures. If the AI

464
00:22:05.039 --> 00:22:08.640
<v Speaker 2>tells me don't touch the fire it's dangerous, does it

465
00:22:08.680 --> 00:22:10.559
<v Speaker 2>matter that it's never been burned itself.

466
00:22:11.559 --> 00:22:14.400
<v Speaker 3>That is the big counter argument. Maybe you don't need

467
00:22:14.440 --> 00:22:17.559
<v Speaker 3>a body to understand, but there is a very strong

468
00:22:17.640 --> 00:22:23.240
<v Speaker 3>hypothesis in cognitive science that true intelligence requires embodiment. That

469
00:22:23.319 --> 00:22:25.960
<v Speaker 3>you can't really think about the physical world unless you

470
00:22:25.960 --> 00:22:29.519
<v Speaker 3>have a body that risks being hurt by it. If

471
00:22:29.519 --> 00:22:32.720
<v Speaker 3>you don't fear the fire, do you really understand danger

472
00:22:33.720 --> 00:22:36.480
<v Speaker 3>or do you just know the statistical correlation between the

473
00:22:36.519 --> 00:22:38.200
<v Speaker 3>token danger and the token fire.

474
00:22:38.319 --> 00:22:40.799
<v Speaker 2>That's deep. We should definitely circle back to that later,

475
00:22:41.200 --> 00:22:43.480
<v Speaker 2>but first let's look at where this is hitting the

476
00:22:43.480 --> 00:22:46.440
<v Speaker 2>ground right now, despite the hallucinations and the lack of

477
00:22:46.480 --> 00:22:49.400
<v Speaker 2>a body, where are these systems actually transforming the world today.

478
00:22:50.039 --> 00:22:52.400
<v Speaker 3>Medicine is the big one. I really can't overstate this.

479
00:22:52.480 --> 00:22:55.240
<v Speaker 3>Medicine has always always been multimodal.

480
00:22:55.519 --> 00:22:57.759
<v Speaker 2>Right, you go to the doctor, they look at you,

481
00:22:57.880 --> 00:23:00.799
<v Speaker 2>they listen to your lungs, they re your chart, they

482
00:23:00.799 --> 00:23:04.079
<v Speaker 2>look at your lab results. It's a mix of everything exactly.

483
00:23:04.359 --> 00:23:09.039
<v Speaker 3>A doctor is at their core an information integrator. Multimodal

484
00:23:09.079 --> 00:23:12.200
<v Speaker 3>AI is the first tool that really matches that workflow.

485
00:23:12.640 --> 00:23:15.079
<v Speaker 3>It acts as a high speed second opinion. It's not

486
00:23:15.119 --> 00:23:18.839
<v Speaker 3>replacing the doctor's judgment, but it's synthesizing the data faster

487
00:23:18.920 --> 00:23:20.480
<v Speaker 3>than any human could ever hope to.

488
00:23:20.720 --> 00:23:21.960
<v Speaker 2>It's the ultimate intern.

489
00:23:22.079 --> 00:23:24.400
<v Speaker 3>It's an intern that has read every single medical paper

490
00:23:24.440 --> 00:23:29.440
<v Speaker 3>ever published and fields like pathology, radiology, genomics. It's starting

491
00:23:29.480 --> 00:23:32.400
<v Speaker 3>to find patterns that humans miss. It might see a

492
00:23:32.400 --> 00:23:35.359
<v Speaker 3>faint correlation between a genetic marker mentioned in the text

493
00:23:35.400 --> 00:23:38.119
<v Speaker 3>of a patient's file and a specific cell shape in

494
00:23:38.160 --> 00:23:40.880
<v Speaker 3>a microscopy image that a human would never connect because

495
00:23:40.920 --> 00:23:42.200
<v Speaker 3>the data is just too vast.

496
00:23:42.559 --> 00:23:45.039
<v Speaker 2>Then there is accessibility. This feels like one of the

497
00:23:45.039 --> 00:23:48.160
<v Speaker 2>most immediate and unambiguously positive impacts.

498
00:23:48.359 --> 00:23:51.599
<v Speaker 3>It is democratization on a massive scale. Think about what

499
00:23:51.640 --> 00:23:55.200
<v Speaker 3>this fluidity between senses means. If you are blind, the

500
00:23:55.240 --> 00:23:58.920
<v Speaker 3>world is opaque to visual signals. Multimodal AI can describe

501
00:23:58.920 --> 00:24:00.799
<v Speaker 3>the visual world to you in text or audio.

502
00:24:01.119 --> 00:24:03.799
<v Speaker 2>There's a blue car approaching on your left, or the

503
00:24:03.880 --> 00:24:04.839
<v Speaker 2>light just turned.

504
00:24:04.559 --> 00:24:08.160
<v Speaker 3>Green exactly, or even you are holding the can of

505
00:24:08.200 --> 00:24:11.799
<v Speaker 3>soup upside down. It gives you eyes. For deaf users,

506
00:24:11.880 --> 00:24:14.759
<v Speaker 3>it can translate audio to text, but also describe the

507
00:24:14.759 --> 00:24:18.119
<v Speaker 3>emotion in the speaker's voice. It bridges the gap between

508
00:24:18.160 --> 00:24:20.200
<v Speaker 3>the sense you have and the information you need.

509
00:24:20.359 --> 00:24:22.839
<v Speaker 2>It completely removes the friction of format.

510
00:24:22.880 --> 00:24:28.240
<v Speaker 3>Domain number three science and education. In science, we are

511
00:24:28.319 --> 00:24:32.839
<v Speaker 3>absolutely drowning in data. We have microscopy images, protein structures,

512
00:24:32.880 --> 00:24:37.079
<v Speaker 3>satellite data, research papers. No single human can read it all.

513
00:24:37.160 --> 00:24:39.640
<v Speaker 2>So the AI becomes a kind of research partner.

514
00:24:39.680 --> 00:24:41.880
<v Speaker 3>It becomes an active participant. It can read all the

515
00:24:41.920 --> 00:24:44.960
<v Speaker 3>latest papers, look at all the new experimental slides, and say, hey,

516
00:24:45.400 --> 00:24:48.720
<v Speaker 3>this pattern in the satellite data over the Amazon matches

517
00:24:48.799 --> 00:24:51.839
<v Speaker 3>this obscure theory from a paper published in nineteen ninety.

518
00:24:52.240 --> 00:24:55.359
<v Speaker 3>It connects dots that are separated by decades and disciplines.

519
00:24:55.559 --> 00:24:57.799
<v Speaker 2>And in education, how does it play out there?

520
00:24:58.000 --> 00:25:02.160
<v Speaker 3>Responsive tutoring imagine AI that watches a student solve a

521
00:25:02.200 --> 00:25:05.480
<v Speaker 3>math problem on a piece of paper, literally watches the

522
00:25:05.519 --> 00:25:08.359
<v Speaker 3>pen move through the camera and at the same time

523
00:25:09.000 --> 00:25:12.279
<v Speaker 3>listens to them talk through their reasoning. It can pinpoint

524
00:25:12.319 --> 00:25:14.599
<v Speaker 3>exactly where the logic broke down. It doesn't just say

525
00:25:14.680 --> 00:25:17.039
<v Speaker 3>wrong answer, It says you forgot to carry the one

526
00:25:17.039 --> 00:25:18.279
<v Speaker 3>in the tens column right here.

527
00:25:18.720 --> 00:25:21.359
<v Speaker 2>That's the difference between a textbook and a real teacher.

528
00:25:21.960 --> 00:25:25.640
<v Speaker 2>A teacher watches the process, not just the result it is.

529
00:25:25.920 --> 00:25:28.799
<v Speaker 3>And finally, of course, creative work. This is the controversial one.

530
00:25:28.880 --> 00:25:31.680
<v Speaker 2>Text to image, text to video. We see this everywhere

531
00:25:31.720 --> 00:25:32.720
<v Speaker 2>now it's exploded.

532
00:25:33.000 --> 00:25:36.400
<v Speaker 3>It has democratized visual creation in a way we've never seen.

533
00:25:36.599 --> 00:25:38.400
<v Speaker 3>You don't need to know how to draw or paint

534
00:25:38.400 --> 00:25:41.240
<v Speaker 3>to create the stunning image anymore. But it creates this

535
00:25:41.440 --> 00:25:45.240
<v Speaker 3>massive tension regarding the displacement of professionals and the ethics

536
00:25:45.240 --> 00:25:47.440
<v Speaker 3>of using copyrighted work in training.

537
00:25:47.519 --> 00:25:50.960
<v Speaker 2>Data artists are rightfully saying, Hey, you train this model

538
00:25:51.000 --> 00:25:53.680
<v Speaker 2>on my entire life's work without my permission, and now

539
00:25:53.680 --> 00:25:56.319
<v Speaker 2>it's competing with me for jobs, and it's.

540
00:25:56.160 --> 00:25:59.839
<v Speaker 3>A completely valid conflict. There's no easy answer. But the

541
00:26:00.000 --> 00:26:02.920
<v Speaker 3>tenential is also there for using these tools as extensions

542
00:26:02.920 --> 00:26:06.519
<v Speaker 3>of human vision. It's a tool that can amplify a creativity,

543
00:26:06.559 --> 00:26:09.920
<v Speaker 3>not just replace it. A director can visualize an entire

544
00:26:10.000 --> 00:26:13.880
<v Speaker 3>storyboard in seconds. An architect can iterate on a dozen

545
00:26:13.960 --> 00:26:15.440
<v Speaker 3>building facades instantly.

546
00:26:15.759 --> 00:26:19.079
<v Speaker 2>I want to pivot back to that philosophical moment we

547
00:26:19.119 --> 00:26:22.240
<v Speaker 2>touched on earlier, the grief example. So this is the

548
00:26:22.279 --> 00:26:23.720
<v Speaker 2>one that really keeps me up at night.

549
00:26:23.920 --> 00:26:25.839
<v Speaker 3>Let's go back to it. It's the most important question,

550
00:26:25.880 --> 00:26:26.160
<v Speaker 3>I think.

551
00:26:26.319 --> 00:26:29.799
<v Speaker 2>Okay, So if the machine recognizes the funeral, it correctly

552
00:26:29.839 --> 00:26:32.960
<v Speaker 2>identifies the sadness in people's faces, and it writes a

553
00:26:33.000 --> 00:26:37.079
<v Speaker 2>poem about loss that makes me cry, has it understood grief?

554
00:26:37.599 --> 00:26:40.359
<v Speaker 3>This is the critical distinction between what we call behavioral

555
00:26:40.400 --> 00:26:43.640
<v Speaker 3>performance versus experiential understanding.

556
00:26:42.920 --> 00:26:45.799
<v Speaker 2>Behavior versus experience. Okay, break that down for me.

557
00:26:46.039 --> 00:26:50.400
<v Speaker 3>Behaviorally, yes, absolutely, it performed the task of understanding grief perfectly.

558
00:26:50.759 --> 00:26:54.519
<v Speaker 3>It recognized the symbols, It generated the appropriate linguistic response.

559
00:26:54.799 --> 00:26:57.559
<v Speaker 3>It passed the Turing test for sadness with flying colors.

560
00:26:57.720 --> 00:27:02.359
<v Speaker 3>But experientially, experientially know it is a hollow shell. It

561
00:27:02.440 --> 00:27:05.480
<v Speaker 3>processes the symbols of grief without the reference. It has

562
00:27:05.559 --> 00:27:08.039
<v Speaker 3>the map, but it has never visited the territory. It

563
00:27:08.079 --> 00:27:10.640
<v Speaker 3>has never lost anyone. It has never felt that hollow

564
00:27:10.680 --> 00:27:12.200
<v Speaker 3>ache of absence in its chest.

565
00:27:12.359 --> 00:27:15.119
<v Speaker 2>So why does this distinction matter? I mean, if the

566
00:27:15.119 --> 00:27:17.000
<v Speaker 2>poem is good, who cares that the poet is a

567
00:27:17.039 --> 00:27:19.640
<v Speaker 2>sad robot or a sad human? If the output is

568
00:27:19.680 --> 00:27:22.039
<v Speaker 2>the same, why does the internal state matter so much?

569
00:27:22.079 --> 00:27:25.079
<v Speaker 3>For writing a poem, maybe it doesn't matter. For generating

570
00:27:25.119 --> 00:27:28.559
<v Speaker 3>ad copy it definitely doesn't matter. But for moral judgment,

571
00:27:29.000 --> 00:27:32.960
<v Speaker 3>for empathy, for wisdom, that gap is critical. If we

572
00:27:33.039 --> 00:27:35.680
<v Speaker 3>ask an AI to make decisions about elder care or

573
00:27:35.759 --> 00:27:38.920
<v Speaker 3>legal sentencing or childcare, do we want a system that

574
00:27:38.960 --> 00:27:41.279
<v Speaker 3>just mimics wisdom or one that actually has it?

575
00:27:41.400 --> 00:27:43.720
<v Speaker 2>That is a chilling thought. We shouldn't mistake a high

576
00:27:43.759 --> 00:27:45.640
<v Speaker 2>confidence output for lived experience.

577
00:27:46.119 --> 00:27:50.240
<v Speaker 3>Exactly, A human radiologist brings years of seeing patients, of

578
00:27:50.319 --> 00:27:52.359
<v Speaker 3>knowing the fear in their eyes when they get a

579
00:27:52.400 --> 00:27:56.720
<v Speaker 3>bad diagnosis, of understanding the weight of that responsibility. The

580
00:27:56.759 --> 00:28:00.200
<v Speaker 3>AI brings pattern matching on a massive data set. Those

581
00:28:00.200 --> 00:28:02.279
<v Speaker 3>are not the same thing. Even if the diagnosis it

582
00:28:02.319 --> 00:28:05.079
<v Speaker 3>gives is correct. The AI doesn't care if the patient

583
00:28:05.119 --> 00:28:08.240
<v Speaker 3>lives or dies. It just cares about minimizing the loss

584
00:28:08.240 --> 00:28:09.440
<v Speaker 3>function in its training.

585
00:28:09.920 --> 00:28:12.240
<v Speaker 2>So where does this all go next? If this is

586
00:28:12.279 --> 00:28:15.480
<v Speaker 2>where we are now on the frontier, what's beyond the frontier?

587
00:28:15.519 --> 00:28:18.599
<v Speaker 3>The trajectory is becoming very clear. We are moving from

588
00:28:18.759 --> 00:28:21.799
<v Speaker 3>processing information to acting on it.

589
00:28:21.920 --> 00:28:24.319
<v Speaker 2>The agentic shift. I keep hearing this term popping up

590
00:28:24.359 --> 00:28:24.839
<v Speaker 2>more and more.

591
00:28:24.960 --> 00:28:28.680
<v Speaker 3>Yes, right now, most people interact with AI by chatting,

592
00:28:29.039 --> 00:28:32.200
<v Speaker 3>write this for me, analyze this data. The next wave

593
00:28:32.319 --> 00:28:33.799
<v Speaker 3>is agents that do things.

594
00:28:33.920 --> 00:28:35.759
<v Speaker 2>So not just tell me how to book a flight,

595
00:28:35.839 --> 00:28:39.400
<v Speaker 2>but literally book me the cheapest flight to Chicago next Tuesday.

596
00:28:39.519 --> 00:28:42.119
<v Speaker 3>Book the flight, email my boss to let them know

597
00:28:42.160 --> 00:28:45.119
<v Speaker 3>I'll be out, update my calendar, and order a car

598
00:28:45.200 --> 00:28:48.000
<v Speaker 3>to take me to the airport. These are systems that

599
00:28:48.039 --> 00:28:51.119
<v Speaker 3>will browse the web, operate software on your computer, and

600
00:28:51.240 --> 00:28:54.359
<v Speaker 3>execute code to accomplish goals. They will have eyes to

601
00:28:54.400 --> 00:28:58.039
<v Speaker 3>see the screen, and hands whether virtual or robotic, to

602
00:28:58.079 --> 00:28:58.519
<v Speaker 3>click the.

603
00:28:58.440 --> 00:29:01.680
<v Speaker 2>Buttons, and robots real physical robots in the world.

604
00:29:01.839 --> 00:29:05.359
<v Speaker 3>That's the physical manifestation of the same idea. Robots that

605
00:29:05.440 --> 00:29:09.079
<v Speaker 3>watch a human demonstrate a task, say folding laundry or

606
00:29:09.119 --> 00:29:12.119
<v Speaker 3>assembling a circuit board, and then replicate it. The visual

607
00:29:12.200 --> 00:29:15.759
<v Speaker 3>understanding guides the motor control. The senses are connected to

608
00:29:15.799 --> 00:29:16.279
<v Speaker 3>the limbs.

609
00:29:16.519 --> 00:29:18.319
<v Speaker 2>This raises the stake significantly.

610
00:29:18.400 --> 00:29:21.319
<v Speaker 3>It completely changes the risk profile. A chat butt that

611
00:29:21.359 --> 00:29:24.880
<v Speaker 3>writes a bad poem is embarrassing. A robot that misunderstands

612
00:29:24.920 --> 00:29:26.920
<v Speaker 3>the command clean up the kitchen and throws out your

613
00:29:26.960 --> 00:29:28.839
<v Speaker 3>vital medication is dangerous.

614
00:29:29.000 --> 00:29:32.440
<v Speaker 2>Or an autonomous software agent that misunderstands a financial instruction

615
00:29:32.680 --> 00:29:35.279
<v Speaker 2>and executes a code that deletes a critical.

616
00:29:35.000 --> 00:29:40.480
<v Speaker 3>Database exactly when perception leads directly to physical or digital action.

617
00:29:40.599 --> 00:29:44.240
<v Speaker 3>In the world, safety isn't just about content moderation anymore.

618
00:29:44.359 --> 00:29:48.279
<v Speaker 3>It's about physical safety and operational security. We are giving

619
00:29:48.359 --> 00:29:51.559
<v Speaker 3>these systems hands, we need to be very very sure

620
00:29:51.680 --> 00:29:53.200
<v Speaker 3>about the brain that's guiding them.

621
00:29:53.400 --> 00:29:55.240
<v Speaker 2>It really feels like we are standing on the edge

622
00:29:55.240 --> 00:29:56.559
<v Speaker 2>of a profoundly different world.

623
00:29:56.759 --> 00:29:59.519
<v Speaker 3>We are the walls between the senses are gone, the

624
00:29:59.559 --> 00:30:00.640
<v Speaker 3>silo are broken.

625
00:30:01.240 --> 00:30:04.599
<v Speaker 2>So to kind of summarize our journey today, we started

626
00:30:04.640 --> 00:30:09.240
<v Speaker 2>with the island problem. AI was completely fragmented. We moved

627
00:30:09.240 --> 00:30:13.359
<v Speaker 2>to the universal substrate, transformers and tokens connected everything. We

628
00:30:13.400 --> 00:30:15.920
<v Speaker 2>saw the magic of vector space, where meanings are mapped

629
00:30:15.920 --> 00:30:19.440
<v Speaker 2>out geometrically. We looked at the superpowers, the medical assistant,

630
00:30:19.480 --> 00:30:23.279
<v Speaker 2>the coder's buddy. We acknowledged the very real limitations no body,

631
00:30:23.720 --> 00:30:26.839
<v Speaker 2>no real spatial sense, the problem of hallucination. And we've

632
00:30:26.880 --> 00:30:29.079
<v Speaker 2>just looked ahead at the agentic future.

633
00:30:29.240 --> 00:30:31.400
<v Speaker 3>That is the arc. It's the collapse of separation.

634
00:30:31.599 --> 00:30:33.799
<v Speaker 2>What's the final thought here which we walk away thinking

635
00:30:33.880 --> 00:30:35.359
<v Speaker 2>about as we go about our day.

636
00:30:35.640 --> 00:30:39.079
<v Speaker 3>I think it's this. The machines are learning to see

637
00:30:39.119 --> 00:30:43.359
<v Speaker 3>and hear, They are developing senses. For decades, we have

638
00:30:43.400 --> 00:30:46.200
<v Speaker 3>spent so much time worrying about whether they can think

639
00:30:46.680 --> 00:30:49.200
<v Speaker 3>that we maybe haven't paid enough attention to what it

640
00:30:49.279 --> 00:30:52.599
<v Speaker 3>means that they can perceive, perceive us, perceive the world.

641
00:30:53.000 --> 00:30:56.440
<v Speaker 3>We are building a new kind of observer. It's not human,

642
00:30:56.640 --> 00:30:59.640
<v Speaker 3>but it's not blind anymore. The question we all need

643
00:30:59.680 --> 00:31:01.880
<v Speaker 3>to ask is are we paying enough attention to what

644
00:31:01.920 --> 00:31:03.799
<v Speaker 3>that means for the world we are building.

645
00:31:03.880 --> 00:31:05.759
<v Speaker 2>That is a question I think we will be wrestling

646
00:31:05.799 --> 00:31:07.799
<v Speaker 2>with for a very long time. Next time you look

647
00:31:07.839 --> 00:31:10.640
<v Speaker 2>at your phone, just remember it might be looking back

648
00:31:10.680 --> 00:31:13.319
<v Speaker 2>at you and it's finally starting to understand what it sees.

649
00:31:13.640 --> 00:31:16.240
<v Speaker 2>Thanks for joining us on this exploration A pleasure, as always,
