WEBVTT 1 00:00:01.199 --> 00:00:06.200 Welcome to the Sentient Code, where intelligence is engineered, autonomy 2 00:00:06.280 --> 00:00:10.439 is emerging, and a line between human and machine grows thinner. 3 00:00:10.800 --> 00:00:15.359 Each episode, we decode the algorithms, explore the robotics, and 4 00:00:15.439 --> 00:00:19.000 examine the ideas shaping the future of artificial minds. 5 00:00:23.800 --> 00:00:25.440 I want to start today by asking you to do 6 00:00:25.519 --> 00:00:30.160 something that feels incredibly simple, almost you know, trivial, but 7 00:00:30.239 --> 00:00:33.799 it's actually a miracle of biology. Right now, just pause 8 00:00:33.840 --> 00:00:35.719 for a second and notice exactly what you're doing. You're 9 00:00:35.719 --> 00:00:38.840 listening to my voice, obviously, but maybe you're also driving, 10 00:00:38.880 --> 00:00:42.039 so your eyes are scanning the road, watching for break lights. 11 00:00:42.359 --> 00:00:44.960 You feel the texture of the steering wheel under your hands. 12 00:00:45.159 --> 00:00:47.479 Maybe you're drinking coffee and you can smell the roast. 13 00:00:47.600 --> 00:00:49.600 It's the sensory soup. We're swimming in it. 14 00:00:49.759 --> 00:00:52.759 Exactly, it's a soup. But here's the thing, and I 15 00:00:52.840 --> 00:00:56.799 really want you to catch this. You aren't toggling between 16 00:00:56.840 --> 00:00:59.600 these senses like you're switching apps on a phone. No, 17 00:01:00.399 --> 00:01:03.039 you don't stop hearing to start seeing. You don't pause 18 00:01:03.079 --> 00:01:05.560 your sense of smell to process the texture of the wheel. 19 00:01:06.079 --> 00:01:09.879 Your brain is this incredible fluid mixing board. It takes 20 00:01:10.040 --> 00:01:15.439 audio visual tactile and textual inputs and weaves them into 21 00:01:15.480 --> 00:01:18.280 this single, seamless narrative we call reality. 22 00:01:18.400 --> 00:01:21.079 And it's completely effortless for us. I mean, it is 23 00:01:21.120 --> 00:01:25.200 the defining feature of biological consciousness. So we don't really 24 00:01:25.200 --> 00:01:28.799 think about modalities, do we. We just think about the world. 25 00:01:29.040 --> 00:01:31.959 But and this is the big concept we're unpacking today. 26 00:01:32.359 --> 00:01:36.200 Until very very recently, artificial intelligence was not like that 27 00:01:36.280 --> 00:01:38.000 at all. In fact, it was the exact opposite. 28 00:01:38.040 --> 00:01:40.879 Oh, it was completely fragmented. You look at the history 29 00:01:40.920 --> 00:01:44.319 of AI really from the nineteen fifties up until well 30 00:01:44.519 --> 00:01:47.519 the early twenty twenties, we were building a fractured mind. 31 00:01:47.599 --> 00:01:49.640 We had what we call the island problem. 32 00:01:49.680 --> 00:01:52.120 The island problem. I like that image paid the picture 33 00:01:52.159 --> 00:01:52.400 for us. 34 00:01:52.439 --> 00:01:55.920 Okay, so picture an archipelago on one island. You have 35 00:01:56.079 --> 00:02:00.159 these brilliant computer vision systems. They were specialists. They could 36 00:02:00.159 --> 00:02:01.879 look at a photo of a cat and tell you 37 00:02:02.000 --> 00:02:06.159 that's a tabby with like ninety nine percent accuracy. Superhuman 38 00:02:06.239 --> 00:02:08.680 vision in some respects. Right. But if you showed that 39 00:02:08.800 --> 00:02:11.919 same system a handwritten note that said this is a cat, 40 00:02:12.360 --> 00:02:15.360 it was blind. It couldn't read. It had no concept 41 00:02:15.439 --> 00:02:16.360 of what letters were. 42 00:02:16.240 --> 00:02:18.879 Okay, so that's island one the eye they cannot read exactly. 43 00:02:19.319 --> 00:02:21.919 Then on the next island over you have the text spots. 44 00:02:22.199 --> 00:02:25.240 The ancestors of you know, chatchypt and the like. They 45 00:02:25.280 --> 00:02:27.159 could write you a sonnet about a cat. They could 46 00:02:27.159 --> 00:02:29.879 define the biology of a feline. They could translate cat 47 00:02:29.919 --> 00:02:32.680 into fifty languages. But if you showed them a picture 48 00:02:32.680 --> 00:02:39.080 of a kitten, nothing, just static. They were effectively brains 49 00:02:39.199 --> 00:02:42.759 in a jar that only knew the world through symbols. 50 00:02:43.080 --> 00:02:45.319 So you have the eye that cannot read and the 51 00:02:45.319 --> 00:02:47.319 brain that cannot see precisely. 52 00:02:47.960 --> 00:02:51.039 And the worst part, they were built by different people. 53 00:02:51.240 --> 00:02:53.919 The computer vision engineers didn't hang out with the natural 54 00:02:54.000 --> 00:02:55.360 language processing engineers. 55 00:02:55.360 --> 00:02:57.080 They were in different departments. 56 00:02:56.639 --> 00:02:59.400 They used different math, they used different architectures. They were 57 00:02:59.400 --> 00:03:01.840 effectively different species of intelligence. 58 00:03:02.000 --> 00:03:05.120 So for fifty sixty years we were building these savants. 59 00:03:05.800 --> 00:03:09.039 One savann could see perfect pixels, one savant could pars 60 00:03:09.080 --> 00:03:12.120 perfect grammar. But they couldn't have a conversation. 61 00:03:12.240 --> 00:03:14.240 They couldn't even acknowledge each other's existence. 62 00:03:14.400 --> 00:03:17.960 And today, because the reason we're doing this show is 63 00:03:17.960 --> 00:03:19.439 that something fundamental changed. 64 00:03:19.719 --> 00:03:23.000 Today. The bridges have been built the water between the 65 00:03:23.039 --> 00:03:27.280 Islands is gone. We are witnessing the rise of multimodal 66 00:03:27.400 --> 00:03:30.199 AI and I want to be really clear to everyone listening, 67 00:03:30.240 --> 00:03:32.800 this isn't just a feature update. This isn't just now 68 00:03:32.800 --> 00:03:34.400 your chatbot has a camera icon. 69 00:03:34.520 --> 00:03:36.280 It feels much much bigger than that. 70 00:03:36.719 --> 00:03:40.280 It is fundamental. We are moving from the era of 71 00:03:40.319 --> 00:03:43.360 the specialist to the era of the generalist. We are 72 00:03:43.400 --> 00:03:46.719 giving machines the ability to integrate senses in a way 73 00:03:46.759 --> 00:03:50.319 that well, it mimics that human sensory soup we started with. 74 00:03:50.680 --> 00:03:53.759 That's our mission for this discussion. We've pulled together a 75 00:03:53.759 --> 00:03:57.759 stack of research, technical papers, and industry analysis to figure 76 00:03:57.800 --> 00:04:00.639 out how this happened, because it seemed like for decades 77 00:04:00.639 --> 00:04:02.840 we were stuck and then in the last few years 78 00:04:02.919 --> 00:04:04.520 everything just collided right. 79 00:04:04.960 --> 00:04:07.599 We're going to look at the architecture, the actual aha 80 00:04:07.840 --> 00:04:11.199 moment that let machines see and read. At the same time, 81 00:04:11.719 --> 00:04:14.520 we'll look at the superpowers this unlocks, like reading X 82 00:04:14.599 --> 00:04:16.120 rays while reading patient notes. 83 00:04:16.160 --> 00:04:19.120 And we absolutely have to talk about the limitations because 84 00:04:19.160 --> 00:04:22.040 the research shows that all these machines can see, they 85 00:04:22.120 --> 00:04:23.639 hallucinate in brand new. 86 00:04:23.480 --> 00:04:25.920 Ways they do, and we need to ask the big 87 00:04:25.959 --> 00:04:29.040 philosophical question if a machine looks at a photo of 88 00:04:29.040 --> 00:04:31.240 a funeral and writes a palm that makes you cry, 89 00:04:31.720 --> 00:04:35.000 does it actually understand grief or is it just really 90 00:04:35.040 --> 00:04:35.920 really good at math. 91 00:04:36.199 --> 00:04:38.759 Let's get into the mechanics then, Section one. How did 92 00:04:38.800 --> 00:04:42.399 we get here? Because I remember reading about AI in 93 00:04:42.480 --> 00:04:46.680 say twenty fifteen, and it was all about these specialized tools. 94 00:04:46.720 --> 00:04:49.879 You had one tool for chess, one tool for translating French. 95 00:04:50.759 --> 00:04:53.040 When did the walls come down? Was there like a 96 00:04:53.079 --> 00:04:53.800 single invention. 97 00:04:54.160 --> 00:04:56.600 To understand the solution, you really have to understand why 98 00:04:56.600 --> 00:04:58.839 the walls were there in the first place. And it 99 00:04:58.879 --> 00:05:02.439 all boils down to the architecture, the literal shape of 100 00:05:02.480 --> 00:05:03.439 the neural networks. 101 00:05:03.600 --> 00:05:06.120 Okay, break that down for us, you know, don't go 102 00:05:06.199 --> 00:05:08.680 too heavy on the jargon, but give us the reality. 103 00:05:08.720 --> 00:05:11.319 Why couldn't vision bot talk to the text butt? 104 00:05:11.439 --> 00:05:13.759 Okay? So, for a long time, the king of computer 105 00:05:13.839 --> 00:05:17.439 vision was something called a CNN, a convolutional neural network. 106 00:05:17.480 --> 00:05:19.199 We've touched on these before. These are the ones that 107 00:05:19.319 --> 00:05:22.199 scan an image like a grid, right, looking for edges 108 00:05:22.199 --> 00:05:23.079 and shapes. Right. 109 00:05:23.240 --> 00:05:26.079 Imagine a sliding window moving over a picture. It looks 110 00:05:26.120 --> 00:05:28.040 at a tiny patch of pixel, say a three x 111 00:05:28.079 --> 00:05:30.240 three square and asks is there an edge? 112 00:05:30.279 --> 00:05:30.439 Here? 113 00:05:30.519 --> 00:05:32.319 Is there a curve? Is there a color gradient? It 114 00:05:32.360 --> 00:05:35.680 builds up from lines to shapes, to ears to eventually 115 00:05:35.720 --> 00:05:39.519 a cat. It is designed mathematically to process grids of 116 00:05:39.519 --> 00:05:41.759 spatial data. It understands space. 117 00:05:42.079 --> 00:05:44.040 Okay, so that's the eye. It deals in grids. It 118 00:05:44.079 --> 00:05:45.399 thinks and grids exactly. 119 00:05:45.480 --> 00:05:48.199 But for text, text is in a grid. Text is 120 00:05:48.240 --> 00:05:51.920 a stream, It's a sequence the quickbat brown dot fox. 121 00:05:52.439 --> 00:05:55.399 The order matters immensely. Of course, you can't just look 122 00:05:55.439 --> 00:05:58.639 at brown without knowing quick came before it. So for 123 00:05:58.759 --> 00:06:02.759 that we used Ours were current neural networks. These were 124 00:06:02.759 --> 00:06:06.480 designed to remember the past. They process the word fox 125 00:06:06.720 --> 00:06:09.199 while trying to hold onto the memory of the. 126 00:06:09.600 --> 00:06:12.360 So you have one kind of math designed for grids, 127 00:06:12.439 --> 00:06:15.120 which is space, and a completely different kind of math 128 00:06:15.160 --> 00:06:16.839 designed for streams, which is time. 129 00:06:17.120 --> 00:06:19.199 You've got it. And you couldn't just plug one into 130 00:06:19.240 --> 00:06:21.439 the other. They spoke different languages. It was like trying 131 00:06:21.439 --> 00:06:24.120 to put a VHS tape into a toaster. The inputs 132 00:06:24.160 --> 00:06:25.439 just didn't match the machinery. 133 00:06:25.519 --> 00:06:28.000 So what changed? I know the answer involves transformers, because 134 00:06:28.000 --> 00:06:30.000 that seems to be the answer to everything in AI. Lately. 135 00:06:30.079 --> 00:06:33.480 But why what did the transformer do that the others couldn't. 136 00:06:33.720 --> 00:06:37.040 Well, the date was twenty seventeen. The paper was attention 137 00:06:37.319 --> 00:06:39.319 is all you need. We talk about it all the 138 00:06:39.360 --> 00:06:42.040 time on this show. But the hidden revolution in that 139 00:06:42.079 --> 00:06:44.920 paper wasn't just that it was better at language. It 140 00:06:44.959 --> 00:06:47.600 was that the transformer was a universal substrate. 141 00:06:48.000 --> 00:06:51.639 Universal substrate, Yeah, that sounds impressive, but what does it 142 00:06:51.680 --> 00:06:53.040 actually mean? In practice? 143 00:06:53.120 --> 00:06:55.959 It means it's a structure that can process any kind 144 00:06:55.959 --> 00:06:58.199 of information as long as you can turn that information 145 00:06:58.319 --> 00:06:59.240 into a sequence. 146 00:06:59.519 --> 00:07:02.120 So text is obviously a sequence, word after word after 147 00:07:02.199 --> 00:07:03.319 word that fits right. 148 00:07:03.399 --> 00:07:06.600 In AI terms, we call those tokens. But then the 149 00:07:06.639 --> 00:07:10.839 researchers have this, this real aha moment. They realized, wait 150 00:07:10.839 --> 00:07:13.360 a minute, we can treat an image as a sequence too. 151 00:07:13.759 --> 00:07:16.680 Hold on, how do you turn a picture into a sequence. 152 00:07:17.079 --> 00:07:20.519 A picture is a flat two D object. It doesn't 153 00:07:20.519 --> 00:07:22.720 have a start and end like a sentence does. 154 00:07:22.800 --> 00:07:25.120 That was the stroke of genius. The researchers asked, what 155 00:07:25.199 --> 00:07:26.759 if we forced it to be a sequence? 156 00:07:26.879 --> 00:07:27.879 Forced it? How? 157 00:07:28.319 --> 00:07:31.480 Imagine taking a photo of a dog. Now imagine taking 158 00:07:31.519 --> 00:07:33.560 a pair of scissors and cutting it up into a 159 00:07:33.600 --> 00:07:37.160 grid of little squares. Let's say sixteen by sixteen pixel squares. 160 00:07:37.560 --> 00:07:39.319 You have a pile of these tiny patches. 161 00:07:39.360 --> 00:07:40.040 Okay, I'm with you. 162 00:07:40.319 --> 00:07:43.480 Now, you just line them up square one, score two, 163 00:07:43.480 --> 00:07:45.199 scure three from top left to bottom right. 164 00:07:45.319 --> 00:07:47.360 You flatten the grid into a line exactly. 165 00:07:47.399 --> 00:07:50.439 They turn the image into a sentence of visual words. 166 00:07:50.480 --> 00:07:53.199 They call them patches. And once you did that, once 167 00:07:53.240 --> 00:07:55.720 you turn the image into a sequence of patches, the 168 00:07:55.759 --> 00:07:57.759 transformer looked at it and said, I know what to 169 00:07:57.800 --> 00:07:58.120 do with this. 170 00:07:58.360 --> 00:08:00.920 Because of the transformer, a patch of pixels is just 171 00:08:00.959 --> 00:08:03.160 another token, the same way a word is a token. 172 00:08:03.279 --> 00:08:06.959 Precisely, that is, the everything is a token realization. And 173 00:08:07.000 --> 00:08:10.480 it didn't stop at images. Audio that's just a sequence 174 00:08:10.519 --> 00:08:14.759 of spectrogram slices. Video that's just a sequence of frames 175 00:08:14.800 --> 00:08:17.360 in temporal order, even code or molecules. 176 00:08:17.439 --> 00:08:20.680 So the machine stops seeing image versus text versus audio, 177 00:08:21.000 --> 00:08:24.000 and just start seeing data stream versus data strue. 178 00:08:24.079 --> 00:08:27.879 Correct. It was like discovering that French, Mandarin and mathematics 179 00:08:27.920 --> 00:08:31.319 are all actually dialects of the same underlying language. Once 180 00:08:31.360 --> 00:08:34.120 they realized that the transformer could handle all of these 181 00:08:34.159 --> 00:08:39.360 as sequences, the barrier between the senses just it evaporated. 182 00:08:39.440 --> 00:08:42.080 That is wild. So the architecture was the lock, and 183 00:08:42.080 --> 00:08:45.519 this idea of tokenization was the key that fit everything. 184 00:08:45.639 --> 00:08:47.799 That's a beautiful way to put it. And once that 185 00:08:47.919 --> 00:08:52.120 architectural problem was solved, the floodgates opened. We moved into 186 00:08:52.159 --> 00:08:55.279 this phase of connecting the dots, of teaching these different 187 00:08:55.279 --> 00:08:56.679 senses to talk to each other. 188 00:08:56.799 --> 00:08:59.759 Okay, I get the architecture. That makes sense. We can 189 00:08:59.759 --> 00:09:02.759 now feed everything into the same kind of machine. But 190 00:09:02.840 --> 00:09:06.720 I'm still stuck on the understanding part. Just because I 191 00:09:06.759 --> 00:09:08.679 feed a picture of a dog and the word dog 192 00:09:08.720 --> 00:09:11.080 into the same machine, how does the machine know they 193 00:09:11.080 --> 00:09:13.279 refer to the same thing. Surely it's not just looking 194 00:09:13.320 --> 00:09:14.159 it up in a dictionary. 195 00:09:14.279 --> 00:09:17.120 No, no, it's not a lookup table at all. It's geometry. 196 00:09:17.200 --> 00:09:19.480 Geometry. You're gonna have to explain that one. How does 197 00:09:19.480 --> 00:09:21.600 a picture of a dog become geometry? 198 00:09:21.720 --> 00:09:24.200 This is where we have to talk about vectors and 199 00:09:24.360 --> 00:09:26.600 high dimensional space, and to do that we have to 200 00:09:26.600 --> 00:09:29.240 talk about how these things are actually trained. The most 201 00:09:29.279 --> 00:09:33.919 famous example is a model called clap from open ai clip. 202 00:09:34.120 --> 00:09:37.120 I've seen that mentioned contrast of language image pre training. 203 00:09:37.399 --> 00:09:41.480 It's a mouthful, but the concept is really elegant. Imagine 204 00:09:41.519 --> 00:09:44.639 you have a massive bucket of data, and I'm talking 205 00:09:44.840 --> 00:09:48.840 four hundred million images scraped from the Internet and the 206 00:09:48.840 --> 00:09:50.240 text captions that came with them. 207 00:09:50.480 --> 00:09:54.519 So like IMG zero zero one dot jpeg and the 208 00:09:54.559 --> 00:09:57.960 alt text that says a golden retriever catching a frisbee 209 00:09:58.000 --> 00:09:58.440 on the beach. 210 00:09:58.799 --> 00:10:01.720 Right now, you start with a blank brain. It knows nothing. 211 00:10:01.799 --> 00:10:04.720 You show at the image and you show at the text. Initially, 212 00:10:04.759 --> 00:10:08.039 the machine thinks these are totally unrelated things. It turns 213 00:10:08.039 --> 00:10:10.120 the image into a set of numbers. We call that 214 00:10:10.159 --> 00:10:12.480 a vector, and it turns the text into another set 215 00:10:12.519 --> 00:10:15.480 of numbers. And those numbers are in this mathematical space, 216 00:10:15.720 --> 00:10:18.639 nowhere near each other. The strangers in the map total strangers. 217 00:10:19.120 --> 00:10:22.440 But then you apply something called contrastive loss. This is 218 00:10:22.480 --> 00:10:26.960 the training mechanism. You essentially punish the machine. You say, hey, 219 00:10:27.639 --> 00:10:31.279 these two sets of numbers, they belong together. Pull them closer. 220 00:10:31.360 --> 00:10:33.399 You're forcing them to be neighbors exactly. 221 00:10:33.440 --> 00:10:36.320 And simultaneously you show at the text a golden retriever 222 00:10:36.440 --> 00:10:39.039 catching a frisbee and a picture of a toaster, and 223 00:10:39.120 --> 00:10:41.399 you say, push these apart. These are not the same. 224 00:10:41.720 --> 00:10:43.639 These live on opposite sides of the universe. 225 00:10:43.720 --> 00:10:47.120 So it's this constant game of hot and cold, pushing 226 00:10:47.159 --> 00:10:48.000 and pulling. 227 00:10:47.960 --> 00:10:51.240 Done billions and billions of times, over and over, and 228 00:10:51.279 --> 00:10:53.799 eventually the machine builds a map. We call it a 229 00:10:53.840 --> 00:10:57.799 high dimensional vector space. Imagine a graph, but instead of 230 00:10:57.799 --> 00:11:02.039 two or three axes, it has thousands. In this map, 231 00:11:02.120 --> 00:11:05.840 the coordinates for the visual pattern of fur, floppy, ears 232 00:11:05.879 --> 00:11:09.200 and tail end up located at the exact same coordinates 233 00:11:09.240 --> 00:11:11.440 as the linguistic pattern for the word dog. 234 00:11:11.879 --> 00:11:14.840 Wow. So it's not using a dictionary. It's not looking 235 00:11:14.960 --> 00:11:18.639 up dog equals animal. It's mapping the concept of dogness 236 00:11:18.919 --> 00:11:22.080 to a specific location in this massive, invisible space. 237 00:11:22.200 --> 00:11:24.360 Yes, and this is why it feels like it understands 238 00:11:24.399 --> 00:11:27.240 because that space has geometry. It has a kind of logic. 239 00:11:27.360 --> 00:11:29.759 Okay, give me an example of that logic, because logic 240 00:11:29.799 --> 00:11:32.240 implies it can do reasoning, not just matching. 241 00:11:32.440 --> 00:11:35.480 Okay, think about the classic relationship between king and queen 242 00:11:35.720 --> 00:11:38.240 in text. If you take the math vector for the 243 00:11:38.240 --> 00:11:40.919 word king, subtract the vector for man, and then add 244 00:11:40.960 --> 00:11:43.879 the vector for woman, you land almost perfectly on the 245 00:11:43.960 --> 00:11:44.679 vector for queen. 246 00:11:44.919 --> 00:11:47.960 Right. That's the famous example King minus man plus woman 247 00:11:48.000 --> 00:11:50.120 equals queen. It's like vector arithmetic. 248 00:11:50.440 --> 00:11:53.399 Now do it with images. If you take the visual 249 00:11:53.519 --> 00:11:55.799 vector of a king a photo of a guy in 250 00:11:55.840 --> 00:11:59.879 a crown, subtract the visual features that represent man, and 251 00:12:00.200 --> 00:12:05.200 add the visual features that represent woman, the machine generates 252 00:12:05.240 --> 00:12:06.240 an image of a. 253 00:12:06.240 --> 00:12:11.799 Queen that is mind blowing. The logic, the geometric relationship 254 00:12:12.679 --> 00:12:15.120 it holds up across the senses it does. 255 00:12:15.279 --> 00:12:17.600 It means the machine has found a concept layer that 256 00:12:17.679 --> 00:12:21.039 sits deeper than language and deeper than pixels. It has 257 00:12:21.080 --> 00:12:22.679 found the meaning that connects them. 258 00:12:22.759 --> 00:12:24.440 It's performing analogical reasoning. 259 00:12:24.519 --> 00:12:26.720 It is. That's how the system can look at a 260 00:12:26.720 --> 00:12:29.159 photo of a funeral and connect it to the text 261 00:12:29.360 --> 00:12:32.159 A moment of grief. It's not because it memorized that 262 00:12:32.240 --> 00:12:35.720 specific photo and caption pair. It's because the visual information 263 00:12:35.759 --> 00:12:38.120 in the funeral photo and the concept of grief from 264 00:12:38.159 --> 00:12:40.960 the text live in the same emotional region of this 265 00:12:41.080 --> 00:12:42.039 mathematical space. 266 00:12:42.120 --> 00:12:44.279 It's math the geometry of sadness. 267 00:12:43.799 --> 00:12:46.840 In a mathematical sense. Yes, it has aligned the visual 268 00:12:46.840 --> 00:12:49.519 features of sadness with the linguistic features of sadness. 269 00:12:49.600 --> 00:12:52.320 That explains so much about why these systems feel like 270 00:12:52.360 --> 00:12:55.039 they get it. They aren't just matching keywords. They are 271 00:12:55.120 --> 00:12:56.840 navigating a map of meaning. 272 00:12:56.879 --> 00:12:59.759 And usually the architecture that runs this map, the sort 273 00:12:59.799 --> 00:13:02.879 of central brain, is a large language model. You have 274 00:13:02.960 --> 00:13:05.519 these specialized encoders. You can think of them as the 275 00:13:05.559 --> 00:13:08.919 eyes and ears that project all this information into the brain. 276 00:13:09.799 --> 00:13:13.600 The LM does the reasoning in that shared space, and 277 00:13:13.639 --> 00:13:15.320 then it can send information back out. 278 00:13:15.480 --> 00:13:17.720 So the LLEN is the conductor of the orchestras, making 279 00:13:17.720 --> 00:13:19.679 sure the strings and the woodwinds are all playing from 280 00:13:19.679 --> 00:13:22.080 the same sheet music. Ceaseely, all right, So we have 281 00:13:22.159 --> 00:13:24.720 the history. The silos are gone with the science. It's 282 00:13:24.720 --> 00:13:27.080 a geometry of concepts. Now I want to talk about 283 00:13:27.080 --> 00:13:30.720 the utility, because cool math is great, but what can 284 00:13:30.799 --> 00:13:31.600 this actually do? 285 00:13:32.000 --> 00:13:35.000 The capabilities are substantial, and I think we should start 286 00:13:35.039 --> 00:13:37.960 with what we can call vision language nuance, because we're 287 00:13:37.960 --> 00:13:40.159 not just talking about identifying objects anymore. 288 00:13:40.279 --> 00:13:42.360 Right. This isn't just drawing a box around a cat 289 00:13:42.440 --> 00:13:45.480 and saying cat ninety nine percent confidence. That was like 290 00:13:45.639 --> 00:13:46.960 twenty fifteen era. AI. 291 00:13:47.159 --> 00:13:51.000 No, No, Now, it's about identifying relationships emotional tenor. It 292 00:13:51.039 --> 00:13:52.720 can look at a scene and say this is a 293 00:13:52.759 --> 00:13:55.960 tense negotiation happening in a corporate boardroom based on the 294 00:13:55.960 --> 00:13:58.919 body language, the lighting, the arrangement of people. But one 295 00:13:58.919 --> 00:14:03.360 of the most practical superpowers is something called OCR integration 296 00:14:03.639 --> 00:14:05.279 optical character recognition. 297 00:14:05.559 --> 00:14:08.639 But OCR has been round since the nineties. My scanner 298 00:14:08.720 --> 00:14:11.279 came with it. Why is this a big deal? Now? 299 00:14:11.519 --> 00:14:15.840 Old OCR was dumb. It just scraped text off a page. 300 00:14:15.919 --> 00:14:17.600 It didn't know where the text was or what it 301 00:14:17.639 --> 00:14:21.559 meant in context. Multimodal AI reads the text in context. 302 00:14:21.600 --> 00:14:23.159 It can look at a street sign in a photo, 303 00:14:23.279 --> 00:14:24.919 read the sign, look at the cars, look at the 304 00:14:24.919 --> 00:14:27.720 time of day, and tell you if parking is legal right. 305 00:14:27.600 --> 00:14:30.720 Now, or to go back to our earlier point. It 306 00:14:30.799 --> 00:14:33.519 could read a handwritten note on a medical scan and 307 00:14:33.679 --> 00:14:36.080 understand how that note relates to the X ray. 308 00:14:35.879 --> 00:14:40.200 Itself exactly, which segues perfectly into the second big capability 309 00:14:40.480 --> 00:14:43.120 document understanding. This is what some people are calling the 310 00:14:43.519 --> 00:14:44.759 ultimate office assistant. 311 00:14:45.080 --> 00:14:46.240 This is the one that I think is going to 312 00:14:46.320 --> 00:14:48.399 change a lot of white collar work. I want you 313 00:14:48.440 --> 00:14:51.799 to walk me through a scenario here, because I deal 314 00:14:51.840 --> 00:14:55.240 with PDFs all day and they are where data goes 315 00:14:55.279 --> 00:14:55.600 to die. 316 00:14:55.759 --> 00:14:59.320 Okay, picture this. You have a fifty page annual report. 317 00:14:59.519 --> 00:15:03.639 It's got three columns of text, complex bar charts, photos 318 00:15:03.639 --> 00:15:08.759 with captions, footnotes. For old AI, that was a complete nightmare. 319 00:15:08.840 --> 00:15:12.039 The text would get jumbled, the chart was invisible. 320 00:15:11.600 --> 00:15:13.600 I was soup. You'd copy paste it into a text 321 00:15:13.639 --> 00:15:15.279 file and just get absolute garbage. Right. 322 00:15:15.639 --> 00:15:19.120 But a multimodal system sees the document like a human does. 323 00:15:19.159 --> 00:15:21.440 It understands the layout. It can look at the bar chart, 324 00:15:21.720 --> 00:15:24.519 extract the data from the visual bars, I mean literally 325 00:15:24.559 --> 00:15:27.360 measuring the pixels of the bars, read the surrounding text, 326 00:15:27.519 --> 00:15:30.600 understand what that data means, and answer a question like 327 00:15:30.799 --> 00:15:32.960 based on the chart on page three, which quarter had 328 00:15:32.960 --> 00:15:34.519 the highest revenue. 329 00:15:34.080 --> 00:15:36.919 Without a human having to manually turn that chart into 330 00:15:36.960 --> 00:15:39.519 an Excel sheet first zero preprocessing. 331 00:15:39.559 --> 00:15:40.960 It just looks and understands. 332 00:15:41.000 --> 00:15:44.919 That's incredible. It basically unlocks all the information that is 333 00:15:44.960 --> 00:15:48.720 trapped inside images, within documents. What about audio and video 334 00:15:48.840 --> 00:15:51.240 You mentioned earlier that video is just a sequence of frames. 335 00:15:51.519 --> 00:15:56.000 Audio and video are huge frontiers now. In audio, we 336 00:15:56.039 --> 00:15:59.799 aren't just transcribing speech to text anymore. We are analyzing 337 00:15:59.840 --> 00:16:03.919 the vocal characteristics. The system can detect emotion. Is the 338 00:16:03.919 --> 00:16:08.600 speaker angry, nervous, sarcastically happy? 339 00:16:08.720 --> 00:16:10.480 You can hear the scare quotes in your voice. 340 00:16:10.559 --> 00:16:13.600 It can absolutely and it can analyze music, not just 341 00:16:13.639 --> 00:16:16.879 the genre, but the rhythm, the mood, the instrumentation. When 342 00:16:16.919 --> 00:16:19.840 you combine that with video, you get narrative understanding. It 343 00:16:19.840 --> 00:16:22.360 can track events over time and start to build a story. 344 00:16:22.559 --> 00:16:24.759 But the real magic, and the research we looked at 345 00:16:24.840 --> 00:16:28.440 was really emphatic about this is the killer app of 346 00:16:28.559 --> 00:16:31.200 true integration. Is it not just being good at video 347 00:16:31.320 --> 00:16:33.200 or good at text. It's the combo. 348 00:16:33.399 --> 00:16:35.399 It is the synthesis. That's where the real power is. 349 00:16:35.519 --> 00:16:37.960 Let's look at a coding scenario. Imagine you're a developer. 350 00:16:38.039 --> 00:16:40.679 You're stuck. You get some cryptic error message. You take 351 00:16:40.720 --> 00:16:42.960 a screenshot of your error message. You just paste it 352 00:16:43.000 --> 00:16:45.679 into the AI. The AI reads the screenshot, looks at 353 00:16:45.679 --> 00:16:49.320 your actual code file, consults the official software documentation online, 354 00:16:49.440 --> 00:16:51.039 and synthesizes an answer. 355 00:16:51.240 --> 00:16:53.759 So it's using its eyes and it's reading comprehension at 356 00:16:53.759 --> 00:16:56.039 the exact same time to solve one problem. 357 00:16:56.440 --> 00:17:00.159 Or take medicine, the radiologist assistant idea. It looks that 358 00:17:00.200 --> 00:17:04.000 the CT scan, that's vision. It reads the patient's history notes, 359 00:17:04.039 --> 00:17:07.480 that's text. It checks the latest research papers for medical journals. 360 00:17:07.480 --> 00:17:11.079 More text, and it synthesizes a potential diagnosis based on 361 00:17:11.200 --> 00:17:12.359 all three modalities. 362 00:17:12.440 --> 00:17:15.359 It becomes the ultimate second opinion engine. 363 00:17:15.480 --> 00:17:18.880 Right or a final example in design, you sketch a 364 00:17:18.960 --> 00:17:20.880 rough idea for an app on a napkin, You take 365 00:17:20.920 --> 00:17:23.119 a photo, you upload it, and you say, make this 366 00:17:23.160 --> 00:17:26.440 look like a sleek modern app interface, but use our 367 00:17:26.480 --> 00:17:30.319 official brand colors from this attached pdf. It sees the sketch, 368 00:17:30.400 --> 00:17:32.720 it reads your brief, it consults the PDF for the 369 00:17:32.759 --> 00:17:35.400 color codes, and it generates the final image. 370 00:17:35.440 --> 00:17:39.039 It's closing the loop between idea, instruction, and creation. It 371 00:17:39.039 --> 00:17:42.039 feels like we're getting closer to that Jarvis from Ironman Fantasy, 372 00:17:42.079 --> 00:17:44.000 the as system that just handles them. 373 00:17:44.119 --> 00:17:46.160 We are getting closer. But and this is a very 374 00:17:46.240 --> 00:17:48.039 very big up. We have to talk about where it 375 00:17:48.079 --> 00:17:50.599 breaks because it is not Jarvis yet, and you and 376 00:17:50.640 --> 00:17:53.079 I need to be clear that this isn't magic. It 377 00:17:53.119 --> 00:17:55.079 breaks in some surprisingly dumb ways. 378 00:17:55.279 --> 00:17:57.000 You don't want to play the skeptic here for a minute, 379 00:17:57.039 --> 00:18:01.240 because it sounds perfect, but I know it's not. Where 380 00:18:01.240 --> 00:18:03.559 does the machine stumble? What trips it up? 381 00:18:04.160 --> 00:18:08.079 It stumbles in some surprisingly fundamental areas. The first one, 382 00:18:08.240 --> 00:18:11.599 and this is almost ironic is spatial reasoning, which. 383 00:18:11.400 --> 00:18:13.839 Is funny, right because you'd think a computer vision system 384 00:18:13.880 --> 00:18:14.920 would be great at space. 385 00:18:15.319 --> 00:18:19.160 It sees pixels, you would think, But remember these systems 386 00:18:19.160 --> 00:18:22.440 are trained on flat two D images from the Internet. 387 00:18:22.880 --> 00:18:25.759 They struggle to build an intuitive three D model of 388 00:18:25.799 --> 00:18:28.559 the world. If you show it a picture of a 389 00:18:28.559 --> 00:18:30.839 table with a messy pile of objects, and you ask 390 00:18:31.200 --> 00:18:33.440 is the apple behind the book or in front of it, 391 00:18:33.440 --> 00:18:34.799 it often gets really confused. 392 00:18:34.839 --> 00:18:36.839 It sees the pixels of the apple and the pixels 393 00:18:36.839 --> 00:18:40.000 of the book, but it doesn't get the depth the 394 00:18:40.039 --> 00:18:42.160 physics of one object including another. 395 00:18:42.240 --> 00:18:44.839 It lacks a physics engine in its head. It doesn't 396 00:18:44.880 --> 00:18:48.359 intuitively understand that solid objects occupy space and can't pass 397 00:18:48.400 --> 00:18:51.480 through each other. And this is a massive problem for robotics. 398 00:18:52.119 --> 00:18:54.119 If you want a robot to clean your kitchen, it 399 00:18:54.200 --> 00:18:56.519 needs to know exactly where the cup is relative to 400 00:18:56.559 --> 00:18:59.480 the edge of the table. Close enough isn't good enough. 401 00:18:59.480 --> 00:19:01.200