WEBVTT 1 00:00:01.199 --> 00:00:06.200 Welcome to the Sentient Code, where intelligence is engineered, autonomy 2 00:00:06.280 --> 00:00:10.439 is emerging, and a line between human and machine grows thinner. 3 00:00:10.800 --> 00:00:15.359 Each episode, we decode the algorithms, explore the robotics, and 4 00:00:15.439 --> 00:00:23.000 examine the ideas shaping the future of artificial minds. 5 00:00:23.920 --> 00:00:28.519 I spent twenty minutes yesterday, literally twenty minutes, trying to 6 00:00:28.559 --> 00:00:32.280 get a supposedly state of the art AI to figure 7 00:00:32.320 --> 00:00:34.719 out this completely absurd riddle. 8 00:00:34.960 --> 00:00:38.439 Oh a riddle. Let me guess. It didn't go exactly 9 00:00:38.439 --> 00:00:39.000 as planned. 10 00:00:39.240 --> 00:00:41.600 No, it was a disaster. My seven year old told 11 00:00:41.640 --> 00:00:43.640 it to me. It was something about a penguin, a 12 00:00:43.679 --> 00:00:47.119 flashlight and a jar of peanut butter. 13 00:00:47.320 --> 00:00:49.600 Right, not exactly standard training data. 14 00:00:49.280 --> 00:00:55.560 Exactly, And the AI confidently generated this five paragraph, highly articulate, 15 00:00:55.640 --> 00:00:58.719 just beautifully written essay that completely missed the punchline. I 16 00:00:58.719 --> 00:01:02.439 mean it fundamentally violated basic physic confidently incorrect. 17 00:01:02.439 --> 00:01:04.680 That is the hallmark of the current architecture. 18 00:01:04.799 --> 00:01:07.319 Yes, and yet I open up my feed right after that, 19 00:01:07.400 --> 00:01:10.439 and the headlines are absolutely screaming. They're saying, this exact 20 00:01:10.480 --> 00:01:14.640 same architecture is about to autonomously replace human doctors and 21 00:01:14.760 --> 00:01:19.879 lawyers and engineers. There is this massive dizzying disconnect happening 22 00:01:19.959 --> 00:01:23.040 right now for you listening, You are constantly surrounded by 23 00:01:23.079 --> 00:01:26.640 these claims that artificial intelligence is reaching human levels of comprehension. 24 00:01:26.879 --> 00:01:29.159 You hear about them acing the bar, exam. 25 00:01:29.000 --> 00:01:31.920 Breezing through advanced medical licensing. 26 00:01:31.480 --> 00:01:36.640 Tests, mastering the exact standardized testing frameworks we've used for generations. 27 00:01:36.799 --> 00:01:39.719 Right it paints this incredibly vivid picture for all of us, 28 00:01:40.040 --> 00:01:43.000 a picture of algorithms that are practically breathing, thinking, and 29 00:01:43.120 --> 00:01:45.599 understanding the world exactly the way you and I do. 30 00:01:46.159 --> 00:01:48.879 But and this is the big question for today, what 31 00:01:49.000 --> 00:01:51.680 if the very yardsticks we have been using to measure 32 00:01:51.760 --> 00:01:54.680 artificial intelligence are just fundamentally broken. 33 00:01:54.760 --> 00:01:57.079 They are obsolete, completely broken. 34 00:01:57.120 --> 00:02:00.400 Today we are exploring a massive paradigm shift. We are 35 00:02:00.400 --> 00:02:03.959 looking at the core reality that those traditional academic benchmarks 36 00:02:03.959 --> 00:02:07.079 have completely and utterly lost their diagnostic utility. 37 00:02:07.400 --> 00:02:10.439 That is precisely our mission today. We have to completely 38 00:02:10.479 --> 00:02:13.400 deconstruct how we evaluate the artificial mind. We are no 39 00:02:13.439 --> 00:02:18.639 longer just talking about machines getting smarter in some vague sense. 40 00:02:18.680 --> 00:02:21.439 We're stepping into a high stakes intellectual mystery. 41 00:02:21.560 --> 00:02:24.800 We are because we are examining a profound shift in 42 00:02:24.840 --> 00:02:28.680 how we test computational intelligence, we're transitioning entirely away from 43 00:02:28.719 --> 00:02:34.039 testing generalized models on standard educational curricula. Instead, we're looking 44 00:02:34.080 --> 00:02:37.000 at how we evaluate them against the absolute limits of 45 00:02:37.120 --> 00:02:39.680 highly specialized human expertise, at. 46 00:02:39.560 --> 00:02:42.879 The very frontier of scientific and historical discovery. 47 00:02:43.039 --> 00:02:47.919 Exactly. The central theme here is understanding the stark delineation 48 00:02:48.159 --> 00:02:52.840 between the statistical probability operations of a machine, the pattern matching, right, 49 00:02:52.840 --> 00:02:57.840 the pattern matching, and the actualized, deep causal reasoning of 50 00:02:57.840 --> 00:03:01.439 a human mind. It is about separating the illusion of 51 00:03:01.479 --> 00:03:04.960 comprehension from genuine contextual synthesis. 52 00:03:05.039 --> 00:03:07.960 Okay, let's unpack this collapse of the old standard, because 53 00:03:08.000 --> 00:03:09.599 I think we all know how these models work at 54 00:03:09.599 --> 00:03:13.080 a baseline, right, They are incredibly sophisticated. 55 00:03:12.319 --> 00:03:16.319 Autocorrects navigating vector spaces to predict the next hope exactly. 56 00:03:16.360 --> 00:03:18.599 But for a long time, the gold standard, the ultimate 57 00:03:18.639 --> 00:03:22.680 proving ground for these architectures was something called the MMLU, 58 00:03:23.039 --> 00:03:26.199 the Massive Multitask Language Understanding Exam. 59 00:03:26.280 --> 00:03:29.120 If you're building a multi billion dollar machine learning model, 60 00:03:29.400 --> 00:03:30.520 this was your benchmark. 61 00:03:30.599 --> 00:03:34.840 It covered this incredibly broad generalized knowledge base, everything from 62 00:03:34.879 --> 00:03:39.560 basic high school European history to complex professional level medical 63 00:03:39.560 --> 00:03:42.039 diagnostic microeconomics tort law. 64 00:03:42.319 --> 00:03:42.520 Right. 65 00:03:42.599 --> 00:03:44.280 It was supposed to be the ultimate test of an 66 00:03:44.280 --> 00:03:46.479 AI's generalized knowledge. 67 00:03:46.000 --> 00:03:50.240 And when the MMLU was initially introduced, it did provide 68 00:03:50.280 --> 00:03:54.159 a highly effective metric, a percentage increase in accuracy on 69 00:03:54.199 --> 00:03:58.439 that exam directly correlated with handible architectural improvements in the 70 00:03:58.479 --> 00:03:59.159 neural networks. 71 00:03:59.240 --> 00:04:01.360 It gave the developer, for is, a clear roadmap, a 72 00:04:01.400 --> 00:04:02.680 clear empirical trajectory. 73 00:04:02.759 --> 00:04:07.360 Yes, But then we witnessed a phenomenon that completely destabilized 74 00:04:07.360 --> 00:04:10.919 this metric, the exponential scaling of neural networks. 75 00:04:11.120 --> 00:04:13.599 The tech giants just started throwing hardware at it. 76 00:04:13.719 --> 00:04:17.800 Massive hardware developers began massively increasing both the parameter counts 77 00:04:17.800 --> 00:04:19.759 of these models and the sheer volume of their training 78 00:04:19.839 --> 00:04:23.279 data sets. They were essentially scraping the entire indexed Internet, 79 00:04:23.360 --> 00:04:23.839 and as a. 80 00:04:23.800 --> 00:04:26.720 Direct result of that scaling, the systems began achieving near 81 00:04:26.720 --> 00:04:28.360 perfect scores on the MMLU. 82 00:04:28.519 --> 00:04:30.360 They effectively maxed out the test. 83 00:04:30.439 --> 00:04:34.240 Which creates a massive structural flaw. If you have a 84 00:04:34.279 --> 00:04:38.279 diagnostic tool, any kind of test, and it routinely starts 85 00:04:38.279 --> 00:04:41.800 returning the maximum possible values across all these diverse subjects. 86 00:04:42.079 --> 00:04:44.240 It stops giving you any meaningful variants. 87 00:04:44.399 --> 00:04:46.879 It goes blind. It is a phenomenon known as saturation. 88 00:04:47.399 --> 00:04:52.160 Saturation. To put this in perspective for you listening, imagine 89 00:04:52.279 --> 00:04:55.199 you are a sports scientist. You're trying to test the 90 00:04:55.399 --> 00:04:59.759 absolute physical limits of an elite Olympic decathlete, a gold 91 00:04:59.800 --> 00:05:03.120 me right, But the only diagnostic tool you have in 92 00:05:03.160 --> 00:05:08.000 your lab is the standard middle school presidential fitness test, the. 93 00:05:07.879 --> 00:05:09.480 One we all took in seventh grade. 94 00:05:09.560 --> 00:05:12.480 Exactly sure, the olympian is going to get a perfect score. 95 00:05:12.560 --> 00:05:14.120 They're going to do all the pull ups, run the 96 00:05:14.120 --> 00:05:17.079 shuttle sprint, stretch past their toes without breaking a sweat. 97 00:05:17.600 --> 00:05:20.959 But that perfect score tells you absolutely nothing about their 98 00:05:21.120 --> 00:05:23.319 actual absolute physical limits. 99 00:05:23.600 --> 00:05:26.839 It doesn't tell you how their cardiovascular system handles the 100 00:05:26.879 --> 00:05:28.800 complex stress of it to caflon, or. 101 00:05:28.759 --> 00:05:32.399 How they adapt to unpredictable physical challenges. It just tells 102 00:05:32.439 --> 00:05:34.000 you that they are stronger than a twelve year old. 103 00:05:34.319 --> 00:05:37.120 The test is saturated, it ceases to provide any insight 104 00:05:37.160 --> 00:05:41.560 into the underlying capabilities, or more importantly, the limitations of 105 00:05:41.600 --> 00:05:42.759 the system being tested. 106 00:05:43.160 --> 00:05:46.279 What's fascinating here is how the saturation exposes a deep 107 00:05:46.399 --> 00:05:50.120 fundamental difference between high performance on tasks designed by humans 108 00:05:50.600 --> 00:05:53.439 and actual generalizable intelligence. 109 00:05:53.000 --> 00:05:55.480 Because getting an A on a human test doesn't mean 110 00:05:55.519 --> 00:05:57.279 you think like a human exactly. 111 00:05:57.680 --> 00:06:00.879 When these models achieve those near perfect scores on the MMLU, 112 00:06:01.199 --> 00:06:05.319 those strong empirical results are frequently just manifestations of highly 113 00:06:05.360 --> 00:06:10.800 sophisticated pattern matching. They are processing an unimaginably vast amount 114 00:06:10.839 --> 00:06:14.800 of ubiquitous online data and finding the correlations. 115 00:06:14.120 --> 00:06:16.759 They've read every prep book ever public millions of them. 116 00:06:17.040 --> 00:06:20.399 But and this is the crucial distinction, that pattern matching 117 00:06:20.439 --> 00:06:25.160 does not represent deep synthesized understanding. The saturation of the 118 00:06:25.279 --> 00:06:30.079 MMLU prove that our old diagnostic tools were fundamentally incapable 119 00:06:30.480 --> 00:06:34.160 of mapping the computational differences between a machine executing a 120 00:06:34.160 --> 00:06:39.600 statistical operation and a human engaging in true causal comprehension. 121 00:06:39.040 --> 00:06:41.639 Right, which brings us to the mechanics of that illusion, 122 00:06:41.879 --> 00:06:44.879 Because what this exposes is just how completely that specific 123 00:06:44.959 --> 00:06:47.639 architecture shatters when you take off the training wheels of 124 00:06:47.639 --> 00:06:48.439 the Internet's data. 125 00:06:48.480 --> 00:06:51.000 It breaks down fundamentally. So let's get into the technicals 126 00:06:51.040 --> 00:06:55.199 of vector embeddings, and let's go beyond the basic IT 127 00:06:55.319 --> 00:06:59.360 maps coordinates explanation that we always hear what is actually 128 00:06:59.360 --> 00:07:02.480 happening in side that high dimensional space when a model 129 00:07:02.519 --> 00:07:04.240 occurs to be thinking. 130 00:07:04.839 --> 00:07:07.040 To understand the illusion, we have to look at the 131 00:07:07.079 --> 00:07:11.519 intersection of vector embeddings, attention mechanisms, and co sign similarity. 132 00:07:11.600 --> 00:07:12.759 Okay, lay it out for us. 133 00:07:12.959 --> 00:07:17.720 When an artificial intelligence processes text, it mathematically maps words 134 00:07:17.759 --> 00:07:20.319 and concepts into a space that can have tens of 135 00:07:20.360 --> 00:07:24.040 thousands of dimensions. Concepts that frequently appear together in the 136 00:07:24.079 --> 00:07:25.600 training data form dense. 137 00:07:25.360 --> 00:07:28.639 Clusters, so they live in the same mathematical neighborhood. 138 00:07:28.759 --> 00:07:32.319 Yes, the model uses attention heads to weigh the importance 139 00:07:32.319 --> 00:07:34.720 of different words in your prompt and then uses a 140 00:07:34.759 --> 00:07:38.480 mathematical function, often co sign similarity, to find the closest, 141 00:07:38.560 --> 00:07:41.920 most statistically relevant cluster of vectors to generate its response. 142 00:07:42.160 --> 00:07:45.000 So, if I ask it about a widely documented historical 143 00:07:45.000 --> 00:07:47.800 event like the moon landing, it's operating in a highly 144 00:07:47.879 --> 00:07:52.000 dense cluster. There are millions of articles, transcripts, and books 145 00:07:52.079 --> 00:07:56.839 in its training data, linking Apollo eleven, Armstrong Moon and 146 00:07:56.959 --> 00:08:01.160 nineteen sixty nine. The cosigine similarity points it directly to 147 00:08:01.240 --> 00:08:06.480 the center of a very tight well defined mathematical neighborhood. 148 00:08:05.920 --> 00:08:08.879 It's essentially impossible for it to miss. The density of 149 00:08:08.920 --> 00:08:13.800 the data cluster allows for highly accurate statistical retrieval. It 150 00:08:13.839 --> 00:08:14.720 looks like mastery. 151 00:08:14.800 --> 00:08:16.439 It looks like it knows what the moon is, but 152 00:08:16.519 --> 00:08:17.040 it doesn't. 153 00:08:17.439 --> 00:08:20.199 And that is the problem. What happens when you introduce 154 00:08:20.279 --> 00:08:23.639 sparse data? What happens when you ask it to synthesize 155 00:08:23.639 --> 00:08:27.240 concepts that do not reside in a dense mathematical neighborhood like. 156 00:08:27.199 --> 00:08:28.759 My seven year old's penguin riddle? 157 00:08:29.120 --> 00:08:33.320 Precisely, the cluster density is too low for reliable statistical retrieval. 158 00:08:33.840 --> 00:08:37.000 The attention mechanisms attempt to draw connections between vectors that 159 00:08:37.039 --> 00:08:40.080 are mathematically distant, leading to what we call hallucinations. 160 00:08:40.120 --> 00:08:41.519 Because it's forced to answer. 161 00:08:41.399 --> 00:08:44.080 The system is mathematically forced to predict the next token, 162 00:08:44.320 --> 00:08:46.919 so it wanders into a low density neighborhood and simply 163 00:08:46.960 --> 00:08:51.279 starts generating plausible sounding nonsense based on superficial syntactical patterns. 164 00:08:51.480 --> 00:08:54.720 Because it doesn't actually possess an internal model of reality. 165 00:08:54.759 --> 00:08:56.759 It's just doing high dimensional. 166 00:08:56.200 --> 00:08:58.799 Geometry geometry disguised as language. 167 00:08:58.840 --> 00:09:01.799 And this brings us to doctor Tong New New's analytical 168 00:09:01.840 --> 00:09:06.159 warning about the anthropomorphic fallacy. We are so incredibly wired 169 00:09:06.200 --> 00:09:09.799 evolutionarily to assume that if something can speak to us, 170 00:09:10.240 --> 00:09:14.279 if it uses syntax and grammar, it must think like us. 171 00:09:14.559 --> 00:09:20.080 Doctor Nunu identifies this pervasive cognitive bias perfectly. Because these 172 00:09:20.159 --> 00:09:24.279 models are successfully completing tasks that were historically designed to 173 00:09:24.320 --> 00:09:29.320 require human cognition, like passing a medical board exam, observers 174 00:09:29.320 --> 00:09:34.279 incorrectly deduce that the machine must possess an equivalent cognitive framework. 175 00:09:34.360 --> 00:09:37.120 We project human thought onto a statistical calculator. 176 00:09:37.200 --> 00:09:40.320 Yes, a machine can predict the next token perfectly in 177 00:09:40.360 --> 00:09:43.960 a highly structured, well documented academic test solely because that 178 00:09:44.080 --> 00:09:46.639 data exists in abundance within its training corpuses. 179 00:09:46.679 --> 00:09:47.679 That's all just correlations. 180 00:09:47.759 --> 00:09:50.399 But when confronted with a novel situation that requires actual 181 00:09:50.480 --> 00:09:54.000 contextual synthesis, a scenario it hasn't mapped the mathematical coordinates 182 00:09:54.039 --> 00:09:57.039 for the statistical probability, mapping completely breaks down. 183 00:09:57.279 --> 00:10:00.000 It's the ultimate trick of the light, and it's exactly 184 00:10:00.000 --> 00:10:03.759 bactly what catalyzed this massive global shift in how we 185 00:10:03.840 --> 00:10:08.440 evaluate intelligence. The structural gaps have become so profound that 186 00:10:08.519 --> 00:10:11.240 they could no longer be mapped by isolated teams of 187 00:10:11.279 --> 00:10:14.919 computer scientists just working in their Silicon Valley silos. 188 00:10:15.000 --> 00:10:16.679 They needed a much broader perspective. 189 00:10:16.799 --> 00:10:21.080 It required a massive interdisciplinary intervention. We were talking about 190 00:10:21.080 --> 00:10:24.840 the engineering of the ultimate metric, something known as Humanity's 191 00:10:24.919 --> 00:10:28.799 Last Exam or the HLE. And let's clarify that name 192 00:10:28.879 --> 00:10:33.200 right now, because Humanity's Last Exam sounds incredibly melodramatic. 193 00:10:33.279 --> 00:10:35.279 It does sound like a cinematic apocalypse. 194 00:10:35.399 --> 00:10:37.720 It sounds like the title of a dystopian sci fi 195 00:10:37.759 --> 00:10:40.159 novel where we are all plugging into the matrix for 196 00:10:40.200 --> 00:10:40.919 the final time. 197 00:10:41.039 --> 00:10:44.639 It is a provocative title, certainly, but the nomenclature is 198 00:10:44.720 --> 00:10:49.039 purely a clinical rhetorical framing device. It is not an 199 00:10:49.039 --> 00:10:52.320 expression of apocalyptic dread regarding human relevance. 200 00:10:52.399 --> 00:10:54.000 We're not throwing in the towel, not at all. 201 00:10:54.279 --> 00:10:58.279 Rather, it is a highly specialized initiative designed to systematically 202 00:10:58.320 --> 00:11:03.799 delineate the boundary between algorithmic operations and genuine human reasoning. 203 00:11:04.399 --> 00:11:08.879 The objective is to identify operational strengths and computational vulnerabilities 204 00:11:09.240 --> 00:11:12.879 so that we can engineer safer, more reliable technologies. 205 00:11:13.000 --> 00:11:16.519 It is about understanding exactly where the machines fail to 206 00:11:16.559 --> 00:11:18.240 synthesize reality exactly. 207 00:11:18.320 --> 00:11:19.600 It's about precision, and the. 208 00:11:19.559 --> 00:11:23.159 Scale of the consortium that built this test is just staggering. 209 00:11:23.519 --> 00:11:27.480 We are looking at nearly one thousand researchers globally, and crucially, 210 00:11:27.559 --> 00:11:31.399 they weren't just computer engineers. They realized that generalized domains 211 00:11:31.440 --> 00:11:34.559 were totally insufficient to test for true understanding. 212 00:11:34.679 --> 00:11:37.320 To break a statistical machine, you have to force a 213 00:11:37.399 --> 00:11:39.799 fusion of disparate knowledge bases. 214 00:11:39.960 --> 00:11:44.080 So they integrated historians, physicists, linguists, and medical researchers right 215 00:11:44.120 --> 00:11:45.679 alongside the computer scientists. 216 00:11:45.879 --> 00:11:50.399 That interdisciplinary composition is critical because conceptual integration is exactly 217 00:11:50.399 --> 00:11:55.200 where the statistical probability mapping of current architectures falters advance. 218 00:11:55.320 --> 00:11:59.399 Human expertise is uniquely characterized by the ability to fuse disparate, 219 00:11:59.519 --> 00:12:04.080 seemingly unrelated domains of knowledge, drawing connections across disciplines. Yes. 220 00:12:04.639 --> 00:12:07.200 To test for this, the consortium published a highly rigorous 221 00:12:07.240 --> 00:12:11.559 assessment in the journal Nature, specifically under the doi ten 222 00:12:11.639 --> 00:12:15.039 point one zero three eight four one five eight six 223 00:12:15.679 --> 00:12:19.159 zero two five zero nine nine six two four. This 224 00:12:19.240 --> 00:12:23.120 examination consists of exactly two thousand, five hundred questions, and 225 00:12:23.159 --> 00:12:26.919 it is bound by incredibly strict, unforgiving methodological constraints. 226 00:12:27.039 --> 00:12:29.639 Let's look at those constraints, because they are brilliantly designed 227 00:12:29.679 --> 00:12:32.759 to trap an AI. The first constraint is binary greeting. 228 00:12:33.240 --> 00:12:37.000 Every single query among those twenty five hundred questions must 229 00:12:37.039 --> 00:12:40.360 possess exactly one clear, verifiable answer. 230 00:12:40.600 --> 00:12:42.679 There is no partial credit none. 231 00:12:43.000 --> 00:12:45.879 There is no room for a beautifully written, eloquent essay 232 00:12:45.919 --> 00:12:48.360 that dances around the topic and sounds smart but says 233 00:12:48.440 --> 00:12:49.399 absolutely nothing. 234 00:12:49.480 --> 00:12:53.200 This binary constraint is absolutely essential for empirical validity. One 235 00:12:53.200 --> 00:12:56.399 of the greatest challenges in evaluating open ended algorithmic generation 236 00:12:56.519 --> 00:12:58.080 is subjective human interpretation. 237 00:12:58.240 --> 00:13:00.720 We get tripped by good grammar who do If. 238 00:13:00.559 --> 00:13:04.279 A model generates a highly articulate response, human evaluators could 239 00:13:04.279 --> 00:13:07.320 be easily deceived, even if the output is factually hallucinatory. 240 00:13:07.919 --> 00:13:11.440 The model syntactical fluency masks its lack of actual comprehension. 241 00:13:11.679 --> 00:13:14.879 It speaks with so much confidence. 242 00:13:14.480 --> 00:13:18.919 But by enforcing strict binary grading, the test entirely eliminates 243 00:13:18.960 --> 00:13:23.960 that subjective vulnerability. The machine either successfully executed the complex 244 00:13:24.039 --> 00:13:27.679 logical deduction to arrive at the single verifiable truth, or 245 00:13:27.720 --> 00:13:29.200 it failed entirely. 246 00:13:29.039 --> 00:13:32.000 It strips away the AI's ability to smooth talk its 247 00:13:32.039 --> 00:13:34.960 way out of a corner. But the second constraint is 248 00:13:35.000 --> 00:13:39.519 the real killer, absolute immunity to rapid online search queries. 249 00:13:39.600 --> 00:13:41.639 This is where the paradigm shifts entirely. 250 00:13:41.879 --> 00:13:44.679 By engineering the test to be immune to basic search 251 00:13:44.720 --> 00:13:48.559 engine retrieval, the consortium forces the system entirely away from 252 00:13:48.559 --> 00:13:51.840 its primary operational strength. If an answer can be located 253 00:13:51.840 --> 00:13:55.120 as a contiguous factual string within an index database anywhere 254 00:13:55.159 --> 00:13:58.559 on the Internet, it completely fails to test structural comprehension. 255 00:13:58.679 --> 00:14:01.440 It just proves the machine can look things up incredibly fast. 256 00:14:01.639 --> 00:14:04.120 Right if I can google the exact phrase, it's not 257 00:14:04.200 --> 00:14:06.360 a good test of intelligence exactly. 258 00:14:06.759 --> 00:14:09.679 If the answer exists in a unified format within the 259 00:14:09.720 --> 00:14:12.960 training data, the model can simply rely on that high 260 00:14:13.039 --> 00:14:17.759 density vector cluster we discussed earlier. Therefore, the questions designed 261 00:14:17.759 --> 00:14:22.919 for the HLE demand multi step logical deduction, intricate spatial reasoning, 262 00:14:23.360 --> 00:14:26.480 or the synthesis of deeply obscured information that does not 263 00:14:26.639 --> 00:14:28.840 exist in a single location anywhere. 264 00:14:28.919 --> 00:14:30.879 It forces them to build something new. 265 00:14:31.080 --> 00:14:34.240 The system must piece together fragments of knowledge to derive 266 00:14:34.279 --> 00:14:36.399 an answer that hasn't been explicitly written. 267 00:14:36.080 --> 00:14:38.960 Down before and to guarantee that these constraints were actually met, 268 00:14:39.039 --> 00:14:42.480 the consortium implemented an adversarial pre testing phase that I 269 00:14:42.559 --> 00:14:46.360 just find brilliant. They built a filtration protocol. Imagine a 270 00:14:46.519 --> 00:14:50.519 massive room of these thousand researchers, and every single proposed 271 00:14:50.600 --> 00:14:54.320 question was systematically administered to the leading state of the 272 00:14:54.399 --> 00:14:56.840 art artificial intelligence systems available at the. 273 00:14:56.759 --> 00:14:58.360 Time, all the top tier models. 274 00:14:58.399 --> 00:15:00.960 If any of those models managed to produce the correct answer, 275 00:15:01.240 --> 00:15:05.200 that specific question was instantly destroyed, ripped up, and thrown out. 276 00:15:05.440 --> 00:15:09.000 This pre testing methodology is what ensures the exam remains 277 00:15:09.000 --> 00:15:14.080 perpetually stationed just beyond the frontier of current computational performance. 278 00:15:14.840 --> 00:15:17.000 It does not measure what the models can already do. 279 00:15:17.879 --> 00:15:21.360 It maps the exact perimeter of algorithmic ignorance. 280 00:15:21.519 --> 00:15:23.919 The perimeter of ignorance. I love that phrasing. 281 00:15:24.159 --> 00:15:28.759 It defines the precise boundary where statistical probability fails and 282 00:15:28.840 --> 00:15:30.320 causal deduction is required. 283 00:15:30.840 --> 00:15:33.559 This brings us to a specific area where this boundary 284 00:15:33.559 --> 00:15:38.159 mapping is most devastating the deterministic vulnerability of these models. 285 00:15:38.840 --> 00:15:41.919 Let's look at the objective contributions of doctor Tungwan from 286 00:15:42.000 --> 00:15:45.240 Texas A and M University's Department of Computer Science and Engineering. 287 00:15:45.360 --> 00:15:47.240 He was a major player in this consortium. 288 00:15:47.320 --> 00:15:50.240 He authored seventy three questions for the assessment, which was 289 00:15:50.360 --> 00:15:54.279 the second highest individual contribution globally, and his queries were 290 00:15:54.360 --> 00:15:58.639 highly concentrated within the domains of rigorous mathematics and computer science. 291 00:15:59.159 --> 00:16:02.440 Doctor Juan's contrabutions are vital because they isolate a critical 292 00:16:02.519 --> 00:16:07.600 vulnerability inherent in all probabilistic models, the fundamental conflict between 293 00:16:07.639 --> 00:16:10.480 stochastic prediction and deterministic execution. 294 00:16:10.679 --> 00:16:12.360 Okay, let's break that down for the listener. 295 00:16:12.600 --> 00:16:18.200 Mathematical and computational logic requires step by step rigid determinism. 296 00:16:18.879 --> 00:16:22.600 A sarcastic prediction model cannot navigate a rigorous mathematical proof. 297 00:16:22.840 --> 00:16:26.360 So say you give the AI a highly complex, fifty 298 00:16:26.360 --> 00:16:28.600 step mathematical proof that has never been solved in this 299 00:16:28.639 --> 00:16:32.039 specific way before. If you are a machine learning model 300 00:16:32.080 --> 00:16:36.519 relying on probabilistic guessing, just predicting the most likely next 301 00:16:36.519 --> 00:16:39.759 mathematical operation based on BASS training data, you might get 302 00:16:39.799 --> 00:16:42.840 step one right with ninety nine point nine percent certainty. 303 00:16:42.919 --> 00:16:44.240 You might even get step two right. 304 00:16:44.440 --> 00:16:47.399 But eventually you are going to make a tiny minor 305 00:16:47.559 --> 00:16:51.519 variable error because you are guessing, you're not deducing precisely. 306 00:16:51.879 --> 00:16:54.759 And in a rigorous mathematical proof, what happens when you 307 00:16:54.799 --> 00:16:57.600 introduce a single minor variable error at step fourteen. 308 00:16:57.720 --> 00:16:59.799 The entire logical structure collapses. 309 00:16:59.840 --> 00:17:03.159 The error compounds exponentially. A stochastic model might get the 310 00:17:03.159 --> 00:17:06.000 first steps right because those operational sequences are common in 311 00:17:06.000 --> 00:17:08.799 its training data, but the moment has to logically deduce 312 00:17:08.799 --> 00:17:12.480 a novel sequence. Its probabilistic nature forces a guess. The 313 00:17:12.519 --> 00:17:15.359 guess introduces an error, and the final answer is completely wrong. 314 00:17:15.519 --> 00:17:19.119 You're building a fifty story house of cards in a windstorm. 315 00:17:19.440 --> 00:17:22.759 It just takes one microscopic miscalculation at the base and 316 00:17:22.799 --> 00:17:26.000 the whole thing comes down. We often think of computers 317 00:17:26.000 --> 00:17:29.559 as being inherently perfect at math, like a giant calculator, But. 318 00:17:29.559 --> 00:17:32.119 These large language models are not calculators. 319 00:17:32.400 --> 00:17:35.160 There are language prediction engines trying to speak math. 320 00:17:35.319 --> 00:17:37.519 That is exactly what they are doing, and that is 321 00:17:37.559 --> 00:17:41.160 why they stumble when forced out of language and into pure, 322 00:17:41.759 --> 00:17:43.559 unforgiving deterministic logic. 323 00:17:44.039 --> 00:17:47.960 Now, to truly comprehend the massive cognitive divide that this 324 00:17:48.039 --> 00:17:50.880 exam is measuring, we need to spend some serious time 325 00:17:50.960 --> 00:17:55.400 analyzing the typeology of the expert level assessment domains. This 326 00:17:55.440 --> 00:17:57.599 is where it gets incredibly fascinating. 327 00:17:57.640 --> 00:17:59.440 The domains themselves are extraordinary. 328 00:17:59.799 --> 00:18:02.880 Look three specific examples of the types of questions that 329 00:18:02.960 --> 00:18:07.400 survived that brutal filtration process, and these are completely wild. 330 00:18:07.799 --> 00:18:12.160 Let's start with domain one linguistic synthesis, specifically the translation 331 00:18:12.279 --> 00:18:13.920 of ancient Palmerine inscriptions. 332 00:18:14.480 --> 00:18:18.960 Agent Palmerine represents a dialect that severely disrupts standard computational processing. 333 00:18:19.640 --> 00:18:22.440 It is an extinct language from the ancient city of Palmyra, 334 00:18:22.680 --> 00:18:24.559 located in present day Syria. 335 00:18:24.519 --> 00:18:26.799 A vital oasis hub on the Silk Road. 336 00:18:27.079 --> 00:18:34.000 Crucially, its linguistic record possesses highly limited fragmentary representation. Because 337 00:18:34.039 --> 00:18:37.720 it is so obscure, it completely lacks the massive digital 338 00:18:37.759 --> 00:18:40.799 corpus required to train statistical engines effectively. 339 00:18:41.079 --> 00:18:43.799 Right, there just aren't millions of pages of ancient Palmerines 340 00:18:43.839 --> 00:18:46.920 sitting on Wikipedia for the AI to ingest and map 341 00:18:47.000 --> 00:18:51.680 into its multidimensional vector space. The cluster density is practically zero. 342 00:18:51.880 --> 00:18:54.440 There is no broad pattern to recall. 343 00:18:54.160 --> 00:18:58.119 So when the AI encounters this dialect, its cosine similarity 344 00:18:58.119 --> 00:19:01.200 functions just hit a brick. Wall. But how does a 345 00:19:01.319 --> 00:19:04.880 human expert handle this? Because a human epigrapher doesn't just 346 00:19:04.920 --> 00:19:06.240 throw up their hands and give up when they don't 347 00:19:06.279 --> 00:19:07.200 have enough data points. 348 00:19:07.319 --> 00:19:09.720 No, they engage in something called epigraphic deduction. 349 00:19:10.039 --> 00:19:11.680 Let's walk through exactly what that looks like. 350 00:19:11.799 --> 00:19:16.640 Epigraphic deduction is a masterful example of multimodal contextual reasoning. 351 00:19:17.200 --> 00:19:21.000 A human epigrapher cross references disparate fields of knowledge that 352 00:19:21.359 --> 00:19:24.440 on the surface have nothing to do with linguistics. Let's 353 00:19:24.440 --> 00:19:26.920 say they are looking at a partially destroyed stone tablet 354 00:19:27.119 --> 00:19:29.759 containing a tax record from the year two fifty AD. 355 00:19:30.240 --> 00:19:31.400 Okay, setting the same. 356 00:19:31.359 --> 00:19:35.039 The word indicating this specific tax commodity is chipped away. 357 00:19:35.839 --> 00:19:39.559 An AI cannot statistically predict the missing word because the 358 00:19:39.640 --> 00:19:41.279 linguistic data is too sparse. 359 00:19:41.839 --> 00:19:44.960 But the human epigrapher steps back. They look at the 360 00:19:45.039 --> 00:19:47.799 chisel marks on the stone and realize it matches the 361 00:19:47.839 --> 00:19:51.319 craftsmanship of a specific merchant class exactly. 362 00:19:51.400 --> 00:19:54.640 They expand the context window to reality itself. 363 00:19:54.839 --> 00:19:58.640 They analyze the regional historical context. They know that around 364 00:19:58.680 --> 00:20:02.200 two hundred and fifty eight there was a massive drought 365 00:20:02.240 --> 00:20:05.839 in the region that decimated local agriculture, which meant trade 366 00:20:05.920 --> 00:20:09.400 routes had to shift significantly to import grain from Egypt. 367 00:20:09.680 --> 00:20:13.079 They know about the political shifts, perhaps a specific marriage 368 00:20:13.119 --> 00:20:16.480 between a Palmerine noble and a Roman patrician that altered 369 00:20:16.480 --> 00:20:18.519 tariff laws for that exact decade. 370 00:20:18.640 --> 00:20:21.559 So the human expert understands the human context in which 371 00:20:21.559 --> 00:20:22.759 the inscription was created. 372 00:20:23.039 --> 00:20:26.519 They use their causal understanding of history, geology, economics, and 373 00:20:26.559 --> 00:20:30.839 politics to infer the missing linguistic data. They deduce that 374 00:20:30.839 --> 00:20:33.839 the missing word must be the specific term for Egyptian grain. 375 00:20:34.240 --> 00:20:36.960 Based on the convergence of all these non linguistic variables. 376 00:20:37.160 --> 00:20:39.759 They solve the puzzle where half the pieces are missing 377 00:20:40.039 --> 00:20:42.920 by understanding the history of the factory that made the puzzle. 378 00:20:43.079 --> 00:20:44.480 That is a brilliant way to phrase it. 379 00:20:44.680 --> 00:20:49.319 The AI architecture completely lacks this multimodal contextual reasoning. Its 380 00:20:49.359 --> 00:20:53.359 standard statistical models fundamentally failed to synthesize the ancient texts 381 00:20:53.400 --> 00:20:56.839 because the variables involved in ancient political shifts and ecological 382 00:20:56.880 --> 00:21:00.640 disasters entirely evade their mathematical parameterization. 383 00:21:01.079 --> 00:21:04.319 They cannot compute the causal link between a drought and 384 00:21:04.359 --> 00:21:07.279 a missing chisel mark because those concepts don't live in 385 00:21:07.319 --> 00:21:10.400 the same mathematical neighborhood in their training data. 386 00:21:10.759 --> 00:21:14.559 This fundamental integration deficit leads us perfectly to the second domain, 387 00:21:14.599 --> 00:21:17.880 which forces a completely different kind of synthesis spatial and 388 00:21:17.880 --> 00:21:21.880 biological reasoning. The designated task in this domain involves the 389 00:21:21.960 --> 00:21:27.319 identification of microscopic anatomical structures within avian biology. 390 00:21:26.839 --> 00:21:30.160 Specifically the complex physiological taxonomy of birds. 391 00:21:30.240 --> 00:21:32.559 Okay, so we are shifting from dead languages on the 392 00:21:32.599 --> 00:21:38.039 silk road to microscopic bird anatomy. Talk about interdisciplinary So 393 00:21:38.279 --> 00:21:42.119 why does bird anatomy break a multi billion dollar AI. 394 00:21:42.480 --> 00:21:45.359 It comes down to the operational difficulty of dealing with 395 00:21:45.599 --> 00:21:50.200 messi real world data. The nature paper task requires deriving 396 00:21:50.240 --> 00:21:56.359 three dimensional spatial relationships purely from chaotic two dimensional microscopic imaging. 397 00:21:56.480 --> 00:21:59.119 Okay, elaborate on that operational difficulty. 398 00:21:59.200 --> 00:22:03.559 The core computecational challenge is that the system must map abstract, 399 00:22:03.960 --> 00:22:10.599 obscure taxonomic classifications onto highly variable, often visually unclear, microscopic data. 400 00:22:11.359 --> 00:22:13.960 When a human biological researcher looks at a slide of 401 00:22:13.960 --> 00:22:17.000