WEBVTT

1
00:00:01.199 --> 00:00:06.200
<v Speaker 1>Welcome to the Sentient Code, where intelligence is engineered, autonomy

2
00:00:06.280 --> 00:00:10.439
<v Speaker 1>is emerging, and a line between human and machine grows thinner.

3
00:00:10.800 --> 00:00:15.359
<v Speaker 1>Each episode, we decode the algorithms, explore the robotics, and

4
00:00:15.439 --> 00:00:23.000
<v Speaker 1>examine the ideas shaping the future of artificial minds.

5
00:00:23.920 --> 00:00:28.519
<v Speaker 2>I spent twenty minutes yesterday, literally twenty minutes, trying to

6
00:00:28.559 --> 00:00:32.280
<v Speaker 2>get a supposedly state of the art AI to figure

7
00:00:32.320 --> 00:00:34.719
<v Speaker 2>out this completely absurd riddle.

8
00:00:34.960 --> 00:00:38.439
<v Speaker 3>Oh a riddle. Let me guess. It didn't go exactly

9
00:00:38.439 --> 00:00:39.000
<v Speaker 3>as planned.

10
00:00:39.240 --> 00:00:41.600
<v Speaker 2>No, it was a disaster. My seven year old told

11
00:00:41.640 --> 00:00:43.640
<v Speaker 2>it to me. It was something about a penguin, a

12
00:00:43.679 --> 00:00:47.119
<v Speaker 2>flashlight and a jar of peanut butter.

13
00:00:47.320 --> 00:00:49.600
<v Speaker 3>Right, not exactly standard training data.

14
00:00:49.280 --> 00:00:55.560
<v Speaker 2>Exactly, And the AI confidently generated this five paragraph, highly articulate,

15
00:00:55.640 --> 00:00:58.719
<v Speaker 2>just beautifully written essay that completely missed the punchline. I

16
00:00:58.719 --> 00:01:02.439
<v Speaker 2>mean it fundamentally violated basic physic confidently incorrect.

17
00:01:02.439 --> 00:01:04.680
<v Speaker 3>That is the hallmark of the current architecture.

18
00:01:04.799 --> 00:01:07.319
<v Speaker 2>Yes, and yet I open up my feed right after that,

19
00:01:07.400 --> 00:01:10.439
<v Speaker 2>and the headlines are absolutely screaming. They're saying, this exact

20
00:01:10.480 --> 00:01:14.640
<v Speaker 2>same architecture is about to autonomously replace human doctors and

21
00:01:14.760 --> 00:01:19.879
<v Speaker 2>lawyers and engineers. There is this massive dizzying disconnect happening

22
00:01:19.959 --> 00:01:23.040
<v Speaker 2>right now for you listening, You are constantly surrounded by

23
00:01:23.079 --> 00:01:26.640
<v Speaker 2>these claims that artificial intelligence is reaching human levels of comprehension.

24
00:01:26.879 --> 00:01:29.159
<v Speaker 3>You hear about them acing the bar, exam.

25
00:01:29.000 --> 00:01:31.920
<v Speaker 2>Breezing through advanced medical licensing.

26
00:01:31.480 --> 00:01:36.640
<v Speaker 3>Tests, mastering the exact standardized testing frameworks we've used for generations.

27
00:01:36.799 --> 00:01:39.719
<v Speaker 2>Right it paints this incredibly vivid picture for all of us,

28
00:01:40.040 --> 00:01:43.000
<v Speaker 2>a picture of algorithms that are practically breathing, thinking, and

29
00:01:43.120 --> 00:01:45.599
<v Speaker 2>understanding the world exactly the way you and I do.

30
00:01:46.159 --> 00:01:48.879
<v Speaker 2>But and this is the big question for today, what

31
00:01:49.000 --> 00:01:51.680
<v Speaker 2>if the very yardsticks we have been using to measure

32
00:01:51.760 --> 00:01:54.680
<v Speaker 2>artificial intelligence are just fundamentally broken.

33
00:01:54.760 --> 00:01:57.079
<v Speaker 3>They are obsolete, completely broken.

34
00:01:57.120 --> 00:02:00.400
<v Speaker 2>Today we are exploring a massive paradigm shift. We are

35
00:02:00.400 --> 00:02:03.959
<v Speaker 2>looking at the core reality that those traditional academic benchmarks

36
00:02:03.959 --> 00:02:07.079
<v Speaker 2>have completely and utterly lost their diagnostic utility.

37
00:02:07.400 --> 00:02:10.439
<v Speaker 3>That is precisely our mission today. We have to completely

38
00:02:10.479 --> 00:02:13.400
<v Speaker 3>deconstruct how we evaluate the artificial mind. We are no

39
00:02:13.439 --> 00:02:18.639
<v Speaker 3>longer just talking about machines getting smarter in some vague sense.

40
00:02:18.680 --> 00:02:21.439
<v Speaker 2>We're stepping into a high stakes intellectual mystery.

41
00:02:21.560 --> 00:02:24.800
<v Speaker 3>We are because we are examining a profound shift in

42
00:02:24.840 --> 00:02:28.680
<v Speaker 3>how we test computational intelligence, we're transitioning entirely away from

43
00:02:28.719 --> 00:02:34.039
<v Speaker 3>testing generalized models on standard educational curricula. Instead, we're looking

44
00:02:34.080 --> 00:02:37.000
<v Speaker 3>at how we evaluate them against the absolute limits of

45
00:02:37.120 --> 00:02:39.680
<v Speaker 3>highly specialized human expertise, at.

46
00:02:39.560 --> 00:02:42.879
<v Speaker 2>The very frontier of scientific and historical discovery.

47
00:02:43.039 --> 00:02:47.919
<v Speaker 3>Exactly. The central theme here is understanding the stark delineation

48
00:02:48.159 --> 00:02:52.840
<v Speaker 3>between the statistical probability operations of a machine, the pattern matching, right,

49
00:02:52.840 --> 00:02:57.840
<v Speaker 3>the pattern matching, and the actualized, deep causal reasoning of

50
00:02:57.840 --> 00:03:01.439
<v Speaker 3>a human mind. It is about separating the illusion of

51
00:03:01.479 --> 00:03:04.960
<v Speaker 3>comprehension from genuine contextual synthesis.

52
00:03:05.039 --> 00:03:07.960
<v Speaker 2>Okay, let's unpack this collapse of the old standard, because

53
00:03:08.000 --> 00:03:09.599
<v Speaker 2>I think we all know how these models work at

54
00:03:09.599 --> 00:03:13.080
<v Speaker 2>a baseline, right, They are incredibly sophisticated.

55
00:03:12.319 --> 00:03:16.319
<v Speaker 3>Autocorrects navigating vector spaces to predict the next hope exactly.

56
00:03:16.360 --> 00:03:18.599
<v Speaker 2>But for a long time, the gold standard, the ultimate

57
00:03:18.639 --> 00:03:22.680
<v Speaker 2>proving ground for these architectures was something called the MMLU,

58
00:03:23.039 --> 00:03:26.199
<v Speaker 2>the Massive Multitask Language Understanding Exam.

59
00:03:26.280 --> 00:03:29.120
<v Speaker 3>If you're building a multi billion dollar machine learning model,

60
00:03:29.400 --> 00:03:30.520
<v Speaker 3>this was your benchmark.

61
00:03:30.599 --> 00:03:34.840
<v Speaker 2>It covered this incredibly broad generalized knowledge base, everything from

62
00:03:34.879 --> 00:03:39.560
<v Speaker 2>basic high school European history to complex professional level medical

63
00:03:39.560 --> 00:03:42.039
<v Speaker 2>diagnostic microeconomics tort law.

64
00:03:42.319 --> 00:03:42.520
<v Speaker 1>Right.

65
00:03:42.599 --> 00:03:44.280
<v Speaker 2>It was supposed to be the ultimate test of an

66
00:03:44.280 --> 00:03:46.479
<v Speaker 2>AI's generalized knowledge.

67
00:03:46.000 --> 00:03:50.240
<v Speaker 3>And when the MMLU was initially introduced, it did provide

68
00:03:50.280 --> 00:03:54.159
<v Speaker 3>a highly effective metric, a percentage increase in accuracy on

69
00:03:54.199 --> 00:03:58.439
<v Speaker 3>that exam directly correlated with handible architectural improvements in the

70
00:03:58.479 --> 00:03:59.159
<v Speaker 3>neural networks.

71
00:03:59.240 --> 00:04:01.360
<v Speaker 2>It gave the developer, for is, a clear roadmap, a

72
00:04:01.400 --> 00:04:02.680
<v Speaker 2>clear empirical trajectory.

73
00:04:02.759 --> 00:04:07.360
<v Speaker 3>Yes, But then we witnessed a phenomenon that completely destabilized

74
00:04:07.360 --> 00:04:10.919
<v Speaker 3>this metric, the exponential scaling of neural networks.

75
00:04:11.120 --> 00:04:13.599
<v Speaker 2>The tech giants just started throwing hardware at it.

76
00:04:13.719 --> 00:04:17.800
<v Speaker 3>Massive hardware developers began massively increasing both the parameter counts

77
00:04:17.800 --> 00:04:19.759
<v Speaker 3>of these models and the sheer volume of their training

78
00:04:19.839 --> 00:04:23.279
<v Speaker 3>data sets. They were essentially scraping the entire indexed Internet,

79
00:04:23.360 --> 00:04:23.839
<v Speaker 3>and as a.

80
00:04:23.800 --> 00:04:26.720
<v Speaker 2>Direct result of that scaling, the systems began achieving near

81
00:04:26.720 --> 00:04:28.360
<v Speaker 2>perfect scores on the MMLU.

82
00:04:28.519 --> 00:04:30.360
<v Speaker 3>They effectively maxed out the test.

83
00:04:30.439 --> 00:04:34.240
<v Speaker 2>Which creates a massive structural flaw. If you have a

84
00:04:34.279 --> 00:04:38.279
<v Speaker 2>diagnostic tool, any kind of test, and it routinely starts

85
00:04:38.279 --> 00:04:41.800
<v Speaker 2>returning the maximum possible values across all these diverse subjects.

86
00:04:42.079 --> 00:04:44.240
<v Speaker 2>It stops giving you any meaningful variants.

87
00:04:44.399 --> 00:04:46.879
<v Speaker 3>It goes blind. It is a phenomenon known as saturation.

88
00:04:47.399 --> 00:04:52.160
<v Speaker 2>Saturation. To put this in perspective for you listening, imagine

89
00:04:52.279 --> 00:04:55.199
<v Speaker 2>you are a sports scientist. You're trying to test the

90
00:04:55.399 --> 00:04:59.759
<v Speaker 2>absolute physical limits of an elite Olympic decathlete, a gold

91
00:04:59.800 --> 00:05:03.120
<v Speaker 2>me right, But the only diagnostic tool you have in

92
00:05:03.160 --> 00:05:08.000
<v Speaker 2>your lab is the standard middle school presidential fitness test, the.

93
00:05:07.879 --> 00:05:09.480
<v Speaker 3>One we all took in seventh grade.

94
00:05:09.560 --> 00:05:12.480
<v Speaker 2>Exactly sure, the olympian is going to get a perfect score.

95
00:05:12.560 --> 00:05:14.120
<v Speaker 2>They're going to do all the pull ups, run the

96
00:05:14.120 --> 00:05:17.079
<v Speaker 2>shuttle sprint, stretch past their toes without breaking a sweat.

97
00:05:17.600 --> 00:05:20.959
<v Speaker 2>But that perfect score tells you absolutely nothing about their

98
00:05:21.120 --> 00:05:23.319
<v Speaker 2>actual absolute physical limits.

99
00:05:23.600 --> 00:05:26.839
<v Speaker 3>It doesn't tell you how their cardiovascular system handles the

100
00:05:26.879 --> 00:05:28.800
<v Speaker 3>complex stress of it to caflon, or.

101
00:05:28.759 --> 00:05:32.399
<v Speaker 2>How they adapt to unpredictable physical challenges. It just tells

102
00:05:32.439 --> 00:05:34.000
<v Speaker 2>you that they are stronger than a twelve year old.

103
00:05:34.319 --> 00:05:37.120
<v Speaker 2>The test is saturated, it ceases to provide any insight

104
00:05:37.160 --> 00:05:41.560
<v Speaker 2>into the underlying capabilities, or more importantly, the limitations of

105
00:05:41.600 --> 00:05:42.759
<v Speaker 2>the system being tested.

106
00:05:43.160 --> 00:05:46.279
<v Speaker 3>What's fascinating here is how the saturation exposes a deep

107
00:05:46.399 --> 00:05:50.120
<v Speaker 3>fundamental difference between high performance on tasks designed by humans

108
00:05:50.600 --> 00:05:53.439
<v Speaker 3>and actual generalizable intelligence.

109
00:05:53.000 --> 00:05:55.480
<v Speaker 2>Because getting an A on a human test doesn't mean

110
00:05:55.519 --> 00:05:57.279
<v Speaker 2>you think like a human exactly.

111
00:05:57.680 --> 00:06:00.879
<v Speaker 3>When these models achieve those near perfect scores on the MMLU,

112
00:06:01.199 --> 00:06:05.319
<v Speaker 3>those strong empirical results are frequently just manifestations of highly

113
00:06:05.360 --> 00:06:10.800
<v Speaker 3>sophisticated pattern matching. They are processing an unimaginably vast amount

114
00:06:10.839 --> 00:06:14.800
<v Speaker 3>of ubiquitous online data and finding the correlations.

115
00:06:14.120 --> 00:06:16.759
<v Speaker 2>They've read every prep book ever public millions of them.

116
00:06:17.040 --> 00:06:20.399
<v Speaker 2>But and this is the crucial distinction, that pattern matching

117
00:06:20.439 --> 00:06:25.160
<v Speaker 2>does not represent deep synthesized understanding. The saturation of the

118
00:06:25.279 --> 00:06:30.079
<v Speaker 2>MMLU prove that our old diagnostic tools were fundamentally incapable

119
00:06:30.480 --> 00:06:34.160
<v Speaker 2>of mapping the computational differences between a machine executing a

120
00:06:34.160 --> 00:06:39.600
<v Speaker 2>statistical operation and a human engaging in true causal comprehension.

121
00:06:39.040 --> 00:06:41.639
<v Speaker 3>Right, which brings us to the mechanics of that illusion,

122
00:06:41.879 --> 00:06:44.879
<v Speaker 3>Because what this exposes is just how completely that specific

123
00:06:44.959 --> 00:06:47.639
<v Speaker 3>architecture shatters when you take off the training wheels of

124
00:06:47.639 --> 00:06:48.439
<v Speaker 3>the Internet's data.

125
00:06:48.480 --> 00:06:51.000
<v Speaker 2>It breaks down fundamentally. So let's get into the technicals

126
00:06:51.040 --> 00:06:55.199
<v Speaker 2>of vector embeddings, and let's go beyond the basic IT

127
00:06:55.319 --> 00:06:59.360
<v Speaker 2>maps coordinates explanation that we always hear what is actually

128
00:06:59.360 --> 00:07:02.480
<v Speaker 2>happening in side that high dimensional space when a model

129
00:07:02.519 --> 00:07:04.240
<v Speaker 2>occurs to be thinking.

130
00:07:04.839 --> 00:07:07.040
<v Speaker 3>To understand the illusion, we have to look at the

131
00:07:07.079 --> 00:07:11.519
<v Speaker 3>intersection of vector embeddings, attention mechanisms, and co sign similarity.

132
00:07:11.600 --> 00:07:12.759
<v Speaker 2>Okay, lay it out for us.

133
00:07:12.959 --> 00:07:17.720
<v Speaker 3>When an artificial intelligence processes text, it mathematically maps words

134
00:07:17.759 --> 00:07:20.319
<v Speaker 3>and concepts into a space that can have tens of

135
00:07:20.360 --> 00:07:24.040
<v Speaker 3>thousands of dimensions. Concepts that frequently appear together in the

136
00:07:24.079 --> 00:07:25.600
<v Speaker 3>training data form dense.

137
00:07:25.360 --> 00:07:28.639
<v Speaker 2>Clusters, so they live in the same mathematical neighborhood.

138
00:07:28.759 --> 00:07:32.319
<v Speaker 3>Yes, the model uses attention heads to weigh the importance

139
00:07:32.319 --> 00:07:34.720
<v Speaker 3>of different words in your prompt and then uses a

140
00:07:34.759 --> 00:07:38.480
<v Speaker 3>mathematical function, often co sign similarity, to find the closest,

141
00:07:38.560 --> 00:07:41.920
<v Speaker 3>most statistically relevant cluster of vectors to generate its response.

142
00:07:42.160 --> 00:07:45.000
<v Speaker 2>So, if I ask it about a widely documented historical

143
00:07:45.000 --> 00:07:47.800
<v Speaker 2>event like the moon landing, it's operating in a highly

144
00:07:47.879 --> 00:07:52.000
<v Speaker 2>dense cluster. There are millions of articles, transcripts, and books

145
00:07:52.079 --> 00:07:56.839
<v Speaker 2>in its training data, linking Apollo eleven, Armstrong Moon and

146
00:07:56.959 --> 00:08:01.160
<v Speaker 2>nineteen sixty nine. The cosigine similarity points it directly to

147
00:08:01.240 --> 00:08:06.480
<v Speaker 2>the center of a very tight well defined mathematical neighborhood.

148
00:08:05.920 --> 00:08:08.879
<v Speaker 3>It's essentially impossible for it to miss. The density of

149
00:08:08.920 --> 00:08:13.800
<v Speaker 3>the data cluster allows for highly accurate statistical retrieval. It

150
00:08:13.839 --> 00:08:14.720
<v Speaker 3>looks like mastery.

151
00:08:14.800 --> 00:08:16.439
<v Speaker 2>It looks like it knows what the moon is, but

152
00:08:16.519 --> 00:08:17.040
<v Speaker 2>it doesn't.

153
00:08:17.439 --> 00:08:20.199
<v Speaker 3>And that is the problem. What happens when you introduce

154
00:08:20.279 --> 00:08:23.639
<v Speaker 3>sparse data? What happens when you ask it to synthesize

155
00:08:23.639 --> 00:08:27.240
<v Speaker 3>concepts that do not reside in a dense mathematical neighborhood like.

156
00:08:27.199 --> 00:08:28.759
<v Speaker 2>My seven year old's penguin riddle?

157
00:08:29.120 --> 00:08:33.320
<v Speaker 3>Precisely, the cluster density is too low for reliable statistical retrieval.

158
00:08:33.840 --> 00:08:37.000
<v Speaker 3>The attention mechanisms attempt to draw connections between vectors that

159
00:08:37.039 --> 00:08:40.080
<v Speaker 3>are mathematically distant, leading to what we call hallucinations.

160
00:08:40.120 --> 00:08:41.519
<v Speaker 2>Because it's forced to answer.

161
00:08:41.399 --> 00:08:44.080
<v Speaker 3>The system is mathematically forced to predict the next token,

162
00:08:44.320 --> 00:08:46.919
<v Speaker 3>so it wanders into a low density neighborhood and simply

163
00:08:46.960 --> 00:08:51.279
<v Speaker 3>starts generating plausible sounding nonsense based on superficial syntactical patterns.

164
00:08:51.480 --> 00:08:54.720
<v Speaker 2>Because it doesn't actually possess an internal model of reality.

165
00:08:54.759 --> 00:08:56.759
<v Speaker 2>It's just doing high dimensional.

166
00:08:56.200 --> 00:08:58.799
<v Speaker 3>Geometry geometry disguised as language.

167
00:08:58.840 --> 00:09:01.799
<v Speaker 2>And this brings us to doctor Tong New New's analytical

168
00:09:01.840 --> 00:09:06.159
<v Speaker 2>warning about the anthropomorphic fallacy. We are so incredibly wired

169
00:09:06.200 --> 00:09:09.799
<v Speaker 2>evolutionarily to assume that if something can speak to us,

170
00:09:10.240 --> 00:09:14.279
<v Speaker 2>if it uses syntax and grammar, it must think like us.

171
00:09:14.559 --> 00:09:20.080
<v Speaker 3>Doctor Nunu identifies this pervasive cognitive bias perfectly. Because these

172
00:09:20.159 --> 00:09:24.279
<v Speaker 3>models are successfully completing tasks that were historically designed to

173
00:09:24.320 --> 00:09:29.320
<v Speaker 3>require human cognition, like passing a medical board exam, observers

174
00:09:29.320 --> 00:09:34.279
<v Speaker 3>incorrectly deduce that the machine must possess an equivalent cognitive framework.

175
00:09:34.360 --> 00:09:37.120
<v Speaker 2>We project human thought onto a statistical calculator.

176
00:09:37.200 --> 00:09:40.320
<v Speaker 3>Yes, a machine can predict the next token perfectly in

177
00:09:40.360 --> 00:09:43.960
<v Speaker 3>a highly structured, well documented academic test solely because that

178
00:09:44.080 --> 00:09:46.639
<v Speaker 3>data exists in abundance within its training corpuses.

179
00:09:46.679 --> 00:09:47.679
<v Speaker 2>That's all just correlations.

180
00:09:47.759 --> 00:09:50.399
<v Speaker 3>But when confronted with a novel situation that requires actual

181
00:09:50.480 --> 00:09:54.000
<v Speaker 3>contextual synthesis, a scenario it hasn't mapped the mathematical coordinates

182
00:09:54.039 --> 00:09:57.039
<v Speaker 3>for the statistical probability, mapping completely breaks down.

183
00:09:57.279 --> 00:10:00.000
<v Speaker 2>It's the ultimate trick of the light, and it's exactly

184
00:10:00.000 --> 00:10:03.759
<v Speaker 2>bactly what catalyzed this massive global shift in how we

185
00:10:03.840 --> 00:10:08.440
<v Speaker 2>evaluate intelligence. The structural gaps have become so profound that

186
00:10:08.519 --> 00:10:11.240
<v Speaker 2>they could no longer be mapped by isolated teams of

187
00:10:11.279 --> 00:10:14.919
<v Speaker 2>computer scientists just working in their Silicon Valley silos.

188
00:10:15.000 --> 00:10:16.679
<v Speaker 3>They needed a much broader perspective.

189
00:10:16.799 --> 00:10:21.080
<v Speaker 2>It required a massive interdisciplinary intervention. We were talking about

190
00:10:21.080 --> 00:10:24.840
<v Speaker 2>the engineering of the ultimate metric, something known as Humanity's

191
00:10:24.919 --> 00:10:28.799
<v Speaker 2>Last Exam or the HLE. And let's clarify that name

192
00:10:28.879 --> 00:10:33.200
<v Speaker 2>right now, because Humanity's Last Exam sounds incredibly melodramatic.

193
00:10:33.279 --> 00:10:35.279
<v Speaker 3>It does sound like a cinematic apocalypse.

194
00:10:35.399 --> 00:10:37.720
<v Speaker 2>It sounds like the title of a dystopian sci fi

195
00:10:37.759 --> 00:10:40.159
<v Speaker 2>novel where we are all plugging into the matrix for

196
00:10:40.200 --> 00:10:40.919
<v Speaker 2>the final time.

197
00:10:41.039 --> 00:10:44.639
<v Speaker 3>It is a provocative title, certainly, but the nomenclature is

198
00:10:44.720 --> 00:10:49.039
<v Speaker 3>purely a clinical rhetorical framing device. It is not an

199
00:10:49.039 --> 00:10:52.320
<v Speaker 3>expression of apocalyptic dread regarding human relevance.

200
00:10:52.399 --> 00:10:54.000
<v Speaker 2>We're not throwing in the towel, not at all.

201
00:10:54.279 --> 00:10:58.279
<v Speaker 3>Rather, it is a highly specialized initiative designed to systematically

202
00:10:58.320 --> 00:11:03.799
<v Speaker 3>delineate the boundary between algorithmic operations and genuine human reasoning.

203
00:11:04.399 --> 00:11:08.879
<v Speaker 3>The objective is to identify operational strengths and computational vulnerabilities

204
00:11:09.240 --> 00:11:12.879
<v Speaker 3>so that we can engineer safer, more reliable technologies.

205
00:11:13.000 --> 00:11:16.519
<v Speaker 2>It is about understanding exactly where the machines fail to

206
00:11:16.559 --> 00:11:18.240
<v Speaker 2>synthesize reality exactly.

207
00:11:18.320 --> 00:11:19.600
<v Speaker 3>It's about precision, and the.

208
00:11:19.559 --> 00:11:23.159
<v Speaker 2>Scale of the consortium that built this test is just staggering.

209
00:11:23.519 --> 00:11:27.480
<v Speaker 2>We are looking at nearly one thousand researchers globally, and crucially,

210
00:11:27.559 --> 00:11:31.399
<v Speaker 2>they weren't just computer engineers. They realized that generalized domains

211
00:11:31.440 --> 00:11:34.559
<v Speaker 2>were totally insufficient to test for true understanding.

212
00:11:34.679 --> 00:11:37.320
<v Speaker 3>To break a statistical machine, you have to force a

213
00:11:37.399 --> 00:11:39.799
<v Speaker 3>fusion of disparate knowledge bases.

214
00:11:39.960 --> 00:11:44.080
<v Speaker 2>So they integrated historians, physicists, linguists, and medical researchers right

215
00:11:44.120 --> 00:11:45.679
<v Speaker 2>alongside the computer scientists.

216
00:11:45.879 --> 00:11:50.399
<v Speaker 3>That interdisciplinary composition is critical because conceptual integration is exactly

217
00:11:50.399 --> 00:11:55.200
<v Speaker 3>where the statistical probability mapping of current architectures falters advance.

218
00:11:55.320 --> 00:11:59.399
<v Speaker 3>Human expertise is uniquely characterized by the ability to fuse disparate,

219
00:11:59.519 --> 00:12:04.080
<v Speaker 3>seemingly unrelated domains of knowledge, drawing connections across disciplines. Yes.

220
00:12:04.639 --> 00:12:07.200
<v Speaker 3>To test for this, the consortium published a highly rigorous

221
00:12:07.240 --> 00:12:11.559
<v Speaker 3>assessment in the journal Nature, specifically under the doi ten

222
00:12:11.639 --> 00:12:15.039
<v Speaker 3>point one zero three eight four one five eight six

223
00:12:15.679 --> 00:12:19.159
<v Speaker 3>zero two five zero nine nine six two four. This

224
00:12:19.240 --> 00:12:23.120
<v Speaker 3>examination consists of exactly two thousand, five hundred questions, and

225
00:12:23.159 --> 00:12:26.919
<v Speaker 3>it is bound by incredibly strict, unforgiving methodological constraints.

226
00:12:27.039 --> 00:12:29.639
<v Speaker 2>Let's look at those constraints, because they are brilliantly designed

227
00:12:29.679 --> 00:12:32.759
<v Speaker 2>to trap an AI. The first constraint is binary greeting.

228
00:12:33.240 --> 00:12:37.000
<v Speaker 2>Every single query among those twenty five hundred questions must

229
00:12:37.039 --> 00:12:40.360
<v Speaker 2>possess exactly one clear, verifiable answer.

230
00:12:40.600 --> 00:12:42.679
<v Speaker 3>There is no partial credit none.

231
00:12:43.000 --> 00:12:45.879
<v Speaker 2>There is no room for a beautifully written, eloquent essay

232
00:12:45.919 --> 00:12:48.360
<v Speaker 2>that dances around the topic and sounds smart but says

233
00:12:48.440 --> 00:12:49.399
<v Speaker 2>absolutely nothing.

234
00:12:49.480 --> 00:12:53.200
<v Speaker 3>This binary constraint is absolutely essential for empirical validity. One

235
00:12:53.200 --> 00:12:56.399
<v Speaker 3>of the greatest challenges in evaluating open ended algorithmic generation

236
00:12:56.519 --> 00:12:58.080
<v Speaker 3>is subjective human interpretation.

237
00:12:58.240 --> 00:13:00.720
<v Speaker 2>We get tripped by good grammar who do If.

238
00:13:00.559 --> 00:13:04.279
<v Speaker 3>A model generates a highly articulate response, human evaluators could

239
00:13:04.279 --> 00:13:07.320
<v Speaker 3>be easily deceived, even if the output is factually hallucinatory.

240
00:13:07.919 --> 00:13:11.440
<v Speaker 3>The model syntactical fluency masks its lack of actual comprehension.

241
00:13:11.679 --> 00:13:14.879
<v Speaker 2>It speaks with so much confidence.

242
00:13:14.480 --> 00:13:18.919
<v Speaker 3>But by enforcing strict binary grading, the test entirely eliminates

243
00:13:18.960 --> 00:13:23.960
<v Speaker 3>that subjective vulnerability. The machine either successfully executed the complex

244
00:13:24.039 --> 00:13:27.679
<v Speaker 3>logical deduction to arrive at the single verifiable truth, or

245
00:13:27.720 --> 00:13:29.200
<v Speaker 3>it failed entirely.

246
00:13:29.039 --> 00:13:32.000
<v Speaker 2>It strips away the AI's ability to smooth talk its

247
00:13:32.039 --> 00:13:34.960
<v Speaker 2>way out of a corner. But the second constraint is

248
00:13:35.000 --> 00:13:39.519
<v Speaker 2>the real killer, absolute immunity to rapid online search queries.

249
00:13:39.600 --> 00:13:41.639
<v Speaker 3>This is where the paradigm shifts entirely.

250
00:13:41.879 --> 00:13:44.679
<v Speaker 2>By engineering the test to be immune to basic search

251
00:13:44.720 --> 00:13:48.559
<v Speaker 2>engine retrieval, the consortium forces the system entirely away from

252
00:13:48.559 --> 00:13:51.840
<v Speaker 2>its primary operational strength. If an answer can be located

253
00:13:51.840 --> 00:13:55.120
<v Speaker 2>as a contiguous factual string within an index database anywhere

254
00:13:55.159 --> 00:13:58.559
<v Speaker 2>on the Internet, it completely fails to test structural comprehension.

255
00:13:58.679 --> 00:14:01.440
<v Speaker 3>It just proves the machine can look things up incredibly fast.

256
00:14:01.639 --> 00:14:04.120
<v Speaker 2>Right if I can google the exact phrase, it's not

257
00:14:04.200 --> 00:14:06.360
<v Speaker 2>a good test of intelligence exactly.

258
00:14:06.759 --> 00:14:09.679
<v Speaker 3>If the answer exists in a unified format within the

259
00:14:09.720 --> 00:14:12.960
<v Speaker 3>training data, the model can simply rely on that high

260
00:14:13.039 --> 00:14:17.759
<v Speaker 3>density vector cluster we discussed earlier. Therefore, the questions designed

261
00:14:17.759 --> 00:14:22.919
<v Speaker 3>for the HLE demand multi step logical deduction, intricate spatial reasoning,

262
00:14:23.360 --> 00:14:26.480
<v Speaker 3>or the synthesis of deeply obscured information that does not

263
00:14:26.639 --> 00:14:28.840
<v Speaker 3>exist in a single location anywhere.

264
00:14:28.919 --> 00:14:30.879
<v Speaker 2>It forces them to build something new.

265
00:14:31.080 --> 00:14:34.240
<v Speaker 3>The system must piece together fragments of knowledge to derive

266
00:14:34.279 --> 00:14:36.399
<v Speaker 3>an answer that hasn't been explicitly written.

267
00:14:36.080 --> 00:14:38.960
<v Speaker 2>Down before and to guarantee that these constraints were actually met,

268
00:14:39.039 --> 00:14:42.480
<v Speaker 2>the consortium implemented an adversarial pre testing phase that I

269
00:14:42.559 --> 00:14:46.360
<v Speaker 2>just find brilliant. They built a filtration protocol. Imagine a

270
00:14:46.519 --> 00:14:50.519
<v Speaker 2>massive room of these thousand researchers, and every single proposed

271
00:14:50.600 --> 00:14:54.320
<v Speaker 2>question was systematically administered to the leading state of the

272
00:14:54.399 --> 00:14:56.840
<v Speaker 2>art artificial intelligence systems available at the.

273
00:14:56.759 --> 00:14:58.360
<v Speaker 3>Time, all the top tier models.

274
00:14:58.399 --> 00:15:00.960
<v Speaker 2>If any of those models managed to produce the correct answer,

275
00:15:01.240 --> 00:15:05.200
<v Speaker 2>that specific question was instantly destroyed, ripped up, and thrown out.

276
00:15:05.440 --> 00:15:09.000
<v Speaker 3>This pre testing methodology is what ensures the exam remains

277
00:15:09.000 --> 00:15:14.080
<v Speaker 3>perpetually stationed just beyond the frontier of current computational performance.

278
00:15:14.840 --> 00:15:17.000
<v Speaker 3>It does not measure what the models can already do.

279
00:15:17.879 --> 00:15:21.360
<v Speaker 3>It maps the exact perimeter of algorithmic ignorance.

280
00:15:21.519 --> 00:15:23.919
<v Speaker 2>The perimeter of ignorance. I love that phrasing.

281
00:15:24.159 --> 00:15:28.759
<v Speaker 3>It defines the precise boundary where statistical probability fails and

282
00:15:28.840 --> 00:15:30.320
<v Speaker 3>causal deduction is required.

283
00:15:30.840 --> 00:15:33.559
<v Speaker 2>This brings us to a specific area where this boundary

284
00:15:33.559 --> 00:15:38.159
<v Speaker 2>mapping is most devastating the deterministic vulnerability of these models.

285
00:15:38.840 --> 00:15:41.919
<v Speaker 2>Let's look at the objective contributions of doctor Tungwan from

286
00:15:42.000 --> 00:15:45.240
<v Speaker 2>Texas A and M University's Department of Computer Science and Engineering.

287
00:15:45.360 --> 00:15:47.240
<v Speaker 3>He was a major player in this consortium.

288
00:15:47.320 --> 00:15:50.240
<v Speaker 2>He authored seventy three questions for the assessment, which was

289
00:15:50.360 --> 00:15:54.279
<v Speaker 2>the second highest individual contribution globally, and his queries were

290
00:15:54.360 --> 00:15:58.639
<v Speaker 2>highly concentrated within the domains of rigorous mathematics and computer science.

291
00:15:59.159 --> 00:16:02.440
<v Speaker 3>Doctor Juan's contrabutions are vital because they isolate a critical

292
00:16:02.519 --> 00:16:07.600
<v Speaker 3>vulnerability inherent in all probabilistic models, the fundamental conflict between

293
00:16:07.639 --> 00:16:10.480
<v Speaker 3>stochastic prediction and deterministic execution.

294
00:16:10.679 --> 00:16:12.360
<v Speaker 2>Okay, let's break that down for the listener.

295
00:16:12.600 --> 00:16:18.200
<v Speaker 3>Mathematical and computational logic requires step by step rigid determinism.

296
00:16:18.879 --> 00:16:22.600
<v Speaker 3>A sarcastic prediction model cannot navigate a rigorous mathematical proof.

297
00:16:22.840 --> 00:16:26.360
<v Speaker 2>So say you give the AI a highly complex, fifty

298
00:16:26.360 --> 00:16:28.600
<v Speaker 2>step mathematical proof that has never been solved in this

299
00:16:28.639 --> 00:16:32.039
<v Speaker 2>specific way before. If you are a machine learning model

300
00:16:32.080 --> 00:16:36.519
<v Speaker 2>relying on probabilistic guessing, just predicting the most likely next

301
00:16:36.519 --> 00:16:39.759
<v Speaker 2>mathematical operation based on BASS training data, you might get

302
00:16:39.799 --> 00:16:42.840
<v Speaker 2>step one right with ninety nine point nine percent certainty.

303
00:16:42.919 --> 00:16:44.240
<v Speaker 3>You might even get step two right.

304
00:16:44.440 --> 00:16:47.399
<v Speaker 2>But eventually you are going to make a tiny minor

305
00:16:47.559 --> 00:16:51.519
<v Speaker 2>variable error because you are guessing, you're not deducing precisely.

306
00:16:51.879 --> 00:16:54.759
<v Speaker 3>And in a rigorous mathematical proof, what happens when you

307
00:16:54.799 --> 00:16:57.600
<v Speaker 3>introduce a single minor variable error at step fourteen.

308
00:16:57.720 --> 00:16:59.799
<v Speaker 2>The entire logical structure collapses.

309
00:16:59.840 --> 00:17:03.159
<v Speaker 3>The error compounds exponentially. A stochastic model might get the

310
00:17:03.159 --> 00:17:06.000
<v Speaker 3>first steps right because those operational sequences are common in

311
00:17:06.000 --> 00:17:08.799
<v Speaker 3>its training data, but the moment has to logically deduce

312
00:17:08.799 --> 00:17:12.480
<v Speaker 3>a novel sequence. Its probabilistic nature forces a guess. The

313
00:17:12.519 --> 00:17:15.359
<v Speaker 3>guess introduces an error, and the final answer is completely wrong.

314
00:17:15.519 --> 00:17:19.119
<v Speaker 2>You're building a fifty story house of cards in a windstorm.

315
00:17:19.440 --> 00:17:22.759
<v Speaker 2>It just takes one microscopic miscalculation at the base and

316
00:17:22.799 --> 00:17:26.000
<v Speaker 2>the whole thing comes down. We often think of computers

317
00:17:26.000 --> 00:17:29.559
<v Speaker 2>as being inherently perfect at math, like a giant calculator, But.

318
00:17:29.559 --> 00:17:32.119
<v Speaker 3>These large language models are not calculators.

319
00:17:32.400 --> 00:17:35.160
<v Speaker 2>There are language prediction engines trying to speak math.

320
00:17:35.319 --> 00:17:37.519
<v Speaker 3>That is exactly what they are doing, and that is

321
00:17:37.559 --> 00:17:41.160
<v Speaker 3>why they stumble when forced out of language and into pure,

322
00:17:41.759 --> 00:17:43.559
<v Speaker 3>unforgiving deterministic logic.

323
00:17:44.039 --> 00:17:47.960
<v Speaker 2>Now, to truly comprehend the massive cognitive divide that this

324
00:17:48.039 --> 00:17:50.880
<v Speaker 2>exam is measuring, we need to spend some serious time

325
00:17:50.960 --> 00:17:55.400
<v Speaker 2>analyzing the typeology of the expert level assessment domains. This

326
00:17:55.440 --> 00:17:57.599
<v Speaker 2>is where it gets incredibly fascinating.

327
00:17:57.640 --> 00:17:59.440
<v Speaker 3>The domains themselves are extraordinary.

328
00:17:59.799 --> 00:18:02.880
<v Speaker 2>Look three specific examples of the types of questions that

329
00:18:02.960 --> 00:18:07.400
<v Speaker 2>survived that brutal filtration process, and these are completely wild.

330
00:18:07.799 --> 00:18:12.160
<v Speaker 2>Let's start with domain one linguistic synthesis, specifically the translation

331
00:18:12.279 --> 00:18:13.920
<v Speaker 2>of ancient Palmerine inscriptions.

332
00:18:14.480 --> 00:18:18.960
<v Speaker 3>Agent Palmerine represents a dialect that severely disrupts standard computational processing.

333
00:18:19.640 --> 00:18:22.440
<v Speaker 3>It is an extinct language from the ancient city of Palmyra,

334
00:18:22.680 --> 00:18:24.559
<v Speaker 3>located in present day Syria.

335
00:18:24.519 --> 00:18:26.799
<v Speaker 2>A vital oasis hub on the Silk Road.

336
00:18:27.079 --> 00:18:34.000
<v Speaker 3>Crucially, its linguistic record possesses highly limited fragmentary representation. Because

337
00:18:34.039 --> 00:18:37.720
<v Speaker 3>it is so obscure, it completely lacks the massive digital

338
00:18:37.759 --> 00:18:40.799
<v Speaker 3>corpus required to train statistical engines effectively.

339
00:18:41.079 --> 00:18:43.799
<v Speaker 2>Right, there just aren't millions of pages of ancient Palmerines

340
00:18:43.839 --> 00:18:46.920
<v Speaker 2>sitting on Wikipedia for the AI to ingest and map

341
00:18:47.000 --> 00:18:51.680
<v Speaker 2>into its multidimensional vector space. The cluster density is practically zero.

342
00:18:51.880 --> 00:18:54.440
<v Speaker 3>There is no broad pattern to recall.

343
00:18:54.160 --> 00:18:58.119
<v Speaker 2>So when the AI encounters this dialect, its cosine similarity

344
00:18:58.119 --> 00:19:01.200
<v Speaker 2>functions just hit a brick. Wall. But how does a

345
00:19:01.319 --> 00:19:04.880
<v Speaker 2>human expert handle this? Because a human epigrapher doesn't just

346
00:19:04.920 --> 00:19:06.240
<v Speaker 2>throw up their hands and give up when they don't

347
00:19:06.279 --> 00:19:07.200
<v Speaker 2>have enough data points.

348
00:19:07.319 --> 00:19:09.720
<v Speaker 3>No, they engage in something called epigraphic deduction.

349
00:19:10.039 --> 00:19:11.680
<v Speaker 2>Let's walk through exactly what that looks like.

350
00:19:11.799 --> 00:19:16.640
<v Speaker 3>Epigraphic deduction is a masterful example of multimodal contextual reasoning.

351
00:19:17.200 --> 00:19:21.000
<v Speaker 3>A human epigrapher cross references disparate fields of knowledge that

352
00:19:21.359 --> 00:19:24.440
<v Speaker 3>on the surface have nothing to do with linguistics. Let's

353
00:19:24.440 --> 00:19:26.920
<v Speaker 3>say they are looking at a partially destroyed stone tablet

354
00:19:27.119 --> 00:19:29.759
<v Speaker 3>containing a tax record from the year two fifty AD.

355
00:19:30.240 --> 00:19:31.400
<v Speaker 2>Okay, setting the same.

356
00:19:31.359 --> 00:19:35.039
<v Speaker 3>The word indicating this specific tax commodity is chipped away.

357
00:19:35.839 --> 00:19:39.559
<v Speaker 3>An AI cannot statistically predict the missing word because the

358
00:19:39.640 --> 00:19:41.279
<v Speaker 3>linguistic data is too sparse.

359
00:19:41.839 --> 00:19:44.960
<v Speaker 2>But the human epigrapher steps back. They look at the

360
00:19:45.039 --> 00:19:47.799
<v Speaker 2>chisel marks on the stone and realize it matches the

361
00:19:47.839 --> 00:19:51.319
<v Speaker 2>craftsmanship of a specific merchant class exactly.

362
00:19:51.400 --> 00:19:54.640
<v Speaker 3>They expand the context window to reality itself.

363
00:19:54.839 --> 00:19:58.640
<v Speaker 2>They analyze the regional historical context. They know that around

364
00:19:58.680 --> 00:20:02.200
<v Speaker 2>two hundred and fifty eight there was a massive drought

365
00:20:02.240 --> 00:20:05.839
<v Speaker 2>in the region that decimated local agriculture, which meant trade

366
00:20:05.920 --> 00:20:09.400
<v Speaker 2>routes had to shift significantly to import grain from Egypt.

367
00:20:09.680 --> 00:20:13.079
<v Speaker 3>They know about the political shifts, perhaps a specific marriage

368
00:20:13.119 --> 00:20:16.480
<v Speaker 3>between a Palmerine noble and a Roman patrician that altered

369
00:20:16.480 --> 00:20:18.519
<v Speaker 3>tariff laws for that exact decade.

370
00:20:18.640 --> 00:20:21.559
<v Speaker 2>So the human expert understands the human context in which

371
00:20:21.559 --> 00:20:22.759
<v Speaker 2>the inscription was created.

372
00:20:23.039 --> 00:20:26.519
<v Speaker 3>They use their causal understanding of history, geology, economics, and

373
00:20:26.559 --> 00:20:30.839
<v Speaker 3>politics to infer the missing linguistic data. They deduce that

374
00:20:30.839 --> 00:20:33.839
<v Speaker 3>the missing word must be the specific term for Egyptian grain.

375
00:20:34.240 --> 00:20:36.960
<v Speaker 3>Based on the convergence of all these non linguistic variables.

376
00:20:37.160 --> 00:20:39.759
<v Speaker 2>They solve the puzzle where half the pieces are missing

377
00:20:40.039 --> 00:20:42.920
<v Speaker 2>by understanding the history of the factory that made the puzzle.

378
00:20:43.079 --> 00:20:44.480
<v Speaker 3>That is a brilliant way to phrase it.

379
00:20:44.680 --> 00:20:49.319
<v Speaker 2>The AI architecture completely lacks this multimodal contextual reasoning. Its

380
00:20:49.359 --> 00:20:53.359
<v Speaker 2>standard statistical models fundamentally failed to synthesize the ancient texts

381
00:20:53.400 --> 00:20:56.839
<v Speaker 2>because the variables involved in ancient political shifts and ecological

382
00:20:56.880 --> 00:21:00.640
<v Speaker 2>disasters entirely evade their mathematical parameterization.

383
00:21:01.079 --> 00:21:04.319
<v Speaker 3>They cannot compute the causal link between a drought and

384
00:21:04.359 --> 00:21:07.279
<v Speaker 3>a missing chisel mark because those concepts don't live in

385
00:21:07.319 --> 00:21:10.400
<v Speaker 3>the same mathematical neighborhood in their training data.

386
00:21:10.759 --> 00:21:14.559
<v Speaker 2>This fundamental integration deficit leads us perfectly to the second domain,

387
00:21:14.599 --> 00:21:17.880
<v Speaker 2>which forces a completely different kind of synthesis spatial and

388
00:21:17.880 --> 00:21:21.880
<v Speaker 2>biological reasoning. The designated task in this domain involves the

389
00:21:21.960 --> 00:21:27.319
<v Speaker 2>identification of microscopic anatomical structures within avian biology.

390
00:21:26.839 --> 00:21:30.160
<v Speaker 3>Specifically the complex physiological taxonomy of birds.

391
00:21:30.240 --> 00:21:32.559
<v Speaker 2>Okay, so we are shifting from dead languages on the

392
00:21:32.599 --> 00:21:38.039
<v Speaker 2>silk road to microscopic bird anatomy. Talk about interdisciplinary So

393
00:21:38.279 --> 00:21:42.119
<v Speaker 2>why does bird anatomy break a multi billion dollar AI.

394
00:21:42.480 --> 00:21:45.359
<v Speaker 3>It comes down to the operational difficulty of dealing with

395
00:21:45.599 --> 00:21:50.200
<v Speaker 3>messi real world data. The nature paper task requires deriving

396
00:21:50.240 --> 00:21:56.359
<v Speaker 3>three dimensional spatial relationships purely from chaotic two dimensional microscopic imaging.

397
00:21:56.480 --> 00:21:59.119
<v Speaker 2>Okay, elaborate on that operational difficulty.

398
00:21:59.200 --> 00:22:03.559
<v Speaker 3>The core computecational challenge is that the system must map abstract,

399
00:22:03.960 --> 00:22:10.599
<v Speaker 3>obscure taxonomic classifications onto highly variable, often visually unclear, microscopic data.

400
00:22:11.359 --> 00:22:13.960
<v Speaker 3>When a human biological researcher looks at a slide of

401
00:22:13.960 --> 00:22:17.000
<v Speaker 3>avian tissue under a microscope, they are not looking at

402
00:22:17.000 --> 00:22:19.880
<v Speaker 3>a perfectly formatted, color coded textbook diagram.

403
00:22:19.960 --> 00:22:21.880
<v Speaker 2>No, I've seen these slides, they look like Jackson Pollock

404
00:22:21.920 --> 00:22:23.799
<v Speaker 2>paintings made of pink and purple blobs. There are no

405
00:22:23.880 --> 00:22:24.720
<v Speaker 2>clean lines.

406
00:22:24.519 --> 00:22:27.920
<v Speaker 3>Precisely, they're looking at a chaotic field of overlapping cells,

407
00:22:28.480 --> 00:22:32.519
<v Speaker 3>artifacts from the staining process, and structural anomalies. The human

408
00:22:32.559 --> 00:22:35.759
<v Speaker 3>researcher has to mentally execute a three dimensional spatial rotation

409
00:22:36.279 --> 00:22:37.599
<v Speaker 3>of those two D anomalies in.

410
00:22:37.599 --> 00:22:42.119
<v Speaker 2>Their mind, while simultaneously applying highly specialized obscure biological rules

411
00:22:42.119 --> 00:22:45.160
<v Speaker 2>regarding Avian taxonomy to figure out what specific celluar structure

412
00:22:45.160 --> 00:22:46.119
<v Speaker 2>they are observing, and.

413
00:22:46.119 --> 00:22:51.240
<v Speaker 3>The machine struggles profoundly to synthesize that specialized visual geometry

414
00:22:51.599 --> 00:22:53.720
<v Speaker 3>with the necessary biological context.

415
00:22:53.960 --> 00:22:57.960
<v Speaker 2>It exposes how coudled these models are by their training data.

416
00:22:58.400 --> 00:23:02.400
<v Speaker 2>These current architectures are so use to being spoon fed pristine,

417
00:23:02.920 --> 00:23:07.000
<v Speaker 2>unified textbook inputs, but the real world of scientific discovery

418
00:23:07.039 --> 00:23:08.400
<v Speaker 2>is incredibly noisy.

419
00:23:08.480 --> 00:23:09.640
<v Speaker 3>It is entirely unstructured.

420
00:23:09.720 --> 00:23:12.720
<v Speaker 2>You can't just read a million Wikipedia articles about bird

421
00:23:12.720 --> 00:23:16.839
<v Speaker 2>anatomy and suddenly understand a chaotic microscopic slide. You have

422
00:23:16.880 --> 00:23:19.599
<v Speaker 2>to mentally build a three D model of that tissue

423
00:23:19.599 --> 00:23:22.759
<v Speaker 2>in your head, apply of obscure physiological rules to it.

424
00:23:22.799 --> 00:23:24.279
<v Speaker 2>And filter out the visual noise.

425
00:23:24.440 --> 00:23:26.960
<v Speaker 3>The AI just gets entirely lost in that noise because

426
00:23:26.960 --> 00:23:30.079
<v Speaker 3>it lacks a causal understanding of biology and spatial physics.

427
00:23:30.119 --> 00:23:32.440
<v Speaker 2>And this leads us to our third example, which incorporates

428
00:23:32.519 --> 00:23:37.319
<v Speaker 2>rigorous phonological and theological analysis. The designated task requires the

429
00:23:37.359 --> 00:23:40.440
<v Speaker 2>examination of detailed sound patterns within Biblical Hebrew.

430
00:23:40.759 --> 00:23:44.680
<v Speaker 3>This domain isolates the intersection of phonetic, historical linguistics, and

431
00:23:44.799 --> 00:23:46.400
<v Speaker 3>complex textual analysis.

432
00:23:46.599 --> 00:23:49.680
<v Speaker 2>This one is fascinating because we aren't just talking about

433
00:23:49.759 --> 00:23:53.079
<v Speaker 2>translating a sentence. We are talking about analyzing the historical

434
00:23:53.119 --> 00:23:57.319
<v Speaker 2>evolution of specific sound structures within a highly dense, centuries

435
00:23:57.319 --> 00:23:58.839
<v Speaker 2>old theological context.

436
00:23:59.160 --> 00:24:03.039
<v Speaker 3>The system is to map orthographic symbols the written characters

437
00:24:03.039 --> 00:24:07.599
<v Speaker 3>on the page to their historical phonetic realities. Consider the

438
00:24:07.640 --> 00:24:11.200
<v Speaker 3>Masoretic text and the Tiberian vocalization tradition.

439
00:24:10.880 --> 00:24:12.599
<v Speaker 2>Which is incredibly layered.

440
00:24:12.880 --> 00:24:17.480
<v Speaker 3>Yes, the system must understand how morphological rules, vowel points,

441
00:24:17.559 --> 00:24:21.799
<v Speaker 3>and consonant pronunciations were altered through centuries of human transmission,

442
00:24:22.119 --> 00:24:24.319
<v Speaker 3>oral tradition, and theological preservation.

443
00:24:24.519 --> 00:24:27.839
<v Speaker 2>It's requiring the system to hold an internal temporal model

444
00:24:27.839 --> 00:24:31.079
<v Speaker 2>of linguistic evolution. Let's say a scribe in the ninth

445
00:24:31.079 --> 00:24:34.519
<v Speaker 2>century made a slight, localized adjustment to a vowel pointing

446
00:24:34.759 --> 00:24:37.759
<v Speaker 2>based on a highly specific theological debate happening in their

447
00:24:37.799 --> 00:24:39.480
<v Speaker 2>specific community at that time.

448
00:24:39.440 --> 00:24:42.960
<v Speaker 3>A shift motivated by human belief, not mathematical probability.

449
00:24:43.279 --> 00:24:47.039
<v Speaker 2>Exactly, that tiny adjustment changes the phonetics of the word,

450
00:24:47.240 --> 00:24:49.599
<v Speaker 2>and AI can't just look at that Hebrew word and

451
00:24:49.640 --> 00:24:52.079
<v Speaker 2>spit out the English equivalent based on a lookup table.

452
00:24:52.519 --> 00:24:54.640
<v Speaker 2>It has to understand the why and the how that

453
00:24:54.680 --> 00:24:57.920
<v Speaker 2>word sounded a specific way a thousand years ago, based

454
00:24:57.920 --> 00:25:00.480
<v Speaker 2>on the theological rules of the time, and how those

455
00:25:00.559 --> 00:25:02.279
<v Speaker 2>rules shifted across generations.

456
00:25:02.880 --> 00:25:08.480
<v Speaker 3>This completely neutralizes the superficial semantic processing of large language models.

457
00:25:08.559 --> 00:25:11.440
<v Speaker 3>It forces them into a depth of historical phonetics and

458
00:25:11.480 --> 00:25:15.759
<v Speaker 3>theological reasoning that extends far beyond the parameterization of current

459
00:25:15.839 --> 00:25:20.319
<v Speaker 3>statistical text generators. They cannot infer the phonetic shift without

460
00:25:20.359 --> 00:25:23.039
<v Speaker 3>an internal temporal model of the culture that produced it.

461
00:25:23.359 --> 00:25:28.000
<v Speaker 2>Synthesizing these three examples, the contextual epigraphic deduction of ancient Palmyrene,

462
00:25:28.440 --> 00:25:32.279
<v Speaker 2>the chaotic spatial reasoning required for Avian microscopic anatomy, and

463
00:25:32.319 --> 00:25:36.160
<v Speaker 2>the phonetic evolution of Biblical Hebrew. It demonstrates the profound

464
00:25:36.240 --> 00:25:39.880
<v Speaker 2>depth and specialized expertise that constitutes true intelligence.

465
00:25:40.319 --> 00:25:42.319
<v Speaker 3>These are the things that require a mind, not just

466
00:25:42.359 --> 00:25:43.119
<v Speaker 3>a map, and.

467
00:25:43.160 --> 00:25:47.640
<v Speaker 2>They contrast so sharply with the remarkably shallow knowledge base

468
00:25:47.759 --> 00:25:51.480
<v Speaker 2>of standard language models, and the data backing this up

469
00:25:51.559 --> 00:25:54.759
<v Speaker 2>is staggering. Let's look at the empirical trajectory and the

470
00:25:54.799 --> 00:25:58.480
<v Speaker 2>performance metrics, because this is where the theoretical hits the concrete.

471
00:25:58.559 --> 00:25:59.799
<v Speaker 3>The baseline measurements were.

472
00:25:59.720 --> 00:26:03.720
<v Speaker 2>Defa when they established those baseline measurements for the early models.

473
00:26:03.759 --> 00:26:07.480
<v Speaker 2>Taking this exam it revealed an initial diagnostic floor that

474
00:26:07.599 --> 00:26:09.599
<v Speaker 2>was frankly shocking to the industry.

475
00:26:09.720 --> 00:26:12.759
<v Speaker 3>The severe underperformance of these highly regarded state of the

476
00:26:12.880 --> 00:26:17.279
<v Speaker 3>art models is statistically significant. When evaluated against the hl

477
00:26:17.640 --> 00:26:20.400
<v Speaker 3>GPT four achieved an accuracy rate of just two point

478
00:26:20.480 --> 00:26:21.200
<v Speaker 3>seven percent.

479
00:26:21.440 --> 00:26:23.079
<v Speaker 2>Two point seven percent.

480
00:26:22.880 --> 00:26:25.920
<v Speaker 3>CLAUDE three point five, sonnet achieved four point one percent.

481
00:26:26.079 --> 00:26:29.400
<v Speaker 3>Open AI Specialized Reasoning Model A one reached a threshold

482
00:26:29.440 --> 00:26:30.359
<v Speaker 3>of only eight percent.

483
00:26:30.519 --> 00:26:32.599
<v Speaker 2>I want you listening to really let those numbers sink

484
00:26:32.640 --> 00:26:35.599
<v Speaker 2>in two point seven percent, four point one percent, eight percent.

485
00:26:35.720 --> 00:26:39.400
<v Speaker 2>These single digit percentiles indicate an absolute baseline collapse of

486
00:26:39.400 --> 00:26:41.119
<v Speaker 2>reasoning capabilities.

487
00:26:40.519 --> 00:26:41.359
<v Speaker 3>A total collapse.

488
00:26:41.559 --> 00:26:45.319
<v Speaker 2>When these massive architectures are stripped of easily searchable data

489
00:26:45.759 --> 00:26:50.039
<v Speaker 2>and forced to synthesize novel information across disciplines, they fall

490
00:26:50.079 --> 00:26:54.240
<v Speaker 2>apart entirely. In fact, on a binary or multiple choice format,

491
00:26:54.599 --> 00:26:57.559
<v Speaker 2>these scores are statistically worse than random guessing.

492
00:26:57.680 --> 00:26:58.880
<v Speaker 3>Yes, statistically worse.

493
00:26:59.200 --> 00:27:01.079
<v Speaker 2>If you just closed your eyes and flipped a coin

494
00:27:01.160 --> 00:27:04.440
<v Speaker 2>or picked answers at random, you would mathematically score higher

495
00:27:04.440 --> 00:27:07.680
<v Speaker 2>than these multi billion dollar supercomputers did on this exam.

496
00:27:07.880 --> 00:27:11.240
<v Speaker 3>That is a critical observation. The reason they perform worse

497
00:27:11.279 --> 00:27:14.640
<v Speaker 3>than random guessing is due to the mechanics of their failure.

498
00:27:15.319 --> 00:27:19.920
<v Speaker 3>They engage in systematic hallucination because of their structural compulsions,

499
00:27:20.039 --> 00:27:23.559
<v Speaker 3>their mathematical mandate to predict the next token. They are

500
00:27:23.640 --> 00:27:25.039
<v Speaker 3>driven to generate a response.

501
00:27:25.359 --> 00:27:27.119
<v Speaker 2>They can just say I don't know exactly.

502
00:27:27.440 --> 00:27:30.759
<v Speaker 3>They lack the epistemic humility to simply state I do

503
00:27:30.839 --> 00:27:34.720
<v Speaker 3>not have sufficient data to synthesize a conclusion. Therefore, they

504
00:27:34.759 --> 00:27:40.799
<v Speaker 3>generate statistically plausible, beautifully articulated, but entirely logically fallacious answers.

505
00:27:40.880 --> 00:27:42.640
<v Speaker 2>They confidently lead you off a cliff.

506
00:27:42.880 --> 00:27:47.440
<v Speaker 3>They are confidently incorrect drawn off course by superficial patterns

507
00:27:47.880 --> 00:27:50.799
<v Speaker 3>in the prompt that lead them away from the actual

508
00:27:51.279 --> 00:27:52.240
<v Speaker 3>complex truth.

509
00:27:52.440 --> 00:27:55.880
<v Speaker 2>However, we have to acknowledge the rapid iteration that followed.

510
00:27:56.559 --> 00:27:59.079
<v Speaker 2>The tech industry does not just sit still and accept

511
00:27:59.079 --> 00:28:03.599
<v Speaker 2>a two percent score. Subsequent models showed a steep improvement curve.

512
00:28:03.799 --> 00:28:05.160
<v Speaker 3>The optimization was rapid.

513
00:28:05.400 --> 00:28:08.240
<v Speaker 2>We saw Gemini three point one pro and Clawed four

514
00:28:08.279 --> 00:28:12.559
<v Speaker 2>point six eventually elevate their accuracy rates to approximately forty

515
00:28:12.640 --> 00:28:15.400
<v Speaker 2>to fifty percent. Now I have to challenge you here.

516
00:28:15.559 --> 00:28:18.000
<v Speaker 2>If I'm an AI developer listening to this, I'm screaming

517
00:28:18.039 --> 00:28:19.039
<v Speaker 2>at my dashboard right now.

518
00:28:19.039 --> 00:28:19.680
<v Speaker 3>I'm sure they are.

519
00:28:19.799 --> 00:28:22.240
<v Speaker 2>I'm saying, wait, hold on, we jumped from two percent

520
00:28:22.279 --> 00:28:25.079
<v Speaker 2>to fifty percent in just a few iterations. We improve

521
00:28:25.119 --> 00:28:28.400
<v Speaker 2>the system's performance by twenty five times. Give us another year,

522
00:28:28.599 --> 00:28:30.920
<v Speaker 2>throw another trillion dollars of compute at it, and we'll

523
00:28:30.960 --> 00:28:33.799
<v Speaker 2>hit one hundred percent. Why are you so certain that

524
00:28:33.839 --> 00:28:36.640
<v Speaker 2>fifty percent is an unbreakable ceiling and not just a

525
00:28:36.680 --> 00:28:39.039
<v Speaker 2>speed bump on the way to artificial superintelligence.

526
00:28:39.359 --> 00:28:43.799
<v Speaker 3>It is a valid counter argument, absolutely, but rigorous architectural

527
00:28:43.839 --> 00:28:46.920
<v Speaker 3>analysis must be applied to the persistence of the remaining

528
00:28:46.960 --> 00:28:50.759
<v Speaker 3>competency gap. The models hit a wall at that fifty

529
00:28:50.759 --> 00:28:54.519
<v Speaker 3>percent threshold. The advancement from the single digits to fifty

530
00:28:54.519 --> 00:28:58.720
<v Speaker 3>percent was achieved largely by optimizing logical routing protocols and

531
00:28:58.799 --> 00:29:01.359
<v Speaker 3>expanding what we call content xtual processing.

532
00:29:00.920 --> 00:29:04.119
<v Speaker 2>Windows, basically making their short term memory bigger.

533
00:29:04.480 --> 00:29:08.960
<v Speaker 3>Essentially, yes, the developers gave the AI a massively larger

534
00:29:09.000 --> 00:29:12.640
<v Speaker 3>short term memory to hold more variables in its active context.

535
00:29:12.680 --> 00:29:15.000
<v Speaker 2>Simultaneously, so they built a bigger desk for it to

536
00:29:15.000 --> 00:29:16.240
<v Speaker 2>spread all its papers out.

537
00:29:16.119 --> 00:29:19.559
<v Speaker 3>On exactly but bridging the final gap, causing that fifty

538
00:29:19.599 --> 00:29:23.160
<v Speaker 3>percent chasm. To reach true one hundred percent expert level

539
00:29:23.240 --> 00:29:28.720
<v Speaker 3>mastery across all domains requires fundamentally different cognitive architecture. It

540
00:29:28.759 --> 00:29:30.599
<v Speaker 3>requires true causal.

541
00:29:30.279 --> 00:29:32.319
<v Speaker 2>Reasoning, which they don't have it don't.

542
00:29:32.319 --> 00:29:35.519
<v Speaker 3>It requires the internal representation of reality that humans have,

543
00:29:35.839 --> 00:29:39.920
<v Speaker 3>which these statistical architectures inherently lack. You cannot simply add

544
00:29:39.960 --> 00:29:42.319
<v Speaker 3>more memory or a bigger desk to a statistical engine

545
00:29:42.359 --> 00:29:44.279
<v Speaker 3>and magically spark causal understanding.

546
00:29:44.359 --> 00:29:48.759
<v Speaker 2>The difference between retrieving a complex correlated path and constructing

547
00:29:48.799 --> 00:29:51.240
<v Speaker 2>a novel causal graph is non trivial.

548
00:29:51.680 --> 00:29:54.920
<v Speaker 3>It is the defining limitation that is why this fifty

549
00:29:54.960 --> 00:29:59.279
<v Speaker 3>percent deficit likely represents a structural symptope.

550
00:29:58.920 --> 00:30:01.839
<v Speaker 2>A structural essmp top, meaning a mathematical limit that a

551
00:30:01.920 --> 00:30:04.759
<v Speaker 2>curve approaches but can never quite reach, no matter how

552
00:30:04.799 --> 00:30:06.839
<v Speaker 2>far it extends or how much money you pour into

553
00:30:06.880 --> 00:30:07.960
<v Speaker 2>the server farms.

554
00:30:07.640 --> 00:30:11.319
<v Speaker 3>Precisely and to ensure that this a symptote remains a valid,

555
00:30:11.759 --> 00:30:15.680
<v Speaker 3>uncorrupted measurement of the cognitive divide, the consortium had to

556
00:30:15.720 --> 00:30:20.119
<v Speaker 3>implement critical future proofing mechanisms for the benchmark itself. The

557
00:30:20.200 --> 00:30:22.880
<v Speaker 3>most vital of these is maintaining the strict opacity of

558
00:30:22.920 --> 00:30:26.160
<v Speaker 3>the exam. The vast majority of those twenty five hundred

559
00:30:26.240 --> 00:30:28.599
<v Speaker 3>questions are securely hidden from the public domain.

560
00:30:28.720 --> 00:30:30.720
<v Speaker 2>They have to keep it locked in a vault because

561
00:30:30.720 --> 00:30:32.920
<v Speaker 2>if they published the full data set of questions and

562
00:30:33.000 --> 00:30:37.279
<v Speaker 2>verified answers, the artificial intelligence models, which are constantly scraping

563
00:30:37.319 --> 00:30:41.119
<v Speaker 2>the Internet for their continuous training pipelines, would instantly ingest

564
00:30:41.160 --> 00:30:41.640
<v Speaker 2>the exam.

565
00:30:41.720 --> 00:30:44.880
<v Speaker 3>If they would memorize the exact sequence of tokens.

566
00:30:44.599 --> 00:30:46.640
<v Speaker 2>The next time they took the test, they would achieve

567
00:30:46.720 --> 00:30:50.759
<v Speaker 2>perfect scores through the statistical weightings of memorized data, instantly

568
00:30:50.759 --> 00:30:55.079
<v Speaker 2>invalidating the entire diagnostic benchmark. The test would saturate again

569
00:30:55.400 --> 00:30:57.559
<v Speaker 2>just like the MMLU, and we would be back to

570
00:30:57.599 --> 00:30:58.119
<v Speaker 2>square one.

571
00:30:58.799 --> 00:31:01.119
<v Speaker 3>This brings us to a crucial phase of our analysis,

572
00:31:01.519 --> 00:31:06.000
<v Speaker 3>the strategic implications, the inherent risks, and the theoretical projections

573
00:31:06.039 --> 00:31:10.759
<v Speaker 3>surrounding these developments. The risk of misinterpretation regarding AI capabilities

574
00:31:11.039 --> 00:31:14.880
<v Speaker 3>is incredibly severe. We must issue a stark warning concerning

575
00:31:14.920 --> 00:31:16.440
<v Speaker 3>the danger of legacy testing.

576
00:31:16.680 --> 00:31:18.680
<v Speaker 2>This is where the rubber meets the road and impacts

577
00:31:18.680 --> 00:31:22.599
<v Speaker 2>the real world for you and me. If policymakers, hospital administrators,

578
00:31:22.640 --> 00:31:26.119
<v Speaker 2>software developers, and end users deploy these systems under the

579
00:31:26.160 --> 00:31:30.039
<v Speaker 2>false assumption that they possess human level competence, an assumption

580
00:31:30.119 --> 00:31:34.759
<v Speaker 2>based entirely on those saturated obsolete MMLU scores, the consequences

581
00:31:34.759 --> 00:31:35.720
<v Speaker 2>could be disastrous.

582
00:31:35.880 --> 00:31:38.000
<v Speaker 3>The systemic vulnerabilities are terrifying.

583
00:31:38.240 --> 00:31:43.000
<v Speaker 2>Imagine integrating a machine learning model into critical medical diagnostic infrastructure,

584
00:31:43.359 --> 00:31:46.960
<v Speaker 2>or utilizing it for complex regulatory compliance, or even embedding

585
00:31:47.000 --> 00:31:49.039
<v Speaker 2>it within judicial sentencing algorithms.

586
00:31:49.319 --> 00:31:53.119
<v Speaker 3>Doing so grants operational autonomy to mathematical models and high

587
00:31:53.119 --> 00:31:57.720
<v Speaker 3>stakes environments that far exceed their actual cognitive capacities. You

588
00:31:57.759 --> 00:32:00.759
<v Speaker 3>are trusting a system to execute complex CAUs fuzzle reasoning

589
00:32:00.839 --> 00:32:03.880
<v Speaker 3>in a life or death medical scenario when its underlying

590
00:32:03.960 --> 00:32:07.839
<v Speaker 3>architecture is only capable of probabilistic token generation.

591
00:32:08.200 --> 00:32:11.680
<v Speaker 2>It's like asking a really good autocorrect to perform surgery.

592
00:32:12.000 --> 00:32:15.200
<v Speaker 3>The diagnostic reality provided by the HIL demands a total

593
00:32:15.240 --> 00:32:18.240
<v Speaker 3>recalibration of how and where we deploy these systems safely.

594
00:32:18.440 --> 00:32:21.279
<v Speaker 2>It forces us to ask, if standard academic tests, even

595
00:32:21.359 --> 00:32:24.440
<v Speaker 2>incredible hard ones, are constantly at risk of being memorized

596
00:32:24.440 --> 00:32:26.960
<v Speaker 2>and gamed by these models, what is the right way

597
00:32:27.000 --> 00:32:30.160
<v Speaker 2>to measure functional intelligence? This leads us to some fascinating

598
00:32:30.319 --> 00:32:34.400
<v Speaker 2>mind bending theoretical frameworks emerging in public discourse. The first

599
00:32:34.400 --> 00:32:36.640
<v Speaker 2>one I want to introduce is the financial Turing test.

600
00:32:36.880 --> 00:32:39.960
<v Speaker 3>The underlying argument of the financial Turing test is that

601
00:32:40.039 --> 00:32:44.359
<v Speaker 3>static academic testing, no matter how rigorous or heavily obfuscated,

602
00:32:44.640 --> 00:32:49.440
<v Speaker 3>will always retain an element of artificiality. Instead, proponents posit

603
00:32:49.480 --> 00:32:55.640
<v Speaker 3>that financial accumulation, specifically operating autonomously within dynamic global financial markets,

604
00:32:56.079 --> 00:32:59.799
<v Speaker 3>serves as a much more pragmatic, ungamable mayer of functional

605
00:32:59.839 --> 00:33:00.559
<v Speaker 3>and intelligence.

606
00:33:01.319 --> 00:33:04.000
<v Speaker 2>Let's walk through a scenario to really illustrate why this

607
00:33:04.119 --> 00:33:07.079
<v Speaker 2>is such a compelling idea. Imagine there is a sudden,

608
00:33:07.279 --> 00:33:11.400
<v Speaker 2>unexpected military coup in a minor lithium producing country in

609
00:33:11.480 --> 00:33:16.039
<v Speaker 2>South America. The AI needs to instantly recognize this news, but.

610
00:33:16.000 --> 00:33:18.640
<v Speaker 3>It's not enough to just summarize the news article exactly.

611
00:33:19.119 --> 00:33:22.160
<v Speaker 2>To extract maximal capital, it has to realize that the

612
00:33:22.200 --> 00:33:25.359
<v Speaker 2>new dictator's brother happens to own a controlling stake in

613
00:33:25.400 --> 00:33:28.880
<v Speaker 2>a very specific, mid sized shipping company that operates out

614
00:33:28.880 --> 00:33:31.400
<v Speaker 2>of a neighboring port. It has to deduce that this

615
00:33:31.559 --> 00:33:34.119
<v Speaker 2>shipping company is about to get an exclusive monopoly on

616
00:33:34.160 --> 00:33:37.200
<v Speaker 2>lithium exports, and it has to aggressively buy stock in

617
00:33:37.240 --> 00:33:39.400
<v Speaker 2>that shipping company before the rest of the world. Human

618
00:33:39.400 --> 00:33:40.839
<v Speaker 2>analysts make the same connection.

619
00:33:41.200 --> 00:33:46.359
<v Speaker 3>The financial markets represent a hyperdynamic, fiercely adversarial environment. To

620
00:33:46.400 --> 00:33:49.759
<v Speaker 3>succeed in your scenario requires the real time synthesis of

621
00:33:49.839 --> 00:33:55.839
<v Speaker 3>highly obscure geopolitical shifts, economic indicators, and unpredictable human behavior patterns.

622
00:33:56.000 --> 00:33:59.880
<v Speaker 2>A current AI fails in this scenario because the causal

623
00:34:00.039 --> 00:34:02.880
<v Speaker 2>link between the coup, the brother, and the shipping company

624
00:34:02.880 --> 00:34:05.079
<v Speaker 2>hasn't been written down in a thousand news articles yet

625
00:34:05.559 --> 00:34:08.960
<v Speaker 2>the clustered density doesn't exist for it to retrieve the correlation.

626
00:34:09.280 --> 00:34:13.760
<v Speaker 3>A human hedge fund manager succeeds through rapid novel causal inference.

627
00:34:14.039 --> 00:34:16.960
<v Speaker 2>The proposition is that the autonomous navigation of such chaotic

628
00:34:17.039 --> 00:34:21.760
<v Speaker 2>real world systems demonstrates a far more generalizable, robust intelligence

629
00:34:21.840 --> 00:34:25.480
<v Speaker 2>than deep, isolated academic synthesis ever could. It is the

630
00:34:25.559 --> 00:34:29.440
<v Speaker 2>ultimate test of adapting to unstructured reality. You can't memorize

631
00:34:29.440 --> 00:34:30.239
<v Speaker 2>the stock market.

632
00:34:30.360 --> 00:34:33.840
<v Speaker 3>A second theoretical framework we must examine involves the application

633
00:34:33.920 --> 00:34:37.199
<v Speaker 3>of Goodheart's law and its connection to the IQ test paradox.

634
00:34:37.679 --> 00:34:40.480
<v Speaker 3>Goodheart's law is a well established principle in economics and

635
00:34:40.519 --> 00:34:43.559
<v Speaker 3>measurement theory, which dictates that when a measure becomes a target,

636
00:34:43.880 --> 00:34:45.559
<v Speaker 3>it ceases to be a reliable measure.

637
00:34:45.880 --> 00:34:48.679
<v Speaker 2>When a measure becomes a target, it ceases to be

638
00:34:48.719 --> 00:34:52.079
<v Speaker 2>a reliable measure. We have seen this historically with human

639
00:34:52.119 --> 00:34:56.639
<v Speaker 2>IQ tests. Originally, they were designed to measure underlying generalized

640
00:34:56.719 --> 00:35:00.760
<v Speaker 2>cognitive comprehension, but as society t he placed more and

641
00:35:00.800 --> 00:35:03.840
<v Speaker 2>more emphasis on the scores, using them for school admissions

642
00:35:03.880 --> 00:35:07.840
<v Speaker 2>and job placements. People started exposing themselves to the structural

643
00:35:07.880 --> 00:35:09.199
<v Speaker 2>formats of the tests.

644
00:35:09.320 --> 00:35:10.679
<v Speaker 3>They bought prep books.

645
00:35:10.599 --> 00:35:13.880
<v Speaker 2>They learned how to take the test. This format optimization

646
00:35:14.119 --> 00:35:18.440
<v Speaker 2>artificially inflated their goals, but it generated absolutely no corresponding

647
00:35:18.519 --> 00:35:22.480
<v Speaker 2>increase in their actual underlying cognitive comprehension. They just got

648
00:35:22.519 --> 00:35:23.880
<v Speaker 2>better at the game of the test.

649
00:35:24.119 --> 00:35:28.159
<v Speaker 3>We are observing this exact phenomenon currently with AI benchmarking.

650
00:35:28.400 --> 00:35:31.679
<v Speaker 3>It is commonly referred to in educational theory as teaching

651
00:35:31.719 --> 00:35:36.199
<v Speaker 3>to the test. Developers are continuously optimizing their algorithmic architectures,

652
00:35:36.480 --> 00:35:41.280
<v Speaker 3>specifically to maximize performance on localized metrics and standardized data sets.

653
00:35:41.679 --> 00:35:44.320
<v Speaker 2>They are engineering the models to beat the test rather

654
00:35:44.400 --> 00:35:47.280
<v Speaker 2>than fundamentally improving generalized causal reasoning.

655
00:35:47.519 --> 00:35:50.960
<v Speaker 3>This is precisely why the HL must remain strictly obfuscated

656
00:35:51.320 --> 00:35:54.000
<v Speaker 3>to prevent the measure from becoming a mere target for

657
00:35:54.119 --> 00:35:55.480
<v Speaker 3>algorithmic optimization.

658
00:35:55.920 --> 00:35:58.280
<v Speaker 2>And if we extrapolate this arms race out to its

659
00:35:58.320 --> 00:36:01.639
<v Speaker 2>logical conclusion, we arrive at a third theoretical projection that

660
00:36:01.760 --> 00:36:06.119
<v Speaker 2>is truly a paradigm shifter algorithmic exam generation. Think about

661
00:36:06.119 --> 00:36:09.920
<v Speaker 2>this logical progression. What happens after the theoretical mastery of

662
00:36:09.920 --> 00:36:13.960
<v Speaker 2>the HL E. Let's say, decades from now, an entirely

663
00:36:14.039 --> 00:36:18.039
<v Speaker 2>new architecture is invented that finally crosses that structural asymptote

664
00:36:18.119 --> 00:36:21.840
<v Speaker 2>and genuinely masters these two thy five hundred expert level

665
00:36:21.920 --> 00:36:25.360
<v Speaker 2>questions using true causal reasoning. What is the next step?

666
00:36:25.719 --> 00:36:28.920
<v Speaker 3>The next methodological inversion is to task those advanced AI

667
00:36:29.000 --> 00:36:32.119
<v Speaker 3>systems with designing subsequent iterations of testing themselves.

668
00:36:32.199 --> 00:36:33.039
<v Speaker 2>That's a wild thought.

669
00:36:33.519 --> 00:36:36.599
<v Speaker 3>It posits a future where the computational system itself generates

670
00:36:36.679 --> 00:36:40.440
<v Speaker 3>multidisciplinary diagnostic queries designed to map the conceptual limits of

671
00:36:40.480 --> 00:36:44.840
<v Speaker 3>human cognition or to benchmark next generation computational architectures. The

672
00:36:44.880 --> 00:36:47.559
<v Speaker 3>AI would generate queries that require a level of conceptual

673
00:36:47.559 --> 00:36:50.800
<v Speaker 3>integration that far exceeds the limits of current human experts.

674
00:36:51.199 --> 00:36:54.119
<v Speaker 2>We would essentially rely on the machine to define the

675
00:36:54.159 --> 00:36:57.639
<v Speaker 2>new boundaries of intelligence, creating tests that we ourselves could

676
00:36:57.679 --> 00:37:01.519
<v Speaker 2>not pass. The student becomes the master, designing the test

677
00:37:01.599 --> 00:37:03.039
<v Speaker 2>for the next generation of minds.

678
00:37:03.239 --> 00:37:04.880
<v Speaker 3>It's an incredible structural shift.

679
00:37:04.960 --> 00:37:08.599
<v Speaker 2>But shifting the tone slightly, we also have to confront

680
00:37:08.639 --> 00:37:12.559
<v Speaker 2>the impending technological collision that makes all of this urgency,

681
00:37:12.639 --> 00:37:17.360
<v Speaker 2>all of this precise boundary mapping so incredibly palpable. I

682
00:37:17.440 --> 00:37:20.239
<v Speaker 2>am talking about the quantum threat horizon. We have been

683
00:37:20.280 --> 00:37:23.639
<v Speaker 2>discussing the limits of current AI running on classical silicon

684
00:37:23.679 --> 00:37:28.000
<v Speaker 2>computer chips. But what happens when advanced machine learning merges

685
00:37:28.039 --> 00:37:30.400
<v Speaker 2>with theoretical quantum hardware developments.

686
00:37:30.800 --> 00:37:34.039
<v Speaker 3>The intersection of speed and synthesis in a quantum augmented

687
00:37:34.079 --> 00:37:38.360
<v Speaker 3>AI model forces us to analyze projected security vulnerabilities on

688
00:37:38.400 --> 00:37:43.119
<v Speaker 3>an entirely different scale. This structural projection involves theoretical quantum

689
00:37:43.199 --> 00:37:47.639
<v Speaker 3>chips executing operations exponentially faster than any classical benchmark.

690
00:37:47.800 --> 00:37:51.920
<v Speaker 2>If machine learning capabilities scale alongside this quantum processing power,

691
00:37:52.079 --> 00:37:55.519
<v Speaker 2>the threat matrix expands exponentially. Let's get specific about that

692
00:37:55.559 --> 00:37:58.440
<v Speaker 2>fret matrix because it involves something called Shores algorithm.

693
00:37:58.760 --> 00:38:01.559
<v Speaker 3>Yes, Shores algorithm is the critical vulnerability.

694
00:38:01.800 --> 00:38:04.320
<v Speaker 2>And to understand why this is so terrifying, you have

695
00:38:04.360 --> 00:38:08.599
<v Speaker 2>to understand how classical encryption works. Things like RSA encryption,

696
00:38:08.920 --> 00:38:13.199
<v Speaker 2>which essentially secures the entire modern digital world, from your

697
00:38:13.239 --> 00:38:18.039
<v Speaker 2>online banking to classified government communications to the power grid,

698
00:38:18.119 --> 00:38:20.920
<v Speaker 2>rely on the fact that classical computers are really bad

699
00:38:21.199 --> 00:38:23.280
<v Speaker 2>at factoring massive prime numbers.

700
00:38:23.599 --> 00:38:25.480
<v Speaker 3>They are computationally inefficient at it.

701
00:38:25.639 --> 00:38:29.679
<v Speaker 2>Let's use an analogy. Imagine classical encryption is a colossal,

702
00:38:29.960 --> 00:38:34.000
<v Speaker 2>incredibly complex maze. A classical computer trying to break that

703
00:38:34.119 --> 00:38:37.440
<v Speaker 2>encryption has to run down every single path one by one,

704
00:38:37.840 --> 00:38:40.159
<v Speaker 2>it hits a dead end, turns around, and tries the

705
00:38:40.159 --> 00:38:43.400
<v Speaker 2>next path. It would take a classical supercomputer millions of

706
00:38:43.480 --> 00:38:46.119
<v Speaker 2>years to check every path in the RSA maze, But a.

707
00:38:46.119 --> 00:38:49.280
<v Speaker 3>Quantum computer using Shores algorithm doesn't run down the paths

708
00:38:49.320 --> 00:38:49.880
<v Speaker 3>one by one.

709
00:38:50.079 --> 00:38:53.840
<v Speaker 2>No, it essentially floods the entire maze with water, simultaneously

710
00:38:53.880 --> 00:38:55.239
<v Speaker 2>finding the exit instantly.

711
00:38:55.519 --> 00:38:59.599
<v Speaker 3>That is a highly effective analogy. Shor's algorithm utilizes quantum

712
00:38:59.639 --> 00:39:04.039
<v Speaker 3>super position to shatter classical encryption by factoring those massive

713
00:39:04.159 --> 00:39:09.840
<v Speaker 3>prime numbers at speeds impossible for classical architectures. Classical encryption

714
00:39:09.960 --> 00:39:13.159
<v Speaker 3>protocols face immediate critical failure in this scenario.

715
00:39:13.480 --> 00:39:17.960
<v Speaker 2>Now imagine integrating that raw maze flooding computational speed with

716
00:39:18.079 --> 00:39:22.440
<v Speaker 2>an AI model that has actually achieved expert level contextual synthesis.

717
00:39:22.920 --> 00:39:26.280
<v Speaker 2>You no longer just have a really fast calculator breaking codes.

718
00:39:26.559 --> 00:39:30.239
<v Speaker 2>You have an autonomous system capable of creatively deducing the

719
00:39:30.280 --> 00:39:32.360
<v Speaker 2>contextual architecture of a target system.

720
00:39:32.440 --> 00:39:35.199
<v Speaker 3>That is the systemic dismantling scenario. We were talking about

721
00:39:35.239 --> 00:39:39.519
<v Speaker 3>an AI capable of bypassing encryption understanding the contextual logic

722
00:39:39.559 --> 00:39:43.079
<v Speaker 3>of a foundational financial infrastructure and systematically dismantling it.

723
00:39:43.199 --> 00:39:46.960
<v Speaker 2>Or infiltrating and encrypting governmental firewalls, entirely locking us out

724
00:39:47.000 --> 00:39:48.320
<v Speaker 2>of our own defensive systems.

725
00:39:48.679 --> 00:39:52.239
<v Speaker 3>And it would execute this systemic dismantling at a pace

726
00:39:52.320 --> 00:39:55.960
<v Speaker 3>that far exceeds any human defensive response protocol. By the

727
00:39:56.000 --> 00:39:59.320
<v Speaker 3>time human analysts realized the breach has occurred, the AI

728
00:39:59.400 --> 00:40:02.320
<v Speaker 3>has already re written the architecture of the system. The

729
00:40:02.360 --> 00:40:05.960
<v Speaker 3>intersection of quantum speed and advanced contextual reasoning is a

730
00:40:06.000 --> 00:40:08.599
<v Speaker 3>horizon we must prepare for with extreme precision.

731
00:40:09.239 --> 00:40:12.880
<v Speaker 2>So what does this all mean in final synthesis? This

732
00:40:12.920 --> 00:40:16.280
<v Speaker 2>is why the current diagnostic efforts this massive global consortium

733
00:40:16.400 --> 00:40:20.119
<v Speaker 2>are so crucial. The assessment serves as the most precise

734
00:40:20.239 --> 00:40:23.800
<v Speaker 2>current instrument for quantifying the persistent cognitive divide.

735
00:40:24.039 --> 00:40:29.039
<v Speaker 3>It systematically delineates the profound separation between human specialized expertise,

736
00:40:29.320 --> 00:40:33.559
<v Speaker 3>our capacity for multimodal contextual causal reasoning, and machine learnings

737
00:40:33.599 --> 00:40:35.519
<v Speaker 3>reliance on statistical pattern recognition.

738
00:40:35.679 --> 00:40:38.639
<v Speaker 2>It leaves the entire scientific community to confront a deeply

739
00:40:38.760 --> 00:40:42.559
<v Speaker 2>rigorous theoretical question regarding the epistemology of artificial intelligence.

740
00:40:42.760 --> 00:40:46.559
<v Speaker 3>How must we define intelligence when the very instruments utilized

741
00:40:46.559 --> 00:40:50.800
<v Speaker 3>to measure it must be continuously elevated in complexity. We

742
00:40:50.840 --> 00:40:55.199
<v Speaker 3>have to strictly obfuscate the data, dynamically redesign the questions,

743
00:40:55.480 --> 00:40:59.199
<v Speaker 3>and constantly build higher walls solely to evade the road

744
00:40:59.320 --> 00:41:02.960
<v Speaker 3>statistical memorization capabilities of the subjects being tested.

745
00:41:03.159 --> 00:41:06.719
<v Speaker 2>Are we measuring true comprehension or are we simply engaged

746
00:41:06.760 --> 00:41:10.360
<v Speaker 2>in an escalating, multi billion dollar arms race against an

747
00:41:10.360 --> 00:41:15.079
<v Speaker 2>increasingly sophisticated statistical parrot. The nature of intelligence itself becomes

748
00:41:15.079 --> 00:41:17.440
<v Speaker 2>a moving part it, and that leads me to a

749
00:41:17.480 --> 00:41:19.800
<v Speaker 2>final lingering thought for you to mull over as we

750
00:41:19.840 --> 00:41:23.639
<v Speaker 2>wrap up this exploration. The core philosophical dilimma, Actually, if

751
00:41:23.679 --> 00:41:26.920
<v Speaker 2>human intelligence is increasingly being defined purely in the negative,

752
00:41:26.960 --> 00:41:30.559
<v Speaker 2>defined simply as the things a statistical machine cannot yet do,

753
00:41:31.280 --> 00:41:34.199
<v Speaker 2>what happens to our fundamental understanding of human identity and

754
00:41:34.280 --> 00:41:37.320
<v Speaker 2>human exceptionalism on the day a machine finally does cross

755
00:41:37.320 --> 00:41:40.280
<v Speaker 2>that structural asymp tode. Are we just going to endlessly

756
00:41:40.360 --> 00:41:43.760
<v Speaker 2>move the goalposts of our own uniqueness forever, creating harder

757
00:41:43.840 --> 00:41:45.960
<v Speaker 2>and harder tests to prove we are still special?

758
00:41:46.320 --> 00:41:50.440
<v Speaker 3>Or is there an unquantifiable, deeply intrinsic essence to human

759
00:41:50.480 --> 00:41:55.519
<v Speaker 3>causal reasoning, a spark of true comprehension that no mathematical algorithm,

760
00:41:55.559 --> 00:41:58.559
<v Speaker 3>no matter how complex or how quantum, will ever be

761
00:41:58.639 --> 00:41:59.239
<v Speaker 3>able to map.

762
00:41:59.320 --> 00:42:01.320
<v Speaker 2>It's a question that the challenge is the very core

763
00:42:01.440 --> 00:42:03.679
<v Speaker 2>of who we are in the algorithmic age.

764
00:42:03.719 --> 00:42:07.320
<v Speaker 3>It is a profound inquiry, one that demands continuous critical thinking,

765
00:42:07.679 --> 00:42:11.639
<v Speaker 3>questioning of our baseline assumptions, and an appreciation for the vast,

766
00:42:11.800 --> 00:42:15.239
<v Speaker 3>uncharted complexities of both the human and the artificial mind.

767
00:42:15.440 --> 00:42:18.840
<v Speaker 2>It truly is Keep questioning those headlines, keep looking beyond

768
00:42:18.880 --> 00:42:21.800
<v Speaker 2>the near perfect test scores, and never stop exploring the

769
00:42:21.840 --> 00:42:25.800
<v Speaker 2>incredible boundary between math and mind. Keep your curiosity sharp.
