WEBVTT 1 00:00:00.120 --> 00:00:03.319 Welcome to the Grock four release. Here, this is the. 2 00:00:03.240 --> 00:00:05.360 Smartest AI in the world. We're going to show you 3 00:00:05.360 --> 00:00:08.800 exactly how and why. It really is remarkable to see 4 00:00:08.800 --> 00:00:13.080 the advancement of artificial intelligence, how quickly it is evolving. 5 00:00:14.000 --> 00:00:18.399 I sometimes compare it to the growth of a human 6 00:00:18.839 --> 00:00:23.839 and how faster human learns and gains conscious awareness and understanding. 7 00:00:24.519 --> 00:00:30.640 And AI is advancing just vastly faster than any human. 8 00:00:31.199 --> 00:00:33.600 I mean, we're going to take you through a bunch 9 00:00:33.640 --> 00:00:38.119 of benchmarks that that GROC four is able to achieve 10 00:00:38.119 --> 00:00:39.000 incredible numbers on. 11 00:00:39.520 --> 00:00:42.000 But it's actually worth noting. 12 00:00:41.759 --> 00:00:45.840 That that GROCK four, if given like the SAT, would 13 00:00:45.880 --> 00:00:48.920 get perfect SATs every time, even if it's never seen 14 00:00:48.960 --> 00:00:53.039 the questions before, and if even going beyond that, to 15 00:00:53.159 --> 00:00:57.159 say like graduate student exams like the GRE, it will 16 00:00:57.200 --> 00:01:03.840 get near perfect results in in every discipline of education, 17 00:01:04.000 --> 00:01:08.640 so from the humanities to like languages, math, physics, engineering, 18 00:01:09.359 --> 00:01:13.000 pick anything, and we're talking about questions that it's never 19 00:01:13.000 --> 00:01:16.280 seen before. These are not on the Internet, and it's 20 00:01:17.400 --> 00:01:21.680 GROCK four is smarter than almost all graduate students in 21 00:01:21.760 --> 00:01:26.560 all disciplines simultaneously. Like it's actually just important to appreciate 22 00:01:26.599 --> 00:01:32.879 the like that's really something. And the reasoning capabilities of 23 00:01:33.000 --> 00:01:35.599 GROCK are incredible. So there's some people out there who 24 00:01:35.760 --> 00:01:39.280 who think AI can't reason, and look, it can reason 25 00:01:39.359 --> 00:01:44.040 at super human levels. So yeah, and frankly, it only 26 00:01:44.079 --> 00:01:47.359 gets better from here. So we'll take you through the 27 00:01:47.400 --> 00:01:50.599 GROCK four release and share you back the pace of 28 00:01:50.680 --> 00:01:53.959 pace of progress here. Like I guess the first part 29 00:01:54.000 --> 00:01:56.879 is like, in terms of the training, we're going from 30 00:01:57.040 --> 00:02:00.519 GROCK two to GROCK three to GROCK four. We essentially 31 00:02:00.519 --> 00:02:03.359 increased the training by an order of magnitue in each case, 32 00:02:04.040 --> 00:02:08.120 so it's one hundred times more training than GROC two 33 00:02:08.919 --> 00:02:13.599 and that's only going to increase. So it's yeah, frankly, 34 00:02:13.639 --> 00:02:15.319 I mean, I don't know. In some ways a little terrifying, 35 00:02:15.520 --> 00:02:18.360 but with the growth of intelligence here is remarkable. 36 00:02:18.639 --> 00:02:21.000 Yes, it's important to realize there are two types of 37 00:02:21.000 --> 00:02:23.919 training compute. Why is the pre training compute that's from 38 00:02:24.039 --> 00:02:27.439 GROWD two to GROW three, But from Growth three to 39 00:02:27.560 --> 00:02:30.599 GAR four we're actually putting a lot of compute in 40 00:02:30.960 --> 00:02:32.479 reasoning in area. 41 00:02:32.560 --> 00:02:35.000 And just like you said, this is literally the fastest 42 00:02:35.000 --> 00:02:37.319 moving field and GROC too is like the high school 43 00:02:37.360 --> 00:02:39.879 student by today's standard. If you look bad in the 44 00:02:39.960 --> 00:02:44.039 last twelve month GRODTU was only a concept for the 45 00:02:44.159 --> 00:02:47.039 even have groc to twelve months ago. And then by 46 00:02:47.039 --> 00:02:49.240 training GROCU that was the first time with scale up 47 00:02:49.280 --> 00:02:51.919 like the pre training, we realized that if you actually 48 00:02:51.960 --> 00:02:55.879 do the data uplation really carefully and infra and also 49 00:02:55.879 --> 00:02:59.360 the algorithm, we can actually push the pre training quite 50 00:02:59.360 --> 00:03:02.280 a lot amount of ten x to make the model 51 00:03:02.599 --> 00:03:05.479 but the best pretri based model. And that's why we 52 00:03:05.520 --> 00:03:09.759 build clauses the world's supercomputer with one h one hundred 53 00:03:10.639 --> 00:03:14.039 and then with the best patrion model, and we realize 54 00:03:14.199 --> 00:03:17.400 if you can collect these verifiable outcome reward, you can 55 00:03:17.439 --> 00:03:19.719 actually train this model to start thinking for the first principle, 56 00:03:19.759 --> 00:03:22.120 so the reason correct its own mistakes and that's where 57 00:03:22.120 --> 00:03:24.840 the GROC reasoning comes from. And today we asked the 58 00:03:24.919 --> 00:03:28.000 question what happens if you take the expansion of the 59 00:03:28.080 --> 00:03:31.599 clauses with all two hundred thousand GPUs, put all these 60 00:03:31.840 --> 00:03:36.400 into oil tenx more compute then any of the models 61 00:03:36.400 --> 00:03:39.719 out there on reinforcement learning unprecedent scale. 62 00:03:39.960 --> 00:03:40.719 What's going to happen? 63 00:03:41.360 --> 00:03:44.520 So this is a story of GROG four and you know, 64 00:03:44.840 --> 00:03:47.000 Tony shares some insight with the audience. 65 00:03:47.400 --> 00:03:50.199 Yeah, so yeah, let's just talk about how smart graph 66 00:03:50.199 --> 00:03:53.400 for it is. So I guess we can start discussing 67 00:03:53.400 --> 00:03:56.759 this benchmark card. Humanity is last exam and this this 68 00:03:56.840 --> 00:04:01.520 benchmark is a very very challenging benchmark. Every single problem 69 00:04:01.759 --> 00:04:06.800 is curated by subject matter experts. It's in total twenty 70 00:04:06.840 --> 00:04:11.199 five hundred problems, and it consists of many different subjects mathematics, 71 00:04:11.319 --> 00:04:15.879 natural sciences, engineering, and also all of humanity subjects. So 72 00:04:16.319 --> 00:04:19.199 essentially when it was first release actually like earlier this year, 73 00:04:20.079 --> 00:04:23.360 most of the models out there can only get single 74 00:04:23.360 --> 00:04:26.639 digit accuracy on this manchmark. Yeah, so we can look 75 00:04:26.639 --> 00:04:29.800 at some of those examples. There is this mathematical problem 76 00:04:30.160 --> 00:04:35.160 which is about natural transformations in category theory, and there's 77 00:04:35.199 --> 00:04:40.000 this organic chemistry problem that talks about electual cyclic reactions. 78 00:04:40.560 --> 00:04:43.800 And also there's this linguistic problem that tries to ask 79 00:04:43.839 --> 00:04:48.279 you about distinguishing between close and open syllabus from a 80 00:04:48.399 --> 00:04:52.439 Hebrew source text. So you can see. Also it's a 81 00:04:52.560 --> 00:04:56.519 very wide range of problems and every single problem is 82 00:04:56.839 --> 00:04:59.720 PhD or even advanced research level problems. 83 00:05:00.439 --> 00:05:03.480 Yeah, I mean these there are no humans that can 84 00:05:03.519 --> 00:05:06.040 actually answer these can get a good score. I mean, 85 00:05:06.040 --> 00:05:08.759 if you ask me, say, like any given human, what, 86 00:05:09.279 --> 00:05:11.920 like what's the best that any humans could score, I'd 87 00:05:11.959 --> 00:05:17.879 say maybe five percent optimistically. So this is much harder 88 00:05:17.879 --> 00:05:21.439 than what any any human can do. It's it's incredibly difficult. 89 00:05:21.800 --> 00:05:23.639 And you can see from the types of questions like 90 00:05:23.680 --> 00:05:27.800 you might be incredible in linguistics or mathematics or chemistry 91 00:05:27.879 --> 00:05:30.120 or physics or anyone of a number of subjects, but 92 00:05:30.120 --> 00:05:32.560 you're not going to be at a post grad level 93 00:05:32.600 --> 00:05:37.160 in everything, and grockpour is a post grad level in everything, 94 00:05:37.839 --> 00:05:39.839 like it's it's just some of these things are just 95 00:05:39.879 --> 00:05:45.879 worth repeating, like grockpoor is post graduate like PhD level 96 00:05:46.399 --> 00:05:49.600 in everything, better than pH but like most PhDs would fail. 97 00:05:50.480 --> 00:05:53.040 So it's better that said, I mean, at least with 98 00:05:53.079 --> 00:05:56.600 respect to academic questions. It I want, it's just emphasized 99 00:05:56.600 --> 00:06:00.759 this point. With respect to academic questions, Grockpoor is better 100 00:06:00.800 --> 00:06:04.839 than PhD level in every subject, no exceptions. That doesn't 101 00:06:04.839 --> 00:06:08.680 mean that it's you know, times it may lack common sense, 102 00:06:08.959 --> 00:06:13.639 and it has not yet invented new technologies or discovered 103 00:06:13.680 --> 00:06:16.360 new physics, but that is just a matter of time. 104 00:06:17.560 --> 00:06:22.439 It may discover new technologies as soon as later this year, 105 00:06:22.800 --> 00:06:24.519 and I would be shocked if it has not done 106 00:06:24.519 --> 00:06:29.079 so next year. So I would expect growk to literally 107 00:06:29.079 --> 00:06:32.160 discover new technologies that are actually useful no later than 108 00:06:32.160 --> 00:06:34.879 next year, and maybe end of this year. It might 109 00:06:34.959 --> 00:06:38.759 discover new physics. Next year and within two years, that'd 110 00:06:38.759 --> 00:06:41.319 say almost certainly. Like so just let that sink in. 111 00:06:41.639 --> 00:06:45.160 How okay, So I guess we can talk about what's 112 00:06:45.199 --> 00:06:48.360 behind the scene of about four. As Jimmy mentioned, we 113 00:06:48.519 --> 00:06:52.600 actually sawing a lot of compute into this training. When 114 00:06:52.639 --> 00:06:57.720 it started, it's only a single digit number. But as 115 00:06:57.759 --> 00:07:00.079 you start putting a more and more training compute, it 116 00:07:01.040 --> 00:07:05.160 started to gradually become smarter and smarter and eventually solved 117 00:07:06.079 --> 00:07:09.920 a quarter of the HI problems. And this is without 118 00:07:09.920 --> 00:07:13.160 any tools. The next thing we did was to adding 119 00:07:13.639 --> 00:07:18.079 tools capabilities to the model, and unlike Growth three, I 120 00:07:18.079 --> 00:07:21.079 think growth actually is able to use clue as well, 121 00:07:21.160 --> 00:07:23.759 but here we actually make it more native in the 122 00:07:23.839 --> 00:07:27.000 sense that we put the tools into training. Growth three 123 00:07:27.160 --> 00:07:30.079 was only relying on generalization. Here we actually put the 124 00:07:30.120 --> 00:07:34.079 tools into training, and it turns out this significantly improves 125 00:07:34.120 --> 00:07:37.079 the model's capability of using those tools. So how is 126 00:07:37.120 --> 00:07:41.680 this different? Research was exactly the growth three reasoning model 127 00:07:42.560 --> 00:07:45.759 without any specific training, but we only asked it to 128 00:07:45.879 --> 00:07:49.399 use those tools. So compared to this, it was much 129 00:07:49.439 --> 00:07:54.240 weaker in terms of its tool capabilities and irreliable and unreliable. 130 00:07:54.399 --> 00:07:56.720 Yes, yes, and to be clear, like these are still 131 00:07:56.759 --> 00:07:59.920 I'd say fairly this is still fairly primitive tool use. 132 00:08:00.360 --> 00:08:02.199 If you compare it to say, the tools that are 133 00:08:02.319 --> 00:08:06.720 used at Tesla SpaceX, where you're using finite element analysis 134 00:08:06.800 --> 00:08:11.959 and competitional flow dynamics and you're able to run or 135 00:08:12.000 --> 00:08:14.720 say like TESL, it is like crash simulations with the 136 00:08:14.720 --> 00:08:17.839 simulations are so close to reality that if the test 137 00:08:17.879 --> 00:08:21.000 doesn't match the simulation, you assume that the test article 138 00:08:21.079 --> 00:08:24.040 is wrong. That's how good the simulations are. So Grock 139 00:08:24.160 --> 00:08:26.399 is not currently using any of the tools that a 140 00:08:26.480 --> 00:08:29.120 company would use, but that is something that we will 141 00:08:29.639 --> 00:08:32.279 provide it with later this year, so we'll have the 142 00:08:32.320 --> 00:08:37.679 tools that a company has and have very accurate physics simulator. Ultimately, 143 00:08:37.720 --> 00:08:39.639 the thing that will make the biggest difference is being 144 00:08:39.679 --> 00:08:42.759 able to interact with the real world via humoroid robots. 145 00:08:43.120 --> 00:08:45.679 So you combine GROCK with optimists and it can actually 146 00:08:46.200 --> 00:08:48.600 interact with the real world and figure out if it's 147 00:08:49.039 --> 00:08:52.559 if it has if it's you can formulate and hypothesis 148 00:08:52.840 --> 00:08:56.039 and then confirm if that hypothesis is true or not. 149 00:08:57.039 --> 00:08:59.639 So we're really you know, I think about like where 150 00:08:59.679 --> 00:08:59.960 we are to. 151 00:09:00.159 --> 00:09:04.799 We're at the beginning of an immense intelligence explosion. We're 152 00:09:04.840 --> 00:09:08.919 in the intelligence big bang right now and the most 153 00:09:08.960 --> 00:09:13.080 interesting time to be alive of any time in history. Now, 154 00:09:13.279 --> 00:09:13.600 that's it. 155 00:09:13.679 --> 00:09:15.320 We need to make sure that the AI is a 156 00:09:15.360 --> 00:09:15.799 good AI. 157 00:09:16.039 --> 00:09:18.639 The thing that I think is most important for AI safety, 158 00:09:18.639 --> 00:09:21.639 at least my biological neural net tells me the most 159 00:09:21.679 --> 00:09:25.200 important thing for AI is to be maximally truth seeking. 160 00:09:26.120 --> 00:09:30.000 You can think of AI as this super genius child 161 00:09:30.080 --> 00:09:33.159 that ultimately will outsmart you, but you can still instill 162 00:09:33.159 --> 00:09:39.200 the right values encourage it to be sort of you know, truthful, honorable, 163 00:09:39.440 --> 00:09:41.759 you know, good things like the values one to instill 164 00:09:41.879 --> 00:09:45.080 in a child ultimately grow up to be incredibly powerful. Yeah, 165 00:09:45.240 --> 00:09:48.080 these are still primitive tools and not the kind of 166 00:09:48.120 --> 00:09:51.879 tools that serious commercial companies use. But we will provide 167 00:09:51.879 --> 00:09:53.919 it with those tools, and I think it will be 168 00:09:53.919 --> 00:09:56.639 able to solve real world technology problems. 169 00:09:56.240 --> 00:09:57.879 Yes, yes, exactly. 170 00:09:58.000 --> 00:10:00.679 But is it just compute all you need? Is it 171 00:10:00.720 --> 00:10:02.639 just compute all you need at this point. 172 00:10:02.519 --> 00:10:05.320 Well, you need compute plus the right tools, and then 173 00:10:05.399 --> 00:10:07.559 ultimately to be able to interact with the physical world, 174 00:10:07.799 --> 00:10:10.840 and then we will effectively have an economy that is 175 00:10:11.360 --> 00:10:15.000 ultimately thousands of times bigger than our card economy, or 176 00:10:15.039 --> 00:10:18.320 maybe millions of times. If you think of civilization as 177 00:10:19.120 --> 00:10:23.879 percentage completion of the Kardashev scale, where Kardashev one is 178 00:10:24.639 --> 00:10:27.960 using all the energy output of a planet, and Kardashev 179 00:10:28.039 --> 00:10:30.519 two is using all the energy output of a sun, 180 00:10:31.240 --> 00:10:33.440 and three is all the energy output of a galaxy. 181 00:10:34.159 --> 00:10:37.120 We're only, in my opinion, probably closer to one percent 182 00:10:37.240 --> 00:10:40.840 of Kardashev one than we are to ten percent, So 183 00:10:41.360 --> 00:10:44.960 like maybe a point one one two percent of Kardashev one, 184 00:10:45.039 --> 00:10:45.639 So we. 185 00:10:45.919 --> 00:10:48.399 Will get to most of the weight, like an. 186 00:10:48.320 --> 00:10:51.600 Eighty ninety percent Kardashiv one, and then hopefully, if civilization 187 00:10:51.679 --> 00:10:55.759 doesn't self annihilate, the actual notion of a human economy, 188 00:10:56.000 --> 00:11:00.240 assuming civilization continues to progress, will seem very quaint in retrospect. 189 00:11:01.039 --> 00:11:04.120 It will seem like sort of caveman throwing sticks into 190 00:11:04.159 --> 00:11:07.799 a fire. Level of economy compared to what the future 191 00:11:07.840 --> 00:11:12.200 will hold, it's very exciting. I've been at times kind 192 00:11:12.240 --> 00:11:16.000 of worried about, like, well, you know, this seems like 193 00:11:16.720 --> 00:11:20.600 it's somewhat unerving to have intelligence created that is far 194 00:11:20.639 --> 00:11:24.799 greater than our own and will let's be better good 195 00:11:24.840 --> 00:11:25.600 for humanity. 196 00:11:26.200 --> 00:11:28.399 I think it'll be good. Most likely it'll be good. 197 00:11:29.120 --> 00:11:32.679 But I somewhat reconcile myself to the fact that even 198 00:11:32.720 --> 00:11:35.159 if it wasn't going to be good, I'd at least 199 00:11:35.159 --> 00:11:36.480 like to be alive to see it happen. 200 00:11:36.799 --> 00:11:41.240 So yeah, yeah, I think a technical problem that we 201 00:11:41.279 --> 00:11:44.279 still need to solve besides just compute, is how do 202 00:11:44.360 --> 00:11:48.159 we unblock the data tottleneck because when we try to 203 00:11:48.200 --> 00:11:52.279 scale up the aisle in this case, we did invent 204 00:11:52.399 --> 00:11:56.440 a lot of new techniques innovations to allow us to 205 00:11:56.559 --> 00:11:59.519 figure out how to find a lot of challenging our 206 00:11:59.559 --> 00:12:02.200 problems will work on. It's not just a problem itself 207 00:12:02.240 --> 00:12:04.480 needs to be challenging, but also it needs to be 208 00:12:04.960 --> 00:12:07.759 you also need to have like a reliable signal to 209 00:12:07.799 --> 00:12:10.200 tell the model you did it wrong, you did it right. 210 00:12:10.360 --> 00:12:13.440 This is sort of the principle of reinforcement learning, and 211 00:12:14.320 --> 00:12:17.480 as the model gets smarter and smarter, the number of 212 00:12:17.799 --> 00:12:20.919 cool problems or challenging problems will be lesson and less 213 00:12:21.519 --> 00:12:24.639 So it's going to be a new type of challenge 214 00:12:24.679 --> 00:12:26.720 that we need to surpass besides just compute. 215 00:12:26.840 --> 00:12:30.360 Yeah, we actually are running out of actual test questions 216 00:12:30.399 --> 00:12:33.679 to ask, So there's like even questions that are ridiculously hard, 217 00:12:33.720 --> 00:12:37.559 if not essentially impossible for humans that are written down 218 00:12:37.639 --> 00:12:42.279 questions are becoming trivial for AI. You know, the one 219 00:12:42.279 --> 00:12:45.240 thing that is an excellent judge of things is reality. 220 00:12:45.720 --> 00:12:49.200 So because if physics is the law, ultimately everything else 221 00:12:49.279 --> 00:12:49.919 is recommendation. 222 00:12:50.080 --> 00:12:51.279 You can't break physics. 223 00:12:51.799 --> 00:12:54.320 So the ultimate test, I think for whether an AI 224 00:12:54.679 --> 00:12:59.519 is the ultimate reasoning test is reality. So you invent 225 00:12:59.559 --> 00:13:03.000 a new technology, like say, improve the design of a. 226 00:13:02.399 --> 00:13:06.200 Car or a rocket, or create a new medication. Does 227 00:13:06.240 --> 00:13:06.679 it work? 228 00:13:07.360 --> 00:13:10.279 Does the rocket get to or it does the car drive? 229 00:13:10.480 --> 00:13:13.960 Does the medicine work, whatever the case may be. Reality 230 00:13:14.039 --> 00:13:16.559 is the ultimate judge here, So it's going to be 231 00:13:16.600 --> 00:13:18.879 a reinforcement learning closing loop around reality. 232 00:13:19.120 --> 00:13:21.159 We asked the question how do we even go further? 233 00:13:21.519 --> 00:13:26.159 So actually we are thinking about now with single agent, 234 00:13:26.240 --> 00:13:28.320 we're able to solve forty percent of a problem. What 235 00:13:28.399 --> 00:13:32.080 if we have multiple agents running the same time. So 236 00:13:32.120 --> 00:13:35.600 this is what's called test on compute. And as we 237 00:13:35.639 --> 00:13:38.759 scale up the test on compute, actually we are able 238 00:13:38.799 --> 00:13:42.200 to solve almost more than fifty percent of the text 239 00:13:42.279 --> 00:13:47.120 only subset of the HI problems. So it's a remarkable achievement. 240 00:13:47.159 --> 00:13:48.840 I think this is insanely difficult. 241 00:13:49.080 --> 00:13:51.360 Before we're saying it's a majority of the text based 242 00:13:51.559 --> 00:13:55.600 of humanities, you know, scarily named Humanity's Last Exam, grow 243 00:13:55.639 --> 00:13:58.159 ful can solve. You can try it out for yourself 244 00:13:58.559 --> 00:14:01.039 with the group Foy heavy. What does is it sports 245 00:14:01.120 --> 00:14:04.759 multiple agents in parallel and all of those agents do 246 00:14:04.919 --> 00:14:09.240 work independently, and then they compare their work and they. 247 00:14:09.039 --> 00:14:11.320 Decide which one. It's like a steady group. 248 00:14:12.120 --> 00:14:15.360 It's not as simple as a majority vote because often 249 00:14:15.440 --> 00:14:18.279 only one of the agents actually figures out the trick 250 00:14:19.000 --> 00:14:22.000 or figures out the solution. And but once they share 251 00:14:22.159 --> 00:14:25.679 the trick or figure out what the real nature of 252 00:14:25.679 --> 00:14:28.519 the problem is, they share that solution with the other 253 00:14:28.559 --> 00:14:32.080 agents and then they compare notes and yield an answer. 254 00:14:32.440 --> 00:14:35.039 So that's the heavy part of group four is where 255 00:14:35.720 --> 00:14:38.200 you scale up the test time compute by roughly in 256 00:14:38.279 --> 00:14:42.519 order of magnitude, have multiple agents tackle the task, and 257 00:14:42.559 --> 00:14:46.639 then they compare their work and they put forward. 258 00:14:46.399 --> 00:14:47.720 What they think is the best result. 259 00:14:48.039 --> 00:14:51.759 Yeah, so we will introduce GLAW four and grawflor happy. 260 00:14:51.840 --> 00:14:55.159 Sorry you can click the next light. Yeah so yeah, 261 00:14:55.240 --> 00:14:59.159 So basically GUA four is a version, a single agent version, 262 00:14:59.399 --> 00:15:04.000 and G for heavy is the multigeneration. So let's take 263 00:15:04.000 --> 00:15:07.000 a look how they actually do on those exam problems 264 00:15:07.120 --> 00:15:09.159 and also some real real life problems. 265 00:15:09.320 --> 00:15:11.440 Yeah. So we're going to start out here and we're 266 00:15:11.440 --> 00:15:14.200 actually going to look at one of those HL problems. 267 00:15:14.240 --> 00:15:16.759 This is actually one of the easier math ones. I 268 00:15:16.759 --> 00:15:19.200 don't really understand it very well. I'm not that smart, 269 00:15:19.240 --> 00:15:21.679 but I can launch this job here and we can 270 00:15:21.720 --> 00:15:23.879 actually see how it's going to go through and start 271 00:15:23.879 --> 00:15:26.320 to think about this problem. While we're doing that, I 272 00:15:26.320 --> 00:15:28.200 also want to show a little bit more about what 273 00:15:28.240 --> 00:15:30.679 this model can do and launch a rock four heavy 274 00:15:30.679 --> 00:15:35.159 as well, so everyone knows polymarket. It's extremely interesting. It 275 00:15:35.200 --> 00:15:38.559 aligns with what reality is most of the time, and 276 00:15:38.720 --> 00:15:41.200 with GROC what we're actually looking at is being able 277 00:15:41.240 --> 00:15:43.759 to see how we can try to take these markets 278 00:15:44.120 --> 00:15:46.600 and see if we can predict the future as well. 279 00:15:47.000 --> 00:15:49.559 So as we're letting this run, we'll see how for 280 00:15:49.720 --> 00:15:53.200 Heavy goes about predicting the world series odds for the 281 00:15:53.240 --> 00:15:55.879 current teams. And while we're waiting for these to process, 282 00:15:55.879 --> 00:15:57.200 we're going to pass it over to Eric and he's 283 00:15:57.240 --> 00:15:59.159 going to show you an example of his. 284 00:15:59.600 --> 00:16:02.960 Yeah, so, I guess one of the coolest things about 285 00:16:03.240 --> 00:16:07.360 GROP four is its ability to understand the world and 286 00:16:07.440 --> 00:16:11.120 to solve hard problems by leveraging tools like Tony discussed, 287 00:16:11.399 --> 00:16:13.320 and I think one kind of cool example of this. 288 00:16:14.360 --> 00:16:18.279 We asked you to generate a visualization of two black 289 00:16:18.360 --> 00:16:21.840 holes colliding. In many case actually pretty clear, and it's 290 00:16:21.840 --> 00:16:25.159 thinking trace about what these liveries are. For example, in 291 00:16:25.240 --> 00:16:27.919 order it's actually be visible, you need to really exaggerate 292 00:16:28.120 --> 00:16:31.720 the scale of the waves. And yeah, so here's like, 293 00:16:32.399 --> 00:16:36.639 you know, this kind of inaction. It exaggerates the scale 294 00:16:36.679 --> 00:16:39.200 in like multiple ways. It drops off less in terms 295 00:16:39.200 --> 00:16:43.159 of implicit or distance, but we can see the basic 296 00:16:43.200 --> 00:16:47.120 effects that are actually correct. It starts with the inspiral 297 00:16:47.399 --> 00:16:50.720 emerges and then you have the ring down. This is 298 00:16:51.919 --> 00:16:56.879 basically largely correct module some of the simplications that need 299 00:16:56.919 --> 00:16:59.879 to do. It's actually quite explicit about this, but uses 300 00:17:00.080 --> 00:17:04.880 post Newtonian approximations instead of actually computing the general relativistic 301 00:17:05.000 --> 00:17:07.359 effects near the center of the black hole, which is 302 00:17:07.400 --> 00:17:10.799 incorrect and you know, will lead to you know, someone 303 00:17:10.880 --> 00:17:14.480 correct results. But the overall you know visualization is yeah, 304 00:17:14.799 --> 00:17:18.480 it's basically there, and you can actually look at the 305 00:17:18.680 --> 00:17:22.839 kinds of resources that it references. So here it actually 306 00:17:23.119 --> 00:17:26.039 you know, it obviously uses search. It gathers results from 307 00:17:26.039 --> 00:17:29.480 a bunch of links, but also reads through an undergraduate 308 00:17:29.559 --> 00:17:36.119 text in analytic gravitational wave models. It's reasons quite a 309 00:17:36.119 --> 00:17:38.839 bit about the actual constants that I should use for 310 00:17:38.920 --> 00:17:43.960 a realistic simulation. It references existing real world data. It's 311 00:17:43.960 --> 00:17:44.839 a pretty good model. 312 00:17:45.160 --> 00:17:45.599 Going forward. 313 00:17:45.759 --> 00:17:48.160 We can give it the same model that physicists use, 314 00:17:48.720 --> 00:17:51.240 so it can run the same level of compute that 315 00:17:51.599 --> 00:17:54.680 leading physics researchers are using and give you a physics 316 00:17:54.680 --> 00:17:56.559 accurate black hole simulation. 317 00:17:56.839 --> 00:17:58.200 Just right now is running in your browser. 318 00:17:58.319 --> 00:18:00.000 This is just running in your brows. Pretty simple. 319 00:18:00.440 --> 00:18:02.680 Swapping back real quick. Here we can actually take a look. 320 00:18:02.680 --> 00:18:05.559 The math problem is finished. The model was able to 321 00:18:05.920 --> 00:18:08.519 Let's look at its thinking trace here so you can 322 00:18:08.559 --> 00:18:11.079 see how it went through the problem. I'll be honest 323 00:18:11.079 --> 00:18:13.839 with you guys, I really don't quite fully understand the math. 324 00:18:14.240 --> 00:18:15.720 But what I do know is that I looked at 325 00:18:15.759 --> 00:18:18.240 the answer ahead of time and it did come to 326 00:18:18.319 --> 00:18:20.160 the correct pans or here in the final part, we 327 00:18:20.160 --> 00:18:22.240 can also come in and actually take a look here 328 00:18:22.279 --> 00:18:25.480 at our World Series prediction and it's still thinking through 329 00:18:25.559 --> 00:18:28.079 on this one, but we can actually try some other. 330 00:18:27.920 --> 00:18:28.559 Stuff as well. 331 00:18:28.599 --> 00:18:31.599 So we worked very heavily on working with all of 332 00:18:31.599 --> 00:18:34.039 our ex tools and building out a really great X 333 00:18:34.079 --> 00:18:37.599 experience so we can actually ask, you know, the model, 334 00:18:37.799 --> 00:18:39.599 you know, find me the Xai employee that has the 335 00:18:39.599 --> 00:18:42.880 weirdest profile photo, and then we can actually try out, 336 00:18:42.960 --> 00:18:46.200 you know, let's create a timeline based on X post 337 00:18:46.400 --> 00:18:48.920 detailing the you know, changes in the scores over time, 338 00:18:49.000 --> 00:18:51.240 and we can see, you know, all the conversation that 339 00:18:51.319 --> 00:18:53.440 was taking place at that time as well, so we 340 00:18:53.480 --> 00:18:55.880 can see who are the you know, announcing scores and 341 00:18:55.960 --> 00:18:58.160 like what was the reactions at those times as well. 342 00:18:58.680 --> 00:19:01.519 If we go back to this was the Greg Yang 343 00:19:01.599 --> 00:19:04.839 photo here, So Greg Yang, of course, who has his 344 00:19:05.000 --> 00:19:08.240 favorite photograph that he has on his account, that's actually 345 00:19:08.279 --> 00:19:09.599 not how he looks like in real life. 346 00:19:09.640 --> 00:19:12.359 By the way, but it had to understand that question, Yeah, 347 00:19:12.400 --> 00:19:13.599 which is that That's the wild part. 348 00:19:13.599 --> 00:19:15.920 It is like it understands what is a weird photo? 349 00:19:16.920 --> 00:19:17.839 What is a weird photo? 350 00:19:18.000 --> 00:19:18.359 Yeah? 351 00:19:18.480 --> 00:19:20.240 What is a less or more weird photo? 352 00:19:21.160 --> 00:19:23.200 It goes through, it has to find all the team members, 353 00:19:23.240 --> 00:19:25.400 has to figure out who we all are, right, you know. 354 00:19:25.400 --> 00:19:29.920 Searches without access to the internal XAI personnel locks literally 355 00:19:29.920 --> 00:19:32.480 looking at that, just at the internet exactly, so you 356 00:19:32.519 --> 00:19:34.279 could say, like the weirdest of any company. 357 00:19:34.440 --> 00:19:37.160 Yeah, And we can also take a look here at 358 00:19:37.200 --> 00:19:40.039 the question here for the Humanity's Last exam. So it 359 00:19:40.119 --> 00:19:43.160 is still researching all of the historical scores, but it 360 00:19:43.200 --> 00:19:45.599 will have that final answer here soon. While it's finishing up. 361 00:19:45.599 --> 00:19:47.359 We can take a look at one of the ones 362 00:19:47.400 --> 00:19:49.799 that we set up here a second ago, and we 363 00:19:49.799 --> 00:19:51.519 could see, like you know, and it finds the date that 364 00:19:51.599 --> 00:19:54.319 Dan Hendricks had initially announced it. We can go through 365 00:19:54.400 --> 00:19:56.960 we can see you know, open Aye announcing their score 366 00:19:57.119 --> 00:20:00.680 back in February, and we can see, you know, progress 367 00:20:00.720 --> 00:20:03.640 happens with like Gemini. We can see like Kimmy, and 368 00:20:03.680 --> 00:20:06.160 we can also even see you know, the leaked benchmarks 369 00:20:06.200 --> 00:20:08.319 of what people are saying is you know, if it's right, 370 00:20:08.359 --> 00:20:09.519 it's going to be pretty impressive. 371 00:20:10.039 --> 00:20:11.599 So pretty cool. 372 00:20:11.920 --> 00:20:13.680 But yeah, it's great. 373 00:20:14.000 --> 00:20:17.400 Yeah, we're going to close the loop around usefulness as well, 374 00:20:17.480 --> 00:20:19.400 so it's like it's not just a book smart, but 375 00:20:19.480 --> 00:20:21.880 actually practically smart exactly. 376 00:20:22.440 --> 00:20:24.200 And we can go back to the slides. 377 00:20:23.960 --> 00:20:28.240 Herea so we actually evaluate also on the multimodel upset. 378 00:20:28.440 --> 00:20:30.799 So on the full set, this is the number on 379 00:20:30.839 --> 00:20:34.000 the hl E exam. You can see there's a little 380 00:20:34.039 --> 00:20:37.799 dip on the numbers. This is actually something we're improving on, 381 00:20:38.079 --> 00:20:41.920 which is the multimodel understanding capabilities. But I do believe 382 00:20:42.759 --> 00:20:46.079 in a very short time we're able to really improve 383 00:20:46.359 --> 00:20:50.599 and got much higher numbers on this higher numbers on 384 00:20:50.680 --> 00:20:51.319 this benchmark. 385 00:20:51.400 --> 00:20:54.279 The biggest weakness of GROCK currently is that it's sort 386 00:20:54.279 --> 00:20:58.720 of partially blind. It can't it's image understanding obviously in 387 00:20:58.759 --> 00:21:01.880 its image generation needs to be a lot better, and 388 00:21:01.960 --> 00:21:04.839 that's actually being trained right now. Growth four is based 389 00:21:04.839 --> 00:21:08.240 on version six of our foundation model. We are training 390 00:21:08.359 --> 00:21:11.680 version seven, which will complete in a few weeks. That'll 391 00:21:11.720 --> 00:21:14.519 address the weakness on the vision side. 392 00:21:15.680 --> 00:21:18.200 Just to show off of this last year, so the 393 00:21:18.240 --> 00:21:21.400 prediction market finished here with the heavy and we can 394 00:21:21.440 --> 00:21:23.680 see here we can see all the tools in the 395 00:21:23.680 --> 00:21:26.519 process it used to actually go through and find the 396 00:21:26.599 --> 00:21:29.160 right answer, but browsed a lot of odds sites. It 397 00:21:29.200 --> 00:21:32.240 calculated its own odds comparing to the market to find 398 00:21:32.279 --> 00:21:34.559 its own alpha and edge. It walks you through the 399 00:21:34.680 --> 00:21:38.519 entire process here, and it calculates the odds of the 400 00:21:38.559 --> 00:21:41.599 winner being like the Dodgers, and it gives them a 401 00:21:41.680 --> 00:21:45.599