WEBVTT

1
00:00:01.080 --> 00:00:03.799
<v Speaker 1>How'd you like to listen to dot NetRocks with no ads?

2
00:00:04.440 --> 00:00:04.799
<v Speaker 2>Easy?

3
00:00:05.360 --> 00:00:08.560
<v Speaker 1>Become a patron for just five dollars a month. You

4
00:00:08.599 --> 00:00:11.320
<v Speaker 1>get access to a private RSS feed where all the

5
00:00:11.359 --> 00:00:14.599
<v Speaker 1>shows have no ads. Twenty dollars a month, we'll get

6
00:00:14.599 --> 00:00:18.440
<v Speaker 1>you that and a special dot NetRocks patron mug. Sign

7
00:00:18.519 --> 00:00:34.640
<v Speaker 1>up now at patreon dot dot NetRocks dot com. Hey

8
00:00:34.880 --> 00:00:39.039
<v Speaker 1>guess what it's dot NetRocks episode nineteen forty four. I'm

9
00:00:39.119 --> 00:00:40.399
<v Speaker 1>Carl Franklin.

10
00:00:39.960 --> 00:00:44.359
<v Speaker 2>At amaterid cap nineteen forty four. Richard, I'm looking forward

11
00:00:44.359 --> 00:00:47.079
<v Speaker 2>to the end of World War Two. Yeah, it's the

12
00:00:47.079 --> 00:00:50.679
<v Speaker 2>beginning of the end. So nineteen forty four, the Allies

13
00:00:50.719 --> 00:00:55.200
<v Speaker 2>launched D Day, the largest amphibious invasion in history, landing

14
00:00:55.200 --> 00:00:57.920
<v Speaker 2>troops on the beaches of Normandy, France on June sixth,

15
00:00:58.679 --> 00:01:02.039
<v Speaker 2>marking a turning point. In August, the Allied forces liberated

16
00:01:02.119 --> 00:01:08.280
<v Speaker 2>Paris from Nazi occupation. You're welcome. In December, here's an

17
00:01:08.280 --> 00:01:10.879
<v Speaker 2>anecdote to go with D Day if you like. Yeah.

18
00:01:10.959 --> 00:01:14.560
<v Speaker 2>In concern for the soldiers in D Day, they mass

19
00:01:14.599 --> 00:01:17.000
<v Speaker 2>produced penicillin for the very first time. There were two

20
00:01:17.000 --> 00:01:19.439
<v Speaker 2>and a half million doses of penicillin made for the

21
00:01:19.519 --> 00:01:22.439
<v Speaker 2>D Day invasion. That is so awesome. So post World

22
00:01:22.480 --> 00:01:27.400
<v Speaker 2>War two, the reason we have antibiotics was that preparation. Yeah.

23
00:01:27.439 --> 00:01:30.400
<v Speaker 1>In December, the Battle of the Bulge, the Germans launched

24
00:01:30.439 --> 00:01:34.519
<v Speaker 1>a major counter offensive in the Ardennes region of Belgium.

25
00:01:34.519 --> 00:01:37.120
<v Speaker 2>Did I say that right? Arden in the Ardennes? Yeah?

26
00:01:37.200 --> 00:01:38.040
<v Speaker 2>Yeap Neardens.

27
00:01:38.120 --> 00:01:42.560
<v Speaker 1>But the Allied forces eventually repelled the attack and in Rome,

28
00:01:42.760 --> 00:01:45.840
<v Speaker 1>three hundred and thirty five Italians were killed in the

29
00:01:46.280 --> 00:01:49.599
<v Speaker 1>Here's another thing I had but pronounce correctly in high school.

30
00:01:50.079 --> 00:01:56.480
<v Speaker 1>Are eighteen r D ten ardittin all right?

31
00:01:56.560 --> 00:01:56.640
<v Speaker 3>Right?

32
00:01:56.799 --> 00:01:59.359
<v Speaker 1>R D ten We're going with that ar d eat

33
00:01:59.599 --> 00:02:03.079
<v Speaker 1>I n E massacre including seventy five Jews and over

34
00:02:03.120 --> 00:02:07.280
<v Speaker 1>two hundred members of the Italian resistance, various from various groups.

35
00:02:07.680 --> 00:02:08.960
<v Speaker 2>So yeah, it's sort.

36
00:02:08.800 --> 00:02:12.000
<v Speaker 1>Of the beginning of the end, the unwinding and leading

37
00:02:12.120 --> 00:02:17.039
<v Speaker 1>up to the following year, nineteen forty five, which ended it.

38
00:02:17.159 --> 00:02:19.919
<v Speaker 2>Right. Yeah. It's also the year that the first plutonium

39
00:02:19.919 --> 00:02:22.759
<v Speaker 2>has ever made in the Hanford site in Washington, will

40
00:02:22.759 --> 00:02:27.639
<v Speaker 2>eventually lead to the bombit Nagasaki. Yeah. And the Harvard

41
00:02:27.800 --> 00:02:31.960
<v Speaker 2>Mark one, the built by IBM based on a design

42
00:02:32.000 --> 00:02:35.479
<v Speaker 2>from professor at Harvard thirty five hundred relays and a

43
00:02:35.560 --> 00:02:39.560
<v Speaker 2>fifty foot long camshaft because computers were different back then. Yeah,

44
00:02:39.639 --> 00:02:42.800
<v Speaker 2>they were, and famously because it's a relays based computer.

45
00:02:42.919 --> 00:02:47.120
<v Speaker 2>The next version of this they call, cleverly, the Mark two. Yeah,

46
00:02:47.159 --> 00:02:50.120
<v Speaker 2>we'll have a moth get trapped in one of the relays,

47
00:02:50.400 --> 00:02:54.039
<v Speaker 2>which race Hopper will find and remove and call the bug,

48
00:02:54.080 --> 00:02:56.080
<v Speaker 2>and that will be the first bug, first bug in

49
00:02:56.080 --> 00:02:58.560
<v Speaker 2>the machine. Yeah. I don't use a lot of relays

50
00:02:58.560 --> 00:02:59.479
<v Speaker 2>and computers anymore.

51
00:02:59.639 --> 00:03:03.439
<v Speaker 1>Yeah, And before we get started with doctor Rachelle, I

52
00:03:03.479 --> 00:03:07.240
<v Speaker 1>wanted to just have you comment on the amazing recovery

53
00:03:07.360 --> 00:03:10.560
<v Speaker 1>of the astronauts and the space station that happened this

54
00:03:10.759 --> 00:03:11.360
<v Speaker 1>past week.

55
00:03:11.599 --> 00:03:13.599
<v Speaker 2>Really not that amazing. It was so perfectly you know,

56
00:03:13.639 --> 00:03:17.159
<v Speaker 2>it was an unexpected things. Those Butcher and Sonny both

57
00:03:17.319 --> 00:03:22.840
<v Speaker 2>very experienced astronauts. When there was concerns about Starliner, they

58
00:03:22.960 --> 00:03:25.840
<v Speaker 2>sent up the next crew with only the next crew

59
00:03:25.919 --> 00:03:28.439
<v Speaker 2>on a crew Dragon with only two passengers, so they

60
00:03:28.439 --> 00:03:30.520
<v Speaker 2>had the two additional seats for them to come back

61
00:03:30.639 --> 00:03:34.879
<v Speaker 2>at any time. Yeah. But since they had two extremely

62
00:03:35.000 --> 00:03:40.080
<v Speaker 2>qualified astronauts already up, why pay to send them back

63
00:03:40.120 --> 00:03:42.280
<v Speaker 2>down when you can put them to work and in fact,

64
00:03:42.280 --> 00:03:45.360
<v Speaker 2>they put Sonny in charge of the mission. She took

65
00:03:45.439 --> 00:03:49.000
<v Speaker 2>over as mission commander for the station for the duration.

66
00:03:48.800 --> 00:03:50.759
<v Speaker 1>And she and Butch were happy to stay there. They

67
00:03:50.759 --> 00:03:52.280
<v Speaker 1>were like, no, we don't want to come home.

68
00:03:52.439 --> 00:03:54.400
<v Speaker 2>Come on. Totally. They were never going to get to

69
00:03:54.400 --> 00:03:58.159
<v Speaker 2>fly again. Those are retired astronauts, right, Yeah, so they

70
00:03:58.199 --> 00:03:59.919
<v Speaker 2>got a great gig. Now that's going to take them

71
00:04:00.039 --> 00:04:03.080
<v Speaker 2>more than a year to recover, which is also normal

72
00:04:03.360 --> 00:04:05.039
<v Speaker 2>for a six months day, and they had a nine

73
00:04:05.039 --> 00:04:09.039
<v Speaker 2>months day. Mark Kelly did a year, and you can

74
00:04:09.080 --> 00:04:11.120
<v Speaker 2>read his book on this, Like, recovery is not a

75
00:04:11.159 --> 00:04:14.120
<v Speaker 2>trivial thing. Yeah, I was watching him being interviewed. You know,

76
00:04:14.120 --> 00:04:17.240
<v Speaker 2>you haven't walked on your feet nine months, your vestibulous

77
00:04:17.279 --> 00:04:20.160
<v Speaker 2>systems messed up, your eyes have been bent out of shape. Like,

78
00:04:20.199 --> 00:04:23.639
<v Speaker 2>it's not a small problem, right to recover from this.

79
00:04:23.759 --> 00:04:27.079
<v Speaker 1>Yeah, I watched being interviewed on the news when it

80
00:04:27.160 --> 00:04:29.319
<v Speaker 1>was having it. It's just still amazing to see that

81
00:04:29.519 --> 00:04:31.759
<v Speaker 1>falcon Booster land.

82
00:04:31.600 --> 00:04:33.040
<v Speaker 2>Land on his tail perfectly.

83
00:04:33.480 --> 00:04:36.439
<v Speaker 1>Always it always is just going to be amazing to me.

84
00:04:36.639 --> 00:04:39.279
<v Speaker 2>Yeah, no, it's it's a miracle. The crazier thing is

85
00:04:39.519 --> 00:04:43.120
<v Speaker 2>it really is that starship Booster being caught out of

86
00:04:43.160 --> 00:04:47.399
<v Speaker 2>the air. It's literally a twenty story, two hundred ton

87
00:04:47.600 --> 00:04:50.800
<v Speaker 2>building that flies, yeah, and they catch it out of

88
00:04:50.839 --> 00:04:52.959
<v Speaker 2>the air. So yeah, we are in amazing time. So

89
00:04:52.959 --> 00:04:55.800
<v Speaker 2>the space industry has been funnelingentally changed by this, right.

90
00:04:56.120 --> 00:04:59.120
<v Speaker 2>The cost of flight is so much lower. It's hard

91
00:04:59.160 --> 00:05:01.160
<v Speaker 2>to even get her head around what's actually going on

92
00:05:01.240 --> 00:05:03.600
<v Speaker 2>up there right now. So it's very cool with the proliferation.

93
00:05:04.240 --> 00:05:06.560
<v Speaker 2>That's a very good experience for me. This week is

94
00:05:06.639 --> 00:05:08.399
<v Speaker 2>very I felt very good about it, all right.

95
00:05:08.480 --> 00:05:10.879
<v Speaker 1>So yeah, so that's a cue for me to roll

96
00:05:10.920 --> 00:05:12.399
<v Speaker 1>the music for better no framework.

97
00:05:12.480 --> 00:05:21.800
<v Speaker 2>So that's awesome. All right, man, what do you got

98
00:05:21.800 --> 00:05:25.199
<v Speaker 2>our good buddy, Simon Crop has the genius. Simon Crop

99
00:05:25.319 --> 00:05:28.360
<v Speaker 2>the ge. This guy is just he's so brilliant. He's

100
00:05:28.360 --> 00:05:31.839
<v Speaker 2>brilliant and he comes up with solutions for things that

101
00:05:31.879 --> 00:05:33.120
<v Speaker 2>you didn't even know you need it. Yeah.

102
00:05:33.240 --> 00:05:36.680
<v Speaker 1>But this one is called symbol. It's a new GET

103
00:05:36.720 --> 00:05:41.120
<v Speaker 1>package and it's an MS build task that enables bundling

104
00:05:41.199 --> 00:05:43.879
<v Speaker 1>dot net symbols for references with a deployed app.

105
00:05:44.040 --> 00:05:44.319
<v Speaker 2>Nice.

106
00:05:44.480 --> 00:05:50.040
<v Speaker 1>The goal being to enable line numbers for exceptions in production.

107
00:05:50.519 --> 00:05:52.399
<v Speaker 2>Oh okay, that's interesting.

108
00:05:52.240 --> 00:05:55.079
<v Speaker 1>Yeah, because I guess you don't get that. Yeah, yeah,

109
00:05:55.120 --> 00:05:58.120
<v Speaker 1>and this is this is what it does. So if

110
00:05:58.160 --> 00:06:01.959
<v Speaker 1>you're in production you have an exception and yeah, I

111
00:06:01.959 --> 00:06:06.839
<v Speaker 1>guess you log it, you're gonna see line numbers, all right, Yeah.

112
00:06:06.680 --> 00:06:09.759
<v Speaker 2>That's cool. You got to know he had that problem, right, like, yeah,

113
00:06:09.879 --> 00:06:11.680
<v Speaker 2>this is clearly a guy who built the thing to

114
00:06:11.720 --> 00:06:14.040
<v Speaker 2>fix a thing that he had, and now we all

115
00:06:14.040 --> 00:06:14.639
<v Speaker 2>get to benefit.

116
00:06:14.800 --> 00:06:17.920
<v Speaker 1>Another alternative, I guess is just deploying the debug symbols

117
00:06:17.959 --> 00:06:20.560
<v Speaker 1>with it, and now you're slowing things down in productions.

118
00:06:20.600 --> 00:06:23.800
<v Speaker 2>So yeah, it's a lot more weight than just yeah,

119
00:06:24.000 --> 00:06:25.160
<v Speaker 2>you know, use this library.

120
00:06:25.360 --> 00:06:29.519
<v Speaker 1>So thank you Simon and Simon. Crop slash symbol on.

121
00:06:29.480 --> 00:06:31.360
<v Speaker 2>GitHub continues to be awesome.

122
00:06:31.480 --> 00:06:34.759
<v Speaker 1>See y mba l Yeah, the musical thing, the musical thing,

123
00:06:34.800 --> 00:06:35.079
<v Speaker 1>all right?

124
00:06:35.079 --> 00:06:37.040
<v Speaker 2>Who's talking to us? Richard grabbed a comment off a

125
00:06:37.040 --> 00:06:38.759
<v Speaker 2>show eighteen thirty five of them when we did with

126
00:06:38.800 --> 00:06:41.040
<v Speaker 2>our friend mattz Targanson talking about the next C sharp

127
00:06:41.040 --> 00:06:43.399
<v Speaker 2>because we've got a great comment LLM related. This is

128
00:06:43.439 --> 00:06:46.959
<v Speaker 2>from Murray who said MADD's mentioned making sure language features

129
00:06:47.000 --> 00:06:49.560
<v Speaker 2>work with the tooling, such as ordering and link syntax.

130
00:06:50.000 --> 00:06:52.600
<v Speaker 2>Increasingly with Copilot and other lms, this is part of

131
00:06:52.639 --> 00:06:56.319
<v Speaker 2>the tooling. Yes. True. Obviously this is a year ago

132
00:06:56.360 --> 00:07:00.839
<v Speaker 2>this comment, so you know so much changes happen. It's challenging.

133
00:07:01.680 --> 00:07:04.160
<v Speaker 2>So given a piece of code using a new C

134
00:07:04.319 --> 00:07:06.800
<v Speaker 2>Sharp language feature, which is what Mads was talking about,

135
00:07:06.920 --> 00:07:09.399
<v Speaker 2>have you tried asking chat, GPT or copilot or so

136
00:07:09.560 --> 00:07:13.199
<v Speaker 2>the LM to describe how that code works. If it

137
00:07:13.240 --> 00:07:16.680
<v Speaker 2>gets it right, does it mean it's intuitive. He's an

138
00:07:16.800 --> 00:07:19.519
<v Speaker 2>LM's intuition and at least you put that in quote,

139
00:07:19.519 --> 00:07:22.319
<v Speaker 2>because there is no intuition in software. There is a

140
00:07:22.439 --> 00:07:25.639
<v Speaker 2>grood approximation for the one that human programmers have, or

141
00:07:25.639 --> 00:07:28.720
<v Speaker 2>a bad approximation, and if programmers are using copil, it

142
00:07:28.759 --> 00:07:32.639
<v Speaker 2>doesn't matter about the human's intuition or the LMS. Let's

143
00:07:32.639 --> 00:07:35.759
<v Speaker 2>complicate this fact with next year's LM that would be now,

144
00:07:36.560 --> 00:07:40.600
<v Speaker 2>which will probably be profoundly different. Yes, so, having said

145
00:07:40.600 --> 00:07:42.240
<v Speaker 2>all that, it's probably best to just aim for the

146
00:07:42.319 --> 00:07:45.600
<v Speaker 2>human and let the LM catch up. Yeah, no intuition

147
00:07:45.720 --> 00:07:48.879
<v Speaker 2>in software. The reality is, of course you would expect

148
00:07:48.920 --> 00:07:51.480
<v Speaker 2>it to not understand a new language feature. There has

149
00:07:51.519 --> 00:07:54.120
<v Speaker 2>to be some time for that language feature to be

150
00:07:54.160 --> 00:07:57.600
<v Speaker 2>documented properly. The good news being as they keep regenerating

151
00:07:57.639 --> 00:08:01.360
<v Speaker 2>these LMS on a regular basis, and Microsoft builds these

152
00:08:01.399 --> 00:08:06.800
<v Speaker 2>features in public view on GitHub even before it ships.

153
00:08:07.000 --> 00:08:11.720
<v Speaker 2>It's likely in the knowledge base that is the al Yeah, curiously,

154
00:08:12.000 --> 00:08:14.160
<v Speaker 2>you know, in my last trip to Microsoft talking to folks,

155
00:08:14.199 --> 00:08:18.720
<v Speaker 2>so what they're using, they've been using Claude Sonnet three seven.

156
00:08:18.879 --> 00:08:22.319
<v Speaker 2>That's their favorite for working in dot net, which isn't

157
00:08:22.319 --> 00:08:27.480
<v Speaker 2>that funny? Fascinating, But you know that's where it's at.

158
00:08:27.759 --> 00:08:30.680
<v Speaker 2>So Mary, you're right, let's focus on the human understanding

159
00:08:30.720 --> 00:08:32.960
<v Speaker 2>the language the most, because the software is only going

160
00:08:33.000 --> 00:08:35.519
<v Speaker 2>to generate what it's got in its model, and it's

161
00:08:35.600 --> 00:08:38.120
<v Speaker 2>up to you to assess it, although admittedly the compiler

162
00:08:38.200 --> 00:08:41.120
<v Speaker 2>has to say also yes, and a copy of music

163
00:08:41.120 --> 00:08:42.679
<v Speaker 2>Cobey is on its way to unit. If you'd like

164
00:08:42.679 --> 00:08:44.200
<v Speaker 2>a copy of music code by, I write a comment

165
00:08:44.240 --> 00:08:46.679
<v Speaker 2>on the website at dot netroocks dot comment on the facebooks.

166
00:08:46.679 --> 00:08:48.320
<v Speaker 2>We publish every show there, and if you comment there

167
00:08:48.320 --> 00:08:50.120
<v Speaker 2>and everything in the show, we'll send you copy of

168
00:08:50.159 --> 00:08:50.720
<v Speaker 2>music code By.

169
00:08:50.840 --> 00:08:52.399
<v Speaker 1>And if you don't want to wait for that, or

170
00:08:52.480 --> 00:08:55.000
<v Speaker 1>you have other ideas and you just want to buy

171
00:08:55.159 --> 00:08:57.279
<v Speaker 1>music to code buy, you can go to music tocode buy,

172
00:08:57.320 --> 00:09:01.519
<v Speaker 1>dot net and track twenty two is new ish and

173
00:09:01.639 --> 00:09:04.519
<v Speaker 1>you can get the entire collection an MP three flacre

174
00:09:04.600 --> 00:09:09.480
<v Speaker 1>wave for a very good deal. It's a very good price,

175
00:09:09.919 --> 00:09:13.879
<v Speaker 1>So happy coding, all right, Well, let's bring on doctor Birchell.

176
00:09:14.080 --> 00:09:18.639
<v Speaker 1>Doctor Jody Birchell is the developer advocate in data science

177
00:09:18.679 --> 00:09:22.200
<v Speaker 1>at jet Brains and was previously a lead data scientist

178
00:09:22.200 --> 00:09:25.799
<v Speaker 1>at Verve Group Europe. She completed a PhD in clinical

179
00:09:25.840 --> 00:09:30.320
<v Speaker 1>psychology and a postdoc in biostatistics before leaving academia for

180
00:09:30.360 --> 00:09:34.039
<v Speaker 1>a data science career. She has worked for seven years

181
00:09:34.039 --> 00:09:37.600
<v Speaker 1>as a data scientist in both Australia and Germany, developing

182
00:09:37.639 --> 00:09:42.840
<v Speaker 1>a range of products including recommendation systems, analysis platforms, search

183
00:09:42.879 --> 00:09:47.200
<v Speaker 1>engine improvements and audience profiling. She's held a broad range

184
00:09:47.200 --> 00:09:51.159
<v Speaker 1>of responsibilities in her career, doing everything from data analytics

185
00:09:51.159 --> 00:09:55.559
<v Speaker 1>to maintaining machine learning solutions and production. She's a longtime

186
00:09:55.600 --> 00:10:01.159
<v Speaker 1>content creator in data science across conference and user group presentations, books, webinars,

187
00:10:01.200 --> 00:10:04.320
<v Speaker 1>and posts on both her own and jet Brains blogs.

188
00:10:04.639 --> 00:10:06.279
<v Speaker 1>In other words, a slacker.

189
00:10:09.320 --> 00:10:11.200
<v Speaker 2>It occurs to me, Jody, that you and I hang

190
00:10:11.240 --> 00:10:13.320
<v Speaker 2>out several times a year of various conferences, But I

191
00:10:13.320 --> 00:10:15.320
<v Speaker 2>don't know that Carl's had time with you since we

192
00:10:15.399 --> 00:10:18.080
<v Speaker 2>did that show at Tekarama. Takarama was the last time

193
00:10:18.120 --> 00:10:20.320
<v Speaker 2>I saw you, No a couple of years ago.

194
00:10:20.519 --> 00:10:25.240
<v Speaker 3>Yeah, yeah, exactly, So it's been a long time actually, Yeah.

195
00:10:25.360 --> 00:10:27.200
<v Speaker 2>Things have changed your jet brains now.

196
00:10:27.240 --> 00:10:32.399
<v Speaker 3>I have, certainly I think changed a lot. Yeah, yes, yeah, yeah,

197
00:10:32.440 --> 00:10:34.480
<v Speaker 3>I was a jet brains when we first met as well,

198
00:10:34.519 --> 00:10:38.399
<v Speaker 3>but I think I had only been there just over

199
00:10:38.440 --> 00:10:41.440
<v Speaker 3>a year and so I was still like, I don't know,

200
00:10:41.559 --> 00:10:43.360
<v Speaker 3>a little bit more shy, I think, a little bit

201
00:10:43.440 --> 00:10:44.279
<v Speaker 3>less opinionated.

202
00:10:45.240 --> 00:10:47.480
<v Speaker 2>You've been hanging around with the troublemakers for a while.

203
00:10:47.320 --> 00:10:49.039
<v Speaker 3>Now, yeah, you talking about you?

204
00:10:49.519 --> 00:10:49.879
<v Speaker 2>Yeah?

205
00:10:49.879 --> 00:10:55.639
<v Speaker 3>Actually, well, and we're going to be hanging out in

206
00:10:55.679 --> 00:10:58.840
<v Speaker 3>my hometown of Melbourne next month.

207
00:10:59.120 --> 00:11:03.039
<v Speaker 2>Yeah, we're excited about that, yeah, NDC. Yes, so, And

208
00:11:03.159 --> 00:11:06.159
<v Speaker 2>of course I've got family in New Zealand, so I've

209
00:11:06.159 --> 00:11:08.039
<v Speaker 2>got to do a little time in Sydney to see

210
00:11:08.039 --> 00:11:09.759
<v Speaker 2>some folks there, and then I'll be in Melbourne for

211
00:11:09.799 --> 00:11:12.159
<v Speaker 2>the show with you, and then a week on the

212
00:11:12.200 --> 00:11:15.559
<v Speaker 2>farm hanging with the cows and the cousins and the

213
00:11:15.600 --> 00:11:20.039
<v Speaker 2>sheep and the sheep, No sheep, the sheep, what sheeps?

214
00:11:20.080 --> 00:11:23.399
<v Speaker 2>The South Island thing? No sheep on the farm. No, no,

215
00:11:23.440 --> 00:11:26.720
<v Speaker 2>it's it's it's a dairy farm. Dairy farm. Yeah. And

216
00:11:26.759 --> 00:11:29.840
<v Speaker 2>by the way, cows are awesome. Sheep are dumb, dumb,

217
00:11:29.879 --> 00:11:35.320
<v Speaker 2>dumb dumb, holy cow dumb. But they're tasty. Like how

218
00:11:35.480 --> 00:11:39.000
<v Speaker 2>Jody says they're cute. I say they're tasty tasty. Where

219
00:11:39.120 --> 00:11:41.320
<v Speaker 2>my mind is at, that's in the cow. The cows

220
00:11:41.320 --> 00:11:44.039
<v Speaker 2>are smart enough that if they're actually having distress, you know,

221
00:11:44.080 --> 00:11:47.200
<v Speaker 2>in birthing or anything, they will come for help. Wow. Right,

222
00:11:47.360 --> 00:11:50.399
<v Speaker 2>Like they're bright and they and they follow the they

223
00:11:50.399 --> 00:11:52.120
<v Speaker 2>follow the gates of the packs where you want them

224
00:11:52.120 --> 00:11:53.279
<v Speaker 2>to go. But it doesn't mean they don't know how

225
00:11:53.279 --> 00:11:55.039
<v Speaker 2>to open them themselves if they really wanted to. I've

226
00:11:55.039 --> 00:11:57.440
<v Speaker 2>seen them do it. Yeah, damn, they're just playing along.

227
00:11:57.559 --> 00:12:01.000
<v Speaker 2>Cows are great, they really are. In lls are great.

228
00:12:01.159 --> 00:12:06.279
<v Speaker 3>Right in the right settings. Yeah, they are great.

229
00:12:06.679 --> 00:12:09.639
<v Speaker 2>Yes, But even that that show we did in twenty three,

230
00:12:09.759 --> 00:12:11.519
<v Speaker 2>you know you were the grown up in the room there,

231
00:12:11.879 --> 00:12:15.120
<v Speaker 2>it's just tired, Like listen, there were limits like that.

232
00:12:15.360 --> 00:12:18.039
<v Speaker 2>We're so hype ish in twenty three, not that it's

233
00:12:18.080 --> 00:12:21.200
<v Speaker 2>all common rational in twenty five, but it's so.

234
00:12:21.240 --> 00:12:24.039
<v Speaker 3>Funny actually, because I remember I was this was the

235
00:12:24.080 --> 00:12:26.519
<v Speaker 3>first talk I did on LMS, so that one at

236
00:12:26.559 --> 00:12:28.799
<v Speaker 3>Techorama actually was the first one I ever did.

237
00:12:29.159 --> 00:12:29.840
<v Speaker 2>No free lunch.

238
00:12:30.000 --> 00:12:32.440
<v Speaker 3>Yeah yeah, yeah yeah, And I was I was actually

239
00:12:32.519 --> 00:12:35.399
<v Speaker 3>really scared of getting up and giving my opinion, like

240
00:12:35.440 --> 00:12:39.480
<v Speaker 3>being a contrarian. Obviously, I'm feeling so vindicated right now.

241
00:12:39.600 --> 00:12:41.960
<v Speaker 2>But it's right, isn't it.

242
00:12:41.960 --> 00:12:45.679
<v Speaker 3>It's great being right, but it's I will say, like

243
00:12:45.799 --> 00:12:48.080
<v Speaker 3>the hype has died slower than I thought it would.

244
00:12:48.120 --> 00:12:50.960
<v Speaker 3>So I think Deep Seek finally has spelled the beginning

245
00:12:50.960 --> 00:12:51.240
<v Speaker 3>of the.

246
00:12:51.279 --> 00:12:55.039
<v Speaker 2>End, but not the end of the business, but the

247
00:12:56.000 --> 00:12:57.080
<v Speaker 2>end of the hype cycle.

248
00:12:57.440 --> 00:12:58.480
<v Speaker 3>The end of the hype cycle.

249
00:12:58.600 --> 00:13:01.200
<v Speaker 2>Okay, I appreciate that the approach.

250
00:13:00.960 --> 00:13:05.519
<v Speaker 3>To how we're going to be I guess, manufacturing these models,

251
00:13:06.240 --> 00:13:10.440
<v Speaker 3>deploying these models, and thinking about these models fundamentally changed

252
00:13:10.440 --> 00:13:13.879
<v Speaker 3>with Deep Seeks. So m it sort of showed that

253
00:13:14.039 --> 00:13:17.159
<v Speaker 3>this hyperinvestment in data centers, which was kicking off with

254
00:13:17.200 --> 00:13:20.799
<v Speaker 3>the Stargate project in the US. To explain context to

255
00:13:20.799 --> 00:13:21.919
<v Speaker 3>anyone in the audience who doesn't know.

256
00:13:21.919 --> 00:13:24.120
<v Speaker 2>It, five hundred billion dollars.

257
00:13:23.799 --> 00:13:27.600
<v Speaker 3>And intended five hundred billion dollar investment between Open AI,

258
00:13:27.919 --> 00:13:32.039
<v Speaker 3>the US government, and I think Microsoft was involved so I.

259
00:13:31.960 --> 00:13:33.240
<v Speaker 2>Think Microsoft pulled out of it.

260
00:13:33.240 --> 00:13:37.080
<v Speaker 3>It was gorecle very okak got you Yeah, yeah, that

261
00:13:37.440 --> 00:13:38.320
<v Speaker 3>just got announced.

262
00:13:39.440 --> 00:13:41.279
<v Speaker 2>Yeah, there was a little political game here is that

263
00:13:41.320 --> 00:13:43.399
<v Speaker 2>was also run around the town. They sort of announced this, Hey,

264
00:13:44.080 --> 00:13:45.759
<v Speaker 2>you know, I know we had this deal with open

265
00:13:45.759 --> 00:13:47.720
<v Speaker 2>Ai wherever there's going to run an azure, but we're

266
00:13:47.799 --> 00:13:50.240
<v Speaker 2>ready to let that go. I think it was because

267
00:13:50.279 --> 00:13:52.759
<v Speaker 2>of Stargate that. Yeah, you know, there was sort of

268
00:13:52.799 --> 00:13:55.000
<v Speaker 2>this pressure on Microsoft. You have to keep growing, growing, growing,

269
00:13:55.039 --> 00:13:57.039
<v Speaker 2>and they're like, this is getting irrational. So if you

270
00:13:57.080 --> 00:13:59.159
<v Speaker 2>want to go play with someone else, you knock yourself out.

271
00:13:59.240 --> 00:14:02.039
<v Speaker 2>So yeahing it back to deepseek for a minute. From

272
00:14:02.080 --> 00:14:05.440
<v Speaker 2>what I understand, you know, the open Ai and all

273
00:14:05.480 --> 00:14:08.039
<v Speaker 2>these other models are looking at that and learning from

274
00:14:08.080 --> 00:14:10.200
<v Speaker 2>it and figuring out how to make their own models

275
00:14:10.240 --> 00:14:16.600
<v Speaker 2>more efficient. And at one point I heard that the

276
00:14:17.480 --> 00:14:20.360
<v Speaker 2>Chinese model is, you know, hey, let's spend a lot

277
00:14:20.440 --> 00:14:24.240
<v Speaker 2>less money on these things so that they're less expensive.

278
00:14:24.360 --> 00:14:26.200
<v Speaker 2>We don't have to use as many processors and all

279
00:14:26.240 --> 00:14:29.960
<v Speaker 2>that stuff. And I think I heard that, you know,

280
00:14:30.039 --> 00:14:34.519
<v Speaker 2>the response from the American companies was, oh no, we're

281
00:14:34.600 --> 00:14:36.639
<v Speaker 2>just going to make it ten times more one hundred

282
00:14:36.639 --> 00:14:40.759
<v Speaker 2>times more powerful, you know, so a different kind of

283
00:14:40.840 --> 00:14:44.240
<v Speaker 2>mindset whereas but that was originally Now I think that

284
00:14:44.360 --> 00:14:52.879
<v Speaker 2>there's more of a desire to make to get smaller lllms, right, yeah,

285
00:14:53.080 --> 00:14:54.120
<v Speaker 2>that are more specialized.

286
00:14:54.360 --> 00:14:59.039
<v Speaker 3>The new ones of the story is that basically we've

287
00:14:59.120 --> 00:15:01.960
<v Speaker 3>known that there are ways to make neural nets more efficient,

288
00:15:02.120 --> 00:15:05.399
<v Speaker 3>right like, there are ways of making the models smaller,

289
00:15:05.799 --> 00:15:09.039
<v Speaker 3>or after you've trained them, actually trimming them down and

290
00:15:10.480 --> 00:15:13.039
<v Speaker 3>getting the same performance or almost the same performance for

291
00:15:13.200 --> 00:15:16.919
<v Speaker 3>much smaller number of parameters. We've also known for quite

292
00:15:16.919 --> 00:15:18.960
<v Speaker 3>a long time, and this is true with any machine

293
00:15:19.039 --> 00:15:22.279
<v Speaker 3>learning model, that the higher the quality of data the

294
00:15:22.879 --> 00:15:24.960
<v Speaker 3>you know, the better the model can perform for much

295
00:15:25.000 --> 00:15:27.440
<v Speaker 3>smaller number of parameters. So this was proven last year

296
00:15:27.480 --> 00:15:29.919
<v Speaker 3>with the Falcon last year or the year before with

297
00:15:30.000 --> 00:15:32.919
<v Speaker 3>the Falcon models, they were sort of the first big

298
00:15:33.000 --> 00:15:35.360
<v Speaker 3>open source ones that were trained on higher quality data

299
00:15:35.440 --> 00:15:38.240
<v Speaker 3>sets and got a lot more performance for less parameters.

300
00:15:38.759 --> 00:15:42.200
<v Speaker 3>But the most reliable way to get better performance was

301
00:15:42.320 --> 00:15:48.759
<v Speaker 3>to scale, and I think what happened. The story I've

302
00:15:48.799 --> 00:15:52.279
<v Speaker 3>heard in China is that they just couldn't get access

303
00:15:52.360 --> 00:15:57.720
<v Speaker 3>to the same size of GPUs because of sanctions. Not

304
00:15:57.840 --> 00:16:02.000
<v Speaker 3>sanctions Basically they weren't being sold in China, and so

305
00:16:02.320 --> 00:16:05.120
<v Speaker 3>they had to make do with older and much less

306
00:16:05.200 --> 00:16:07.440
<v Speaker 3>efficient processes, and they had to do all these tricks

307
00:16:07.519 --> 00:16:11.159
<v Speaker 3>to basically share the training across a bunch of smaller machines.

308
00:16:11.799 --> 00:16:16.919
<v Speaker 3>So this meant that they just couldn't create absolutely massive models.

309
00:16:17.679 --> 00:16:20.759
<v Speaker 3>And essentially this meant that, yeah, they were forced to

310
00:16:20.799 --> 00:16:24.240
<v Speaker 3>create a smaller model. But you know, the thing is

311
00:16:24.360 --> 00:16:28.519
<v Speaker 3>is the quality of AI researchers and AI engineers that

312
00:16:28.600 --> 00:16:32.440
<v Speaker 3>are being employed at companies like Open Ai and Anthropic

313
00:16:32.559 --> 00:16:35.480
<v Speaker 3>and companies like this. I'm sure that they knew it

314
00:16:35.600 --> 00:16:38.279
<v Speaker 3>was possible. It was just as I understood it, a

315
00:16:38.360 --> 00:16:42.759
<v Speaker 3>less reliable path to performance. And you know, the American

316
00:16:42.799 --> 00:16:45.519
<v Speaker 3>companies had they had the money and they had the

317
00:16:45.720 --> 00:16:48.679
<v Speaker 3>servers to train it, so why not go big?

318
00:16:49.000 --> 00:16:53.120
<v Speaker 2>And they understand that race right like they understand build bigger,

319
00:16:53.279 --> 00:16:56.200
<v Speaker 2>keep going like it's a very American approach to things. Yes,

320
00:16:56.399 --> 00:16:59.559
<v Speaker 2>you can always tune later, right, do your land grab now, but.

321
00:16:59.600 --> 00:17:03.799
<v Speaker 1>Also that there's a difference between having one huge model

322
00:17:03.919 --> 00:17:08.640
<v Speaker 1>like you know, chat ept that knows everything as bazillions

323
00:17:08.680 --> 00:17:11.799
<v Speaker 1>of nodes or whatever it is, and then can you know,

324
00:17:12.079 --> 00:17:16.359
<v Speaker 1>can cross reference things right and put connect the dots

325
00:17:17.119 --> 00:17:20.519
<v Speaker 1>very much in ways that humans do, but in even

326
00:17:20.640 --> 00:17:26.160
<v Speaker 1>more broadly, Whereas if you have smaller, less expensive models

327
00:17:26.240 --> 00:17:31.480
<v Speaker 1>that are just our lllms that are trained on specific data, right,

328
00:17:32.079 --> 00:17:35.880
<v Speaker 1>you'll get probably get more accurate things out of them

329
00:17:36.079 --> 00:17:41.279
<v Speaker 1>for that particular set you know, that particular context maybe

330
00:17:41.880 --> 00:17:45.000
<v Speaker 1>and then be able to have many of those with

331
00:17:45.359 --> 00:17:48.880
<v Speaker 1>that have different expertise, but you won't necessarily be able

332
00:17:48.960 --> 00:17:51.160
<v Speaker 1>to it won't necessarily be able to connect the dots

333
00:17:51.319 --> 00:17:53.799
<v Speaker 1>like a large, huge model can.

334
00:17:53.960 --> 00:17:56.640
<v Speaker 3>Right. This can actually lead into a further discussion about

335
00:17:56.720 --> 00:18:01.039
<v Speaker 3>measurement if we want. But basically, looking at the current

336
00:18:01.079 --> 00:18:05.400
<v Speaker 3>benchmarks that they're using to assess performance of llms, Deepseeky

337
00:18:05.559 --> 00:18:08.880
<v Speaker 3>and smaller models coming out of China are actually rivaling

338
00:18:09.000 --> 00:18:12.880
<v Speaker 3>the performance of larger models. So basically the understanding seems

339
00:18:12.920 --> 00:18:15.160
<v Speaker 3>to be is that a lot of the parameters that

340
00:18:15.279 --> 00:18:19.680
<v Speaker 3>these big models have are not actually being used every

341
00:18:19.799 --> 00:18:23.720
<v Speaker 3>single time you try to do like inference for a

342
00:18:23.799 --> 00:18:26.559
<v Speaker 3>particular task. It's only a subset of the parameters. So

343
00:18:26.880 --> 00:18:29.559
<v Speaker 3>the way to think about parameters is think about neural

344
00:18:29.640 --> 00:18:32.400
<v Speaker 3>nets as like you have inputs and then you have

345
00:18:32.480 --> 00:18:35.880
<v Speaker 3>a bunch of neurons that are connected by what are

346
00:18:35.920 --> 00:18:39.079
<v Speaker 3>called weights. They're basically multipliers, and you can kind of

347
00:18:39.160 --> 00:18:43.400
<v Speaker 3>think about inference as a path that you take through

348
00:18:43.559 --> 00:18:46.519
<v Speaker 3>the neural net, where like, you know, the whole thing's

349
00:18:46.559 --> 00:18:49.640
<v Speaker 3>going to be used, but only certain weights will actually

350
00:18:49.839 --> 00:18:54.480
<v Speaker 3>have an impact for particular types of tasks. And it

351
00:18:54.559 --> 00:18:57.480
<v Speaker 3>sort of seems that what's happened with scaling down these

352
00:18:57.559 --> 00:19:01.039
<v Speaker 3>models is that because they learned on so much data,

353
00:19:01.359 --> 00:19:03.119
<v Speaker 3>and so much of the data seems to have not

354
00:19:03.240 --> 00:19:06.960
<v Speaker 3>been high quality, that they really, like a lot of

355
00:19:07.079 --> 00:19:10.799
<v Speaker 3>the parameters were not really being used in the majority

356
00:19:10.880 --> 00:19:13.519
<v Speaker 3>of cases, they were just I see dead weight.

357
00:19:14.359 --> 00:19:17.319
<v Speaker 1>And so so if you wanted to translate parameters and

358
00:19:17.400 --> 00:19:21.119
<v Speaker 1>neurons to language, we're talking about the probability of the

359
00:19:21.240 --> 00:19:24.720
<v Speaker 1>next word exactly right that it spits out. Yeah, and

360
00:19:24.799 --> 00:19:29.759
<v Speaker 1>what you're saying is that they're only choosing from parameters

361
00:19:29.839 --> 00:19:30.640
<v Speaker 1>with higher weights.

362
00:19:31.240 --> 00:19:34.319
<v Speaker 3>Yeah, it's it's like or words.

363
00:19:34.079 --> 00:19:34.799
<v Speaker 2>With higher weights.

364
00:19:35.279 --> 00:19:37.480
<v Speaker 3>Yeah. So basically the way it works is, like you

365
00:19:37.559 --> 00:19:40.799
<v Speaker 3>think about the last layer of the neural net is

366
00:19:40.920 --> 00:19:44.440
<v Speaker 3>basically like all the words in the vocabulary. So it's

367
00:19:44.440 --> 00:19:48.279
<v Speaker 3>obviously really really huge, and so the whole neural net

368
00:19:48.440 --> 00:19:51.960
<v Speaker 3>is trying to predict to the probability of which of

369
00:19:52.079 --> 00:19:54.880
<v Speaker 3>these words is the most likely to come next. So

370
00:19:55.240 --> 00:19:58.400
<v Speaker 3>it's basically saying that for a particular import only a

371
00:19:58.480 --> 00:20:01.480
<v Speaker 3>subset of that, you know, the paths that go through

372
00:20:01.519 --> 00:20:03.720
<v Speaker 3>the neural net are actually going to give good information

373
00:20:04.359 --> 00:20:08.240
<v Speaker 3>about what the next word is. And so yeah, it's

374
00:20:09.440 --> 00:20:13.279
<v Speaker 3>it's also like it's kind of fascinating because the models

375
00:20:13.319 --> 00:20:17.039
<v Speaker 3>are such black boxes. No one fully understands how the

376
00:20:17.519 --> 00:20:20.599
<v Speaker 3>decisions are being made. I'm putting decisions in air quotes.

377
00:20:20.640 --> 00:20:25.200
<v Speaker 3>I want to make this clear because interpretability is hot,

378
00:20:25.279 --> 00:20:28.000
<v Speaker 3>but this is actually interpretability is becoming a really hot

379
00:20:28.079 --> 00:20:31.279
<v Speaker 3>area in twenty twenty five. So actually understanding how llms

380
00:20:31.319 --> 00:20:34.000
<v Speaker 3>come to the conclusions they come to, or sorry, how

381
00:20:34.079 --> 00:20:36.720
<v Speaker 3>the predictions being made. Let's put it in more clinical terms,

382
00:20:37.200 --> 00:20:40.200
<v Speaker 3>and that's going to help firstly make the models more efficient,

383
00:20:40.279 --> 00:20:43.119
<v Speaker 3>but also demystify a lot of the assumptions we make

384
00:20:43.319 --> 00:20:46.480
<v Speaker 3>about the predictions they make. Like we look at the prediction,

385
00:20:46.599 --> 00:20:49.519
<v Speaker 3>we're like, oh, it's solving problems because if a person

386
00:20:49.599 --> 00:20:52.160
<v Speaker 3>did that, it would be showing problem solving. Or the

387
00:20:52.240 --> 00:20:54.839
<v Speaker 3>model's more intelligent because if a person did that, it

388
00:20:54.880 --> 00:20:58.000
<v Speaker 3>would be showing more intelligence, but that's just us projecting.

389
00:20:58.119 --> 00:21:01.839
<v Speaker 2>Sure, yeah, anther of morphisation. Now you know, I'm maybe

390
00:21:01.839 --> 00:21:03.880
<v Speaker 2>I'm thinking about this the wrong way, but you know,

391
00:21:03.960 --> 00:21:05.720
<v Speaker 2>as soon as you say that, I'm like, hey, there's like,

392
00:21:05.839 --> 00:21:08.559
<v Speaker 2>what six hundred thousand words in the Oxford Dictionary that's

393
00:21:08.599 --> 00:21:11.039
<v Speaker 2>just English and most people use fifteen hundred of them.

394
00:21:11.920 --> 00:21:14.599
<v Speaker 2>So oh yeah, yeah, yeah. You know here you've built

395
00:21:14.640 --> 00:21:18.279
<v Speaker 2>this model that has this huge potential range of comprehension

396
00:21:18.720 --> 00:21:21.759
<v Speaker 2>and you're using a tiny subsect of it depending on

397
00:21:21.839 --> 00:21:24.400
<v Speaker 2>what you're doing. Especially when we're coming at this from

398
00:21:24.440 --> 00:21:27.200
<v Speaker 2>the copilot part of you was like, I'm working on code.

399
00:21:27.440 --> 00:21:31.200
<v Speaker 1>Yeah, every symbol in the language is a is a

400
00:21:31.279 --> 00:21:32.440
<v Speaker 1>word essentially right.

401
00:21:33.839 --> 00:21:37.920
<v Speaker 2>So, but you also talked about performance. In My immediate

402
00:21:38.160 --> 00:21:41.000
<v Speaker 2>reaction was, so, what do we mean when we say performance?

403
00:21:42.400 --> 00:21:42.559
<v Speaker 3>Yes?

404
00:21:42.880 --> 00:21:45.440
<v Speaker 2>Is that speed? Is that a speed measurement or is

405
00:21:45.519 --> 00:21:47.079
<v Speaker 2>that an accuracy measurement?

406
00:21:47.440 --> 00:21:51.200
<v Speaker 3>Yeah? So to kind of put this in context, I

407
00:21:51.240 --> 00:21:55.759
<v Speaker 3>gave a keynote to NBC Porto about all the hairy

408
00:21:55.880 --> 00:21:59.200
<v Speaker 3>things that go along with assessing LLM. So I didn't

409
00:21:59.200 --> 00:22:02.480
<v Speaker 3>get into speed. We can come back to that if

410
00:22:02.519 --> 00:22:05.039
<v Speaker 3>we get time. But it's more about like how do

411
00:22:05.119 --> 00:22:08.920
<v Speaker 3>people judge if these models are good? And last time

412
00:22:09.000 --> 00:22:11.480
<v Speaker 3>we talked and you gave the episode this name, we

413
00:22:11.559 --> 00:22:13.759
<v Speaker 3>talked about the concept of there's no free lunch in

414
00:22:13.880 --> 00:22:17.880
<v Speaker 3>machine learning, and what this means is there is no

415
00:22:19.000 --> 00:22:22.599
<v Speaker 3>there's no one model that will be best for every

416
00:22:22.759 --> 00:22:24.359
<v Speaker 3>possible task you can do.

417
00:22:24.960 --> 00:22:25.119
<v Speaker 2>Right.

418
00:22:25.359 --> 00:22:27.839
<v Speaker 3>But what we've seen with the way people talk about

419
00:22:28.000 --> 00:22:32.680
<v Speaker 3>llms is there advertised exactly like this. Like it's like, oh,

420
00:22:33.519 --> 00:22:35.960
<v Speaker 3>open Ai just came out with the one model, and

421
00:22:36.079 --> 00:22:39.359
<v Speaker 3>it is the best model on the market, right, right,

422
00:22:39.640 --> 00:22:43.519
<v Speaker 3>And even if we're not, let's put like engineering considerations aside,

423
00:22:43.559 --> 00:22:45.519
<v Speaker 3>let's talk about like, let's put cost aside, let's put

424
00:22:45.559 --> 00:22:47.519
<v Speaker 3>speed aside. That's still not going to be true.

425
00:22:47.839 --> 00:22:49.920
<v Speaker 1>It's like who's the best guitar player in the world?

426
00:22:51.000 --> 00:22:52.799
<v Speaker 3>Yes, how do you measure this?

427
00:22:53.079 --> 00:22:55.880
<v Speaker 2>That's an impossible question? Answered well, I think when they

428
00:22:55.920 --> 00:22:57.720
<v Speaker 2>were saying best that time, we were talking the largest

429
00:22:57.799 --> 00:22:59.559
<v Speaker 2>number of parameters, weren't they.

430
00:23:00.400 --> 00:23:03.759
<v Speaker 3>Well, what they're talking about is there's this suite of

431
00:23:04.240 --> 00:23:09.319
<v Speaker 3>benchmarks that are designed to assess LLM performance. And we

432
00:23:09.440 --> 00:23:12.480
<v Speaker 3>talked about this last time. But llms were originally designed

433
00:23:12.559 --> 00:23:17.680
<v Speaker 3>to be natural language processing task generalists. So they're good

434
00:23:17.720 --> 00:23:20.599
<v Speaker 3>at doing a range of natural language tasks, often without

435
00:23:20.720 --> 00:23:22.480
<v Speaker 3>further training out of the box, so they can do

436
00:23:22.559 --> 00:23:29.680
<v Speaker 3>things like classification, summarization, they can do translation, things like this.

437
00:23:30.680 --> 00:23:35.920
<v Speaker 3>So generally, when these models were first designed, they were

438
00:23:36.200 --> 00:23:39.319
<v Speaker 3>benchmarked against how well they could do these natural language tasks,

439
00:23:39.359 --> 00:23:42.640
<v Speaker 3>like specific things like question and answering, translation, blah blah blah.

440
00:23:43.759 --> 00:23:48.279
<v Speaker 3>But as as the capabilities of the models have grown,

441
00:23:48.559 --> 00:23:51.759
<v Speaker 3>or maybe they seem to have grown, we don't know.

442
00:23:52.680 --> 00:23:55.079
<v Speaker 3>What we started doing is getting them to do things

443
00:23:55.319 --> 00:23:58.920
<v Speaker 3>like grade school math problems, or we've gotten them to

444
00:23:59.000 --> 00:24:02.920
<v Speaker 3>do suites of questions that are designed to assess problem

445
00:24:03.000 --> 00:24:06.680
<v Speaker 3>solving or blah blah blah. And then what we do

446
00:24:07.119 --> 00:24:10.240
<v Speaker 3>is we collate a bunch of these gold standard measures

447
00:24:10.279 --> 00:24:12.880
<v Speaker 3>together and we combine them in such a way, and

448
00:24:12.960 --> 00:24:15.480
<v Speaker 3>we create leader boards and we rank these models and

449
00:24:15.559 --> 00:24:18.160
<v Speaker 3>we say, oh, Okay, this model is the best because

450
00:24:18.160 --> 00:24:20.640
<v Speaker 3>it did the best at the MMLU, which is like

451
00:24:20.759 --> 00:24:25.680
<v Speaker 3>a reasoning benchmark, or this one's the best because it

452
00:24:25.799 --> 00:24:28.880
<v Speaker 3>did the best at like a collated collection of all

453
00:24:28.960 --> 00:24:32.240
<v Speaker 3>of these benchmarks. So it's doing well on reasoning, and

454
00:24:32.359 --> 00:24:34.680
<v Speaker 3>it's doing well on problem solving, and it's doing well

455
00:24:34.720 --> 00:24:38.440
<v Speaker 3>on math, and it's doing well on coding. But this

456
00:24:38.640 --> 00:24:42.519
<v Speaker 3>is the thing, like, firstly, a lot of these measures

457
00:24:43.000 --> 00:24:46.839
<v Speaker 3>have been found to have serious problems. Then they've been

458
00:24:46.880 --> 00:24:49.279
<v Speaker 3>found to really not measure what they said they claim

459
00:24:49.319 --> 00:24:53.799
<v Speaker 3>to measure in a variety of ways. And the second is, Okay,

460
00:24:54.079 --> 00:24:58.200
<v Speaker 3>I am an application developer. I want to design an

461
00:24:58.240 --> 00:25:00.960
<v Speaker 3>application that uses an LM. Say I want to make

462
00:25:01.359 --> 00:25:06.240
<v Speaker 3>a chatbot that can help people plan their holidays. What

463
00:25:06.440 --> 00:25:09.599
<v Speaker 3>does it matter to me that an l ELM is

464
00:25:09.799 --> 00:25:15.160
<v Speaker 3>really good at solving science problems, grade school math problems,

465
00:25:16.079 --> 00:25:18.480
<v Speaker 3>Like is that going to be good for my application?

466
00:25:18.960 --> 00:25:22.720
<v Speaker 2>So, got a calculator do you have, like, okay, gotta

467
00:25:22.799 --> 00:25:24.119
<v Speaker 2>coverage and.

468
00:25:24.160 --> 00:25:26.960
<v Speaker 3>It's probably going to do the math wrong anywhere because

469
00:25:27.200 --> 00:25:33.119
<v Speaker 3>they're not symbolically simulating exactly they do that, But.

470
00:25:33.160 --> 00:25:36.240
<v Speaker 2>Then you also have it return the response in the

471
00:25:36.319 --> 00:25:37.119
<v Speaker 2>form of a limerick.

472
00:25:39.240 --> 00:25:41.240
<v Speaker 3>That's fantastic. It's what our customers needed.

473
00:25:41.680 --> 00:25:41.960
<v Speaker 2>That's it.

474
00:25:44.680 --> 00:25:47.039
<v Speaker 3>So yeah, this is part of the problem. The way

475
00:25:47.079 --> 00:25:49.680
<v Speaker 3>we talk about the way we talk about l elms

476
00:25:50.200 --> 00:25:53.000
<v Speaker 3>is we talk about them like they are a thing

477
00:25:53.279 --> 00:25:57.519
<v Speaker 3>independent of machine learning, but they are absolutely not. And

478
00:25:58.640 --> 00:26:00.279
<v Speaker 3>part of the problem with that is it means that

479
00:26:00.319 --> 00:26:02.000
<v Speaker 3>the way that we use them is we tend to

480
00:26:02.079 --> 00:26:06.839
<v Speaker 3>trust their outputs too much, and we also tend to

481
00:26:08.000 --> 00:26:11.759
<v Speaker 3>you know, not have scrutiny about like whether a model

482
00:26:11.839 --> 00:26:14.519
<v Speaker 3>is the best fit for our use KSE, so we

483
00:26:14.599 --> 00:26:19.000
<v Speaker 3>don't design assessments to see like is this actually doing

484
00:26:19.079 --> 00:26:21.799
<v Speaker 3>what it's supposed to do, which we would absolutely do

485
00:26:21.920 --> 00:26:23.160
<v Speaker 3>with traditional machine learning.

486
00:26:23.359 --> 00:26:27.960
<v Speaker 1>I have had the experience of using lllms in you know,

487
00:26:28.079 --> 00:26:32.640
<v Speaker 1>both chat, GPT and Copilot to help with coding things,

488
00:26:32.880 --> 00:26:36.799
<v Speaker 1>and I found a situation where I asked it to

489
00:26:37.559 --> 00:26:41.359
<v Speaker 1>do something, you know, to write something, and instead of

490
00:26:41.880 --> 00:26:44.640
<v Speaker 1>pointing to something in the framework that already did that

491
00:26:45.559 --> 00:26:47.920
<v Speaker 1>and say why don't you just use this, it just

492
00:26:48.000 --> 00:26:52.359
<v Speaker 1>went ahead creating the thing, you know, reinventing the wheel.

493
00:26:53.240 --> 00:26:55.640
<v Speaker 1>And then you know, an hour later, I've got something

494
00:26:55.680 --> 00:26:58.160
<v Speaker 1>that works. But I'm like, hey, there's something in the

495
00:26:58.200 --> 00:26:59.519
<v Speaker 1>framework that works just like this.

496
00:27:00.319 --> 00:27:00.519
<v Speaker 2>Yeah.

497
00:27:02.039 --> 00:27:05.119
<v Speaker 1>So it's that's why you need a human in the

498
00:27:05.319 --> 00:27:06.000
<v Speaker 1>in the equation.

499
00:27:06.440 --> 00:27:08.920
<v Speaker 2>Although although where's there a prompt there to say is

500
00:27:09.039 --> 00:27:10.920
<v Speaker 2>there a class that does X?

501
00:27:11.279 --> 00:27:13.880
<v Speaker 1>Well, that would have been yeah, that's the human error

502
00:27:13.960 --> 00:27:16.880
<v Speaker 1>that because that should be the first question. It's like

503
00:27:16.960 --> 00:27:19.079
<v Speaker 1>when somebody says, you know, I have an idea for

504
00:27:19.160 --> 00:27:22.559
<v Speaker 1>an app, and my first question is, well, first of all,

505
00:27:23.119 --> 00:27:24.720
<v Speaker 1>I don't I'm not going to write it for you

506
00:27:24.839 --> 00:27:25.599
<v Speaker 1>unless you pay.

507
00:27:25.559 --> 00:27:28.640
<v Speaker 2>Me in second of all, does it already exist? And

508
00:27:28.799 --> 00:27:31.400
<v Speaker 2>the answer is usually yeah, if it's really that good

509
00:27:31.440 --> 00:27:36.039
<v Speaker 2>of an idea, somebody else's somebody else has done it. Okay, well,

510
00:27:36.039 --> 00:27:37.480
<v Speaker 2>why don't we take a break down. I want to

511
00:27:37.519 --> 00:27:39.720
<v Speaker 2>dig into some of these evaluation strategies.

512
00:27:39.960 --> 00:27:43.119
<v Speaker 1>All right, we'll be right back after these very important messages.

513
00:27:43.119 --> 00:27:46.200
<v Speaker 1>Stay tuned. Do you have a complex dot net monolith

514
00:27:46.240 --> 00:27:50.119
<v Speaker 1>you'd like to refactor to a microservices architecture? The micro

515
00:27:50.240 --> 00:27:53.920
<v Speaker 1>Service Extractor for dot Net tool visualizes your app and

516
00:27:54.119 --> 00:27:58.079
<v Speaker 1>helps progressively extract code into micro services. Learn more at

517
00:27:58.160 --> 00:28:02.519
<v Speaker 1>aws dot Amazon dot com, slash modernize.

518
00:28:05.279 --> 00:28:07.200
<v Speaker 2>Am We're bag. It's done at Rocks Amateur canvill that's

519
00:28:07.240 --> 00:28:12.319
<v Speaker 2>called Franklin talking to doctor Jody Burchell. Hi. And if

520
00:28:12.759 --> 00:28:16.519
<v Speaker 2>you don't enjoy those those ads and you'd like an alternative,

521
00:28:16.599 --> 00:28:19.279
<v Speaker 2>we do have a Patreon that provides an ad free feed.

522
00:28:19.880 --> 00:28:22.240
<v Speaker 2>Let's go to patreon dot com. Check it out Patreon

523
00:28:22.279 --> 00:28:25.200
<v Speaker 2>dot dot NetRocks dot com. Yeah, so I found the

524
00:28:25.279 --> 00:28:28.680
<v Speaker 2>deep avow site that talks about MMLU. But nearest I

525
00:28:28.720 --> 00:28:31.039
<v Speaker 2>can tell this is just a set of questions in

526
00:28:31.119 --> 00:28:32.279
<v Speaker 2>different topic areas.

527
00:28:32.920 --> 00:28:41.799
<v Speaker 3>Yes, yes, so so let's talk about benchmarks. So there

528
00:28:42.160 --> 00:28:47.240
<v Speaker 3>is a very famous leader board called the Hugging Face

529
00:28:48.400 --> 00:28:52.680
<v Speaker 3>open Fellolane leader board something like that. Okay, So hugging

530
00:28:52.759 --> 00:28:56.799
<v Speaker 3>Face is a company. They're based in France and basically

531
00:28:57.119 --> 00:29:00.480
<v Speaker 3>what they do in their open source branch is provide

532
00:29:00.599 --> 00:29:03.720
<v Speaker 3>access to all of the major open source what are

533
00:29:03.720 --> 00:29:07.920
<v Speaker 3>called foundational models, so big l lams that are open

534
00:29:08.000 --> 00:29:14.960
<v Speaker 3>source computer vision models, those that can generate audio, you know,

535
00:29:15.079 --> 00:29:19.119
<v Speaker 3>do transcription, all these sort of things. And so Hugging

536
00:29:19.200 --> 00:29:23.160
<v Speaker 3>Face take the open source models. They run these models

537
00:29:23.200 --> 00:29:25.440
<v Speaker 3>against a suiteter benchmarks and then they call aid them.

538
00:29:26.480 --> 00:29:29.880
<v Speaker 3>And they used to have a used to have a

539
00:29:30.000 --> 00:29:33.640
<v Speaker 3>leaderboard up until June last year. This was the first

540
00:29:33.759 --> 00:29:40.039
<v Speaker 3>version and it included scales like Hella, Swag and the MMLU.

541
00:29:41.359 --> 00:29:43.480
<v Speaker 3>So it got retired for a couple of reasons. But

542
00:29:43.799 --> 00:29:46.319
<v Speaker 3>one of the reasons that got retired is people started

543
00:29:46.359 --> 00:29:51.359
<v Speaker 3>going through the questions and MMLU was bad. It had

544
00:29:51.400 --> 00:29:55.079
<v Speaker 3>a few questions that literally were like I think one

545
00:29:55.119 --> 00:29:59.400
<v Speaker 3>of them was something like the continuity of the theory.

546
00:29:59.640 --> 00:30:02.039
<v Speaker 3>That's that's the full question. And then it was a

547
00:30:02.119 --> 00:30:04.880
<v Speaker 3>bunch of multiple choice answers that were just lists of

548
00:30:05.000 --> 00:30:09.240
<v Speaker 3>numbers like that was the question, and think about, Wow,

549
00:30:09.400 --> 00:30:11.799
<v Speaker 3>the gold standard is a human, so humans meant to

550
00:30:11.799 --> 00:30:13.799
<v Speaker 3>be able to answer this, And then you rank how

551
00:30:13.880 --> 00:30:16.039
<v Speaker 3>well the LM goes, and I'm.

552
00:30:15.960 --> 00:30:17.559
<v Speaker 2>Like, nobody can answer that.

553
00:30:17.720 --> 00:30:18.440
<v Speaker 3>What does this mean?

554
00:30:18.920 --> 00:30:19.759
<v Speaker 2>What does it even mean?

555
00:30:20.200 --> 00:30:23.039
<v Speaker 3>What does this mean? But my favorite, my favorite, my

556
00:30:23.079 --> 00:30:25.759
<v Speaker 3>favorite was Hella Swag. So apparently Hella Swag I think

557
00:30:25.920 --> 00:30:28.720
<v Speaker 3>was made using mechanical turk, so they got people to

558
00:30:28.799 --> 00:30:32.759
<v Speaker 3>generate the questions and then validate them. But clearly whoever

559
00:30:32.839 --> 00:30:36.119
<v Speaker 3>picked up this task was like not particularly invested, Like,

560
00:30:36.559 --> 00:30:38.200
<v Speaker 3>you know, they're not getting paid a lot of money,

561
00:30:38.400 --> 00:30:42.119
<v Speaker 3>they probably didn't care, right, And I have actually an

562
00:30:42.319 --> 00:30:48.440
<v Speaker 3>article with some of my favorite absolutely bizarre Hella Slag questions. Okay,

563
00:30:48.680 --> 00:30:51.839
<v Speaker 3>now keep in mind I am reading this out as

564
00:30:51.920 --> 00:30:54.880
<v Speaker 3>it's written. Okay, So we have a question, and we

565
00:30:55.000 --> 00:30:57.799
<v Speaker 3>have a bunch of multiple choice answers, and what the

566
00:30:57.960 --> 00:31:01.720
<v Speaker 3>LM is supposed to do is complete the scenario. So

567
00:31:01.799 --> 00:31:05.119
<v Speaker 3>it's meant to pick the option that has the most

568
00:31:05.640 --> 00:31:10.559
<v Speaker 3>you know, fitting scenario end. Okay, so I've got one

569
00:31:10.599 --> 00:31:15.640
<v Speaker 3>for you. Man is in roofed gym weightlifting. Woman is

570
00:31:15.720 --> 00:31:19.279
<v Speaker 3>walking behind the man watching the man. Woman is a

571
00:31:20.200 --> 00:31:24.559
<v Speaker 3>tightening balls on stand on front of weight bar b

572
00:31:25.240 --> 00:31:29.279
<v Speaker 3>lifting eights while he man sits to watch her cee

573
00:31:29.920 --> 00:31:34.160
<v Speaker 3>doing mediocrity spinning on the floor, D lift the weight

574
00:31:34.319 --> 00:31:34.799
<v Speaker 3>lift man.

575
00:31:37.119 --> 00:31:38.279
<v Speaker 2>That doesn't make any sense.

576
00:31:39.119 --> 00:31:42.240
<v Speaker 3>It doesn't and probably around a third of the questions

577
00:31:42.279 --> 00:31:44.440
<v Speaker 3>in hellaswag with this garbage.

578
00:31:44.480 --> 00:31:47.279
<v Speaker 1>I just want to know what mediocrity spins are. I

579
00:31:47.759 --> 00:31:50.680
<v Speaker 1>want to do that, and I just don't know because

580
00:31:50.680 --> 00:31:51.119
<v Speaker 1>I don't.

581
00:31:51.000 --> 00:31:51.640
<v Speaker 2>Know the definition.

582
00:31:52.279 --> 00:31:54.359
<v Speaker 3>That's every time I turn around and knock something off

583
00:31:54.400 --> 00:32:00.880
<v Speaker 3>a shelf with my clumsy hair. That was twenty twenty.

584
00:32:01.000 --> 00:32:02.920
<v Speaker 3>That was mediocol for the child.

585
00:32:04.480 --> 00:32:07.839
<v Speaker 2>Yeah, there's been some times. So I mean this just

586
00:32:07.920 --> 00:32:11.359
<v Speaker 2>seems lazy then, like the well, let's back up. Is

587
00:32:11.519 --> 00:32:14.519
<v Speaker 2>asking questions of an LM actually a good way to

588
00:32:14.599 --> 00:32:15.799
<v Speaker 2>measure its effectiveness?

589
00:32:16.200 --> 00:32:21.799
<v Speaker 3>Now? Yes and no. So you can create well defined

590
00:32:22.079 --> 00:32:24.599
<v Speaker 3>problem suites if you have a good idea of what

591
00:32:24.720 --> 00:32:27.480
<v Speaker 3>you're assessing. So this is this is basic measurement theory, right,

592
00:32:27.480 --> 00:32:32.880
<v Speaker 3>It's like we learned this in psychology. It's tricky with

593
00:32:33.119 --> 00:32:37.839
<v Speaker 3>llms because we have a tendency to extrapolate too much.

594
00:32:38.279 --> 00:32:42.039
<v Speaker 3>We try to project what their performance would mean if

595
00:32:42.079 --> 00:32:44.400
<v Speaker 3>a human did that, and we can't do it because

596
00:32:44.599 --> 00:32:49.079
<v Speaker 3>llms do not have what's called fluid intelligence or general intelligence, right.

597
00:32:49.200 --> 00:32:52.839
<v Speaker 3>They have what you could essentially call crystallized intelligence, which

598
00:32:52.880 --> 00:32:55.440
<v Speaker 3>is that they have a bunch of little templates of

599
00:32:56.200 --> 00:33:00.160
<v Speaker 3>how things work based on scenarios they've seen before. They

600
00:33:00.160 --> 00:33:03.960
<v Speaker 3>can patent match questions they see against this, So you've

601
00:33:03.960 --> 00:33:06.839
<v Speaker 3>got to be really careful about how far you deviate

602
00:33:06.920 --> 00:33:10.680
<v Speaker 3>from the doing patent matching to their showing intelligence, right,

603
00:33:11.400 --> 00:33:13.880
<v Speaker 3>But it is possible. Let's say you want to assess

604
00:33:14.039 --> 00:33:18.160
<v Speaker 3>how well they do specific tasks, like they can answer

605
00:33:18.279 --> 00:33:21.240
<v Speaker 3>questions about history or whatever. That's fine. I think that's

606
00:33:21.279 --> 00:33:26.640
<v Speaker 3>fine to assess. It gets tricky because there are two

607
00:33:27.000 --> 00:33:31.079
<v Speaker 3>main problems with using questions other than the one I've

608
00:33:31.119 --> 00:33:34.200
<v Speaker 3>just said. The first is is that the answer type

609
00:33:34.359 --> 00:33:39.680
<v Speaker 3>that an LM is presented with actually impacts their performance.

610
00:33:39.920 --> 00:33:44.599
<v Speaker 3>So most of these measurements use multiple choice questions, and

611
00:33:44.640 --> 00:33:46.720
<v Speaker 3>the reason that they do that is because it's much

612
00:33:46.839 --> 00:33:50.839
<v Speaker 3>easier to score because they're essentially ways of seeing, you know,

613
00:33:50.880 --> 00:33:53.000
<v Speaker 3>the probabilities of words I was talking about. You can

614
00:33:53.079 --> 00:33:56.680
<v Speaker 3>quite accurately tell what's the highest probability sequence that it

615
00:33:56.720 --> 00:33:59.119
<v Speaker 3>would have ended up predicting based on, you know, the

616
00:33:59.200 --> 00:34:02.880
<v Speaker 3>ones that is present with, So you know, it's much

617
00:34:03.000 --> 00:34:06.200
<v Speaker 3>much easier to work this out. But you can also

618
00:34:06.279 --> 00:34:10.119
<v Speaker 3>get them to generate answers, and generating free form answers

619
00:34:10.239 --> 00:34:12.960
<v Speaker 3>is really hard to assess unless you're gtting humans to

620
00:34:13.039 --> 00:34:15.519
<v Speaker 3>actually compare it to a gold standard because the statistical

621
00:34:15.559 --> 00:34:20.880
<v Speaker 3>ways we have of comparing two sequences are imperfect. So

622
00:34:22.280 --> 00:34:24.880
<v Speaker 3>most of the time people will use these multiple choice

623
00:34:24.880 --> 00:34:28.199
<v Speaker 3>answer keys. But the problem is is that elms seem

624
00:34:28.280 --> 00:34:31.440
<v Speaker 3>to do a lot better when they're presented with multiple

625
00:34:31.519 --> 00:34:35.159
<v Speaker 3>choice answers compared to free form answers. Sure, and the

626
00:34:35.199 --> 00:34:36.920
<v Speaker 3>reason it seems to be is because it's a lot

627
00:34:36.960 --> 00:34:40.239
<v Speaker 3>easier to just pattern match to something they've already memorized

628
00:34:40.360 --> 00:34:45.159
<v Speaker 3>as opposed to having to generalize a bit more. And

629
00:34:45.239 --> 00:34:51.039
<v Speaker 3>then the second big problem is hell, elms are ridiculously

630
00:34:51.239 --> 00:34:54.159
<v Speaker 3>sensitive to the format of the prompt template you use.

631
00:34:54.199 --> 00:34:57.159
<v Speaker 3>We've already talked about this, like did you tell them

632
00:34:57.239 --> 00:35:02.199
<v Speaker 3>to use a framework that already exists? But it's so

633
00:35:02.519 --> 00:35:07.079
<v Speaker 3>much more subtle than that. So using a different placement

634
00:35:07.159 --> 00:35:11.840
<v Speaker 3>of punctuation, using different spacing, this can impact the performance

635
00:35:11.920 --> 00:35:17.480
<v Speaker 3>of LMS on task by like thirty fifty seventy percent. Wow, yes, yeah,

636
00:35:17.719 --> 00:35:21.840
<v Speaker 3>and like why it seems to be again pattern matching.

637
00:35:22.159 --> 00:35:27.440
<v Speaker 3>So if that like particular formatting is closer to something

638
00:35:27.480 --> 00:35:30.800
<v Speaker 3>that it's seen already in training, it's more likely going

639
00:35:30.840 --> 00:35:31.800
<v Speaker 3>to be able to get it right.

640
00:35:32.199 --> 00:35:34.000
<v Speaker 2>So all I got to do is ee cummings my

641
00:35:34.159 --> 00:35:35.079
<v Speaker 2>prompt and it just.

642
00:35:40.079 --> 00:35:45.000
<v Speaker 1>Exactly it's just Richard invents a new ferbs.

643
00:35:45.079 --> 00:35:47.760
<v Speaker 2>Yeah. But you know an interesting point, like anytime you

644
00:35:47.840 --> 00:35:52.800
<v Speaker 2>want to remind a person that this software is not intelligent,

645
00:35:53.039 --> 00:35:55.880
<v Speaker 2>it's that that recognize that this is pattern matching. The

646
00:35:55.960 --> 00:35:59.000
<v Speaker 2>fact that that as a human, I can hand you

647
00:35:59.280 --> 00:36:03.199
<v Speaker 2>only lower case there's no punctuation statement or a perfectly

648
00:36:03.400 --> 00:36:08.119
<v Speaker 2>punctuated statement, and you'll see it as exactly the same,

649
00:36:08.320 --> 00:36:11.320
<v Speaker 2>just one lazier than the other. But the software treats

650
00:36:11.320 --> 00:36:12.079
<v Speaker 2>it completely differently.

651
00:36:12.280 --> 00:36:15.599
<v Speaker 1>Exactly do you guys know the story of what the

652
00:36:15.960 --> 00:36:20.800
<v Speaker 1>moment that Bill Gates went nuts over chat GPT and

653
00:36:21.400 --> 00:36:24.599
<v Speaker 1>began to trust it and his mind was blown over it.

654
00:36:25.559 --> 00:36:28.360
<v Speaker 1>So the Richard I sent you a link in the

655
00:36:28.480 --> 00:36:31.000
<v Speaker 1>chat you can post it there. This is the story

656
00:36:31.039 --> 00:36:34.039
<v Speaker 1>and I heard about this story on the on the radio.

657
00:36:34.880 --> 00:36:38.400
<v Speaker 1>So the story from this CNBC dot com things. Bill

658
00:36:38.480 --> 00:36:43.000
<v Speaker 1>Gates watched chat GPT asen ap bio exam and went

659
00:36:43.079 --> 00:36:47.960
<v Speaker 1>into quote a state of shock. And this was August eleventh,

660
00:36:48.239 --> 00:36:51.960
<v Speaker 1>twenty twenty three. So but what you don't know, and

661
00:36:52.239 --> 00:36:55.079
<v Speaker 1>I don't even know if they say it in this article,

662
00:36:55.119 --> 00:36:57.960
<v Speaker 1>I don't think they do. But a couple of months

663
00:36:58.000 --> 00:37:04.320
<v Speaker 1>before is when Sam All actually showed Bill chat GPT

664
00:37:05.559 --> 00:37:07.960
<v Speaker 1>and he added a couple of things and he said,

665
00:37:08.039 --> 00:37:10.639
<v Speaker 1>you know what, you know, it would be a great test.

666
00:37:11.280 --> 00:37:16.760
<v Speaker 1>Sam is if we could give it the ap bio test.

667
00:37:17.920 --> 00:37:22.000
<v Speaker 1>And it aced it and then Sam goes home and

668
00:37:22.119 --> 00:37:25.880
<v Speaker 1>two months later brings it back and it ass the exam.

669
00:37:26.039 --> 00:37:27.760
<v Speaker 1>So what do you think happened in those two months?

670
00:37:27.960 --> 00:37:28.719
<v Speaker 3>What a mystery.

671
00:37:29.199 --> 00:37:31.599
<v Speaker 2>It's so strange, I can't imagine. I'd also point out

672
00:37:31.639 --> 00:37:34.239
<v Speaker 2>that a GAP exam is largely multiple choice.

673
00:37:34.400 --> 00:37:37.320
<v Speaker 1>There you go, Well, I mean I thought it was

674
00:37:37.639 --> 00:37:40.400
<v Speaker 1>I thought they were essay questions. There were five questions,

675
00:37:40.480 --> 00:37:43.679
<v Speaker 1>but I don't I didn't read that part, but I

676
00:37:43.800 --> 00:37:49.760
<v Speaker 1>heard that they were five five essay questions. Anyway, they

677
00:37:49.880 --> 00:37:52.760
<v Speaker 1>did not say that in the article, but that that

678
00:37:52.840 --> 00:37:53.800
<v Speaker 1>apparently happened.

679
00:37:53.880 --> 00:37:54.960
<v Speaker 2>It all depends on what you.

680
00:37:54.960 --> 00:37:58.039
<v Speaker 3>Train it on, right, Yes, and this is actually a

681
00:37:58.599 --> 00:38:03.199
<v Speaker 3>bigger issue. So this is an issue that's called data leakage, again,

682
00:38:03.280 --> 00:38:06.280
<v Speaker 3>well known problem in machine learning. It's when your model

683
00:38:06.440 --> 00:38:09.360
<v Speaker 3>gets access to the test set during training that it

684
00:38:09.440 --> 00:38:13.920
<v Speaker 3>can basically learn the answers and well, the implication from

685
00:38:14.480 --> 00:38:16.800
<v Speaker 3>Carl is that this may not have been an accident

686
00:38:16.920 --> 00:38:20.920
<v Speaker 3>this time. But you know, we don't have a clear

687
00:38:21.000 --> 00:38:23.079
<v Speaker 3>idea of what's in the training data for a lot

688
00:38:23.159 --> 00:38:25.719
<v Speaker 3>of these models. Even open source models now are being

689
00:38:25.840 --> 00:38:28.880
<v Speaker 3>super cagey about what's in their training data. So they

690
00:38:28.880 --> 00:38:33.840
<v Speaker 3>say it's a competitive advantage. But we know from experiments

691
00:38:33.920 --> 00:38:37.760
<v Speaker 3>people have done that even benchmarks have ended up at

692
00:38:37.880 --> 00:38:41.599
<v Speaker 3>least partially leaking into the data. So we know that

693
00:38:41.840 --> 00:38:44.639
<v Speaker 3>a lot of these companies will optimize for benchmarks. They'll

694
00:38:44.679 --> 00:38:46.599
<v Speaker 3>keep training the models until or not keep training them,

695
00:38:46.639 --> 00:38:49.079
<v Speaker 3>but they'll keep tuning them until they do well on benchmarks.

696
00:38:49.159 --> 00:38:54.440
<v Speaker 3>But even accidentally, because they're just scraping the open internet,

697
00:38:54.800 --> 00:38:57.920
<v Speaker 3>sure they've accidentally shoved a bunch of these questions.

698
00:38:57.679 --> 00:38:59.519
<v Speaker 2>Which is probably where the benchmarks came from in the

699
00:38:59.559 --> 00:39:02.880
<v Speaker 2>first place. Any so exactly eventually you're going to meet

700
00:39:02.960 --> 00:39:05.480
<v Speaker 2>up with the data. It doesn't seem surprising at all.

701
00:39:06.519 --> 00:39:09.960
<v Speaker 3>So yeah, the modern suite of like benchmarks that started

702
00:39:09.960 --> 00:39:14.679
<v Speaker 3>being created last year, they started making them private reasons

703
00:39:14.719 --> 00:39:18.000
<v Speaker 3>to mitigate this. But it doesn't mean that you know,

704
00:39:18.199 --> 00:39:21.440
<v Speaker 3>you as a consumer, you're a lay user of an

705
00:39:21.599 --> 00:39:24.079
<v Speaker 3>l l M. Maybe not a lay user. You might

706
00:39:24.119 --> 00:39:26.519
<v Speaker 3>be a bit more technically advanced, but none of us

707
00:39:26.559 --> 00:39:29.719
<v Speaker 3>here are. AI research is right, right, and so we

708
00:39:30.199 --> 00:39:38.360
<v Speaker 3>might be not far from that to inform consumer. Let's

709
00:39:38.360 --> 00:39:38.920
<v Speaker 3>put me that way.

710
00:39:40.480 --> 00:39:43.360
<v Speaker 1>But you know, I just found it. I'm sorry to interrupt,

711
00:39:43.360 --> 00:39:46.519
<v Speaker 1>I just found it. A story is that, you know,

712
00:39:46.679 --> 00:39:49.960
<v Speaker 1>Gates issued what he believed to be a rather difficult

713
00:39:50.119 --> 00:39:54.079
<v Speaker 1>challenge to Sam Oltman, bring chat GPT back to me

714
00:39:54.360 --> 00:39:57.639
<v Speaker 1>once it could exhibit advanced human level competency by achieving

715
00:39:57.679 --> 00:40:00.599
<v Speaker 1>the highest possible score on the ap by l exam.

716
00:40:01.559 --> 00:40:06.719
<v Speaker 1>And so two months later, oh Magic, Open a Eyes

717
00:40:06.760 --> 00:40:10.400
<v Speaker 1>developers came back and Gates watched the top score of

718
00:40:10.480 --> 00:40:13.320
<v Speaker 1>five on the test. So so yeah, so there it is,

719
00:40:13.440 --> 00:40:17.639
<v Speaker 1>right in black and white that actually happened. And as

720
00:40:17.679 --> 00:40:20.800
<v Speaker 1>I was hearing this, I was like, you idiot, why

721
00:40:20.880 --> 00:40:24.000
<v Speaker 1>didn't you just say give it to me when it

722
00:40:24.079 --> 00:40:28.119
<v Speaker 1>can answer a test question and don't tell them what test? Yeah,

723
00:40:28.440 --> 00:40:31.960
<v Speaker 1>you know a test question and then just do it.

724
00:40:32.320 --> 00:40:32.639
<v Speaker 2>Try it.

725
00:40:33.400 --> 00:40:34.079
<v Speaker 3>Yeah, I don't know.

726
00:40:34.599 --> 00:40:36.480
<v Speaker 2>Far be it from me to call Bill Gates an idiots?

727
00:40:38.039 --> 00:40:40.159
<v Speaker 2>Did I actually do that? But you know, there's this

728
00:40:40.320 --> 00:40:43.320
<v Speaker 2>confirmation bias situation you can put yourself into. Yeah.

729
00:40:43.719 --> 00:40:46.440
<v Speaker 3>And this is the thing too, Like I don't blame

730
00:40:46.920 --> 00:40:50.760
<v Speaker 3>people for feeling enchanted by the models, like there is

731
00:40:50.840 --> 00:40:53.559
<v Speaker 3>something so human feeling about them because they're echoing back

732
00:40:54.159 --> 00:40:59.559
<v Speaker 3>our humanity. Yeah, but you need you always need to

733
00:40:59.599 --> 00:41:02.199
<v Speaker 3>be CAUs and like we were trying to do at

734
00:41:02.199 --> 00:41:05.280
<v Speaker 3>the beginning, like you see, we slip into anthromorphizing the

735
00:41:05.320 --> 00:41:06.559
<v Speaker 3>models even though we know better.

736
00:41:06.519 --> 00:41:07.679
<v Speaker 2>For sure, because it's easy.

737
00:41:07.760 --> 00:41:14.320
<v Speaker 3>It is easy. But really, like even with the latest

738
00:41:14.360 --> 00:41:17.960
<v Speaker 3>benchmarks trying to assess AGI, this one called the ARCAGI

739
00:41:18.159 --> 00:41:22.639
<v Speaker 3>that Open Eyes three actually did very well on got

740
00:41:22.760 --> 00:41:30.239
<v Speaker 3>seventy percent late last year. This is still just pattern matching,

741
00:41:30.559 --> 00:41:34.880
<v Speaker 3>but pattern matching in a more organized way. It's basically

742
00:41:34.960 --> 00:41:37.280
<v Speaker 3>the model has more of an ability to sort of

743
00:41:37.480 --> 00:41:41.239
<v Speaker 3>sort through which patterns might be the best to apply.

744
00:41:41.400 --> 00:41:45.320
<v Speaker 3>But again, we're just talking about a more systematic application

745
00:41:46.159 --> 00:41:49.760
<v Speaker 3>of crystallized intelligence. We're not talking about generalizability yet.

746
00:41:50.320 --> 00:41:52.800
<v Speaker 2>Yeah, And I mean the more I read, the less

747
00:41:52.800 --> 00:41:54.840
<v Speaker 2>I'm concerned about the AGI side of the equation. It

748
00:41:54.960 --> 00:41:56.960
<v Speaker 2>seems more and more like a marketing term to hire

749
00:41:57.000 --> 00:41:58.280
<v Speaker 2>more people to work at open AI.

750
00:41:59.119 --> 00:42:02.960
<v Speaker 1>Yeah, it's only a fluid term that keeps changing. The

751
00:42:03.079 --> 00:42:05.119
<v Speaker 1>definition keeps changing.

752
00:42:05.360 --> 00:42:07.320
<v Speaker 3>But how do you assess AGI? Like I don't know

753
00:42:07.360 --> 00:42:09.760
<v Speaker 3>if we talked about this last time, because I had

754
00:42:09.800 --> 00:42:12.400
<v Speaker 3>that in my first talk, But you know, how can

755
00:42:12.440 --> 00:42:16.559
<v Speaker 3>you even assess the gap between what a model knows

756
00:42:17.440 --> 00:42:21.440
<v Speaker 3>and you know, a task, so like the difficulty that

757
00:42:21.760 --> 00:42:23.840
<v Speaker 3>a model would have doing that task based on what

758
00:42:23.920 --> 00:42:27.400
<v Speaker 3>it already knows, and then standardized that across a bunch

759
00:42:27.440 --> 00:42:30.679
<v Speaker 3>of different models that have potentially been exposed to very

760
00:42:30.840 --> 00:42:34.119
<v Speaker 3>different tasks and knowledge like it. Sure, it feels it

761
00:42:34.199 --> 00:42:35.880
<v Speaker 3>feels like such a difficult.

762
00:42:35.519 --> 00:42:39.719
<v Speaker 2>Challenge and it's way too broad, and ultimately I feel

763
00:42:39.760 --> 00:42:42.920
<v Speaker 2>like it's a distraction from the fact that we're just

764
00:42:42.960 --> 00:42:47.119
<v Speaker 2>trying to be engineers making using a useful tool. And

765
00:42:47.360 --> 00:42:49.320
<v Speaker 2>I mean I let off this conversation talking about the

766
00:42:49.360 --> 00:42:52.159
<v Speaker 2>fact that I always ask folks like what one are

767
00:42:52.159 --> 00:42:54.159
<v Speaker 2>you using right now? What are you enamored of? And

768
00:42:54.199 --> 00:42:55.360
<v Speaker 2>the fact that you know, I had sort of a

769
00:42:55.440 --> 00:42:59.119
<v Speaker 2>universal everybody likes cloud right now, it's like, why what

770
00:42:59.519 --> 00:43:03.039
<v Speaker 2>do you What is your innate benchmark that made you

771
00:43:03.199 --> 00:43:05.119
<v Speaker 2>switch to this or is it just a social pressure

772
00:43:05.199 --> 00:43:08.159
<v Speaker 2>thing vibesmen because that smart person was using Cloud. Now

773
00:43:08.199 --> 00:43:10.519
<v Speaker 2>I'll use claud and then there'll be some nice confirmation

774
00:43:10.679 --> 00:43:12.599
<v Speaker 2>bias there. Well, yeah, no, it seems to be doing

775
00:43:12.679 --> 00:43:15.480
<v Speaker 2>the thing. Is it actually better than the chat GPT information?

776
00:43:15.920 --> 00:43:19.960
<v Speaker 2>I don't know how would you measure that? So we're

777
00:43:20.000 --> 00:43:22.639
<v Speaker 2>in this loop and I don't feel like I can

778
00:43:23.159 --> 00:43:25.599
<v Speaker 2>get a version, a new version of anything from any

779
00:43:25.639 --> 00:43:27.800
<v Speaker 2>of these folks come out and you open AI a

780
00:43:27.920 --> 00:43:30.519
<v Speaker 2>new cloud or any of these and say, okay, is

781
00:43:30.559 --> 00:43:32.360
<v Speaker 2>it worth switching? Yeah? I mean I know they want

782
00:43:32.400 --> 00:43:35.559
<v Speaker 2>me to. I know it's invariably more expensive, but is

783
00:43:35.599 --> 00:43:36.519
<v Speaker 2>it better?

784
00:43:37.119 --> 00:43:41.519
<v Speaker 3>Yeah? Look, I have a prediction that here we go.

785
00:43:42.280 --> 00:43:43.639
<v Speaker 3>I'm going to do a prediction why not?

786
00:43:43.880 --> 00:43:45.480
<v Speaker 2>Why not? Why not? Why not?

787
00:43:46.079 --> 00:43:48.559
<v Speaker 3>I'm going to say probably in a year's time, the

788
00:43:48.679 --> 00:43:52.559
<v Speaker 3>landscape of providers is going to look quite different. Oh, definitely,

789
00:43:52.639 --> 00:43:57.199
<v Speaker 3>And it's because the advantages of using smaller models is

790
00:43:57.480 --> 00:44:01.159
<v Speaker 3>just drastically outweighs using bigger one. They're cheaper, they're more

791
00:44:01.280 --> 00:44:05.440
<v Speaker 3>momentally friendly, they're more more specialized, they can be more specialized,

792
00:44:05.480 --> 00:44:07.880
<v Speaker 3>like it's easier to tune them so that you can

793
00:44:08.320 --> 00:44:12.199
<v Speaker 3>focus them on specific tasks. And yeah, ultimately they're just

794
00:44:12.679 --> 00:44:14.840
<v Speaker 3>you know, it's easy to control what happens to your data.

795
00:44:15.239 --> 00:44:17.920
<v Speaker 2>So right, that's a big one.

796
00:44:18.320 --> 00:44:19.159
<v Speaker 3>That is a big one.

797
00:44:19.400 --> 00:44:22.719
<v Speaker 1>It's a big one, especially with something like deep seek.

798
00:44:22.800 --> 00:44:24.400
<v Speaker 1>You know, the only way I'm going to run that

799
00:44:24.800 --> 00:44:27.239
<v Speaker 1>is on my own network not connected to the internet.

800
00:44:27.440 --> 00:44:30.960
<v Speaker 2>Well, they do often a local offer a local version, Yeah,

801
00:44:31.039 --> 00:44:33.480
<v Speaker 2>they do, right, which n video has been benchmarking with

802
00:44:33.599 --> 00:44:36.159
<v Speaker 2>a fair bit I noticed, like, which I thought was cool,

803
00:44:36.360 --> 00:44:38.719
<v Speaker 2>like smart thing to do, not just to because there's

804
00:44:38.719 --> 00:44:41.599
<v Speaker 2>lots of folks saying, no, don't use the Chinese LLM.

805
00:44:43.119 --> 00:44:45.239
<v Speaker 2>But yeah, the fact that video just said.

806
00:44:45.119 --> 00:44:47.280
<v Speaker 1>That's the same reason they're saying don't use TikTok, right,

807
00:44:47.400 --> 00:44:48.880
<v Speaker 1>so they don't trust it more or less?

808
00:44:49.719 --> 00:44:51.760
<v Speaker 2>What could happen I don't.

809
00:44:51.679 --> 00:44:54.559
<v Speaker 3>Use Yeah, I don't use TikTok just because I'm deeply uncool,

810
00:44:54.880 --> 00:44:56.360
<v Speaker 3>Like it makes with you.

811
00:44:58.000 --> 00:44:58.920
<v Speaker 2>I'm so with you.

812
00:44:59.679 --> 00:45:02.119
<v Speaker 3>I actually had to make a TikTok. I was at

813
00:45:02.159 --> 00:45:05.039
<v Speaker 3>a workshop for my job like two months ago, so

814
00:45:05.159 --> 00:45:07.000
<v Speaker 3>you know, I'm an advocate. So they're like, hey, let's

815
00:45:07.039 --> 00:45:09.880
<v Speaker 3>teach these old people to make tiktoks nice. And I

816
00:45:09.960 --> 00:45:15.039
<v Speaker 3>made this TikTok with Michelle you know, Michelle Richard, Michelle Frost,

817
00:45:15.519 --> 00:45:18.239
<v Speaker 3>so yeah, yeah, yeah, she just started with jet brains.

818
00:45:18.679 --> 00:45:20.559
<v Speaker 3>And so we made this TikTok with another of our

819
00:45:20.639 --> 00:45:23.239
<v Speaker 3>colleagues and an involved Wilbur, her dog, and it was

820
00:45:23.360 --> 00:45:26.480
<v Speaker 3>just like it was so bad. And then they're like

821
00:45:26.679 --> 00:45:29.519
<v Speaker 3>it's awesome, like you should go on TikTok right now,

822
00:45:29.599 --> 00:45:32.960
<v Speaker 3>and I'm like no, I'm deeply ashamed, like.

823
00:45:34.519 --> 00:45:35.239
<v Speaker 2>I should see.

824
00:45:35.320 --> 00:45:37.440
<v Speaker 3>This was bad.

825
00:45:39.000 --> 00:45:41.199
<v Speaker 1>I do have to confess that I have a TikTok

826
00:45:41.199 --> 00:45:44.800
<v Speaker 1>account Carlotphoenix dot com. I have not used it yet

827
00:45:44.880 --> 00:45:47.920
<v Speaker 1>for anything more than hey, I'm here, and I certainly

828
00:45:48.039 --> 00:45:51.079
<v Speaker 1>don't scroll TikTok. I have so many better things to

829
00:45:51.199 --> 00:45:56.199
<v Speaker 1>do than to scroll inane, insane, crazy music videos of

830
00:45:56.239 --> 00:45:58.719
<v Speaker 1>people doing stupid things, and dogs and.

831
00:45:58.760 --> 00:46:01.639
<v Speaker 2>Cats are also. You know what it's interesting about TikTok

832
00:46:02.440 --> 00:46:05.760
<v Speaker 2>is you're not really picking the content they're picking the content. Yes, yeah,

833
00:46:06.199 --> 00:46:09.320
<v Speaker 2>they are watching your loiter time, so it's your behavior

834
00:46:09.360 --> 00:46:12.599
<v Speaker 2>that's only selecting the content. But you know, it is

835
00:46:12.679 --> 00:46:15.440
<v Speaker 2>a different mechanism there where you can't really curate a

836
00:46:15.559 --> 00:46:18.239
<v Speaker 2>list or build a social graph. That's not up to you.

837
00:46:19.079 --> 00:46:23.840
<v Speaker 2>And and I find that interesting, right, like that, were

838
00:46:23.960 --> 00:46:27.119
<v Speaker 2>literally are handing over our attention to something else that's

839
00:46:27.400 --> 00:46:27.840
<v Speaker 2>driving it.

840
00:46:28.079 --> 00:46:30.239
<v Speaker 3>Yeah, I do have to admit I'm so curious about

841
00:46:30.239 --> 00:46:31.119
<v Speaker 3>their recommended the.

842
00:46:32.119 --> 00:46:35.840
<v Speaker 2>Yeah, well, as a technologist, right anything, like well, because

843
00:46:35.880 --> 00:46:38.480
<v Speaker 2>that's the thing that they're all upset about. This is

844
00:46:38.519 --> 00:46:40.400
<v Speaker 2>our secret sauce, and we'll be keeping it to ourselves,

845
00:46:40.400 --> 00:46:43.360
<v Speaker 2>thanks very much. Oh, it's well, here's the thing. Is

846
00:46:43.400 --> 00:46:45.320
<v Speaker 2>there a way when you see something that you don't

847
00:46:45.440 --> 00:46:47.039
<v Speaker 2>like to say, give her to death. I don't want

848
00:46:47.039 --> 00:46:49.920
<v Speaker 2>to see this again? No, because it's too late to

849
00:46:50.039 --> 00:46:53.519
<v Speaker 2>scroll past it. You've already scrolled. The problem is that

850
00:46:53.599 --> 00:46:55.400
<v Speaker 2>before you found out you didn't like it, you watched it.

851
00:46:57.880 --> 00:47:00.360
<v Speaker 2>You know, it's the old old I'm trying to improve

852
00:47:00.400 --> 00:47:02.679
<v Speaker 2>the quality of my diet by eating everything and deciding

853
00:47:02.719 --> 00:47:09.480
<v Speaker 2>what I like. Yeah, there is no nutritional label on

854
00:47:09.599 --> 00:47:10.400
<v Speaker 2>any of these things.

855
00:47:13.039 --> 00:47:14.760
<v Speaker 3>Sorry, it's called democracy.

856
00:47:15.519 --> 00:47:17.960
<v Speaker 2>Okay, so that's what you want to call it. The

857
00:47:18.039 --> 00:47:19.760
<v Speaker 2>only person who doesn't get a vote is the viewer.

858
00:47:26.000 --> 00:47:27.639
<v Speaker 2>I'm not cynical at all. I don't know what you're

859
00:47:27.639 --> 00:47:28.159
<v Speaker 2>talking about.

860
00:47:28.320 --> 00:47:31.719
<v Speaker 1>What you remind me of David Mitchell, the British comedian

861
00:47:32.119 --> 00:47:33.800
<v Speaker 1>who just every once in a while will just go

862
00:47:33.920 --> 00:47:34.679
<v Speaker 1>off on a rant.

863
00:47:36.039 --> 00:47:46.079
<v Speaker 2>Just start and he'll keep me going. I'm fine, Everything's fine, Okay. Yeah,

864
00:47:46.159 --> 00:47:48.800
<v Speaker 2>we're still at this core issue of how do I

865
00:47:48.880 --> 00:47:52.800
<v Speaker 2>select an LM from my app? I mean, part of

866
00:47:52.840 --> 00:47:55.800
<v Speaker 2>it is the running contact or I can I can

867
00:47:55.880 --> 00:47:58.159
<v Speaker 2>go down the cost side and can go down the

868
00:47:58.360 --> 00:48:00.679
<v Speaker 2>does it, you know, integrate well with my Do I

869
00:48:00.679 --> 00:48:02.400
<v Speaker 2>any cloud access? Or can I run local?

870
00:48:02.480 --> 00:48:02.519
<v Speaker 3>Like?

871
00:48:02.559 --> 00:48:05.159
<v Speaker 1>There's all those decisions we need an LM to answer

872
00:48:05.239 --> 00:48:05.719
<v Speaker 1>this question.

873
00:48:06.159 --> 00:48:06.679
<v Speaker 2>I don't think.

874
00:48:09.199 --> 00:48:11.840
<v Speaker 3>I have something even better. I've got a blog post, hey,

875
00:48:12.199 --> 00:48:15.199
<v Speaker 3>are right? He so I will share this with you

876
00:48:15.320 --> 00:48:18.000
<v Speaker 3>so you can share it with the audience. But I

877
00:48:18.079 --> 00:48:19.760
<v Speaker 3>came across when I was writing my talk, I came

878
00:48:19.800 --> 00:48:24.440
<v Speaker 3>across this absolutely phenomenal blog post. So guy's an AI

879
00:48:24.519 --> 00:48:28.159
<v Speaker 3>engineer called Hassan Hussein. So this guy works in an

880
00:48:28.159 --> 00:48:33.440
<v Speaker 3>AI consultancy. So exciting job these days goes out and

881
00:48:33.639 --> 00:48:38.079
<v Speaker 3>he needs to basically build applications for companies that use AI.

882
00:48:39.159 --> 00:48:42.159
<v Speaker 3>And one of the jobs that he talks about is

883
00:48:42.519 --> 00:48:45.519
<v Speaker 3>he and his company were hired to build a chatbot

884
00:48:45.719 --> 00:48:48.840
<v Speaker 3>for real estate agents. So basically, they wanted the real

885
00:48:48.920 --> 00:48:50.760
<v Speaker 3>estate agents to be able to type in natural language,

886
00:48:50.840 --> 00:48:55.599
<v Speaker 3>like give me the contact details for everyone in this area,

887
00:48:55.880 --> 00:48:59.559
<v Speaker 3>whatever you know, and then the LM would generate a

888
00:48:59.719 --> 00:49:02.599
<v Speaker 3>quer to a CRM something like that and return the information.

889
00:49:03.800 --> 00:49:06.519
<v Speaker 3>So when they first started building the app, they like,

890
00:49:06.679 --> 00:49:10.280
<v Speaker 3>they picked a good LM based on the leaderboard, good one,

891
00:49:10.840 --> 00:49:13.920
<v Speaker 3>and then they wrote the initial prompt templates and then

892
00:49:14.639 --> 00:49:17.239
<v Speaker 3>you know, everything looked good, and then things started not

893
00:49:17.360 --> 00:49:19.480
<v Speaker 3>working on the edge cases, so they made the prompt

894
00:49:19.480 --> 00:49:21.400
<v Speaker 3>a bit more elaborate, and then the prompt started getting

895
00:49:21.400 --> 00:49:24.519
<v Speaker 3>really unwieldy, and then they realized the only evaluation metric

896
00:49:24.559 --> 00:49:27.159
<v Speaker 3>they had was vibes and they were like, really, this

897
00:49:28.079 --> 00:49:32.440
<v Speaker 3>is a mess. So he set out in this really

898
00:49:32.559 --> 00:49:35.400
<v Speaker 3>interesting way how they actually went back to ground zero

899
00:49:35.599 --> 00:49:39.280
<v Speaker 3>and they started again, and he said, like, basically, we

900
00:49:39.400 --> 00:49:43.960
<v Speaker 3>realized we needed a tiered assessment. So he said, like,

901
00:49:44.239 --> 00:49:47.559
<v Speaker 3>the first tier of assessment is unit tests, Like it

902
00:49:47.639 --> 00:49:50.800
<v Speaker 3>seems really obvious, right, But he's like, the thing is

903
00:49:50.920 --> 00:49:53.800
<v Speaker 3>is like, because it's nondeterministic, you're not going to have

904
00:49:54.079 --> 00:49:56.440
<v Speaker 3>one hundred percent pass rate on your unit tests. So

905
00:49:56.599 --> 00:50:00.199
<v Speaker 3>you need to determine what error rate you're happy with,

906
00:50:00.440 --> 00:50:02.320
<v Speaker 3>and that's going to require a bit of experimentation.

907
00:50:03.079 --> 00:50:04.960
<v Speaker 2>But you also have to accept the level of ur rate,

908
00:50:05.119 --> 00:50:07.320
<v Speaker 2>like you're not getting all agree exactly.

909
00:50:07.039 --> 00:50:09.280
<v Speaker 3>Exactly, so it might be like you just need ninety

910
00:50:09.320 --> 00:50:11.559
<v Speaker 3>five or ninety nine percent or whatever to pass whatever

911
00:50:11.639 --> 00:50:15.159
<v Speaker 3>looks realistic. But you know, an example of unit tests

912
00:50:15.280 --> 00:50:18.360
<v Speaker 3>is let's say the query from the user was return

913
00:50:18.400 --> 00:50:21.000
<v Speaker 3>me the phone number of you know, Jane Smith or

914
00:50:21.920 --> 00:50:25.400
<v Speaker 3>you know someone like that, and then basically what you're

915
00:50:25.400 --> 00:50:27.719
<v Speaker 3>going to expect from the CRM is a phone number,

916
00:50:28.159 --> 00:50:30.199
<v Speaker 3>So you can write a unit test for that. You know,

917
00:50:31.159 --> 00:50:34.639
<v Speaker 3>it's basic engineering. And then he said, you know, you

918
00:50:34.760 --> 00:50:39.440
<v Speaker 3>can create a suite of manual evaluations, so you basically

919
00:50:39.519 --> 00:50:43.239
<v Speaker 3>look at the traces how the LM is interacting with

920
00:50:43.320 --> 00:50:45.320
<v Speaker 3>the users and the rest of the system, and you

921
00:50:45.760 --> 00:50:48.840
<v Speaker 3>manually evaluate that. And you don't have to keep doing

922
00:50:48.880 --> 00:50:51.280
<v Speaker 3>that forever because then you can use a new method

923
00:50:51.440 --> 00:50:54.320
<v Speaker 3>called LM as a judge just where you get another

924
00:50:54.519 --> 00:50:59.079
<v Speaker 3>LM to also do the same assessments and trying to

925
00:50:59.239 --> 00:51:02.440
<v Speaker 3>get them to converge. And once you have a relatively

926
00:51:02.519 --> 00:51:05.800
<v Speaker 3>strong sense that the LM is giving similar assessments to

927
00:51:05.920 --> 00:51:08.519
<v Speaker 3>your human you can you know, you need to check

928
00:51:08.519 --> 00:51:09.840
<v Speaker 3>in on it from nine to time to see if

929
00:51:09.840 --> 00:51:13.000
<v Speaker 3>it's okay. But that you know, takes over that part

930
00:51:13.039 --> 00:51:15.599
<v Speaker 3>of the assessment, and then you know, you can go

931
00:51:15.719 --> 00:51:18.719
<v Speaker 3>up to your normal kind of higher level assessments like

932
00:51:19.199 --> 00:51:22.159
<v Speaker 3>a B testing. You know, it's really it's just a

933
00:51:22.239 --> 00:51:25.719
<v Speaker 3>normal engineering system, and you can create a feedback loop

934
00:51:25.760 --> 00:51:28.920
<v Speaker 3>where you can you know, refine your prompts or fine

935
00:51:28.920 --> 00:51:32.559
<v Speaker 3>tune models, or use different models that maybe are smaller

936
00:51:32.760 --> 00:51:35.639
<v Speaker 3>or cheaper and see whether you can get the same

937
00:51:35.679 --> 00:51:38.440
<v Speaker 3>sort of performance. So you know, obviously you're going to

938
00:51:38.559 --> 00:51:40.920
<v Speaker 3>need to just pick a model to start with. You

939
00:51:41.039 --> 00:51:42.880
<v Speaker 3>might be able to get a sense of whether it's

940
00:51:42.960 --> 00:51:46.559
<v Speaker 3>good for chatbot applications in this language, you know, do

941
00:51:46.639 --> 00:51:51.079
<v Speaker 3>your research on that. But this really shows me, like

942
00:51:51.239 --> 00:51:54.840
<v Speaker 3>it's just it's so obvious, right, like this is how

943
00:51:54.960 --> 00:51:56.000
<v Speaker 3>we do monitoring.

944
00:51:56.400 --> 00:52:00.239
<v Speaker 2>Yeah, and it's I'm sorry, this looks too adult me.

945
00:52:00.440 --> 00:52:03.159
<v Speaker 3>Well, I know, it looks like a lot of hard work.

946
00:52:03.719 --> 00:52:06.280
<v Speaker 2>Literally, as you actually have to work at building a

947
00:52:06.360 --> 00:52:10.559
<v Speaker 2>day decent testing framework specific to your your case. I

948
00:52:10.719 --> 00:52:15.239
<v Speaker 2>know I wanted I wanted a happy button. Jody can

949
00:52:15.320 --> 00:52:17.639
<v Speaker 2>be sad I want to button. Yeah.

950
00:52:18.000 --> 00:52:22.480
<v Speaker 1>We recorded a show with Spencer Schneidenbach, which is actually

951
00:52:22.639 --> 00:52:23.320
<v Speaker 1>next week's show.

952
00:52:23.400 --> 00:52:24.559
<v Speaker 2>We recorded it a couple.

953
00:52:24.440 --> 00:52:27.480
<v Speaker 1>Of days ago, so we have the benefit of future

954
00:52:27.599 --> 00:52:30.559
<v Speaker 1>looking here, and we talked about some of these things

955
00:52:30.639 --> 00:52:34.920
<v Speaker 1>with him, and uh, you know that just the comment

956
00:52:35.039 --> 00:52:37.000
<v Speaker 1>came up and I think it was even mere Richard.

957
00:52:37.000 --> 00:52:39.639
<v Speaker 1>I can't remember who, but you know, we we used

958
00:52:39.639 --> 00:52:42.400
<v Speaker 1>to have we used to be programmers that you know,

959
00:52:42.519 --> 00:52:45.320
<v Speaker 1>we have a bug, we fix it. Now the program

960
00:52:45.480 --> 00:52:48.760
<v Speaker 1>is one hundred percent accurate. And now I mean it's

961
00:52:48.800 --> 00:52:53.039
<v Speaker 1>even it's even more like we're a psychologist now instead

962
00:52:53.079 --> 00:52:56.880
<v Speaker 1>of scientists. You know, we make some suggestions, we examine

963
00:52:56.880 --> 00:52:58.840
<v Speaker 1>the output, you know, we think about it a little

964
00:52:58.880 --> 00:53:01.440
<v Speaker 1>bit and it doesn't seem quite right. We ask some

965
00:53:01.519 --> 00:53:03.320
<v Speaker 1>more questions, examine the behavior.

966
00:53:03.880 --> 00:53:04.039
<v Speaker 2>You know.

967
00:53:04.159 --> 00:53:07.199
<v Speaker 1>It's like, if these things are going into our software,

968
00:53:08.760 --> 00:53:12.320
<v Speaker 1>I have a little trepidation about that just because of

969
00:53:12.480 --> 00:53:16.000
<v Speaker 1>the inaccuracies. Even if it's even if something is ninety

970
00:53:16.079 --> 00:53:20.320
<v Speaker 1>nine percent accurate. That's that's a bug. That's a one

971
00:53:20.400 --> 00:53:21.119
<v Speaker 1>percent bug.

972
00:53:21.480 --> 00:53:23.679
<v Speaker 2>Yeah, and when you probably can't pin down.

973
00:53:23.519 --> 00:53:25.039
<v Speaker 1>And one you probably can't fix.

974
00:53:25.199 --> 00:53:29.800
<v Speaker 2>Use probabilistic tools, get probabilistic results. Yeah, exactly.

975
00:53:32.360 --> 00:53:35.599
<v Speaker 3>Look, it's funny because I'm probably so much more comfortable

976
00:53:35.639 --> 00:53:37.639
<v Speaker 3>with this than any of you, because I'm like, hey,

977
00:53:37.760 --> 00:53:39.199
<v Speaker 3>this is just how stuff works.

978
00:53:39.440 --> 00:53:41.519
<v Speaker 2>That was how machine learning always work. When we talk

979
00:53:41.559 --> 00:53:43.280
<v Speaker 2>to you in twenty three. You've been doing this for years,

980
00:53:43.320 --> 00:53:46.000
<v Speaker 2>and it's like you do the testing and there is

981
00:53:46.119 --> 00:53:48.400
<v Speaker 2>no one hundred percent exactly. Yeah, you get in the

982
00:53:48.480 --> 00:53:50.199
<v Speaker 2>mid nineties. You should feel good.

983
00:53:50.440 --> 00:53:55.280
<v Speaker 3>Yeah, yeah, well sometimes suspicious, it depends sometimes too quickly.

984
00:53:59.320 --> 00:54:02.800
<v Speaker 3>But yeah, I think it's an uncomfortable new reality. And

985
00:54:03.519 --> 00:54:06.519
<v Speaker 3>you know, it's something I've observed for years when you know,

986
00:54:07.400 --> 00:54:10.000
<v Speaker 3>you bring engineers into the world of machine learning, and

987
00:54:10.679 --> 00:54:13.719
<v Speaker 3>it is a deeply uncomfortable thing not knowing that something

988
00:54:13.960 --> 00:54:18.239
<v Speaker 3>is one hundred percent deterministic. I think the main problem

989
00:54:18.400 --> 00:54:21.599
<v Speaker 3>is is it's one thing to say have a system

990
00:54:21.800 --> 00:54:26.360
<v Speaker 3>that otherwise works totally in a deterministic faction. So let's

991
00:54:26.360 --> 00:54:28.960
<v Speaker 3>say you've got some sort of system that say, takes

992
00:54:29.000 --> 00:54:33.039
<v Speaker 3>in queries or takes in numbers from a user, let's say,

993
00:54:33.079 --> 00:54:37.480
<v Speaker 3>like nutrition numbers for a piece of food or something,

994
00:54:38.480 --> 00:54:40.519
<v Speaker 3>and then you have a machine learning model that generates

995
00:54:40.559 --> 00:54:43.239
<v Speaker 3>a prediction that maybe within a certain band of correct.

996
00:54:44.440 --> 00:54:47.000
<v Speaker 3>It's more difficult when you're talking about an LM being

997
00:54:47.119 --> 00:54:49.880
<v Speaker 3>an actor as part of that system and generating pieces

998
00:54:49.880 --> 00:54:53.480
<v Speaker 3>of code that will then run that system and then

999
00:54:54.679 --> 00:54:57.360
<v Speaker 3>generating error in that way is actually quite consequential.

1000
00:54:57.559 --> 00:55:00.800
<v Speaker 2>I just like that phrase certain band of correct.

1001
00:55:03.199 --> 00:55:06.400
<v Speaker 3>We call it, we call it a confidence interval. Actually,

1002
00:55:09.280 --> 00:55:11.280
<v Speaker 3>how confident am I that this is correct?

1003
00:55:13.440 --> 00:55:16.199
<v Speaker 2>But you know, I think as a developer, when you're

1004
00:55:16.239 --> 00:55:18.679
<v Speaker 2>talking to leadership that want you to use these tools,

1005
00:55:18.719 --> 00:55:20.360
<v Speaker 2>I think they're going to provide as advantage. Just like

1006
00:55:20.440 --> 00:55:24.000
<v Speaker 2>one of their educations is these are nondeterministic models and

1007
00:55:24.000 --> 00:55:25.920
<v Speaker 2>there will always be a certain level of uncertainty, and

1008
00:55:25.960 --> 00:55:28.159
<v Speaker 2>if you're not good with that, we don't get to

1009
00:55:28.239 --> 00:55:28.840
<v Speaker 2>use these tools.

1010
00:55:28.960 --> 00:55:29.159
<v Speaker 1>Yeah.

1011
00:55:29.280 --> 00:55:30.639
<v Speaker 2>Yeah, yeah, that's right.

1012
00:55:31.239 --> 00:55:34.440
<v Speaker 1>So you know, the I guess first analysis you should

1013
00:55:34.480 --> 00:55:38.400
<v Speaker 1>do in your business is what level of uncertainty are

1014
00:55:38.440 --> 00:55:39.199
<v Speaker 1>we comfortable with?

1015
00:55:39.519 --> 00:55:42.079
<v Speaker 2>Can we tolerate? You know, what is the benchmark that

1016
00:55:42.119 --> 00:55:44.400
<v Speaker 2>we're shooting for? Well? And then the other side of

1017
00:55:44.440 --> 00:55:47.880
<v Speaker 2>this is the consequences of the uncertainty of the mistake,

1018
00:55:48.039 --> 00:55:50.719
<v Speaker 2>like what has happen? Right, people gonna die? You know,

1019
00:55:50.800 --> 00:55:52.559
<v Speaker 2>I know, I see this over and over again, where

1020
00:55:52.719 --> 00:55:55.719
<v Speaker 2>like the first case of an LLEM in an organization

1021
00:55:55.840 --> 00:55:58.800
<v Speaker 2>is with an HR system. So it's totally internal. And

1022
00:55:58.960 --> 00:56:01.440
<v Speaker 2>part of that is because as the consequence omitting correct

1023
00:56:01.599 --> 00:56:04.360
<v Speaker 2>is minor. You know, yes, you're going to make somebody

1024
00:56:04.400 --> 00:56:06.280
<v Speaker 2>angry if you tell them they have more vacation days

1025
00:56:06.280 --> 00:56:09.079
<v Speaker 2>than they do, but you probably haven't cost a company

1026
00:56:09.079 --> 00:56:09.719
<v Speaker 2>a lot of money.

1027
00:56:09.920 --> 00:56:12.920
<v Speaker 1>Well, and also there's a human there to make sure

1028
00:56:13.000 --> 00:56:16.960
<v Speaker 1>that the you know, accurate information gets given to the person.

1029
00:56:17.840 --> 00:56:20.280
<v Speaker 2>I like your optimism, but yes, I would hope you

1030
00:56:20.320 --> 00:56:23.440
<v Speaker 2>would hope. I would hope. But you know, get the point, like,

1031
00:56:24.119 --> 00:56:26.920
<v Speaker 2>there is a bunch of ways to manage this uncertainty.

1032
00:56:27.079 --> 00:56:29.440
<v Speaker 2>So there's a going to be a new corporate title

1033
00:56:29.920 --> 00:56:35.480
<v Speaker 2>job and it's going to be a nondeterministic compensator. Oh,

1034
00:56:36.440 --> 00:56:38.119
<v Speaker 2>I think that's from I think I think that's from

1035
00:56:38.159 --> 00:56:44.239
<v Speaker 2>Back to the Future. You're thinking, CuO chief Certainty officer.

1036
00:56:44.119 --> 00:56:47.559
<v Speaker 1>Chief Uncertainly that's good. I think you're thinking of uncoupling

1037
00:56:47.639 --> 00:56:49.119
<v Speaker 1>the Heisenberg compensators.

1038
00:56:49.360 --> 00:56:51.599
<v Speaker 2>There you go, that's star Trek, Star Trek.

1039
00:56:51.760 --> 00:56:56.440
<v Speaker 3>Yeah, bouncing all over the place, such geeks.

1040
00:56:58.159 --> 00:57:01.400
<v Speaker 2>I found the blog post from Hassan and I'll include

1041
00:57:01.400 --> 00:57:03.880
<v Speaker 2>it in the show, or from Hammel, who's sing man

1042
00:57:04.079 --> 00:57:07.760
<v Speaker 2>who sings your AI product needs evals And it's exactly

1043
00:57:07.840 --> 00:57:11.519
<v Speaker 2>the way you describe it. Building unit tests, doing model evaluation,

1044
00:57:11.760 --> 00:57:15.239
<v Speaker 2>doing a b testing. This seems like a real concrete

1045
00:57:15.480 --> 00:57:19.559
<v Speaker 2>approach to just how do we at least be able

1046
00:57:19.639 --> 00:57:21.519
<v Speaker 2>to look people in the eye and say we've done

1047
00:57:21.599 --> 00:57:25.079
<v Speaker 2>our best to test this and have some certainty around it.

1048
00:57:25.840 --> 00:57:28.119
<v Speaker 3>Yeap, And well, I think what I like about it

1049
00:57:28.280 --> 00:57:34.599
<v Speaker 3>is it's not unfamiliar territory for engineers. This is exactly

1050
00:57:34.679 --> 00:57:38.000
<v Speaker 3>what you've all been doing for decades now, Like this

1051
00:57:38.159 --> 00:57:40.719
<v Speaker 3>is just monitoring well, or at.

1052
00:57:40.719 --> 00:57:43.199
<v Speaker 2>Least should have been doing. This is looking like the

1053
00:57:43.360 --> 00:57:44.639
<v Speaker 2>testing we do on software.

1054
00:57:44.960 --> 00:57:46.880
<v Speaker 3>But this is the thing. It's no one's fault. That's

1055
00:57:46.960 --> 00:57:49.679
<v Speaker 3>an on the ground developer, because the way these models

1056
00:57:49.719 --> 00:57:52.360
<v Speaker 3>are sold is that no their magic like they are

1057
00:57:52.440 --> 00:57:55.519
<v Speaker 3>different to everything else. They are certainly not. They are

1058
00:57:56.280 --> 00:57:59.360
<v Speaker 3>the same as any other machine learning model, except slightly

1059
00:57:59.440 --> 00:58:04.360
<v Speaker 3>more problem because you're probably involving them in critical parts

1060
00:58:04.400 --> 00:58:05.280
<v Speaker 3>of generating code.

1061
00:58:05.599 --> 00:58:08.880
<v Speaker 2>Just be careful and measure please be careful. Yeah. So so,

1062
00:58:09.199 --> 00:58:13.760
<v Speaker 2>doctor Burchell, what's next for you? What's in your inbox? I?

1063
00:58:14.400 --> 00:58:17.239
<v Speaker 3>As I said, I'm heading down to Melbourne in a

1064
00:58:17.320 --> 00:58:19.440
<v Speaker 3>month for NDC. I'm going to be giving this talk

1065
00:58:19.519 --> 00:58:21.679
<v Speaker 3>actually the one I gave it Porter and I'm going

1066
00:58:21.760 --> 00:58:24.559
<v Speaker 3>to be giving one that I gave in Oslo just

1067
00:58:24.599 --> 00:58:28.360
<v Speaker 3>about the psychology of llms. If you will not be

1068
00:58:28.480 --> 00:58:30.920
<v Speaker 3>with me in Australia, you can also watch that on

1069
00:58:31.079 --> 00:58:35.639
<v Speaker 3>YouTube on the NDC channel. The moment that I'm laying

1070
00:58:35.719 --> 00:58:38.400
<v Speaker 3>kind of low, I'm actually going for my German citizenship,

1071
00:58:38.599 --> 00:58:42.159
<v Speaker 3>so nice. Yeah, I gotta do my citizenship test in June.

1072
00:58:42.360 --> 00:58:45.119
<v Speaker 3>Just did my language test couple of weeks ago, and

1073
00:58:45.440 --> 00:58:48.039
<v Speaker 3>like everything in Germany, it takes months, so I may

1074
00:58:48.119 --> 00:58:49.599
<v Speaker 3>be able to apply by the end of the year.

1075
00:58:49.800 --> 00:58:51.880
<v Speaker 2>So you got to wratch it up your complaint too.

1076
00:58:52.760 --> 00:58:56.320
<v Speaker 3>I laughed, actually so hard, because a friend of mine

1077
00:58:56.360 --> 00:58:58.960
<v Speaker 3>did her exam and one of her writing tests was

1078
00:58:59.000 --> 00:59:00.599
<v Speaker 3>to write a letter of complaint.

1079
00:59:05.599 --> 00:59:08.880
<v Speaker 1>Well it started because before you came on, Richard, Jody says,

1080
00:59:08.920 --> 00:59:10.880
<v Speaker 1>how you doing, and I said, I can't complain, but

1081
00:59:10.960 --> 00:59:14.599
<v Speaker 1>I do anyway, She says free chairman.

1082
00:59:17.079 --> 00:59:20.320
<v Speaker 2>Awesome, all right, thanks Jody, really appreciate it. Yeah, thank you.

1083
00:59:20.760 --> 00:59:22.559
<v Speaker 2>What a great conversation, all.

1084
00:59:22.480 --> 00:59:24.440
<v Speaker 3>Right, Always always a pleasure, Okay, and.

1085
00:59:24.440 --> 00:59:27.079
<v Speaker 1>We'll talk to you next time on dot net rocks.

1086
00:59:47.280 --> 00:59:49.840
<v Speaker 1>Dot net Rocks is brought to you by Franklin's Net

1087
00:59:50.119 --> 00:59:54.039
<v Speaker 1>and produced by Pop Studios, a full service audio, video

1088
00:59:54.159 --> 00:59:58.159
<v Speaker 1>and post production facility located physically in New London, Connecticut,

1089
00:59:58.480 --> 01:00:02.639
<v Speaker 1>and of course in the cloud online at PWOP dot com.

1090
01:00:03.480 --> 01:00:05.519
<v Speaker 1>Visit our website at d O T N E t

1091
01:00:05.840 --> 01:00:09.840
<v Speaker 1>R O c k S dot com for RSS feeds, downloads,

1092
01:00:10.000 --> 01:00:13.679
<v Speaker 1>mobile apps, comments, and access to the full archives going

1093
01:00:13.719 --> 01:00:17.119
<v Speaker 1>back to show number one, recorded in September two thousand

1094
01:00:17.119 --> 01:00:19.760
<v Speaker 1>and two. And make sure you check out our sponsors.

1095
01:00:19.960 --> 01:00:22.760
<v Speaker 1>They keep us in business. Now, go write some code,

1096
01:00:23.320 --> 01:00:24.079
<v Speaker 1>See you next time.

1097
01:00:25.000 --> 01:00:28.920
<v Speaker 2>You got Jack Middle Vans and
