WEBVTT

1
00:00:00.160 --> 00:00:02.960
<v Speaker 1>Welcome to the deep dive. If you've been watching the

2
00:00:03.000 --> 00:00:05.919
<v Speaker 1>tech world, you know we're living through an incredible moment.

3
00:00:06.960 --> 00:00:10.720
<v Speaker 1>This AI revolution powered by large language models or.

4
00:00:11.000 --> 00:00:12.679
<v Speaker 2>LMS, it really is something else.

5
00:00:12.759 --> 00:00:14.720
<v Speaker 1>Yeah, it's not just a minor tech update. It feels

6
00:00:14.720 --> 00:00:17.480
<v Speaker 1>like one of those huge shifts, you know, like the computer,

7
00:00:17.559 --> 00:00:18.760
<v Speaker 1>the internet, or the smartphone.

8
00:00:18.760 --> 00:00:19.719
<v Speaker 3>Definitely pivotal.

9
00:00:19.920 --> 00:00:23.600
<v Speaker 1>We're seeing these prototypes that seem almost magical. You can

10
00:00:23.640 --> 00:00:26.559
<v Speaker 1>write stories, generate code, it's amazing.

11
00:00:26.600 --> 00:00:27.879
<v Speaker 3>The demos are stunning.

12
00:00:28.000 --> 00:00:30.760
<v Speaker 1>But here's the thing, right, taking that cool demo and

13
00:00:30.800 --> 00:00:35.479
<v Speaker 1>making it a reliable, production grade application, that's well, that's

14
00:00:35.520 --> 00:00:36.399
<v Speaker 1>a whole different.

15
00:00:36.159 --> 00:00:37.039
<v Speaker 3>Credit, much harder game.

16
00:00:37.119 --> 00:00:39.079
<v Speaker 1>Yeah. So our mission today is really to cut through

17
00:00:39.079 --> 00:00:43.479
<v Speaker 1>some of that hype and navigate this pretty complex landscape

18
00:00:43.479 --> 00:00:47.159
<v Speaker 1>of LM development. We want to equip you, our listener,

19
00:00:47.640 --> 00:00:51.000
<v Speaker 1>with the core intuition, some surprising facts maybe, and the

20
00:00:51.000 --> 00:00:55.320
<v Speaker 1>practical tools you'll need to build genuinely sophisticated applications, the

21
00:00:55.359 --> 00:00:56.600
<v Speaker 1>ones that actually work.

22
00:00:56.560 --> 00:00:58.759
<v Speaker 2>And to guide us on this deep dive. We're leaning

23
00:00:58.799 --> 00:01:02.600
<v Speaker 2>pretty heavily on a fent plastic resource designing Large language

24
00:01:02.640 --> 00:01:05.879
<v Speaker 2>Model applications by suhas PE. What's great about it? I

25
00:01:05.920 --> 00:01:08.959
<v Speaker 2>think is that it's not just some dry technical manual.

26
00:01:09.319 --> 00:01:09.959
<v Speaker 3>It gives this.

27
00:01:10.040 --> 00:01:15.000
<v Speaker 2>Really holistic overview for you know, software engineers and mel folks,

28
00:01:15.040 --> 00:01:16.799
<v Speaker 2>product managers, anyone involved.

29
00:01:16.840 --> 00:01:17.439
<v Speaker 1>That's useful.

30
00:01:17.640 --> 00:01:20.920
<v Speaker 2>Yeah, and it provides surprising depth that helps you understand

31
00:01:21.280 --> 00:01:25.040
<v Speaker 2>not just what the models do, but fundamentally why they

32
00:01:25.079 --> 00:01:27.319
<v Speaker 2>behave the way they do, and that why is crucial,

33
00:01:27.599 --> 00:01:32.359
<v Speaker 2>absolutely crucial, especially for getting past fragile prototypes to something robust.

34
00:01:32.560 --> 00:01:35.480
<v Speaker 1>Okay, let's unpack this then when we talk about lms,

35
00:01:35.560 --> 00:01:37.640
<v Speaker 1>what are they actually made of? Like what are the

36
00:01:37.680 --> 00:01:39.840
<v Speaker 1>basic ingredients before they even start learning?

37
00:01:40.319 --> 00:01:40.519
<v Speaker 3>Right?

38
00:01:40.599 --> 00:01:43.079
<v Speaker 2>So, at their very core, lms are built on pre

39
00:01:43.120 --> 00:01:43.959
<v Speaker 2>training data.

40
00:01:44.040 --> 00:01:46.079
<v Speaker 1>That's the raw fuel data. Got it?

41
00:01:46.200 --> 00:01:48.640
<v Speaker 2>And you know that old saying garbage in, garbage out,

42
00:01:48.760 --> 00:01:53.400
<v Speaker 2>It applies massively here. The scale and maybe even more importantly,

43
00:01:53.480 --> 00:01:56.079
<v Speaker 2>the quality of this data is paramount.

44
00:01:56.239 --> 00:01:57.359
<v Speaker 1>So where does it all come from?

45
00:01:57.439 --> 00:02:01.040
<v Speaker 2>We're talking colossal amounts of text. A huge chunk often

46
00:02:01.079 --> 00:02:04.239
<v Speaker 2>comes from web text, like from common Girl. Massive, but

47
00:02:04.599 --> 00:02:07.480
<v Speaker 2>it needs so much cleaning because well, the Internet's.

48
00:02:07.079 --> 00:02:09.400
<v Speaker 1>Messy understatement of the year, huh.

49
00:02:09.479 --> 00:02:09.680
<v Speaker 3>Right.

50
00:02:09.919 --> 00:02:12.560
<v Speaker 2>Then you have things like web text or open web text.

51
00:02:12.719 --> 00:02:16.039
<v Speaker 2>They often use signals from places like Reddit outbound links

52
00:02:16.080 --> 00:02:19.439
<v Speaker 2>specifically trying to filter for you know, higher quality stuff.

53
00:02:19.599 --> 00:02:21.800
<v Speaker 3>Wisdom of the crowd kind of interesting.

54
00:02:21.919 --> 00:02:22.319
<v Speaker 1>What else?

55
00:02:22.520 --> 00:02:26.479
<v Speaker 2>There's factual knowledge from Wikipedia, super valuable for accuracy, but

56
00:02:26.560 --> 00:02:27.319
<v Speaker 2>the style.

57
00:02:27.080 --> 00:02:29.000
<v Speaker 1>Is very formal, yeah, very encyclopedic.

58
00:02:29.120 --> 00:02:33.360
<v Speaker 2>And historically BooksCorpus was big, lots of narrative but surprisingly

59
00:02:33.479 --> 00:02:36.840
<v Speaker 2>like twenty six percent romance novels from unpublished authors, so

60
00:02:37.199 --> 00:02:37.960
<v Speaker 2>quite specific.

61
00:02:38.000 --> 00:02:38.240
<v Speaker 1>Wow.

62
00:02:38.319 --> 00:02:41.719
<v Speaker 2>Okay, and now you see newer efforts like hugging faces,

63
00:02:41.800 --> 00:02:44.960
<v Speaker 2>Fine Web aiming for even cleaner web data and I'm

64
00:02:45.039 --> 00:02:47.759
<v Speaker 2>fine Web ed you focusing on educational content.

65
00:02:48.439 --> 00:02:50.759
<v Speaker 3>Fifteen trillion tokens. It's huge.

66
00:02:50.840 --> 00:02:54.599
<v Speaker 1>So it's clearly not just about quantity, it's about cleaning

67
00:02:54.680 --> 00:02:57.560
<v Speaker 1>and curating this raw material. What does that involve exactly?

68
00:02:57.680 --> 00:03:00.840
<v Speaker 2>Data preprocessing is key, It's not glamorous, but maybe the

69
00:03:00.840 --> 00:03:03.400
<v Speaker 2>most vital step, like what specifically you got to strip

70
00:03:03.479 --> 00:03:07.000
<v Speaker 2>out all the web boilerplate, menus, navelinks, lar nipsen, placeholders,

71
00:03:07.039 --> 00:03:07.719
<v Speaker 2>all that junk.

72
00:03:08.240 --> 00:03:10.599
<v Speaker 3>And language identification.

73
00:03:10.199 --> 00:03:15.199
<v Speaker 2>Is surprisingly tricky, even in supposedly English only data sets.

74
00:03:15.319 --> 00:03:16.599
<v Speaker 3>Other languages creep in.

75
00:03:17.039 --> 00:03:19.800
<v Speaker 2>If you don't catch that, your model might suddenly start

76
00:03:19.800 --> 00:03:21.400
<v Speaker 2>speaking Spanish, which.

77
00:03:21.199 --> 00:03:23.439
<v Speaker 1>Could be a bug or maybe a.

78
00:03:23.400 --> 00:03:26.759
<v Speaker 2>Feature could be either, and quality filtering is vital too

79
00:03:27.039 --> 00:03:29.159
<v Speaker 2>often using things like perplexity scores.

80
00:03:29.280 --> 00:03:31.840
<v Speaker 1>Okay, perplexity scores, how does that work? Break that down?

81
00:03:31.919 --> 00:03:33.039
<v Speaker 3>Sure? Think of it like this.

82
00:03:33.879 --> 00:03:36.199
<v Speaker 2>If you're trying to predict the next word in a

83
00:03:36.240 --> 00:03:39.039
<v Speaker 2>really well written, clear sentence, it's pretty easy.

84
00:03:39.560 --> 00:03:43.039
<v Speaker 3>Low uncertainty. That's low perplexity, makes sense. But if you're

85
00:03:43.080 --> 00:03:43.639
<v Speaker 3>trying to.

86
00:03:43.599 --> 00:03:46.719
<v Speaker 2>Guess the next word in some garbled text full of errors,

87
00:03:46.759 --> 00:03:50.400
<v Speaker 2>it's super hard. High uncertainty, that's high perplexity.

88
00:03:50.479 --> 00:03:54.280
<v Speaker 1>Ah, okay, So high perplexity means noisy, bad data.

89
00:03:54.319 --> 00:03:56.400
<v Speaker 2>Basically, yeah, you probably don't want to feed that to

90
00:03:56.439 --> 00:03:57.439
<v Speaker 2>your expensive model.

91
00:03:57.560 --> 00:04:01.319
<v Speaker 1>Got it? And after cleaning you mentioned duplication in privacy?

92
00:04:01.560 --> 00:04:02.639
<v Speaker 1>Why is that so important?

93
00:04:02.639 --> 00:04:03.360
<v Speaker 3>Oh? It's massive.

94
00:04:03.400 --> 00:04:06.360
<v Speaker 2>Web tex is full of duplicates. Removing them isn't just

95
00:04:06.360 --> 00:04:11.680
<v Speaker 2>about efficiency. It's critical to stop llms from accidentally memorizing

96
00:04:11.680 --> 00:04:15.000
<v Speaker 2>and leaking PII personally identifiable information.

97
00:04:14.919 --> 00:04:17.120
<v Speaker 1>Right, even if it's technically published exactly.

98
00:04:17.199 --> 00:04:20.560
<v Speaker 2>That's the whole contextual integrity issue. Should an AI just

99
00:04:20.600 --> 00:04:23.399
<v Speaker 2>blurt out someone's address because it found it online somewhere?

100
00:04:23.560 --> 00:04:25.600
<v Speaker 2>It's tricky, especially with public figures.

101
00:04:26.319 --> 00:04:28.079
<v Speaker 1>Complex ethical grounds totally.

102
00:04:28.839 --> 00:04:31.040
<v Speaker 2>And what's wild is that even a tiny bit of

103
00:04:31.120 --> 00:04:35.759
<v Speaker 2>manipulated data, like less than point one percent, can potentially

104
00:04:35.800 --> 00:04:38.519
<v Speaker 2>make it easier for other sensitive data to leak.

105
00:04:38.800 --> 00:04:41.800
<v Speaker 1>Wow. Okay, that's a lot about the raw material. But

106
00:04:41.879 --> 00:04:44.839
<v Speaker 1>this next part for me is where it gets really fascinating.

107
00:04:45.399 --> 00:04:48.879
<v Speaker 1>How do these models actually read. It's not like they

108
00:04:48.879 --> 00:04:50.279
<v Speaker 1>see words like we do, is it?

109
00:04:50.399 --> 00:04:50.920
<v Speaker 3>You're spot on?

110
00:04:51.000 --> 00:04:53.600
<v Speaker 2>They don't process discrete words like humans. They use something

111
00:04:53.639 --> 00:04:56.199
<v Speaker 2>called tokens, and often these tokens are.

112
00:04:56.079 --> 00:04:59.199
<v Speaker 1>Subwords, subwords like parts of words kind of yeah.

113
00:04:59.480 --> 00:05:02.959
<v Speaker 2>So example in gpt ex, office might be one token,

114
00:05:03.000 --> 00:05:05.759
<v Speaker 2>but office with that little meaning a space before it

115
00:05:05.800 --> 00:05:06.639
<v Speaker 2>is a different token.

116
00:05:06.959 --> 00:05:09.480
<v Speaker 3>Case matters to office versus office.

117
00:05:09.560 --> 00:05:11.319
<v Speaker 1>Okay, so it's more granial exactly.

118
00:05:11.600 --> 00:05:14.879
<v Speaker 2>And the subword approach is clever because it mostly avoids.

119
00:05:14.600 --> 00:05:16.120
<v Speaker 3>The out of vocabulary problem.

120
00:05:16.439 --> 00:05:18.720
<v Speaker 2>If it sees a totally new word, it can usually

121
00:05:18.720 --> 00:05:22.399
<v Speaker 2>break it down into known subword pieces instead of just crashing.

122
00:05:22.519 --> 00:05:25.199
<v Speaker 1>So it's almost like they're reading in syllables or morphemes,

123
00:05:25.199 --> 00:05:25.839
<v Speaker 1>not whole words.

124
00:05:25.839 --> 00:05:28.879
<v Speaker 2>Oh a bit like that, Yeah, smaller meaningful units. And

125
00:05:28.959 --> 00:05:34.720
<v Speaker 2>sometimes this process creates weird artifacts glitch tokens tokens, yeah,

126
00:05:34.839 --> 00:05:38.120
<v Speaker 2>or undertrain ones. There's this great story about solid Magic

127
00:05:38.160 --> 00:05:40.920
<v Speaker 2>gold Carp. It was a Reddit username that actually became

128
00:05:41.040 --> 00:05:42.360
<v Speaker 2>a token in GPT two.

129
00:05:42.480 --> 00:05:44.079
<v Speaker 1>Seriously a username yep.

130
00:05:44.519 --> 00:05:47.439
<v Speaker 2>But then later models like GPT three were trained on

131
00:05:47.480 --> 00:05:50.720
<v Speaker 2>different data where that token barely appeared. It had no

132
00:05:50.800 --> 00:05:53.720
<v Speaker 2>training signal. So if you fed GPT three Solid Magic

133
00:05:53.720 --> 00:05:55.920
<v Speaker 2>old Carp, it would just act weirdly like it had

134
00:05:55.959 --> 00:05:57.839
<v Speaker 2>no clue what to do as it are it is.

135
00:05:58.079 --> 00:06:01.920
<v Speaker 2>But it raises this cool question, what can these weird

136
00:06:01.920 --> 00:06:04.279
<v Speaker 2>tokens tell us about the training data. It's like a

137
00:06:04.279 --> 00:06:06.839
<v Speaker 2>little window into the model's digestive system.

138
00:06:07.000 --> 00:06:10.120
<v Speaker 1>Huh, a digital digestive system. I like that. Okay, so

139
00:06:10.160 --> 00:06:12.079
<v Speaker 1>we have the data, we have the tokens. How does

140
00:06:12.120 --> 00:06:16.199
<v Speaker 1>this all come together? We hear neural networks transformers. What's

141
00:06:16.240 --> 00:06:16.800
<v Speaker 1>the engine?

142
00:06:17.000 --> 00:06:17.199
<v Speaker 3>Right?

143
00:06:17.240 --> 00:06:19.879
<v Speaker 2>The engine at the heart of almost all modern llms

144
00:06:20.000 --> 00:06:23.160
<v Speaker 2>is the transformer architecture. It was a huge breakthrough back

145
00:06:23.160 --> 00:06:25.199
<v Speaker 2>in twenty seventeen. Why was it such a big deal

146
00:06:25.839 --> 00:06:32.000
<v Speaker 2>because older recurrent neural networks RNNs really struggled with long sentences,

147
00:06:32.160 --> 00:06:35.759
<v Speaker 2>long range dependencies. They were trying to sort of cram

148
00:06:35.800 --> 00:06:38.079
<v Speaker 2>the meaning of a whole sentence into one single vector.

149
00:06:38.120 --> 00:06:40.879
<v Speaker 2>It just didn't scale well. The transformer changed that with

150
00:06:40.920 --> 00:06:43.199
<v Speaker 2>its key innovation self attention.

151
00:06:43.800 --> 00:06:47.439
<v Speaker 1>Self attention. That sounds mindful. How does it work for AI?

152
00:06:47.839 --> 00:06:50.800
<v Speaker 2>Heah, Yeah, it's actually pretty intuitive. Think about how we read.

153
00:06:50.920 --> 00:06:53.839
<v Speaker 2>We don't give every word equal weight, right, definitely not.

154
00:06:54.079 --> 00:06:57.279
<v Speaker 2>We focus on certain words to understand context, like bank

155
00:06:57.560 --> 00:07:01.399
<v Speaker 2>means something different in riverbank versus saving bank. Self attention

156
00:07:01.560 --> 00:07:04.399
<v Speaker 2>lets the model do exactly that way, the importance of

157
00:07:04.399 --> 00:07:07.399
<v Speaker 2>different words in the sequence as it processes them, so it.

158
00:07:07.399 --> 00:07:09.720
<v Speaker 1>Learns context from surrounding words precisely.

159
00:07:09.759 --> 00:07:12.040
<v Speaker 2>It's like that old linguistics idea you shall know a

160
00:07:12.079 --> 00:07:14.920
<v Speaker 2>word by the company it keeps. It uses these things

161
00:07:14.959 --> 00:07:18.680
<v Speaker 2>called query key and value matrices to let words mathematically

162
00:07:18.759 --> 00:07:19.879
<v Speaker 2>attend to each other.

163
00:07:20.040 --> 00:07:23.600
<v Speaker 1>Okay, mathematically attend, got it? And they're different types of

164
00:07:23.639 --> 00:07:24.560
<v Speaker 1>these transformers.

165
00:07:24.839 --> 00:07:27.959
<v Speaker 3>Yeah, broadly. Three main transformer backbones.

166
00:07:28.199 --> 00:07:32.199
<v Speaker 2>First, encoder only models like Burt great for understanding text things,

167
00:07:32.240 --> 00:07:33.279
<v Speaker 2>search or classification.

168
00:07:33.480 --> 00:07:33.720
<v Speaker 1>Right.

169
00:07:33.959 --> 00:07:37.680
<v Speaker 2>Then the original encoder decoder design still fantastic for things

170
00:07:37.720 --> 00:07:40.480
<v Speaker 2>like machine translation, where you need to process an input

171
00:07:40.600 --> 00:07:42.680
<v Speaker 2>and generate a distinct output.

172
00:07:42.839 --> 00:07:43.199
<v Speaker 1>Okay.

173
00:07:43.279 --> 00:07:46.040
<v Speaker 2>And finally, the one we usually associate with generative AI

174
00:07:46.199 --> 00:07:50.439
<v Speaker 2>like GPT four the decoder only architecture. These models are

175
00:07:50.439 --> 00:07:53.600
<v Speaker 2>specialized in predicting the very next token in a sequence.

176
00:07:53.959 --> 00:07:55.319
<v Speaker 2>That's how they generate text.

177
00:07:55.600 --> 00:07:59.519
<v Speaker 1>And what about these mixture of experts models? Loe? Are

178
00:07:59.519 --> 00:08:00.319
<v Speaker 1>they different? Again?

179
00:08:00.680 --> 00:08:04.079
<v Speaker 2>There are really interesting evolution sort of built on the backbone.

180
00:08:04.199 --> 00:08:08.439
<v Speaker 2>Mixture of experts aims to massively increase a model's capacity

181
00:08:08.519 --> 00:08:12.000
<v Speaker 2>how much it knows without proportionally increasing the compute cost

182
00:08:12.079 --> 00:08:13.319
<v Speaker 2>for every single input.

183
00:08:13.360 --> 00:08:14.040
<v Speaker 1>How does that work?

184
00:08:14.279 --> 00:08:16.439
<v Speaker 2>The clever bit is that for any given input, only

185
00:08:16.439 --> 00:08:19.800
<v Speaker 2>a subset of specialized experts inside the model gets activated.

186
00:08:19.920 --> 00:08:23.600
<v Speaker 2>So ask about physics, the physics expert activates. Ask about poetry,

187
00:08:23.600 --> 00:08:26.240
<v Speaker 2>The poetry expert lights up. The others stay quiet.

188
00:08:26.519 --> 00:08:29.040
<v Speaker 1>Huh So it's like calling on specialists exactly.

189
00:08:29.399 --> 00:08:31.680
<v Speaker 2>You get the power of a huge model, but you

190
00:08:31.800 --> 00:08:35.399
<v Speaker 2>only run the relevant parts for each query. Mistral's mixtral

191
00:08:35.480 --> 00:08:39.000
<v Speaker 2>is a key example, and many suspect GEPC four uses

192
00:08:39.039 --> 00:08:40.960
<v Speaker 2>something similar, though it's unconfirmed.

193
00:08:41.480 --> 00:08:45.720
<v Speaker 1>This is really critical then for you, the listener, understanding

194
00:08:45.759 --> 00:08:51.360
<v Speaker 1>these foundations, the data, the tokens, the transformer architecture MOE.

195
00:08:52.039 --> 00:08:55.679
<v Speaker 1>It's crucial, absolutely, even if you never train one from scratch.

196
00:08:56.200 --> 00:09:00.360
<v Speaker 1>That intuition helps you debug, figure out why it's behaving oddly,

197
00:09:00.799 --> 00:09:03.320
<v Speaker 1>and build better apps. You start to see why it

198
00:09:03.360 --> 00:09:04.759
<v Speaker 1>might struggle or succeed.

199
00:09:04.480 --> 00:09:05.919
<v Speaker 3>Get a feel for the machine.

200
00:09:06.120 --> 00:09:09.200
<v Speaker 1>So we've established these models are powerful, but yeah, definitely

201
00:09:09.200 --> 00:09:12.559
<v Speaker 1>not perfect. What are some of the biggest practical limitations

202
00:09:12.799 --> 00:09:14.360
<v Speaker 1>and how are we starting to tackle them.

203
00:09:14.679 --> 00:09:18.360
<v Speaker 2>One of the biggest and probably most talked about, is hallucinations.

204
00:09:17.720 --> 00:09:19.120
<v Speaker 1>Right when they just make stuff up.

205
00:09:19.080 --> 00:09:22.679
<v Speaker 2>Exactly more formally, it's generated text that isn't grounded in

206
00:09:22.720 --> 00:09:24.679
<v Speaker 2>the training data or the input context.

207
00:09:25.000 --> 00:09:27.639
<v Speaker 3>It sounds plausible, but it's just fabrications.

208
00:09:27.679 --> 00:09:28.559
<v Speaker 1>Do you give an example.

209
00:09:28.799 --> 00:09:31.559
<v Speaker 2>Sure, there was a well known case with the NAS

210
00:09:31.639 --> 00:09:36.399
<v Speaker 2>Research Hermes model. It hallucinated details about Ugandan medal winners

211
00:09:36.480 --> 00:09:39.519
<v Speaker 2>from the twenty twenty Olympics. Oh wow, Yeah, it got

212
00:09:39.559 --> 00:09:42.639
<v Speaker 2>birth dates wrong, mixed up which medals they won. The

213
00:09:42.720 --> 00:09:45.039
<v Speaker 2>athletes were real, the core facts were real, but the

214
00:09:45.080 --> 00:09:48.720
<v Speaker 2>details were just invented, confidently stated but wrong.

215
00:09:48.919 --> 00:09:51.519
<v Speaker 1>Yikes, how do you even begin to fix that?

216
00:09:52.000 --> 00:09:56.440
<v Speaker 2>It's tough. Mitigation involves several things. Good product design helps

217
00:09:56.480 --> 00:09:59.559
<v Speaker 2>try not to ask questions the LLM likely can't answer,

218
00:10:00.000 --> 00:10:02.639
<v Speaker 2>soh knowing what it doesn't know is hard true. We

219
00:10:02.679 --> 00:10:06.000
<v Speaker 2>also look at model self knowledge and calibration, basically, how

220
00:10:06.000 --> 00:10:09.159
<v Speaker 2>confident is the model in its own output. Sometimes low

221
00:10:09.200 --> 00:10:11.879
<v Speaker 2>confidence correlates with higher hallucination risk.

222
00:10:11.799 --> 00:10:13.879
<v Speaker 1>Okay, using its own uncertainty signals.

223
00:10:13.960 --> 00:10:14.480
<v Speaker 3>Yeah.

224
00:10:14.519 --> 00:10:17.720
<v Speaker 2>And then there are technical effixes during generation, like factual

225
00:10:17.799 --> 00:10:21.320
<v Speaker 2>nuclear sampling, which tries to reduce randomness for more factual outputs,

226
00:10:21.919 --> 00:10:25.480
<v Speaker 2>or doulity coding, which cleverly uses differences between signals and

227
00:10:25.480 --> 00:10:28.000
<v Speaker 2>the transformulators to spot potential hallucinations.

228
00:10:28.159 --> 00:10:31.240
<v Speaker 1>Fascinating, And sometimes they hallucinate just because the prompt itself

229
00:10:31.279 --> 00:10:33.399
<v Speaker 1>is confusing, right, like yeah, with irrelevant info?

230
00:10:33.600 --> 00:10:37.360
<v Speaker 2>Yeah, absolutely. If you put distracting sentences in the prompt

231
00:10:37.440 --> 00:10:41.600
<v Speaker 2>like mentioning max selling apples in Sarah's unrelated math problem,

232
00:10:42.000 --> 00:10:44.879
<v Speaker 2>the LM can get confused and incorporate the wrong info,

233
00:10:45.039 --> 00:10:49.159
<v Speaker 2>so prompting it to first identify and remove irrelevant context

234
00:10:49.399 --> 00:10:49.879
<v Speaker 2>can help.

235
00:10:50.200 --> 00:10:54.480
<v Speaker 1>Okay. So beyond just factual accuracy, what about actual reasoning?

236
00:10:55.159 --> 00:10:57.559
<v Speaker 1>Can they really connect dots logically?

237
00:10:58.000 --> 00:11:01.000
<v Speaker 2>That's a huge area of research and development. Natural language

238
00:11:01.039 --> 00:11:04.559
<v Speaker 2>reasoning means integrating knowledge to draw conclusions, and there are

239
00:11:04.559 --> 00:11:08.879
<v Speaker 2>different kinds. Deductive is pure logic premise a premise B,

240
00:11:09.039 --> 00:11:12.639
<v Speaker 2>therefore conclusion C. Like mister Shockley is allergic to mushrooms.

241
00:11:12.720 --> 00:11:14.919
<v Speaker 2>This dish has mushrooms, so mister Shockley.

242
00:11:14.559 --> 00:11:15.279
<v Speaker 3>Should avoid it.

243
00:11:15.360 --> 00:11:19.200
<v Speaker 2>Simple logic, then inductive generalizing from examples, so hundreds of

244
00:11:19.279 --> 00:11:23.240
<v Speaker 2>round manhole covers conclude manhole covers are generally round. Abductive

245
00:11:23.279 --> 00:11:27.519
<v Speaker 2>reasoning is finding the most likely explanation streets, wet puddles, umbrellas, hmm,

246
00:11:27.720 --> 00:11:31.039
<v Speaker 2>probably rain inferance of the best explanation exactly. And then

247
00:11:31.080 --> 00:11:34.120
<v Speaker 2>there's common sense implicit stuff like you can't fit a

248
00:11:34.159 --> 00:11:35.440
<v Speaker 2>horse in a Mini Cooper.

249
00:11:35.399 --> 00:11:38.759
<v Speaker 1>Huh yeah, hopefully obvious. How do you get an LLM

250
00:11:38.799 --> 00:11:39.440
<v Speaker 1>to do that better?

251
00:11:39.879 --> 00:11:43.000
<v Speaker 2>A major technique is chain of thought prompting or cooey.

252
00:11:43.559 --> 00:11:46.639
<v Speaker 2>You literally tell the LLM to think, step by.

253
00:11:46.480 --> 00:11:48.840
<v Speaker 1>Step, show your work basically pretty.

254
00:11:48.600 --> 00:11:51.519
<v Speaker 2>Much for a math problem like thirty four plus forty

255
00:11:51.519 --> 00:11:54.000
<v Speaker 2>four plus three twenty three three to two. Instead of

256
00:11:54.080 --> 00:11:55.960
<v Speaker 2>just asking for the answer, you ask it to break

257
00:11:56.000 --> 00:11:58.799
<v Speaker 2>it down first, calculate three, two, three, and so on.

258
00:11:59.200 --> 00:12:01.399
<v Speaker 2>Performance jump dramatically.

259
00:12:01.000 --> 00:12:03.559
<v Speaker 1>Because it forces a sequential process right.

260
00:12:03.440 --> 00:12:05.679
<v Speaker 2>It gives it intermediate steps to work with. It costs

261
00:12:05.720 --> 00:12:08.600
<v Speaker 2>more tokens, more time, but it's often worth it for

262
00:12:08.679 --> 00:12:12.159
<v Speaker 2>complex tasks. You can also use verifiers, maybe another LM

263
00:12:12.279 --> 00:12:15.559
<v Speaker 2>to check the steps, or even fine tune models specifically

264
00:12:15.600 --> 00:12:16.799
<v Speaker 2>on reasoning data sets.

265
00:12:16.879 --> 00:12:19.840
<v Speaker 1>Okay, so we can make them smarter, more reliable, but

266
00:12:19.879 --> 00:12:22.200
<v Speaker 1>these things are huge. How do we actually run them

267
00:12:22.200 --> 00:12:24.879
<v Speaker 1>efficiently in the real world. That sounds like a massive hurdle.

268
00:12:25.000 --> 00:12:26.679
<v Speaker 2>It is a huge hurdle, and that brings us to

269
00:12:26.840 --> 00:12:30.200
<v Speaker 2>choosing and optimizing llms for production. First, you have to

270
00:12:30.240 --> 00:12:33.879
<v Speaker 2>pick one. You've got proprietary providers open AI, Google, Mthropic

271
00:12:34.000 --> 00:12:37.480
<v Speaker 2>via APIs easy to use, manage the big players, and

272
00:12:37.519 --> 00:12:43.960
<v Speaker 2>then open source models Metaslama, luther AI, Mistral, Microsoft's FI models.

273
00:12:44.080 --> 00:12:46.759
<v Speaker 2>You get the model weights, more transparency, more flexibility, but

274
00:12:46.799 --> 00:12:49.200
<v Speaker 2>you often have to manage the deployment yourself or use

275
00:12:49.240 --> 00:12:50.279
<v Speaker 2>specialized platforms.

276
00:12:50.320 --> 00:12:51.759
<v Speaker 1>Trade offs there big time.

277
00:12:51.799 --> 00:12:55.519
<v Speaker 2>Transparency versus convenience, cost versus latency, And.

278
00:12:55.519 --> 00:12:57.759
<v Speaker 1>Once you pick one, how do you know if it's

279
00:12:57.759 --> 00:13:00.720
<v Speaker 1>any good for your job? Benchmarks are everywhere, but are

280
00:13:00.720 --> 00:13:01.600
<v Speaker 1>they the whole story?

281
00:13:02.000 --> 00:13:05.960
<v Speaker 2>Definitely not the whole story. Evaluating llm's is super tricka.

282
00:13:06.039 --> 00:13:09.600
<v Speaker 2>Benchmarks can suffer from test set contamination. The model might

283
00:13:09.600 --> 00:13:12.960
<v Speaker 2>have seen the answers in its training data cheating basically

284
00:13:13.120 --> 00:13:16.399
<v Speaker 2>kind of or models get over optimized just to score

285
00:13:16.399 --> 00:13:19.159
<v Speaker 2>well on a benchmark, but aren't great in practice, and

286
00:13:19.200 --> 00:13:22.080
<v Speaker 2>they're very sensitive to how you prompt them. Frameworks like

287
00:13:22.120 --> 00:13:28.720
<v Speaker 2>Stanford's HLM try to be more comprehensive, looking at accuracy, robustness, fairness, calibration.

288
00:13:28.840 --> 00:13:29.519
<v Speaker 3>Lots of things.

289
00:13:29.559 --> 00:13:33.559
<v Speaker 2>So you need holistic evaluation and ideally your own internal

290
00:13:33.600 --> 00:13:37.000
<v Speaker 2>benchmarks tailored to your actual use case. That's key now

291
00:13:37.159 --> 00:13:40.559
<v Speaker 2>actually running them, you generally need GPUs for decent speed.

292
00:13:40.559 --> 00:13:43.960
<v Speaker 2>They're just computationally intensive, right, expensive hardware, which is where

293
00:13:44.039 --> 00:13:45.120
<v Speaker 2>quantization comes in.

294
00:13:45.440 --> 00:13:46.279
<v Speaker 3>It's a lifesaver.

295
00:13:46.559 --> 00:13:48.720
<v Speaker 1>Quantization Explain that sounds.

296
00:13:48.399 --> 00:13:50.440
<v Speaker 3>Complex, it's actually a pretty neat idea.

297
00:13:50.679 --> 00:13:53.759
<v Speaker 2>It's about reducing the memory footprint. You take the numbers

298
00:13:53.799 --> 00:13:57.639
<v Speaker 2>inside the model, usually high precision floating point numbers like

299
00:13:57.960 --> 00:14:01.000
<v Speaker 2>FP thirty two, and represent them with fewer bits like

300
00:14:01.120 --> 00:14:04.360
<v Speaker 2>FP sixteen, b F sixteen or even eight bit integers.

301
00:14:04.799 --> 00:14:07.879
<v Speaker 2>It's an INN eight so like compressing the numbers.

302
00:14:07.519 --> 00:14:10.879
<v Speaker 1>Exactly like compressing them, you lose a tiny bit of precision,

303
00:14:11.200 --> 00:14:15.440
<v Speaker 1>usually negligible, but the model becomes much smaller, uses less memory,

304
00:14:15.600 --> 00:14:18.519
<v Speaker 1>and runs faster. Tools like a LAMA. Make it easier

305
00:14:18.519 --> 00:14:21.879
<v Speaker 1>to run these quantized models, even locally on a powerful laptop.

306
00:14:21.919 --> 00:14:24.720
<v Speaker 2>Sometimes that's cool. Make the more accessible. Okay, so it's loaded,

307
00:14:25.080 --> 00:14:27.960
<v Speaker 2>maybe quantized. How do you speed up the inference, the

308
00:14:28.000 --> 00:14:29.639
<v Speaker 2>actual running part, and make it cheaper.

309
00:14:29.840 --> 00:14:33.600
<v Speaker 1>Several key tricks for LLM inference optimization. A huge one

310
00:14:33.679 --> 00:14:36.799
<v Speaker 1>is the cav cash cav cash Yeah, key value cash.

311
00:14:36.960 --> 00:14:38.840
<v Speaker 1>Think of it as the model's short term memory for

312
00:14:38.879 --> 00:14:41.720
<v Speaker 1>the current conversation or task. When you send a prompt,

313
00:14:41.799 --> 00:14:45.200
<v Speaker 1>especially one with instructions, those instructions often stay the same

314
00:14:45.279 --> 00:14:46.279
<v Speaker 1>for follow up questions.

315
00:14:46.519 --> 00:14:49.759
<v Speaker 2>The cav cash stores the internal calculations the key and

316
00:14:49.879 --> 00:14:53.799
<v Speaker 2>value matrices from self attention related to that initial prompt,

317
00:14:53.840 --> 00:14:56.200
<v Speaker 2>so the model doesn't have to recalculate them every single

318
00:14:56.200 --> 00:14:59.360
<v Speaker 2>time you ask a follow up question. It dramatically speeds

319
00:14:59.360 --> 00:15:00.000
<v Speaker 2>things up after.

320
00:15:00.080 --> 00:15:00.679
<v Speaker 3>The first turn.

321
00:15:01.519 --> 00:15:05.080
<v Speaker 1>Ah avoids redundant work clever what.

322
00:15:05.120 --> 00:15:09.120
<v Speaker 2>Else, there's speculative decoding. This is pretty cool. You use

323
00:15:09.159 --> 00:15:12.480
<v Speaker 2>a small, fast draft model to generate a chunk of

324
00:15:12.480 --> 00:15:16.919
<v Speaker 2>tokens quickly, Then the larger, more accurate model verifies those

325
00:15:16.919 --> 00:15:18.120
<v Speaker 2>tokens in a batch, like.

326
00:15:18.080 --> 00:15:20.039
<v Speaker 1>A quick first draft, and then a careful.

327
00:15:19.799 --> 00:15:23.360
<v Speaker 2>Edit exactly the big model checks the interns work quickly

328
00:15:23.399 --> 00:15:26.360
<v Speaker 2>instead of doing it all slowly. Itself speeds things up

329
00:15:26.399 --> 00:15:29.559
<v Speaker 2>a lot for generation nice. We also use knowledge distillation.

330
00:15:30.120 --> 00:15:33.759
<v Speaker 2>Train a smaller student model to mimic a big teacher model.

331
00:15:33.960 --> 00:15:36.159
<v Speaker 2>You get a faster, cheaper model that retains a lot

332
00:15:36.159 --> 00:15:36.879
<v Speaker 2>of the capability.

333
00:15:37.080 --> 00:15:40.240
<v Speaker 1>Think the stillburd right, smaller but still capable.

334
00:15:39.879 --> 00:15:43.720
<v Speaker 2>And things like parallel de coding for generating multiple parts simultaneously,

335
00:15:44.000 --> 00:15:46.919
<v Speaker 2>or early exit where simpler queries might get an answer

336
00:15:46.919 --> 00:15:48.960
<v Speaker 2>from an earlier layer of the model without going all

337
00:15:49.000 --> 00:15:51.080
<v Speaker 2>the way through lots of techniques to.

338
00:15:51.000 --> 00:15:53.879
<v Speaker 1>Make them practical. Okay, this brings us squarely to the

339
00:15:53.919 --> 00:15:57.399
<v Speaker 1>application layer. How do we take these optimized lmms and

340
00:15:57.440 --> 00:16:00.759
<v Speaker 1>actually plug them into complex software? They can't just operate

341
00:16:00.759 --> 00:16:01.720
<v Speaker 1>in a vacuum, can they?

342
00:16:01.879 --> 00:16:05.120
<v Speaker 2>No, definitely not. They have real limitations. Knowledge cutoff is

343
00:16:05.120 --> 00:16:08.159
<v Speaker 2>a big one. They don't know about yesterday's news unless retrained.

344
00:16:08.720 --> 00:16:13.480
<v Speaker 2>They struggle with precise math, no factual guarantees, can't easily

345
00:16:13.519 --> 00:16:18.960
<v Speaker 2>cite sources and context. Windows while growing are still finite.

346
00:16:18.879 --> 00:16:20.440
<v Speaker 1>So they need help from the outside world.

347
00:16:20.519 --> 00:16:21.080
<v Speaker 3>Precisely.

348
00:16:21.120 --> 00:16:23.960
<v Speaker 2>You need to interface them with external tools and data.

349
00:16:24.360 --> 00:16:29.399
<v Speaker 2>We generally talk about three core LLM interaction paradigms. Okay, First,

350
00:16:29.519 --> 00:16:34.399
<v Speaker 2>the passive approach. This is basically retrieval augmented generation or RG.

351
00:16:35.000 --> 00:16:38.120
<v Speaker 2>The LLM just receives information and its prompt. It doesn't

352
00:16:38.120 --> 00:16:40.879
<v Speaker 2>know where it came from. You feed it the relevant context.

353
00:16:40.679 --> 00:16:42.679
<v Speaker 1>Giving it the answer key snippet.

354
00:16:42.440 --> 00:16:44.720
<v Speaker 2>Kind of yeah, perfect for Q and A over your

355
00:16:44.720 --> 00:16:47.879
<v Speaker 2>own private documents. You retrieve the relevant text, put it

356
00:16:47.919 --> 00:16:49.919
<v Speaker 2>in the prompt, and the LLM answer.

357
00:16:49.600 --> 00:16:50.120
<v Speaker 3>Is based on that.

358
00:16:50.240 --> 00:16:51.679
<v Speaker 1>Okay, passive, what's next?

359
00:16:52.039 --> 00:16:55.080
<v Speaker 3>Explicit tool use here? The LLM is more active.

360
00:16:55.480 --> 00:16:57.600
<v Speaker 2>You give it instructions and a set of tools it

361
00:16:57.639 --> 00:17:00.039
<v Speaker 2>can use, like a web search tool, a calculator, a

362
00:17:00.399 --> 00:17:01.759
<v Speaker 2>database connector, and.

363
00:17:01.679 --> 00:17:03.639
<v Speaker 1>It chooses which tool to use exactly.

364
00:17:04.000 --> 00:17:07.920
<v Speaker 2>Frameworks like lang chain help manage this. The LLM decides, okay,

365
00:17:07.960 --> 00:17:09.920
<v Speaker 2>to answer this, I need to search the web, and

366
00:17:09.960 --> 00:17:13.200
<v Speaker 2>it triggers the search tool. It becomes an orchestrator, more interactive.

367
00:17:13.279 --> 00:17:14.319
<v Speaker 1>And the third, the.

368
00:17:14.200 --> 00:17:19.079
<v Speaker 2>Most advanced, is the aegentic paradigm. Think autonomous agents. These

369
00:17:19.240 --> 00:17:22.720
<v Speaker 2>lllms can interact with their environment, break down complex goals

370
00:17:22.720 --> 00:17:25.920
<v Speaker 2>into subtasks, and take a sequence of actions using tools

371
00:17:25.920 --> 00:17:26.799
<v Speaker 2>to achieve the goal.

372
00:17:27.079 --> 00:17:28.839
<v Speaker 1>Like that Apple CFO example you.

373
00:17:28.799 --> 00:17:32.200
<v Speaker 2>Mentioned earlier, Exactly like that, who was Apple CFO at

374
00:17:32.200 --> 00:17:35.039
<v Speaker 2>its lowest stock price in ten years? The agent figures

375
00:17:35.079 --> 00:17:39.480
<v Speaker 2>out one, get stock data, two, find lowest point three

376
00:17:39.720 --> 00:17:42.799
<v Speaker 2>find CFO for that date. It plans and executes.

377
00:17:43.119 --> 00:17:46.240
<v Speaker 1>Wow, that's powerful. Still limitations though, you said.

378
00:17:46.079 --> 00:17:48.680
<v Speaker 2>Oh yeah, current agents can still get stuck in loops,

379
00:17:48.839 --> 00:17:51.519
<v Speaker 2>choose the wrong tool, or just fail. It's definitely the

380
00:17:51.519 --> 00:17:53.079
<v Speaker 2>frontier very active research.

381
00:17:53.240 --> 00:17:56.839
<v Speaker 1>Okay, but let's go back to Eric retrieval augmented generation.

382
00:17:57.200 --> 00:18:00.079
<v Speaker 1>You said it's passive, but it feels like the cornerstone

383
00:18:00.119 --> 00:18:03.880
<v Speaker 1>of so many practical LLM apps today. Let's really dive

384
00:18:03.960 --> 00:18:07.079
<v Speaker 1>deep into OURG. Why is it so vital and how

385
00:18:07.079 --> 00:18:08.000
<v Speaker 1>does it actually work?

386
00:18:08.079 --> 00:18:10.920
<v Speaker 3>Under the hood, OURAG is absolutely fundamental.

387
00:18:10.960 --> 00:18:14.480
<v Speaker 2>Its main job is letting LMS access your specific private

388
00:18:14.559 --> 00:18:16.000
<v Speaker 2>data stuff it never saw.

389
00:18:15.880 --> 00:18:18.160
<v Speaker 1>During training, right bitging the knowledge gap.

390
00:18:18.039 --> 00:18:22.440
<v Speaker 2>Exactly, and by doing that it drastically reduces hallucinations because

391
00:18:22.599 --> 00:18:26.519
<v Speaker 2>responses are grounded in actual provided text. It allows for citations,

392
00:18:26.519 --> 00:18:29.480
<v Speaker 2>It lets the LM talk about recent events, and it

393
00:18:29.559 --> 00:18:31.319
<v Speaker 2>handles the long tail entities.

394
00:18:31.640 --> 00:18:33.799
<v Speaker 1>Long tail entities what are those again? Think?

395
00:18:33.880 --> 00:18:36.079
<v Speaker 3>Really niche facts stuff? So rare.

396
00:18:36.119 --> 00:18:38.680
<v Speaker 2>It might only appear once or twice in trillions of

397
00:18:38.720 --> 00:18:42.640
<v Speaker 2>tokens of training data. LM struggle to memorize that OURG

398
00:18:42.640 --> 00:18:45.759
<v Speaker 2>retrieves that specific fact just when needed. Without our AG,

399
00:18:45.799 --> 00:18:49.000
<v Speaker 2>you'd need impossibly huge models to maybe memorize everything.

400
00:18:49.240 --> 00:18:53.200
<v Speaker 1>So OURG is essential for accessing specific, less common knowledge totally.

401
00:18:53.480 --> 00:18:55.960
<v Speaker 2>The RAG pipeline itself is actually quite sophisticated.

402
00:18:55.960 --> 00:18:56.160
<v Speaker 1>Now.

403
00:18:56.240 --> 00:18:57.440
<v Speaker 3>It's not just a simple.

404
00:18:57.200 --> 00:18:58.880
<v Speaker 1>Look up, okay, walk us through the steps.

405
00:18:59.119 --> 00:19:02.240
<v Speaker 2>It often starts with rewrite. The user's query might get

406
00:19:02.279 --> 00:19:06.039
<v Speaker 2>rephrase to better match the documents. Sometimes an LM even

407
00:19:06.079 --> 00:19:09.359
<v Speaker 2>generates a hypothetical document high D that would answer the query,

408
00:19:09.599 --> 00:19:10.799
<v Speaker 2>and then you search using.

409
00:19:10.640 --> 00:19:13.279
<v Speaker 1>That clever search for the ideal answer.

410
00:19:13.039 --> 00:19:18.119
<v Speaker 2>Shape kind of then retrieve, fetching potentially relevant documents, often

411
00:19:18.240 --> 00:19:21.640
<v Speaker 2>using embeddings for semantic similarity. But other methods exist too,

412
00:19:21.759 --> 00:19:25.519
<v Speaker 2>like generative retrieval, where the LLM predicts document IDs.

413
00:19:25.640 --> 00:19:28.079
<v Speaker 1>Okay, got a pile of potential documents, then.

414
00:19:28.000 --> 00:19:32.640
<v Speaker 2>Rerank That initial retrieval might grab some irrelevant stuff. Reranking

415
00:19:32.720 --> 00:19:36.480
<v Speaker 2>uses another model, often a smaller specialized one, to score

416
00:19:36.559 --> 00:19:41.079
<v Speaker 2>and reorder the retrieved docs by true relevance quality.

417
00:19:40.680 --> 00:19:43.200
<v Speaker 1>Control, refining the results. What's next?

418
00:19:43.440 --> 00:19:46.920
<v Speaker 2>Refine Now, you might shorten or summarize the relevant snippets

419
00:19:46.920 --> 00:19:49.680
<v Speaker 2>to fit the context window better and make them more useful.

420
00:19:49.960 --> 00:19:52.559
<v Speaker 2>Techniques like chain of note use an LLM to generate

421
00:19:52.599 --> 00:19:56.960
<v Speaker 2>summaries or bullet points, highlighting key info from the retrieved.

422
00:19:56.519 --> 00:19:59.559
<v Speaker 1>Chunks, making it digestible for the main LLM precisely.

423
00:20:00.160 --> 00:20:02.759
<v Speaker 2>Then insert This is just about how you put that

424
00:20:02.799 --> 00:20:06.920
<v Speaker 2>refined context into the final prompt. Turns out lllms often

425
00:20:06.920 --> 00:20:09.200
<v Speaker 2>pay more attention to stuff at the very beginning or

426
00:20:09.279 --> 00:20:11.599
<v Speaker 2>very end of the prompt attention biases.

427
00:20:11.759 --> 00:20:12.920
<v Speaker 1>Huh interesting quirk.

428
00:20:13.039 --> 00:20:13.480
<v Speaker 3>Yeah.

429
00:20:13.720 --> 00:20:18.480
<v Speaker 2>Finally, generate the main LLM produces the answer, grounded by

430
00:20:18.480 --> 00:20:21.720
<v Speaker 2>the context you just carefully prepared and inserted. That's the

431
00:20:21.759 --> 00:20:24.720
<v Speaker 2>basic flow well, and it can get even more complex

432
00:20:24.839 --> 00:20:28.880
<v Speaker 2>techniques like flare, interleave generation and retrieval. The LM starts writing,

433
00:20:28.920 --> 00:20:32.920
<v Speaker 2>identifies a bit it's unsure about low confidence tokens, pauses,

434
00:20:33.279 --> 00:20:37.799
<v Speaker 2>retrieves more specific info just for that gap, than continues generating.

435
00:20:37.880 --> 00:20:41.920
<v Speaker 1>Wow, dynamically retrieving info mid sentence. That's intricate. Now, with

436
00:20:42.119 --> 00:20:45.359
<v Speaker 1>context windows getting huge like Gemini one point five pros

437
00:20:45.440 --> 00:20:48.559
<v Speaker 1>million tokens, does rags start to become less important. Can't

438
00:20:48.559 --> 00:20:50.720
<v Speaker 1>we just stuff everything in the context.

439
00:20:50.759 --> 00:20:53.839
<v Speaker 2>That's a really common question, and maybe counterintuitively, no, our

440
00:20:53.920 --> 00:20:56.519
<v Speaker 2>rag is still crucial, maybe even more crucial. It is

441
00:20:56.559 --> 00:20:59.240
<v Speaker 2>that because even a million tokens of real world data

442
00:20:59.279 --> 00:21:02.960
<v Speaker 2>can be incredibly noisy, finding that one specific fact the

443
00:21:03.000 --> 00:21:05.480
<v Speaker 2>needle in the haystack, is still hard for the LLM.

444
00:21:06.160 --> 00:21:10.599
<v Speaker 2>RI provides that targeted grounding. It ensures factuality and allows

445
00:21:10.599 --> 00:21:13.079
<v Speaker 2>citations in a way that just having a massive context

446
00:21:13.119 --> 00:21:14.359
<v Speaker 2>window doesn't guarantee.

447
00:21:14.440 --> 00:21:17.640
<v Speaker 1>So long context helps fit more potential info, but ARI

448
00:21:17.799 --> 00:21:20.000
<v Speaker 1>helps find the right info within it exactly.

449
00:21:20.240 --> 00:21:23.640
<v Speaker 2>It reduces the noise problem. Long context isn't a magic

450
00:21:23.680 --> 00:21:26.799
<v Speaker 2>bullet for messy data or the need for verifiable grounding,

451
00:21:27.359 --> 00:21:30.720
<v Speaker 2>and often are and fine tuning work together. You might

452
00:21:30.759 --> 00:21:34.519
<v Speaker 2>fine tune the retriever, the reranker, the generator, each part benefits.

453
00:21:34.559 --> 00:21:38.880
<v Speaker 1>Okay, that makes sense. So given all this complexity, RI pipelines, optimization,

454
00:21:39.039 --> 00:21:42.319
<v Speaker 1>choosing models, how do developers actually build and manage these systems?

455
00:21:42.559 --> 00:21:44.640
<v Speaker 1>It sounds way beyond just fiddling with prompts.

456
00:21:44.680 --> 00:21:47.759
<v Speaker 2>Oh, it absolutely is. It requires real system design. One

457
00:21:47.799 --> 00:21:50.240
<v Speaker 2>common pattern is MULTILLLM.

458
00:21:49.720 --> 00:21:51.720
<v Speaker 1>Architectures using more than one LM.

459
00:21:51.880 --> 00:21:55.519
<v Speaker 2>Yeah, maybe a small, fast, cheap model for simple tasks

460
00:21:55.640 --> 00:21:58.559
<v Speaker 2>or routing the query, then escalating to a big, powerful

461
00:21:58.559 --> 00:22:01.680
<v Speaker 2>model for the complex reasoning A cascade or a router

462
00:22:01.759 --> 00:22:06.079
<v Speaker 2>setup balances cost and capability. And the programming paradigms are

463
00:22:06.079 --> 00:22:11.000
<v Speaker 2>evolving too. We're moving away from just endless manual prompt engineering.

464
00:22:11.079 --> 00:22:11.720
<v Speaker 1>Thank goodness.

465
00:22:11.799 --> 00:22:15.559
<v Speaker 2>Right. Frameworks like DSPI are pushing this idea of programming

466
00:22:15.680 --> 00:22:16.440
<v Speaker 2>not prompting.

467
00:22:16.519 --> 00:22:18.359
<v Speaker 1>DSPI tell me more.

468
00:22:18.759 --> 00:22:21.920
<v Speaker 2>It abstracts the prompting process. You define the flow using

469
00:22:22.039 --> 00:22:25.359
<v Speaker 2>modules that represent techniques like chain of Thought or React.

470
00:22:26.039 --> 00:22:29.200
<v Speaker 2>Then DSPI figures out the best prompt and even model

471
00:22:29.200 --> 00:22:32.440
<v Speaker 2>parameters to make that flow work for your data, often

472
00:22:32.440 --> 00:22:37.799
<v Speaker 2>through optimization. It separates the program logic from the prompt tinkering.

473
00:22:37.480 --> 00:22:39.960
<v Speaker 1>So the framework optimizes the prompts for you largely.

474
00:22:40.039 --> 00:22:43.079
<v Speaker 2>Yes, it angks for more robust, generalizable applications.

475
00:22:43.160 --> 00:22:45.640
<v Speaker 1>That's a big shift. What about controlling the output format

476
00:22:45.640 --> 00:22:46.319
<v Speaker 1>more strictly?

477
00:22:46.559 --> 00:22:49.319
<v Speaker 2>For that, look at things like LMQL. It stands for

478
00:22:49.519 --> 00:22:52.240
<v Speaker 2>Language Model Query Language. It's like SQL but for.

479
00:22:52.319 --> 00:22:53.880
<v Speaker 3>Llms embedded in Python.

480
00:22:54.039 --> 00:22:54.319
<v Speaker 1>Okay.

481
00:22:54.359 --> 00:22:57.640
<v Speaker 2>It lets you write prompts but add declarative constraints like

482
00:22:57.680 --> 00:23:00.599
<v Speaker 2>you could define a template for generating a jet pre clue,

483
00:23:00.920 --> 00:23:03.960
<v Speaker 2>then add a wear clause in LMQL saying the output

484
00:23:04.039 --> 00:23:06.480
<v Speaker 2>must be phrased as a question and be exactly three

485
00:23:06.480 --> 00:23:07.079
<v Speaker 2>words long.

486
00:23:07.279 --> 00:23:09.519
<v Speaker 1>Ah, so you enforce structure directly.

487
00:23:09.400 --> 00:23:12.759
<v Speaker 2>Exactly, much more control over the output format than just

488
00:23:12.839 --> 00:23:16.400
<v Speaker 2>asking nicely in the prompt. Really useful for structured data

489
00:23:16.480 --> 00:23:20.319
<v Speaker 2>extraction or making the LM fit into existing systems.

490
00:23:20.640 --> 00:23:23.160
<v Speaker 1>What an incredible journey we've been on really, from the

491
00:23:23.640 --> 00:23:27.880
<v Speaker 1>fundamental building blocks that messy data, the weirdness have tokens

492
00:23:28.319 --> 00:23:31.680
<v Speaker 1>the transformer all the way to these advanced techniques are

493
00:23:31.839 --> 00:23:38.200
<v Speaker 1>gree optimization, ds PI, LMQL, making them robust, efficient, integrated.

494
00:23:38.279 --> 00:23:39.400
<v Speaker 3>Yeah, it's a lot to cover.

495
00:23:39.640 --> 00:23:42.519
<v Speaker 1>It really shows these lemms aren't just inscrutable black boxes,

496
00:23:42.559 --> 00:23:46.519
<v Speaker 1>are they They're complex, yes, but they can be steered, optimized,

497
00:23:46.680 --> 00:23:47.559
<v Speaker 1>connected to the world.

498
00:23:47.680 --> 00:23:50.440
<v Speaker 2>Absolutely, they are tools we can shape and direct. And

499
00:23:50.480 --> 00:23:52.960
<v Speaker 2>maybe that's the provocative thought to leave you with. This

500
00:23:53.000 --> 00:23:55.880
<v Speaker 2>field is moving so incredibly fast. What we discussed today

501
00:23:55.920 --> 00:23:59.240
<v Speaker 2>as best practice, it might honestly be different in six

502
00:23:59.279 --> 00:23:59.960
<v Speaker 2>months or a year.

503
00:24:00.039 --> 00:24:01.559
<v Speaker 1>Constant change, constant change.

504
00:24:01.599 --> 00:24:04.079
<v Speaker 2>So the real lasting value for you as someone Learning

505
00:24:04.160 --> 00:24:08.000
<v Speaker 2>or building in this space isn't just memorizing today's specific techniques.

506
00:24:08.279 --> 00:24:11.759
<v Speaker 2>It's getting that deeper, intuitive grasp of the underlying.

507
00:24:11.279 --> 00:24:13.480
<v Speaker 1>Principles, Understanding the why.

508
00:24:13.440 --> 00:24:16.839
<v Speaker 2>Understanding the why, how they work, why they fail, how.

509
00:24:16.720 --> 00:24:17.559
<v Speaker 3>To nudge them.

510
00:24:17.920 --> 00:24:21.359
<v Speaker 2>That intuition will be your best asset as the technology

511
00:24:21.440 --> 00:24:24.119
<v Speaker 2>keeps evolving at this frankly breathtaking pace.

512
00:24:24.240 --> 00:24:27.359
<v Speaker 1>To keep exploring, keep asking questions, and keep building.

513
00:24:27.839 --> 00:24:29.400
<v Speaker 3>That's where the real learning happens.
