WEBVTT 1 00:00:00.160 --> 00:00:02.960 Welcome to the deep dive. If you've been watching the 2 00:00:03.000 --> 00:00:05.919 tech world, you know we're living through an incredible moment. 3 00:00:06.960 --> 00:00:10.720 This AI revolution powered by large language models or. 4 00:00:11.000 --> 00:00:12.679 LMS, it really is something else. 5 00:00:12.759 --> 00:00:14.720 Yeah, it's not just a minor tech update. It feels 6 00:00:14.720 --> 00:00:17.480 like one of those huge shifts, you know, like the computer, 7 00:00:17.559 --> 00:00:18.760 the internet, or the smartphone. 8 00:00:18.760 --> 00:00:19.719 Definitely pivotal. 9 00:00:19.920 --> 00:00:23.600 We're seeing these prototypes that seem almost magical. You can 10 00:00:23.640 --> 00:00:26.559 write stories, generate code, it's amazing. 11 00:00:26.600 --> 00:00:27.879 The demos are stunning. 12 00:00:28.000 --> 00:00:30.760 But here's the thing, right, taking that cool demo and 13 00:00:30.800 --> 00:00:35.479 making it a reliable, production grade application, that's well, that's 14 00:00:35.520 --> 00:00:36.399 a whole different. 15 00:00:36.159 --> 00:00:37.039 Credit, much harder game. 16 00:00:37.119 --> 00:00:39.079 Yeah. So our mission today is really to cut through 17 00:00:39.079 --> 00:00:43.479 some of that hype and navigate this pretty complex landscape 18 00:00:43.479 --> 00:00:47.159 of LM development. We want to equip you, our listener, 19 00:00:47.640 --> 00:00:51.000 with the core intuition, some surprising facts maybe, and the 20 00:00:51.000 --> 00:00:55.320 practical tools you'll need to build genuinely sophisticated applications, the 21 00:00:55.359 --> 00:00:56.600 ones that actually work. 22 00:00:56.560 --> 00:00:58.759 And to guide us on this deep dive. We're leaning 23 00:00:58.799 --> 00:01:02.600 pretty heavily on a fent plastic resource designing Large language 24 00:01:02.640 --> 00:01:05.879 Model applications by suhas PE. What's great about it? I 25 00:01:05.920 --> 00:01:08.959 think is that it's not just some dry technical manual. 26 00:01:09.319 --> 00:01:09.959 It gives this. 27 00:01:10.040 --> 00:01:15.000 Really holistic overview for you know, software engineers and mel folks, 28 00:01:15.040 --> 00:01:16.799 product managers, anyone involved. 29 00:01:16.840 --> 00:01:17.439 That's useful. 30 00:01:17.640 --> 00:01:20.920 Yeah, and it provides surprising depth that helps you understand 31 00:01:21.280 --> 00:01:25.040 not just what the models do, but fundamentally why they 32 00:01:25.079 --> 00:01:27.319 behave the way they do, and that why is crucial, 33 00:01:27.599 --> 00:01:32.359 absolutely crucial, especially for getting past fragile prototypes to something robust. 34 00:01:32.560 --> 00:01:35.480 Okay, let's unpack this then when we talk about lms, 35 00:01:35.560 --> 00:01:37.640 what are they actually made of? Like what are the 36 00:01:37.680 --> 00:01:39.840 basic ingredients before they even start learning? 37 00:01:40.319 --> 00:01:40.519 Right? 38 00:01:40.599 --> 00:01:43.079 So, at their very core, lms are built on pre 39 00:01:43.120 --> 00:01:43.959 training data. 40 00:01:44.040 --> 00:01:46.079 That's the raw fuel data. Got it? 41 00:01:46.200 --> 00:01:48.640 And you know that old saying garbage in, garbage out, 42 00:01:48.760 --> 00:01:53.400 It applies massively here. The scale and maybe even more importantly, 43 00:01:53.480 --> 00:01:56.079 the quality of this data is paramount. 44 00:01:56.239 --> 00:01:57.359 So where does it all come from? 45 00:01:57.439 --> 00:02:01.040 We're talking colossal amounts of text. A huge chunk often 46 00:02:01.079 --> 00:02:04.239 comes from web text, like from common Girl. Massive, but 47 00:02:04.599 --> 00:02:07.480 it needs so much cleaning because well, the Internet's. 48 00:02:07.079 --> 00:02:09.400 Messy understatement of the year, huh. 49 00:02:09.479 --> 00:02:09.680 Right. 50 00:02:09.919 --> 00:02:12.560 Then you have things like web text or open web text. 51 00:02:12.719 --> 00:02:16.039 They often use signals from places like Reddit outbound links 52 00:02:16.080 --> 00:02:19.439 specifically trying to filter for you know, higher quality stuff. 53 00:02:19.599 --> 00:02:21.800 Wisdom of the crowd kind of interesting. 54 00:02:21.919 --> 00:02:22.319 What else? 55 00:02:22.520 --> 00:02:26.479 There's factual knowledge from Wikipedia, super valuable for accuracy, but 56 00:02:26.560 --> 00:02:27.319 the style. 57 00:02:27.080 --> 00:02:29.000 Is very formal, yeah, very encyclopedic. 58 00:02:29.120 --> 00:02:33.360 And historically BooksCorpus was big, lots of narrative but surprisingly 59 00:02:33.479 --> 00:02:36.840 like twenty six percent romance novels from unpublished authors, so 60 00:02:37.199 --> 00:02:37.960 quite specific. 61 00:02:38.000 --> 00:02:38.240 Wow. 62 00:02:38.319 --> 00:02:41.719 Okay, and now you see newer efforts like hugging faces, 63 00:02:41.800 --> 00:02:44.960 Fine Web aiming for even cleaner web data and I'm 64 00:02:45.039 --> 00:02:47.759 fine Web ed you focusing on educational content. 65 00:02:48.439 --> 00:02:50.759 Fifteen trillion tokens. It's huge. 66 00:02:50.840 --> 00:02:54.599 So it's clearly not just about quantity, it's about cleaning 67 00:02:54.680 --> 00:02:57.560 and curating this raw material. What does that involve exactly? 68 00:02:57.680 --> 00:03:00.840 Data preprocessing is key, It's not glamorous, but maybe the 69 00:03:00.840 --> 00:03:03.400 most vital step, like what specifically you got to strip 70 00:03:03.479 --> 00:03:07.000 out all the web boilerplate, menus, navelinks, lar nipsen, placeholders, 71 00:03:07.039 --> 00:03:07.719 all that junk. 72 00:03:08.240 --> 00:03:10.599 And language identification. 73 00:03:10.199 --> 00:03:15.199 Is surprisingly tricky, even in supposedly English only data sets. 74 00:03:15.319 --> 00:03:16.599 Other languages creep in. 75 00:03:17.039 --> 00:03:19.800 If you don't catch that, your model might suddenly start 76 00:03:19.800 --> 00:03:21.400 speaking Spanish, which. 77 00:03:21.199 --> 00:03:23.439 Could be a bug or maybe a. 78 00:03:23.400 --> 00:03:26.759 Feature could be either, and quality filtering is vital too 79 00:03:27.039 --> 00:03:29.159 often using things like perplexity scores. 80 00:03:29.280 --> 00:03:31.840 Okay, perplexity scores, how does that work? Break that down? 81 00:03:31.919 --> 00:03:33.039 Sure? Think of it like this. 82 00:03:33.879 --> 00:03:36.199 If you're trying to predict the next word in a 83 00:03:36.240 --> 00:03:39.039 really well written, clear sentence, it's pretty easy. 84 00:03:39.560 --> 00:03:43.039 Low uncertainty. That's low perplexity, makes sense. But if you're 85 00:03:43.080 --> 00:03:43.639 trying to. 86 00:03:43.599 --> 00:03:46.719 Guess the next word in some garbled text full of errors, 87 00:03:46.759 --> 00:03:50.400 it's super hard. High uncertainty, that's high perplexity. 88 00:03:50.479 --> 00:03:54.280 Ah, okay, So high perplexity means noisy, bad data. 89 00:03:54.319 --> 00:03:56.400 Basically, yeah, you probably don't want to feed that to 90 00:03:56.439 --> 00:03:57.439 your expensive model. 91 00:03:57.560 --> 00:04:01.319 Got it? And after cleaning you mentioned duplication in privacy? 92 00:04:01.560 --> 00:04:02.639 Why is that so important? 93 00:04:02.639 --> 00:04:03.360 Oh? It's massive. 94 00:04:03.400 --> 00:04:06.360 Web tex is full of duplicates. Removing them isn't just 95 00:04:06.360 --> 00:04:11.680 about efficiency. It's critical to stop llms from accidentally memorizing 96 00:04:11.680 --> 00:04:15.000 and leaking PII personally identifiable information. 97 00:04:14.919 --> 00:04:17.120 Right, even if it's technically published exactly. 98 00:04:17.199 --> 00:04:20.560 That's the whole contextual integrity issue. Should an AI just 99 00:04:20.600 --> 00:04:23.399 blurt out someone's address because it found it online somewhere? 100 00:04:23.560 --> 00:04:25.600 It's tricky, especially with public figures. 101 00:04:26.319 --> 00:04:28.079 Complex ethical grounds totally. 102 00:04:28.839 --> 00:04:31.040 And what's wild is that even a tiny bit of 103 00:04:31.120 --> 00:04:35.759 manipulated data, like less than point one percent, can potentially 104 00:04:35.800 --> 00:04:38.519 make it easier for other sensitive data to leak. 105 00:04:38.800 --> 00:04:41.800 Wow. Okay, that's a lot about the raw material. But 106 00:04:41.879 --> 00:04:44.839 this next part for me is where it gets really fascinating. 107 00:04:45.399 --> 00:04:48.879 How do these models actually read. It's not like they 108 00:04:48.879 --> 00:04:50.279 see words like we do, is it? 109 00:04:50.399 --> 00:04:50.920 You're spot on? 110 00:04:51.000 --> 00:04:53.600 They don't process discrete words like humans. They use something 111 00:04:53.639 --> 00:04:56.199 called tokens, and often these tokens are. 112 00:04:56.079 --> 00:04:59.199 Subwords, subwords like parts of words kind of yeah. 113 00:04:59.480 --> 00:05:02.959 So example in gpt ex, office might be one token, 114 00:05:03.000 --> 00:05:05.759 but office with that little meaning a space before it 115 00:05:05.800 --> 00:05:06.639 is a different token. 116 00:05:06.959 --> 00:05:09.480 Case matters to office versus office. 117 00:05:09.560 --> 00:05:11.319 Okay, so it's more granial exactly. 118 00:05:11.600 --> 00:05:14.879 And the subword approach is clever because it mostly avoids. 119 00:05:14.600 --> 00:05:16.120 The out of vocabulary problem. 120 00:05:16.439 --> 00:05:18.720 If it sees a totally new word, it can usually 121 00:05:18.720 --> 00:05:22.399 break it down into known subword pieces instead of just crashing. 122 00:05:22.519 --> 00:05:25.199 So it's almost like they're reading in syllables or morphemes, 123 00:05:25.199 --> 00:05:25.839 not whole words. 124 00:05:25.839 --> 00:05:28.879 Oh a bit like that, Yeah, smaller meaningful units. And 125 00:05:28.959 --> 00:05:34.720 sometimes this process creates weird artifacts glitch tokens tokens, yeah, 126 00:05:34.839 --> 00:05:38.120 or undertrain ones. There's this great story about solid Magic 127 00:05:38.160 --> 00:05:40.920 gold Carp. It was a Reddit username that actually became 128 00:05:41.040 --> 00:05:42.360 a token in GPT two. 129 00:05:42.480 --> 00:05:44.079 Seriously a username yep. 130 00:05:44.519 --> 00:05:47.439 But then later models like GPT three were trained on 131 00:05:47.480 --> 00:05:50.720 different data where that token barely appeared. It had no 132 00:05:50.800 --> 00:05:53.720 training signal. So if you fed GPT three Solid Magic 133 00:05:53.720 --> 00:05:55.920 old Carp, it would just act weirdly like it had 134 00:05:55.959 --> 00:05:57.839 no clue what to do as it are it is. 135 00:05:58.079 --> 00:06:01.920 But it raises this cool question, what can these weird 136 00:06:01.920 --> 00:06:04.279 tokens tell us about the training data. It's like a 137 00:06:04.279 --> 00:06:06.839 little window into the model's digestive system. 138 00:06:07.000 --> 00:06:10.120 Huh, a digital digestive system. I like that. Okay, so 139 00:06:10.160 --> 00:06:12.079 we have the data, we have the tokens. How does 140 00:06:12.120 --> 00:06:16.199 this all come together? We hear neural networks transformers. What's 141 00:06:16.240 --> 00:06:16.800 the engine? 142 00:06:17.000 --> 00:06:17.199 Right? 143 00:06:17.240 --> 00:06:19.879 The engine at the heart of almost all modern llms 144 00:06:20.000 --> 00:06:23.160 is the transformer architecture. It was a huge breakthrough back 145 00:06:23.160 --> 00:06:25.199 in twenty seventeen. Why was it such a big deal 146 00:06:25.839 --> 00:06:32.000 because older recurrent neural networks RNNs really struggled with long sentences, 147 00:06:32.160 --> 00:06:35.759 long range dependencies. They were trying to sort of cram 148 00:06:35.800 --> 00:06:38.079 the meaning of a whole sentence into one single vector. 149 00:06:38.120 --> 00:06:40.879 It just didn't scale well. The transformer changed that with 150 00:06:40.920 --> 00:06:43.199 its key innovation self attention. 151 00:06:43.800 --> 00:06:47.439 Self attention. That sounds mindful. How does it work for AI? 152 00:06:47.839 --> 00:06:50.800 Heah, Yeah, it's actually pretty intuitive. Think about how we read. 153 00:06:50.920 --> 00:06:53.839 We don't give every word equal weight, right, definitely not. 154 00:06:54.079 --> 00:06:57.279 We focus on certain words to understand context, like bank 155 00:06:57.560 --> 00:07:01.399 means something different in riverbank versus saving bank. Self attention 156 00:07:01.560 --> 00:07:04.399 lets the model do exactly that way, the importance of 157 00:07:04.399 --> 00:07:07.399 different words in the sequence as it processes them, so it. 158 00:07:07.399 --> 00:07:09.720 Learns context from surrounding words precisely. 159 00:07:09.759 --> 00:07:12.040 It's like that old linguistics idea you shall know a 160 00:07:12.079 --> 00:07:14.920 word by the company it keeps. It uses these things 161 00:07:14.959 --> 00:07:18.680 called query key and value matrices to let words mathematically 162 00:07:18.759 --> 00:07:19.879 attend to each other. 163 00:07:20.040 --> 00:07:23.600 Okay, mathematically attend, got it? And they're different types of 164 00:07:23.639 --> 00:07:24.560 these transformers. 165 00:07:24.839 --> 00:07:27.959 Yeah, broadly. Three main transformer backbones. 166 00:07:28.199 --> 00:07:32.199 First, encoder only models like Burt great for understanding text things, 167 00:07:32.240 --> 00:07:33.279 search or classification. 168 00:07:33.480 --> 00:07:33.720 Right. 169 00:07:33.959 --> 00:07:37.680 Then the original encoder decoder design still fantastic for things 170 00:07:37.720 --> 00:07:40.480 like machine translation, where you need to process an input 171 00:07:40.600 --> 00:07:42.680 and generate a distinct output. 172 00:07:42.839 --> 00:07:43.199 Okay. 173 00:07:43.279 --> 00:07:46.040 And finally, the one we usually associate with generative AI 174 00:07:46.199 --> 00:07:50.439 like GPT four the decoder only architecture. These models are 175 00:07:50.439 --> 00:07:53.600 specialized in predicting the very next token in a sequence. 176 00:07:53.959 --> 00:07:55.319 That's how they generate text. 177 00:07:55.600 --> 00:07:59.519 And what about these mixture of experts models? Loe? Are 178 00:07:59.519 --> 00:08:00.319 they different? Again? 179 00:08:00.680 --> 00:08:04.079 There are really interesting evolution sort of built on the backbone. 180 00:08:04.199 --> 00:08:08.439 Mixture of experts aims to massively increase a model's capacity 181 00:08:08.519 --> 00:08:12.000 how much it knows without proportionally increasing the compute cost 182 00:08:12.079 --> 00:08:13.319 for every single input. 183 00:08:13.360 --> 00:08:14.040 How does that work? 184 00:08:14.279 --> 00:08:16.439 The clever bit is that for any given input, only 185 00:08:16.439 --> 00:08:19.800 a subset of specialized experts inside the model gets activated. 186 00:08:19.920 --> 00:08:23.600 So ask about physics, the physics expert activates. Ask about poetry, 187 00:08:23.600 --> 00:08:26.240 The poetry expert lights up. The others stay quiet. 188 00:08:26.519 --> 00:08:29.040 Huh So it's like calling on specialists exactly. 189 00:08:29.399 --> 00:08:31.680 You get the power of a huge model, but you 190 00:08:31.800 --> 00:08:35.399 only run the relevant parts for each query. Mistral's mixtral 191 00:08:35.480 --> 00:08:39.000 is a key example, and many suspect GEPC four uses 192 00:08:39.039 --> 00:08:40.960 something similar, though it's unconfirmed. 193 00:08:41.480 --> 00:08:45.720 This is really critical then for you, the listener, understanding 194 00:08:45.759 --> 00:08:51.360 these foundations, the data, the tokens, the transformer architecture MOE. 195 00:08:52.039 --> 00:08:55.679 It's crucial, absolutely, even if you never train one from scratch. 196 00:08:56.200 --> 00:09:00.360 That intuition helps you debug, figure out why it's behaving oddly, 197 00:09:00.799 --> 00:09:03.320 and build better apps. You start to see why it 198 00:09:03.360 --> 00:09:04.759 might struggle or succeed. 199 00:09:04.480 --> 00:09:05.919 Get a feel for the machine. 200 00:09:06.120 --> 00:09:09.200 So we've established these models are powerful, but yeah, definitely 201 00:09:09.200 --> 00:09:12.559 not perfect. What are some of the biggest practical limitations 202 00:09:12.799 --> 00:09:14.360 and how are we starting to tackle them. 203 00:09:14.679 --> 00:09:18.360 One of the biggest and probably most talked about, is hallucinations. 204 00:09:17.720 --> 00:09:19.120 Right when they just make stuff up. 205 00:09:19.080 --> 00:09:22.679 Exactly more formally, it's generated text that isn't grounded in 206 00:09:22.720 --> 00:09:24.679 the training data or the input context. 207 00:09:25.000 --> 00:09:27.639 It sounds plausible, but it's just fabrications. 208 00:09:27.679 --> 00:09:28.559 Do you give an example. 209 00:09:28.799 --> 00:09:31.559 Sure, there was a well known case with the NAS 210 00:09:31.639 --> 00:09:36.399 Research Hermes model. It hallucinated details about Ugandan medal winners 211 00:09:36.480 --> 00:09:39.519 from the twenty twenty Olympics. Oh wow, Yeah, it got 212 00:09:39.559 --> 00:09:42.639 birth dates wrong, mixed up which medals they won. The 213 00:09:42.720 --> 00:09:45.039 athletes were real, the core facts were real, but the 214 00:09:45.080 --> 00:09:48.720 details were just invented, confidently stated but wrong. 215 00:09:48.919 --> 00:09:51.519 Yikes, how do you even begin to fix that? 216 00:09:52.000 --> 00:09:56.440 It's tough. Mitigation involves several things. Good product design helps 217 00:09:56.480 --> 00:09:59.559 try not to ask questions the LLM likely can't answer, 218 00:10:00.000 --> 00:10:02.639 soh knowing what it doesn't know is hard true. We 219 00:10:02.679 --> 00:10:06.000 also look at model self knowledge and calibration, basically, how 220 00:10:06.000 --> 00:10:09.159 confident is the model in its own output. Sometimes low 221 00:10:09.200 --> 00:10:11.879 confidence correlates with higher hallucination risk. 222 00:10:11.799 --> 00:10:13.879 Okay, using its own uncertainty signals. 223 00:10:13.960 --> 00:10:14.480 Yeah. 224 00:10:14.519 --> 00:10:17.720 And then there are technical effixes during generation, like factual 225 00:10:17.799 --> 00:10:21.320 nuclear sampling, which tries to reduce randomness for more factual outputs, 226 00:10:21.919 --> 00:10:25.480 or doulity coding, which cleverly uses differences between signals and 227 00:10:25.480 --> 00:10:28.000 the transformulators to spot potential hallucinations. 228 00:10:28.159 --> 00:10:31.240 Fascinating, And sometimes they hallucinate just because the prompt itself 229 00:10:31.279 --> 00:10:33.399 is confusing, right, like yeah, with irrelevant info? 230 00:10:33.600 --> 00:10:37.360 Yeah, absolutely. If you put distracting sentences in the prompt 231 00:10:37.440 --> 00:10:41.600 like mentioning max selling apples in Sarah's unrelated math problem, 232 00:10:42.000 --> 00:10:44.879 the LM can get confused and incorporate the wrong info, 233 00:10:45.039 --> 00:10:49.159 so prompting it to first identify and remove irrelevant context 234 00:10:49.399 --> 00:10:49.879 can help. 235 00:10:50.200 --> 00:10:54.480 Okay. So beyond just factual accuracy, what about actual reasoning? 236 00:10:55.159 --> 00:10:57.559 Can they really connect dots logically? 237 00:10:58.000 --> 00:11:01.000 That's a huge area of research and development. Natural language 238 00:11:01.039 --> 00:11:04.559 reasoning means integrating knowledge to draw conclusions, and there are 239 00:11:04.559 --> 00:11:08.879 different kinds. Deductive is pure logic premise a premise B, 240 00:11:09.039 --> 00:11:12.639 therefore conclusion C. Like mister Shockley is allergic to mushrooms. 241 00:11:12.720 --> 00:11:14.919 This dish has mushrooms, so mister Shockley. 242 00:11:14.559 --> 00:11:15.279 Should avoid it. 243 00:11:15.360 --> 00:11:19.200 Simple logic, then inductive generalizing from examples, so hundreds of 244 00:11:19.279 --> 00:11:23.240 round manhole covers conclude manhole covers are generally round. Abductive 245 00:11:23.279 --> 00:11:27.519 reasoning is finding the most likely explanation streets, wet puddles, umbrellas, hmm, 246 00:11:27.720 --> 00:11:31.039 probably rain inferance of the best explanation exactly. And then 247 00:11:31.080 --> 00:11:34.120 there's common sense implicit stuff like you can't fit a 248 00:11:34.159 --> 00:11:35.440 horse in a Mini Cooper. 249 00:11:35.399 --> 00:11:38.759 Huh yeah, hopefully obvious. How do you get an LLM 250 00:11:38.799 --> 00:11:39.440 to do that better? 251 00:11:39.879 --> 00:11:43.000 A major technique is chain of thought prompting or cooey. 252 00:11:43.559 --> 00:11:46.639 You literally tell the LLM to think, step by. 253 00:11:46.480 --> 00:11:48.840 Step, show your work basically pretty. 254 00:11:48.600 --> 00:11:51.519 Much for a math problem like thirty four plus forty 255 00:11:51.519 --> 00:11:54.000 four plus three twenty three three to two. Instead of 256 00:11:54.080 --> 00:11:55.960 just asking for the answer, you ask it to break 257 00:11:56.000 --> 00:11:58.799 it down first, calculate three, two, three, and so on. 258 00:11:59.200 --> 00:12:01.399 Performance jump dramatically. 259 00:12:01.000 --> 00:12:03.559 Because it forces a sequential process right. 260 00:12:03.440 --> 00:12:05.679 It gives it intermediate steps to work with. It costs 261 00:12:05.720 --> 00:12:08.600 more tokens, more time, but it's often worth it for 262 00:12:08.679 --> 00:12:12.159 complex tasks. You can also use verifiers, maybe another LM 263 00:12:12.279 --> 00:12:15.559 to check the steps, or even fine tune models specifically 264 00:12:15.600 --> 00:12:16.799 on reasoning data sets. 265 00:12:16.879 --> 00:12:19.840 Okay, so we can make them smarter, more reliable, but 266 00:12:19.879 --> 00:12:22.200 these things are huge. How do we actually run them 267 00:12:22.200 --> 00:12:24.879 efficiently in the real world. That sounds like a massive hurdle. 268 00:12:25.000 --> 00:12:26.679 It is a huge hurdle, and that brings us to 269 00:12:26.840 --> 00:12:30.200 choosing and optimizing llms for production. First, you have to 270 00:12:30.240 --> 00:12:33.879 pick one. You've got proprietary providers open AI, Google, Mthropic 271 00:12:34.000 --> 00:12:37.480 via APIs easy to use, manage the big players, and 272 00:12:37.519 --> 00:12:43.960 then open source models Metaslama, luther AI, Mistral, Microsoft's FI models. 273 00:12:44.080 --> 00:12:46.759 You get the model weights, more transparency, more flexibility, but 274 00:12:46.799 --> 00:12:49.200 you often have to manage the deployment yourself or use 275 00:12:49.240 --> 00:12:50.279 specialized platforms. 276 00:12:50.320 --> 00:12:51.759 Trade offs there big time. 277 00:12:51.799 --> 00:12:55.519 Transparency versus convenience, cost versus latency, And. 278 00:12:55.519 --> 00:12:57.759 Once you pick one, how do you know if it's 279 00:12:57.759 --> 00:13:00.720 any good for your job? Benchmarks are everywhere, but are 280 00:13:00.720 --> 00:13:01.600 they the whole story? 281 00:13:02.000 --> 00:13:05.960 Definitely not the whole story. Evaluating llm's is super tricka. 282 00:13:06.039 --> 00:13:09.600 Benchmarks can suffer from test set contamination. The model might 283 00:13:09.600 --> 00:13:12.960 have seen the answers in its training data cheating basically 284 00:13:13.120 --> 00:13:16.399 kind of or models get over optimized just to score 285 00:13:16.399 --> 00:13:19.159 well on a benchmark, but aren't great in practice, and 286 00:13:19.200 --> 00:13:22.080 they're very sensitive to how you prompt them. Frameworks like 287 00:13:22.120 --> 00:13:28.720 Stanford's HLM try to be more comprehensive, looking at accuracy, robustness, fairness, calibration. 288 00:13:28.840 --> 00:13:29.519 Lots of things. 289 00:13:29.559 --> 00:13:33.559 So you need holistic evaluation and ideally your own internal 290 00:13:33.600 --> 00:13:37.000 benchmarks tailored to your actual use case. That's key now 291 00:13:37.159 --> 00:13:40.559 actually running them, you generally need GPUs for decent speed. 292 00:13:40.559 --> 00:13:43.960 They're just computationally intensive, right, expensive hardware, which is where 293 00:13:44.039 --> 00:13:45.120 quantization comes in. 294 00:13:45.440 --> 00:13:46.279 It's a lifesaver. 295 00:13:46.559 --> 00:13:48.720 Quantization Explain that sounds. 296 00:13:48.399 --> 00:13:50.440 Complex, it's actually a pretty neat idea. 297 00:13:50.679 --> 00:13:53.759 It's about reducing the memory footprint. You take the numbers 298 00:13:53.799 --> 00:13:57.639 inside the model, usually high precision floating point numbers like 299 00:13:57.960 --> 00:14:01.000 FP thirty two, and represent them with fewer bits like 300 00:14:01.120 --> 00:14:04.360 FP sixteen, b F sixteen or even eight bit integers. 301 00:14:04.799 --> 00:14:07.879 It's an INN eight so like compressing the numbers. 302 00:14:07.519 --> 00:14:10.879 Exactly like compressing them, you lose a tiny bit of precision, 303 00:14:11.200 --> 00:14:15.440 usually negligible, but the model becomes much smaller, uses less memory, 304 00:14:15.600 --> 00:14:18.519 and runs faster. Tools like a LAMA. Make it easier 305 00:14:18.519 --> 00:14:21.879 to run these quantized models, even locally on a powerful laptop. 306 00:14:21.919 --> 00:14:24.720 Sometimes that's cool. Make the more accessible. Okay, so it's loaded, 307 00:14:25.080 --> 00:14:27.960 maybe quantized. How do you speed up the inference, the 308 00:14:28.000 --> 00:14:29.639 actual running part, and make it cheaper. 309 00:14:29.840 --> 00:14:33.600 Several key tricks for LLM inference optimization. A huge one 310 00:14:33.679 --> 00:14:36.799 is the cav cash cav cash Yeah, key value cash. 311 00:14:36.960 --> 00:14:38.840 Think of it as the model's short term memory for 312 00:14:38.879 --> 00:14:41.720 the current conversation or task. When you send a prompt, 313 00:14:41.799 --> 00:14:45.200 especially one with instructions, those instructions often stay the same 314 00:14:45.279 --> 00:14:46.279 for follow up questions. 315 00:14:46.519 --> 00:14:49.759 The cav cash stores the internal calculations the key and 316 00:14:49.879 --> 00:14:53.799 value matrices from self attention related to that initial prompt, 317 00:14:53.840 --> 00:14:56.200 so the model doesn't have to recalculate them every single 318 00:14:56.200 --> 00:14:59.360 time you ask a follow up question. It dramatically speeds 319 00:14:59.360 --> 00:15:00.000 things up after. 320 00:15:00.080 --> 00:15:00.679 The first turn. 321 00:15:01.519 --> 00:15:05.080 Ah avoids redundant work clever what. 322 00:15:05.120 --> 00:15:09.120 Else, there's speculative decoding. This is pretty cool. You use 323 00:15:09.159 --> 00:15:12.480 a small, fast draft model to generate a chunk of 324 00:15:12.480 --> 00:15:16.919 tokens quickly, Then the larger, more accurate model verifies those 325 00:15:16.919 --> 00:15:18.120 tokens in a batch, like. 326 00:15:18.080 --> 00:15:20.039 A quick first draft, and then a careful. 327 00:15:19.799 --> 00:15:23.360 Edit exactly the big model checks the interns work quickly 328 00:15:23.399 --> 00:15:26.360 instead of doing it all slowly. Itself speeds things up 329 00:15:26.399 --> 00:15:29.559 a lot for generation nice. We also use knowledge distillation. 330 00:15:30.120 --> 00:15:33.759 Train a smaller student model to mimic a big teacher model. 331 00:15:33.960 --> 00:15:36.159 You get a faster, cheaper model that retains a lot 332 00:15:36.159 --> 00:15:36.879 of the capability. 333 00:15:37.080 --> 00:15:40.240 Think the stillburd right, smaller but still capable. 334 00:15:39.879 --> 00:15:43.720 And things like parallel de coding for generating multiple parts simultaneously, 335 00:15:44.000 --> 00:15:46.919 or early exit where simpler queries might get an answer 336 00:15:46.919 --> 00:15:48.960 from an earlier layer of the model without going all 337 00:15:49.000 --> 00:15:51.080 the way through lots of techniques to. 338 00:15:51.000 --> 00:15:53.879 Make them practical. Okay, this brings us squarely to the 339 00:15:53.919 --> 00:15:57.399 application layer. How do we take these optimized lmms and 340 00:15:57.440 --> 00:16:00.759 actually plug them into complex software? They can't just operate 341 00:16:00.759 --> 00:16:01.720 in a vacuum, can they? 342 00:16:01.879 --> 00:16:05.120 No, definitely not. They have real limitations. Knowledge cutoff is 343 00:16:05.120 --> 00:16:08.159 a big one. They don't know about yesterday's news unless retrained. 344 00:16:08.720 --> 00:16:13.480 They struggle with precise math, no factual guarantees, can't easily 345 00:16:13.519 --> 00:16:18.960 cite sources and context. Windows while growing are still finite. 346 00:16:18.879 --> 00:16:20.440 So they need help from the outside world. 347 00:16:20.519 --> 00:16:21.080 Precisely. 348 00:16:21.120 --> 00:16:23.960 You need to interface them with external tools and data. 349 00:16:24.360 --> 00:16:29.399 We generally talk about three core LLM interaction paradigms. Okay, First, 350 00:16:29.519 --> 00:16:34.399 the passive approach. This is basically retrieval augmented generation or RG. 351 00:16:35.000 --> 00:16:38.120 The LLM just receives information and its prompt. It doesn't 352 00:16:38.120 --> 00:16:40.879 know where it came from. You feed it the relevant context. 353 00:16:40.679 --> 00:16:42.679 Giving it the answer key snippet. 354 00:16:42.440 --> 00:16:44.720 Kind of yeah, perfect for Q and A over your 355 00:16:44.720 --> 00:16:47.879 own private documents. You retrieve the relevant text, put it 356 00:16:47.919 --> 00:16:49.919 in the prompt, and the LLM answer. 357 00:16:49.600 --> 00:16:50.120 Is based on that. 358 00:16:50.240 --> 00:16:51.679 Okay, passive, what's next? 359 00:16:52.039 --> 00:16:55.080 Explicit tool use here? The LLM is more active. 360 00:16:55.480 --> 00:16:57.600 You give it instructions and a set of tools it 361 00:16:57.639 --> 00:17:00.039 can use, like a web search tool, a calculator, a 362 00:17:00.399 --> 00:17:01.759 database connector, and. 363 00:17:01.679 --> 00:17:03.639 It chooses which tool to use exactly. 364 00:17:04.000 --> 00:17:07.920 Frameworks like lang chain help manage this. The LLM decides, okay, 365 00:17:07.960 --> 00:17:09.920 to answer this, I need to search the web, and 366 00:17:09.960 --> 00:17:13.200 it triggers the search tool. It becomes an orchestrator, more interactive. 367 00:17:13.279 --> 00:17:14.319 And the third, the. 368 00:17:14.200 --> 00:17:19.079 Most advanced, is the aegentic paradigm. Think autonomous agents. These 369 00:17:19.240 --> 00:17:22.720 lllms can interact with their environment, break down complex goals 370 00:17:22.720 --> 00:17:25.920 into subtasks, and take a sequence of actions using tools 371 00:17:25.920 --> 00:17:26.799 to achieve the goal. 372 00:17:27.079 --> 00:17:28.839 Like that Apple CFO example you. 373 00:17:28.799 --> 00:17:32.200 Mentioned earlier, Exactly like that, who was Apple CFO at 374 00:17:32.200 --> 00:17:35.039 its lowest stock price in ten years? The agent figures 375 00:17:35.079 --> 00:17:39.480 out one, get stock data, two, find lowest point three 376 00:17:39.720 --> 00:17:42.799 find CFO for that date. It plans and executes. 377 00:17:43.119 --> 00:17:46.240 Wow, that's powerful. Still limitations though, you said. 378 00:17:46.079 --> 00:17:48.680 Oh yeah, current agents can still get stuck in loops, 379 00:17:48.839 --> 00:17:51.519 choose the wrong tool, or just fail. It's definitely the 380 00:17:51.519 --> 00:17:53.079 frontier very active research. 381 00:17:53.240 --> 00:17:56.839 Okay, but let's go back to Eric retrieval augmented generation. 382 00:17:57.200 --> 00:18:00.079 You said it's passive, but it feels like the cornerstone 383 00:18:00.119 --> 00:18:03.880 of so many practical LLM apps today. Let's really dive 384 00:18:03.960 --> 00:18:07.079 deep into OURG. Why is it so vital and how 385 00:18:07.079 --> 00:18:08.000 does it actually work? 386 00:18:08.079 --> 00:18:10.920 Under the hood, OURAG is absolutely fundamental. 387 00:18:10.960 --> 00:18:14.480 Its main job is letting LMS access your specific private 388 00:18:14.559 --> 00:18:16.000 data stuff it never saw. 389 00:18:15.880 --> 00:18:18.160 During training, right bitging the knowledge gap. 390 00:18:18.039 --> 00:18:22.440 Exactly, and by doing that it drastically reduces hallucinations because 391 00:18:22.599 --> 00:18:26.519 responses are grounded in actual provided text. It allows for citations, 392 00:18:26.519 --> 00:18:29.480 It lets the LM talk about recent events, and it 393 00:18:29.559 --> 00:18:31.319 handles the long tail entities. 394 00:18:31.640 --> 00:18:33.799 Long tail entities what are those again? Think? 395 00:18:33.880 --> 00:18:36.079 Really niche facts stuff? So rare. 396 00:18:36.119 --> 00:18:38.680 It might only appear once or twice in trillions of 397 00:18:38.720 --> 00:18:42.640 tokens of training data. LM struggle to memorize that OURG 398 00:18:42.640 --> 00:18:45.759 retrieves that specific fact just when needed. Without our AG, 399 00:18:45.799 --> 00:18:49.000 you'd need impossibly huge models to maybe memorize everything. 400 00:18:49.240 --> 00:18:53.200 So OURG is essential for accessing specific, less common knowledge totally. 401 00:18:53.480 --> 00:18:55.960