WEBVTT 1 00:00:00.160 --> 00:00:02.240 Welcome to the Deep Dive. We're the show that gets 2 00:00:02.279 --> 00:00:04.719 you quickly and thoroughly well informed on the topics that 3 00:00:04.759 --> 00:00:09.679 really matter. And today we are plunging headfirst into something that, honestly, 4 00:00:09.720 --> 00:00:11.480 it still feels a bit like science fiction, but it's 5 00:00:11.560 --> 00:00:16.359 very much real now, the world of artificial intelligence, specifically 6 00:00:16.440 --> 00:00:20.039 large language models or LMS. I mean, think about it. 7 00:00:20.039 --> 00:00:24.719 I remember when chat GPT just exploded onto the scene exactly. 8 00:00:24.800 --> 00:00:27.160 It wasn't just some tech news item. It went global, 9 00:00:27.239 --> 00:00:29.760 pulling what over one hundred million users in just two months. 10 00:00:30.079 --> 00:00:33.840 It really felt like these things could just conjure up text, answers, 11 00:00:34.119 --> 00:00:35.439 anything right out of thin air. 12 00:00:35.560 --> 00:00:37.600 It did feel a bit magical, didnety totally. 13 00:00:38.159 --> 00:00:40.560 So our mission with this Deep Dive is really to 14 00:00:40.560 --> 00:00:44.359 give you a shortcut into learning liang chain, building AI 15 00:00:44.520 --> 00:00:47.600 and LM applications. We want to go beyond the buzzwords, 16 00:00:47.600 --> 00:00:50.200 you know, get into the core concepts, the actual practical 17 00:00:50.240 --> 00:00:53.960 strategies for building powerful AI apps with this thing called 18 00:00:54.039 --> 00:00:56.600 lang chain. Our goal isn't just to tell you what's 19 00:00:56.600 --> 00:00:59.280 happening in AI, but really show you why it matters, 20 00:00:59.479 --> 00:01:02.920 point out the aha moments, and crucially equip you the 21 00:01:03.000 --> 00:01:06.319 listener with the knowledge to maybe even apply it yourself. 22 00:01:06.480 --> 00:01:10.280 And to guide us through this pretty intricate, in let's 23 00:01:10.280 --> 00:01:13.400 be honest, rapidly evolving landscape. We're drawing directly from a 24 00:01:13.439 --> 00:01:17.560 really fantastic source, the book Learning lang Chain by myo 25 00:01:17.640 --> 00:01:20.719 Ocean and Nunocampos. And these aren't just you know, academics 26 00:01:20.719 --> 00:01:23.000 writing about the fields from Afar. Mayo was actually an 27 00:01:23.280 --> 00:01:25.799 early developer, an advocate for the lang Chain open source 28 00:01:25.840 --> 00:01:28.879 library itself, a real pioneer in that whole chat with 29 00:01:29.000 --> 00:01:32.000 data movement, and Nun is a founding software engineer at 30 00:01:32.079 --> 00:01:35.239 lang Chain. So this book, it isn't just theory, it's 31 00:01:35.280 --> 00:01:39.799 packed with super clear explanations, actionable techniques. Industry experts are 32 00:01:39.799 --> 00:01:43.760 calling it the go to resource for building production ready 33 00:01:43.840 --> 00:01:45.680 generative AI and agents. 34 00:01:45.959 --> 00:01:48.680 That's fantastic, a really solid foundation. Then, So the big 35 00:01:48.760 --> 00:01:53.239 question we're tackling today is this, how can developers, maybe 36 00:01:53.280 --> 00:01:56.840 even those who don't have a deep machine learning background, 37 00:01:57.079 --> 00:02:00.200 how can they harness this incredible power of lllms to 38 00:02:00.239 --> 00:02:04.879 build genuinely production ready generative AI applications and these intelligent agents. Right, 39 00:02:05.120 --> 00:02:07.879 we're going to unpack the essential tools, the patterns, the 40 00:02:07.920 --> 00:02:11.840 thinking that transforms these powerful models from cool tech demos 41 00:02:11.879 --> 00:02:16.080 into practical, usable solutions. So, okay, let's start right at 42 00:02:16.120 --> 00:02:18.919 the beginning. These lms, they seem almost magical. How exactly 43 00:02:18.919 --> 00:02:21.319 do they know the answers they give and what precisely 44 00:02:21.400 --> 00:02:22.719 is a token in their world? 45 00:02:22.960 --> 00:02:24.039 Yeah, good place to start. 46 00:02:24.400 --> 00:02:27.919 So at their heart, large language models are generative models 47 00:02:28.000 --> 00:02:32.159 specifically built for text. They're trained on just vast amounts 48 00:02:32.199 --> 00:02:36.599 of text, think everything publicly available, books, articles, forums, code, 49 00:02:37.080 --> 00:02:41.159 even cleaned up video transcripts, an immense data set. Their 50 00:02:41.159 --> 00:02:43.599 core function isn't really magic, though it looks like it. 51 00:02:43.599 --> 00:02:48.080 It's incredibly sophisticated prediction. They basically predict the most probable 52 00:02:48.319 --> 00:02:51.879 next word or token in a sequence based on all 53 00:02:51.879 --> 00:02:53.800 the patterns they've learned. So if you feed it the 54 00:02:53.800 --> 00:02:56.759 capital of England is, it's learned from countless examples that 55 00:02:56.800 --> 00:02:58.439 the highest probability next word is. 56 00:02:58.439 --> 00:03:02.000 London, ok N, matching on a massive scale. But what's 57 00:03:02.039 --> 00:03:03.879 a token? Is it just a word? 58 00:03:04.280 --> 00:03:04.840 Not always? 59 00:03:05.039 --> 00:03:08.439 Yeah, a token is the fundamental unit the LM processes. 60 00:03:08.840 --> 00:03:12.039 Often it's a word, but sometimes longer or less common 61 00:03:12.080 --> 00:03:15.400 words get broken down, like dearest might become two tokens. 62 00:03:15.520 --> 00:03:19.240 Done and arrest on average. You know, for common English text, 63 00:03:19.240 --> 00:03:22.280 one token is roughly four characters. And the driving engine 64 00:03:22.319 --> 00:03:25.240 behind all this predictive power is something called the transformer 65 00:03:25.280 --> 00:03:26.479 neural network architecture. 66 00:03:26.560 --> 00:03:27.800 Right, the transformer architecture. 67 00:03:27.879 --> 00:03:29.960 Heard a lot about that, Yeah, it's key. Think of 68 00:03:29.960 --> 00:03:33.039 it as being really good at understanding context. It relates 69 00:03:33.080 --> 00:03:35.599 every word in a sentence to every other word, building 70 00:03:35.599 --> 00:03:38.840 this rich understanding of meaning and relationships. That's how they 71 00:03:38.879 --> 00:03:42.520 handle complex grammar and nuance, not just simple word prediction. 72 00:03:43.159 --> 00:03:46.000 And it was that understanding, or perhaps a limitation of 73 00:03:46.000 --> 00:03:48.319 that understanding, that led to lang chain, Right, I read 74 00:03:48.360 --> 00:03:51.400 Harrison Chase started the open source library back in October 75 00:03:51.439 --> 00:03:54.680 twenty twenty two. What was the key realization. 76 00:03:54.280 --> 00:03:56.919 Exactly, the real breakthrough, The thing that sparked lang chain 77 00:03:56.960 --> 00:04:00.000 was this insight in LM, as brilliant as it is 78 00:04:00.080 --> 00:04:03.840 with language, could totally fumble basic arithmetic, Like ask it 79 00:04:03.879 --> 00:04:06.439 to calculate one two hundred and thirty four module one 80 00:04:06.439 --> 00:04:08.759 twenty three on its own. It might just guess or 81 00:04:08.800 --> 00:04:09.319 get it wrong. 82 00:04:09.400 --> 00:04:12.400 Huh weird. Right, It can write poetry but not do 83 00:04:12.479 --> 00:04:13.039 simple math. 84 00:04:13.120 --> 00:04:14.319 It is kind of paradoxical. 85 00:04:14.520 --> 00:04:18.079 Yeah, And that raised this crucial question, how do you 86 00:04:18.120 --> 00:04:21.959 give this powerful language model capabilities it just doesn't have intrinsically. 87 00:04:22.759 --> 00:04:25.959 Harrison Chase's pivotal realization was, and this is key, the 88 00:04:26.000 --> 00:04:30.079 most interesting LLM applications needed to use lms together with 89 00:04:30.279 --> 00:04:34.959 other sources of computation or knowledge. Laying Chain was essentially 90 00:04:34.959 --> 00:04:37.959 built to provide the building blocks, the interfaces, and the 91 00:04:38.000 --> 00:04:42.160 tooling to reliably combine llms with other things, like giving 92 00:04:42.199 --> 00:04:44.800 it the ability to call out to a calculator when 93 00:04:44.839 --> 00:04:45.920 it sees a math problem. 94 00:04:46.000 --> 00:04:48.439 That makes so much sense giving it tools. So, if 95 00:04:48.439 --> 00:04:51.879 the LLM isn't a calculator or a database itself, how 96 00:04:51.879 --> 00:04:53.759 do we even talk to it effectively? How do we 97 00:04:53.839 --> 00:04:55.120 guide it to do what we need? 98 00:04:55.240 --> 00:04:57.720 Yeah, that's all about prompting. The prompt is basically the 99 00:04:57.759 --> 00:05:00.959 instructions and input you provide to the model, and crucially, 100 00:05:01.519 --> 00:05:04.759 how you phrase that prompt significantly influences the model's output. 101 00:05:06.079 --> 00:05:08.920 There's also this fascinating control called temperature. You can think 102 00:05:08.959 --> 00:05:11.439 of it like a creativity dial. Lower temperature makes the 103 00:05:11.480 --> 00:05:16.600 output more focused, more deterministic, predictable. Higher temperature it lets 104 00:05:16.639 --> 00:05:19.839 the model take more risks, get more creative, maybe even 105 00:05:19.879 --> 00:05:22.240 a bit random, useful for different tasks. 106 00:05:22.439 --> 00:05:26.399 Okay, so prompting is key, temperature, controls, creativity. What are 107 00:05:26.399 --> 00:05:27.920 the main ways we prompt these things? 108 00:05:28.439 --> 00:05:31.439 There are several core techniques, each kind of addressing a 109 00:05:31.439 --> 00:05:35.720 different need. The absolute simplest is zero shot prompting. Just 110 00:05:35.759 --> 00:05:39.000 give it a direct instruction like that example earlier, how 111 00:05:39.000 --> 00:05:41.199 Old was the thirtiest president of the United States when 112 00:05:41.240 --> 00:05:44.839 his wife's mother died. It's straightforward for basic questions, but 113 00:05:44.959 --> 00:05:47.519 you know it can often lead to inaccuracies or just 114 00:05:47.560 --> 00:05:49.399 making things up at the info is and baked into 115 00:05:49.399 --> 00:05:50.240 his training data. 116 00:05:50.319 --> 00:05:53.000 Hallucinations, right, the dreaded hallucinations joys lely. 117 00:05:53.319 --> 00:05:55.879 Then you've got chain of thought or co T. This 118 00:05:55.920 --> 00:05:58.680 is where you literally tell the model think step by step. 119 00:06:00.079 --> 00:06:03.959 It's intriguing because this often dramatically improves performance on reasoning tasks. 120 00:06:04.240 --> 00:06:07.079 It forces the LM to sort of show its work, 121 00:06:07.319 --> 00:06:09.639 break down the problem like we learned in school math 122 00:06:09.800 --> 00:06:12.959 pretty much. But here's a funny quirk. The book notes 123 00:06:12.959 --> 00:06:16.680 that sometimes for tasks where humans also tend to overthink 124 00:06:16.720 --> 00:06:20.639 and make mistakes, COKE can actually make the LLLM perform worse. 125 00:06:21.240 --> 00:06:23.879 A good reminder they aren't just scaled up human brains. 126 00:06:24.199 --> 00:06:25.680 Huh interesting? 127 00:06:25.800 --> 00:06:26.959 What else next? 128 00:06:27.040 --> 00:06:27.240 Up? 129 00:06:27.360 --> 00:06:30.040 And this is fundamental for making lllms useful with your 130 00:06:30.120 --> 00:06:35.879 data is retrieval augmented generation or OURG. This means providing 131 00:06:36.000 --> 00:06:39.439 relevant pieces of text also known as context, directly within 132 00:06:39.480 --> 00:06:41.759 the prompt. So if you want the LLLM to know 133 00:06:41.800 --> 00:06:45.319 about your company's latest internal reporter today's news, you use 134 00:06:45.360 --> 00:06:49.199 OURG to feed at that specific information alongside the question. 135 00:06:49.000 --> 00:06:51.319 Ah okay, so ori is how you give it knowledge 136 00:06:51.319 --> 00:06:52.879 it wasn't trained on precisely. 137 00:06:53.160 --> 00:06:56.040 And then for making lllms do things, there's tool calling. 138 00:06:56.240 --> 00:06:58.199 This lets you give the LM a list of external 139 00:06:58.199 --> 00:07:00.959 functions or calculator. Example, maybe a search engine, API, a 140 00:07:01.000 --> 00:07:03.600 weather service, whatever. You train it to recognize when it 141 00:07:03.600 --> 00:07:05.600 needs a tool and to signal it's intent. 142 00:07:05.480 --> 00:07:07.160 To use it, so it can decide I need to 143 00:07:07.160 --> 00:07:09.639 search for this or I need to calculate that exactly. 144 00:07:10.120 --> 00:07:14.560 And often the most powerful applications combine these techniques. Maybe 145 00:07:14.639 --> 00:07:16.920 use chain of thought to plan or ready to fetch 146 00:07:16.959 --> 00:07:20.000 relevant data, and then tool calling to perform a specific 147 00:07:20.040 --> 00:07:23.480 action or calculation based on that data. Oh and one 148 00:07:23.480 --> 00:07:26.279 more few shot prompting. This is where you give the 149 00:07:26.399 --> 00:07:28.879 LM just a small number of examples, like here's a question, 150 00:07:28.959 --> 00:07:31.279 here's the right kind of answer. It helps it learn 151 00:07:31.360 --> 00:07:34.720 new tasks or formats on the fly without full retraining, 152 00:07:34.759 --> 00:07:36.360 like showing it a few examples to get the hang 153 00:07:36.399 --> 00:07:36.959 of something new. 154 00:07:37.079 --> 00:07:39.759 Wow, Okay, that's a whole toolkit for interacting with them, 155 00:07:40.000 --> 00:07:41.759 and lang chain helps manage all of this. 156 00:07:41.959 --> 00:07:43.000 Yeah, that's the beauty of it. 157 00:07:43.079 --> 00:07:46.160 Lang chain was one of the earliest open source libraries 158 00:07:46.199 --> 00:07:49.800 to provide these core LM and prompting building blocks, and 159 00:07:49.879 --> 00:07:52.720 it's taken off massively. The community is huge, over seventy 160 00:07:52.759 --> 00:07:55.360 two thousand members, twenty eight million downloads a month. 161 00:07:55.439 --> 00:07:56.120 It's staggering. 162 00:07:56.519 --> 00:07:59.319 What lang chain does is offer these simple abstractions for 163 00:07:59.439 --> 00:08:02.639 all those tech niques we just discussed, zero shot solo 164 00:08:02.720 --> 00:08:07.079 t rrag tool calling fushot plus. It integrates seamlessly with 165 00:08:07.120 --> 00:08:11.279 all the major LLLM providers open Ai, Anthropic, Google, and 166 00:08:11.360 --> 00:08:15.240 popular open source models like Lama. This common interface is 167 00:08:15.279 --> 00:08:17.600 a really big deal. It means you can easily experiment, 168 00:08:17.920 --> 00:08:21.519 swap out different llms, and crucially avoid being locked into 169 00:08:21.560 --> 00:08:22.399 a single provider. 170 00:08:22.720 --> 00:08:24.199 That gives you a huge flexibility. 171 00:08:24.319 --> 00:08:27.480 That flexibility sounds key Okay, so this brings us to 172 00:08:27.560 --> 00:08:30.560 a really crucial challenge, especially for us the builders. If 173 00:08:30.560 --> 00:08:33.759 these alans are brilliant, but they fundamentally can't know everything. 174 00:08:33.759 --> 00:08:36.639 They weren't trained on my company's latest financials or yesterday's news, 175 00:08:36.919 --> 00:08:38.679 how do we stop them from just making stuff up, 176 00:08:38.720 --> 00:08:41.000 from hallucinating when we ask about that information. 177 00:08:41.440 --> 00:08:42.960 You've nailed the core problem. 178 00:08:43.600 --> 00:08:46.960 Just relying on the LM's pre train knowledge often isn't 179 00:08:47.039 --> 00:08:50.799 enough for real world apps, precisely because, like you said, 180 00:08:50.919 --> 00:08:53.600 they lack private data stuff not on the public Internet, 181 00:08:53.799 --> 00:08:56.159 and they lack knowledge of current events because of their 182 00:08:56.200 --> 00:08:57.240 knowledge cutoff date. 183 00:08:57.879 --> 00:08:58.519 When they don't have. 184 00:08:58.519 --> 00:09:02.519 The information they need, they tend to hallucinate, generating plausible 185 00:09:02.519 --> 00:09:06.639 sounding but incorrect or even totally fabricated answers. 186 00:09:06.360 --> 00:09:08.519 Which can be dangerous in a real application. 187 00:09:08.720 --> 00:09:12.120 Absolutely, and that's exactly where retrieval augmented generation are AG 188 00:09:12.480 --> 00:09:16.440 becomes essential. It's basically your defense mechanism against hallucination by 189 00:09:16.440 --> 00:09:19.039 providing the necessary context RAG. 190 00:09:19.279 --> 00:09:21.799 So walk us through how it actually works. How does 191 00:09:21.840 --> 00:09:25.080 it ground the LM in specific maybe private or very 192 00:09:25.120 --> 00:09:26.080 current information. 193 00:09:26.559 --> 00:09:30.200 Okay, So, our RAG is specifically designed to enhance the 194 00:09:30.240 --> 00:09:35.080 accuracy of outputs generated by llms by providing context from 195 00:09:35.200 --> 00:09:39.440 external sources. Meta AI actually coined the term, and their 196 00:09:39.480 --> 00:09:42.679 research found that RAG makes models more factual and specific. 197 00:09:43.080 --> 00:09:46.000 The whole process generally involves four key steps for getting 198 00:09:46.000 --> 00:09:50.399 your documents ready, sometimes called ingestion or indexing. First, you 199 00:09:50.480 --> 00:09:53.480 extract the text from whatever documents you have. Lang chain 200 00:09:53.559 --> 00:09:56.519 has helpers for this, like text loater for plain text files, 201 00:09:56.600 --> 00:09:59.639 or pd sloader for PDFs, and many others. Simple enough, 202 00:10:00.120 --> 00:10:02.799 the text out step one. Step two, you split that 203 00:10:02.879 --> 00:10:06.440 text into manageable chunks. This is really important because, as 204 00:10:06.440 --> 00:10:09.080 we mentioned, LM's have a context window, a limit on 205 00:10:09.120 --> 00:10:11.000 how much text they can look at in one go. 206 00:10:11.360 --> 00:10:13.559 Can't just feeding up five hundred page document, right, It's 207 00:10:13.559 --> 00:10:16.399 too big Exactly so, tools like lang teen's where cursive 208 00:10:16.440 --> 00:10:19.960 character text splitter, cleverly break the text down. It tries 209 00:10:20.000 --> 00:10:23.120 to split along natural boundaries first like paragraphs and sentences, 210 00:10:23.159 --> 00:10:25.840 then words. To keep things coherent, you can configure the 211 00:10:25.919 --> 00:10:28.600 chunk size and also add some chunk overlap, meaning consecutive 212 00:10:28.679 --> 00:10:30.120 chunks share a bit of text. 213 00:10:30.440 --> 00:10:34.000 Ah overlap helps maintain context across the breaks. 214 00:10:34.240 --> 00:10:37.120 Precisely, It's like making sure the end of one chapter 215 00:10:37.159 --> 00:10:39.639 flows smoothly into the start of the next third step. 216 00:10:40.200 --> 00:10:44.799 You convert these text chunks into numbers, specifically into embeddings. 217 00:10:44.960 --> 00:10:47.639 Embeddings. Okay, this sounds like where the magic happens. 218 00:10:47.720 --> 00:10:48.399 It kind of is. 219 00:10:48.919 --> 00:10:51.519 Think of an embedding as a long list of numbers 220 00:10:51.559 --> 00:10:54.759 a vector that represents the meaning of that text chunk. 221 00:10:55.159 --> 00:10:58.360 Now it's a lossy representation. You can't perfectly reconstruct the 222 00:10:58.360 --> 00:11:01.080 original words just from the numbers, like you can't get 223 00:11:01.120 --> 00:11:02.639 perfect ced quality back. 224 00:11:02.519 --> 00:11:03.279 From an MP three. 225 00:11:03.399 --> 00:11:05.320 But it captures the essence exactly. 226 00:11:05.399 --> 00:11:08.200 It captures the semantic essence, and this allows for math 227 00:11:08.279 --> 00:11:11.120 on words. This is a huge lead from older systems 228 00:11:11.120 --> 00:11:15.080 that just did keyword searching LM based embeddings or semantic 229 00:11:15.120 --> 00:11:16.600 embeddings understand meaning. 230 00:11:16.799 --> 00:11:19.879 Okay, this is fascinating. How do we teach a computer 231 00:11:19.960 --> 00:11:24.720 the difference between say, lion, pet and dog. I get 232 00:11:24.759 --> 00:11:27.440 the related but how does the computer quantify that? If 233 00:11:27.440 --> 00:11:31.519 we connect this to the bigger picture, this cosign similarity 234 00:11:31.600 --> 00:11:36.960 idea quantifying how close pet and dog are numerically versus lion. 235 00:11:37.559 --> 00:11:40.879 That seems powerful, But how does that number crunching actually 236 00:11:40.960 --> 00:11:42.039 enable better search? 237 00:11:42.279 --> 00:11:44.399 That's a fantastic question. Really gets to the heart of 238 00:11:44.440 --> 00:11:48.919 semantic search. Imagine all these words, or rather the concepts 239 00:11:48.919 --> 00:11:53.159 they represent, existing is points in some vast high dimensional space. 240 00:11:53.759 --> 00:11:56.480 The embedding vectors for pet and dog would literally be 241 00:11:56.559 --> 00:11:58.919 mapped closer together in this space. Then either would be 242 00:11:58.960 --> 00:12:02.519 to lion because they meanings are more related. Cosine similarity 243 00:12:02.600 --> 00:12:04.679 is just the mathematical tool we use to measure the 244 00:12:04.720 --> 00:12:07.399 angle or the closeness between these vectors. It gives a 245 00:12:07.440 --> 00:12:11.360 score usually between natus, one opposite meaning and one identical meaning. 246 00:12:11.720 --> 00:12:14.679 So pet and dog would have a cosine similarity score 247 00:12:14.799 --> 00:12:16.559 much closer to one than pet and lion. 248 00:12:16.759 --> 00:12:21.279 Ah. Okay, so similarity means closer in this meaning space exactly. 249 00:12:21.759 --> 00:12:24.679 And this ability to turn text into embeddings that capture 250 00:12:24.679 --> 00:12:28.559 deep meaning lets us search based on concepts, not just keywords. 251 00:12:28.840 --> 00:12:31.519 You could search for happy house animal and the system 252 00:12:31.519 --> 00:12:34.960 could find documents talking about joyful puppies or content cats. 253 00:12:35.240 --> 00:12:38.399 Even if the exact words happy house or animal aren't there, 254 00:12:38.559 --> 00:12:40.720 it understands the underlying meaning is similar. 255 00:12:40.799 --> 00:12:44.759 That's incredibly powerful search. Okay, so we've extracted, split, and 256 00:12:44.840 --> 00:12:47.039 embedded the text. What's the final step. 257 00:12:47.320 --> 00:12:49.320 The fourth step is to store these embeddings in a 258 00:12:49.399 --> 00:12:53.440 vector store. Think of this as a specialized database designed 259 00:12:53.480 --> 00:12:57.440 to store these numerical vectors and perform those complex similarity 260 00:12:57.440 --> 00:13:02.000 calculations like cosine similarity really efficiently and quickly. There are 261 00:13:02.039 --> 00:13:05.039 lots of options, open source ones like pg vector, an 262 00:13:05.080 --> 00:13:08.679 extension for post cresscool, dedicated databases like wev eight or 263 00:13:08.720 --> 00:13:12.519 pine Cone, or cloud services. When you, the user ask 264 00:13:12.600 --> 00:13:16.000 a question, your question is also converted into an embedding vector. 265 00:13:16.639 --> 00:13:19.639 The vector store then rapidly finds the stored embeddings and 266 00:13:19.679 --> 00:13:23.559 their corresponding text chunks that are most similar mathematically to 267 00:13:23.600 --> 00:13:26.919 your query embedding. Those relevant chunks are then retrieved and 268 00:13:27.000 --> 00:13:29.320 passed to the LLM along with your original. 269 00:13:29.080 --> 00:13:31.879 Question, giving it the specific context it needs to answer 270 00:13:31.919 --> 00:13:32.879 accurate precisely. 271 00:13:32.919 --> 00:13:36.039 And lane Chain also provides tools like its indexing API 272 00:13:36.080 --> 00:13:38.559 and record manager to help keep this vector store up 273 00:13:38.600 --> 00:13:41.600 to date. As your source documents change, you can efficiently 274 00:13:41.639 --> 00:13:45.000 track those changes, add new embeddings, remove old ones, and 275 00:13:45.039 --> 00:13:49.240 avoid costly reprocessing of unchanged documents, keeping the knowledge current. 276 00:13:49.519 --> 00:13:52.720 Okay, that makes sense. We've got the basic arget pipeline down, 277 00:13:53.399 --> 00:13:57.559 index the data, retrieve relevant chunks, give them to the LLM. 278 00:13:57.960 --> 00:14:02.000 But I imagine building something truly production ready involves more nuance. 279 00:14:02.639 --> 00:14:05.240 What are the common challenges and how do we refine 280 00:14:05.279 --> 00:14:08.480 that search for knowledge to be even more accurate and robust? 281 00:14:08.799 --> 00:14:11.679 Yeah, moving from a basic a RAGI demo to production 282 00:14:12.279 --> 00:14:15.799 definitely introduces complexity. Users ask questions in all sorts of ways, 283 00:14:15.879 --> 00:14:19.399 sometimes ambiguously. Your data might live in multiple different places, 284 00:14:19.639 --> 00:14:22.080 and you often need to translate that natural language question 285 00:14:22.120 --> 00:14:25.000 into something more structured for retrieval. So we need more 286 00:14:25.039 --> 00:14:25.879 advanced strategies. 287 00:14:26.000 --> 00:14:26.639 Okay, what are they? 288 00:14:26.759 --> 00:14:30.039 The book highlights three main categories of strategy. The first 289 00:14:30.080 --> 00:14:33.480 is query transformation. The idea here is to modify the 290 00:14:33.559 --> 00:14:36.840 user's input before you even search to improve the chances 291 00:14:36.840 --> 00:14:38.200 of finding the best documents. 292 00:14:38.360 --> 00:14:40.759 Ah, like cleaning up the question first exactly. 293 00:14:40.799 --> 00:14:44.240 One technique is rewrite retrieve read. Here, you actually use 294 00:14:44.279 --> 00:14:47.360 another LM call first just to rewrite the user's potentially 295 00:14:47.480 --> 00:14:51.600 vague or conversational query into a clearer, more focused search query. 296 00:14:51.919 --> 00:14:54.399 Then you use that rewritten query for the retrieval step. 297 00:14:54.720 --> 00:14:58.919 Smart like having an assistant clarify your question before searching. 298 00:14:58.960 --> 00:15:00.480 Does it add much delay? 299 00:15:00.879 --> 00:15:02.720 Yeah, it's a little bit of latency. Yeah, because it's 300 00:15:02.720 --> 00:15:06.039 an extra LM call, but often the improvement in retrieval 301 00:15:06.120 --> 00:15:10.440 quality is worth it. Another transformation technique is multi query retrieval. 302 00:15:11.320 --> 00:15:14.120 Instead of just one query, you have the LMM generate 303 00:15:14.559 --> 00:15:18.480 multiple versions of the given user question, maybe from slightly 304 00:15:18.480 --> 00:15:20.240 different angles or using different keywords. 305 00:15:20.320 --> 00:15:21.600 Oh interesting, why do that? 306 00:15:21.879 --> 00:15:24.519 It's great for complex questions that might need information from 307 00:15:24.639 --> 00:15:28.159 multiple perspectives. You run retrievals for all those generated queries 308 00:15:28.200 --> 00:15:31.960 in parallel, then combine the unique documents found. It casts 309 00:15:31.960 --> 00:15:34.720 a wider net, reducing the chance you miss something important. 310 00:15:35.200 --> 00:15:39.120 Building on that is RAG fusion. It starts like multiquery, 311 00:15:39.399 --> 00:15:42.679 generating multiple queries and retrieving results for each, but then 312 00:15:42.679 --> 00:15:45.759 it has a crucial final re ranking step using something 313 00:15:45.759 --> 00:15:47.360 called the reciprocal rank of fusion. 314 00:15:47.519 --> 00:15:48.559 RF algorithm. 315 00:15:49.519 --> 00:15:50.320 Sounds technical. 316 00:15:50.559 --> 00:15:52.919 It's a clever way to combine the rankings from all 317 00:15:52.960 --> 00:15:57.279 the different searches. Documents that consistently rank highly across multiple 318 00:15:57.360 --> 00:16:00.720 queries get boosted to the very top. Really effective at 319 00:16:00.720 --> 00:16:03.559 finding the most relevant stuff while also broadening discovery. 320 00:16:03.639 --> 00:16:07.159 Okay, so RF aggregates the wisdom of multiple searches. Cool 321 00:16:07.480 --> 00:16:09.039 any other transformation tricks? 322 00:16:09.360 --> 00:16:13.519 One more interesting one is hypothetical document embeddings or Heidi 323 00:16:14.200 --> 00:16:17.519 this kind of counterintuitive. Instead of searching with the user's query, 324 00:16:17.679 --> 00:16:21.399 you first have an LM create a hypothetical document that 325 00:16:21.480 --> 00:16:23.039 would be a perfect answer to the query. 326 00:16:23.120 --> 00:16:24.679 Wait, it makes up an answer first. 327 00:16:24.759 --> 00:16:25.159 Yeah. 328 00:16:25.360 --> 00:16:28.519 The intuition is that this generated ideal answer, even though 329 00:16:28.519 --> 00:16:32.480 it's hypothetical, is often semantically more similar to the actual 330 00:16:32.559 --> 00:16:36.919 relevant documents than the original maybe short or ambiguous user query. 331 00:16:37.639 --> 00:16:40.960 So you embed this hypothetical answer and use that embedding 332 00:16:41.320 --> 00:16:42.440 for the similarity search. 333 00:16:42.799 --> 00:16:45.799 Huh, that's sliver. Using an ideal answer is a better 334 00:16:45.799 --> 00:16:49.360 search query? Okay, so that's query transformation. What's the second strategy? 335 00:16:49.600 --> 00:16:53.159 The second strategy is query routing. This tackles the problem 336 00:16:53.200 --> 00:16:56.960 you mentioned earlier. What if your data lives in different places. 337 00:16:57.200 --> 00:17:00.399 Maybe you have Python docks in one vector store and 338 00:17:00.519 --> 00:17:01.919 JavaScript docs in another. 339 00:17:02.240 --> 00:17:04.240 Right, how do you send the query to the right place? 340 00:17:04.319 --> 00:17:07.960 That's exactly what quer routing does, forward a user's query 341 00:17:08.000 --> 00:17:11.000 to the relevant data source. There are a couple of 342 00:17:11.000 --> 00:17:14.880 ways logical routing uses an LLM to make the decision. 343 00:17:15.680 --> 00:17:18.680 You give the LM descriptions of your available data sources, 344 00:17:18.720 --> 00:17:23.000 like this index contains technical documentation for Python and Based 345 00:17:23.000 --> 00:17:25.640 on the user's query, the LM picks which of the 346 00:17:25.680 --> 00:17:29.480 available indexes to use. Lang chain helps ensure the LM 347 00:17:29.519 --> 00:17:32.720 outputs its choice in a structured way your application can understand. 348 00:17:32.920 --> 00:17:35.799 So the LLM acts like a switchboard operator kind of. 349 00:17:35.880 --> 00:17:36.200 Yeah. 350 00:17:36.559 --> 00:17:38.359 Alternatively, there's semantic routing. 351 00:17:38.839 --> 00:17:39.079 Here. 352 00:17:39.240 --> 00:17:42.319 You embed the descriptions of your data sources themselves. Then 353 00:17:42.359 --> 00:17:45.279 you compare the user's query embedding to these description embeddings. 354 00:17:45.599 --> 00:17:48.839 The closest match indicates the most relevant data source. This 355 00:17:48.920 --> 00:17:51.440 is more dynamic, doesn't require an LLLM call for every 356 00:17:51.519 --> 00:17:52.279 routing decision. 357 00:17:52.480 --> 00:17:55.759 Okay, route it logically or semantically makes sense. What's the 358 00:17:55.799 --> 00:17:57.079 third major strategy? 359 00:17:57.319 --> 00:18:00.680 The third is query construction. This is about transforming a 360 00:18:00.759 --> 00:18:04.119 natural language query into the query language of the database 361 00:18:04.200 --> 00:18:07.279 or data source you were interacting with. It goes beyond 362 00:18:07.359 --> 00:18:10.359 just finding similar text chunks. Oh so, well, maybe you 363 00:18:10.400 --> 00:18:14.759 need to combine semantic search with traditional database filters. Text 364 00:18:14.759 --> 00:18:17.599 to metadata filter is a technique where the LLM extracts 365 00:18:17.640 --> 00:18:21.640 structured information like a date, a category, a price range 366 00:18:21.880 --> 00:18:24.079 directly from the user's natural language query. 367 00:18:24.279 --> 00:18:26.480 Ah so, if I ask for sci fi movies from 368 00:18:26.480 --> 00:18:29.680 the eighties, it pulls out sci fi for semantic search 369 00:18:29.759 --> 00:18:32.599 and nineteen eighties as a metadata filter exactly. 370 00:18:32.640 --> 00:18:35.079 It lets you combine the power of semantic understanding with 371 00:18:35.079 --> 00:18:38.279 the precision of structured filters. Another big one here is 372 00:18:38.359 --> 00:18:41.640 text to seql. This involves having the LM translate a 373 00:18:41.720 --> 00:18:45.039 natural language question like what were our total sales in 374 00:18:45.160 --> 00:18:48.640 Q three directly into an executable SQL query to run 375 00:18:48.680 --> 00:18:50.519 against a traditional relational database. 376 00:18:50.680 --> 00:18:53.319 Wow, that's powerful. How do you make that reliable? SQL? 377 00:18:53.319 --> 00:18:53.920 Can be tricky? 378 00:18:54.160 --> 00:18:57.480 It requires careful setup. You usually need to provide the 379 00:18:57.599 --> 00:19:00.559 LLM with a description of the database scheme like the 380 00:19:00.920 --> 00:19:04.640 create table statements, maybe some example rows, and often some 381 00:19:04.799 --> 00:19:08.319 few shot examples of natural language questions paired with their 382 00:19:08.359 --> 00:19:10.240 correct SQL queries. 383 00:19:09.839 --> 00:19:10.279 To guide it. 384 00:19:10.599 --> 00:19:14.359 Got it? So, text to sqel translates language to database code. 385 00:19:14.759 --> 00:19:17.240 If we connect this text to SQL capability to the 386 00:19:17.279 --> 00:19:21.559 bigger picture. Though, while it's incredibly powerful, letting an LEBLEM 387 00:19:21.680 --> 00:19:25.720 generate SQL queries directly from potentially untrusted user input is 388 00:19:25.720 --> 00:19:27.519 one of the riskiest things you can do in a 389 00:19:27.640 --> 00:19:28.960 production application. Ah. 390 00:19:29.000 --> 00:19:31.279 Security implications, Yeah, I can see that. 391 00:19:31.359 --> 00:19:34.960 Absolutely. This raise is a really important question around safety. 392 00:19:35.200 --> 00:19:38.880 You must implement critical security measures, things like ensuring the 393 00:19:38.960 --> 00:19:42.480 database connection has read only permissions, strictly limiting access to 394 00:19:42.519 --> 00:19:45.799 only the necessary tables, maybe even views, and definitely adding 395 00:19:45.839 --> 00:19:49.440 query timeouts to prevent denial of service attacks or runaway queries. 396 00:19:49.880 --> 00:19:53.720 It's a capability that demands extreme caution and robust safeguards. 397 00:19:54.079 --> 00:19:57.440 Absolutely crucial point. Okay, this is getting really interesting, especially 398 00:19:57.440 --> 00:20:00.400 for building interactive apps. How do we tackle the fact 399 00:20:00.440 --> 00:20:03.920 that lllms are inherently forgetful. How do we give them 400 00:20:04.039 --> 00:20:07.160 memory to actually hold a conversation, especially as things get 401 00:20:07.200 --> 00:20:07.880