WEBVTT 1 00:00:00.160 --> 00:00:02.480 Welcome to the Deep Dive, the show that navigates the 2 00:00:02.560 --> 00:00:06.559 labyrinth of information, distilling the essence of what truly matters. 3 00:00:07.240 --> 00:00:11.199 I vividly remember the first time I interacted with GPT three. 4 00:00:11.480 --> 00:00:14.560 Oh yeah, it felt like an almost magical experience. For 5 00:00:14.640 --> 00:00:18.160 the first time, it genuinely seemed like the computer understood 6 00:00:18.160 --> 00:00:22.800 my complex inputs and could react appropriately, you know, solving 7 00:00:22.879 --> 00:00:26.480 diverse tasks from text analysis to coding, just based on 8 00:00:26.519 --> 00:00:29.839 my instructions. It was a complete game changer, especially compared 9 00:00:29.879 --> 00:00:34.039 to the well the prior neural networks that always needed specialized, 10 00:00:34.079 --> 00:00:35.719 hand labeled training data. 11 00:00:35.759 --> 00:00:38.640 It truly redefined what we thought was possible with AI. 12 00:00:38.799 --> 00:00:43.399 The leap was just undeniable, absolutely, and today our Deep 13 00:00:43.439 --> 00:00:45.920 Dive is all about harnessing that power. We're going to 14 00:00:45.960 --> 00:00:48.799 explore how these incredible language models can be used specifically 15 00:00:48.880 --> 00:00:51.039 for data analysis, helping you make the most of your 16 00:00:51.079 --> 00:00:54.200 data sets. Right, They've evolved so rapidly, moving from just 17 00:00:54.439 --> 00:00:59.119 processing text to understanding multimodal inputs that's images, audio, video, 18 00:00:59.200 --> 00:00:59.880 and of course tech. 19 00:01:00.079 --> 00:01:01.920 Yeah, the multi modality is huge. 20 00:01:02.079 --> 00:01:05.439 This expansion makes them an invaluable tool across pretty much 21 00:01:05.439 --> 00:01:06.840 every facet of data science. 22 00:01:07.120 --> 00:01:09.680 Our mission in this deep dive really is to show 23 00:01:09.680 --> 00:01:13.799 you how llms can act as expert guides to your data, 24 00:01:13.920 --> 00:01:17.280 offering a genuine shortcut to being well informed. A shortcut 25 00:01:17.319 --> 00:01:19.719 I like that will delve into how they extract the 26 00:01:19.760 --> 00:01:24.040 most important nuggets of knowledge and insight from various sources 27 00:01:24.400 --> 00:01:28.319 and empower you to build complex analysis pipelines with just 28 00:01:28.359 --> 00:01:31.400 a few lines of Python code, all driven by natural 29 00:01:31.480 --> 00:01:32.359 language instructions. 30 00:01:32.359 --> 00:01:34.840 Okay, let's unpack the core of this magic. Then it 31 00:01:34.879 --> 00:01:38.959 all begins with what GPT actually stands for, generative pre 32 00:01:39.040 --> 00:01:44.359 trained transformer. Generative is key here, meaning these models don't 33 00:01:44.400 --> 00:01:47.719 just classify or recognize things. They create new content, whether 34 00:01:47.760 --> 00:01:49.680 it's text, code or even images. 35 00:01:49.959 --> 00:01:52.680 And the pre trained aspect means they've learned from truly 36 00:01:52.719 --> 00:01:55.959 immense amounts of data, vast swaths of the Internet, books, 37 00:01:56.319 --> 00:01:59.480 and more, enabling them to understand languid broadly, not just 38 00:01:59.560 --> 00:02:03.640 you know, specific narrow task. This generic understanding allows them 39 00:02:03.680 --> 00:02:06.599 to then adapt to specialized tasks much more efficiently. And 40 00:02:06.640 --> 00:02:10.680 the transformer part, ah, that's the underlying neural network architecture, 41 00:02:11.719 --> 00:02:13.919 the brilliant design that makes all this possible. 42 00:02:14.400 --> 00:02:17.240 So how does this fundamental design let them tackle such 43 00:02:17.280 --> 00:02:18.919 a wide array of problems. 44 00:02:19.120 --> 00:02:22.240 What's truly fascinating is how this design allows lms to 45 00:02:22.240 --> 00:02:26.520 be universal task solvers. Unlike earlier models built for one 46 00:02:26.560 --> 00:02:31.479 specific purpose, llms are designed intended to serve as universal 47 00:02:31.520 --> 00:02:35.479 task solvers that can, in principle, solve any task the 48 00:02:35.560 --> 00:02:36.280 user desires. 49 00:02:36.800 --> 00:02:38.439 Any task wow. 50 00:02:38.199 --> 00:02:41.360 Well within reason. The way you communicate with them is 51 00:02:41.360 --> 00:02:44.319 through prompting. Think of a prompt as your direct instruction 52 00:02:44.400 --> 00:02:46.280 to the model. So the input you give it, and 53 00:02:46.319 --> 00:02:49.400 it can be multimodal, combining text with images or other 54 00:02:49.479 --> 00:02:53.560 data types. A really effective plompt needs a clear task description, 55 00:02:53.960 --> 00:02:56.240 all the relevant context like are we talking about reviewing 56 00:02:56.280 --> 00:02:58.080 laptops or lawnmowers for example? Right? 57 00:02:58.120 --> 00:02:59.520 Context matter Context is. 58 00:02:59.520 --> 00:03:02.639 Critical, and crucially, it can optionally include a few examples 59 00:03:02.680 --> 00:03:03.479 to guide the model. 60 00:03:03.800 --> 00:03:06.639 So if the prompt is the key, how much handholding 61 00:03:06.680 --> 00:03:09.159 do we actually need to give the model? Does it 62 00:03:09.240 --> 00:03:11.280 learn from a few examples or can it just get 63 00:03:11.280 --> 00:03:12.560 it from the description alone? 64 00:03:12.719 --> 00:03:15.960 Yeah, that brings us to fu shot learning versus zero 65 00:03:16.039 --> 00:03:18.879 shot learning. FU shot learning is when you provide those 66 00:03:18.919 --> 00:03:21.759 few examples directly in your prompt to show the model 67 00:03:21.840 --> 00:03:25.360 exactly what you expect showing your work exactly. It's like 68 00:03:25.400 --> 00:03:28.280 showing someone a couple of solved puzzles so they understand 69 00:03:28.280 --> 00:03:31.960 the pattern. Zero shot learning, on the other hand, means 70 00:03:32.080 --> 00:03:35.960 you're relying solely on your task description with no examples provided, 71 00:03:36.080 --> 00:03:39.120 and that works. It's impressive how often llms can still 72 00:03:39.120 --> 00:03:43.280 perform effectively even with zero shot prompting. It really depends 73 00:03:43.280 --> 00:03:45.479 on the task complexity and the model itself. 74 00:03:46.000 --> 00:03:48.879 And it's important to distinguish between the types of data 75 00:03:49.000 --> 00:03:51.719 lllms work with, right structured versus unstructured. 76 00:03:51.759 --> 00:03:55.080 Absolutely, we have structured data that's your tables grabs, anything 77 00:03:55.120 --> 00:03:59.039 with a fixed format that specialized tools can process very efficiently. 78 00:03:59.360 --> 00:04:02.439 For this primarily act as an intelligent. 79 00:04:01.960 --> 00:04:04.560 Interface, got it, like a translator kind of. 80 00:04:04.680 --> 00:04:08.840 Then there's unstructured data text, images, audio video, where llms 81 00:04:08.879 --> 00:04:12.080 operate directly on the raw content. A critical point for 82 00:04:12.080 --> 00:04:14.800 anyone using these models, and something that often surprises people, 83 00:04:15.159 --> 00:04:18.920 is that interacting with language models incurs monetary fees. AH 84 00:04:19.079 --> 00:04:22.079 the cost yes proportional to the amount of data process 85 00:04:22.720 --> 00:04:27.199 and using larger language models, well that's often more expensive significantly, 86 00:04:27.199 --> 00:04:28.000 So sometimes how. 87 00:04:28.000 --> 00:04:28.879 Do they measure that cost? 88 00:04:29.079 --> 00:04:32.279 These costs are calculated in tokens. Think of tokens as 89 00:04:32.319 --> 00:04:36.120 the smallest, meaningful lego bricks of language. So if I 90 00:04:36.160 --> 00:04:39.079 say Hello World, that might be just a few tokens. 91 00:04:39.199 --> 00:04:41.360 It's roughly four characters a text, give or take. 92 00:04:41.399 --> 00:04:43.399 That's a good way to put it. So for many 93 00:04:43.399 --> 00:04:46.600 of us are first let's say, dance with an LLM 94 00:04:46.800 --> 00:04:50.519 was likely through the chat GPT web interface. Ugly, most 95 00:04:50.519 --> 00:04:52.800 of you have probably already dabbled there, accessing it at 96 00:04:52.879 --> 00:04:55.759 chat dot OpenEye dot com. It's a great sandbox for 97 00:04:55.839 --> 00:05:00.279 quick text processing or even exploring its data analysis capabilities. 98 00:05:00.079 --> 00:05:03.199 And in that web interface you can perform some genuinely 99 00:05:03.279 --> 00:05:09.000 practical tasks. For text processing, classification is straightforward determining the 100 00:05:09.040 --> 00:05:11.959 sentiment of a movie review or sorting a product review, 101 00:05:12.120 --> 00:05:13.959 like for a I don't know a banana book a 102 00:05:13.959 --> 00:05:17.079 banana book into its correct category. You can even hint 103 00:05:17.079 --> 00:05:20.519 it your desired output format simply by saying answer concisely. 104 00:05:20.800 --> 00:05:21.720 Nice. 105 00:05:21.759 --> 00:05:25.439 For information extraction, it's brilliant at pulling structured data from 106 00:05:25.480 --> 00:05:29.720 freeform text, like gathering a name, GPA, and degree from 107 00:05:29.720 --> 00:05:31.120 a stack of applicant emails. 108 00:05:31.240 --> 00:05:33.839 And what's truly impressive is how it handles tables right 109 00:05:34.120 --> 00:05:34.920 right there in the chat. 110 00:05:35.079 --> 00:05:37.639 It really is. You can upload a dot csv file, 111 00:05:37.680 --> 00:05:41.959 for instance, review stable dot csv. Chat SHPT doesn't just 112 00:05:42.000 --> 00:05:45.240 display it. If you've enabled to write features, it generates 113 00:05:45.240 --> 00:05:48.120 an executes Python code behind the scenes to analyze that 114 00:05:48.199 --> 00:05:49.199 data hikon code. 115 00:05:49.319 --> 00:05:50.079 Really Yeah. 116 00:05:50.519 --> 00:05:52.199 You can even peak at the code by clicking the 117 00:05:52.199 --> 00:05:57.000 show analysis button. This demonstrates lms acting as intelligent orchestrators 118 00:05:57.000 --> 00:05:58.160 for external tools. 119 00:05:58.360 --> 00:05:58.639 Wow. 120 00:05:58.800 --> 00:06:02.240 They also excel as translation, converting natural language questions into 121 00:06:02.279 --> 00:06:04.639 formal query languages like sql. 122 00:06:04.319 --> 00:06:05.519 Ah SQL generation. 123 00:06:05.920 --> 00:06:09.600 That's useful, very You can then execute that SEQL on 124 00:06:09.639 --> 00:06:13.079 your own platform, say and squally database in a Google 125 00:06:13.120 --> 00:06:18.439 collab notebook. It's fantastic for writing complex multiline queries, and 126 00:06:18.480 --> 00:06:21.000 that handy copy code button makes it so easy to 127 00:06:21.040 --> 00:06:22.279 grab the generated sequel. 128 00:06:22.399 --> 00:06:26.680 That sounds incredibly powerful, but it also raises an important question. 129 00:06:26.879 --> 00:06:29.879 Can we truly trust everything? A LLLM tells us. 130 00:06:29.800 --> 00:06:31.920 That's a crucial point, and it's one of the biggest challenges. 131 00:06:32.439 --> 00:06:37.600 The term hallucinations refers to situations where lms will invent 132 00:06:37.680 --> 00:06:40.920 new content in the absence of information, invent things. Yes, 133 00:06:41.120 --> 00:06:43.680 and the truly profound insight here isn't just that they 134 00:06:43.680 --> 00:06:46.839 invent things, but that they do so with such convincing 135 00:06:46.879 --> 00:06:49.360 confidence it sounds completely plausible. 136 00:06:49.439 --> 00:06:51.199 Oh that's dangerous, it can be. 137 00:06:51.360 --> 00:06:54.920 This fundamentally shifts our perspective, and LLLLM doesn't know in 138 00:06:54.959 --> 00:06:57.759 the human sense, it generates plausibly, Yeah, forcing us to 139 00:06:57.800 --> 00:07:00.879 rethink how we trust automated information. So it's essential to 140 00:07:00.879 --> 00:07:03.959 always verify the output, Always double check before relying on it. 141 00:07:04.319 --> 00:07:06.000 Use alternative sources for corroboration. 142 00:07:06.199 --> 00:07:07.160 Okay, always verify. 143 00:07:07.279 --> 00:07:07.560 Got it. 144 00:07:07.879 --> 00:07:10.319 So, while the web interface is great for chatting and 145 00:07:10.399 --> 00:07:14.040 quick tasks, it's not really designed for building robust, complex 146 00:07:14.319 --> 00:07:15.759 data processing pipelines. 147 00:07:16.000 --> 00:07:16.519 Not really. 148 00:07:16.600 --> 00:07:18.959 No, for that, we need to go deeper into the code. 149 00:07:18.959 --> 00:07:21.680 This is where the open ai Python library comes into play. 150 00:07:21.800 --> 00:07:25.680 Exactly. The Python library allows you to directly invoke llms 151 00:07:25.920 --> 00:07:28.720 as a subfunction within your own code, giving you much 152 00:07:28.800 --> 00:07:30.000 more programmatic control. 153 00:07:30.079 --> 00:07:31.240 How do you get started with that? 154 00:07:31.600 --> 00:07:33.879 To get set up, you'll need Python three point nine 155 00:07:34.040 --> 00:07:37.360 or later, and then simply install the opene library using 156 00:07:37.399 --> 00:07:41.600 pip standard stuff. Okay, critically, you'll need an API key 157 00:07:41.680 --> 00:07:44.639 from open Ai, and it is highly recommended to store 158 00:07:44.680 --> 00:07:47.000 this securely as an environment variable. 159 00:07:46.720 --> 00:07:48.000 Right, don't just paste it in your. 160 00:07:48.000 --> 00:07:51.079 Code, absolutely not. Never ever share your code if it 161 00:07:51.120 --> 00:07:54.120 contains your open AI access key directly, as others could 162 00:07:54.160 --> 00:07:56.439 use it to encour charges on your account. Very important. 163 00:07:56.519 --> 00:07:57.519 Okay, key secured? 164 00:07:57.720 --> 00:08:01.319 Then what when using check completetion in Python, you can 165 00:08:01.360 --> 00:08:04.439 struct a list of messages. Each message has a role 166 00:08:04.680 --> 00:08:07.920 user for your input, assistant for the model's reply or 167 00:08:08.000 --> 00:08:11.040 system for instructions about the model's persona or behavior. 168 00:08:11.399 --> 00:08:14.279 System user assistant okay. 169 00:08:14.120 --> 00:08:17.759 And then the actual content of the message the client 170 00:08:17.839 --> 00:08:21.639 dot chat, dot completions, dot create function handles setting this off. 171 00:08:22.120 --> 00:08:26.560 Remember token usage, specifically, the total tokens attribute in the 172 00:08:26.680 --> 00:08:30.079 response you get back directly impacts cost right back to 173 00:08:30.120 --> 00:08:32.799 the tokens, and tokens generated by the model are often 174 00:08:32.840 --> 00:08:35.480 more expensive than the tokens you send as input. Keep 175 00:08:35.519 --> 00:08:36.000 that in mind. 176 00:08:36.120 --> 00:08:38.679 Ah, good tip. Now that we know how to talk 177 00:08:38.679 --> 00:08:41.960 to these models through code, the next logical step is 178 00:08:42.559 --> 00:08:44.360 how do we steer them, how do we make sure 179 00:08:44.360 --> 00:08:46.679 they behave exactly as we want, and crucially, how do 180 00:08:46.720 --> 00:08:47.960 we manage those costs. 181 00:08:48.360 --> 00:08:51.840 That's where customizing model behavior and optimizing for costing quality 182 00:08:51.879 --> 00:08:55.159 comes in. It's really about controlling the generation process. Oh so, 183 00:08:55.559 --> 00:08:58.960 for example, to control output length and therefore fees, you 184 00:08:59.000 --> 00:09:02.000 can set max token to specify a maximum response length. 185 00:09:02.279 --> 00:09:04.600 Pretty straightforward, can't limit the output makes sense. 186 00:09:04.919 --> 00:09:08.159 You can also use obsequences specific text patterns like maybe 187 00:09:08.279 --> 00:09:11.759 endo response or even something narrative like and they lived 188 00:09:11.759 --> 00:09:15.200 happily ever after to tell the model exactly when to 189 00:09:15.200 --> 00:09:18.919 stop generating. This can be very useful for getting structured. 190 00:09:18.559 --> 00:09:22.000 Outputs ah nat trick, and for controlling the actual words 191 00:09:22.000 --> 00:09:25.039 it chooses. How do we guide that for output generation? 192 00:09:25.399 --> 00:09:29.120 Presence penalty and frequency penalty are your levers for controlling repetitiveness. 193 00:09:29.919 --> 00:09:33.480 Positive values discourage the model from repeating tokens it's already 194 00:09:33.559 --> 00:09:36.080 used or that are present in the prompt helps keep 195 00:09:36.080 --> 00:09:36.559 things fresh. 196 00:09:36.639 --> 00:09:37.360 That's repetition. 197 00:09:37.679 --> 00:09:41.519 Good for truly surgical precision, like forcing a model to 198 00:09:41.639 --> 00:09:46.080 use specific words, say positive or negative in a sentiment task. 199 00:09:46.600 --> 00:09:47.720 There's legit bias. 200 00:09:48.000 --> 00:09:49.759 Legit bias sounds complex. 201 00:09:49.840 --> 00:09:52.759 It's a more advanced lever. It lets you explicitly increase 202 00:09:52.840 --> 00:09:56.080 or decrease the likelihood of specific tokens appearing. You need 203 00:09:56.120 --> 00:09:58.919 to find the token IDs using a token aser tool first. 204 00:09:59.200 --> 00:10:02.320 It's powerful, but typically for very niche use cases. You 205 00:10:02.320 --> 00:10:03.440 wouldn't use it every day. 206 00:10:03.440 --> 00:10:07.840 Okay, And what about controlling how creative or let's say 207 00:10:07.919 --> 00:10:12.120 random the model gets. Sometimes you want predictable, sometimes more exploratory. 208 00:10:12.360 --> 00:10:16.360 That's where randomization parameters are key. Temperature, typically set between 209 00:10:16.440 --> 00:10:20.840 zero and two, directly controls randomness. Higher values like maybe 210 00:10:20.919 --> 00:10:23.120 point eight or one point zero lead to more diverse 211 00:10:23.159 --> 00:10:27.000 and sometimes more created outputs. Lower values closer to zero 212 00:10:27.360 --> 00:10:29.480 make it more deterministic and focused. 213 00:10:29.480 --> 00:10:31.559 So zero for facts, higher for fiction. 214 00:10:32.080 --> 00:10:34.960 Sort of kind of yeah. TOP is an alternative approach 215 00:10:35.000 --> 00:10:38.840 that achieves a similar goal. It reduces randomization by focusing 216 00:10:38.879 --> 00:10:41.600 only on the highest probability tokens that add up to 217 00:10:41.600 --> 00:10:45.399 a certain cumulative probability. It's just a different way to tune. 218 00:10:45.200 --> 00:10:47.440 The randomness, temperature or TOP. Okay. 219 00:10:47.720 --> 00:10:50.120 And if you want multiple options for a single plumpt, 220 00:10:50.360 --> 00:10:52.840 you can use the N parameter to generate several replies 221 00:10:52.879 --> 00:10:55.240 at once. Gives you more choices to pick from. 222 00:10:55.720 --> 00:10:58.919 This raises an important question with all these settings, how 223 00:10:59.000 --> 00:11:02.039 do we get the best perform ormans while managing costs effectively? 224 00:11:02.080 --> 00:11:03.360 It sounds like a balancing act. 225 00:11:03.480 --> 00:11:07.679 It absolutely is, and that's where strategic optimization becomes crucial first, 226 00:11:08.159 --> 00:11:12.200 model selection. Do not always default to the largest, most 227 00:11:12.240 --> 00:11:13.600 expensive available model. 228 00:11:13.879 --> 00:11:15.399 Bigger isn't always better. 229 00:11:15.440 --> 00:11:18.759 Not necessarily and certainly not always cost effective. For many 230 00:11:18.799 --> 00:11:22.679 simpler tasks, a smaller, cheaper model like GPT three point 231 00:11:22.679 --> 00:11:26.639 five turbo might perform perfectly well. GPT four, for instance, 232 00:11:26.919 --> 00:11:29.639 can be over one hundred times more expensive per token in. 233 00:11:29.600 --> 00:11:31.799 Some cases, wow, a hundred times. 234 00:11:31.879 --> 00:11:35.879 Yeah, it's smart to check benchmarks like Stanford's ALM evaluation 235 00:11:36.440 --> 00:11:39.759 and definitely experiment with different models for your specific task 236 00:11:40.080 --> 00:11:42.960 to find that sweet spot between cost and quality. 237 00:11:43.200 --> 00:11:46.120 So model choice is clearly a big one. What else 238 00:11:46.159 --> 00:11:48.600 can we do besides tweaking temperature and penalties? 239 00:11:48.960 --> 00:11:52.480 Prompt engineering is absolutely vital. I can't stress this enough. 240 00:11:52.600 --> 00:11:55.240 The design of your prompt can have a significant effect on. 241 00:11:55.200 --> 00:11:57.840 Performance, really just the way you ask. 242 00:11:58.000 --> 00:12:01.600 Yes, it's a really counterintuitive insights sometimes, but the biggest 243 00:12:01.639 --> 00:12:03.679 leap in performance might not come from a bigger model 244 00:12:03.759 --> 00:12:07.320 or more training data, but simply from better instructions, like 245 00:12:07.360 --> 00:12:10.440 a skilled artisan responding to a perfectly precise brief. You know. 246 00:12:10.720 --> 00:12:13.399 That's the magic of fu shot learning, which we mentioned earlier, 247 00:12:13.840 --> 00:12:17.240 including samples of correctly solved tasks directly in the prompt 248 00:12:17.440 --> 00:12:21.360 can dramatically improve quality. It often allows cheaper models to 249 00:12:21.399 --> 00:12:25.000 perform comparably to much more expensive ones just because the 250 00:12:25.039 --> 00:12:26.399 task is clearer. 251 00:12:26.080 --> 00:12:28.320 So invest time in the prompt itself. 252 00:12:28.600 --> 00:12:31.840 Definitely. You can even find ready made prompt templates on 253 00:12:31.919 --> 00:12:35.240 platforms like prompt Base, though crafting your own specific to 254 00:12:35.320 --> 00:12:36.639 your need is usually best. 255 00:12:36.759 --> 00:12:38.799 And what about fine tuning? That sounds like a big 256 00:12:38.840 --> 00:12:40.559 step like retraining the model? 257 00:12:40.759 --> 00:12:43.799 It is kind of. Fine tuning allows you to specialize 258 00:12:43.840 --> 00:12:47.159 base models to the specific tasks you care most about. 259 00:12:47.720 --> 00:12:50.240 You take an existing model like GBT three point five 260 00:12:50.279 --> 00:12:53.440 Turbo and you continue its training, but with a relatively 261 00:12:53.440 --> 00:12:56.799 small amount of your own task specific data, typically fifty 262 00:12:56.879 --> 00:12:58.440 to maybe a few thousand examples. 263 00:12:58.480 --> 00:13:00.840 Fifty examples. That doesn't sound like much compared to the 264 00:13:00.840 --> 00:13:01.919 pre training data. 265 00:13:02.039 --> 00:13:05.639 It's not, but it's focused. The model already understands language, 266 00:13:05.639 --> 00:13:07.519 you're just nudging it to be really good at your 267 00:13:07.559 --> 00:13:08.320 specific thing. 268 00:13:08.440 --> 00:13:11.279 What are the upsides and downsides of that kind of specialization? 269 00:13:11.399 --> 00:13:12.200 Seems powerful? 270 00:13:12.440 --> 00:13:17.279 The advantages include potentially significantly improved accuracy for your specific 271 00:13:17.440 --> 00:13:21.000 use case, and you might get away with shorter, simpler prompts. 272 00:13:21.000 --> 00:13:23.759 Because the task is sort of baked into the specialized 273 00:13:23.799 --> 00:13:24.679 model now and. 274 00:13:24.600 --> 00:13:27.399 The downsides cost I assume yes. 275 00:13:27.559 --> 00:13:31.200 There are upfront monetary fees for the training process itself, 276 00:13:31.919 --> 00:13:36.000 and importantly it usually increases the cost per token for 277 00:13:36.080 --> 00:13:39.840 the fine tuned model's ongoing usage compared to the base model. 278 00:13:40.000 --> 00:13:41.960 Ah, so it costs more to run afterwards. 279 00:13:42.000 --> 00:13:44.639 Often Yes, the training data also needs to be in 280 00:13:44.639 --> 00:13:49.720 a specific JSM lines format, basically representing successful interactions as 281 00:13:49.759 --> 00:13:53.679 little conversations with user and assistant roles. It's a powerful tool, 282 00:13:53.879 --> 00:13:57.200 but when you typically explore once you've exhausted prompt engineering options. 283 00:13:57.279 --> 00:14:00.559 Okay, that makes sense. Let's unpack this further than beyond 284 00:14:00.720 --> 00:14:04.480 just text, lms are fundamentally transforming how we interact with 285 00:14:04.759 --> 00:14:06.440 all sorts of data, right, not just words on a. 286 00:14:06.440 --> 00:14:11.200 Paid absolutely for text analysis. Classification remains a natural application, 287 00:14:11.440 --> 00:14:15.879 like we said categorizing movie reviews or support tickets. Information extraction, 288 00:14:16.200 --> 00:14:19.240 where you pull structured data like compiling a table of 289 00:14:19.519 --> 00:14:23.559 applicant attributes from free form emails, is another really strong suit. 290 00:14:23.679 --> 00:14:27.360 Yeah, pulling structured data from messy texts is huge and. 291 00:14:27.279 --> 00:14:32.600 For clustering which groups semantically similar text documents, llms leverage 292 00:14:32.639 --> 00:14:34.039 something called embeddings. 293 00:14:34.200 --> 00:14:36.320 Embeddings heard that term, what is it exactly? 294 00:14:36.360 --> 00:14:39.039 Think of embeddings like assigning every piece of text a 295 00:14:39.159 --> 00:14:43.279 unique invisible address in a vast high dimensional space, like 296 00:14:43.360 --> 00:14:46.240 a point on a complex map. Okay, the closer to 297 00:14:46.600 --> 00:14:49.279 addresses or points are on this map, the more similarly 298 00:14:49.320 --> 00:14:51.879 the meaning of the texts. This allows the computer to 299 00:14:52.000 --> 00:14:55.840 understand semantic similarity without actually reading in our human sense. 300 00:14:55.879 --> 00:14:57.279 It's purely mathematical, so. 301 00:14:57.279 --> 00:14:58.480 It turns meaning into. 302 00:14:58.360 --> 00:15:02.279 Coordinates basically, yes, and this makes tasks like clustering emails 303 00:15:02.279 --> 00:15:06.120 incredibly efficient, separating them from, say, poems. It also powers 304 00:15:06.159 --> 00:15:09.759 things like semantic search and retrieval systems find me documents 305 00:15:09.799 --> 00:15:10.320 like this one. 306 00:15:10.440 --> 00:15:13.279 So it's about turning complex information into something that computer 307 00:15:13.360 --> 00:15:17.080 can intelligently compare and measure. That's genuinely mind. 308 00:15:16.919 --> 00:15:22.320 Boggling precisely, and for structured data analysis think relational databases 309 00:15:22.399 --> 00:15:27.440 or graph databases. Lms truly act as a universal interface interface. 310 00:15:27.480 --> 00:15:30.879 How they translate your natural language questions directly into formal 311 00:15:30.960 --> 00:15:34.960 query languages like SQL for tables or cipher for graphs. 312 00:15:35.679 --> 00:15:38.320 This contrast sharply with the traditional need for someone to 313 00:15:38.399 --> 00:15:42.399 manually write those precise, often complex queries. 314 00:15:42.080 --> 00:15:44.399 So I can just ask my database questions in English. 315 00:15:44.480 --> 00:15:46.919 That's the goal. We often use external tools with the 316 00:15:47.039 --> 00:15:50.279 LM for this because of efficiency, cost, and the sheer 317 00:15:50.399 --> 00:15:53.240 volume of large data sets that would exceed in LM's 318 00:15:53.399 --> 00:15:56.279 input limits. The LM acts as the translator. 319 00:15:56.320 --> 00:15:57.519 How does that work? In practice? 320 00:15:57.600 --> 00:16:00.840 You can build a natural language query interface for tabular data, 321 00:16:00.840 --> 00:16:04.039 for example, by first having the LM automatically extract the 322 00:16:04.120 --> 00:16:07.519 database structure, maybe by querrying the quite master table in 323 00:16:07.639 --> 00:16:09.000 squite ah. 324 00:16:09.080 --> 00:16:12.080 It figures out the tables and columns itself exactly. 325 00:16:12.519 --> 00:16:16.000 Then it translates your natural language questions into SEQL queries 326 00:16:16.039 --> 00:16:19.399 based on that structure, and finally, your application executes those 327 00:16:19.480 --> 00:16:21.399 queries against the actual database. 328 00:16:21.600 --> 00:16:24.879 That kind of automation sounds amazing, but with that level 329 00:16:24.879 --> 00:16:30.840 of power accessing databases directly, there must be a significant caution, right. 330 00:16:30.840 --> 00:16:34.120 Yes, a big one, a huge one actually. Do not 331 00:16:34.279 --> 00:16:37.639 blindly trust your language model to generate accurate queries. They 332 00:16:37.639 --> 00:16:40.559 can make mistakes. What kind of mistakes They might misunderstand 333 00:16:40.600 --> 00:16:44.639 the question, misinterpret the schema, or generate SQEL that's inefficient 334 00:16:45.080 --> 00:16:48.480 or just plain wrong, or worse, potentially destructive if you've 335 00:16:48.519 --> 00:16:49.519 given it right access. 336 00:16:49.639 --> 00:16:49.960 Yikes. 337 00:16:50.320 --> 00:16:53.639 So always always keep a backup of important data before 338 00:16:53.759 --> 00:16:57.799 enabling data access via language models, and ideally have checks 339 00:16:57.799 --> 00:17:01.159 in place, maybe even human review for sensitive querities. It's 340 00:17:01.200 --> 00:17:03.519 power that absolutely needs human oversight. 341 00:17:03.600 --> 00:17:06.480 Okay, creceed with caution on database access. Got it. And 342 00:17:06.559 --> 00:17:09.200 it's not just text and tables anymore. Llms are now 343 00:17:09.240 --> 00:17:11.759 analyzing images and videos too. How does that work? 344 00:17:12.000 --> 00:17:15.359 It's truly incredible. Models like GPT four to H are 345 00:17:15.480 --> 00:17:19.160 natively multimodal. This means they were trained from the ground 346 00:17:19.240 --> 00:17:21.839 up on different types of data, not just text, so 347 00:17:21.880 --> 00:17:24.839 they can see in the sense, you could ask free 348 00:17:24.880 --> 00:17:29.279 form natural language questions directly about images. For example, detect 349 00:17:29.480 --> 00:17:32.599 golden persian cats in this picture, and you provide the 350 00:17:32.640 --> 00:17:36.200 image along with the text wow. Your prompts combine text 351 00:17:36.240 --> 00:17:40.279 instructions with image ll components pointing to images online. You 352 00:17:40.279 --> 00:17:44.519 could even include multiple images in one prompt for comparative analysis, 353 00:17:44.559 --> 00:17:46.640 like what's different between these two photos? 354 00:17:46.640 --> 00:17:49.400 And cost? Is analyzing images expensive? 355 00:17:49.680 --> 00:17:52.240 The cost is generally proportional to the resolution of the 356 00:17:52.279 --> 00:17:56.920 images you submit. High resolution, more detail, potentially more tokens used. 357 00:17:56.920 --> 00:18:00.200 Okay, what about say, tagging people in photos. 358 00:18:00.160 --> 00:18:02.440 Do that too. You could provide a reference picture of 359 00:18:02.480 --> 00:18:04.880 a person alongside the pictures you want to tag, using 360 00:18:04.960 --> 00:18:08.559 multimodal prompts with two or more images and text instructions 361 00:18:08.640 --> 00:18:10.799 like is the person in the first image present in 362 00:18:10.799 --> 00:18:11.559 the second image? 363 00:18:11.599 --> 00:18:14.039 What if my images and videos aren't online if they're 364 00:18:14.079 --> 00:18:15.720 stored locally on my computer? 365 00:18:15.920 --> 00:18:19.319 Good question. For local images or video frames, you need 366 00:18:19.359 --> 00:18:23.480 to encode them first. Common formats like PNG jpeg up 367 00:18:23.559 --> 00:18:26.039 to about twenty milibuni in size need to be converted 368 00:18:26.079 --> 00:18:28.759 into a text format called Base sixty four and then 369 00:18:28.920 --> 00:18:30.079 encoded as UTF eight. 370 00:18:30.359 --> 00:18:32.119 Encode them as text yes. 371 00:18:32.119 --> 00:18:34.839 Essentially turning the image data into a long string of 372 00:18:34.920 --> 00:18:38.000 characters that can be sent in the API request along 373 00:18:38.039 --> 00:18:42.160 with your text prompt. Libraries like OpenCV are commonly used 374 00:18:42.200 --> 00:18:45.599 to extract individual frames from videos, maybe just the first 375 00:18:45.680 --> 00:18:48.160 ten frames, to get a sense of the videos content and. 376 00:18:48.119 --> 00:18:49.440 What would you do with those frames? 377 00:18:49.559 --> 00:18:52.759 You could use those sampled frames, along with text instructions 378 00:18:52.799 --> 00:18:56.240 to say, generate a concise video title, like provide the 379 00:18:56.240 --> 00:18:58.759 frames and ask generate a short title for a video 380 00:18:58.880 --> 00:19:01.799 showing these scenes. It might come back with traffic conditions 381 00:19:01.799 --> 00:19:03.039 on I five during rush hour. 382 00:19:03.200 --> 00:19:06.799 That's remarkable versatility taking us from text to tables, to 383 00:19:06.920 --> 00:19:10.359 images and video frames. And finally, what about audio data? 384 00:19:10.400 --> 00:19:11.039 Can they listen? 385 00:19:11.279 --> 00:19:14.039 Be sure, can or at least process the data for 386 00:19:14.119 --> 00:19:17.319 audio data analysis. Open AI's Whisper model is a real 387 00:19:17.400 --> 00:19:21.319 game changer. Yeah, it's a transformer model like GPT, but 388 00:19:21.440 --> 00:19:25.319 train specifically on over six hundred and eighty thousand hours 389 00:19:25.640 --> 00:19:30.160 of multilingual audio data. It's excellent for transcription, converting audio 390 00:19:30.200 --> 00:19:34.039 recordings into written text, typically English text output, though it 391 00:19:34.160 --> 00:19:35.960 understands many languages. 392 00:19:35.640 --> 00:19:37.920 So speech to text. What formats? 393 00:19:37.799 --> 00:19:41.279 It supports common formats like MP three, WAV and others, 394 00:19:41.599 --> 00:19:44.119 usually with a file size limit around twenty five milibit 395 00:19:44.119 --> 00:19:45.000 for the standard API. 396 00:19:45.079 --> 00:19:45.920 What can you build with that? 397 00:19:46.160 --> 00:19:49.960 You could build a full voice query interface. Imagine record 398 00:19:50.000 --> 00:19:53.200 a spoken question using a library like sound device on 399 00:19:53.240 --> 00:19:56.759 your computer, OK, transcribe that audio to text using whisper, 400 00:19:57.440 --> 00:20:00.839 translate that text into a SQL query GBT four H, 401 00:20:01.480 --> 00:20:04.000