WEBVTT 1 00:00:00.080 --> 00:00:02.000 Welcome to the deep dive. This is where we take 2 00:00:02.000 --> 00:00:05.519 a whole stack of articles, research papers, notes and basically 3 00:00:05.559 --> 00:00:07.080 just dive in to pull out the key insights for 4 00:00:07.120 --> 00:00:10.759 you today. Our mission is to really get under the 5 00:00:10.800 --> 00:00:13.800 hood of generative AI. It's a technology that's well, it's 6 00:00:13.880 --> 00:00:16.160 changing things incredibly fast. And just to give you a 7 00:00:16.199 --> 00:00:18.640 sense of how impactful it is, remember back in twenty 8 00:00:18.719 --> 00:00:21.960 twenty two the Colorado State Fair Art Competition. The winning 9 00:00:21.960 --> 00:00:25.280 piece in the digital art category, theatro tops Space Shelle 10 00:00:25.640 --> 00:00:27.800 wasn't you know, painted by a human. It was made 11 00:00:27.879 --> 00:00:31.280 using mid journey an AI, a really stunning sci fi scene. 12 00:00:31.280 --> 00:00:34.840 It just perfectly captures this blend of creativity and well 13 00:00:34.880 --> 00:00:38.200 pure tech. That's where we're starting today. So okay, generative AI. 14 00:00:38.359 --> 00:00:42.039 It's making headlines everywhere, creating art, writing code, sounding almost human. 15 00:00:42.560 --> 00:00:44.799 But what is it fundamentally? What's the big shift here? 16 00:00:44.880 --> 00:00:47.399 Yeah, it's a massive shift, really a paradigm shift. You 17 00:00:47.439 --> 00:00:50.600 could say, think about most AI you probably come across before. 18 00:00:50.640 --> 00:00:55.479 It's usually discriminative. I mean it learns to tell things 19 00:00:55.520 --> 00:01:00.200 apart right, classify, predict based on data. It's scene like 20 00:01:00.240 --> 00:01:03.200 telling a cat photo from a dog photo. Generative models, 21 00:01:03.240 --> 00:01:06.680 though they do something different. They're not just recognizing patterns. 22 00:01:06.840 --> 00:01:09.560 They learn the deep underlying rules of the data itself. 23 00:01:09.840 --> 00:01:11.920 They learned enough about what makes a cat a cat 24 00:01:12.040 --> 00:01:15.959 that they can actually create entirely new cat images, believable 25 00:01:15.959 --> 00:01:21.239 ones from scratch. It's about generation, not just discrimination. 26 00:01:21.400 --> 00:01:23.879 Wow, okay, so it's not just sorting or identifying. It's 27 00:01:23.920 --> 00:01:27.439 like imagining. Yeah, materializing things. That feels like a huge leap. 28 00:01:27.560 --> 00:01:29.159 It is a profound one, But. 29 00:01:29.120 --> 00:01:31.760 I mean creating something totally new like that. This sounds 30 00:01:31.760 --> 00:01:35.400 incredibly complicated. What are some of the biggest challenges these 31 00:01:35.879 --> 00:01:37.480 imagination machines face. 32 00:01:37.680 --> 00:01:40.239 You're right, it's definitely not simple for starters. The day 33 00:01:40.319 --> 00:01:43.599 itself is a huge hurdle. Real world information is messy, 34 00:01:43.959 --> 00:01:47.959 you know, it's full of errors, noise, biases, and the models, well, 35 00:01:48.040 --> 00:01:50.159 they can learn those imperfections just as easily as the 36 00:01:50.239 --> 00:01:51.079 useful patterns. 37 00:01:51.200 --> 00:01:54.120 Ah, So garbage in, garbage out potentially sort of. 38 00:01:54.200 --> 00:01:57.359 Yeah, And then there's the issue of staying current, especially 39 00:01:57.359 --> 00:02:01.239 for large language models lllms. The world changes so fast 40 00:02:01.280 --> 00:02:04.480 and the information they generate can become outdated pretty quickly 41 00:02:04.519 --> 00:02:06.120 if they're not constantly updated. 42 00:02:06.439 --> 00:02:09.120 Right, Like asking about current events from a model trained 43 00:02:09.280 --> 00:02:10.479 last year exactly. 44 00:02:10.919 --> 00:02:15.599 And then there's the sheer computational power required learning these 45 00:02:15.639 --> 00:02:20.319 incredibly complex patterns and then generating new high fidelity data. 46 00:02:21.039 --> 00:02:25.479 It demands massive amounts of compute. And finally, think about evaluation. 47 00:02:25.879 --> 00:02:28.400 With a discriminative model, you ask, is this a cat? 48 00:02:28.759 --> 00:02:32.199 Yes or no? Easy to check. But with the generative model, 49 00:02:32.280 --> 00:02:34.639 how do you evaluate if a generated cat picture is 50 00:02:34.840 --> 00:02:37.639 good or accurate? There is, it's a single right answer. 51 00:02:37.719 --> 00:02:39.960 It's much more subjective, much more complex to measure. 52 00:02:40.000 --> 00:02:42.120 That makes total sense. It's not just about being correct, 53 00:02:42.159 --> 00:02:45.360 it's about being believable, useful. 54 00:02:45.080 --> 00:02:49.120 Plausible, believable, useful, coherent, all those things. 55 00:02:49.360 --> 00:02:54.080 Okay, so despite those big challenges, the promise must be huge, right, 56 00:02:54.120 --> 00:02:57.719 that's why everyone's pouring resources into this. Let's dig into 57 00:02:57.719 --> 00:03:01.080 some of those applications. Images. For instance, we've gone way 58 00:03:01.080 --> 00:03:02.719 beyond simple photo filters. Oh. 59 00:03:02.800 --> 00:03:08.560 Absolutely. Models now can create incredibly diverse photorealistic images just 60 00:03:08.599 --> 00:03:12.759 from say a text description, things you'd never imagine possible 61 00:03:12.759 --> 00:03:13.439 a few years ago. 62 00:03:13.520 --> 00:03:15.919 And it's not just art, right. You mentioned data augmentation. 63 00:03:16.319 --> 00:03:18.840 Yeah, that's a really practical one. Imagine you have only 64 00:03:18.879 --> 00:03:21.639 a small data set maybe for training an AI to 65 00:03:21.719 --> 00:03:26.319 recognize a specific product defect. Generative AI can create thousands 66 00:03:26.360 --> 00:03:30.080 of synthetic examples, different angles, lighting conditions, you name it, 67 00:03:30.240 --> 00:03:33.280 to bolster that data set, make the training more robust, 68 00:03:33.599 --> 00:03:36.680 maybe even reduce bias if the original data was skewed. 69 00:03:36.960 --> 00:03:39.280 That's clever using AI to make other AI. 70 00:03:39.159 --> 00:03:43.560 Better exactly, And in content creation too, generating texts for chatbots, 71 00:03:43.560 --> 00:03:47.120 helping writers brainstorm, even drafting emails. We've come such a 72 00:03:47.120 --> 00:03:49.000 long way from Eliza back in the. 73 00:03:48.960 --> 00:03:51.599 Sixties, right, those old rule based bots. Now we have 74 00:03:51.680 --> 00:03:54.560 these powerful models built on architectures like transformers. 75 00:03:54.639 --> 00:03:57.520 It's a different world. But those challenges we mentioned, the 76 00:03:57.560 --> 00:04:00.759 messi data keeping up with reality that can cute demands, 77 00:04:00.800 --> 00:04:05.840 and that tricky evaluation problem. There's still very real hurdles. 78 00:04:06.120 --> 00:04:10.240 Yeah, defining good enough for generated stuff, that's a tough one. Okay. 79 00:04:10.240 --> 00:04:13.000 So before we get deeper into the applications, how did 80 00:04:13.000 --> 00:04:15.400 we even get here? How did machines learn to imagine 81 00:04:15.439 --> 00:04:20.199 like this? Let's trace back the building blocks deep neural networks. 82 00:04:20.519 --> 00:04:23.120 It goes way back. Actually, early ideas in the nineteen 83 00:04:23.199 --> 00:04:26.839 forties were inspired by biological neurons. Simple things like the 84 00:04:26.879 --> 00:04:31.279 threshold logic unit. But those early models hit limitations famously. 85 00:04:31.399 --> 00:04:34.319 Minsky and Paper showed in their book Perceptrons that single 86 00:04:34.360 --> 00:04:37.600 Lairer networks couldn't even solve basic problems like the xoor 87 00:04:37.720 --> 00:04:40.560 logic function that led to the first AI winter in 88 00:04:40.600 --> 00:04:41.160 the seventies. 89 00:04:41.279 --> 00:04:44.639 Progress stalled right the AI winter. So what thought things out? 90 00:04:44.680 --> 00:04:47.199 What was the big breakthrough that got things moving again? 91 00:04:47.399 --> 00:04:51.199 The absolute game changer was backpropagation. Before that, figuring out 92 00:04:51.279 --> 00:04:53.439 how to adjust all the connections the weights in a 93 00:04:53.519 --> 00:04:57.600 deep network was incredibly inefficient, almost impossible for complex networks. 94 00:04:57.680 --> 00:05:00.160 How does it work sort of in simple terms. 95 00:05:00.120 --> 00:05:03.279 Well, it uses calculus, specifically the chain rule, to efficiently 96 00:05:03.319 --> 00:05:06.720 calculate how much each weight in the network contributed to 97 00:05:06.759 --> 00:05:09.800 the final error. It tells each connection exactly how to 98 00:05:09.839 --> 00:05:13.759 adjust itself, layer by layer, working backward from the output air. 99 00:05:14.319 --> 00:05:17.639 It made training deep networks practical. That's what really ended 100 00:05:17.639 --> 00:05:20.680 the AI winter and opened the door to modern deep learning. 101 00:05:20.959 --> 00:05:24.120 But you said even backpropagation wasn't perfect. It had issues 102 00:05:24.199 --> 00:05:24.519 it did. 103 00:05:24.600 --> 00:05:27.439 A big one was the vanish ingradient problem. In very 104 00:05:27.480 --> 00:05:30.199 deep networks, the error signal gets weaker and weaker as 105 00:05:30.199 --> 00:05:33.480 it propagates backward, so the early layers, the ones furthest 106 00:05:33.480 --> 00:05:37.360 from the output, learn extremely slowly or sometimes not at all, 107 00:05:37.639 --> 00:05:39.680 like a whisper getting lost down a long hallway. 108 00:05:39.759 --> 00:05:41.800 Okay, I can picture that the signal just fades out. 109 00:05:42.000 --> 00:05:44.279 So once we had back propagation, even with its flaws, 110 00:05:44.399 --> 00:05:48.199 what kinds of network structures or architecture started showing up well? 111 00:05:48.240 --> 00:05:52.399 For images? A major leap was convolutional neural networks CNNs. 112 00:05:52.759 --> 00:05:55.439 They were kind of inspired by the human visual cortex. 113 00:05:56.000 --> 00:05:59.279 Instead of looking at an image pixel by pixel, CNN's 114 00:05:59.360 --> 00:06:02.639 use filters that slide across the image looking for specific 115 00:06:02.680 --> 00:06:07.199 local features edges, corners, textures, and crucially, they share weights. 116 00:06:07.600 --> 00:06:10.279 The filter looking for a horizontal edge is the same 117 00:06:10.279 --> 00:06:12.560 filter whether it's looking at the bop left or bottom right. 118 00:06:12.759 --> 00:06:14.240 This makes them way more efficient for. 119 00:06:14.199 --> 00:06:17.920 Images sharing weights. Okay, and there were improvements on those 120 00:06:17.920 --> 00:06:19.040 basic CNNs oh. 121 00:06:19.000 --> 00:06:22.759 Yeah, big ones, things like reilu activation functions. They replaced 122 00:06:22.800 --> 00:06:25.959 older functions that saturated easily and helped fix that vanishing 123 00:06:25.959 --> 00:06:29.680 gradient problem. Kept the signal strong and drop out, which 124 00:06:29.720 --> 00:06:33.079 sounds weird but works amazingly well. During training, you randomly 125 00:06:33.120 --> 00:06:35.800 switch off some neurons. It forces the network not to 126 00:06:35.839 --> 00:06:38.600 rely too much on any single neuron, making it generalize 127 00:06:38.639 --> 00:06:41.120 better to new data. Kind of like cross training for 128 00:06:41.160 --> 00:06:41.600 the network. 129 00:06:41.720 --> 00:06:44.839 Huh. Interesting. Okay, so that's images. What about sequences like 130 00:06:45.000 --> 00:06:47.839 text or speech or time series data. 131 00:06:47.920 --> 00:06:51.480 For sequential data, the go to became recurrent neural networks 132 00:06:51.800 --> 00:06:55.680 or RNNs. They have loops allowing information to persist. They 133 00:06:55.680 --> 00:06:57.360 have a kind of memory. 134 00:06:57.000 --> 00:06:59.240 A memory, right, but didn't they also have issues with 135 00:06:59.279 --> 00:07:00.759 a long sequence they did. 136 00:07:01.199 --> 00:07:04.480 That vanishing gradient problem hit them hard too when trying 137 00:07:04.519 --> 00:07:07.279 to remember things from many steps back, which led to 138 00:07:07.319 --> 00:07:11.279 the development of lstm's long short term memory networks. LSTMs 139 00:07:11.279 --> 00:07:14.279 were a much more sophisticated type of RNN. They have 140 00:07:14.439 --> 00:07:18.839 these internal mechanisms called gates, an input gate, a forget gate, 141 00:07:19.040 --> 00:07:22.680 and output gate. These gates carefully control what information gets 142 00:07:22.680 --> 00:07:25.959 stored the memory cell, what gets forgotten, and what influence 143 00:07:26.000 --> 00:07:28.920 is the output at each step. They were much much 144 00:07:29.000 --> 00:07:32.839 better at capturing long range dependencies, crucial for understanding language. 145 00:07:32.879 --> 00:07:36.240 Okay, so lstm's improved memory. But you mentioned earlier that 146 00:07:36.279 --> 00:07:39.199 even they had limitations, especially for really long text, which 147 00:07:39.279 --> 00:07:41.519 led to transformers. 148 00:07:40.920 --> 00:07:44.319 Exactly this is where transformers completely change the game, particularly 149 00:07:44.319 --> 00:07:47.120 for language. They threw out the sequential, step by step 150 00:07:47.120 --> 00:07:51.040 processing of RNNs and LSTMs. The core idea the revolution 151 00:07:51.639 --> 00:07:52.439 was self attention. 152 00:07:52.920 --> 00:07:55.800 Self attention, we hear that term a lot. What does 153 00:07:55.839 --> 00:07:57.160 it actually let the model do? 154 00:07:57.680 --> 00:08:01.639 Instead of processing word by word, self attention allows every 155 00:08:01.680 --> 00:08:05.800 single word in a sentence to directly look at and 156 00:08:05.879 --> 00:08:09.439 weigh the importance of every other word in that same sentence. 157 00:08:09.279 --> 00:08:13.439 All at once, all at once, so no more sequential bottleneck. 158 00:08:13.759 --> 00:08:17.160 Precisely, it can instantly see how the first word relates 159 00:08:17.160 --> 00:08:19.600 to the last word, or how pronoun relates to the 160 00:08:19.639 --> 00:08:22.879 noun it refers to, even if they're far apart. And crucially, 161 00:08:23.439 --> 00:08:26.680 because it's not sequential, you can process all words in parallel. 162 00:08:27.360 --> 00:08:30.920 This makes training on massive data sets much much faster 163 00:08:31.000 --> 00:08:34.399 and scalable than RNNs ever could be. It just unlocked 164 00:08:34.399 --> 00:08:36.279 a whole new level of performance in scale. 165 00:08:36.360 --> 00:08:38.279 Okay, that makes sense why they were such a big deal. 166 00:08:38.440 --> 00:08:40.480 So if we have these powerful architectures, how do we 167 00:08:40.519 --> 00:08:44.519 get them to actually understand and use words? How does 168 00:08:44.600 --> 00:08:47.080 text get turned into numbers the machine can process? 169 00:08:47.200 --> 00:08:49.879 Right? That's fundamental. The early approaches were pretty simple like 170 00:08:49.919 --> 00:08:52.320 bag of words. You literally just count how many times 171 00:08:52.399 --> 00:08:53.840 each word appears in a document. 172 00:08:53.960 --> 00:08:55.519 Simple, But I guess it loses a lot. 173 00:08:55.720 --> 00:09:00.840 It loses all the context, word order, grammar gone, dog 174 00:09:00.879 --> 00:09:04.159 bites man and man bites dog look exactly the same 175 00:09:04.200 --> 00:09:06.720 to a bag of words model. Not very useful for 176 00:09:06.799 --> 00:09:07.679 understanding meaning. 177 00:09:08.000 --> 00:09:09.720 Yeah, that seems like a pretty big flaw. 178 00:09:10.000 --> 00:09:12.279 So the next big step was word embeddings. These are 179 00:09:12.320 --> 00:09:16.120 dense vector representations, basically lists of numbers for each word. 180 00:09:16.879 --> 00:09:20.039 Models like word to vec learn these embeddings by looking 181 00:09:20.039 --> 00:09:22.840 at the context words appear in. The key idea was 182 00:09:22.879 --> 00:09:26.919 that words used in similar contexts should have similar numerical representations, 183 00:09:26.960 --> 00:09:31.000 similar vectors. It started capturing semantic relationships. 184 00:09:30.279 --> 00:09:33.240 So king and queen would be mathematically closer than king 185 00:09:33.320 --> 00:09:34.639 and cabbage exactly. 186 00:09:34.960 --> 00:09:37.840 But even those embeddings were static. The vector for bank 187 00:09:37.960 --> 00:09:40.039 was the same whether you meant a riverbank or a 188 00:09:40.080 --> 00:09:44.240 financial bank. The real breakthrough for nuance was contextual representations. 189 00:09:44.720 --> 00:09:48.279 Models like Burt and Elmo generate embeddings that change based 190 00:09:48.320 --> 00:09:51.159 on the specific sentence the word is in. They understand 191 00:09:51.159 --> 00:09:53.840 that bank means different things in different contexts. That was 192 00:09:54.000 --> 00:09:55.840 huge for understanding language. Properly. 193 00:09:55.919 --> 00:09:58.879 Okay, so we have ways to represent words with nuance. Now, 194 00:09:59.000 --> 00:10:01.960 how do we make the machine talk generate text. 195 00:10:02.399 --> 00:10:05.759 That's the job of language models. At their heart, they're 196 00:10:05.799 --> 00:10:08.799 trying to predict the next word in a sequence given 197 00:10:08.840 --> 00:10:12.440 the previous words, like a superpowered autocomplete. 198 00:10:12.639 --> 00:10:14.720 Just predicting the next word. How does that lead to 199 00:10:14.840 --> 00:10:16.960 coherent sentences or paragraphs. 200 00:10:17.240 --> 00:10:19.559 Well, once it predicts a word, that word becomes part 201 00:10:19.559 --> 00:10:21.759 of the context for predicting the next word and so on. 202 00:10:22.120 --> 00:10:25.559 But just picking the single most probable word at each step, 203 00:10:25.600 --> 00:10:29.399 that's called greedy decoding often leads to really repetitive or 204 00:10:29.440 --> 00:10:30.200 boring text. 205 00:10:30.480 --> 00:10:32.720 Right, It might just get stuck saying the same phrase 206 00:10:32.799 --> 00:10:33.679 over and over. 207 00:10:33.639 --> 00:10:37.720 Exactly, So we use more sophisticated decoding strategies. Beam search 208 00:10:37.799 --> 00:10:40.200 keeps track of several of the most likely sequences at 209 00:10:40.200 --> 00:10:42.320 each step, kind of looking ahead to find a better 210 00:10:42.360 --> 00:10:46.240 overall sentence. And then there's sampling. Instead of always picking 211 00:10:46.320 --> 00:10:49.799 the most likely word, you introduce some randomness. You might 212 00:10:49.840 --> 00:10:53.840 sample from say the top ten most likely words top 213 00:10:53.919 --> 00:10:56.639 k sampling, or from the smallest set of words whose 214 00:10:56.679 --> 00:11:00.759 probabilities add up to a certain threshold nucleus sample. This 215 00:11:00.759 --> 00:11:04.559 adds variety and makes the text feel more natural, less predictable. 216 00:11:04.840 --> 00:11:07.960 So sampling adds a bit of creativity, stops it being 217 00:11:08.039 --> 00:11:09.279 robotic pretty much. 218 00:11:09.360 --> 00:11:11.720 Yeah, it helps avoid getting stuck in loops and generates 219 00:11:11.720 --> 00:11:12.639 more interesting output. 220 00:11:13.039 --> 00:11:15.240 And it seems like the transformer architecture with that self 221 00:11:15.240 --> 00:11:18.840 attention mechanism was absolutely critical for enabling this kind of 222 00:11:18.879 --> 00:11:22.960 sophisticated text generation at scale. Right. Can you expand on 223 00:11:23.039 --> 00:11:24.879 why it was such a turning point for these large 224 00:11:24.919 --> 00:11:25.600 language models. 225 00:11:25.639 --> 00:11:28.960 Oh, absolutely pivotal. That twenty seventeen paper Attention is all 226 00:11:29.000 --> 00:11:33.720 you need. It really did shift the paradigm before transformers. Remember, 227 00:11:33.759 --> 00:11:37.720 even lstm's our best sequential models had that bottleneck issue. 228 00:11:37.759 --> 00:11:41.480 They had to cram the meaning of the entire input sequence, 229 00:11:41.559 --> 00:11:45.279 no matter how long, into a single fixed sized context 230 00:11:45.320 --> 00:11:48.720 vector to pass along. For very long sentences or documents 231 00:11:48.759 --> 00:11:50.919 that just wasn't enough information got lost. 232 00:11:51.000 --> 00:11:52.720 The memory was an infinite. 233 00:11:52.440 --> 00:11:56.720 Right, Transformers, by ditching recurrens entirely and using self attention, 234 00:11:57.240 --> 00:12:01.639 broke that bottleneck wide open. Every word could directly attend 235 00:12:01.679 --> 00:12:06.279 to every other word, instantly capturing those long range dependencies. Plus, 236 00:12:06.559 --> 00:12:10.399 they introduced multihead self attention think of it as allowing 237 00:12:10.399 --> 00:12:12.960 the model to pay attention to different kinds of relationships 238 00:12:13.039 --> 00:12:18.720 simultaneously in parallel subspaces. Maybe one head focuses on grammatical relationships, 239 00:12:18.960 --> 00:12:21.159 another on semantic similarity. 240 00:12:20.679 --> 00:12:23.159 So it could capture multiple layers of meaning at once. 241 00:12:23.320 --> 00:12:27.720 Exactly. That ability to handle long contexts effectively and efficiently, 242 00:12:28.000 --> 00:12:30.759 combined with the massive parallelism allowing them to train on 243 00:12:30.960 --> 00:12:34.039 unprecedented amounts of data, that's what paved the way for 244 00:12:34.080 --> 00:12:37.279 the truly large language models, the lms that we have today. 245 00:12:37.559 --> 00:12:40.799 And from that core transformer idea different sort of flavors 246 00:12:40.879 --> 00:12:42.679 or families of models emerged. 247 00:12:42.919 --> 00:12:46.039 Yeah. Broadly speaking, you see three main types based on 248 00:12:46.120 --> 00:12:49.679 which parts of the original transformer architecture they use. First, 249 00:12:49.879 --> 00:12:53.279 encoder only models like the famous BURT These are designed 250 00:12:53.320 --> 00:12:56.159 primarily for understanding text. They look at the whole sentence 251 00:12:56.200 --> 00:13:00.240 at once. Great for tasks like classification, sentiment analysis, or 252 00:13:00.320 --> 00:13:02.399 question answering where context is key. 253 00:13:02.559 --> 00:13:05.360 Okay, understanding text, What's the next type? 254 00:13:05.440 --> 00:13:08.600 Then you have decoder only models like the GPT family. 255 00:13:08.759 --> 00:13:12.759 These are built for generating text. They work sequentially predicting 256 00:13:12.799 --> 00:13:16.000 the next word based on the preceding ones. This causal 257 00:13:16.120 --> 00:13:21.200 nature makes them naturals for chatbots, story writing, codegeneration. GPT 258 00:13:21.360 --> 00:13:25.919 really revolutionized generation with its ability for unsupervised multitask learning, 259 00:13:26.279 --> 00:13:28.480 learning many tasks just from raw text. 260 00:13:28.639 --> 00:13:30.639 Right. GPT is the one most people probably think of, 261 00:13:30.679 --> 00:13:32.039 and the third type. 262 00:13:31.840 --> 00:13:35.000 Encoder decoder models like T five or the original transformer. 263 00:13:35.639 --> 00:13:37.720 These have both parts and are often used for sequence 264 00:13:37.759 --> 00:13:41.000 to sequence tasks where you're transforming an input sequence into 265 00:13:41.000 --> 00:13:44.639 an output sequence. Think machine translation or text summarization. 266 00:13:45.000 --> 00:13:49.240 Got it encoder for understanding, decoder for generating, and both 267 00:13:49.240 --> 00:13:52.879 for transforming and focusing on GPT since it's so prominent, 268 00:13:53.240 --> 00:13:54.360 what were the big leaps there? 269 00:13:54.879 --> 00:13:57.759 Well, GPT two in twenty nineteen was a major milestone, 270 00:13:58.120 --> 00:14:01.360 one point five billion parameters trained on a huge chunk 271 00:14:01.399 --> 00:14:04.879 of the Internet. What was really stunning was its few 272 00:14:04.919 --> 00:14:08.240 shot ability key shot meaning meaning you could give it 273 00:14:08.320 --> 00:14:09.919 just a couple of examples of a task and the 274 00:14:09.960 --> 00:14:12.279 prompt and it could often figure out how to do 275 00:14:12.320 --> 00:14:15.120 it without any specific training for that task. Yeah, it 276 00:14:15.159 --> 00:14:17.960 showed an incredible level of general language understanding. 277 00:14:18.039 --> 00:14:18.480 Wow. 278 00:14:18.639 --> 00:14:21.799 And then GPT three came along, and how GVT three 279 00:14:21.840 --> 00:14:24.679 was enormous one hundred and seventy five billion parameters over 280 00:14:24.679 --> 00:14:27.879 one hundred times bigger. It started showing these emergent abilities 281 00:14:27.919 --> 00:14:30.120 things it wasn't explicitly trained for but could just do, 282 00:14:30.279 --> 00:14:33.399 like unscrambling words or even basic arithmetic. It felt like 283 00:14:33.679 --> 00:14:35.120 a qualitative leap. 284 00:14:35.039 --> 00:14:37.960 But raw capability isn't always the same as being useful 285 00:14:38.080 --> 00:14:39.000 or safe. 286 00:14:38.720 --> 00:14:41.919 Right exactly, and that led to instruct GPT in twenty 287 00:14:41.960 --> 00:14:44.919 twenty two. It was actually smaller than GPT three, but 288 00:14:45.080 --> 00:14:48.000 critically it was much better at following instructions and aligning 289 00:14:48.039 --> 00:14:48.840 with user intent. 290 00:14:49.120 --> 00:14:51.559 How did they achieve that alignment through. 291 00:14:51.399 --> 00:14:54.960 Two extra training steps after the initial pre training, First 292 00:14:55.039 --> 00:14:57.720 instruction fine tuning, where they trained it on examples of 293 00:14:57.759 --> 00:15:02.120 prompts and desired outputs, and second, crucially, reinforcement learning with 294 00:15:02.240 --> 00:15:05.440 human feedback or URLHF. 295 00:15:04.960 --> 00:15:08.080 Our LHF that involves humans ranking different. 296 00:15:07.759 --> 00:15:11.399 Outcome Yes, humans would compare different responses from the model 297 00:15:11.440 --> 00:15:14.200 to the same prompt and indicate which one they preferred. 298 00:15:14.799 --> 00:15:17.679 This feedback was used to train a reward model, which 299 00:15:17.759 --> 00:15:21.240 then guided the LM during further fine tuning to produce 300 00:15:21.279 --> 00:15:24.399 outputs that humans are more likely to find helpful, honest, 301 00:15:24.519 --> 00:15:28.000 and harmless. That alignment step was key for making models 302 00:15:28.000 --> 00:15:30.399 like chat GPT practical and safer to. 303 00:15:30.360 --> 00:15:34.320 Deploy alignment right, that seems super important. Then it also 304 00:15:34.399 --> 00:15:36.919 brings up the point about access. Many of these really 305 00:15:36.960 --> 00:15:40.080 powerful models like GPT four are closed source. We don't 306 00:15:40.159 --> 00:15:42.200 know the exact architecture of the training data. How does 307 00:15:42.200 --> 00:15:42.919 that affect things. 308 00:15:43.039 --> 00:15:45.320 It's a huge debate in the field. On one hand, 309 00:15:45.399 --> 00:15:48.639 companies invest billions and want to protect their IP. On 310 00:15:48.679 --> 00:15:54.120 the other hand, it raises serious questions about transparency, reproducibility, bias, auditing, 311 00:15:54.200 --> 00:15:58.120 and just how can the broader community innovate and build 312 00:15:58.399 --> 00:15:59.759 if the cutting edge is locked away? 313 00:16:00.440 --> 00:16:01.879 So is there a counter movement? 314 00:16:02.120 --> 00:16:05.679 Absolutely? The open source LLM movement has exploded in response. 315 00:16:05.960 --> 00:16:09.120 You have major efforts like met Islama models. They release 316 00:16:09.200 --> 00:16:13.200 models with billions of parameters, allowing researchers and developers everywhere 317 00:16:13.360 --> 00:16:16.120 to experiment and build on them. They've shown really strong 318 00:16:16.159 --> 00:16:19.879 performance on benchmarks for coding, reasoning, common sense. 319 00:16:19.679 --> 00:16:22.440 Surviable open alternatives are emerging. 320 00:16:22.080 --> 00:16:25.480 Definitely, and you see interesting architectural innovations too. Look at 321 00:16:25.519 --> 00:16:29.799 mixtral frommystrall dot ai. It uses a mixture of experts moe. 322 00:16:29.799 --> 00:16:31.679 Architecture, mixture of experts. 323 00:16:31.679 --> 00:16:34.120 How does that work instead of the entire huge model 324 00:16:34.159 --> 00:16:37.679 processing every single input token, and MOE model has multiple 325 00:16:37.720 --> 00:16:42.919 smaller expert networks, usually specialized transformer layers. A lightweight router 326 00:16:43.039 --> 00:16:45.519 network directs each part of the input to only a 327 00:16:45.559 --> 00:16:46.759 small subset of these. 328 00:16:46.600 --> 00:16:49.960 Experts, ah so only part of the model is active 329 00:16:49.960 --> 00:16:51.639 at any given time. More efficient. 330 00:16:51.840 --> 00:16:54.159 Exactly, you could have a model with a massive total 331 00:16:54.240 --> 00:16:58.120 number of parameters, giving great capacity, but the actual computation 332 00:16:58.240 --> 00:17:01.399 needed for inference is much lower because you're only using 333 00:17:01.399 --> 00:17:04.079 a fraction of the experts for any given input. It's 334 00:17:04.119 --> 00:17:07.400 a clever way to scale up while managing costs. Plus, 335 00:17:07.519 --> 00:17:10.680 Mixed role has a very permissive Apache two point zero license, 336 00:17:11.000 --> 00:17:12.119 making it widely. 337 00:17:11.839 --> 00:17:15.400 Usable interesting any other key open source players. 338 00:17:15.799 --> 00:17:18.720 Well Dolly from data Bricks took a different approach. They 339 00:17:18.799 --> 00:17:22.240 focused on creating a high quality instruction following data set 340 00:17:22.440 --> 00:17:26.319 about fifteen thousand prompts and responses generated entirely by data 341 00:17:26.319 --> 00:17:29.720 Bricks employees. Their goal was specifically to create an open 342 00:17:29.839 --> 00:17:33.720 instruction tuned model without relying on data generated by proprietary 343 00:17:33.759 --> 00:17:36.880 models like chat GPT, which often comes with restrictive licenses. 344 00:17:37.079 --> 00:17:40.880 You wanted to truly democratize instruction following capabilities. 345 00:17:40.279 --> 00:17:43.359 So focusing on open data as much as open models precisely. 346 00:17:43.640 --> 00:17:47.119 And you also have models like Falcon from TII in 347 00:17:47.160 --> 00:17:51.359 the UAE trained primarily on web data, and grock one 348 00:17:51.519 --> 00:17:55.440 from XAI, which also uses that mixture of experts architecture. 349 00:17:56.119 --> 00:17:59.599 The open source space is incredibly vibrant right now, OK. 350 00:18:00.000 --> 00:18:03.160 It's open or closed. We have these incredibly powerful llms. 351 00:18:03.240 --> 00:18:07.319 If they're like general purpose programmable machines, as some say, 352 00:18:08.279 --> 00:18:11.559 how do we the users actually program them? How do 353 00:18:11.599 --> 00:18:12.680 we tell them what we want? 354 00:18:12.839 --> 00:18:15.319 That's the art and science of prompt engineering. It's all 355 00:18:15.319 --> 00:18:18.640 about designing and refining the input, the prompt that you 356 00:18:18.680 --> 00:18:21.079 give to the model to guide it towards the output 357 00:18:21.119 --> 00:18:21.480 you need. 358 00:18:21.759 --> 00:18:23.559 So the prompt is like the code we write for 359 00:18:23.599 --> 00:18:24.599 the LLM. 360 00:18:24.359 --> 00:18:27.680 In a way. Yeah, you're essentially reprogramming the model's behavior 361 00:18:27.720 --> 00:18:30.880 on the fly, just using natural language instructions. It's becoming 362 00:18:30.880 --> 00:18:33.160 a crucial skill for anyone working with these models. 363 00:18:33.200 --> 00:18:35.400 And it's not just writing one prompt and being done right. 364 00:18:35.480 --> 00:18:37.519 You mentioned, it's iterative, totally iterative. 365 00:18:37.559 --> 00:18:39.680 You design a prompt, you test it, you see what 366 00:18:39.720 --> 00:18:41.880 the model gets back, You evaluate that output, and then 367 00:18:41.880 --> 00:18:45.440 you refine the prompt based on the results. Lather, rinse, repeat. 368 00:18:45.680 --> 00:18:48.880 Okay, so what goes into a well structured prompt? What 369 00:18:48.920 --> 00:18:49.839 are the key pieces? 370 00:18:50.680 --> 00:18:53.440 There are a few core components to think about. First, 371 00:18:53.720 --> 00:18:56.960 you often have system instructions or as system prompt. This 372 00:18:57.039 --> 00:19:01.079 sets the stage, defines the LM's persona or overall behavior 373 00:19:01.079 --> 00:19:04.559 for the conversation, like you are a helpful assistant who 374 00:19:04.559 --> 00:19:09.359 explains complex topics simply. This usually persists across multiple. 375 00:19:09.079 --> 00:19:12.119 Turns, so setting the ground rules exactly. 376 00:19:12.400 --> 00:19:14.200 Then you have the main prompt template, which is the 377 00:19:14.279 --> 00:19:17.799 user facing instruction, often with placeholders where specific input will go. 378 00:19:18.240 --> 00:19:20.480 You also need to consider the LM parameters, things like 379 00:19:20.720 --> 00:19:21.960 temperature temperature. 380 00:19:22.000 --> 00:19:22.799 What does that control? 381 00:19:23.000 --> 00:19:26.359 Temperature controls the randomness of the output. Higher temperature means 382 00:19:26.359 --> 00:19:31.279 more randomness, more creativity, maybe more unexpected results. Lower temperature 383 00:19:31.319 --> 00:19:35.680 makes the output more focused deterministic, sticking closer to the 384 00:19:35.720 --> 00:19:36.720 most probable words. 385 00:19:36.880 --> 00:19:39.119 Okay, creativity versus predictability. 386 00:19:39.440 --> 00:19:42.920 Right, And you might set completion tokens to limit the 387 00:19:42.960 --> 00:19:48.000 output length. And importantly, there are usually safeguards or guardrails 388 00:19:48.039 --> 00:19:50.960 in place, either built into the model or added around 389 00:19:50.960 --> 00:19:54.359 it to prevent it from generating harmful, biased or inappropriate 390 00:19:54.400 --> 00:19:55.440 content makes sense. 391 00:19:55.640 --> 00:19:59.279 So beyond the structure, what makes a prompt effective any 392 00:19:59.359 --> 00:20:00.200 general stratu. 393 00:20:00.839 --> 00:20:04.240 Clarity and specificity are key. Be really clear about what 394 00:20:04.279 --> 00:20:07.839 you want, don't be vague. If it's a complex task, 395 00:20:08.039 --> 00:20:10.799 break it down into smaller, simpler steps within the prompt. 396 00:20:11.200 --> 00:20:12.559 Tell the model how you want it to. 397 00:20:12.559 --> 00:20:15.319 Approach the problem, step by step, instructions exactly. 398 00:20:15.880 --> 00:20:18.559 And another really powerful technique is few shot prompting. 399 00:20:19.079 --> 00:20:21.880 Ah, you mentioned that with GPT too giving examples. 400 00:20:22.039 --> 00:20:24.160 Yes, instead of just telling the model what to do, 401 00:20:24.200 --> 00:20:26.039