WEBVTT 1 00:00:00.200 --> 00:00:02.759 Welcome to the Deep Dive. We're the show that helps 2 00:00:02.799 --> 00:00:05.879 you cut through all that information noise and really get 3 00:00:05.919 --> 00:00:08.599 to the insights you actually need. If you've ever felt 4 00:00:08.599 --> 00:00:11.000 like you're trying to well drink from a fire hose 5 00:00:11.000 --> 00:00:13.359 when you're looking at the world of AI, especially genitive AI, 6 00:00:13.560 --> 00:00:17.440 you are definitely not alone. It's a lot. So today 7 00:00:17.480 --> 00:00:20.600 we're taking a deep dive into something really practical, unlocking 8 00:00:20.640 --> 00:00:24.120 creativity with Azure open Ai. It's basically a guide to 9 00:00:24.199 --> 00:00:27.399 using these really advanced AI models effectively. Our mission here 10 00:00:27.440 --> 00:00:30.120 is simple, cut through the complexity, pull out the most 11 00:00:30.120 --> 00:00:33.000 important stuff, the surprising facts, so you can get up 12 00:00:33.039 --> 00:00:36.000 to speed fast on how these tools work and importantly, 13 00:00:36.200 --> 00:00:38.479 how they're being used out there in the real world. 14 00:00:38.759 --> 00:00:41.039 Think of this as your shortcut, you know, to understanding 15 00:00:41.039 --> 00:00:43.399 the what, the how, and the why it matters. For 16 00:00:43.479 --> 00:00:46.799 Azure open Ai, the source for dipping into is super comprehensive. 17 00:00:46.840 --> 00:00:50.119 It goes from the absolute basics right through to advanced 18 00:00:50.119 --> 00:00:52.439 stuff security, how to actually run these things, the whole 19 00:00:52.520 --> 00:00:55.600 nine yards. Okay, so let's unpack this a bit. When 20 00:00:55.600 --> 00:00:58.039 we talk about large language models lllms, what are we 21 00:00:58.079 --> 00:01:00.600 actually talking about, Like, what's the core idea. 22 00:01:00.679 --> 00:01:05.920 So at their very core, lms are about taking human language, 23 00:01:06.040 --> 00:01:09.280 our text, and turning it into something computers can genuinely 24 00:01:09.319 --> 00:01:12.359 work with, not just store. It starts by breaking the 25 00:01:12.400 --> 00:01:15.480 text down into what are called tokens. These are usually 26 00:01:15.480 --> 00:01:17.120 words or sometimes parts of words. 27 00:01:17.159 --> 00:01:18.439 Okay, tokens, got it? 28 00:01:18.560 --> 00:01:22.000 Yeah, And then these tokens get converted into something called embeddings. 29 00:01:22.400 --> 00:01:25.799 These are numerical vectors, basically long strings of numbers. You 30 00:01:25.799 --> 00:01:29.079 can sort of imagine these embeddings like a really sophisticated map, 31 00:01:29.439 --> 00:01:31.719 where the position of each point tells you the meaning 32 00:01:31.760 --> 00:01:34.599 of that word or phrase and how it relates to others. 33 00:01:34.959 --> 00:01:37.040 Ah. So it's not just the word itself, but it's 34 00:01:37.079 --> 00:01:39.000 meaning in context exactly. 35 00:01:39.079 --> 00:01:42.400 That's how the computer starts to grasp the nuances the context, 36 00:01:42.760 --> 00:01:47.359 not just isolated words. Now, the real breakthrough tech here 37 00:01:47.480 --> 00:01:50.599 is the transformer architecture. Older models they really struggle to 38 00:01:50.680 --> 00:01:53.239 keep track of context in long pieces of text. They'd 39 00:01:53.280 --> 00:01:54.680 sort of forget the beginning, right. 40 00:01:54.640 --> 00:01:55.719 I remember that limitation. 41 00:01:56.120 --> 00:01:59.319 Yeah, But the transformer, with its self attention mechanism, totally 42 00:01:59.400 --> 00:02:02.680 change the game. It lets the model way how important 43 00:02:02.719 --> 00:02:05.000 different words are to each other, even across a really 44 00:02:05.040 --> 00:02:08.520 long sequence. It captures those deep relationships. And when an 45 00:02:08.639 --> 00:02:12.680 LM actually generates text, it does it word by word. 46 00:02:13.039 --> 00:02:14.879 It's called auto regressive generation. 47 00:02:15.159 --> 00:02:16.759 Auto regressive Okay, think of. 48 00:02:16.680 --> 00:02:19.960 It like a game where each move, each word is 49 00:02:20.000 --> 00:02:22.400 based on all the previous ones. It helps maintain context 50 00:02:22.439 --> 00:02:26.280 and coherence, even for really complex ideas. And these models 51 00:02:26.280 --> 00:02:29.360 are just massive, huge. They run on big clusters of 52 00:02:29.360 --> 00:02:32.680 computers and usually access them as a service through an 53 00:02:32.680 --> 00:02:36.360 API because they've been trained on just enormous amounts of text. 54 00:02:36.240 --> 00:02:39.479 Data like a skilled improv artist building on what came before. 55 00:02:39.520 --> 00:02:41.960 That's a helpful analogy. So okay, that's the foundation. Then 56 00:02:42.000 --> 00:02:44.319 what about foundation models? What makes them special? 57 00:02:44.599 --> 00:02:48.639 Well, what's really interesting about foundation models is while their 58 00:02:48.680 --> 00:02:51.639 main job is basically predicting the next word, their sheer 59 00:02:51.639 --> 00:02:57.479 scale changes things. They're trained on these immense diverse data sets. 60 00:02:57.520 --> 00:03:01.680 We're talking terabytes of data often, and this training gives 61 00:03:01.719 --> 00:03:03.520 them what are called emergent capability. 62 00:03:03.639 --> 00:03:05.639 Emergent capabilities meaning. 63 00:03:05.560 --> 00:03:07.120 Meaning they can do a whole bunch of tasks they 64 00:03:07.120 --> 00:03:10.840 weren't specifically programmed or trained for, often really well, sometimes 65 00:03:10.879 --> 00:03:14.240 just needing a few examples or even none. The main 66 00:03:14.280 --> 00:03:18.199 advantages are well. First that performance. It leads to really 67 00:03:18.199 --> 00:03:22.000 big productivity games. Think of them like a super efficient 68 00:03:22.039 --> 00:03:24.479 assistant for tasks that usually take a lot of time 69 00:03:25.039 --> 00:03:29.039 customer service processing data. They can speed things up dramatically. 70 00:03:28.719 --> 00:03:30.039 A turbo charger for the team. 71 00:03:30.120 --> 00:03:33.360 Yeah, pretty much, but this is important. They have limitations. 72 00:03:33.479 --> 00:03:35.159 The big one is hallucination. 73 00:03:35.800 --> 00:03:37.400 Ah heard about this. 74 00:03:37.520 --> 00:03:41.080 It's when the LM generates stuff that sounds totally plausible, 75 00:03:41.199 --> 00:03:45.360 really confident, but it's just not factually accurate or maybe 76 00:03:45.360 --> 00:03:46.280 even completely made up. 77 00:03:46.360 --> 00:03:50.000 So it's not lying, just pattern matching gone wrong exactly. 78 00:03:50.120 --> 00:03:53.199 It's confidently producing text that fits a pattern even if 79 00:03:53.240 --> 00:03:56.759 reality doesn't match. That's why human oversight is absolutely crucial. 80 00:03:57.360 --> 00:03:59.360 We need to ground them, which we can talk about. 81 00:04:00.039 --> 00:04:03.199 Another limit is the context window. It's basically the model's 82 00:04:03.199 --> 00:04:05.759 short term memory, how much info it can juggle at once. 83 00:04:06.199 --> 00:04:08.199 You know, some big models like GPT four to oh 84 00:04:08.280 --> 00:04:11.759 can handle say one hundred and twenty eight thousand input tokens, 85 00:04:11.800 --> 00:04:14.159 which is huge, but there's still a limit. Feed it 86 00:04:14.199 --> 00:04:17.040 too much and it just can't process it all simultaneously. 87 00:04:17.360 --> 00:04:21.879 Okay, so potential for errors, memory limits, but still incredibly powerful. 88 00:04:21.959 --> 00:04:25.639 So where are we seeing these foundation models really making 89 00:04:25.680 --> 00:04:28.279 a difference in the real world despite those caveats. 90 00:04:28.439 --> 00:04:31.519 Well, the fleckibility is just amazing. It's touching almost every industry. 91 00:04:31.800 --> 00:04:34.800 In content creation, for example, they're not just writing generic stuff. 92 00:04:35.120 --> 00:04:39.839 They can generate targeted marketing copy, blog posts, social media updates. Yeah, 93 00:04:39.879 --> 00:04:43.680 speeding up content pipelines hugely faster content okay. And in 94 00:04:43.720 --> 00:04:48.040 customer support, handling tons of common questions automatically. That frees 95 00:04:48.120 --> 00:04:52.920 up human agents for the really tricky, nuanced problems. Beyond that, 96 00:04:54.040 --> 00:04:56.560 text summarization is a big one, getting the gist of 97 00:04:56.600 --> 00:05:02.480 long documents quickly, powering sophisticated chatbots, virtual assistance for personalized help, 98 00:05:02.759 --> 00:05:07.000 even creative writing assistance, you know, brainstorming plots or dialogue. 99 00:05:07.199 --> 00:05:09.680 Interesting. What about more specialized fields. 100 00:05:09.720 --> 00:05:12.399 Yeah, definitely making inroads in healthcare for instance, they can 101 00:05:12.439 --> 00:05:16.360 help analyze initial patient info maybe symptoms alongside some images. 102 00:05:16.759 --> 00:05:19.000 But and this is critical, they are not built to 103 00:05:19.079 --> 00:05:23.319 interpret specialized medical scans or give medical advice that needs. 104 00:05:23.160 --> 00:05:24.920 A professional, very important distinction. 105 00:05:25.160 --> 00:05:27.839 Absolutely. Yeah, And we also see them in cybersecurity for 106 00:05:27.879 --> 00:05:32.079 analyzing potential threats and language learning apps creating accessibility tools 107 00:05:32.160 --> 00:05:35.120 like audio descriptions for videos. The list just keeps growing. 108 00:05:35.199 --> 00:05:38.759 Wow, that's a massive range from marketing copy to analyzing 109 00:05:38.800 --> 00:05:41.079 medical info sort of. Okay, now let's pivot. This is 110 00:05:41.079 --> 00:05:43.839 where it gets really interesting for businesses. Right, how does 111 00:05:43.879 --> 00:05:46.600 Microsoft's Azure open AI fit in? We hear about this 112 00:05:46.639 --> 00:05:47.560 big partnership. 113 00:05:47.759 --> 00:05:52.079 You're right, that partnership is central. Azure OpenAI Service or AOAI, 114 00:05:52.519 --> 00:05:55.360 is Microsoft's way of bringing these powerful open AI models 115 00:05:55.399 --> 00:05:58.480 into the enterprise world, but with a heavy focus on security, 116 00:05:58.600 --> 00:06:03.560 compliance and and manageability. It gives you secure rest API 117 00:06:03.720 --> 00:06:06.879 access to all the big open AI models GPT four Turbo, 118 00:06:06.959 --> 00:06:10.000 the new GPT four h GPT four oh Mini, GPT 119 00:06:10.079 --> 00:06:13.040 three point five Turbo for text tasks, Whisper for audio, 120 00:06:13.120 --> 00:06:15.439 Daily three for images, and the embedding models. 121 00:06:15.480 --> 00:06:18.720 So the models everyone's talking about, but package for business exactly. 122 00:06:19.000 --> 00:06:21.199 But the key difference with Azure open Ai is the 123 00:06:21.319 --> 00:06:24.360 enterprise grade stuff that's only available on Azure. We're talking 124 00:06:24.519 --> 00:06:27.959 robust security controls, private networking options so your data doesn't 125 00:06:28.000 --> 00:06:33.040 touch the public Internet, meaning strict compliance standards, broad geographic availability, 126 00:06:33.439 --> 00:06:36.879 and really important built in responsible AI content filtering. 127 00:06:36.920 --> 00:06:39.959 Okay, those enterprise features sound critical. Can you quickly run 128 00:06:40.040 --> 00:06:41.240 through the main model types. 129 00:06:41.240 --> 00:06:43.800 Again, sure you've got the GPT four family that's the 130 00:06:43.800 --> 00:06:46.360 top tier, like GPT four Oh, GPT four O Mini 131 00:06:46.360 --> 00:06:49.480 and Turbo. They have advanced reasoning, big context windows. GPT 132 00:06:49.519 --> 00:06:51.759 four in takes one hundred and twenty eight thousand input tokens, 133 00:06:51.759 --> 00:06:54.480 which is huge, a whole book almost pretty much, and 134 00:06:54.560 --> 00:06:56.920 GPT four Mini is interesting because it can output a 135 00:06:56.920 --> 00:06:59.720 lot of tokens up to sixteen thousand, great for longer 136 00:06:59.720 --> 00:07:03.600 respons bonses. Then there's GBT three point five Turbo, often 137 00:07:03.639 --> 00:07:06.519 the go to for being capable, the cost effective, especially 138 00:07:06.519 --> 00:07:09.519 for chats, and of course Whisper for audio, Dally three 139 00:07:09.560 --> 00:07:12.360 for images, and the embedding models which are essential for 140 00:07:12.399 --> 00:07:14.920 any kind of smart search or understanding meaning. 141 00:07:14.959 --> 00:07:17.279 And who gets access? Can any business just sign. 142 00:07:17.199 --> 00:07:21.160 Up right now? Access is mostly for enterprise customers and partners. 143 00:07:21.439 --> 00:07:25.319 You typically apply using your company email. It's a deliberate 144 00:07:25.360 --> 00:07:28.639 approach really, Microsoft wants to ensure these powerful tools are 145 00:07:28.639 --> 00:07:31.959 deployed responsibly and securely in business settings with the right 146 00:07:31.959 --> 00:07:34.040 support in governance structures in place. 147 00:07:34.160 --> 00:07:37.560 Makes sense for managing something this powerful. Okay, let's go deeper. Now, 148 00:07:37.600 --> 00:07:40.680 some of the more advanced capabilities that really unlock new potential. 149 00:07:41.199 --> 00:07:43.839 Tell us about those embedding models in Azure open AI. 150 00:07:43.879 --> 00:07:44.600 What do they let you do? 151 00:07:44.959 --> 00:07:49.839 Right? Embeddings they are absolutely fundamental for what we call 152 00:07:49.920 --> 00:07:53.639 semantic understanding and similarity searches. Instead of just matching keywords 153 00:07:53.639 --> 00:07:57.120 like finding car when someone types car, embeddings capture the meaning. 154 00:07:57.439 --> 00:07:59.720 So if you search for fast car, it can find 155 00:07:59.720 --> 00:08:02.000 dot com U means talking about rapid automobiles because it 156 00:08:02.079 --> 00:08:03.720 understands those concepts are similar. 157 00:08:03.839 --> 00:08:06.000 Much smarter search then exactly. 158 00:08:05.639 --> 00:08:09.000 Much more relevant results. Now there are older versions like 159 00:08:09.079 --> 00:08:11.920 ad A zero zero two, but the newer ones text 160 00:08:11.920 --> 00:08:15.240 embedding three small and text embedding three large, are well. 161 00:08:15.279 --> 00:08:18.720 They're significantly better. Text ebedting three small is much more 162 00:08:18.759 --> 00:08:22.199 cost effective and shows big performance jumps, especially for multi 163 00:08:22.279 --> 00:08:25.240 lingual stuff. Text embedding three large is the top performer 164 00:08:25.319 --> 00:08:26.920 overall for accuracy. 165 00:08:26.480 --> 00:08:27.800 Better and cheaper. Nice. 166 00:08:28.000 --> 00:08:29.959 And here's a really cool thing about these new models, 167 00:08:30.000 --> 00:08:33.840 a real aha moment. They use something called Matryoshka representation 168 00:08:34.000 --> 00:08:35.440 learning am. 169 00:08:36.519 --> 00:08:38.519 Like the Russian dolls exactly. 170 00:08:38.799 --> 00:08:42.000 It means you can actually shorten the embeddings, literally chop 171 00:08:42.039 --> 00:08:44.919 off numbers from the end of the sequence without them 172 00:08:44.960 --> 00:08:48.840 losing their core meaning. This is huge because shorter embeddings 173 00:08:48.879 --> 00:08:54.000 mean less storage, faster searches, lower costs, often while keeping 174 00:08:54.399 --> 00:08:57.799 or even improving performance compared to older, longer embeddings. It's 175 00:08:57.840 --> 00:08:58.759 incredibly efficient. 176 00:08:58.879 --> 00:09:01.799 That's amazing, trimming the th without losing the substance. Yeah. 177 00:09:02.039 --> 00:09:05.399 So you create these smart embeddings, where do you put them? 178 00:09:05.480 --> 00:09:08.200 Why are Azure vector databases important here? 179 00:09:08.399 --> 00:09:11.759 Good question. You need a special kind of database optimized 180 00:09:11.879 --> 00:09:14.679 for storing and searching these high dimensional vectors. That's where 181 00:09:14.679 --> 00:09:17.360 Azure vector databases come in. Their whole point is to 182 00:09:17.600 --> 00:09:21.679 enable really fast, really precise similarity searches based on that 183 00:09:21.679 --> 00:09:25.080 semantic meaning we talked about, not just keyword matching, find 184 00:09:25.120 --> 00:09:27.879 related concepts instantly across huge data sets. 185 00:09:27.960 --> 00:09:29.200 And Azure has options for this. 186 00:09:29.360 --> 00:09:31.879 Oh yes, Azure ai search is a big one. Interestingly, 187 00:09:31.960 --> 00:09:35.639 open Ai actually uses Azure ai Search for vector capabilities 188 00:09:35.639 --> 00:09:36.879 in chat GPT itself. 189 00:09:36.960 --> 00:09:37.240 Wow. 190 00:09:37.399 --> 00:09:41.440 Yeah. And there's also Azure Cosmos dB with vector capabilities, 191 00:09:41.679 --> 00:09:44.799 Azure Managed Rettis, and even Postgres School with the PG 192 00:09:45.000 --> 00:09:48.279 vector expansion. Lots of choices depending on your needs, all 193 00:09:48.279 --> 00:09:50.759 designed for handling these complex numerical vectors. 194 00:09:50.879 --> 00:09:54.320 Okay, Earlier you mentioned that limitation hallucination where lllms can 195 00:09:54.320 --> 00:09:57.639 make things up. How does retrieval, augmented generation or RI 196 00:09:58.200 --> 00:09:58.879 help fix that? 197 00:09:59.120 --> 00:10:03.039 Right? Is direct answer to the hallucination problem. It works 198 00:10:03.080 --> 00:10:04.360 by grounding the model. 199 00:10:04.559 --> 00:10:07.000 Grounding it like keeping its feet on the ground pretty much. 200 00:10:07.399 --> 00:10:11.320 It connects the LM's internal knowledge with real world verified information, 201 00:10:12.000 --> 00:10:14.679 usually from an external source. Think of it like giving 202 00:10:14.679 --> 00:10:17.159 the model a factual reference library to check before it 203 00:10:17.200 --> 00:10:19.759 generates an answer, keeps it rooted in reality. 204 00:10:19.960 --> 00:10:21.480 How does that work in practice? 205 00:10:22.080 --> 00:10:26.759 So the process is quite elegant. A user asks a question. First, 206 00:10:26.879 --> 00:10:30.200 the system retree is relevant information from an external knowledge base, 207 00:10:30.240 --> 00:10:33.559 typically one of those vector databases we just discussed. Then 208 00:10:33.679 --> 00:10:36.519 the LM gets both the original question and this retrieved 209 00:10:36.519 --> 00:10:40.600 factual context. It uses both pieces to generate the final response. 210 00:10:41.080 --> 00:10:44.759 Ah, so it's using verified info to guide its answer precisely. 211 00:10:44.799 --> 00:10:48.759 The benefits are huge. Much better accuracy because it's using facts, 212 00:10:49.279 --> 00:10:53.039 richer context than just its training data, more flexibility because 213 00:10:53.080 --> 00:10:55.799 you can update the knowledge base without retraining the whole model, 214 00:10:56.039 --> 00:10:57.000 and it scales well. 215 00:10:57.440 --> 00:10:59.720 Sounds great. Are there downsides? 216 00:11:00.080 --> 00:11:00.679 Our challenges? 217 00:11:00.759 --> 00:11:01.039 Yeah? 218 00:11:01.200 --> 00:11:04.320 Getting the document segmentation right for the retrieval step is tricky. 219 00:11:04.679 --> 00:11:07.440 Making sure the retrieved info is genuinely relevant can be hard, 220 00:11:07.960 --> 00:11:10.639 and setting of the whole RMA pipeline could be complex 221 00:11:10.679 --> 00:11:11.639 and resource intensive. 222 00:11:11.720 --> 00:11:15.879 Okay, makes sense. Moving beyond just text, what about models 223 00:11:15.879 --> 00:11:18.600 that understand images too? Tell us about azure OpenAI is 224 00:11:18.679 --> 00:11:20.759 multimodal stuff, especially GBT four oh. 225 00:11:20.919 --> 00:11:24.559 Yeah, multimodal is a really exciting frontier. Models like GBT 226 00:11:24.600 --> 00:11:27.720 four oh can process and understand both text and images 227 00:11:27.759 --> 00:11:29.559 together in the same input, so. 228 00:11:29.480 --> 00:11:31.519 You can show it a picture and ask questions. 229 00:11:31.240 --> 00:11:34.320 About it exactly. This opens up tons of practical uses, 230 00:11:34.600 --> 00:11:39.440 automatically generating detailed captions for images, visual question answering asking 231 00:11:39.679 --> 00:11:42.679 what color is the car in this picture, content moderation 232 00:11:42.759 --> 00:11:46.360 for visual stuff in e commerce, maybe generating product descriptions 233 00:11:46.399 --> 00:11:49.320 just from photos, and as we touched on, even assisting 234 00:11:49.360 --> 00:11:53.679 with initial medical diagnostics by looking at symptoms and related images. 235 00:11:54.679 --> 00:11:58.960 But again with that crucial caveat, not for interpreting specialized 236 00:11:59.000 --> 00:12:00.039 scans or giving. 237 00:11:59.799 --> 00:12:03.440 It right always the caveat. Are there things that struggles 238 00:12:03.440 --> 00:12:05.879 with visually definitely limitations. 239 00:12:06.120 --> 00:12:08.480 It might not perform as well with non Latin alphabets 240 00:12:08.519 --> 00:12:12.799 and images, or very small or rotated text. Sometimes precise 241 00:12:12.840 --> 00:12:15.600 spatial reasoning like is the blue box exactly to the 242 00:12:15.679 --> 00:12:18.080 left of the red sphere? Can be tricky for. 243 00:12:18.039 --> 00:12:21.399 It still a massive leap. Now, how do these models 244 00:12:21.440 --> 00:12:24.720 actually do things in the real world interact with other systems? 245 00:12:24.720 --> 00:12:26.039 How does function calling work? 246 00:12:26.200 --> 00:12:28.559 Function calling is super interesting. The key thing to get 247 00:12:28.600 --> 00:12:30.600 is the model itself doesn't run the function. 248 00:12:30.759 --> 00:12:31.720 It doesn't, then what does it do? 249 00:12:32.120 --> 00:12:34.879 It intelligently figures out if an external tool or function 250 00:12:35.000 --> 00:12:39.159 is needed to answer the user's request. If it decides yes, 251 00:12:39.600 --> 00:12:42.559 it then generates the parameters or arguments that function needs. 252 00:12:43.159 --> 00:12:45.279 So the flow is like this model thinks a function 253 00:12:45.320 --> 00:12:48.799 call would help. The API response tells your application, Hey, 254 00:12:48.919 --> 00:12:50.759 call this function with these arguments. 255 00:12:51.080 --> 00:12:53.279 So my app does the actual work exactly. 256 00:12:53.440 --> 00:12:57.039 Your application takes those parameters, runs the function. Maybe it 257 00:12:57.120 --> 00:13:00.000 queries a database, calls another API, sends an email, whatever. 258 00:13:00.480 --> 00:13:03.200 Then your app sends the result of that function call 259 00:13:03.279 --> 00:13:06.360 back to the LM. The LEM then uses that real 260 00:13:06.360 --> 00:13:10.120 world result to formulate its final informed answer to the user. 261 00:13:10.440 --> 00:13:12.559 It's a really dynamic way to connect the AI to 262 00:13:12.639 --> 00:13:13.519 external systems. 263 00:13:13.759 --> 00:13:16.720 Got it? And building on that interaction idea, what's the 264 00:13:16.799 --> 00:13:19.759 assistance API? Sounds like you can build more complex agents. 265 00:13:19.879 --> 00:13:23.879 Precisely, the Azure Open AI Assistance API is designed specifically 266 00:13:23.919 --> 00:13:27.919 for building these more sophisticated stateful AI assistants, tailored to 267 00:13:28.080 --> 00:13:31.240 particular jobs. It comes with some really powerful built in tools. 268 00:13:31.279 --> 00:13:34.080 One is a code interpreter. This lets the assistant write 269 00:13:34.120 --> 00:13:36.919 and run Python code securely in a sandboxed environment. 270 00:13:37.080 --> 00:13:39.159 Python code what for all. 271 00:13:39.000 --> 00:13:43.759 Sorts of things, performing complex calculations, analyzing data directly from 272 00:13:43.840 --> 00:13:48.279 uploaded files like csvs, even generating charts or processing files. 273 00:13:48.600 --> 00:13:52.200 It's incredibly powerful for data tasks. Another key tool is 274 00:13:52.240 --> 00:13:55.600 file search. This allows the assistant to access and retrieve 275 00:13:55.639 --> 00:13:57.720 information from documents you provide to It. 276 00:13:57.879 --> 00:14:01.279 Ah like a private knowledge base for assistant exactly. 277 00:14:01.679 --> 00:14:04.559 It acts as an external knowledge source, letting the assistant 278 00:14:04.600 --> 00:14:08.080 answer questions using your specific, up to date information, going 279 00:14:08.159 --> 00:14:11.679 way beyond its original training. It uses vector embeddings under the. 280 00:14:11.679 --> 00:14:13.960 Hood for this, and function calling is part of this too. 281 00:14:14.320 --> 00:14:17.159 Yep, function calling is integrated right into the assistance API 282 00:14:17.240 --> 00:14:20.159 as well, so your assistant can use those external tools seamlessly. 283 00:14:20.720 --> 00:14:25.720 Okay, so assistance API for interactive smart agents. What if 284 00:14:25.720 --> 00:14:28.080 you just need to process a ton of stuff and 285 00:14:28.120 --> 00:14:31.159 you don't need instant answers like batch processing. 286 00:14:31.480 --> 00:14:34.360 That's exactly where the batch API comes in. It's designed 287 00:14:34.399 --> 00:14:37.919 for asynchronous, non real time processing jobs where you can 288 00:14:37.960 --> 00:14:40.480 wait a bit for the results. You basically bundle up 289 00:14:40.480 --> 00:14:43.200 a whole load of requests into a single file, submit it, 290 00:14:43.399 --> 00:14:44.919 and AZURE processes in bulk. 291 00:14:45.279 --> 00:14:47.279 What are the advantages of doing it that way? 292 00:14:47.360 --> 00:14:51.399 Two main things, costs and quota. You typically see a 293 00:14:51.440 --> 00:14:54.960 significant cost reduction, often around fifty percent compared to making 294 00:14:54.960 --> 00:14:57.799 all those calls individually to the standard real time endpoints. 295 00:14:58.039 --> 00:15:01.360 Plus you get a dedicated quota for backs processing separate 296 00:15:01.360 --> 00:15:05.360 from your interactive traffic. Azure guarantees completion within twenty four hours, 297 00:15:05.360 --> 00:15:08.480 though usually it's much much faster. Perfect for large scale 298 00:15:08.519 --> 00:15:11.399 content generation, data cleansing, summarization tasks. 299 00:15:11.799 --> 00:15:15.279 Things like that fifty percent cost reduction is pretty compelling. Okay, 300 00:15:15.360 --> 00:15:18.960 let's switch gears slightly. Fine tuning. This comes up a lot, 301 00:15:19.000 --> 00:15:21.440 but it raises a big question. When do you actually 302 00:15:21.480 --> 00:15:24.440 need to fine tune a model? Especially with powerful things 303 00:15:24.519 --> 00:15:26.799 like prompt engineering and r GAG. 304 00:15:26.519 --> 00:15:30.679 Available, That is a really critical strategic question. Fine tuning 305 00:15:30.759 --> 00:15:33.919 is different. It means taking an existing pre trained LLM 306 00:15:34.159 --> 00:15:37.960 and actually retraining it, adapting it using your own specific 307 00:15:38.120 --> 00:15:41.480 curated example data. It's a supervised learning process. You show 308 00:15:41.480 --> 00:15:45.360 the model examples given this input produce this exact output. 309 00:15:45.440 --> 00:15:47.919 You're teaching it a very specific behavior or style. 310 00:15:48.000 --> 00:15:51.240 Okay, so you're modifying the model itself. What are the benefits? 311 00:15:51.600 --> 00:15:54.919 Well, you can potentially get much higher quality responses for 312 00:15:55.080 --> 00:15:58.559 very specific niche tasks. You can effectively train it on 313 00:15:58.639 --> 00:16:01.799 more data than fits in this andar context window because 314 00:16:01.840 --> 00:16:05.480 the knowledge gets baked into the model weights, and sometimes 315 00:16:05.519 --> 00:16:08.360 it can lead to using fewer tokens in your prompts later, 316 00:16:08.799 --> 00:16:09.679 saving costs. 317 00:16:10.440 --> 00:16:12.919 So when is it the right call over just better 318 00:16:12.960 --> 00:16:14.159 prompting or RAG? 319 00:16:14.600 --> 00:16:17.480 You should really only consider fine tuning when prompt engineering 320 00:16:17.519 --> 00:16:19.919 in a RAG aren't getting you the consistent quality or 321 00:16:19.960 --> 00:16:23.279 accuracy you need for a specific problem. It's best when 322 00:16:23.320 --> 00:16:25.919 you have a unique domain or a very specific data 323 00:16:26.000 --> 00:16:29.279 set that's well prepared in high quality, and crucially, you 324 00:16:29.360 --> 00:16:32.279 need clear goals and ways to measure if the fine 325 00:16:32.279 --> 00:16:35.480 tuning actually work. Like quantitative metrics, how much data do 326 00:16:35.480 --> 00:16:38.039 you need? That's a key point. While technically you might 327 00:16:38.080 --> 00:16:41.120 start with just like ten examples to get any real benefit, 328 00:16:41.200 --> 00:16:44.360 to really shift the model's behavior. Usually need hundreds or 329 00:16:44.360 --> 00:16:46.480 more likely thousands of high quality examples. 330 00:16:46.639 --> 00:16:48.679 Thousands okay, that's a commitment. 331 00:16:48.480 --> 00:16:52.879 It is, and importantly, low quality or inconsistent examples can 332 00:16:52.919 --> 00:16:56.120 actually hurt the model's performance, making it worse, so data 333 00:16:56.200 --> 00:17:00.600 quality is paramount. The process involves preparing that data, carefully, 334 00:17:00.919 --> 00:17:05.039 running the training job, and then rigorously evaluating both safety 335 00:17:05.079 --> 00:17:06.440 and performance before deploying. 336 00:17:06.559 --> 00:17:09.400 Okay, let's circle back to interacting with the model. Prompt 337 00:17:09.400 --> 00:17:12.240 engineering you mentioned it's powerful. It feels like a real 338 00:17:12.359 --> 00:17:15.319 art form almost. It's not just asking a basic question, is. 339 00:17:15.240 --> 00:17:18.519 It not at all? It's absolutely critical. Prompt engineering is 340 00:17:20.119 --> 00:17:22.839 basically the art and science of crafting your input, your 341 00:17:22.880 --> 00:17:25.880 prompt to guide the LM towards the specific kind of 342 00:17:25.880 --> 00:17:28.720 output you want without changing the model itself. It's all 343 00:17:28.720 --> 00:17:31.559 about how you communicate your request to the AI. Think 344 00:17:31.599 --> 00:17:34.440 of a really good prompt as having several key ingredients. 345 00:17:34.640 --> 00:17:36.799 Ingredients like a recipe, kind. 346 00:17:36.559 --> 00:17:39.960 Of first unique context like imagine you're a travel agent. 347 00:17:40.400 --> 00:17:44.440 Then clear instructions, write a three day itinerary, add constraints 348 00:17:45.160 --> 00:17:49.400 focusing on budget friendly options. You might include variables or 349 00:17:49.440 --> 00:17:53.400 specific inputs. Mention the Eiffel Tower in the louver, specify 350 00:17:53.400 --> 00:17:56.960 the desired output format, provide the answer as a bulleted list, 351 00:17:57.440 --> 00:18:00.200 and maybe set the tone style in an enthusiasm, sick 352 00:18:00.240 --> 00:18:01.079 and friendly tone. 353 00:18:01.119 --> 00:18:02.759 Wow, Okay, that's quite detailed. 354 00:18:02.839 --> 00:18:05.640 It can be, And one really powerful element is providing 355 00:18:05.720 --> 00:18:08.680 examples or templates, like here's an example of a good 356 00:18:08.720 --> 00:18:13.519 itinerary item. Day one morning, visit Notre Dame cathedral, free entry. 357 00:18:13.799 --> 00:18:17.000 Now create the rest. Putting these elements together makes a 358 00:18:17.079 --> 00:18:20.799 huge difference in getting tailored, useful responses instead of something 359 00:18:20.839 --> 00:18:21.680 generic that. 360 00:18:21.599 --> 00:18:25.079 Makes total sense layering the instructions, What about more advanced 361 00:18:25.119 --> 00:18:28.400 strategies you mentioned guiding the AI's thought process. 362 00:18:28.640 --> 00:18:32.160 Yet beyond just the structure, there are strategies. Always aim 363 00:18:32.240 --> 00:18:36.359 for clear, unambiguous instructions. Asking the model to adopt a 364 00:18:36.400 --> 00:18:40.640 specific persona helps. Using delimiters like triple quotes or XML 365 00:18:40.720 --> 00:18:45.119 tags to separate instructions from content is good practice. Breaking 366 00:18:45.119 --> 00:18:48.480 down complex tasks into steps for the model is effective, 367 00:18:48.920 --> 00:18:51.920 and as I mentioned, providing examples is almost always beneficial. 368 00:18:52.200 --> 00:18:55.079 One really important strategy is often called give the model 369 00:18:55.119 --> 00:18:55.680 time to think. 370 00:18:55.839 --> 00:18:58.319 Time to think. It's not actually thinking though, right. 371 00:18:58.240 --> 00:19:01.559 Right, it's not conscious Yeah. Structuring the prompt to encourage 372 00:19:01.559 --> 00:19:04.200 a step by step process often leads to better accuracy 373 00:19:04.200 --> 00:19:07.079 on complex problems. Force it to outline its steps before 374 00:19:07.119 --> 00:19:09.480 giving the final answer. It's like asking your person to 375 00:19:09.519 --> 00:19:11.440 show their work in math reduces errors. 376 00:19:11.559 --> 00:19:15.400 Oh, okay, show you work. What about specific named techniques? 377 00:19:15.720 --> 00:19:18.359 So we have a kind of progression. Zero shot is 378 00:19:18.400 --> 00:19:22.440 asking a question cold with no examples. One shot gives 379 00:19:22.440 --> 00:19:26.039 one example, Few shot gives well a few examples. Adding 380 00:19:26.079 --> 00:19:29.839 examples dramatically improves accuracy by showing the model the pattern 381 00:19:29.880 --> 00:19:32.240 you want. Then there's chain of thought or code T. 382 00:19:32.599 --> 00:19:35.160 This is where you explicitly ask the model to explain 383 00:19:35.200 --> 00:19:38.519 its reasoning step by step before giving the final answer. 384 00:19:39.000 --> 00:19:41.799 It forces that show your work process and really helps 385 00:19:41.799 --> 00:19:43.480 with complex logic or math problems. 386 00:19:43.480 --> 00:19:45.559 So you see it's reasoning exactly. 387 00:19:45.759 --> 00:19:48.519 Building on that is tree of SATs or toe T. 388 00:19:49.039 --> 00:19:51.920 This is more advanced. It lets the model explore multiple 389 00:19:51.960 --> 00:19:55.279 different reasoning paths like branches of a tree, evaluate them, 390 00:19:55.519 --> 00:19:58.039 and then choose the best one. Great for complex planning 391 00:19:58.119 --> 00:19:59.359 or exploring possibilities. 392 00:19:59.319 --> 00:20:01.160 Oka more complex now, yeah. 393 00:20:00.960 --> 00:20:04.039 A couple more interesting ones. Program aided language model or 394 00:20:04.200 --> 00:20:08.440 pall MS. This is fascinating. The LM actually generates small 395 00:20:08.480 --> 00:20:11.720 snippets of code, often Python, to help it solve a problem. 396 00:20:11.759 --> 00:20:13.160 It writes code to help itself. 397 00:20:13.440 --> 00:20:16.880 Yes, like if you ask a complex math question, it 398 00:20:16.960 --> 00:20:19.880 might write and run Python code using an interpreter to 399 00:20:19.880 --> 00:20:22.720 get the exact numerical answer rather than trying to estimate 400 00:20:22.720 --> 00:20:26.680 it linguistically. Then there's react. This technique lets the model 401 00:20:26.720 --> 00:20:29.799