WEBVTT

1
00:00:00.200 --> 00:00:02.759
<v Speaker 1>Welcome to the Deep Dive. We're the show that helps

2
00:00:02.799 --> 00:00:05.879
<v Speaker 1>you cut through all that information noise and really get

3
00:00:05.919 --> 00:00:08.599
<v Speaker 1>to the insights you actually need. If you've ever felt

4
00:00:08.599 --> 00:00:11.000
<v Speaker 1>like you're trying to well drink from a fire hose

5
00:00:11.000 --> 00:00:13.359
<v Speaker 1>when you're looking at the world of AI, especially genitive AI,

6
00:00:13.560 --> 00:00:17.440
<v Speaker 1>you are definitely not alone. It's a lot. So today

7
00:00:17.480 --> 00:00:20.600
<v Speaker 1>we're taking a deep dive into something really practical, unlocking

8
00:00:20.640 --> 00:00:24.120
<v Speaker 1>creativity with Azure open Ai. It's basically a guide to

9
00:00:24.199 --> 00:00:27.399
<v Speaker 1>using these really advanced AI models effectively. Our mission here

10
00:00:27.440 --> 00:00:30.120
<v Speaker 1>is simple, cut through the complexity, pull out the most

11
00:00:30.120 --> 00:00:33.000
<v Speaker 1>important stuff, the surprising facts, so you can get up

12
00:00:33.039 --> 00:00:36.000
<v Speaker 1>to speed fast on how these tools work and importantly,

13
00:00:36.200 --> 00:00:38.479
<v Speaker 1>how they're being used out there in the real world.

14
00:00:38.759 --> 00:00:41.039
<v Speaker 1>Think of this as your shortcut, you know, to understanding

15
00:00:41.039 --> 00:00:43.399
<v Speaker 1>the what, the how, and the why it matters. For

16
00:00:43.479 --> 00:00:46.799
<v Speaker 1>Azure open Ai, the source for dipping into is super comprehensive.

17
00:00:46.840 --> 00:00:50.119
<v Speaker 1>It goes from the absolute basics right through to advanced

18
00:00:50.119 --> 00:00:52.439
<v Speaker 1>stuff security, how to actually run these things, the whole

19
00:00:52.520 --> 00:00:55.600
<v Speaker 1>nine yards. Okay, so let's unpack this a bit. When

20
00:00:55.600 --> 00:00:58.039
<v Speaker 1>we talk about large language models lllms, what are we

21
00:00:58.079 --> 00:01:00.600
<v Speaker 1>actually talking about, Like, what's the core idea.

22
00:01:00.679 --> 00:01:05.920
<v Speaker 2>So at their very core, lms are about taking human language,

23
00:01:06.040 --> 00:01:09.280
<v Speaker 2>our text, and turning it into something computers can genuinely

24
00:01:09.319 --> 00:01:12.359
<v Speaker 2>work with, not just store. It starts by breaking the

25
00:01:12.400 --> 00:01:15.480
<v Speaker 2>text down into what are called tokens. These are usually

26
00:01:15.480 --> 00:01:17.120
<v Speaker 2>words or sometimes parts of words.

27
00:01:17.159 --> 00:01:18.439
<v Speaker 1>Okay, tokens, got it?

28
00:01:18.560 --> 00:01:22.000
<v Speaker 2>Yeah, And then these tokens get converted into something called embeddings.

29
00:01:22.400 --> 00:01:25.799
<v Speaker 2>These are numerical vectors, basically long strings of numbers. You

30
00:01:25.799 --> 00:01:29.079
<v Speaker 2>can sort of imagine these embeddings like a really sophisticated map,

31
00:01:29.439 --> 00:01:31.719
<v Speaker 2>where the position of each point tells you the meaning

32
00:01:31.760 --> 00:01:34.599
<v Speaker 2>of that word or phrase and how it relates to others.

33
00:01:34.959 --> 00:01:37.040
<v Speaker 1>Ah. So it's not just the word itself, but it's

34
00:01:37.079 --> 00:01:39.000
<v Speaker 1>meaning in context exactly.

35
00:01:39.079 --> 00:01:42.400
<v Speaker 2>That's how the computer starts to grasp the nuances the context,

36
00:01:42.760 --> 00:01:47.359
<v Speaker 2>not just isolated words. Now, the real breakthrough tech here

37
00:01:47.480 --> 00:01:50.599
<v Speaker 2>is the transformer architecture. Older models they really struggle to

38
00:01:50.680 --> 00:01:53.239
<v Speaker 2>keep track of context in long pieces of text. They'd

39
00:01:53.280 --> 00:01:54.680
<v Speaker 2>sort of forget the beginning, right.

40
00:01:54.640 --> 00:01:55.719
<v Speaker 1>I remember that limitation.

41
00:01:56.120 --> 00:01:59.319
<v Speaker 2>Yeah, But the transformer, with its self attention mechanism, totally

42
00:01:59.400 --> 00:02:02.680
<v Speaker 2>change the game. It lets the model way how important

43
00:02:02.719 --> 00:02:05.000
<v Speaker 2>different words are to each other, even across a really

44
00:02:05.040 --> 00:02:08.520
<v Speaker 2>long sequence. It captures those deep relationships. And when an

45
00:02:08.639 --> 00:02:12.680
<v Speaker 2>LM actually generates text, it does it word by word.

46
00:02:13.039 --> 00:02:14.879
<v Speaker 2>It's called auto regressive generation.

47
00:02:15.159 --> 00:02:16.759
<v Speaker 1>Auto regressive Okay, think of.

48
00:02:16.680 --> 00:02:19.960
<v Speaker 2>It like a game where each move, each word is

49
00:02:20.000 --> 00:02:22.400
<v Speaker 2>based on all the previous ones. It helps maintain context

50
00:02:22.439 --> 00:02:26.280
<v Speaker 2>and coherence, even for really complex ideas. And these models

51
00:02:26.280 --> 00:02:29.360
<v Speaker 2>are just massive, huge. They run on big clusters of

52
00:02:29.360 --> 00:02:32.680
<v Speaker 2>computers and usually access them as a service through an

53
00:02:32.680 --> 00:02:36.360
<v Speaker 2>API because they've been trained on just enormous amounts of text.

54
00:02:36.240 --> 00:02:39.479
<v Speaker 1>Data like a skilled improv artist building on what came before.

55
00:02:39.520 --> 00:02:41.960
<v Speaker 1>That's a helpful analogy. So okay, that's the foundation. Then

56
00:02:42.000 --> 00:02:44.319
<v Speaker 1>what about foundation models? What makes them special?

57
00:02:44.599 --> 00:02:48.639
<v Speaker 2>Well, what's really interesting about foundation models is while their

58
00:02:48.680 --> 00:02:51.639
<v Speaker 2>main job is basically predicting the next word, their sheer

59
00:02:51.639 --> 00:02:57.479
<v Speaker 2>scale changes things. They're trained on these immense diverse data sets.

60
00:02:57.520 --> 00:03:01.680
<v Speaker 2>We're talking terabytes of data often, and this training gives

61
00:03:01.719 --> 00:03:03.520
<v Speaker 2>them what are called emergent capability.

62
00:03:03.639 --> 00:03:05.639
<v Speaker 1>Emergent capabilities meaning.

63
00:03:05.560 --> 00:03:07.120
<v Speaker 2>Meaning they can do a whole bunch of tasks they

64
00:03:07.120 --> 00:03:10.840
<v Speaker 2>weren't specifically programmed or trained for, often really well, sometimes

65
00:03:10.879 --> 00:03:14.240
<v Speaker 2>just needing a few examples or even none. The main

66
00:03:14.280 --> 00:03:18.199
<v Speaker 2>advantages are well. First that performance. It leads to really

67
00:03:18.199 --> 00:03:22.000
<v Speaker 2>big productivity games. Think of them like a super efficient

68
00:03:22.039 --> 00:03:24.479
<v Speaker 2>assistant for tasks that usually take a lot of time

69
00:03:25.039 --> 00:03:29.039
<v Speaker 2>customer service processing data. They can speed things up dramatically.

70
00:03:28.719 --> 00:03:30.039
<v Speaker 1>A turbo charger for the team.

71
00:03:30.120 --> 00:03:33.360
<v Speaker 2>Yeah, pretty much, but this is important. They have limitations.

72
00:03:33.479 --> 00:03:35.159
<v Speaker 2>The big one is hallucination.

73
00:03:35.800 --> 00:03:37.400
<v Speaker 1>Ah heard about this.

74
00:03:37.520 --> 00:03:41.080
<v Speaker 2>It's when the LM generates stuff that sounds totally plausible,

75
00:03:41.199 --> 00:03:45.360
<v Speaker 2>really confident, but it's just not factually accurate or maybe

76
00:03:45.360 --> 00:03:46.280
<v Speaker 2>even completely made up.

77
00:03:46.360 --> 00:03:50.000
<v Speaker 1>So it's not lying, just pattern matching gone wrong exactly.

78
00:03:50.120 --> 00:03:53.199
<v Speaker 2>It's confidently producing text that fits a pattern even if

79
00:03:53.240 --> 00:03:56.759
<v Speaker 2>reality doesn't match. That's why human oversight is absolutely crucial.

80
00:03:57.360 --> 00:03:59.360
<v Speaker 2>We need to ground them, which we can talk about.

81
00:04:00.039 --> 00:04:03.199
<v Speaker 2>Another limit is the context window. It's basically the model's

82
00:04:03.199 --> 00:04:05.759
<v Speaker 2>short term memory, how much info it can juggle at once.

83
00:04:06.199 --> 00:04:08.199
<v Speaker 2>You know, some big models like GPT four to oh

84
00:04:08.280 --> 00:04:11.759
<v Speaker 2>can handle say one hundred and twenty eight thousand input tokens,

85
00:04:11.800 --> 00:04:14.159
<v Speaker 2>which is huge, but there's still a limit. Feed it

86
00:04:14.199 --> 00:04:17.040
<v Speaker 2>too much and it just can't process it all simultaneously.

87
00:04:17.360 --> 00:04:21.879
<v Speaker 1>Okay, so potential for errors, memory limits, but still incredibly powerful.

88
00:04:21.959 --> 00:04:25.639
<v Speaker 1>So where are we seeing these foundation models really making

89
00:04:25.680 --> 00:04:28.279
<v Speaker 1>a difference in the real world despite those caveats.

90
00:04:28.439 --> 00:04:31.519
<v Speaker 2>Well, the fleckibility is just amazing. It's touching almost every industry.

91
00:04:31.800 --> 00:04:34.800
<v Speaker 2>In content creation, for example, they're not just writing generic stuff.

92
00:04:35.120 --> 00:04:39.839
<v Speaker 2>They can generate targeted marketing copy, blog posts, social media updates. Yeah,

93
00:04:39.879 --> 00:04:43.680
<v Speaker 2>speeding up content pipelines hugely faster content okay. And in

94
00:04:43.720 --> 00:04:48.040
<v Speaker 2>customer support, handling tons of common questions automatically. That frees

95
00:04:48.120 --> 00:04:52.920
<v Speaker 2>up human agents for the really tricky, nuanced problems. Beyond that,

96
00:04:54.040 --> 00:04:56.560
<v Speaker 2>text summarization is a big one, getting the gist of

97
00:04:56.600 --> 00:05:02.480
<v Speaker 2>long documents quickly, powering sophisticated chatbots, virtual assistance for personalized help,

98
00:05:02.759 --> 00:05:07.000
<v Speaker 2>even creative writing assistance, you know, brainstorming plots or dialogue.

99
00:05:07.199 --> 00:05:09.680
<v Speaker 1>Interesting. What about more specialized fields.

100
00:05:09.720 --> 00:05:12.399
<v Speaker 2>Yeah, definitely making inroads in healthcare for instance, they can

101
00:05:12.439 --> 00:05:16.360
<v Speaker 2>help analyze initial patient info maybe symptoms alongside some images.

102
00:05:16.759 --> 00:05:19.000
<v Speaker 2>But and this is critical, they are not built to

103
00:05:19.079 --> 00:05:23.319
<v Speaker 2>interpret specialized medical scans or give medical advice that needs.

104
00:05:23.160 --> 00:05:24.920
<v Speaker 1>A professional, very important distinction.

105
00:05:25.160 --> 00:05:27.839
<v Speaker 2>Absolutely. Yeah, And we also see them in cybersecurity for

106
00:05:27.879 --> 00:05:32.079
<v Speaker 2>analyzing potential threats and language learning apps creating accessibility tools

107
00:05:32.160 --> 00:05:35.120
<v Speaker 2>like audio descriptions for videos. The list just keeps growing.

108
00:05:35.199 --> 00:05:38.759
<v Speaker 1>Wow, that's a massive range from marketing copy to analyzing

109
00:05:38.800 --> 00:05:41.079
<v Speaker 1>medical info sort of. Okay, now let's pivot. This is

110
00:05:41.079 --> 00:05:43.839
<v Speaker 1>where it gets really interesting for businesses. Right, how does

111
00:05:43.879 --> 00:05:46.600
<v Speaker 1>Microsoft's Azure open AI fit in? We hear about this

112
00:05:46.639 --> 00:05:47.560
<v Speaker 1>big partnership.

113
00:05:47.759 --> 00:05:52.079
<v Speaker 2>You're right, that partnership is central. Azure OpenAI Service or AOAI,

114
00:05:52.519 --> 00:05:55.360
<v Speaker 2>is Microsoft's way of bringing these powerful open AI models

115
00:05:55.399 --> 00:05:58.480
<v Speaker 2>into the enterprise world, but with a heavy focus on security,

116
00:05:58.600 --> 00:06:03.560
<v Speaker 2>compliance and and manageability. It gives you secure rest API

117
00:06:03.720 --> 00:06:06.879
<v Speaker 2>access to all the big open AI models GPT four Turbo,

118
00:06:06.959 --> 00:06:10.000
<v Speaker 2>the new GPT four h GPT four oh Mini, GPT

119
00:06:10.079 --> 00:06:13.040
<v Speaker 2>three point five Turbo for text tasks, Whisper for audio,

120
00:06:13.120 --> 00:06:15.439
<v Speaker 2>Daily three for images, and the embedding models.

121
00:06:15.480 --> 00:06:18.720
<v Speaker 1>So the models everyone's talking about, but package for business exactly.

122
00:06:19.000 --> 00:06:21.199
<v Speaker 2>But the key difference with Azure open Ai is the

123
00:06:21.319 --> 00:06:24.360
<v Speaker 2>enterprise grade stuff that's only available on Azure. We're talking

124
00:06:24.519 --> 00:06:27.959
<v Speaker 2>robust security controls, private networking options so your data doesn't

125
00:06:28.000 --> 00:06:33.040
<v Speaker 2>touch the public Internet, meaning strict compliance standards, broad geographic availability,

126
00:06:33.439 --> 00:06:36.879
<v Speaker 2>and really important built in responsible AI content filtering.

127
00:06:36.920 --> 00:06:39.959
<v Speaker 1>Okay, those enterprise features sound critical. Can you quickly run

128
00:06:40.040 --> 00:06:41.240
<v Speaker 1>through the main model types.

129
00:06:41.240 --> 00:06:43.800
<v Speaker 2>Again, sure you've got the GPT four family that's the

130
00:06:43.800 --> 00:06:46.360
<v Speaker 2>top tier, like GPT four Oh, GPT four O Mini

131
00:06:46.360 --> 00:06:49.480
<v Speaker 2>and Turbo. They have advanced reasoning, big context windows. GPT

132
00:06:49.519 --> 00:06:51.759
<v Speaker 2>four in takes one hundred and twenty eight thousand input tokens,

133
00:06:51.759 --> 00:06:54.480
<v Speaker 2>which is huge, a whole book almost pretty much, and

134
00:06:54.560 --> 00:06:56.920
<v Speaker 2>GPT four Mini is interesting because it can output a

135
00:06:56.920 --> 00:06:59.720
<v Speaker 2>lot of tokens up to sixteen thousand, great for longer

136
00:06:59.720 --> 00:07:03.600
<v Speaker 2>respons bonses. Then there's GBT three point five Turbo, often

137
00:07:03.639 --> 00:07:06.519
<v Speaker 2>the go to for being capable, the cost effective, especially

138
00:07:06.519 --> 00:07:09.519
<v Speaker 2>for chats, and of course Whisper for audio, Dally three

139
00:07:09.560 --> 00:07:12.360
<v Speaker 2>for images, and the embedding models which are essential for

140
00:07:12.399 --> 00:07:14.920
<v Speaker 2>any kind of smart search or understanding meaning.

141
00:07:14.959 --> 00:07:17.279
<v Speaker 1>And who gets access? Can any business just sign.

142
00:07:17.199 --> 00:07:21.160
<v Speaker 2>Up right now? Access is mostly for enterprise customers and partners.

143
00:07:21.439 --> 00:07:25.319
<v Speaker 2>You typically apply using your company email. It's a deliberate

144
00:07:25.360 --> 00:07:28.639
<v Speaker 2>approach really, Microsoft wants to ensure these powerful tools are

145
00:07:28.639 --> 00:07:31.959
<v Speaker 2>deployed responsibly and securely in business settings with the right

146
00:07:31.959 --> 00:07:34.040
<v Speaker 2>support in governance structures in place.

147
00:07:34.160 --> 00:07:37.560
<v Speaker 1>Makes sense for managing something this powerful. Okay, let's go deeper. Now,

148
00:07:37.600 --> 00:07:40.680
<v Speaker 1>some of the more advanced capabilities that really unlock new potential.

149
00:07:41.199 --> 00:07:43.839
<v Speaker 1>Tell us about those embedding models in Azure open AI.

150
00:07:43.879 --> 00:07:44.600
<v Speaker 1>What do they let you do?

151
00:07:44.959 --> 00:07:49.839
<v Speaker 2>Right? Embeddings they are absolutely fundamental for what we call

152
00:07:49.920 --> 00:07:53.639
<v Speaker 2>semantic understanding and similarity searches. Instead of just matching keywords

153
00:07:53.639 --> 00:07:57.120
<v Speaker 2>like finding car when someone types car, embeddings capture the meaning.

154
00:07:57.439 --> 00:07:59.720
<v Speaker 2>So if you search for fast car, it can find

155
00:07:59.720 --> 00:08:02.000
<v Speaker 2>dot com U means talking about rapid automobiles because it

156
00:08:02.079 --> 00:08:03.720
<v Speaker 2>understands those concepts are similar.

157
00:08:03.839 --> 00:08:06.000
<v Speaker 1>Much smarter search then exactly.

158
00:08:05.639 --> 00:08:09.000
<v Speaker 2>Much more relevant results. Now there are older versions like

159
00:08:09.079 --> 00:08:11.920
<v Speaker 2>ad A zero zero two, but the newer ones text

160
00:08:11.920 --> 00:08:15.240
<v Speaker 2>embedding three small and text embedding three large, are well.

161
00:08:15.279 --> 00:08:18.720
<v Speaker 2>They're significantly better. Text ebedting three small is much more

162
00:08:18.759 --> 00:08:22.199
<v Speaker 2>cost effective and shows big performance jumps, especially for multi

163
00:08:22.279 --> 00:08:25.240
<v Speaker 2>lingual stuff. Text embedding three large is the top performer

164
00:08:25.319 --> 00:08:26.920
<v Speaker 2>overall for accuracy.

165
00:08:26.480 --> 00:08:27.800
<v Speaker 1>Better and cheaper. Nice.

166
00:08:28.000 --> 00:08:29.959
<v Speaker 2>And here's a really cool thing about these new models,

167
00:08:30.000 --> 00:08:33.840
<v Speaker 2>a real aha moment. They use something called Matryoshka representation

168
00:08:34.000 --> 00:08:35.440
<v Speaker 2>learning am.

169
00:08:36.519 --> 00:08:38.519
<v Speaker 1>Like the Russian dolls exactly.

170
00:08:38.799 --> 00:08:42.000
<v Speaker 2>It means you can actually shorten the embeddings, literally chop

171
00:08:42.039 --> 00:08:44.919
<v Speaker 2>off numbers from the end of the sequence without them

172
00:08:44.960 --> 00:08:48.840
<v Speaker 2>losing their core meaning. This is huge because shorter embeddings

173
00:08:48.879 --> 00:08:54.000
<v Speaker 2>mean less storage, faster searches, lower costs, often while keeping

174
00:08:54.399 --> 00:08:57.799
<v Speaker 2>or even improving performance compared to older, longer embeddings. It's

175
00:08:57.840 --> 00:08:58.759
<v Speaker 2>incredibly efficient.

176
00:08:58.879 --> 00:09:01.799
<v Speaker 1>That's amazing, trimming the th without losing the substance. Yeah.

177
00:09:02.039 --> 00:09:05.399
<v Speaker 1>So you create these smart embeddings, where do you put them?

178
00:09:05.480 --> 00:09:08.200
<v Speaker 1>Why are Azure vector databases important here?

179
00:09:08.399 --> 00:09:11.759
<v Speaker 2>Good question. You need a special kind of database optimized

180
00:09:11.879 --> 00:09:14.679
<v Speaker 2>for storing and searching these high dimensional vectors. That's where

181
00:09:14.679 --> 00:09:17.360
<v Speaker 2>Azure vector databases come in. Their whole point is to

182
00:09:17.600 --> 00:09:21.679
<v Speaker 2>enable really fast, really precise similarity searches based on that

183
00:09:21.679 --> 00:09:25.080
<v Speaker 2>semantic meaning we talked about, not just keyword matching, find

184
00:09:25.120 --> 00:09:27.879
<v Speaker 2>related concepts instantly across huge data sets.

185
00:09:27.960 --> 00:09:29.200
<v Speaker 1>And Azure has options for this.

186
00:09:29.360 --> 00:09:31.879
<v Speaker 2>Oh yes, Azure ai search is a big one. Interestingly,

187
00:09:31.960 --> 00:09:35.639
<v Speaker 2>open Ai actually uses Azure ai Search for vector capabilities

188
00:09:35.639 --> 00:09:36.879
<v Speaker 2>in chat GPT itself.

189
00:09:36.960 --> 00:09:37.240
<v Speaker 1>Wow.

190
00:09:37.399 --> 00:09:41.440
<v Speaker 2>Yeah. And there's also Azure Cosmos dB with vector capabilities,

191
00:09:41.679 --> 00:09:44.799
<v Speaker 2>Azure Managed Rettis, and even Postgres School with the PG

192
00:09:45.000 --> 00:09:48.279
<v Speaker 2>vector expansion. Lots of choices depending on your needs, all

193
00:09:48.279 --> 00:09:50.759
<v Speaker 2>designed for handling these complex numerical vectors.

194
00:09:50.879 --> 00:09:54.320
<v Speaker 1>Okay, Earlier you mentioned that limitation hallucination where lllms can

195
00:09:54.320 --> 00:09:57.639
<v Speaker 1>make things up. How does retrieval, augmented generation or RI

196
00:09:58.200 --> 00:09:58.879
<v Speaker 1>help fix that?

197
00:09:59.120 --> 00:10:03.039
<v Speaker 2>Right? Is direct answer to the hallucination problem. It works

198
00:10:03.080 --> 00:10:04.360
<v Speaker 2>by grounding the model.

199
00:10:04.559 --> 00:10:07.000
<v Speaker 1>Grounding it like keeping its feet on the ground pretty much.

200
00:10:07.399 --> 00:10:11.320
<v Speaker 2>It connects the LM's internal knowledge with real world verified information,

201
00:10:12.000 --> 00:10:14.679
<v Speaker 2>usually from an external source. Think of it like giving

202
00:10:14.679 --> 00:10:17.159
<v Speaker 2>the model a factual reference library to check before it

203
00:10:17.200 --> 00:10:19.759
<v Speaker 2>generates an answer, keeps it rooted in reality.

204
00:10:19.960 --> 00:10:21.480
<v Speaker 1>How does that work in practice?

205
00:10:22.080 --> 00:10:26.759
<v Speaker 2>So the process is quite elegant. A user asks a question. First,

206
00:10:26.879 --> 00:10:30.200
<v Speaker 2>the system retree is relevant information from an external knowledge base,

207
00:10:30.240 --> 00:10:33.559
<v Speaker 2>typically one of those vector databases we just discussed. Then

208
00:10:33.679 --> 00:10:36.519
<v Speaker 2>the LM gets both the original question and this retrieved

209
00:10:36.519 --> 00:10:40.600
<v Speaker 2>factual context. It uses both pieces to generate the final response.

210
00:10:41.080 --> 00:10:44.759
<v Speaker 1>Ah, so it's using verified info to guide its answer precisely.

211
00:10:44.799 --> 00:10:48.759
<v Speaker 2>The benefits are huge. Much better accuracy because it's using facts,

212
00:10:49.279 --> 00:10:53.039
<v Speaker 2>richer context than just its training data, more flexibility because

213
00:10:53.080 --> 00:10:55.799
<v Speaker 2>you can update the knowledge base without retraining the whole model,

214
00:10:56.039 --> 00:10:57.000
<v Speaker 2>and it scales well.

215
00:10:57.440 --> 00:10:59.720
<v Speaker 1>Sounds great. Are there downsides?

216
00:11:00.080 --> 00:11:00.679
<v Speaker 2>Our challenges?

217
00:11:00.759 --> 00:11:01.039
<v Speaker 1>Yeah?

218
00:11:01.200 --> 00:11:04.320
<v Speaker 2>Getting the document segmentation right for the retrieval step is tricky.

219
00:11:04.679 --> 00:11:07.440
<v Speaker 2>Making sure the retrieved info is genuinely relevant can be hard,

220
00:11:07.960 --> 00:11:10.639
<v Speaker 2>and setting of the whole RMA pipeline could be complex

221
00:11:10.679 --> 00:11:11.639
<v Speaker 2>and resource intensive.

222
00:11:11.720 --> 00:11:15.879
<v Speaker 1>Okay, makes sense. Moving beyond just text, what about models

223
00:11:15.879 --> 00:11:18.600
<v Speaker 1>that understand images too? Tell us about azure OpenAI is

224
00:11:18.679 --> 00:11:20.759
<v Speaker 1>multimodal stuff, especially GBT four oh.

225
00:11:20.919 --> 00:11:24.559
<v Speaker 2>Yeah, multimodal is a really exciting frontier. Models like GBT

226
00:11:24.600 --> 00:11:27.720
<v Speaker 2>four oh can process and understand both text and images

227
00:11:27.759 --> 00:11:29.559
<v Speaker 2>together in the same input, so.

228
00:11:29.480 --> 00:11:31.519
<v Speaker 1>You can show it a picture and ask questions.

229
00:11:31.240 --> 00:11:34.320
<v Speaker 2>About it exactly. This opens up tons of practical uses,

230
00:11:34.600 --> 00:11:39.440
<v Speaker 2>automatically generating detailed captions for images, visual question answering asking

231
00:11:39.679 --> 00:11:42.679
<v Speaker 2>what color is the car in this picture, content moderation

232
00:11:42.759 --> 00:11:46.360
<v Speaker 2>for visual stuff in e commerce, maybe generating product descriptions

233
00:11:46.399 --> 00:11:49.320
<v Speaker 2>just from photos, and as we touched on, even assisting

234
00:11:49.360 --> 00:11:53.679
<v Speaker 2>with initial medical diagnostics by looking at symptoms and related images.

235
00:11:54.679 --> 00:11:58.960
<v Speaker 2>But again with that crucial caveat, not for interpreting specialized

236
00:11:59.000 --> 00:12:00.039
<v Speaker 2>scans or giving.

237
00:11:59.799 --> 00:12:03.440
<v Speaker 1>It right always the caveat. Are there things that struggles

238
00:12:03.440 --> 00:12:05.879
<v Speaker 1>with visually definitely limitations.

239
00:12:06.120 --> 00:12:08.480
<v Speaker 2>It might not perform as well with non Latin alphabets

240
00:12:08.519 --> 00:12:12.799
<v Speaker 2>and images, or very small or rotated text. Sometimes precise

241
00:12:12.840 --> 00:12:15.600
<v Speaker 2>spatial reasoning like is the blue box exactly to the

242
00:12:15.679 --> 00:12:18.080
<v Speaker 2>left of the red sphere? Can be tricky for.

243
00:12:18.039 --> 00:12:21.399
<v Speaker 1>It still a massive leap. Now, how do these models

244
00:12:21.440 --> 00:12:24.720
<v Speaker 1>actually do things in the real world interact with other systems?

245
00:12:24.720 --> 00:12:26.039
<v Speaker 1>How does function calling work?

246
00:12:26.200 --> 00:12:28.559
<v Speaker 2>Function calling is super interesting. The key thing to get

247
00:12:28.600 --> 00:12:30.600
<v Speaker 2>is the model itself doesn't run the function.

248
00:12:30.759 --> 00:12:31.720
<v Speaker 1>It doesn't, then what does it do?

249
00:12:32.120 --> 00:12:34.879
<v Speaker 2>It intelligently figures out if an external tool or function

250
00:12:35.000 --> 00:12:39.159
<v Speaker 2>is needed to answer the user's request. If it decides yes,

251
00:12:39.600 --> 00:12:42.559
<v Speaker 2>it then generates the parameters or arguments that function needs.

252
00:12:43.159 --> 00:12:45.279
<v Speaker 2>So the flow is like this model thinks a function

253
00:12:45.320 --> 00:12:48.799
<v Speaker 2>call would help. The API response tells your application, Hey,

254
00:12:48.919 --> 00:12:50.759
<v Speaker 2>call this function with these arguments.

255
00:12:51.080 --> 00:12:53.279
<v Speaker 1>So my app does the actual work exactly.

256
00:12:53.440 --> 00:12:57.039
<v Speaker 2>Your application takes those parameters, runs the function. Maybe it

257
00:12:57.120 --> 00:13:00.000
<v Speaker 2>queries a database, calls another API, sends an email, whatever.

258
00:13:00.480 --> 00:13:03.200
<v Speaker 2>Then your app sends the result of that function call

259
00:13:03.279 --> 00:13:06.360
<v Speaker 2>back to the LM. The LEM then uses that real

260
00:13:06.360 --> 00:13:10.120
<v Speaker 2>world result to formulate its final informed answer to the user.

261
00:13:10.440 --> 00:13:12.559
<v Speaker 2>It's a really dynamic way to connect the AI to

262
00:13:12.639 --> 00:13:13.519
<v Speaker 2>external systems.

263
00:13:13.759 --> 00:13:16.720
<v Speaker 1>Got it? And building on that interaction idea, what's the

264
00:13:16.799 --> 00:13:19.759
<v Speaker 1>assistance API? Sounds like you can build more complex agents.

265
00:13:19.879 --> 00:13:23.879
<v Speaker 2>Precisely, the Azure Open AI Assistance API is designed specifically

266
00:13:23.919 --> 00:13:27.919
<v Speaker 2>for building these more sophisticated stateful AI assistants, tailored to

267
00:13:28.080 --> 00:13:31.240
<v Speaker 2>particular jobs. It comes with some really powerful built in tools.

268
00:13:31.279 --> 00:13:34.080
<v Speaker 2>One is a code interpreter. This lets the assistant write

269
00:13:34.120 --> 00:13:36.919
<v Speaker 2>and run Python code securely in a sandboxed environment.

270
00:13:37.080 --> 00:13:39.159
<v Speaker 1>Python code what for all.

271
00:13:39.000 --> 00:13:43.759
<v Speaker 2>Sorts of things, performing complex calculations, analyzing data directly from

272
00:13:43.840 --> 00:13:48.279
<v Speaker 2>uploaded files like csvs, even generating charts or processing files.

273
00:13:48.600 --> 00:13:52.200
<v Speaker 2>It's incredibly powerful for data tasks. Another key tool is

274
00:13:52.240 --> 00:13:55.600
<v Speaker 2>file search. This allows the assistant to access and retrieve

275
00:13:55.639 --> 00:13:57.720
<v Speaker 2>information from documents you provide to It.

276
00:13:57.879 --> 00:14:01.279
<v Speaker 1>Ah like a private knowledge base for assistant exactly.

277
00:14:01.679 --> 00:14:04.559
<v Speaker 2>It acts as an external knowledge source, letting the assistant

278
00:14:04.600 --> 00:14:08.080
<v Speaker 2>answer questions using your specific, up to date information, going

279
00:14:08.159 --> 00:14:11.679
<v Speaker 2>way beyond its original training. It uses vector embeddings under the.

280
00:14:11.679 --> 00:14:13.960
<v Speaker 1>Hood for this, and function calling is part of this too.

281
00:14:14.320 --> 00:14:17.159
<v Speaker 2>Yep, function calling is integrated right into the assistance API

282
00:14:17.240 --> 00:14:20.159
<v Speaker 2>as well, so your assistant can use those external tools seamlessly.

283
00:14:20.720 --> 00:14:25.720
<v Speaker 1>Okay, so assistance API for interactive smart agents. What if

284
00:14:25.720 --> 00:14:28.080
<v Speaker 1>you just need to process a ton of stuff and

285
00:14:28.120 --> 00:14:31.159
<v Speaker 1>you don't need instant answers like batch processing.

286
00:14:31.480 --> 00:14:34.360
<v Speaker 2>That's exactly where the batch API comes in. It's designed

287
00:14:34.399 --> 00:14:37.919
<v Speaker 2>for asynchronous, non real time processing jobs where you can

288
00:14:37.960 --> 00:14:40.480
<v Speaker 2>wait a bit for the results. You basically bundle up

289
00:14:40.480 --> 00:14:43.200
<v Speaker 2>a whole load of requests into a single file, submit it,

290
00:14:43.399 --> 00:14:44.919
<v Speaker 2>and AZURE processes in bulk.

291
00:14:45.279 --> 00:14:47.279
<v Speaker 1>What are the advantages of doing it that way?

292
00:14:47.360 --> 00:14:51.399
<v Speaker 2>Two main things, costs and quota. You typically see a

293
00:14:51.440 --> 00:14:54.960
<v Speaker 2>significant cost reduction, often around fifty percent compared to making

294
00:14:54.960 --> 00:14:57.799
<v Speaker 2>all those calls individually to the standard real time endpoints.

295
00:14:58.039 --> 00:15:01.360
<v Speaker 2>Plus you get a dedicated quota for backs processing separate

296
00:15:01.360 --> 00:15:05.360
<v Speaker 2>from your interactive traffic. Azure guarantees completion within twenty four hours,

297
00:15:05.360 --> 00:15:08.480
<v Speaker 2>though usually it's much much faster. Perfect for large scale

298
00:15:08.519 --> 00:15:11.399
<v Speaker 2>content generation, data cleansing, summarization tasks.

299
00:15:11.799 --> 00:15:15.279
<v Speaker 1>Things like that fifty percent cost reduction is pretty compelling. Okay,

300
00:15:15.360 --> 00:15:18.960
<v Speaker 1>let's switch gears slightly. Fine tuning. This comes up a lot,

301
00:15:19.000 --> 00:15:21.440
<v Speaker 1>but it raises a big question. When do you actually

302
00:15:21.480 --> 00:15:24.440
<v Speaker 1>need to fine tune a model? Especially with powerful things

303
00:15:24.519 --> 00:15:26.799
<v Speaker 1>like prompt engineering and r GAG.

304
00:15:26.519 --> 00:15:30.679
<v Speaker 2>Available, That is a really critical strategic question. Fine tuning

305
00:15:30.759 --> 00:15:33.919
<v Speaker 2>is different. It means taking an existing pre trained LLM

306
00:15:34.159 --> 00:15:37.960
<v Speaker 2>and actually retraining it, adapting it using your own specific

307
00:15:38.120 --> 00:15:41.480
<v Speaker 2>curated example data. It's a supervised learning process. You show

308
00:15:41.480 --> 00:15:45.360
<v Speaker 2>the model examples given this input produce this exact output.

309
00:15:45.440 --> 00:15:47.919
<v Speaker 2>You're teaching it a very specific behavior or style.

310
00:15:48.000 --> 00:15:51.240
<v Speaker 1>Okay, so you're modifying the model itself. What are the benefits?

311
00:15:51.600 --> 00:15:54.919
<v Speaker 2>Well, you can potentially get much higher quality responses for

312
00:15:55.080 --> 00:15:58.559
<v Speaker 2>very specific niche tasks. You can effectively train it on

313
00:15:58.639 --> 00:16:01.799
<v Speaker 2>more data than fits in this andar context window because

314
00:16:01.840 --> 00:16:05.480
<v Speaker 2>the knowledge gets baked into the model weights, and sometimes

315
00:16:05.519 --> 00:16:08.360
<v Speaker 2>it can lead to using fewer tokens in your prompts later,

316
00:16:08.799 --> 00:16:09.679
<v Speaker 2>saving costs.

317
00:16:10.440 --> 00:16:12.919
<v Speaker 1>So when is it the right call over just better

318
00:16:12.960 --> 00:16:14.159
<v Speaker 1>prompting or RAG?

319
00:16:14.600 --> 00:16:17.480
<v Speaker 2>You should really only consider fine tuning when prompt engineering

320
00:16:17.519 --> 00:16:19.919
<v Speaker 2>in a RAG aren't getting you the consistent quality or

321
00:16:19.960 --> 00:16:23.279
<v Speaker 2>accuracy you need for a specific problem. It's best when

322
00:16:23.320 --> 00:16:25.919
<v Speaker 2>you have a unique domain or a very specific data

323
00:16:26.000 --> 00:16:29.279
<v Speaker 2>set that's well prepared in high quality, and crucially, you

324
00:16:29.360 --> 00:16:32.279
<v Speaker 2>need clear goals and ways to measure if the fine

325
00:16:32.279 --> 00:16:35.480
<v Speaker 2>tuning actually work. Like quantitative metrics, how much data do

326
00:16:35.480 --> 00:16:38.039
<v Speaker 2>you need? That's a key point. While technically you might

327
00:16:38.080 --> 00:16:41.120
<v Speaker 2>start with just like ten examples to get any real benefit,

328
00:16:41.200 --> 00:16:44.360
<v Speaker 2>to really shift the model's behavior. Usually need hundreds or

329
00:16:44.360 --> 00:16:46.480
<v Speaker 2>more likely thousands of high quality examples.

330
00:16:46.639 --> 00:16:48.679
<v Speaker 1>Thousands okay, that's a commitment.

331
00:16:48.480 --> 00:16:52.879
<v Speaker 2>It is, and importantly, low quality or inconsistent examples can

332
00:16:52.919 --> 00:16:56.120
<v Speaker 2>actually hurt the model's performance, making it worse, so data

333
00:16:56.200 --> 00:17:00.600
<v Speaker 2>quality is paramount. The process involves preparing that data, carefully,

334
00:17:00.919 --> 00:17:05.039
<v Speaker 2>running the training job, and then rigorously evaluating both safety

335
00:17:05.079 --> 00:17:06.440
<v Speaker 2>and performance before deploying.

336
00:17:06.559 --> 00:17:09.400
<v Speaker 1>Okay, let's circle back to interacting with the model. Prompt

337
00:17:09.400 --> 00:17:12.240
<v Speaker 1>engineering you mentioned it's powerful. It feels like a real

338
00:17:12.359 --> 00:17:15.319
<v Speaker 1>art form almost. It's not just asking a basic question, is.

339
00:17:15.240 --> 00:17:18.519
<v Speaker 2>It not at all? It's absolutely critical. Prompt engineering is

340
00:17:20.119 --> 00:17:22.839
<v Speaker 2>basically the art and science of crafting your input, your

341
00:17:22.880 --> 00:17:25.880
<v Speaker 2>prompt to guide the LM towards the specific kind of

342
00:17:25.880 --> 00:17:28.720
<v Speaker 2>output you want without changing the model itself. It's all

343
00:17:28.720 --> 00:17:31.559
<v Speaker 2>about how you communicate your request to the AI. Think

344
00:17:31.599 --> 00:17:34.440
<v Speaker 2>of a really good prompt as having several key ingredients.

345
00:17:34.640 --> 00:17:36.799
<v Speaker 1>Ingredients like a recipe, kind.

346
00:17:36.559 --> 00:17:39.960
<v Speaker 2>Of first unique context like imagine you're a travel agent.

347
00:17:40.400 --> 00:17:44.440
<v Speaker 2>Then clear instructions, write a three day itinerary, add constraints

348
00:17:45.160 --> 00:17:49.400
<v Speaker 2>focusing on budget friendly options. You might include variables or

349
00:17:49.440 --> 00:17:53.400
<v Speaker 2>specific inputs. Mention the Eiffel Tower in the louver, specify

350
00:17:53.400 --> 00:17:56.960
<v Speaker 2>the desired output format, provide the answer as a bulleted list,

351
00:17:57.440 --> 00:18:00.200
<v Speaker 2>and maybe set the tone style in an enthusiasm, sick

352
00:18:00.240 --> 00:18:01.079
<v Speaker 2>and friendly tone.

353
00:18:01.119 --> 00:18:02.759
<v Speaker 1>Wow, Okay, that's quite detailed.

354
00:18:02.839 --> 00:18:05.640
<v Speaker 2>It can be, And one really powerful element is providing

355
00:18:05.720 --> 00:18:08.680
<v Speaker 2>examples or templates, like here's an example of a good

356
00:18:08.720 --> 00:18:13.519
<v Speaker 2>itinerary item. Day one morning, visit Notre Dame cathedral, free entry.

357
00:18:13.799 --> 00:18:17.000
<v Speaker 2>Now create the rest. Putting these elements together makes a

358
00:18:17.079 --> 00:18:20.799
<v Speaker 2>huge difference in getting tailored, useful responses instead of something

359
00:18:20.839 --> 00:18:21.680
<v Speaker 2>generic that.

360
00:18:21.599 --> 00:18:25.079
<v Speaker 1>Makes total sense layering the instructions, What about more advanced

361
00:18:25.119 --> 00:18:28.400
<v Speaker 1>strategies you mentioned guiding the AI's thought process.

362
00:18:28.640 --> 00:18:32.160
<v Speaker 2>Yet beyond just the structure, there are strategies. Always aim

363
00:18:32.240 --> 00:18:36.359
<v Speaker 2>for clear, unambiguous instructions. Asking the model to adopt a

364
00:18:36.400 --> 00:18:40.640
<v Speaker 2>specific persona helps. Using delimiters like triple quotes or XML

365
00:18:40.720 --> 00:18:45.119
<v Speaker 2>tags to separate instructions from content is good practice. Breaking

366
00:18:45.119 --> 00:18:48.480
<v Speaker 2>down complex tasks into steps for the model is effective,

367
00:18:48.920 --> 00:18:51.920
<v Speaker 2>and as I mentioned, providing examples is almost always beneficial.

368
00:18:52.200 --> 00:18:55.079
<v Speaker 2>One really important strategy is often called give the model

369
00:18:55.119 --> 00:18:55.680
<v Speaker 2>time to think.

370
00:18:55.839 --> 00:18:58.319
<v Speaker 1>Time to think. It's not actually thinking though, right.

371
00:18:58.240 --> 00:19:01.559
<v Speaker 2>Right, it's not conscious Yeah. Structuring the prompt to encourage

372
00:19:01.559 --> 00:19:04.200
<v Speaker 2>a step by step process often leads to better accuracy

373
00:19:04.200 --> 00:19:07.079
<v Speaker 2>on complex problems. Force it to outline its steps before

374
00:19:07.119 --> 00:19:09.480
<v Speaker 2>giving the final answer. It's like asking your person to

375
00:19:09.519 --> 00:19:11.440
<v Speaker 2>show their work in math reduces errors.

376
00:19:11.559 --> 00:19:15.400
<v Speaker 1>Oh, okay, show you work. What about specific named techniques?

377
00:19:15.720 --> 00:19:18.359
<v Speaker 2>So we have a kind of progression. Zero shot is

378
00:19:18.400 --> 00:19:22.440
<v Speaker 2>asking a question cold with no examples. One shot gives

379
00:19:22.440 --> 00:19:26.039
<v Speaker 2>one example, Few shot gives well a few examples. Adding

380
00:19:26.079 --> 00:19:29.839
<v Speaker 2>examples dramatically improves accuracy by showing the model the pattern

381
00:19:29.880 --> 00:19:32.240
<v Speaker 2>you want. Then there's chain of thought or code T.

382
00:19:32.599 --> 00:19:35.160
<v Speaker 2>This is where you explicitly ask the model to explain

383
00:19:35.200 --> 00:19:38.519
<v Speaker 2>its reasoning step by step before giving the final answer.

384
00:19:39.000 --> 00:19:41.799
<v Speaker 2>It forces that show your work process and really helps

385
00:19:41.799 --> 00:19:43.480
<v Speaker 2>with complex logic or math problems.

386
00:19:43.480 --> 00:19:45.559
<v Speaker 1>So you see it's reasoning exactly.

387
00:19:45.759 --> 00:19:48.519
<v Speaker 2>Building on that is tree of SATs or toe T.

388
00:19:49.039 --> 00:19:51.920
<v Speaker 2>This is more advanced. It lets the model explore multiple

389
00:19:51.960 --> 00:19:55.279
<v Speaker 2>different reasoning paths like branches of a tree, evaluate them,

390
00:19:55.519 --> 00:19:58.039
<v Speaker 2>and then choose the best one. Great for complex planning

391
00:19:58.119 --> 00:19:59.359
<v Speaker 2>or exploring possibilities.

392
00:19:59.319 --> 00:20:01.160
<v Speaker 1>Oka more complex now, yeah.

393
00:20:00.960 --> 00:20:04.039
<v Speaker 2>A couple more interesting ones. Program aided language model or

394
00:20:04.200 --> 00:20:08.440
<v Speaker 2>pall MS. This is fascinating. The LM actually generates small

395
00:20:08.480 --> 00:20:11.720
<v Speaker 2>snippets of code, often Python, to help it solve a problem.

396
00:20:11.759 --> 00:20:13.160
<v Speaker 1>It writes code to help itself.

397
00:20:13.440 --> 00:20:16.880
<v Speaker 2>Yes, like if you ask a complex math question, it

398
00:20:16.960 --> 00:20:19.880
<v Speaker 2>might write and run Python code using an interpreter to

399
00:20:19.880 --> 00:20:22.720
<v Speaker 2>get the exact numerical answer rather than trying to estimate

400
00:20:22.720 --> 00:20:26.680
<v Speaker 2>it linguistically. Then there's react. This technique lets the model

401
00:20:26.720 --> 00:20:29.799
<v Speaker 2>interleave reasoning steps with actions. You can decide it needs

402
00:20:29.799 --> 00:20:32.559
<v Speaker 2>more information. Formulate a query to an external tool like

403
00:20:32.599 --> 00:20:34.680
<v Speaker 2>a search engine or database. If you have function calling,

404
00:20:34.920 --> 00:20:37.599
<v Speaker 2>get the result, and then incorporate that into its reasoning

405
00:20:37.640 --> 00:20:38.279
<v Speaker 2>to continue.

406
00:20:38.319 --> 00:20:41.079
<v Speaker 1>So it can actively seek out information.

407
00:20:40.759 --> 00:20:44.480
<v Speaker 2>Yes, interact with tools to improve its response, and finally, reflection.

408
00:20:45.279 --> 00:20:47.359
<v Speaker 2>This allows an agent to look back at its past

409
00:20:47.400 --> 00:20:51.960
<v Speaker 2>actions and outcomes, receive feedback, often linguistic, and essentially learn

410
00:20:51.960 --> 00:20:54.960
<v Speaker 2>from its mistakes to improve its strategy. Over time, it

411
00:20:55.000 --> 00:20:56.240
<v Speaker 2>reflects on its own reasoning.

412
00:20:56.359 --> 00:20:59.839
<v Speaker 1>Wow, okay, lots of powerful techniques there. So if you

413
00:20:59.880 --> 00:21:03.119
<v Speaker 1>have all these prompting methods plus RG plus fine tuning,

414
00:21:04.039 --> 00:21:07.160
<v Speaker 1>how do you decide where path to take? It seems complicated.

415
00:21:07.319 --> 00:21:10.039
<v Speaker 2>It really boils down to a few key factors. Your

416
00:21:10.079 --> 00:21:13.680
<v Speaker 2>specific goal. The resources you have and your team's expertise.

417
00:21:14.319 --> 00:21:18.200
<v Speaker 2>Prompt engineering is almost always the starting point. It's low cost,

418
00:21:18.519 --> 00:21:21.599
<v Speaker 2>fast accessible. You can often get great results just by

419
00:21:21.599 --> 00:21:24.039
<v Speaker 2>crafting better prompts using those techniques.

420
00:21:23.640 --> 00:21:26.640
<v Speaker 1>We discussed, Start simple, iterate exactly.

421
00:21:27.200 --> 00:21:30.440
<v Speaker 2>If prompt engineering isn't enough, and especially if your application

422
00:21:30.559 --> 00:21:34.599
<v Speaker 2>needs access to external changing information or needs to avoid

423
00:21:34.599 --> 00:21:38.160
<v Speaker 2>hallucinations based on specific documents, that ra is usually the

424
00:21:38.200 --> 00:21:41.319
<v Speaker 2>next step. It adds that grounding layer fine tuning is

425
00:21:41.359 --> 00:21:44.839
<v Speaker 2>generally the last resort. It's more expensive, needs significant data

426
00:21:44.880 --> 00:21:47.880
<v Speaker 2>and mL expertise. You'd only really go there if RAG

427
00:21:47.960 --> 00:21:50.400
<v Speaker 2>and prompting aren't hitting the mark for a very specific,

428
00:21:50.480 --> 00:21:53.119
<v Speaker 2>high value task where you need to deeply embed custom

429
00:21:53.200 --> 00:21:55.240
<v Speaker 2>knowledge or style into the model itself.

430
00:21:55.319 --> 00:21:58.079
<v Speaker 1>Okay, that clarifies the decision process. Now it's switching to

431
00:21:58.119 --> 00:22:01.680
<v Speaker 1>deployment for any business using the security responsible use keeping

432
00:22:01.680 --> 00:22:04.920
<v Speaker 1>things running smoothly, These are absolutely critical. How does Azure

433
00:22:04.960 --> 00:22:06.920
<v Speaker 1>OpenAI handle that side of things?

434
00:22:07.279 --> 00:22:10.599
<v Speaker 2>Microsoft puts a huge emphasis on this For compliance and

435
00:22:10.720 --> 00:22:14.759
<v Speaker 2>data privacy, the Azure open Ai service meets strict standards

436
00:22:15.039 --> 00:22:19.119
<v Speaker 2>like SOOC one, two and three. The really crucial privacy

437
00:22:19.160 --> 00:22:22.119
<v Speaker 2>point is that the models are stateless They don't learn

438
00:22:22.160 --> 00:22:26.839
<v Speaker 2>from or remember your interactions, your prompts, the completions generated,

439
00:22:26.920 --> 00:22:30.359
<v Speaker 2>any embeddings, any data you use for fine tuning. None

440
00:22:30.400 --> 00:22:32.400
<v Speaker 2>of it is shared with other customers. It's not sent

441
00:22:32.440 --> 00:22:35.319
<v Speaker 2>to open AI. Microsoft doesn't use it to improve their

442
00:22:35.319 --> 00:22:38.119
<v Speaker 2>base models as not shared with any third parties. Your

443
00:22:38.200 --> 00:22:39.160
<v Speaker 2>data stays yours.

444
00:22:39.359 --> 00:22:42.400
<v Speaker 1>That's a big reassurance for businesses. What about monitoring for

445
00:22:42.480 --> 00:22:43.079
<v Speaker 1>bad stuff?

446
00:22:43.240 --> 00:22:46.440
<v Speaker 2>Right? Microsoft does have real time abuse monitoring systems that

447
00:22:46.559 --> 00:22:50.279
<v Speaker 2>scan for harmful content generation. However, and this is important,

448
00:22:50.400 --> 00:22:52.920
<v Speaker 2>eligible customers can actually apply to opt out of this

449
00:22:53.039 --> 00:22:56.839
<v Speaker 2>monitoring if approved, none of your prompt or completion data

450
00:22:56.880 --> 00:22:57.799
<v Speaker 2>is stored for that purpose.

451
00:22:57.880 --> 00:23:00.480
<v Speaker 1>Okay. Control over monitoring and content filtering?

452
00:23:00.759 --> 00:23:04.400
<v Speaker 2>Yes, there's built in content filtering that runs alongside the models.

453
00:23:04.640 --> 00:23:07.559
<v Speaker 2>It uses classification models to check prompts and outputs for

454
00:23:07.680 --> 00:23:11.640
<v Speaker 2>categories like hate speech, sexual content, violence, and self harm.

455
00:23:11.880 --> 00:23:16.480
<v Speaker 2>It operates on severity levels safe, low, medium, high. The

456
00:23:16.559 --> 00:23:20.920
<v Speaker 2>default usually filters content rated medium or high severity. Businesses

457
00:23:20.960 --> 00:23:24.640
<v Speaker 2>can request customizations like only filtering high severity or even

458
00:23:24.680 --> 00:23:28.640
<v Speaker 2>disabling specific filters, but that typically requires justification and approval.

459
00:23:28.720 --> 00:23:33.000
<v Speaker 1>What about securing access not putting API keys in code.

460
00:23:32.920 --> 00:23:36.720
<v Speaker 2>Definitely best practice to avoid that. Azure uses manage identities.

461
00:23:37.039 --> 00:23:40.519
<v Speaker 2>This lets your Azure services authenticate securely to Azure OpenAI

462
00:23:40.880 --> 00:23:43.599
<v Speaker 2>without needing to embed keys directly in your application code.

463
00:23:43.680 --> 00:23:47.000
<v Speaker 2>Much safer, and for network security, you can use private endpoints.

464
00:23:47.400 --> 00:23:50.319
<v Speaker 2>This essentially connects the Azure OpenAI service directly to your

465
00:23:50.359 --> 00:23:54.559
<v Speaker 2>private Azure network, disabling public Internet access entirely for that resource.

466
00:23:55.000 --> 00:23:58.599
<v Speaker 2>All traffic stays within your secure boundary and data encryption absolutely.

467
00:23:58.680 --> 00:24:02.160
<v Speaker 2>Data is encrypted both at rest and in transit. At rest,

468
00:24:02.319 --> 00:24:06.079
<v Speaker 2>it uses strong AS two fifty six encryption with Microsoft

469
00:24:06.119 --> 00:24:09.359
<v Speaker 2>Managed keys by default. For extra control, you can bring

470
00:24:09.400 --> 00:24:13.279
<v Speaker 2>your own keys byok using Azure KEYVOULT that's Customer managed

471
00:24:13.359 --> 00:24:17.480
<v Speaker 2>keys or CMK. In transit, all communication uses Transport Layer

472
00:24:17.519 --> 00:24:19.920
<v Speaker 2>Security TLUS one point two or higher.

473
00:24:20.039 --> 00:24:23.319
<v Speaker 1>Okay, very robust security layers. What about the responsible AI

474
00:24:23.480 --> 00:24:26.000
<v Speaker 1>side ensuring the models themselves behave safely.

475
00:24:26.119 --> 00:24:29.640
<v Speaker 2>Microsoft has a whole responsible AI framework specifically adapted for

476
00:24:29.720 --> 00:24:32.599
<v Speaker 2>generitive models like those in Azure open Ai. It's generally

477
00:24:32.599 --> 00:24:37.119
<v Speaker 2>a four stage approach. First, identify potential harms. This involves

478
00:24:37.160 --> 00:24:40.440
<v Speaker 2>extensive testing, including red teaming, where experts actively try to

479
00:24:40.440 --> 00:24:44.759
<v Speaker 2>make the model produce harmful output. Second, measure quantify how

480
00:24:44.799 --> 00:24:47.480
<v Speaker 2>often and how severely these harms occur using metrics in

481
00:24:47.559 --> 00:24:51.680
<v Speaker 2>human review. Third, mitigate implement tools and strategies to reduce

482
00:24:51.720 --> 00:24:54.720
<v Speaker 2>those harms. This includes things like prompt engineering, guardrails, the

483
00:24:54.759 --> 00:24:58.440
<v Speaker 2>content filters we discussed, designing user experiences carefully, maybe adding

484
00:24:58.440 --> 00:25:01.440
<v Speaker 2>citations or limiting response line. It's a layered defense and.

485
00:25:01.400 --> 00:25:03.440
<v Speaker 1>Depth strategy, defense and depth right.

486
00:25:03.279 --> 00:25:07.559
<v Speaker 2>And fourth, operate. This is about having plans for ongoing

487
00:25:07.640 --> 00:25:12.440
<v Speaker 2>monitoring after deployment, collecting telemetry, gathering user feedback, and having

488
00:25:12.440 --> 00:25:14.799
<v Speaker 2>incident response plans ready if something goes wrong.

489
00:25:14.920 --> 00:25:17.119
<v Speaker 1>So it's an ongoing process, not just a one time

490
00:25:17.240 --> 00:25:18.119
<v Speaker 1>check exactly.

491
00:25:18.200 --> 00:25:22.279
<v Speaker 2>Continuous vigilance for general operations. As your monitor is key.

492
00:25:22.440 --> 00:25:26.079
<v Speaker 2>It collects activity logs, resource logs, performance metrics from your

493
00:25:26.079 --> 00:25:28.960
<v Speaker 2>Azure open AI deployments. You can track things like processed

494
00:25:28.960 --> 00:25:31.400
<v Speaker 2>inference tokens to see your usage or if you're using

495
00:25:31.400 --> 00:25:35.720
<v Speaker 2>provision throughput metrics like Provision Managed Utilization V two show

496
00:25:35.720 --> 00:25:39.039
<v Speaker 2>how efficiently you're using that reserve capacity. You query all

497
00:25:39.079 --> 00:25:42.079
<v Speaker 2>this data using Cousto query language KQL okay.

498
00:25:42.119 --> 00:25:45.319
<v Speaker 1>And finally, for scaling up, what about quotas and limits

499
00:25:45.480 --> 00:25:48.279
<v Speaker 1>and this PTUM thing you mentioned right, So.

500
00:25:48.319 --> 00:25:51.279
<v Speaker 2>The standard pays you go. As your open AI uses

501
00:25:51.319 --> 00:25:55.480
<v Speaker 2>shared GPU infrastructure. Because it's shared, there are quotas and

502
00:25:55.519 --> 00:25:59.000
<v Speaker 2>rate limits to ensure fair usage. There are limits on say,

503
00:25:59.200 --> 00:26:01.839
<v Speaker 2>how many resources as you can deploy per region, how

504
00:26:01.839 --> 00:26:05.400
<v Speaker 2>many concurrent DELI requests, how many whisper requests per minute,

505
00:26:05.440 --> 00:26:08.160
<v Speaker 2>a number of fine tuning jobs, et cetera. Your main

506
00:26:08.200 --> 00:26:11.599
<v Speaker 2>throughput limit is usually measured in tokens per minute or TPM.

507
00:26:12.160 --> 00:26:15.480
<v Speaker 2>These TPM limits vary based on the region, the specific model,

508
00:26:15.559 --> 00:26:19.039
<v Speaker 2>and your deployment type. Enterprise agreement customers often get higher

509
00:26:19.119 --> 00:26:23.359
<v Speaker 2>default quotas requests per mint. RPM is directly related, typically

510
00:26:23.400 --> 00:26:25.200
<v Speaker 2>six rpm per one thousand tpm.

511
00:26:25.480 --> 00:26:28.039
<v Speaker 1>So you manage these TPM limits yes, As.

512
00:26:27.839 --> 00:26:30.519
<v Speaker 2>Your open AI Quota management lets you allocate your total

513
00:26:30.519 --> 00:26:33.799
<v Speaker 2>available quota across your different model deployments as needed. Now,

514
00:26:33.839 --> 00:26:36.640
<v Speaker 2>for businesses needing really consistent high performance or low latency,

515
00:26:36.759 --> 00:26:40.039
<v Speaker 2>especially for critical apps, that's where provision throughput Unit Managed

516
00:26:40.119 --> 00:26:41.240
<v Speaker 2>or PQUM comes in.

517
00:26:41.359 --> 00:26:42.799
<v Speaker 1>P TUM. What does that give you?

518
00:26:43.039 --> 00:26:46.000
<v Speaker 2>It allows you to reserve dedicated processing capacity just for

519
00:26:46.039 --> 00:26:50.440
<v Speaker 2>your models. This guarantees consistent performance, stable latency and throughput

520
00:26:50.680 --> 00:26:53.400
<v Speaker 2>because you're not competing for resources on the shared infrastructure.

521
00:26:54.000 --> 00:26:56.920
<v Speaker 2>It also often comes with a significant cost saving round

522
00:26:56.960 --> 00:26:59.240
<v Speaker 2>fifty percent potentially compared to pay as you go for

523
00:26:59.279 --> 00:27:02.079
<v Speaker 2>the same level as stained usage. You can buy PTUs

524
00:27:02.119 --> 00:27:04.440
<v Speaker 2>hourly or commit to longer term reservations for.

525
00:27:04.400 --> 00:27:06.039
<v Speaker 1>Even better rates and guarantees.

526
00:27:06.359 --> 00:27:10.640
<v Speaker 2>Yes, importantly, PTUM comes with strong slas a ninety nine

527
00:27:10.720 --> 00:27:13.440
<v Speaker 2>point nine percent uptime guarantee and a nine to nine

528
00:27:13.480 --> 00:27:17.440
<v Speaker 2>percent token generation latency guarantee, which provides that predictability crucial

529
00:27:17.480 --> 00:27:18.720
<v Speaker 2>for demanding applications.

530
00:27:19.000 --> 00:27:21.799
<v Speaker 1>What an incredible journey we've taken today, Seriously, we went

531
00:27:21.839 --> 00:27:25.119
<v Speaker 1>from the absolute fundamentals, what large language models even are,

532
00:27:25.279 --> 00:27:30.039
<v Speaker 1>understanding tokens, embeddings, transformers. Then we dove straight into Azure

533
00:27:30.079 --> 00:27:33.240
<v Speaker 1>open Ai, looking at how Microsoft packages these powerful models

534
00:27:33.240 --> 00:27:37.559
<v Speaker 1>for enterprise use, covering security, compliance, all those critical aspects.

535
00:27:37.960 --> 00:27:42.519
<v Speaker 1>We explored the really advanced stuff too, embeddings, vector databases,

536
00:27:42.880 --> 00:27:46.880
<v Speaker 1>RGY for grounding models, multimodal capabilities with GPT four to

537
00:27:46.880 --> 00:27:50.799
<v Speaker 1>oh function calling the assistance API. And we didn't forget

538
00:27:50.880 --> 00:27:54.359
<v Speaker 1>the human element. Mastering communication through prompt engineering from basic

539
00:27:54.359 --> 00:27:57.240
<v Speaker 1>principles to advanced techniques like chain of thought and react.

540
00:27:57.640 --> 00:28:02.960
<v Speaker 1>Finally wrapping up with how to securely operationalize everything, compliance filtering, monitoring, quotas,

541
00:28:03.000 --> 00:28:06.920
<v Speaker 1>and those PTUs for dedicated performance. It's a lot, but

542
00:28:07.000 --> 00:28:08.519
<v Speaker 1>hopefully a clear picture emerged.

543
00:28:08.680 --> 00:28:11.400
<v Speaker 2>It really covers the spectrum from the core concepts to

544
00:28:11.480 --> 00:28:14.680
<v Speaker 2>the practicalities of building and running real world AI solutions

545
00:28:14.720 --> 00:28:17.920
<v Speaker 2>responsibly on Azure. It's clear the power isn't just the

546
00:28:17.960 --> 00:28:19.720
<v Speaker 2>tech itself, but how you wield it.

547
00:28:19.920 --> 00:28:22.960
<v Speaker 1>That's perfectly put as you've hopefully gathered from our chat.

548
00:28:23.079 --> 00:28:26.039
<v Speaker 1>The real magic happens when you understand not just what

549
00:28:26.079 --> 00:28:28.400
<v Speaker 1>these AI tools can do, but how to guide them,

550
00:28:28.480 --> 00:28:32.400
<v Speaker 1>how to integrate them thoughtfully, securely, and responsibly. So here's

551
00:28:32.440 --> 00:28:35.240
<v Speaker 1>something to think about. Given how fast AI is moving

552
00:28:35.519 --> 00:28:38.920
<v Speaker 1>and how white spread it's becoming, how might your understanding

553
00:28:39.000 --> 00:28:41.920
<v Speaker 1>of these tools, the capabilities, the limitations, the ways to

554
00:28:42.000 --> 00:28:46.200
<v Speaker 1>interact change how you tackle problems, not just tech problems,

555
00:28:46.240 --> 00:28:50.880
<v Speaker 1>but any challenge where information, creativity and communication are key.

556
00:28:51.119 --> 00:28:53.799
<v Speaker 2>Yeah, think about the possibilities that might have opened up

557
00:28:53.960 --> 00:28:57.319
<v Speaker 2>just from this discussion. What new approaches could you take, What.

558
00:28:57.279 --> 00:28:59.440
<v Speaker 1>New questions does the spark for you? Maybe you want

559
00:28:59.440 --> 00:29:01.680
<v Speaker 1>to dig even deeper into the source material we use.

560
00:29:01.759 --> 00:29:05.000
<v Speaker 1>Or perhaps you're already thinking what specific area should we

561
00:29:05.079 --> 00:29:08.680
<v Speaker 1>deep dive into next time Food for thought. Until then,

562
00:29:08.799 --> 00:29:10.200
<v Speaker 1>keep asking the big questions.
