WEBVTT

1
00:00:00.080 --> 00:00:02.000
<v Speaker 1>Welcome to the deep dive. This is where we take

2
00:00:02.000 --> 00:00:05.519
<v Speaker 1>a whole stack of articles, research papers, notes and basically

3
00:00:05.559 --> 00:00:07.080
<v Speaker 1>just dive in to pull out the key insights for

4
00:00:07.120 --> 00:00:10.759
<v Speaker 1>you today. Our mission is to really get under the

5
00:00:10.800 --> 00:00:13.800
<v Speaker 1>hood of generative AI. It's a technology that's well, it's

6
00:00:13.880 --> 00:00:16.160
<v Speaker 1>changing things incredibly fast. And just to give you a

7
00:00:16.199 --> 00:00:18.640
<v Speaker 1>sense of how impactful it is, remember back in twenty

8
00:00:18.719 --> 00:00:21.960
<v Speaker 1>twenty two the Colorado State Fair Art Competition. The winning

9
00:00:21.960 --> 00:00:25.280
<v Speaker 1>piece in the digital art category, theatro tops Space Shelle

10
00:00:25.640 --> 00:00:27.800
<v Speaker 1>wasn't you know, painted by a human. It was made

11
00:00:27.879 --> 00:00:31.280
<v Speaker 1>using mid journey an AI, a really stunning sci fi scene.

12
00:00:31.280 --> 00:00:34.840
<v Speaker 1>It just perfectly captures this blend of creativity and well

13
00:00:34.880 --> 00:00:38.200
<v Speaker 1>pure tech. That's where we're starting today. So okay, generative AI.

14
00:00:38.359 --> 00:00:42.039
<v Speaker 1>It's making headlines everywhere, creating art, writing code, sounding almost human.

15
00:00:42.560 --> 00:00:44.799
<v Speaker 1>But what is it fundamentally? What's the big shift here?

16
00:00:44.880 --> 00:00:47.399
<v Speaker 2>Yeah, it's a massive shift, really a paradigm shift. You

17
00:00:47.439 --> 00:00:50.600
<v Speaker 2>could say, think about most AI you probably come across before.

18
00:00:50.640 --> 00:00:55.479
<v Speaker 2>It's usually discriminative. I mean it learns to tell things

19
00:00:55.520 --> 00:01:00.200
<v Speaker 2>apart right, classify, predict based on data. It's scene like

20
00:01:00.240 --> 00:01:03.200
<v Speaker 2>telling a cat photo from a dog photo. Generative models,

21
00:01:03.240 --> 00:01:06.680
<v Speaker 2>though they do something different. They're not just recognizing patterns.

22
00:01:06.840 --> 00:01:09.560
<v Speaker 2>They learn the deep underlying rules of the data itself.

23
00:01:09.840 --> 00:01:11.920
<v Speaker 2>They learned enough about what makes a cat a cat

24
00:01:12.040 --> 00:01:15.959
<v Speaker 2>that they can actually create entirely new cat images, believable

25
00:01:15.959 --> 00:01:21.239
<v Speaker 2>ones from scratch. It's about generation, not just discrimination.

26
00:01:21.400 --> 00:01:23.879
<v Speaker 1>Wow, okay, so it's not just sorting or identifying. It's

27
00:01:23.920 --> 00:01:27.439
<v Speaker 1>like imagining. Yeah, materializing things. That feels like a huge leap.

28
00:01:27.560 --> 00:01:29.159
<v Speaker 2>It is a profound one, But.

29
00:01:29.120 --> 00:01:31.760
<v Speaker 1>I mean creating something totally new like that. This sounds

30
00:01:31.760 --> 00:01:35.400
<v Speaker 1>incredibly complicated. What are some of the biggest challenges these

31
00:01:35.879 --> 00:01:37.480
<v Speaker 1>imagination machines face.

32
00:01:37.680 --> 00:01:40.239
<v Speaker 2>You're right, it's definitely not simple for starters. The day

33
00:01:40.319 --> 00:01:43.599
<v Speaker 2>itself is a huge hurdle. Real world information is messy,

34
00:01:43.959 --> 00:01:47.959
<v Speaker 2>you know, it's full of errors, noise, biases, and the models, well,

35
00:01:48.040 --> 00:01:50.159
<v Speaker 2>they can learn those imperfections just as easily as the

36
00:01:50.239 --> 00:01:51.079
<v Speaker 2>useful patterns.

37
00:01:51.200 --> 00:01:54.120
<v Speaker 1>Ah, So garbage in, garbage out potentially sort of.

38
00:01:54.200 --> 00:01:57.359
<v Speaker 2>Yeah, And then there's the issue of staying current, especially

39
00:01:57.359 --> 00:02:01.239
<v Speaker 2>for large language models lllms. The world changes so fast

40
00:02:01.280 --> 00:02:04.480
<v Speaker 2>and the information they generate can become outdated pretty quickly

41
00:02:04.519 --> 00:02:06.120
<v Speaker 2>if they're not constantly updated.

42
00:02:06.439 --> 00:02:09.120
<v Speaker 1>Right, Like asking about current events from a model trained

43
00:02:09.280 --> 00:02:10.479
<v Speaker 1>last year exactly.

44
00:02:10.919 --> 00:02:15.599
<v Speaker 2>And then there's the sheer computational power required learning these

45
00:02:15.639 --> 00:02:20.319
<v Speaker 2>incredibly complex patterns and then generating new high fidelity data.

46
00:02:21.039 --> 00:02:25.479
<v Speaker 2>It demands massive amounts of compute. And finally, think about evaluation.

47
00:02:25.879 --> 00:02:28.400
<v Speaker 2>With a discriminative model, you ask, is this a cat?

48
00:02:28.759 --> 00:02:32.199
<v Speaker 2>Yes or no? Easy to check. But with the generative model,

49
00:02:32.280 --> 00:02:34.639
<v Speaker 2>how do you evaluate if a generated cat picture is

50
00:02:34.840 --> 00:02:37.639
<v Speaker 2>good or accurate? There is, it's a single right answer.

51
00:02:37.719 --> 00:02:39.960
<v Speaker 2>It's much more subjective, much more complex to measure.

52
00:02:40.000 --> 00:02:42.120
<v Speaker 1>That makes total sense. It's not just about being correct,

53
00:02:42.159 --> 00:02:45.360
<v Speaker 1>it's about being believable, useful.

54
00:02:45.080 --> 00:02:49.120
<v Speaker 2>Plausible, believable, useful, coherent, all those things.

55
00:02:49.360 --> 00:02:54.080
<v Speaker 1>Okay, so despite those big challenges, the promise must be huge, right,

56
00:02:54.120 --> 00:02:57.719
<v Speaker 1>that's why everyone's pouring resources into this. Let's dig into

57
00:02:57.719 --> 00:03:01.080
<v Speaker 1>some of those applications. Images. For instance, we've gone way

58
00:03:01.080 --> 00:03:02.719
<v Speaker 1>beyond simple photo filters. Oh.

59
00:03:02.800 --> 00:03:08.560
<v Speaker 2>Absolutely. Models now can create incredibly diverse photorealistic images just

60
00:03:08.599 --> 00:03:12.759
<v Speaker 2>from say a text description, things you'd never imagine possible

61
00:03:12.759 --> 00:03:13.439
<v Speaker 2>a few years ago.

62
00:03:13.520 --> 00:03:15.919
<v Speaker 1>And it's not just art, right. You mentioned data augmentation.

63
00:03:16.319 --> 00:03:18.840
<v Speaker 2>Yeah, that's a really practical one. Imagine you have only

64
00:03:18.879 --> 00:03:21.639
<v Speaker 2>a small data set maybe for training an AI to

65
00:03:21.719 --> 00:03:26.319
<v Speaker 2>recognize a specific product defect. Generative AI can create thousands

66
00:03:26.360 --> 00:03:30.080
<v Speaker 2>of synthetic examples, different angles, lighting conditions, you name it,

67
00:03:30.240 --> 00:03:33.280
<v Speaker 2>to bolster that data set, make the training more robust,

68
00:03:33.599 --> 00:03:36.680
<v Speaker 2>maybe even reduce bias if the original data was skewed.

69
00:03:36.960 --> 00:03:39.280
<v Speaker 1>That's clever using AI to make other AI.

70
00:03:39.159 --> 00:03:43.560
<v Speaker 2>Better exactly, And in content creation too, generating texts for chatbots,

71
00:03:43.560 --> 00:03:47.120
<v Speaker 2>helping writers brainstorm, even drafting emails. We've come such a

72
00:03:47.120 --> 00:03:49.000
<v Speaker 2>long way from Eliza back in the.

73
00:03:48.960 --> 00:03:51.599
<v Speaker 1>Sixties, right, those old rule based bots. Now we have

74
00:03:51.680 --> 00:03:54.560
<v Speaker 1>these powerful models built on architectures like transformers.

75
00:03:54.639 --> 00:03:57.520
<v Speaker 2>It's a different world. But those challenges we mentioned, the

76
00:03:57.560 --> 00:04:00.759
<v Speaker 2>messi data keeping up with reality that can cute demands,

77
00:04:00.800 --> 00:04:05.840
<v Speaker 2>and that tricky evaluation problem. There's still very real hurdles.

78
00:04:06.120 --> 00:04:10.240
<v Speaker 1>Yeah, defining good enough for generated stuff, that's a tough one. Okay.

79
00:04:10.240 --> 00:04:13.000
<v Speaker 1>So before we get deeper into the applications, how did

80
00:04:13.000 --> 00:04:15.400
<v Speaker 1>we even get here? How did machines learn to imagine

81
00:04:15.439 --> 00:04:20.199
<v Speaker 1>like this? Let's trace back the building blocks deep neural networks.

82
00:04:20.519 --> 00:04:23.120
<v Speaker 2>It goes way back. Actually, early ideas in the nineteen

83
00:04:23.199 --> 00:04:26.839
<v Speaker 2>forties were inspired by biological neurons. Simple things like the

84
00:04:26.879 --> 00:04:31.279
<v Speaker 2>threshold logic unit. But those early models hit limitations famously.

85
00:04:31.399 --> 00:04:34.319
<v Speaker 2>Minsky and Paper showed in their book Perceptrons that single

86
00:04:34.360 --> 00:04:37.600
<v Speaker 2>Lairer networks couldn't even solve basic problems like the xoor

87
00:04:37.720 --> 00:04:40.560
<v Speaker 2>logic function that led to the first AI winter in

88
00:04:40.600 --> 00:04:41.160
<v Speaker 2>the seventies.

89
00:04:41.279 --> 00:04:44.639
<v Speaker 1>Progress stalled right the AI winter. So what thought things out?

90
00:04:44.680 --> 00:04:47.199
<v Speaker 1>What was the big breakthrough that got things moving again?

91
00:04:47.399 --> 00:04:51.199
<v Speaker 2>The absolute game changer was backpropagation. Before that, figuring out

92
00:04:51.279 --> 00:04:53.439
<v Speaker 2>how to adjust all the connections the weights in a

93
00:04:53.519 --> 00:04:57.600
<v Speaker 2>deep network was incredibly inefficient, almost impossible for complex networks.

94
00:04:57.680 --> 00:05:00.160
<v Speaker 1>How does it work sort of in simple terms.

95
00:05:00.120 --> 00:05:03.279
<v Speaker 2>Well, it uses calculus, specifically the chain rule, to efficiently

96
00:05:03.319 --> 00:05:06.720
<v Speaker 2>calculate how much each weight in the network contributed to

97
00:05:06.759 --> 00:05:09.800
<v Speaker 2>the final error. It tells each connection exactly how to

98
00:05:09.839 --> 00:05:13.759
<v Speaker 2>adjust itself, layer by layer, working backward from the output air.

99
00:05:14.319 --> 00:05:17.639
<v Speaker 2>It made training deep networks practical. That's what really ended

100
00:05:17.639 --> 00:05:20.680
<v Speaker 2>the AI winter and opened the door to modern deep learning.

101
00:05:20.959 --> 00:05:24.120
<v Speaker 1>But you said even backpropagation wasn't perfect. It had issues

102
00:05:24.199 --> 00:05:24.519
<v Speaker 1>it did.

103
00:05:24.600 --> 00:05:27.439
<v Speaker 2>A big one was the vanish ingradient problem. In very

104
00:05:27.480 --> 00:05:30.199
<v Speaker 2>deep networks, the error signal gets weaker and weaker as

105
00:05:30.199 --> 00:05:33.480
<v Speaker 2>it propagates backward, so the early layers, the ones furthest

106
00:05:33.480 --> 00:05:37.360
<v Speaker 2>from the output, learn extremely slowly or sometimes not at all,

107
00:05:37.639 --> 00:05:39.680
<v Speaker 2>like a whisper getting lost down a long hallway.

108
00:05:39.759 --> 00:05:41.800
<v Speaker 1>Okay, I can picture that the signal just fades out.

109
00:05:42.000 --> 00:05:44.279
<v Speaker 1>So once we had back propagation, even with its flaws,

110
00:05:44.399 --> 00:05:48.199
<v Speaker 1>what kinds of network structures or architecture started showing up well?

111
00:05:48.240 --> 00:05:52.399
<v Speaker 2>For images? A major leap was convolutional neural networks CNNs.

112
00:05:52.759 --> 00:05:55.439
<v Speaker 2>They were kind of inspired by the human visual cortex.

113
00:05:56.000 --> 00:05:59.279
<v Speaker 2>Instead of looking at an image pixel by pixel, CNN's

114
00:05:59.360 --> 00:06:02.639
<v Speaker 2>use filters that slide across the image looking for specific

115
00:06:02.680 --> 00:06:07.199
<v Speaker 2>local features edges, corners, textures, and crucially, they share weights.

116
00:06:07.600 --> 00:06:10.279
<v Speaker 2>The filter looking for a horizontal edge is the same

117
00:06:10.279 --> 00:06:12.560
<v Speaker 2>filter whether it's looking at the bop left or bottom right.

118
00:06:12.759 --> 00:06:14.240
<v Speaker 2>This makes them way more efficient for.

119
00:06:14.199 --> 00:06:17.920
<v Speaker 1>Images sharing weights. Okay, and there were improvements on those

120
00:06:17.920 --> 00:06:19.040
<v Speaker 1>basic CNNs oh.

121
00:06:19.000 --> 00:06:22.759
<v Speaker 2>Yeah, big ones, things like reilu activation functions. They replaced

122
00:06:22.800 --> 00:06:25.959
<v Speaker 2>older functions that saturated easily and helped fix that vanishing

123
00:06:25.959 --> 00:06:29.680
<v Speaker 2>gradient problem. Kept the signal strong and drop out, which

124
00:06:29.720 --> 00:06:33.079
<v Speaker 2>sounds weird but works amazingly well. During training, you randomly

125
00:06:33.120 --> 00:06:35.800
<v Speaker 2>switch off some neurons. It forces the network not to

126
00:06:35.839 --> 00:06:38.600
<v Speaker 2>rely too much on any single neuron, making it generalize

127
00:06:38.639 --> 00:06:41.120
<v Speaker 2>better to new data. Kind of like cross training for

128
00:06:41.160 --> 00:06:41.600
<v Speaker 2>the network.

129
00:06:41.720 --> 00:06:44.839
<v Speaker 1>Huh. Interesting. Okay, so that's images. What about sequences like

130
00:06:45.000 --> 00:06:47.839
<v Speaker 1>text or speech or time series data.

131
00:06:47.920 --> 00:06:51.480
<v Speaker 2>For sequential data, the go to became recurrent neural networks

132
00:06:51.800 --> 00:06:55.680
<v Speaker 2>or RNNs. They have loops allowing information to persist. They

133
00:06:55.680 --> 00:06:57.360
<v Speaker 2>have a kind of memory.

134
00:06:57.000 --> 00:06:59.240
<v Speaker 1>A memory, right, but didn't they also have issues with

135
00:06:59.279 --> 00:07:00.759
<v Speaker 1>a long sequence they did.

136
00:07:01.199 --> 00:07:04.480
<v Speaker 2>That vanishing gradient problem hit them hard too when trying

137
00:07:04.519 --> 00:07:07.279
<v Speaker 2>to remember things from many steps back, which led to

138
00:07:07.319 --> 00:07:11.279
<v Speaker 2>the development of lstm's long short term memory networks. LSTMs

139
00:07:11.279 --> 00:07:14.279
<v Speaker 2>were a much more sophisticated type of RNN. They have

140
00:07:14.439 --> 00:07:18.839
<v Speaker 2>these internal mechanisms called gates, an input gate, a forget gate,

141
00:07:19.040 --> 00:07:22.680
<v Speaker 2>and output gate. These gates carefully control what information gets

142
00:07:22.680 --> 00:07:25.959
<v Speaker 2>stored the memory cell, what gets forgotten, and what influence

143
00:07:26.000 --> 00:07:28.920
<v Speaker 2>is the output at each step. They were much much

144
00:07:29.000 --> 00:07:32.839
<v Speaker 2>better at capturing long range dependencies, crucial for understanding language.

145
00:07:32.879 --> 00:07:36.240
<v Speaker 1>Okay, so lstm's improved memory. But you mentioned earlier that

146
00:07:36.279 --> 00:07:39.199
<v Speaker 1>even they had limitations, especially for really long text, which

147
00:07:39.279 --> 00:07:41.519
<v Speaker 1>led to transformers.

148
00:07:40.920 --> 00:07:44.319
<v Speaker 2>Exactly this is where transformers completely change the game, particularly

149
00:07:44.319 --> 00:07:47.120
<v Speaker 2>for language. They threw out the sequential, step by step

150
00:07:47.120 --> 00:07:51.040
<v Speaker 2>processing of RNNs and LSTMs. The core idea the revolution

151
00:07:51.639 --> 00:07:52.439
<v Speaker 2>was self attention.

152
00:07:52.920 --> 00:07:55.800
<v Speaker 1>Self attention, we hear that term a lot. What does

153
00:07:55.839 --> 00:07:57.160
<v Speaker 1>it actually let the model do?

154
00:07:57.680 --> 00:08:01.639
<v Speaker 2>Instead of processing word by word, self attention allows every

155
00:08:01.680 --> 00:08:05.800
<v Speaker 2>single word in a sentence to directly look at and

156
00:08:05.879 --> 00:08:09.439
<v Speaker 2>weigh the importance of every other word in that same sentence.

157
00:08:09.279 --> 00:08:13.439
<v Speaker 1>All at once, all at once, so no more sequential bottleneck.

158
00:08:13.759 --> 00:08:17.160
<v Speaker 2>Precisely, it can instantly see how the first word relates

159
00:08:17.160 --> 00:08:19.600
<v Speaker 2>to the last word, or how pronoun relates to the

160
00:08:19.639 --> 00:08:22.879
<v Speaker 2>noun it refers to, even if they're far apart. And crucially,

161
00:08:23.439 --> 00:08:26.680
<v Speaker 2>because it's not sequential, you can process all words in parallel.

162
00:08:27.360 --> 00:08:30.920
<v Speaker 2>This makes training on massive data sets much much faster

163
00:08:31.000 --> 00:08:34.399
<v Speaker 2>and scalable than RNNs ever could be. It just unlocked

164
00:08:34.399 --> 00:08:36.279
<v Speaker 2>a whole new level of performance in scale.

165
00:08:36.360 --> 00:08:38.279
<v Speaker 1>Okay, that makes sense why they were such a big deal.

166
00:08:38.440 --> 00:08:40.480
<v Speaker 1>So if we have these powerful architectures, how do we

167
00:08:40.519 --> 00:08:44.519
<v Speaker 1>get them to actually understand and use words? How does

168
00:08:44.600 --> 00:08:47.080
<v Speaker 1>text get turned into numbers the machine can process?

169
00:08:47.200 --> 00:08:49.879
<v Speaker 2>Right? That's fundamental. The early approaches were pretty simple like

170
00:08:49.919 --> 00:08:52.320
<v Speaker 2>bag of words. You literally just count how many times

171
00:08:52.399 --> 00:08:53.840
<v Speaker 2>each word appears in a document.

172
00:08:53.960 --> 00:08:55.519
<v Speaker 1>Simple, But I guess it loses a lot.

173
00:08:55.720 --> 00:09:00.840
<v Speaker 2>It loses all the context, word order, grammar gone, dog

174
00:09:00.879 --> 00:09:04.159
<v Speaker 2>bites man and man bites dog look exactly the same

175
00:09:04.200 --> 00:09:06.720
<v Speaker 2>to a bag of words model. Not very useful for

176
00:09:06.799 --> 00:09:07.679
<v Speaker 2>understanding meaning.

177
00:09:08.000 --> 00:09:09.720
<v Speaker 1>Yeah, that seems like a pretty big flaw.

178
00:09:10.000 --> 00:09:12.279
<v Speaker 2>So the next big step was word embeddings. These are

179
00:09:12.320 --> 00:09:16.120
<v Speaker 2>dense vector representations, basically lists of numbers for each word.

180
00:09:16.879 --> 00:09:20.039
<v Speaker 2>Models like word to vec learn these embeddings by looking

181
00:09:20.039 --> 00:09:22.840
<v Speaker 2>at the context words appear in. The key idea was

182
00:09:22.879 --> 00:09:26.919
<v Speaker 2>that words used in similar contexts should have similar numerical representations,

183
00:09:26.960 --> 00:09:31.000
<v Speaker 2>similar vectors. It started capturing semantic relationships.

184
00:09:30.279 --> 00:09:33.240
<v Speaker 1>So king and queen would be mathematically closer than king

185
00:09:33.320 --> 00:09:34.639
<v Speaker 1>and cabbage exactly.

186
00:09:34.960 --> 00:09:37.840
<v Speaker 2>But even those embeddings were static. The vector for bank

187
00:09:37.960 --> 00:09:40.039
<v Speaker 2>was the same whether you meant a riverbank or a

188
00:09:40.080 --> 00:09:44.240
<v Speaker 2>financial bank. The real breakthrough for nuance was contextual representations.

189
00:09:44.720 --> 00:09:48.279
<v Speaker 2>Models like Burt and Elmo generate embeddings that change based

190
00:09:48.320 --> 00:09:51.159
<v Speaker 2>on the specific sentence the word is in. They understand

191
00:09:51.159 --> 00:09:53.840
<v Speaker 2>that bank means different things in different contexts. That was

192
00:09:54.000 --> 00:09:55.840
<v Speaker 2>huge for understanding language. Properly.

193
00:09:55.919 --> 00:09:58.879
<v Speaker 1>Okay, so we have ways to represent words with nuance. Now,

194
00:09:59.000 --> 00:10:01.960
<v Speaker 1>how do we make the machine talk generate text.

195
00:10:02.399 --> 00:10:05.759
<v Speaker 2>That's the job of language models. At their heart, they're

196
00:10:05.799 --> 00:10:08.799
<v Speaker 2>trying to predict the next word in a sequence given

197
00:10:08.840 --> 00:10:12.440
<v Speaker 2>the previous words, like a superpowered autocomplete.

198
00:10:12.639 --> 00:10:14.720
<v Speaker 1>Just predicting the next word. How does that lead to

199
00:10:14.840 --> 00:10:16.960
<v Speaker 1>coherent sentences or paragraphs.

200
00:10:17.240 --> 00:10:19.559
<v Speaker 2>Well, once it predicts a word, that word becomes part

201
00:10:19.559 --> 00:10:21.759
<v Speaker 2>of the context for predicting the next word and so on.

202
00:10:22.120 --> 00:10:25.559
<v Speaker 2>But just picking the single most probable word at each step,

203
00:10:25.600 --> 00:10:29.399
<v Speaker 2>that's called greedy decoding often leads to really repetitive or

204
00:10:29.440 --> 00:10:30.200
<v Speaker 2>boring text.

205
00:10:30.480 --> 00:10:32.720
<v Speaker 1>Right, It might just get stuck saying the same phrase

206
00:10:32.799 --> 00:10:33.679
<v Speaker 1>over and over.

207
00:10:33.639 --> 00:10:37.720
<v Speaker 2>Exactly, So we use more sophisticated decoding strategies. Beam search

208
00:10:37.799 --> 00:10:40.200
<v Speaker 2>keeps track of several of the most likely sequences at

209
00:10:40.200 --> 00:10:42.320
<v Speaker 2>each step, kind of looking ahead to find a better

210
00:10:42.360 --> 00:10:46.240
<v Speaker 2>overall sentence. And then there's sampling. Instead of always picking

211
00:10:46.320 --> 00:10:49.799
<v Speaker 2>the most likely word, you introduce some randomness. You might

212
00:10:49.840 --> 00:10:53.840
<v Speaker 2>sample from say the top ten most likely words top

213
00:10:53.919 --> 00:10:56.639
<v Speaker 2>k sampling, or from the smallest set of words whose

214
00:10:56.679 --> 00:11:00.759
<v Speaker 2>probabilities add up to a certain threshold nucleus sample. This

215
00:11:00.759 --> 00:11:04.559
<v Speaker 2>adds variety and makes the text feel more natural, less predictable.

216
00:11:04.840 --> 00:11:07.960
<v Speaker 1>So sampling adds a bit of creativity, stops it being

217
00:11:08.039 --> 00:11:09.279
<v Speaker 1>robotic pretty much.

218
00:11:09.360 --> 00:11:11.720
<v Speaker 2>Yeah, it helps avoid getting stuck in loops and generates

219
00:11:11.720 --> 00:11:12.639
<v Speaker 2>more interesting output.

220
00:11:13.039 --> 00:11:15.240
<v Speaker 1>And it seems like the transformer architecture with that self

221
00:11:15.240 --> 00:11:18.840
<v Speaker 1>attention mechanism was absolutely critical for enabling this kind of

222
00:11:18.879 --> 00:11:22.960
<v Speaker 1>sophisticated text generation at scale. Right. Can you expand on

223
00:11:23.039 --> 00:11:24.879
<v Speaker 1>why it was such a turning point for these large

224
00:11:24.919 --> 00:11:25.600
<v Speaker 1>language models.

225
00:11:25.639 --> 00:11:28.960
<v Speaker 2>Oh, absolutely pivotal. That twenty seventeen paper Attention is all

226
00:11:29.000 --> 00:11:33.720
<v Speaker 2>you need. It really did shift the paradigm before transformers. Remember,

227
00:11:33.759 --> 00:11:37.720
<v Speaker 2>even lstm's our best sequential models had that bottleneck issue.

228
00:11:37.759 --> 00:11:41.480
<v Speaker 2>They had to cram the meaning of the entire input sequence,

229
00:11:41.559 --> 00:11:45.279
<v Speaker 2>no matter how long, into a single fixed sized context

230
00:11:45.320 --> 00:11:48.720
<v Speaker 2>vector to pass along. For very long sentences or documents

231
00:11:48.759 --> 00:11:50.919
<v Speaker 2>that just wasn't enough information got lost.

232
00:11:51.000 --> 00:11:52.720
<v Speaker 1>The memory was an infinite.

233
00:11:52.440 --> 00:11:56.720
<v Speaker 2>Right, Transformers, by ditching recurrens entirely and using self attention,

234
00:11:57.240 --> 00:12:01.639
<v Speaker 2>broke that bottleneck wide open. Every word could directly attend

235
00:12:01.679 --> 00:12:06.279
<v Speaker 2>to every other word, instantly capturing those long range dependencies. Plus,

236
00:12:06.559 --> 00:12:10.399
<v Speaker 2>they introduced multihead self attention think of it as allowing

237
00:12:10.399 --> 00:12:12.960
<v Speaker 2>the model to pay attention to different kinds of relationships

238
00:12:13.039 --> 00:12:18.720
<v Speaker 2>simultaneously in parallel subspaces. Maybe one head focuses on grammatical relationships,

239
00:12:18.960 --> 00:12:21.159
<v Speaker 2>another on semantic similarity.

240
00:12:20.679 --> 00:12:23.159
<v Speaker 1>So it could capture multiple layers of meaning at once.

241
00:12:23.320 --> 00:12:27.720
<v Speaker 2>Exactly. That ability to handle long contexts effectively and efficiently,

242
00:12:28.000 --> 00:12:30.759
<v Speaker 2>combined with the massive parallelism allowing them to train on

243
00:12:30.960 --> 00:12:34.039
<v Speaker 2>unprecedented amounts of data, that's what paved the way for

244
00:12:34.080 --> 00:12:37.279
<v Speaker 2>the truly large language models, the lms that we have today.

245
00:12:37.559 --> 00:12:40.799
<v Speaker 1>And from that core transformer idea different sort of flavors

246
00:12:40.879 --> 00:12:42.679
<v Speaker 1>or families of models emerged.

247
00:12:42.919 --> 00:12:46.039
<v Speaker 2>Yeah. Broadly speaking, you see three main types based on

248
00:12:46.120 --> 00:12:49.679
<v Speaker 2>which parts of the original transformer architecture they use. First,

249
00:12:49.879 --> 00:12:53.279
<v Speaker 2>encoder only models like the famous BURT These are designed

250
00:12:53.320 --> 00:12:56.159
<v Speaker 2>primarily for understanding text. They look at the whole sentence

251
00:12:56.200 --> 00:13:00.240
<v Speaker 2>at once. Great for tasks like classification, sentiment analysis, or

252
00:13:00.320 --> 00:13:02.399
<v Speaker 2>question answering where context is key.

253
00:13:02.559 --> 00:13:05.360
<v Speaker 1>Okay, understanding text, What's the next type?

254
00:13:05.440 --> 00:13:08.600
<v Speaker 2>Then you have decoder only models like the GPT family.

255
00:13:08.759 --> 00:13:12.759
<v Speaker 2>These are built for generating text. They work sequentially predicting

256
00:13:12.799 --> 00:13:16.000
<v Speaker 2>the next word based on the preceding ones. This causal

257
00:13:16.120 --> 00:13:21.200
<v Speaker 2>nature makes them naturals for chatbots, story writing, codegeneration. GPT

258
00:13:21.360 --> 00:13:25.919
<v Speaker 2>really revolutionized generation with its ability for unsupervised multitask learning,

259
00:13:26.279 --> 00:13:28.480
<v Speaker 2>learning many tasks just from raw text.

260
00:13:28.639 --> 00:13:30.639
<v Speaker 1>Right. GPT is the one most people probably think of,

261
00:13:30.679 --> 00:13:32.039
<v Speaker 1>and the third type.

262
00:13:31.840 --> 00:13:35.000
<v Speaker 2>Encoder decoder models like T five or the original transformer.

263
00:13:35.639 --> 00:13:37.720
<v Speaker 2>These have both parts and are often used for sequence

264
00:13:37.759 --> 00:13:41.000
<v Speaker 2>to sequence tasks where you're transforming an input sequence into

265
00:13:41.000 --> 00:13:44.639
<v Speaker 2>an output sequence. Think machine translation or text summarization.

266
00:13:45.000 --> 00:13:49.240
<v Speaker 1>Got it encoder for understanding, decoder for generating, and both

267
00:13:49.240 --> 00:13:52.879
<v Speaker 1>for transforming and focusing on GPT since it's so prominent,

268
00:13:53.240 --> 00:13:54.360
<v Speaker 1>what were the big leaps there?

269
00:13:54.879 --> 00:13:57.759
<v Speaker 2>Well, GPT two in twenty nineteen was a major milestone,

270
00:13:58.120 --> 00:14:01.360
<v Speaker 2>one point five billion parameters trained on a huge chunk

271
00:14:01.399 --> 00:14:04.879
<v Speaker 2>of the Internet. What was really stunning was its few

272
00:14:04.919 --> 00:14:08.240
<v Speaker 2>shot ability key shot meaning meaning you could give it

273
00:14:08.320 --> 00:14:09.919
<v Speaker 2>just a couple of examples of a task and the

274
00:14:09.960 --> 00:14:12.279
<v Speaker 2>prompt and it could often figure out how to do

275
00:14:12.320 --> 00:14:15.120
<v Speaker 2>it without any specific training for that task. Yeah, it

276
00:14:15.159 --> 00:14:17.960
<v Speaker 2>showed an incredible level of general language understanding.

277
00:14:18.039 --> 00:14:18.480
<v Speaker 1>Wow.

278
00:14:18.639 --> 00:14:21.799
<v Speaker 2>And then GPT three came along, and how GVT three

279
00:14:21.840 --> 00:14:24.679
<v Speaker 2>was enormous one hundred and seventy five billion parameters over

280
00:14:24.679 --> 00:14:27.879
<v Speaker 2>one hundred times bigger. It started showing these emergent abilities

281
00:14:27.919 --> 00:14:30.120
<v Speaker 2>things it wasn't explicitly trained for but could just do,

282
00:14:30.279 --> 00:14:33.399
<v Speaker 2>like unscrambling words or even basic arithmetic. It felt like

283
00:14:33.679 --> 00:14:35.120
<v Speaker 2>a qualitative leap.

284
00:14:35.039 --> 00:14:37.960
<v Speaker 1>But raw capability isn't always the same as being useful

285
00:14:38.080 --> 00:14:39.000
<v Speaker 1>or safe.

286
00:14:38.720 --> 00:14:41.919
<v Speaker 2>Right exactly, and that led to instruct GPT in twenty

287
00:14:41.960 --> 00:14:44.919
<v Speaker 2>twenty two. It was actually smaller than GPT three, but

288
00:14:45.080 --> 00:14:48.000
<v Speaker 2>critically it was much better at following instructions and aligning

289
00:14:48.039 --> 00:14:48.840
<v Speaker 2>with user intent.

290
00:14:49.120 --> 00:14:51.559
<v Speaker 1>How did they achieve that alignment through.

291
00:14:51.399 --> 00:14:54.960
<v Speaker 2>Two extra training steps after the initial pre training, First

292
00:14:55.039 --> 00:14:57.720
<v Speaker 2>instruction fine tuning, where they trained it on examples of

293
00:14:57.759 --> 00:15:02.120
<v Speaker 2>prompts and desired outputs, and second, crucially, reinforcement learning with

294
00:15:02.240 --> 00:15:05.440
<v Speaker 2>human feedback or URLHF.

295
00:15:04.960 --> 00:15:08.080
<v Speaker 1>Our LHF that involves humans ranking different.

296
00:15:07.759 --> 00:15:11.399
<v Speaker 2>Outcome Yes, humans would compare different responses from the model

297
00:15:11.440 --> 00:15:14.200
<v Speaker 2>to the same prompt and indicate which one they preferred.

298
00:15:14.799 --> 00:15:17.679
<v Speaker 2>This feedback was used to train a reward model, which

299
00:15:17.759 --> 00:15:21.240
<v Speaker 2>then guided the LM during further fine tuning to produce

300
00:15:21.279 --> 00:15:24.399
<v Speaker 2>outputs that humans are more likely to find helpful, honest,

301
00:15:24.519 --> 00:15:28.000
<v Speaker 2>and harmless. That alignment step was key for making models

302
00:15:28.000 --> 00:15:30.399
<v Speaker 2>like chat GPT practical and safer to.

303
00:15:30.360 --> 00:15:34.320
<v Speaker 1>Deploy alignment right, that seems super important. Then it also

304
00:15:34.399 --> 00:15:36.919
<v Speaker 1>brings up the point about access. Many of these really

305
00:15:36.960 --> 00:15:40.080
<v Speaker 1>powerful models like GPT four are closed source. We don't

306
00:15:40.159 --> 00:15:42.200
<v Speaker 1>know the exact architecture of the training data. How does

307
00:15:42.200 --> 00:15:42.919
<v Speaker 1>that affect things.

308
00:15:43.039 --> 00:15:45.320
<v Speaker 2>It's a huge debate in the field. On one hand,

309
00:15:45.399 --> 00:15:48.639
<v Speaker 2>companies invest billions and want to protect their IP. On

310
00:15:48.679 --> 00:15:54.120
<v Speaker 2>the other hand, it raises serious questions about transparency, reproducibility, bias, auditing,

311
00:15:54.200 --> 00:15:58.120
<v Speaker 2>and just how can the broader community innovate and build

312
00:15:58.399 --> 00:15:59.759
<v Speaker 2>if the cutting edge is locked away?

313
00:16:00.440 --> 00:16:01.879
<v Speaker 1>So is there a counter movement?

314
00:16:02.120 --> 00:16:05.679
<v Speaker 2>Absolutely? The open source LLM movement has exploded in response.

315
00:16:05.960 --> 00:16:09.120
<v Speaker 2>You have major efforts like met Islama models. They release

316
00:16:09.200 --> 00:16:13.200
<v Speaker 2>models with billions of parameters, allowing researchers and developers everywhere

317
00:16:13.360 --> 00:16:16.120
<v Speaker 2>to experiment and build on them. They've shown really strong

318
00:16:16.159 --> 00:16:19.879
<v Speaker 2>performance on benchmarks for coding, reasoning, common sense.

319
00:16:19.679 --> 00:16:22.440
<v Speaker 1>Surviable open alternatives are emerging.

320
00:16:22.080 --> 00:16:25.480
<v Speaker 2>Definitely, and you see interesting architectural innovations too. Look at

321
00:16:25.519 --> 00:16:29.799
<v Speaker 2>mixtral frommystrall dot ai. It uses a mixture of experts moe.

322
00:16:29.799 --> 00:16:31.679
<v Speaker 1>Architecture, mixture of experts.

323
00:16:31.679 --> 00:16:34.120
<v Speaker 2>How does that work instead of the entire huge model

324
00:16:34.159 --> 00:16:37.679
<v Speaker 2>processing every single input token, and MOE model has multiple

325
00:16:37.720 --> 00:16:42.919
<v Speaker 2>smaller expert networks, usually specialized transformer layers. A lightweight router

326
00:16:43.039 --> 00:16:45.519
<v Speaker 2>network directs each part of the input to only a

327
00:16:45.559 --> 00:16:46.759
<v Speaker 2>small subset of these.

328
00:16:46.600 --> 00:16:49.960
<v Speaker 1>Experts, ah so only part of the model is active

329
00:16:49.960 --> 00:16:51.639
<v Speaker 1>at any given time. More efficient.

330
00:16:51.840 --> 00:16:54.159
<v Speaker 2>Exactly, you could have a model with a massive total

331
00:16:54.240 --> 00:16:58.120
<v Speaker 2>number of parameters, giving great capacity, but the actual computation

332
00:16:58.240 --> 00:17:01.399
<v Speaker 2>needed for inference is much lower because you're only using

333
00:17:01.399 --> 00:17:04.079
<v Speaker 2>a fraction of the experts for any given input. It's

334
00:17:04.119 --> 00:17:07.400
<v Speaker 2>a clever way to scale up while managing costs. Plus,

335
00:17:07.519 --> 00:17:10.680
<v Speaker 2>Mixed role has a very permissive Apache two point zero license,

336
00:17:11.000 --> 00:17:12.119
<v Speaker 2>making it widely.

337
00:17:11.839 --> 00:17:15.400
<v Speaker 1>Usable interesting any other key open source players.

338
00:17:15.799 --> 00:17:18.720
<v Speaker 2>Well Dolly from data Bricks took a different approach. They

339
00:17:18.799 --> 00:17:22.240
<v Speaker 2>focused on creating a high quality instruction following data set

340
00:17:22.440 --> 00:17:26.319
<v Speaker 2>about fifteen thousand prompts and responses generated entirely by data

341
00:17:26.319 --> 00:17:29.720
<v Speaker 2>Bricks employees. Their goal was specifically to create an open

342
00:17:29.839 --> 00:17:33.720
<v Speaker 2>instruction tuned model without relying on data generated by proprietary

343
00:17:33.759 --> 00:17:36.880
<v Speaker 2>models like chat GPT, which often comes with restrictive licenses.

344
00:17:37.079 --> 00:17:40.880
<v Speaker 2>You wanted to truly democratize instruction following capabilities.

345
00:17:40.279 --> 00:17:43.359
<v Speaker 1>So focusing on open data as much as open models precisely.

346
00:17:43.640 --> 00:17:47.119
<v Speaker 2>And you also have models like Falcon from TII in

347
00:17:47.160 --> 00:17:51.359
<v Speaker 2>the UAE trained primarily on web data, and grock one

348
00:17:51.519 --> 00:17:55.440
<v Speaker 2>from XAI, which also uses that mixture of experts architecture.

349
00:17:56.119 --> 00:17:59.599
<v Speaker 2>The open source space is incredibly vibrant right now, OK.

350
00:18:00.000 --> 00:18:03.160
<v Speaker 1>It's open or closed. We have these incredibly powerful llms.

351
00:18:03.240 --> 00:18:07.319
<v Speaker 1>If they're like general purpose programmable machines, as some say,

352
00:18:08.279 --> 00:18:11.559
<v Speaker 1>how do we the users actually program them? How do

353
00:18:11.599 --> 00:18:12.680
<v Speaker 1>we tell them what we want?

354
00:18:12.839 --> 00:18:15.319
<v Speaker 2>That's the art and science of prompt engineering. It's all

355
00:18:15.319 --> 00:18:18.640
<v Speaker 2>about designing and refining the input, the prompt that you

356
00:18:18.680 --> 00:18:21.079
<v Speaker 2>give to the model to guide it towards the output

357
00:18:21.119 --> 00:18:21.480
<v Speaker 2>you need.

358
00:18:21.759 --> 00:18:23.559
<v Speaker 1>So the prompt is like the code we write for

359
00:18:23.599 --> 00:18:24.599
<v Speaker 1>the LLM.

360
00:18:24.359 --> 00:18:27.680
<v Speaker 2>In a way. Yeah, you're essentially reprogramming the model's behavior

361
00:18:27.720 --> 00:18:30.880
<v Speaker 2>on the fly, just using natural language instructions. It's becoming

362
00:18:30.880 --> 00:18:33.160
<v Speaker 2>a crucial skill for anyone working with these models.

363
00:18:33.200 --> 00:18:35.400
<v Speaker 1>And it's not just writing one prompt and being done right.

364
00:18:35.480 --> 00:18:37.519
<v Speaker 1>You mentioned, it's iterative, totally iterative.

365
00:18:37.559 --> 00:18:39.680
<v Speaker 2>You design a prompt, you test it, you see what

366
00:18:39.720 --> 00:18:41.880
<v Speaker 2>the model gets back, You evaluate that output, and then

367
00:18:41.880 --> 00:18:45.440
<v Speaker 2>you refine the prompt based on the results. Lather, rinse, repeat.

368
00:18:45.680 --> 00:18:48.880
<v Speaker 1>Okay, so what goes into a well structured prompt? What

369
00:18:48.920 --> 00:18:49.839
<v Speaker 1>are the key pieces?

370
00:18:50.680 --> 00:18:53.440
<v Speaker 2>There are a few core components to think about. First,

371
00:18:53.720 --> 00:18:56.960
<v Speaker 2>you often have system instructions or as system prompt. This

372
00:18:57.039 --> 00:19:01.079
<v Speaker 2>sets the stage, defines the LM's persona or overall behavior

373
00:19:01.079 --> 00:19:04.559
<v Speaker 2>for the conversation, like you are a helpful assistant who

374
00:19:04.559 --> 00:19:09.359
<v Speaker 2>explains complex topics simply. This usually persists across multiple.

375
00:19:09.079 --> 00:19:12.119
<v Speaker 1>Turns, so setting the ground rules exactly.

376
00:19:12.400 --> 00:19:14.200
<v Speaker 2>Then you have the main prompt template, which is the

377
00:19:14.279 --> 00:19:17.799
<v Speaker 2>user facing instruction, often with placeholders where specific input will go.

378
00:19:18.240 --> 00:19:20.480
<v Speaker 2>You also need to consider the LM parameters, things like

379
00:19:20.720 --> 00:19:21.960
<v Speaker 2>temperature temperature.

380
00:19:22.000 --> 00:19:22.799
<v Speaker 1>What does that control?

381
00:19:23.000 --> 00:19:26.359
<v Speaker 2>Temperature controls the randomness of the output. Higher temperature means

382
00:19:26.359 --> 00:19:31.279
<v Speaker 2>more randomness, more creativity, maybe more unexpected results. Lower temperature

383
00:19:31.319 --> 00:19:35.680
<v Speaker 2>makes the output more focused deterministic, sticking closer to the

384
00:19:35.720 --> 00:19:36.720
<v Speaker 2>most probable words.

385
00:19:36.880 --> 00:19:39.119
<v Speaker 1>Okay, creativity versus predictability.

386
00:19:39.440 --> 00:19:42.920
<v Speaker 2>Right, And you might set completion tokens to limit the

387
00:19:42.960 --> 00:19:48.000
<v Speaker 2>output length. And importantly, there are usually safeguards or guardrails

388
00:19:48.039 --> 00:19:50.960
<v Speaker 2>in place, either built into the model or added around

389
00:19:50.960 --> 00:19:54.359
<v Speaker 2>it to prevent it from generating harmful, biased or inappropriate

390
00:19:54.400 --> 00:19:55.440
<v Speaker 2>content makes sense.

391
00:19:55.640 --> 00:19:59.279
<v Speaker 1>So beyond the structure, what makes a prompt effective any

392
00:19:59.359 --> 00:20:00.200
<v Speaker 1>general stratu.

393
00:20:00.839 --> 00:20:04.240
<v Speaker 2>Clarity and specificity are key. Be really clear about what

394
00:20:04.279 --> 00:20:07.839
<v Speaker 2>you want, don't be vague. If it's a complex task,

395
00:20:08.039 --> 00:20:10.799
<v Speaker 2>break it down into smaller, simpler steps within the prompt.

396
00:20:11.200 --> 00:20:12.559
<v Speaker 2>Tell the model how you want it to.

397
00:20:12.559 --> 00:20:15.319
<v Speaker 1>Approach the problem, step by step, instructions exactly.

398
00:20:15.880 --> 00:20:18.559
<v Speaker 2>And another really powerful technique is few shot prompting.

399
00:20:19.079 --> 00:20:21.880
<v Speaker 1>Ah, you mentioned that with GPT too giving examples.

400
00:20:22.039 --> 00:20:24.160
<v Speaker 2>Yes, instead of just telling the model what to do,

401
00:20:24.200 --> 00:20:26.039
<v Speaker 2>you showed a few examples of the input and it

402
00:20:26.119 --> 00:20:28.839
<v Speaker 2>got output you want. This helps it grasp the desired

403
00:20:28.880 --> 00:20:32.279
<v Speaker 2>format style or reasoning pattern much more effectively than just

404
00:20:32.319 --> 00:20:33.359
<v Speaker 2>instructions alone.

405
00:20:33.440 --> 00:20:36.480
<v Speaker 1>Okay, showing is better than telling. What about really complex

406
00:20:36.559 --> 00:20:39.200
<v Speaker 1>tasks that require like multi step reasoning.

407
00:20:39.440 --> 00:20:42.079
<v Speaker 2>This is where the advanced prompting techniques come in, and

408
00:20:42.119 --> 00:20:46.079
<v Speaker 2>they are really quite clever. One major one is chain

409
00:20:46.119 --> 00:20:49.319
<v Speaker 2>of thought. At prompting chain of.

410
00:20:49.279 --> 00:20:52.119
<v Speaker 1>Thought making it think step.

411
00:20:51.839 --> 00:20:55.960
<v Speaker 2>By step Precisely, you explicitly instruct the model to think

412
00:20:56.160 --> 00:20:59.000
<v Speaker 2>step by step or show its reasoning before giving the

413
00:20:59.039 --> 00:21:02.720
<v Speaker 2>final answer. For problems like math word problems or complex

414
00:21:02.759 --> 00:21:06.960
<v Speaker 2>logic puzzles. Forcing it to articulate the intermediate steps dramatically

415
00:21:07.000 --> 00:21:09.920
<v Speaker 2>improves its accuracy. It's less likely to jump to a

416
00:21:09.960 --> 00:21:10.720
<v Speaker 2>wrong conclusion.

417
00:21:10.839 --> 00:21:13.200
<v Speaker 1>So you're making the reasoning process.

418
00:21:12.799 --> 00:21:16.000
<v Speaker 2>Explicit yes, and building on that, you have tree of

419
00:21:16.039 --> 00:21:19.640
<v Speaker 2>thought instead of just one chain. The model explores multiple

420
00:21:19.680 --> 00:21:23.519
<v Speaker 2>different reasoning paths or branches, like exploring different possibilities in parallel.

421
00:21:23.720 --> 00:21:26.240
<v Speaker 2>It then evaluates these different thoughts to pick the most

422
00:21:26.240 --> 00:21:29.319
<v Speaker 2>promising path to the solution. It's like enabling the model

423
00:21:29.319 --> 00:21:30.079
<v Speaker 2>to brainstorm.

424
00:21:30.200 --> 00:21:32.319
<v Speaker 1>Wow, Okay, that sounds powerful.

425
00:21:32.359 --> 00:21:35.480
<v Speaker 2>Any others, there's REACT, which stands for reason and act.

426
00:21:35.920 --> 00:21:39.720
<v Speaker 2>This technique combines the llm's reasoning capabilities with the ability

427
00:21:39.720 --> 00:21:43.359
<v Speaker 2>to use external tools tools like what like a calculator,

428
00:21:43.559 --> 00:21:47.400
<v Speaker 2>a web search API, a code execution environment, a database lookup.

429
00:21:47.880 --> 00:21:50.599
<v Speaker 2>The LLLM can reason about the problem, decide it needs

430
00:21:50.599 --> 00:21:54.079
<v Speaker 2>more information, generate an action like search the web for

431
00:21:54.200 --> 00:21:57.160
<v Speaker 2>recent news on X, get the result back from the tool,

432
00:21:57.440 --> 00:22:01.039
<v Speaker 2>incorporate that information into its reasoning, and continue towards the

433
00:22:01.039 --> 00:22:04.119
<v Speaker 2>final answer. It allows llms to interact with the world

434
00:22:04.359 --> 00:22:06.559
<v Speaker 2>and access up to date information.

435
00:22:06.480 --> 00:22:09.279
<v Speaker 1>So it can go beyond its internal knowledge.

436
00:22:08.880 --> 00:22:12.359
<v Speaker 2>Exactly and one more is self consistency. Here, you run

437
00:22:12.400 --> 00:22:15.559
<v Speaker 2>the same prompt often the chain of thought prompt multiple

438
00:22:15.559 --> 00:22:19.400
<v Speaker 2>times with some randomness enabled, generating several different reasoning paths.

439
00:22:19.759 --> 00:22:21.839
<v Speaker 2>You then look at the final answers produced by each

440
00:22:21.920 --> 00:22:24.599
<v Speaker 2>path and choose the answer that appears most frequently or

441
00:22:24.640 --> 00:22:28.079
<v Speaker 2>consistently across the different reasoning attempts. It's like taking a

442
00:22:28.079 --> 00:22:32.079
<v Speaker 2>majority vote among different ways of thinking, which often boosts robustness,

443
00:22:32.359 --> 00:22:33.960
<v Speaker 2>especially for things like arithmetic.

444
00:22:34.119 --> 00:22:38.039
<v Speaker 1>These techniques sound incredibly powerful for unlocking more complex capabilities.

445
00:22:38.200 --> 00:22:41.400
<v Speaker 1>But prompt engineering can't be perfect, right What are the

446
00:22:41.440 --> 00:22:43.119
<v Speaker 1>downsides or limitations?

447
00:22:43.160 --> 00:22:46.119
<v Speaker 2>Definitely not perfect. One big issue is that prompts can

448
00:22:46.160 --> 00:22:49.480
<v Speaker 2>be very brittle. A prompt cracted perfectly for one model

449
00:22:49.599 --> 00:22:52.480
<v Speaker 2>might completely fail or give weird results on another model,

450
00:22:52.880 --> 00:22:55.680
<v Speaker 2>or even a slightly updated version of the same model.

451
00:22:56.119 --> 00:22:57.720
<v Speaker 2>They'd always transfer well.

452
00:22:57.519 --> 00:23:00.279
<v Speaker 1>So you might need to constantly retune your prompts.

453
00:23:00.160 --> 00:23:03.079
<v Speaker 2>Which leads to the next point. Evaluation is hard. How

454
00:23:03.079 --> 00:23:06.319
<v Speaker 2>do you objectively measure if one prompt is better than another?

455
00:23:06.799 --> 00:23:10.319
<v Speaker 2>Especially for creative or complex tasks, there aren't always simple metrics,

456
00:23:10.759 --> 00:23:15.000
<v Speaker 2>and the iterative process of designing, testing, refining it takes time.

457
00:23:15.039 --> 00:23:18.480
<v Speaker 2>In compute resources, which means latency and costs can add up,

458
00:23:18.559 --> 00:23:19.680
<v Speaker 2>especially during development.

459
00:23:19.759 --> 00:23:22.799
<v Speaker 1>Right, and are there risks like people using prompts maliciously?

460
00:23:23.119 --> 00:23:26.839
<v Speaker 2>Yes, that's a growing concern known as adversarial prompting. Bad

461
00:23:26.880 --> 00:23:29.240
<v Speaker 2>actors try to craft prompts to trick the model into

462
00:23:29.279 --> 00:23:32.680
<v Speaker 2>bypassing its safety guidelines so called jail breaks, or to

463
00:23:32.720 --> 00:23:37.200
<v Speaker 2>reveal sense of information prompt injection. Defending against these is

464
00:23:37.200 --> 00:23:38.359
<v Speaker 2>an ongoing challenge.

465
00:23:38.519 --> 00:23:41.880
<v Speaker 1>Okay, so prompt engineering is key, but it has its challenges.

466
00:23:42.640 --> 00:23:46.319
<v Speaker 1>Given all this, how are developers actually building applications that

467
00:23:46.480 --> 00:23:49.440
<v Speaker 1>use these llms in the real world? Are there specific

468
00:23:49.519 --> 00:23:50.640
<v Speaker 1>tools or frameworks.

469
00:23:50.720 --> 00:23:54.680
<v Speaker 2>Yeah, the ecosystem around llms is evolving rapidly. Frameworks like

470
00:23:54.759 --> 00:23:58.240
<v Speaker 2>lane chain have become really popular. They provide building blocks

471
00:23:58.240 --> 00:24:01.119
<v Speaker 2>and abstractions to make it easier to change llms together,

472
00:24:01.559 --> 00:24:04.799
<v Speaker 2>connect them to other data sources, and manage the overall

473
00:24:04.799 --> 00:24:06.400
<v Speaker 2>application logic slang.

474
00:24:06.480 --> 00:24:08.559
<v Speaker 1>Chain helps orchestrate things exactly.

475
00:24:08.599 --> 00:24:11.559
<v Speaker 2>It simplifies common patterns, and one of the most important

476
00:24:11.559 --> 00:24:15.200
<v Speaker 2>patterns it helps implement is retrieval augmented generation or.

477
00:24:15.359 --> 00:24:17.960
<v Speaker 1>Ride ryan do you mention that helps with hallucinations and

478
00:24:18.039 --> 00:24:19.559
<v Speaker 1>keeping infocurrent precisely?

479
00:24:20.039 --> 00:24:22.359
<v Speaker 2>Llms are trained on a snapshot of data. They don't

480
00:24:22.400 --> 00:24:26.759
<v Speaker 2>inherently know your company's latest internal documents or real time news. Alright,

481
00:24:26.759 --> 00:24:30.440
<v Speaker 2>fixes this. The idea is when a user asks a question,

482
00:24:30.640 --> 00:24:33.599
<v Speaker 2>the system first retrieves relevant snippets of information from an

483
00:24:33.640 --> 00:24:38.119
<v Speaker 2>external knowledge base, maybe your company wiki, product manuals, recent reports, whatever.

484
00:24:38.599 --> 00:24:40.759
<v Speaker 2>This is often done using a vector store, which is

485
00:24:40.799 --> 00:24:42.559
<v Speaker 2>like a searchable database for text meaning.

486
00:24:42.720 --> 00:24:44.480
<v Speaker 1>So it finds relevant facts first.

487
00:24:44.839 --> 00:24:48.480
<v Speaker 2>Yes, Then it takes those retrieved snippets and augments the

488
00:24:48.519 --> 00:24:52.480
<v Speaker 2>original prompt, essentially stuffing that relevant information into the context

489
00:24:52.559 --> 00:24:55.440
<v Speaker 2>window it sends to the LLM. So the LLM gets

490
00:24:55.480 --> 00:24:58.440
<v Speaker 2>the user's question plus the relevant facts needed to answer

491
00:24:58.480 --> 00:24:59.759
<v Speaker 2>it accurately and currently.

492
00:25:00.079 --> 00:25:03.200
<v Speaker 1>Ah. So you're giving the LLM the specific knowledge it needs,

493
00:25:03.319 --> 00:25:04.039
<v Speaker 1>right when it.

494
00:25:03.960 --> 00:25:08.039
<v Speaker 2>Needs it exactly. It massively improves factual accuracy, reduces made

495
00:25:08.079 --> 00:25:11.000
<v Speaker 2>up answers, and lets you ground the LM's responses in

496
00:25:11.039 --> 00:25:14.799
<v Speaker 2>specific trusted data sources without having to constantly retrain the

497
00:25:14.960 --> 00:25:18.640
<v Speaker 2>entire model. Our rage is fundamental for most serious enterprise

498
00:25:19.039 --> 00:25:20.079
<v Speaker 2>M applications today.

499
00:25:20.240 --> 00:25:23.599
<v Speaker 1>Okay, our rage is huge for grounding responses, but what

500
00:25:23.640 --> 00:25:26.240
<v Speaker 1>about more complex interactions like a chatbot that needs to

501
00:25:26.240 --> 00:25:29.920
<v Speaker 1>remember the conversation history or applications with multiple steps in

502
00:25:29.920 --> 00:25:30.759
<v Speaker 1>branching logic.

503
00:25:31.119 --> 00:25:33.960
<v Speaker 2>For that kind of complexity, you need ways to manage

504
00:25:34.039 --> 00:25:38.079
<v Speaker 2>state the memory of the interaction. This is where tools

505
00:25:38.119 --> 00:25:40.599
<v Speaker 2>like lang graph, which builds on lang chain come in.

506
00:25:41.160 --> 00:25:44.039
<v Speaker 2>Lang graph allows you to define your LLM application as

507
00:25:44.039 --> 00:25:47.640
<v Speaker 2>a graph, specifically a state graph. Each node in the

508
00:25:47.680 --> 00:25:51.039
<v Speaker 2>graph represents a function, which could involve calling an LLM,

509
00:25:51.160 --> 00:25:53.960
<v Speaker 2>using a tool, or just processing data, and the edges

510
00:25:54.039 --> 00:25:56.000
<v Speaker 2>represent the flow based on the current state.

511
00:25:56.079 --> 00:25:58.079
<v Speaker 1>So it's like a flow chart for the M application.

512
00:25:58.279 --> 00:26:01.960
<v Speaker 2>Kind of yeah, but it's designed explicitly for building clickle

513
00:26:02.000 --> 00:26:04.839
<v Speaker 2>stateful applications. It lets you create agents that can have

514
00:26:04.920 --> 00:26:09.839
<v Speaker 2>multi turned conversations. Remember context, make decisions loop branch basically

515
00:26:09.839 --> 00:26:13.839
<v Speaker 2>build much more sophisticated and robust applications than simple linear chains.

516
00:26:14.119 --> 00:26:16.640
<v Speaker 1>And you mentioned tools and agents earlier with React. How

517
00:26:16.640 --> 00:26:18.480
<v Speaker 1>does that fit into building applications.

518
00:26:18.720 --> 00:26:22.279
<v Speaker 2>It's central to making llms truly useful beyond just text generation.

519
00:26:23.039 --> 00:26:25.920
<v Speaker 2>By giving an LLLM access to tools like the Tavly

520
00:26:25.960 --> 00:26:29.319
<v Speaker 2>search results tool for web searches or custom tools for

521
00:26:29.359 --> 00:26:31.960
<v Speaker 2>your databases, and defining how it can use them, you

522
00:26:32.000 --> 00:26:35.119
<v Speaker 2>turn it into an agent. This agent can then autonomously

523
00:26:35.160 --> 00:26:37.759
<v Speaker 2>decide which tool to use, what input to give it,

524
00:26:37.880 --> 00:26:39.759
<v Speaker 2>and how to use the tool's output to achieve a

525
00:26:39.839 --> 00:26:42.559
<v Speaker 2>larger goal set by the user. Lang Chain and lane

526
00:26:42.559 --> 00:26:46.039
<v Speaker 2>graph provide frameworks for building these agents, managing their state,

527
00:26:46.160 --> 00:26:49.759
<v Speaker 2>their reasoning loops, and their interactions with tools and humans.

528
00:26:49.640 --> 00:26:52.279
<v Speaker 1>So the LM becomes less of a text generator and

529
00:26:52.359 --> 00:26:55.279
<v Speaker 1>more of a problem solver that can use external resources.

530
00:26:55.279 --> 00:26:59.279
<v Speaker 2>Exactly. It's about moving from passive generation to active task execution.

531
00:27:00.240 --> 00:27:03.319
<v Speaker 1>These models are clearly incredibly powerful, and the tools for

532
00:27:03.359 --> 00:27:06.039
<v Speaker 1>building with them are getting sophisticated, But we keep coming

533
00:27:06.039 --> 00:27:09.920
<v Speaker 1>back to the fact that they are massive, expensive, computationally hungry.

534
00:27:10.319 --> 00:27:12.240
<v Speaker 1>Why is optimizing them such a big focus?

535
00:27:12.440 --> 00:27:15.039
<v Speaker 2>Well several reasons. Scalability is one. We want to be

536
00:27:15.079 --> 00:27:17.680
<v Speaker 2>able to run these models for more users more efficiently.

537
00:27:18.359 --> 00:27:22.160
<v Speaker 2>Cost is obviously huge. Training and running billion parameter models

538
00:27:22.200 --> 00:27:24.680
<v Speaker 2>requires immense hardware investment and energy.

539
00:27:24.400 --> 00:27:26.680
<v Speaker 1>Consumption, and the environmental impact too.

540
00:27:26.720 --> 00:27:30.359
<v Speaker 2>I guess absolutely that's increasingly part of the conversation. There's

541
00:27:30.400 --> 00:27:33.559
<v Speaker 2>also research like the Scaling Laws work from Kaplan and

542
00:27:33.599 --> 00:27:38.519
<v Speaker 2>others suggesting that performance scales predictably with model size data

543
00:27:38.559 --> 00:27:42.880
<v Speaker 2>set size and compute. But crucially, they also found many

544
00:27:42.960 --> 00:27:47.119
<v Speaker 2>large models are technically undertrained, meaning they are so large

545
00:27:47.359 --> 00:27:49.400
<v Speaker 2>that they haven't been trained for long enough on enough

546
00:27:49.480 --> 00:27:52.720
<v Speaker 2>data relative to their size to reach their full potential

547
00:27:52.759 --> 00:27:56.200
<v Speaker 2>within typical compute budgets. This implies there are gains to

548
00:27:56.240 --> 00:27:59.079
<v Speaker 2>be had by being smarter about training, not just bigger.

549
00:28:00.000 --> 00:28:02.440
<v Speaker 1>How do we get smarter how do we optimize these

550
00:28:02.480 --> 00:28:04.720
<v Speaker 1>things starting right from the pre training phase.

551
00:28:05.119 --> 00:28:08.839
<v Speaker 2>One major trend is focusing on data efficiency. Instead of

552
00:28:08.880 --> 00:28:11.599
<v Speaker 2>just throwing quintillions of tokens scraped from the Internet of

553
00:28:11.680 --> 00:28:14.920
<v Speaker 2>the model, there's a growing emphasis on using higher quality,

554
00:28:15.160 --> 00:28:18.920
<v Speaker 2>carefully curated and sometimes even synthetically generated data like.

555
00:28:18.839 --> 00:28:20.960
<v Speaker 1>The Microsoft five models you mentioned exactly.

556
00:28:21.160 --> 00:28:24.960
<v Speaker 2>They use textbook quality data and synthetic stories tiny stories

557
00:28:25.160 --> 00:28:28.319
<v Speaker 2>to train much smaller models that achieve surprisingly strong performance,

558
00:28:28.720 --> 00:28:33.279
<v Speaker 2>suggesting data quality can sometimes trump sheer quantity. Another huge

559
00:28:33.319 --> 00:28:37.000
<v Speaker 2>area is using lower numerical precision. Models are typically trained

560
00:28:37.079 --> 00:28:40.400
<v Speaker 2>using thirty two bit floating point numbers f P thirty two,

561
00:28:40.640 --> 00:28:43.880
<v Speaker 2>but using mixed precision combining sixteen bit floats like b

562
00:28:43.960 --> 00:28:47.680
<v Speaker 2>float sixteen with FB thirty two, or even quantization using

563
00:28:47.720 --> 00:28:51.039
<v Speaker 2>eight bit or even four bit integers drastically reduces the

564
00:28:51.039 --> 00:28:55.160
<v Speaker 2>memory footprint and speeds up computation, often with minimal impact.

565
00:28:54.799 --> 00:28:58.680
<v Speaker 1>On accuracy quantization, So using less precise numbers saves space

566
00:28:58.720 --> 00:28:59.319
<v Speaker 1>and time, a.

567
00:28:59.240 --> 00:29:01.160
<v Speaker 2>Lot of space and time. Yes, yeah, you can do

568
00:29:01.240 --> 00:29:04.960
<v Speaker 2>post training quantization PDQ, where you quantize an already trained model,

569
00:29:05.039 --> 00:29:08.759
<v Speaker 2>or quantization aware training QAT, where you incorporate the quantization

570
00:29:08.839 --> 00:29:11.839
<v Speaker 2>process during training to potentially get better accuracy. This lets

571
00:29:11.880 --> 00:29:14.000
<v Speaker 2>you run much larger models on the same hardware.

572
00:29:14.119 --> 00:29:17.400
<v Speaker 1>Okay, data quality and number formats. What about the model

573
00:29:17.519 --> 00:29:20.119
<v Speaker 1>architecture itself? Can we make attention more efficient?

574
00:29:20.519 --> 00:29:25.839
<v Speaker 2>Yes, that's critical because standard self attention has quadratic complexity.

575
00:29:25.960 --> 00:29:28.400
<v Speaker 2>The compute grows with the square of the sequence length

576
00:29:28.720 --> 00:29:32.400
<v Speaker 2>o in two. For very long sequences, that becomes a bottleneck.

577
00:29:32.759 --> 00:29:36.920
<v Speaker 2>So researchers have developed various efficient attention mechanisms, things like

578
00:29:37.000 --> 00:29:39.920
<v Speaker 2>sparse attention where each token only attends to a subset

579
00:29:39.920 --> 00:29:44.200
<v Speaker 2>of other tokens, or methods that approximate attention using linear complexity.

580
00:29:44.759 --> 00:29:47.480
<v Speaker 2>And then there's flash attention, which doesn't change the math

581
00:29:47.519 --> 00:29:51.480
<v Speaker 2>of attention, but cleverly optimizes its implementation to be much

582
00:29:51.519 --> 00:29:55.240
<v Speaker 2>faster on modern GPUs by minimizing slow memory reads and writes.

583
00:29:55.559 --> 00:29:57.440
<v Speaker 2>It's become almost standard now, so.

584
00:29:57.519 --> 00:30:01.640
<v Speaker 1>Optimizing the core attention calculation. Are there entirely different architectures

585
00:30:01.640 --> 00:30:03.279
<v Speaker 1>emerging too, Yes.

586
00:30:03.079 --> 00:30:06.880
<v Speaker 2>Things like Lindformer perceiver IO, and we're also seeing architectures

587
00:30:06.920 --> 00:30:11.240
<v Speaker 2>designed specifically for multimodal inputs right from the start. Efficiency

588
00:30:11.279 --> 00:30:13.160
<v Speaker 2>is being baked into the design process. Now.

589
00:30:13.359 --> 00:30:16.200
<v Speaker 1>Okay, so we've made pre training more efficient. What about

590
00:30:16.200 --> 00:30:18.599
<v Speaker 1>when we have a massive pre trained model and just

591
00:30:18.640 --> 00:30:21.119
<v Speaker 1>want to adapt it to a new specific task. We

592
00:30:21.160 --> 00:30:22.440
<v Speaker 1>don't want to retrain everything.

593
00:30:22.720 --> 00:30:26.720
<v Speaker 2>That's where parameter efficient fine tuning or PFT techniques are

594
00:30:26.839 --> 00:30:30.640
<v Speaker 2>absolutely essential. The goal is to adapt the model effectively

595
00:30:30.960 --> 00:30:34.319
<v Speaker 2>while only updating a very small percentage of its total parameters.

596
00:30:34.480 --> 00:30:35.680
<v Speaker 1>Why is that so beneficial?

597
00:30:35.960 --> 00:30:38.920
<v Speaker 2>It massively reduces the compute cost and time needed for

598
00:30:39.000 --> 00:30:42.240
<v Speaker 2>fine tuning. It requires much less memory, meaning you can

599
00:30:42.279 --> 00:30:46.119
<v Speaker 2>fine tune larger models on less powerful hardware, And importantly,

600
00:30:46.200 --> 00:30:48.240
<v Speaker 2>you only need to store the small number of change

601
00:30:48.279 --> 00:30:51.279
<v Speaker 2>parameters for each task, not a full copy of the

602
00:30:51.359 --> 00:30:54.440
<v Speaker 2>huge model, which saves enormous amounts of storage.

603
00:30:54.559 --> 00:30:57.680
<v Speaker 1>Okay, so how do these PEFT methods work? What are

604
00:30:57.680 --> 00:30:58.400
<v Speaker 1>some examples?

605
00:30:58.599 --> 00:31:02.200
<v Speaker 2>One early approach was. Instead of tuning the model's weights,

606
00:31:02.240 --> 00:31:05.160
<v Speaker 2>you add a small number of learnable virtual tokens to

607
00:31:05.200 --> 00:31:09.000
<v Speaker 2>the input embedding layer and only train those. But perhaps

608
00:31:09.000 --> 00:31:11.720
<v Speaker 2>the most popular method right now is LAURA, which stands

609
00:31:11.759 --> 00:31:13.640
<v Speaker 2>for a low rank adaptation LAURA.

610
00:31:13.720 --> 00:31:14.440
<v Speaker 1>How does that work?

611
00:31:14.759 --> 00:31:17.240
<v Speaker 2>LAURA works on the insight that the change needed to

612
00:31:17.279 --> 00:31:19.960
<v Speaker 2>adapt a pre trained model often lies in a low

613
00:31:20.000 --> 00:31:24.480
<v Speaker 2>dimensional subspace. So instead of updating the massive weight matrices directly,

614
00:31:24.920 --> 00:31:29.720
<v Speaker 2>LAURA injects pairs of small, trainable low rank matrices alongside

615
00:31:29.720 --> 00:31:33.519
<v Speaker 2>the original frozen weights. During fine tuning, you only train

616
00:31:33.559 --> 00:31:37.880
<v Speaker 2>these small injected matrices. The original weights remain untouched. Because

617
00:31:37.880 --> 00:31:41.079
<v Speaker 2>these matrices are small, the number of trainable parameters is tiny,

618
00:31:41.160 --> 00:31:43.799
<v Speaker 2>often less than point one percent or even point zero

619
00:31:43.839 --> 00:31:47.000
<v Speaker 2>one percent of the total model size. Yet it performs

620
00:31:47.079 --> 00:31:50.440
<v Speaker 2>remarkably well, often matching full fine tuning performance.

621
00:31:50.880 --> 00:31:54.000
<v Speaker 1>Wow, only training a tiny fraction but getting similar results.

622
00:31:54.039 --> 00:31:55.400
<v Speaker 1>That's huge, it really is.

623
00:31:55.480 --> 00:31:58.279
<v Speaker 2>And you can even combine techniques likeq LAURA, which applies

624
00:31:58.359 --> 00:32:01.440
<v Speaker 2>LAURA on top of a quantized G four bit base model,

625
00:32:01.720 --> 00:32:04.960
<v Speaker 2>making fine tuning incredibly efficient in terms of memory usage.

626
00:32:05.039 --> 00:32:08.720
<v Speaker 1>So PFT methods like LAURA make adapting models much more practical. Now,

627
00:32:08.720 --> 00:32:10.559
<v Speaker 1>once the model is trained and fine tuned, how do

628
00:32:10.599 --> 00:32:13.240
<v Speaker 1>we make it actually respond faster during inference when a

629
00:32:13.319 --> 00:32:14.519
<v Speaker 1>user is waiting for an answer?

630
00:32:14.720 --> 00:32:18.880
<v Speaker 2>Right Inference optimization is crucial for user experience. Several techniques

631
00:32:18.880 --> 00:32:22.039
<v Speaker 2>help here, offloading parts of the model to CPU or

632
00:32:22.200 --> 00:32:25.559
<v Speaker 2>DISC if GPU memory is tight, shorting the model across

633
00:32:25.599 --> 00:32:29.319
<v Speaker 2>multiple GPUs. Batch inference is a big one. Instead of

634
00:32:29.359 --> 00:32:32.559
<v Speaker 2>processing one user query at a time, you group multiple

635
00:32:32.599 --> 00:32:36.440
<v Speaker 2>queries together into a batch and process them simultaneously to

636
00:32:36.480 --> 00:32:39.200
<v Speaker 2>better utilize the parallel processing power of the hardware.

637
00:32:39.279 --> 00:32:41.119
<v Speaker 1>Processing requests in parallel.

638
00:32:41.200 --> 00:32:45.359
<v Speaker 2>Yes, and for generating texts sequentially with transformers, kV caching

639
00:32:45.440 --> 00:32:49.000
<v Speaker 2>is absolutely VITALKV cashing. What's up in a transformer decoder?

640
00:32:49.240 --> 00:32:51.680
<v Speaker 2>To generate the next word, the model needs to compute

641
00:32:51.680 --> 00:32:55.519
<v Speaker 2>a pension over all the previous words. This involves calculating

642
00:32:55.599 --> 00:33:00.400
<v Speaker 2>key K and value V tensors for each word. Caching

643
00:33:00.480 --> 00:33:03.799
<v Speaker 2>simply stores these calculated K and V tensors from previous steps,

644
00:33:04.400 --> 00:33:07.240
<v Speaker 2>so when generating the next word, the model doesn't need

645
00:33:07.279 --> 00:33:09.359
<v Speaker 2>to recompute all the keys and values for the words

646
00:33:09.359 --> 00:33:13.160
<v Speaker 2>it's already processed. It just reuses the cash ones. This

647
00:33:13.240 --> 00:33:16.519
<v Speaker 2>dramatically speeds up the generation process, especially for long sequences,

648
00:33:16.759 --> 00:33:18.599
<v Speaker 2>because most of the computation.

649
00:33:18.200 --> 00:33:22.119
<v Speaker 1>Is reused, avoiding redundant calculations.

650
00:33:21.519 --> 00:33:23.640
<v Speaker 2>Clever makes a huge difference to latency.

651
00:33:23.759 --> 00:33:27.160
<v Speaker 1>So looking ahead, then, with all this focus on efficiency

652
00:33:27.200 --> 00:33:29.920
<v Speaker 1>and new techniques, what are some of the emerging trends

653
00:33:29.920 --> 00:33:31.359
<v Speaker 1>we should really be kipping an eye on.

654
00:33:31.480 --> 00:33:35.079
<v Speaker 2>Well, Definitely the exploration of alternate architectures beyond the transformer.

655
00:33:35.480 --> 00:33:39.119
<v Speaker 2>We're seeing intriguing results from models like Mamba based on

656
00:33:39.160 --> 00:33:43.559
<v Speaker 2>states based models, which achieves strong performance potentially without Attention's

657
00:33:43.640 --> 00:33:47.960
<v Speaker 2>quadratic complexity, and things like RWKV, which tries to combine

658
00:33:48.000 --> 00:33:51.000
<v Speaker 2>the best of R and n's efficiency for long sequences,

659
00:33:51.279 --> 00:33:53.599
<v Speaker 2>and transformers parallel training.

660
00:33:53.960 --> 00:33:57.119
<v Speaker 1>So maybe the rain of the transformer isn't absolute, it's.

661
00:33:57.000 --> 00:34:01.000
<v Speaker 2>Being challenged certainly. Another big trend is specialized hardware. We're

662
00:34:01.039 --> 00:34:05.599
<v Speaker 2>seeing more dedicated AI accelerators NPUs neural processing units being

663
00:34:05.599 --> 00:34:09.599
<v Speaker 2>built into chips, alongside efforts to optimize AI software for

664
00:34:09.679 --> 00:34:14.239
<v Speaker 2>existing hardware using frameworks like Apple's Metal Performance Shaders MPs

665
00:34:14.360 --> 00:34:17.440
<v Speaker 2>or web GPUs for running models and browsers. And maybe

666
00:34:17.480 --> 00:34:20.760
<v Speaker 2>the most exciting practically speaking is the rise of really

667
00:34:20.800 --> 00:34:24.000
<v Speaker 2>capable small foundational models or SLMs.

668
00:34:23.760 --> 00:34:26.480
<v Speaker 1>Like the FIE models. Again small but mighty.

669
00:34:26.480 --> 00:34:30.480
<v Speaker 2>Exactly, models like five two or PI three demonstrate that

670
00:34:30.559 --> 00:34:34.920
<v Speaker 2>by using extremely high quality, carefully curated, and often synthetic data,

671
00:34:35.360 --> 00:34:39.360
<v Speaker 2>you can achieve performance comparable to much much larger models,

672
00:34:39.719 --> 00:34:43.639
<v Speaker 2>but with drastically less compute memory and cost. This could

673
00:34:43.719 --> 00:34:47.440
<v Speaker 2>democratize access to powerful AI capability significantly, right.

674
00:34:47.360 --> 00:34:50.719
<v Speaker 1>Making powerful AI runnable on say a laptop or even

675
00:34:50.760 --> 00:34:51.280
<v Speaker 1>a phone.

676
00:34:51.320 --> 00:34:54.199
<v Speaker 2>That's the direction things are heading. Efficiency across the board,

677
00:34:54.280 --> 00:34:57.599
<v Speaker 2>data architecture, hardware, fine tuning is the name of the

678
00:34:57.599 --> 00:34:58.199
<v Speaker 2>game right now.

679
00:34:58.360 --> 00:35:00.840
<v Speaker 1>Okay, this has been fascinating on the tip, but we

680
00:35:00.960 --> 00:35:04.119
<v Speaker 1>started this whole deep dive talking about AI art. Let's

681
00:35:04.119 --> 00:35:07.679
<v Speaker 1>swing back to images. How do machines generate pictures? What

682
00:35:07.719 --> 00:35:09.159
<v Speaker 1>are the key generative models?

683
00:35:09.199 --> 00:35:13.119
<v Speaker 2>There two main families really dominated early on, Variational auto

684
00:35:13.199 --> 00:35:17.320
<v Speaker 2>encoders vaes and generative adversarial networks vas.

685
00:35:17.440 --> 00:35:21.159
<v Speaker 1>You mentioned them briefly. Auto encoders suggest encoding and decoding.

686
00:35:20.960 --> 00:35:25.440
<v Speaker 2>Precisely a vee learns two things, and encoder that compresses

687
00:35:25.480 --> 00:35:29.159
<v Speaker 2>an input image down into a compact latent vector. Think

688
00:35:29.159 --> 00:35:31.800
<v Speaker 2>of it as capturing the essential features, or as sort

689
00:35:31.800 --> 00:35:34.639
<v Speaker 2>of barcode for the image, and the decoder that takes

690
00:35:34.639 --> 00:35:37.400
<v Speaker 2>a vector from that latent space and reconstructs an image.

691
00:35:37.719 --> 00:35:41.559
<v Speaker 2>The variational part is a clever mathematical trick using variational

692
00:35:41.599 --> 00:35:45.119
<v Speaker 2>inference and the reparamemorization trick to make this latent space

693
00:35:45.280 --> 00:35:48.559
<v Speaker 2>smooth and continuous, so you can sample new points from

694
00:35:48.559 --> 00:35:51.960
<v Speaker 2>it and decode them into novel realistic looking images that

695
00:35:52.039 --> 00:35:53.440
<v Speaker 2>resemble the training data.

696
00:35:53.320 --> 00:35:56.400
<v Speaker 1>So it learns a compressed representation and can generate from

697
00:35:56.440 --> 00:35:59.440
<v Speaker 1>that space. What about gans, you said they're like a

698
00:35:59.519 --> 00:36:00.559
<v Speaker 1>game exactly.

699
00:36:00.679 --> 00:36:03.679
<v Speaker 2>Jans involve two neural networks competing against each other. You

700
00:36:03.679 --> 00:36:06.280
<v Speaker 2>have the generator, whose job is to create fake images

701
00:36:06.320 --> 00:36:09.400
<v Speaker 2>that look real, and you have the discriminator, whose job

702
00:36:09.440 --> 00:36:11.079
<v Speaker 2>is to look at an image and decide if it's

703
00:36:11.119 --> 00:36:13.920
<v Speaker 2>real from the training set or fake made by the generator.

704
00:36:13.960 --> 00:36:16.360
<v Speaker 1>A forger and a detective perfect analogy.

705
00:36:16.760 --> 00:36:20.760
<v Speaker 2>They train together. The generator gets better at fooling the discriminator,

706
00:36:21.039 --> 00:36:24.880
<v Speaker 2>and the discriminator gets better at spotting fakes. This adversarial

707
00:36:24.960 --> 00:36:29.199
<v Speaker 2>process pushes the generator to produce increasingly realistic and high

708
00:36:29.239 --> 00:36:30.119
<v Speaker 2>quality images.

709
00:36:30.480 --> 00:36:33.000
<v Speaker 1>But jans had some issues too, right, like being hard

710
00:36:33.039 --> 00:36:33.400
<v Speaker 1>to train.

711
00:36:33.519 --> 00:36:36.480
<v Speaker 2>They can be notoriously tricky to train. Sometimes the training

712
00:36:36.519 --> 00:36:39.400
<v Speaker 2>is unstable, or the generator might suffer from mode collapse,

713
00:36:39.440 --> 00:36:41.440
<v Speaker 2>where it gets stuck producing only a few types of

714
00:36:41.480 --> 00:36:45.400
<v Speaker 2>images and fails to capture the diversity of the real data. However,

715
00:36:45.639 --> 00:36:48.519
<v Speaker 2>lots of variations were developed to address these issues, like

716
00:36:48.719 --> 00:36:53.719
<v Speaker 2>deep convolational gans DC jams for better image quality, conditional

717
00:36:53.760 --> 00:36:56.400
<v Speaker 2>gans SEA jams where you can control the output by

718
00:36:56.400 --> 00:37:00.920
<v Speaker 2>providing extra information like a class label, and progressive game progms,

719
00:37:01.239 --> 00:37:04.760
<v Speaker 2>which achieved amazing high resolution results by training the generator

720
00:37:04.800 --> 00:37:08.280
<v Speaker 2>and discriminator gradually on increasingly larger image sizes.

721
00:37:08.960 --> 00:37:12.360
<v Speaker 1>Okay, so vaes and gans were foundational and gans you

722
00:37:12.400 --> 00:37:14.599
<v Speaker 1>mentioned are particularly good at style.

723
00:37:15.000 --> 00:37:19.639
<v Speaker 2>Yes, jans really excel at style transfer, taking the content

724
00:37:19.679 --> 00:37:22.559
<v Speaker 2>of one image and rendering it in the artistic style

725
00:37:22.599 --> 00:37:25.960
<v Speaker 2>of another, turning your photo into a Monet painting, for instance.

726
00:37:25.960 --> 00:37:27.480
<v Speaker 1>How do they do that, especially if you don't have

727
00:37:27.559 --> 00:37:30.719
<v Speaker 1>paired examples like the exact same scene painted by Monet.

728
00:37:30.840 --> 00:37:34.119
<v Speaker 2>That's where cycle gan was a brilliant innovation. It enables

729
00:37:34.280 --> 00:37:37.960
<v Speaker 2>unpaired image to image translation. You don't need photos of

730
00:37:38.000 --> 00:37:41.159
<v Speaker 2>horses perfectly matched with photos of zebras to learn how

731
00:37:41.159 --> 00:37:42.639
<v Speaker 2>to turn horses into zebras.

732
00:37:42.800 --> 00:37:45.039
<v Speaker 1>So how does it learn without matched pairs?

733
00:37:45.719 --> 00:37:49.559
<v Speaker 2>Uses a clever concept called cycle consistency loss. The idea

734
00:37:49.760 --> 00:37:53.119
<v Speaker 2>is if you translate an image from domain A, say horses,

735
00:37:53.199 --> 00:37:56.480
<v Speaker 2>to domain B zebras, and then translate that result back

736
00:37:56.480 --> 00:37:58.920
<v Speaker 2>from domain B to Domain A, you should get something

737
00:37:59.000 --> 00:38:00.320
<v Speaker 2>very close to your original end image.

738
00:38:00.400 --> 00:38:01.960
<v Speaker 1>Ah, the round trips should bring you.

739
00:38:02.000 --> 00:38:05.920
<v Speaker 2>Back home exactly. This constraint forces the model to learn

740
00:38:06.000 --> 00:38:10.480
<v Speaker 2>meaningful translation mappings without needing perfectly aligned pairs. It also

741
00:38:10.639 --> 00:38:13.840
<v Speaker 2>uses discriminators in both domains and often an identity loss

742
00:38:14.119 --> 00:38:17.599
<v Speaker 2>to encourage the generator to preserve color and composition where appropriate.

743
00:38:18.159 --> 00:38:21.719
<v Speaker 2>Cyclegan opened up huge possibilities for creative image transformations.

744
00:38:21.920 --> 00:38:25.760
<v Speaker 1>That's really clever. But this ability to manipulate images and

745
00:38:25.880 --> 00:38:29.360
<v Speaker 1>videos so convincingly it leads directly to the topic of

746
00:38:29.440 --> 00:38:30.880
<v Speaker 1>deep fix right it does.

747
00:38:31.360 --> 00:38:35.880
<v Speaker 2>Deep fakes are essentially AI generated or manipulated media, typically

748
00:38:36.000 --> 00:38:39.920
<v Speaker 2>video or images, where a person's likeness is replaced or

749
00:38:39.960 --> 00:38:41.039
<v Speaker 2>altered convincingly.

750
00:38:41.199 --> 00:38:45.079
<v Speaker 1>We've seen some pretty amazing and maybe sometimes concerning examples.

751
00:38:45.320 --> 00:38:48.760
<v Speaker 1>There are creative uses like that Dolly Museum exhibit brings

752
00:38:48.800 --> 00:38:52.000
<v Speaker 1>Salvador Dolli back to life or campaigns like David Beckham

753
00:38:52.000 --> 00:38:54.639
<v Speaker 1>appearing to speak multiple languages fluently.

754
00:38:54.320 --> 00:38:57.719
<v Speaker 2>Yeah, and even things like AI generated fashion models. The

755
00:38:57.800 --> 00:39:01.000
<v Speaker 2>technology itself can be used for productive or entertaining purposes.

756
00:39:01.079 --> 00:39:03.400
<v Speaker 1>How do they typically work? What are the main techniques?

757
00:39:03.880 --> 00:39:06.679
<v Speaker 2>Broadly, you can think of three modes. Replacement or swamping,

758
00:39:06.719 --> 00:39:09.360
<v Speaker 2>where one person's face is grafted onto another's body in

759
00:39:09.400 --> 00:39:12.280
<v Speaker 2>a video, reenactment where you take a source video of

760
00:39:12.320 --> 00:39:15.840
<v Speaker 2>someone and use it to control the facial expressions, pose, gaze,

761
00:39:15.920 --> 00:39:18.480
<v Speaker 2>or mouth movements of a target person in another video.

762
00:39:19.119 --> 00:39:22.639
<v Speaker 2>And editing, where attributes like hair, color, age, or expression

763
00:39:22.960 --> 00:39:25.480
<v Speaker 2>are modified on an existing image or video.

764
00:39:25.679 --> 00:39:29.159
<v Speaker 1>And this relies on the AI really understanding faces in detail.

765
00:39:29.440 --> 00:39:33.480
<v Speaker 2>Absolutely. Deep fake generation relies heavily on accurately detecting and

766
00:39:33.559 --> 00:39:37.519
<v Speaker 2>modeling facial features. This often involves using standardized systems like

767
00:39:37.559 --> 00:39:41.519
<v Speaker 2>the Facial Action Coding System FACS to describe muscle movements,

768
00:39:42.039 --> 00:39:45.400
<v Speaker 2>using three D morphable models three dmms to represent face

769
00:39:45.440 --> 00:39:49.639
<v Speaker 2>shape and texture, or extracting precise facial landmarks typically sixty

770
00:39:49.679 --> 00:39:52.480
<v Speaker 2>eight key points on the face identified using libraries like

771
00:39:52.559 --> 00:39:58.360
<v Speaker 2>dlib or MTCNN. The AI learns to manipulate these underlying representations.

772
00:39:57.559 --> 00:40:00.639
<v Speaker 1>But creating perfect deep fakes is still hard, right. What

773
00:40:00.679 --> 00:40:01.920
<v Speaker 1>are the technical challenges?

774
00:40:02.000 --> 00:40:05.679
<v Speaker 2>Definitely, generalization is a big one. Models trained on certain

775
00:40:05.760 --> 00:40:09.000
<v Speaker 2>data sets might fail or produce weird artifacts when faced

776
00:40:09.000 --> 00:40:13.440
<v Speaker 2>with unseen lighting conditions, extreme angles or different identities. Occlusions

777
00:40:13.440 --> 00:40:15.559
<v Speaker 2>when parts of the face are blocked by hands, hair,

778
00:40:15.679 --> 00:40:19.239
<v Speaker 2>or objects are really difficult to handle realistically, and maintaining

779
00:40:19.280 --> 00:40:24.119
<v Speaker 2>temporal consistency across video frames avoiding flickering or unnatural transitions

780
00:40:24.400 --> 00:40:28.039
<v Speaker 2>is a constant challenge. It's getting better, but artifacts are often.

781
00:40:27.800 --> 00:40:31.000
<v Speaker 1>Still detectable, and beyond the technical hurdles, there are obviously

782
00:40:31.079 --> 00:40:32.960
<v Speaker 1>significant ethical concerns too.

783
00:40:33.159 --> 00:40:38.679
<v Speaker 2>Huge concerns misinformation, fraud, non consensual pornography, undermining trust. The

784
00:40:38.679 --> 00:40:42.280
<v Speaker 2>potential for misuse is serious. This has led to widespread

785
00:40:42.320 --> 00:40:45.639
<v Speaker 2>calls for legislation, development of detection tools by companies like

786
00:40:45.679 --> 00:40:49.239
<v Speaker 2>Microsoft and initiatives like deep wear, and ongoing research into

787
00:40:49.360 --> 00:40:53.599
<v Speaker 2>robust watermarking and providence tracking. It's a critical area.

788
00:40:53.239 --> 00:40:58.239
<v Speaker 1>Absolutely so. Pulling back a bit as we look across text, images, video,

789
00:40:58.960 --> 00:41:01.159
<v Speaker 1>what does the future hole? How are we going to

790
00:41:01.199 --> 00:41:05.440
<v Speaker 1>interact with these increasingly powerful and multimodal AI systems well.

791
00:41:05.519 --> 00:41:09.280
<v Speaker 2>One ongoing challenge, especially with LLMS, is managing hallucinations. We

792
00:41:09.360 --> 00:41:11.800
<v Speaker 2>need systems that are not just fluent, but also factual.

793
00:41:12.119 --> 00:41:15.440
<v Speaker 2>As we discussed, RG is a key technique here grounding

794
00:41:15.559 --> 00:41:19.320
<v Speaker 2>LLLM responses in external knowledge. This focus on factuality and

795
00:41:19.320 --> 00:41:20.760
<v Speaker 2>reliability will continue.

796
00:41:20.440 --> 00:41:22.719
<v Speaker 1>To be critical, so making them more trustworthy.

797
00:41:22.840 --> 00:41:26.400
<v Speaker 2>Yes, and the trend towards multimodal models is undeniable. Systems

798
00:41:26.440 --> 00:41:28.960
<v Speaker 2>like GPT four OH that can seamlessly process and generate

799
00:41:29.000 --> 00:41:32.159
<v Speaker 2>combinations of text, audio, images, and video are the future.

800
00:41:32.719 --> 00:41:35.599
<v Speaker 2>Think of open AI's Sora model for text to video generation.

801
00:41:36.079 --> 00:41:39.559
<v Speaker 2>It points towards AI understanding and creating rich, dynamic content

802
00:41:39.760 --> 00:41:41.320
<v Speaker 2>across different senses.

803
00:41:41.079 --> 00:41:44.199
<v Speaker 1>Interacting with AI through more than just texts exactly, and

804
00:41:44.239 --> 00:41:46.320
<v Speaker 1>this leads to the concept of AI agents.

805
00:41:46.760 --> 00:41:49.880
<v Speaker 2>These aren't just models anymore. They're systems designed to achieve

806
00:41:50.239 --> 00:41:54.639
<v Speaker 2>complex goals autonomously. They can break down tasks, use tools

807
00:41:55.119 --> 00:41:59.800
<v Speaker 2>like web search or code execution, access databases, learn from feedback,

808
00:41:59.800 --> 00:42:03.079
<v Speaker 2>maybe even collaborate with other AI agents. We're moving towards

809
00:42:03.079 --> 00:42:06.599
<v Speaker 2>systems that don't just respond, but proactively act in the

810
00:42:06.599 --> 00:42:11.039
<v Speaker 2>digital and potentially physical world to accomplish objectives. Multi agent

811
00:42:11.079 --> 00:42:13.840
<v Speaker 2>systems with feedback loops start to hint at rudimentary forms

812
00:42:13.840 --> 00:42:16.960
<v Speaker 2>of self improvement, which inevitably brings up discussions around the

813
00:42:16.960 --> 00:42:19.880
<v Speaker 2>path towards artificial general intelligence or AGI.

814
00:42:20.159 --> 00:42:23.760
<v Speaker 1>Right agents that can chain actions, learn, maybe even collaborate.

815
00:42:24.119 --> 00:42:26.880
<v Speaker 1>That really opens up possibilities. So we've taken quite the

816
00:42:26.960 --> 00:42:29.480
<v Speaker 1>journey here, a real deep dive into generative AI. We've

817
00:42:29.480 --> 00:42:32.320
<v Speaker 1>gone from the basic building blocks like morons and backpropagation,

818
00:42:32.920 --> 00:42:36.679
<v Speaker 1>through the revolutions of CNNs, LSTMs, and especially transformers. We've

819
00:42:36.719 --> 00:42:39.400
<v Speaker 1>looked at how they handle text, the rise of llms,

820
00:42:39.480 --> 00:42:42.960
<v Speaker 1>the open source movement, the crucial art of prompt engineering,

821
00:42:43.239 --> 00:42:45.559
<v Speaker 1>and the tools like lang, chain and RAG used to

822
00:42:45.559 --> 00:42:49.840
<v Speaker 1>build real applications. We've touched on optimization, image generation with

823
00:42:50.000 --> 00:42:53.800
<v Speaker 1>vavaes and jams, style transfer, deep fix, and now these

824
00:42:53.800 --> 00:42:57.199
<v Speaker 1>emerging multimodal models and agents. You've seen how these models

825
00:42:57.199 --> 00:43:00.960
<v Speaker 1>can create translate optimized reason. But as they get better

826
00:43:01.000 --> 00:43:05.159
<v Speaker 1>and better generating content that's increasingly indistinguishable from human creation,

827
00:43:05.239 --> 00:43:07.559
<v Speaker 1>you have to wonder how might our very definitions of

828
00:43:07.599 --> 00:43:11.559
<v Speaker 1>things like knowledge, creativity, originality, maybe even truth itself start

829
00:43:11.679 --> 00:43:12.920
<v Speaker 1>to shift or evolve.

830
00:43:13.119 --> 00:43:16.239
<v Speaker 2>That's the big question. Isn't it The leap from simple

831
00:43:16.320 --> 00:43:20.760
<v Speaker 2>pattern recognition to these dynamic systems that can learn, generate

832
00:43:20.840 --> 00:43:24.920
<v Speaker 2>novel content, interact with tools, and potentially even improve themselves.

833
00:43:25.400 --> 00:43:29.440
<v Speaker 2>It does raise profound questions about intelligence, creativity, and what

834
00:43:29.480 --> 00:43:32.559
<v Speaker 2>it means to understand or create. It's a field moving

835
00:43:32.599 --> 00:43:36.280
<v Speaker 2>at incredible speed. There's always more to learn, more perspectives

836
00:43:36.280 --> 00:43:39.320
<v Speaker 2>to consider, and the deeper you dig, the more fascinating

837
00:43:39.360 --> 00:43:40.760
<v Speaker 2>and complex it all becomes.

838
00:43:40.920 --> 00:43:42.880
<v Speaker 1>And that's really what we hope you take away from this.

839
00:43:43.000 --> 00:43:47.199
<v Speaker 1>Keep exploring, keep questioning, think about how this technology impacts you,

840
00:43:47.199 --> 00:43:51.239
<v Speaker 1>your work, the world around you, because this generative revolution, well,

841
00:43:51.320 --> 00:43:53.039
<v Speaker 1>it feels like it's really just getting started.
