WEBVTT

1
00:00:00.160 --> 00:00:02.879
<v Speaker 1>Welcome to the deep dive, your express lane to understanding

2
00:00:02.919 --> 00:00:06.719
<v Speaker 1>what truly matters in today's most complex subjects. Forget digging

3
00:00:06.799 --> 00:00:10.199
<v Speaker 1>through dense texts. We're cutting straight to the core of

4
00:00:10.240 --> 00:00:12.720
<v Speaker 1>what you need to know today. We're diving into the

5
00:00:12.720 --> 00:00:17.160
<v Speaker 1>fascinating world of deep learning, specifically using PyTorch. Our source

6
00:00:17.239 --> 00:00:21.480
<v Speaker 1>material is a pretty comprehensive guide to PyTorch and our mission. Well,

7
00:00:21.679 --> 00:00:26.640
<v Speaker 1>it's simple unpack these powerful concepts into clear, actionable insights

8
00:00:26.719 --> 00:00:27.079
<v Speaker 1>for you.

9
00:00:27.359 --> 00:00:31.519
<v Speaker 2>Yeah, and it's incredibly relevant right now. Deep learning isn't

10
00:00:31.519 --> 00:00:36.079
<v Speaker 2>just theory anymore. It's actively reshaping entire industries. I think

11
00:00:36.079 --> 00:00:40.320
<v Speaker 2>personalized medicine, autonomous cars. So understanding the fundamentals and how

12
00:00:40.399 --> 00:00:43.280
<v Speaker 2>tools like PyTorch actually make it happen gives you a

13
00:00:43.320 --> 00:00:46.280
<v Speaker 2>really critical perspective on the edge of AI. We're hoping

14
00:00:46.320 --> 00:00:49.399
<v Speaker 2>to give you that foundational knowledge plus a practical feel

15
00:00:49.439 --> 00:00:50.320
<v Speaker 2>for what's under the hood.

16
00:00:50.439 --> 00:00:53.640
<v Speaker 1>Okay, let's unpack this. Then. When we talk about machine intelligence,

17
00:00:53.679 --> 00:00:57.560
<v Speaker 1>you hear AI, machine learning, deep learning. They sometimes get

18
00:00:57.640 --> 00:00:59.799
<v Speaker 1>used interchangeably. Can we clarify that relationship?

19
00:00:59.840 --> 00:01:02.240
<v Speaker 2>For absolutely, it's helpful to think of it in layers.

20
00:01:02.600 --> 00:01:06.319
<v Speaker 2>So artificial intelligence AI, that's the big overarching goal right,

21
00:01:06.400 --> 00:01:12.120
<v Speaker 2>making machines intelligence grand ambition exactly now. Machine learning or

22
00:01:12.239 --> 00:01:14.959
<v Speaker 2>mL is one major way to get there. It's where

23
00:01:15.000 --> 00:01:19.079
<v Speaker 2>machines learn from data without being explicitly programmed for every

24
00:01:19.120 --> 00:01:19.879
<v Speaker 2>single task.

25
00:01:20.079 --> 00:01:22.000
<v Speaker 1>Okay, so they learn patterns right.

26
00:01:22.120 --> 00:01:26.000
<v Speaker 2>And deep learning DL is a specific type within machine learning.

27
00:01:26.239 --> 00:01:29.480
<v Speaker 2>It's a technique that's proven incredibly effective, especially for learning

28
00:01:29.519 --> 00:01:33.239
<v Speaker 2>really complex patterns from unstructured data like images or sound.

29
00:01:33.439 --> 00:01:36.560
<v Speaker 1>So AI is the goal, mL is the approach. DL

30
00:01:36.640 --> 00:01:38.519
<v Speaker 1>is a powerful technique within that approach.

31
00:01:38.760 --> 00:01:41.719
<v Speaker 2>You got it. And the reason DL has really taken

32
00:01:41.760 --> 00:01:45.120
<v Speaker 2>off is its advantage in certain areas. Traditional mL often

33
00:01:45.159 --> 00:01:48.799
<v Speaker 2>needs humans to carefully engineer features from the data first,

34
00:01:49.120 --> 00:01:51.959
<v Speaker 2>like telling the algorithm what to look for in an image,

35
00:01:52.079 --> 00:01:54.519
<v Speaker 2>a lot of manual work a ton. Deep learning kind

36
00:01:54.519 --> 00:01:57.560
<v Speaker 2>of flips that the algorithm itself learns to extract the

37
00:01:57.599 --> 00:02:00.719
<v Speaker 2>important features directly from the raw data builds up this

38
00:02:01.400 --> 00:02:03.959
<v Speaker 2>hierarchical understanding in a non linear way.

39
00:02:04.079 --> 00:02:06.560
<v Speaker 1>Nonlinear That's key, isn't it crucial?

40
00:02:07.000 --> 00:02:10.319
<v Speaker 2>Because real world data isn't neat straight lines, and this

41
00:02:10.400 --> 00:02:13.159
<v Speaker 2>ability to learn features means DL models tend to keep

42
00:02:13.159 --> 00:02:16.520
<v Speaker 2>getting better the more data you give them traditional mL

43
00:02:16.560 --> 00:02:17.680
<v Speaker 2>can sometimes plateau.

44
00:02:18.080 --> 00:02:21.560
<v Speaker 1>Ah, so scale really matters for deep learning performance, like

45
00:02:22.199 --> 00:02:24.919
<v Speaker 1>more data equals significantly better results.

46
00:02:24.800 --> 00:02:29.159
<v Speaker 2>Generally generally, yes, significantly. Think of it like learning a language.

47
00:02:29.639 --> 00:02:33.960
<v Speaker 2>Traditional mL might learn vocabulary lists. Deep learning with enough

48
00:02:34.000 --> 00:02:38.319
<v Speaker 2>exposure starts to grasp the grammar and nuance the underlying structure.

49
00:02:38.439 --> 00:02:40.479
<v Speaker 1>Okay, that makes sense. Now, if we want to build

50
00:02:40.479 --> 00:02:44.159
<v Speaker 1>these systems, we need tools frameworks, and that brings us

51
00:02:44.199 --> 00:02:46.919
<v Speaker 1>to PyTorch. What makes it stand out? I've heard about

52
00:02:46.919 --> 00:02:48.599
<v Speaker 1>this defined by run.

53
00:02:48.479 --> 00:02:51.240
<v Speaker 2>Thing, right, that's a big one. So some other frameworks

54
00:02:51.360 --> 00:02:53.520
<v Speaker 2>use a define and run approach. You first build this

55
00:02:53.680 --> 00:02:56.479
<v Speaker 2>entire computation graph like a blueprint, and then you run

56
00:02:56.560 --> 00:02:59.360
<v Speaker 2>data through it. It's quite static, okay. Pritorch is defined

57
00:02:59.360 --> 00:03:02.599
<v Speaker 2>by run. The compugation graph gets built dynamically on the

58
00:03:02.639 --> 00:03:05.319
<v Speaker 2>fly as your Python code executes.

59
00:03:05.000 --> 00:03:08.479
<v Speaker 1>So it's more flexible, like you can change things midstream exactly.

60
00:03:08.560 --> 00:03:12.520
<v Speaker 2>You can use standard Python loops, conditionals, print statements, debuggers.

61
00:03:12.960 --> 00:03:16.039
<v Speaker 2>It feels much more like regular programming. This makes it

62
00:03:16.080 --> 00:03:20.319
<v Speaker 2>super popular for research and rapid prototyping where you're constantly experimenting.

63
00:03:20.439 --> 00:03:23.080
<v Speaker 1>That sounds much more intuitive, especially if you're already comfortable

64
00:03:23.080 --> 00:03:23.599
<v Speaker 1>with Python.

65
00:03:23.719 --> 00:03:27.080
<v Speaker 2>It often is, and practically speaking, is just a Python package.

66
00:03:27.159 --> 00:03:30.000
<v Speaker 2>You install it with pip Workonda easy setup.

67
00:03:29.719 --> 00:03:32.319
<v Speaker 1>And GPUs graphics cards.

68
00:03:32.319 --> 00:03:34.919
<v Speaker 2>I hear they're important, oh, absolutely crucial for serious work.

69
00:03:35.520 --> 00:03:40.919
<v Speaker 2>Deep learning involves massive matrix multiplications, and GPUs, especially in video.

70
00:03:40.960 --> 00:03:44.960
<v Speaker 2>Ones using CUDA, are specifically designed to crunch those numbers

71
00:03:44.960 --> 00:03:48.280
<v Speaker 2>incredibly fast, way faster than a standard CPU.

72
00:03:48.479 --> 00:03:50.960
<v Speaker 1>What if you don't have a powerful GPU sitting around?

73
00:03:51.120 --> 00:03:55.400
<v Speaker 2>Cloud computing is your friend? Services like Google Cloud, Aws, Azure.

74
00:03:55.840 --> 00:03:58.280
<v Speaker 2>They offer instances with powerful GPUs you can rep by

75
00:03:58.319 --> 00:03:59.759
<v Speaker 2>the hour. Very accessible.

76
00:04:00.199 --> 00:04:02.039
<v Speaker 1>Good to know. So let's get into the nitty gritty.

77
00:04:02.039 --> 00:04:05.199
<v Speaker 1>If we're building something, what are the absolute core components?

78
00:04:05.400 --> 00:04:07.479
<v Speaker 1>Starting with the neural network itself.

79
00:04:07.360 --> 00:04:11.319
<v Speaker 2>Right at its heart. A neural network is an algorithm

80
00:04:11.360 --> 00:04:15.159
<v Speaker 2>designed to learn relationships. It maps input variables like say,

81
00:04:15.360 --> 00:04:18.560
<v Speaker 2>pixels in an image, to some target output like cat

82
00:04:18.680 --> 00:04:19.079
<v Speaker 2>or dog.

83
00:04:19.839 --> 00:04:21.079
<v Speaker 1>How does it learn that mapping?

84
00:04:21.839 --> 00:04:25.839
<v Speaker 2>Let's stick a simpler example. Yeah, predicting college admission. Your

85
00:04:25.879 --> 00:04:30.079
<v Speaker 2>inputs might be GPA gr score, university rank, okay. In

86
00:04:30.120 --> 00:04:33.839
<v Speaker 2>the network, these inputs are connected to processing units sometimes

87
00:04:33.839 --> 00:04:37.680
<v Speaker 2>called neurons. Each connection has a weight, basically a number

88
00:04:37.680 --> 00:04:40.279
<v Speaker 2>indicating how important that input is for the prediction.

89
00:04:40.160 --> 00:04:43.079
<v Speaker 1>And the network learns these weights exactly.

90
00:04:43.319 --> 00:04:46.120
<v Speaker 2>It learns them from the data inside a neuron. There

91
00:04:46.120 --> 00:04:50.120
<v Speaker 2>are typically two main operations. First, a dot product summing

92
00:04:50.199 --> 00:04:53.040
<v Speaker 2>up all the inputs multiplied by their weights. It's like

93
00:04:53.120 --> 00:04:55.000
<v Speaker 2>mixing the ingredients based on their importance.

94
00:04:55.079 --> 00:04:56.480
<v Speaker 1>Okay, a weighted sum.

95
00:04:56.600 --> 00:05:01.439
<v Speaker 2>Then, crucially, it applies a nonlinear transformation and activation function.

96
00:05:01.560 --> 00:05:03.759
<v Speaker 1>Why nonlinear? Why can't it just be linear?

97
00:05:03.920 --> 00:05:06.480
<v Speaker 2>Because if you just stack linear operations, no matter how

98
00:05:06.519 --> 00:05:08.920
<v Speaker 2>many layers you have, the whole thing is still just

99
00:05:08.959 --> 00:05:12.279
<v Speaker 2>one big linear transformation. It can only learn straightline relationships.

100
00:05:12.560 --> 00:05:16.720
<v Speaker 1>Ah, and the real world is messy, not straight lines precisely.

101
00:05:17.319 --> 00:05:21.879
<v Speaker 2>Think about recognizing a face or understanding language. Super complex

102
00:05:22.079 --> 00:05:26.040
<v Speaker 2>nonlinear patterns. Those activation functions allow the network to learn

103
00:05:26.079 --> 00:05:29.360
<v Speaker 2>these intricate curves and boundaries. Each layer builds on the

104
00:05:29.399 --> 00:05:31.959
<v Speaker 2>previous one, learning more abstract features.

105
00:05:32.000 --> 00:05:36.399
<v Speaker 1>And in PyTorch, how do you actually define these network structures?

106
00:05:36.759 --> 00:05:39.680
<v Speaker 2>For simple feed forward networks, you can use torch dot

107
00:05:39.800 --> 00:05:42.480
<v Speaker 2>nn sequential. It lets you just list the layers one

108
00:05:42.519 --> 00:05:46.000
<v Speaker 2>after another. Very straightforward, okay. But for anything more complex,

109
00:05:46.040 --> 00:05:49.360
<v Speaker 2>maybe networks with multiple inputs or outputs or custom connections.

110
00:05:49.600 --> 00:05:52.720
<v Speaker 2>You'll typically define your own network class you inherit from

111
00:05:53.040 --> 00:05:55.959
<v Speaker 2>torch dot nn dot module and define the layers in

112
00:05:56.000 --> 00:05:58.279
<v Speaker 2>the init method and how data flows through them in

113
00:05:58.319 --> 00:06:00.120
<v Speaker 2>the forward method. Gives you total.

114
00:06:00.079 --> 00:06:02.959
<v Speaker 1>Control, right, more power, more flexibility. Now back to those

115
00:06:03.000 --> 00:06:06.240
<v Speaker 1>non linear activation functions. You said they're crucial. What are

116
00:06:06.240 --> 00:06:06.920
<v Speaker 1>some common ones?

117
00:06:07.000 --> 00:06:09.920
<v Speaker 2>Yeah, there are a few main players. Historically, sigmoid was popular.

118
00:06:10.240 --> 00:06:13.240
<v Speaker 2>It squashes any input value into a range between zero and.

119
00:06:13.199 --> 00:06:16.920
<v Speaker 1>One, useful for probabilities, maybe like binary classification.

120
00:06:17.959 --> 00:06:20.800
<v Speaker 2>Exactly is it a cat near one or not hear

121
00:06:20.879 --> 00:06:23.439
<v Speaker 2>or one? The problem is when the output is very

122
00:06:23.480 --> 00:06:26.879
<v Speaker 2>close to zero or one, the gradient the signal use

123
00:06:26.920 --> 00:06:30.560
<v Speaker 2>for learning becomes tiny. It basically stops learning.

124
00:06:30.519 --> 00:06:33.800
<v Speaker 1>Ah the vanish ingredient problem, leading to dead neurons.

125
00:06:33.920 --> 00:06:37.279
<v Speaker 2>Precisely, parts of the network just stop updating. Then there's

126
00:06:37.399 --> 00:06:41.160
<v Speaker 2>ton or hyperbolic tangent, similar to sigmoid, but it squashes

127
00:06:41.279 --> 00:06:42.839
<v Speaker 2>values between man is one and one.

128
00:06:42.920 --> 00:06:45.079
<v Speaker 1>Why is mata one to one better than zero to.

129
00:06:44.959 --> 00:06:48.360
<v Speaker 2>One because its output is zero centered. This often helps

130
00:06:48.399 --> 00:06:51.319
<v Speaker 2>the optimization process converge a bit faster and more reliably.

131
00:06:51.839 --> 00:06:54.839
<v Speaker 2>It's generally preferred over sigmoid in many cases. Okay, what

132
00:06:54.920 --> 00:06:58.399
<v Speaker 2>else The current crowd favorite really is real you rectified

133
00:06:58.439 --> 00:07:01.240
<v Speaker 2>linear unit. It's super simple. If the input is negative,

134
00:07:01.279 --> 00:07:03.920
<v Speaker 2>the output is zero. If it's positive, the output is

135
00:07:03.959 --> 00:07:05.319
<v Speaker 2>just the input of value itself.

136
00:07:05.360 --> 00:07:07.519
<v Speaker 1>That sounds really simple. Why is it so popular?

137
00:07:07.920 --> 00:07:11.079
<v Speaker 2>It's computationally very cheap, much faster than sigmoid or ton,

138
00:07:11.759 --> 00:07:14.639
<v Speaker 2>and in practice it often helps networks learn faster because

139
00:07:14.639 --> 00:07:17.959
<v Speaker 2>it doesn't saturate for positive values, but it can still die.

140
00:07:18.199 --> 00:07:21.319
<v Speaker 2>If a neuron consistently gets negative input, it just outputs

141
00:07:21.399 --> 00:07:22.759
<v Speaker 2>zero and the gradient becomes.

142
00:07:22.600 --> 00:07:25.920
<v Speaker 1>Zero, so it has its own dying real you problem.

143
00:07:26.160 --> 00:07:29.680
<v Speaker 2>It can, yeah, which led to variations like leaky reel you.

144
00:07:30.560 --> 00:07:34.240
<v Speaker 2>Instead of outputting zero for negative inputs, it outputs a

145
00:07:34.319 --> 00:07:38.160
<v Speaker 2>very small positive value like point zero one times the input,

146
00:07:38.319 --> 00:07:40.759
<v Speaker 2>just enough to keep it alive basically exactly keeps the

147
00:07:40.800 --> 00:07:44.040
<v Speaker 2>gradient flowing prevents the neuron from completely dying off.

148
00:07:44.120 --> 00:07:47.160
<v Speaker 1>Okay, so we have networks with layers, weights and these

149
00:07:47.199 --> 00:07:50.759
<v Speaker 1>activation functions. How do we measure if it's actually learning

150
00:07:50.800 --> 00:07:54.240
<v Speaker 1>the right thing? How do we quantify good or bad predictions?

151
00:07:54.480 --> 00:07:56.959
<v Speaker 2>That's the job of the loss function, sometimes called a

152
00:07:57.000 --> 00:08:00.399
<v Speaker 2>cost function or objective function. Its whole purpose is to

153
00:08:00.399 --> 00:08:02.920
<v Speaker 2>take the network's predictions and compare them to the actual

154
00:08:03.000 --> 00:08:06.199
<v Speaker 2>correct answers the targets or labels, and spit out a

155
00:08:06.240 --> 00:08:07.800
<v Speaker 2>single number at the loss.

156
00:08:07.680 --> 00:08:09.519
<v Speaker 1>One number representing the error yep.

157
00:08:09.639 --> 00:08:12.000
<v Speaker 2>A high loss means the predictions are bad. A low

158
00:08:12.040 --> 00:08:15.199
<v Speaker 2>loss means they're good. The entire goal of training is

159
00:08:15.240 --> 00:08:17.000
<v Speaker 2>to minimize this loss value.

160
00:08:17.079 --> 00:08:18.040
<v Speaker 1>What are some examples.

161
00:08:18.199 --> 00:08:21.240
<v Speaker 2>Well, if you're predicting a continuous value like the price

162
00:08:21.279 --> 00:08:23.839
<v Speaker 2>of a house or a T shirt like in the

163
00:08:23.879 --> 00:08:27.600
<v Speaker 2>book's example, that's regression. A common loss function is mean

164
00:08:27.680 --> 00:08:31.959
<v Speaker 2>squared error MAS. It calculates the average of the squared

165
00:08:31.959 --> 00:08:34.960
<v Speaker 2>differences between each prediction and the actual value.

166
00:08:35.159 --> 00:08:38.519
<v Speaker 1>Squaring makes errors positive and penalizes larger rors.

167
00:08:38.559 --> 00:08:43.080
<v Speaker 2>More right exactly now, for classification deciding between categories like

168
00:08:43.120 --> 00:08:46.679
<v Speaker 2>cat versus dog versus panda, you often use cross entropy loss.

169
00:08:47.159 --> 00:08:50.679
<v Speaker 2>It measures how different the network's predicted probability distribution is

170
00:08:50.720 --> 00:08:53.759
<v Speaker 2>from the actual distribution, which is usually one for the

171
00:08:53.799 --> 00:08:55.440
<v Speaker 2>correct class and o for others.

172
00:08:55.679 --> 00:08:58.519
<v Speaker 1>So if network is very confident about the wrong class,

173
00:08:58.840 --> 00:09:01.240
<v Speaker 1>the cross enterpy loss will be high, very high.

174
00:09:01.320 --> 00:09:05.360
<v Speaker 2>It heavily penalizes confident wrong answers pushing the network towards

175
00:09:05.360 --> 00:09:07.759
<v Speaker 2>predicting the correct class with high probability.

176
00:09:08.039 --> 00:09:10.639
<v Speaker 1>Okay, so the loss function tells us how bad we are.

177
00:09:10.799 --> 00:09:13.000
<v Speaker 1>How does the network use that information to get better?

178
00:09:13.279 --> 00:09:15.840
<v Speaker 2>That's where optimizers come in. The loss function gives us

179
00:09:15.840 --> 00:09:18.799
<v Speaker 2>the error signal. The optimizer is the algorithm that uses

180
00:09:18.840 --> 00:09:20.840
<v Speaker 2>that signal to update the network's weights.

181
00:09:20.919 --> 00:09:22.960
<v Speaker 1>It adjusts the knobs basically.

182
00:09:22.759 --> 00:09:26.000
<v Speaker 2>Precisely, it figures out how to adjust each weight to

183
00:09:26.039 --> 00:09:29.120
<v Speaker 2>reduce the loss. The most basic one is to cast

184
00:09:29.120 --> 00:09:33.120
<v Speaker 2>a gradient descent SGD, but there are more advanced ones

185
00:09:33.200 --> 00:09:36.639
<v Speaker 2>like atom or arms prop that often converge faster and

186
00:09:36.720 --> 00:09:39.759
<v Speaker 2>more reliably by adapting the learning rate for each weight.

187
00:09:40.320 --> 00:09:42.679
<v Speaker 1>And in PyTorch, how does that training loop look? You

188
00:09:42.720 --> 00:09:45.759
<v Speaker 1>mentioned steps like zero grad backwards step right.

189
00:09:45.799 --> 00:09:48.759
<v Speaker 2>It's a cycle for each batch of data. One you

190
00:09:48.799 --> 00:09:51.039
<v Speaker 2>feed the data forward through the network to get predictions.

191
00:09:51.200 --> 00:09:55.200
<v Speaker 2>Two you calculate the loss using your chosen loss function. Three, crucially,

192
00:09:55.320 --> 00:09:58.879
<v Speaker 2>you call optimizer dot zerograd. This clears out any old

193
00:09:58.879 --> 00:10:02.759
<v Speaker 2>gradient calculations from the previous batch. Very important. Four you

194
00:10:02.799 --> 00:10:06.600
<v Speaker 2>call loss dot backward. This is where PyTorch automatically calculates

195
00:10:06.600 --> 00:10:09.279
<v Speaker 2>the gradients how much each weight contributed to the loss

196
00:10:09.360 --> 00:10:13.720
<v Speaker 2>using backpropagation five. Finally, you call optimizer dot step. This

197
00:10:13.840 --> 00:10:16.000
<v Speaker 2>tells the optimizer to update the weights based on the

198
00:10:16.000 --> 00:10:17.279
<v Speaker 2>gradients it just calculated.

199
00:10:17.320 --> 00:10:22.480
<v Speaker 1>Predect, calculate loss, clear old gradients, calculate new gradients, update weights, repeat.

200
00:10:22.679 --> 00:10:25.080
<v Speaker 2>That's the essence of training a neural network. You do

201
00:10:25.120 --> 00:10:27.480
<v Speaker 2>this over and over, batch after batch, e back after

202
00:10:27.559 --> 00:10:30.480
<v Speaker 2>APOC until the loss is low and the network performs well.

203
00:10:30.639 --> 00:10:34.600
<v Speaker 1>Okay, before we get into specific applications like vision or language,

204
00:10:34.679 --> 00:10:37.960
<v Speaker 1>we need to talk about how PyTorch actually handles the data.

205
00:10:38.399 --> 00:10:40.039
<v Speaker 1>You mentioned tensors earlier.

206
00:10:40.200 --> 00:10:44.000
<v Speaker 2>Yes, tensors are the absolute fundamental data structure in pytors.

207
00:10:44.600 --> 00:10:49.039
<v Speaker 2>You can think of them as multidimensional arrays like numb

208
00:10:49.080 --> 00:10:53.080
<v Speaker 2>pi arrays, but with superpowers, especially acceleration on GPUs.

209
00:10:53.279 --> 00:10:55.320
<v Speaker 1>Multidimensional what does that mean exactly?

210
00:10:55.480 --> 00:10:58.279
<v Speaker 2>It refers to the tensor's order or number of dimensions.

211
00:10:58.759 --> 00:11:01.559
<v Speaker 2>A single number like five is a scaler a tensor

212
00:11:01.559 --> 00:11:04.480
<v Speaker 2>of order zero. A list of numbers like one, two,

213
00:11:04.559 --> 00:11:07.559
<v Speaker 2>three is a vector order one. A grid of numbers

214
00:11:07.559 --> 00:11:09.960
<v Speaker 2>like a spreadsheet table is a matrix order two, and

215
00:11:10.039 --> 00:11:12.360
<v Speaker 2>you can keep going. An image might be ordered three

216
00:11:12.799 --> 00:11:15.840
<v Speaker 2>height with color channels, and a batch of images would

217
00:11:15.840 --> 00:11:18.679
<v Speaker 2>be order four batch size, height, width channels.

218
00:11:18.919 --> 00:11:21.039
<v Speaker 1>So the order tells you how many indices you need

219
00:11:21.080 --> 00:11:22.559
<v Speaker 1>to access a specific element.

220
00:11:22.720 --> 00:11:24.960
<v Speaker 2>Exactly to get element twenty one twenty two from that

221
00:11:25.039 --> 00:11:27.960
<v Speaker 2>hypothetical fourth order tensor, you'd use four indicies like my

222
00:11:28.039 --> 00:11:30.960
<v Speaker 2>tensor one zero, one one. The number of indices always

223
00:11:30.960 --> 00:11:32.000
<v Speaker 2>matches the tensor's order.

224
00:11:32.279 --> 00:11:34.159
<v Speaker 1>How do you know the shape or size of a tensor?

225
00:11:34.320 --> 00:11:38.279
<v Speaker 2>You use the dot size or dot shape attribute. It

226
00:11:38.320 --> 00:11:40.879
<v Speaker 2>returns a topal telling you the length of each dimension.

227
00:11:41.360 --> 00:11:44.759
<v Speaker 2>For instance, a batch of thirty two images each two

228
00:11:44.799 --> 00:11:47.039
<v Speaker 2>hundred and twenty four by two hundred twenty four pixels

229
00:11:47.240 --> 00:11:50.080
<v Speaker 2>with three color channels would have a shape of thirty

230
00:11:50.080 --> 00:11:51.960
<v Speaker 2>two two twenty four two twenty four three. Ken you

231
00:11:52.039 --> 00:11:55.399
<v Speaker 2>change the shape, yes, using methods like dot view or

232
00:11:55.519 --> 00:11:58.799
<v Speaker 2>dot reshape. This lets you rearrange the elements into a

233
00:11:58.799 --> 00:12:02.240
<v Speaker 2>different configuration without changing the total number of elements or

234
00:12:02.279 --> 00:12:05.639
<v Speaker 2>the underlying data. It's really useful, for example, flattening an

235
00:12:05.639 --> 00:12:08.200
<v Speaker 2>image before feeding it into a simple linear.

236
00:12:07.919 --> 00:12:10.399
<v Speaker 1>Layer, and you has that handy night of one trick.

237
00:12:10.519 --> 00:12:12.919
<v Speaker 2>Yeah, if you specified night of one for one dimension.

238
00:12:13.240 --> 00:12:16.879
<v Speaker 2>PyTorch automatically calculates its size based on the total number

239
00:12:16.879 --> 00:12:19.360
<v Speaker 2>of elements and the sizes of the other dimensions you provided.

240
00:12:19.600 --> 00:12:23.159
<v Speaker 2>Super convenient, and remember dot view usually returns a new

241
00:12:23.200 --> 00:12:25.960
<v Speaker 2>tensor sharing the same data. It doesn't modify the original

242
00:12:25.960 --> 00:12:26.279
<v Speaker 2>in place.

243
00:12:26.320 --> 00:12:29.000
<v Speaker 1>Typically, what about basic math adding? Multiplying?

244
00:12:29.080 --> 00:12:33.960
<v Speaker 2>Tensors support all the standard element wise operations addition, subtraction, multiplication, division.

245
00:12:34.240 --> 00:12:35.679
<v Speaker 2>They work just like you'd expect of the race.

246
00:12:35.759 --> 00:12:38.519
<v Speaker 1>Any gotcha's there? The book mentions something about division.

247
00:12:38.840 --> 00:12:42.440
<v Speaker 2>Ah, yes, data types. If you have a tensor of integers,

248
00:12:42.679 --> 00:12:46.559
<v Speaker 2>say Torch dot tensor five three, which defaults to in

249
00:12:46.600 --> 00:12:49.519
<v Speaker 2>ten sixty four, and you divide five to three element wise,

250
00:12:49.840 --> 00:12:52.960
<v Speaker 2>you might get one because it performs integer division right

251
00:12:53.080 --> 00:12:56.240
<v Speaker 2>truncates the deskmall exactly to get the floating point result

252
00:12:56.240 --> 00:12:58.799
<v Speaker 2>like one point sixty sixty six. You need to make

253
00:12:58.840 --> 00:13:00.919
<v Speaker 2>sure at least one of the ten tensors has a

254
00:13:00.919 --> 00:13:03.840
<v Speaker 2>floating point D type like torch dot float three two.

255
00:13:04.440 --> 00:13:06.960
<v Speaker 2>You can specify the D type when creating the tensor

256
00:13:07.080 --> 00:13:09.639
<v Speaker 2>or cast it later. Always be mindful of your data

257
00:13:09.639 --> 00:13:10.720
<v Speaker 2>types makes sense.

258
00:13:11.000 --> 00:13:13.879
<v Speaker 1>Tensors really seem like the core way Pietorch handles all

259
00:13:13.960 --> 00:13:16.919
<v Speaker 1>numerical data from inputs to weights to gradients.

260
00:13:17.120 --> 00:13:22.879
<v Speaker 2>They absolutely are everything flows as tensors. Getting comfortable manipulating them, indexing, reshaping,

261
00:13:22.960 --> 00:13:26.600
<v Speaker 2>doing operations is key to working effectively with PyTorch.

262
00:13:26.639 --> 00:13:28.600
<v Speaker 1>All right, so we have the network building blocks, we

263
00:13:28.679 --> 00:13:31.679
<v Speaker 1>understand loss and optimization, and we know data is handled

264
00:13:31.759 --> 00:13:34.720
<v Speaker 1>via tensors. Let's see the stuff in action. Computer vision

265
00:13:34.759 --> 00:13:36.919
<v Speaker 1>seems like a huge area for depth learning.

266
00:13:36.759 --> 00:13:39.279
<v Speaker 2>Definitely one of the fields where it first made massive breakthroughs.

267
00:13:39.559 --> 00:13:41.200
<v Speaker 2>The problem with the images, as we touched on, is

268
00:13:41.200 --> 00:13:43.039
<v Speaker 2>that if you just flatten them into a long vector

269
00:13:43.120 --> 00:13:46.480
<v Speaker 2>for a standard fully connected network, you lose.

270
00:13:46.320 --> 00:13:48.960
<v Speaker 1>All the spatial information like which pixels are next to

271
00:13:49.000 --> 00:13:49.559
<v Speaker 1>each other.

272
00:13:49.519 --> 00:13:53.120
<v Speaker 2>Precisely, and the number of weights needed becomes astronomically large

273
00:13:53.399 --> 00:13:56.919
<v Speaker 2>even for moderately sized images. It just doesn't scale well

274
00:13:57.039 --> 00:13:59.399
<v Speaker 2>and doesn't leverage the inherent structure of images.

275
00:13:59.519 --> 00:14:01.240
<v Speaker 1>So what's the deep learning solution?

276
00:14:01.639 --> 00:14:06.120
<v Speaker 2>Convolutional neural networks or CNNs. They are designed specifically to

277
00:14:06.200 --> 00:14:08.799
<v Speaker 2>process grid like data like images.

278
00:14:08.879 --> 00:14:10.000
<v Speaker 1>How did they work differently?

279
00:14:10.279 --> 00:14:13.120
<v Speaker 2>Instead of connecting every input pixel to every neuron in

280
00:14:13.159 --> 00:14:17.399
<v Speaker 2>the first layer, CNNs use filters or kernels. These are

281
00:14:17.440 --> 00:14:19.840
<v Speaker 2>small windows of weight say three by three or five

282
00:14:19.960 --> 00:14:21.879
<v Speaker 2>y five that slide across the.

283
00:14:21.799 --> 00:14:25.279
<v Speaker 1>Input image like scanning the image with a small magnifying glass.

284
00:14:25.480 --> 00:14:28.159
<v Speaker 2>Kind of a yeah. Each filter learns to detect a

285
00:14:28.159 --> 00:14:31.960
<v Speaker 2>specific local feature, maybe a vertical edge, a horizontal line,

286
00:14:32.320 --> 00:14:35.320
<v Speaker 2>a certain curve, or a texture. As the filter slides

287
00:14:35.360 --> 00:14:38.639
<v Speaker 2>across the image, it creates an activation map showing where

288
00:14:38.639 --> 00:14:39.519
<v Speaker 2>it found that feature.

289
00:14:39.600 --> 00:14:42.600
<v Speaker 1>And because it's sliding, it detects that feature regardless of

290
00:14:42.600 --> 00:14:43.120
<v Speaker 1>where it is.

291
00:14:43.039 --> 00:14:46.200
<v Speaker 2>In the image exactly. That's called translation in variants, a

292
00:14:46.279 --> 00:14:50.919
<v Speaker 2>key property, and crucially, it preserves the spatial relationships between features.

293
00:14:51.559 --> 00:14:53.919
<v Speaker 2>Layers deeper in the CNN then learn to combine these

294
00:14:53.960 --> 00:14:58.919
<v Speaker 2>simpler features into more complex ones edges combined to form corners, corners,

295
00:14:58.919 --> 00:15:01.600
<v Speaker 2>and textures combined objects like eyes or wheels.

296
00:15:01.720 --> 00:15:05.679
<v Speaker 1>Let's look at that journey the classic MNIST data set

297
00:15:05.799 --> 00:15:08.320
<v Speaker 1>handwritten digits. How well do CNNs do there?

298
00:15:08.600 --> 00:15:11.840
<v Speaker 2>Even a fairly simple CNN can achieve really high accuracy

299
00:15:11.879 --> 00:15:14.960
<v Speaker 2>on MNIST, like ninety eight percent or ninety nine percent.

300
00:15:15.120 --> 00:15:17.639
<v Speaker 2>It's a stand in benchmark and CNN's crush it.

301
00:15:17.840 --> 00:15:20.600
<v Speaker 1>But then you take that same CNN, maybe trained on MNIST,

302
00:15:20.720 --> 00:15:23.320
<v Speaker 1>and try it on something harder like the Dogs Versus

303
00:15:23.320 --> 00:15:24.879
<v Speaker 1>Cats challenge from Cagle.

304
00:15:24.600 --> 00:15:28.120
<v Speaker 2>And suddenly it might struggle, maybe only seventy five percent accuracy.

305
00:15:28.200 --> 00:15:31.720
<v Speaker 2>The features learned for recognizing simple digits aren't necessarily complex

306
00:15:31.840 --> 00:15:34.759
<v Speaker 2>enough or the right kind to distinguish between detailed photos

307
00:15:34.759 --> 00:15:37.720
<v Speaker 2>of different animal breeds. It doesn't generalize well enough.

308
00:15:37.559 --> 00:15:39.799
<v Speaker 1>Which brings us back to an important idea you mentioned,

309
00:15:40.039 --> 00:15:42.720
<v Speaker 1>how do we tackle these harder tasks, especially if we

310
00:15:42.759 --> 00:15:45.679
<v Speaker 1>don't have millions of labeled dog and cat photos ourselves.

311
00:15:45.919 --> 00:15:50.080
<v Speaker 2>Trendsfer learning This is hugely powerful in computer vision. The

312
00:15:50.159 --> 00:15:53.919
<v Speaker 2>idea is why start learning from scratch when others have

313
00:15:53.960 --> 00:15:57.240
<v Speaker 2>already trained massive models on enormous data.

314
00:15:57.039 --> 00:15:59.720
<v Speaker 1>Sets, like learning to drive a motorbike after knowing how

315
00:15:59.720 --> 00:16:02.679
<v Speaker 1>to drive a car reusing the basic road knowledge.

316
00:16:02.840 --> 00:16:06.320
<v Speaker 2>Perfect analogy, we take a pre trade model like VGG

317
00:16:06.440 --> 00:16:09.799
<v Speaker 2>sixteen or ResNet, which has already been trained on image neet,

318
00:16:09.879 --> 00:16:13.559
<v Speaker 2>a data set with millions of images across one thousand categories.

319
00:16:14.159 --> 00:16:17.519
<v Speaker 2>These models have learned incredibly rich in general visual features

320
00:16:17.519 --> 00:16:22.120
<v Speaker 2>in their early layers edge detectors, texture detectors, basic shape detectors.

321
00:16:22.200 --> 00:16:24.759
<v Speaker 1>So you take that pre trained network and.

322
00:16:24.679 --> 00:16:28.000
<v Speaker 2>You typically freeze the weights of those early convolutional layers,

323
00:16:28.000 --> 00:16:30.120
<v Speaker 2>you don't let them train anymore. You basically treat them

324
00:16:30.120 --> 00:16:34.200
<v Speaker 2>as fixed future extractors. Then you replace the final classification

325
00:16:34.279 --> 00:16:36.639
<v Speaker 2>layer which was trained for the original one thousand imagh

326
00:16:36.679 --> 00:16:39.440
<v Speaker 2>net classes, with a new one suited to your task

327
00:16:39.639 --> 00:16:41.759
<v Speaker 2>like discriminating between dogs and cats.

328
00:16:41.559 --> 00:16:43.879
<v Speaker 1>And you only train this new final layer, or maybe

329
00:16:43.879 --> 00:16:45.759
<v Speaker 1>the last few layers exactly.

330
00:16:45.919 --> 00:16:48.879
<v Speaker 2>You only train the small task specific part of the

331
00:16:48.919 --> 00:16:52.759
<v Speaker 2>network using your relatively smaller data set, like the dogs

332
00:16:52.799 --> 00:16:56.960
<v Speaker 2>versus cats images. The bulk of the network's knowledge is transferred, and.

333
00:16:56.919 --> 00:16:59.240
<v Speaker 1>The result on dogs versus cats it's dramatic.

334
00:16:59.480 --> 00:17:02.960
<v Speaker 2>Instead of seventy five percent accuracy, using transfer learning with

335
00:17:03.039 --> 00:17:05.640
<v Speaker 2>a pre trained ResNet can easily push you up to

336
00:17:05.720 --> 00:17:10.319
<v Speaker 2>ninety eight percent or higher. Massive improvement leveraging knowledge learn

337
00:17:10.400 --> 00:17:11.799
<v Speaker 2>from a different, larger task.

338
00:17:12.039 --> 00:17:15.319
<v Speaker 1>That's amazing. And you can even peak inside these CNNs

339
00:17:16.400 --> 00:17:17.799
<v Speaker 1>visualize what they're learning.

340
00:17:17.920 --> 00:17:20.400
<v Speaker 2>Yeah, it's fascinating. You can look at the activations the

341
00:17:20.440 --> 00:17:24.079
<v Speaker 2>output maps from different filters at different layers. Early layers,

342
00:17:24.160 --> 00:17:27.440
<v Speaker 2>you'll see activations responding to simple things like edges and corners.

343
00:17:27.599 --> 00:17:31.759
<v Speaker 2>Go deeper, and you see activations responding to more complex textures, patterns,

344
00:17:31.880 --> 00:17:34.039
<v Speaker 2>or even parts of objects like eyes or snouts.

345
00:17:34.240 --> 00:17:36.519
<v Speaker 1>It really gives you sense that the network is building up,

346
00:17:36.599 --> 00:17:38.759
<v Speaker 1>understanding hierarchically it does.

347
00:17:38.920 --> 00:17:41.680
<v Speaker 2>It demystifies the black box a little bit. And beyond

348
00:17:41.799 --> 00:17:44.839
<v Speaker 2>VGG and ResNet, there are other cool architectures you mentioned

349
00:17:44.880 --> 00:17:48.920
<v Speaker 2>ResNet solving the vanishing grading issue with skip connections.

350
00:17:48.519 --> 00:17:50.359
<v Speaker 1>Right letting information by pass layers.

351
00:17:50.680 --> 00:17:54.599
<v Speaker 2>Then there's inception or Google net, which cleverly uses parallel

352
00:17:54.640 --> 00:17:58.240
<v Speaker 2>convolutional filters of different sizes one by one, three by three,

353
00:17:58.319 --> 00:18:01.559
<v Speaker 2>five by five at the same length layer and concatenates

354
00:18:01.559 --> 00:18:05.799
<v Speaker 2>their outputs. It captures features at multiple scale simultaneously, and

355
00:18:05.839 --> 00:18:09.400
<v Speaker 2>it uses one by one convolution smartly for dimensionality reduction,

356
00:18:09.680 --> 00:18:12.960
<v Speaker 2>making it efficient and dense net. Dense net took connectivity

357
00:18:12.960 --> 00:18:16.839
<v Speaker 2>even further. Each layer receives inputs from all preceding layers

358
00:18:16.880 --> 00:18:19.599
<v Speaker 2>and passes its own feature maps to all subsequent layers.

359
00:18:20.000 --> 00:18:23.400
<v Speaker 2>It sounds complex, but it actually encourages feature reuse and

360
00:18:23.440 --> 00:18:26.279
<v Speaker 2>can lead to models with fewer parameters that are very effective.

361
00:18:26.559 --> 00:18:29.319
<v Speaker 1>So if one of these powerful models gives great results,

362
00:18:29.359 --> 00:18:31.559
<v Speaker 1>can you do even better by combining them?

363
00:18:31.680 --> 00:18:35.039
<v Speaker 2>Yes, that's model ensembling. You train several different high performing models,

364
00:18:35.039 --> 00:18:38.000
<v Speaker 2>maybe a res net and inception a dense net independently

365
00:18:38.039 --> 00:18:40.319
<v Speaker 2>on the same task. Then for a new image, you

366
00:18:40.359 --> 00:18:42.839
<v Speaker 2>get predictions from all of them and combine those predictions,

367
00:18:43.000 --> 00:18:45.640
<v Speaker 2>often just by averaging their output probabilities or taking a

368
00:18:45.680 --> 00:18:46.599
<v Speaker 2>majority vote, and.

369
00:18:46.480 --> 00:18:48.519
<v Speaker 1>That actually improves accuracy further.

370
00:18:48.880 --> 00:18:52.279
<v Speaker 2>Often, yes, it can smooth out the errors or biases

371
00:18:52.319 --> 00:18:56.960
<v Speaker 2>of individual models. For dogs versus cats, ensembling can nudge

372
00:18:57.000 --> 00:19:00.680
<v Speaker 2>accuracy even higher, maybe to ninety nine point three percent more.

373
00:19:01.240 --> 00:19:03.680
<v Speaker 2>The downside is well, you have to train and run

374
00:19:03.759 --> 00:19:06.759
<v Speaker 2>multiple models, so it's computationally more expensive.

375
00:19:06.319 --> 00:19:09.640
<v Speaker 1>A trade off between performance and cost. Okay, let's switch

376
00:19:09.640 --> 00:19:14.839
<v Speaker 1>gears now, from seeing to understanding language. Natural language processing

377
00:19:15.240 --> 00:19:17.240
<v Speaker 1>or NLP text data is different.

378
00:19:17.279 --> 00:19:20.480
<v Speaker 2>It's sequential absolutely. The meaning often depends on the order

379
00:19:20.519 --> 00:19:24.319
<v Speaker 2>of words, so the first step is usually tokenization, breaking

380
00:19:24.319 --> 00:19:26.240
<v Speaker 2>the text down into smaller units.

381
00:19:25.960 --> 00:19:28.440
<v Speaker 1>Or tokens like words or characters.

382
00:19:28.519 --> 00:19:31.839
<v Speaker 2>Could be either. For a review like just perfect, Splitting

383
00:19:31.880 --> 00:19:35.119
<v Speaker 2>by spaces gives you word tokens just perfect. Using Python's

384
00:19:35.119 --> 00:19:37.400
<v Speaker 2>list function on the frame would give you character tokens

385
00:19:37.519 --> 00:19:40.200
<v Speaker 2>j sh she d. The choice depends on the task.

386
00:19:40.319 --> 00:19:43.759
<v Speaker 1>Okay. Once you have tokens, you need numbers right vectorization.

387
00:19:43.359 --> 00:19:46.400
<v Speaker 2>Right, we need to represent these tokens numerically. One old

388
00:19:46.440 --> 00:19:50.079
<v Speaker 2>method is one hot encoding, where each unique word gets

389
00:19:50.119 --> 00:19:52.640
<v Speaker 2>a huge vector that's all zeros except for a single

390
00:19:52.680 --> 00:19:54.039
<v Speaker 2>one at its specific index.

391
00:19:54.359 --> 00:19:57.200
<v Speaker 1>Sounds very sparse and doesn't capture meaning, does it like

392
00:19:57.400 --> 00:20:00.720
<v Speaker 1>king and queen would be totally unrelated vectors exactly.

393
00:20:01.000 --> 00:20:05.240
<v Speaker 2>It's rarely used in modern deep learning for NLP. Much

394
00:20:05.279 --> 00:20:09.359
<v Speaker 2>more powerful are word embeddings. These represent words as dense,

395
00:20:09.720 --> 00:20:13.119
<v Speaker 2>relatively low dimensional vectors, maybe one hundred or three hundred

396
00:20:13.160 --> 00:20:14.519
<v Speaker 2>dimensions instead of millions.

397
00:20:14.759 --> 00:20:16.480
<v Speaker 1>And these vectors capture meaning.

398
00:20:16.839 --> 00:20:19.680
<v Speaker 2>Yes, that's the key. They are learned in such a

399
00:20:19.680 --> 00:20:23.079
<v Speaker 2>way that words with similar meanings end up having similar

400
00:20:23.160 --> 00:20:27.759
<v Speaker 2>vector representations, like the vector for king might be mathematically

401
00:20:27.759 --> 00:20:31.440
<v Speaker 2>close to the vector for queen or monarch. They capture

402
00:20:31.440 --> 00:20:32.680
<v Speaker 2>semantic relationships.

403
00:20:32.960 --> 00:20:34.440
<v Speaker 1>How are these embeddings learned?

404
00:20:34.839 --> 00:20:37.319
<v Speaker 2>Often they're learned as part of training a larger model

405
00:20:37.359 --> 00:20:40.880
<v Speaker 2>on a specific task, like sentiment analysis on the IMDb

406
00:20:41.000 --> 00:20:43.599
<v Speaker 2>movie review data set mentioned in the book, The network

407
00:20:43.680 --> 00:20:46.039
<v Speaker 2>learns embeddings that help it predict whether a review is

408
00:20:46.079 --> 00:20:47.039
<v Speaker 2>positive or negative.

409
00:20:47.200 --> 00:20:50.240
<v Speaker 1>And, like with images, can you use pre trained embeddings?

410
00:20:50.440 --> 00:20:54.359
<v Speaker 2>Absolutely? And it's very common. Models like Glove or fast

411
00:20:54.359 --> 00:20:57.400
<v Speaker 2>text are trained on massive text corpora like all of

412
00:20:57.400 --> 00:21:01.839
<v Speaker 2>Wikipedia or the Web to produce general word embeddings. If

413
00:21:01.839 --> 00:21:04.200
<v Speaker 2>you don't have much text data for your specific task,

414
00:21:04.759 --> 00:21:07.799
<v Speaker 2>using these pre trained embeddings gives your model a huge

415
00:21:07.839 --> 00:21:12.440
<v Speaker 2>headstart on understanding language. Torch text is a useful library here, so.

416
00:21:12.680 --> 00:21:15.880
<v Speaker 1>We have numerical representations of words that capture meaning. How

417
00:21:15.920 --> 00:21:18.079
<v Speaker 1>do we process sequences were order matters.

418
00:21:18.160 --> 00:21:22.359
<v Speaker 2>The classic approaches were current neural networks RNNs. They're designed

419
00:21:22.359 --> 00:21:25.559
<v Speaker 2>to handle sequences by processing tokens one by one while

420
00:21:25.599 --> 00:21:29.359
<v Speaker 2>maintaining an internal hidden state. This state acts like a memory,

421
00:21:29.559 --> 00:21:32.640
<v Speaker 2>accumulating information from previous tokens in the sequence.

422
00:21:32.759 --> 00:21:35.119
<v Speaker 1>Sounds good, but I remember reading. They have a weakness

423
00:21:35.119 --> 00:21:36.599
<v Speaker 1>something about long sequences.

424
00:21:36.720 --> 00:21:40.799
<v Speaker 2>Yeah, the long term dependency problem. Standard RNNs struggle to

425
00:21:40.799 --> 00:21:44.119
<v Speaker 2>retain information from tokens seen much earlier. In a long sequence,

426
00:21:44.640 --> 00:21:47.400
<v Speaker 2>the signal tends to fade or get overwritten. It's hard

427
00:21:47.440 --> 00:21:49.720
<v Speaker 2>for them to connect, say the subject at the beginning

428
00:21:49.720 --> 00:21:52.839
<v Speaker 2>of a long paragraph to a verb much later. Also,

429
00:21:52.880 --> 00:21:55.519
<v Speaker 2>the vanish ingredient problem can hit them hard during training.

430
00:21:55.720 --> 00:21:56.519
<v Speaker 1>So what's the fix?

431
00:21:56.880 --> 00:22:01.160
<v Speaker 2>Long short term memory networks LSTMs. They are a special,

432
00:22:01.519 --> 00:22:06.000
<v Speaker 2>more complex type of RNN, specifically designed to overcome these issues.

433
00:22:06.079 --> 00:22:06.799
<v Speaker 1>How do they beew it?

434
00:22:07.200 --> 00:22:11.160
<v Speaker 2>LSTMs have a more sophisticated internal structure. They introduce a

435
00:22:11.200 --> 00:22:14.240
<v Speaker 2>cell state that runs through the entire chain, acting like

436
00:22:14.279 --> 00:22:17.720
<v Speaker 2>a conveyor belt for information, making it easier for contexts

437
00:22:17.759 --> 00:22:21.839
<v Speaker 2>to persist over long distances. And they use gates input gates,

438
00:22:22.079 --> 00:22:23.759
<v Speaker 2>forget gates, output.

439
00:22:23.319 --> 00:22:25.880
<v Speaker 1>Gates dates like little controllers exactly.

440
00:22:26.240 --> 00:22:29.119
<v Speaker 2>They are small neural networks themselves that learn to control

441
00:22:29.160 --> 00:22:32.240
<v Speaker 2>the flow of information. The forget gate learns what old

442
00:22:32.279 --> 00:22:35.000
<v Speaker 2>information to throw away from the cell state. The input

443
00:22:35.000 --> 00:22:37.680
<v Speaker 2>gate learns what new information from the current token to add.

444
00:22:38.119 --> 00:22:40.640
<v Speaker 2>The output gate learns what part of the cell state

445
00:22:40.680 --> 00:22:43.279
<v Speaker 2>to output as the hidden state for the next step.

446
00:22:43.680 --> 00:22:47.200
<v Speaker 2>This selective memory management lets them handle long term dependencies

447
00:22:47.279 --> 00:22:47.759
<v Speaker 2>much better.

448
00:22:47.880 --> 00:22:50.759
<v Speaker 1>That sounds much more capable are LSTMs the only way?

449
00:22:50.960 --> 00:22:56.160
<v Speaker 2>Not necessarily, Sometimes one Deconvolutional networks similar to the c

450
00:22:56.240 --> 00:22:59.599
<v Speaker 2>and ns used for images, but applied along the sequence dimension,

451
00:23:00.039 --> 00:23:02.960
<v Speaker 2>could be very effective for text tasks too. They can

452
00:23:03.000 --> 00:23:06.880
<v Speaker 2>capture local patterns like phrases efficiently and are often faster

453
00:23:07.000 --> 00:23:09.000
<v Speaker 2>to train than LSTMs interesting.

454
00:23:09.599 --> 00:23:12.440
<v Speaker 1>So what are some big applications of these sequence models

455
00:23:12.480 --> 00:23:12.960
<v Speaker 1>in n LP.

456
00:23:13.240 --> 00:23:17.000
<v Speaker 2>A fundamental one is language modeling. Predicting the next word

457
00:23:17.279 --> 00:23:19.319
<v Speaker 2>in a sequence given the preceding words.

458
00:23:19.720 --> 00:23:22.480
<v Speaker 1>Seems simple, but I guess that's the basis for a lot.

459
00:23:22.519 --> 00:23:26.160
<v Speaker 2>Huge applications autocomplete on your phone, for instance, but also

460
00:23:26.319 --> 00:23:29.039
<v Speaker 2>machine translation predicting the next word in the target language,

461
00:23:29.240 --> 00:23:33.400
<v Speaker 2>image captioning, generating a textual description word by word, summarization,

462
00:23:33.480 --> 00:23:36.799
<v Speaker 2>and even creative text generation writing stories, poems, or code.

463
00:23:36.920 --> 00:23:39.279
<v Speaker 1>And this is where models like BURT and GPT come in.

464
00:23:39.519 --> 00:23:43.240
<v Speaker 2>Yes, exactly. The transformer architecture, which relies on a mechanism

465
00:23:43.359 --> 00:23:47.920
<v Speaker 2>called self attention rather than recurrence, really revolutionized language modeling

466
00:23:47.960 --> 00:23:51.599
<v Speaker 2>around twenty seventeen. It allowed for much better parallelization and

467
00:23:51.680 --> 00:23:55.160
<v Speaker 2>capturing long range dependencies. This paved the way for massive

468
00:23:55.200 --> 00:23:58.839
<v Speaker 2>pre trained language models like elmo Bert and the GPT

469
00:23:58.920 --> 00:24:01.599
<v Speaker 2>series GPT two, GPT three and beyond.

470
00:24:01.960 --> 00:24:04.559
<v Speaker 1>These models are trained on enormous amounts of text and

471
00:24:04.599 --> 00:24:08.440
<v Speaker 1>can then be fine tuned for various specific NLT tasks,

472
00:24:08.799 --> 00:24:10.240
<v Speaker 1>achieving state of the art results.

473
00:24:10.279 --> 00:24:13.480
<v Speaker 2>Precisely, they have a remarkable grasp of language. Of course,

474
00:24:13.519 --> 00:24:16.119
<v Speaker 2>the power of models like GPT two also raised important

475
00:24:16.119 --> 00:24:19.920
<v Speaker 2>ethical discussions about potential misuse like generating fake news.

476
00:24:20.039 --> 00:24:22.920
<v Speaker 1>A crucial point. Okay, we've covered networks that see and

477
00:24:22.920 --> 00:24:26.680
<v Speaker 1>networks that understand language. What about models that create things

478
00:24:27.079 --> 00:24:29.559
<v Speaker 1>or models that learn through trial and error? Ah?

479
00:24:29.640 --> 00:24:33.000
<v Speaker 2>Yes, this gets us into generative models and reinforcement worning

480
00:24:33.039 --> 00:24:36.039
<v Speaker 2>really exciting areas. Let's start with auto encoders modernencoders.

481
00:24:36.079 --> 00:24:36.640
<v Speaker 1>What's their goal?

482
00:24:36.880 --> 00:24:41.759
<v Speaker 2>They're generally unsupervised learning algorithms. Their goal is simple, learn

483
00:24:41.839 --> 00:24:45.160
<v Speaker 2>to reconstrict their own input. They typically have two parts.

484
00:24:45.680 --> 00:24:48.119
<v Speaker 2>And encoder that compresses the input data into a lower

485
00:24:48.119 --> 00:24:52.400
<v Speaker 2>dimensional representation the bottleneck or latent space, and a decoder

486
00:24:52.400 --> 00:24:55.519
<v Speaker 2>that tries to rebuild the original input from that compressed representation.

487
00:24:55.920 --> 00:24:58.599
<v Speaker 1>Why compress it just to rebuild it? What's the use?

488
00:24:59.119 --> 00:25:03.559
<v Speaker 2>Several things? That compressed representation the bottleneck forces the network

489
00:25:03.599 --> 00:25:06.480
<v Speaker 2>to learn the most salient features of the data, so

490
00:25:06.519 --> 00:25:09.480
<v Speaker 2>they can be used for dimensionality reduction or for data

491
00:25:09.519 --> 00:25:12.039
<v Speaker 2>to noising if you train it on noisy images, but

492
00:25:12.119 --> 00:25:15.440
<v Speaker 2>ask it to reconstruct clean ones. And importantly, they form

493
00:25:15.480 --> 00:25:17.400
<v Speaker 2>the basis for some generative.

494
00:25:16.960 --> 00:25:20.640
<v Speaker 1>Models, like variational auto encoders VIAES exactly.

495
00:25:20.759 --> 00:25:23.680
<v Speaker 2>Vies are a type of generative auto encoder. Instead of

496
00:25:23.680 --> 00:25:25.920
<v Speaker 2>mapping an input to a single fixed point in the

497
00:25:26.000 --> 00:25:29.480
<v Speaker 2>latent space, the encoder maps it to a probability distribution,

498
00:25:29.960 --> 00:25:34.039
<v Speaker 2>usually a Gaussian a distribution. Why because then, to generate

499
00:25:34.119 --> 00:25:36.359
<v Speaker 2>new data, you can just sample a point from that

500
00:25:36.440 --> 00:25:39.039
<v Speaker 2>learned distribution in a latent space and feed it to

501
00:25:39.119 --> 00:25:42.319
<v Speaker 2>the decoder. Since it learned to decode points from that

502
00:25:42.359 --> 00:25:45.920
<v Speaker 2>general area into realistic outputs, it can generate novel examples

503
00:25:45.960 --> 00:25:49.160
<v Speaker 2>that look similar to the training data but aren't exact copies.

504
00:25:49.599 --> 00:25:53.799
<v Speaker 2>Think generating new faces that look plausible. Features like smiling

505
00:25:53.920 --> 00:25:57.400
<v Speaker 2>might be represented probabilistically across the latent space.

506
00:25:57.720 --> 00:26:00.759
<v Speaker 1>How do you train something that involves sampling? Isn't that tricky?

507
00:26:01.319 --> 00:26:03.920
<v Speaker 2>He uses a clever technique called the Rape parameterization trick,

508
00:26:04.359 --> 00:26:07.279
<v Speaker 2>which allows gradients to flow back through the sampling process,

509
00:26:07.680 --> 00:26:10.559
<v Speaker 2>making the vae trainable with standard backpropagation.

510
00:26:11.119 --> 00:26:15.400
<v Speaker 1>Okay, cool. What about restricted Boltzmann machines RBMs and deep

511
00:26:15.400 --> 00:26:17.960
<v Speaker 1>belief networks bbns? They sound a bit.

512
00:26:17.839 --> 00:26:21.319
<v Speaker 2>Older school they are, but conceptually important. An RBM is

513
00:26:21.319 --> 00:26:24.599
<v Speaker 2>a simple two layer network, a visible layer for input

514
00:26:24.960 --> 00:26:27.920
<v Speaker 2>and a hidden layer that learns patterns in an unsupervised way.

515
00:26:28.400 --> 00:26:30.680
<v Speaker 2>They were often used for things like collaborative.

516
00:26:30.160 --> 00:26:32.200
<v Speaker 1>Filtering like movie recommendations.

517
00:26:32.599 --> 00:26:36.799
<v Speaker 2>Exactly imagine, the visible layer represents movies you've liked or disliked.

518
00:26:37.359 --> 00:26:40.559
<v Speaker 2>The RBM learns connections to hidden units that might represent

519
00:26:40.640 --> 00:26:45.119
<v Speaker 2>underlying factors like genres or actor preferences, even if those

520
00:26:45.119 --> 00:26:48.680
<v Speaker 2>aren't explicitly labeled. Then it can use these learned factors

521
00:26:48.720 --> 00:26:51.519
<v Speaker 2>to predict if you'd like other movies. A deep belief

522
00:26:51.559 --> 00:26:55.880
<v Speaker 2>network DBN is essentially a stack of RBMs trained layer

523
00:26:55.920 --> 00:26:59.519
<v Speaker 2>by layer in an unsupervised fashion, often followed by supervised

524
00:26:59.519 --> 00:26:59.960
<v Speaker 2>fine tuning.

525
00:27:00.319 --> 00:27:04.000
<v Speaker 1>Interesting now for the big one in generative models, generative

526
00:27:04.000 --> 00:27:06.799
<v Speaker 1>adversarial networks GNS sounds like a competition.

527
00:27:06.880 --> 00:27:10.440
<v Speaker 2>It absolutely is, yeah. Popularized by Ian Goodfellow in twenty fourteen,

528
00:27:10.799 --> 00:27:13.960
<v Speaker 2>gns involve two neural networks locked in a contest. You

529
00:27:14.000 --> 00:27:16.480
<v Speaker 2>have a generator network that tries to create fake data

530
00:27:16.599 --> 00:27:19.680
<v Speaker 2>like images that looks realistic, and you have a discriminator

531
00:27:19.680 --> 00:27:22.359
<v Speaker 2>network that tries to distinguish between real data from the

532
00:27:22.400 --> 00:27:24.960
<v Speaker 2>training set and the fake data created by the generator.

533
00:27:25.000 --> 00:27:27.319
<v Speaker 1>The classic counterfeiter and police analogy, right.

534
00:27:27.440 --> 00:27:30.920
<v Speaker 2>That's the one. The generator wants to fool the discriminator,

535
00:27:31.240 --> 00:27:33.960
<v Speaker 2>the discriminator wants to catch the fakes. They train together.

536
00:27:34.480 --> 00:27:37.559
<v Speaker 2>As the discriminator gets better at spotting fakes, the generator

537
00:27:37.559 --> 00:27:40.039
<v Speaker 2>has to get better at creating more convinsing fakes to

538
00:27:40.119 --> 00:27:44.480
<v Speaker 2>fool it. This adversarial process pushes both networks to improve,

539
00:27:44.799 --> 00:27:48.839
<v Speaker 2>often resulting in generators that can create stunningly realistic synthetic data.

540
00:27:48.880 --> 00:27:53.039
<v Speaker 2>How does the discriminator learn It's trained like a standard classifier,

541
00:27:53.119 --> 00:27:55.799
<v Speaker 2>on a mix of real images labeled as real and

542
00:27:55.880 --> 00:27:59.319
<v Speaker 2>fake images from the generator labeled as fake. It's feedback

543
00:27:59.440 --> 00:28:02.039
<v Speaker 2>how well it classify them is used to train both

544
00:28:02.079 --> 00:28:04.519
<v Speaker 2>itself and to guide the generator on how to improve

545
00:28:04.559 --> 00:28:09.640
<v Speaker 2>its fakes. Dcgn's deep convolutional jans or an early successful

546
00:28:09.680 --> 00:28:11.599
<v Speaker 2>architecture for generating decent images.

547
00:28:11.680 --> 00:28:15.200
<v Speaker 1>Incredible stuff. Okay, finally, let's talk about machines learning to act.

548
00:28:15.640 --> 00:28:17.920
<v Speaker 1>Reinforcement learning RL right.

549
00:28:18.359 --> 00:28:21.160
<v Speaker 2>RL is about training an agent like a robot or

550
00:28:21.200 --> 00:28:23.880
<v Speaker 2>a game AI to make sequences of decisions in an

551
00:28:24.000 --> 00:28:26.599
<v Speaker 2>environment like the real world or a game level to

552
00:28:26.680 --> 00:28:29.039
<v Speaker 2>maximize some notion of cumulative reward.

553
00:28:29.240 --> 00:28:32.839
<v Speaker 1>So it learns by doing getting feedback exactly.

554
00:28:33.039 --> 00:28:35.039
<v Speaker 2>The agent takes an action in a certain state of

555
00:28:35.079 --> 00:28:38.519
<v Speaker 2>the environment. The environment responds by transitioning to a new

556
00:28:38.559 --> 00:28:41.920
<v Speaker 2>state and giving the agent a reward positive for good

557
00:28:41.920 --> 00:28:44.759
<v Speaker 2>actions negative for bad. The agent's goal is to learn

558
00:28:44.759 --> 00:28:48.440
<v Speaker 2>a policy, a strategy for choosing actions that maximizes the

559
00:28:48.480 --> 00:28:51.400
<v Speaker 2>total reward it collects over time. It's a continuous loop

560
00:28:51.440 --> 00:28:54.920
<v Speaker 2>of observe, act, get rewarded learn.

561
00:28:55.200 --> 00:28:57.000
<v Speaker 1>Is it always learning from direct interaction?

562
00:28:57.200 --> 00:29:00.960
<v Speaker 2>Not always. There's model based RL, where the agent tries

563
00:29:01.000 --> 00:29:03.519
<v Speaker 2>to learn a model of how the environment works, predicting

564
00:29:03.559 --> 00:29:06.440
<v Speaker 2>next states and rewards, and then plans using that model.

565
00:29:06.720 --> 00:29:09.839
<v Speaker 2>More common, perhaps, is model free RL, where the agent

566
00:29:09.920 --> 00:29:12.440
<v Speaker 2>learns directly through trial and error, figuring out which actions

567
00:29:12.519 --> 00:29:15.599
<v Speaker 2>lead to good rewards in which states without explicitly building

568
00:29:15.599 --> 00:29:16.480
<v Speaker 2>a model of the world.

569
00:29:16.720 --> 00:29:19.359
<v Speaker 1>How does deep learning fit into this RL sounds like

570
00:29:19.400 --> 00:29:21.960
<v Speaker 1>it could involve really complex states like pixels on a

571
00:29:22.000 --> 00:29:22.599
<v Speaker 1>game screen.

572
00:29:22.720 --> 00:29:26.079
<v Speaker 2>That's where deep q networks dqns made a huge splash,

573
00:29:26.079 --> 00:29:29.599
<v Speaker 2>particularly with playing Atari games directly from pixels. A DQN

574
00:29:29.720 --> 00:29:32.160
<v Speaker 2>uses a deep neural network often a CNN for visual

575
00:29:32.160 --> 00:29:36.279
<v Speaker 2>input to approximate the Q value. The Q value QS

576
00:29:36.880 --> 00:29:40.000
<v Speaker 2>represents the expected future reward an agent can get if

577
00:29:40.000 --> 00:29:42.880
<v Speaker 2>it takes action A in state as and then follows

578
00:29:42.880 --> 00:29:46.720
<v Speaker 2>the optimal policy. Thereafter, the network learns to predict these

579
00:29:46.799 --> 00:29:49.920
<v Speaker 2>Q values for all possible actions in a given state.

580
00:29:50.400 --> 00:29:52.720
<v Speaker 2>The best action is simply the one with the highest

581
00:29:52.720 --> 00:29:54.079
<v Speaker 2>predicted Q value.

582
00:29:53.920 --> 00:29:56.279
<v Speaker 1>So the deep network learns to evaluate how good each

583
00:29:56.319 --> 00:29:57.599
<v Speaker 1>action is exactly.

584
00:29:58.240 --> 00:30:02.599
<v Speaker 2>DQNS introduced some key innovation. One is the DQN loss function.

585
00:30:03.640 --> 00:30:06.519
<v Speaker 2>He uses a separate target network, a slightly older copy

586
00:30:06.519 --> 00:30:09.240
<v Speaker 2>of the main Q network to provide more stable target

587
00:30:09.319 --> 00:30:12.319
<v Speaker 2>Q values during training, preventing oscillations.

588
00:30:11.799 --> 00:30:13.319
<v Speaker 1>Yeah experience replay crucial.

589
00:30:13.640 --> 00:30:16.480
<v Speaker 2>Instead of learning only from the very last action, the

590
00:30:16.519 --> 00:30:20.359
<v Speaker 2>agent stores its experiences state, action, reward, next state tuples

591
00:30:20.640 --> 00:30:23.839
<v Speaker 2>in a large memory buffer. During training, it samples random

592
00:30:23.880 --> 00:30:28.039
<v Speaker 2>mini batches from this buffer. This breaks correlations and sequential experiences,

593
00:30:28.200 --> 00:30:31.440
<v Speaker 2>improves data efficiency, and prevents the agent from getting stuck

594
00:30:31.440 --> 00:30:32.400
<v Speaker 2>in short term loops.

595
00:30:32.480 --> 00:30:34.720
<v Speaker 1>There was also something about double deep q learning.

596
00:30:35.039 --> 00:30:38.200
<v Speaker 2>Yes, double DQN is a refinement that helps reduce the

597
00:30:38.240 --> 00:30:42.440
<v Speaker 2>tendency of standard tick dqns to overestimate Q values, leading

598
00:30:42.440 --> 00:30:44.519
<v Speaker 2>to more stable and sometimes better policies.

599
00:30:44.839 --> 00:30:48.319
<v Speaker 1>Are dqns the only way deep learning is used NURL.

600
00:30:48.519 --> 00:30:52.440
<v Speaker 2>No, There are other major approaches. Policy gradient methods directly

601
00:30:52.480 --> 00:30:56.319
<v Speaker 2>learn the policy function, a network that outputs probabilities for

602
00:30:56.359 --> 00:30:59.319
<v Speaker 2>each action given a state, trying to directly optimize the

603
00:30:59.319 --> 00:31:01.480
<v Speaker 2>policy to macmize rewards.

604
00:31:01.000 --> 00:31:02.440
<v Speaker 1>And actor critic methods.

605
00:31:02.599 --> 00:31:05.839
<v Speaker 2>Actor critic methods combine ideas from both value based like

606
00:31:05.920 --> 00:31:09.440
<v Speaker 2>DQN and policy based methods. They typically have two networks,

607
00:31:09.839 --> 00:31:13.000
<v Speaker 2>an actor network that learns the policy decides which action

608
00:31:13.079 --> 00:31:15.359
<v Speaker 2>to take, and a critic network that learns a value

609
00:31:15.359 --> 00:31:18.519
<v Speaker 2>function evaluates how good the chosen action or current state is.

610
00:31:19.039 --> 00:31:21.359
<v Speaker 2>The critic helps train the actor more efficiently. A three

611
00:31:21.480 --> 00:31:25.440
<v Speaker 2>C asynchronous advantage. Actor critic was a very influential algorithm

612
00:31:25.480 --> 00:31:28.359
<v Speaker 2>in this family, using multiple parallel agents to explore the

613
00:31:28.440 --> 00:31:29.240
<v Speaker 2>environment faster.

614
00:31:29.440 --> 00:31:33.400
<v Speaker 1>Wow. So URL powered by deep learning can learn complex

615
00:31:33.440 --> 00:31:37.240
<v Speaker 1>strategies and complex environments. What are the real world applications

616
00:31:37.240 --> 00:31:38.359
<v Speaker 1>beyond games.

617
00:31:38.359 --> 00:31:43.920
<v Speaker 2>Tons, robotics, teaching robots, manipulation skills, optimizing traffic, light control,

618
00:31:44.039 --> 00:31:48.480
<v Speaker 2>creating personalized recommendation systems that adapt over time, resource management,

619
00:31:48.680 --> 00:31:52.680
<v Speaker 2>and even generating creative content like images or music by

620
00:31:52.720 --> 00:31:55.440
<v Speaker 2>framing it as a sequential decision problem.

621
00:31:55.480 --> 00:31:58.079
<v Speaker 1>What an incredible journey we've taken through deep learning with

622
00:31:58.160 --> 00:32:02.799
<v Speaker 1>PyTorch from the core concepts AI mL DL PyTorch defined

623
00:32:02.799 --> 00:32:07.839
<v Speaker 1>by run through the anatomy of neural networks, activations, loss optimizers,

624
00:32:07.880 --> 00:32:09.640
<v Speaker 1>the power of tensors.

625
00:32:09.319 --> 00:32:12.079
<v Speaker 2>Then seeing it all applied in computer vision with CNNs,

626
00:32:12.279 --> 00:32:15.319
<v Speaker 2>the magic of transfer learning and visualizing.

627
00:32:14.720 --> 00:32:19.160
<v Speaker 1>What models learn, and shifting to language with NLP tokenization, embeddings,

628
00:32:19.200 --> 00:32:22.039
<v Speaker 1>the challenges of sequences handled by LSTMs, and the rise

629
00:32:22.039 --> 00:32:24.400
<v Speaker 1>of transformers and huge pre trained models.

630
00:32:24.440 --> 00:32:28.000
<v Speaker 2>Finally touching on generative models like vaes and jams, creating

631
00:32:28.039 --> 00:32:31.440
<v Speaker 2>new data and rl agents learning through interaction with dqns

632
00:32:31.480 --> 00:32:32.519
<v Speaker 2>and actor critic methods.

633
00:32:32.599 --> 00:32:35.400
<v Speaker 1>Yeah, hopefully you listening now have a much stronger foundation

634
00:32:35.559 --> 00:32:38.119
<v Speaker 1>or real understanding of what's powering some of the most

635
00:32:38.160 --> 00:32:39.480
<v Speaker 1>advanced AI out there.

636
00:32:39.920 --> 00:32:43.079
<v Speaker 2>Absolutely. Yeah, And remember this knowledge is really most valuable

637
00:32:43.079 --> 00:32:45.440
<v Speaker 2>when you start applying it, or at least digging deeper

638
00:32:45.480 --> 00:32:48.799
<v Speaker 2>into areas that sparked your interest. This deep dive is

639
00:32:48.799 --> 00:32:49.720
<v Speaker 2>a starting point.

640
00:32:50.079 --> 00:32:52.160
<v Speaker 1>So where should someone go next if they want to

641
00:32:52.160 --> 00:32:52.920
<v Speaker 1>continue learning?

642
00:32:53.160 --> 00:32:55.720
<v Speaker 2>Well, reading research papers is always a good way to

643
00:32:55.759 --> 00:32:58.359
<v Speaker 2>stay on the cutting edge. Sites like papers with cood

644
00:32:58.440 --> 00:33:01.319
<v Speaker 2>dot com link papers to code to implementations, which is

645
00:33:01.319 --> 00:33:05.440
<v Speaker 2>super helpful urxivdashanity dot com helps filter the fire hose

646
00:33:05.480 --> 00:33:06.759
<v Speaker 2>of new papers.

647
00:33:06.279 --> 00:33:09.920
<v Speaker 1>On our good resources. What about specific topics.

648
00:33:10.079 --> 00:33:12.759
<v Speaker 2>If computer vision grabbed you, maybe you look into object

649
00:33:12.799 --> 00:33:16.680
<v Speaker 2>detection models like SSD faster RCNN or YOLO, or image

650
00:33:16.680 --> 00:33:20.400
<v Speaker 2>segmentation with mask RCNN. If language is your thing, exploring

651
00:33:20.440 --> 00:33:24.160
<v Speaker 2>libraries like hugging faces, transformers or open source translation projects

652
00:33:24.200 --> 00:33:26.519
<v Speaker 2>like open and mt could be great next steps.

653
00:33:26.640 --> 00:33:27.880
<v Speaker 1>Fantastic suggestions.

654
00:33:27.960 --> 00:33:30.400
<v Speaker 2>The key is to keep exploring and if possible, get

655
00:33:30.440 --> 00:33:31.759
<v Speaker 2>your hands dirty with some code.

656
00:33:31.920 --> 00:33:34.839
<v Speaker 1>Definitely. So as we wrap up, here's something to think about.

657
00:33:34.880 --> 00:33:39.119
<v Speaker 3>We've talked about all these powerful tools, models that see,

658
00:33:39.680 --> 00:33:43.480
<v Speaker 3>understand language, generate content, make decisions. How deeply are these

659
00:33:43.480 --> 00:33:46.200
<v Speaker 3>technologies already woven into our daily lives, maybe in ways

660
00:33:46.240 --> 00:33:49.200
<v Speaker 3>we don't even notice. And looking forward, what were possibilities

661
00:33:49.279 --> 00:33:51.680
<v Speaker 3>or perhaps what new questions does that raise for you

662
00:33:51.839 --> 00:33:52.680
<v Speaker 3>and for society?

663
00:33:52.799 --> 00:33:54.839
<v Speaker 1>Something to mul over. Thanks for joining us on the

664
00:33:54.880 --> 00:33:55.400
<v Speaker 1>deep dive.
