WEBVTT

1
00:00:00.160 --> 00:00:03.080
<v Speaker 1>Welcome to the deep Dive. Today, we're marking on a

2
00:00:03.120 --> 00:00:07.440
<v Speaker 1>journey into the powerful world of deep learning, seen specifically

3
00:00:07.480 --> 00:00:08.320
<v Speaker 1>through the lens of R.

4
00:00:08.560 --> 00:00:09.000
<v Speaker 2>That's right.

5
00:00:09.080 --> 00:00:12.119
<v Speaker 1>Our mission is to extract the most important insights and

6
00:00:12.919 --> 00:00:15.880
<v Speaker 1>practical applications from hands on deep learning with R by

7
00:00:15.880 --> 00:00:17.280
<v Speaker 1>Michael Paulis and Roger Devine.

8
00:00:17.359 --> 00:00:19.480
<v Speaker 2>And this isn't just about theory, right, think of this

9
00:00:19.519 --> 00:00:24.600
<v Speaker 2>as your guide to designing, building, and truly improving neural

10
00:00:24.640 --> 00:00:28.199
<v Speaker 2>network models. We're distilling it down into core concepts and

11
00:00:28.239 --> 00:00:31.920
<v Speaker 2>some surprising applications. It's kind of a shortcut really to

12
00:00:32.079 --> 00:00:34.960
<v Speaker 2>being genuinely well informed in this complex space.

13
00:00:35.840 --> 00:00:38.679
<v Speaker 1>So, whether you're prepping for a big meeting, maybe just

14
00:00:38.719 --> 00:00:41.000
<v Speaker 1>catching up on the latest in data science, or you're

15
00:00:41.039 --> 00:00:45.960
<v Speaker 1>simply like insatiably curious, prepare for some serious aha moments.

16
00:00:46.159 --> 00:00:47.880
<v Speaker 1>Let's start at the beginning. Then, deep learning it's a

17
00:00:47.880 --> 00:00:50.600
<v Speaker 1>powerful subset of machine learning, but they fundamentally share a

18
00:00:50.600 --> 00:00:52.799
<v Speaker 1>lot of common ground. What are some of those essential

19
00:00:52.799 --> 00:00:53.560
<v Speaker 1>building blocks?

20
00:00:53.679 --> 00:00:57.079
<v Speaker 2>Well, at its core, it's all about preparing your data

21
00:00:57.119 --> 00:01:00.600
<v Speaker 2>for modeling. The book uses a great example actually with

22
00:01:01.119 --> 00:01:04.079
<v Speaker 2>the London air Quality network data set. The goal there

23
00:01:04.200 --> 00:01:07.599
<v Speaker 2>is predicting nitrogen dioxide levels, and this involves some really

24
00:01:07.599 --> 00:01:11.959
<v Speaker 2>crucial steps like identifying and extracting relevant data information you know,

25
00:01:12.120 --> 00:01:16.280
<v Speaker 2>the day, the month, and also intelligently handling missing values,

26
00:01:16.319 --> 00:01:21.239
<v Speaker 2>for instance, removing maybe a small percentage like three or

27
00:01:21.280 --> 00:01:23.760
<v Speaker 2>four percent of missing target values that you just can't

28
00:01:23.799 --> 00:01:25.079
<v Speaker 2>reliably guess.

29
00:01:25.120 --> 00:01:27.920
<v Speaker 1>Right because bad guesses could throw the whole model.

30
00:01:27.640 --> 00:01:31.040
<v Speaker 2>Off exactly, and filtering out variables that don't add any

31
00:01:31.079 --> 00:01:34.640
<v Speaker 2>real information. Think about columns where every single value is identical,

32
00:01:34.959 --> 00:01:36.879
<v Speaker 2>like maybe a site idea. If you're only looking at

33
00:01:36.879 --> 00:01:39.359
<v Speaker 2>one site or units, they just don't help.

34
00:01:39.519 --> 00:01:42.040
<v Speaker 1>And data quality isn't just about missing values, is it.

35
00:01:42.280 --> 00:01:46.120
<v Speaker 1>We also saw the importance of checks like confirming provisional

36
00:01:46.239 --> 00:01:48.719
<v Speaker 1>or ratified values, making sure you know the status of

37
00:01:48.719 --> 00:01:52.359
<v Speaker 1>the data, and transforming data types too, ensuring numeric data

38
00:01:52.400 --> 00:01:55.640
<v Speaker 1>isn't accidentally stored as text, which happens more than you'd think.

39
00:01:55.799 --> 00:01:58.920
<v Speaker 2>Oh definitely. Then comes the actual model training, so that

40
00:01:58.959 --> 00:02:02.319
<v Speaker 2>means splitting your data into training and testing sets, typically

41
00:02:02.319 --> 00:02:05.680
<v Speaker 2>a good chunk maybe seventy or eighty percent for training,

42
00:02:06.000 --> 00:02:09.680
<v Speaker 2>and then choosing the right algorithm. The book highlights exch

43
00:02:09.840 --> 00:02:14.639
<v Speaker 2>boost as a pretty robust gradient tree boosting method boosting.

44
00:02:14.960 --> 00:02:16.680
<v Speaker 1>That's different from something like a random.

45
00:02:16.439 --> 00:02:19.759
<v Speaker 2>Forest, right, Yeah, very different. What's important about boosting methods

46
00:02:19.800 --> 00:02:22.879
<v Speaker 2>like xt boost is that they learn iteratively. Each new

47
00:02:23.000 --> 00:02:26.400
<v Speaker 2>model essentially tries to correct the mistakes of the previous one.

48
00:02:26.520 --> 00:02:29.960
<v Speaker 2>It's a refinement process. Random forests, on the other hand,

49
00:02:30.080 --> 00:02:33.000
<v Speaker 2>use bagging. They build many independent models and sort of

50
00:02:33.080 --> 00:02:35.719
<v Speaker 2>average their results. Both powerful but different approaches.

51
00:02:35.800 --> 00:02:38.879
<v Speaker 1>Okay, that makes sense. This brings us to a critical point. Then,

52
00:02:39.319 --> 00:02:43.439
<v Speaker 1>how do you truly evaluate your model's results? Because simple

53
00:02:43.479 --> 00:02:45.599
<v Speaker 1>accuracy can be really misleading? Can it?

54
00:02:45.680 --> 00:02:46.240
<v Speaker 2>Absolutely?

55
00:02:46.319 --> 00:02:49.199
<v Speaker 1>The credit card fraud example in the book perfectly illustrates this.

56
00:02:49.719 --> 00:02:52.840
<v Speaker 1>If only say, zero point one percent of transactions are fraudulent,

57
00:02:53.039 --> 00:02:55.919
<v Speaker 1>a model predicting no fraud every time gets ninety nine

58
00:02:55.960 --> 00:02:59.240
<v Speaker 1>point nine percent accuracy, But it's completely useless. It misses

59
00:02:59.280 --> 00:03:02.039
<v Speaker 1>every single instance of actual fraud precisely.

60
00:03:03.000 --> 00:03:06.080
<v Speaker 2>That's why just looking at accuracy it can be well

61
00:03:06.280 --> 00:03:09.960
<v Speaker 2>dangerously deceptive. Sometimes it really forces you to think about

62
00:03:09.960 --> 00:03:12.840
<v Speaker 2>the cost of being wrong. You need metrics like mean

63
00:03:12.919 --> 00:03:17.840
<v Speaker 2>absolute error MA or root means squared error rmse E

64
00:03:17.960 --> 00:03:18.560
<v Speaker 2>and RMS.

65
00:03:18.800 --> 00:03:20.360
<v Speaker 1>That's the one that squares the errors.

66
00:03:20.560 --> 00:03:23.199
<v Speaker 2>Yeah, and the real insight with RMSC isn't just the math.

67
00:03:23.360 --> 00:03:26.960
<v Speaker 2>It's that it forces you to heavily penalize those big,

68
00:03:27.000 --> 00:03:31.599
<v Speaker 2>potentially catastrophic mispredictions. So if missing badly is really really

69
00:03:31.639 --> 00:03:34.919
<v Speaker 2>bad for your project, RMSE is often the better choice.

70
00:03:35.039 --> 00:03:37.520
<v Speaker 2>It makes those big errors hurt more in the calculation.

71
00:03:38.000 --> 00:03:40.120
<v Speaker 1>Got it, So, once we have a model, how do

72
00:03:40.159 --> 00:03:42.719
<v Speaker 1>we actually make it better? The book talks about strategies

73
00:03:42.759 --> 00:03:44.039
<v Speaker 1>like cross validation.

74
00:03:44.000 --> 00:03:46.800
<v Speaker 2>Right cross validation. That's where you basically repeat the train

75
00:03:46.879 --> 00:03:49.479
<v Speaker 2>to split multiple times with different slices of your data.

76
00:03:49.800 --> 00:03:52.039
<v Speaker 2>It gives you a much more reliable estimate of how

77
00:03:52.039 --> 00:03:54.120
<v Speaker 2>the model will perform on unseen data.

78
00:03:54.159 --> 00:03:56.479
<v Speaker 1>And early stopping. That sounds important too.

79
00:03:56.400 --> 00:03:58.639
<v Speaker 2>Yeah, early stopping is key. It means you monitor the

80
00:03:58.680 --> 00:04:01.759
<v Speaker 2>model's performance on a valuelidation set during training, and you

81
00:04:01.879 --> 00:04:04.879
<v Speaker 2>just stop the training if things haven't improved for say,

82
00:04:05.240 --> 00:04:08.759
<v Speaker 2>twenty five rounds. Or EPOX prevents overfitting.

83
00:04:08.479 --> 00:04:11.639
<v Speaker 1>And grid searches for hyper parameter tuning, that's about finding

84
00:04:11.680 --> 00:04:12.960
<v Speaker 1>the best settings.

85
00:04:12.800 --> 00:04:17.519
<v Speaker 2>Exactly systematically, trying out different combinations of settings like learning

86
00:04:17.600 --> 00:04:21.000
<v Speaker 2>rates or tree depths to find that optimal configuration for

87
00:04:21.079 --> 00:04:25.519
<v Speaker 2>your specific problem. And you know, we also briefly touched

88
00:04:25.519 --> 00:04:28.680
<v Speaker 2>on a wider range of machine learning algorithms. Beyond XP

89
00:04:28.839 --> 00:04:31.720
<v Speaker 2>you boost, there's a whole family things like decision trees

90
00:04:31.839 --> 00:04:37.120
<v Speaker 2>and their ensemble cousin random forests, logistic regression for classification problems,

91
00:04:38.000 --> 00:04:40.959
<v Speaker 2>support vector machines which are great at finding separation lines

92
00:04:40.959 --> 00:04:44.560
<v Speaker 2>and data k nearest neighbors k means for clustering, and

93
00:04:44.759 --> 00:04:48.560
<v Speaker 2>other boosting methods too, like gradient boosting machines GBM and

94
00:04:48.680 --> 00:04:52.399
<v Speaker 2>light GBM. Understanding how these iterative methods work, how they

95
00:04:52.399 --> 00:04:55.040
<v Speaker 2>build on information, it really does lay some groundwork for

96
00:04:55.079 --> 00:04:56.959
<v Speaker 2>grasping deep learning concepts later on.

97
00:04:57.279 --> 00:04:59.120
<v Speaker 1>All right, So, if you're listening and ready to get

98
00:04:59.120 --> 00:05:01.720
<v Speaker 1>hands on with deep learning in R, what are the

99
00:05:02.160 --> 00:05:04.040
<v Speaker 1>essential libraries we need to get started? Okay?

100
00:05:04.040 --> 00:05:06.879
<v Speaker 2>The primary work courses mentioned are H two O, mx

101
00:05:06.959 --> 00:05:09.959
<v Speaker 2>NET and KERAS, and we also saw some more specialized

102
00:05:10.000 --> 00:05:14.639
<v Speaker 2>packages like RBM for restricted Boltzmann machines and reinforcement.

103
00:05:14.120 --> 00:05:18.920
<v Speaker 1>Learning and installation. It's not always just installed packages, is it.

104
00:05:19.480 --> 00:05:22.639
<v Speaker 2>Some seem straightforward from KARAN the main R archives right.

105
00:05:22.519 --> 00:05:26.319
<v Speaker 1>Some are, but others yeah, like RBM or espressly karas

106
00:05:26.319 --> 00:05:28.759
<v Speaker 1>often need a bit more worre Keras usually relies on

107
00:05:28.839 --> 00:05:32.079
<v Speaker 1>TensorFlow running in the background, often in a separate Python

108
00:05:32.160 --> 00:05:34.839
<v Speaker 1>environment like Conda or a virtual environment.

109
00:05:34.560 --> 00:05:36.839
<v Speaker 2>So you might need div tools or need to point

110
00:05:36.959 --> 00:05:39.240
<v Speaker 2>R to the right Python installation exactly.

111
00:05:39.240 --> 00:05:42.240
<v Speaker 1>And mx net, for instance, might even need external libraries

112
00:05:42.360 --> 00:05:45.279
<v Speaker 1>installed in your system first, like OpenCV for image stuff

113
00:05:45.360 --> 00:05:47.879
<v Speaker 1>or open believes for linear algebra. It can get a

114
00:05:47.920 --> 00:05:50.800
<v Speaker 1>little complex, but what's really insightful here, I think, is

115
00:05:50.879 --> 00:05:55.360
<v Speaker 1>understanding the different strengths of each library within our karras

116
00:05:55.399 --> 00:05:58.639
<v Speaker 1>gives you incredibly broad support for almost any neural network

117
00:05:58.720 --> 00:06:02.000
<v Speaker 1>architecture you can think of and end CNN's MLPs, you

118
00:06:02.079 --> 00:06:04.920
<v Speaker 1>name it. H two Ozero is fantastic when you're dealing

119
00:06:04.920 --> 00:06:08.079
<v Speaker 1>with really really large data sets because you can store

120
00:06:08.120 --> 00:06:11.240
<v Speaker 1>objects out of memory across a cluster if needed. An

121
00:06:11.319 --> 00:06:15.600
<v Speaker 1>mx net it provides a really robust, efficient set of algorithms.

122
00:06:16.120 --> 00:06:19.240
<v Speaker 2>Powerful stuff in the book shows examples for each Yeah,

123
00:06:19.279 --> 00:06:21.560
<v Speaker 2>we saw how to get a basic example running with

124
00:06:21.639 --> 00:06:24.959
<v Speaker 2>each one, including a practical demo of pre processing the

125
00:06:25.000 --> 00:06:28.720
<v Speaker 2>adult census data set, converting character columns to numbers using

126
00:06:28.759 --> 00:06:32.639
<v Speaker 2>one hottened coding, scaling everything between zero and one standard

127
00:06:32.720 --> 00:06:33.720
<v Speaker 2>but crucial steps.

128
00:06:33.920 --> 00:06:36.959
<v Speaker 1>Okay, let's dig into the deep part, How exactly does

129
00:06:37.079 --> 00:06:40.360
<v Speaker 1>deep learning get that name? And what's really at its core?

130
00:06:40.600 --> 00:06:44.040
<v Speaker 2>Right? The deep comes from using multiple hidden layers made

131
00:06:44.120 --> 00:06:46.800
<v Speaker 2>up of these artificial neuros. These layers are stacked and

132
00:06:46.839 --> 00:06:49.680
<v Speaker 2>they mimic in a very very simplified way, how our

133
00:06:49.680 --> 00:06:53.439
<v Speaker 2>brains process information. The real key insight is that each

134
00:06:53.519 --> 00:06:57.439
<v Speaker 2>layer can learn progressively more complex features from the data.

135
00:06:57.519 --> 00:06:59.959
<v Speaker 1>How does that work in practice? For an image?

136
00:07:00.439 --> 00:07:03.360
<v Speaker 2>So imagine the first layer might identify basic edges or

137
00:07:03.399 --> 00:07:06.160
<v Speaker 2>corners in an image. The next layer might combine those

138
00:07:06.240 --> 00:07:09.639
<v Speaker 2>edges to detect simple shapes. A layer deeper might recognize

139
00:07:09.639 --> 00:07:12.560
<v Speaker 2>textures or parts of objects, and so on. This hierarchical

140
00:07:12.639 --> 00:07:14.680
<v Speaker 2>learning building complexity layer by layer.

141
00:07:14.800 --> 00:07:17.279
<v Speaker 1>That's what makes it deep and what does this structure

142
00:07:17.319 --> 00:07:18.800
<v Speaker 1>mean for how they actually learn?

143
00:07:19.120 --> 00:07:22.000
<v Speaker 2>Well, the process starts with random weights. These are just

144
00:07:22.120 --> 00:07:25.600
<v Speaker 2>numbers assigned to the connections between neurons, representing the strength

145
00:07:25.639 --> 00:07:28.920
<v Speaker 2>of the connection. Then these weights are adjusted over and

146
00:07:28.959 --> 00:07:32.920
<v Speaker 2>over again iteratively to minimize the difference the error between

147
00:07:32.959 --> 00:07:36.319
<v Speaker 2>the network's predictions and the actual answers in your training data.

148
00:07:36.480 --> 00:07:38.680
<v Speaker 1>So it's constantly refining itself based on feedback.

149
00:07:38.759 --> 00:07:41.839
<v Speaker 2>Exactly, it's a continuous refinement process, very much like how

150
00:07:41.879 --> 00:07:45.399
<v Speaker 2>those boosting algorithms learn from the errors of previous iterations.

151
00:07:45.439 --> 00:07:47.759
<v Speaker 1>Actually that makes sense, but okay, zooming in on those

152
00:07:47.759 --> 00:07:51.279
<v Speaker 1>individual neurons, Yeah, how do they decide whether to fire

153
00:07:51.800 --> 00:07:53.879
<v Speaker 1>or pass a signal forward? Ah?

154
00:07:53.920 --> 00:07:57.680
<v Speaker 2>Good question. That's where two key things come in, bias

155
00:07:57.720 --> 00:08:01.399
<v Speaker 2>functions and activation functions. Bias functions you can think of

156
00:08:01.439 --> 00:08:04.600
<v Speaker 2>them as shifting the decision boundary, allowing the model to

157
00:08:04.639 --> 00:08:09.199
<v Speaker 2>better separate different classes of data. And activation functions they're

158
00:08:09.199 --> 00:08:11.959
<v Speaker 2>the real decision makers inside each neuron. They take the

159
00:08:11.959 --> 00:08:15.199
<v Speaker 2>weighted sum of inputs plus the bias and decide if

160
00:08:15.240 --> 00:08:18.000
<v Speaker 2>and how strongly that neuron should fire and pass the

161
00:08:18.079 --> 00:08:19.199
<v Speaker 2>signal to the next layer.

162
00:08:19.439 --> 00:08:21.399
<v Speaker 1>Right, and we looked at a whole range of these

163
00:08:21.439 --> 00:08:24.160
<v Speaker 1>activation functions, didn't we From the simple on off heavy side.

164
00:08:24.279 --> 00:08:26.279
<v Speaker 2>Yeah, the heavy side is very basic, just a step,

165
00:08:26.600 --> 00:08:29.120
<v Speaker 2>But the non linear ones are where things get interesting.

166
00:08:29.319 --> 00:08:33.120
<v Speaker 2>There's the sigmoid function, which squishes values into a range

167
00:08:33.159 --> 00:08:37.200
<v Speaker 2>between zero and one, really useful for probabilities or binary outcomes.

168
00:08:38.000 --> 00:08:42.120
<v Speaker 2>Then its cousin the hyperbolic tangent or ten, which is

169
00:08:42.320 --> 00:08:44.360
<v Speaker 2>similar but ranges from moneybo one to.

170
00:08:44.360 --> 00:08:47.679
<v Speaker 1>One and read lu seems really popular. Rectified linear units.

171
00:08:47.720 --> 00:08:51.080
<v Speaker 2>Oh yeah, ReLU is huge. It's simple. It outputs the

172
00:08:51.080 --> 00:08:55.159
<v Speaker 2>input directly if it's positive and zero otherwise. That simplicity

173
00:08:55.279 --> 00:08:58.600
<v Speaker 2>makes training much faster in many cases. But it has

174
00:08:58.639 --> 00:09:02.200
<v Speaker 2>a potential issue called the eyeing ReLU problem, where neurons

175
00:09:02.240 --> 00:09:06.039
<v Speaker 2>can get stuck outputting zero, So leaky ReLU is developed.

176
00:09:06.080 --> 00:09:08.639
<v Speaker 2>It gives a tiny slope for negative inputs just to

177
00:09:08.679 --> 00:09:09.399
<v Speaker 2>keep things flowing.

178
00:09:09.559 --> 00:09:10.600
<v Speaker 1>And swish was another one.

179
00:09:10.679 --> 00:09:12.840
<v Speaker 2>Swish Yeah, a more recent one that's shown good results.

180
00:09:12.879 --> 00:09:15.120
<v Speaker 2>It's a smoother function, lots of options really.

181
00:09:14.960 --> 00:09:19.879
<v Speaker 1>And for classification tasks where you have like multiple categories dogs, cats, birds,

182
00:09:20.799 --> 00:09:22.399
<v Speaker 1>the softmax function is key.

183
00:09:22.279 --> 00:09:26.279
<v Speaker 2>Right, absolutely essential. Softmax takes the outputs for each class

184
00:09:26.320 --> 00:09:29.240
<v Speaker 2>and converts them into probabilities that all add up to one.

185
00:09:29.399 --> 00:09:31.879
<v Speaker 2>So the model tells you I think it's seventy percent

186
00:09:31.960 --> 00:09:35.159
<v Speaker 2>likely a cat, twenty percent a dog, ten percent a bird. Okay,

187
00:09:35.559 --> 00:09:38.000
<v Speaker 2>And you know, the book even walks you through building

188
00:09:38.000 --> 00:09:41.039
<v Speaker 2>a very basic network from scratch in just basar, just

189
00:09:41.039 --> 00:09:43.720
<v Speaker 2>to illustrate how weights get updated and how a LIGNE

190
00:09:43.759 --> 00:09:47.200
<v Speaker 2>can separate classes. Then it scales up using the neural

191
00:09:47.279 --> 00:09:49.720
<v Speaker 2>net package for the Wisconsin cancer data set, which is

192
00:09:49.720 --> 00:09:55.080
<v Speaker 2>a classic and importantly it shows the backpropagation.

193
00:09:54.279 --> 00:09:57.120
<v Speaker 1>Step backpropagation, that's how the error gets used to update

194
00:09:57.120 --> 00:09:58.279
<v Speaker 1>the weights exactly.

195
00:09:58.399 --> 00:10:00.679
<v Speaker 2>The error is calculated at the out and then it's

196
00:10:00.720 --> 00:10:04.159
<v Speaker 2>propagated backward through the network layer by layer, telling each

197
00:10:04.200 --> 00:10:06.759
<v Speaker 2>weight how much it needs to adjust to reduce that error.

198
00:10:06.840 --> 00:10:08.480
<v Speaker 2>It's the core learning mechanism.

199
00:10:08.480 --> 00:10:12.200
<v Speaker 1>Fascinating stuff. Okay, let's move to applications. Image recognition is

200
00:10:12.200 --> 00:10:15.000
<v Speaker 1>a huge one for deep learning. Can we use traditional

201
00:10:15.000 --> 00:10:16.480
<v Speaker 1>machine learning for images at all?

202
00:10:16.600 --> 00:10:20.279
<v Speaker 2>You absolutely can. Yeah. Using what are sometimes called shallow nets,

203
00:10:20.360 --> 00:10:23.399
<v Speaker 2>things like random forests or simple neural networks. You can

204
00:10:23.440 --> 00:10:26.600
<v Speaker 2>apply them to data sets like fashion mnists Fashion.

205
00:10:26.399 --> 00:10:30.200
<v Speaker 1>Mnist, that's the clothing images instead of handwritten digits. Right.

206
00:10:30.320 --> 00:10:32.879
<v Speaker 2>It's a bit more challenging than the original mnist digits.

207
00:10:33.120 --> 00:10:36.799
<v Speaker 2>But shallow nets their limitations become pretty clear when you

208
00:10:36.840 --> 00:10:40.960
<v Speaker 2>move to larger, more complex real world images. They just

209
00:10:41.039 --> 00:10:43.919
<v Speaker 2>struggle to efficiently capture all the intricate patterns.

210
00:10:44.200 --> 00:10:47.639
<v Speaker 1>And this is where convolutional neural networks CNNs really come

211
00:10:47.639 --> 00:10:49.799
<v Speaker 1>into their own, isn't it? How do they manage it?

212
00:10:50.320 --> 00:10:54.120
<v Speaker 2>What really sets CNN's apart is their architecture specifically designed

213
00:10:54.120 --> 00:10:58.399
<v Speaker 2>for grid like data like images, they automatically learn the

214
00:10:58.440 --> 00:11:02.279
<v Speaker 2>right features directly from the pixels. They use specialized layers

215
00:11:02.279 --> 00:11:05.240
<v Speaker 2>of convolution layers that apply filters across the image to

216
00:11:05.320 --> 00:11:10.320
<v Speaker 2>detect specific patterns like edges, corners, textures, maybe even simple shapes, so.

217
00:11:10.279 --> 00:11:12.919
<v Speaker 1>They're not just looking at individual pixels anymore, not at all.

218
00:11:13.240 --> 00:11:16.080
<v Speaker 2>Then they often use pooling layers, which reduce the size

219
00:11:16.200 --> 00:11:19.440
<v Speaker 2>the dimensionality, making the process more efficient and helping the

220
00:11:19.440 --> 00:11:22.679
<v Speaker 2>network focus on the most important features. And techniques like

221
00:11:22.759 --> 00:11:26.600
<v Speaker 2>adding padding say padding same can help control how quickly

222
00:11:26.639 --> 00:11:29.879
<v Speaker 2>the dimensions shrink, letting you build deeper networks without losing

223
00:11:29.879 --> 00:11:31.000
<v Speaker 2>information too fast.

224
00:11:31.279 --> 00:11:34.399
<v Speaker 1>And you can build really deep CNNs right stacking, multiple

225
00:11:34.440 --> 00:11:35.759
<v Speaker 1>convolution and pooling layers.

226
00:11:35.759 --> 00:11:38.919
<v Speaker 2>Oh yes, that allows the network to learn this hierarchy

227
00:11:38.960 --> 00:11:41.679
<v Speaker 2>of features. We talked about simple features in early layers

228
00:11:41.960 --> 00:11:45.120
<v Speaker 2>combined into more complex ones and deeper layers. It's kind

229
00:11:45.120 --> 00:11:47.799
<v Speaker 2>of analogous to how our own visual system works in

230
00:11:47.840 --> 00:11:48.200
<v Speaker 2>a way.

231
00:11:48.639 --> 00:11:52.240
<v Speaker 1>So with these complex models, how do we optimize them effectively?

232
00:11:52.399 --> 00:11:57.639
<v Speaker 2>Good question? Optimization is key. We discussed various algorithms called optimizers,

233
00:11:57.919 --> 00:12:01.799
<v Speaker 2>things like stochastic gradient descent SGD, which is a basic workhourse.

234
00:12:02.159 --> 00:12:05.840
<v Speaker 2>Then RM's prop and ATAM is a very popular one nowadays.

235
00:12:05.879 --> 00:12:09.000
<v Speaker 2>It sort of combines the ideas of arms PROP with momentum,

236
00:12:09.399 --> 00:12:11.919
<v Speaker 2>often leading to faster convergence.

237
00:12:11.519 --> 00:12:14.080
<v Speaker 1>And choosing the right loss function is important.

238
00:12:13.639 --> 00:12:18.840
<v Speaker 2>To crucial for binary classification, binary cross entropy for multiple classes,

239
00:12:18.879 --> 00:12:22.360
<v Speaker 2>categorical cross entropy for regression problems where you predict a

240
00:12:22.440 --> 00:12:25.639
<v Speaker 2>number maybe means squared error, and sometimes you need metrics

241
00:12:25.679 --> 00:12:30.080
<v Speaker 2>beyond just accuracy like cosine similarity or CHL divergence, especially

242
00:12:30.080 --> 00:12:32.720
<v Speaker 2>if you're comparing probability distributions or embeddings.

243
00:12:32.840 --> 00:12:36.279
<v Speaker 1>Okay, and you mentioned ways to prevent overfitting like dropout layers.

244
00:12:36.399 --> 00:12:39.840
<v Speaker 2>Yeah, dropout is a really clever technique. During training, it

245
00:12:39.919 --> 00:12:42.879
<v Speaker 2>randomly sets a fraction of neuron outputs to zero for

246
00:12:42.919 --> 00:12:44.240
<v Speaker 2>each training example.

247
00:12:44.000 --> 00:12:46.679
<v Speaker 1>So it forces the network not to rely too heavily

248
00:12:46.720 --> 00:12:48.559
<v Speaker 1>on any single neuron exactly.

249
00:12:48.720 --> 00:12:52.279
<v Speaker 2>It encourages redundancy and makes the network more robust and

250
00:12:52.480 --> 00:12:56.320
<v Speaker 2>early stopping. Like we mentioned before, halting training when performance

251
00:12:56.320 --> 00:12:59.480
<v Speaker 2>on a validation set stops improving is another vital tool

252
00:12:59.480 --> 00:13:02.679
<v Speaker 2>against over fitting. Helps find that sweet spot for the

253
00:13:02.759 --> 00:13:04.759
<v Speaker 2>number of training epochs right.

254
00:13:05.080 --> 00:13:08.840
<v Speaker 1>Okay, let's ship here's a bit. Multilayer perceptions or MLPs.

255
00:13:09.279 --> 00:13:13.080
<v Speaker 1>What about them, particularly for signal detection tasks? What makes

256
00:13:13.080 --> 00:13:13.720
<v Speaker 1>them distinct?

257
00:13:14.320 --> 00:13:17.960
<v Speaker 2>MLPs are kind of the classic foundational feed forward neural network.

258
00:13:18.440 --> 00:13:22.200
<v Speaker 2>Their defining feature is that they only use fully connected layers.

259
00:13:22.120 --> 00:13:24.919
<v Speaker 1>Meaning every neuron in one layer connects to every single

260
00:13:24.960 --> 00:13:25.639
<v Speaker 1>neuron in the.

261
00:13:25.600 --> 00:13:29.279
<v Speaker 2>Next layer, precisely. Unlike CNNs with their specialized convolution layers

262
00:13:29.399 --> 00:13:33.159
<v Speaker 2>or RNNs with their recurrent connections, MLPs are just stacks

263
00:13:33.159 --> 00:13:37.000
<v Speaker 2>of these dense fully connected layers. They're good general purpose learners,

264
00:13:37.000 --> 00:13:41.399
<v Speaker 2>maybe less specialized than cms for images or LSTMs for sequences,

265
00:13:41.759 --> 00:13:42.679
<v Speaker 2>and for MLPs.

266
00:13:42.840 --> 00:13:45.360
<v Speaker 1>We looked at some specific data prep steps, didn't we

267
00:13:45.840 --> 00:13:49.759
<v Speaker 1>like trimming white space from categories. Why is that important? Ah?

268
00:13:49.879 --> 00:13:53.480
<v Speaker 2>Yes, it sounds trivial, But if you have mail with

269
00:13:53.559 --> 00:13:57.120
<v Speaker 2>a leading space and mail without, the computer sees them

270
00:13:57.159 --> 00:14:00.600
<v Speaker 2>as two totally different categories, so cleaning that up is essential.

271
00:14:00.799 --> 00:14:04.799
<v Speaker 1>And rescaling numeric values to a zero one range. Why

272
00:14:04.799 --> 00:14:06.120
<v Speaker 1>do we do that rescale step?

273
00:14:06.159 --> 00:14:09.200
<v Speaker 2>Again, It's really about efficiency and stability. If you have

274
00:14:09.240 --> 00:14:11.440
<v Speaker 2>one feature ranging from zero to one and another from

275
00:14:11.519 --> 00:14:14.639
<v Speaker 2>zero to one million, the larger future can dominate the

276
00:14:14.720 --> 00:14:18.360
<v Speaker 2>learning process. Scaling brings everything into the same range, so

277
00:14:18.399 --> 00:14:21.240
<v Speaker 2>they contribute more equally, and it often helps the model's

278
00:14:21.279 --> 00:14:24.399
<v Speaker 2>optimization process converge faster and more reliably.

279
00:14:24.600 --> 00:14:26.639
<v Speaker 1>Makes sense, and it was a rule of thumb for

280
00:14:26.720 --> 00:14:27.720
<v Speaker 1>hidden layer size.

281
00:14:27.799 --> 00:14:30.799
<v Speaker 2>Yeah, a common juristic, just a starting point really is

282
00:14:30.840 --> 00:14:32.799
<v Speaker 2>to try setting the number of nodes in a hidden

283
00:14:32.879 --> 00:14:35.559
<v Speaker 2>layer to about two thirds of the input layer size.

284
00:14:35.720 --> 00:14:37.879
<v Speaker 2>We saw how you could write functions in R using

285
00:14:37.879 --> 00:14:40.559
<v Speaker 2>the mx net syntax in the book to easily test

286
00:14:40.639 --> 00:14:43.840
<v Speaker 2>different node counts and even experiment with adding more hidden layers.

287
00:14:44.159 --> 00:14:47.080
<v Speaker 1>Okay, now let's talk about something we all encounter daily.

288
00:14:47.919 --> 00:14:52.480
<v Speaker 1>Recommender systems. Yeah, streaming movies, online shopping. How do they

289
00:14:52.519 --> 00:14:55.399
<v Speaker 1>actually work and where does deep learning fit in?

290
00:14:55.720 --> 00:15:00.600
<v Speaker 2>Right recommenders? Broadly, there are three main types. Collaborative filtering,

291
00:15:00.720 --> 00:15:03.519
<v Speaker 2>which finds users similar to you and recommends what they liked,

292
00:15:04.120 --> 00:15:07.440
<v Speaker 2>content based filtering, which recommends items similar to ones you've

293
00:15:07.480 --> 00:15:10.759
<v Speaker 2>liked before based on their attributes, and habrid systems, which

294
00:15:11.320 --> 00:15:13.000
<v Speaker 2>tried combine the best to both worlds.

295
00:15:13.080 --> 00:15:15.480
<v Speaker 1>Had a big challenge. Is the cold start problem right

296
00:15:15.799 --> 00:15:17.240
<v Speaker 1>for new users or new items?

297
00:15:17.320 --> 00:15:20.559
<v Speaker 2>Exactly? If you're a new user, the system knows nothing

298
00:15:20.600 --> 00:15:22.960
<v Speaker 2>about your tastes. If it's a brand new movie, nobody

299
00:15:22.960 --> 00:15:25.559
<v Speaker 2>has rated it yet. That makes recommendations difficult.

300
00:15:25.600 --> 00:15:29.720
<v Speaker 1>Initially, what seemed really fascinating here was the idea of embeddings.

301
00:15:30.240 --> 00:15:32.639
<v Speaker 1>How do these low dimensional vectors help?

302
00:15:32.840 --> 00:15:35.919
<v Speaker 2>Embeddings are a really powerful concept in deep learning, not

303
00:15:36.000 --> 00:15:40.600
<v Speaker 2>just for recommendations. They basically learned dense low dimensional vector

304
00:15:40.639 --> 00:15:44.519
<v Speaker 2>representations for things like users and items. Instead of dealing

305
00:15:44.519 --> 00:15:49.240
<v Speaker 2>with huge, sparse matrices of user item interactions. You map

306
00:15:49.440 --> 00:15:54.080
<v Speaker 2>users and items into this shared latent space, a coordinate system,

307
00:15:54.120 --> 00:15:54.879
<v Speaker 2>if you will.

308
00:15:54.759 --> 00:15:57.240
<v Speaker 1>And closeness in that space means similarity.

309
00:15:57.159 --> 00:16:01.000
<v Speaker 2>Precisely users close to items they like and similar users

310
00:16:01.000 --> 00:16:03.879
<v Speaker 2>close to each other. It captures these affinities efficiently, making

311
00:16:03.960 --> 00:16:06.960
<v Speaker 2>it easy to calculate similarity like with a dot product,

312
00:16:07.240 --> 00:16:09.480
<v Speaker 2>even when you don't have explicit ratings for everything.

313
00:16:09.679 --> 00:16:12.519
<v Speaker 1>And we looked at the Steam two hundred k do

314
00:16:12.639 --> 00:16:15.600
<v Speaker 1>CSV data set, which uses implicit feedback.

315
00:16:15.759 --> 00:16:18.000
<v Speaker 2>Yeah, that was a great example. Instead of star ratings,

316
00:16:18.039 --> 00:16:21.519
<v Speaker 2>it uses hours played. For video games. Sid Meier's Civilization

317
00:16:21.679 --> 00:16:25.240
<v Speaker 2>V had huge hours logged by some users. This implicit

318
00:16:25.320 --> 00:16:29.159
<v Speaker 2>data clicks, views, purchase history, playtime is often much more

319
00:16:29.200 --> 00:16:31.679
<v Speaker 2>abundant and sometimes more revealing than explicit ratings.

320
00:16:31.759 --> 00:16:34.519
<v Speaker 1>So we saw preparing that data, doing some exploratory data

321
00:16:34.559 --> 00:16:37.639
<v Speaker 1>analysis EDA to understand those interactions.

322
00:16:37.200 --> 00:16:39.039
<v Speaker 2>Yep, understanding who plays what for how long?

323
00:16:39.279 --> 00:16:42.440
<v Speaker 1>And then building a custom caris model using both user

324
00:16:42.480 --> 00:16:44.080
<v Speaker 1>and bettings and item embttings.

325
00:16:44.159 --> 00:16:46.559
<v Speaker 2>Right, But then there's another layer. How do you account

326
00:16:46.600 --> 00:16:50.279
<v Speaker 2>for inherent biases? Some users just play games way more

327
00:16:50.360 --> 00:16:53.440
<v Speaker 2>than others, regardless of the specific game, and some games

328
00:16:53.480 --> 00:16:55.000
<v Speaker 2>are just universally popular.

329
00:16:55.200 --> 00:16:57.960
<v Speaker 1>Ah, so you need to model those baseline tendons exactly.

330
00:16:58.000 --> 00:17:01.200
<v Speaker 2>Adding specific bias e bettings one for the average user's

331
00:17:01.279 --> 00:17:04.480
<v Speaker 2>tendency and one for the average items popularity can really

332
00:17:04.480 --> 00:17:07.400
<v Speaker 2>improve the model. It lets the main embeddings focus on

333
00:17:07.480 --> 00:17:11.839
<v Speaker 2>the interaction effect the specific user item affinity separate from

334
00:17:11.839 --> 00:17:15.960
<v Speaker 2>these general biases. In the books example, adding biases nearly

335
00:17:16.079 --> 00:17:19.759
<v Speaker 2>doubled the trainable parameters, but led to much better recommendations.

336
00:17:19.880 --> 00:17:23.759
<v Speaker 1>Very clever. Okay, let's pivot to time series data. Stock

337
00:17:23.799 --> 00:17:27.160
<v Speaker 1>price forecasting is the classic example. How does deep learning

338
00:17:27.200 --> 00:17:30.440
<v Speaker 1>tackle this? Given that the order of events is so critical.

339
00:17:30.319 --> 00:17:34.440
<v Speaker 2>Time series is definitely unique, Unlike say, image classification, where

340
00:17:34.440 --> 00:17:37.240
<v Speaker 2>you can shuffle the images with time series. The sequence

341
00:17:37.279 --> 00:17:40.480
<v Speaker 2>is everything. You absolutely have to maintain chronological order when

342
00:17:40.480 --> 00:17:42.599
<v Speaker 2>splitting data for training and testing.

343
00:17:42.440 --> 00:17:44.400
<v Speaker 1>Because the past predicts the future.

344
00:17:44.480 --> 00:17:48.000
<v Speaker 2>Basically, fundamentally, yes, the patterns are in the sequence. We

345
00:17:48.119 --> 00:17:52.519
<v Speaker 2>compared this deep learning approach to traditional methods like ARIMA models.

346
00:17:52.920 --> 00:17:56.519
<v Speaker 2>Arima can be good, but often struggles to predict complex

347
00:17:56.640 --> 00:17:59.480
<v Speaker 2>patterns far beyond the training data it saw.

348
00:17:59.839 --> 00:18:03.119
<v Speaker 1>This is where recurrent neural networks are and ends and

349
00:18:03.200 --> 00:18:07.480
<v Speaker 1>especially long short term memory LSTM networks come in. These

350
00:18:07.480 --> 00:18:08.920
<v Speaker 1>are the game changers.

351
00:18:08.640 --> 00:18:12.200
<v Speaker 2>They really are for sequential data. LSTMs in particular are

352
00:18:12.240 --> 00:18:16.759
<v Speaker 2>designed to have memory. They have internal mechanisms, these gates

353
00:18:16.799 --> 00:18:20.000
<v Speaker 2>that allow them to retain information from previous steps in

354
00:18:20.039 --> 00:18:22.039
<v Speaker 2>the sequence and use it for current predictions.

355
00:18:22.119 --> 00:18:23.839
<v Speaker 1>So they can remember relevant past.

356
00:18:23.559 --> 00:18:27.599
<v Speaker 2>Events exactly, they can learn long range dependencies. A crucial

357
00:18:27.640 --> 00:18:30.400
<v Speaker 2>step we saw was transforming the raw stock prices using

358
00:18:30.480 --> 00:18:33.400
<v Speaker 2>log differences. This helps achieve stationarity.

359
00:18:33.559 --> 00:18:35.200
<v Speaker 1>Stationarity does that mean again?

360
00:18:35.480 --> 00:18:38.160
<v Speaker 2>It means the statistical properties of the time series, like

361
00:18:38.240 --> 00:18:42.079
<v Speaker 2>its average and variants, don't change over time. Most time

362
00:18:42.119 --> 00:18:46.119
<v Speaker 2>series models, including lstm's, work much better with stationary data.

363
00:18:46.880 --> 00:18:50.200
<v Speaker 2>Raw stock prices usually aren't stationary, they tend to trend

364
00:18:50.279 --> 00:18:53.599
<v Speaker 2>upwards over time log differences often stabilize them.

365
00:18:53.680 --> 00:18:56.440
<v Speaker 1>Okay, and we use a time series generator to prepare

366
00:18:56.440 --> 00:18:56.799
<v Speaker 1>the data.

367
00:18:57.000 --> 00:19:00.119
<v Speaker 2>Yeah, that's a handy tool and caras it automatically creates

368
00:19:00.119 --> 00:19:02.720
<v Speaker 2>batches of sequential data for the LSTM. You tell it

369
00:19:02.799 --> 00:19:06.119
<v Speaker 2>how many past days to look back at, say ten days,

370
00:19:06.160 --> 00:19:09.200
<v Speaker 2>to predict the next day's value. It handles creating those

371
00:19:09.200 --> 00:19:10.359
<v Speaker 2>sliding windows for you.

372
00:19:10.680 --> 00:19:12.839
<v Speaker 1>And then we built the actual LSTM.

373
00:19:12.400 --> 00:19:16.000
<v Speaker 2>Model in caress right sequential model, defining the LSTM layers,

374
00:19:16.079 --> 00:19:18.920
<v Speaker 2>specifying the number of units or memory cells in each layer,

375
00:19:18.960 --> 00:19:21.359
<v Speaker 2>and crucially the input shape which has to match the

376
00:19:21.400 --> 00:19:24.400
<v Speaker 2>look back window and number of features. And of course

377
00:19:24.440 --> 00:19:27.759
<v Speaker 2>tuning is vital here too, experimenting with the lookback window

378
00:19:27.799 --> 00:19:30.680
<v Speaker 2>size maybe three days works better than ten or vice versa,

379
00:19:31.039 --> 00:19:34.279
<v Speaker 2>adding multiple LSTM layers, maybe with dropout in between to

380
00:19:34.359 --> 00:19:36.680
<v Speaker 2>prevent overfitting on the sequence.

381
00:19:36.319 --> 00:19:40.480
<v Speaker 1>And refining the optimizer like the ATOM optimizer's learning rate definitely.

382
00:19:40.920 --> 00:19:43.640
<v Speaker 2>Finding the right learning rate is often critical for stable training,

383
00:19:43.720 --> 00:19:46.240
<v Speaker 2>especially with time series where things can fluctuate a lot.

384
00:19:46.359 --> 00:19:49.119
<v Speaker 1>Okay, this next one is maybe the most mind bending

385
00:19:49.599 --> 00:19:56.440
<v Speaker 1>generative adversarial networks chans creating synthetic images like faces totally

386
00:19:56.519 --> 00:19:58.960
<v Speaker 1>from scratch. How does that even work?

387
00:19:59.119 --> 00:20:02.559
<v Speaker 2>It is pretty amazing stuff. Jans are a really special

388
00:20:02.640 --> 00:20:06.920
<v Speaker 2>type of unsupervised learning model. They're generative because their goal

389
00:20:07.039 --> 00:20:10.119
<v Speaker 2>is to create new data that looks like the training data,

390
00:20:10.599 --> 00:20:13.759
<v Speaker 2>and their adversarial because they involve two neural networks locked

391
00:20:13.759 --> 00:20:15.039
<v Speaker 2>in a competition.

392
00:20:14.640 --> 00:20:17.000
<v Speaker 1>The generator and the discriminator exactly.

393
00:20:17.440 --> 00:20:20.440
<v Speaker 2>The generator takes random noise as input and tries to

394
00:20:20.480 --> 00:20:23.400
<v Speaker 2>transform it into a realistic looking image like a face.

395
00:20:23.880 --> 00:20:27.319
<v Speaker 2>The discriminator, meanwhile, is shown both real images from the

396
00:20:27.359 --> 00:20:30.359
<v Speaker 2>training set and fake images from the generator and has

397
00:20:30.359 --> 00:20:32.440
<v Speaker 2>to learn to tell the difference is this image real

398
00:20:32.519 --> 00:20:32.839
<v Speaker 2>or fake?

399
00:20:33.079 --> 00:20:35.480
<v Speaker 1>So it's like a counterfeitter, the generator trying to fool

400
00:20:35.519 --> 00:20:37.319
<v Speaker 1>a detective, the discriminator.

401
00:20:37.480 --> 00:20:40.559
<v Speaker 2>That's a great analogy, and the key is they both

402
00:20:40.599 --> 00:20:43.839
<v Speaker 2>get better over time because of each other. The generator

403
00:20:43.920 --> 00:20:47.440
<v Speaker 2>learns to make more convincing fakes to fool the discriminator.

404
00:20:47.839 --> 00:20:51.599
<v Speaker 2>The discriminator gets better at spotting fakes, forcing the generator

405
00:20:51.599 --> 00:20:54.799
<v Speaker 2>to improve further. It's this constant cat and mouse game.

406
00:20:55.039 --> 00:20:56.839
<v Speaker 1>So how do you even know if a JAN is

407
00:20:56.880 --> 00:20:58.920
<v Speaker 1>working well? Is there an accuracy score?

408
00:20:59.160 --> 00:21:02.480
<v Speaker 2>That's the tricky. Unlike most models where you have clear

409
00:21:02.559 --> 00:21:07.200
<v Speaker 2>metrics like accuracy or RMS, evaluating Jan's is often subjective.

410
00:21:07.440 --> 00:21:09.960
<v Speaker 2>There's no single number that tells you how real the

411
00:21:10.000 --> 00:21:12.720
<v Speaker 2>generated images are. Often you just have to look at

412
00:21:12.720 --> 00:21:16.559
<v Speaker 2>the output and judge visually. Are the generated faces plausible?

413
00:21:16.640 --> 00:21:19.799
<v Speaker 2>Do they look realistic? Although there are some more advanced

414
00:21:19.799 --> 00:21:22.799
<v Speaker 2>metrics researchers use, visual inspection is still common.

415
00:21:23.160 --> 00:21:26.359
<v Speaker 1>Okay, so how are these two networks actually built? The generator?

416
00:21:26.480 --> 00:21:29.400
<v Speaker 2>The generator typically starts with a vector of random noise.

417
00:21:29.759 --> 00:21:34.000
<v Speaker 2>It then uses layers like dense layers, reshaping layers, and crucially,

418
00:21:34.400 --> 00:21:36.200
<v Speaker 2>two D transposed convolutions.

419
00:21:36.240 --> 00:21:37.920
<v Speaker 1>Transposed convolutions. What do they do?

420
00:21:38.160 --> 00:21:41.599
<v Speaker 2>They're essentially the opposite of regular convolutions. They upsample the

421
00:21:41.640 --> 00:21:44.359
<v Speaker 2>feature maps, making the image larger. So you go from

422
00:21:44.359 --> 00:21:47.640
<v Speaker 2>a small noise vector through layers that gradually increase the

423
00:21:47.720 --> 00:21:51.279
<v Speaker 2>spatial dimensions, maybe from twenty five by twenty five pixels

424
00:21:51.319 --> 00:21:54.039
<v Speaker 2>to fifty by fifty until you get the desired output

425
00:21:54.079 --> 00:21:57.559
<v Speaker 2>image size. You'll also see things like Batche normalization to

426
00:21:57.599 --> 00:22:01.680
<v Speaker 2>help stabilize training and activation. Functions like ReLU and.

427
00:22:01.680 --> 00:22:04.640
<v Speaker 1>The discriminator it's basically a classifier pretty much. Yeah.

428
00:22:04.720 --> 00:22:08.720
<v Speaker 2>The discriminator is usually a standard convolutional neural network CNN.

429
00:22:09.119 --> 00:22:11.880
<v Speaker 2>It takes an image reel or fake as input. It

430
00:22:12.000 --> 00:22:15.240
<v Speaker 2>uses regular two D convolution layers, often with strides greater

431
00:22:15.279 --> 00:22:17.799
<v Speaker 2>than one, which helps reduce the image dimensions as you

432
00:22:17.839 --> 00:22:21.400
<v Speaker 2>go deeper. It might use leaky reel you activations, which

433
00:22:21.440 --> 00:22:24.519
<v Speaker 2>sometimes work well in discriminators. Then eventually it flattens the features,

434
00:22:24.759 --> 00:22:29.160
<v Speaker 2>maybe applies some dropout for regularization, and outputs a single probability,

435
00:22:29.400 --> 00:22:32.519
<v Speaker 2>usually via a sigmoid function, representing the likelihood that the

436
00:22:32.519 --> 00:22:33.559
<v Speaker 2>input image was real.

437
00:22:33.720 --> 00:22:36.400
<v Speaker 1>And preparing the image data for this, Yeah, that involves

438
00:22:36.680 --> 00:22:38.599
<v Speaker 1>loading JPEGs resizing.

439
00:22:38.880 --> 00:22:41.559
<v Speaker 2>Yeah, consistency is key. You need to load all your

440
00:22:41.599 --> 00:22:46.039
<v Speaker 2>real images maybe JPEG files, into numerical arrays, re size

441
00:22:46.079 --> 00:22:48.240
<v Speaker 2>them all to the exact same dimensions like fifty by

442
00:22:48.319 --> 00:22:51.480
<v Speaker 2>fifty pixels in the books example, and then typically stack

443
00:22:51.559 --> 00:22:55.480
<v Speaker 2>them all into a single large four dimensional array number

444
00:22:55.519 --> 00:22:58.559
<v Speaker 2>of images height with color channels. That's the format the

445
00:22:58.640 --> 00:22:59.440
<v Speaker 2>networks expect.

446
00:23:00.119 --> 00:23:01.680
<v Speaker 1>You train them together, right.

447
00:23:01.720 --> 00:23:04.759
<v Speaker 2>You alternate training steps. You train the discriminator on a

448
00:23:04.759 --> 00:23:07.759
<v Speaker 2>batch of real images labeled as real and fake images

449
00:23:07.799 --> 00:23:10.279
<v Speaker 2>from the generator are labeled as fake, then you freeze

450
00:23:10.279 --> 00:23:13.200
<v Speaker 2>the discriminator's weights and train the generator based on whether

451
00:23:13.240 --> 00:23:16.559
<v Speaker 2>the discriminator was fooled by its latest fakes. The generator's

452
00:23:16.559 --> 00:23:19.599
<v Speaker 2>goal is to produce images that the discriminator labels as real.

453
00:23:19.920 --> 00:23:22.720
<v Speaker 2>It's this back and forth that drives the learning process,

454
00:23:23.079 --> 00:23:25.720
<v Speaker 2>and tweaking parts of this, like the network architectures or

455
00:23:25.759 --> 00:23:28.559
<v Speaker 2>training process, can lead to wildly different results.

456
00:23:28.799 --> 00:23:32.240
<v Speaker 1>Wow. Okay, that was an incredible deep dive into hands

457
00:23:32.240 --> 00:23:35.119
<v Speaker 1>on deep learning with R. We've really covered a huge

458
00:23:35.119 --> 00:23:38.480
<v Speaker 1>amount of ground, everything from those foundational machine learning concepts,

459
00:23:38.880 --> 00:23:43.000
<v Speaker 1>setting up the R environment, getting into the nitty gritty

460
00:23:43.000 --> 00:23:46.839
<v Speaker 1>of artificial neural networks, and then exploring all these diverse applications.

461
00:23:47.039 --> 00:23:50.039
<v Speaker 2>Yeah, we really have. We've seen how deep learning is

462
00:23:50.119 --> 00:23:54.200
<v Speaker 2>powering critical areas like image recognition with those CNNs, how

463
00:23:54.440 --> 00:24:00.279
<v Speaker 2>personalized recommender systems work using embeddings, how LSTMs tackled i'm

464
00:24:00.279 --> 00:24:04.640
<v Speaker 2>series forecasting, and yeah, that utterly fascinating world of Jens

465
00:24:04.680 --> 00:24:08.160
<v Speaker 2>that can literally generate entirely new data from scratch.

466
00:24:08.480 --> 00:24:10.920
<v Speaker 1>So what does this all mean for you the listener?

467
00:24:10.920 --> 00:24:13.599
<v Speaker 1>Hopefully you've gained some really practical insights into how these

468
00:24:13.680 --> 00:24:16.519
<v Speaker 1>complex models are actually build, how they're optimized, and how

469
00:24:16.559 --> 00:24:19.599
<v Speaker 1>they're applied in real world scenarios. And crucially, you've seen

470
00:24:19.640 --> 00:24:23.000
<v Speaker 1>the specific R libraries and techniques that make it all possible.

471
00:24:23.039 --> 00:24:25.640
<v Speaker 1>Within that ecosystem, it's kind of a shortcut to understanding

472
00:24:25.680 --> 00:24:28.160
<v Speaker 1>the nuances, right, the things that really set these powerful

473
00:24:28.160 --> 00:24:28.759
<v Speaker 1>tools apart.

474
00:24:29.000 --> 00:24:31.680
<v Speaker 2>Absolutely, and I think the true power of deep learning,

475
00:24:31.720 --> 00:24:34.720
<v Speaker 2>when you boil it down, lies in this amazing ability

476
00:24:34.759 --> 00:24:39.200
<v Speaker 2>to learn incredibly intricate patterns and generate really profound insights,

477
00:24:39.519 --> 00:24:42.759
<v Speaker 2>often from just vast amounts of raw data. And maybe

478
00:24:42.799 --> 00:24:45.400
<v Speaker 2>if we connect this to the bigger picture, consider that

479
00:24:45.519 --> 00:24:48.960
<v Speaker 2>adversarial training concept from Jans, you know, where two components

480
00:24:49.039 --> 00:24:52.400
<v Speaker 2>learned by competing against each other. Yeah, could that idea

481
00:24:52.640 --> 00:24:56.279
<v Speaker 2>inspire completely new approaches to problem solving and fields way

482
00:24:56.319 --> 00:24:59.319
<v Speaker 2>beyond just generating images. Maybe areas like I don't know,

483
00:24:59.359 --> 00:25:02.480
<v Speaker 2>scientific disc discovery or designing complex systems, where you could

484
00:25:02.480 --> 00:25:06.279
<v Speaker 2>set up competing agents or models and that iterative competition

485
00:25:06.400 --> 00:25:08.759
<v Speaker 2>actually drives you towards optimal solutions.

486
00:25:08.960 --> 00:25:12.640
<v Speaker 1>That is a thought provoking idea using that competitive dynamic. Yeah,

487
00:25:12.759 --> 00:25:15.759
<v Speaker 1>very interesting, something to definitely mull over. Well, that's all

488
00:25:15.799 --> 00:25:17.160
<v Speaker 1>the time we have for this deep dive.
