WEBVTT

1
00:00:00.120 --> 00:00:04.080
<v Speaker 1>Welcome to the deep dive today. We're taking a shortcut

2
00:00:04.120 --> 00:00:07.360
<v Speaker 1>really to understanding deep learning. It's everywhere, right, really is.

3
00:00:07.519 --> 00:00:10.240
<v Speaker 1>So we've got some excerpts here from deep Learning with Python,

4
00:00:10.800 --> 00:00:14.000
<v Speaker 1>and basically our mission is to pull out the core ideas,

5
00:00:14.119 --> 00:00:17.920
<v Speaker 1>what's it doing, how does it work fundamentally, and maybe

6
00:00:18.000 --> 00:00:18.760
<v Speaker 1>where it's headed?

7
00:00:18.920 --> 00:00:22.600
<v Speaker 2>Yeah, getting the essence without you know, needing a PhD

8
00:00:22.679 --> 00:00:26.160
<v Speaker 2>in maths exactly, avoid the overwhelm, and the book itself

9
00:00:26.239 --> 00:00:29.160
<v Speaker 2>really tries to make it accessible. It pushes back against

10
00:00:29.160 --> 00:00:32.359
<v Speaker 2>this idea that deep learning is some kind of like

11
00:00:32.479 --> 00:00:37.240
<v Speaker 2>dark art. It highlights how Python and TensorFlow two specifically,

12
00:00:37.560 --> 00:00:41.320
<v Speaker 2>plus the Caras community, how all that has made it

13
00:00:41.359 --> 00:00:42.640
<v Speaker 2>practical for way more people.

14
00:00:42.759 --> 00:00:45.560
<v Speaker 1>Right, So we want to give you listening a clear

15
00:00:45.679 --> 00:00:47.679
<v Speaker 1>sense of what it's good for. It's limits to.

16
00:00:47.719 --> 00:00:50.560
<v Speaker 2>Right absolutely, and the sort of standard steps people take

17
00:00:50.600 --> 00:00:52.840
<v Speaker 2>to solve problems with it, you know, from computer vision

18
00:00:52.840 --> 00:00:53.719
<v Speaker 2>to language stuff.

19
00:00:53.840 --> 00:00:55.600
<v Speaker 1>And the author talks about how fast it's all.

20
00:00:55.479 --> 00:00:56.880
<v Speaker 2>Moving, oh, incredibly fast. Yeah.

21
00:00:56.960 --> 00:00:58.840
<v Speaker 1>Yeah, So for you listening, if you want to get

22
00:00:58.840 --> 00:01:01.439
<v Speaker 1>a handle on complex top pretty quickly, maybe for work,

23
00:01:01.479 --> 00:01:04.040
<v Speaker 1>maybe just because you're curious. This is aimed at giving

24
00:01:04.079 --> 00:01:06.000
<v Speaker 1>you that solid foundation.

25
00:01:05.719 --> 00:01:08.400
<v Speaker 2>Core ideas, the impact.

26
00:01:07.920 --> 00:01:11.840
<v Speaker 1>We're looking for those aha moments, keeping it focused. So

27
00:01:11.879 --> 00:01:14.480
<v Speaker 1>you walk away feeling like, okay, I get the big picture.

28
00:01:14.519 --> 00:01:16.480
<v Speaker 2>Now sounds good. Where should we start?

29
00:01:16.599 --> 00:01:19.719
<v Speaker 1>Okay, let's unpack it. What is deep learning and how

30
00:01:19.719 --> 00:01:22.359
<v Speaker 1>does it fit in with you know, AI and machine learning?

31
00:01:22.400 --> 00:01:24.280
<v Speaker 1>Those terms get thrown around a lot, they do.

32
00:01:24.359 --> 00:01:28.519
<v Speaker 2>It's a good starting point. So AI, artificial intelligence in

33
00:01:28.560 --> 00:01:32.319
<v Speaker 2>the really broad sense, is just automating tasks that usually

34
00:01:32.359 --> 00:01:33.599
<v Speaker 2>need human smarts.

35
00:01:33.760 --> 00:01:34.079
<v Speaker 1>Okay.

36
00:01:34.159 --> 00:01:37.079
<v Speaker 2>It's a huge field actually, and older than many people think.

37
00:01:37.519 --> 00:01:41.319
<v Speaker 2>Early AI, sometimes called symbolic AI, was very different, so

38
00:01:41.840 --> 00:01:44.879
<v Speaker 2>it was more about programmers writing down tons and tons

39
00:01:44.879 --> 00:01:48.519
<v Speaker 2>of rules by hand, building these big knowledge databases. The

40
00:01:48.560 --> 00:01:51.680
<v Speaker 2>computer wasn't really learning from experience in the way we

41
00:01:51.719 --> 00:01:52.280
<v Speaker 2>think of now.

42
00:01:52.400 --> 00:01:56.599
<v Speaker 1>Ah, So no actual learning, just following pre written instructions,

43
00:01:56.599 --> 00:01:58.920
<v Speaker 1>basically like those old chess program.

44
00:01:58.680 --> 00:02:02.000
<v Speaker 2>Exactly like that, just rules. Machine learning then emerged as

45
00:02:02.040 --> 00:02:05.040
<v Speaker 2>its own thing, where the focus shifted. The idea became

46
00:02:05.599 --> 00:02:09.159
<v Speaker 2>can we build programs models that learn from data?

47
00:02:09.280 --> 00:02:10.879
<v Speaker 1>Okay, that sounds more familiar.

48
00:02:10.960 --> 00:02:13.960
<v Speaker 2>Yeah. The model finds patterns in the data itself, makes predictions,

49
00:02:13.960 --> 00:02:18.759
<v Speaker 2>makes decisions, all without programmers explicitly telling it every single rule.

50
00:02:18.879 --> 00:02:21.000
<v Speaker 1>Right, and deep learning? Where does that fit?

51
00:02:21.199 --> 00:02:25.599
<v Speaker 2>Deep learning is a subfield within machine learning. It's defining

52
00:02:25.680 --> 00:02:29.719
<v Speaker 2>characteristic is using these multi stage ways of learning representations

53
00:02:29.719 --> 00:02:30.319
<v Speaker 2>of the data.

54
00:02:30.400 --> 00:02:31.159
<v Speaker 1>Multi stage.

55
00:02:31.280 --> 00:02:34.520
<v Speaker 2>Yeah, think of processing the data through many layers. Each

56
00:02:34.639 --> 00:02:38.439
<v Speaker 2>layer learns to represent the data in a slightly more complex,

57
00:02:38.520 --> 00:02:41.159
<v Speaker 2>more useful way based on the layer before it.

58
00:02:41.360 --> 00:02:44.360
<v Speaker 1>Okay, so breaking it down, not trying to learn everything

59
00:02:44.400 --> 00:02:47.360
<v Speaker 1>in one giant leap. The book uses three figures right

60
00:02:47.479 --> 00:02:48.719
<v Speaker 1>to explain how it works.

61
00:02:48.800 --> 00:02:50.840
<v Speaker 2>It does. Yeah, it's a good way to picture. First,

62
00:02:50.960 --> 00:02:54.960
<v Speaker 2>the basic idea deep learning maps inputs to targets. It

63
00:02:55.039 --> 00:02:57.199
<v Speaker 2>learns this mapping by just looking at lots and lots

64
00:02:57.240 --> 00:02:57.719
<v Speaker 2>of examples.

65
00:02:57.719 --> 00:03:00.280
<v Speaker 1>Well, show it cat pictures and dog pictures.

66
00:03:00.240 --> 00:03:02.879
<v Speaker 2>Exactly and tell it which is which. That's the input

67
00:03:03.400 --> 00:03:08.000
<v Speaker 2>in the image and the target the label cat or dog. Second,

68
00:03:08.360 --> 00:03:11.719
<v Speaker 2>this mapping isn't direct. The data flows through a deep

69
00:03:11.800 --> 00:03:15.520
<v Speaker 2>sequence of simple transformations the layers you mentioned precisely. These

70
00:03:15.599 --> 00:03:18.400
<v Speaker 2>layers are like steps in an assembly line. Each one

71
00:03:18.439 --> 00:03:21.199
<v Speaker 2>does something relatively simple to the data it receives. And

72
00:03:21.319 --> 00:03:25.560
<v Speaker 2>the third point, crucially, these transformations, these operations, the layers

73
00:03:25.599 --> 00:03:29.240
<v Speaker 2>perform they aren't hand coded by a programmer. The model

74
00:03:29.319 --> 00:03:32.479
<v Speaker 2>learns what transformations are useful by seeing all those examples

75
00:03:32.520 --> 00:03:33.120
<v Speaker 2>during training.

76
00:03:33.199 --> 00:03:36.479
<v Speaker 1>Okay, learned transformations. That feels like the core of it,

77
00:03:36.520 --> 00:03:39.800
<v Speaker 1>doesn't it. It figures out what features matter on its own.

78
00:03:39.879 --> 00:03:40.479
<v Speaker 2>That's the magic.

79
00:03:40.560 --> 00:03:43.360
<v Speaker 1>Yeah, And this figuring out happens in what the book

80
00:03:43.400 --> 00:03:46.159
<v Speaker 1>calls the training loop. Can you walk us through that?

81
00:03:46.199 --> 00:03:46.919
<v Speaker 1>What's happening there?

82
00:03:46.960 --> 00:03:50.599
<v Speaker 2>Okay? The training loop. So when you first create a network,

83
00:03:51.199 --> 00:03:54.199
<v Speaker 2>it's internal settings that these numbers called weights are just

84
00:03:54.240 --> 00:03:56.680
<v Speaker 2>at randomly small random numbers.

85
00:03:56.400 --> 00:03:59.520
<v Speaker 1>Usually, so it knows nothing. Basically, its first guesses are

86
00:03:59.520 --> 00:04:00.120
<v Speaker 1>a while.

87
00:04:00.319 --> 00:04:02.879
<v Speaker 2>Pretty much guaranteed to be wrong, yeah, which means it

88
00:04:02.879 --> 00:04:06.199
<v Speaker 2>will have a high loss score. The loss is just

89
00:04:06.240 --> 00:04:08.919
<v Speaker 2>a number that measures how far off the network's predictions

90
00:04:08.919 --> 00:04:12.800
<v Speaker 2>are from the actual targets. High loss means very wrong.

91
00:04:12.719 --> 00:04:14.759
<v Speaker 1>Like static on a radio you haven't tuned.

92
00:04:14.520 --> 00:04:17.399
<v Speaker 2>Yet, good analogy, lots of static initially, but then for

93
00:04:17.439 --> 00:04:19.639
<v Speaker 2>each example you show it during training.

94
00:04:19.519 --> 00:04:21.480
<v Speaker 1>Like one cap picture, right.

95
00:04:21.360 --> 00:04:24.160
<v Speaker 2>It makes a prediction. Yeah, it calculates the loss how

96
00:04:24.199 --> 00:04:26.560
<v Speaker 2>wrong it was for that picture, And then comes the

97
00:04:26.560 --> 00:04:30.800
<v Speaker 2>clever part, using calculus, specifically the gradient.

98
00:04:30.480 --> 00:04:33.439
<v Speaker 1>Gradient sounds technical it is a.

99
00:04:33.319 --> 00:04:36.480
<v Speaker 2>Bit, but think of it like this. The gradient tells

100
00:04:36.519 --> 00:04:39.360
<v Speaker 2>you the direction of steepest increase in the loss, like

101
00:04:39.720 --> 00:04:41.000
<v Speaker 2>which way is more wrong?

102
00:04:41.319 --> 00:04:41.720
<v Speaker 1>Okay?

103
00:04:42.000 --> 00:04:45.519
<v Speaker 2>So the optimization algorithm, usually some form of gradient descent,

104
00:04:46.079 --> 00:04:48.879
<v Speaker 2>takes that information and adjusts the weights slightly in the

105
00:04:48.879 --> 00:04:51.839
<v Speaker 2>opposite direction, the direction that would have made the loss

106
00:04:51.879 --> 00:04:53.959
<v Speaker 2>a tiny bit smaller for that one example.

107
00:04:54.160 --> 00:04:57.800
<v Speaker 1>Ah, So it nudges the weights downhill towards less error.

108
00:04:58.000 --> 00:05:00.959
<v Speaker 2>Exactly, it takes a small step downhill on the air landscape.

109
00:05:00.959 --> 00:05:04.480
<v Speaker 1>And it does this over and over for every example, yep.

110
00:05:04.680 --> 00:05:08.319
<v Speaker 2>For every example in your training data, usually in small badges,

111
00:05:08.839 --> 00:05:11.399
<v Speaker 2>and you repeat this process over the entire data set

112
00:05:11.519 --> 00:05:14.519
<v Speaker 2>multiple times. Each full pass through the data set is

113
00:05:14.560 --> 00:05:17.000
<v Speaker 2>called an epoch epoch, got it, And as you go

114
00:05:17.040 --> 00:05:20.240
<v Speaker 2>through more and more epochs, tweaking the waves after each badge,

115
00:05:20.360 --> 00:05:22.920
<v Speaker 2>the overall loss score gradually goes.

116
00:05:22.800 --> 00:05:24.439
<v Speaker 1>Down the static clears up.

117
00:05:24.600 --> 00:05:27.439
<v Speaker 2>Right. A well trained network is one where the loss

118
00:05:27.480 --> 00:05:31.560
<v Speaker 2>is very low, meaning its predictions are consistently close to

119
00:05:31.600 --> 00:05:32.959
<v Speaker 2>the actual target values.

120
00:05:33.079 --> 00:05:35.800
<v Speaker 1>Okay, that makes a lot of sense learning from mistakes

121
00:05:36.000 --> 00:05:39.480
<v Speaker 1>step by tiny step. Now. The book also mentions other

122
00:05:39.560 --> 00:05:43.920
<v Speaker 1>machine learning algorithms kind of for context logistic regression, SVMs,

123
00:05:44.120 --> 00:05:46.040
<v Speaker 1>random forests. Why bring those up?

124
00:05:46.639 --> 00:05:48.959
<v Speaker 2>It helps to see where deep learning fits in the

125
00:05:48.959 --> 00:05:52.079
<v Speaker 2>broader picture. These are really important tools in what you

126
00:05:52.160 --> 00:05:55.959
<v Speaker 2>might call classical or shallow machine learning. Shallow yeah, generally

127
00:05:56.000 --> 00:05:59.000
<v Speaker 2>meaning they don't have that deep, multi layered structure for

128
00:05:59.160 --> 00:06:03.040
<v Speaker 2>learning representation logistic regression, for instance, it's pretty simple, but

129
00:06:03.079 --> 00:06:06.040
<v Speaker 2>still super useful. For classification, it's often the first thing

130
00:06:06.040 --> 00:06:07.959
<v Speaker 2>you try. Like the Hello world.

131
00:06:07.759 --> 00:06:11.399
<v Speaker 1>Of mL okay and SVMs support vector machines.

132
00:06:12.160 --> 00:06:15.360
<v Speaker 2>SVMs try to find the best possible boundary, like a

133
00:06:15.399 --> 00:06:18.439
<v Speaker 2>line or a plane, to separate different classes in your data.

134
00:06:18.800 --> 00:06:21.560
<v Speaker 2>They have this neat mathematical trick called the kernel.

135
00:06:21.240 --> 00:06:23.319
<v Speaker 1>Trick ooh, kernel tricks hounds.

136
00:06:23.360 --> 00:06:27.079
<v Speaker 2>Fancy it kind of is? It lets SVMs handle complex

137
00:06:27.360 --> 00:06:32.240
<v Speaker 2>nonlinear separations without explicitly calculating coordinates in a super high

138
00:06:32.240 --> 00:06:35.040
<v Speaker 2>dimensional space. It's computationally clever.

139
00:06:35.160 --> 00:06:39.160
<v Speaker 1>Hmm. Interesting Maybe for another deep dive. What about random

140
00:06:39.199 --> 00:06:40.720
<v Speaker 1>forests and gradient boosting.

141
00:06:40.959 --> 00:06:44.680
<v Speaker 2>Both are ensemble methods. They combine predictions from many simpler models.

142
00:06:45.120 --> 00:06:48.279
<v Speaker 2>Random forests build lots of decision trees on different subsets

143
00:06:48.319 --> 00:06:51.240
<v Speaker 2>of the data and features, then average their outputs or

144
00:06:51.279 --> 00:06:52.360
<v Speaker 2>take a majority.

145
00:06:52.079 --> 00:06:54.720
<v Speaker 1>Vote like Wisdom of the Crowd, but for trees.

146
00:06:54.560 --> 00:06:58.160
<v Speaker 2>Sort of Yeah, they're often really strong performers, very robust.

147
00:06:58.480 --> 00:07:01.800
<v Speaker 2>Gradient boosting machines are also ensemble methods, but they build

148
00:07:01.839 --> 00:07:05.079
<v Speaker 2>trees sequentially. Each new tree tries to correct the errors

149
00:07:05.079 --> 00:07:06.279
<v Speaker 2>made by the trees that came before.

150
00:07:06.319 --> 00:07:09.040
<v Speaker 1>It. Oh interesting, like building on previous mistakes.

151
00:07:08.680 --> 00:07:14.560
<v Speaker 2>Exactly, and they often slightly outperform random forests, though they

152
00:07:14.560 --> 00:07:17.279
<v Speaker 2>can be a bit more sensitive to tuning. But again,

153
00:07:17.319 --> 00:07:19.759
<v Speaker 2>these are generally considered shallow compared.

154
00:07:19.360 --> 00:07:21.920
<v Speaker 1>To deep learning, right, they don't have that automatic, multi

155
00:07:22.000 --> 00:07:25.639
<v Speaker 1>layered feature learning. So that brings us back, what is

156
00:07:25.680 --> 00:07:29.680
<v Speaker 1>it about deepe learning that's so transformative? What's the key difference?

157
00:07:29.879 --> 00:07:33.199
<v Speaker 2>I think the biggest thing is its ability to learn

158
00:07:33.360 --> 00:07:38.079
<v Speaker 2>all the layers of representation jointly, simultaneously, jointly, as opposed

159
00:07:38.160 --> 00:07:41.160
<v Speaker 2>as opposed to traditional approaches where you might have separate steps.

160
00:07:41.319 --> 00:07:44.720
<v Speaker 2>Like first you'd manually engineer some features from the.

161
00:07:44.759 --> 00:07:47.879
<v Speaker 1>Raw data, like counting specific words and text or finding

162
00:07:48.000 --> 00:07:49.319
<v Speaker 1>edges and an image exactly.

163
00:07:49.399 --> 00:07:52.040
<v Speaker 2>You'd do that feature engineering, and then you'd feed those

164
00:07:52.079 --> 00:07:55.480
<v Speaker 2>engineers features into a classifier like an SVM or a

165
00:07:55.519 --> 00:07:57.959
<v Speaker 2>logistic regression. Deep learning does it all in one go.

166
00:07:58.040 --> 00:08:00.480
<v Speaker 2>The network learns the best features and how to classify

167
00:08:00.560 --> 00:08:01.639
<v Speaker 2>based on them altogether.

168
00:08:01.800 --> 00:08:05.319
<v Speaker 1>Ah okay, and why is learning them jointly so powerful?

169
00:08:05.519 --> 00:08:07.839
<v Speaker 2>Because the features can adapt to each other during learning.

170
00:08:08.399 --> 00:08:11.800
<v Speaker 2>If one layer starts extracting a slightly different, maybe better

171
00:08:12.079 --> 00:08:16.079
<v Speaker 2>type of future, the layers above it can adjust automatically

172
00:08:16.240 --> 00:08:19.199
<v Speaker 2>to make use of that improved representation. It's much more

173
00:08:19.240 --> 00:08:20.399
<v Speaker 2>dynamic and integrated.

174
00:08:20.720 --> 00:08:24.519
<v Speaker 1>So the features themselves evolve during training to be optimal

175
00:08:24.600 --> 00:08:25.160
<v Speaker 1>for the task.

176
00:08:25.279 --> 00:08:27.519
<v Speaker 2>That's a great way to put it. This allows deep

177
00:08:27.600 --> 00:08:32.200
<v Speaker 2>learning to learn really complex abstract concepts by breaking them down.

178
00:08:32.440 --> 00:08:34.919
<v Speaker 2>You start with simple features at the bottom layers, like

179
00:08:35.080 --> 00:08:37.679
<v Speaker 2>edges or textures in an image, and as you go

180
00:08:37.759 --> 00:08:40.799
<v Speaker 2>up through the layers, the network combines these to learn

181
00:08:40.840 --> 00:08:44.480
<v Speaker 2>more complex things like object parts and eventually whole objects.

182
00:08:44.639 --> 00:08:47.720
<v Speaker 1>Like building complex ideas from simpler blocks. That makes sense.

183
00:08:48.120 --> 00:08:50.320
<v Speaker 1>The book also touches on the pace of progress and

184
00:08:50.440 --> 00:08:54.320
<v Speaker 1>mentions a sort of explosive phase. Where are we now?

185
00:08:54.360 --> 00:08:55.279
<v Speaker 1>According to the author?

186
00:08:55.360 --> 00:08:57.840
<v Speaker 2>Yeah, the author reflects on that period maybe around twenty

187
00:08:57.960 --> 00:09:02.600
<v Speaker 2>seventeen twenty eighteen, especially with transformer models revolutionizing language tasks.

188
00:09:03.200 --> 00:09:06.320
<v Speaker 2>It felt like huge breakthroughs were happening constantly, like the

189
00:09:06.360 --> 00:09:07.720
<v Speaker 2>steep early part of an S.

190
00:09:07.720 --> 00:09:11.080
<v Speaker 1>Curve, exponential growth almost almost.

191
00:09:11.360 --> 00:09:13.639
<v Speaker 2>But the feeling, at least when the book was written

192
00:09:13.639 --> 00:09:15.840
<v Speaker 2>around twenty twenty one was that we're probably in the

193
00:09:15.879 --> 00:09:19.159
<v Speaker 2>second half of that S curve now, meaning meaning progress

194
00:09:19.200 --> 00:09:23.159
<v Speaker 2>is still definitely happening and it's significant, but maybe the

195
00:09:23.200 --> 00:09:27.200
<v Speaker 2>era of those absolutely fundamental paradigm shifting discoveries every few

196
00:09:27.200 --> 00:09:29.240
<v Speaker 2>months is slowing down a bit.

197
00:09:29.519 --> 00:09:32.159
<v Speaker 1>So more refinement, building on the existing foundations.

198
00:09:32.240 --> 00:09:36.960
<v Speaker 2>That's a sensia, more incremental, but still powerful progress, finding

199
00:09:37.000 --> 00:09:39.840
<v Speaker 2>new ways to apply these incredibly strong foundations that have

200
00:09:39.840 --> 00:09:40.120
<v Speaker 2>been late.

201
00:09:40.200 --> 00:09:44.000
<v Speaker 1>Okay, interesting perspectives, still moving fast, but maybe maturing. All right,

202
00:09:44.080 --> 00:09:46.480
<v Speaker 1>let's get into the real nuts and bolts, the components.

203
00:09:46.559 --> 00:09:49.559
<v Speaker 1>The book starts with tensors. What are they? Why are

204
00:09:49.639 --> 00:09:50.679
<v Speaker 1>they the starting point?

205
00:09:51.000 --> 00:09:55.080
<v Speaker 2>Tensors are basically the containers for data in neural networks.

206
00:09:55.120 --> 00:09:57.639
<v Speaker 2>You can think of them as generalizations of vectors and

207
00:09:57.679 --> 00:09:59.879
<v Speaker 2>matrices to potentially higher dimensions.

208
00:10:00.159 --> 00:10:02.840
<v Speaker 1>So like a number is a tensor, a list of

209
00:10:02.919 --> 00:10:04.080
<v Speaker 1>numbers a table.

210
00:10:03.919 --> 00:10:06.759
<v Speaker 2>Exactly, a single number is a scaler or ranked zero

211
00:10:06.960 --> 00:10:09.600
<v Speaker 2>tensor a list of numbers, like A vector is a

212
00:10:09.639 --> 00:10:12.720
<v Speaker 2>rank one tensor a table of numbers. A matrix is

213
00:10:12.720 --> 00:10:14.120
<v Speaker 2>a ranked two tensor, and you.

214
00:10:14.080 --> 00:10:16.519
<v Speaker 1>Can have ranked three, ranked four, and so on.

215
00:10:16.639 --> 00:10:18.840
<v Speaker 2>Yep, the rank just tells you how many axes or

216
00:10:18.879 --> 00:10:20.120
<v Speaker 2>dimensions the tensor has.

217
00:10:20.240 --> 00:10:23.000
<v Speaker 1>What defines a tensor then, besides the data.

218
00:10:22.799 --> 00:10:26.320
<v Speaker 2>Itself two key things its shape and its data type

219
00:10:26.600 --> 00:10:28.960
<v Speaker 2>or d type. The shape tells you how many elements

220
00:10:29.000 --> 00:10:31.159
<v Speaker 2>are along each axis, like a matrix might be shape

221
00:10:31.320 --> 00:10:33.559
<v Speaker 2>three five. The d type tells you what kind of

222
00:10:33.639 --> 00:10:36.440
<v Speaker 2>numbers are inside, like thirty two bit floating point numbers

223
00:10:36.519 --> 00:10:37.159
<v Speaker 2>or integers.

224
00:10:37.240 --> 00:10:41.000
<v Speaker 1>Okay, can you give examples of real world data as tensors? Sure?

225
00:10:41.279 --> 00:10:44.960
<v Speaker 2>Simple tabular data like customer infoage, income, whatever, It could

226
00:10:44.960 --> 00:10:48.440
<v Speaker 2>be a ranked two tensor rows or customers columns or features. Right,

227
00:10:48.679 --> 00:10:51.679
<v Speaker 2>time series data like daily stock prices for several stocks

228
00:10:51.759 --> 00:10:55.519
<v Speaker 2>might be ranked three stocks time steps features like open

229
00:10:55.559 --> 00:10:59.559
<v Speaker 2>Hilo clothes. Images are typically ranked four number of images

230
00:10:59.639 --> 00:11:02.480
<v Speaker 2>height with color channels usually three for RGB.

231
00:11:02.720 --> 00:11:04.200
<v Speaker 1>More dimensions for images.

232
00:11:04.840 --> 00:11:07.360
<v Speaker 2>Video adds another dimension for time or frames, making at

233
00:11:07.399 --> 00:11:10.200
<v Speaker 2>rank five number videos, frames, height with channels.

234
00:11:10.360 --> 00:11:13.039
<v Speaker 1>Okay, I see how tensors provide this flexible structure for

235
00:11:13.080 --> 00:11:16.120
<v Speaker 1>all sorts of data. So if tensors hold the data,

236
00:11:16.159 --> 00:11:19.919
<v Speaker 1>what are the tensor operations? The book mentions the gears.

237
00:11:20.159 --> 00:11:22.879
<v Speaker 2>These are the mathematical operations that the layers perform on

238
00:11:22.919 --> 00:11:26.440
<v Speaker 2>the tensors. There are the calculations that transform the data

239
00:11:26.559 --> 00:11:28.120
<v Speaker 2>as it flows through the network.

240
00:11:28.279 --> 00:11:29.600
<v Speaker 1>Like, what kind of operations?

241
00:11:29.679 --> 00:11:33.200
<v Speaker 2>Well, there are simple element wise operations where you do

242
00:11:33.240 --> 00:11:35.879
<v Speaker 2>the same thing like add, multiply, or apply a function

243
00:11:36.399 --> 00:11:40.000
<v Speaker 2>to each individual number in the tensor. There's broadcasting, which

244
00:11:40.039 --> 00:11:43.360
<v Speaker 2>is a set of rules allowing operations between tensors of

245
00:11:43.639 --> 00:11:47.480
<v Speaker 2>different but compatible shapes. It's very useful. The tensor product

246
00:11:47.559 --> 00:11:50.559
<v Speaker 2>or dot product is absolutely fundamental. It's a core operation

247
00:11:50.639 --> 00:11:55.279
<v Speaker 2>in linear algebra and use constantly in dense layers and reshaping,

248
00:11:55.320 --> 00:11:58.399
<v Speaker 2>which changes the tensor shape without changing its contents.

249
00:11:58.519 --> 00:12:03.200
<v Speaker 1>The book also has this geometric interpretation deep learning as

250
00:12:03.360 --> 00:12:07.240
<v Speaker 1>untangling data manifolds. That sounds abstract.

251
00:12:07.600 --> 00:12:09.799
<v Speaker 2>It is a bit abstract, but it's a powerful way

252
00:12:09.840 --> 00:12:14.440
<v Speaker 2>to think about it. Imagine your raw data points, maybe

253
00:12:14.600 --> 00:12:18.360
<v Speaker 2>images of handwritten digits are all jumbled together in a

254
00:12:18.399 --> 00:12:21.600
<v Speaker 2>high dimensional space like a crumpled piece of paper.

255
00:12:21.759 --> 00:12:22.960
<v Speaker 1>Okay, a messy blob.

256
00:12:23.120 --> 00:12:27.279
<v Speaker 2>Right, A data manifold each layer in a deep network

257
00:12:27.440 --> 00:12:31.720
<v Speaker 2>applies a transformation, a tensor operation that essentially tries to

258
00:12:31.799 --> 00:12:35.399
<v Speaker 2>uncrumple that paper a little bit. It stretches, rotates, and

259
00:12:35.440 --> 00:12:38.080
<v Speaker 2>folds the space that data lives in, trying to make

260
00:12:38.080 --> 00:12:41.559
<v Speaker 2>the different categories the different digits in this example more

261
00:12:41.600 --> 00:12:42.440
<v Speaker 2>easily separable.

262
00:12:42.960 --> 00:12:46.000
<v Speaker 1>So layer by layer, it's smoothing out the crumpled paper

263
00:12:46.080 --> 00:12:49.080
<v Speaker 1>until the digits written on different parts are clearly distinct.

264
00:12:48.759 --> 00:12:53.279
<v Speaker 2>Exactly untangling the manifold. After enough layers, ideally that different

265
00:12:53.279 --> 00:12:55.639
<v Speaker 2>classes of data will be nicely separated, maybe even by

266
00:12:55.639 --> 00:12:56.320
<v Speaker 2>simple planes.

267
00:12:56.480 --> 00:12:59.159
<v Speaker 1>That's a great visual. Okay, So tensors are data operations

268
00:12:59.200 --> 00:13:02.919
<v Speaker 1>manipulate THEMMI metrically. The next piece is layers. What are they?

269
00:13:02.919 --> 00:13:05.600
<v Speaker 2>Fundamentally, layers are the building blocks you stack together to

270
00:13:05.639 --> 00:13:07.919
<v Speaker 2>create a deep learning model. You can think of them

271
00:13:07.919 --> 00:13:10.840
<v Speaker 2>as modules that process data. They take one or more

272
00:13:10.840 --> 00:13:13.159
<v Speaker 2>tensors as input and spit out one or more tensors

273
00:13:13.200 --> 00:13:14.240
<v Speaker 2>as output.

274
00:13:14.000 --> 00:13:16.039
<v Speaker 1>And they perform those tensor operations we.

275
00:13:16.039 --> 00:13:19.879
<v Speaker 2>Just talked about precisely. Some layers are stateless, their output

276
00:13:19.919 --> 00:13:22.559
<v Speaker 2>just depends on the current input. Others have internal state.

277
00:13:23.039 --> 00:13:25.440
<v Speaker 2>This state consists of the layer's weights.

278
00:13:25.279 --> 00:13:27.000
<v Speaker 1>The things that get learned during training.

279
00:13:27.320 --> 00:13:30.679
<v Speaker 2>Exactly. The weights are themselves tensors, and they contain the

280
00:13:30.720 --> 00:13:33.600
<v Speaker 2>knowledge the layer has learned. They get updated during training

281
00:13:33.679 --> 00:13:34.559
<v Speaker 2>via gradient descent.

282
00:13:34.960 --> 00:13:37.279
<v Speaker 1>And we use different types of layers for different.

283
00:13:37.080 --> 00:13:41.320
<v Speaker 2>Data, right, yes, absolutely. Dense layers, also called fully connected

284
00:13:41.360 --> 00:13:45.720
<v Speaker 2>layers are common for vector data. Convolutional layers like conv

285
00:13:45.840 --> 00:13:49.320
<v Speaker 2>two D are the stars for image data. Recurrent layers

286
00:13:49.360 --> 00:13:52.840
<v Speaker 2>like LSTMs or grus are designed for sequential data like

287
00:13:52.879 --> 00:13:55.679
<v Speaker 2>text or time series. You choose layers suited to your

288
00:13:55.759 --> 00:13:56.399
<v Speaker 2>data structure.

289
00:13:56.480 --> 00:13:58.440
<v Speaker 1>Let's zoom it on dense layers for a second. What's

290
00:13:58.480 --> 00:14:00.919
<v Speaker 1>the core operation they do and what's the deal with

291
00:14:01.080 --> 00:14:03.320
<v Speaker 1>activation functions like re lu.

292
00:14:03.639 --> 00:14:07.480
<v Speaker 2>Okay, A dense layer performs what's mathematically called an affine transform,

293
00:14:07.879 --> 00:14:11.120
<v Speaker 2>takes the input vector, multiplies it by a weight matrix

294
00:14:11.200 --> 00:14:13.840
<v Speaker 2>that's a tensor product, and then as a bias vector,

295
00:14:14.240 --> 00:14:17.519
<v Speaker 2>it's basically output dot input plus.

296
00:14:17.399 --> 00:14:20.360
<v Speaker 1>B a linear transformation plus an offset.

297
00:14:20.519 --> 00:14:23.600
<v Speaker 2>Correct. Now, here's a really important point. If you just

298
00:14:23.639 --> 00:14:26.720
<v Speaker 2>stack a bunch of these dense layers together doing only

299
00:14:26.759 --> 00:14:30.720
<v Speaker 2>these Effin transforms, the whole stack is mathematically equivalent to

300
00:14:30.840 --> 00:14:34.039
<v Speaker 2>just one single Effen transform. You haven't actually gained any

301
00:14:34.080 --> 00:14:37.200
<v Speaker 2>expressive power beyond a simple linear model, no matter how

302
00:14:37.279 --> 00:14:38.240
<v Speaker 2>many layers you add.

303
00:14:38.360 --> 00:14:41.679
<v Speaker 1>WHOA. Okay, so stacking linear operations just gives you another

304
00:14:41.720 --> 00:14:43.600
<v Speaker 1>linear operation. That seems limiting.

305
00:14:43.679 --> 00:14:46.519
<v Speaker 2>It is. That's why we need activation functions. They introduce

306
00:14:46.679 --> 00:14:50.159
<v Speaker 2>non linearity into the network after the fin transform in

307
00:14:50.200 --> 00:14:50.720
<v Speaker 2>each layer.

308
00:14:50.879 --> 00:14:52.840
<v Speaker 1>Non linearity. Why is that crucial?

309
00:14:53.320 --> 00:14:57.440
<v Speaker 2>Because most real world relationships are non linear. If your

310
00:14:57.480 --> 00:15:00.279
<v Speaker 2>network can only model linear functions, it's going to fail

311
00:15:00.360 --> 00:15:04.759
<v Speaker 2>on most interesting problems. Activation functions break that linearity.

312
00:15:04.279 --> 00:15:07.679
<v Speaker 1>And re LU is a common one, rectified linear unit.

313
00:15:07.759 --> 00:15:12.240
<v Speaker 2>Very common and incredibly simple. It just computes max x zero.

314
00:15:12.559 --> 00:15:15.080
<v Speaker 2>So if the input x is positive, it passes it

315
00:15:15.120 --> 00:15:18.320
<v Speaker 2>through unchanged. If it's negative, it outputs zero.

316
00:15:18.480 --> 00:15:20.960
<v Speaker 1>That's it. That little kink at zero is enough.

317
00:15:21.320 --> 00:15:25.440
<v Speaker 2>It seems simple. But stacking layers with these ReLU activations

318
00:15:25.639 --> 00:15:31.200
<v Speaker 2>allows the network to approximate arbitrarily complex nonlinear functions. It's

319
00:15:31.240 --> 00:15:33.120
<v Speaker 2>what gives deep networks their power.

320
00:15:33.240 --> 00:15:37.840
<v Speaker 1>Okay, ReLU simple function, massive impact because it adds nonlinearity.

321
00:15:38.399 --> 00:15:38.799
<v Speaker 2>Got it.

322
00:15:39.480 --> 00:15:43.480
<v Speaker 1>Now, these layers have weight matrices. You said they're initialized

323
00:15:43.559 --> 00:15:44.639
<v Speaker 1>randomly yep.

324
00:15:44.519 --> 00:15:47.080
<v Speaker 2>Usually with small random values. If you started them all

325
00:15:47.120 --> 00:15:51.279
<v Speaker 2>at zero, they wouldn't learn properly. Randomness breaks the symmetry.

326
00:15:50.879 --> 00:15:53.240
<v Speaker 1>And the whole point of training is to adjust these

327
00:15:53.320 --> 00:15:54.480
<v Speaker 1>random weights.

328
00:15:54.120 --> 00:15:57.000
<v Speaker 2>Exactly, to adjust them based on the feedback signal a loss,

329
00:15:57.360 --> 00:16:01.399
<v Speaker 2>so that the network's overall transformation from to output performs

330
00:16:01.399 --> 00:16:05.039
<v Speaker 2>the task correctly. The learned weights encode the solution, and.

331
00:16:04.919 --> 00:16:08.519
<v Speaker 1>That adjustment mechanism is gradient based optimization. Let's break that

332
00:16:08.559 --> 00:16:08.960
<v Speaker 1>down right.

333
00:16:09.000 --> 00:16:11.759
<v Speaker 2>This is the engine driving the learning. The core idea

334
00:16:12.039 --> 00:16:14.279
<v Speaker 2>is to use the gradient of the loss function.

335
00:16:14.120 --> 00:16:15.960
<v Speaker 1>The direction of steepest descent.

336
00:16:15.759 --> 00:16:18.200
<v Speaker 2>To figure out how to change the weights to decrease

337
00:16:18.240 --> 00:16:20.519
<v Speaker 2>the loss. We want to go downhill on that lost

338
00:16:20.600 --> 00:16:21.200
<v Speaker 2>landscape we.

339
00:16:21.159 --> 00:16:23.840
<v Speaker 1>Talked about, okay, and how does it actually take the steps?

340
00:16:24.159 --> 00:16:29.320
<v Speaker 2>A common algorithm is doochastic gradient descent or SGD. Stochastic

341
00:16:29.399 --> 00:16:31.960
<v Speaker 2>just means it uses small random batches of the training

342
00:16:32.039 --> 00:16:35.159
<v Speaker 2>data to estimate the gradient at each step, rather than

343
00:16:35.159 --> 00:16:36.919
<v Speaker 2>the whole data set, which would be very slow.

344
00:16:37.039 --> 00:16:39.559
<v Speaker 1>So it gets a noisy estimate of the downhill direction

345
00:16:39.600 --> 00:16:40.960
<v Speaker 1>from a small sample.

346
00:16:40.799 --> 00:16:43.919
<v Speaker 2>Exactly, and it takes a small step in that estimated

347
00:16:43.919 --> 00:16:47.039
<v Speaker 2>downhill direction updating the weights. The size of that step

348
00:16:47.120 --> 00:16:49.120
<v Speaker 2>is controlled by the learning rate.

349
00:16:49.240 --> 00:16:50.279
<v Speaker 1>Ah, the learning rate.

350
00:16:50.320 --> 00:16:53.240
<v Speaker 2>That sounds important, it's critical. Too big and you might

351
00:16:53.279 --> 00:16:57.200
<v Speaker 2>overshoot the minimum or bounce around wildly. Too small and

352
00:16:57.240 --> 00:16:59.799
<v Speaker 2>training will take forever, or you might get stuck easily.

353
00:17:00.120 --> 00:17:01.600
<v Speaker 2>Finding a good learning rate is key.

354
00:17:01.960 --> 00:17:05.000
<v Speaker 1>And the loss function itself, that's what defines the landscape

355
00:17:05.000 --> 00:17:07.240
<v Speaker 1>we're descending. It tells us how wrong we are.

356
00:17:07.440 --> 00:17:11.519
<v Speaker 2>Precisely, it quantifies the mismatch between the network's predictions and

357
00:17:11.559 --> 00:17:15.160
<v Speaker 2>the true target values. Different tasks need different loss functions,

358
00:17:15.519 --> 00:17:17.599
<v Speaker 2>but the goal is always to minimize it.

359
00:17:17.880 --> 00:17:22.160
<v Speaker 1>Now, this dissent, can it get stuck? The book mentions

360
00:17:22.279 --> 00:17:24.119
<v Speaker 1>local versus global minima.

361
00:17:24.480 --> 00:17:28.440
<v Speaker 2>Yes, that's a potential issue. The lost landscape for deep

362
00:17:28.440 --> 00:17:32.680
<v Speaker 2>networks can be very complex, with many valors. SGD might

363
00:17:32.720 --> 00:17:36.400
<v Speaker 2>find the bottom of a small nearby valley the local minimum,

364
00:17:36.680 --> 00:17:39.880
<v Speaker 2>but miss a much deeper valley elsewhere the global minimum.

365
00:17:39.920 --> 00:17:42.759
<v Speaker 1>So it finds a solution, but maybe not the best

366
00:17:42.799 --> 00:17:43.559
<v Speaker 1>possible one.

367
00:17:43.680 --> 00:17:48.200
<v Speaker 2>Potentially, yes, although in practice for very high dimensional problems

368
00:17:48.240 --> 00:17:52.000
<v Speaker 2>in deep learning, many local minimum are often quite good anyway.

369
00:17:52.720 --> 00:17:55.559
<v Speaker 2>But techniques like momentum can help momentum.

370
00:17:55.599 --> 00:17:56.279
<v Speaker 1>How does that help?

371
00:17:56.519 --> 00:17:59.079
<v Speaker 2>Momentum adds a sort of inertia to the update step.

372
00:17:59.359 --> 00:18:01.759
<v Speaker 2>It considers the direction of previous steps, not just the

373
00:18:01.759 --> 00:18:04.839
<v Speaker 2>current gredient. This can help the optimizer roll through small

374
00:18:04.880 --> 00:18:08.480
<v Speaker 2>local minima or navigate flat regions more effectively.

375
00:18:08.079 --> 00:18:09.799
<v Speaker 1>Like giving it a push to get over little bumps.

376
00:18:09.839 --> 00:18:13.920
<v Speaker 1>Yeah cool, okay, And you mentioned back propagation earlier as

377
00:18:14.000 --> 00:18:16.039
<v Speaker 1>the way to calculate these gradients efficiently.

378
00:18:16.240 --> 00:18:20.359
<v Speaker 2>Yes. Backpropagation is the algorithm that makes training deep networks feasible.

379
00:18:20.720 --> 00:18:24.079
<v Speaker 2>It's a clever application of the chain roll from calculus.

380
00:18:23.640 --> 00:18:25.200
<v Speaker 1>Chang right for nested functions.

381
00:18:25.440 --> 00:18:29.079
<v Speaker 2>Exactly. A deep network is just a long chain of

382
00:18:29.119 --> 00:18:33.400
<v Speaker 2>nested functions the layers. Backpropagation starts with the final loss

383
00:18:33.640 --> 00:18:36.799
<v Speaker 2>and works backward to the network layer by layer. Why

384
00:18:36.839 --> 00:18:40.400
<v Speaker 2>backward because it efficiently calculates how much each weight in

385
00:18:40.440 --> 00:18:43.960
<v Speaker 2>the network contributed to the final error by reusing calculations

386
00:18:44.000 --> 00:18:46.720
<v Speaker 2>from later layers. It figures out the gradient of the

387
00:18:46.759 --> 00:18:50.200
<v Speaker 2>loss with respect to every single weight in the network.

388
00:18:50.400 --> 00:18:54.079
<v Speaker 1>Wow, without having to recalculate everything from scratch for each

389
00:18:54.119 --> 00:18:55.200
<v Speaker 1>weight precisely.

390
00:18:55.319 --> 00:19:00.240
<v Speaker 2>It's computationally very efficient, and modern frameworks like tensor flo

391
00:19:00.559 --> 00:19:02.880
<v Speaker 2>have automatic differentiation tools built.

392
00:19:02.640 --> 00:19:05.039
<v Speaker 1>In, like gradient tape and TensorFlow Exactly.

393
00:19:05.240 --> 00:19:07.880
<v Speaker 2>You define your networks, forward pass how the data flows through,

394
00:19:07.920 --> 00:19:11.640
<v Speaker 2>and TensorFlow, using tools like gradient tape, automatically figures out

395
00:19:11.640 --> 00:19:14.720
<v Speaker 2>how to compute the gradients needed for backpropagation. It handles

396
00:19:14.720 --> 00:19:15.799
<v Speaker 2>all that calculus for you.

397
00:19:15.920 --> 00:19:19.559
<v Speaker 1>That's amazing, takes away a huge mathematical burden. Okay, so

398
00:19:19.599 --> 00:19:23.720
<v Speaker 1>we have tensors, operations, layers, activation functions, and this gradient

399
00:19:23.759 --> 00:19:27.160
<v Speaker 1>descent engine powered by backpropagation. Let's talk about Paris. The

400
00:19:27.160 --> 00:19:30.319
<v Speaker 1>book focuses on it heavily. What is Keras.

401
00:19:30.240 --> 00:19:34.119
<v Speaker 2>Keiras is essentially a user friendly interface, an API for

402
00:19:34.240 --> 00:19:37.279
<v Speaker 2>doing deep learning and Python. Its main goal is to

403
00:19:37.279 --> 00:19:40.200
<v Speaker 2>make building and experimenting with models fast and easy.

404
00:19:40.519 --> 00:19:42.200
<v Speaker 1>An interface on top of something else.

405
00:19:42.359 --> 00:19:46.000
<v Speaker 2>Yes, it runs on top of lower level tensor computation libraries.

406
00:19:46.200 --> 00:19:49.279
<v Speaker 2>TensorFlow is the primary one, especially since Keras was integrated

407
00:19:49.279 --> 00:19:52.319
<v Speaker 2>directly into TensorFlow too. But it was designed to be

408
00:19:52.400 --> 00:19:54.000
<v Speaker 2>back end diagnostic.

409
00:19:53.599 --> 00:19:57.200
<v Speaker 1>So TensorFlow does the heavy lifting the tensor math running

410
00:19:57.200 --> 00:20:00.640
<v Speaker 1>on GPUs or TPUs, and Keras provides a simpler way

411
00:20:00.680 --> 00:20:01.960
<v Speaker 1>to tell TensorFlow what to do.

412
00:20:02.160 --> 00:20:04.359
<v Speaker 2>That's a great way to put it. Kearras abstracts away

413
00:20:04.400 --> 00:20:06.240
<v Speaker 2>a lot of the boilerplate code you'd need if you

414
00:20:06.240 --> 00:20:08.640
<v Speaker 2>were using raw TensorFlow. Let you focus more on the

415
00:20:08.640 --> 00:20:11.119
<v Speaker 2>model architecture and the experiment design makes sense.

416
00:20:11.240 --> 00:20:15.079
<v Speaker 1>The book mentions TensorFlow concepts like TF dot tensor and

417
00:20:15.119 --> 00:20:17.160
<v Speaker 1>TF dot variable. How do they fit in?

418
00:20:17.240 --> 00:20:20.279
<v Speaker 2>Well? TF tensor is just tensorflow's implementation of the tensors

419
00:20:20.319 --> 00:20:23.960
<v Speaker 2>we've been discussing the multi dimensional arrays holding data. TF

420
00:20:24.000 --> 00:20:26.279
<v Speaker 2>variable is a special kind of tensor used to hold

421
00:20:26.319 --> 00:20:30.559
<v Speaker 2>the model state, specifically the learnable parameters the weights and biases.

422
00:20:30.920 --> 00:20:34.279
<v Speaker 1>Ah So variables are the tensors that the optimizers allowed

423
00:20:34.279 --> 00:20:36.000
<v Speaker 1>to change during training.

424
00:20:36.000 --> 00:20:41.079
<v Speaker 2>Exactly their values persist across training steps. TensorFlow also provides

425
00:20:41.119 --> 00:20:46.319
<v Speaker 2>all the tensor operations like matrix multiplication, addition, activation functions, etc.

426
00:20:46.880 --> 00:20:49.920
<v Speaker 2>That operate on these tensors and variables, often mimicking the

427
00:20:49.920 --> 00:20:53.200
<v Speaker 2>interface of NUMPI, which is familiar to many Python users.

428
00:20:53.319 --> 00:20:57.640
<v Speaker 1>Okay, so how do you actually build a model using keras?

429
00:20:58.279 --> 00:21:01.359
<v Speaker 1>The book mentions a few ways. Munch API right.

430
00:21:01.440 --> 00:21:04.000
<v Speaker 2>The sequential API is the simplest way. It's literally for

431
00:21:04.079 --> 00:21:07.279
<v Speaker 2>building a model layer by layer, and the linear stack

432
00:21:07.759 --> 00:21:10.960
<v Speaker 2>output of one layer feeds directly into the next. Super

433
00:21:10.960 --> 00:21:14.160
<v Speaker 2>straightforward from many common network types, like building a single

434
00:21:14.200 --> 00:21:15.759
<v Speaker 2>tower of legos.

435
00:21:15.519 --> 00:21:18.359
<v Speaker 1>Simple but maybe limited. If you want something more.

436
00:21:18.240 --> 00:21:22.440
<v Speaker 2>Complex, exactly for more complex architectures, you'd use the functional API.

437
00:21:22.920 --> 00:21:25.039
<v Speaker 2>This lets you build models that are like graphs of

438
00:21:25.119 --> 00:21:28.359
<v Speaker 2>layers rather than just a straight line graphs meaning meaning

439
00:21:28.400 --> 00:21:33.039
<v Speaker 2>you can have multiple inputs, multiple outputs, layers that share connections, branches, merges,

440
00:21:33.559 --> 00:21:36.640
<v Speaker 2>much more flexible if your model isn't just a simple stack,

441
00:21:36.839 --> 00:21:38.720
<v Speaker 2>the functional API is usually the way to.

442
00:21:38.680 --> 00:21:42.519
<v Speaker 1>Go, Okay, more powerful, and the third way model subclassing.

443
00:21:42.680 --> 00:21:45.839
<v Speaker 2>Model subclassing is the most flexible approach. You define your

444
00:21:45.920 --> 00:21:49.480
<v Speaker 2>model as a Python class inheriting from karst model. You

445
00:21:49.519 --> 00:21:52.640
<v Speaker 2>define the layers in the init method, and then crucially

446
00:21:52.880 --> 00:21:55.759
<v Speaker 2>you define the forward pass how data flows through the

447
00:21:55.839 --> 00:21:58.160
<v Speaker 2>layers yourself, in a method called call.

448
00:21:58.640 --> 00:22:01.000
<v Speaker 1>So you have complete control over the computation.

449
00:22:01.240 --> 00:22:05.759
<v Speaker 2>Total control. Great for research or really non standard architectures.

450
00:22:06.440 --> 00:22:08.640
<v Speaker 2>The trade off is that you lose some of the

451
00:22:08.680 --> 00:22:11.960
<v Speaker 2>automatic features of the other APIs, like easy model plotting

452
00:22:12.039 --> 00:22:12.839
<v Speaker 2>or serialization.

453
00:22:13.160 --> 00:22:16.440
<v Speaker 1>You have bit more responsibility, right, more power, more work

454
00:22:16.720 --> 00:22:19.599
<v Speaker 1>makes sense. Okay, so you've built your model using one

455
00:22:19.640 --> 00:22:23.039
<v Speaker 1>of these APIs, what's the standard workflow for training and

456
00:22:23.160 --> 00:22:26.880
<v Speaker 1>using it. The compile, fit, evaluate, predict that's the core loop.

457
00:22:26.960 --> 00:22:29.440
<v Speaker 2>Yeah, first you compile the model. This step configures the

458
00:22:29.519 --> 00:22:32.559
<v Speaker 2>learning process. You tell Keras which optimizer to use, like

459
00:22:32.599 --> 00:22:34.039
<v Speaker 2>ATOM or MUG.

460
00:22:34.000 --> 00:22:35.519
<v Speaker 1>The algorithm for gradient descent.

461
00:22:35.759 --> 00:22:38.640
<v Speaker 2>Right. You specify the longs function you want to minimize,

462
00:22:38.720 --> 00:22:42.079
<v Speaker 2>but categorical cross entropy for multi class classification or MZ

463
00:22:42.279 --> 00:22:45.440
<v Speaker 2>for regression. And you list any metrics you want attrack

464
00:22:45.559 --> 00:22:47.039
<v Speaker 2>during training like accuracy.

465
00:22:47.200 --> 00:22:49.039
<v Speaker 1>Okay, setting the rules for learning.

466
00:22:49.440 --> 00:22:53.039
<v Speaker 2>Then fit yep fit is where the actual training happens.

467
00:22:53.200 --> 00:22:55.799
<v Speaker 2>You give it your training data inputs and target outputs.

468
00:22:56.000 --> 00:22:57.880
<v Speaker 2>You tell it how many epbos to train for and

469
00:22:58.000 --> 00:23:01.160
<v Speaker 2>usually the batch size, how many SAMs to process before

470
00:23:01.240 --> 00:23:02.400
<v Speaker 2>updating the weights.

471
00:23:02.200 --> 00:23:06.319
<v Speaker 1>And Kars handles the look, the backpropagation, updating weights.

472
00:23:06.200 --> 00:23:09.440
<v Speaker 2>All of it. You can also pass validation data to fit,

473
00:23:09.759 --> 00:23:12.319
<v Speaker 2>so Karas will evaluate the model on data it hasn't

474
00:23:12.359 --> 00:23:17.200
<v Speaker 2>trained on after each epoch. That's crucial for monitoring progress monitoring.

475
00:23:17.440 --> 00:23:19.640
<v Speaker 1>Yeah, so after fit finishes.

476
00:23:19.359 --> 00:23:22.359
<v Speaker 2>What's next You use Evaluate. You give it a separate

477
00:23:22.400 --> 00:23:24.799
<v Speaker 2>test data set data the model is never seen during

478
00:23:24.839 --> 00:23:29.079
<v Speaker 2>training or validation tuning. Evaluate returns the final loss and

479
00:23:29.160 --> 00:23:32.960
<v Speaker 2>metric values like accuracy on this test set. This gives

480
00:23:32.960 --> 00:23:34.920
<v Speaker 2>you the best estimate of how well your model will

481
00:23:34.960 --> 00:23:39.319
<v Speaker 2>generalize to new real world data the final report card exactly.

482
00:23:39.839 --> 00:23:43.200
<v Speaker 2>And then finally predict. You give predict new input data

483
00:23:43.440 --> 00:23:47.000
<v Speaker 2>without labels, and the trained model gives you its predictions.

484
00:23:47.200 --> 00:23:48.759
<v Speaker 2>This is how you actually use the model.

485
00:23:48.839 --> 00:23:54.640
<v Speaker 1>Compile, fit, evaluate, predict, got the flow you mentioned monitoring

486
00:23:54.680 --> 00:23:58.079
<v Speaker 1>progress during fit using validation data. The book also talks

487
00:23:58.119 --> 00:23:59.640
<v Speaker 1>about callbacks and tensor board.

488
00:23:59.759 --> 00:24:03.160
<v Speaker 2>Yes, these are super useful validation data. As we said,

489
00:24:03.240 --> 00:24:05.960
<v Speaker 2>let you see if your model's starting to overfit doing

490
00:24:06.000 --> 00:24:10.119
<v Speaker 2>better on training data but worse on unseen data. Callbacks

491
00:24:10.160 --> 00:24:13.000
<v Speaker 2>are objects you can pass to fit that perform actions

492
00:24:13.160 --> 00:24:15.920
<v Speaker 2>at certain points during training, like what kind of actions

493
00:24:16.279 --> 00:24:19.680
<v Speaker 2>things like early stopping. This callback monitors a metric on

494
00:24:19.680 --> 00:24:22.799
<v Speaker 2>the validation set, maybe validation loss, and if it stops

495
00:24:22.799 --> 00:24:25.920
<v Speaker 2>improving for a certain number of etbox, it automatically stops

496
00:24:25.920 --> 00:24:26.359
<v Speaker 2>the training.

497
00:24:26.359 --> 00:24:29.559
<v Speaker 1>Now that's smart. Prevents wasting time and overfitting exactly.

498
00:24:29.279 --> 00:24:32.200
<v Speaker 2>Or model checkpoint which saves your model's weights whenever the

499
00:24:32.240 --> 00:24:35.039
<v Speaker 2>validation performance improves, so you always keep the best version.

500
00:24:35.559 --> 00:24:37.920
<v Speaker 2>And tensive board is a visualization toolkit.

501
00:24:38.039 --> 00:24:39.519
<v Speaker 1>What does tensor boards show.

502
00:24:39.319 --> 00:24:43.000
<v Speaker 2>You you can log metrics during training, loss, accuracy, etc.

503
00:24:43.799 --> 00:24:45.720
<v Speaker 2>And view plots of them in your web browser in

504
00:24:45.759 --> 00:24:49.240
<v Speaker 2>real time. You can visualize the model graph, examine histograms

505
00:24:49.240 --> 00:24:52.359
<v Speaker 2>of weights and activations. It gives you much deeper insight

506
00:24:52.400 --> 00:24:53.920
<v Speaker 2>into what's happening during training.

507
00:24:54.160 --> 00:24:57.920
<v Speaker 1>Sounds invaluable for debugging and understanding. Okay. The book also

508
00:24:57.960 --> 00:25:02.920
<v Speaker 1>distinguishes between regression and classification tasks. How do they differ

509
00:25:02.960 --> 00:25:04.519
<v Speaker 1>in terms of loss and metrics?

510
00:25:04.640 --> 00:25:07.160
<v Speaker 2>Right? The goal is different. Classification is about predicting a

511
00:25:07.200 --> 00:25:11.319
<v Speaker 2>category label catdog, spam not spam. Regression is about predicting

512
00:25:11.319 --> 00:25:14.000
<v Speaker 2>a continuous number, price, temperature.

513
00:25:13.559 --> 00:25:16.039
<v Speaker 1>Age, So the way you measure success has to be

514
00:25:16.079 --> 00:25:17.039
<v Speaker 1>different exactly.

515
00:25:17.400 --> 00:25:20.599
<v Speaker 2>For classification, you often use loss functions like categorical cross

516
00:25:20.680 --> 00:25:24.039
<v Speaker 2>entropy or binary cross entropy, and you measure performance with

517
00:25:24.119 --> 00:25:26.880
<v Speaker 2>metrics like accuracy, what fraction did it get right?

518
00:25:27.000 --> 00:25:27.119
<v Speaker 1>Right?

519
00:25:27.279 --> 00:25:28.519
<v Speaker 2>Precision recall?

520
00:25:28.640 --> 00:25:29.000
<v Speaker 1>Okay.

521
00:25:29.160 --> 00:25:33.200
<v Speaker 2>For regression, common loss functions are mean squared air MC

522
00:25:33.839 --> 00:25:37.799
<v Speaker 2>or mean absolute air. These measure how far off the

523
00:25:37.880 --> 00:25:41.200
<v Speaker 2>numerical promptions are on average, So your metrics are also

524
00:25:41.200 --> 00:25:44.920
<v Speaker 2>things like MAE or rmc root means squared air. Accuracy

525
00:25:44.920 --> 00:25:46.480
<v Speaker 2>doesn't really make sense for regression.

526
00:25:46.720 --> 00:25:50.079
<v Speaker 1>Got it, Different targets, different ways to measure how close

527
00:25:50.119 --> 00:25:53.759
<v Speaker 1>you are. Okay, let's shift to some really key concepts

528
00:25:53.839 --> 00:25:58.039
<v Speaker 1>in developing models. Generalization, overfitting, underfitting.

529
00:25:58.200 --> 00:26:02.160
<v Speaker 2>These seem critical, absolutely fundam Generalization is the whole point. Really,

530
00:26:02.480 --> 00:26:07.440
<v Speaker 2>it's the model's ability to perform well on new unseen data,

531
00:26:07.799 --> 00:26:09.240
<v Speaker 2>not just the data was trained on.

532
00:26:09.400 --> 00:26:11.400
<v Speaker 1>You want it to work in the real world exactly.

533
00:26:11.880 --> 00:26:14.839
<v Speaker 2>Overfitting is when the model learns the training data too well.

534
00:26:15.240 --> 00:26:18.680
<v Speaker 2>It memorizes the noise and specific quirks of the training set,

535
00:26:18.759 --> 00:26:21.839
<v Speaker 2>but it fails to generalize to new data. It performs

536
00:26:21.880 --> 00:26:24.200
<v Speaker 2>great on training data, poorly on test data.

537
00:26:24.319 --> 00:26:28.640
<v Speaker 1>Okay, memorizing instead of learning the underlying pattern and underfitting.

538
00:26:28.839 --> 00:26:31.839
<v Speaker 2>Underfitting is the opposite problem. The model is too simple.

539
00:26:32.039 --> 00:26:34.640
<v Speaker 2>It can't even capture the underlying patterns and the training data,

540
00:26:34.839 --> 00:26:38.319
<v Speaker 2>let alone generalize. It performs poorly on both training and

541
00:26:38.440 --> 00:26:39.720
<v Speaker 2>test data.

542
00:26:39.480 --> 00:26:42.079
<v Speaker 1>So we need to find that sweet spot. Complex enough

543
00:26:42.079 --> 00:26:45.039
<v Speaker 1>to learn, but not so complex it just memorizes. The

544
00:26:45.079 --> 00:26:48.599
<v Speaker 1>book calls this the tension between optimization and generalization.

545
00:26:49.039 --> 00:26:52.759
<v Speaker 2>Right, because as you train your model longer optimizing it

546
00:26:52.799 --> 00:26:55.559
<v Speaker 2>on the training data, its performance on the training data

547
00:26:55.640 --> 00:26:59.480
<v Speaker 2>keeps getting better, but its performance on unseen validation data

548
00:26:59.559 --> 00:27:02.359
<v Speaker 2>will improve for a while, then peak, and then start

549
00:27:02.400 --> 00:27:04.000
<v Speaker 2>to get worse as overfitting kicks in.

550
00:27:04.279 --> 00:27:07.839
<v Speaker 1>Ah, So optimizing too much hurts generalization.

551
00:27:07.640 --> 00:27:11.519
<v Speaker 2>Beyond a certain point. Yes, all the techniques for building

552
00:27:11.519 --> 00:27:14.920
<v Speaker 2>good models are about managing this tension, finding that peak

553
00:27:15.039 --> 00:27:16.880
<v Speaker 2>generalization performance.

554
00:27:16.440 --> 00:27:20.119
<v Speaker 1>And how do we reliably measure generalization. The book mentions

555
00:27:20.160 --> 00:27:21.799
<v Speaker 1>different evaluation protocols.

556
00:27:22.079 --> 00:27:24.880
<v Speaker 2>Hold out K fold, yeah, because just looking at performance

557
00:27:24.880 --> 00:27:28.559
<v Speaker 2>on the training set is misleading. The simplest is hold

558
00:27:28.559 --> 00:27:32.519
<v Speaker 2>out validation. You split your data training set, validation set testing.

559
00:27:32.160 --> 00:27:35.119
<v Speaker 1>Train on training, tune on validation, final check on tests.

560
00:27:35.200 --> 00:27:38.400
<v Speaker 2>Precisely, you use the validation set during development to make

561
00:27:38.440 --> 00:27:41.000
<v Speaker 2>decisions like when to stop training or how many layers

562
00:27:41.000 --> 00:27:43.720
<v Speaker 2>to use. The test set is kept completely separate until

563
00:27:43.720 --> 00:27:46.920
<v Speaker 2>the very end for one final unbiased evaluation.

564
00:27:47.039 --> 00:27:49.759
<v Speaker 1>What if you don't have much data, then kfold.

565
00:27:49.400 --> 00:27:52.880
<v Speaker 2>Cross validation is better. You split the data minus the

566
00:27:52.920 --> 00:27:56.240
<v Speaker 2>test set into k chunks or folds. Then you train

567
00:27:56.319 --> 00:27:59.759
<v Speaker 2>k models. Each model uses k one folds for training

568
00:28:00.039 --> 00:28:01.880
<v Speaker 2>and one fold for validation.

569
00:28:01.720 --> 00:28:04.759
<v Speaker 1>So every data point gets used for validation exactly once.

570
00:28:05.000 --> 00:28:08.119
<v Speaker 2>Right. Then you average the validation scores from the k runs.

571
00:28:08.359 --> 00:28:11.200
<v Speaker 2>It gives a more robust estimate, especially with small data sets.

572
00:28:11.440 --> 00:28:14.559
<v Speaker 2>Iterated kfold just repeats this whole process multiple times with

573
00:28:14.599 --> 00:28:18.400
<v Speaker 2>different shuffles for even more stability. But always always keep

574
00:28:18.400 --> 00:28:21.160
<v Speaker 2>that final test set pristine until the very end.

575
00:28:21.240 --> 00:28:24.640
<v Speaker 1>Got it validation guides development test gives the final score.

576
00:28:25.319 --> 00:28:28.880
<v Speaker 1>What about data preprocessing things like scaling features? Why do that?

577
00:28:29.319 --> 00:28:31.759
<v Speaker 2>Neural networks can be quite sensitive to the scale of

578
00:28:31.759 --> 00:28:34.599
<v Speaker 2>input features. If one feature ranges from zero to one

579
00:28:34.640 --> 00:28:37.519
<v Speaker 2>and another from zero to one million, the network might struggle.

580
00:28:37.920 --> 00:28:40.960
<v Speaker 2>The larger valued feature could dominate the learning process or

581
00:28:41.000 --> 00:28:42.200
<v Speaker 2>cause numerical.

582
00:28:41.759 --> 00:28:44.160
<v Speaker 1>Instability, so you need to put them on a similar scale.

583
00:28:44.240 --> 00:28:49.400
<v Speaker 2>Yeah, it generally helps training significantly. Common techniques are normalization

584
00:28:49.599 --> 00:28:53.559
<v Speaker 2>scaling to be between zero one or standardization scaling to

585
00:28:53.559 --> 00:28:58.240
<v Speaker 2>have zero mean and unit variants. You typically scale features independently.

586
00:28:58.319 --> 00:29:01.640
<v Speaker 1>Makes sense. And model capacity that's about how complex the

587
00:29:01.680 --> 00:29:04.519
<v Speaker 1>model is, number of layers units exactly.

588
00:29:04.799 --> 00:29:07.559
<v Speaker 2>It's roughly how much information the model can store, how

589
00:29:07.559 --> 00:29:11.440
<v Speaker 2>complex a function it can learn. Too little capacity leads

590
00:29:11.440 --> 00:29:15.200
<v Speaker 2>to underfitting. Too much capacity makes overfitting easier.

591
00:29:15.599 --> 00:29:18.599
<v Speaker 1>So finding the right capacity for your specific problem in

592
00:29:18.720 --> 00:29:19.799
<v Speaker 1>data is crucial.

593
00:29:20.000 --> 00:29:23.640
<v Speaker 2>It's a key part of model development. Yeah. Often involves experimentation, and.

594
00:29:23.599 --> 00:29:26.039
<v Speaker 1>If your model has too much capacity and starts overfitting,

595
00:29:26.079 --> 00:29:27.839
<v Speaker 1>you use regularization techniques.

596
00:29:28.000 --> 00:29:32.039
<v Speaker 2>Yes. Regularization methods are designed specifically to combat overfitting and

597
00:29:32.160 --> 00:29:36.119
<v Speaker 2>encourage better generalization. They work by constraining the complexity of

598
00:29:36.119 --> 00:29:37.119
<v Speaker 2>the model during training.

599
00:29:37.319 --> 00:29:38.519
<v Speaker 1>How what are some examples.

600
00:29:38.839 --> 00:29:41.440
<v Speaker 2>Well, one simple form is just reducing the model size

601
00:29:41.480 --> 00:29:45.480
<v Speaker 2>fewer layers or fewer units neurons per layer. Another very

602
00:29:45.480 --> 00:29:49.319
<v Speaker 2>common one is dropout during training. Dropout randomly sets the

603
00:29:49.319 --> 00:29:51.680
<v Speaker 2>output of a fraction of neurons in a layer to

604
00:29:51.799 --> 00:29:55.279
<v Speaker 2>zero for each training example. This forces the network to

605
00:29:55.319 --> 00:29:58.759
<v Speaker 2>learn more robust representations that don't rely too heavily on

606
00:29:58.799 --> 00:29:59.960
<v Speaker 2>any single neuron.

607
00:30:00.519 --> 00:30:03.680
<v Speaker 1>Like forcing it to have redundant pathways kind of Yeah.

608
00:30:04.039 --> 00:30:07.720
<v Speaker 2>Another technique is weight regularization, like L one or L two.

609
00:30:08.359 --> 00:30:11.079
<v Speaker 2>This adds a penalty to the loss function based on

610
00:30:11.119 --> 00:30:13.880
<v Speaker 2>the size of the model's weights. It encourages the model

611
00:30:13.920 --> 00:30:18.039
<v Speaker 2>to learn smaller, simpler weight configurations, which often generalize better.

612
00:30:18.319 --> 00:30:21.599
<v Speaker 1>Okay, so a whole toolkit to fight overfitting. Now, the

613
00:30:21.599 --> 00:30:24.920
<v Speaker 1>book also touches on ethical considerations. What's the main point there?

614
00:30:25.359 --> 00:30:28.799
<v Speaker 2>It's a really important reminder that technology isn't neutral. The

615
00:30:28.920 --> 00:30:32.200
<v Speaker 2>choices we make when designing and deploying AI systems, what

616
00:30:32.319 --> 00:30:35.240
<v Speaker 2>data we use, what objective we optimize for, how we

617
00:30:35.319 --> 00:30:37.839
<v Speaker 2>test it can have real world ethical.

618
00:30:37.440 --> 00:30:41.319
<v Speaker 1>Consequences, like biases in the training data leading to unfair outcomes.

619
00:30:41.400 --> 00:30:44.519
<v Speaker 2>That's a major one. If your data reflects historical biases,

620
00:30:44.559 --> 00:30:47.279
<v Speaker 2>your model will likely learn and perpetuate them. We need

621
00:30:47.279 --> 00:30:49.799
<v Speaker 2>to be aware of potential harms and actively work to

622
00:30:49.839 --> 00:30:53.599
<v Speaker 2>mitigate them. Technical choices have moral dimensions a crucial point.

623
00:30:53.640 --> 00:30:57.559
<v Speaker 1>Okay, let's walk through the overall machine learning workflow. The

624
00:30:57.559 --> 00:30:59.680
<v Speaker 1>book outlines what are the big stages.

625
00:31:00.039 --> 00:31:04.440
<v Speaker 2>It starts crucially with defining the task, really understanding the

626
00:31:04.440 --> 00:31:07.279
<v Speaker 2>problem you're trying to solve before you jump into code.

627
00:31:07.480 --> 00:31:09.839
<v Speaker 1>Understanding the context, the user, the.

628
00:31:09.759 --> 00:31:13.079
<v Speaker 2>Goal, exactly what's the value, how will the model actually

629
00:31:13.119 --> 00:31:15.559
<v Speaker 2>be used? What data do you have or can you get?

630
00:31:16.240 --> 00:31:19.720
<v Speaker 2>And then framing that business problem is a specific mL task?

631
00:31:20.480 --> 00:31:24.240
<v Speaker 2>Is it classification, regression, something else?

632
00:31:24.440 --> 00:31:27.720
<v Speaker 1>So problem definition first, then then.

633
00:31:27.799 --> 00:31:29.920
<v Speaker 2>Collect a data set. This is often the hardest, most

634
00:31:29.960 --> 00:31:33.319
<v Speaker 2>expensive part. You need inputs, You need corresponding targets, labels,

635
00:31:33.559 --> 00:31:36.920
<v Speaker 2>data quality and availability, or often the bottlenecks sometimes involves

636
00:31:36.960 --> 00:31:38.039
<v Speaker 2>lots of manual labeling.

637
00:31:38.200 --> 00:31:41.559
<v Speaker 1>Right garbage in garbage out applies strongly here absolutely.

638
00:31:41.759 --> 00:31:44.799
<v Speaker 2>Step three is develop a model. This involves choosing a

639
00:31:44.839 --> 00:31:46.599
<v Speaker 2>suitable architecture.

640
00:31:46.079 --> 00:31:49.400
<v Speaker 1>Like convents for images, transformers for texts right, and then.

641
00:31:49.319 --> 00:31:52.880
<v Speaker 2>The initial goal is often counterintuitively, to build a model

642
00:31:52.920 --> 00:31:56.079
<v Speaker 2>that's powerful enough to overfit the training data first. Why

643
00:31:56.160 --> 00:31:59.559
<v Speaker 2>overfit first, because if you can't even overfit the training data,

644
00:32:00.079 --> 00:32:02.960
<v Speaker 2>it means your model doesn't have enough capacity or something

645
00:32:02.960 --> 00:32:06.359
<v Speaker 2>else is fundamentally wrong. Overfitting proves your model can learn

646
00:32:06.400 --> 00:32:09.119
<v Speaker 2>the training patterns. You need to reach that point before

647
00:32:09.160 --> 00:32:12.599
<v Speaker 2>you can start regularizing, and you monitor training and validation

648
00:32:12.720 --> 00:32:13.680
<v Speaker 2>metrics constantly.

649
00:32:13.759 --> 00:32:17.559
<v Speaker 1>Okay, achieve overfitting, then pull back. So step four is

650
00:32:17.880 --> 00:32:20.400
<v Speaker 1>regularize and tune exactly.

651
00:32:20.799 --> 00:32:24.839
<v Speaker 2>Now you focus on generalization. You adjust hyper parameters like

652
00:32:24.960 --> 00:32:29.319
<v Speaker 2>learning rate, network size, regularization strength, apply techniques like dropout,

653
00:32:29.559 --> 00:32:32.640
<v Speaker 2>all guided by the performance on your validation set. The

654
00:32:32.759 --> 00:32:35.880
<v Speaker 2>goal is to find the settings that give the best validation.

655
00:32:35.559 --> 00:32:39.519
<v Speaker 1>Performance, maximize generalization, and the final step.

656
00:32:39.599 --> 00:32:42.599
<v Speaker 2>Deploy the model. Get it out into the real world.

657
00:32:42.960 --> 00:32:45.960
<v Speaker 2>This involves exporting it, maybe to a non Python format,

658
00:32:46.200 --> 00:32:49.680
<v Speaker 2>integrating it into your production system, monitoring its performance live

659
00:32:49.960 --> 00:32:53.000
<v Speaker 2>and crucially collecting data on how it's doing to feed

660
00:32:53.000 --> 00:32:54.720
<v Speaker 2>into training the next version of the model.

661
00:32:55.200 --> 00:32:59.319
<v Speaker 1>So it's a cycle really define, collect, develop, regularize, deploy,

662
00:32:59.559 --> 00:33:00.720
<v Speaker 1>monitor and repeat.

663
00:33:01.039 --> 00:33:02.720
<v Speaker 2>It's very much an iterative process.

664
00:33:02.799 --> 00:33:06.160
<v Speaker 1>Yes, okay, that workflow makes a lot of sense. So

665
00:33:06.400 --> 00:33:08.880
<v Speaker 1>wrapping up our deep dive today, we covered a lot

666
00:33:08.920 --> 00:33:12.480
<v Speaker 1>of ground we did. The core idea. Deep learning uses

667
00:33:12.559 --> 00:33:16.400
<v Speaker 1>multi layered neural networks to learn representations from data. It's

668
00:33:16.400 --> 00:33:20.240
<v Speaker 1>built on tensors, tensor operations, layers, and powered by a

669
00:33:20.279 --> 00:33:21.920
<v Speaker 1>gradient descent and backpropagation.

670
00:33:22.039 --> 00:33:24.559
<v Speaker 2>The tools like keras and TensorFlow make it much more

671
00:33:24.559 --> 00:33:27.559
<v Speaker 2>accessible to build and train these complex models.

672
00:33:27.640 --> 00:33:34.680
<v Speaker 1>Plus understanding those key concepts generalization, overfitting, the evaluation protocols,

673
00:33:34.920 --> 00:33:38.319
<v Speaker 1>the whole workflow is crucial for actually using it effectively

674
00:33:38.400 --> 00:33:39.160
<v Speaker 1>and responsibly.

675
00:33:39.359 --> 00:33:41.799
<v Speaker 2>Absolutely, it's not just about the algorithms, but the whole

676
00:33:41.799 --> 00:33:42.720
<v Speaker 2>process around them.

677
00:33:42.799 --> 00:33:44.920
<v Speaker 1>So a final thought for you, our listener to chew

678
00:33:44.960 --> 00:33:48.480
<v Speaker 1>on the book mentions this interesting trade off maybe losing

679
00:33:48.519 --> 00:33:52.640
<v Speaker 1>some cultural diversity for more intellectual or technical diversity as

680
00:33:52.680 --> 00:33:54.720
<v Speaker 1>societies become more globally connected.

681
00:33:55.039 --> 00:33:56.920
<v Speaker 2>Yeah, that was an intriguing point.

682
00:33:56.680 --> 00:34:00.200
<v Speaker 1>As AI and deep learning become even more pervasive, how

683
00:34:00.279 --> 00:34:03.839
<v Speaker 1>might they influence that balance? Could they create new kinds

684
00:34:03.880 --> 00:34:08.119
<v Speaker 1>of digital diversity, or maybe new forms of homogeneity. It's

685
00:34:08.159 --> 00:34:10.800
<v Speaker 1>something to think about how this technology shapes not just

686
00:34:10.800 --> 00:34:13.280
<v Speaker 1>what we can do, but maybe even how we think

687
00:34:13.400 --> 00:34:14.119
<v Speaker 1>and interact.

688
00:34:14.559 --> 00:34:17.239
<v Speaker 2>Definitely food for thought. The impact goes way beyond just

689
00:34:17.280 --> 00:34:18.039
<v Speaker 2>the tech itself.

690
00:34:18.119 --> 00:34:20.719
<v Speaker 1>Indeed, well thanks for joining us on this deep dive.

691
00:34:20.840 --> 00:34:21.400
<v Speaker 2>My pleasure.
