WEBVTT

1
00:00:00.080 --> 00:00:02.960
<v Speaker 1>The world of AI and machine learning is just exploding,

2
00:00:03.040 --> 00:00:04.879
<v Speaker 1>isn't it. And if you're a coder looking in, you

3
00:00:04.919 --> 00:00:06.879
<v Speaker 1>might be thinking, Okay, how do I actually get started

4
00:00:06.919 --> 00:00:09.240
<v Speaker 1>here without needing a PhD.

5
00:00:09.560 --> 00:00:12.640
<v Speaker 2>That's a really common question and honestly, the tools available

6
00:00:12.679 --> 00:00:16.440
<v Speaker 2>now have made it much more accessible for developers to jump.

7
00:00:16.199 --> 00:00:19.199
<v Speaker 1>In absolutely, and that's really our mission for this deep dive.

8
00:00:19.280 --> 00:00:23.440
<v Speaker 1>We're digging into key parts of Lawrence Moroney's book AI

9
00:00:23.600 --> 00:00:26.440
<v Speaker 1>and Machine Learning for coders. We want to pull out

10
00:00:26.480 --> 00:00:27.760
<v Speaker 1>the most important bits for you.

11
00:00:28.039 --> 00:00:30.600
<v Speaker 2>Think of it as a practical starting point. We're aiming

12
00:00:30.600 --> 00:00:33.479
<v Speaker 2>to give you the essentials on deep learning, computer vision,

13
00:00:33.640 --> 00:00:38.119
<v Speaker 2>and LP, basically focusing on how you can use TensorFlow

14
00:00:38.159 --> 00:00:39.079
<v Speaker 2>to tackle these things.

15
00:00:39.280 --> 00:00:42.079
<v Speaker 1>Yeah, the perspective here is crucial. It's tailored for you,

16
00:00:42.560 --> 00:00:44.799
<v Speaker 1>the coder. It's about the tools you can grab now

17
00:00:44.840 --> 00:00:47.399
<v Speaker 1>and the problems you can start solving.

18
00:00:47.439 --> 00:00:50.079
<v Speaker 2>Right, just like the book intents equipping you to become

19
00:00:50.119 --> 00:00:53.439
<v Speaker 2>an mL developer by focusing on actually doing it, not

20
00:00:53.479 --> 00:00:56.079
<v Speaker 2>getting bogged down in just the theory or the super

21
00:00:56.119 --> 00:00:56.960
<v Speaker 2>complex math.

22
00:00:57.280 --> 00:01:00.759
<v Speaker 1>And people have really responded to that approachmon called it

23
00:01:00.799 --> 00:01:04.120
<v Speaker 1>the much needed practical starting point. Soufu mentioned how it

24
00:01:04.159 --> 00:01:07.120
<v Speaker 1>teaches the key building blocks so you can code AI

25
00:01:07.239 --> 00:01:09.840
<v Speaker 1>for PCs mobile the browser.

26
00:01:10.159 --> 00:01:14.079
<v Speaker 2>Yeah, Laurence Maroney's vision was clearly about empowering developers, and,

27
00:01:15.000 --> 00:01:17.799
<v Speaker 2>like Andrew En says in the foreword, great Adventures await you,

28
00:01:17.879 --> 00:01:18.920
<v Speaker 2>it's an exciting space.

29
00:01:19.200 --> 00:01:22.400
<v Speaker 1>Okay, So let's unpack this. Where does machine learning really

30
00:01:22.439 --> 00:01:27.400
<v Speaker 1>differ from say, traditional programming and what's the main platform.

31
00:01:26.959 --> 00:01:27.879
<v Speaker 3>We'll be talking about.

32
00:01:28.000 --> 00:01:30.760
<v Speaker 2>Well, the fundamental difference is a kind of flip in thinking.

33
00:01:31.159 --> 00:01:35.239
<v Speaker 2>In traditional programming, you write explicit rules, rules, act on data,

34
00:01:35.439 --> 00:01:36.400
<v Speaker 2>and you get answers.

35
00:01:36.799 --> 00:01:40.359
<v Speaker 1>Like coding a game like Breakout, specifically, write the logic

36
00:01:40.920 --> 00:01:45.000
<v Speaker 1>how the ball moves, what happens when it hits a brick, scoring, paddle, misses.

37
00:01:45.200 --> 00:01:47.319
<v Speaker 1>It's all defined rules exactly.

38
00:01:47.599 --> 00:01:50.640
<v Speaker 2>Or think about activity detection from a wearable. You might

39
00:01:50.640 --> 00:01:53.040
<v Speaker 2>write rules like okay, if speed is over x, it's running.

40
00:01:53.120 --> 00:01:56.000
<v Speaker 2>If it's between y and X, it's walking. You define

41
00:01:56.000 --> 00:01:57.359
<v Speaker 2>the logic based on the data.

42
00:01:57.400 --> 00:02:01.000
<v Speaker 1>But that approach hits the ceiling pretty quickly, right. What

43
00:02:01.079 --> 00:02:03.480
<v Speaker 1>if you wanted to detect something way more complex, like

44
00:02:03.719 --> 00:02:05.280
<v Speaker 1>golfing precisely?

45
00:02:05.760 --> 00:02:08.159
<v Speaker 2>How do you write rules for that? The mix of

46
00:02:08.240 --> 00:02:13.719
<v Speaker 2>swinging pausing walking, It gets incredibly hard, maybe even impossible,

47
00:02:14.000 --> 00:02:18.599
<v Speaker 2>to define robust rules that cover every single variation by hand.

48
00:02:18.840 --> 00:02:21.719
<v Speaker 1>So the old rules act on data to give answers.

49
00:02:21.759 --> 00:02:24.599
<v Speaker 1>Method breaks down and the rules are just too fuzzy

50
00:02:24.680 --> 00:02:27.840
<v Speaker 1>or complex to write yourself. And that's where mL comes in.

51
00:02:28.000 --> 00:02:30.280
<v Speaker 2>Right. With machine learning, you kind of flip it. You

52
00:02:30.360 --> 00:02:32.879
<v Speaker 2>provide the data and you provide the answers. We call

53
00:02:32.919 --> 00:02:36.520
<v Speaker 2>those labels. Then the machine learning algorithm figures out the

54
00:02:36.560 --> 00:02:38.960
<v Speaker 2>rules or patterns connecting the data to those answers.

55
00:02:39.080 --> 00:02:39.639
<v Speaker 3>Ah okay.

56
00:02:39.680 --> 00:02:42.120
<v Speaker 1>So for the activity example, you'd give it sensor data

57
00:02:42.120 --> 00:02:45.719
<v Speaker 1>from someone walking, running, biking, and golfing, and you'd label

58
00:02:45.759 --> 00:02:48.319
<v Speaker 1>those chunks of data this part is walking, this part

59
00:02:48.360 --> 00:02:48.800
<v Speaker 1>is golfing.

60
00:02:48.879 --> 00:02:51.759
<v Speaker 2>Yep. And the mL algorithm looks at all that labeled

61
00:02:51.800 --> 00:02:55.520
<v Speaker 2>sensor data maybe acceleration, rotation time, whatever, and it learns

62
00:02:55.520 --> 00:02:58.800
<v Speaker 2>the underlying patterns that distinguish golfing from the others. It

63
00:02:58.840 --> 00:03:01.919
<v Speaker 2>derives the complex rules you couldn't realistically right, that's the

64
00:03:01.960 --> 00:03:03.919
<v Speaker 2>core shift. It's pretty powerful, and.

65
00:03:03.879 --> 00:03:06.280
<v Speaker 1>The platform that's really designed to put this power into

66
00:03:06.319 --> 00:03:07.960
<v Speaker 1>coder's hands is TensorFlow.

67
00:03:08.080 --> 00:03:11.879
<v Speaker 2>TensorFlow, Yeah, it's this huge open source platform for building

68
00:03:11.879 --> 00:03:14.719
<v Speaker 2>and using mL models. Its real value, I think is

69
00:03:14.719 --> 00:03:17.400
<v Speaker 2>that it handles a lot of the underlying complexity. It

70
00:03:17.520 --> 00:03:20.639
<v Speaker 2>implements common algorithms, common patterns, so you.

71
00:03:20.719 --> 00:03:23.560
<v Speaker 1>The coder can focus more on the actual problem you're

72
00:03:23.599 --> 00:03:27.800
<v Speaker 1>trying to solve with mL, unless on say, implementing backpropagation from.

73
00:03:27.639 --> 00:03:31.280
<v Speaker 2>Scratch, exactly focus on the scenario. And TensorFlow is built

74
00:03:31.280 --> 00:03:33.680
<v Speaker 2>to be flexible. You can deploy the models you create

75
00:03:33.719 --> 00:03:37.360
<v Speaker 2>almost anywhere web cloud, mobile apps, on Android or iOS,

76
00:03:37.400 --> 00:03:38.919
<v Speaker 2>even tiny embedded systems.

77
00:03:39.199 --> 00:03:43.080
<v Speaker 1>And how do you typically work with it? Python, install idse.

78
00:03:42.800 --> 00:03:44.800
<v Speaker 2>All of the above. Really, you can pip install it

79
00:03:44.800 --> 00:03:49.639
<v Speaker 2>in Python, use it in IDEs like pischarm, or a

80
00:03:49.879 --> 00:03:52.800
<v Speaker 2>really popular way is using cloud environments like Google Collab

81
00:03:53.000 --> 00:03:56.039
<v Speaker 2>that gives you access to GPUs and TPUs without needing

82
00:03:56.080 --> 00:03:56.960
<v Speaker 2>the hardware yourself.

83
00:03:57.120 --> 00:04:00.919
<v Speaker 1>Okay, let's drill down. What does the simplest possible example

84
00:04:00.960 --> 00:04:03.879
<v Speaker 1>of this learning look like? Like the basic building.

85
00:04:03.599 --> 00:04:07.520
<v Speaker 2>Block, right, So imagine teaching a network a really simple

86
00:04:07.680 --> 00:04:11.400
<v Speaker 2>linear relationship like why equals two x one? You'd give

87
00:04:11.400 --> 00:04:14.360
<v Speaker 2>examples if x is one, why is one of x's two,

88
00:04:14.400 --> 00:04:16.360
<v Speaker 2>Why is three x's three, y is five? And so

89
00:04:16.399 --> 00:04:19.560
<v Speaker 2>on Okay, a tiny neural network, even one with just

90
00:04:19.600 --> 00:04:22.399
<v Speaker 2>a single neuron, can learn this. It does it by

91
00:04:22.439 --> 00:04:27.040
<v Speaker 2>adjusting two internal values, a weight it multiplies the input

92
00:04:27.279 --> 00:04:29.199
<v Speaker 2>x by, and a bias it adds.

93
00:04:29.600 --> 00:04:31.600
<v Speaker 1>So it's basically learning the two in the menx one

94
00:04:32.000 --> 00:04:34.920
<v Speaker 1>from our equation. Yeah, those learned numbers, the weight and bias,

95
00:04:35.000 --> 00:04:36.399
<v Speaker 1>those are the parameters.

96
00:04:35.879 --> 00:04:38.199
<v Speaker 2>Of the network exactly, And that's a really important distinction

97
00:04:38.279 --> 00:04:42.680
<v Speaker 2>for coders getting into mL. Parameters, weights and biases are

98
00:04:42.720 --> 00:04:45.680
<v Speaker 2>what the network learns from the data. They're different from hyper.

99
00:04:45.399 --> 00:04:48.639
<v Speaker 1>Parameters, right. Hyperarameters are the knobs you turn before training starts.

100
00:04:48.639 --> 00:04:51.560
<v Speaker 1>Aren't they things that control the learning process itself.

101
00:04:51.680 --> 00:04:55.000
<v Speaker 2>Yeah, things like the learning rate, how quickly it adjusts weights,

102
00:04:55.079 --> 00:04:56.879
<v Speaker 2>or the number of neurons you decide to put in

103
00:04:56.920 --> 00:04:59.600
<v Speaker 2>a layer, or how many epochs meaning how many times

104
00:04:59.600 --> 00:05:02.040
<v Speaker 2>it sees a whole data set. You experiment with these

105
00:05:02.079 --> 00:05:06.120
<v Speaker 2>to get better results. And neurons also usually have something

106
00:05:06.160 --> 00:05:08.720
<v Speaker 2>called an activation function. It's like a little function that

107
00:05:08.759 --> 00:05:13.000
<v Speaker 2>processes the neurons output. A common one is relute rectified

108
00:05:13.040 --> 00:05:15.680
<v Speaker 2>linear unit. It basically just passes the value through if

109
00:05:15.720 --> 00:05:20.079
<v Speaker 2>it's positive and outputs zero otherwise. This adds nonlinearity, which

110
00:05:20.120 --> 00:05:22.519
<v Speaker 2>is crucial for learning anything beyond simple lines.

111
00:05:22.600 --> 00:05:25.160
<v Speaker 1>Okay, makes sense. We've got the basic shift to learning

112
00:05:25.360 --> 00:05:28.439
<v Speaker 1>the platform, the simplest building block. Let's apply this to

113
00:05:28.480 --> 00:05:31.839
<v Speaker 1>a huge area. Yeah, computer vision making machines.

114
00:05:31.959 --> 00:05:34.959
<v Speaker 2>See so at its core and image is just a

115
00:05:35.000 --> 00:05:38.240
<v Speaker 2>grid of numbers, right pixels. A small gray scale image

116
00:05:38.279 --> 00:05:41.000
<v Speaker 2>like from the Fashion MNIS data set might be twenty

117
00:05:41.000 --> 00:05:43.959
<v Speaker 2>eight by twenty eight pixels. Each pixel has a value,

118
00:05:44.040 --> 00:05:46.399
<v Speaker 2>say zero to two fifty five for how bright it.

119
00:05:46.319 --> 00:05:48.680
<v Speaker 1>Is, and color images just have more numbers per pixel,

120
00:05:48.759 --> 00:05:50.519
<v Speaker 1>usually three for red, green, and blue.

121
00:05:50.680 --> 00:05:51.319
<v Speaker 2>Three channels.

122
00:05:51.439 --> 00:05:51.600
<v Speaker 3>Yeah.

123
00:05:51.639 --> 00:05:54.160
<v Speaker 2>Now if you try to feed those raw pixel values

124
00:05:54.199 --> 00:05:57.240
<v Speaker 2>directly into that simple single neural network we just talked about,

125
00:05:57.319 --> 00:05:59.720
<v Speaker 2>or even a basic multi layer network, well it would

126
00:05:59.800 --> 00:06:03.920
<v Speaker 2>really struggle because a simple network doesn't understand spatial structure.

127
00:06:04.120 --> 00:06:06.680
<v Speaker 2>It just sees a flat list of numbers. It might

128
00:06:06.959 --> 00:06:10.439
<v Speaker 2>learn to recognize, say a sneaker, if it's exactly like

129
00:06:10.480 --> 00:06:13.160
<v Speaker 2>the ones in the training data, same position, same angle.

130
00:06:13.319 --> 00:06:16.240
<v Speaker 1>Ah, but if you show it the sneaker slightly rotated,

131
00:06:16.920 --> 00:06:19.319
<v Speaker 1>or maybe a different type of sneaker, like a high heel, it.

132
00:06:19.319 --> 00:06:22.879
<v Speaker 2>Might completely fail. It hasn't learned the features that make

133
00:06:23.000 --> 00:06:26.720
<v Speaker 2>something a sneaker. It just memorized specific pixel patterns in

134
00:06:26.759 --> 00:06:27.759
<v Speaker 2>specific locations.

135
00:06:27.800 --> 00:06:28.800
<v Speaker 3>Okay, so that's the problem.

136
00:06:28.839 --> 00:06:31.720
<v Speaker 1>Convolutional neural networks or CNNs are designed.

137
00:06:31.360 --> 00:06:35.079
<v Speaker 2>To solve exactly. CNNs are built to automatically find and

138
00:06:35.160 --> 00:06:40.680
<v Speaker 2>learn hierarchical features in images, things like edges, textures, shapes, objects,

139
00:06:40.920 --> 00:06:42.879
<v Speaker 2>regardless of where they appear in the image.

140
00:06:43.160 --> 00:06:44.800
<v Speaker 3>How do they do that? What are the core ideas?

141
00:06:44.879 --> 00:06:49.480
<v Speaker 2>Two main operations, convolutions and pooling. Convolutions use small filters,

142
00:06:49.920 --> 00:06:52.360
<v Speaker 2>like maybe a three by three grade of weights that

143
00:06:52.439 --> 00:06:55.639
<v Speaker 2>slide across the image. Each filter is trained to detect

144
00:06:55.720 --> 00:06:58.519
<v Speaker 2>a specific local pattern, maybe a vertical edge or a

145
00:06:58.560 --> 00:07:00.160
<v Speaker 2>certain curve or text.

146
00:07:00.439 --> 00:07:03.120
<v Speaker 1>So the filter scans the image and produces a sort

147
00:07:03.120 --> 00:07:04.720
<v Speaker 1>of map showing where it found that.

148
00:07:04.639 --> 00:07:09.959
<v Speaker 2>Pattern precisely, and applying a filter reduces the image dimensions slightly.

149
00:07:10.600 --> 00:07:12.720
<v Speaker 2>A three x three filter on a twenty eight x

150
00:07:12.759 --> 00:07:15.079
<v Speaker 2>twenty eight image gives you a twenty six x twenty

151
00:07:15.120 --> 00:07:16.000
<v Speaker 2>six output map.

152
00:07:16.120 --> 00:07:17.120
<v Speaker 3>Okay, and pooling.

153
00:07:17.240 --> 00:07:20.079
<v Speaker 2>Pooling layers then reduce the size of these feature maps,

154
00:07:20.160 --> 00:07:23.720
<v Speaker 2>making the representation smaller and more manageable while keeping the

155
00:07:23.720 --> 00:07:27.759
<v Speaker 2>most important information. A common type is max pooling. You

156
00:07:27.839 --> 00:07:29.800
<v Speaker 2>might take a two x two area and just keep

157
00:07:29.800 --> 00:07:32.759
<v Speaker 2>the maximum value, throwing away the other three. This halves

158
00:07:32.800 --> 00:07:35.480
<v Speaker 2>the dimensions but keeps the strongest signal for that feature

159
00:07:35.519 --> 00:07:36.199
<v Speaker 2>in that region.

160
00:07:36.360 --> 00:07:39.160
<v Speaker 1>And by stacking these convolutions and pooling layers.

161
00:07:39.000 --> 00:07:43.000
<v Speaker 2>The network builds up understanding. Early layers find simple edges,

162
00:07:43.160 --> 00:07:46.759
<v Speaker 2>Middle layers combined edges into corners and textures. Deeper layers

163
00:07:46.800 --> 00:07:49.519
<v Speaker 2>combine those into parts of objects, and then whole objects.

164
00:07:49.959 --> 00:07:51.720
<v Speaker 2>The key thing for you as a coder is that

165
00:07:51.759 --> 00:07:54.800
<v Speaker 2>the CNN automates this feature extraction. You don't have to

166
00:07:54.839 --> 00:07:56.519
<v Speaker 2>hand code edge detectors anymore.

167
00:07:56.639 --> 00:07:57.879
<v Speaker 3>Right, Let's make it concrete.

168
00:07:58.000 --> 00:07:59.920
<v Speaker 1>The book uses a horse or human classifier.

169
00:08:00.000 --> 00:08:00.319
<v Speaker 3>Example.

170
00:08:00.399 --> 00:08:03.519
<v Speaker 2>Yeah, that data set uses bigger color images, maybe three

171
00:08:03.639 --> 00:08:06.319
<v Speaker 2>hundred by three hundred pixels with three color channels. So

172
00:08:06.360 --> 00:08:08.800
<v Speaker 2>the input shape is three hundred by three hundred by three.

173
00:08:08.920 --> 00:08:12.040
<v Speaker 1>And since it's just two classes horse or human, it's

174
00:08:12.160 --> 00:08:15.480
<v Speaker 1>binary classification. You can use just one neuron in the

175
00:08:15.519 --> 00:08:16.480
<v Speaker 1>final output layer.

176
00:08:16.560 --> 00:08:19.600
<v Speaker 2>You can, and you typically attach a Sigmoid activation function

177
00:08:19.680 --> 00:08:23.120
<v Speaker 2>to it. Sigmoid squashes any input value into a range

178
00:08:23.160 --> 00:08:26.600
<v Speaker 2>between zero and one, perfect for probability. You can interpret

179
00:08:26.720 --> 00:08:29.920
<v Speaker 2>the output as say, the probability that the image is

180
00:08:29.920 --> 00:08:30.360
<v Speaker 2>a human.

181
00:08:30.600 --> 00:08:33.919
<v Speaker 1>The source mentioned a specific failure case, though, where a

182
00:08:34.000 --> 00:08:36.840
<v Speaker 1>model trained on this data set saw a picture of

183
00:08:36.919 --> 00:08:39.720
<v Speaker 1>just like the top half of a person and classified

184
00:08:39.759 --> 00:08:42.399
<v Speaker 1>it as a horse. Why would that happen? It seems

185
00:08:42.399 --> 00:08:44.080
<v Speaker 1>like a common beginner frustration.

186
00:08:44.519 --> 00:08:47.279
<v Speaker 2>It often boils down to the training data and overfitting.

187
00:08:47.519 --> 00:08:50.200
<v Speaker 2>If your training set mostly contains full body pictures of

188
00:08:50.279 --> 00:08:53.440
<v Speaker 2>humans standing up and maybe horses in profile, the model

189
00:08:53.519 --> 00:08:57.080
<v Speaker 2>learns those specific views. When it sees something unusual, like

190
00:08:57.200 --> 00:08:59.519
<v Speaker 2>only the upper body of a person perhaps and oppose

191
00:08:59.559 --> 00:09:02.200
<v Speaker 2>it hasn't seen, it might lapt onto some features, maybe texture,

192
00:09:02.240 --> 00:09:05.039
<v Speaker 2>maybe background that it learned we're associated with horses in

193
00:09:05.080 --> 00:09:08.320
<v Speaker 2>the training data. It hasn't generalized the concept of human

194
00:09:08.399 --> 00:09:10.240
<v Speaker 2>well enough outside the examples it saw.

195
00:09:10.480 --> 00:09:14.799
<v Speaker 1>Okay, so this overfitting. Doing great on training data but

196
00:09:14.919 --> 00:09:18.000
<v Speaker 1>failing on new stuff. How do you fight that, especially

197
00:09:18.000 --> 00:09:19.200
<v Speaker 1>if you don't have tons of data?

198
00:09:19.759 --> 00:09:23.320
<v Speaker 2>Several really good techniques. One is image augmentation. It's clever.

199
00:09:23.960 --> 00:09:26.759
<v Speaker 2>During training, you don't just feed the network your original images.

200
00:09:26.879 --> 00:09:29.960
<v Speaker 2>You apply random transformations on the fly, maybe rotate the

201
00:09:30.000 --> 00:09:33.120
<v Speaker 2>image slightly, zoom in or out a bit shift. It

202
00:09:33.159 --> 00:09:35.159
<v Speaker 2>horizontally or vertically flip it.

203
00:09:35.320 --> 00:09:38.720
<v Speaker 1>Ah, so you're essentially creating slightly modified versions of your

204
00:09:38.720 --> 00:09:42.120
<v Speaker 1>existing images, making the data set seem bigger and more varied.

205
00:09:42.279 --> 00:09:44.759
<v Speaker 2>Exactly, the model learns that a horse is still a

206
00:09:44.799 --> 00:09:48.159
<v Speaker 2>horse even if it's tilted or slightly zoomed, it becomes

207
00:09:48.159 --> 00:09:52.159
<v Speaker 2>more robust. Tensorflow's image data generator makes this super easy

208
00:09:52.200 --> 00:09:52.679
<v Speaker 2>to set up.

209
00:09:52.799 --> 00:09:53.200
<v Speaker 3>What else?

210
00:09:53.399 --> 00:09:57.200
<v Speaker 2>Another huge one is transfer learning. This is incredibly powerful.

211
00:09:57.240 --> 00:09:59.960
<v Speaker 2>You might only have a few thousand horse and human images,

212
00:10:00.559 --> 00:10:03.799
<v Speaker 2>but other people have trained enormous models, like mobile net

213
00:10:03.879 --> 00:10:08.039
<v Speaker 2>or inception on millions of images covering say one thousand

214
00:10:08.120 --> 00:10:11.559
<v Speaker 2>different categories on the ImageNet data set. Those massive models

215
00:10:11.600 --> 00:10:15.759
<v Speaker 2>have already learned really really good general purpose feature extractors

216
00:10:15.799 --> 00:10:21.159
<v Speaker 2>in their early convolutional layers. They know how to detect edges, textures, shapes,

217
00:10:21.679 --> 00:10:26.279
<v Speaker 2>basic object parts, things useful for recognizing any image.

218
00:10:26.480 --> 00:10:30.120
<v Speaker 1>So with transfer learning, you take those pre trained layers.

219
00:10:29.840 --> 00:10:32.799
<v Speaker 2>Yep, you basically chop off the original final layers that

220
00:10:32.840 --> 00:10:35.879
<v Speaker 2>were specific to the one thousand image Neet classes. You

221
00:10:36.279 --> 00:10:38.840
<v Speaker 2>freeze the weights of the early layers so they don't change,

222
00:10:39.159 --> 00:10:41.799
<v Speaker 2>and you add your own new classification layers on top,

223
00:10:41.879 --> 00:10:44.440
<v Speaker 2>maybe just a couple of layers ending in that single

224
00:10:44.480 --> 00:10:47.559
<v Speaker 2>sigmoid neuron for your horse human tasks.

225
00:10:47.200 --> 00:10:49.519
<v Speaker 1>And you only train your new small layers using your

226
00:10:49.519 --> 00:10:50.360
<v Speaker 1>smaller data set.

227
00:10:50.519 --> 00:10:54.240
<v Speaker 2>Mostly yes, or sometimes you fine tune by unfreezing the

228
00:10:54.279 --> 00:10:56.679
<v Speaker 2>last few pre trained layers and training them a tiny

229
00:10:56.720 --> 00:10:59.360
<v Speaker 2>bit too. But the point is you're leveraging all the

230
00:10:59.399 --> 00:11:03.000
<v Speaker 2>knowledge learn from the giant data set for your specific problem.

231
00:11:03.279 --> 00:11:06.279
<v Speaker 2>It's a massive shortcut. TensorFlow Hub is a great place

232
00:11:06.279 --> 00:11:07.799
<v Speaker 2>to find these pre trained models.

233
00:11:08.039 --> 00:11:10.480
<v Speaker 3>That makes a lot of sense any other tricks for overfitting.

234
00:11:10.559 --> 00:11:14.720
<v Speaker 2>Dropout regularization is another common one. During training, For each

235
00:11:14.759 --> 00:11:18.759
<v Speaker 2>batch of data, you randomly drop out, meaning temporarily ignore

236
00:11:18.879 --> 00:11:20.840
<v Speaker 2>a certain percentage of the neurons in a layer.

237
00:11:20.879 --> 00:11:21.919
<v Speaker 3>Wait, you just turn them off?

238
00:11:21.960 --> 00:11:25.240
<v Speaker 2>Why It sounds weird, but it forces the network to

239
00:11:25.320 --> 00:11:29.759
<v Speaker 2>learn more robust and redundant representations. It prevents any single

240
00:11:29.840 --> 00:11:33.559
<v Speaker 2>neuron from becoming overly specialized or critical for making predictions

241
00:11:33.559 --> 00:11:36.399
<v Speaker 2>based on the training data. It encourages the network to

242
00:11:36.440 --> 00:11:39.840
<v Speaker 2>distribute the learning. You often see the training accuracy and

243
00:11:39.879 --> 00:11:43.360
<v Speaker 2>the validation accuracy performance on data held back from training

244
00:11:43.720 --> 00:11:46.240
<v Speaker 2>stay much closer together when you use dropout.

245
00:11:46.320 --> 00:11:48.320
<v Speaker 1>Okay, got it. So that's a good overview for images.

246
00:11:48.559 --> 00:11:52.200
<v Speaker 1>What about the other big area text, natural language processing?

247
00:11:52.480 --> 00:11:54.480
<v Speaker 1>How to machine start to understand language?

248
00:11:54.720 --> 00:11:58.480
<v Speaker 2>Well? Like images, text needs to be converted into numbers first.

249
00:11:58.840 --> 00:12:01.639
<v Speaker 2>The initial step is usually tokenization.

250
00:12:01.399 --> 00:12:04.519
<v Speaker 1>Breaking sentences down into words or maybe even parts of words,

251
00:12:04.799 --> 00:12:07.840
<v Speaker 1>and giving each unique piece a number ID like these

252
00:12:07.919 --> 00:12:09.840
<v Speaker 1>is one, cat is two, sat.

253
00:12:09.679 --> 00:12:13.679
<v Speaker 2>As three exactly. You build a vocabulary, a mapping from

254
00:12:13.720 --> 00:12:18.200
<v Speaker 2>words to integer tokens. Good tokenizers handle things like punctuation.

255
00:12:18.320 --> 00:12:22.200
<v Speaker 2>Maybe today just becomes the token for today, and crucially,

256
00:12:22.679 --> 00:12:25.159
<v Speaker 2>you need a plan for words that weren't in your

257
00:12:25.200 --> 00:12:28.559
<v Speaker 2>training vocabulary and out of vocabulary or OOV token.

258
00:12:28.759 --> 00:12:32.240
<v Speaker 1>Once you have tokens, you turn sentences into sequences of

259
00:12:32.279 --> 00:12:35.240
<v Speaker 1>these numbers. The cat sat might become the sequence one

260
00:12:35.320 --> 00:12:36.159
<v Speaker 1>two three, right.

261
00:12:36.480 --> 00:12:39.919
<v Speaker 2>And because neural networks usually need fixed size inputs, you

262
00:12:39.960 --> 00:12:42.200
<v Speaker 2>have to make all your sequences the same length. You

263
00:12:42.279 --> 00:12:45.960
<v Speaker 2>either pad shorter sequences, usually with zeros, or you trunk

264
00:12:45.960 --> 00:12:46.799
<v Speaker 2>paate longer ones.

265
00:12:46.879 --> 00:12:47.759
<v Speaker 3>How do you pick the length?

266
00:12:48.080 --> 00:12:50.679
<v Speaker 2>You typically look at the distribution of sentence links in

267
00:12:50.720 --> 00:12:53.440
<v Speaker 2>your data. Maybe ninety five percent of your sentences are

268
00:12:53.519 --> 00:12:56.120
<v Speaker 2>shorter than say eighty five words, so you might pick

269
00:12:56.159 --> 00:12:58.879
<v Speaker 2>eighty five as your max length to minimize padding while

270
00:12:58.919 --> 00:13:00.799
<v Speaker 2>capturing most sentences fully.

271
00:13:00.559 --> 00:13:03.759
<v Speaker 1>And sometimes you clean the text first, remove HTML maybe

272
00:13:03.799 --> 00:13:04.399
<v Speaker 1>common words.

273
00:13:04.639 --> 00:13:09.320
<v Speaker 2>Yeah, preprocessing is often important, removing HTML tags, maybe converting

274
00:13:09.320 --> 00:13:13.320
<v Speaker 2>to lowercase. Sometimes you remove stop words. Common words like

275
00:13:13.519 --> 00:13:16.080
<v Speaker 2>is is it is? The that might not carry a

276
00:13:16.120 --> 00:13:19.799
<v Speaker 2>much specific meaning for your task. So is it sunny

277
00:13:19.840 --> 00:13:23.200
<v Speaker 2>today might become tokens for just sonny today.

278
00:13:23.480 --> 00:13:26.480
<v Speaker 1>Okay, so we have sequences of numbers, Yeah, but just

279
00:13:26.519 --> 00:13:29.720
<v Speaker 1>assigning arbitrary IDs like one, two, three doesn't tell the

280
00:13:29.759 --> 00:13:33.559
<v Speaker 1>model anything about meaning. Right, Cat two isn't inherently related

281
00:13:33.559 --> 00:13:35.200
<v Speaker 1>to dog maybe fifty exactly.

282
00:13:35.240 --> 00:13:37.600
<v Speaker 2>That's where embttings come in. This is a really key

283
00:13:37.639 --> 00:13:38.519
<v Speaker 2>concept in NLP.

284
00:13:38.759 --> 00:13:38.879
<v Speaker 3>Right.

285
00:13:39.120 --> 00:13:42.240
<v Speaker 2>Embeddings represent words not just as single numbers, but as

286
00:13:42.559 --> 00:13:46.080
<v Speaker 2>vectors lists of numbers in a multi dimensional space. Think

287
00:13:46.120 --> 00:13:49.320
<v Speaker 2>of it like giving each word coordinates on a complex map, and.

288
00:13:49.320 --> 00:13:51.600
<v Speaker 1>The idea is that words with similar meanings end up

289
00:13:51.600 --> 00:13:52.600
<v Speaker 1>closer together.

290
00:13:52.360 --> 00:13:52.879
<v Speaker 3>On this map.

291
00:13:52.919 --> 00:13:56.039
<v Speaker 2>Precisely, king and queen might have similar vectors. Walking and

292
00:13:56.120 --> 00:13:59.679
<v Speaker 2>running might be close. The relationships between words are captured

293
00:13:59.679 --> 00:14:02.159
<v Speaker 2>by their relative positions in this embedding space.

294
00:14:02.240 --> 00:14:04.519
<v Speaker 1>The book uses a cool example with pride and prejudice

295
00:14:04.559 --> 00:14:07.639
<v Speaker 1>characters right, plotting them based on learned dimensions like gender

296
00:14:07.720 --> 00:14:08.399
<v Speaker 1>or nobility.

297
00:14:08.639 --> 00:14:11.440
<v Speaker 2>Yeah, that's a great way to visualize it. Mister Darcy

298
00:14:11.480 --> 00:14:14.360
<v Speaker 2>and Elizabeth Bennett might be positioned based on these learned

299
00:14:14.360 --> 00:14:18.320
<v Speaker 2>semantic features. The key is that the network learns these

300
00:14:18.399 --> 00:14:23.559
<v Speaker 2>vector representations. The dimensions aren't predefined. They emerge from how

301
00:14:23.600 --> 00:14:26.000
<v Speaker 2>words are used together in the training text.

302
00:14:26.360 --> 00:14:29.120
<v Speaker 1>So the network figures out that king is used in

303
00:14:29.200 --> 00:14:33.679
<v Speaker 1>similar contexts to queen, but maybe also related to man,

304
00:14:34.000 --> 00:14:35.679
<v Speaker 1>while queen is related to woman.

305
00:14:35.919 --> 00:14:38.480
<v Speaker 2>Right, and you can even optimize the number of dimensions

306
00:14:38.559 --> 00:14:41.159
<v Speaker 2>in your embedding vectors. A rule of thumb is maybe

307
00:14:41.200 --> 00:14:44.120
<v Speaker 2>the fourth root of your vocabulary size. So for a

308
00:14:44.120 --> 00:14:47.279
<v Speaker 2>few thousand words, maybe seven or eight dimensions is enough

309
00:14:47.440 --> 00:14:50.320
<v Speaker 2>instead of say sixteen or three two. It trains faster

310
00:14:50.440 --> 00:14:51.759
<v Speaker 2>without losing too much meaning.

311
00:14:52.159 --> 00:14:55.000
<v Speaker 1>But if you'd just average the embedding vectors for all

312
00:14:55.039 --> 00:14:57.440
<v Speaker 1>words in a sentence, you lose word order, don't you.

313
00:14:57.480 --> 00:14:58.600
<v Speaker 1>It becomes a bag of words.

314
00:14:58.799 --> 00:15:01.960
<v Speaker 2>That's a major limitation of simple embedding approaches. Word order

315
00:15:02.039 --> 00:15:04.960
<v Speaker 2>is critical in language. Dog bites man versus man bites

316
00:15:05.000 --> 00:15:07.480
<v Speaker 2>dog totally different meanings, same bag of words, So.

317
00:15:07.480 --> 00:15:10.440
<v Speaker 1>To handle sequence in context, we need something more sophisticated,

318
00:15:10.720 --> 00:15:13.320
<v Speaker 1>like we're current nural networks RNNs exactly.

319
00:15:13.720 --> 00:15:16.480
<v Speaker 2>RNNs are designed from the ground up for sequential data.

320
00:15:16.759 --> 00:15:18.919
<v Speaker 2>They have a kind of internal memory or state that

321
00:15:18.960 --> 00:15:21.919
<v Speaker 2>gets updated as they process each word or token in

322
00:15:21.960 --> 00:15:26.000
<v Speaker 2>a sequence. This state carries context from previous words forward.

323
00:15:26.519 --> 00:15:29.799
<v Speaker 1>But you mentioned simple RNNs can struggle with long sentences.

324
00:15:30.039 --> 00:15:33.159
<v Speaker 1>They might forget important context from the beginning.

325
00:15:33.480 --> 00:15:37.039
<v Speaker 2>Yeah, that's the venish ingradient problem. Essentially, the influence of

326
00:15:37.080 --> 00:15:39.960
<v Speaker 2>early words can fade out over long sequences. If you

327
00:15:39.960 --> 00:15:42.440
<v Speaker 2>have a sentence like I grew up in France, so

328
00:15:42.559 --> 00:15:45.840
<v Speaker 2>I speak fluent, the word France early on is key

329
00:15:45.879 --> 00:15:49.000
<v Speaker 2>to predicting French at the end. A simple RNN might

330
00:15:49.080 --> 00:15:49.919
<v Speaker 2>lose that connection.

331
00:15:50.279 --> 00:15:53.440
<v Speaker 1>Okay, and that's why lstm's long short term memory networks

332
00:15:53.440 --> 00:15:54.279
<v Speaker 1>were developed, right.

333
00:15:54.480 --> 00:15:57.559
<v Speaker 2>LSTMs are a special type of RNM. They have internal

334
00:15:57.600 --> 00:16:02.360
<v Speaker 2>mechanisms called gates that expl ilicitly control what information to remember,

335
00:16:02.559 --> 00:16:05.440
<v Speaker 2>what to forget, and what to output. This makes them

336
00:16:05.519 --> 00:16:09.039
<v Speaker 2>much much better at capturing long range dependencies and sequences.

337
00:16:09.320 --> 00:16:12.639
<v Speaker 1>And then there are bidirectional LSTMs. How do they improve things?

338
00:16:12.960 --> 00:16:16.840
<v Speaker 2>So a standard LSTM reads the sequence from start to end.

339
00:16:17.320 --> 00:16:21.840
<v Speaker 2>A bidirectional LSTM has two LSTMs. One reads forwards, the

340
00:16:21.840 --> 00:16:25.399
<v Speaker 2>other reads backwards. Then it combines their outputs at each step.

341
00:16:25.559 --> 00:16:27.840
<v Speaker 1>Ah, so it gets context from both.

342
00:16:27.679 --> 00:16:32.080
<v Speaker 2>Directions exactly for understanding language, this is often really powerful.

343
00:16:32.159 --> 00:16:36.639
<v Speaker 2>Think about sentiment analysis. Sometimes the keyword determining the sentiment

344
00:16:36.720 --> 00:16:40.559
<v Speaker 2>comes late in the sentence, or for predicting a missing word,

345
00:16:40.840 --> 00:16:43.240
<v Speaker 2>knowing the words that come after it is just as

346
00:16:43.240 --> 00:16:45.039
<v Speaker 2>important as knowing the words before it.

347
00:16:45.039 --> 00:16:47.960
<v Speaker 1>Like that I lived in country right Gaelic example. Seeing

348
00:16:48.000 --> 00:16:49.679
<v Speaker 1>Gaelic later helps figure out.

349
00:16:49.519 --> 00:16:53.200
<v Speaker 2>The country precisely. The backward pass provides that future context,

350
00:16:53.360 --> 00:16:56.879
<v Speaker 2>and you can feed pre trained embeddings like love vectors

351
00:16:56.919 --> 00:16:59.080
<v Speaker 2>into these LSTMs to give them a head start on

352
00:16:59.240 --> 00:16:59.840
<v Speaker 2>word meaning.

353
00:17:00.080 --> 00:17:02.360
<v Speaker 1>Okay, so we can use these models to understand text.

354
00:17:02.720 --> 00:17:04.880
<v Speaker 1>What about generating text? How does that work?

355
00:17:05.440 --> 00:17:08.599
<v Speaker 2>The core idea is pretty straightforward. Actually, you train a

356
00:17:08.640 --> 00:17:11.279
<v Speaker 2>model to predict the next word in a sequence given

357
00:17:11.319 --> 00:17:12.200
<v Speaker 2>the preceding words.

358
00:17:12.400 --> 00:17:14.720
<v Speaker 1>So if your training data has the sentence the quick

359
00:17:14.759 --> 00:17:19.640
<v Speaker 1>brown Fox, you'd create training examples like input the label

360
00:17:19.720 --> 00:17:22.599
<v Speaker 1>quick input, the quick label brown input, the quick brown

361
00:17:22.680 --> 00:17:23.880
<v Speaker 1>label Fox exactly.

362
00:17:24.160 --> 00:17:27.400
<v Speaker 2>You slide a window across your text corpus, creating these

363
00:17:27.440 --> 00:17:31.559
<v Speaker 2>input sequences and their corresponding next word labels. The labels

364
00:17:31.559 --> 00:17:34.240
<v Speaker 2>are usually one hot encoded a vector of zeros with

365
00:17:34.279 --> 00:17:37.440
<v Speaker 2>a single one at the index, corresponding to the correct

366
00:17:37.640 --> 00:17:39.640
<v Speaker 2>next word in your vocabulary, and the.

367
00:17:39.599 --> 00:17:42.559
<v Speaker 1>Model architecture would be similar embedding layer than maybe an

368
00:17:42.640 --> 00:17:45.079
<v Speaker 1>LSTM or bidirectional LSTM.

369
00:17:44.720 --> 00:17:48.359
<v Speaker 2>Yep, very common. Then to generate text, you start with

370
00:17:48.440 --> 00:17:51.480
<v Speaker 2>a seed text, maybe a word or a phrase. You

371
00:17:51.559 --> 00:17:55.200
<v Speaker 2>feed that seed sequence into your trained model. It outputs

372
00:17:55.200 --> 00:17:58.359
<v Speaker 2>a probability distribution over all the words in your vocabulary

373
00:17:58.359 --> 00:17:59.960
<v Speaker 2>for what the next word is most likely to be.

374
00:18:00.559 --> 00:18:03.440
<v Speaker 1>You pick a word based on those probabilities, maybe the

375
00:18:03.440 --> 00:18:04.319
<v Speaker 1>most likely one.

376
00:18:04.319 --> 00:18:06.960
<v Speaker 2>Usually yeah, or sometimes you sample from the distribution to

377
00:18:06.960 --> 00:18:10.079
<v Speaker 2>get more variety. Then you append that predicted word to

378
00:18:10.160 --> 00:18:13.400
<v Speaker 2>your seed text. Now you have a slightly longer sequence, and.

379
00:18:13.359 --> 00:18:15.279
<v Speaker 1>You feed that new sequence back into the model to

380
00:18:15.319 --> 00:18:17.240
<v Speaker 1>predict the next word and repeat.

381
00:18:17.480 --> 00:18:20.400
<v Speaker 2>That's the loop, keep feeding the growing sequence back in

382
00:18:20.599 --> 00:18:22.680
<v Speaker 2>predicting the next word, appending it.

383
00:18:22.720 --> 00:18:25.799
<v Speaker 1>The book had an example using song lyrics right starting

384
00:18:25.839 --> 00:18:28.839
<v Speaker 1>with in the town of a Lottie yeah.

385
00:18:28.440 --> 00:18:31.759
<v Speaker 2>And the model correctly predicted one which was the actual

386
00:18:31.799 --> 00:18:33.720
<v Speaker 2>next word in the song it was trained on, and

387
00:18:33.839 --> 00:18:37.440
<v Speaker 2>using different seeds like sweet, Jeremy sad Dublin produced other

388
00:18:37.519 --> 00:18:40.839
<v Speaker 2>plausible next words based on the patterns learned from the lyrics.

389
00:18:41.359 --> 00:18:44.359
<v Speaker 1>Though it's fair to say these simple generation models can

390
00:18:44.400 --> 00:18:47.759
<v Speaker 1>often start repeating themselves or outputting stuff that doesn't make

391
00:18:47.839 --> 00:18:48.799
<v Speaker 1>much sense after a while.

392
00:18:48.799 --> 00:18:52.799
<v Speaker 2>No, definitely, they can descend into gibberish quite quickly. Getting

393
00:18:52.799 --> 00:18:56.440
<v Speaker 2>coherent long form text generation is much harder. It often

394
00:18:56.440 --> 00:19:01.240
<v Speaker 2>involves more complex architectures, maybe stacking multiple LSTM, careful tuning

395
00:19:01.279 --> 00:19:05.960
<v Speaker 2>of hyper parameters, and more sophisticated sampling strategies. The generated

396
00:19:05.960 --> 00:19:08.359
<v Speaker 2>Shakespearean text in the book is an example of getting

397
00:19:08.400 --> 00:19:11.799
<v Speaker 2>something more structured, even if parts are nonsensical, Right.

398
00:19:11.920 --> 00:19:15.640
<v Speaker 1>Okay, images text? What about data where the sequence is

399
00:19:15.720 --> 00:19:19.880
<v Speaker 1>time itself, time series data like weather or stock prices.

400
00:19:20.079 --> 00:19:24.640
<v Speaker 2>Yeah, time series data is everywhere, weather forecasts, stock market trends,

401
00:19:24.680 --> 00:19:28.400
<v Speaker 2>sensor readings over time, even something like Moore's law tracking

402
00:19:28.440 --> 00:19:33.440
<v Speaker 2>transistor density. It's all data points ordered chronologically, and this.

403
00:19:33.440 --> 00:19:36.599
<v Speaker 1>Kind of data often has specific characteristics you need to understand.

404
00:19:36.720 --> 00:19:41.839
<v Speaker 2>Absolutely. There's often an overall trend is the value generally

405
00:19:42.240 --> 00:19:46.839
<v Speaker 2>increasing or decreasing over time. There's seasonality patterns that repeat

406
00:19:46.880 --> 00:19:51.720
<v Speaker 2>at regular intervals, think daily temperature cycles, weekly website traffic spikes,

407
00:19:52.039 --> 00:19:56.920
<v Speaker 2>yearly retail sales patterns. Okay, there's autocorrelation, meaning the value

408
00:19:56.920 --> 00:19:59.079
<v Speaker 2>at one point in time is correlated with values at

409
00:19:59.079 --> 00:20:01.920
<v Speaker 2>previous points, like if it's hot today, it's probably going

410
00:20:01.960 --> 00:20:05.240
<v Speaker 2>to be warmish tomorrow, or maybe a predictable decay after

411
00:20:05.279 --> 00:20:06.039
<v Speaker 2>some event, and.

412
00:20:05.960 --> 00:20:08.640
<v Speaker 1>Then there's always just random noise, right, fluctuations You can't

413
00:20:08.680 --> 00:20:09.920
<v Speaker 1>really predict exactly.

414
00:20:10.319 --> 00:20:12.640
<v Speaker 2>Understanding these components helps in modeling.

415
00:20:13.000 --> 00:20:15.759
<v Speaker 1>So how do you prep time series data for an

416
00:20:15.880 --> 00:20:18.000
<v Speaker 1>mL model? It's not exactly sentences.

417
00:20:18.079 --> 00:20:21.279
<v Speaker 2>The standard technique is windowing. You essentially turn the time

418
00:20:21.319 --> 00:20:24.640
<v Speaker 2>series prediction problem into a supervised learning problem.

419
00:20:24.680 --> 00:20:25.400
<v Speaker 3>How does that work?

420
00:20:25.640 --> 00:20:30.480
<v Speaker 2>You create fixed size input sequences or windows of past

421
00:20:30.599 --> 00:20:33.680
<v Speaker 2>data points. For example, a window might be the last

422
00:20:33.720 --> 00:20:37.160
<v Speaker 2>thirty days of temperature readings, and the corresponding label for

423
00:20:37.200 --> 00:20:40.000
<v Speaker 2>that window is typically the value you want to predict,

424
00:20:40.240 --> 00:20:42.920
<v Speaker 2>maybe the temperature reading for the next day, day thirty one.

425
00:20:43.079 --> 00:20:46.319
<v Speaker 1>Okay, So you slide this window across your entire time

426
00:20:46.400 --> 00:20:51.039
<v Speaker 1>series history, creating lots of input window next value pairs precisely.

427
00:20:51.400 --> 00:20:52.799
<v Speaker 2>That becomes your training data.

428
00:20:52.640 --> 00:20:55.519
<v Speaker 1>Set, and before you build a fancy mL model, you'd

429
00:20:55.519 --> 00:20:58.079
<v Speaker 1>probably want some simple baselines to compare against.

430
00:20:58.279 --> 00:21:00.400
<v Speaker 2>Definitely, you need to know if your MLMA model is

431
00:21:00.400 --> 00:21:04.200
<v Speaker 2>actually adding value. The simplest baseline is the nive forecast.

432
00:21:04.960 --> 00:21:06.920
<v Speaker 2>Just predict that the next value will be the same

433
00:21:06.960 --> 00:21:09.519
<v Speaker 2>as the last observed value, so predict tomorrow's temperature will

434
00:21:09.519 --> 00:21:10.480
<v Speaker 2>be the same as today's.

435
00:21:10.599 --> 00:21:13.799
<v Speaker 1>Or maybe a moving average averaging the values in the last.

436
00:21:13.559 --> 00:21:16.680
<v Speaker 2>Window right that smooths out noise, but often lags behind

437
00:21:16.680 --> 00:21:18.519
<v Speaker 2>trends and doesn't capture seasonality.

438
00:21:18.519 --> 00:21:18.759
<v Speaker 3>Well.

439
00:21:19.079 --> 00:21:23.160
<v Speaker 2>You calculate the error of these baselines maybe mean absolute

440
00:21:23.319 --> 00:21:26.000
<v Speaker 2>error MAE, and that gives you a target for your

441
00:21:26.079 --> 00:21:26.920
<v Speaker 2>mL model to beat.

442
00:21:27.279 --> 00:21:30.920
<v Speaker 1>What mL models work well here on this windowed data You.

443
00:21:30.880 --> 00:21:35.720
<v Speaker 2>Can try basic dense neural networks DNNs, but architectures that

444
00:21:35.799 --> 00:21:39.119
<v Speaker 2>understand sequences are often better. You can use one D convolutions,

445
00:21:40.079 --> 00:21:43.599
<v Speaker 2>similar to how two dcnns find spatial patterns and images.

446
00:21:44.079 --> 00:21:47.960
<v Speaker 2>One dcnns can find patterns across consecutive timesteps within your window.

447
00:21:48.440 --> 00:21:50.480
<v Speaker 2>You need to use causal padding, though, to make sure

448
00:21:50.519 --> 00:21:53.319
<v Speaker 2>the convolution only looks at past data, not future data

449
00:21:53.359 --> 00:21:54.400
<v Speaker 2>it shouldn't know about.

450
00:21:54.240 --> 00:21:57.519
<v Speaker 1>Makes sense, and RNNs LSTMs grus absolutely.

451
00:21:58.039 --> 00:22:00.880
<v Speaker 2>Since time series is inherently sequential, rn n's are a

452
00:22:00.960 --> 00:22:04.279
<v Speaker 2>natural fit. They can maintain state across the timesteps within

453
00:22:04.319 --> 00:22:08.559
<v Speaker 2>the window, potentially capturing complex temporal patterns and dependencies that

454
00:22:08.599 --> 00:22:11.039
<v Speaker 2>simple DNNs or even CNNs might miss.

455
00:22:11.240 --> 00:22:14.599
<v Speaker 1>So again, it's about experimenting with different window sizes, architectures,

456
00:22:14.680 --> 00:22:17.440
<v Speaker 1>hyper parameters to see what works best for your specific

457
00:22:17.480 --> 00:22:18.119
<v Speaker 1>time series.

458
00:22:18.160 --> 00:22:21.119
<v Speaker 2>Exactly, different data sets will respond better to different approaches.

459
00:22:21.240 --> 00:22:23.680
<v Speaker 1>Okay, so we've trained all these amazing models for vision

460
00:22:23.960 --> 00:22:27.400
<v Speaker 1>language time series. Now the crucial part for a coder,

461
00:22:28.039 --> 00:22:30.559
<v Speaker 1>how do you actually use them in an application deployment?

462
00:22:30.839 --> 00:22:33.559
<v Speaker 2>Right, if you want to run your model directly on

463
00:22:33.599 --> 00:22:35.920
<v Speaker 2>a user's device, like in a native mobile app on

464
00:22:36.279 --> 00:22:39.400
<v Speaker 2>Android or iOS, or maybe on an embedded system like

465
00:22:39.400 --> 00:22:43.799
<v Speaker 2>a Raspberry Pie, the main tool is tensorflowite.

466
00:22:43.599 --> 00:22:46.160
<v Speaker 1>T flight And why run it on the device on

467
00:22:46.200 --> 00:22:48.559
<v Speaker 1>the edge. Why not just call a server API.

468
00:22:48.400 --> 00:22:51.480
<v Speaker 2>Several big advantages late and see the prediction happens right

469
00:22:51.559 --> 00:22:56.000
<v Speaker 2>there instantly, no network delay, connectivity, it works even if

470
00:22:56.039 --> 00:22:59.440
<v Speaker 2>the device is offline, and privacy the user's data like

471
00:22:59.480 --> 00:23:02.599
<v Speaker 2>an image want to classify, doesn't have to leave their device,

472
00:23:02.759 --> 00:23:04.240
<v Speaker 2>which is huge nowadays.

473
00:23:04.319 --> 00:23:06.400
<v Speaker 1>Okay, that makes sense, So how does it work. Let's

474
00:23:06.400 --> 00:23:09.039
<v Speaker 1>take our simple Y for X one model again.

475
00:23:09.279 --> 00:23:11.799
<v Speaker 2>So you train your model as usual, probably in Python

476
00:23:11.880 --> 00:23:15.200
<v Speaker 2>using TensorFlow. You save the trained model. Then you use

477
00:23:15.200 --> 00:23:17.480
<v Speaker 2>a tool called the TF dot light dot T flight

478
00:23:17.519 --> 00:23:20.880
<v Speaker 2>converter to convert your saved TensorFlow model into a special

479
00:23:20.960 --> 00:23:22.720
<v Speaker 2>optimized dot T flight format.

480
00:23:22.880 --> 00:23:24.759
<v Speaker 3>So you get this lightweight dot T flight file.

481
00:23:24.839 --> 00:23:27.519
<v Speaker 2>Yep, it's usually much smaller then in your mobile app

482
00:23:27.559 --> 00:23:31.160
<v Speaker 2>code Jabacotlin for Android, Swift Objective, BAFF for iOS, or

483
00:23:31.200 --> 00:23:33.720
<v Speaker 2>your embedded system code like Python or C plus plus

484
00:23:33.720 --> 00:23:36.680
<v Speaker 2>on a PI. You use the t flight interpreter library.

485
00:23:36.920 --> 00:23:39.519
<v Speaker 2>You load that dot T flight file into the interpreter.

486
00:23:40.160 --> 00:23:42.559
<v Speaker 2>You prepare your input data, making sure it has the

487
00:23:42.640 --> 00:23:45.960
<v Speaker 2>exact shape and data type the model expects. For example,

488
00:23:46.039 --> 00:23:48.559
<v Speaker 2>that might be a float three two array containing ten

489
00:23:48.599 --> 00:23:50.160
<v Speaker 2>point zero with the shape of one one.

490
00:23:50.319 --> 00:23:52.200
<v Speaker 3>Gotta get the details right there absolutely.

491
00:23:52.720 --> 00:23:55.240
<v Speaker 2>Then you pass that input to the interpreter, invoke it,

492
00:23:55.680 --> 00:23:58.680
<v Speaker 2>run the prediction, and it gives you back the output tensor,

493
00:23:58.680 --> 00:24:01.519
<v Speaker 2>which would contain something like eighteen point ninety seven.

494
00:24:01.359 --> 00:24:04.160
<v Speaker 1>The prediction for why what about more complex things like

495
00:24:04.240 --> 00:24:06.079
<v Speaker 1>running an image classifier on mobile.

496
00:24:06.440 --> 00:24:08.839
<v Speaker 2>The core process is the same convert to dot t

497
00:24:09.000 --> 00:24:12.319
<v Speaker 2>flight use the interpreter. The trickier part is usually getting

498
00:24:12.319 --> 00:24:15.000
<v Speaker 2>the image data into the right format before feeding it

499
00:24:15.039 --> 00:24:18.200
<v Speaker 2>to the interpreter. How So, mobile platforms have their own

500
00:24:18.240 --> 00:24:23.759
<v Speaker 2>image formats, like androids bitmap or iOS's uiimage. Your CNN model, however,

501
00:24:23.920 --> 00:24:27.839
<v Speaker 2>probably expects input as a say, two to four by

502
00:24:27.880 --> 00:24:30.279
<v Speaker 2>two two four x three tensor of floating point numbers

503
00:24:30.319 --> 00:24:32.880
<v Speaker 2>normalized between zero one. So you need code to take

504
00:24:32.880 --> 00:24:35.880
<v Speaker 2>the native image, resize it, extract the pixel values, maybe

505
00:24:35.880 --> 00:24:38.319
<v Speaker 2>dealing with raw bite buffers and bit shifting for colors,

506
00:24:38.720 --> 00:24:41.480
<v Speaker 2>normalize them and arrange them into the correct tensor shape.

507
00:24:41.519 --> 00:24:43.839
<v Speaker 1>All right, There's some data wrangling involved, yeah, and some

508
00:24:43.920 --> 00:24:47.240
<v Speaker 1>platform specific setup for the interpreter itself in the app project.

509
00:24:47.279 --> 00:24:52.279
<v Speaker 2>Definitely. T flight also offers optimization, particularly quantization. During the

510
00:24:52.279 --> 00:24:54.720
<v Speaker 2>conversion process, you can tell it to convert the model's

511
00:24:54.799 --> 00:24:59.039
<v Speaker 2>weights from thirty two bit floats to say, eight bit integers,

512
00:24:59.079 --> 00:25:01.599
<v Speaker 2>and that makes the model much smaller, often a four

513
00:25:01.720 --> 00:25:04.920
<v Speaker 2>x reduction in file size, and it usually runs faster too,

514
00:25:04.960 --> 00:25:07.799
<v Speaker 2>maybe two three x speed up on mobile CPUs or

515
00:25:07.880 --> 00:25:10.000
<v Speaker 2>specialized hardware like edgetpus.

516
00:25:10.119 --> 00:25:10.720
<v Speaker 3>Is there catch?

517
00:25:11.079 --> 00:25:13.640
<v Speaker 2>There can be a small loss in accuracy because you're

518
00:25:13.680 --> 00:25:17.200
<v Speaker 2>reducing precision. You always need to test the quantized model

519
00:25:17.240 --> 00:25:19.880
<v Speaker 2>to see if the accuracy trade off is acceptable for

520
00:25:20.000 --> 00:25:21.440
<v Speaker 2>your specific application.

521
00:25:21.599 --> 00:25:24.279
<v Speaker 1>Okay, so t flight for native embedded What if you

522
00:25:24.319 --> 00:25:28.039
<v Speaker 1>want mL in the browser or for a no JS backend, then.

523
00:25:27.960 --> 00:25:30.960
<v Speaker 2>You're looking at pencerflow dot js. Tfs it lets you

524
00:25:31.000 --> 00:25:34.000
<v Speaker 2>define train and run mL models entirely in JavaScript.

525
00:25:34.039 --> 00:25:36.000
<v Speaker 1>You can actually train models in the browser.

526
00:25:36.279 --> 00:25:39.319
<v Speaker 2>You can you can define a model layer by layer,

527
00:25:39.400 --> 00:25:42.680
<v Speaker 2>similar to how you do it in Python. Using JavaScript APIs,

528
00:25:43.119 --> 00:25:47.240
<v Speaker 2>you'd use TFJS tensors for your data. Training often involves

529
00:25:47.279 --> 00:25:50.160
<v Speaker 2>a sinkle weight because it happens asynchronously in the browser.

530
00:25:50.599 --> 00:25:52.839
<v Speaker 2>You can build that Y two x one model or

531
00:25:52.960 --> 00:25:56.559
<v Speaker 2>even more complex things like image classifiers or models that

532
00:25:56.599 --> 00:25:59.240
<v Speaker 2>work on CSV data, all within javascripts.

533
00:25:59.279 --> 00:25:59.920
<v Speaker 3>That's pretty cool.

534
00:26:00.279 --> 00:26:02.839
<v Speaker 1>But maybe the biggest win for web deevs is using

535
00:26:02.880 --> 00:26:04.480
<v Speaker 1>models someone else has already trained.

536
00:26:04.559 --> 00:26:08.400
<v Speaker 2>Absolutely, that's the power of pre converted JavaScript models. Places

537
00:26:08.440 --> 00:26:12.200
<v Speaker 2>like TensorFlow, dot org and tfhub offer many sophisticated models

538
00:26:12.240 --> 00:26:14.400
<v Speaker 2>already converted to the TFGS format, so.

539
00:26:14.359 --> 00:26:16.519
<v Speaker 1>You can just load them and use them with minimal

540
00:26:16.559 --> 00:26:17.440
<v Speaker 1>code exactly.

541
00:26:17.799 --> 00:26:21.359
<v Speaker 2>There are models for toxicity detection and text image classification

542
00:26:21.640 --> 00:26:23.920
<v Speaker 2>using things like mobile net, where you can just pass

543
00:26:23.960 --> 00:26:27.240
<v Speaker 2>an mg tag pose detection with posnet that gives you

544
00:26:27.279 --> 00:26:30.240
<v Speaker 2>coordinates of body joints from an image or video. You

545
00:26:30.240 --> 00:26:33.160
<v Speaker 2>can integrate these powerful features into your web app pretty easily.

546
00:26:33.359 --> 00:26:35.319
<v Speaker 1>Just load the model script, write a few lines of

547
00:26:35.359 --> 00:26:37.200
<v Speaker 1>JavaScript to run inference.

548
00:26:36.960 --> 00:26:40.519
<v Speaker 2>Pretty much, and you can even do transfer learning in TFJS.

549
00:26:40.799 --> 00:26:43.200
<v Speaker 2>You can load a pre trained model like mobilenet, use

550
00:26:43.240 --> 00:26:46.640
<v Speaker 2>a function like model dot infer mmg embtting to grab

551
00:26:46.720 --> 00:26:49.680
<v Speaker 2>the internal feature embeddings for your own images, and then

552
00:26:49.799 --> 00:26:52.960
<v Speaker 2>use those embeddings as input to train a small new

553
00:26:53.039 --> 00:26:57.160
<v Speaker 2>TFJS model tailored to your specific classification task.

554
00:26:57.400 --> 00:26:57.880
<v Speaker 3>Nice.

555
00:26:58.279 --> 00:27:02.559
<v Speaker 1>Okay, one more deportment scenario. What if you need a dedicated,

556
00:27:02.640 --> 00:27:06.480
<v Speaker 1>scalable server for running predictions, maybe lots of users hitting

557
00:27:06.519 --> 00:27:06.960
<v Speaker 1>it at once.

558
00:27:07.240 --> 00:27:10.839
<v Speaker 2>For that kind of robust production environment, you'd use TensorFlow serving.

559
00:27:11.240 --> 00:27:14.359
<v Speaker 2>It's specifically designed to deploy TensorFlow models as a high

560
00:27:14.400 --> 00:27:15.599
<v Speaker 2>performance inference server.

561
00:27:15.759 --> 00:27:16.559
<v Speaker 3>What is it anal for you?

562
00:27:16.799 --> 00:27:20.079
<v Speaker 2>It's built for low latency and high throughput. Crucially, it

563
00:27:20.160 --> 00:27:23.680
<v Speaker 2>handles model versioning really well. You can deploy multiple versions

564
00:27:23.720 --> 00:27:25.920
<v Speaker 2>of the same model, say V one and a new

565
00:27:26.000 --> 00:27:28.960
<v Speaker 2>V two you just trained, and TensorFlow serving can manage

566
00:27:28.960 --> 00:27:32.759
<v Speaker 2>serving requests to either version or transition traffic smoothly, so you.

567
00:27:32.680 --> 00:27:35.240
<v Speaker 3>Can ab test models or roll out updates safely.

568
00:27:35.640 --> 00:27:40.599
<v Speaker 2>Exactly. It typically exposes APIs like rest or gRPC. Your

569
00:27:40.640 --> 00:27:44.720
<v Speaker 2>application sends an inference request with the data, TensorFlow Serving

570
00:27:44.960 --> 00:27:49.119
<v Speaker 2>loads the correct model version, often managed via configuration files,

571
00:27:49.559 --> 00:27:53.000
<v Speaker 2>runs the prediction efficiently and sends the result back. It's

572
00:27:53.000 --> 00:27:55.720
<v Speaker 2>built for reliable, scalable production use.

573
00:27:55.920 --> 00:27:58.359
<v Speaker 1>Okay, we've covered a ton of ground from the basic

574
00:27:58.440 --> 00:28:02.799
<v Speaker 1>concepts through vision, language, time series, and now all these

575
00:28:02.839 --> 00:28:06.279
<v Speaker 1>ways to deploy models. This really pulls together the practical

576
00:28:06.319 --> 00:28:07.359
<v Speaker 1>side from Roney's book.

577
00:28:07.480 --> 00:28:10.440
<v Speaker 2>Yeah, we started with that fundamental shift learning rules from

578
00:28:10.519 --> 00:28:13.240
<v Speaker 2>data instead of writing them, and saw how tensilflow provides

579
00:28:13.279 --> 00:28:14.759
<v Speaker 2>the platform for coders to do that.

580
00:28:14.880 --> 00:28:18.319
<v Speaker 1>Then we dove into computer vision CNN's dealing with data

581
00:28:18.319 --> 00:28:20.720
<v Speaker 1>limits using augmentation and transfer learning and.

582
00:28:20.839 --> 00:28:25.319
<v Speaker 2>NLP tokenizing embeddings to capture meaning and RNNs like lstm's

583
00:28:25.400 --> 00:28:28.960
<v Speaker 2>especially by directional ones to understand sequence and context even

584
00:28:28.960 --> 00:28:29.960
<v Speaker 2>for generating texts.

585
00:28:30.039 --> 00:28:33.240
<v Speaker 1>We touched on time series windowing data using CNN's and

586
00:28:33.319 --> 00:28:34.640
<v Speaker 1>RNNs for prediction.

587
00:28:34.480 --> 00:28:37.680
<v Speaker 2>And finally getting those models out there TFLight for devices,

588
00:28:37.720 --> 00:28:40.759
<v Speaker 2>TFGS for the Web, and tf serving for scalable back

589
00:28:40.880 --> 00:28:41.480
<v Speaker 2>end inference.

590
00:28:41.799 --> 00:28:44.799
<v Speaker 1>It really feels like these tools make AI and mL

591
00:28:45.119 --> 00:28:47.839
<v Speaker 1>much more practical for coders today. You don't need to

592
00:28:47.839 --> 00:28:50.960
<v Speaker 1>be a deep research scientist to start building things exactly.

593
00:28:51.000 --> 00:28:54.960
<v Speaker 2>The focus shifts to understanding the techniques when to use

594
00:28:55.000 --> 00:28:58.160
<v Speaker 2>a CNN versus an LSTM, how to prepare your data,

595
00:28:58.640 --> 00:29:01.079
<v Speaker 2>how to leverage pre train mode models, and applying them

596
00:29:01.079 --> 00:29:02.720
<v Speaker 2>to your problems, which leads.

597
00:29:02.559 --> 00:29:03.279
<v Speaker 3>To a final thought.

598
00:29:03.359 --> 00:29:06.759
<v Speaker 1>Maybe we talked about transfer learning using these powerful pre

599
00:29:06.839 --> 00:29:09.240
<v Speaker 1>train models, and also about running models right on the

600
00:29:09.279 --> 00:29:13.920
<v Speaker 1>device with t flight for speed and privacy. So thinking

601
00:29:13.960 --> 00:29:18.240
<v Speaker 1>about those capabilities combined, what kinds of really complex, maybe

602
00:29:18.240 --> 00:29:22.079
<v Speaker 1>even personalized, intelligent features could you start imagining for the

603
00:29:22.119 --> 00:29:24.960
<v Speaker 1>applications you build things that maybe seemed impossible just a

604
00:29:24.960 --> 00:29:27.359
<v Speaker 1>few years ago, but are now potentially within reach for

605
00:29:27.400 --> 00:29:28.720
<v Speaker 1>a code or using these tools
