WEBVTT

1
00:00:00.120 --> 00:00:05.519
<v Speaker 1>Welcome to the deep dive. Today, we're jumping into the

2
00:00:05.559 --> 00:00:08.119
<v Speaker 1>really interesting world of Python image processing.

3
00:00:08.240 --> 00:00:09.800
<v Speaker 2>Yeah, it's a big topic, it is.

4
00:00:10.240 --> 00:00:12.599
<v Speaker 1>And you asked us for a way to quickly get

5
00:00:12.599 --> 00:00:16.719
<v Speaker 1>the main ideas the techniques for manipulating and understanding images.

6
00:00:16.879 --> 00:00:19.239
<v Speaker 1>So that's what we're doing today.

7
00:00:19.320 --> 00:00:22.679
<v Speaker 2>That's a plan. We're using the Python Image Processing Cookbook

8
00:00:23.079 --> 00:00:26.839
<v Speaker 2>as well. Our guide is packed with practical stuff, right.

9
00:00:26.960 --> 00:00:31.519
<v Speaker 1>Think of it as decoding how computers learn to see

10
00:00:32.039 --> 00:00:33.960
<v Speaker 1>and even change the images we look at.

11
00:00:33.799 --> 00:00:36.640
<v Speaker 2>Every day exactly. Our goal, our mission, if you like,

12
00:00:36.759 --> 00:00:39.679
<v Speaker 2>is to pull out the most useful, maybe even surprising

13
00:00:40.200 --> 00:00:41.240
<v Speaker 2>bits from the cookbook.

14
00:00:41.399 --> 00:00:43.840
<v Speaker 1>Yeah, give you that shortcut to the core concepts without

15
00:00:43.840 --> 00:00:47.359
<v Speaker 1>getting liced in all the super technical code details right away.

16
00:00:47.719 --> 00:00:50.880
<v Speaker 1>We're aiming for those aha moments, Ready to dive in.

17
00:00:50.880 --> 00:00:52.880
<v Speaker 2>Let's do it. A really fun place to begin is

18
00:00:52.920 --> 00:00:56.399
<v Speaker 2>creating artistic effects. The cookbook shows some well pretty cool

19
00:00:56.399 --> 00:00:58.320
<v Speaker 2>ways to take a normal photo and make it something

20
00:00:58.359 --> 00:00:59.000
<v Speaker 2>else entirely.

21
00:00:59.159 --> 00:01:00.359
<v Speaker 1>Okay, I like this one of that?

22
00:01:00.439 --> 00:01:03.799
<v Speaker 2>Like what well one is turning photos into cartoons? Oh yeah,

23
00:01:03.840 --> 00:01:06.159
<v Speaker 2>not just a simple filter, I guess no, No, it's

24
00:01:06.159 --> 00:01:08.079
<v Speaker 2>more involved a sequence of steps.

25
00:01:08.120 --> 00:01:10.480
<v Speaker 1>Actually, all right, walk me through it. How do you

26
00:01:10.519 --> 00:01:12.280
<v Speaker 1>start making a photo look like a cartoon?

27
00:01:12.560 --> 00:01:16.319
<v Speaker 2>First step is something called bilateral filtering. Imagine you want

28
00:01:16.319 --> 00:01:19.120
<v Speaker 2>to smooth out parts of an image, but not the

29
00:01:19.200 --> 00:01:22.680
<v Speaker 2>sharp lines, the edges. Okay, Bilateral filtering does that. It

30
00:01:22.840 --> 00:01:27.239
<v Speaker 2>smooths areas with similar colors, but keeps the important boundaries sharp.

31
00:01:27.959 --> 00:01:31.079
<v Speaker 2>You'd use the bilateral filter function in open CD Python

32
00:01:31.159 --> 00:01:31.359
<v Speaker 2>for this.

33
00:01:31.680 --> 00:01:35.680
<v Speaker 1>Uh. Okay, so soften the texture, keep the lines, got it.

34
00:01:35.920 --> 00:01:39.040
<v Speaker 2>What's next then comes median blurring. This is more about

35
00:01:39.079 --> 00:01:42.760
<v Speaker 2>smoothing out noise and creating those flat blocks of color

36
00:01:42.799 --> 00:01:44.319
<v Speaker 2>you see in cartoons.

37
00:01:43.840 --> 00:01:46.079
<v Speaker 1>Right, like simplifying the textures exactly.

38
00:01:46.159 --> 00:01:48.680
<v Speaker 2>The function for that is median blur. It sort of

39
00:01:48.719 --> 00:01:51.040
<v Speaker 2>averages out small imperfections.

40
00:01:50.519 --> 00:01:53.920
<v Speaker 1>Makes sense. Flat colors, sharp lines, so the lines need

41
00:01:53.959 --> 00:01:55.159
<v Speaker 1>to be emphasized.

42
00:01:54.680 --> 00:01:57.920
<v Speaker 2>Somehow, you got it. That's where adaptive thresholding comes in.

43
00:01:58.840 --> 00:02:00.879
<v Speaker 2>This really makes the main edges pop. Think of it

44
00:02:00.959 --> 00:02:03.560
<v Speaker 2>like inking in the outlines. Okay, even if the lighting

45
00:02:03.640 --> 00:02:07.439
<v Speaker 2>isn't perfect across the image, adaptive threshold helps find and

46
00:02:07.560 --> 00:02:09.000
<v Speaker 2>enhance those dominant edges.

47
00:02:09.400 --> 00:02:13.479
<v Speaker 1>Nice bold outlines, flat colors. How do they merge?

48
00:02:13.759 --> 00:02:17.319
<v Speaker 2>The final step uses a bit wise A and D operation.

49
00:02:18.039 --> 00:02:20.840
<v Speaker 2>Imagine you have the smooth color image on one layer

50
00:02:21.120 --> 00:02:24.560
<v Speaker 2>and the strong edges on another. The bitwise and function

51
00:02:24.680 --> 00:02:27.520
<v Speaker 2>basically combines them, so the color fills in up to

52
00:02:27.599 --> 00:02:30.800
<v Speaker 2>those strong edges. That gives you the final cartoon look.

53
00:02:31.159 --> 00:02:33.560
<v Speaker 1>That's actually really clever. It's like a recipe for mimicking

54
00:02:33.599 --> 00:02:36.000
<v Speaker 1>an art style. What other artistic tricks are in there?

55
00:02:36.039 --> 00:02:39.960
<v Speaker 2>There's also simulating light art or long exposure effects. You

56
00:02:40.000 --> 00:02:42.639
<v Speaker 2>know those photos with light trails from cars or water

57
00:02:42.680 --> 00:02:44.080
<v Speaker 2>that looks all smooth and silky.

58
00:02:44.199 --> 00:02:46.360
<v Speaker 1>Oh yeah, those are cool. How's that done?

59
00:02:46.520 --> 00:02:49.599
<v Speaker 2>It's surprisingly simple at its core. You just average together

60
00:02:49.719 --> 00:02:53.159
<v Speaker 2>many frames from a video clip, average them. Yeah, anything

61
00:02:53.240 --> 00:02:56.159
<v Speaker 2>static in the video stays clear when you average the frames,

62
00:02:56.199 --> 00:02:58.879
<v Speaker 2>but anything moving gets blurred together. That's how you get

63
00:02:58.879 --> 00:03:00.400
<v Speaker 2>the light trails or the smooth water.

64
00:03:00.639 --> 00:03:03.919
<v Speaker 1>Ah. Right, So if you film traffic at night, the

65
00:03:03.919 --> 00:03:07.400
<v Speaker 1>buildings would be sharp, but the headlights would become streaks

66
00:03:07.439 --> 00:03:10.159
<v Speaker 1>across the image, like leaving the camera shutter.

67
00:03:09.879 --> 00:03:14.120
<v Speaker 2>Open, precisely the digital equivalent. The cookbook mentions getting that

68
00:03:14.240 --> 00:03:16.080
<v Speaker 2>silky water look this way.

69
00:03:15.960 --> 00:03:19.639
<v Speaker 1>Clever, very clever. Yeah, what about drawing style like pencil sketches?

70
00:03:19.840 --> 00:03:22.680
<v Speaker 2>Yep? The cookbook covers that too. It uses different kinds

71
00:03:22.680 --> 00:03:25.120
<v Speaker 2>of edge detection to pull up the outlines of details,

72
00:03:25.159 --> 00:03:26.439
<v Speaker 2>kind of like an artist would.

73
00:03:26.360 --> 00:03:29.159
<v Speaker 1>Edge detection finding the sharp changes in brightness. Right.

74
00:03:29.280 --> 00:03:29.439
<v Speaker 2>Yeah.

75
00:03:29.479 --> 00:03:31.599
<v Speaker 1>The book mentioned a few ways for sketches.

76
00:03:31.719 --> 00:03:35.560
<v Speaker 2>Exactly. One is using difference of Gaushian dolldy.

77
00:03:35.280 --> 00:03:37.479
<v Speaker 1>Og doheg dot og okay.

78
00:03:37.560 --> 00:03:39.879
<v Speaker 2>Yeah. The idea is you blur the image slightly differently

79
00:03:39.960 --> 00:03:43.319
<v Speaker 2>twice and then compare them. The differences highlight the edges.

80
00:03:43.879 --> 00:03:45.919
<v Speaker 2>Xog is just a variation on that, maybe for a

81
00:03:45.960 --> 00:03:47.159
<v Speaker 2>more stylized look.

82
00:03:47.240 --> 00:03:49.960
<v Speaker 1>So the computer compares slightly different views to find the

83
00:03:50.000 --> 00:03:54.360
<v Speaker 1>important lines. Interesting. It also mentioned anisopropic diffusion. Again we

84
00:03:54.479 --> 00:03:55.560
<v Speaker 1>heard that from noise reduction.

85
00:03:55.759 --> 00:03:59.280
<v Speaker 2>Yes, it's versatile for sketching. It smooths out the image

86
00:03:59.319 --> 00:04:02.680
<v Speaker 2>while keeping the key edges sharp. It simplifies things, makes

87
00:04:02.719 --> 00:04:05.479
<v Speaker 2>it look more abstract, more like a sketch. The book

88
00:04:05.520 --> 00:04:08.960
<v Speaker 2>even gives some parameters like KAPA, twenty night or twenty

89
00:04:09.159 --> 00:04:10.560
<v Speaker 2>as starting points, so.

90
00:04:10.520 --> 00:04:12.879
<v Speaker 1>It's a smart smoothing that knows what to keep. And

91
00:04:12.919 --> 00:04:15.400
<v Speaker 1>the last sketch method was the dodge operation.

92
00:04:16.040 --> 00:04:20.319
<v Speaker 2>Sounds like photography, it's related for sketching, you invert the image,

93
00:04:20.519 --> 00:04:23.040
<v Speaker 2>blur the inverted version quite a bit, and then sort

94
00:04:23.040 --> 00:04:26.120
<v Speaker 2>of divide the original by that blurred in version, sometimes

95
00:04:26.120 --> 00:04:29.920
<v Speaker 2>with the threshold too. It really emphasizes contrast along edges,

96
00:04:30.079 --> 00:04:32.360
<v Speaker 2>giving that bright outline sketch effect.

97
00:04:32.839 --> 00:04:36.639
<v Speaker 1>It's amazing how math can replicate these artistic looks. Okay,

98
00:04:36.680 --> 00:04:39.720
<v Speaker 1>So moving from art, the cookbook gets into image enhancement,

99
00:04:40.160 --> 00:04:42.959
<v Speaker 1>making images better, clearer, right.

100
00:04:42.800 --> 00:04:45.120
<v Speaker 2>And a big part of that is denoising. We all

101
00:04:45.160 --> 00:04:48.560
<v Speaker 2>hate grainy photos. Simple filters are the first thing mentioned,

102
00:04:48.839 --> 00:04:52.079
<v Speaker 2>Like blurring right that that can kill details too exactly.

103
00:04:52.240 --> 00:04:55.680
<v Speaker 1>Things like Gaushian or Median blur reduce noise, but they

104
00:04:55.680 --> 00:04:57.040
<v Speaker 1>often blur everything else.

105
00:04:56.920 --> 00:04:59.199
<v Speaker 2>Along with it, which brings us back to things like

106
00:04:59.279 --> 00:05:00.560
<v Speaker 2>anisotropic fusion.

107
00:05:00.680 --> 00:05:04.240
<v Speaker 1>Seems useful it really is, because it smooths while trying

108
00:05:04.240 --> 00:05:07.560
<v Speaker 1>to preserve edges. It's often better at removing noise without

109
00:05:07.639 --> 00:05:09.120
<v Speaker 1>making the whole image look soft.

110
00:05:09.399 --> 00:05:12.680
<v Speaker 2>Okay, And then there are denoising auto encoders. Now that

111
00:05:12.720 --> 00:05:15.800
<v Speaker 2>sounds like AI. It is. It's a neural network. You

112
00:05:15.879 --> 00:05:18.680
<v Speaker 2>train it by feeding it noisy images and teaching it

113
00:05:18.720 --> 00:05:20.199
<v Speaker 2>to output clean versions.

114
00:05:20.399 --> 00:05:21.279
<v Speaker 1>How does it learn that?

115
00:05:21.519 --> 00:05:25.279
<v Speaker 2>Through training lots of examples? It sees a noisy input,

116
00:05:25.480 --> 00:05:27.519
<v Speaker 2>makes a guess at the clean output, compares it to

117
00:05:27.560 --> 00:05:30.279
<v Speaker 2>the actual clean image, and adjusts itself to get closer.

118
00:05:30.319 --> 00:05:33.439
<v Speaker 2>Next time, it learns to recognize noise patterns and remove them.

119
00:05:33.720 --> 00:05:36.279
<v Speaker 2>The book even mentions you can use color images and

120
00:05:36.319 --> 00:05:37.639
<v Speaker 2>try different network types.

121
00:05:37.759 --> 00:05:40.720
<v Speaker 1>Wow, so the network literally learns what noise is and

122
00:05:40.759 --> 00:05:45.759
<v Speaker 1>how to subtract it. Okay, what else for enhancement? Histogram

123
00:05:45.759 --> 00:05:48.600
<v Speaker 1>matching sounds like adjusting brightness and contrast.

124
00:05:48.759 --> 00:05:51.920
<v Speaker 2>Kind of a histogram shows a distribution of brightness levels.

125
00:05:52.399 --> 00:05:55.639
<v Speaker 2>Histogram matching lets you take the overall tonal feel of

126
00:05:55.800 --> 00:05:58.360
<v Speaker 2>one image, the template, and apply it to another image,

127
00:05:58.399 --> 00:06:03.680
<v Speaker 2>the sourcewulate these things called cumulative distribution functions CDFs for

128
00:06:03.800 --> 00:06:07.439
<v Speaker 2>both images. They summarize the brightness distribution. Then you map

129
00:06:07.480 --> 00:06:10.199
<v Speaker 2>the brightness levels from the source image to the corresponding

130
00:06:10.240 --> 00:06:12.319
<v Speaker 2>levels and the template based on these CDFs.

131
00:06:12.439 --> 00:06:13.399
<v Speaker 1>And why would you do that?

132
00:06:13.600 --> 00:06:17.480
<v Speaker 2>For creative effects? Mostly the cookbook suggests making a daytime

133
00:06:17.480 --> 00:06:20.720
<v Speaker 2>photo look like night vision by matching its histogram to

134
00:06:20.920 --> 00:06:22.240
<v Speaker 2>a picture taken at night.

135
00:06:22.399 --> 00:06:25.600
<v Speaker 1>Huh. So you could completely change the mood by borrowing

136
00:06:25.600 --> 00:06:26.360
<v Speaker 1>the tonal range.

137
00:06:26.399 --> 00:06:30.399
<v Speaker 2>That's powerful, definitely, And the last enhancement technique here is

138
00:06:30.600 --> 00:06:35.319
<v Speaker 2>seamless cloning or Poisson image editing. This is about pasting

139
00:06:35.399 --> 00:06:38.199
<v Speaker 2>something from one image into another really realistically.

140
00:06:38.360 --> 00:06:41.160
<v Speaker 1>Ah yes, cutting and pasting without it looking fake. How

141
00:06:41.199 --> 00:06:41.759
<v Speaker 1>does that work?

142
00:06:41.839 --> 00:06:43.920
<v Speaker 2>The magic is in the blending. It looks at the

143
00:06:43.959 --> 00:06:46.279
<v Speaker 2>gradients the changes in color at the boundary of the

144
00:06:46.319 --> 00:06:49.439
<v Speaker 2>object you're pasting, okay, and it tries to adjust the

145
00:06:49.480 --> 00:06:53.040
<v Speaker 2>pasted objects so it's gradients smoothly transition into the gradients

146
00:06:53.040 --> 00:06:56.600
<v Speaker 2>of the background image. THECV two dot seamless clone function

147
00:06:56.759 --> 00:07:00.000
<v Speaker 2>in OpenCV maybe with the CV two dot mix clone

148
00:07:00.199 --> 00:07:04.040
<v Speaker 2>option uses some clever math Posson equations to figure this out.

149
00:07:04.079 --> 00:07:05.680
<v Speaker 1>So it's matching not just the colors, but the way

150
00:07:05.759 --> 00:07:07.680
<v Speaker 1>light and shadow change across the boundary.

151
00:07:07.839 --> 00:07:12.199
<v Speaker 2>Very cool exactly now. After enhancing images, the book moves

152
00:07:12.240 --> 00:07:16.920
<v Speaker 2>into understanding their structure, starting with edge detection algorithms. We

153
00:07:17.000 --> 00:07:18.639
<v Speaker 2>mentioned some for sketching, but there's more.

154
00:07:18.920 --> 00:07:21.720
<v Speaker 1>Right, we talked about canny and more Hildreth. For canny,

155
00:07:22.160 --> 00:07:25.480
<v Speaker 1>the book said, less blur means more detail, maybe more

156
00:07:25.519 --> 00:07:28.319
<v Speaker 1>noise and more blur gives cleaner, stronger edges.

157
00:07:28.480 --> 00:07:31.680
<v Speaker 2>Correct that blur amount. The sigma value controls the trade

158
00:07:31.720 --> 00:07:35.720
<v Speaker 2>off and mare Hildreth uses the laplation of Gaussian log filter.

159
00:07:36.079 --> 00:07:39.800
<v Speaker 2>It highlights rapid intensity changes. Then you find the zero

160
00:07:39.959 --> 00:07:43.480
<v Speaker 2>crossings in that filtered image, which often mark the edges.

161
00:07:43.600 --> 00:07:46.680
<v Speaker 1>Zero crossings where the filtered value goes from positive to

162
00:07:46.720 --> 00:07:47.720
<v Speaker 1>negative vice versa.

163
00:07:47.800 --> 00:07:51.600
<v Speaker 2>Yeah, it pinpoints those sharp transitions. And the third method

164
00:07:51.680 --> 00:07:55.639
<v Speaker 2>mentioned was wavelet based edge detection. I know wavelets from audio?

165
00:07:56.199 --> 00:07:59.920
<v Speaker 2>How do they work for images? Similar idea? Actually, wavelets

166
00:08:00.040 --> 00:08:02.959
<v Speaker 2>break down the image into different frequency components. Edges are

167
00:08:03.000 --> 00:08:05.959
<v Speaker 2>sharp features, so they contain a lot of high frequency information.

168
00:08:06.639 --> 00:08:10.120
<v Speaker 2>By looking at the wavelet coefficients the numbers representing these frequencies,

169
00:08:10.319 --> 00:08:12.759
<v Speaker 2>you can find where the high frequencies are concentrated, and

170
00:08:12.759 --> 00:08:14.639
<v Speaker 2>that tells you where the edges are. It's another way

171
00:08:14.639 --> 00:08:15.040
<v Speaker 2>to find.

172
00:08:14.920 --> 00:08:20.439
<v Speaker 1>Sharpness analyzing the image's visual frequencies. Need perspective, okay. Next

173
00:08:20.519 --> 00:08:25.120
<v Speaker 1>up image restoration, fixing broken images exactly.

174
00:08:25.240 --> 00:08:28.920
<v Speaker 2>De blurring is a big one. The cookbook mentions Wiener filters.

175
00:08:29.120 --> 00:08:32.240
<v Speaker 1>I think I've heard of those for signal processing YEP.

176
00:08:32.720 --> 00:08:36.120
<v Speaker 2>Applied to images. Wiener filters try to reverse blurring. They

177
00:08:36.200 --> 00:08:38.960
<v Speaker 2>estimate the original sharp image considering both how it was

178
00:08:38.960 --> 00:08:42.080
<v Speaker 2>blurred and any noise present. There's usually a parameter to

179
00:08:42.120 --> 00:08:44.039
<v Speaker 2>balance how much denoising versus de blurring.

180
00:08:44.039 --> 00:08:47.279
<v Speaker 1>You want a balancing act right, trying to unblur without

181
00:08:47.279 --> 00:08:51.879
<v Speaker 1>making noise worse. The book also mentioned constrained least squares

182
00:08:51.919 --> 00:08:58.240
<v Speaker 1>filtering CLS with laplation. Constrained sounds complicated.

183
00:08:57.759 --> 00:08:59.840
<v Speaker 2>It's a bit more advanced. CLS. Lets you add us

184
00:08:59.840 --> 00:09:03.559
<v Speaker 2>some about the original image. Using a laplation constraint basically

185
00:09:03.600 --> 00:09:06.639
<v Speaker 2>tells the algorithm the original image was probably smooth, so

186
00:09:07.120 --> 00:09:09.840
<v Speaker 2>try to make the deep blurred result smooth too, while

187
00:09:09.879 --> 00:09:11.200
<v Speaker 2>still trying to recover detail.

188
00:09:11.440 --> 00:09:14.399
<v Speaker 1>Got it? Adding some prior knowledge. What about denoising with

189
00:09:14.679 --> 00:09:17.879
<v Speaker 1>markoff random fields MRFs sounds statistical.

190
00:09:18.000 --> 00:09:21.159
<v Speaker 2>It is MRF's model how pixels relate to their neighbors.

191
00:09:21.320 --> 00:09:24.480
<v Speaker 2>The basic ideas that nearby pixels usually have similar values

192
00:09:24.480 --> 00:09:27.279
<v Speaker 2>in a clean image, So the algorithm tries to find

193
00:09:27.320 --> 00:09:30.320
<v Speaker 2>a denoised image where these local relationships are most likely,

194
00:09:30.600 --> 00:09:34.840
<v Speaker 2>effectively smoothing out the random noise that violates those neighborhood similarities.

195
00:09:35.440 --> 00:09:39.039
<v Speaker 2>The book mentions converting pixels to Mannix one and one first,

196
00:09:39.080 --> 00:09:41.679
<v Speaker 2>which is common for some MRF methods.

197
00:09:41.399 --> 00:09:46.000
<v Speaker 1>So finding the most probable clean image based on pixel statistics. Okay,

198
00:09:46.200 --> 00:09:48.159
<v Speaker 1>and fixing holes image in painting.

199
00:09:48.279 --> 00:09:53.080
<v Speaker 2>Yeah, like digital art restoration, filling in missing bits plausibly

200
00:09:54.000 --> 00:09:56.919
<v Speaker 2>total variation in painting is one method mentioned.

201
00:09:56.600 --> 00:09:59.600
<v Speaker 1>Coldal variation heard that before. How does it fill holes?

202
00:09:59.799 --> 00:10:02.720
<v Speaker 2>It tries to fill the missing area by extending information

203
00:10:02.799 --> 00:10:05.080
<v Speaker 2>from the surrounding pixels, but it does it in a

204
00:10:05.120 --> 00:10:07.480
<v Speaker 2>way that keeps the filled area as smooth as possible,

205
00:10:07.679 --> 00:10:11.360
<v Speaker 2>minimizing sharp changes or new edges within the patch. OpenCV

206
00:10:11.480 --> 00:10:12.399
<v Speaker 2>has functions for this.

207
00:10:12.600 --> 00:10:16.159
<v Speaker 1>Smoothly propagating the existing textures into the gap. Okay, And

208
00:10:16.399 --> 00:10:17.799
<v Speaker 1>the last restoration.

209
00:10:17.399 --> 00:10:21.240
<v Speaker 2>Technique, dictionary learning sounds like building a library. That's a

210
00:10:21.240 --> 00:10:24.440
<v Speaker 2>good analogy. You learn a set of basic image patches

211
00:10:24.480 --> 00:10:27.799
<v Speaker 2>the dictionary from the image itself. Then you assume that

212
00:10:27.879 --> 00:10:30.240
<v Speaker 2>any noisy or missing part of the image can be

213
00:10:30.279 --> 00:10:34.279
<v Speaker 2>reconstructed by combining these learned dictionary atoms or patches. So

214
00:10:34.320 --> 00:10:37.080
<v Speaker 2>you find the best combination to represent and rebuild the

215
00:10:37.159 --> 00:10:37.960
<v Speaker 2>damaged area.

216
00:10:38.279 --> 00:10:41.879
<v Speaker 1>So it learns the image's own building blocks and uses

217
00:10:41.919 --> 00:10:44.039
<v Speaker 1>them for repairs. Clever.

218
00:10:44.480 --> 00:10:49.080
<v Speaker 2>Very Okay, Moving onto binary image processing just black and white.

219
00:10:48.840 --> 00:10:51.519
<v Speaker 1>Pixels still useful stuff you can do though, like the

220
00:10:51.559 --> 00:10:53.320
<v Speaker 1>distance transform. What's that measure?

221
00:10:53.759 --> 00:10:56.080
<v Speaker 2>For every white pixel, it calculates how far it is

222
00:10:56.159 --> 00:10:59.720
<v Speaker 2>from the nearest black pixel the background boundary. Pixels deep

223
00:10:59.720 --> 00:11:02.519
<v Speaker 2>inside to white shape get high values, Pixels near the

224
00:11:02.639 --> 00:11:06.480
<v Speaker 2>edge get low values. Good for analyzing thickness or shape.

225
00:11:06.720 --> 00:11:10.480
<v Speaker 1>Makes sense, sort of thickness map. What about the morphological.

226
00:11:09.720 --> 00:11:13.000
<v Speaker 2>Gradient that's mainly for highlighting the boundaries of objects in

227
00:11:13.039 --> 00:11:16.000
<v Speaker 2>a binary image. You get it by subtracting an eroded

228
00:11:16.080 --> 00:11:20.000
<v Speaker 2>version shrunk of the image from a dilated version expanded.

229
00:11:20.360 --> 00:11:22.279
<v Speaker 2>It leaves just the one pixel thick outline.

230
00:11:22.399 --> 00:11:25.120
<v Speaker 1>A clean way to get just the edges and the

231
00:11:25.200 --> 00:11:27.759
<v Speaker 1>hit or mistransform. Taking name it is.

232
00:11:27.879 --> 00:11:30.840
<v Speaker 2>It's for finding very specific small shapes or patterns. You

233
00:11:30.919 --> 00:11:34.840
<v Speaker 2>use two little templates, one matching the foreground pattern and

234
00:11:34.919 --> 00:11:38.600
<v Speaker 2>one matching the required background around it. It only triggers

235
00:11:38.600 --> 00:11:39.799
<v Speaker 2>where both match perfectly.

236
00:11:40.200 --> 00:11:44.000
<v Speaker 1>A very precise pattern finder for binary images. Got it?

237
00:11:44.720 --> 00:11:48.279
<v Speaker 1>Last one here is morphological watershed, I know watershed for

238
00:11:48.279 --> 00:11:49.720
<v Speaker 1>segmenting grayscale images.

239
00:11:49.879 --> 00:11:53.480
<v Speaker 2>Yep, same principle, powerful for binary and grayscale. You treat

240
00:11:53.480 --> 00:11:56.080
<v Speaker 2>the image like a three D landscape based on intensity.

241
00:11:56.559 --> 00:11:59.279
<v Speaker 2>Then you flood it from low points the markers where

242
00:11:59.320 --> 00:12:03.039
<v Speaker 2>the water from different basins meats. Those are your segmentation boundaries.

243
00:12:03.240 --> 00:12:03.559
<v Speaker 1>Okay.

244
00:12:03.639 --> 00:12:06.440
<v Speaker 2>The cookbook says you can place markers by finding peaks

245
00:12:06.519 --> 00:12:10.000
<v Speaker 2>in the distance transform image or in low gradient areas.

246
00:12:10.240 --> 00:12:13.639
<v Speaker 2>Great for separating touching objects like cells, or just finding

247
00:12:13.679 --> 00:12:15.720
<v Speaker 2>distinct blobs flooding.

248
00:12:15.279 --> 00:12:18.279
<v Speaker 1>The image landscape to find the natural divides. All right,

249
00:12:18.360 --> 00:12:22.240
<v Speaker 1>let's shift to image registration. Aligning images super.

250
00:12:22.000 --> 00:12:24.679
<v Speaker 2>Important for comparing images taken at different times or with

251
00:12:24.720 --> 00:12:28.440
<v Speaker 2>different cameras or different medical scanners. The book starts with

252
00:12:28.559 --> 00:12:30.919
<v Speaker 2>medical image registration using simple ITK.

253
00:12:31.159 --> 00:12:33.440
<v Speaker 1>Yeah, like aligning a CT and an MRI scan of

254
00:12:33.480 --> 00:12:36.240
<v Speaker 1>the same patient. Right, how does simple ITK do it?

255
00:12:36.240 --> 00:12:40.200
<v Speaker 2>It finds the best geometric transformation maybe shifting, rotating scaling

256
00:12:40.240 --> 00:12:43.000
<v Speaker 2>to line them up. It does this by optimizing a

257
00:12:43.039 --> 00:12:47.320
<v Speaker 2>similarity score like Matt's mutual information, using a specific transform

258
00:12:47.320 --> 00:12:51.240
<v Speaker 2>model like similarity d transform and an optimizer maybe gradient

259
00:12:51.279 --> 00:12:54.279
<v Speaker 2>to set read images, set up the process run it

260
00:12:54.639 --> 00:12:56.759
<v Speaker 2>then resample one image to match the other.

261
00:12:57.240 --> 00:13:00.440
<v Speaker 1>A systematic way to find the perfect overlap. Okay. Then

262
00:13:00.480 --> 00:13:02.519
<v Speaker 1>there's the ECC algorithm and warping.

263
00:13:02.919 --> 00:13:07.399
<v Speaker 2>ECC is enhanced correlation coefficient. It's an algorithm designed to

264
00:13:07.440 --> 00:13:10.240
<v Speaker 2>figure out the geometric warp needed to align two images,

265
00:13:10.519 --> 00:13:13.759
<v Speaker 2>maybe correcting for slight camera shifts. Once you have the warp,

266
00:13:13.879 --> 00:13:14.440
<v Speaker 2>you apply it.

267
00:13:14.519 --> 00:13:17.399
<v Speaker 1>Gotcha. What about faces? Aligning faces with dlib.

268
00:13:17.519 --> 00:13:20.799
<v Speaker 2>Dlib is great for finding facial landmarks eyes, nose corners,

269
00:13:20.840 --> 00:13:23.279
<v Speaker 2>mouth corners, et cetera. Right, once you have those points

270
00:13:23.320 --> 00:13:26.039
<v Speaker 2>on two faces, you can calculate and a fine transformation

271
00:13:26.200 --> 00:13:28.639
<v Speaker 2>to warp one face so its landmarks line up with

272
00:13:28.679 --> 00:13:31.879
<v Speaker 2>the other. This normalizes the pose the face a laner

273
00:13:31.960 --> 00:13:34.320
<v Speaker 2>class and Immutell's helps here central.

274
00:13:34.080 --> 00:13:38.000
<v Speaker 1>For face recognition. I bet okay. Robust matching and homography

275
00:13:38.039 --> 00:13:40.120
<v Speaker 1>with RANSACK sounds like dealing with.

276
00:13:40.120 --> 00:13:44.080
<v Speaker 2>Errors exactly When you match features between images, say, using

277
00:13:44.159 --> 00:13:48.399
<v Speaker 2>sift features and brief descriptors, you often get bad matches outliers.

278
00:13:49.000 --> 00:13:52.840
<v Speaker 2>RANSACK random sample consensus helps find the true transformation the

279
00:13:52.879 --> 00:13:55.200
<v Speaker 2>homography despite these outliers.

280
00:13:55.279 --> 00:13:55.559
<v Speaker 1>Wow.

281
00:13:55.879 --> 00:13:59.679
<v Speaker 2>It randomly picks small subsets of matches, calculates a homography,

282
00:13:59.759 --> 00:14:02.639
<v Speaker 2>and see how many other matches agree with it. It repeats

283
00:14:02.679 --> 00:14:05.519
<v Speaker 2>this and picks the homography supported by the most matches,

284
00:14:05.720 --> 00:14:06.960
<v Speaker 2>ignoring the ones that don't fit.

285
00:14:07.440 --> 00:14:12.440
<v Speaker 1>Finds the consensus ignores the noise. Smart and image mosaicing,

286
00:14:13.120 --> 00:14:14.679
<v Speaker 1>making panoramas.

287
00:14:14.120 --> 00:14:17.840
<v Speaker 2>Yeah, stishing overlapping photos. The usual steps are find features

288
00:14:17.879 --> 00:14:21.320
<v Speaker 2>like sift, match them between images, calculate the homography to

289
00:14:21.360 --> 00:14:24.840
<v Speaker 2>warp them into alignment, then blend the seams open cvs

290
00:14:25.039 --> 00:14:28.039
<v Speaker 2>CV two stitcher class makes it easier. The book also

291
00:14:28.120 --> 00:14:31.519
<v Speaker 2>mentions cylindrical warping for very wide panoramas to handle distortion.

292
00:14:31.759 --> 00:14:34.480
<v Speaker 1>So seamless panoramas are impressive. What about face morphing? That

293
00:14:34.600 --> 00:14:35.159
<v Speaker 1>sounds fun?

294
00:14:35.399 --> 00:14:39.519
<v Speaker 2>It is creating that smooth video transition between two faces.

295
00:14:39.840 --> 00:14:43.120
<v Speaker 2>You need corresponding points on both faces first, Then you

296
00:14:43.200 --> 00:14:46.559
<v Speaker 2>calculate an average shape between them. Then you warp both

297
00:14:46.600 --> 00:14:50.440
<v Speaker 2>original faces towards that average shape. Finally, you blend the

298
00:14:50.480 --> 00:14:55.360
<v Speaker 2>warped images together over time, usually with alpha blending. Meshwarping

299
00:14:55.440 --> 00:14:57.519
<v Speaker 2>is one technique for the warp itself.

300
00:14:57.159 --> 00:14:59.720
<v Speaker 1>Guiding one face to become another by aligning features in

301
00:14:59.720 --> 00:15:05.600
<v Speaker 1>bloe And Finally, registration leads to building an image search engine.

302
00:15:05.720 --> 00:15:09.120
<v Speaker 2>Content based image retrieval, Yeah, a multi step process.

303
00:15:09.120 --> 00:15:09.799
<v Speaker 1>How did it work?

304
00:15:10.519 --> 00:15:14.200
<v Speaker 2>You extract features like sift from every image in your database,

305
00:15:14.679 --> 00:15:19.360
<v Speaker 2>create compact descriptions of those features, index them efficiently. Then

306
00:15:19.480 --> 00:15:22.360
<v Speaker 2>for a query image, you extract its features descriptions and

307
00:15:22.399 --> 00:15:25.320
<v Speaker 2>search the index for images with the most similar descriptions

308
00:15:25.600 --> 00:15:29.320
<v Speaker 2>using tools like flan for speed and ratio testing for reliability.

309
00:15:29.519 --> 00:15:32.759
<v Speaker 1>Creating a visual fingerprint and searching for matches. Powerful stuff.

310
00:15:32.840 --> 00:15:37.360
<v Speaker 2>Definitely okay. Next major area image segmentation, dividing an image

311
00:15:37.360 --> 00:15:38.399
<v Speaker 2>into meaningful parts.

312
00:15:38.600 --> 00:15:42.679
<v Speaker 1>Simplest ways thresholding right. The book mentions OTSU and Riddler Calvert.

313
00:15:42.960 --> 00:15:47.320
<v Speaker 2>Yeah. Basic idea is separating foreground from background based on brightness.

314
00:15:47.919 --> 00:15:52.039
<v Speaker 2>Atsu's method and Riddler Calvert are automatic threshold finders. They

315
00:15:52.080 --> 00:15:56.080
<v Speaker 2>analyze the histogram to find the best split point. Mahota's

316
00:15:56.120 --> 00:16:00.600
<v Speaker 2>library has them. Atsu minimizes variance within classes. Really covered

317
00:16:00.639 --> 00:16:01.879
<v Speaker 2>is iterative.

318
00:16:01.799 --> 00:16:04.600
<v Speaker 1>So they find the threshold for you. Yes. What about

319
00:16:04.639 --> 00:16:07.759
<v Speaker 1>segmentation with self organizing maps SOMs?

320
00:16:08.120 --> 00:16:11.720
<v Speaker 2>SOMs are neural networks used for clustering. You can feed

321
00:16:11.799 --> 00:16:15.399
<v Speaker 2>image pixel data like color into an SOM. It learns

322
00:16:15.440 --> 00:16:18.200
<v Speaker 2>to group similar pixels together on its map. Okay, so

323
00:16:18.200 --> 00:16:20.720
<v Speaker 2>you can use the trained SOM to segment the image

324
00:16:20.720 --> 00:16:23.720
<v Speaker 2>based on which map neuron a pixel activates, or just

325
00:16:23.759 --> 00:16:26.919
<v Speaker 2>to reduce the number of colors. Quantization. The book mentions

326
00:16:27.000 --> 00:16:28.519
<v Speaker 2>using it on handwritten digits.

327
00:16:28.320 --> 00:16:31.159
<v Speaker 1>Letting the data cluster itself. What's random walk segmentation?

328
00:16:31.440 --> 00:16:34.399
<v Speaker 2>That one's interactive. You first label a few seed pixels

329
00:16:34.440 --> 00:16:37.440
<v Speaker 2>for each region you want. Then for every unlabeled pixel,

330
00:16:37.480 --> 00:16:40.320
<v Speaker 2>the algorithm figures out the probability that are random walk

331
00:16:40.360 --> 00:16:44.080
<v Speaker 2>starting there would hit each seed region first, the pixel

332
00:16:44.080 --> 00:16:47.000
<v Speaker 2>gets assigned to the region with the highest probability. Often

333
00:16:47.039 --> 00:16:48.200
<v Speaker 2>gives really nice results.

334
00:16:48.480 --> 00:16:52.440
<v Speaker 1>A guided approach using initial hints. What about segmenting skin

335
00:16:52.960 --> 00:16:54.960
<v Speaker 1>gmm EM algorithm.

336
00:16:54.639 --> 00:16:59.960
<v Speaker 2>Gaussian mixture model GMM and expectation maximization EM. The idea

337
00:17:00.240 --> 00:17:03.000
<v Speaker 2>is that skin colors follow a mix of Gaussian distributions.

338
00:17:03.399 --> 00:17:06.480
<v Speaker 2>You train a GMM on skin and skin examples, then

339
00:17:06.559 --> 00:17:10.240
<v Speaker 2>you use the train model to classify pixels in new images.

340
00:17:09.880 --> 00:17:13.440
<v Speaker 1>Learning the statistics of skin color. Okay, medical image segmentation

341
00:17:13.519 --> 00:17:15.319
<v Speaker 1>again UNED and watershed right.

342
00:17:15.799 --> 00:17:18.880
<v Speaker 2>Deep learning models like UNIT are huge in medical imaging,

343
00:17:18.920 --> 00:17:23.000
<v Speaker 2>now great at learning complex patterns for segmenting organs or tumors,

344
00:17:23.359 --> 00:17:27.480
<v Speaker 2>and watershed via simple ITK is still useful, especially for

345
00:17:27.559 --> 00:17:29.640
<v Speaker 2>separating touching cells or structures.

346
00:17:29.720 --> 00:17:32.720
<v Speaker 1>Then deep semantic segmentation assigning a label to every.

347
00:17:32.519 --> 00:17:35.359
<v Speaker 2>Pixel exactly using models like deep lab V three plus

348
00:17:35.480 --> 00:17:38.319
<v Speaker 2>or FCN not just there's a car, but these specific

349
00:17:38.400 --> 00:17:42.279
<v Speaker 2>pixels are car. These are road, et cetera, pixel level understanding, got.

350
00:17:42.200 --> 00:17:44.680
<v Speaker 1>It, and deep instant segmentation. How's that different?

351
00:17:44.720 --> 00:17:46.920
<v Speaker 2>It goes one step further. Semantic says these are all

352
00:17:46.920 --> 00:17:49.680
<v Speaker 2>car pixels. Instance says this is car hashtag one, This

353
00:17:49.720 --> 00:17:52.640
<v Speaker 2>is car hashtag two, this is car hashtag three inch

354
00:17:52.720 --> 00:17:53.720
<v Speaker 2>with its own mask.

355
00:17:53.599 --> 00:17:57.240
<v Speaker 1>Ah distinguishing individual objects of the same class precisely.

356
00:17:57.640 --> 00:18:01.079
<v Speaker 2>Models like mask URCNN do this. They build on object

357
00:18:01.160 --> 00:18:04.200
<v Speaker 2>detectors like faster RCNN and add a branch to predict

358
00:18:04.200 --> 00:18:05.960
<v Speaker 2>the mask for each detected instance.

359
00:18:06.119 --> 00:18:08.920
<v Speaker 1>Car one, Car two, Car three, each outlined much more.

360
00:18:08.799 --> 00:18:14.279
<v Speaker 2>Detail exactly, okay. Next up image classification, assigning one label

361
00:18:14.359 --> 00:18:15.240
<v Speaker 2>to the whole image.

362
00:18:15.240 --> 00:18:19.119
<v Speaker 1>The book starts with feature based HOG and logistic regression.

363
00:18:19.200 --> 00:18:21.000
<v Speaker 1>We saw HOG for detection.

364
00:18:20.799 --> 00:18:25.119
<v Speaker 2>YEP histogram of oriented gradients. You extract these HG features,

365
00:18:25.160 --> 00:18:27.599
<v Speaker 2>which capture edge direction info, and feed them into a

366
00:18:27.640 --> 00:18:31.799
<v Speaker 2>standard classifier like logistic regression to categorize the entire image.

367
00:18:32.039 --> 00:18:37.400
<v Speaker 2>Classic machine learning pipeline extract features, train classifier, evaluate.

368
00:18:37.079 --> 00:18:40.279
<v Speaker 1>Using gradients as a signature. What about texture classification? Gebor

369
00:18:40.319 --> 00:18:41.359
<v Speaker 1>filter banks.

370
00:18:41.279 --> 00:18:45.160
<v Speaker 2>Gaybor filters are sensitive to orientation and frequency. Great for texture,

371
00:18:45.279 --> 00:18:47.240
<v Speaker 2>a bank is just a set of Gaybor filters with

372
00:18:47.279 --> 00:18:50.240
<v Speaker 2>different parameters. You apply the bank, get a feature vector

373
00:18:50.279 --> 00:18:52.759
<v Speaker 2>describing the texture, and compare it to feature vectors of

374
00:18:52.839 --> 00:18:53.680
<v Speaker 2>known textures.

375
00:18:53.880 --> 00:18:56.720
<v Speaker 1>Analyzing the image grain with special filters. Okay, and then

376
00:18:56.839 --> 00:19:00.079
<v Speaker 1>the big one. Pre trained deep blurring models, transfer lar.

377
00:19:00.599 --> 00:19:04.920
<v Speaker 2>Huge shortcut models like VGG sixteen, mobile NETV two ResNet

378
00:19:05.039 --> 00:19:09.079
<v Speaker 2>inception trained on millions of image neet images they've learned

379
00:19:09.200 --> 00:19:10.519
<v Speaker 2>general visual features.

380
00:19:10.599 --> 00:19:12.480
<v Speaker 1>Do you just use mat in the box pretty much?

381
00:19:12.559 --> 00:19:14.960
<v Speaker 2>You feed your image in get predictions based on the

382
00:19:15.039 --> 00:19:18.400
<v Speaker 2>vast knowledge they already have. The cookbook shows classifying a

383
00:19:18.519 --> 00:19:19.880
<v Speaker 2>cheetah and swans this.

384
00:19:19.920 --> 00:19:24.279
<v Speaker 1>Way, borrowing expertise cool and training a custom classifier using

385
00:19:24.359 --> 00:19:25.519
<v Speaker 1>transfer learning Right.

386
00:19:25.960 --> 00:19:28.440
<v Speaker 2>You take a pre train model, usually chop off its

387
00:19:28.480 --> 00:19:32.960
<v Speaker 2>final classification layer, freeze the early layers which learn general features,

388
00:19:33.279 --> 00:19:36.039
<v Speaker 2>add your own new classification layers on top, and train

389
00:19:36.160 --> 00:19:38.519
<v Speaker 2>only those new layers, or maybe fine tune a bit

390
00:19:38.519 --> 00:19:40.559
<v Speaker 2>more on your specific data.

391
00:19:40.319 --> 00:19:43.000
<v Speaker 1>Set ah adapting it exactly much.

392
00:19:42.839 --> 00:19:46.000
<v Speaker 2>Faster and needs less data than training from scratch. The

393
00:19:46.039 --> 00:19:49.279
<v Speaker 2>book mentions image data Generator for augmenting your data too,

394
00:19:49.400 --> 00:19:51.680
<v Speaker 2>creating variations to help the model generalize.

395
00:19:51.880 --> 00:19:55.640
<v Speaker 1>Taking a generalist model and making it a specialist Okay.

396
00:19:56.200 --> 00:19:59.960
<v Speaker 1>Classifying graphic signs as mentioned next, with challenges like imbalance

397
00:20:00.079 --> 00:20:00.759
<v Speaker 1>and overfitting.

398
00:20:00.960 --> 00:20:04.680
<v Speaker 2>Yeah, important for self driving. Some signs are rare. Imbalance

399
00:20:04.880 --> 00:20:09.240
<v Speaker 2>models might memorize the training data overfitting, so you use

400
00:20:09.279 --> 00:20:13.279
<v Speaker 2>techniques like resembling classes or heavy data augmentation during training,

401
00:20:13.400 --> 00:20:16.319
<v Speaker 2>maybe with PyTorch, to make the model more robust.

402
00:20:16.039 --> 00:20:21.200
<v Speaker 1>Real world considerations. Finally, estimating human pose with open pose.

403
00:20:21.160 --> 00:20:25.160
<v Speaker 2>Finding key body joins, elbows, knees, et cetera. Open pose

404
00:20:25.319 --> 00:20:28.279
<v Speaker 2>is a popular bottom up method. It finds all potential

405
00:20:28.279 --> 00:20:30.480
<v Speaker 2>body parts first, then figures out how to assemble them

406
00:20:30.480 --> 00:20:34.039
<v Speaker 2>into skeletons. Uses a deep network, often VGG based, to

407
00:20:34.119 --> 00:20:38.039
<v Speaker 2>predict heat maps for joins and connection maps part affinity fields.

408
00:20:37.759 --> 00:20:39.920
<v Speaker 1>Finds the parts, then connects the dots. Got it.

409
00:20:40.000 --> 00:20:44.559
<v Speaker 2>Okay? Almost there. Last big section, object detection finding where

410
00:20:44.559 --> 00:20:45.200
<v Speaker 2>things are.

411
00:20:45.200 --> 00:20:49.039
<v Speaker 1>Starting with HOG again but with non maximum suppression. What's that?

412
00:20:49.279 --> 00:20:52.279
<v Speaker 2>When using sliding windows with HOG, you often detect the

413
00:20:52.319 --> 00:20:56.400
<v Speaker 2>same object multiple times with slightly different overlapping boxes. Non

414
00:20:56.440 --> 00:20:59.880
<v Speaker 2>maximum suppression NMS cleans this up. It keeps the box

415
00:21:00.119 --> 00:21:03.279
<v Speaker 2>the highest confidence score for an object and suppresses other

416
00:21:03.359 --> 00:21:05.000
<v Speaker 2>boxes that overlap heavily with it.

417
00:21:05.200 --> 00:21:08.440
<v Speaker 1>Getting rid of redundant detections makes sense. Then y'lo three

418
00:21:08.599 --> 00:21:09.759
<v Speaker 1>you only look once.

419
00:21:09.759 --> 00:21:12.960
<v Speaker 2>Yeah, famous for speed processes the whole image at once.

420
00:21:13.400 --> 00:21:16.680
<v Speaker 2>You'lo A three uses a better backbone network. Darknet fifty

421
00:21:16.680 --> 00:21:20.839
<v Speaker 2>three detects at multiple scales, better prediction layers than older versions.

422
00:21:20.960 --> 00:21:22.079
<v Speaker 2>Great for real time.

423
00:21:21.920 --> 00:21:24.279
<v Speaker 1>Fast RCNN is next. How does that compare?

424
00:21:24.440 --> 00:21:28.119
<v Speaker 2>It's often very accurate, but usually slower than YOLO. It's

425
00:21:28.160 --> 00:21:31.920
<v Speaker 2>a two stage process. First, a region proposal network RPN

426
00:21:32.240 --> 00:21:36.240
<v Speaker 2>suggests areas that might contain objects. Then a second stage

427
00:21:36.319 --> 00:21:39.599
<v Speaker 2>classifies those proposals and refines their bounding boxes.

428
00:21:39.680 --> 00:21:44.200
<v Speaker 1>Propose then classify more deliberate and mask ARCNN builds on this.

429
00:21:44.720 --> 00:21:47.079
<v Speaker 1>For instance, segmentation, which we already covered.

430
00:21:46.920 --> 00:21:49.200
<v Speaker 2>Right, adds that mask prediction branch.

431
00:21:49.359 --> 00:21:51.839
<v Speaker 1>Multi object tracking is mentioned briefly for video.

432
00:21:51.680 --> 00:21:54.759
<v Speaker 2>Yeah, Following multiple objects over time usually involves detecting in

433
00:21:54.799 --> 00:21:57.759
<v Speaker 2>each frame and then associating detections across frames to keep

434
00:21:57.759 --> 00:21:58.720
<v Speaker 2>track of who's who.

435
00:21:58.599 --> 00:22:02.680
<v Speaker 1>Okay and reading tech next East and Tesseract.

436
00:22:02.480 --> 00:22:06.119
<v Speaker 2>EAST is a deep learning model specifically for detecting text regions.

437
00:22:06.119 --> 00:22:10.400
<v Speaker 2>And images even angled or curved text. Tesseract is a

438
00:22:10.519 --> 00:22:14.400
<v Speaker 2>powerful OCR engine for recognizing the characters inside the region's

439
00:22:14.480 --> 00:22:18.240
<v Speaker 2>east finds. Detect, then recognize.

440
00:22:17.599 --> 00:22:20.279
<v Speaker 1>Locate the text, then read it, got it, and finally

441
00:22:20.319 --> 00:22:22.119
<v Speaker 1>face detection with hard cascades.

442
00:22:22.400 --> 00:22:26.359
<v Speaker 2>A more traditional but fast method uses simple rectangular features.

443
00:22:26.559 --> 00:22:31.119
<v Speaker 2>OpenCV has pre trained hard cascades for faces, eyes, smiles,

444
00:22:31.160 --> 00:22:34.319
<v Speaker 2>still useful for real time stuff because they're computationally cheap.

445
00:22:34.440 --> 00:22:38.519
<v Speaker 1>Good overview of detection. Last section. Now, face recognition colorization

446
00:22:38.640 --> 00:22:41.039
<v Speaker 1>generation starting with face net embeddings.

447
00:22:41.119 --> 00:22:43.279
<v Speaker 2>Face net learns to map faces to a point in

448
00:22:43.279 --> 00:22:46.319
<v Speaker 2>a high dimensional space and embedding The trick is faces

449
00:22:46.359 --> 00:22:49.000
<v Speaker 2>of the same person map to points close together, different

450
00:22:49.039 --> 00:22:49.799
<v Speaker 2>people map.

451
00:22:49.599 --> 00:22:51.920
<v Speaker 1>Far apart, a unique vector for each face.

452
00:22:52.279 --> 00:22:54.680
<v Speaker 2>Sort of Yeah. You use a pre trained face net

453
00:22:54.720 --> 00:22:57.319
<v Speaker 2>to get these embeddings, then compare them like with distance,

454
00:22:57.519 --> 00:22:59.960
<v Speaker 2>or train a classifier on them to recognize people.

455
00:23:00.079 --> 00:23:01.319
<v Speaker 1>Digital face fingerpains okay.

456
00:23:01.359 --> 00:23:04.920
<v Speaker 2>Automatic colorization of the CNN, taking grayscale and making it color.

457
00:23:05.599 --> 00:23:09.880
<v Speaker 2>A CNN trained on color images learns the relationship between luminance,

458
00:23:10.519 --> 00:23:14.880
<v Speaker 2>grayscale and prominance color. Given a grea scale image l channel,

459
00:23:15.119 --> 00:23:17.200
<v Speaker 2>it predicts the color channels A and B in.

460
00:23:17.240 --> 00:23:21.640
<v Speaker 1>Lab space, learning typical color patterns always less magical image

461
00:23:21.680 --> 00:23:23.079
<v Speaker 1>generation with the jan.

462
00:23:23.319 --> 00:23:27.960
<v Speaker 2>We mentioned these generative adversarial networks. The generator makes fake images.

463
00:23:28.079 --> 00:23:30.440
<v Speaker 2>The discriminator tries to spot the fakes. They compete and

464
00:23:30.480 --> 00:23:33.680
<v Speaker 2>both get better. The generator learns to make really realistic

465
00:23:33.720 --> 00:23:38.079
<v Speaker 2>images from noise. The book showed training one on anime faces.

466
00:23:37.599 --> 00:23:41.160
<v Speaker 1>The counterfeiter versus the cup Okay and variational auto encoders

467
00:23:41.240 --> 00:23:43.880
<v Speaker 1>vaes for generation and reconstruction.

468
00:23:44.160 --> 00:23:47.960
<v Speaker 2>Vaes also generate images. They learn a probabilistic latent space.

469
00:23:48.359 --> 00:23:51.079
<v Speaker 2>This means they can reconstruct inputs, but also generate new

470
00:23:51.160 --> 00:23:54.759
<v Speaker 2>plausible samples by drawing points from that learned probability distribution.

471
00:23:55.160 --> 00:23:57.079
<v Speaker 2>The cookbook used fashion mnist.

472
00:23:57.039 --> 00:23:59.559
<v Speaker 1>Learns the underlying probability the data, not just to fix

473
00:23:59.640 --> 00:24:05.240
<v Speaker 1>represent interesting last one restricted Boltzmann Machines RBMs for reconstructing

474
00:24:05.279 --> 00:24:06.079
<v Speaker 1>Bangla mnist.

475
00:24:06.440 --> 00:24:10.920
<v Speaker 2>RBMs are older unsupervised models important historically and deep learning.

476
00:24:11.200 --> 00:24:16.400
<v Speaker 2>They learn representations, often of binary data train with contrastive divergence.

477
00:24:16.640 --> 00:24:20.160
<v Speaker 2>They can capture data patterns and reconstruct noisy inputs. The

478
00:24:20.200 --> 00:24:23.839
<v Speaker 2>example showed reconstructing Bangla digits and visualizing the learned features.

479
00:24:24.079 --> 00:24:26.880
<v Speaker 1>Fascinating to see these earlier foundational techniques too.

480
00:24:27.039 --> 00:24:30.200
<v Speaker 2>Absolutely wow. We've really covered a lot of ground, from

481
00:24:31.000 --> 00:24:33.920
<v Speaker 2>artistic filters and making images clearer all the way to

482
00:24:34.240 --> 00:24:39.920
<v Speaker 2>complex detection, segmentation, classification, and even generating totally new images

483
00:24:39.920 --> 00:24:40.480
<v Speaker 2>with Python.

484
00:24:40.680 --> 00:24:43.160
<v Speaker 1>It really is incredible how these algorithms can learn to

485
00:24:43.200 --> 00:24:47.119
<v Speaker 1>perceive and even manipulate images, sometimes better than we can.

486
00:24:47.440 --> 00:24:50.359
<v Speaker 1>And just the sheer number of different approaches to similar problems.

487
00:24:50.160 --> 00:24:51.759
<v Speaker 1>It's kind of amazing, isn't it.

488
00:24:51.759 --> 00:24:53.920
<v Speaker 2>It truly is a really dynamic field. We hope this

489
00:24:53.920 --> 00:24:56.599
<v Speaker 2>steep dive sparks some ideas for you. That Python Image

490
00:24:56.599 --> 00:25:00.359
<v Speaker 2>Processing Cookbook has way more detail and code of course, yeah, only.

491
00:25:00.319 --> 00:25:02.920
<v Speaker 1>Check out the resources. Is something caught your eye makes

492
00:25:02.960 --> 00:25:06.400
<v Speaker 1>you wonder what kind of image challenges or maybe creative

493
00:25:06.400 --> 00:25:08.759
<v Speaker 1>projects could you tackle with these kinds of tools. What's

494
00:25:08.799 --> 00:25:11.680
<v Speaker 1>the next visual puzzle you might want to solve.
