WEBVTT

1
00:00:00.160 --> 00:00:03.040
<v Speaker 1>Welcome to the deep dive. Today we're plunging into neural

2
00:00:03.080 --> 00:00:07.280
<v Speaker 1>network programming with Java by Fabiosaurs and Alan Susa.

3
00:00:07.400 --> 00:00:09.800
<v Speaker 2>Yeah, and this book it's not just code, is it.

4
00:00:09.800 --> 00:00:12.800
<v Speaker 2>It really goes deep into the fundamentals exactly.

5
00:00:12.960 --> 00:00:16.079
<v Speaker 1>It's about how these well intelligence systems are built, how

6
00:00:16.120 --> 00:00:20.440
<v Speaker 1>they learn. Our mission today extract the key insights.

7
00:00:20.600 --> 00:00:25.239
<v Speaker 2>Uh huh, those surprising bits, the aha moments. We want

8
00:00:25.280 --> 00:00:27.879
<v Speaker 2>to give you a real shortcut to understanding these networks.

9
00:00:27.920 --> 00:00:31.600
<v Speaker 1>We'll make the complex stuff digestible, engaging, show you what

10
00:00:31.640 --> 00:00:33.119
<v Speaker 1>they are, but also why they.

11
00:00:33.039 --> 00:00:35.320
<v Speaker 2>Work and the incredible things they can actually do out

12
00:00:35.320 --> 00:00:36.159
<v Speaker 2>there in the real world.

13
00:00:36.280 --> 00:00:40.359
<v Speaker 1>Okay, so neural networks, these artificial brands, where do we

14
00:00:40.399 --> 00:00:43.320
<v Speaker 1>even start? What's the core idea? How can we even

15
00:00:43.399 --> 00:00:44.880
<v Speaker 1>build something based on a brain?

16
00:00:45.240 --> 00:00:47.399
<v Speaker 2>It's fascinating. Actually, you have to go way back, like

17
00:00:47.479 --> 00:00:51.679
<v Speaker 2>the nineteen forties, really that early. Yeah. A neurophysiologist Warren

18
00:00:51.759 --> 00:00:56.039
<v Speaker 2>McCulloch and a mathematician Walter Pitts. They created the first

19
00:00:56.280 --> 00:00:58.960
<v Speaker 2>mathematical model of an artificial.

20
00:00:58.479 --> 00:01:00.280
<v Speaker 1>Neuron, inspired by the real thing.

21
00:01:00.560 --> 00:01:04.120
<v Speaker 2>Absolutely, they saw the natural neuron as a kind of

22
00:01:04.560 --> 00:01:07.840
<v Speaker 2>simple processor. It sums up signals, decides whether to fire

23
00:01:08.120 --> 00:01:12.239
<v Speaker 2>propagates it onward. That basic idea was the spark.

24
00:01:12.840 --> 00:01:16.519
<v Speaker 1>Biological simplicity leading to well, technological complexity.

25
00:01:16.680 --> 00:01:17.120
<v Speaker 2>You got it.

26
00:01:17.200 --> 00:01:20.920
<v Speaker 1>So, building on that, these artificial networks have some key parts, right,

27
00:01:21.000 --> 00:01:22.640
<v Speaker 1>the building blocks definitely.

28
00:01:23.239 --> 00:01:25.519
<v Speaker 2>First up, the artificial neuron itself.

29
00:01:25.640 --> 00:01:27.920
<v Speaker 1>The basic processing unit exactly.

30
00:01:27.680 --> 00:01:30.319
<v Speaker 2>Takes multiple inputs kind of like dendrites.

31
00:01:29.920 --> 00:01:31.200
<v Speaker 1>Aggregates them right.

32
00:01:31.040 --> 00:01:34.000
<v Speaker 2>Sums them up, and then produces a single output like

33
00:01:34.040 --> 00:01:36.560
<v Speaker 2>an axon firing based on some internal logic.

34
00:01:36.719 --> 00:01:39.400
<v Speaker 1>Okay, makes sense. But the connections matter too.

35
00:01:39.319 --> 00:01:41.719
<v Speaker 2>Oh hugely. That's where the weights come in. They're the

36
00:01:41.719 --> 00:01:42.879
<v Speaker 2>connections between.

37
00:01:42.599 --> 00:01:44.040
<v Speaker 1>Neurons, not just wires though.

38
00:01:44.200 --> 00:01:48.879
<v Speaker 2>No, No, they amplify or reduce these signals passing through.

39
00:01:49.159 --> 00:01:50.959
<v Speaker 2>They multiply the input signal.

40
00:01:50.719 --> 00:01:53.840
<v Speaker 1>And that's where the learning happens, adjusting these weights decisely.

41
00:01:54.239 --> 00:01:56.480
<v Speaker 2>The weights essentially store the network's knowledge.

42
00:01:56.640 --> 00:02:01.079
<v Speaker 1>But is that enough just weights and neurons. Feels like

43
00:02:01.120 --> 00:02:02.680
<v Speaker 1>something's missing for real complexity.

44
00:02:02.920 --> 00:02:06.840
<v Speaker 2>You're right, you need bias and those crucial activation functions.

45
00:02:07.000 --> 00:02:08.439
<v Speaker 1>Okay, tell me about bias first.

46
00:02:08.800 --> 00:02:11.439
<v Speaker 2>Bias is like an extra input always set to one

47
00:02:11.639 --> 00:02:14.599
<v Speaker 2>with its own weight. It adds a constant value to

48
00:02:14.639 --> 00:02:17.360
<v Speaker 2>the sum before the activation function kicks in.

49
00:02:17.560 --> 00:02:18.639
<v Speaker 1>Why what does that do?

50
00:02:18.960 --> 00:02:22.319
<v Speaker 2>It gives the neuron more flexibility. It basically shifts the

51
00:02:22.319 --> 00:02:26.599
<v Speaker 2>activation threshold, allowing the network to model more complex relationships

52
00:02:26.759 --> 00:02:30.240
<v Speaker 2>stuff that doesn't necessarily pass through the origin helps handle

53
00:02:30.280 --> 00:02:31.439
<v Speaker 2>nonlinear stuff better.

54
00:02:31.599 --> 00:02:34.120
<v Speaker 1>Got it? And activation functions you said they're crucial. The

55
00:02:34.120 --> 00:02:36.280
<v Speaker 1>book mentions sigmoid ton.

56
00:02:36.439 --> 00:02:40.560
<v Speaker 2>Right, hyperbolic tangent also purely linear functions. But the key

57
00:02:40.599 --> 00:02:43.479
<v Speaker 2>insight here is the nonlinear.

58
00:02:42.879 --> 00:02:44.199
<v Speaker 1>Ones like sigmoid and ton.

59
00:02:44.439 --> 00:02:48.800
<v Speaker 2>Exactly. Without nonlinearity, even a deep network, a multi layer

60
00:02:48.879 --> 00:02:51.919
<v Speaker 2>one would just be doing a sequence of linear operations.

61
00:02:51.960 --> 00:02:56.000
<v Speaker 2>It means it could only solve linear problems. Nonlinear activation

62
00:02:56.080 --> 00:02:59.439
<v Speaker 2>functions let the network learn really complex curved boundaries in

63
00:02:59.479 --> 00:03:02.680
<v Speaker 2>the data. I think image recognition that's inherently nonlinear.

64
00:03:02.759 --> 00:03:06.599
<v Speaker 1>Okay, So that nonlinearity is the secret sauce for handling.

65
00:03:06.199 --> 00:03:07.560
<v Speaker 2>Complexity A big part of it.

66
00:03:07.639 --> 00:03:10.800
<v Speaker 1>Yeah, And these neurons they aren't just you know, floating around.

67
00:03:11.120 --> 00:03:12.919
<v Speaker 1>They're organized into layers.

68
00:03:12.960 --> 00:03:16.039
<v Speaker 2>Correct. You have input layer where data comes in, an

69
00:03:16.039 --> 00:03:19.520
<v Speaker 2>output layer where the result comes out, and in between

70
00:03:19.680 --> 00:03:22.280
<v Speaker 2>potentially one or more hidden layers.

71
00:03:22.360 --> 00:03:24.080
<v Speaker 1>Hidden layers sound important.

72
00:03:23.800 --> 00:03:26.000
<v Speaker 2>They really are. They allow the network to build up

73
00:03:26.039 --> 00:03:30.120
<v Speaker 2>layers of abstraction. It learns intermediate features representations of the

74
00:03:30.199 --> 00:03:33.639
<v Speaker 2>data that aren't obvious in the raw input but are

75
00:03:33.719 --> 00:03:38.159
<v Speaker 2>useful for the final task. That's where complex knowledge gets represented.

76
00:03:37.719 --> 00:03:40.479
<v Speaker 1>Like building its own internal understanding tind of. Yeah, and

77
00:03:40.520 --> 00:03:44.400
<v Speaker 1>how these layers and neurons are arranged. That gives different architectures.

78
00:03:44.560 --> 00:03:47.520
<v Speaker 2>Yep. Simple ones are mono layer just input and output.

79
00:03:47.879 --> 00:03:50.840
<v Speaker 2>More complex or multi layer with those hidden layers. Okay,

80
00:03:51.159 --> 00:03:54.000
<v Speaker 2>then there's how the signal flows. Feed Forward is the

81
00:03:54.000 --> 00:03:57.960
<v Speaker 2>basic type signal goes one way input to output straightforward,

82
00:03:58.000 --> 00:04:00.639
<v Speaker 2>But then you have feedback networks or recurrent networks.

83
00:04:00.719 --> 00:04:04.360
<v Speaker 1>We're current, meaning the signal can loop back exactly.

84
00:04:05.039 --> 00:04:08.039
<v Speaker 2>Outputs from neurons can be fed back as inputs to

85
00:04:08.080 --> 00:04:12.000
<v Speaker 2>neurons in the same or earlier layers. This introduces memory,

86
00:04:12.199 --> 00:04:12.759
<v Speaker 2>a sense of.

87
00:04:12.719 --> 00:04:16.920
<v Speaker 1>Time, useful for sequences time series data.

88
00:04:16.560 --> 00:04:21.439
<v Speaker 2>Perfect for that pattern recognition over time. But the catch

89
00:04:22.120 --> 00:04:25.759
<v Speaker 2>is they're significantly harder to train. Why is that because

90
00:04:25.759 --> 00:04:30.079
<v Speaker 2>the network state depends on its previous states. That feedback

91
00:04:30.120 --> 00:04:34.959
<v Speaker 2>loop complicates the learning process tracking how errors should propagate back?

92
00:04:35.160 --> 00:04:37.920
<v Speaker 1>Right, that makes sense. So we have these components, these

93
00:04:38.000 --> 00:04:42.560
<v Speaker 1>architectures these artificial brains. But how do they actually learn?

94
00:04:42.639 --> 00:04:43.600
<v Speaker 1>What's the mechanism?

95
00:04:44.000 --> 00:04:48.519
<v Speaker 2>Well, fundamentally, learning is about adjusting those weights, systematically, changing

96
00:04:48.519 --> 00:04:51.399
<v Speaker 2>the connection strengths based on experience, based on data. Yeah,

97
00:04:51.720 --> 00:04:54.959
<v Speaker 2>but what's really fascinating is the distributed nature of this intelligence,

98
00:04:55.079 --> 00:04:58.600
<v Speaker 2>meaning it's spread out exactly. It's not one central brain

99
00:04:58.680 --> 00:05:02.279
<v Speaker 2>part holding all the knowledge. It's across potentially millions or

100
00:05:02.279 --> 00:05:05.879
<v Speaker 2>billions of tiny connections, each weight holding a small piece.

101
00:05:05.720 --> 00:05:08.639
<v Speaker 1>So it's robust. Losing a few connections isn't catastrophic.

102
00:05:08.759 --> 00:05:12.079
<v Speaker 2>Generally, yes, very robust compared to traditional programs where one

103
00:05:12.160 --> 00:05:15.439
<v Speaker 2>error can crash everything. And this distributed learning helps them

104
00:05:15.480 --> 00:05:17.120
<v Speaker 2>generalize well to new data.

105
00:05:17.199 --> 00:05:20.480
<v Speaker 1>Okay, and the book talks about two main ways they learn,

106
00:05:20.560 --> 00:05:24.040
<v Speaker 1>two paradigms. First is supervised.

107
00:05:23.439 --> 00:05:26.600
<v Speaker 2>Learning right learning with a teacher. Essentially, you give the

108
00:05:26.639 --> 00:05:30.920
<v Speaker 2>network an input X and the correct output why you wanted.

109
00:05:30.680 --> 00:05:32.839
<v Speaker 1>To produce labeled data exactly.

110
00:05:33.160 --> 00:05:36.600
<v Speaker 2>The network makes a prediction, compares it to the target why,

111
00:05:37.240 --> 00:05:40.759
<v Speaker 2>calculates the error, and then uses that error to adjust.

112
00:05:40.480 --> 00:05:43.800
<v Speaker 1>Its weights, so it learns to map X to Y precisely.

113
00:05:44.240 --> 00:05:47.160
<v Speaker 2>This is great for things like image classification. Here's a

114
00:05:47.199 --> 00:05:49.399
<v Speaker 2>picture tell if it's a cat or a dog, or

115
00:05:49.480 --> 00:05:53.600
<v Speaker 2>speech recognition forecasting tasks where you know the right answer

116
00:05:53.680 --> 00:05:54.279
<v Speaker 2>during training.

117
00:05:54.480 --> 00:05:57.399
<v Speaker 1>Okay, supervised is learning from examples. What's the other.

118
00:05:57.279 --> 00:06:01.040
<v Speaker 2>Type unsupervised learning? Here there's no teacher labels. You just

119
00:06:01.040 --> 00:06:03.639
<v Speaker 2>give the network the input data XP and it has

120
00:06:03.639 --> 00:06:07.079
<v Speaker 2>to figure things out on its own, find hidden structures, patterns, correlations,

121
00:06:07.360 --> 00:06:08.800
<v Speaker 2>group similar data points.

122
00:06:08.639 --> 00:06:13.120
<v Speaker 1>Together, so discovering patterns rather than predicting known answers exactly.

123
00:06:13.560 --> 00:06:17.199
<v Speaker 2>Think clustering, grouping customers based on purchasing habits without knowing

124
00:06:17.279 --> 00:06:21.319
<v Speaker 2>the groups beforehand, or data compression finding efficient ways to

125
00:06:21.360 --> 00:06:22.360
<v Speaker 2>represent the information.

126
00:06:22.720 --> 00:06:26.040
<v Speaker 1>That sounds powerful for exploration, it really is.

127
00:06:26.199 --> 00:06:28.399
<v Speaker 2>Discovering insights you didn't even know to look for.

128
00:06:28.680 --> 00:06:32.319
<v Speaker 1>So in both cases there's a learning algorithm driving this

129
00:06:32.439 --> 00:06:33.800
<v Speaker 1>weight adjustment.

130
00:06:33.360 --> 00:06:37.199
<v Speaker 2>Yes, a systematic procedure. The goal is usually to minimize

131
00:06:37.199 --> 00:06:39.519
<v Speaker 2>a cost function, which is just a mathematical way of

132
00:06:39.560 --> 00:06:41.800
<v Speaker 2>measuring the total error the network is making.

133
00:06:41.920 --> 00:06:44.199
<v Speaker 1>And a key part of this is splitting the data.

134
00:06:44.759 --> 00:06:47.240
<v Speaker 1>Training and testing absolutely crucial.

135
00:06:47.720 --> 00:06:50.560
<v Speaker 2>You train the network on one set of data, but

136
00:06:50.680 --> 00:06:53.759
<v Speaker 2>you evaluate its real performance on a separate set. It's

137
00:06:53.839 --> 00:06:55.240
<v Speaker 2>never seen before the test set.

138
00:06:55.240 --> 00:06:56.720
<v Speaker 1>Why separate them.

139
00:06:56.519 --> 00:07:00.160
<v Speaker 2>To prevent overtraining or overfitting. Yeah, that's when the network

140
00:07:00.319 --> 00:07:03.639
<v Speaker 2>basically just memorizes the training examples, noise and.

141
00:07:03.600 --> 00:07:05.000
<v Speaker 1>All, like cramming for a test.

142
00:07:05.199 --> 00:07:07.720
<v Speaker 2>Exactly. It does great on the stuff it memorized, but

143
00:07:07.839 --> 00:07:10.800
<v Speaker 2>fails miserably on new questions because it didn't learn the

144
00:07:10.879 --> 00:07:15.319
<v Speaker 2>underlying concepts. Testing on unseen data checks for that generalization ability.

145
00:07:15.360 --> 00:07:17.720
<v Speaker 1>Okay, And there are knobs to tune in this learning

146
00:07:17.759 --> 00:07:19.759
<v Speaker 1>process parameters.

147
00:07:19.279 --> 00:07:21.720
<v Speaker 2>Oh yeah, A big one is the learning rate usually

148
00:07:21.759 --> 00:07:24.720
<v Speaker 2>called eta. What does that control? It controls how much

149
00:07:24.759 --> 00:07:27.360
<v Speaker 2>the weights are adjusted in response to the error in

150
00:07:27.439 --> 00:07:28.000
<v Speaker 2>each step.

151
00:07:28.240 --> 00:07:30.959
<v Speaker 1>So like the size of the learning steps.

152
00:07:30.959 --> 00:07:34.120
<v Speaker 2>Kind of too high and you might overshoot the best

153
00:07:34.120 --> 00:07:37.920
<v Speaker 2>solution bouncing around radically too low and learning can be

154
00:07:37.959 --> 00:07:41.600
<v Speaker 2>incredibly slow, might get stuck. It's a balancing act, makes sense?

155
00:07:42.000 --> 00:07:45.000
<v Speaker 1>And how does the network know when to stop learning?

156
00:07:45.399 --> 00:07:48.399
<v Speaker 2>Those are the starting conditions? Could be a maximum number

157
00:07:48.399 --> 00:07:50.879
<v Speaker 2>of training cycles called epochs.

158
00:07:50.480 --> 00:07:53.480
<v Speaker 1>Epox meaning passes through the whole training data set.

159
00:07:53.399 --> 00:07:56.319
<v Speaker 2>Right, Or you might stop when the error on the

160
00:07:56.360 --> 00:07:59.959
<v Speaker 2>training set or maybe a separate validation set drops below

161
00:08:00.279 --> 00:08:05.240
<v Speaker 2>a certain target threshold, or when the error stops improving significantly.

162
00:08:04.800 --> 00:08:07.720
<v Speaker 1>Setting the goalposts for its education pretty much. Yeah, so

163
00:08:07.800 --> 00:08:10.480
<v Speaker 1>let's get concrete. The book talks about some early algorithms.

164
00:08:10.759 --> 00:08:14.560
<v Speaker 2>The perceptron the simplest one, really. It updates weights based

165
00:08:14.600 --> 00:08:16.800
<v Speaker 2>directly on the output error and the learning rate.

166
00:08:17.199 --> 00:08:19.879
<v Speaker 1>Super basic, but it has limits, right you mentioned that earlier.

167
00:08:19.920 --> 00:08:22.399
<v Speaker 2>Big limits. This raises the really important question of what

168
00:08:22.519 --> 00:08:23.199
<v Speaker 2>can't hit do?

169
00:08:23.639 --> 00:08:26.920
<v Speaker 1>And the classic example is the XOR problem Y's.

170
00:08:26.800 --> 00:08:30.079
<v Speaker 2>Exactly exclusive or R. If you plot the inputs and

171
00:08:30.079 --> 00:08:33.600
<v Speaker 2>outputs for xor on a two D graph, you have

172
00:08:33.720 --> 00:08:37.240
<v Speaker 2>points at zero zero, zero, one meters one mate of one,

173
00:08:37.279 --> 00:08:38.600
<v Speaker 2>matters one and one middle.

174
00:08:38.440 --> 00:08:41.120
<v Speaker 1>Zero right, two classes zero and one.

175
00:08:41.279 --> 00:08:44.799
<v Speaker 2>Try drawing a single straight line to perfectly separate the

176
00:08:44.879 --> 00:08:45.799
<v Speaker 2>zeros from the ones.

177
00:08:45.960 --> 00:08:46.440
<v Speaker 1>You can't.

178
00:08:46.519 --> 00:08:50.000
<v Speaker 2>You absolutely cannot, And that's the perceptron's limitation. It can

179
00:08:50.000 --> 00:08:53.279
<v Speaker 2>only learn problems that are linearly separable, problems where you

180
00:08:53.279 --> 00:08:57.039
<v Speaker 2>can draw that single line or a plane in higher dimensions.

181
00:08:56.600 --> 00:08:59.080
<v Speaker 1>Like an A and D gate that's linearly separable. The

182
00:08:59.080 --> 00:09:01.240
<v Speaker 1>book uses a warning system example for that.

183
00:09:01.320 --> 00:09:03.440
<v Speaker 2>Right, if sensor A and D sensor B or on

184
00:09:03.679 --> 00:09:08.159
<v Speaker 2>trigger the alarm. A perceptron can learn that easily, but XRP.

185
00:09:07.679 --> 00:09:10.000
<v Speaker 1>So a step up from the basic perceptron was the

186
00:09:10.039 --> 00:09:10.639
<v Speaker 1>delta rule.

187
00:09:10.720 --> 00:09:14.279
<v Speaker 2>Yeah, an improvement. It takes the activation functions non linearity

188
00:09:14.279 --> 00:09:18.000
<v Speaker 2>into account. Specifically, it's derivative when calculating the weight updates.

189
00:09:18.240 --> 00:09:21.840
<v Speaker 2>It's a bit more sophisticated, uses gradiate descent conceptually.

190
00:09:21.399 --> 00:09:23.879
<v Speaker 1>But still fundamentally limited to single layers mostly.

191
00:09:23.960 --> 00:09:25.799
<v Speaker 2>Yeah, still struggles with things like XR.

192
00:09:25.879 --> 00:09:28.159
<v Speaker 1>So here's where it gets really interesting, right, how did

193
00:09:28.159 --> 00:09:30.080
<v Speaker 1>they crack problems like xor?

194
00:09:30.440 --> 00:09:35.360
<v Speaker 2>The breakthrough was multilayer perceptrons or MLPs, adding those hidden layers.

195
00:09:35.480 --> 00:09:36.159
<v Speaker 1>That was the key.

196
00:09:36.240 --> 00:09:39.039
<v Speaker 2>That was the revolutionary idea. By adding one or more

197
00:09:39.120 --> 00:09:42.360
<v Speaker 2>hidden layers between the input and output, the network gains

198
00:09:42.399 --> 00:09:45.320
<v Speaker 2>the ability to learn nonlinear decision boundaries.

199
00:09:45.639 --> 00:09:47.759
<v Speaker 1>How what did the hidden layers do?

200
00:09:48.000 --> 00:09:51.000
<v Speaker 2>They essentially learned to transform the input data into a

201
00:09:51.039 --> 00:09:56.399
<v Speaker 2>new representation. In this new hidden space, the problem can

202
00:09:56.440 --> 00:10:02.159
<v Speaker 2>become linearly separable. The hidden layer learns useful intermediate features.

203
00:10:01.759 --> 00:10:04.200
<v Speaker 1>So it finds its own way to make the problem solvable.

204
00:10:04.320 --> 00:10:08.320
<v Speaker 2>Exactly, it learns abstractions for xor, a hidden layer can

205
00:10:08.360 --> 00:10:12.120
<v Speaker 2>create internal representations that allow a final output layer to

206
00:10:12.240 --> 00:10:13.559
<v Speaker 2>draw that separating line.

207
00:10:13.759 --> 00:10:16.879
<v Speaker 1>Metaphorically speaking, Wow, but how do you train these. If

208
00:10:16.919 --> 00:10:19.799
<v Speaker 1>the hidden layers aren't directly connected to the final error,

209
00:10:20.320 --> 00:10:21.720
<v Speaker 1>how do their weights get updated.

210
00:10:22.159 --> 00:10:26.399
<v Speaker 2>Ah, that's where the truly game changing algorithm comes in. Backpropagation,

211
00:10:26.519 --> 00:10:29.480
<v Speaker 2>the famous backprop that's the one. It calculates the error

212
00:10:29.480 --> 00:10:32.440
<v Speaker 2>at the output layer, just like before, but then it

213
00:10:32.519 --> 00:10:35.080
<v Speaker 2>propagates that error backwards, layer by layer.

214
00:10:34.919 --> 00:10:36.039
<v Speaker 1>Back through the hidden layers.

215
00:10:36.279 --> 00:10:40.879
<v Speaker 2>Yes, it uses the chain rule from calculus essentially to

216
00:10:40.960 --> 00:10:45.399
<v Speaker 2>figure out how much each weight in every layer, including

217
00:10:45.440 --> 00:10:47.360
<v Speaker 2>the hidden ones, contributed to the.

218
00:10:47.279 --> 00:10:49.679
<v Speaker 1>Final error, and then adjusts them accordingly.

219
00:10:49.759 --> 00:10:53.519
<v Speaker 2>Precisely, it allows the entire network, all the connections, to

220
00:10:53.720 --> 00:10:57.120
<v Speaker 2>learn in a coordinated way based on the final output error.

221
00:10:57.440 --> 00:11:00.519
<v Speaker 2>It's what made training deep complex networks feasible.

222
00:11:00.799 --> 00:11:04.399
<v Speaker 1>Powerful stuff. The book also mentions Levenberg mark Wort.

223
00:11:04.840 --> 00:11:09.960
<v Speaker 2>Yeah. Briefly, it's another more complex optimization algorithm, often converges

224
00:11:10.000 --> 00:11:13.159
<v Speaker 2>faster than basic backprop for smaller networks or certain types

225
00:11:13.159 --> 00:11:16.879
<v Speaker 2>of problems, but computationally more intensive. It's like a more

226
00:11:16.919 --> 00:11:19.440
<v Speaker 2>sophisticated engine for finding those optimal.

227
00:11:19.080 --> 00:11:23.360
<v Speaker 1>Weights and thinking about implementation. The book uses Java. How

228
00:11:23.360 --> 00:11:24.799
<v Speaker 1>does it structure things?

229
00:11:24.879 --> 00:11:27.480
<v Speaker 2>It takes a nice object oriented approach. You have classes

230
00:11:27.559 --> 00:11:28.600
<v Speaker 2>like neuron layer.

231
00:11:28.759 --> 00:11:31.360
<v Speaker 1>Neuralnet modeling the concepts directly.

232
00:11:31.000 --> 00:11:35.279
<v Speaker 2>In code exactly. Neural objects have their weights, bias, activation function,

233
00:11:35.919 --> 00:11:38.559
<v Speaker 2>layer objects, group neurons. Neural net puts the layers together,

234
00:11:38.759 --> 00:11:41.240
<v Speaker 2>makes the theory very concrete and practical if you're coding

235
00:11:41.279 --> 00:11:41.519
<v Speaker 2>it up.

236
00:11:41.759 --> 00:11:45.120
<v Speaker 1>Cool. So we have these powerful MLPs trained with backprop.

237
00:11:45.720 --> 00:11:48.279
<v Speaker 1>What kinds of real world problems do they tackle? The

238
00:11:48.320 --> 00:11:50.480
<v Speaker 1>book mentions two main classes.

239
00:11:50.840 --> 00:11:53.840
<v Speaker 2>Right, broadly speaking, classification and regression.

240
00:11:53.919 --> 00:11:55.639
<v Speaker 1>Classification is putting things into.

241
00:11:55.480 --> 00:11:59.000
<v Speaker 2>Category exactly, assigning input record to one of several pre

242
00:11:59.039 --> 00:12:02.080
<v Speaker 2>defined classes, like is this email spam or not spam?

243
00:12:02.159 --> 00:12:05.720
<v Speaker 2>Is this tumor malignant or benign? Predicting a student's major

244
00:12:05.799 --> 00:12:07.120
<v Speaker 2>based on grades.

245
00:12:07.080 --> 00:12:08.879
<v Speaker 1>How does the network output work for that?

246
00:12:09.240 --> 00:12:12.799
<v Speaker 2>Multiple outputs often, yeah, you might have one output neuron

247
00:12:12.840 --> 00:12:16.600
<v Speaker 2>per class. The neuron with the highest activation wins and

248
00:12:16.679 --> 00:12:18.240
<v Speaker 2>determines the predicted class.

249
00:12:18.279 --> 00:12:22.639
<v Speaker 1>And evaluating classification, you need specific metrics. The book mentions

250
00:12:22.679 --> 00:12:23.720
<v Speaker 1>confusion matrices.

251
00:12:23.879 --> 00:12:27.600
<v Speaker 2>Absolutely, a confusion matrix shows you not just the overall accuracy,

252
00:12:28.039 --> 00:12:30.559
<v Speaker 2>but what kind of errors the network is making. How

253
00:12:30.559 --> 00:12:33.840
<v Speaker 2>many actual positives were predicted as negative false negatives? How

254
00:12:33.840 --> 00:12:36.480
<v Speaker 2>many actual negatives were predicted as positive false.

255
00:12:36.200 --> 00:12:39.720
<v Speaker 1>Positives, which leads to metrics like sensitivity and specificity.

256
00:12:40.080 --> 00:12:43.200
<v Speaker 2>Right, Sensitivity is the true positive rate, how well it

257
00:12:43.279 --> 00:12:47.960
<v Speaker 2>identifies actual positives. Spensificity is the true negative rate, how

258
00:12:47.960 --> 00:12:51.759
<v Speaker 2>well it identifies actual negatives. Super important in medical diagnosis,

259
00:12:51.759 --> 00:12:53.360
<v Speaker 2>for example, you need to know.

260
00:12:53.360 --> 00:12:56.440
<v Speaker 1>Both makes sense and the other class was regression.

261
00:12:57.039 --> 00:13:00.440
<v Speaker 2>Regression is about predicting a continuous numerical value. You not

262
00:13:00.559 --> 00:13:01.279
<v Speaker 2>a category.

263
00:13:01.360 --> 00:13:05.360
<v Speaker 1>It's like predicting house prices or stock values exactly.

264
00:13:05.600 --> 00:13:08.879
<v Speaker 2>Finding a function that maps inputs to a number, predicting

265
00:13:08.919 --> 00:13:11.519
<v Speaker 2>best ticket prices based on root, time of day, et cetera.

266
00:13:11.759 --> 00:13:12.879
<v Speaker 2>That's a regression task.

267
00:13:13.039 --> 00:13:15.320
<v Speaker 1>The book gives some concrete examples, right, Yeah.

268
00:13:15.080 --> 00:13:18.840
<v Speaker 2>Some good ones. A university enrollment status predictor that's classification

269
00:13:19.480 --> 00:13:23.279
<v Speaker 2>takes gender grades, predicts a fill enroll, and the medical

270
00:13:23.320 --> 00:13:26.679
<v Speaker 2>ones disease diagnosis specifically, they look at breast cancer and

271
00:13:26.720 --> 00:13:30.759
<v Speaker 2>diabetes data sets using various medical inputs to predict the diagnosis.

272
00:13:30.799 --> 00:13:33.320
<v Speaker 2>Again classic classification.

273
00:13:32.679 --> 00:13:35.679
<v Speaker 1>And they show how their classification class helps analyze this

274
00:13:36.279 --> 00:13:38.679
<v Speaker 1>with those confusion matrices.

275
00:13:38.440 --> 00:13:43.759
<v Speaker 2>Yeah, it calculates the matrix. Sensitivity, specificity, accuracy really helps

276
00:13:43.799 --> 00:13:46.919
<v Speaker 2>you understand the performance beyond just a single accuracy number.

277
00:13:47.320 --> 00:13:50.600
<v Speaker 2>It's fascinating seeing how networks find patterns in that complex

278
00:13:50.639 --> 00:13:51.519
<v Speaker 2>medical data.

279
00:13:51.600 --> 00:13:56.159
<v Speaker 1>Definitely now shifting gears slightly. What about that other learning paradigm,

280
00:13:56.320 --> 00:13:58.559
<v Speaker 1>unsupervised learning? Where does that shine?

281
00:13:58.759 --> 00:14:02.159
<v Speaker 2>Right? Unsupervises about discovery and a prime example the book

282
00:14:02.200 --> 00:14:07.679
<v Speaker 2>covers is self organizing maps or SOMs, also called Cohona networks.

283
00:14:07.919 --> 00:14:09.240
<v Speaker 1>What's unique about SOMs?

284
00:14:09.559 --> 00:14:13.320
<v Speaker 2>They map high dimensional input data onto a lower dimensional grid,

285
00:14:13.679 --> 00:14:15.799
<v Speaker 2>usually one D year two D. They create a kind

286
00:14:15.840 --> 00:14:19.519
<v Speaker 2>of map where similar inputs activate neurons that are close

287
00:14:19.519 --> 00:14:20.600
<v Speaker 2>to each other on the map.

288
00:14:20.799 --> 00:14:23.600
<v Speaker 1>So it organizes the data visually pretty much.

289
00:14:23.840 --> 00:14:25.879
<v Speaker 2>It preserves the topology of the data. You get these

290
00:14:25.919 --> 00:14:29.240
<v Speaker 2>clusters forming naturally on the map, showing relationships in the data.

291
00:14:29.320 --> 00:14:31.559
<v Speaker 2>It's great for visualization and exploration.

292
00:14:31.720 --> 00:14:34.559
<v Speaker 1>How do they learn without labels? What's the mechanism?

293
00:14:34.639 --> 00:14:37.960
<v Speaker 2>It's based on competitive learning, sometimes called winner takes all, though.

294
00:14:37.799 --> 00:14:39.519
<v Speaker 1>It's a bit more nuanced winner takes all.

295
00:14:39.639 --> 00:14:42.679
<v Speaker 2>When an input is presented, all neurons compute their output,

296
00:14:43.320 --> 00:14:46.919
<v Speaker 2>but only one winner neuron, the one whose weight vector

297
00:14:47.000 --> 00:14:50.320
<v Speaker 2>is closest to the input vector, gets strongly activated. Okay,

298
00:14:50.480 --> 00:14:53.440
<v Speaker 2>then that winterer neuron and its neighbors on the map

299
00:14:53.440 --> 00:14:56.440
<v Speaker 2>grid update their weights to become even closer to that

300
00:14:56.519 --> 00:14:57.240
<v Speaker 2>input vector.

301
00:14:57.440 --> 00:15:00.799
<v Speaker 1>Ah, so neighboring neurons learn similar things exactly.

302
00:15:00.840 --> 00:15:04.799
<v Speaker 2>That's how the map organizes itself over time. Different regions

303
00:15:04.799 --> 00:15:07.600
<v Speaker 2>of the map specialize in responding to different types of inputs,

304
00:15:08.080 --> 00:15:09.960
<v Speaker 2>forming those clusters or centroids.

305
00:15:10.120 --> 00:15:12.399
<v Speaker 1>Cool. What are some examples the book uses for this?

306
00:15:12.919 --> 00:15:15.960
<v Speaker 2>One is clustering animals, giving the network characteristics as it

307
00:15:16.039 --> 00:15:18.360
<v Speaker 2>have fur is a terrestrial? Does it have mammary glands?

308
00:15:18.559 --> 00:15:21.840
<v Speaker 2>And letting the SAM group the animals based on similarity

309
00:15:22.080 --> 00:15:24.960
<v Speaker 2>without any predefined labels like mammal or reptile.

310
00:15:25.120 --> 00:15:26.759
<v Speaker 1>It discovers the categories right.

311
00:15:27.000 --> 00:15:31.399
<v Speaker 2>Another big one is customer profiling, analyzing transaction data maybe demographics,

312
00:15:31.639 --> 00:15:34.279
<v Speaker 2>to find hidden segments or clusters of customers.

313
00:15:34.320 --> 00:15:36.720
<v Speaker 1>That sounds commercially very valuable.

314
00:15:36.440 --> 00:15:39.440
<v Speaker 2>Hugely businesses use it to understand their customer based better

315
00:15:39.559 --> 00:15:44.639
<v Speaker 2>target marketing, etc. But it often requires careful data preprocessing.

316
00:15:44.120 --> 00:15:45.720
<v Speaker 1>Because the network needs numbers.

317
00:15:45.960 --> 00:15:48.840
<v Speaker 2>Yeah, you need to convert different data types of numerical

318
00:15:48.919 --> 00:15:53.240
<v Speaker 2>categorical like gender or city into a format the network

319
00:15:53.240 --> 00:15:55.080
<v Speaker 2>can handle. That's often a big part of the job.

320
00:15:55.240 --> 00:15:59.639
<v Speaker 1>Okay, so we have supervised for prediction, unsupervised for discovery.

321
00:16:00.080 --> 00:16:05.120
<v Speaker 1>What about tasks that combined aspects like pattern recognition.

322
00:16:05.279 --> 00:16:10.000
<v Speaker 2>Pattern recognition, especially something like optical character recognition OCR is

323
00:16:10.000 --> 00:16:12.279
<v Speaker 2>a great example. It often involves elements of.

324
00:16:12.279 --> 00:16:14.759
<v Speaker 1>Both recognizing handwriting or typed text.

325
00:16:15.000 --> 00:16:18.480
<v Speaker 2>Exactly. The book has a nice OCR case study recognizing

326
00:16:18.519 --> 00:16:20.080
<v Speaker 2>handwritten digits zero through nine.

327
00:16:20.279 --> 00:16:22.679
<v Speaker 1>How did they represent the digits for the network?

328
00:16:22.840 --> 00:16:26.360
<v Speaker 2>They use simple five y five pixel grayscale images. Each

329
00:16:26.399 --> 00:16:30.960
<v Speaker 2>image is flattened into a vector of twenty five pixel inputs.

330
00:16:30.639 --> 00:16:32.799
<v Speaker 1>So the image becomes numerical data.

331
00:16:32.840 --> 00:16:36.759
<v Speaker 2>Precisely, that transformation from visual information to numbers the network

332
00:16:36.799 --> 00:16:39.679
<v Speaker 2>and process is fundamental. Then typically you train it using

333
00:16:39.720 --> 00:16:43.039
<v Speaker 2>supervised learning. Show it lots of examples of three images

334
00:16:43.080 --> 00:16:45.720
<v Speaker 2>labeled as three four images labels four, and so on.

335
00:16:46.000 --> 00:16:49.679
<v Speaker 1>Okay, Now throughout these examples, something you mentioned earlier seems important.

336
00:16:50.080 --> 00:16:53.000
<v Speaker 1>The trial and error aspect of designing these networks.

337
00:16:53.039 --> 00:16:56.720
<v Speaker 2>Oh, absolutely, it's rarely straightforward. The weather forecasting example they

338
00:16:56.759 --> 00:16:59.759
<v Speaker 2>discuss in chapter five really highlights this. Oh, so they

339
00:16:59.759 --> 00:17:04.160
<v Speaker 2>had to experiment empirically, try different network structures, different numbers

340
00:17:04.200 --> 00:17:09.119
<v Speaker 2>of hidden neurons, different learning parameters, and crucially carefully select

341
00:17:09.160 --> 00:17:11.200
<v Speaker 2>the training and test data sets, and.

342
00:17:11.200 --> 00:17:14.240
<v Speaker 1>The goal isn't always just the lowest possible error on

343
00:17:14.279 --> 00:17:15.240
<v Speaker 1>the training set.

344
00:17:15.480 --> 00:17:18.960
<v Speaker 2>Not necessarily, this is a key point. Sometimes a network

345
00:17:18.960 --> 00:17:23.079
<v Speaker 2>that achieves a slightly higher error say means squared error

346
00:17:23.440 --> 00:17:27.359
<v Speaker 2>MESSE during training might actually perform better on the unseen

347
00:17:27.400 --> 00:17:28.079
<v Speaker 2>test data.

348
00:17:28.160 --> 00:17:30.720
<v Speaker 1>Better generalization exactly.

349
00:17:30.480 --> 00:17:33.400
<v Speaker 2>They learned the underlying pattern better wasn't just overfitting to

350
00:17:33.400 --> 00:17:36.079
<v Speaker 2>the training noise. They saw this in both the weather

351
00:17:36.119 --> 00:17:40.119
<v Speaker 2>forecasting and the OCR digit recognition results. The network that

352
00:17:40.200 --> 00:17:43.400
<v Speaker 2>generalized best wasn't always the one with the absolute rock

353
00:17:43.440 --> 00:17:45.160
<v Speaker 2>bottom training MSc.

354
00:17:45.279 --> 00:17:48.440
<v Speaker 1>So it's an iterative design process requires judgment.

355
00:17:48.240 --> 00:17:51.000
<v Speaker 2>Very much so part science, part art maybe.

356
00:17:50.720 --> 00:17:53.400
<v Speaker 1>And things can go wrong right. Common issues for.

357
00:17:53.279 --> 00:17:58.119
<v Speaker 2>Sure, bad input selection, feeding the network irrelevant data, noisy

358
00:17:58.200 --> 00:18:03.039
<v Speaker 2>data that obscures the patterns, choosing an unsuitable network structure,

359
00:18:03.119 --> 00:18:05.640
<v Speaker 2>too simple or maybe overly.

360
00:18:05.319 --> 00:18:07.920
<v Speaker 1>Complex, so optimization is key.

361
00:18:08.200 --> 00:18:11.519
<v Speaker 2>Definitely, techniques exist to help, Like for input selection, you

362
00:18:11.519 --> 00:18:15.920
<v Speaker 2>can analyze data correlation using something like the piercing coefficient

363
00:18:16.000 --> 00:18:18.880
<v Speaker 2>to see which potential inputs are actually strongly related to

364
00:18:18.960 --> 00:18:21.920
<v Speaker 2>the output you're trying to predict. Helps weed out the noise.

365
00:18:22.079 --> 00:18:23.960
<v Speaker 1>Makes sense, and if you have tons of inputs, like

366
00:18:24.000 --> 00:18:25.640
<v Speaker 1>from high res images.

367
00:18:25.519 --> 00:18:29.799
<v Speaker 2>Then dimensionality reduction techniques become vital ways to compress the

368
00:18:29.799 --> 00:18:33.160
<v Speaker 2>input data, capture the most important information in fewer dimensions,

369
00:18:33.519 --> 00:18:36.640
<v Speaker 2>making the learning task more manageable without losing too much signal.

370
00:18:36.920 --> 00:18:40.720
<v Speaker 1>So it sounds like mastering neural networks takes patience, experimentation

371
00:18:40.920 --> 00:18:41.759
<v Speaker 1>in const of refinement.

372
00:18:41.839 --> 00:18:44.039
<v Speaker 2>Yeah, it's not usually a one shot deal. You build,

373
00:18:44.079 --> 00:18:47.000
<v Speaker 2>you test, you tweak, you learn for the results and iterate.

374
00:18:47.160 --> 00:18:49.359
<v Speaker 1>Well, this has been an incredible deep dive. We've really

375
00:18:49.519 --> 00:18:55.880
<v Speaker 1>unpacked the core pieces artificial neurons, weights, bias, activation, functions, layers.

376
00:18:55.839 --> 00:18:57.000
<v Speaker 2>Uh huh, the building blocks.

377
00:18:57.119 --> 00:19:03.240
<v Speaker 1>Explore how they learn supervised with teacher, unsupervised, discovering patterns

378
00:19:03.240 --> 00:19:03.960
<v Speaker 1>on their own.

379
00:19:04.160 --> 00:19:09.039
<v Speaker 2>With algorithms like backpropagation making the complex learning possible.

380
00:19:08.680 --> 00:19:12.920
<v Speaker 1>And competitive learning driving that self organization in essoms. Yeah,

381
00:19:12.960 --> 00:19:18.200
<v Speaker 1>and we saw their versatility forecasting, diagnosis, clustering, even reading handwriting.

382
00:19:18.279 --> 00:19:21.839
<v Speaker 2>It really shows they're more than just algorithms. They're inspired

383
00:19:21.880 --> 00:19:25.400
<v Speaker 2>by life, finding knowledge in ways we might not expect,

384
00:19:25.640 --> 00:19:28.359
<v Speaker 2>almost like extensions of our own ways of finding patterns.

385
00:19:28.400 --> 00:19:32.359
<v Speaker 1>Absolutely so, given everything we've discussed, their ability to self

386
00:19:32.440 --> 00:19:36.880
<v Speaker 1>organize adapt, create internal representations. Here's a final thought for you.

387
00:19:37.599 --> 00:19:40.759
<v Speaker 1>What new frontiers of human knowledgement these networks unlocked that

388
00:19:40.880 --> 00:19:42.720
<v Speaker 1>maybe we can't even conceive of yet.

389
00:19:43.079 --> 00:19:44.680
<v Speaker 2>That is the big question, isn't it.

390
00:19:44.920 --> 00:19:46.839
<v Speaker 1>Thank you for joining us on this deep dive.
