WEBVTT 1 00:00:00.160 --> 00:00:03.040 Welcome to the deep dive. Today we're plunging into neural 2 00:00:03.080 --> 00:00:07.280 network programming with Java by Fabiosaurs and Alan Susa. 3 00:00:07.400 --> 00:00:09.800 Yeah, and this book it's not just code, is it. 4 00:00:09.800 --> 00:00:12.800 It really goes deep into the fundamentals exactly. 5 00:00:12.960 --> 00:00:16.079 It's about how these well intelligence systems are built, how 6 00:00:16.120 --> 00:00:20.440 they learn. Our mission today extract the key insights. 7 00:00:20.600 --> 00:00:25.239 Uh huh, those surprising bits, the aha moments. We want 8 00:00:25.280 --> 00:00:27.879 to give you a real shortcut to understanding these networks. 9 00:00:27.920 --> 00:00:31.600 We'll make the complex stuff digestible, engaging, show you what 10 00:00:31.640 --> 00:00:33.119 they are, but also why they. 11 00:00:33.039 --> 00:00:35.320 Work and the incredible things they can actually do out 12 00:00:35.320 --> 00:00:36.159 there in the real world. 13 00:00:36.280 --> 00:00:40.359 Okay, so neural networks, these artificial brands, where do we 14 00:00:40.399 --> 00:00:43.320 even start? What's the core idea? How can we even 15 00:00:43.399 --> 00:00:44.880 build something based on a brain? 16 00:00:45.240 --> 00:00:47.399 It's fascinating. Actually, you have to go way back, like 17 00:00:47.479 --> 00:00:51.679 the nineteen forties, really that early. Yeah. A neurophysiologist Warren 18 00:00:51.759 --> 00:00:56.039 McCulloch and a mathematician Walter Pitts. They created the first 19 00:00:56.280 --> 00:00:58.960 mathematical model of an artificial. 20 00:00:58.479 --> 00:01:00.280 Neuron, inspired by the real thing. 21 00:01:00.560 --> 00:01:04.120 Absolutely, they saw the natural neuron as a kind of 22 00:01:04.560 --> 00:01:07.840 simple processor. It sums up signals, decides whether to fire 23 00:01:08.120 --> 00:01:12.239 propagates it onward. That basic idea was the spark. 24 00:01:12.840 --> 00:01:16.519 Biological simplicity leading to well, technological complexity. 25 00:01:16.680 --> 00:01:17.120 You got it. 26 00:01:17.200 --> 00:01:20.920 So, building on that, these artificial networks have some key parts, right, 27 00:01:21.000 --> 00:01:22.640 the building blocks definitely. 28 00:01:23.239 --> 00:01:25.519 First up, the artificial neuron itself. 29 00:01:25.640 --> 00:01:27.920 The basic processing unit exactly. 30 00:01:27.680 --> 00:01:30.319 Takes multiple inputs kind of like dendrites. 31 00:01:29.920 --> 00:01:31.200 Aggregates them right. 32 00:01:31.040 --> 00:01:34.000 Sums them up, and then produces a single output like 33 00:01:34.040 --> 00:01:36.560 an axon firing based on some internal logic. 34 00:01:36.719 --> 00:01:39.400 Okay, makes sense. But the connections matter too. 35 00:01:39.319 --> 00:01:41.719 Oh hugely. That's where the weights come in. They're the 36 00:01:41.719 --> 00:01:42.879 connections between. 37 00:01:42.599 --> 00:01:44.040 Neurons, not just wires though. 38 00:01:44.200 --> 00:01:48.879 No, No, they amplify or reduce these signals passing through. 39 00:01:49.159 --> 00:01:50.959 They multiply the input signal. 40 00:01:50.719 --> 00:01:53.840 And that's where the learning happens, adjusting these weights decisely. 41 00:01:54.239 --> 00:01:56.480 The weights essentially store the network's knowledge. 42 00:01:56.640 --> 00:02:01.079 But is that enough just weights and neurons. Feels like 43 00:02:01.120 --> 00:02:02.680 something's missing for real complexity. 44 00:02:02.920 --> 00:02:06.840 You're right, you need bias and those crucial activation functions. 45 00:02:07.000 --> 00:02:08.439 Okay, tell me about bias first. 46 00:02:08.800 --> 00:02:11.439 Bias is like an extra input always set to one 47 00:02:11.639 --> 00:02:14.599 with its own weight. It adds a constant value to 48 00:02:14.639 --> 00:02:17.360 the sum before the activation function kicks in. 49 00:02:17.560 --> 00:02:18.639 Why what does that do? 50 00:02:18.960 --> 00:02:22.319 It gives the neuron more flexibility. It basically shifts the 51 00:02:22.319 --> 00:02:26.599 activation threshold, allowing the network to model more complex relationships 52 00:02:26.759 --> 00:02:30.240 stuff that doesn't necessarily pass through the origin helps handle 53 00:02:30.280 --> 00:02:31.439 nonlinear stuff better. 54 00:02:31.599 --> 00:02:34.120 Got it? And activation functions you said they're crucial. The 55 00:02:34.120 --> 00:02:36.280 book mentions sigmoid ton. 56 00:02:36.439 --> 00:02:40.560 Right, hyperbolic tangent also purely linear functions. But the key 57 00:02:40.599 --> 00:02:43.479 insight here is the nonlinear. 58 00:02:42.879 --> 00:02:44.199 Ones like sigmoid and ton. 59 00:02:44.439 --> 00:02:48.800 Exactly. Without nonlinearity, even a deep network, a multi layer 60 00:02:48.879 --> 00:02:51.919 one would just be doing a sequence of linear operations. 61 00:02:51.960 --> 00:02:56.000 It means it could only solve linear problems. Nonlinear activation 62 00:02:56.080 --> 00:02:59.439 functions let the network learn really complex curved boundaries in 63 00:02:59.479 --> 00:03:02.680 the data. I think image recognition that's inherently nonlinear. 64 00:03:02.759 --> 00:03:06.599 Okay, So that nonlinearity is the secret sauce for handling. 65 00:03:06.199 --> 00:03:07.560 Complexity A big part of it. 66 00:03:07.639 --> 00:03:10.800 Yeah, And these neurons they aren't just you know, floating around. 67 00:03:11.120 --> 00:03:12.919 They're organized into layers. 68 00:03:12.960 --> 00:03:16.039 Correct. You have input layer where data comes in, an 69 00:03:16.039 --> 00:03:19.520 output layer where the result comes out, and in between 70 00:03:19.680 --> 00:03:22.280 potentially one or more hidden layers. 71 00:03:22.360 --> 00:03:24.080 Hidden layers sound important. 72 00:03:23.800 --> 00:03:26.000 They really are. They allow the network to build up 73 00:03:26.039 --> 00:03:30.120 layers of abstraction. It learns intermediate features representations of the 74 00:03:30.199 --> 00:03:33.639 data that aren't obvious in the raw input but are 75 00:03:33.719 --> 00:03:38.159 useful for the final task. That's where complex knowledge gets represented. 76 00:03:37.719 --> 00:03:40.479 Like building its own internal understanding tind of. Yeah, and 77 00:03:40.520 --> 00:03:44.400 how these layers and neurons are arranged. That gives different architectures. 78 00:03:44.560 --> 00:03:47.520 Yep. Simple ones are mono layer just input and output. 79 00:03:47.879 --> 00:03:50.840 More complex or multi layer with those hidden layers. Okay, 80 00:03:51.159 --> 00:03:54.000 then there's how the signal flows. Feed Forward is the 81 00:03:54.000 --> 00:03:57.960 basic type signal goes one way input to output straightforward, 82 00:03:58.000 --> 00:04:00.639 But then you have feedback networks or recurrent networks. 83 00:04:00.719 --> 00:04:04.360 We're current, meaning the signal can loop back exactly. 84 00:04:05.039 --> 00:04:08.039 Outputs from neurons can be fed back as inputs to 85 00:04:08.080 --> 00:04:12.000 neurons in the same or earlier layers. This introduces memory, 86 00:04:12.199 --> 00:04:12.759 a sense of. 87 00:04:12.719 --> 00:04:16.920 Time, useful for sequences time series data. 88 00:04:16.560 --> 00:04:21.439 Perfect for that pattern recognition over time. But the catch 89 00:04:22.120 --> 00:04:25.759 is they're significantly harder to train. Why is that because 90 00:04:25.759 --> 00:04:30.079 the network state depends on its previous states. That feedback 91 00:04:30.120 --> 00:04:34.959 loop complicates the learning process tracking how errors should propagate back? 92 00:04:35.160 --> 00:04:37.920 Right, that makes sense. So we have these components, these 93 00:04:38.000 --> 00:04:42.560 architectures these artificial brains. But how do they actually learn? 94 00:04:42.639 --> 00:04:43.600 What's the mechanism? 95 00:04:44.000 --> 00:04:48.519 Well, fundamentally, learning is about adjusting those weights, systematically, changing 96 00:04:48.519 --> 00:04:51.399 the connection strengths based on experience, based on data. Yeah, 97 00:04:51.720 --> 00:04:54.959 but what's really fascinating is the distributed nature of this intelligence, 98 00:04:55.079 --> 00:04:58.600 meaning it's spread out exactly. It's not one central brain 99 00:04:58.680 --> 00:05:02.279 part holding all the knowledge. It's across potentially millions or 100 00:05:02.279 --> 00:05:05.879 billions of tiny connections, each weight holding a small piece. 101 00:05:05.720 --> 00:05:08.639 So it's robust. Losing a few connections isn't catastrophic. 102 00:05:08.759 --> 00:05:12.079 Generally, yes, very robust compared to traditional programs where one 103 00:05:12.160 --> 00:05:15.439 error can crash everything. And this distributed learning helps them 104 00:05:15.480 --> 00:05:17.120 generalize well to new data. 105 00:05:17.199 --> 00:05:20.480 Okay, and the book talks about two main ways they learn, 106 00:05:20.560 --> 00:05:24.040 two paradigms. First is supervised. 107 00:05:23.439 --> 00:05:26.600 Learning right learning with a teacher. Essentially, you give the 108 00:05:26.639 --> 00:05:30.920 network an input X and the correct output why you wanted. 109 00:05:30.680 --> 00:05:32.839 To produce labeled data exactly. 110 00:05:33.160 --> 00:05:36.600 The network makes a prediction, compares it to the target why, 111 00:05:37.240 --> 00:05:40.759 calculates the error, and then uses that error to adjust. 112 00:05:40.480 --> 00:05:43.800 Its weights, so it learns to map X to Y precisely. 113 00:05:44.240 --> 00:05:47.160 This is great for things like image classification. Here's a 114 00:05:47.199 --> 00:05:49.399 picture tell if it's a cat or a dog, or 115 00:05:49.480 --> 00:05:53.600 speech recognition forecasting tasks where you know the right answer 116 00:05:53.680 --> 00:05:54.279 during training. 117 00:05:54.480 --> 00:05:57.399 Okay, supervised is learning from examples. What's the other. 118 00:05:57.279 --> 00:06:01.040 Type unsupervised learning? Here there's no teacher labels. You just 119 00:06:01.040 --> 00:06:03.639 give the network the input data XP and it has 120 00:06:03.639 --> 00:06:07.079 to figure things out on its own, find hidden structures, patterns, correlations, 121 00:06:07.360 --> 00:06:08.800 group similar data points. 122 00:06:08.639 --> 00:06:13.120 Together, so discovering patterns rather than predicting known answers exactly. 123 00:06:13.560 --> 00:06:17.199 Think clustering, grouping customers based on purchasing habits without knowing 124 00:06:17.279 --> 00:06:21.319 the groups beforehand, or data compression finding efficient ways to 125 00:06:21.360 --> 00:06:22.360 represent the information. 126 00:06:22.720 --> 00:06:26.040 That sounds powerful for exploration, it really is. 127 00:06:26.199 --> 00:06:28.399 Discovering insights you didn't even know to look for. 128 00:06:28.680 --> 00:06:32.319 So in both cases there's a learning algorithm driving this 129 00:06:32.439 --> 00:06:33.800 weight adjustment. 130 00:06:33.360 --> 00:06:37.199 Yes, a systematic procedure. The goal is usually to minimize 131 00:06:37.199 --> 00:06:39.519 a cost function, which is just a mathematical way of 132 00:06:39.560 --> 00:06:41.800 measuring the total error the network is making. 133 00:06:41.920 --> 00:06:44.199 And a key part of this is splitting the data. 134 00:06:44.759 --> 00:06:47.240 Training and testing absolutely crucial. 135 00:06:47.720 --> 00:06:50.560 You train the network on one set of data, but 136 00:06:50.680 --> 00:06:53.759 you evaluate its real performance on a separate set. It's 137 00:06:53.839 --> 00:06:55.240 never seen before the test set. 138 00:06:55.240 --> 00:06:56.720 Why separate them. 139 00:06:56.519 --> 00:07:00.160 To prevent overtraining or overfitting. Yeah, that's when the network 140 00:07:00.319 --> 00:07:03.639 basically just memorizes the training examples, noise and. 141 00:07:03.600 --> 00:07:05.000 All, like cramming for a test. 142 00:07:05.199 --> 00:07:07.720 Exactly. It does great on the stuff it memorized, but 143 00:07:07.839 --> 00:07:10.800 fails miserably on new questions because it didn't learn the 144 00:07:10.879 --> 00:07:15.319 underlying concepts. Testing on unseen data checks for that generalization ability. 145 00:07:15.360 --> 00:07:17.720 Okay, And there are knobs to tune in this learning 146 00:07:17.759 --> 00:07:19.759 process parameters. 147 00:07:19.279 --> 00:07:21.720 Oh yeah, A big one is the learning rate usually 148 00:07:21.759 --> 00:07:24.720 called eta. What does that control? It controls how much 149 00:07:24.759 --> 00:07:27.360 the weights are adjusted in response to the error in 150 00:07:27.439 --> 00:07:28.000 each step. 151 00:07:28.240 --> 00:07:30.959 So like the size of the learning steps. 152 00:07:30.959 --> 00:07:34.120 Kind of too high and you might overshoot the best 153 00:07:34.120 --> 00:07:37.920 solution bouncing around radically too low and learning can be 154 00:07:37.959 --> 00:07:41.600 incredibly slow, might get stuck. It's a balancing act, makes sense? 155 00:07:42.000 --> 00:07:45.000 And how does the network know when to stop learning? 156 00:07:45.399 --> 00:07:48.399 Those are the starting conditions? Could be a maximum number 157 00:07:48.399 --> 00:07:50.879 of training cycles called epochs. 158 00:07:50.480 --> 00:07:53.480 Epox meaning passes through the whole training data set. 159 00:07:53.399 --> 00:07:56.319 Right, Or you might stop when the error on the 160 00:07:56.360 --> 00:07:59.959 training set or maybe a separate validation set drops below 161 00:08:00.279 --> 00:08:05.240 a certain target threshold, or when the error stops improving significantly. 162 00:08:04.800 --> 00:08:07.720 Setting the goalposts for its education pretty much. Yeah, so 163 00:08:07.800 --> 00:08:10.480 let's get concrete. The book talks about some early algorithms. 164 00:08:10.759 --> 00:08:14.560 The perceptron the simplest one, really. It updates weights based 165 00:08:14.600 --> 00:08:16.800 directly on the output error and the learning rate. 166 00:08:17.199 --> 00:08:19.879 Super basic, but it has limits, right you mentioned that earlier. 167 00:08:19.920 --> 00:08:22.399 Big limits. This raises the really important question of what 168 00:08:22.519 --> 00:08:23.199 can't hit do? 169 00:08:23.639 --> 00:08:26.920 And the classic example is the XOR problem Y's. 170 00:08:26.800 --> 00:08:30.079 Exactly exclusive or R. If you plot the inputs and 171 00:08:30.079 --> 00:08:33.600 outputs for xor on a two D graph, you have 172 00:08:33.720 --> 00:08:37.240 points at zero zero, zero, one meters one mate of one, 173 00:08:37.279 --> 00:08:38.600 matters one and one middle. 174 00:08:38.440 --> 00:08:41.120 Zero right, two classes zero and one. 175 00:08:41.279 --> 00:08:44.799 Try drawing a single straight line to perfectly separate the 176 00:08:44.879 --> 00:08:45.799 zeros from the ones. 177 00:08:45.960 --> 00:08:46.440 You can't. 178 00:08:46.519 --> 00:08:50.000 You absolutely cannot, And that's the perceptron's limitation. It can 179 00:08:50.000 --> 00:08:53.279 only learn problems that are linearly separable, problems where you 180 00:08:53.279 --> 00:08:57.039 can draw that single line or a plane in higher dimensions. 181 00:08:56.600 --> 00:08:59.080 Like an A and D gate that's linearly separable. The 182 00:08:59.080 --> 00:09:01.240 book uses a warning system example for that. 183 00:09:01.320 --> 00:09:03.440 Right, if sensor A and D sensor B or on 184 00:09:03.679 --> 00:09:08.159 trigger the alarm. A perceptron can learn that easily, but XRP. 185 00:09:07.679 --> 00:09:10.000 So a step up from the basic perceptron was the 186 00:09:10.039 --> 00:09:10.639 delta rule. 187 00:09:10.720 --> 00:09:14.279 Yeah, an improvement. It takes the activation functions non linearity 188 00:09:14.279 --> 00:09:18.000 into account. Specifically, it's derivative when calculating the weight updates. 189 00:09:18.240 --> 00:09:21.840 It's a bit more sophisticated, uses gradiate descent conceptually. 190 00:09:21.399 --> 00:09:23.879 But still fundamentally limited to single layers mostly. 191 00:09:23.960 --> 00:09:25.799 Yeah, still struggles with things like XR. 192 00:09:25.879 --> 00:09:28.159 So here's where it gets really interesting, right, how did 193 00:09:28.159 --> 00:09:30.080 they crack problems like xor? 194 00:09:30.440 --> 00:09:35.360 The breakthrough was multilayer perceptrons or MLPs, adding those hidden layers. 195 00:09:35.480 --> 00:09:36.159 That was the key. 196 00:09:36.240 --> 00:09:39.039 That was the revolutionary idea. By adding one or more 197 00:09:39.120 --> 00:09:42.360 hidden layers between the input and output, the network gains 198 00:09:42.399 --> 00:09:45.320 the ability to learn nonlinear decision boundaries. 199 00:09:45.639 --> 00:09:47.759 How what did the hidden layers do? 200 00:09:48.000 --> 00:09:51.000 They essentially learned to transform the input data into a 201 00:09:51.039 --> 00:09:56.399 new representation. In this new hidden space, the problem can 202 00:09:56.440 --> 00:10:02.159 become linearly separable. The hidden layer learns useful intermediate features. 203 00:10:01.759 --> 00:10:04.200 So it finds its own way to make the problem solvable. 204 00:10:04.320 --> 00:10:08.320 Exactly, it learns abstractions for xor, a hidden layer can 205 00:10:08.360 --> 00:10:12.120 create internal representations that allow a final output layer to 206 00:10:12.240 --> 00:10:13.559 draw that separating line. 207 00:10:13.759 --> 00:10:16.879 Metaphorically speaking, Wow, but how do you train these. If 208 00:10:16.919 --> 00:10:19.799 the hidden layers aren't directly connected to the final error, 209 00:10:20.320 --> 00:10:21.720 how do their weights get updated. 210 00:10:22.159 --> 00:10:26.399 Ah, that's where the truly game changing algorithm comes in. Backpropagation, 211 00:10:26.519 --> 00:10:29.480 the famous backprop that's the one. It calculates the error 212 00:10:29.480 --> 00:10:32.440 at the output layer, just like before, but then it 213 00:10:32.519 --> 00:10:35.080 propagates that error backwards, layer by layer. 214 00:10:34.919 --> 00:10:36.039 Back through the hidden layers. 215 00:10:36.279 --> 00:10:40.879 Yes, it uses the chain rule from calculus essentially to 216 00:10:40.960 --> 00:10:45.399 figure out how much each weight in every layer, including 217 00:10:45.440 --> 00:10:47.360 the hidden ones, contributed to the. 218 00:10:47.279 --> 00:10:49.679 Final error, and then adjusts them accordingly. 219 00:10:49.759 --> 00:10:53.519 Precisely, it allows the entire network, all the connections, to 220 00:10:53.720 --> 00:10:57.120 learn in a coordinated way based on the final output error. 221 00:10:57.440 --> 00:11:00.519 It's what made training deep complex networks feasible. 222 00:11:00.799 --> 00:11:04.399 Powerful stuff. The book also mentions Levenberg mark Wort. 223 00:11:04.840 --> 00:11:09.960 Yeah. Briefly, it's another more complex optimization algorithm, often converges 224 00:11:10.000 --> 00:11:13.159 faster than basic backprop for smaller networks or certain types 225 00:11:13.159 --> 00:11:16.879 of problems, but computationally more intensive. It's like a more 226 00:11:16.919 --> 00:11:19.440 sophisticated engine for finding those optimal. 227 00:11:19.080 --> 00:11:23.360 Weights and thinking about implementation. The book uses Java. How 228 00:11:23.360 --> 00:11:24.799 does it structure things? 229 00:11:24.879 --> 00:11:27.480 It takes a nice object oriented approach. You have classes 230 00:11:27.559 --> 00:11:28.600 like neuron layer. 231 00:11:28.759 --> 00:11:31.360 Neuralnet modeling the concepts directly. 232 00:11:31.000 --> 00:11:35.279 In code exactly. Neural objects have their weights, bias, activation function, 233 00:11:35.919 --> 00:11:38.559 layer objects, group neurons. Neural net puts the layers together, 234 00:11:38.759 --> 00:11:41.240 makes the theory very concrete and practical if you're coding 235 00:11:41.279 --> 00:11:41.519 it up. 236 00:11:41.759 --> 00:11:45.120 Cool. So we have these powerful MLPs trained with backprop. 237 00:11:45.720 --> 00:11:48.279 What kinds of real world problems do they tackle? The 238 00:11:48.320 --> 00:11:50.480 book mentions two main classes. 239 00:11:50.840 --> 00:11:53.840 Right, broadly speaking, classification and regression. 240 00:11:53.919 --> 00:11:55.639 Classification is putting things into. 241 00:11:55.480 --> 00:11:59.000 Category exactly, assigning input record to one of several pre 242 00:11:59.039 --> 00:12:02.080 defined classes, like is this email spam or not spam? 243 00:12:02.159 --> 00:12:05.720 Is this tumor malignant or benign? Predicting a student's major 244 00:12:05.799 --> 00:12:07.120 based on grades. 245 00:12:07.080 --> 00:12:08.879 How does the network output work for that? 246 00:12:09.240 --> 00:12:12.799 Multiple outputs often, yeah, you might have one output neuron 247 00:12:12.840 --> 00:12:16.600 per class. The neuron with the highest activation wins and 248 00:12:16.679 --> 00:12:18.240 determines the predicted class. 249 00:12:18.279 --> 00:12:22.639 And evaluating classification, you need specific metrics. The book mentions 250 00:12:22.679 --> 00:12:23.720 confusion matrices. 251 00:12:23.879 --> 00:12:27.600 Absolutely, a confusion matrix shows you not just the overall accuracy, 252 00:12:28.039 --> 00:12:30.559 but what kind of errors the network is making. How 253 00:12:30.559 --> 00:12:33.840 many actual positives were predicted as negative false negatives? How 254 00:12:33.840 --> 00:12:36.480 many actual negatives were predicted as positive false. 255 00:12:36.200 --> 00:12:39.720 Positives, which leads to metrics like sensitivity and specificity. 256 00:12:40.080 --> 00:12:43.200 Right, Sensitivity is the true positive rate, how well it 257 00:12:43.279 --> 00:12:47.960 identifies actual positives. Spensificity is the true negative rate, how 258 00:12:47.960 --> 00:12:51.759 well it identifies actual negatives. Super important in medical diagnosis, 259 00:12:51.759 --> 00:12:53.360 for example, you need to know. 260 00:12:53.360 --> 00:12:56.440 Both makes sense and the other class was regression. 261 00:12:57.039 --> 00:13:00.440 Regression is about predicting a continuous numerical value. You not 262 00:13:00.559 --> 00:13:01.279 a category. 263 00:13:01.360 --> 00:13:05.360 It's like predicting house prices or stock values exactly. 264 00:13:05.600 --> 00:13:08.879 Finding a function that maps inputs to a number, predicting 265 00:13:08.919 --> 00:13:11.519 best ticket prices based on root, time of day, et cetera. 266 00:13:11.759 --> 00:13:12.879 That's a regression task. 267 00:13:13.039 --> 00:13:15.320 The book gives some concrete examples, right, Yeah. 268 00:13:15.080 --> 00:13:18.840 Some good ones. A university enrollment status predictor that's classification 269 00:13:19.480 --> 00:13:23.279 takes gender grades, predicts a fill enroll, and the medical 270 00:13:23.320 --> 00:13:26.679 ones disease diagnosis specifically, they look at breast cancer and 271 00:13:26.720 --> 00:13:30.759 diabetes data sets using various medical inputs to predict the diagnosis. 272 00:13:30.799 --> 00:13:33.320 Again classic classification. 273 00:13:32.679 --> 00:13:35.679 And they show how their classification class helps analyze this 274 00:13:36.279 --> 00:13:38.679 with those confusion matrices. 275 00:13:38.440 --> 00:13:43.759 Yeah, it calculates the matrix. Sensitivity, specificity, accuracy really helps 276 00:13:43.799 --> 00:13:46.919 you understand the performance beyond just a single accuracy number. 277 00:13:47.320 --> 00:13:50.600 It's fascinating seeing how networks find patterns in that complex 278 00:13:50.639 --> 00:13:51.519 medical data. 279 00:13:51.600 --> 00:13:56.159 Definitely now shifting gears slightly. What about that other learning paradigm, 280 00:13:56.320 --> 00:13:58.559 unsupervised learning? Where does that shine? 281 00:13:58.759 --> 00:14:02.159 Right? Unsupervises about discovery and a prime example the book 282 00:14:02.200 --> 00:14:07.679 covers is self organizing maps or SOMs, also called Cohona networks. 283 00:14:07.919 --> 00:14:09.240 What's unique about SOMs? 284 00:14:09.559 --> 00:14:13.320 They map high dimensional input data onto a lower dimensional grid, 285 00:14:13.679 --> 00:14:15.799 usually one D year two D. They create a kind 286 00:14:15.840 --> 00:14:19.519 of map where similar inputs activate neurons that are close 287 00:14:19.519 --> 00:14:20.600 to each other on the map. 288 00:14:20.799 --> 00:14:23.600 So it organizes the data visually pretty much. 289 00:14:23.840 --> 00:14:25.879 It preserves the topology of the data. You get these 290 00:14:25.919 --> 00:14:29.240 clusters forming naturally on the map, showing relationships in the data. 291 00:14:29.320 --> 00:14:31.559 It's great for visualization and exploration. 292 00:14:31.720 --> 00:14:34.559 How do they learn without labels? What's the mechanism? 293 00:14:34.639 --> 00:14:37.960 It's based on competitive learning, sometimes called winner takes all, though. 294 00:14:37.799 --> 00:14:39.519 It's a bit more nuanced winner takes all. 295 00:14:39.639 --> 00:14:42.679 When an input is presented, all neurons compute their output, 296 00:14:43.320 --> 00:14:46.919 but only one winner neuron, the one whose weight vector 297 00:14:47.000 --> 00:14:50.320 is closest to the input vector, gets strongly activated. Okay, 298 00:14:50.480 --> 00:14:53.440 then that winterer neuron and its neighbors on the map 299 00:14:53.440 --> 00:14:56.440 grid update their weights to become even closer to that 300 00:14:56.519 --> 00:14:57.240 input vector. 301 00:14:57.440 --> 00:15:00.799 Ah, so neighboring neurons learn similar things exactly. 302 00:15:00.840 --> 00:15:04.799 That's how the map organizes itself over time. Different regions 303 00:15:04.799 --> 00:15:07.600 of the map specialize in responding to different types of inputs, 304 00:15:08.080 --> 00:15:09.960 forming those clusters or centroids. 305 00:15:10.120 --> 00:15:12.399 Cool. What are some examples the book uses for this? 306 00:15:12.919 --> 00:15:15.960 One is clustering animals, giving the network characteristics as it 307 00:15:16.039 --> 00:15:18.360 have fur is a terrestrial? Does it have mammary glands? 308 00:15:18.559 --> 00:15:21.840 And letting the SAM group the animals based on similarity 309 00:15:22.080 --> 00:15:24.960 without any predefined labels like mammal or reptile. 310 00:15:25.120 --> 00:15:26.759 It discovers the categories right. 311 00:15:27.000 --> 00:15:31.399 Another big one is customer profiling, analyzing transaction data maybe demographics, 312 00:15:31.639 --> 00:15:34.279 to find hidden segments or clusters of customers. 313 00:15:34.320 --> 00:15:36.720 That sounds commercially very valuable. 314 00:15:36.440 --> 00:15:39.440 Hugely businesses use it to understand their customer based better 315 00:15:39.559 --> 00:15:44.639 target marketing, etc. But it often requires careful data preprocessing. 316 00:15:44.120 --> 00:15:45.720 Because the network needs numbers. 317 00:15:45.960 --> 00:15:48.840 Yeah, you need to convert different data types of numerical 318 00:15:48.919 --> 00:15:53.240 categorical like gender or city into a format the network 319 00:15:53.240 --> 00:15:55.080 can handle. That's often a big part of the job. 320 00:15:55.240 --> 00:15:59.639 Okay, so we have supervised for prediction, unsupervised for discovery. 321 00:16:00.080 --> 00:16:05.120 What about tasks that combined aspects like pattern recognition. 322 00:16:05.279 --> 00:16:10.000 Pattern recognition, especially something like optical character recognition OCR is 323 00:16:10.000 --> 00:16:12.279 a great example. It often involves elements of. 324 00:16:12.279 --> 00:16:14.759 Both recognizing handwriting or typed text. 325 00:16:15.000 --> 00:16:18.480 Exactly. The book has a nice OCR case study recognizing 326 00:16:18.519 --> 00:16:20.080 handwritten digits zero through nine. 327 00:16:20.279 --> 00:16:22.679 How did they represent the digits for the network? 328 00:16:22.840 --> 00:16:26.360 They use simple five y five pixel grayscale images. Each 329 00:16:26.399 --> 00:16:30.960 image is flattened into a vector of twenty five pixel inputs. 330 00:16:30.639 --> 00:16:32.799 So the image becomes numerical data. 331 00:16:32.840 --> 00:16:36.759 Precisely, that transformation from visual information to numbers the network 332 00:16:36.799 --> 00:16:39.679 and process is fundamental. Then typically you train it using 333 00:16:39.720 --> 00:16:43.039 supervised learning. Show it lots of examples of three images 334 00:16:43.080 --> 00:16:45.720 labeled as three four images labels four, and so on. 335 00:16:46.000 --> 00:16:49.679 Okay, Now throughout these examples, something you mentioned earlier seems important. 336 00:16:50.080 --> 00:16:53.000 The trial and error aspect of designing these networks. 337 00:16:53.039 --> 00:16:56.720 Oh, absolutely, it's rarely straightforward. The weather forecasting example they 338 00:16:56.759 --> 00:16:59.759 discuss in chapter five really highlights this. Oh, so they 339 00:16:59.759 --> 00:17:04.160 had to experiment empirically, try different network structures, different numbers 340 00:17:04.200 --> 00:17:09.119 of hidden neurons, different learning parameters, and crucially carefully select 341 00:17:09.160 --> 00:17:11.200 the training and test data sets, and. 342 00:17:11.200 --> 00:17:14.240 The goal isn't always just the lowest possible error on 343 00:17:14.279 --> 00:17:15.240 the training set. 344 00:17:15.480 --> 00:17:18.960 Not necessarily, this is a key point. Sometimes a network 345 00:17:18.960 --> 00:17:23.079 that achieves a slightly higher error say means squared error 346 00:17:23.440 --> 00:17:27.359 MESSE during training might actually perform better on the unseen 347 00:17:27.400 --> 00:17:28.079 test data. 348 00:17:28.160 --> 00:17:30.720 Better generalization exactly. 349 00:17:30.480 --> 00:17:33.400 They learned the underlying pattern better wasn't just overfitting to 350 00:17:33.400 --> 00:17:36.079 the training noise. They saw this in both the weather 351 00:17:36.119 --> 00:17:40.119 forecasting and the OCR digit recognition results. The network that 352 00:17:40.200 --> 00:17:43.400 generalized best wasn't always the one with the absolute rock 353 00:17:43.440 --> 00:17:45.160 bottom training MSc. 354 00:17:45.279 --> 00:17:48.440 So it's an iterative design process requires judgment. 355 00:17:48.240 --> 00:17:51.000 Very much so part science, part art maybe. 356 00:17:50.720 --> 00:17:53.400 And things can go wrong right. Common issues for. 357 00:17:53.279 --> 00:17:58.119 Sure, bad input selection, feeding the network irrelevant data, noisy 358 00:17:58.200 --> 00:18:03.039 data that obscures the patterns, choosing an unsuitable network structure, 359 00:18:03.119 --> 00:18:05.640 too simple or maybe overly. 360 00:18:05.319 --> 00:18:07.920 Complex, so optimization is key. 361 00:18:08.200 --> 00:18:11.519 Definitely, techniques exist to help, Like for input selection, you 362 00:18:11.519 --> 00:18:15.920 can analyze data correlation using something like the piercing coefficient 363 00:18:16.000 --> 00:18:18.880 to see which potential inputs are actually strongly related to 364 00:18:18.960 --> 00:18:21.920 the output you're trying to predict. Helps weed out the noise. 365 00:18:22.079 --> 00:18:23.960 Makes sense, and if you have tons of inputs, like 366 00:18:24.000 --> 00:18:25.640 from high res images. 367 00:18:25.519 --> 00:18:29.799 Then dimensionality reduction techniques become vital ways to compress the 368 00:18:29.799 --> 00:18:33.160 input data, capture the most important information in fewer dimensions, 369 00:18:33.519 --> 00:18:36.640 making the learning task more manageable without losing too much signal. 370 00:18:36.920 --> 00:18:40.720 So it sounds like mastering neural networks takes patience, experimentation 371 00:18:40.920 --> 00:18:41.759 in const of refinement. 372 00:18:41.839 --> 00:18:44.039 Yeah, it's not usually a one shot deal. You build, 373 00:18:44.079 --> 00:18:47.000 you test, you tweak, you learn for the results and iterate. 374 00:18:47.160 --> 00:18:49.359 Well, this has been an incredible deep dive. We've really 375 00:18:49.519 --> 00:18:55.880 unpacked the core pieces artificial neurons, weights, bias, activation, functions, layers. 376 00:18:55.839 --> 00:18:57.000 Uh huh, the building blocks. 377 00:18:57.119 --> 00:19:03.240 Explore how they learn supervised with teacher, unsupervised, discovering patterns 378 00:19:03.240 --> 00:19:03.960 on their own. 379 00:19:04.160 --> 00:19:09.039 With algorithms like backpropagation making the complex learning possible. 380 00:19:08.680 --> 00:19:12.920 And competitive learning driving that self organization in essoms. Yeah, 381 00:19:12.960 --> 00:19:18.200 and we saw their versatility forecasting, diagnosis, clustering, even reading handwriting. 382 00:19:18.279 --> 00:19:21.839 It really shows they're more than just algorithms. They're inspired 383 00:19:21.880 --> 00:19:25.400 by life, finding knowledge in ways we might not expect, 384 00:19:25.640 --> 00:19:28.359 almost like extensions of our own ways of finding patterns. 385 00:19:28.400 --> 00:19:32.359 Absolutely so, given everything we've discussed, their ability to self 386 00:19:32.440 --> 00:19:36.880 organize adapt, create internal representations. Here's a final thought for you. 387 00:19:37.599 --> 00:19:40.759 What new frontiers of human knowledgement these networks unlocked that 388 00:19:40.880 --> 00:19:42.720 maybe we can't even conceive of yet. 389 00:19:43.079 --> 00:19:44.680 That is the big question, isn't it. 390 00:19:44.920 --> 00:19:46.839 Thank you for joining us on this deep dive.