WEBVTT 1 00:00:00.160 --> 00:00:03.080 Welcome to the deep Dive. Today, we're marking on a 2 00:00:03.120 --> 00:00:07.440 journey into the powerful world of deep learning, seen specifically 3 00:00:07.480 --> 00:00:08.320 through the lens of R. 4 00:00:08.560 --> 00:00:09.000 That's right. 5 00:00:09.080 --> 00:00:12.119 Our mission is to extract the most important insights and 6 00:00:12.919 --> 00:00:15.880 practical applications from hands on deep learning with R by 7 00:00:15.880 --> 00:00:17.280 Michael Paulis and Roger Devine. 8 00:00:17.359 --> 00:00:19.480 And this isn't just about theory, right, think of this 9 00:00:19.519 --> 00:00:24.600 as your guide to designing, building, and truly improving neural 10 00:00:24.640 --> 00:00:28.199 network models. We're distilling it down into core concepts and 11 00:00:28.239 --> 00:00:31.920 some surprising applications. It's kind of a shortcut really to 12 00:00:32.079 --> 00:00:34.960 being genuinely well informed in this complex space. 13 00:00:35.840 --> 00:00:38.679 So, whether you're prepping for a big meeting, maybe just 14 00:00:38.719 --> 00:00:41.000 catching up on the latest in data science, or you're 15 00:00:41.039 --> 00:00:45.960 simply like insatiably curious, prepare for some serious aha moments. 16 00:00:46.159 --> 00:00:47.880 Let's start at the beginning. Then, deep learning it's a 17 00:00:47.880 --> 00:00:50.600 powerful subset of machine learning, but they fundamentally share a 18 00:00:50.600 --> 00:00:52.799 lot of common ground. What are some of those essential 19 00:00:52.799 --> 00:00:53.560 building blocks? 20 00:00:53.679 --> 00:00:57.079 Well, at its core, it's all about preparing your data 21 00:00:57.119 --> 00:01:00.600 for modeling. The book uses a great example actually with 22 00:01:01.119 --> 00:01:04.079 the London air Quality network data set. The goal there 23 00:01:04.200 --> 00:01:07.599 is predicting nitrogen dioxide levels, and this involves some really 24 00:01:07.599 --> 00:01:11.959 crucial steps like identifying and extracting relevant data information you know, 25 00:01:12.120 --> 00:01:16.280 the day, the month, and also intelligently handling missing values, 26 00:01:16.319 --> 00:01:21.239 for instance, removing maybe a small percentage like three or 27 00:01:21.280 --> 00:01:23.760 four percent of missing target values that you just can't 28 00:01:23.799 --> 00:01:25.079 reliably guess. 29 00:01:25.120 --> 00:01:27.920 Right because bad guesses could throw the whole model. 30 00:01:27.640 --> 00:01:31.040 Off exactly, and filtering out variables that don't add any 31 00:01:31.079 --> 00:01:34.640 real information. Think about columns where every single value is identical, 32 00:01:34.959 --> 00:01:36.879 like maybe a site idea. If you're only looking at 33 00:01:36.879 --> 00:01:39.359 one site or units, they just don't help. 34 00:01:39.519 --> 00:01:42.040 And data quality isn't just about missing values, is it. 35 00:01:42.280 --> 00:01:46.120 We also saw the importance of checks like confirming provisional 36 00:01:46.239 --> 00:01:48.719 or ratified values, making sure you know the status of 37 00:01:48.719 --> 00:01:52.359 the data, and transforming data types too, ensuring numeric data 38 00:01:52.400 --> 00:01:55.640 isn't accidentally stored as text, which happens more than you'd think. 39 00:01:55.799 --> 00:01:58.920 Oh definitely. Then comes the actual model training, so that 40 00:01:58.959 --> 00:02:02.319 means splitting your data into training and testing sets, typically 41 00:02:02.319 --> 00:02:05.680 a good chunk maybe seventy or eighty percent for training, 42 00:02:06.000 --> 00:02:09.680 and then choosing the right algorithm. The book highlights exch 43 00:02:09.840 --> 00:02:14.639 boost as a pretty robust gradient tree boosting method boosting. 44 00:02:14.960 --> 00:02:16.680 That's different from something like a random. 45 00:02:16.439 --> 00:02:19.759 Forest, right, Yeah, very different. What's important about boosting methods 46 00:02:19.800 --> 00:02:22.879 like xt boost is that they learn iteratively. Each new 47 00:02:23.000 --> 00:02:26.400 model essentially tries to correct the mistakes of the previous one. 48 00:02:26.520 --> 00:02:29.960 It's a refinement process. Random forests, on the other hand, 49 00:02:30.080 --> 00:02:33.000 use bagging. They build many independent models and sort of 50 00:02:33.080 --> 00:02:35.719 average their results. Both powerful but different approaches. 51 00:02:35.800 --> 00:02:38.879 Okay, that makes sense. This brings us to a critical point. Then, 52 00:02:39.319 --> 00:02:43.439 how do you truly evaluate your model's results? Because simple 53 00:02:43.479 --> 00:02:45.599 accuracy can be really misleading? Can it? 54 00:02:45.680 --> 00:02:46.240 Absolutely? 55 00:02:46.319 --> 00:02:49.199 The credit card fraud example in the book perfectly illustrates this. 56 00:02:49.719 --> 00:02:52.840 If only say, zero point one percent of transactions are fraudulent, 57 00:02:53.039 --> 00:02:55.919 a model predicting no fraud every time gets ninety nine 58 00:02:55.960 --> 00:02:59.240 point nine percent accuracy, But it's completely useless. It misses 59 00:02:59.280 --> 00:03:02.039 every single instance of actual fraud precisely. 60 00:03:03.000 --> 00:03:06.080 That's why just looking at accuracy it can be well 61 00:03:06.280 --> 00:03:09.960 dangerously deceptive. Sometimes it really forces you to think about 62 00:03:09.960 --> 00:03:12.840 the cost of being wrong. You need metrics like mean 63 00:03:12.919 --> 00:03:17.840 absolute error MA or root means squared error rmse E 64 00:03:17.960 --> 00:03:18.560 and RMS. 65 00:03:18.800 --> 00:03:20.360 That's the one that squares the errors. 66 00:03:20.560 --> 00:03:23.199 Yeah, and the real insight with RMSC isn't just the math. 67 00:03:23.360 --> 00:03:26.960 It's that it forces you to heavily penalize those big, 68 00:03:27.000 --> 00:03:31.599 potentially catastrophic mispredictions. So if missing badly is really really 69 00:03:31.639 --> 00:03:34.919 bad for your project, RMSE is often the better choice. 70 00:03:35.039 --> 00:03:37.520 It makes those big errors hurt more in the calculation. 71 00:03:38.000 --> 00:03:40.120 Got it, So, once we have a model, how do 72 00:03:40.159 --> 00:03:42.719 we actually make it better? The book talks about strategies 73 00:03:42.759 --> 00:03:44.039 like cross validation. 74 00:03:44.000 --> 00:03:46.800 Right cross validation. That's where you basically repeat the train 75 00:03:46.879 --> 00:03:49.479 to split multiple times with different slices of your data. 76 00:03:49.800 --> 00:03:52.039 It gives you a much more reliable estimate of how 77 00:03:52.039 --> 00:03:54.120 the model will perform on unseen data. 78 00:03:54.159 --> 00:03:56.479 And early stopping. That sounds important too. 79 00:03:56.400 --> 00:03:58.639 Yeah, early stopping is key. It means you monitor the 80 00:03:58.680 --> 00:04:01.759 model's performance on a valuelidation set during training, and you 81 00:04:01.879 --> 00:04:04.879 just stop the training if things haven't improved for say, 82 00:04:05.240 --> 00:04:08.759 twenty five rounds. Or EPOX prevents overfitting. 83 00:04:08.479 --> 00:04:11.639 And grid searches for hyper parameter tuning, that's about finding 84 00:04:11.680 --> 00:04:12.960 the best settings. 85 00:04:12.800 --> 00:04:17.519 Exactly systematically, trying out different combinations of settings like learning 86 00:04:17.600 --> 00:04:21.000 rates or tree depths to find that optimal configuration for 87 00:04:21.079 --> 00:04:25.519 your specific problem. And you know, we also briefly touched 88 00:04:25.519 --> 00:04:28.680 on a wider range of machine learning algorithms. Beyond XP 89 00:04:28.839 --> 00:04:31.720 you boost, there's a whole family things like decision trees 90 00:04:31.839 --> 00:04:37.120 and their ensemble cousin random forests, logistic regression for classification problems, 91 00:04:38.000 --> 00:04:40.959 support vector machines which are great at finding separation lines 92 00:04:40.959 --> 00:04:44.560 and data k nearest neighbors k means for clustering, and 93 00:04:44.759 --> 00:04:48.560 other boosting methods too, like gradient boosting machines GBM and 94 00:04:48.680 --> 00:04:52.399 light GBM. Understanding how these iterative methods work, how they 95 00:04:52.399 --> 00:04:55.040 build on information, it really does lay some groundwork for 96 00:04:55.079 --> 00:04:56.959 grasping deep learning concepts later on. 97 00:04:57.279 --> 00:04:59.120 All right, So, if you're listening and ready to get 98 00:04:59.120 --> 00:05:01.720 hands on with deep learning in R, what are the 99 00:05:02.160 --> 00:05:04.040 essential libraries we need to get started? Okay? 100 00:05:04.040 --> 00:05:06.879 The primary work courses mentioned are H two O, mx 101 00:05:06.959 --> 00:05:09.959 NET and KERAS, and we also saw some more specialized 102 00:05:10.000 --> 00:05:14.639 packages like RBM for restricted Boltzmann machines and reinforcement. 103 00:05:14.120 --> 00:05:18.920 Learning and installation. It's not always just installed packages, is it. 104 00:05:19.480 --> 00:05:22.639 Some seem straightforward from KARAN the main R archives right. 105 00:05:22.519 --> 00:05:26.319 Some are, but others yeah, like RBM or espressly karas 106 00:05:26.319 --> 00:05:28.759 often need a bit more worre Keras usually relies on 107 00:05:28.839 --> 00:05:32.079 TensorFlow running in the background, often in a separate Python 108 00:05:32.160 --> 00:05:34.839 environment like Conda or a virtual environment. 109 00:05:34.560 --> 00:05:36.839 So you might need div tools or need to point 110 00:05:36.959 --> 00:05:39.240 R to the right Python installation exactly. 111 00:05:39.240 --> 00:05:42.240 And mx net, for instance, might even need external libraries 112 00:05:42.360 --> 00:05:45.279 installed in your system first, like OpenCV for image stuff 113 00:05:45.360 --> 00:05:47.879 or open believes for linear algebra. It can get a 114 00:05:47.920 --> 00:05:50.800 little complex, but what's really insightful here, I think, is 115 00:05:50.879 --> 00:05:55.360 understanding the different strengths of each library within our karras 116 00:05:55.399 --> 00:05:58.639 gives you incredibly broad support for almost any neural network 117 00:05:58.720 --> 00:06:02.000 architecture you can think of and end CNN's MLPs, you 118 00:06:02.079 --> 00:06:04.920 name it. H two Ozero is fantastic when you're dealing 119 00:06:04.920 --> 00:06:08.079 with really really large data sets because you can store 120 00:06:08.120 --> 00:06:11.240 objects out of memory across a cluster if needed. An 121 00:06:11.319 --> 00:06:15.600 mx net it provides a really robust, efficient set of algorithms. 122 00:06:16.120 --> 00:06:19.240 Powerful stuff in the book shows examples for each Yeah, 123 00:06:19.279 --> 00:06:21.560 we saw how to get a basic example running with 124 00:06:21.639 --> 00:06:24.959 each one, including a practical demo of pre processing the 125 00:06:25.000 --> 00:06:28.720 adult census data set, converting character columns to numbers using 126 00:06:28.759 --> 00:06:32.639 one hottened coding, scaling everything between zero and one standard 127 00:06:32.720 --> 00:06:33.720 but crucial steps. 128 00:06:33.920 --> 00:06:36.959 Okay, let's dig into the deep part, How exactly does 129 00:06:37.079 --> 00:06:40.360 deep learning get that name? And what's really at its core? 130 00:06:40.600 --> 00:06:44.040 Right? The deep comes from using multiple hidden layers made 131 00:06:44.120 --> 00:06:46.800 up of these artificial neuros. These layers are stacked and 132 00:06:46.839 --> 00:06:49.680 they mimic in a very very simplified way, how our 133 00:06:49.680 --> 00:06:53.439 brains process information. The real key insight is that each 134 00:06:53.519 --> 00:06:57.439 layer can learn progressively more complex features from the data. 135 00:06:57.519 --> 00:06:59.959 How does that work in practice? For an image? 136 00:07:00.439 --> 00:07:03.360 So imagine the first layer might identify basic edges or 137 00:07:03.399 --> 00:07:06.160 corners in an image. The next layer might combine those 138 00:07:06.240 --> 00:07:09.639 edges to detect simple shapes. A layer deeper might recognize 139 00:07:09.639 --> 00:07:12.560 textures or parts of objects, and so on. This hierarchical 140 00:07:12.639 --> 00:07:14.680 learning building complexity layer by layer. 141 00:07:14.800 --> 00:07:17.279 That's what makes it deep and what does this structure 142 00:07:17.319 --> 00:07:18.800 mean for how they actually learn? 143 00:07:19.120 --> 00:07:22.000 Well, the process starts with random weights. These are just 144 00:07:22.120 --> 00:07:25.600 numbers assigned to the connections between neurons, representing the strength 145 00:07:25.639 --> 00:07:28.920 of the connection. Then these weights are adjusted over and 146 00:07:28.959 --> 00:07:32.920 over again iteratively to minimize the difference the error between 147 00:07:32.959 --> 00:07:36.319 the network's predictions and the actual answers in your training data. 148 00:07:36.480 --> 00:07:38.680 So it's constantly refining itself based on feedback. 149 00:07:38.759 --> 00:07:41.839 Exactly, it's a continuous refinement process, very much like how 150 00:07:41.879 --> 00:07:45.399 those boosting algorithms learn from the errors of previous iterations. 151 00:07:45.439 --> 00:07:47.759 Actually that makes sense, but okay, zooming in on those 152 00:07:47.759 --> 00:07:51.279 individual neurons, Yeah, how do they decide whether to fire 153 00:07:51.800 --> 00:07:53.879 or pass a signal forward? Ah? 154 00:07:53.920 --> 00:07:57.680 Good question. That's where two key things come in, bias 155 00:07:57.720 --> 00:08:01.399 functions and activation functions. Bias functions you can think of 156 00:08:01.439 --> 00:08:04.600 them as shifting the decision boundary, allowing the model to 157 00:08:04.639 --> 00:08:09.199 better separate different classes of data. And activation functions they're 158 00:08:09.199 --> 00:08:11.959 the real decision makers inside each neuron. They take the 159 00:08:11.959 --> 00:08:15.199 weighted sum of inputs plus the bias and decide if 160 00:08:15.240 --> 00:08:18.000 and how strongly that neuron should fire and pass the 161 00:08:18.079 --> 00:08:19.199 signal to the next layer. 162 00:08:19.439 --> 00:08:21.399 Right, and we looked at a whole range of these 163 00:08:21.439 --> 00:08:24.160 activation functions, didn't we From the simple on off heavy side. 164 00:08:24.279 --> 00:08:26.279 Yeah, the heavy side is very basic, just a step, 165 00:08:26.600 --> 00:08:29.120 But the non linear ones are where things get interesting. 166 00:08:29.319 --> 00:08:33.120 There's the sigmoid function, which squishes values into a range 167 00:08:33.159 --> 00:08:37.200 between zero and one, really useful for probabilities or binary outcomes. 168 00:08:38.000 --> 00:08:42.120 Then its cousin the hyperbolic tangent or ten, which is 169 00:08:42.320 --> 00:08:44.360 similar but ranges from moneybo one to. 170 00:08:44.360 --> 00:08:47.679 One and read lu seems really popular. Rectified linear units. 171 00:08:47.720 --> 00:08:51.080 Oh yeah, ReLU is huge. It's simple. It outputs the 172 00:08:51.080 --> 00:08:55.159 input directly if it's positive and zero otherwise. That simplicity 173 00:08:55.279 --> 00:08:58.600 makes training much faster in many cases. But it has 174 00:08:58.639 --> 00:09:02.200 a potential issue called the eyeing ReLU problem, where neurons 175 00:09:02.240 --> 00:09:06.039 can get stuck outputting zero, So leaky ReLU is developed. 176 00:09:06.080 --> 00:09:08.639 It gives a tiny slope for negative inputs just to 177 00:09:08.679 --> 00:09:09.399 keep things flowing. 178 00:09:09.559 --> 00:09:10.600 And swish was another one. 179 00:09:10.679 --> 00:09:12.840 Swish Yeah, a more recent one that's shown good results. 180 00:09:12.879 --> 00:09:15.120 It's a smoother function, lots of options really. 181 00:09:14.960 --> 00:09:19.879 And for classification tasks where you have like multiple categories dogs, cats, birds, 182 00:09:20.799 --> 00:09:22.399 the softmax function is key. 183 00:09:22.279 --> 00:09:26.279 Right, absolutely essential. Softmax takes the outputs for each class 184 00:09:26.320 --> 00:09:29.240 and converts them into probabilities that all add up to one. 185 00:09:29.399 --> 00:09:31.879 So the model tells you I think it's seventy percent 186 00:09:31.960 --> 00:09:35.159 likely a cat, twenty percent a dog, ten percent a bird. Okay, 187 00:09:35.559 --> 00:09:38.000 And you know, the book even walks you through building 188 00:09:38.000 --> 00:09:41.039 a very basic network from scratch in just basar, just 189 00:09:41.039 --> 00:09:43.720 to illustrate how weights get updated and how a LIGNE 190 00:09:43.759 --> 00:09:47.200 can separate classes. Then it scales up using the neural 191 00:09:47.279 --> 00:09:49.720 net package for the Wisconsin cancer data set, which is 192 00:09:49.720 --> 00:09:55.080 a classic and importantly it shows the backpropagation. 193 00:09:54.279 --> 00:09:57.120 Step backpropagation, that's how the error gets used to update 194 00:09:57.120 --> 00:09:58.279 the weights exactly. 195 00:09:58.399 --> 00:10:00.679 The error is calculated at the out and then it's 196 00:10:00.720 --> 00:10:04.159 propagated backward through the network layer by layer, telling each 197 00:10:04.200 --> 00:10:06.759 weight how much it needs to adjust to reduce that error. 198 00:10:06.840 --> 00:10:08.480 It's the core learning mechanism. 199 00:10:08.480 --> 00:10:12.200 Fascinating stuff. Okay, let's move to applications. Image recognition is 200 00:10:12.200 --> 00:10:15.000 a huge one for deep learning. Can we use traditional 201 00:10:15.000 --> 00:10:16.480 machine learning for images at all? 202 00:10:16.600 --> 00:10:20.279 You absolutely can. Yeah. Using what are sometimes called shallow nets, 203 00:10:20.360 --> 00:10:23.399 things like random forests or simple neural networks. You can 204 00:10:23.440 --> 00:10:26.600 apply them to data sets like fashion mnists Fashion. 205 00:10:26.399 --> 00:10:30.200 Mnist, that's the clothing images instead of handwritten digits. Right. 206 00:10:30.320 --> 00:10:32.879 It's a bit more challenging than the original mnist digits. 207 00:10:33.120 --> 00:10:36.799 But shallow nets their limitations become pretty clear when you 208 00:10:36.840 --> 00:10:40.960 move to larger, more complex real world images. They just 209 00:10:41.039 --> 00:10:43.919 struggle to efficiently capture all the intricate patterns. 210 00:10:44.200 --> 00:10:47.639 And this is where convolutional neural networks CNNs really come 211 00:10:47.639 --> 00:10:49.799 into their own, isn't it? How do they manage it? 212 00:10:50.320 --> 00:10:54.120 What really sets CNN's apart is their architecture specifically designed 213 00:10:54.120 --> 00:10:58.399 for grid like data like images, they automatically learn the 214 00:10:58.440 --> 00:11:02.279 right features directly from the pixels. They use specialized layers 215 00:11:02.279 --> 00:11:05.240 of convolution layers that apply filters across the image to 216 00:11:05.320 --> 00:11:10.320 detect specific patterns like edges, corners, textures, maybe even simple shapes, so. 217 00:11:10.279 --> 00:11:12.919 They're not just looking at individual pixels anymore, not at all. 218 00:11:13.240 --> 00:11:16.080 Then they often use pooling layers, which reduce the size 219 00:11:16.200 --> 00:11:19.440 the dimensionality, making the process more efficient and helping the 220 00:11:19.440 --> 00:11:22.679 network focus on the most important features. And techniques like 221 00:11:22.759 --> 00:11:26.600 adding padding say padding same can help control how quickly 222 00:11:26.639 --> 00:11:29.879 the dimensions shrink, letting you build deeper networks without losing 223 00:11:29.879 --> 00:11:31.000 information too fast. 224 00:11:31.279 --> 00:11:34.399 And you can build really deep CNNs right stacking, multiple 225 00:11:34.440 --> 00:11:35.759 convolution and pooling layers. 226 00:11:35.759 --> 00:11:38.919 Oh yes, that allows the network to learn this hierarchy 227 00:11:38.960 --> 00:11:41.679 of features. We talked about simple features in early layers 228 00:11:41.960 --> 00:11:45.120 combined into more complex ones and deeper layers. It's kind 229 00:11:45.120 --> 00:11:47.799 of analogous to how our own visual system works in 230 00:11:47.840 --> 00:11:48.200 a way. 231 00:11:48.639 --> 00:11:52.240 So with these complex models, how do we optimize them effectively? 232 00:11:52.399 --> 00:11:57.639 Good question? Optimization is key. We discussed various algorithms called optimizers, 233 00:11:57.919 --> 00:12:01.799 things like stochastic gradient descent SGD, which is a basic workhourse. 234 00:12:02.159 --> 00:12:05.840 Then RM's prop and ATAM is a very popular one nowadays. 235 00:12:05.879 --> 00:12:09.000 It sort of combines the ideas of arms PROP with momentum, 236 00:12:09.399 --> 00:12:11.919 often leading to faster convergence. 237 00:12:11.519 --> 00:12:14.080 And choosing the right loss function is important. 238 00:12:13.639 --> 00:12:18.840 To crucial for binary classification, binary cross entropy for multiple classes, 239 00:12:18.879 --> 00:12:22.360 categorical cross entropy for regression problems where you predict a 240 00:12:22.440 --> 00:12:25.639 number maybe means squared error, and sometimes you need metrics 241 00:12:25.679 --> 00:12:30.080 beyond just accuracy like cosine similarity or CHL divergence, especially 242 00:12:30.080 --> 00:12:32.720 if you're comparing probability distributions or embeddings. 243 00:12:32.840 --> 00:12:36.279 Okay, and you mentioned ways to prevent overfitting like dropout layers. 244 00:12:36.399 --> 00:12:39.840 Yeah, dropout is a really clever technique. During training, it 245 00:12:39.919 --> 00:12:42.879 randomly sets a fraction of neuron outputs to zero for 246 00:12:42.919 --> 00:12:44.240 each training example. 247 00:12:44.000 --> 00:12:46.679 So it forces the network not to rely too heavily 248 00:12:46.720 --> 00:12:48.559 on any single neuron exactly. 249 00:12:48.720 --> 00:12:52.279 It encourages redundancy and makes the network more robust and 250 00:12:52.480 --> 00:12:56.320 early stopping. Like we mentioned before, halting training when performance 251 00:12:56.320 --> 00:12:59.480 on a validation set stops improving is another vital tool 252 00:12:59.480 --> 00:13:02.679 against over fitting. Helps find that sweet spot for the 253 00:13:02.759 --> 00:13:04.759 number of training epochs right. 254 00:13:05.080 --> 00:13:08.840 Okay, let's ship here's a bit. Multilayer perceptions or MLPs. 255 00:13:09.279 --> 00:13:13.080 What about them, particularly for signal detection tasks? What makes 256 00:13:13.080 --> 00:13:13.720 them distinct? 257 00:13:14.320 --> 00:13:17.960 MLPs are kind of the classic foundational feed forward neural network. 258 00:13:18.440 --> 00:13:22.200 Their defining feature is that they only use fully connected layers. 259 00:13:22.120 --> 00:13:24.919 Meaning every neuron in one layer connects to every single 260 00:13:24.960 --> 00:13:25.639 neuron in the. 261 00:13:25.600 --> 00:13:29.279 Next layer, precisely. Unlike CNNs with their specialized convolution layers 262 00:13:29.399 --> 00:13:33.159 or RNNs with their recurrent connections, MLPs are just stacks 263 00:13:33.159 --> 00:13:37.000 of these dense fully connected layers. They're good general purpose learners, 264 00:13:37.000 --> 00:13:41.399 maybe less specialized than cms for images or LSTMs for sequences, 265 00:13:41.759 --> 00:13:42.679 and for MLPs. 266 00:13:42.840 --> 00:13:45.360 We looked at some specific data prep steps, didn't we 267 00:13:45.840 --> 00:13:49.759 like trimming white space from categories. Why is that important? Ah? 268 00:13:49.879 --> 00:13:53.480 Yes, it sounds trivial, But if you have mail with 269 00:13:53.559 --> 00:13:57.120 a leading space and mail without, the computer sees them 270 00:13:57.159 --> 00:14:00.600 as two totally different categories, so cleaning that up is essential. 271 00:14:00.799 --> 00:14:04.799 And rescaling numeric values to a zero one range. Why 272 00:14:04.799 --> 00:14:06.120 do we do that rescale step? 273 00:14:06.159 --> 00:14:09.200 Again, It's really about efficiency and stability. If you have 274 00:14:09.240 --> 00:14:11.440 one feature ranging from zero to one and another from 275 00:14:11.519 --> 00:14:14.639 zero to one million, the larger future can dominate the 276 00:14:14.720 --> 00:14:18.360 learning process. Scaling brings everything into the same range, so 277 00:14:18.399 --> 00:14:21.240 they contribute more equally, and it often helps the model's 278 00:14:21.279 --> 00:14:24.399 optimization process converge faster and more reliably. 279 00:14:24.600 --> 00:14:26.639 Makes sense, and it was a rule of thumb for 280 00:14:26.720 --> 00:14:27.720 hidden layer size. 281 00:14:27.799 --> 00:14:30.799 Yeah, a common juristic, just a starting point really is 282 00:14:30.840 --> 00:14:32.799 to try setting the number of nodes in a hidden 283 00:14:32.879 --> 00:14:35.559 layer to about two thirds of the input layer size. 284 00:14:35.720 --> 00:14:37.879 We saw how you could write functions in R using 285 00:14:37.879 --> 00:14:40.559 the mx net syntax in the book to easily test 286 00:14:40.639 --> 00:14:43.840 different node counts and even experiment with adding more hidden layers. 287 00:14:44.159 --> 00:14:47.080 Okay, now let's talk about something we all encounter daily. 288 00:14:47.919 --> 00:14:52.480 Recommender systems. Yeah, streaming movies, online shopping. How do they 289 00:14:52.519 --> 00:14:55.399 actually work and where does deep learning fit in? 290 00:14:55.720 --> 00:15:00.600 Right recommenders? Broadly, there are three main types. Collaborative filtering, 291 00:15:00.720 --> 00:15:03.519 which finds users similar to you and recommends what they liked, 292 00:15:04.120 --> 00:15:07.440 content based filtering, which recommends items similar to ones you've 293 00:15:07.480 --> 00:15:10.759 liked before based on their attributes, and habrid systems, which 294 00:15:11.320 --> 00:15:13.000 tried combine the best to both worlds. 295 00:15:13.080 --> 00:15:15.480 Had a big challenge. Is the cold start problem right 296 00:15:15.799 --> 00:15:17.240 for new users or new items? 297 00:15:17.320 --> 00:15:20.559 Exactly? If you're a new user, the system knows nothing 298 00:15:20.600 --> 00:15:22.960 about your tastes. If it's a brand new movie, nobody 299 00:15:22.960 --> 00:15:25.559 has rated it yet. That makes recommendations difficult. 300 00:15:25.600 --> 00:15:29.720 Initially, what seemed really fascinating here was the idea of embeddings. 301 00:15:30.240 --> 00:15:32.639 How do these low dimensional vectors help? 302 00:15:32.840 --> 00:15:35.919 Embeddings are a really powerful concept in deep learning, not 303 00:15:36.000 --> 00:15:40.600 just for recommendations. They basically learned dense low dimensional vector 304 00:15:40.639 --> 00:15:44.519 representations for things like users and items. Instead of dealing 305 00:15:44.519 --> 00:15:49.240 with huge, sparse matrices of user item interactions. You map 306 00:15:49.440 --> 00:15:54.080 users and items into this shared latent space, a coordinate system, 307 00:15:54.120 --> 00:15:54.879 if you will. 308 00:15:54.759 --> 00:15:57.240 And closeness in that space means similarity. 309 00:15:57.159 --> 00:16:01.000 Precisely users close to items they like and similar users 310 00:16:01.000 --> 00:16:03.879 close to each other. It captures these affinities efficiently, making 311 00:16:03.960 --> 00:16:06.960 it easy to calculate similarity like with a dot product, 312 00:16:07.240 --> 00:16:09.480 even when you don't have explicit ratings for everything. 313 00:16:09.679 --> 00:16:12.519 And we looked at the Steam two hundred k do 314 00:16:12.639 --> 00:16:15.600 CSV data set, which uses implicit feedback. 315 00:16:15.759 --> 00:16:18.000 Yeah, that was a great example. Instead of star ratings, 316 00:16:18.039 --> 00:16:21.519 it uses hours played. For video games. Sid Meier's Civilization 317 00:16:21.679 --> 00:16:25.240 V had huge hours logged by some users. This implicit 318 00:16:25.320 --> 00:16:29.159 data clicks, views, purchase history, playtime is often much more 319 00:16:29.200 --> 00:16:31.679 abundant and sometimes more revealing than explicit ratings. 320 00:16:31.759 --> 00:16:34.519 So we saw preparing that data, doing some exploratory data 321 00:16:34.559 --> 00:16:37.639 analysis EDA to understand those interactions. 322 00:16:37.200 --> 00:16:39.039 Yep, understanding who plays what for how long? 323 00:16:39.279 --> 00:16:42.440 And then building a custom caris model using both user 324 00:16:42.480 --> 00:16:44.080 and bettings and item embttings. 325 00:16:44.159 --> 00:16:46.559 Right, But then there's another layer. How do you account 326 00:16:46.600 --> 00:16:50.279 for inherent biases? Some users just play games way more 327 00:16:50.360 --> 00:16:53.440 than others, regardless of the specific game, and some games 328 00:16:53.480 --> 00:16:55.000 are just universally popular. 329 00:16:55.200 --> 00:16:57.960 Ah, so you need to model those baseline tendons exactly. 330 00:16:58.000 --> 00:17:01.200 Adding specific bias e bettings one for the average user's 331 00:17:01.279 --> 00:17:04.480 tendency and one for the average items popularity can really 332 00:17:04.480 --> 00:17:07.400 improve the model. It lets the main embeddings focus on 333 00:17:07.480 --> 00:17:11.839 the interaction effect the specific user item affinity separate from 334 00:17:11.839 --> 00:17:15.960 these general biases. In the books example, adding biases nearly 335 00:17:16.079 --> 00:17:19.759 doubled the trainable parameters, but led to much better recommendations. 336 00:17:19.880 --> 00:17:23.759 Very clever. Okay, let's pivot to time series data. Stock 337 00:17:23.799 --> 00:17:27.160 price forecasting is the classic example. How does deep learning 338 00:17:27.200 --> 00:17:30.440 tackle this? Given that the order of events is so critical. 339 00:17:30.319 --> 00:17:34.440 Time series is definitely unique, Unlike say, image classification, where 340 00:17:34.440 --> 00:17:37.240 you can shuffle the images with time series. The sequence 341 00:17:37.279 --> 00:17:40.480 is everything. You absolutely have to maintain chronological order when 342 00:17:40.480 --> 00:17:42.599 splitting data for training and testing. 343 00:17:42.440 --> 00:17:44.400 Because the past predicts the future. 344 00:17:44.480 --> 00:17:48.000 Basically, fundamentally, yes, the patterns are in the sequence. We 345 00:17:48.119 --> 00:17:52.519 compared this deep learning approach to traditional methods like ARIMA models. 346 00:17:52.920 --> 00:17:56.519 Arima can be good, but often struggles to predict complex 347 00:17:56.640 --> 00:17:59.480 patterns far beyond the training data it saw. 348 00:17:59.839 --> 00:18:03.119 This is where recurrent neural networks are and ends and 349 00:18:03.200 --> 00:18:07.480 especially long short term memory LSTM networks come in. These 350 00:18:07.480 --> 00:18:08.920 are the game changers. 351 00:18:08.640 --> 00:18:12.200 They really are for sequential data. LSTMs in particular are 352 00:18:12.240 --> 00:18:16.759 designed to have memory. They have internal mechanisms, these gates 353 00:18:16.799 --> 00:18:20.000 that allow them to retain information from previous steps in 354 00:18:20.039 --> 00:18:22.039 the sequence and use it for current predictions. 355 00:18:22.119 --> 00:18:23.839 So they can remember relevant past. 356 00:18:23.559 --> 00:18:27.599 Events exactly, they can learn long range dependencies. A crucial 357 00:18:27.640 --> 00:18:30.400 step we saw was transforming the raw stock prices using 358 00:18:30.480 --> 00:18:33.400 log differences. This helps achieve stationarity. 359 00:18:33.559 --> 00:18:35.200 Stationarity does that mean again? 360 00:18:35.480 --> 00:18:38.160 It means the statistical properties of the time series, like 361 00:18:38.240 --> 00:18:42.079 its average and variants, don't change over time. Most time 362 00:18:42.119 --> 00:18:46.119 series models, including lstm's, work much better with stationary data. 363 00:18:46.880 --> 00:18:50.200 Raw stock prices usually aren't stationary, they tend to trend 364 00:18:50.279 --> 00:18:53.599 upwards over time log differences often stabilize them. 365 00:18:53.680 --> 00:18:56.440 Okay, and we use a time series generator to prepare 366 00:18:56.440 --> 00:18:56.799 the data. 367 00:18:57.000 --> 00:19:00.119 Yeah, that's a handy tool and caras it automatically creates 368 00:19:00.119 --> 00:19:02.720 batches of sequential data for the LSTM. You tell it 369 00:19:02.799 --> 00:19:06.119 how many past days to look back at, say ten days, 370 00:19:06.160 --> 00:19:09.200 to predict the next day's value. It handles creating those 371 00:19:09.200 --> 00:19:10.359 sliding windows for you. 372 00:19:10.680 --> 00:19:12.839 And then we built the actual LSTM. 373 00:19:12.400 --> 00:19:16.000 Model in caress right sequential model, defining the LSTM layers, 374 00:19:16.079 --> 00:19:18.920 specifying the number of units or memory cells in each layer, 375 00:19:18.960 --> 00:19:21.359 and crucially the input shape which has to match the 376 00:19:21.400 --> 00:19:24.400 look back window and number of features. And of course 377 00:19:24.440 --> 00:19:27.759 tuning is vital here too, experimenting with the lookback window 378 00:19:27.799 --> 00:19:30.680 size maybe three days works better than ten or vice versa, 379 00:19:31.039 --> 00:19:34.279 adding multiple LSTM layers, maybe with dropout in between to 380 00:19:34.359 --> 00:19:36.680 prevent overfitting on the sequence. 381 00:19:36.319 --> 00:19:40.480 And refining the optimizer like the ATOM optimizer's learning rate definitely. 382 00:19:40.920 --> 00:19:43.640 Finding the right learning rate is often critical for stable training, 383 00:19:43.720 --> 00:19:46.240 especially with time series where things can fluctuate a lot. 384 00:19:46.359 --> 00:19:49.119 Okay, this next one is maybe the most mind bending 385 00:19:49.599 --> 00:19:56.440 generative adversarial networks chans creating synthetic images like faces totally 386 00:19:56.519 --> 00:19:58.960 from scratch. How does that even work? 387 00:19:59.119 --> 00:20:02.559 It is pretty amazing stuff. Jans are a really special 388 00:20:02.640 --> 00:20:06.920 type of unsupervised learning model. They're generative because their goal 389 00:20:07.039 --> 00:20:10.119 is to create new data that looks like the training data, 390 00:20:10.599 --> 00:20:13.759 and their adversarial because they involve two neural networks locked 391 00:20:13.759 --> 00:20:15.039 in a competition. 392 00:20:14.640 --> 00:20:17.000 The generator and the discriminator exactly. 393 00:20:17.440 --> 00:20:20.440 The generator takes random noise as input and tries to 394 00:20:20.480 --> 00:20:23.400 transform it into a realistic looking image like a face. 395 00:20:23.880 --> 00:20:27.319 The discriminator, meanwhile, is shown both real images from the 396 00:20:27.359 --> 00:20:30.359 training set and fake images from the generator and has 397 00:20:30.359 --> 00:20:32.440 to learn to tell the difference is this image real 398 00:20:32.519 --> 00:20:32.839 or fake? 399 00:20:33.079 --> 00:20:35.480 So it's like a counterfeitter, the generator trying to fool 400 00:20:35.519 --> 00:20:37.319 a detective, the discriminator. 401 00:20:37.480 --> 00:20:40.559