WEBVTT 1 00:00:00.160 --> 00:00:03.720 Welcome to the deep dive. We take source materials, unpack 2 00:00:03.759 --> 00:00:06.400 complex topics and basically give you the crucial insights and 3 00:00:06.440 --> 00:00:08.279 maybe some surprising facts along the way. 4 00:00:08.400 --> 00:00:10.240 Yeah, think of it as your shortcut to getting up 5 00:00:10.240 --> 00:00:11.279 to speed exactly. 6 00:00:11.679 --> 00:00:14.320 So today we're diving into some excerpts from a book 7 00:00:14.480 --> 00:00:19.559 AI Crash Course. We're looking specifically at chapters covering reinforcement learning, 8 00:00:19.600 --> 00:00:21.600 deep learning, and AI in general. 9 00:00:21.800 --> 00:00:25.160 Right, and this source it positions itself as a kind 10 00:00:25.160 --> 00:00:27.760 of all in one guide. It's built from online courses 11 00:00:27.760 --> 00:00:31.199 that were apparently quite successful, okay, and it really stresses 12 00:00:31.839 --> 00:00:35.240 getting the intuition first, then the math, and then you know, 13 00:00:35.320 --> 00:00:36.840 actually coding things. 14 00:00:36.679 --> 00:00:39.240 Up right, intuition first. So our mission here is to 15 00:00:39.320 --> 00:00:42.920 pull out those core ideas, help build that intuitive feel, 16 00:00:42.960 --> 00:00:45.520 and look at some of the well pretty exciting real 17 00:00:45.560 --> 00:00:48.560 world applications they talk about. The idea is to help you, 18 00:00:48.759 --> 00:00:53.079 the listener, understand how these AI models actually work and importantly, 19 00:00:53.280 --> 00:00:54.280 where they might be used. 20 00:00:54.320 --> 00:00:56.479 When the book sets a big stage, it talks about 21 00:00:56.479 --> 00:01:04.359 AI's potential impact across well almost every transport, education, security, jobs, entertainment. 22 00:01:03.920 --> 00:01:06.439 Even the environment, So a lot of potential there. 23 00:01:06.519 --> 00:01:10.040 Definitely. It frames these technologies as potentially transformative. 24 00:01:10.120 --> 00:01:13.519 Okay, let's unpack this. Where does our source material suggest 25 00:01:13.560 --> 00:01:16.920 we start when building an AI, especially in this reinforcement 26 00:01:16.959 --> 00:01:17.879 learning space. 27 00:01:17.799 --> 00:01:20.439 It starts with the absolute foundation. You have to define 28 00:01:20.439 --> 00:01:21.920 the AI's environment. 29 00:01:22.079 --> 00:01:22.640 The environment. 30 00:01:22.879 --> 00:01:25.760 Yeah, that's the world, the context the AI operates in. 31 00:01:26.200 --> 00:01:28.840 And it's got three really key parts. All right, what 32 00:01:28.879 --> 00:01:33.200 are those first up? States? States are basically the inputs 33 00:01:33.239 --> 00:01:36.159 the AI gets what it perceives. 34 00:01:36.000 --> 00:01:38.439 Like sensor readings for a self driving car. 35 00:01:38.319 --> 00:01:42.680 Maybe exactly, sensor reading speed, location, or for a simple 36 00:01:42.760 --> 00:01:45.280 robot in a maze, the state might just be which 37 00:01:45.280 --> 00:01:47.640 square is currently in It's the where am I? Or 38 00:01:47.680 --> 00:01:48.879 what's happening info? 39 00:01:49.040 --> 00:01:52.040 Okay, so state is what the AI knows about its situation. 40 00:01:52.519 --> 00:01:53.159 What's next? 41 00:01:53.280 --> 00:01:55.920 Next are the actions. These are the things the AI 42 00:01:56.000 --> 00:01:57.719 can do, the choices it can make. 43 00:01:57.840 --> 00:02:00.640 So for the car, turn left, accelerate, break. 44 00:02:00.439 --> 00:02:04.280 Yep, or for the maze robot move north south east west, 45 00:02:04.519 --> 00:02:07.480 those are its possible moves, its decisions makes sense. 46 00:02:07.719 --> 00:02:12.319 State action, and the third piece must be important. 47 00:02:12.080 --> 00:02:16.800 Critically important rewards. This is the feedback what the AI 48 00:02:17.000 --> 00:02:20.439 gets after it it takes an action in a certain state. 49 00:02:20.800 --> 00:02:22.319 Ah, the feedback loop. 50 00:02:22.520 --> 00:02:26.000 Precisely, it could be positive like reaching a goal, or 51 00:02:26.039 --> 00:02:30.639 negative like hitting a wall. This reward signal is what 52 00:02:30.719 --> 00:02:33.520 guides the AI. It tells it what's good and what's bad. 53 00:02:33.680 --> 00:02:35.800 So the whole game for the AI is to figure 54 00:02:35.800 --> 00:02:38.840 out how to act, which actions to take in which 55 00:02:38.919 --> 00:02:41.800 states to get the most reward possible over time. 56 00:02:41.919 --> 00:02:45.199 That's the core idea, maximize cumulative reward. It learns by 57 00:02:45.439 --> 00:02:49.199 essentially trial and error driven by those rewards. And the 58 00:02:49.240 --> 00:02:53.599 source makes this important distinction too, between training mode and inference. 59 00:02:53.199 --> 00:02:55.639 Mode, right. Training versus inference. 60 00:02:55.319 --> 00:02:59.240 Training is where it's learning, interacting, getting rewards, updating it 61 00:02:59.280 --> 00:03:02.199 to understanding. For instance, well at showtime, the trained AI 62 00:03:02.599 --> 00:03:05.039 uses what it learned to just do the task without 63 00:03:05.080 --> 00:03:05.719 learning anymore. 64 00:03:05.800 --> 00:03:08.960 Got it learn first, then perform. So with that framework 65 00:03:09.000 --> 00:03:12.639 state's actions rewards, what's like the first actual AI model 66 00:03:12.680 --> 00:03:13.439 the book introduces. 67 00:03:13.560 --> 00:03:15.680 It kicks off with the classic problem, actually the multi 68 00:03:15.719 --> 00:03:16.520 arm banded problem. 69 00:03:16.639 --> 00:03:18.680 Ah, the slot machines. I remember this. 70 00:03:18.599 --> 00:03:22.759 One, yeah, exactly, multiple slot machines bandits in a casino. 71 00:03:23.520 --> 00:03:26.400 Each pays out with a different probability, but you don't 72 00:03:26.439 --> 00:03:27.520 know those probabilities. 73 00:03:27.680 --> 00:03:29.520 So the question is, how do you play them to 74 00:03:29.599 --> 00:03:33.080 maximize your winnings over time without knowing which machine is 75 00:03:33.120 --> 00:03:34.879 actually best initially exactly. 76 00:03:35.280 --> 00:03:38.560 And the AI approach discussed is Thompson sampling. 77 00:03:38.680 --> 00:03:40.919 Thompson sampling. Okay, what's the intuition there? 78 00:03:41.000 --> 00:03:43.879 Sound statistical, it is, but the intuition is quite neat. 79 00:03:43.919 --> 00:03:46.039 You could just keep playing the machine that's paid out 80 00:03:46.080 --> 00:03:46.479 the most. 81 00:03:46.360 --> 00:03:48.039 So far, right, Yeah, seems logical. 82 00:03:48.199 --> 00:03:50.639 Exploit the winner, but that winner might just be on 83 00:03:50.680 --> 00:03:51.520 a lucky streak. 84 00:03:52.120 --> 00:03:55.120 Thompson sampling is smarter. It keeps track of wins and 85 00:03:55.159 --> 00:03:58.400 losses for each machine, okay, and uses that history to 86 00:03:58.479 --> 00:04:03.159 maintain a probability distribution for each machine's likely success rate, 87 00:04:03.800 --> 00:04:05.960 specifically a beta distribution. 88 00:04:06.199 --> 00:04:10.080 A beta distribution, so more wins on a machine means 89 00:04:10.199 --> 00:04:13.599 its distribution shifts towards predicting higher success Right. 90 00:04:13.680 --> 00:04:17.199 More wins, fewer losses, the distribution gets more confident that 91 00:04:17.240 --> 00:04:21.519 the machine is good. Now here's the clever bit. Each round, 92 00:04:21.560 --> 00:04:23.560 you don't just pick the machine with the highest average 93 00:04:23.560 --> 00:04:24.560 win rate so far. 94 00:04:24.680 --> 00:04:28.000 No, we then you take a random draw from each 95 00:04:28.040 --> 00:04:32.319 machine's beta distribution, and you play the machine whose random 96 00:04:32.399 --> 00:04:34.439 draw came out highest for that round. 97 00:04:34.759 --> 00:04:37.600 Random draw? Why random seems like you'd want the most 98 00:04:37.639 --> 00:04:38.199 likely winner. 99 00:04:38.519 --> 00:04:42.399 That randomness is key. It builds an exploration. A machine 100 00:04:42.439 --> 00:04:46.399 you haven't played much will have a wider, less certain distribution, 101 00:04:46.959 --> 00:04:50.319 so its random draws might sometimes be high, prompting you 102 00:04:50.360 --> 00:04:50.959 to try it out. 103 00:04:51.160 --> 00:04:54.199 Ah, so it forces you to explore the less known options, 104 00:04:54.240 --> 00:04:58.079 sometimes just in case they're actually better than the current favorite. 105 00:04:58.199 --> 00:05:02.959 Exactly. It naturally balance is exploring new things, exploration with 106 00:05:03.079 --> 00:05:06.319 sticking to what seems to work exploitation, and it does 107 00:05:06.360 --> 00:05:09.240 this just based on the observed wins and losses, without 108 00:05:09.279 --> 00:05:10.600 needing the true payout rates. 109 00:05:10.759 --> 00:05:14.879 That's really clever. Balancing exploration and exploitation is a classic problem. 110 00:05:15.000 --> 00:05:16.920 Does the book give a real world example? 111 00:05:17.079 --> 00:05:20.160 Yes, a great one. Online advertising, which adversion gets the 112 00:05:20.199 --> 00:05:21.360 most clicks or sign ups? 113 00:05:21.399 --> 00:05:24.040 Okay, so each AD variation is like a slot machine. 114 00:05:23.800 --> 00:05:27.240 Arm perfect analogy. You show different ads, oh actions, you 115 00:05:27.319 --> 00:05:30.959 track clicks or conversions rewards one for click, zero for 116 00:05:31.040 --> 00:05:35.519 no click. Thompson sampling figures out which ad performs best over. 117 00:05:35.319 --> 00:05:40.040 Time by showing ads somewhat randomly based on those beta distributions. 118 00:05:40.120 --> 00:05:41.199 Learning as it goes. 119 00:05:41.079 --> 00:05:45.560 Yep, it converges on the statistically best ad adapting as 120 00:05:45.600 --> 00:05:48.680 it gets more data maps directly from the casino problem. 121 00:05:48.959 --> 00:05:52.480 Pretty neat, very neat, Okay, So Thompson sampling helps pick 122 00:05:52.519 --> 00:05:55.800 the best single option. But many problems involve a sequence 123 00:05:55.800 --> 00:05:56.839 of actions to reach a. 124 00:05:56.759 --> 00:05:59.959 Goal, right, and that's where the source introduces Q learning. 125 00:06:00.319 --> 00:06:04.720 This is a really foundational reinforcement learning algorithm for sequential decisions. 126 00:06:04.959 --> 00:06:06.360 Q learning. What's the Q stand for? 127 00:06:06.800 --> 00:06:10.120 It stands for quality. Essentially, the core idea is the 128 00:06:10.199 --> 00:06:10.920 Q value. 129 00:06:11.040 --> 00:06:12.639 Okay, quality value. What does it represent? 130 00:06:12.720 --> 00:06:16.560 A Q value written q s is a number. It 131 00:06:16.600 --> 00:06:19.759 represents the expected total future reward you'll get if you 132 00:06:19.800 --> 00:06:22.000 take action A when you're in state S A and D. 133 00:06:22.639 --> 00:06:24.560 This is key you act optimally after that. 134 00:06:24.720 --> 00:06:27.240 Whoa Okay, So it's not just the immediate reward for 135 00:06:27.279 --> 00:06:30.160 taking action A, it's that plus the best possible rewards 136 00:06:30.199 --> 00:06:31.040 you could get from then on. 137 00:06:31.439 --> 00:06:34.879 Exactly. It's the long term value of taking that specific 138 00:06:34.959 --> 00:06:38.720 action in that specific state. The goal of Q learning 139 00:06:39.040 --> 00:06:42.279 is to learn these Q values for all possible state 140 00:06:42.319 --> 00:06:43.160 action pairs. 141 00:06:43.439 --> 00:06:45.079 So if you know all the Q values, you just 142 00:06:45.120 --> 00:06:47.519 pick the action with the highest Q value in your 143 00:06:47.519 --> 00:06:49.279 current state, and that's the best move. 144 00:06:49.480 --> 00:06:51.879 That's the idea for using it once it's learned. Yes, 145 00:06:52.480 --> 00:06:55.560 but how does it learn those values? You use something 146 00:06:55.560 --> 00:06:57.319 called a temporal difference or TD. 147 00:06:57.759 --> 00:06:59.959 Temporal difference sounds like difference over time. 148 00:07:00.199 --> 00:07:03.319 Kind of think of TD as measuring the surprise. It's 149 00:07:03.360 --> 00:07:06.439 the difference between the AI's current estimate of qs A 150 00:07:07.079 --> 00:07:09.720 and a better estimate it gets after actually taking action A, 151 00:07:10.079 --> 00:07:12.800 getting a reward R, and seeing the next state's prime. 152 00:07:12.959 --> 00:07:14.319 How does it get that better estimate? 153 00:07:14.399 --> 00:07:17.519 The better estimate is the immediate reward R plus the 154 00:07:17.560 --> 00:07:20.519 maximum Q value it could get from that next state's 155 00:07:20.560 --> 00:07:22.920 prim basically R plus max QS. 156 00:07:23.040 --> 00:07:26.639 Okay, so TD is actual reward plus best future value 157 00:07:26.639 --> 00:07:29.399 from next state minus my old estimate of current state 158 00:07:29.439 --> 00:07:30.079 action value. 159 00:07:30.120 --> 00:07:32.399 You've got it. A big positive TD means Wow, that 160 00:07:32.439 --> 00:07:34.800 action was way better than I thought. A negative TD 161 00:07:34.920 --> 00:07:36.360 means Oops, that was worse. 162 00:07:36.639 --> 00:07:39.560 And this TD error is used to update the original 163 00:07:39.600 --> 00:07:40.879 Q value estimate. 164 00:07:40.600 --> 00:07:43.959 Precisely using the Bellman equation, which is the mathematical rule 165 00:07:43.959 --> 00:07:46.360 for this update. It uses the TD error and a 166 00:07:46.439 --> 00:07:49.920 learning rate to nudge the Q value closer to that 167 00:07:49.920 --> 00:07:54.560 better estimate. It links the immediate reward to the future potential. 168 00:07:54.319 --> 00:07:56.680 So it learns iteratively. Can you walk through the training 169 00:07:56.680 --> 00:07:58.000 process generally? Sure. 170 00:07:58.439 --> 00:08:01.480 You start by initializing all all Q values, maybe to zero. 171 00:08:01.920 --> 00:08:04.920 Then you run many episodes. In each episode, maybe started 172 00:08:04.959 --> 00:08:08.399 a random state, pick a random valid action. See what 173 00:08:08.480 --> 00:08:10.319 reward you get in what state you land in? 174 00:08:10.399 --> 00:08:10.720 Okay? 175 00:08:10.879 --> 00:08:13.439 Then you calculate that TDR based on the reward and 176 00:08:13.480 --> 00:08:15.360 the max q value of the next state, and you 177 00:08:15.480 --> 00:08:17.360 update the Q value for the state action pair you 178 00:08:17.480 --> 00:08:19.639 just experienced, repeat, repeat, repeat. 179 00:08:19.480 --> 00:08:22.240 Lots of exploration and updating exactly. 180 00:08:22.240 --> 00:08:25.800 Over time, exploring the environment and propagating these rewards back 181 00:08:25.879 --> 00:08:28.759 via the TV updates, the Q values start to converge 182 00:08:28.800 --> 00:08:30.480 towards the true optimal values. 183 00:08:30.680 --> 00:08:33.960 And then once training is done, The inference process is. 184 00:08:33.919 --> 00:08:36.840 Simple, very simple. Put the AI in any state s 185 00:08:37.360 --> 00:08:39.720 it looks up the learned Q values for all possible 186 00:08:39.759 --> 00:08:42.559 actions A from that state. It picks the action with 187 00:08:42.600 --> 00:08:45.000 the highest Q value. That's its policy. 188 00:08:45.320 --> 00:08:48.000 Okay, that makes sense. It learns the map of values, 189 00:08:48.080 --> 00:08:51.039 then follows the path of highest value. The source gives 190 00:08:51.080 --> 00:08:52.960 a warehouse robot example, right. 191 00:08:52.879 --> 00:08:56.120 Yeah, a really clear one. Guiding a robot through a 192 00:08:56.120 --> 00:08:59.039 maze like a warehouse layout to get to a specific 193 00:08:59.120 --> 00:09:00.840 goal location, say location G. 194 00:09:01.200 --> 00:09:04.320 How does that map to states, actions rewards. 195 00:09:04.799 --> 00:09:08.440 The states are just the robot's current location ABC. The 196 00:09:08.480 --> 00:09:11.600 actions are moving to an adjacent connected location, simple enough, 197 00:09:11.639 --> 00:09:13.399 and the rewards are designed to get it to G. 198 00:09:13.799 --> 00:09:16.360 Maybe a small reward like plus one for any valid 199 00:09:16.360 --> 00:09:19.519 move between locations, zero reward if it tries to move 200 00:09:19.600 --> 00:09:22.039 through a wall and the goal. A big reward, say 201 00:09:22.120 --> 00:09:26.000 plus one thousand for reaching location G. That high value 202 00:09:26.039 --> 00:09:27.360 at the goal is the incentive. 203 00:09:27.519 --> 00:09:31.159 So during training, the robot wanders around, bumping into walls, 204 00:09:31.240 --> 00:09:34.000 maybe stumbling into G. Eventually right, and. 205 00:09:33.919 --> 00:09:36.279 When it gets rewards, especially that big one of G, 206 00:09:36.919 --> 00:09:40.720 the TD updates start propagating that value backwards along the 207 00:09:40.759 --> 00:09:41.639 paths leading to G. 208 00:09:41.879 --> 00:09:45.080 So actions that lead towards G gradually get higher Q. 209 00:09:45.159 --> 00:09:50.360 Values Exactly the Q values effectively learn the goodness of 210 00:09:50.399 --> 00:09:52.159 each move in terms of reaching. 211 00:09:51.879 --> 00:09:53.679 The goal and the sourt's also mentioned. You could add 212 00:09:53.720 --> 00:09:57.159 intermediate goals like forcing the robot to go through location 213 00:09:57.320 --> 00:09:58.240 K on the way to G. 214 00:09:58.600 --> 00:10:01.559 Yes, you just tweak the reward matrix give a medium 215 00:10:01.600 --> 00:10:04.960 sized reward, maybe five hundred specifically for the action of 216 00:10:05.000 --> 00:10:08.480 moving from jda K if that's the desired intermediate step. 217 00:10:08.320 --> 00:10:11.159 AH make that specific transition valuable. 218 00:10:10.919 --> 00:10:13.200 Or you could add a big negative reward main to 219 00:10:13.240 --> 00:10:15.440 five hundred for a transition you wanted to avoid, like 220 00:10:15.480 --> 00:10:18.720 going from jda F. You shape the desired path by 221 00:10:18.759 --> 00:10:21.480 manipulating the rewards for specific state action. 222 00:10:21.360 --> 00:10:25.120 Pairs, very flexible and in inference. The trained robot would 223 00:10:25.120 --> 00:10:27.679 then follow the path that accumulated the highest. 224 00:10:27.399 --> 00:10:30.279 Q values testing the example path mentioned E to I 225 00:10:30.440 --> 00:10:34.799 to JDAK than LHG. The robot figures that out just 226 00:10:34.840 --> 00:10:37.559 by following the highest Q value at each step, guided 227 00:10:37.600 --> 00:10:38.759 by the rewards you designed. 228 00:10:39.000 --> 00:10:41.639 Okay, Q learning seems powerful for these kinds of discrete 229 00:10:41.679 --> 00:10:44.679 state spased problems. But what about more complex stuff like 230 00:10:44.960 --> 00:10:47.960 dealing with messy continuous data or images. 231 00:10:48.399 --> 00:10:52.159 Exactly, that's the limit of basic Q learning tables. For 232 00:10:52.279 --> 00:10:55.279 more complex problems. The source brings in artificial neural networks 233 00:10:55.320 --> 00:10:58.799 an ns and deep learning. The artificial brains kind of yeah, 234 00:10:59.039 --> 00:11:02.279 inspired by biologic brains. The basic unit is the neuron. 235 00:11:02.399 --> 00:11:05.440 It gets inputs, multiplies them by weights, sums them. 236 00:11:05.399 --> 00:11:08.639 Up, and passes the result through an activation function like 237 00:11:08.799 --> 00:11:10.799 re lu the rectifier you mentioned right. 238 00:11:10.840 --> 00:11:14.159 That activation function adds nonlinearity, which is super important for 239 00:11:14.240 --> 00:11:18.480 learning complex patterns. These neurons are arranged in layers input, 240 00:11:18.600 --> 00:11:21.440 hidden layers, output information flows forward. 241 00:11:21.679 --> 00:11:25.559 Okay, and how do these networks learn you mentioned adjusting weights. 242 00:11:25.720 --> 00:11:29.039 They learn by trying to minimize error. For example, predicting 243 00:11:29.039 --> 00:11:32.120 house prices, the network makes a prediction you compare to 244 00:11:32.120 --> 00:11:32.919 the actual price. 245 00:11:33.200 --> 00:11:35.679 That difference is the loss error, and it tries to 246 00:11:35.720 --> 00:11:36.720 reduce that error. 247 00:11:36.799 --> 00:11:41.000 Yes, using optimization algorithms like gradient descent, it calculates how 248 00:11:41.039 --> 00:11:43.879 adjusting each weight would affect the error and nudges the 249 00:11:43.879 --> 00:11:47.080 weights in the direction that reduces the error. Or many many. 250 00:11:46.879 --> 00:11:50.720 Examples the book uses that house price prediction example. What's 251 00:11:50.759 --> 00:11:54.399 a really critical step when you feed data like house size, 252 00:11:54.440 --> 00:11:57.399 number of bedrooms, et cetera into an ann. 253 00:11:57.279 --> 00:12:00.559 Data prep is huge. Splitting into twenty two test sets 254 00:12:00.639 --> 00:12:04.279 is standard, But the crucial thing, especially for an n's 255 00:12:04.600 --> 00:12:05.679 is scaling the data. 256 00:12:06.360 --> 00:12:08.240 Scaling Why is that so vital? 257 00:12:08.480 --> 00:12:12.480 Imagine number of bedrooms maybe one to five versus square 258 00:12:12.519 --> 00:12:17.759 footage thousands. Without scaling, the network might overweight square footage 259 00:12:17.799 --> 00:12:20.320 just because the numbers are bigger. Even if bedrooms are 260 00:12:20.360 --> 00:12:21.039 just as important. 261 00:12:21.240 --> 00:12:24.200 Ah, the scale of the numbers dominates the learning. 262 00:12:24.320 --> 00:12:27.679 Exactly scaling methods like midmax scale are mentioned in the 263 00:12:27.720 --> 00:12:30.759 source bring all features into a similar range like zero 264 00:12:30.840 --> 00:12:33.559 to one. So the network learns based on the predictive 265 00:12:33.600 --> 00:12:36.519 power of each feature, not just its raw numerical size. 266 00:12:36.639 --> 00:12:39.639 Makes sense, leveling the playing field for the input features. Okay, 267 00:12:39.679 --> 00:12:43.159 so we have Q learning for sequences ANNs for complex data. 268 00:12:43.679 --> 00:12:44.840 What happens when you put them. 269 00:12:44.679 --> 00:12:48.600 Together, magic happens. That's deep Q learning or DQN. This 270 00:12:48.679 --> 00:12:52.000 is where things get really powerful for complex RL problems. 271 00:12:51.600 --> 00:12:53.639 Deep Q learning. So the deep comes from the deep 272 00:12:53.720 --> 00:12:55.600 learning neural network exactly. 273 00:12:55.919 --> 00:12:58.440 The an N acts as a function approximator for the 274 00:12:58.519 --> 00:13:02.279 Q function instead of a giant table storing queues A 275 00:13:02.799 --> 00:13:06.159 for every possible state in action, which is impossible for 276 00:13:06.240 --> 00:13:07.320 complex environments. 277 00:13:07.399 --> 00:13:10.559 Right the state space could be enormous or even continuous. 278 00:13:10.639 --> 00:13:14.240 The ANN takes the states as as input, and its 279 00:13:14.279 --> 00:13:18.200 output layer predicts the Q values for all possible actions 280 00:13:18.240 --> 00:13:19.320 A from that state. 281 00:13:19.559 --> 00:13:21.919 So the network learns to estimate the Q values on 282 00:13:21.960 --> 00:13:23.720 the fly based on the input state. 283 00:13:23.840 --> 00:13:27.039 Precisely. It generalizes. Now, when it comes to choosing an 284 00:13:27.039 --> 00:13:30.080 action during training, DQN doesn't always just pick the action 285 00:13:30.159 --> 00:13:33.279 with the highest predicted Q value. That would be pure exploitation. 286 00:13:33.559 --> 00:13:36.559 It needs exploration too, write like in Thompson sampling exactly. 287 00:13:36.919 --> 00:13:40.360 The source mentions common strategies like softmax or epsilon greedy 288 00:13:40.360 --> 00:13:41.879 exploration epsilon greedy. 289 00:13:41.919 --> 00:13:44.000 That's the one where, say ten percent of the time, 290 00:13:44.039 --> 00:13:46.000 it just picks a random action instead of the best one. 291 00:13:46.120 --> 00:13:50.360 Yeah, that's the idea with probability upslone explore randomly, otherwise 292 00:13:50.879 --> 00:13:55.320 exploit the best known action. Softmax assigns probabilities based on 293 00:13:55.440 --> 00:13:59.440 Q values, giving even weaker actions some chance. This exploration 294 00:13:59.519 --> 00:14:03.000 is crucial for discovering potentially better strategies the AI doesn't 295 00:14:03.000 --> 00:14:03.759 know about yet. 296 00:14:03.919 --> 00:14:07.120 Okay, so how does the DQN actually learn? How does 297 00:14:07.159 --> 00:14:09.879 the network get better at predicting Q values. 298 00:14:10.159 --> 00:14:12.320 It's similar to the q learning update, but uses the 299 00:14:12.360 --> 00:14:16.080 network The AI is in state a's picks an action 300 00:14:16.159 --> 00:14:19.279 A using epslong, greedy you or similar, observes the reward 301 00:14:19.480 --> 00:14:20.919 R and the next state's prime. 302 00:14:21.240 --> 00:14:21.480 Okay. 303 00:14:21.559 --> 00:14:24.120 It then uses the same neural network to predict the 304 00:14:24.159 --> 00:14:27.200 maximum Q value possible from that next state. Hell, it's prime. 305 00:14:27.320 --> 00:14:30.159 Let's call that max qs. It calculates the target Q value. 306 00:14:30.320 --> 00:14:34.360 Target equals R plus gamma max qsaighty gamma is a 307 00:14:34.440 --> 00:14:36.600 discount factor for future rewards, so. 308 00:14:36.679 --> 00:14:39.159 Reward plus the discounted best value from the next state 309 00:14:39.360 --> 00:14:40.960 that's the target right now. 310 00:14:41.000 --> 00:14:43.279 It compares this target value to the q value the 311 00:14:43.320 --> 00:14:45.759 network originally predicted for the action a it actually took 312 00:14:45.759 --> 00:14:48.080 in state. As the difference between the prediction and the 313 00:14:48.120 --> 00:14:51.159 target is the error, the temporal difference error again. 314 00:14:51.120 --> 00:14:53.639 And that error signal is used to update. 315 00:14:53.360 --> 00:14:57.320 The network exactly. The error is backpropagated through the ann 316 00:14:57.799 --> 00:15:01.600 adjusting the weights so that next time the network's prediction 317 00:15:01.759 --> 00:15:05.440 for q A will be closer to that target value. 318 00:15:05.639 --> 00:15:08.639 It learns to make better predictions through experience. 319 00:15:08.519 --> 00:15:11.200 And there is something about experience replay. 320 00:15:10.759 --> 00:15:14.639 Memory ah yes, crucial for stability. Instead of learning only 321 00:15:14.679 --> 00:15:17.519 from the very last thing that happened, the AI stores 322 00:15:17.600 --> 00:15:22.159 lots of past experiences state action reward next to state 323 00:15:22.600 --> 00:15:26.320 tipples in a big memory buffer. Okay, Then for learning updates, 324 00:15:26.320 --> 00:15:29.960 it samples random mini batches of these past experiences. 325 00:15:29.399 --> 00:15:31.039 From the buffer way random badges. 326 00:15:31.159 --> 00:15:35.039 It breaks the correlation between consecutive experiences. Learning step by 327 00:15:35.039 --> 00:15:38.320 step can be unstable because consecutive states are often very similar. 328 00:15:38.759 --> 00:15:41.840 Random sampling makes the training data more diverse and independent 329 00:15:41.879 --> 00:15:45.080 in each batch, which really helps stabilize the learning process 330 00:15:45.080 --> 00:15:46.279 for the deep neural network. 331 00:15:46.360 --> 00:15:49.440 Got it okay, DQN sounds really powerful. The source must 332 00:15:49.440 --> 00:15:52.759 have some cool applications. You mentioned virtual self driving. 333 00:15:52.480 --> 00:15:54.919 Car, Yeah, a great example in the book, they use 334 00:15:54.960 --> 00:15:57.879 a Kivi app a Python framework to simulate it. The 335 00:15:57.879 --> 00:16:01.080 input states for the AI are are things like the 336 00:16:01.159 --> 00:16:05.320 car's angle towards the goal, but also crucially sensor readings, 337 00:16:05.399 --> 00:16:09.639 what kind of sensors virtual sensors detecting sand basically obstacles 338 00:16:09.720 --> 00:16:11.960 or off road areas to the left, front and right. 339 00:16:12.440 --> 00:16:15.039 This gives the AI situational awareness. 340 00:16:14.600 --> 00:16:17.120 And the actions are simple driving controls. 341 00:16:16.799 --> 00:16:17.919 Basic steering adjustments. 342 00:16:18.000 --> 00:16:18.240 Yeah. 343 00:16:18.799 --> 00:16:21.840 The rewards are set up to encourage driving well, a 344 00:16:21.879 --> 00:16:25.519 penalty magnetive one for hitting sand borders, a smaller penalty 345 00:16:25.600 --> 00:16:27.279 need you a point two if it moves away from 346 00:16:27.279 --> 00:16:29.559 the goal, and a small reward plus point one from 347 00:16:29.600 --> 00:16:30.440 moving towards the goal. 348 00:16:30.639 --> 00:16:34.159 So the DQN learns to process those sensor inputs, predict 349 00:16:34.240 --> 00:16:37.720 Q values for steering actions, and chooses actions that avoid 350 00:16:37.759 --> 00:16:39.039 penalties and get rewards. 351 00:16:39.120 --> 00:16:41.919 Exactly, it learns through trial and error in the simulation 352 00:16:42.320 --> 00:16:45.279 to stay on the road, avoid sand and navigate towards 353 00:16:45.360 --> 00:16:48.279 the target, eventually making round trips. You use something like 354 00:16:48.320 --> 00:16:50.679 py torch or TensorFlow to build the an N park 355 00:16:50.919 --> 00:16:51.360 very cool. 356 00:16:51.519 --> 00:16:54.639 And the server cooling example that sounded really practical. 357 00:16:54.279 --> 00:16:58.879 Extremely practical, applying Dkewin to minimize energy costs into server environment. 358 00:16:59.080 --> 00:17:02.240 So the input states there are things affecting temperature right. 359 00:17:02.440 --> 00:17:06.400 Server's current temperature, maybe number of active users, data transmission rate, 360 00:17:06.440 --> 00:17:10.079 factors influencing heat load, and the actions discrete choices. The 361 00:17:10.119 --> 00:17:13.039 source example use things like cool by one point five 362 00:17:13.079 --> 00:17:16.640 degrees cools by point five degree C, do nothing, heat 363 00:17:16.680 --> 00:17:18.960 by one point five degree c, heat by one point 364 00:17:19.000 --> 00:17:21.720 five degree C. Five distinct actions. 365 00:17:21.319 --> 00:17:23.960 And the reward is the energy saved compared to a 366 00:17:24.000 --> 00:17:26.839 standard maybe thermostat based system exactly. 367 00:17:26.880 --> 00:17:30.400 The goal is purely energy efficiency. The DQAN trains by 368 00:17:30.400 --> 00:17:33.759 simulating temperature changes based on inputs and its actions, learning 369 00:17:33.799 --> 00:17:36.559 which sequence of cooling heating actions keeps the temperature within 370 00:17:36.599 --> 00:17:39.559 an acceptable range while using the least energy possible. 371 00:17:39.720 --> 00:17:41.440 And it uses a standard A and unset up. 372 00:17:41.599 --> 00:17:44.680 Yeah, the source mentions a typical structure maybe two hidden 373 00:17:44.720 --> 00:17:48.200 layers means squared error mc loss to measure how far 374 00:17:48.240 --> 00:17:51.319 off its temperature prediction? Is the atom optimizer to adjust 375 00:17:51.359 --> 00:17:54.759 weights and epsilon greedy exploration during training. 376 00:17:54.559 --> 00:17:56.480 And the result was significant, quite significant. 377 00:17:56.519 --> 00:17:59.200 Yeah, the source sited achieving up to eighty seven percent 378 00:17:59.279 --> 00:18:01.839 energy savings compared to the baseline. That's a huge real 379 00:18:01.839 --> 00:18:03.200 world win from applying URL. 380 00:18:03.519 --> 00:18:07.440 Wow. Okay, so DQN handles complex states, but what about 381 00:18:07.519 --> 00:18:10.440 visual states like images or game screens. 382 00:18:10.759 --> 00:18:14.880 Ah, Now we get to deep convolutional q learning DCQN. 383 00:18:15.319 --> 00:18:18.680 This brings in convolutional neural networks CNNs. 384 00:18:18.759 --> 00:18:21.400 CNNs. They're specialized for images, right exactly. 385 00:18:21.640 --> 00:18:25.240 They're designed to process grid like data and images are 386 00:18:25.240 --> 00:18:26.039 the prime example. 387 00:18:26.119 --> 00:18:28.079 How do they work sort of intuitively? 388 00:18:28.319 --> 00:18:31.160 Well, the first key step is convolution. You slide small 389 00:18:31.160 --> 00:18:34.240 filters across the image. Each filter is designed to detect 390 00:18:34.279 --> 00:18:38.039 a specific simple feature like a vertical edge, horizontal edge, 391 00:18:38.119 --> 00:18:41.359 a corner, maybe a certain texture or color patch. This 392 00:18:41.440 --> 00:18:42.680 produces feature maps. 393 00:18:42.720 --> 00:18:44.519 Okay, finding basic patterns, then. 394 00:18:44.440 --> 00:18:47.640 Codes pooling, often max pooling. It takes small regions of 395 00:18:47.680 --> 00:18:50.119 the feature map and just keep the maximum value. It's 396 00:18:50.119 --> 00:18:52.839 a way to downsample reduce the data size will keeping 397 00:18:52.839 --> 00:18:55.880 the most salient features detected. It makes the network more 398 00:18:55.960 --> 00:18:57.799 robust to small shifts or distortions. 399 00:18:57.920 --> 00:19:01.640 So extract features, then condense them right after several layers 400 00:19:01.640 --> 00:19:05.559 of convolution and pooling, you've extracted increasingly complex features. 401 00:19:06.039 --> 00:19:08.480