WEBVTT 1 00:00:00.120 --> 00:00:03.879 Welcome to the deep dive. We're your shortcut to getting informed, 2 00:00:04.639 --> 00:00:07.440 mixing facts with just enough fun to keep things interesting. 3 00:00:07.919 --> 00:00:11.160 Today we're jumping into reinforcement learning RL, and it's well 4 00:00:11.359 --> 00:00:16.160 supercharged version Deep reinforcement Learning or DRL. Our main guide 5 00:00:16.160 --> 00:00:18.839 for this is the book Deep Reinforcement Learning with Python's 6 00:00:18.879 --> 00:00:24.120 second edition. It's by Sudharsan Ravichandaran and Villari Babushkin. And 7 00:00:24.120 --> 00:00:27.039 our mission really is to unpack how AI agents learn 8 00:00:27.480 --> 00:00:31.800 through interacting and getting rewards. Will explore some applications you 9 00:00:31.879 --> 00:00:34.439 might not expect, and figure out what makes these learning 10 00:00:34.439 --> 00:00:35.759 methods just so powerful. 11 00:00:35.960 --> 00:00:38.000 Yeah, and the really key thing about RL, I think, 12 00:00:38.079 --> 00:00:41.159 is that the agents learn by actually doing stuff. It's 13 00:00:41.159 --> 00:00:43.200 not like other machine learning where you just feed it 14 00:00:43.240 --> 00:00:46.000 a load of data someone already collected. Here, the agent 15 00:00:46.079 --> 00:00:48.799 is sort of dropped into its world. It has to 16 00:00:48.840 --> 00:00:52.679 try things out, make choices, and learn directly from the 17 00:00:52.679 --> 00:00:55.960 consequences from the feedback it gets. It's intelligence that really 18 00:00:55.960 --> 00:00:57.759 grows through experience. 19 00:00:57.320 --> 00:01:01.079 Right, learning by doing that trial and error aspect absolutely fundamental. 20 00:01:01.159 --> 00:01:04.159 So basically, we've got an agent that's the learner in 21 00:01:04.200 --> 00:01:06.359 an environment, it's world, and in that world there are 22 00:01:06.359 --> 00:01:10.920 different situations or states. The agent takes actions and then 23 00:01:11.000 --> 00:01:14.560 gets rewards or I guess sometimes kundlies right. That's the 24 00:01:14.560 --> 00:01:15.719 feedback exactly. 25 00:01:16.319 --> 00:01:19.159 The classic analogy, and it really works is teaching a 26 00:01:19.239 --> 00:01:21.840 dog to catch a ball. You don't sit it down 27 00:01:21.879 --> 00:01:23.480 with physics diagrams. 28 00:01:23.040 --> 00:01:25.000 Do you, ah? No, definitely not. 29 00:01:25.239 --> 00:01:27.359 You just throw the ball. If it catches it, great, 30 00:01:27.719 --> 00:01:31.760 here's a cookie, positive reward. If it misses, well, no cookie, 31 00:01:32.040 --> 00:01:35.000 maybe just a neutral outcome. And over lots and lots 32 00:01:35.000 --> 00:01:37.840 of crows, the dog starts figuring out, Okay, these actions 33 00:01:37.879 --> 00:01:41.799 in this kind of situation they lead to cookies. It's 34 00:01:41.799 --> 00:01:45.359 building a strategy really to maximize those treats. That continuous 35 00:01:45.439 --> 00:01:48.840 loop action, feedback reward in this dynamic world, that's the 36 00:01:48.879 --> 00:01:50.200 absolute heart of rol. 37 00:01:50.359 --> 00:01:53.799 That cookie example makes the feedback loop really clear. But 38 00:01:54.439 --> 00:01:57.280 how is this learn by doing thing really different from 39 00:01:57.280 --> 00:02:00.120 other kinds of machine learning people might know, like say 40 00:02:00.159 --> 00:02:00.920 supervised learning. 41 00:02:01.200 --> 00:02:04.280 Yeah, that's a really important difference. So with supervised learning, 42 00:02:04.319 --> 00:02:07.280 you're essentially showing the model examples that are already labeled. 43 00:02:07.680 --> 00:02:10.240 Think of teaching it to spotcats by showing it thousands 44 00:02:10.280 --> 00:02:12.800 of pictures, and each one clearly says cat or not 45 00:02:12.919 --> 00:02:17.400 cat got it labeled data right and unsupervised learning that's 46 00:02:17.400 --> 00:02:20.840 about finding hidden patterns and data that isn't labeled, like 47 00:02:21.159 --> 00:02:25.719 grouping similar photos together automatically. But RL the agent is 48 00:02:25.800 --> 00:02:28.560 kind of on its own. It learns by directly messing 49 00:02:28.599 --> 00:02:31.560 with the environment, changing its behavior based on the feedback 50 00:02:31.560 --> 00:02:34.199 it gets in real time. There's no pre cooked data 51 00:02:34.240 --> 00:02:36.919 set of right answers. It has to discover what works 52 00:02:37.159 --> 00:02:39.280 through this constant back and forth with its world. 53 00:02:39.400 --> 00:02:42.520 So it's much more dynamic this interaction. Okay, And to 54 00:02:42.639 --> 00:02:45.840 handle that interaction you need some structure. Right environments need 55 00:02:45.879 --> 00:02:48.919 to be framed somehow for decision making. What's the usual 56 00:02:48.919 --> 00:02:49.439 way to do that? 57 00:02:50.000 --> 00:02:53.159 The standard way, the framework most people use is called 58 00:02:53.159 --> 00:02:57.960 a Markov decision process or MDP. Basically, it's a mathematical 59 00:02:58.000 --> 00:03:01.719 way to model these sequential decision problem. It formally defines 60 00:03:01.759 --> 00:03:06.400 all those bits we mentioned, the states, the possible actions, crucially, 61 00:03:06.719 --> 00:03:09.639 the probabilities of moving between states when you take an action, 62 00:03:09.919 --> 00:03:13.560 and the rewards you get for those transitions. The beauty 63 00:03:13.560 --> 00:03:15.960 of an MDP is it lets us map out almost 64 00:03:15.960 --> 00:03:19.120 any kind of decision making sequence mathematically, which is how 65 00:03:19.199 --> 00:03:21.039 machines can start planning strategically. 66 00:03:21.120 --> 00:03:23.599 Okay, that makes sense. It provides the rules of the game, 67 00:03:23.680 --> 00:03:23.919 so to. 68 00:03:23.879 --> 00:03:25.919 Speak, exactly, and a key part of it is the 69 00:03:25.960 --> 00:03:29.479 Markov property, which sounds complicated, but it just means the 70 00:03:29.520 --> 00:03:32.680 agent's decision only depends on its current state. It doesn't 71 00:03:32.719 --> 00:03:35.199 need to remember the entire history of how it got there. 72 00:03:35.240 --> 00:03:36.360 Just where am I now? 73 00:03:36.479 --> 00:03:39.840 Right? The present is all that matters for the next decision. Okay. 74 00:03:39.919 --> 00:03:43.479 So we have the environment structured, but the agent needs 75 00:03:43.520 --> 00:03:46.280 its own plan, its strategy. How does it figure out 76 00:03:46.560 --> 00:03:48.479 what to actually do in each state? 77 00:03:48.759 --> 00:03:51.360 That's its policy. You can think of the policy as 78 00:03:51.360 --> 00:03:54.639 the agent's rule book or its behavior. It tells the 79 00:03:54.680 --> 00:03:57.240 agent which action to take when it finds itself in 80 00:03:57.280 --> 00:04:01.439 a particular state. Policies can be determined meaning in this state, 81 00:04:02.000 --> 00:04:05.800 always do this specific action, simple okay, Or they can 82 00:04:05.800 --> 00:04:09.879 be stochastic. This means a state maps to a probability 83 00:04:09.919 --> 00:04:13.400 distribution over action, so maybe it's seventy percent likely to 84 00:04:13.400 --> 00:04:16.160 go left thirty percent likely to go right. This allows 85 00:04:16.199 --> 00:04:19.920 for a bit more randomness, which can be good for exploring. Ultimately, 86 00:04:19.959 --> 00:04:22.600 the agent is trying to learn the best possible policy, 87 00:04:22.720 --> 00:04:25.120 the one that gets at the most cumulative reward over 88 00:04:25.319 --> 00:04:28.720 many runs or episodes. An episode is just one full 89 00:04:28.759 --> 00:04:30.920 sequence of interaction from start to finish. 90 00:04:31.120 --> 00:04:33.800 And how does it know if a policy is actually good? 91 00:04:34.079 --> 00:04:35.839 How does it judge its own strategy? 92 00:04:36.279 --> 00:04:39.079 Ah? Well, that's where value functions and Q functions come in. 93 00:04:39.079 --> 00:04:42.319 There are ways to evaluate policies. A value function basically asks, 94 00:04:42.360 --> 00:04:45.040 starting from this state, how much total reward can I 95 00:04:45.079 --> 00:04:47.279 expect to get if I follow my current policy? Is 96 00:04:47.279 --> 00:04:49.279 about the long term value of being in a state? 97 00:04:49.360 --> 00:04:51.360 Okay, the value of a situation. 98 00:04:51.240 --> 00:04:54.199 Precisely, and a Q function goes one step deeper. It 99 00:04:54.319 --> 00:04:57.240 asks how good is it to take this specific action 100 00:04:57.439 --> 00:04:59.560 when I'm in this specific state and then follow my 101 00:04:59.600 --> 00:05:03.920 policy afterwards? Okay? The agent uses these calculations to figure 102 00:05:03.920 --> 00:05:07.199 out which actions in which states are likely to lead 103 00:05:07.199 --> 00:05:08.959 to the best outcomes down the line. 104 00:05:09.040 --> 00:05:11.720 Okay, this all sounds really solid, But like you said, 105 00:05:11.759 --> 00:05:14.879 these agents can be in massive environments thinking about video 106 00:05:14.920 --> 00:05:19.040 games or robotics. The number of possible states and actions 107 00:05:19.160 --> 00:05:22.199 must be huge, right, trying to calculate a Q value 108 00:05:22.240 --> 00:05:26.600 for every single possibility? Yeah, that sounds computationally well impossible. 109 00:05:26.800 --> 00:05:28.639 How did RL get past that? 110 00:05:28.639 --> 00:05:31.120 That is exactly the challenge that led to deep reinforcement 111 00:05:31.199 --> 00:05:34.240 learning or DRL. You're spot on. In complex worlds, you 112 00:05:34.279 --> 00:05:36.639 just can't compute and store all those Q values there 113 00:05:36.639 --> 00:05:39.680 are too many. So DRL brings in deep neural networks. 114 00:05:39.959 --> 00:05:43.240 Instead of calculating exact values, these networks learn to approximate 115 00:05:43.240 --> 00:05:46.360 the Q function, or sometimes even the policy itself. This 116 00:05:46.439 --> 00:05:49.279 is the breakthrough that lets RL handle really high dimensional 117 00:05:49.360 --> 00:05:52.000 inputs like raw pixels from a game screen, which was 118 00:05:52.079 --> 00:05:53.000 unthinkable before. 119 00:05:53.240 --> 00:05:57.680 Okay, so neural networks approximate the answers instead of calculating 120 00:05:57.680 --> 00:06:02.800 everything perfectly. How do these networks learn? What's the mechanism there? 121 00:06:03.399 --> 00:06:05.879 Well, at a high level, think of a basic artificial 122 00:06:05.920 --> 00:06:09.759 neural network ANN. You've got layers of interconnected nodes and 123 00:06:09.879 --> 00:06:12.759 input layer, one or more hidden layers, and an output layer. 124 00:06:12.800 --> 00:06:15.759 Data flows through gets transformed at each layer, often using 125 00:06:15.759 --> 00:06:18.920 something called an activation function like RAILU. That's one that 126 00:06:19.000 --> 00:06:21.360 just outputs zero if the input is negative, and the 127 00:06:21.360 --> 00:06:24.639 input itself if it's positive. It adds nonlinearity, which is 128 00:06:24.680 --> 00:06:28.360 crucial now. Learning happens by adjusting the connections, the weights 129 00:06:28.360 --> 00:06:30.800 and biases within the network. The network makes a prediction, 130 00:06:30.959 --> 00:06:34.079 say a Q value, we compare that prediction to a 131 00:06:34.120 --> 00:06:34.959 target value what. 132 00:06:34.920 --> 00:06:35.480 It should have been. 133 00:06:35.560 --> 00:06:39.279 The difference is the loss. Then, using calculus tricks like 134 00:06:39.319 --> 00:06:42.399 gradient descent and backpropagation, the network figures out how to 135 00:06:42.439 --> 00:06:44.839 tweak its weights and biases to reduce that loss to 136 00:06:44.839 --> 00:06:48.639 make better predictions next time. It's iterative refinement. 137 00:06:48.240 --> 00:06:51.560 Got it. So the network learns by correcting its own 138 00:06:51.600 --> 00:06:55.160 mistakes over and over, and this ability to approximate with 139 00:06:55.199 --> 00:06:58.360 networks led to some big moments, right, I remember hearing 140 00:06:58.399 --> 00:06:59.920 a lot about deep Q network. 141 00:07:00.839 --> 00:07:05.480 Oh. Absolutely. DQN, developed by Google's Deep Mind, was a landmark. 142 00:07:05.639 --> 00:07:07.839 It was famously used to play a whole suite of 143 00:07:07.879 --> 00:07:11.240 Atari games, often reaching human level skill just from looking 144 00:07:11.279 --> 00:07:14.199 at the screen pixels That really grab people's attention. 145 00:07:14.360 --> 00:07:16.720 Yeah, that was huge. What made it work so well. 146 00:07:16.439 --> 00:07:19.120 It had a couple of really clever innovations to deal 147 00:07:19.160 --> 00:07:21.839 with the instability you get when you combine deep learning 148 00:07:21.920 --> 00:07:26.240 with RL's constantly changing data. First was experience replay. Instead 149 00:07:26.279 --> 00:07:28.199 of learning only from the very last thing that happened, 150 00:07:28.199 --> 00:07:32.240 the agent stores lots of past experiences state, action, reward, 151 00:07:32.720 --> 00:07:36.560 next state in a memory buffer diary exactly, and then 152 00:07:36.560 --> 00:07:40.480 for training it samples random batches from this memory. This 153 00:07:40.560 --> 00:07:43.360 breaks up the correlations and sequential data. You know, one 154 00:07:43.399 --> 00:07:45.319 step often looks a lot like the next, which makes 155 00:07:45.319 --> 00:07:48.240 the learning much more stable and efficient. It stops the 156 00:07:48.279 --> 00:07:52.120 network for getting old, useful stuff. The second big idea 157 00:07:52.199 --> 00:07:55.439 was the target network. They used a separate, slightly older 158 00:07:55.439 --> 00:07:58.000 copy of the main network just to calculate the target 159 00:07:58.040 --> 00:08:01.000 Q values. This target net work is held fixed for 160 00:08:01.040 --> 00:08:02.199 a while, then updated. 161 00:08:02.360 --> 00:08:05.399 Ah, so the target isn't constantly shifting while the main 162 00:08:05.399 --> 00:08:06.199 network is trying to. 163 00:08:06.240 --> 00:08:10.040 Learn precisely, it provides a stable goalpost, preventing the learning 164 00:08:10.079 --> 00:08:13.759 process from chasing its own tail and diverging. Those two tricks, 165 00:08:13.959 --> 00:08:17.439 experience replay and target networks were key to dqn's success. 166 00:08:17.560 --> 00:08:21.000 Okay, so DQN is about learning the values of actions. 167 00:08:21.079 --> 00:08:23.920 It's value based. Are there other ways to go about it? 168 00:08:23.959 --> 00:08:25.040 Maybe more direct ways? 169 00:08:25.360 --> 00:08:28.600 Yes? There are. Another major family of methods are policy 170 00:08:28.680 --> 00:08:32.200 gradient methods. Instead of figuring out Q values first and 171 00:08:32.200 --> 00:08:35.200 then working out the policy from those, these methods try 172 00:08:35.200 --> 00:08:38.720 to learn the optimal policy directly. They adjust the policy 173 00:08:38.759 --> 00:08:42.399 parameters to favor actions that lead to higher rewards. This 174 00:08:42.519 --> 00:08:45.519 is often really useful in environments where the actions are continuous, 175 00:08:45.639 --> 00:08:49.320 continuous like controlling the throttle or steering angle of a car. 176 00:08:49.360 --> 00:08:51.879 It's not just left, right, up, down, It's a whole 177 00:08:51.960 --> 00:08:55.799 range of values. Policy gradient methods handle that naturally, often 178 00:08:55.919 --> 00:08:59.000 using those stochastic policies we mentioned earlier to explore. 179 00:08:59.080 --> 00:09:03.279 Okay, p learning makes sense for certain problems. Is there 180 00:09:04.360 --> 00:09:06.320 a way to get the best of both world combine 181 00:09:06.399 --> 00:09:08.480 value learning and policy learning? 182 00:09:08.639 --> 00:09:12.200 There is, and that brings us to actor critic methods. 183 00:09:12.399 --> 00:09:14.919 These are really popular now and form the basis for 184 00:09:15.080 --> 00:09:17.679 many state of the art algorithms. They essentially have two 185 00:09:17.679 --> 00:09:20.519 components working together. You have the actor, which is a 186 00:09:20.559 --> 00:09:23.360 policy network it decides which action to take, and you 187 00:09:23.440 --> 00:09:26.720 have the critic, which is a value network. It evaluates 188 00:09:26.759 --> 00:09:28.919 the action taken by the actor, saying hey, that was 189 00:09:28.960 --> 00:09:31.720 a good move or hmm, maybe not so great. The 190 00:09:31.759 --> 00:09:35.720 critics feedback then helps the actor update its policy more effectively. 191 00:09:36.240 --> 00:09:38.879 It's a nice synergy. The actor acts, the critic critiques 192 00:09:38.919 --> 00:09:42.639 and they both improve together. Algorithms like DDPG TD three 193 00:09:43.000 --> 00:09:46.039 SAC they're all built on this actor critic. 194 00:09:45.759 --> 00:09:49.000 Idea actor and critic working together. I like that. Okay, 195 00:09:49.039 --> 00:09:51.080 before we look ahead, let's maybe touch on a classic 196 00:09:51.120 --> 00:09:53.960 problem that really highlights a core RL challenge, the multi 197 00:09:54.000 --> 00:09:54.559 arm bandit. 198 00:09:54.759 --> 00:09:58.120 AH. Yes, the multi arm bandit or m AB. It's 199 00:09:58.120 --> 00:10:00.960 simpler than full RL but captures a fun mental trade off. 200 00:10:01.360 --> 00:10:04.240 Imagine you're in front of several slot machines or bandits, 201 00:10:04.320 --> 00:10:06.600 each with a lever an arm. You pull an arm, 202 00:10:06.679 --> 00:10:09.639 you get a payout a reward the catches. Each machine 203 00:10:09.639 --> 00:10:13.000 cayes out differently with probabilities you don't know beforehand. So 204 00:10:13.039 --> 00:10:15.639 the big question is do you stick with the machine 205 00:10:15.639 --> 00:10:18.840 that seems best so far that's exploitation, or do you 206 00:10:18.879 --> 00:10:21.200 try out other machines hoping to find an even better 207 00:10:21.240 --> 00:10:22.279 one that's exploration? 208 00:10:22.679 --> 00:10:25.960 Right, the explorer versus exploit dilemma? How do you balance that? 209 00:10:26.320 --> 00:10:29.000 There are various strategies, but a common simple one is 210 00:10:29.000 --> 00:10:32.600 called epsilon. Greedy most of the time, say ninety percent, 211 00:10:32.679 --> 00:10:36.159 that's one minus epsilon. You exploit by pulling the arm 212 00:10:36.159 --> 00:10:38.360 of the machine that has given the best average reward 213 00:10:38.440 --> 00:10:42.039 so far, but with a small probability exelon maybe ten percent. 214 00:10:42.519 --> 00:10:45.120 You explore by picking an arm completely at random, just 215 00:10:45.120 --> 00:10:47.519 to see what happens. It's a basic way to ensure 216 00:10:47.559 --> 00:10:49.799 you don't get stuck on a suboptimal choice forever. 217 00:10:50.200 --> 00:10:52.480 That's a neat simple way to think about it. Does 218 00:10:52.559 --> 00:10:55.759 this miib idea show up in the real world outside 219 00:10:55.759 --> 00:10:56.919 of casinos? Oh? 220 00:10:56.960 --> 00:11:00.320 Absolutely. It's used all over the place, especially online. Think 221 00:11:00.320 --> 00:11:03.840 about websites running AB tests for things like which advertisement 222 00:11:03.840 --> 00:11:06.919 banner gets more clicks. Instead of a fixed AB test, 223 00:11:07.200 --> 00:11:10.279 a multi armed bandit approach can start showing the better 224 00:11:10.320 --> 00:11:13.080 performing ad more often even while the test is still running, 225 00:11:13.399 --> 00:11:16.720 maximizing clicks faster. It also extends to what are called 226 00:11:16.720 --> 00:11:20.000 contextual bandits. This is where the best arm depends on 227 00:11:20.039 --> 00:11:23.320 the context like the user. Netflix famously uses this for 228 00:11:23.360 --> 00:11:26.159 personalizing the thumbnail images for shows and movies based on 229 00:11:26.200 --> 00:11:29.639 your viewing history. The reward is you clicking play. It's 230 00:11:29.639 --> 00:11:32.720 also great for cold start problems and recommendations, quickly learning 231 00:11:32.720 --> 00:11:33.879 what a new user might like. 232 00:11:34.000 --> 00:11:37.039 Wow. Okay, so that simple banded idea is behind a 233 00:11:37.080 --> 00:11:40.159 lot of the personalization we see online. That's quite surprising. 234 00:11:40.480 --> 00:11:43.960 Now let's broaden out again. We've talked games recommendations, But 235 00:11:44.039 --> 00:11:46.320 where else is RL making a real impact? You mentioned 236 00:11:46.320 --> 00:11:48.360 the source book covers quite a few areas. 237 00:11:48.519 --> 00:11:51.200 Yeah, the range is pretty impressive. Now, For instance, dynamic 238 00:11:51.240 --> 00:11:54.840 pricing businesses use URL agents to adjust prices on the 239 00:11:54.840 --> 00:11:57.600 fly based on real time supply and demand, trying to 240 00:11:57.600 --> 00:11:59.000 maximize revenue. 241 00:11:58.919 --> 00:12:01.200 Like airline tickets are ride sharing apps. 242 00:12:01.200 --> 00:12:05.960 Exactly like that. Then there's manufacturing training intelligent robots using 243 00:12:06.200 --> 00:12:09.039 URL to perform tasks like picking and placing objects with 244 00:12:09.120 --> 00:12:12.639 high precision. This can reduce costs and improve efficiency on 245 00:12:12.679 --> 00:12:16.240 assembly lines. Finance is another big one. RL is used 246 00:12:16.240 --> 00:12:21.519 for things like optimizing investment portfolios or developing algorithmic trading strategies. 247 00:12:21.720 --> 00:12:24.080 JP Morgan, for example, used it to improve how they 248 00:12:24.080 --> 00:12:26.960 execute large traits for clients, making them more efficient. 249 00:12:27.279 --> 00:12:31.480 Interesting, so finance, manufacturing, what else? 250 00:12:31.679 --> 00:12:36.039 Well, there's neural architecture search or NAS that's basically using 251 00:12:36.159 --> 00:12:39.120 RL to automatically design the structure of other neural networks 252 00:12:39.120 --> 00:12:42.039 to get the best performance on a task, automating AI 253 00:12:42.120 --> 00:12:45.639 design with AI, and even in natural language processing NLP, 254 00:12:45.799 --> 00:12:50.120 people are using RL for tasks like improving abstractive text summarization, 255 00:12:50.679 --> 00:12:53.759 getting AI to write concise summaries, or making chatbots more 256 00:12:53.759 --> 00:12:55.080 engaging and goal oriented. 257 00:12:55.200 --> 00:12:57.440 It really is branching out everywhere. The field. Sound like 258 00:12:57.480 --> 00:12:59.960 it's moving incredibly fast. Yeah, what's kind of on the hahriz. 259 00:13:00.279 --> 00:13:02.039 What are the really cutting edge areas right now? 260 00:13:02.360 --> 00:13:05.759 It is moving fast. Some really exciting frontiers include things 261 00:13:05.799 --> 00:13:09.360 like meta reinforcement learning. This is about developing agents that 262 00:13:09.399 --> 00:13:11.720 can learn how to learn, so they get better at 263 00:13:11.720 --> 00:13:15.480 picking up new tasks quickly because they've learned general learning strategy, 264 00:13:15.679 --> 00:13:16.519 learning to learn. 265 00:13:16.679 --> 00:13:18.600 Okay, that sounds powerful. Yeah. 266 00:13:18.840 --> 00:13:22.519 Then there's hierarchical reinforcement learning or HRL. The idea here 267 00:13:22.600 --> 00:13:26.399 is to break down really big complex tasks into smaller, 268 00:13:26.519 --> 00:13:30.240 more manageable sub goals or subtasks. Think about a robot 269 00:13:30.320 --> 00:13:33.200 needing to make coffee. HRL might break that down into 270 00:13:33.360 --> 00:13:35.919 go to coverard, get mug, go to machine, press button. 271 00:13:36.080 --> 00:13:39.639 It makes tackling long horizon problems much more feasible. Like 272 00:13:39.679 --> 00:13:42.799 the taxi example in the outline, decomposed driving into get 273 00:13:42.799 --> 00:13:44.600 passenger and drop off passenger makes sense. 274 00:13:44.600 --> 00:13:47.000 Break it down. Yeah, and you mentioned something earlier that 275 00:13:47.080 --> 00:13:50.679 sounded almost like AI imagination ah. 276 00:13:50.679 --> 00:13:55.720 Right, imagination augmented agents or itwo A. This is a 277 00:13:55.759 --> 00:13:59.919 fascinating direction. These agents try to internally simulate or imagine 278 00:14:00.159 --> 00:14:03.159 the likely consequences of their actions before actually taking them 279 00:14:03.200 --> 00:14:05.200 in the real world. It's a bit like how a 280 00:14:05.320 --> 00:14:07.840 chess player thinks ahead, if I move here, what might 281 00:14:07.879 --> 00:14:11.559 happen next. They combine learning from actual experience model free 282 00:14:11.960 --> 00:14:14.519 with learning an internal model of the world to plan 283 00:14:14.759 --> 00:14:19.360 model based. This allows for more sophisticated planning, especially environments 284 00:14:19.360 --> 00:14:22.440 where mistakes are costly, like certain puzzle games such as Soacobond, 285 00:14:22.440 --> 00:14:23.639 which was mentioned in the source. 286 00:14:23.840 --> 00:14:26.120 Wow. From a dog learning to get a ball with 287 00:14:26.159 --> 00:14:28.559 cookies all the way to AI agents that can sort 288 00:14:28.600 --> 00:14:31.759 of imagine the future. That's quite a journey we've covered. 289 00:14:31.799 --> 00:14:34.480 We've really seen how this core idea of learning through 290 00:14:34.480 --> 00:14:38.559 trial and error, through rewards and interactions scales up massively 291 00:14:38.600 --> 00:14:42.159 with deep learning. It lets AI tackle these incredibly complex 292 00:14:42.200 --> 00:14:45.679 problems in finance, robotics, online systems, you name it. It 293 00:14:45.720 --> 00:14:49.279 really emphasizes how URL lets agents learn directly adapt on 294 00:14:49.320 --> 00:14:51.840 the fly. We're probably just scratching the surface of what's. 295 00:14:51.639 --> 00:14:54.879 Possible absolutely, and maybe a final thought for you to 296 00:14:54.879 --> 00:14:58.639 consider is just that how that simple principle of learning 297 00:14:58.720 --> 00:15:02.679 from feedback, which seems intuitive with the dog analogy, scales up. 298 00:15:03.000 --> 00:15:06.279 It scales to let machines master complex games, manage huge 299 00:15:06.279 --> 00:15:09.600 financial portfolios, personalize your online world, and even start to 300 00:15:09.600 --> 00:15:12.879 build internal models to imagine outcomes. Where else could this 301 00:15:12.919 --> 00:15:16.039 fundamental principle of adaptive reward driven learning take us next? 302 00:15:16.240 --> 00:15:18.440 What new kinds of dynamic intelligence might emerge,