WEBVTT

1
00:00:00.120 --> 00:00:03.879
<v Speaker 1>Welcome to the deep dive. We're your shortcut to getting informed,

2
00:00:04.639 --> 00:00:07.440
<v Speaker 1>mixing facts with just enough fun to keep things interesting.

3
00:00:07.919 --> 00:00:11.160
<v Speaker 1>Today we're jumping into reinforcement learning RL, and it's well

4
00:00:11.359 --> 00:00:16.160
<v Speaker 1>supercharged version Deep reinforcement Learning or DRL. Our main guide

5
00:00:16.160 --> 00:00:18.839
<v Speaker 1>for this is the book Deep Reinforcement Learning with Python's

6
00:00:18.879 --> 00:00:24.120
<v Speaker 1>second edition. It's by Sudharsan Ravichandaran and Villari Babushkin. And

7
00:00:24.120 --> 00:00:27.039
<v Speaker 1>our mission really is to unpack how AI agents learn

8
00:00:27.480 --> 00:00:31.800
<v Speaker 1>through interacting and getting rewards. Will explore some applications you

9
00:00:31.879 --> 00:00:34.439
<v Speaker 1>might not expect, and figure out what makes these learning

10
00:00:34.439 --> 00:00:35.759
<v Speaker 1>methods just so powerful.

11
00:00:35.960 --> 00:00:38.000
<v Speaker 2>Yeah, and the really key thing about RL, I think,

12
00:00:38.079 --> 00:00:41.159
<v Speaker 2>is that the agents learn by actually doing stuff. It's

13
00:00:41.159 --> 00:00:43.200
<v Speaker 2>not like other machine learning where you just feed it

14
00:00:43.240 --> 00:00:46.000
<v Speaker 2>a load of data someone already collected. Here, the agent

15
00:00:46.079 --> 00:00:48.799
<v Speaker 2>is sort of dropped into its world. It has to

16
00:00:48.840 --> 00:00:52.679
<v Speaker 2>try things out, make choices, and learn directly from the

17
00:00:52.679 --> 00:00:55.960
<v Speaker 2>consequences from the feedback it gets. It's intelligence that really

18
00:00:55.960 --> 00:00:57.759
<v Speaker 2>grows through experience.

19
00:00:57.320 --> 00:01:01.079
<v Speaker 1>Right, learning by doing that trial and error aspect absolutely fundamental.

20
00:01:01.159 --> 00:01:04.159
<v Speaker 1>So basically, we've got an agent that's the learner in

21
00:01:04.200 --> 00:01:06.359
<v Speaker 1>an environment, it's world, and in that world there are

22
00:01:06.359 --> 00:01:10.920
<v Speaker 1>different situations or states. The agent takes actions and then

23
00:01:11.000 --> 00:01:14.560
<v Speaker 1>gets rewards or I guess sometimes kundlies right. That's the

24
00:01:14.560 --> 00:01:15.719
<v Speaker 1>feedback exactly.

25
00:01:16.319 --> 00:01:19.159
<v Speaker 2>The classic analogy, and it really works is teaching a

26
00:01:19.239 --> 00:01:21.840
<v Speaker 2>dog to catch a ball. You don't sit it down

27
00:01:21.879 --> 00:01:23.480
<v Speaker 2>with physics diagrams.

28
00:01:23.040 --> 00:01:25.000
<v Speaker 1>Do you, ah? No, definitely not.

29
00:01:25.239 --> 00:01:27.359
<v Speaker 2>You just throw the ball. If it catches it, great,

30
00:01:27.719 --> 00:01:31.760
<v Speaker 2>here's a cookie, positive reward. If it misses, well, no cookie,

31
00:01:32.040 --> 00:01:35.000
<v Speaker 2>maybe just a neutral outcome. And over lots and lots

32
00:01:35.000 --> 00:01:37.840
<v Speaker 2>of crows, the dog starts figuring out, Okay, these actions

33
00:01:37.879 --> 00:01:41.799
<v Speaker 2>in this kind of situation they lead to cookies. It's

34
00:01:41.799 --> 00:01:45.359
<v Speaker 2>building a strategy really to maximize those treats. That continuous

35
00:01:45.439 --> 00:01:48.840
<v Speaker 2>loop action, feedback reward in this dynamic world, that's the

36
00:01:48.879 --> 00:01:50.200
<v Speaker 2>absolute heart of rol.

37
00:01:50.359 --> 00:01:53.799
<v Speaker 1>That cookie example makes the feedback loop really clear. But

38
00:01:54.439 --> 00:01:57.280
<v Speaker 1>how is this learn by doing thing really different from

39
00:01:57.280 --> 00:02:00.120
<v Speaker 1>other kinds of machine learning people might know, like say

40
00:02:00.159 --> 00:02:00.920
<v Speaker 1>supervised learning.

41
00:02:01.200 --> 00:02:04.280
<v Speaker 2>Yeah, that's a really important difference. So with supervised learning,

42
00:02:04.319 --> 00:02:07.280
<v Speaker 2>you're essentially showing the model examples that are already labeled.

43
00:02:07.680 --> 00:02:10.240
<v Speaker 2>Think of teaching it to spotcats by showing it thousands

44
00:02:10.280 --> 00:02:12.800
<v Speaker 2>of pictures, and each one clearly says cat or not

45
00:02:12.919 --> 00:02:17.400
<v Speaker 2>cat got it labeled data right and unsupervised learning that's

46
00:02:17.400 --> 00:02:20.840
<v Speaker 2>about finding hidden patterns and data that isn't labeled, like

47
00:02:21.159 --> 00:02:25.719
<v Speaker 2>grouping similar photos together automatically. But RL the agent is

48
00:02:25.800 --> 00:02:28.560
<v Speaker 2>kind of on its own. It learns by directly messing

49
00:02:28.599 --> 00:02:31.560
<v Speaker 2>with the environment, changing its behavior based on the feedback

50
00:02:31.560 --> 00:02:34.199
<v Speaker 2>it gets in real time. There's no pre cooked data

51
00:02:34.240 --> 00:02:36.919
<v Speaker 2>set of right answers. It has to discover what works

52
00:02:37.159 --> 00:02:39.280
<v Speaker 2>through this constant back and forth with its world.

53
00:02:39.400 --> 00:02:42.520
<v Speaker 1>So it's much more dynamic this interaction. Okay, And to

54
00:02:42.639 --> 00:02:45.840
<v Speaker 1>handle that interaction you need some structure. Right environments need

55
00:02:45.879 --> 00:02:48.919
<v Speaker 1>to be framed somehow for decision making. What's the usual

56
00:02:48.919 --> 00:02:49.439
<v Speaker 1>way to do that?

57
00:02:50.000 --> 00:02:53.159
<v Speaker 2>The standard way, the framework most people use is called

58
00:02:53.159 --> 00:02:57.960
<v Speaker 2>a Markov decision process or MDP. Basically, it's a mathematical

59
00:02:58.000 --> 00:03:01.719
<v Speaker 2>way to model these sequential decision problem. It formally defines

60
00:03:01.759 --> 00:03:06.400
<v Speaker 2>all those bits we mentioned, the states, the possible actions, crucially,

61
00:03:06.719 --> 00:03:09.639
<v Speaker 2>the probabilities of moving between states when you take an action,

62
00:03:09.919 --> 00:03:13.560
<v Speaker 2>and the rewards you get for those transitions. The beauty

63
00:03:13.560 --> 00:03:15.960
<v Speaker 2>of an MDP is it lets us map out almost

64
00:03:15.960 --> 00:03:19.120
<v Speaker 2>any kind of decision making sequence mathematically, which is how

65
00:03:19.199 --> 00:03:21.039
<v Speaker 2>machines can start planning strategically.

66
00:03:21.120 --> 00:03:23.599
<v Speaker 1>Okay, that makes sense. It provides the rules of the game,

67
00:03:23.680 --> 00:03:23.919
<v Speaker 1>so to.

68
00:03:23.879 --> 00:03:25.919
<v Speaker 2>Speak, exactly, and a key part of it is the

69
00:03:25.960 --> 00:03:29.479
<v Speaker 2>Markov property, which sounds complicated, but it just means the

70
00:03:29.520 --> 00:03:32.680
<v Speaker 2>agent's decision only depends on its current state. It doesn't

71
00:03:32.719 --> 00:03:35.199
<v Speaker 2>need to remember the entire history of how it got there.

72
00:03:35.240 --> 00:03:36.360
<v Speaker 2>Just where am I now?

73
00:03:36.479 --> 00:03:39.840
<v Speaker 1>Right? The present is all that matters for the next decision. Okay.

74
00:03:39.919 --> 00:03:43.479
<v Speaker 1>So we have the environment structured, but the agent needs

75
00:03:43.520 --> 00:03:46.280
<v Speaker 1>its own plan, its strategy. How does it figure out

76
00:03:46.560 --> 00:03:48.479
<v Speaker 1>what to actually do in each state?

77
00:03:48.759 --> 00:03:51.360
<v Speaker 2>That's its policy. You can think of the policy as

78
00:03:51.360 --> 00:03:54.639
<v Speaker 2>the agent's rule book or its behavior. It tells the

79
00:03:54.680 --> 00:03:57.240
<v Speaker 2>agent which action to take when it finds itself in

80
00:03:57.280 --> 00:04:01.439
<v Speaker 2>a particular state. Policies can be determined meaning in this state,

81
00:04:02.000 --> 00:04:05.800
<v Speaker 2>always do this specific action, simple okay, Or they can

82
00:04:05.800 --> 00:04:09.879
<v Speaker 2>be stochastic. This means a state maps to a probability

83
00:04:09.919 --> 00:04:13.400
<v Speaker 2>distribution over action, so maybe it's seventy percent likely to

84
00:04:13.400 --> 00:04:16.160
<v Speaker 2>go left thirty percent likely to go right. This allows

85
00:04:16.199 --> 00:04:19.920
<v Speaker 2>for a bit more randomness, which can be good for exploring. Ultimately,

86
00:04:19.959 --> 00:04:22.600
<v Speaker 2>the agent is trying to learn the best possible policy,

87
00:04:22.720 --> 00:04:25.120
<v Speaker 2>the one that gets at the most cumulative reward over

88
00:04:25.319 --> 00:04:28.720
<v Speaker 2>many runs or episodes. An episode is just one full

89
00:04:28.759 --> 00:04:30.920
<v Speaker 2>sequence of interaction from start to finish.

90
00:04:31.120 --> 00:04:33.800
<v Speaker 1>And how does it know if a policy is actually good?

91
00:04:34.079 --> 00:04:35.839
<v Speaker 1>How does it judge its own strategy?

92
00:04:36.279 --> 00:04:39.079
<v Speaker 2>Ah? Well, that's where value functions and Q functions come in.

93
00:04:39.079 --> 00:04:42.319
<v Speaker 2>There are ways to evaluate policies. A value function basically asks,

94
00:04:42.360 --> 00:04:45.040
<v Speaker 2>starting from this state, how much total reward can I

95
00:04:45.079 --> 00:04:47.279
<v Speaker 2>expect to get if I follow my current policy? Is

96
00:04:47.279 --> 00:04:49.279
<v Speaker 2>about the long term value of being in a state?

97
00:04:49.360 --> 00:04:51.360
<v Speaker 1>Okay, the value of a situation.

98
00:04:51.240 --> 00:04:54.199
<v Speaker 2>Precisely, and a Q function goes one step deeper. It

99
00:04:54.319 --> 00:04:57.240
<v Speaker 2>asks how good is it to take this specific action

100
00:04:57.439 --> 00:04:59.560
<v Speaker 2>when I'm in this specific state and then follow my

101
00:04:59.600 --> 00:05:03.920
<v Speaker 2>policy afterwards? Okay? The agent uses these calculations to figure

102
00:05:03.920 --> 00:05:07.199
<v Speaker 2>out which actions in which states are likely to lead

103
00:05:07.199 --> 00:05:08.959
<v Speaker 2>to the best outcomes down the line.

104
00:05:09.040 --> 00:05:11.720
<v Speaker 1>Okay, this all sounds really solid, But like you said,

105
00:05:11.759 --> 00:05:14.879
<v Speaker 1>these agents can be in massive environments thinking about video

106
00:05:14.920 --> 00:05:19.040
<v Speaker 1>games or robotics. The number of possible states and actions

107
00:05:19.160 --> 00:05:22.199
<v Speaker 1>must be huge, right, trying to calculate a Q value

108
00:05:22.240 --> 00:05:26.600
<v Speaker 1>for every single possibility? Yeah, that sounds computationally well impossible.

109
00:05:26.800 --> 00:05:28.639
<v Speaker 1>How did RL get past that?

110
00:05:28.639 --> 00:05:31.120
<v Speaker 2>That is exactly the challenge that led to deep reinforcement

111
00:05:31.199 --> 00:05:34.240
<v Speaker 2>learning or DRL. You're spot on. In complex worlds, you

112
00:05:34.279 --> 00:05:36.639
<v Speaker 2>just can't compute and store all those Q values there

113
00:05:36.639 --> 00:05:39.680
<v Speaker 2>are too many. So DRL brings in deep neural networks.

114
00:05:39.959 --> 00:05:43.240
<v Speaker 2>Instead of calculating exact values, these networks learn to approximate

115
00:05:43.240 --> 00:05:46.360
<v Speaker 2>the Q function, or sometimes even the policy itself. This

116
00:05:46.439 --> 00:05:49.279
<v Speaker 2>is the breakthrough that lets RL handle really high dimensional

117
00:05:49.360 --> 00:05:52.000
<v Speaker 2>inputs like raw pixels from a game screen, which was

118
00:05:52.079 --> 00:05:53.000
<v Speaker 2>unthinkable before.

119
00:05:53.240 --> 00:05:57.680
<v Speaker 1>Okay, so neural networks approximate the answers instead of calculating

120
00:05:57.680 --> 00:06:02.800
<v Speaker 1>everything perfectly. How do these networks learn? What's the mechanism there?

121
00:06:03.399 --> 00:06:05.879
<v Speaker 2>Well, at a high level, think of a basic artificial

122
00:06:05.920 --> 00:06:09.759
<v Speaker 2>neural network ANN. You've got layers of interconnected nodes and

123
00:06:09.879 --> 00:06:12.759
<v Speaker 2>input layer, one or more hidden layers, and an output layer.

124
00:06:12.800 --> 00:06:15.759
<v Speaker 2>Data flows through gets transformed at each layer, often using

125
00:06:15.759 --> 00:06:18.920
<v Speaker 2>something called an activation function like RAILU. That's one that

126
00:06:19.000 --> 00:06:21.360
<v Speaker 2>just outputs zero if the input is negative, and the

127
00:06:21.360 --> 00:06:24.639
<v Speaker 2>input itself if it's positive. It adds nonlinearity, which is

128
00:06:24.680 --> 00:06:28.360
<v Speaker 2>crucial now. Learning happens by adjusting the connections, the weights

129
00:06:28.360 --> 00:06:30.800
<v Speaker 2>and biases within the network. The network makes a prediction,

130
00:06:30.959 --> 00:06:34.079
<v Speaker 2>say a Q value, we compare that prediction to a

131
00:06:34.120 --> 00:06:34.959
<v Speaker 2>target value what.

132
00:06:34.920 --> 00:06:35.480
<v Speaker 1>It should have been.

133
00:06:35.560 --> 00:06:39.279
<v Speaker 2>The difference is the loss. Then, using calculus tricks like

134
00:06:39.319 --> 00:06:42.399
<v Speaker 2>gradient descent and backpropagation, the network figures out how to

135
00:06:42.439 --> 00:06:44.839
<v Speaker 2>tweak its weights and biases to reduce that loss to

136
00:06:44.839 --> 00:06:48.639
<v Speaker 2>make better predictions next time. It's iterative refinement.

137
00:06:48.240 --> 00:06:51.560
<v Speaker 1>Got it. So the network learns by correcting its own

138
00:06:51.600 --> 00:06:55.160
<v Speaker 1>mistakes over and over, and this ability to approximate with

139
00:06:55.199 --> 00:06:58.360
<v Speaker 1>networks led to some big moments, right, I remember hearing

140
00:06:58.399 --> 00:06:59.920
<v Speaker 1>a lot about deep Q network.

141
00:07:00.839 --> 00:07:05.480
<v Speaker 2>Oh. Absolutely. DQN, developed by Google's Deep Mind, was a landmark.

142
00:07:05.639 --> 00:07:07.839
<v Speaker 2>It was famously used to play a whole suite of

143
00:07:07.879 --> 00:07:11.240
<v Speaker 2>Atari games, often reaching human level skill just from looking

144
00:07:11.279 --> 00:07:14.199
<v Speaker 2>at the screen pixels That really grab people's attention.

145
00:07:14.360 --> 00:07:16.720
<v Speaker 1>Yeah, that was huge. What made it work so well.

146
00:07:16.439 --> 00:07:19.120
<v Speaker 2>It had a couple of really clever innovations to deal

147
00:07:19.160 --> 00:07:21.839
<v Speaker 2>with the instability you get when you combine deep learning

148
00:07:21.920 --> 00:07:26.240
<v Speaker 2>with RL's constantly changing data. First was experience replay. Instead

149
00:07:26.279 --> 00:07:28.199
<v Speaker 2>of learning only from the very last thing that happened,

150
00:07:28.199 --> 00:07:32.240
<v Speaker 2>the agent stores lots of past experiences state, action, reward,

151
00:07:32.720 --> 00:07:36.560
<v Speaker 2>next state in a memory buffer diary exactly, and then

152
00:07:36.560 --> 00:07:40.480
<v Speaker 2>for training it samples random batches from this memory. This

153
00:07:40.560 --> 00:07:43.360
<v Speaker 2>breaks up the correlations and sequential data. You know, one

154
00:07:43.399 --> 00:07:45.319
<v Speaker 2>step often looks a lot like the next, which makes

155
00:07:45.319 --> 00:07:48.240
<v Speaker 2>the learning much more stable and efficient. It stops the

156
00:07:48.279 --> 00:07:52.120
<v Speaker 2>network for getting old, useful stuff. The second big idea

157
00:07:52.199 --> 00:07:55.439
<v Speaker 2>was the target network. They used a separate, slightly older

158
00:07:55.439 --> 00:07:58.000
<v Speaker 2>copy of the main network just to calculate the target

159
00:07:58.040 --> 00:08:01.000
<v Speaker 2>Q values. This target net work is held fixed for

160
00:08:01.040 --> 00:08:02.199
<v Speaker 2>a while, then updated.

161
00:08:02.360 --> 00:08:05.399
<v Speaker 1>Ah, so the target isn't constantly shifting while the main

162
00:08:05.399 --> 00:08:06.199
<v Speaker 1>network is trying to.

163
00:08:06.240 --> 00:08:10.040
<v Speaker 2>Learn precisely, it provides a stable goalpost, preventing the learning

164
00:08:10.079 --> 00:08:13.759
<v Speaker 2>process from chasing its own tail and diverging. Those two tricks,

165
00:08:13.959 --> 00:08:17.439
<v Speaker 2>experience replay and target networks were key to dqn's success.

166
00:08:17.560 --> 00:08:21.000
<v Speaker 1>Okay, so DQN is about learning the values of actions.

167
00:08:21.079 --> 00:08:23.920
<v Speaker 1>It's value based. Are there other ways to go about it?

168
00:08:23.959 --> 00:08:25.040
<v Speaker 1>Maybe more direct ways?

169
00:08:25.360 --> 00:08:28.600
<v Speaker 2>Yes? There are. Another major family of methods are policy

170
00:08:28.680 --> 00:08:32.200
<v Speaker 2>gradient methods. Instead of figuring out Q values first and

171
00:08:32.200 --> 00:08:35.200
<v Speaker 2>then working out the policy from those, these methods try

172
00:08:35.200 --> 00:08:38.720
<v Speaker 2>to learn the optimal policy directly. They adjust the policy

173
00:08:38.759 --> 00:08:42.399
<v Speaker 2>parameters to favor actions that lead to higher rewards. This

174
00:08:42.519 --> 00:08:45.519
<v Speaker 2>is often really useful in environments where the actions are continuous,

175
00:08:45.639 --> 00:08:49.320
<v Speaker 2>continuous like controlling the throttle or steering angle of a car.

176
00:08:49.360 --> 00:08:51.879
<v Speaker 2>It's not just left, right, up, down, It's a whole

177
00:08:51.960 --> 00:08:55.799
<v Speaker 2>range of values. Policy gradient methods handle that naturally, often

178
00:08:55.919 --> 00:08:59.000
<v Speaker 2>using those stochastic policies we mentioned earlier to explore.

179
00:08:59.080 --> 00:09:03.279
<v Speaker 1>Okay, p learning makes sense for certain problems. Is there

180
00:09:04.360 --> 00:09:06.320
<v Speaker 1>a way to get the best of both world combine

181
00:09:06.399 --> 00:09:08.480
<v Speaker 1>value learning and policy learning?

182
00:09:08.639 --> 00:09:12.200
<v Speaker 2>There is, and that brings us to actor critic methods.

183
00:09:12.399 --> 00:09:14.919
<v Speaker 2>These are really popular now and form the basis for

184
00:09:15.080 --> 00:09:17.679
<v Speaker 2>many state of the art algorithms. They essentially have two

185
00:09:17.679 --> 00:09:20.519
<v Speaker 2>components working together. You have the actor, which is a

186
00:09:20.559 --> 00:09:23.360
<v Speaker 2>policy network it decides which action to take, and you

187
00:09:23.440 --> 00:09:26.720
<v Speaker 2>have the critic, which is a value network. It evaluates

188
00:09:26.759 --> 00:09:28.919
<v Speaker 2>the action taken by the actor, saying hey, that was

189
00:09:28.960 --> 00:09:31.720
<v Speaker 2>a good move or hmm, maybe not so great. The

190
00:09:31.759 --> 00:09:35.720
<v Speaker 2>critics feedback then helps the actor update its policy more effectively.

191
00:09:36.240 --> 00:09:38.879
<v Speaker 2>It's a nice synergy. The actor acts, the critic critiques

192
00:09:38.919 --> 00:09:42.639
<v Speaker 2>and they both improve together. Algorithms like DDPG TD three

193
00:09:43.000 --> 00:09:46.039
<v Speaker 2>SAC they're all built on this actor critic.

194
00:09:45.759 --> 00:09:49.000
<v Speaker 1>Idea actor and critic working together. I like that. Okay,

195
00:09:49.039 --> 00:09:51.080
<v Speaker 1>before we look ahead, let's maybe touch on a classic

196
00:09:51.120 --> 00:09:53.960
<v Speaker 1>problem that really highlights a core RL challenge, the multi

197
00:09:54.000 --> 00:09:54.559
<v Speaker 1>arm bandit.

198
00:09:54.759 --> 00:09:58.120
<v Speaker 2>AH. Yes, the multi arm bandit or m AB. It's

199
00:09:58.120 --> 00:10:00.960
<v Speaker 2>simpler than full RL but captures a fun mental trade off.

200
00:10:01.360 --> 00:10:04.240
<v Speaker 2>Imagine you're in front of several slot machines or bandits,

201
00:10:04.320 --> 00:10:06.600
<v Speaker 2>each with a lever an arm. You pull an arm,

202
00:10:06.679 --> 00:10:09.639
<v Speaker 2>you get a payout a reward the catches. Each machine

203
00:10:09.639 --> 00:10:13.000
<v Speaker 2>cayes out differently with probabilities you don't know beforehand. So

204
00:10:13.039 --> 00:10:15.639
<v Speaker 2>the big question is do you stick with the machine

205
00:10:15.639 --> 00:10:18.840
<v Speaker 2>that seems best so far that's exploitation, or do you

206
00:10:18.879 --> 00:10:21.200
<v Speaker 2>try out other machines hoping to find an even better

207
00:10:21.240 --> 00:10:22.279
<v Speaker 2>one that's exploration?

208
00:10:22.679 --> 00:10:25.960
<v Speaker 1>Right, the explorer versus exploit dilemma? How do you balance that?

209
00:10:26.320 --> 00:10:29.000
<v Speaker 2>There are various strategies, but a common simple one is

210
00:10:29.000 --> 00:10:32.600
<v Speaker 2>called epsilon. Greedy most of the time, say ninety percent,

211
00:10:32.679 --> 00:10:36.159
<v Speaker 2>that's one minus epsilon. You exploit by pulling the arm

212
00:10:36.159 --> 00:10:38.360
<v Speaker 2>of the machine that has given the best average reward

213
00:10:38.440 --> 00:10:42.039
<v Speaker 2>so far, but with a small probability exelon maybe ten percent.

214
00:10:42.519 --> 00:10:45.120
<v Speaker 2>You explore by picking an arm completely at random, just

215
00:10:45.120 --> 00:10:47.519
<v Speaker 2>to see what happens. It's a basic way to ensure

216
00:10:47.559 --> 00:10:49.799
<v Speaker 2>you don't get stuck on a suboptimal choice forever.

217
00:10:50.200 --> 00:10:52.480
<v Speaker 1>That's a neat simple way to think about it. Does

218
00:10:52.559 --> 00:10:55.759
<v Speaker 1>this miib idea show up in the real world outside

219
00:10:55.759 --> 00:10:56.919
<v Speaker 1>of casinos? Oh?

220
00:10:56.960 --> 00:11:00.320
<v Speaker 2>Absolutely. It's used all over the place, especially online. Think

221
00:11:00.320 --> 00:11:03.840
<v Speaker 2>about websites running AB tests for things like which advertisement

222
00:11:03.840 --> 00:11:06.919
<v Speaker 2>banner gets more clicks. Instead of a fixed AB test,

223
00:11:07.200 --> 00:11:10.279
<v Speaker 2>a multi armed bandit approach can start showing the better

224
00:11:10.320 --> 00:11:13.080
<v Speaker 2>performing ad more often even while the test is still running,

225
00:11:13.399 --> 00:11:16.720
<v Speaker 2>maximizing clicks faster. It also extends to what are called

226
00:11:16.720 --> 00:11:20.000
<v Speaker 2>contextual bandits. This is where the best arm depends on

227
00:11:20.039 --> 00:11:23.320
<v Speaker 2>the context like the user. Netflix famously uses this for

228
00:11:23.360 --> 00:11:26.159
<v Speaker 2>personalizing the thumbnail images for shows and movies based on

229
00:11:26.200 --> 00:11:29.639
<v Speaker 2>your viewing history. The reward is you clicking play. It's

230
00:11:29.639 --> 00:11:32.720
<v Speaker 2>also great for cold start problems and recommendations, quickly learning

231
00:11:32.720 --> 00:11:33.879
<v Speaker 2>what a new user might like.

232
00:11:34.000 --> 00:11:37.039
<v Speaker 1>Wow. Okay, so that simple banded idea is behind a

233
00:11:37.080 --> 00:11:40.159
<v Speaker 1>lot of the personalization we see online. That's quite surprising.

234
00:11:40.480 --> 00:11:43.960
<v Speaker 1>Now let's broaden out again. We've talked games recommendations, But

235
00:11:44.039 --> 00:11:46.320
<v Speaker 1>where else is RL making a real impact? You mentioned

236
00:11:46.320 --> 00:11:48.360
<v Speaker 1>the source book covers quite a few areas.

237
00:11:48.519 --> 00:11:51.200
<v Speaker 2>Yeah, the range is pretty impressive. Now, For instance, dynamic

238
00:11:51.240 --> 00:11:54.840
<v Speaker 2>pricing businesses use URL agents to adjust prices on the

239
00:11:54.840 --> 00:11:57.600
<v Speaker 2>fly based on real time supply and demand, trying to

240
00:11:57.600 --> 00:11:59.000
<v Speaker 2>maximize revenue.

241
00:11:58.919 --> 00:12:01.200
<v Speaker 1>Like airline tickets are ride sharing apps.

242
00:12:01.200 --> 00:12:05.960
<v Speaker 2>Exactly like that. Then there's manufacturing training intelligent robots using

243
00:12:06.200 --> 00:12:09.039
<v Speaker 2>URL to perform tasks like picking and placing objects with

244
00:12:09.120 --> 00:12:12.639
<v Speaker 2>high precision. This can reduce costs and improve efficiency on

245
00:12:12.679 --> 00:12:16.240
<v Speaker 2>assembly lines. Finance is another big one. RL is used

246
00:12:16.240 --> 00:12:21.519
<v Speaker 2>for things like optimizing investment portfolios or developing algorithmic trading strategies.

247
00:12:21.720 --> 00:12:24.080
<v Speaker 2>JP Morgan, for example, used it to improve how they

248
00:12:24.080 --> 00:12:26.960
<v Speaker 2>execute large traits for clients, making them more efficient.

249
00:12:27.279 --> 00:12:31.480
<v Speaker 1>Interesting, so finance, manufacturing, what else?

250
00:12:31.679 --> 00:12:36.039
<v Speaker 2>Well, there's neural architecture search or NAS that's basically using

251
00:12:36.159 --> 00:12:39.120
<v Speaker 2>RL to automatically design the structure of other neural networks

252
00:12:39.120 --> 00:12:42.039
<v Speaker 2>to get the best performance on a task, automating AI

253
00:12:42.120 --> 00:12:45.639
<v Speaker 2>design with AI, and even in natural language processing NLP,

254
00:12:45.799 --> 00:12:50.120
<v Speaker 2>people are using RL for tasks like improving abstractive text summarization,

255
00:12:50.679 --> 00:12:53.759
<v Speaker 2>getting AI to write concise summaries, or making chatbots more

256
00:12:53.759 --> 00:12:55.080
<v Speaker 2>engaging and goal oriented.

257
00:12:55.200 --> 00:12:57.440
<v Speaker 1>It really is branching out everywhere. The field. Sound like

258
00:12:57.480 --> 00:12:59.960
<v Speaker 1>it's moving incredibly fast. Yeah, what's kind of on the hahriz.

259
00:13:00.279 --> 00:13:02.039
<v Speaker 1>What are the really cutting edge areas right now?

260
00:13:02.360 --> 00:13:05.759
<v Speaker 2>It is moving fast. Some really exciting frontiers include things

261
00:13:05.799 --> 00:13:09.360
<v Speaker 2>like meta reinforcement learning. This is about developing agents that

262
00:13:09.399 --> 00:13:11.720
<v Speaker 2>can learn how to learn, so they get better at

263
00:13:11.720 --> 00:13:15.480
<v Speaker 2>picking up new tasks quickly because they've learned general learning strategy,

264
00:13:15.679 --> 00:13:16.519
<v Speaker 2>learning to learn.

265
00:13:16.679 --> 00:13:18.600
<v Speaker 1>Okay, that sounds powerful. Yeah.

266
00:13:18.840 --> 00:13:22.519
<v Speaker 2>Then there's hierarchical reinforcement learning or HRL. The idea here

267
00:13:22.600 --> 00:13:26.399
<v Speaker 2>is to break down really big complex tasks into smaller,

268
00:13:26.519 --> 00:13:30.240
<v Speaker 2>more manageable sub goals or subtasks. Think about a robot

269
00:13:30.320 --> 00:13:33.200
<v Speaker 2>needing to make coffee. HRL might break that down into

270
00:13:33.360 --> 00:13:35.919
<v Speaker 2>go to coverard, get mug, go to machine, press button.

271
00:13:36.080 --> 00:13:39.639
<v Speaker 2>It makes tackling long horizon problems much more feasible. Like

272
00:13:39.679 --> 00:13:42.799
<v Speaker 2>the taxi example in the outline, decomposed driving into get

273
00:13:42.799 --> 00:13:44.600
<v Speaker 2>passenger and drop off passenger makes sense.

274
00:13:44.600 --> 00:13:47.000
<v Speaker 1>Break it down. Yeah, and you mentioned something earlier that

275
00:13:47.080 --> 00:13:50.679
<v Speaker 1>sounded almost like AI imagination ah.

276
00:13:50.679 --> 00:13:55.720
<v Speaker 2>Right, imagination augmented agents or itwo A. This is a

277
00:13:55.759 --> 00:13:59.919
<v Speaker 2>fascinating direction. These agents try to internally simulate or imagine

278
00:14:00.159 --> 00:14:03.159
<v Speaker 2>the likely consequences of their actions before actually taking them

279
00:14:03.200 --> 00:14:05.200
<v Speaker 2>in the real world. It's a bit like how a

280
00:14:05.320 --> 00:14:07.840
<v Speaker 2>chess player thinks ahead, if I move here, what might

281
00:14:07.879 --> 00:14:11.559
<v Speaker 2>happen next. They combine learning from actual experience model free

282
00:14:11.960 --> 00:14:14.519
<v Speaker 2>with learning an internal model of the world to plan

283
00:14:14.759 --> 00:14:19.360
<v Speaker 2>model based. This allows for more sophisticated planning, especially environments

284
00:14:19.360 --> 00:14:22.440
<v Speaker 2>where mistakes are costly, like certain puzzle games such as Soacobond,

285
00:14:22.440 --> 00:14:23.639
<v Speaker 2>which was mentioned in the source.

286
00:14:23.840 --> 00:14:26.120
<v Speaker 1>Wow. From a dog learning to get a ball with

287
00:14:26.159 --> 00:14:28.559
<v Speaker 1>cookies all the way to AI agents that can sort

288
00:14:28.600 --> 00:14:31.759
<v Speaker 1>of imagine the future. That's quite a journey we've covered.

289
00:14:31.799 --> 00:14:34.480
<v Speaker 1>We've really seen how this core idea of learning through

290
00:14:34.480 --> 00:14:38.559
<v Speaker 1>trial and error, through rewards and interactions scales up massively

291
00:14:38.600 --> 00:14:42.159
<v Speaker 1>with deep learning. It lets AI tackle these incredibly complex

292
00:14:42.200 --> 00:14:45.679
<v Speaker 1>problems in finance, robotics, online systems, you name it. It

293
00:14:45.720 --> 00:14:49.279
<v Speaker 1>really emphasizes how URL lets agents learn directly adapt on

294
00:14:49.320 --> 00:14:51.840
<v Speaker 1>the fly. We're probably just scratching the surface of what's.

295
00:14:51.639 --> 00:14:54.879
<v Speaker 2>Possible absolutely, and maybe a final thought for you to

296
00:14:54.879 --> 00:14:58.639
<v Speaker 2>consider is just that how that simple principle of learning

297
00:14:58.720 --> 00:15:02.679
<v Speaker 2>from feedback, which seems intuitive with the dog analogy, scales up.

298
00:15:03.000 --> 00:15:06.279
<v Speaker 2>It scales to let machines master complex games, manage huge

299
00:15:06.279 --> 00:15:09.600
<v Speaker 2>financial portfolios, personalize your online world, and even start to

300
00:15:09.600 --> 00:15:12.879
<v Speaker 2>build internal models to imagine outcomes. Where else could this

301
00:15:12.919 --> 00:15:16.039
<v Speaker 2>fundamental principle of adaptive reward driven learning take us next?

302
00:15:16.240 --> 00:15:18.440
<v Speaker 2>What new kinds of dynamic intelligence might emerge,
