WEBVTT

1
00:00:00.160 --> 00:00:03.720
<v Speaker 1>Welcome to the deep dive. We take source materials, unpack

2
00:00:03.759 --> 00:00:06.400
<v Speaker 1>complex topics and basically give you the crucial insights and

3
00:00:06.440 --> 00:00:08.279
<v Speaker 1>maybe some surprising facts along the way.

4
00:00:08.400 --> 00:00:10.240
<v Speaker 2>Yeah, think of it as your shortcut to getting up

5
00:00:10.240 --> 00:00:11.279
<v Speaker 2>to speed exactly.

6
00:00:11.679 --> 00:00:14.320
<v Speaker 1>So today we're diving into some excerpts from a book

7
00:00:14.480 --> 00:00:19.559
<v Speaker 1>AI Crash Course. We're looking specifically at chapters covering reinforcement learning,

8
00:00:19.600 --> 00:00:21.600
<v Speaker 1>deep learning, and AI in general.

9
00:00:21.800 --> 00:00:25.160
<v Speaker 2>Right, and this source it positions itself as a kind

10
00:00:25.160 --> 00:00:27.760
<v Speaker 2>of all in one guide. It's built from online courses

11
00:00:27.760 --> 00:00:31.199
<v Speaker 2>that were apparently quite successful, okay, and it really stresses

12
00:00:31.839 --> 00:00:35.240
<v Speaker 2>getting the intuition first, then the math, and then you know,

13
00:00:35.320 --> 00:00:36.840
<v Speaker 2>actually coding things.

14
00:00:36.679 --> 00:00:39.240
<v Speaker 1>Up right, intuition first. So our mission here is to

15
00:00:39.320 --> 00:00:42.920
<v Speaker 1>pull out those core ideas, help build that intuitive feel,

16
00:00:42.960 --> 00:00:45.520
<v Speaker 1>and look at some of the well pretty exciting real

17
00:00:45.560 --> 00:00:48.560
<v Speaker 1>world applications they talk about. The idea is to help you,

18
00:00:48.759 --> 00:00:53.079
<v Speaker 1>the listener, understand how these AI models actually work and importantly,

19
00:00:53.280 --> 00:00:54.280
<v Speaker 1>where they might be used.

20
00:00:54.320 --> 00:00:56.479
<v Speaker 2>When the book sets a big stage, it talks about

21
00:00:56.479 --> 00:01:04.359
<v Speaker 2>AI's potential impact across well almost every transport, education, security, jobs, entertainment.

22
00:01:03.920 --> 00:01:06.439
<v Speaker 1>Even the environment, So a lot of potential there.

23
00:01:06.519 --> 00:01:10.040
<v Speaker 2>Definitely. It frames these technologies as potentially transformative.

24
00:01:10.120 --> 00:01:13.519
<v Speaker 1>Okay, let's unpack this. Where does our source material suggest

25
00:01:13.560 --> 00:01:16.920
<v Speaker 1>we start when building an AI, especially in this reinforcement

26
00:01:16.959 --> 00:01:17.879
<v Speaker 1>learning space.

27
00:01:17.799 --> 00:01:20.439
<v Speaker 2>It starts with the absolute foundation. You have to define

28
00:01:20.439 --> 00:01:21.920
<v Speaker 2>the AI's environment.

29
00:01:22.079 --> 00:01:22.640
<v Speaker 1>The environment.

30
00:01:22.879 --> 00:01:25.760
<v Speaker 2>Yeah, that's the world, the context the AI operates in.

31
00:01:26.200 --> 00:01:28.840
<v Speaker 2>And it's got three really key parts. All right, what

32
00:01:28.879 --> 00:01:33.200
<v Speaker 2>are those first up? States? States are basically the inputs

33
00:01:33.239 --> 00:01:36.159
<v Speaker 2>the AI gets what it perceives.

34
00:01:36.000 --> 00:01:38.439
<v Speaker 1>Like sensor readings for a self driving car.

35
00:01:38.319 --> 00:01:42.680
<v Speaker 2>Maybe exactly, sensor reading speed, location, or for a simple

36
00:01:42.760 --> 00:01:45.280
<v Speaker 2>robot in a maze, the state might just be which

37
00:01:45.280 --> 00:01:47.640
<v Speaker 2>square is currently in It's the where am I? Or

38
00:01:47.680 --> 00:01:48.879
<v Speaker 2>what's happening info?

39
00:01:49.040 --> 00:01:52.040
<v Speaker 1>Okay, so state is what the AI knows about its situation.

40
00:01:52.519 --> 00:01:53.159
<v Speaker 1>What's next?

41
00:01:53.280 --> 00:01:55.920
<v Speaker 2>Next are the actions. These are the things the AI

42
00:01:56.000 --> 00:01:57.719
<v Speaker 2>can do, the choices it can make.

43
00:01:57.840 --> 00:02:00.640
<v Speaker 1>So for the car, turn left, accelerate, break.

44
00:02:00.439 --> 00:02:04.280
<v Speaker 2>Yep, or for the maze robot move north south east west,

45
00:02:04.519 --> 00:02:07.480
<v Speaker 2>those are its possible moves, its decisions makes sense.

46
00:02:07.719 --> 00:02:12.319
<v Speaker 1>State action, and the third piece must be important.

47
00:02:12.080 --> 00:02:16.800
<v Speaker 2>Critically important rewards. This is the feedback what the AI

48
00:02:17.000 --> 00:02:20.439
<v Speaker 2>gets after it it takes an action in a certain state.

49
00:02:20.800 --> 00:02:22.319
<v Speaker 1>Ah, the feedback loop.

50
00:02:22.520 --> 00:02:26.000
<v Speaker 2>Precisely, it could be positive like reaching a goal, or

51
00:02:26.039 --> 00:02:30.639
<v Speaker 2>negative like hitting a wall. This reward signal is what

52
00:02:30.719 --> 00:02:33.520
<v Speaker 2>guides the AI. It tells it what's good and what's bad.

53
00:02:33.680 --> 00:02:35.800
<v Speaker 1>So the whole game for the AI is to figure

54
00:02:35.800 --> 00:02:38.840
<v Speaker 1>out how to act, which actions to take in which

55
00:02:38.919 --> 00:02:41.800
<v Speaker 1>states to get the most reward possible over time.

56
00:02:41.919 --> 00:02:45.199
<v Speaker 2>That's the core idea, maximize cumulative reward. It learns by

57
00:02:45.439 --> 00:02:49.199
<v Speaker 2>essentially trial and error driven by those rewards. And the

58
00:02:49.240 --> 00:02:53.599
<v Speaker 2>source makes this important distinction too, between training mode and inference.

59
00:02:53.199 --> 00:02:55.639
<v Speaker 1>Mode, right. Training versus inference.

60
00:02:55.319 --> 00:02:59.240
<v Speaker 2>Training is where it's learning, interacting, getting rewards, updating it

61
00:02:59.280 --> 00:03:02.199
<v Speaker 2>to understanding. For instance, well at showtime, the trained AI

62
00:03:02.599 --> 00:03:05.039
<v Speaker 2>uses what it learned to just do the task without

63
00:03:05.080 --> 00:03:05.719
<v Speaker 2>learning anymore.

64
00:03:05.800 --> 00:03:08.960
<v Speaker 1>Got it learn first, then perform. So with that framework

65
00:03:09.000 --> 00:03:12.639
<v Speaker 1>state's actions rewards, what's like the first actual AI model

66
00:03:12.680 --> 00:03:13.439
<v Speaker 1>the book introduces.

67
00:03:13.560 --> 00:03:15.680
<v Speaker 2>It kicks off with the classic problem, actually the multi

68
00:03:15.719 --> 00:03:16.520
<v Speaker 2>arm banded problem.

69
00:03:16.639 --> 00:03:18.680
<v Speaker 1>Ah, the slot machines. I remember this.

70
00:03:18.599 --> 00:03:22.759
<v Speaker 2>One, yeah, exactly, multiple slot machines bandits in a casino.

71
00:03:23.520 --> 00:03:26.400
<v Speaker 2>Each pays out with a different probability, but you don't

72
00:03:26.439 --> 00:03:27.520
<v Speaker 2>know those probabilities.

73
00:03:27.680 --> 00:03:29.520
<v Speaker 1>So the question is, how do you play them to

74
00:03:29.599 --> 00:03:33.080
<v Speaker 1>maximize your winnings over time without knowing which machine is

75
00:03:33.120 --> 00:03:34.879
<v Speaker 1>actually best initially exactly.

76
00:03:35.280 --> 00:03:38.560
<v Speaker 2>And the AI approach discussed is Thompson sampling.

77
00:03:38.680 --> 00:03:40.919
<v Speaker 1>Thompson sampling. Okay, what's the intuition there?

78
00:03:41.000 --> 00:03:43.879
<v Speaker 2>Sound statistical, it is, but the intuition is quite neat.

79
00:03:43.919 --> 00:03:46.039
<v Speaker 2>You could just keep playing the machine that's paid out

80
00:03:46.080 --> 00:03:46.479
<v Speaker 2>the most.

81
00:03:46.360 --> 00:03:48.039
<v Speaker 1>So far, right, Yeah, seems logical.

82
00:03:48.199 --> 00:03:50.639
<v Speaker 2>Exploit the winner, but that winner might just be on

83
00:03:50.680 --> 00:03:51.520
<v Speaker 2>a lucky streak.

84
00:03:52.120 --> 00:03:55.120
<v Speaker 1>Thompson sampling is smarter. It keeps track of wins and

85
00:03:55.159 --> 00:03:58.400
<v Speaker 1>losses for each machine, okay, and uses that history to

86
00:03:58.479 --> 00:04:03.159
<v Speaker 1>maintain a probability distribution for each machine's likely success rate,

87
00:04:03.800 --> 00:04:05.960
<v Speaker 1>specifically a beta distribution.

88
00:04:06.199 --> 00:04:10.080
<v Speaker 2>A beta distribution, so more wins on a machine means

89
00:04:10.199 --> 00:04:13.599
<v Speaker 2>its distribution shifts towards predicting higher success Right.

90
00:04:13.680 --> 00:04:17.199
<v Speaker 1>More wins, fewer losses, the distribution gets more confident that

91
00:04:17.240 --> 00:04:21.519
<v Speaker 1>the machine is good. Now here's the clever bit. Each round,

92
00:04:21.560 --> 00:04:23.560
<v Speaker 1>you don't just pick the machine with the highest average

93
00:04:23.560 --> 00:04:24.560
<v Speaker 1>win rate so far.

94
00:04:24.680 --> 00:04:28.000
<v Speaker 2>No, we then you take a random draw from each

95
00:04:28.040 --> 00:04:32.319
<v Speaker 2>machine's beta distribution, and you play the machine whose random

96
00:04:32.399 --> 00:04:34.439
<v Speaker 2>draw came out highest for that round.

97
00:04:34.759 --> 00:04:37.600
<v Speaker 1>Random draw? Why random seems like you'd want the most

98
00:04:37.639 --> 00:04:38.199
<v Speaker 1>likely winner.

99
00:04:38.519 --> 00:04:42.399
<v Speaker 2>That randomness is key. It builds an exploration. A machine

100
00:04:42.439 --> 00:04:46.399
<v Speaker 2>you haven't played much will have a wider, less certain distribution,

101
00:04:46.959 --> 00:04:50.319
<v Speaker 2>so its random draws might sometimes be high, prompting you

102
00:04:50.360 --> 00:04:50.959
<v Speaker 2>to try it out.

103
00:04:51.160 --> 00:04:54.199
<v Speaker 1>Ah, so it forces you to explore the less known options,

104
00:04:54.240 --> 00:04:58.079
<v Speaker 1>sometimes just in case they're actually better than the current favorite.

105
00:04:58.199 --> 00:05:02.959
<v Speaker 2>Exactly. It naturally balance is exploring new things, exploration with

106
00:05:03.079 --> 00:05:06.319
<v Speaker 2>sticking to what seems to work exploitation, and it does

107
00:05:06.360 --> 00:05:09.240
<v Speaker 2>this just based on the observed wins and losses, without

108
00:05:09.279 --> 00:05:10.600
<v Speaker 2>needing the true payout rates.

109
00:05:10.759 --> 00:05:14.879
<v Speaker 1>That's really clever. Balancing exploration and exploitation is a classic problem.

110
00:05:15.000 --> 00:05:16.920
<v Speaker 1>Does the book give a real world example?

111
00:05:17.079 --> 00:05:20.160
<v Speaker 2>Yes, a great one. Online advertising, which adversion gets the

112
00:05:20.199 --> 00:05:21.360
<v Speaker 2>most clicks or sign ups?

113
00:05:21.399 --> 00:05:24.040
<v Speaker 1>Okay, so each AD variation is like a slot machine.

114
00:05:23.800 --> 00:05:27.240
<v Speaker 2>Arm perfect analogy. You show different ads, oh actions, you

115
00:05:27.319 --> 00:05:30.959
<v Speaker 2>track clicks or conversions rewards one for click, zero for

116
00:05:31.040 --> 00:05:35.519
<v Speaker 2>no click. Thompson sampling figures out which ad performs best over.

117
00:05:35.319 --> 00:05:40.040
<v Speaker 1>Time by showing ads somewhat randomly based on those beta distributions.

118
00:05:40.120 --> 00:05:41.199
<v Speaker 1>Learning as it goes.

119
00:05:41.079 --> 00:05:45.560
<v Speaker 2>Yep, it converges on the statistically best ad adapting as

120
00:05:45.600 --> 00:05:48.680
<v Speaker 2>it gets more data maps directly from the casino problem.

121
00:05:48.959 --> 00:05:52.480
<v Speaker 1>Pretty neat, very neat, Okay, So Thompson sampling helps pick

122
00:05:52.519 --> 00:05:55.800
<v Speaker 1>the best single option. But many problems involve a sequence

123
00:05:55.800 --> 00:05:56.839
<v Speaker 1>of actions to reach a.

124
00:05:56.759 --> 00:05:59.959
<v Speaker 2>Goal, right, and that's where the source introduces Q learning.

125
00:06:00.319 --> 00:06:04.720
<v Speaker 2>This is a really foundational reinforcement learning algorithm for sequential decisions.

126
00:06:04.959 --> 00:06:06.360
<v Speaker 1>Q learning. What's the Q stand for?

127
00:06:06.800 --> 00:06:10.120
<v Speaker 2>It stands for quality. Essentially, the core idea is the

128
00:06:10.199 --> 00:06:10.920
<v Speaker 2>Q value.

129
00:06:11.040 --> 00:06:12.639
<v Speaker 1>Okay, quality value. What does it represent?

130
00:06:12.720 --> 00:06:16.560
<v Speaker 2>A Q value written q s is a number. It

131
00:06:16.600 --> 00:06:19.759
<v Speaker 2>represents the expected total future reward you'll get if you

132
00:06:19.800 --> 00:06:22.000
<v Speaker 2>take action A when you're in state S A and D.

133
00:06:22.639 --> 00:06:24.560
<v Speaker 2>This is key you act optimally after that.

134
00:06:24.720 --> 00:06:27.240
<v Speaker 1>Whoa Okay, So it's not just the immediate reward for

135
00:06:27.279 --> 00:06:30.160
<v Speaker 1>taking action A, it's that plus the best possible rewards

136
00:06:30.199 --> 00:06:31.040
<v Speaker 1>you could get from then on.

137
00:06:31.439 --> 00:06:34.879
<v Speaker 2>Exactly. It's the long term value of taking that specific

138
00:06:34.959 --> 00:06:38.720
<v Speaker 2>action in that specific state. The goal of Q learning

139
00:06:39.040 --> 00:06:42.279
<v Speaker 2>is to learn these Q values for all possible state

140
00:06:42.319 --> 00:06:43.160
<v Speaker 2>action pairs.

141
00:06:43.439 --> 00:06:45.079
<v Speaker 1>So if you know all the Q values, you just

142
00:06:45.120 --> 00:06:47.519
<v Speaker 1>pick the action with the highest Q value in your

143
00:06:47.519 --> 00:06:49.279
<v Speaker 1>current state, and that's the best move.

144
00:06:49.480 --> 00:06:51.879
<v Speaker 2>That's the idea for using it once it's learned. Yes,

145
00:06:52.480 --> 00:06:55.560
<v Speaker 2>but how does it learn those values? You use something

146
00:06:55.560 --> 00:06:57.319
<v Speaker 2>called a temporal difference or TD.

147
00:06:57.759 --> 00:06:59.959
<v Speaker 1>Temporal difference sounds like difference over time.

148
00:07:00.199 --> 00:07:03.319
<v Speaker 2>Kind of think of TD as measuring the surprise. It's

149
00:07:03.360 --> 00:07:06.439
<v Speaker 2>the difference between the AI's current estimate of qs A

150
00:07:07.079 --> 00:07:09.720
<v Speaker 2>and a better estimate it gets after actually taking action A,

151
00:07:10.079 --> 00:07:12.800
<v Speaker 2>getting a reward R, and seeing the next state's prime.

152
00:07:12.959 --> 00:07:14.319
<v Speaker 1>How does it get that better estimate?

153
00:07:14.399 --> 00:07:17.519
<v Speaker 2>The better estimate is the immediate reward R plus the

154
00:07:17.560 --> 00:07:20.519
<v Speaker 2>maximum Q value it could get from that next state's

155
00:07:20.560 --> 00:07:22.920
<v Speaker 2>prim basically R plus max QS.

156
00:07:23.040 --> 00:07:26.639
<v Speaker 1>Okay, so TD is actual reward plus best future value

157
00:07:26.639 --> 00:07:29.399
<v Speaker 1>from next state minus my old estimate of current state

158
00:07:29.439 --> 00:07:30.079
<v Speaker 1>action value.

159
00:07:30.120 --> 00:07:32.399
<v Speaker 2>You've got it. A big positive TD means Wow, that

160
00:07:32.439 --> 00:07:34.800
<v Speaker 2>action was way better than I thought. A negative TD

161
00:07:34.920 --> 00:07:36.360
<v Speaker 2>means Oops, that was worse.

162
00:07:36.639 --> 00:07:39.560
<v Speaker 1>And this TD error is used to update the original

163
00:07:39.600 --> 00:07:40.879
<v Speaker 1>Q value estimate.

164
00:07:40.600 --> 00:07:43.959
<v Speaker 2>Precisely using the Bellman equation, which is the mathematical rule

165
00:07:43.959 --> 00:07:46.360
<v Speaker 2>for this update. It uses the TD error and a

166
00:07:46.439 --> 00:07:49.920
<v Speaker 2>learning rate to nudge the Q value closer to that

167
00:07:49.920 --> 00:07:54.560
<v Speaker 2>better estimate. It links the immediate reward to the future potential.

168
00:07:54.319 --> 00:07:56.680
<v Speaker 1>So it learns iteratively. Can you walk through the training

169
00:07:56.680 --> 00:07:58.000
<v Speaker 1>process generally? Sure.

170
00:07:58.439 --> 00:08:01.480
<v Speaker 2>You start by initializing all all Q values, maybe to zero.

171
00:08:01.920 --> 00:08:04.920
<v Speaker 2>Then you run many episodes. In each episode, maybe started

172
00:08:04.959 --> 00:08:08.399
<v Speaker 2>a random state, pick a random valid action. See what

173
00:08:08.480 --> 00:08:10.319
<v Speaker 2>reward you get in what state you land in?

174
00:08:10.399 --> 00:08:10.720
<v Speaker 1>Okay?

175
00:08:10.879 --> 00:08:13.439
<v Speaker 2>Then you calculate that TDR based on the reward and

176
00:08:13.480 --> 00:08:15.360
<v Speaker 2>the max q value of the next state, and you

177
00:08:15.480 --> 00:08:17.360
<v Speaker 2>update the Q value for the state action pair you

178
00:08:17.480 --> 00:08:19.639
<v Speaker 2>just experienced, repeat, repeat, repeat.

179
00:08:19.480 --> 00:08:22.240
<v Speaker 1>Lots of exploration and updating exactly.

180
00:08:22.240 --> 00:08:25.800
<v Speaker 2>Over time, exploring the environment and propagating these rewards back

181
00:08:25.879 --> 00:08:28.759
<v Speaker 2>via the TV updates, the Q values start to converge

182
00:08:28.800 --> 00:08:30.480
<v Speaker 2>towards the true optimal values.

183
00:08:30.680 --> 00:08:33.960
<v Speaker 1>And then once training is done, The inference process is.

184
00:08:33.919 --> 00:08:36.840
<v Speaker 2>Simple, very simple. Put the AI in any state s

185
00:08:37.360 --> 00:08:39.720
<v Speaker 2>it looks up the learned Q values for all possible

186
00:08:39.759 --> 00:08:42.559
<v Speaker 2>actions A from that state. It picks the action with

187
00:08:42.600 --> 00:08:45.000
<v Speaker 2>the highest Q value. That's its policy.

188
00:08:45.320 --> 00:08:48.000
<v Speaker 1>Okay, that makes sense. It learns the map of values,

189
00:08:48.080 --> 00:08:51.039
<v Speaker 1>then follows the path of highest value. The source gives

190
00:08:51.080 --> 00:08:52.960
<v Speaker 1>a warehouse robot example, right.

191
00:08:52.879 --> 00:08:56.120
<v Speaker 2>Yeah, a really clear one. Guiding a robot through a

192
00:08:56.120 --> 00:08:59.039
<v Speaker 2>maze like a warehouse layout to get to a specific

193
00:08:59.120 --> 00:09:00.840
<v Speaker 2>goal location, say location G.

194
00:09:01.200 --> 00:09:04.320
<v Speaker 1>How does that map to states, actions rewards.

195
00:09:04.799 --> 00:09:08.440
<v Speaker 2>The states are just the robot's current location ABC. The

196
00:09:08.480 --> 00:09:11.600
<v Speaker 2>actions are moving to an adjacent connected location, simple enough,

197
00:09:11.639 --> 00:09:13.399
<v Speaker 2>and the rewards are designed to get it to G.

198
00:09:13.799 --> 00:09:16.360
<v Speaker 2>Maybe a small reward like plus one for any valid

199
00:09:16.360 --> 00:09:19.519
<v Speaker 2>move between locations, zero reward if it tries to move

200
00:09:19.600 --> 00:09:22.039
<v Speaker 2>through a wall and the goal. A big reward, say

201
00:09:22.120 --> 00:09:26.000
<v Speaker 2>plus one thousand for reaching location G. That high value

202
00:09:26.039 --> 00:09:27.360
<v Speaker 2>at the goal is the incentive.

203
00:09:27.519 --> 00:09:31.159
<v Speaker 1>So during training, the robot wanders around, bumping into walls,

204
00:09:31.240 --> 00:09:34.000
<v Speaker 1>maybe stumbling into G. Eventually right, and.

205
00:09:33.919 --> 00:09:36.279
<v Speaker 2>When it gets rewards, especially that big one of G,

206
00:09:36.919 --> 00:09:40.720
<v Speaker 2>the TD updates start propagating that value backwards along the

207
00:09:40.759 --> 00:09:41.639
<v Speaker 2>paths leading to G.

208
00:09:41.879 --> 00:09:45.080
<v Speaker 1>So actions that lead towards G gradually get higher Q.

209
00:09:45.159 --> 00:09:50.360
<v Speaker 2>Values Exactly the Q values effectively learn the goodness of

210
00:09:50.399 --> 00:09:52.159
<v Speaker 2>each move in terms of reaching.

211
00:09:51.879 --> 00:09:53.679
<v Speaker 1>The goal and the sourt's also mentioned. You could add

212
00:09:53.720 --> 00:09:57.159
<v Speaker 1>intermediate goals like forcing the robot to go through location

213
00:09:57.320 --> 00:09:58.240
<v Speaker 1>K on the way to G.

214
00:09:58.600 --> 00:10:01.559
<v Speaker 2>Yes, you just tweak the reward matrix give a medium

215
00:10:01.600 --> 00:10:04.960
<v Speaker 2>sized reward, maybe five hundred specifically for the action of

216
00:10:05.000 --> 00:10:08.480
<v Speaker 2>moving from jda K if that's the desired intermediate step.

217
00:10:08.320 --> 00:10:11.159
<v Speaker 1>AH make that specific transition valuable.

218
00:10:10.919 --> 00:10:13.200
<v Speaker 2>Or you could add a big negative reward main to

219
00:10:13.240 --> 00:10:15.440
<v Speaker 2>five hundred for a transition you wanted to avoid, like

220
00:10:15.480 --> 00:10:18.720
<v Speaker 2>going from jda F. You shape the desired path by

221
00:10:18.759 --> 00:10:21.480
<v Speaker 2>manipulating the rewards for specific state action.

222
00:10:21.360 --> 00:10:25.120
<v Speaker 1>Pairs, very flexible and in inference. The trained robot would

223
00:10:25.120 --> 00:10:27.679
<v Speaker 1>then follow the path that accumulated the highest.

224
00:10:27.399 --> 00:10:30.279
<v Speaker 2>Q values testing the example path mentioned E to I

225
00:10:30.440 --> 00:10:34.799
<v Speaker 2>to JDAK than LHG. The robot figures that out just

226
00:10:34.840 --> 00:10:37.559
<v Speaker 2>by following the highest Q value at each step, guided

227
00:10:37.600 --> 00:10:38.759
<v Speaker 2>by the rewards you designed.

228
00:10:39.000 --> 00:10:41.639
<v Speaker 1>Okay, Q learning seems powerful for these kinds of discrete

229
00:10:41.679 --> 00:10:44.679
<v Speaker 1>state spased problems. But what about more complex stuff like

230
00:10:44.960 --> 00:10:47.960
<v Speaker 1>dealing with messy continuous data or images.

231
00:10:48.399 --> 00:10:52.159
<v Speaker 2>Exactly, that's the limit of basic Q learning tables. For

232
00:10:52.279 --> 00:10:55.279
<v Speaker 2>more complex problems. The source brings in artificial neural networks

233
00:10:55.320 --> 00:10:58.799
<v Speaker 2>an ns and deep learning. The artificial brains kind of yeah,

234
00:10:59.039 --> 00:11:02.279
<v Speaker 2>inspired by biologic brains. The basic unit is the neuron.

235
00:11:02.399 --> 00:11:05.440
<v Speaker 2>It gets inputs, multiplies them by weights, sums them.

236
00:11:05.399 --> 00:11:08.639
<v Speaker 1>Up, and passes the result through an activation function like

237
00:11:08.799 --> 00:11:10.799
<v Speaker 1>re lu the rectifier you mentioned right.

238
00:11:10.840 --> 00:11:14.159
<v Speaker 2>That activation function adds nonlinearity, which is super important for

239
00:11:14.240 --> 00:11:18.480
<v Speaker 2>learning complex patterns. These neurons are arranged in layers input,

240
00:11:18.600 --> 00:11:21.440
<v Speaker 2>hidden layers, output information flows forward.

241
00:11:21.679 --> 00:11:25.559
<v Speaker 1>Okay, and how do these networks learn you mentioned adjusting weights.

242
00:11:25.720 --> 00:11:29.039
<v Speaker 2>They learn by trying to minimize error. For example, predicting

243
00:11:29.039 --> 00:11:32.120
<v Speaker 2>house prices, the network makes a prediction you compare to

244
00:11:32.120 --> 00:11:32.919
<v Speaker 2>the actual price.

245
00:11:33.200 --> 00:11:35.679
<v Speaker 1>That difference is the loss error, and it tries to

246
00:11:35.720 --> 00:11:36.720
<v Speaker 1>reduce that error.

247
00:11:36.799 --> 00:11:41.000
<v Speaker 2>Yes, using optimization algorithms like gradient descent, it calculates how

248
00:11:41.039 --> 00:11:43.879
<v Speaker 2>adjusting each weight would affect the error and nudges the

249
00:11:43.879 --> 00:11:47.080
<v Speaker 2>weights in the direction that reduces the error. Or many many.

250
00:11:46.879 --> 00:11:50.720
<v Speaker 1>Examples the book uses that house price prediction example. What's

251
00:11:50.759 --> 00:11:54.399
<v Speaker 1>a really critical step when you feed data like house size,

252
00:11:54.440 --> 00:11:57.399
<v Speaker 1>number of bedrooms, et cetera into an ann.

253
00:11:57.279 --> 00:12:00.559
<v Speaker 2>Data prep is huge. Splitting into twenty two test sets

254
00:12:00.639 --> 00:12:04.279
<v Speaker 2>is standard, But the crucial thing, especially for an n's

255
00:12:04.600 --> 00:12:05.679
<v Speaker 2>is scaling the data.

256
00:12:06.360 --> 00:12:08.240
<v Speaker 1>Scaling Why is that so vital?

257
00:12:08.480 --> 00:12:12.480
<v Speaker 2>Imagine number of bedrooms maybe one to five versus square

258
00:12:12.519 --> 00:12:17.759
<v Speaker 2>footage thousands. Without scaling, the network might overweight square footage

259
00:12:17.799 --> 00:12:20.320
<v Speaker 2>just because the numbers are bigger. Even if bedrooms are

260
00:12:20.360 --> 00:12:21.039
<v Speaker 2>just as important.

261
00:12:21.240 --> 00:12:24.200
<v Speaker 1>Ah, the scale of the numbers dominates the learning.

262
00:12:24.320 --> 00:12:27.679
<v Speaker 2>Exactly scaling methods like midmax scale are mentioned in the

263
00:12:27.720 --> 00:12:30.759
<v Speaker 2>source bring all features into a similar range like zero

264
00:12:30.840 --> 00:12:33.559
<v Speaker 2>to one. So the network learns based on the predictive

265
00:12:33.600 --> 00:12:36.519
<v Speaker 2>power of each feature, not just its raw numerical size.

266
00:12:36.639 --> 00:12:39.639
<v Speaker 1>Makes sense, leveling the playing field for the input features. Okay,

267
00:12:39.679 --> 00:12:43.159
<v Speaker 1>so we have Q learning for sequences ANNs for complex data.

268
00:12:43.679 --> 00:12:44.840
<v Speaker 1>What happens when you put them.

269
00:12:44.679 --> 00:12:48.600
<v Speaker 2>Together, magic happens. That's deep Q learning or DQN. This

270
00:12:48.679 --> 00:12:52.000
<v Speaker 2>is where things get really powerful for complex RL problems.

271
00:12:51.600 --> 00:12:53.639
<v Speaker 1>Deep Q learning. So the deep comes from the deep

272
00:12:53.720 --> 00:12:55.600
<v Speaker 1>learning neural network exactly.

273
00:12:55.919 --> 00:12:58.440
<v Speaker 2>The an N acts as a function approximator for the

274
00:12:58.519 --> 00:13:02.279
<v Speaker 2>Q function instead of a giant table storing queues A

275
00:13:02.799 --> 00:13:06.159
<v Speaker 2>for every possible state in action, which is impossible for

276
00:13:06.240 --> 00:13:07.320
<v Speaker 2>complex environments.

277
00:13:07.399 --> 00:13:10.559
<v Speaker 1>Right the state space could be enormous or even continuous.

278
00:13:10.639 --> 00:13:14.240
<v Speaker 2>The ANN takes the states as as input, and its

279
00:13:14.279 --> 00:13:18.200
<v Speaker 2>output layer predicts the Q values for all possible actions

280
00:13:18.240 --> 00:13:19.320
<v Speaker 2>A from that state.

281
00:13:19.559 --> 00:13:21.919
<v Speaker 1>So the network learns to estimate the Q values on

282
00:13:21.960 --> 00:13:23.720
<v Speaker 1>the fly based on the input state.

283
00:13:23.840 --> 00:13:27.039
<v Speaker 2>Precisely. It generalizes. Now, when it comes to choosing an

284
00:13:27.039 --> 00:13:30.080
<v Speaker 2>action during training, DQN doesn't always just pick the action

285
00:13:30.159 --> 00:13:33.279
<v Speaker 2>with the highest predicted Q value. That would be pure exploitation.

286
00:13:33.559 --> 00:13:36.559
<v Speaker 1>It needs exploration too, write like in Thompson sampling exactly.

287
00:13:36.919 --> 00:13:40.360
<v Speaker 2>The source mentions common strategies like softmax or epsilon greedy

288
00:13:40.360 --> 00:13:41.879
<v Speaker 2>exploration epsilon greedy.

289
00:13:41.919 --> 00:13:44.000
<v Speaker 1>That's the one where, say ten percent of the time,

290
00:13:44.039 --> 00:13:46.000
<v Speaker 1>it just picks a random action instead of the best one.

291
00:13:46.120 --> 00:13:50.360
<v Speaker 2>Yeah, that's the idea with probability upslone explore randomly, otherwise

292
00:13:50.879 --> 00:13:55.320
<v Speaker 2>exploit the best known action. Softmax assigns probabilities based on

293
00:13:55.440 --> 00:13:59.440
<v Speaker 2>Q values, giving even weaker actions some chance. This exploration

294
00:13:59.519 --> 00:14:03.000
<v Speaker 2>is crucial for discovering potentially better strategies the AI doesn't

295
00:14:03.000 --> 00:14:03.759
<v Speaker 2>know about yet.

296
00:14:03.919 --> 00:14:07.120
<v Speaker 1>Okay, so how does the DQN actually learn? How does

297
00:14:07.159 --> 00:14:09.879
<v Speaker 1>the network get better at predicting Q values.

298
00:14:10.159 --> 00:14:12.320
<v Speaker 2>It's similar to the q learning update, but uses the

299
00:14:12.360 --> 00:14:16.080
<v Speaker 2>network The AI is in state a's picks an action

300
00:14:16.159 --> 00:14:19.279
<v Speaker 2>A using epslong, greedy you or similar, observes the reward

301
00:14:19.480 --> 00:14:20.919
<v Speaker 2>R and the next state's prime.

302
00:14:21.240 --> 00:14:21.480
<v Speaker 1>Okay.

303
00:14:21.559 --> 00:14:24.120
<v Speaker 2>It then uses the same neural network to predict the

304
00:14:24.159 --> 00:14:27.200
<v Speaker 2>maximum Q value possible from that next state. Hell, it's prime.

305
00:14:27.320 --> 00:14:30.159
<v Speaker 2>Let's call that max qs. It calculates the target Q value.

306
00:14:30.320 --> 00:14:34.360
<v Speaker 2>Target equals R plus gamma max qsaighty gamma is a

307
00:14:34.440 --> 00:14:36.600
<v Speaker 2>discount factor for future rewards, so.

308
00:14:36.679 --> 00:14:39.159
<v Speaker 1>Reward plus the discounted best value from the next state

309
00:14:39.360 --> 00:14:40.960
<v Speaker 1>that's the target right now.

310
00:14:41.000 --> 00:14:43.279
<v Speaker 2>It compares this target value to the q value the

311
00:14:43.320 --> 00:14:45.759
<v Speaker 2>network originally predicted for the action a it actually took

312
00:14:45.759 --> 00:14:48.080
<v Speaker 2>in state. As the difference between the prediction and the

313
00:14:48.120 --> 00:14:51.159
<v Speaker 2>target is the error, the temporal difference error again.

314
00:14:51.120 --> 00:14:53.639
<v Speaker 1>And that error signal is used to update.

315
00:14:53.360 --> 00:14:57.320
<v Speaker 2>The network exactly. The error is backpropagated through the ann

316
00:14:57.799 --> 00:15:01.600
<v Speaker 2>adjusting the weights so that next time the network's prediction

317
00:15:01.759 --> 00:15:05.440
<v Speaker 2>for q A will be closer to that target value.

318
00:15:05.639 --> 00:15:08.639
<v Speaker 2>It learns to make better predictions through experience.

319
00:15:08.519 --> 00:15:11.200
<v Speaker 1>And there is something about experience replay.

320
00:15:10.759 --> 00:15:14.639
<v Speaker 2>Memory ah yes, crucial for stability. Instead of learning only

321
00:15:14.679 --> 00:15:17.519
<v Speaker 2>from the very last thing that happened, the AI stores

322
00:15:17.600 --> 00:15:22.159
<v Speaker 2>lots of past experiences state action reward next to state

323
00:15:22.600 --> 00:15:26.320
<v Speaker 2>tipples in a big memory buffer. Okay, Then for learning updates,

324
00:15:26.320 --> 00:15:29.960
<v Speaker 2>it samples random mini batches of these past experiences.

325
00:15:29.399 --> 00:15:31.039
<v Speaker 1>From the buffer way random badges.

326
00:15:31.159 --> 00:15:35.039
<v Speaker 2>It breaks the correlation between consecutive experiences. Learning step by

327
00:15:35.039 --> 00:15:38.320
<v Speaker 2>step can be unstable because consecutive states are often very similar.

328
00:15:38.759 --> 00:15:41.840
<v Speaker 2>Random sampling makes the training data more diverse and independent

329
00:15:41.879 --> 00:15:45.080
<v Speaker 2>in each batch, which really helps stabilize the learning process

330
00:15:45.080 --> 00:15:46.279
<v Speaker 2>for the deep neural network.

331
00:15:46.360 --> 00:15:49.440
<v Speaker 1>Got it okay, DQN sounds really powerful. The source must

332
00:15:49.440 --> 00:15:52.759
<v Speaker 1>have some cool applications. You mentioned virtual self driving.

333
00:15:52.480 --> 00:15:54.919
<v Speaker 2>Car, Yeah, a great example in the book, they use

334
00:15:54.960 --> 00:15:57.879
<v Speaker 2>a Kivi app a Python framework to simulate it. The

335
00:15:57.879 --> 00:16:01.080
<v Speaker 2>input states for the AI are are things like the

336
00:16:01.159 --> 00:16:05.320
<v Speaker 2>car's angle towards the goal, but also crucially sensor readings,

337
00:16:05.399 --> 00:16:09.639
<v Speaker 2>what kind of sensors virtual sensors detecting sand basically obstacles

338
00:16:09.720 --> 00:16:11.960
<v Speaker 2>or off road areas to the left, front and right.

339
00:16:12.440 --> 00:16:15.039
<v Speaker 2>This gives the AI situational awareness.

340
00:16:14.600 --> 00:16:17.120
<v Speaker 1>And the actions are simple driving controls.

341
00:16:16.799 --> 00:16:17.919
<v Speaker 2>Basic steering adjustments.

342
00:16:18.000 --> 00:16:18.240
<v Speaker 1>Yeah.

343
00:16:18.799 --> 00:16:21.840
<v Speaker 2>The rewards are set up to encourage driving well, a

344
00:16:21.879 --> 00:16:25.519
<v Speaker 2>penalty magnetive one for hitting sand borders, a smaller penalty

345
00:16:25.600 --> 00:16:27.279
<v Speaker 2>need you a point two if it moves away from

346
00:16:27.279 --> 00:16:29.559
<v Speaker 2>the goal, and a small reward plus point one from

347
00:16:29.600 --> 00:16:30.440
<v Speaker 2>moving towards the goal.

348
00:16:30.639 --> 00:16:34.159
<v Speaker 1>So the DQN learns to process those sensor inputs, predict

349
00:16:34.240 --> 00:16:37.720
<v Speaker 1>Q values for steering actions, and chooses actions that avoid

350
00:16:37.759 --> 00:16:39.039
<v Speaker 1>penalties and get rewards.

351
00:16:39.120 --> 00:16:41.919
<v Speaker 2>Exactly, it learns through trial and error in the simulation

352
00:16:42.320 --> 00:16:45.279
<v Speaker 2>to stay on the road, avoid sand and navigate towards

353
00:16:45.360 --> 00:16:48.279
<v Speaker 2>the target, eventually making round trips. You use something like

354
00:16:48.320 --> 00:16:50.679
<v Speaker 2>py torch or TensorFlow to build the an N park

355
00:16:50.919 --> 00:16:51.360
<v Speaker 2>very cool.

356
00:16:51.519 --> 00:16:54.639
<v Speaker 1>And the server cooling example that sounded really practical.

357
00:16:54.279 --> 00:16:58.879
<v Speaker 2>Extremely practical, applying Dkewin to minimize energy costs into server environment.

358
00:16:59.080 --> 00:17:02.240
<v Speaker 1>So the input states there are things affecting temperature right.

359
00:17:02.440 --> 00:17:06.400
<v Speaker 2>Server's current temperature, maybe number of active users, data transmission rate,

360
00:17:06.440 --> 00:17:10.079
<v Speaker 2>factors influencing heat load, and the actions discrete choices. The

361
00:17:10.119 --> 00:17:13.039
<v Speaker 2>source example use things like cool by one point five

362
00:17:13.079 --> 00:17:16.640
<v Speaker 2>degrees cools by point five degree C, do nothing, heat

363
00:17:16.680 --> 00:17:18.960
<v Speaker 2>by one point five degree c, heat by one point

364
00:17:19.000 --> 00:17:21.720
<v Speaker 2>five degree C. Five distinct actions.

365
00:17:21.319 --> 00:17:23.960
<v Speaker 1>And the reward is the energy saved compared to a

366
00:17:24.000 --> 00:17:26.839
<v Speaker 1>standard maybe thermostat based system exactly.

367
00:17:26.880 --> 00:17:30.400
<v Speaker 2>The goal is purely energy efficiency. The DQAN trains by

368
00:17:30.400 --> 00:17:33.759
<v Speaker 2>simulating temperature changes based on inputs and its actions, learning

369
00:17:33.799 --> 00:17:36.559
<v Speaker 2>which sequence of cooling heating actions keeps the temperature within

370
00:17:36.599 --> 00:17:39.559
<v Speaker 2>an acceptable range while using the least energy possible.

371
00:17:39.720 --> 00:17:41.440
<v Speaker 1>And it uses a standard A and unset up.

372
00:17:41.599 --> 00:17:44.680
<v Speaker 2>Yeah, the source mentions a typical structure maybe two hidden

373
00:17:44.720 --> 00:17:48.200
<v Speaker 2>layers means squared error mc loss to measure how far

374
00:17:48.240 --> 00:17:51.319
<v Speaker 2>off its temperature prediction? Is the atom optimizer to adjust

375
00:17:51.359 --> 00:17:54.759
<v Speaker 2>weights and epsilon greedy exploration during training.

376
00:17:54.559 --> 00:17:56.480
<v Speaker 1>And the result was significant, quite significant.

377
00:17:56.519 --> 00:17:59.200
<v Speaker 2>Yeah, the source sited achieving up to eighty seven percent

378
00:17:59.279 --> 00:18:01.839
<v Speaker 2>energy savings compared to the baseline. That's a huge real

379
00:18:01.839 --> 00:18:03.200
<v Speaker 2>world win from applying URL.

380
00:18:03.519 --> 00:18:07.440
<v Speaker 1>Wow. Okay, so DQN handles complex states, but what about

381
00:18:07.519 --> 00:18:10.440
<v Speaker 1>visual states like images or game screens.

382
00:18:10.759 --> 00:18:14.880
<v Speaker 2>Ah, Now we get to deep convolutional q learning DCQN.

383
00:18:15.319 --> 00:18:18.680
<v Speaker 2>This brings in convolutional neural networks CNNs.

384
00:18:18.759 --> 00:18:21.400
<v Speaker 1>CNNs. They're specialized for images, right exactly.

385
00:18:21.640 --> 00:18:25.240
<v Speaker 2>They're designed to process grid like data and images are

386
00:18:25.240 --> 00:18:26.039
<v Speaker 2>the prime example.

387
00:18:26.119 --> 00:18:28.079
<v Speaker 1>How do they work sort of intuitively?

388
00:18:28.319 --> 00:18:31.160
<v Speaker 2>Well, the first key step is convolution. You slide small

389
00:18:31.160 --> 00:18:34.240
<v Speaker 2>filters across the image. Each filter is designed to detect

390
00:18:34.279 --> 00:18:38.039
<v Speaker 2>a specific simple feature like a vertical edge, horizontal edge,

391
00:18:38.119 --> 00:18:41.359
<v Speaker 2>a corner, maybe a certain texture or color patch. This

392
00:18:41.440 --> 00:18:42.680
<v Speaker 2>produces feature maps.

393
00:18:42.720 --> 00:18:44.519
<v Speaker 1>Okay, finding basic patterns, then.

394
00:18:44.440 --> 00:18:47.640
<v Speaker 2>Codes pooling, often max pooling. It takes small regions of

395
00:18:47.680 --> 00:18:50.119
<v Speaker 2>the feature map and just keep the maximum value. It's

396
00:18:50.119 --> 00:18:52.839
<v Speaker 2>a way to downsample reduce the data size will keeping

397
00:18:52.839 --> 00:18:55.880
<v Speaker 2>the most salient features detected. It makes the network more

398
00:18:55.960 --> 00:18:57.799
<v Speaker 2>robust to small shifts or distortions.

399
00:18:57.920 --> 00:19:01.640
<v Speaker 1>So extract features, then condense them right after several layers

400
00:19:01.640 --> 00:19:05.559
<v Speaker 1>of convolution and pooling, you've extracted increasingly complex features.

401
00:19:06.039 --> 00:19:08.480
<v Speaker 2>Then you flatten the final two D feature maps into

402
00:19:08.480 --> 00:19:09.880
<v Speaker 2>a single long one.

403
00:19:09.759 --> 00:19:12.000
<v Speaker 1>D vector okay, a feature vector, and.

404
00:19:11.920 --> 00:19:14.799
<v Speaker 2>That vector is then fed into a standard fully connected

405
00:19:14.839 --> 00:19:18.200
<v Speaker 2>ann like we discussed before for the final prediction or

406
00:19:18.240 --> 00:19:20.839
<v Speaker 2>decision making in this case predicting Q values.

407
00:19:20.960 --> 00:19:24.839
<v Speaker 1>And you mentioned CNNs can handle three D inputs. That's

408
00:19:24.880 --> 00:19:25.799
<v Speaker 1>important for.

409
00:19:25.559 --> 00:19:28.400
<v Speaker 2>For the next example. Yeah, playing the classic Snake game

410
00:19:28.480 --> 00:19:29.920
<v Speaker 2>using dcqn.

411
00:19:29.480 --> 00:19:32.160
<v Speaker 1>Ah Snake perfect visual task exactly.

412
00:19:32.440 --> 00:19:34.960
<v Speaker 2>The state isn't just a single snapshot of the game screen.

413
00:19:35.000 --> 00:19:38.799
<v Speaker 2>To understand movement, the AI needs context, So the state

414
00:19:38.839 --> 00:19:42.440
<v Speaker 2>input is actually a stack of recent game frames.

415
00:19:42.480 --> 00:19:45.400
<v Speaker 1>Like layering the last few frames together precisely.

416
00:19:45.519 --> 00:19:47.880
<v Speaker 2>Think of it like a three D volume with height

417
00:19:48.000 --> 00:19:50.720
<v Speaker 2>and a short time dimension. This allows the CNN to

418
00:19:50.720 --> 00:19:54.039
<v Speaker 2>perceive motion and direction, not just static positions.

419
00:19:54.079 --> 00:19:56.599
<v Speaker 1>That's clever, Okay. What are the actions for snake?

420
00:19:57.640 --> 00:20:01.359
<v Speaker 2>Simple up, down, left, right for possible moves?

421
00:20:01.559 --> 00:20:04.160
<v Speaker 1>What about impossible moves like if the snake is going right,

422
00:20:04.200 --> 00:20:05.400
<v Speaker 1>it can immediately go left.

423
00:20:05.680 --> 00:20:09.599
<v Speaker 2>Good? Point the AI might try to command left, the

424
00:20:09.640 --> 00:20:12.960
<v Speaker 2>game engine ignores it. The snake continues right and promptly dies.

425
00:20:13.559 --> 00:20:16.799
<v Speaker 2>The key is the AI takes the action left, observes

426
00:20:16.799 --> 00:20:20.559
<v Speaker 2>the outcome death, gets a negative reward, and associates that

427
00:20:20.559 --> 00:20:23.799
<v Speaker 2>negative reward with the attempted action left. In that specific

428
00:20:23.880 --> 00:20:27.240
<v Speaker 2>state moving right next to self, it learns trying to

429
00:20:27.279 --> 00:20:28.279
<v Speaker 2>go left here is bad.

430
00:20:28.759 --> 00:20:31.839
<v Speaker 1>Ah okay, It learns the consequence of the attempted action,

431
00:20:32.000 --> 00:20:35.440
<v Speaker 1>even if the game rules prevent it. What about the rewards.

432
00:20:35.279 --> 00:20:38.480
<v Speaker 2>Simple and effective plus one for eating an apple, negative

433
00:20:38.480 --> 00:20:41.279
<v Speaker 2>one for dying, hitting wall or self, and crucially, a

434
00:20:41.319 --> 00:20:44.440
<v Speaker 2>small negative reward like negative point zero three for every

435
00:20:44.440 --> 00:20:45.920
<v Speaker 2>single step that doesn't end the game.

436
00:20:45.839 --> 00:20:49.200
<v Speaker 1>Or get an apple a living penalty. Why penalize it

437
00:20:49.279 --> 00:20:50.240
<v Speaker 1>just for moving.

438
00:20:50.039 --> 00:20:53.160
<v Speaker 2>To encourage efficiency. Without it, the snake could just wiggle

439
00:20:53.200 --> 00:20:56.000
<v Speaker 2>around an empty space forever. Not dying is okay. Not

440
00:20:56.039 --> 00:20:59.240
<v Speaker 2>getting apples as okay. That small penalty incentivizes it to

441
00:20:59.240 --> 00:21:01.720
<v Speaker 2>find apples quickly, because eating an apple plus one is

442
00:21:01.759 --> 00:21:05.119
<v Speaker 2>the main way to counteract the accumulating negative living penalty.

443
00:21:04.920 --> 00:21:08.640
<v Speaker 1>Makes sense, drives it towards the objective efficiently. So the

444
00:21:08.720 --> 00:21:13.039
<v Speaker 1>DCQN takes the stack frames, processes them through CNN layers

445
00:21:13.039 --> 00:21:16.839
<v Speaker 1>to understand the visual state where's the snake, apple Wall's.

446
00:21:16.640 --> 00:21:19.839
<v Speaker 2>Body, flattens those features, feeds them to an ANN, which

447
00:21:19.920 --> 00:21:23.319
<v Speaker 2>outputs the Q values for up, down, left, right, and.

448
00:21:23.240 --> 00:21:27.920
<v Speaker 1>It learns by taking actions with exploration, getting rewards, penalties,

449
00:21:28.000 --> 00:21:32.240
<v Speaker 1>and updating the whole network CNN plus ANN via backpropagation

450
00:21:32.519 --> 00:21:33.519
<v Speaker 1>using the TD.

451
00:21:33.440 --> 00:21:36.079
<v Speaker 2>Error You've got it and the result mentioned the source.

452
00:21:36.599 --> 00:21:39.279
<v Speaker 2>After training, the AI could consistently eat around ten to

453
00:21:39.279 --> 00:21:42.519
<v Speaker 2>eleven apples per game, which is pretty decent for learning

454
00:21:42.519 --> 00:21:43.079
<v Speaker 2>from scratch.

455
00:21:43.160 --> 00:21:45.319
<v Speaker 1>That's really cool, So quite a journey there from the

456
00:21:45.400 --> 00:21:46.359
<v Speaker 1>absolute basics of.

457
00:21:46.359 --> 00:21:48.319
<v Speaker 2>RL state's actions reward Drew.

458
00:21:48.200 --> 00:21:52.440
<v Speaker 1>Thompson sampling for simple choices, Q learning for basic sequences.

459
00:21:52.079 --> 00:21:55.279
<v Speaker 2>Then bringing in the power of neural networks with DQN

460
00:21:55.359 --> 00:21:56.240
<v Speaker 2>for complex.

461
00:21:55.920 --> 00:21:58.599
<v Speaker 1>States handling things like driving and server cooling, and.

462
00:21:58.559 --> 00:22:02.079
<v Speaker 2>Finally DCQN use and convolutional networks to actually see and

463
00:22:02.119 --> 00:22:03.200
<v Speaker 2>play a game.

464
00:22:03.119 --> 00:22:07.680
<v Speaker 1>Like Snake Yeah, across casinos, warehouses, cars, server rooms, video games.

465
00:22:07.960 --> 00:22:12.240
<v Speaker 1>It's amazing how that core loop of interaction and reward applies.

466
00:22:12.160 --> 00:22:15.519
<v Speaker 2>And the source really hammers home that idea of intuition

467
00:22:15.680 --> 00:22:20.039
<v Speaker 2>first practice, continuous learning. It even mentions resources like opening

468
00:22:20.119 --> 00:22:21.920
<v Speaker 2>EYEGM for getting hands on.

469
00:22:22.000 --> 00:22:26.400
<v Speaker 1>Right because understanding is one thing, but actually building these things.

470
00:22:26.119 --> 00:22:28.839
<v Speaker 2>That takes practice. And thinking back to that big picture

471
00:22:28.839 --> 00:22:31.640
<v Speaker 2>of the book painted at the start, all those potential

472
00:22:31.720 --> 00:22:33.000
<v Speaker 2>application areas.

473
00:22:32.759 --> 00:22:34.440
<v Speaker 1>Yeah, it kind of brings it full circle. We've seen

474
00:22:34.440 --> 00:22:36.359
<v Speaker 1>the build in blocks. Now you can think about where

475
00:22:36.400 --> 00:22:37.599
<v Speaker 1>they might fit exactly.

476
00:22:37.640 --> 00:22:40.079
<v Speaker 2>We pulled out the core ideas from these excerpts.

477
00:22:40.279 --> 00:22:42.480
<v Speaker 1>So what's the big takeaway here for you listening?

478
00:22:42.880 --> 00:22:46.039
<v Speaker 2>Well, I think it's that these complex AI systems, they

479
00:22:46.079 --> 00:22:50.039
<v Speaker 2>often boil down to these understandable core principles, defining the

480
00:22:50.079 --> 00:22:54.880
<v Speaker 2>problem clearly states actions rewards is maybe half.

481
00:22:54.759 --> 00:22:57.759
<v Speaker 1>The battle, yeah, and then using these learning algorithms, often

482
00:22:57.799 --> 00:23:01.359
<v Speaker 1>involving neural networks now to figure out the optimal strategy

483
00:23:01.519 --> 00:23:05.240
<v Speaker 1>through interaction and feedback, whether it's beta distributions for ADS

484
00:23:05.440 --> 00:23:07.119
<v Speaker 1>or CNNs for snake.

485
00:23:07.640 --> 00:23:10.839
<v Speaker 2>The potential is just huge and it's evolving so fast.

486
00:23:11.240 --> 00:23:13.519
<v Speaker 1>So as you think about what we've talked through, the

487
00:23:13.599 --> 00:23:17.680
<v Speaker 1>MULTIIRN bandit, the warehouse robot, the self driving car simulation,

488
00:23:18.079 --> 00:23:23.000
<v Speaker 1>the energy saving server, the gameplaying AI, maybe ask yourself, this,

489
00:23:23.359 --> 00:23:25.519
<v Speaker 1>is there a problem or a task in your world,

490
00:23:25.599 --> 00:23:29.039
<v Speaker 1>maybe work, maybe a hobby that you could perhaps frame

491
00:23:29.200 --> 00:23:31.839
<v Speaker 1>in terms of states, actions and rewards.

492
00:23:32.079 --> 00:23:35.200
<v Speaker 2>How would an AI learning just through trial and error

493
00:23:35.200 --> 00:23:39.359
<v Speaker 2>and feedback approach solving it. It's a powerful way to

494
00:23:39.359 --> 00:23:41.079
<v Speaker 2>think about automation and optimization.

495
00:23:41.359 --> 00:23:44.759
<v Speaker 1>That idea of learning by doing, driven by feedback, it

496
00:23:44.799 --> 00:23:45.519
<v Speaker 1>really is powerful.

497
00:23:45.599 --> 00:23:46.680
<v Speaker 2>Definitely something them all over.

498
00:23:46.839 --> 00:23:48.400
<v Speaker 1>Thanks for diving deep with us today.

499
00:23:48.480 --> 00:23:49.359
<v Speaker 2>Yeah, great discussion.

500
00:23:49.400 --> 00:23:50.319
<v Speaker 1>We'll see you on the next one.
