WEBVTT

1
00:00:00.080 --> 00:00:01.000
<v Speaker 1>Welcome to the deep dive.

2
00:00:01.000 --> 00:00:02.560
<v Speaker 2>We're here to cut through the noise, pull out the

3
00:00:02.640 --> 00:00:05.839
<v Speaker 2>insights that matter really just for you today, we are

4
00:00:05.919 --> 00:00:10.080
<v Speaker 2>diving deep into something moving incredibly fast, this escalating digital

5
00:00:10.160 --> 00:00:13.560
<v Speaker 2>arms race. Ian cyber threats are just increasing. They're getting well,

6
00:00:13.839 --> 00:00:17.800
<v Speaker 2>shockingly sophisticated. You've got nation state actors, advanced machine learning.

7
00:00:18.280 --> 00:00:23.280
<v Speaker 2>They're acting as real force multipliers for the attackers. So

8
00:00:24.519 --> 00:00:27.679
<v Speaker 2>our deep dive today it's about reinforcement learning RL. It's

9
00:00:27.719 --> 00:00:31.160
<v Speaker 2>this cutting edge part of AI, and it's not just

10
00:00:31.199 --> 00:00:34.039
<v Speaker 2>theory anymore. It's becoming a really powerful practical tool. It's

11
00:00:34.079 --> 00:00:38.280
<v Speaker 2>fundamentally changing cybersecurity, especially in well critical area of penetration testing.

12
00:00:38.880 --> 00:00:40.759
<v Speaker 2>Our mission really is take you on a journey. We'll

13
00:00:40.799 --> 00:00:43.159
<v Speaker 2>look at the core ideas of RL, how it's actually

14
00:00:43.240 --> 00:00:46.359
<v Speaker 2>being used in cyber ops, highlight the challenges, the clever

15
00:00:46.399 --> 00:00:48.799
<v Speaker 2>solutions popping out, and then look at the real world

16
00:00:48.920 --> 00:00:51.840
<v Speaker 2>uses in this future of AI combating AI. The goal

17
00:00:51.920 --> 00:00:54.039
<v Speaker 2>is simple, give you a shortcut to being genuinely well

18
00:00:54.079 --> 00:00:58.039
<v Speaker 2>in formed, maybe offer some surprising insights practical takeaways. Okay,

19
00:00:58.079 --> 00:01:00.960
<v Speaker 2>so with that laid out, let's unpack this traditional penetration testing.

20
00:01:01.039 --> 00:01:04.840
<v Speaker 1>It's absolutely vital for securing our digital world, right, but

21
00:01:04.879 --> 00:01:07.280
<v Speaker 1>it's also, let's be honest, often seems like a slow,

22
00:01:07.400 --> 00:01:10.079
<v Speaker 1>manual and incredibly complex undertaking. Is that fair?

23
00:01:10.200 --> 00:01:12.519
<v Speaker 3>Oh? Absolutely, that's right at the heart of the challenge

24
00:01:12.560 --> 00:01:16.760
<v Speaker 3>pen testing. Yeah, it is where these highly technical red

25
00:01:16.840 --> 00:01:21.239
<v Speaker 3>teams simulate real attacks, trying to find the holes in

26
00:01:21.280 --> 00:01:25.519
<v Speaker 3>an organization's defenses. And it's crucial. I mean identifying weaknesses,

27
00:01:25.640 --> 00:01:29.640
<v Speaker 3>prioritizing where you spend your security budget, tuning defenses, meeting

28
00:01:29.680 --> 00:01:32.519
<v Speaker 3>compliance like PCI or a pally for the execs, it's

29
00:01:32.599 --> 00:01:35.840
<v Speaker 3>risk management, reputation for DEFF teams, it's baking security in

30
00:01:35.879 --> 00:01:39.319
<v Speaker 3>from the start, super important stuff. But what's fascinating here

31
00:01:39.400 --> 00:01:43.480
<v Speaker 3>and a bit problematic, is how this critical process, which

32
00:01:43.519 --> 00:01:46.879
<v Speaker 3>is really labor intensive to struggles. It struggles with the

33
00:01:46.920 --> 00:01:49.959
<v Speaker 3>sheer volume of data in modern networks. You know, despite

34
00:01:50.000 --> 00:01:52.959
<v Speaker 3>having brilliant human testers, the results are often found through

35
00:01:53.079 --> 00:01:57.079
<v Speaker 3>well manual, tedious means. It's just an overabundance of information, logs,

36
00:01:57.079 --> 00:01:59.799
<v Speaker 3>network endpoints. It's overwhelming. Even when you use automated tools

37
00:01:59.840 --> 00:02:01.879
<v Speaker 3>made sense of it all, that's still a huge challenge

38
00:02:01.920 --> 00:02:04.040
<v Speaker 3>for the analyst. So it really begs the question, doesn't it.

39
00:02:04.040 --> 00:02:07.200
<v Speaker 3>How can we possibly scale human expertise to keep up

40
00:02:07.239 --> 00:02:09.280
<v Speaker 3>with this constantly growing threat landscape?

41
00:02:09.400 --> 00:02:12.400
<v Speaker 1>Right, and that sounds like the perfect entry point for AI,

42
00:02:12.479 --> 00:02:17.240
<v Speaker 1>specifically reinforcement learning as this force multiplayer you mentioned. So

43
00:02:17.319 --> 00:02:20.439
<v Speaker 1>how does RL actually step in? What lets it chew

44
00:02:20.599 --> 00:02:24.039
<v Speaker 1>through these mountains of data that swamp human teams?

45
00:02:24.319 --> 00:02:28.039
<v Speaker 3>Well, AI capabilities they've just improved so dramatically. RL models

46
00:02:28.039 --> 00:02:31.039
<v Speaker 3>can now sift through, I mean, mountains of data that

47
00:02:31.199 --> 00:02:34.960
<v Speaker 3>maybe was ignored before. They find patterns, anomalies, these sort

48
00:02:35.000 --> 00:02:39.680
<v Speaker 3>of graph linked epiphanies. It just massively accelerates the ability

49
00:02:39.759 --> 00:02:42.360
<v Speaker 3>to spot and stop bad actors. It really is about

50
00:02:42.439 --> 00:02:45.400
<v Speaker 3>using AI to combat AI, or at least AI to

51
00:02:45.439 --> 00:02:48.719
<v Speaker 3>combat the complexity that modern systems and threats bring.

52
00:02:49.000 --> 00:02:52.199
<v Speaker 1>Okay, let's break down RL itself. The taxi driver analogy

53
00:02:52.240 --> 00:02:55.080
<v Speaker 1>is pretty classic, right, helps make it concrete. Imagine you're

54
00:02:55.080 --> 00:02:58.240
<v Speaker 1>a taxi driver. Your goal maximize fares in a city,

55
00:02:58.360 --> 00:03:01.400
<v Speaker 1>that whole city, the traffic, passengers, time of day, that's the.

56
00:03:01.400 --> 00:03:04.439
<v Speaker 3>Environment exactly, and your states are your current situation, like

57
00:03:04.479 --> 00:03:06.599
<v Speaker 3>where your taxi is right now, the time, the weather.

58
00:03:06.759 --> 00:03:09.159
<v Speaker 3>In cyber terms, that translates to things like the network

59
00:03:09.159 --> 00:03:11.800
<v Speaker 3>can fig maybe a host status. Is it up as

60
00:03:11.800 --> 00:03:13.560
<v Speaker 3>we scanned it? What access level do we have?

61
00:03:13.840 --> 00:03:16.960
<v Speaker 1>Then you have actions, the choices the driver makes. Go downtown,

62
00:03:17.199 --> 00:03:21.039
<v Speaker 1>wait at the station in cyber that's your scans, your exploits,

63
00:03:21.159 --> 00:03:23.879
<v Speaker 1>trying to get higher privileges on a machine you've popped.

64
00:03:23.719 --> 00:03:27.560
<v Speaker 3>And absolutely crucial for learning. The reward. That's the feedback

65
00:03:27.599 --> 00:03:31.759
<v Speaker 3>for the driver. It's the fare, simple enough. In cybersecurity simulations,

66
00:03:31.800 --> 00:03:34.840
<v Speaker 3>it's often framed as costs or penalties for certain actions,

67
00:03:35.120 --> 00:03:37.280
<v Speaker 3>maybe a big lump sum reward for hitting a key

68
00:03:37.280 --> 00:03:39.280
<v Speaker 3>objective like getting domain admin.

69
00:03:39.439 --> 00:03:42.439
<v Speaker 1>But here's a really elegant part. I think Markov decision

70
00:03:42.479 --> 00:03:46.319
<v Speaker 1>processes MDPs. Instead of the driver needing to remember every

71
00:03:46.360 --> 00:03:49.280
<v Speaker 1>single fare they've ever collected to decide where to go next,

72
00:03:49.360 --> 00:03:52.800
<v Speaker 1>which would be crazy, MDPs simplify things. They focus on

73
00:03:52.840 --> 00:03:55.960
<v Speaker 1>the present moment, the here and now. This lets the agent,

74
00:03:56.080 --> 00:03:59.639
<v Speaker 1>our driver or our cyber agent, make quick, informed decisions

75
00:03:59.680 --> 00:04:02.039
<v Speaker 1>based on the current state, not the entire history. It's

76
00:04:02.039 --> 00:04:03.960
<v Speaker 1>about what matters right now, makes.

77
00:04:03.759 --> 00:04:07.000
<v Speaker 3>Sense it does, it makes the problem tractable, and finally

78
00:04:07.080 --> 00:04:10.240
<v Speaker 3>you have the objective function. This is the mathematical goal.

79
00:04:10.280 --> 00:04:14.759
<v Speaker 3>The agent tries to maximize total fair maybe and often

80
00:04:14.840 --> 00:04:19.040
<v Speaker 3>it's a discounted sum of future rewards, meaning rewards you

81
00:04:19.120 --> 00:04:21.560
<v Speaker 3>might get way down the line are seen as less

82
00:04:21.639 --> 00:04:24.439
<v Speaker 3>valuable than rewards you can get right now. It reflects

83
00:04:24.480 --> 00:04:28.680
<v Speaker 3>that real world trade off, immediate gains often feel more important.

84
00:04:28.959 --> 00:04:31.519
<v Speaker 1>So here's where it gets really interesting. Imagine teaching a

85
00:04:31.560 --> 00:04:35.040
<v Speaker 1>computer to think like a hacker by letting it continuously

86
00:04:35.120 --> 00:04:38.399
<v Speaker 1>interact with a simulated network. That's essentially what reinforcement learning

87
00:04:38.399 --> 00:04:41.120
<v Speaker 1>allows us to do in cybersecurity, and you mentioned combining

88
00:04:41.120 --> 00:04:43.319
<v Speaker 1>this with neural networks that gets us into deep reinforcement

89
00:04:43.439 --> 00:04:44.279
<v Speaker 1>learning DRL.

90
00:04:44.680 --> 00:04:48.959
<v Speaker 3>Exactly. DRL uses those neural nets to handle well incredibly

91
00:04:49.000 --> 00:04:53.680
<v Speaker 3>complex inputs and figure out sophisticated strategies or policies and

92
00:04:53.720 --> 00:04:58.079
<v Speaker 3>specific algorithms. You mentioned PPO proximal Policy optimization. That one,

93
00:04:58.160 --> 00:05:00.959
<v Speaker 3>along with others like DQN or A to C has

94
00:05:01.000 --> 00:05:05.079
<v Speaker 3>been really key. PPO especially brought a lot more stability

95
00:05:05.079 --> 00:05:08.040
<v Speaker 3>and efficiency to the training process. It lets us actually

96
00:05:08.120 --> 00:05:12.560
<v Speaker 3>apply these powerful learning methods to really large complex network

97
00:05:12.600 --> 00:05:15.480
<v Speaker 3>simulations without things going completely off the rails.

98
00:05:15.519 --> 00:05:18.240
<v Speaker 1>Okay, so the theory sounds powerful, but how do we

99
00:05:18.279 --> 00:05:22.800
<v Speaker 1>actually connect this theoretical AI agent to a real, messy,

100
00:05:22.839 --> 00:05:25.120
<v Speaker 1>live network. That sounds like a huge leap.

101
00:05:25.240 --> 00:05:27.120
<v Speaker 3>It is a huge leap, and that's where this grounding

102
00:05:27.199 --> 00:05:28.839
<v Speaker 3>problem comes in. You have to make sure the AI

103
00:05:28.920 --> 00:05:32.680
<v Speaker 3>is understanding its representation of reality is actually tied accurately

104
00:05:32.720 --> 00:05:33.839
<v Speaker 3>to the system it's interacting with.

105
00:05:34.000 --> 00:05:36.199
<v Speaker 1>Right, how do you bridge that gap between the clean

106
00:05:36.279 --> 00:05:39.720
<v Speaker 1>model and the well the chaos of a real corporate network.

107
00:05:39.839 --> 00:05:43.199
<v Speaker 3>The key approach involves a high level architecture. It's often

108
00:05:43.240 --> 00:05:48.639
<v Speaker 3>called something like the layered reference model or LRMSHRAG. Think

109
00:05:48.639 --> 00:05:51.000
<v Speaker 3>of it like building layers of maps for the AI,

110
00:05:51.240 --> 00:05:54.519
<v Speaker 3>each one adding more detail. First, you take info from

111
00:05:54.519 --> 00:05:57.360
<v Speaker 3>the real network and abstract it into an attack graph.

112
00:05:57.759 --> 00:06:01.120
<v Speaker 3>This graph then becomes the foundation for the mark decision process,

113
00:06:01.399 --> 00:06:05.040
<v Speaker 3>the MDP. That's the environment the URL agent actually learns inside.

114
00:06:05.160 --> 00:06:07.240
<v Speaker 1>Okay, an attack graph is the base map.

115
00:06:07.199 --> 00:06:10.399
<v Speaker 3>Exactly, but the crucial part is layering more context onto

116
00:06:10.399 --> 00:06:14.600
<v Speaker 3>that basic MDP. First, there's a terrain MDP. This layer

117
00:06:14.639 --> 00:06:19.199
<v Speaker 3>adds concepts of cyber terrain, so firewalls become obstacles. Maybe

118
00:06:19.199 --> 00:06:22.480
<v Speaker 3>an intrusion detection system and IDs has a fuel to fire.

119
00:06:22.920 --> 00:06:26.759
<v Speaker 3>It borrows from military ideas. Actually, intelligence preparation of the battlefield,

120
00:06:27.120 --> 00:06:29.639
<v Speaker 3>understanding the environment to predict moves, so.

121
00:06:29.560 --> 00:06:31.879
<v Speaker 1>Mapping the cyber landscape strategically makes sense.

122
00:06:31.959 --> 00:06:35.079
<v Speaker 3>Then you add an adversary MDP. This layer tailors the

123
00:06:35.160 --> 00:06:38.439
<v Speaker 3>environment to specific types of attackers, maybe using node attack

124
00:06:38.519 --> 00:06:41.360
<v Speaker 3>templates or reflecting the capabilities of your own red team.

125
00:06:41.480 --> 00:06:44.240
<v Speaker 1>So modeling different kinds of threats precisely.

126
00:06:45.040 --> 00:06:48.879
<v Speaker 3>And finally, a task MDP. This refines the whole setup

127
00:06:48.879 --> 00:06:52.279
<v Speaker 3>for specific goals. Are you doing crown jewel analysis trying

128
00:06:52.360 --> 00:06:56.759
<v Speaker 3>to find exfiltration paths? The task shapes the environment and rewards,

129
00:06:57.199 --> 00:07:01.160
<v Speaker 3>and importantly, as networks change or tasks change, these agents

130
00:07:01.160 --> 00:07:03.560
<v Speaker 3>don't always have to start from scratch. They can use

131
00:07:03.560 --> 00:07:06.480
<v Speaker 3>transfer learning to share knowledge between tasks, or even metal

132
00:07:06.519 --> 00:07:10.319
<v Speaker 3>learning basically learning how to learn more efficiently to adapt quickly.

133
00:07:10.680 --> 00:07:13.079
<v Speaker 3>So this whole layered approach, that's how we connect the

134
00:07:13.079 --> 00:07:15.360
<v Speaker 3>theory to the practice. It gives a structure needed to

135
00:07:15.399 --> 00:07:19.240
<v Speaker 3>make these AI driven cyber operations actually feasible on real networks.

136
00:07:19.360 --> 00:07:21.920
<v Speaker 1>Okay, but even with that structure, there must be massive

137
00:07:22.000 --> 00:07:25.800
<v Speaker 1>practical challenges. You mentioned scaling earlier. Real companies have networks

138
00:07:25.839 --> 00:07:28.240
<v Speaker 1>with what tens of thousands.

139
00:07:27.879 --> 00:07:30.519
<v Speaker 3>Of machine oh, easily tens of thousands of hosts is

140
00:07:30.519 --> 00:07:33.360
<v Speaker 3>not uncommon in large enterprises, and that scale is a

141
00:07:33.439 --> 00:07:36.360
<v Speaker 3>huge problem for ROL models. If your model doesn't scale well,

142
00:07:36.360 --> 00:07:40.639
<v Speaker 3>it becomes incredibly computationally expensive. Training takes forever, or maybe

143
00:07:40.680 --> 00:07:43.519
<v Speaker 3>just won't converge, meaning it never settles on a good strategy,

144
00:07:43.959 --> 00:07:47.439
<v Speaker 3>or worse, the reward signal just keeps bouncing around wildly,

145
00:07:47.600 --> 00:07:52.120
<v Speaker 3>never improving. The simulation becomes useless, slower than just using

146
00:07:52.160 --> 00:07:52.839
<v Speaker 3>a human.

147
00:07:52.600 --> 00:07:56.000
<v Speaker 1>Team, and the attack grafts themselves must explode.

148
00:07:55.519 --> 00:07:59.680
<v Speaker 3>In size exponentially. Traditional attack graft generation just blows up

149
00:07:59.720 --> 00:08:03.279
<v Speaker 3>as you add hosts. You end up with these unbelievably vast,

150
00:08:03.800 --> 00:08:07.560
<v Speaker 3>complex decision spaces for the RIL agent to explore. It's

151
00:08:07.600 --> 00:08:09.600
<v Speaker 3>like going from tic tac toe to I don't know,

152
00:08:09.879 --> 00:08:12.319
<v Speaker 3>forty chess with millions of pieces.

153
00:08:12.399 --> 00:08:14.680
<v Speaker 1>Wow. Okay, So how on earth do you make that

154
00:08:14.759 --> 00:08:17.480
<v Speaker 1>manageable for an AI? How do you simplify the choices?

155
00:08:17.680 --> 00:08:20.519
<v Speaker 3>That's where action space simplification comes in. You have to

156
00:08:20.519 --> 00:08:25.040
<v Speaker 3>make the problem tractable. Strategies include things like reducing the dimensions,

157
00:08:25.519 --> 00:08:28.360
<v Speaker 3>maybe focusing only on the most relevant actions at any

158
00:08:28.360 --> 00:08:32.240
<v Speaker 3>given point, or combining similar actions into more general ones

159
00:08:32.679 --> 00:08:36.000
<v Speaker 3>using hierarchical action spaces is another key idea to teach

160
00:08:36.000 --> 00:08:39.480
<v Speaker 3>the agent high level goals first, like gain access to

161
00:08:39.600 --> 00:08:43.000
<v Speaker 3>subnet X, before it learns the specific low level steps.

162
00:08:43.360 --> 00:08:44.399
<v Speaker 3>It's about smart.

163
00:08:44.159 --> 00:08:46.559
<v Speaker 1>Abstraction makes sense, giving it a better way to think

164
00:08:46.559 --> 00:08:50.279
<v Speaker 1>about its options. What about the realism challenge, especially with rewards,

165
00:08:50.480 --> 00:08:53.320
<v Speaker 1>You need the AI to value things like a real attacker. Right.

166
00:08:53.639 --> 00:08:57.320
<v Speaker 1>You mentioned CVSS scores earlier. The zero to ten vulnerability

167
00:08:57.399 --> 00:08:58.679
<v Speaker 1>rating Okay, you said, has.

168
00:08:58.559 --> 00:09:02.159
<v Speaker 3>Limits, big limits in this cond text. CBSS is standardized,

169
00:09:02.200 --> 00:09:05.240
<v Speaker 3>which is good, but it focuses purely on technical severity.

170
00:09:05.279 --> 00:09:08.919
<v Speaker 3>It often lacks crucial context, like what's the actual business

171
00:09:09.039 --> 00:09:11.639
<v Speaker 3>value of the data on that server? Or they're compensating

172
00:09:11.639 --> 00:09:15.039
<v Speaker 3>security controls already in place. It's also static, it doesn't change,

173
00:09:15.159 --> 00:09:17.320
<v Speaker 3>and it doesn't really capture human factors.

174
00:09:17.519 --> 00:09:19.879
<v Speaker 1>So a critical vulnerability on a test server isn't the

175
00:09:19.879 --> 00:09:22.840
<v Speaker 1>same risk as a medium one on the main financial

176
00:09:22.919 --> 00:09:24.200
<v Speaker 1>database exactly.

177
00:09:24.320 --> 00:09:27.679
<v Speaker 3>CBSS doesn't capture that nuance. It's not really a measure

178
00:09:27.679 --> 00:09:31.799
<v Speaker 3>of risk, just technical severity, and it definitely doesn't generalize

179
00:09:31.840 --> 00:09:35.600
<v Speaker 3>well to evaluating an entire attack path with multiple steps.

180
00:09:35.720 --> 00:09:39.440
<v Speaker 1>So how do you inject that realism that context.

181
00:09:39.559 --> 00:09:42.919
<v Speaker 3>Well, real attackers think holistically, don't they. They weigh factors

182
00:09:42.960 --> 00:09:46.480
<v Speaker 3>beyond just the technical vulnerability. They look at the cyberterran

183
00:09:46.639 --> 00:09:50.919
<v Speaker 3>firewalls IDs detection potential. So the reward system needs to

184
00:09:50.919 --> 00:09:53.759
<v Speaker 3>mimic that. We need to build in that contextual awareness.

185
00:09:54.120 --> 00:09:57.120
<v Speaker 3>One way is using these service based penalties we talked

186
00:09:57.159 --> 00:10:00.320
<v Speaker 3>about them, assigning different negative rewards or costs based on

187
00:10:00.360 --> 00:10:03.720
<v Speaker 3>the type of service being attacked, Like attacking authentication services

188
00:10:03.799 --> 00:10:05.879
<v Speaker 3>might get a my nine to six penalty hitting data

189
00:10:05.919 --> 00:10:08.600
<v Speaker 3>services man at four, maybe security or common services man

190
00:10:08.639 --> 00:10:12.320
<v Speaker 3>of two. The exact numbers are relative tune for the simulation,

191
00:10:12.759 --> 00:10:15.840
<v Speaker 3>but they reflect the proportional risk to the organization, higher

192
00:10:15.840 --> 00:10:17.720
<v Speaker 3>penalty for hitting more critical services.

193
00:10:17.919 --> 00:10:22.600
<v Speaker 1>Got it, So penalties reflecting business impact. Essentially. Now bringing

194
00:10:22.600 --> 00:10:25.840
<v Speaker 1>this all together is scaling the realism. What's the approach

195
00:10:25.879 --> 00:10:28.559
<v Speaker 1>that's really making this work in practice? The workhorce solution.

196
00:10:28.919 --> 00:10:31.840
<v Speaker 3>A really promising combination that's emerged is known as double

197
00:10:31.879 --> 00:10:36.000
<v Speaker 3>agent plus PPO or DAPPO. It starts with the double

198
00:10:36.080 --> 00:10:39.600
<v Speaker 3>agent architecture the DAA. Instead of one monolithic AI trying

199
00:10:39.600 --> 00:10:43.919
<v Speaker 3>to figure everything out, you have two specialized agents working together.

200
00:10:44.399 --> 00:10:47.559
<v Speaker 3>There's an exploration agent whose job is to decide which

201
00:10:47.559 --> 00:10:50.639
<v Speaker 3>host to target next, and then there's an exploitation agent

202
00:10:50.679 --> 00:10:53.919
<v Speaker 3>that decides which specific action or exploit to use on

203
00:10:54.000 --> 00:10:54.840
<v Speaker 3>that chosen host.

204
00:10:54.960 --> 00:10:58.279
<v Speaker 1>Ah, So like a team, one doing recon in target selection,

205
00:10:58.440 --> 00:11:00.720
<v Speaker 1>the other handling the actual at execution.

206
00:11:00.879 --> 00:11:05.240
<v Speaker 3>Precisely, this decomposition makes the whole learning problem much more tractable.

207
00:11:05.600 --> 00:11:09.440
<v Speaker 3>Each agent has a smaller, more focused learning space, and importantly,

208
00:11:09.480 --> 00:11:13.480
<v Speaker 3>it's quite conceptually sound from an attacker's perspective. Real attackers

209
00:11:13.519 --> 00:11:15.240
<v Speaker 3>often think in terms of where do I go next?

210
00:11:15.279 --> 00:11:16.720
<v Speaker 3>And then what do I do once I'm there?

211
00:11:16.919 --> 00:11:20.679
<v Speaker 1>Okay, that makes intuitive sense. Splitting the problem HEALTHS and

212
00:11:20.720 --> 00:11:23.559
<v Speaker 1>the PPO part proxim policy optimization.

213
00:11:23.799 --> 00:11:26.960
<v Speaker 3>That's the other key piece. Applying PPO to both of

214
00:11:27.000 --> 00:11:30.320
<v Speaker 3>these agents provides the stability and efficiency we talked about earlier.

215
00:11:31.000 --> 00:11:34.120
<v Speaker 3>PPO is just much better than some older algorithms like

216
00:11:34.279 --> 00:11:37.679
<v Speaker 3>say A to C, especially for complex problems. It gives

217
00:11:37.679 --> 00:11:42.320
<v Speaker 3>you stability, robustness, and sample efficiency, less data needed to learn,

218
00:11:42.720 --> 00:11:46.399
<v Speaker 3>less likely to get stuck, and this combination the double

219
00:11:46.440 --> 00:11:49.639
<v Speaker 3>agent architecture powered by PPO is what has really enabled

220
00:11:49.639 --> 00:11:53.720
<v Speaker 3>these systems to scale effectively to networks of thousands of nodes.

221
00:11:54.279 --> 00:11:57.120
<v Speaker 3>It keeps the learning stable even in huge environments.

222
00:11:57.200 --> 00:11:59.639
<v Speaker 1>So essentially, instead of one AI trying to do everything,

223
00:12:00.039 --> 00:12:02.519
<v Speaker 1>we're giving it a specialized team and a really smart,

224
00:12:02.639 --> 00:12:05.519
<v Speaker 1>stable way to learn. Allows it to tackle networks far

225
00:12:05.639 --> 00:12:08.679
<v Speaker 1>larger than before. It's like having that reconnaissance expert and

226
00:12:08.759 --> 00:12:12.159
<v Speaker 1>an exploit expert working together, powered by the best learning methods.

227
00:12:12.240 --> 00:12:13.279
<v Speaker 3>That's a great way to put it.

228
00:12:13.360 --> 00:12:15.559
<v Speaker 1>Okay, So these aren't just lab experiments. You're saying, this

229
00:12:15.720 --> 00:12:19.799
<v Speaker 1>dappo approach, and these layered models are actually being used

230
00:12:19.799 --> 00:12:22.399
<v Speaker 1>now for real cybersecurity tasks.

231
00:12:22.639 --> 00:12:26.679
<v Speaker 3>Yes, absolutely, we're seeing RL applied in several practical ways.

232
00:12:27.159 --> 00:12:32.200
<v Speaker 3>One key area is crown Jewels analysis or CJARL. Here,

233
00:12:32.519 --> 00:12:35.399
<v Speaker 3>RL models are trained specifically to find the most effective,

234
00:12:35.519 --> 00:12:40.840
<v Speaker 3>often the stealthiest paths to compromise an organization's highest value assets.

235
00:12:40.960 --> 00:12:44.759
<v Speaker 1>They're crown jewels, so finding the quickest way to the

236
00:12:44.799 --> 00:12:45.919
<v Speaker 1>most important.

237
00:12:45.519 --> 00:12:48.879
<v Speaker 3>Stuff, not just the quickest, but often the path of

238
00:12:48.960 --> 00:12:53.080
<v Speaker 3>least resistance or least detection. The insights you get provide

239
00:12:53.080 --> 00:12:56.960
<v Speaker 3>a really nuanced understanding of attackers methods of discreetly navigating

240
00:12:56.960 --> 00:13:00.000
<v Speaker 3>through networks. It can reveal attack pads you simply want

241
00:13:00.080 --> 00:13:00.679
<v Speaker 3>and have thought.

242
00:13:00.519 --> 00:13:03.679
<v Speaker 1>Of manually exposing those hidden routes. What else?

243
00:13:03.759 --> 00:13:08.279
<v Speaker 3>Another big one is discovering exfiltration paths. The focus here shifts.

244
00:13:08.480 --> 00:13:11.320
<v Speaker 3>It's not about getting in anymore, but about how attackers

245
00:13:11.320 --> 00:13:13.759
<v Speaker 3>get sensitive data out after a breach while trying to

246
00:13:13.840 --> 00:13:17.960
<v Speaker 3>minimize detection ah getaway plan exactly. The model has to

247
00:13:18.000 --> 00:13:22.320
<v Speaker 3>consider things like protocol and payload considerations. Agents might learn,

248
00:13:22.360 --> 00:13:25.440
<v Speaker 3>for example, to use specific protocols like tunneling exful traffic

249
00:13:25.440 --> 00:13:29.120
<v Speaker 3>through domain name systems DNS because DNS traffic often looks

250
00:13:29.159 --> 00:13:32.960
<v Speaker 3>benign and isn't heavily scrutinized. Very They can also learn

251
00:13:32.960 --> 00:13:36.399
<v Speaker 3>to use strategic pauses to avoid detection, mimicking low and

252
00:13:36.519 --> 00:13:39.759
<v Speaker 3>slow techniques, or maybe they learn to stick to just

253
00:13:40.039 --> 00:13:44.279
<v Speaker 3>one protocol consistently to better blend in with benign or

254
00:13:44.320 --> 00:13:48.799
<v Speaker 3>otherwise unmonitored traffic. It's about modeling that stealthy data theft.

255
00:13:48.919 --> 00:13:51.399
<v Speaker 1>That's fascinating and it keeps going. Oh yes.

256
00:13:51.759 --> 00:13:54.480
<v Speaker 3>Another application is discovering command and control.

257
00:13:54.279 --> 00:13:58.039
<v Speaker 1>Channels C two channels right the phone helme mechanism for malware.

258
00:13:58.159 --> 00:14:01.879
<v Speaker 3>Precisely, these are the pathways that malware, once it's inside

259
00:14:01.919 --> 00:14:05.279
<v Speaker 3>and undetected, uses to get instructions from its operator and

260
00:14:05.399 --> 00:14:08.440
<v Speaker 3>send back stolen data or status updates. It has to

261
00:14:08.480 --> 00:14:12.919
<v Speaker 3>execute nefarious tasks under direction. RL agents can learn how

262
00:14:12.919 --> 00:14:15.919
<v Speaker 3>to establish and maintain these channels, figuring out how to

263
00:14:16.000 --> 00:14:20.080
<v Speaker 3>navigle through firewalls, again using strategic pauses sleep actions to

264
00:14:20.240 --> 00:14:22.960
<v Speaker 3>lie low and avoid detection. They might even learn optimal

265
00:14:23.039 --> 00:14:26.440
<v Speaker 3>data upload speeds may be consistently choosing fast upload options

266
00:14:26.480 --> 00:14:29.559
<v Speaker 3>overslow if the coast seems clear, balancing speed against the

267
00:14:29.639 --> 00:14:32.600
<v Speaker 3>risk of setting off alarms. It reveals how persistent threats

268
00:14:32.639 --> 00:14:33.679
<v Speaker 3>maintain their foothold.

269
00:14:33.840 --> 00:14:36.679
<v Speaker 1>Incredible, So mapping out not just the break in, but

270
00:14:36.759 --> 00:14:38.919
<v Speaker 1>the long term occupation and data theft too.

271
00:14:39.240 --> 00:14:42.039
<v Speaker 3>Exactly and perhaps one of the most advanced applications is

272
00:14:42.480 --> 00:14:46.799
<v Speaker 3>exposing surveillance detection routes or SDRs. This is like super

273
00:14:46.840 --> 00:14:50.440
<v Speaker 3>advanced reconnaissance. The goal is to find paths an attacker

274
00:14:50.480 --> 00:14:54.120
<v Speaker 3>could use to gain maximum surveillance exposure, learn as much

275
00:14:54.120 --> 00:14:58.559
<v Speaker 3>as possible about the network while simultaneously minimizing opportunities of

276
00:14:58.600 --> 00:15:01.759
<v Speaker 3>being detected. The ultimate stealth recon.

277
00:15:01.600 --> 00:15:04.240
<v Speaker 1>Maximum info, minimum footprint. How does that work?

278
00:15:04.519 --> 00:15:07.279
<v Speaker 3>One really interesting technique used here is a warm up

279
00:15:07.279 --> 00:15:11.200
<v Speaker 3>phase before the RL agent starts actually learning and updating

280
00:15:11.200 --> 00:15:14.879
<v Speaker 3>its strategy based on rewards. It first explores areas of

281
00:15:14.879 --> 00:15:18.480
<v Speaker 3>the network deemed safe to explore without changing its internal weights.

282
00:15:18.919 --> 00:15:21.799
<v Speaker 3>It just gathers initial information cautiously.

283
00:15:21.759 --> 00:15:24.879
<v Speaker 1>Like a human operator, carefully mapping out the surroundings before

284
00:15:24.919 --> 00:15:26.120
<v Speaker 1>making any risky moves.

285
00:15:26.240 --> 00:15:29.559
<v Speaker 3>Exactly like that, it mimics that initial caution. This warm

286
00:15:29.679 --> 00:15:32.440
<v Speaker 3>up sets the stage for more efficient and targeted learning

287
00:15:32.519 --> 00:15:32.879
<v Speaker 3>later on.

288
00:15:33.279 --> 00:15:35.440
<v Speaker 1>And does this also show different attacker styles?

289
00:15:35.679 --> 00:15:39.639
<v Speaker 3>Yes, very clearly. By adjusting the penalty scales how much

290
00:15:39.799 --> 00:15:42.840
<v Speaker 3>the agent is punished for potentially being detected, you can

291
00:15:42.879 --> 00:15:47.159
<v Speaker 3>simulate different adversary behaviors in different levels of risk aversion.

292
00:15:47.600 --> 00:15:50.399
<v Speaker 3>For instance, with a low penalty scale, say a value

293
00:15:50.399 --> 00:15:53.039
<v Speaker 3>of one, the agent acts more like a smash and

294
00:15:53.080 --> 00:15:56.960
<v Speaker 3>grab operator or maybe a less experienced attacker. It might

295
00:15:56.960 --> 00:16:00.000
<v Speaker 3>perform noisy scams, not caring as much about stealth.

296
00:16:00.279 --> 00:16:01.879
<v Speaker 1>Okay, the loud attacker, right.

297
00:16:02.279 --> 00:16:04.279
<v Speaker 3>But if you crank up the penalty scale maybe two

298
00:16:04.279 --> 00:16:07.240
<v Speaker 3>to eleven, the agent starts behaving very differently. It acts

299
00:16:07.279 --> 00:16:10.399
<v Speaker 3>more like highly competent actors like nation state actors or

300
00:16:10.440 --> 00:16:15.320
<v Speaker 3>apts advanced persistent threats. It displays highly risk averse behavior

301
00:16:15.480 --> 00:16:18.879
<v Speaker 3>chooses the most direct paths that minimize exposure, tries to

302
00:16:18.879 --> 00:16:21.759
<v Speaker 3>minimize its overall footprint. It becomes incredibly stealthy.

303
00:16:21.919 --> 00:16:25.240
<v Speaker 1>So you can model specific threat actors from script kitties

304
00:16:25.279 --> 00:16:28.759
<v Speaker 1>to spies just by tuning the AI's aversion to risk.

305
00:16:28.840 --> 00:16:33.360
<v Speaker 3>That's the idea. It allows defenders to anticipate the specific tactics, techniques,

306
00:16:33.399 --> 00:16:36.960
<v Speaker 3>and procedures the TTPs associated with different adversary profiles.

307
00:16:37.159 --> 00:16:40.559
<v Speaker 1>These aren't just theoretical models. They're literally showing us how

308
00:16:40.720 --> 00:16:43.320
<v Speaker 1>attackers might move through a network, whether they're looking for

309
00:16:43.440 --> 00:16:46.600
<v Speaker 1>most valuable data or trying to stay hidden. It's like

310
00:16:46.679 --> 00:16:50.559
<v Speaker 1>having a crystal ball for cyber defense, revealing attacker TTPs

311
00:16:50.879 --> 00:16:53.519
<v Speaker 1>even before they strike. It's quite remarkable.

312
00:16:53.759 --> 00:16:55.759
<v Speaker 3>It really shifts the perspective for defenders.

313
00:16:56.320 --> 00:16:59.600
<v Speaker 1>So, looking ahead, what does this all mean for the future.

314
00:16:59.600 --> 00:17:03.600
<v Speaker 1>We're in this AI versus AI situation or heading deeper

315
00:17:03.600 --> 00:17:07.599
<v Speaker 1>into it. What are the next frontiers beyond these simulation applications.

316
00:17:07.839 --> 00:17:11.480
<v Speaker 3>Well, the applications are expanding rapidly. We're seeing AI, including

317
00:17:11.559 --> 00:17:15.359
<v Speaker 3>oral principles, move more into active threat detection, shifting away

318
00:17:15.400 --> 00:17:20.000
<v Speaker 3>from just relying on known signatures of malware towards behavioral

319
00:17:20.000 --> 00:17:25.000
<v Speaker 3>based detection using sophisticated AML to spot anomalies, unusual patterns

320
00:17:25.000 --> 00:17:28.039
<v Speaker 3>of activity that might indicate a novel, never before seen threat.

321
00:17:28.480 --> 00:17:31.799
<v Speaker 3>Protecting against the unknown unknowns, So spotting.

322
00:17:31.440 --> 00:17:35.200
<v Speaker 1>Bad behavior even if you don't recognize a specific tool exactly.

323
00:17:35.240 --> 00:17:38.960
<v Speaker 3>And related to that is specific ransomware detection. We can

324
00:17:39.000 --> 00:17:43.640
<v Speaker 3>simulate the entire ransomware life cycle, the initial spread, installation, staging,

325
00:17:43.720 --> 00:17:48.519
<v Speaker 3>data encryption, and also simulate defenses like honeypots.

326
00:17:48.160 --> 00:17:50.960
<v Speaker 1>Ah those decoy systems designed at trap attackers.

327
00:17:51.359 --> 00:17:55.119
<v Speaker 3>Right, AI can help optimize honeypop placement and analyze the

328
00:17:55.119 --> 00:17:57.160
<v Speaker 3>behavior of attackers who fall into them.

329
00:17:57.480 --> 00:18:00.880
<v Speaker 1>What about offense? Can AI actually create new attacks?

330
00:18:01.039 --> 00:18:04.920
<v Speaker 3>That's one of the really disruptive possibilities, the potential for

331
00:18:05.000 --> 00:18:10.000
<v Speaker 3>AI models to perhaps invent new atomic level vulnerabilities, maybe

332
00:18:10.079 --> 00:18:14.839
<v Speaker 3>by fuzzing or analyzing code and novel ways automating penetration testing,

333
00:18:14.880 --> 00:18:18.519
<v Speaker 3>not just by orchestrating known exploits, but by discovering entirely

334
00:18:18.640 --> 00:18:21.119
<v Speaker 3>new ones at a granular level. That's a big step.

335
00:18:21.200 --> 00:18:23.079
<v Speaker 1>Wow, Okay, that's significant. What else on the.

336
00:18:23.079 --> 00:18:27.240
<v Speaker 3>Horizon asset discovery and classification. Imagine AI models that can

337
00:18:27.240 --> 00:18:28.960
<v Speaker 3>infer the role of a server or the type of

338
00:18:29.039 --> 00:18:32.119
<v Speaker 3>data holes. Ah, there's likely PII in here just from

339
00:18:32.160 --> 00:18:36.160
<v Speaker 3>analyzing network traffic or scan results even with limited initial.

340
00:18:35.799 --> 00:18:38.880
<v Speaker 1>Information, making sense of the network automatically.

341
00:18:38.480 --> 00:18:43.640
<v Speaker 3>And attribution assisting human analysts in identifying and assigning responsibility

342
00:18:43.720 --> 00:18:48.599
<v Speaker 3>to threat actors. There's research into using metric learning, essentially

343
00:18:48.680 --> 00:18:52.119
<v Speaker 3>comparing patterns seen in live network data flows end points

344
00:18:52.359 --> 00:18:55.400
<v Speaker 3>against a library of synthetic attack paths generated by URL

345
00:18:55.440 --> 00:18:59.000
<v Speaker 3>agents trained to mimic different known actors. This could potentially

346
00:18:59.079 --> 00:19:03.200
<v Speaker 3>allow for zero ROO attribution identifying a new campaign launched

347
00:19:03.240 --> 00:19:06.000
<v Speaker 3>by a known group even if the specific tools.

348
00:19:05.680 --> 00:19:09.039
<v Speaker 1>Are new, Identifying the actor behind a novel attack almost immediately.

349
00:19:09.079 --> 00:19:10.000
<v Speaker 1>That would be huge for.

350
00:19:10.039 --> 00:19:15.359
<v Speaker 3>Response game changing and finally, defensive modeling. Moving beyond static

351
00:19:15.440 --> 00:19:18.200
<v Speaker 3>pre programmed response is like if you see this block

352
00:19:18.279 --> 00:19:22.200
<v Speaker 3>that IP towards truly AI driven defenses that can dynamically

353
00:19:22.240 --> 00:19:25.440
<v Speaker 3>analyze an ongoing attack and choose the optimal countermeasures in

354
00:19:25.480 --> 00:19:29.279
<v Speaker 3>real time, adapting as the attack evolves active intelligent defense.

355
00:19:29.519 --> 00:19:32.119
<v Speaker 1>This really paints a picture of an accelerating arms race.

356
00:19:32.319 --> 00:19:35.079
<v Speaker 1>We're going to see true AI attacks. Aren't we not

357
00:19:35.119 --> 00:19:37.960
<v Speaker 1>just humans using AI tools, but AI directing the attack?

358
00:19:38.200 --> 00:19:41.599
<v Speaker 3>It seems inevitable. Malicious actors will likely use RL and

359
00:19:41.680 --> 00:19:45.880
<v Speaker 3>other mL techniques to automate complex attack patterns, including the

360
00:19:45.920 --> 00:19:49.880
<v Speaker 3>initial scanning and enumeration feases which are often tedious and

361
00:19:49.960 --> 00:19:54.039
<v Speaker 3>think about social engineering. AI could be used for honing, refining,

362
00:19:54.079 --> 00:19:58.480
<v Speaker 3>and using more efficiently these attacks, crafting hyper personalized phishing emails,

363
00:19:58.720 --> 00:20:03.759
<v Speaker 3>maybe even generating real, realistic, relevant, customized synthetic media voice

364
00:20:04.079 --> 00:20:06.839
<v Speaker 3>video for spearfishing, or disinformation.

365
00:20:06.319 --> 00:20:08.480
<v Speaker 1>Deep figs for hacking. That's UNSOI it is.

366
00:20:08.440 --> 00:20:11.039
<v Speaker 3>And attackers could use mL defensively too, in a sense

367
00:20:11.200 --> 00:20:14.160
<v Speaker 3>observe how defenses like IDs or anti virus react to

368
00:20:14.200 --> 00:20:16.799
<v Speaker 3>their probes and then use that feedback to craft malware

369
00:20:16.960 --> 00:20:20.119
<v Speaker 3>or just simply hone their techniques to avoid detection, learning

370
00:20:20.200 --> 00:20:22.720
<v Speaker 3>to bypass our security controls, so.

371
00:20:22.599 --> 00:20:25.480
<v Speaker 1>The AI learns how to be invisible to our AI defenses.

372
00:20:25.759 --> 00:20:28.880
<v Speaker 3>That's the adversarial dynamic, and a key challenge for a

373
00:20:28.920 --> 00:20:32.359
<v Speaker 3>defensive AI is generalization. How do you get an RL

374
00:20:32.480 --> 00:20:36.119
<v Speaker 3>model trained on one network simulation to perform well on

375
00:20:36.160 --> 00:20:39.400
<v Speaker 3>a completely different real world network it's never seen before.

376
00:20:40.000 --> 00:20:42.720
<v Speaker 3>That's where techniques like metal learning learning how to learn

377
00:20:42.799 --> 00:20:46.359
<v Speaker 3>how to adapt quickly to new environments become absolutely critical,

378
00:20:46.880 --> 00:20:51.839
<v Speaker 3>And this raises a fascinating, maybe provocative thought. Cybersecurity might

379
00:20:51.880 --> 00:20:55.720
<v Speaker 3>be uniquely suited for AI evolution. How so think about it.

380
00:20:55.720 --> 00:20:59.599
<v Speaker 3>It's perhaps the one domain of AI application that presents

381
00:20:59.680 --> 00:21:03.640
<v Speaker 3>the can conditions for true evolution. Why Because it's AI

382
00:21:03.799 --> 00:21:08.200
<v Speaker 3>existing in its natural environment. It's constantly interacting with other software,

383
00:21:08.240 --> 00:21:11.480
<v Speaker 3>with hardware, with networks, and crucially with other AIS, both

384
00:21:11.519 --> 00:21:15.920
<v Speaker 3>friendly and adversarial. It's a dynamic, competitive ecosystem. It might

385
00:21:15.960 --> 00:21:18.440
<v Speaker 3>be the first place where human intelligence is really forced

386
00:21:18.519 --> 00:21:20.519
<v Speaker 3>to turn the keys over to an AI that truly

387
00:21:20.559 --> 00:21:24.000
<v Speaker 3>surpasses us, simply because the speed and complexity demand it.

388
00:21:24.119 --> 00:21:27.599
<v Speaker 1>That's a huge point an environment driving AI evolution. Because

389
00:21:27.640 --> 00:21:30.480
<v Speaker 1>the stakes are so high and the interaction so constant, it.

390
00:21:30.480 --> 00:21:33.720
<v Speaker 3>Really raises an important question, doesn't it. As these AIS

391
00:21:33.799 --> 00:21:36.960
<v Speaker 3>become more capable, especially in a competitive space like cyber

392
00:21:37.480 --> 00:21:40.640
<v Speaker 3>how do we ensure we design them responsibly? The arms

393
00:21:40.720 --> 00:21:43.359
<v Speaker 3>race dynamic likely means we will build them to be

394
00:21:43.440 --> 00:21:46.839
<v Speaker 3>as effective as possible, even if their intelligence isn't human like.

395
00:21:47.079 --> 00:21:49.079
<v Speaker 3>We need them to serve our defensive purposes.

396
00:21:49.279 --> 00:21:51.759
<v Speaker 1>A profound challenge layered on top of the technical ones.

397
00:21:52.039 --> 00:21:54.319
<v Speaker 1>So let's try to wrap this up. This deep dive,

398
00:21:54.359 --> 00:21:56.759
<v Speaker 1>I think has really shown how reinforcement learning isn't just

399
00:21:56.799 --> 00:22:01.319
<v Speaker 1>tweaking cybersecurity, it's fundamentally transforming it. We're moving away from

400
00:22:01.319 --> 00:22:07.759
<v Speaker 1>these cumbersome, often slow, manual processes towards dynamic AI driven insights,

401
00:22:08.200 --> 00:22:11.319
<v Speaker 1>insights that can mimic, an dissipate, and hopefully counter even

402
00:22:11.319 --> 00:22:14.880
<v Speaker 1>the most sophisticated adversaries out there. This ongoing AI versus

403
00:22:14.960 --> 00:22:18.480
<v Speaker 1>AI arms rights. While it has huge implications for everyone

404
00:22:18.519 --> 00:22:21.240
<v Speaker 1>really understanding these evolving capabilities, it's not just for the

405
00:22:21.279 --> 00:22:24.359
<v Speaker 1>cyberpros anymore. It's relevant for anyone who lives or works

406
00:22:24.400 --> 00:22:27.680
<v Speaker 1>in our digital world. So hopefully we've left you, our listener,

407
00:22:27.720 --> 00:22:30.240
<v Speaker 1>with a sense of the incredible potential here, but also

408
00:22:30.720 --> 00:22:33.880
<v Speaker 1>maybe the scale of the challenge and this constant, fascinating

409
00:22:33.920 --> 00:22:36.200
<v Speaker 1>evolution happening right now in our digital defenses.
