WEBVTT 1 00:00:00.080 --> 00:00:01.000 Welcome to the deep dive. 2 00:00:01.000 --> 00:00:02.560 We're here to cut through the noise, pull out the 3 00:00:02.640 --> 00:00:05.839 insights that matter really just for you today, we are 4 00:00:05.919 --> 00:00:10.080 diving deep into something moving incredibly fast, this escalating digital 5 00:00:10.160 --> 00:00:13.560 arms race. Ian cyber threats are just increasing. They're getting well, 6 00:00:13.839 --> 00:00:17.800 shockingly sophisticated. You've got nation state actors, advanced machine learning. 7 00:00:18.280 --> 00:00:23.280 They're acting as real force multipliers for the attackers. So 8 00:00:24.519 --> 00:00:27.679 our deep dive today it's about reinforcement learning RL. It's 9 00:00:27.719 --> 00:00:31.160 this cutting edge part of AI, and it's not just 10 00:00:31.199 --> 00:00:34.039 theory anymore. It's becoming a really powerful practical tool. It's 11 00:00:34.079 --> 00:00:38.280 fundamentally changing cybersecurity, especially in well critical area of penetration testing. 12 00:00:38.880 --> 00:00:40.759 Our mission really is take you on a journey. We'll 13 00:00:40.799 --> 00:00:43.159 look at the core ideas of RL, how it's actually 14 00:00:43.240 --> 00:00:46.359 being used in cyber ops, highlight the challenges, the clever 15 00:00:46.399 --> 00:00:48.799 solutions popping out, and then look at the real world 16 00:00:48.920 --> 00:00:51.840 uses in this future of AI combating AI. The goal 17 00:00:51.920 --> 00:00:54.039 is simple, give you a shortcut to being genuinely well 18 00:00:54.079 --> 00:00:58.039 in formed, maybe offer some surprising insights practical takeaways. Okay, 19 00:00:58.079 --> 00:01:00.960 so with that laid out, let's unpack this traditional penetration testing. 20 00:01:01.039 --> 00:01:04.840 It's absolutely vital for securing our digital world, right, but 21 00:01:04.879 --> 00:01:07.280 it's also, let's be honest, often seems like a slow, 22 00:01:07.400 --> 00:01:10.079 manual and incredibly complex undertaking. Is that fair? 23 00:01:10.200 --> 00:01:12.519 Oh? Absolutely, that's right at the heart of the challenge 24 00:01:12.560 --> 00:01:16.760 pen testing. Yeah, it is where these highly technical red 25 00:01:16.840 --> 00:01:21.239 teams simulate real attacks, trying to find the holes in 26 00:01:21.280 --> 00:01:25.519 an organization's defenses. And it's crucial. I mean identifying weaknesses, 27 00:01:25.640 --> 00:01:29.640 prioritizing where you spend your security budget, tuning defenses, meeting 28 00:01:29.680 --> 00:01:32.519 compliance like PCI or a pally for the execs, it's 29 00:01:32.599 --> 00:01:35.840 risk management, reputation for DEFF teams, it's baking security in 30 00:01:35.879 --> 00:01:39.319 from the start, super important stuff. But what's fascinating here 31 00:01:39.400 --> 00:01:43.480 and a bit problematic, is how this critical process, which 32 00:01:43.519 --> 00:01:46.879 is really labor intensive to struggles. It struggles with the 33 00:01:46.920 --> 00:01:49.959 sheer volume of data in modern networks. You know, despite 34 00:01:50.000 --> 00:01:52.959 having brilliant human testers, the results are often found through 35 00:01:53.079 --> 00:01:57.079 well manual, tedious means. It's just an overabundance of information, logs, 36 00:01:57.079 --> 00:01:59.799 network endpoints. It's overwhelming. Even when you use automated tools 37 00:01:59.840 --> 00:02:01.879 made sense of it all, that's still a huge challenge 38 00:02:01.920 --> 00:02:04.040 for the analyst. So it really begs the question, doesn't it. 39 00:02:04.040 --> 00:02:07.200 How can we possibly scale human expertise to keep up 40 00:02:07.239 --> 00:02:09.280 with this constantly growing threat landscape? 41 00:02:09.400 --> 00:02:12.400 Right, and that sounds like the perfect entry point for AI, 42 00:02:12.479 --> 00:02:17.240 specifically reinforcement learning as this force multiplayer you mentioned. So 43 00:02:17.319 --> 00:02:20.439 how does RL actually step in? What lets it chew 44 00:02:20.599 --> 00:02:24.039 through these mountains of data that swamp human teams? 45 00:02:24.319 --> 00:02:28.039 Well, AI capabilities they've just improved so dramatically. RL models 46 00:02:28.039 --> 00:02:31.039 can now sift through, I mean, mountains of data that 47 00:02:31.199 --> 00:02:34.960 maybe was ignored before. They find patterns, anomalies, these sort 48 00:02:35.000 --> 00:02:39.680 of graph linked epiphanies. It just massively accelerates the ability 49 00:02:39.759 --> 00:02:42.360 to spot and stop bad actors. It really is about 50 00:02:42.439 --> 00:02:45.400 using AI to combat AI, or at least AI to 51 00:02:45.439 --> 00:02:48.719 combat the complexity that modern systems and threats bring. 52 00:02:49.000 --> 00:02:52.199 Okay, let's break down RL itself. The taxi driver analogy 53 00:02:52.240 --> 00:02:55.080 is pretty classic, right, helps make it concrete. Imagine you're 54 00:02:55.080 --> 00:02:58.240 a taxi driver. Your goal maximize fares in a city, 55 00:02:58.360 --> 00:03:01.400 that whole city, the traffic, passengers, time of day, that's the. 56 00:03:01.400 --> 00:03:04.439 Environment exactly, and your states are your current situation, like 57 00:03:04.479 --> 00:03:06.599 where your taxi is right now, the time, the weather. 58 00:03:06.759 --> 00:03:09.159 In cyber terms, that translates to things like the network 59 00:03:09.159 --> 00:03:11.800 can fig maybe a host status. Is it up as 60 00:03:11.800 --> 00:03:13.560 we scanned it? What access level do we have? 61 00:03:13.840 --> 00:03:16.960 Then you have actions, the choices the driver makes. Go downtown, 62 00:03:17.199 --> 00:03:21.039 wait at the station in cyber that's your scans, your exploits, 63 00:03:21.159 --> 00:03:23.879 trying to get higher privileges on a machine you've popped. 64 00:03:23.719 --> 00:03:27.560 And absolutely crucial for learning. The reward. That's the feedback 65 00:03:27.599 --> 00:03:31.759 for the driver. It's the fare, simple enough. In cybersecurity simulations, 66 00:03:31.800 --> 00:03:34.840 it's often framed as costs or penalties for certain actions, 67 00:03:35.120 --> 00:03:37.280 maybe a big lump sum reward for hitting a key 68 00:03:37.280 --> 00:03:39.280 objective like getting domain admin. 69 00:03:39.439 --> 00:03:42.439 But here's a really elegant part. I think Markov decision 70 00:03:42.479 --> 00:03:46.319 processes MDPs. Instead of the driver needing to remember every 71 00:03:46.360 --> 00:03:49.280 single fare they've ever collected to decide where to go next, 72 00:03:49.360 --> 00:03:52.800 which would be crazy, MDPs simplify things. They focus on 73 00:03:52.840 --> 00:03:55.960 the present moment, the here and now. This lets the agent, 74 00:03:56.080 --> 00:03:59.639 our driver or our cyber agent, make quick, informed decisions 75 00:03:59.680 --> 00:04:02.039 based on the current state, not the entire history. It's 76 00:04:02.039 --> 00:04:03.960 about what matters right now, makes. 77 00:04:03.759 --> 00:04:07.000 Sense it does, it makes the problem tractable, and finally 78 00:04:07.080 --> 00:04:10.240 you have the objective function. This is the mathematical goal. 79 00:04:10.280 --> 00:04:14.759 The agent tries to maximize total fair maybe and often 80 00:04:14.840 --> 00:04:19.040 it's a discounted sum of future rewards, meaning rewards you 81 00:04:19.120 --> 00:04:21.560 might get way down the line are seen as less 82 00:04:21.639 --> 00:04:24.439 valuable than rewards you can get right now. It reflects 83 00:04:24.480 --> 00:04:28.680 that real world trade off, immediate gains often feel more important. 84 00:04:28.959 --> 00:04:31.519 So here's where it gets really interesting. Imagine teaching a 85 00:04:31.560 --> 00:04:35.040 computer to think like a hacker by letting it continuously 86 00:04:35.120 --> 00:04:38.399 interact with a simulated network. That's essentially what reinforcement learning 87 00:04:38.399 --> 00:04:41.120 allows us to do in cybersecurity, and you mentioned combining 88 00:04:41.120 --> 00:04:43.319 this with neural networks that gets us into deep reinforcement 89 00:04:43.439 --> 00:04:44.279 learning DRL. 90 00:04:44.680 --> 00:04:48.959 Exactly. DRL uses those neural nets to handle well incredibly 91 00:04:49.000 --> 00:04:53.680 complex inputs and figure out sophisticated strategies or policies and 92 00:04:53.720 --> 00:04:58.079 specific algorithms. You mentioned PPO proximal Policy optimization. That one, 93 00:04:58.160 --> 00:05:00.959 along with others like DQN or A to C has 94 00:05:01.000 --> 00:05:05.079 been really key. PPO especially brought a lot more stability 95 00:05:05.079 --> 00:05:08.040 and efficiency to the training process. It lets us actually 96 00:05:08.120 --> 00:05:12.560 apply these powerful learning methods to really large complex network 97 00:05:12.600 --> 00:05:15.480 simulations without things going completely off the rails. 98 00:05:15.519 --> 00:05:18.240 Okay, so the theory sounds powerful, but how do we 99 00:05:18.279 --> 00:05:22.800 actually connect this theoretical AI agent to a real, messy, 100 00:05:22.839 --> 00:05:25.120 live network. That sounds like a huge leap. 101 00:05:25.240 --> 00:05:27.120 It is a huge leap, and that's where this grounding 102 00:05:27.199 --> 00:05:28.839 problem comes in. You have to make sure the AI 103 00:05:28.920 --> 00:05:32.680 is understanding its representation of reality is actually tied accurately 104 00:05:32.720 --> 00:05:33.839 to the system it's interacting with. 105 00:05:34.000 --> 00:05:36.199 Right, how do you bridge that gap between the clean 106 00:05:36.279 --> 00:05:39.720 model and the well the chaos of a real corporate network. 107 00:05:39.839 --> 00:05:43.199 The key approach involves a high level architecture. It's often 108 00:05:43.240 --> 00:05:48.639 called something like the layered reference model or LRMSHRAG. Think 109 00:05:48.639 --> 00:05:51.000 of it like building layers of maps for the AI, 110 00:05:51.240 --> 00:05:54.519 each one adding more detail. First, you take info from 111 00:05:54.519 --> 00:05:57.360 the real network and abstract it into an attack graph. 112 00:05:57.759 --> 00:06:01.120 This graph then becomes the foundation for the mark decision process, 113 00:06:01.399 --> 00:06:05.040 the MDP. That's the environment the URL agent actually learns inside. 114 00:06:05.160 --> 00:06:07.240 Okay, an attack graph is the base map. 115 00:06:07.199 --> 00:06:10.399 Exactly, but the crucial part is layering more context onto 116 00:06:10.399 --> 00:06:14.600 that basic MDP. First, there's a terrain MDP. This layer 117 00:06:14.639 --> 00:06:19.199 adds concepts of cyber terrain, so firewalls become obstacles. Maybe 118 00:06:19.199 --> 00:06:22.480 an intrusion detection system and IDs has a fuel to fire. 119 00:06:22.920 --> 00:06:26.759 It borrows from military ideas. Actually, intelligence preparation of the battlefield, 120 00:06:27.120 --> 00:06:29.639 understanding the environment to predict moves, so. 121 00:06:29.560 --> 00:06:31.879 Mapping the cyber landscape strategically makes sense. 122 00:06:31.959 --> 00:06:35.079 Then you add an adversary MDP. This layer tailors the 123 00:06:35.160 --> 00:06:38.439 environment to specific types of attackers, maybe using node attack 124 00:06:38.519 --> 00:06:41.360 templates or reflecting the capabilities of your own red team. 125 00:06:41.480 --> 00:06:44.240 So modeling different kinds of threats precisely. 126 00:06:45.040 --> 00:06:48.879 And finally, a task MDP. This refines the whole setup 127 00:06:48.879 --> 00:06:52.279 for specific goals. Are you doing crown jewel analysis trying 128 00:06:52.360 --> 00:06:56.759 to find exfiltration paths? The task shapes the environment and rewards, 129 00:06:57.199 --> 00:07:01.160 and importantly, as networks change or tasks change, these agents 130 00:07:01.160 --> 00:07:03.560 don't always have to start from scratch. They can use 131 00:07:03.560 --> 00:07:06.480 transfer learning to share knowledge between tasks, or even metal 132 00:07:06.519 --> 00:07:10.319 learning basically learning how to learn more efficiently to adapt quickly. 133 00:07:10.680 --> 00:07:13.079 So this whole layered approach, that's how we connect the 134 00:07:13.079 --> 00:07:15.360 theory to the practice. It gives a structure needed to 135 00:07:15.399 --> 00:07:19.240 make these AI driven cyber operations actually feasible on real networks. 136 00:07:19.360 --> 00:07:21.920 Okay, but even with that structure, there must be massive 137 00:07:22.000 --> 00:07:25.800 practical challenges. You mentioned scaling earlier. Real companies have networks 138 00:07:25.839 --> 00:07:28.240 with what tens of thousands. 139 00:07:27.879 --> 00:07:30.519 Of machine oh, easily tens of thousands of hosts is 140 00:07:30.519 --> 00:07:33.360 not uncommon in large enterprises, and that scale is a 141 00:07:33.439 --> 00:07:36.360 huge problem for ROL models. If your model doesn't scale well, 142 00:07:36.360 --> 00:07:40.639 it becomes incredibly computationally expensive. Training takes forever, or maybe 143 00:07:40.680 --> 00:07:43.519 just won't converge, meaning it never settles on a good strategy, 144 00:07:43.959 --> 00:07:47.439 or worse, the reward signal just keeps bouncing around wildly, 145 00:07:47.600 --> 00:07:52.120 never improving. The simulation becomes useless, slower than just using 146 00:07:52.160 --> 00:07:52.839 a human. 147 00:07:52.600 --> 00:07:56.000 Team, and the attack grafts themselves must explode. 148 00:07:55.519 --> 00:07:59.680 In size exponentially. Traditional attack graft generation just blows up 149 00:07:59.720 --> 00:08:03.279 as you add hosts. You end up with these unbelievably vast, 150 00:08:03.800 --> 00:08:07.560 complex decision spaces for the RIL agent to explore. It's 151 00:08:07.600 --> 00:08:09.600 like going from tic tac toe to I don't know, 152 00:08:09.879 --> 00:08:12.319 forty chess with millions of pieces. 153 00:08:12.399 --> 00:08:14.680 Wow. Okay, So how on earth do you make that 154 00:08:14.759 --> 00:08:17.480 manageable for an AI? How do you simplify the choices? 155 00:08:17.680 --> 00:08:20.519 That's where action space simplification comes in. You have to 156 00:08:20.519 --> 00:08:25.040 make the problem tractable. Strategies include things like reducing the dimensions, 157 00:08:25.519 --> 00:08:28.360 maybe focusing only on the most relevant actions at any 158 00:08:28.360 --> 00:08:32.240 given point, or combining similar actions into more general ones 159 00:08:32.679 --> 00:08:36.000 using hierarchical action spaces is another key idea to teach 160 00:08:36.000 --> 00:08:39.480 the agent high level goals first, like gain access to 161 00:08:39.600 --> 00:08:43.000 subnet X, before it learns the specific low level steps. 162 00:08:43.360 --> 00:08:44.399 It's about smart. 163 00:08:44.159 --> 00:08:46.559 Abstraction makes sense, giving it a better way to think 164 00:08:46.559 --> 00:08:50.279 about its options. What about the realism challenge, especially with rewards, 165 00:08:50.480 --> 00:08:53.320 You need the AI to value things like a real attacker. Right. 166 00:08:53.639 --> 00:08:57.320 You mentioned CVSS scores earlier. The zero to ten vulnerability 167 00:08:57.399 --> 00:08:58.679 rating Okay, you said, has. 168 00:08:58.559 --> 00:09:02.159 Limits, big limits in this cond text. CBSS is standardized, 169 00:09:02.200 --> 00:09:05.240 which is good, but it focuses purely on technical severity. 170 00:09:05.279 --> 00:09:08.919 It often lacks crucial context, like what's the actual business 171 00:09:09.039 --> 00:09:11.639 value of the data on that server? Or they're compensating 172 00:09:11.639 --> 00:09:15.039 security controls already in place. It's also static, it doesn't change, 173 00:09:15.159 --> 00:09:17.320 and it doesn't really capture human factors. 174 00:09:17.519 --> 00:09:19.879 So a critical vulnerability on a test server isn't the 175 00:09:19.879 --> 00:09:22.840 same risk as a medium one on the main financial 176 00:09:22.919 --> 00:09:24.200 database exactly. 177 00:09:24.320 --> 00:09:27.679 CBSS doesn't capture that nuance. It's not really a measure 178 00:09:27.679 --> 00:09:31.799 of risk, just technical severity, and it definitely doesn't generalize 179 00:09:31.840 --> 00:09:35.600 well to evaluating an entire attack path with multiple steps. 180 00:09:35.720 --> 00:09:39.440 So how do you inject that realism that context. 181 00:09:39.559 --> 00:09:42.919 Well, real attackers think holistically, don't they. They weigh factors 182 00:09:42.960 --> 00:09:46.480 beyond just the technical vulnerability. They look at the cyberterran 183 00:09:46.639 --> 00:09:50.919 firewalls IDs detection potential. So the reward system needs to 184 00:09:50.919 --> 00:09:53.759 mimic that. We need to build in that contextual awareness. 185 00:09:54.120 --> 00:09:57.120 One way is using these service based penalties we talked 186 00:09:57.159 --> 00:10:00.320 about them, assigning different negative rewards or costs based on 187 00:10:00.360 --> 00:10:03.720 the type of service being attacked, Like attacking authentication services 188 00:10:03.799 --> 00:10:05.879 might get a my nine to six penalty hitting data 189 00:10:05.919 --> 00:10:08.600 services man at four, maybe security or common services man 190 00:10:08.639 --> 00:10:12.320 of two. The exact numbers are relative tune for the simulation, 191 00:10:12.759 --> 00:10:15.840 but they reflect the proportional risk to the organization, higher 192 00:10:15.840 --> 00:10:17.720 penalty for hitting more critical services. 193 00:10:17.919 --> 00:10:22.600 Got it, So penalties reflecting business impact. Essentially. Now bringing 194 00:10:22.600 --> 00:10:25.840 this all together is scaling the realism. What's the approach 195 00:10:25.879 --> 00:10:28.559 that's really making this work in practice? The workhorce solution. 196 00:10:28.919 --> 00:10:31.840 A really promising combination that's emerged is known as double 197 00:10:31.879 --> 00:10:36.000 agent plus PPO or DAPPO. It starts with the double 198 00:10:36.080 --> 00:10:39.600 agent architecture the DAA. Instead of one monolithic AI trying 199 00:10:39.600 --> 00:10:43.919 to figure everything out, you have two specialized agents working together. 200 00:10:44.399 --> 00:10:47.559 There's an exploration agent whose job is to decide which 201 00:10:47.559 --> 00:10:50.639 host to target next, and then there's an exploitation agent 202 00:10:50.679 --> 00:10:53.919 that decides which specific action or exploit to use on 203 00:10:54.000 --> 00:10:54.840 that chosen host. 204 00:10:54.960 --> 00:10:58.279 Ah, So like a team, one doing recon in target selection, 205 00:10:58.440 --> 00:11:00.720 the other handling the actual at execution. 206 00:11:00.879 --> 00:11:05.240 Precisely, this decomposition makes the whole learning problem much more tractable. 207 00:11:05.600 --> 00:11:09.440 Each agent has a smaller, more focused learning space, and importantly, 208 00:11:09.480 --> 00:11:13.480 it's quite conceptually sound from an attacker's perspective. Real attackers 209 00:11:13.519 --> 00:11:15.240 often think in terms of where do I go next? 210 00:11:15.279 --> 00:11:16.720 And then what do I do once I'm there? 211 00:11:16.919 --> 00:11:20.679 Okay, that makes intuitive sense. Splitting the problem HEALTHS and 212 00:11:20.720 --> 00:11:23.559 the PPO part proxim policy optimization. 213 00:11:23.799 --> 00:11:26.960 That's the other key piece. Applying PPO to both of 214 00:11:27.000 --> 00:11:30.320 these agents provides the stability and efficiency we talked about earlier. 215 00:11:31.000 --> 00:11:34.120 PPO is just much better than some older algorithms like 216 00:11:34.279 --> 00:11:37.679 say A to C, especially for complex problems. It gives 217 00:11:37.679 --> 00:11:42.320 you stability, robustness, and sample efficiency, less data needed to learn, 218 00:11:42.720 --> 00:11:46.399 less likely to get stuck, and this combination the double 219 00:11:46.440 --> 00:11:49.639 agent architecture powered by PPO is what has really enabled 220 00:11:49.639 --> 00:11:53.720 these systems to scale effectively to networks of thousands of nodes. 221 00:11:54.279 --> 00:11:57.120 It keeps the learning stable even in huge environments. 222 00:11:57.200 --> 00:11:59.639 So essentially, instead of one AI trying to do everything, 223 00:12:00.039 --> 00:12:02.519 we're giving it a specialized team and a really smart, 224 00:12:02.639 --> 00:12:05.519 stable way to learn. Allows it to tackle networks far 225 00:12:05.639 --> 00:12:08.679 larger than before. It's like having that reconnaissance expert and 226 00:12:08.759 --> 00:12:12.159 an exploit expert working together, powered by the best learning methods. 227 00:12:12.240 --> 00:12:13.279 That's a great way to put it. 228 00:12:13.360 --> 00:12:15.559 Okay, So these aren't just lab experiments. You're saying, this 229 00:12:15.720 --> 00:12:19.799 dappo approach, and these layered models are actually being used 230 00:12:19.799 --> 00:12:22.399 now for real cybersecurity tasks. 231 00:12:22.639 --> 00:12:26.679 Yes, absolutely, we're seeing RL applied in several practical ways. 232 00:12:27.159 --> 00:12:32.200 One key area is crown Jewels analysis or CJARL. Here, 233 00:12:32.519 --> 00:12:35.399 RL models are trained specifically to find the most effective, 234 00:12:35.519 --> 00:12:40.840 often the stealthiest paths to compromise an organization's highest value assets. 235 00:12:40.960 --> 00:12:44.759 They're crown jewels, so finding the quickest way to the 236 00:12:44.799 --> 00:12:45.919 most important. 237 00:12:45.519 --> 00:12:48.879 Stuff, not just the quickest, but often the path of 238 00:12:48.960 --> 00:12:53.080 least resistance or least detection. The insights you get provide 239 00:12:53.080 --> 00:12:56.960 a really nuanced understanding of attackers methods of discreetly navigating 240 00:12:56.960 --> 00:13:00.000 through networks. It can reveal attack pads you simply want 241 00:13:00.080 --> 00:13:00.679 and have thought. 242 00:13:00.519 --> 00:13:03.679 Of manually exposing those hidden routes. What else? 243 00:13:03.759 --> 00:13:08.279 Another big one is discovering exfiltration paths. The focus here shifts. 244 00:13:08.480 --> 00:13:11.320 It's not about getting in anymore, but about how attackers 245 00:13:11.320 --> 00:13:13.759 get sensitive data out after a breach while trying to 246 00:13:13.840 --> 00:13:17.960 minimize detection ah getaway plan exactly. The model has to 247 00:13:18.000 --> 00:13:22.320 consider things like protocol and payload considerations. Agents might learn, 248 00:13:22.360 --> 00:13:25.440 for example, to use specific protocols like tunneling exful traffic 249 00:13:25.440 --> 00:13:29.120 through domain name systems DNS because DNS traffic often looks 250 00:13:29.159 --> 00:13:32.960 benign and isn't heavily scrutinized. Very They can also learn 251 00:13:32.960 --> 00:13:36.399 to use strategic pauses to avoid detection, mimicking low and 252 00:13:36.519 --> 00:13:39.759 slow techniques, or maybe they learn to stick to just 253 00:13:40.039 --> 00:13:44.279 one protocol consistently to better blend in with benign or 254 00:13:44.320 --> 00:13:48.799 otherwise unmonitored traffic. It's about modeling that stealthy data theft. 255 00:13:48.919 --> 00:13:51.399 That's fascinating and it keeps going. Oh yes. 256 00:13:51.759 --> 00:13:54.480 Another application is discovering command and control. 257 00:13:54.279 --> 00:13:58.039 Channels C two channels right the phone helme mechanism for malware. 258 00:13:58.159 --> 00:14:01.879 Precisely, these are the pathways that malware, once it's inside 259 00:14:01.919 --> 00:14:05.279 and undetected, uses to get instructions from its operator and 260 00:14:05.399 --> 00:14:08.440 send back stolen data or status updates. It has to 261 00:14:08.480 --> 00:14:12.919 execute nefarious tasks under direction. RL agents can learn how 262 00:14:12.919 --> 00:14:15.919 to establish and maintain these channels, figuring out how to 263 00:14:16.000 --> 00:14:20.080 navigle through firewalls, again using strategic pauses sleep actions to 264 00:14:20.240 --> 00:14:22.960 lie low and avoid detection. They might even learn optimal 265 00:14:23.039 --> 00:14:26.440 data upload speeds may be consistently choosing fast upload options 266 00:14:26.480 --> 00:14:29.559 overslow if the coast seems clear, balancing speed against the 267 00:14:29.639 --> 00:14:32.600 risk of setting off alarms. It reveals how persistent threats 268 00:14:32.639 --> 00:14:33.679 maintain their foothold. 269 00:14:33.840 --> 00:14:36.679 Incredible, So mapping out not just the break in, but 270 00:14:36.759 --> 00:14:38.919 the long term occupation and data theft too. 271 00:14:39.240 --> 00:14:42.039 Exactly and perhaps one of the most advanced applications is 272 00:14:42.480 --> 00:14:46.799 exposing surveillance detection routes or SDRs. This is like super 273 00:14:46.840 --> 00:14:50.440 advanced reconnaissance. The goal is to find paths an attacker 274 00:14:50.480 --> 00:14:54.120 could use to gain maximum surveillance exposure, learn as much 275 00:14:54.120 --> 00:14:58.559 as possible about the network while simultaneously minimizing opportunities of 276 00:14:58.600 --> 00:15:01.759 being detected. The ultimate stealth recon. 277 00:15:01.600 --> 00:15:04.240 Maximum info, minimum footprint. How does that work? 278 00:15:04.519 --> 00:15:07.279 One really interesting technique used here is a warm up 279 00:15:07.279 --> 00:15:11.200 phase before the RL agent starts actually learning and updating 280 00:15:11.200 --> 00:15:14.879 its strategy based on rewards. It first explores areas of 281 00:15:14.879 --> 00:15:18.480 the network deemed safe to explore without changing its internal weights. 282 00:15:18.919 --> 00:15:21.799 It just gathers initial information cautiously. 283 00:15:21.759 --> 00:15:24.879 Like a human operator, carefully mapping out the surroundings before 284 00:15:24.919 --> 00:15:26.120 making any risky moves. 285 00:15:26.240 --> 00:15:29.559 Exactly like that, it mimics that initial caution. This warm 286 00:15:29.679 --> 00:15:32.440 up sets the stage for more efficient and targeted learning 287 00:15:32.519 --> 00:15:32.879 later on. 288 00:15:33.279 --> 00:15:35.440 And does this also show different attacker styles? 289 00:15:35.679 --> 00:15:39.639 Yes, very clearly. By adjusting the penalty scales how much 290 00:15:39.799 --> 00:15:42.840 the agent is punished for potentially being detected, you can 291 00:15:42.879 --> 00:15:47.159 simulate different adversary behaviors in different levels of risk aversion. 292 00:15:47.600 --> 00:15:50.399 For instance, with a low penalty scale, say a value 293 00:15:50.399 --> 00:15:53.039 of one, the agent acts more like a smash and 294 00:15:53.080 --> 00:15:56.960 grab operator or maybe a less experienced attacker. It might 295 00:15:56.960 --> 00:16:00.000 perform noisy scams, not caring as much about stealth. 296 00:16:00.279 --> 00:16:01.879 Okay, the loud attacker, right. 297 00:16:02.279 --> 00:16:04.279 But if you crank up the penalty scale maybe two 298 00:16:04.279 --> 00:16:07.240 to eleven, the agent starts behaving very differently. It acts 299 00:16:07.279 --> 00:16:10.399 more like highly competent actors like nation state actors or 300 00:16:10.440 --> 00:16:15.320 apts advanced persistent threats. It displays highly risk averse behavior 301 00:16:15.480 --> 00:16:18.879 chooses the most direct paths that minimize exposure, tries to 302 00:16:18.879 --> 00:16:21.759 minimize its overall footprint. It becomes incredibly stealthy. 303 00:16:21.919 --> 00:16:25.240 So you can model specific threat actors from script kitties 304 00:16:25.279 --> 00:16:28.759 to spies just by tuning the AI's aversion to risk. 305 00:16:28.840 --> 00:16:33.360 That's the idea. It allows defenders to anticipate the specific tactics, techniques, 306 00:16:33.399 --> 00:16:36.960 and procedures the TTPs associated with different adversary profiles. 307 00:16:37.159 --> 00:16:40.559 These aren't just theoretical models. They're literally showing us how 308 00:16:40.720 --> 00:16:43.320 attackers might move through a network, whether they're looking for 309 00:16:43.440 --> 00:16:46.600 most valuable data or trying to stay hidden. It's like 310 00:16:46.679 --> 00:16:50.559 having a crystal ball for cyber defense, revealing attacker TTPs 311 00:16:50.879 --> 00:16:53.519 even before they strike. It's quite remarkable. 312 00:16:53.759 --> 00:16:55.759 It really shifts the perspective for defenders. 313 00:16:56.320 --> 00:16:59.600 So, looking ahead, what does this all mean for the future. 314 00:16:59.600 --> 00:17:03.600 We're in this AI versus AI situation or heading deeper 315 00:17:03.600 --> 00:17:07.599 into it. What are the next frontiers beyond these simulation applications. 316 00:17:07.839 --> 00:17:11.480 Well, the applications are expanding rapidly. We're seeing AI, including 317 00:17:11.559 --> 00:17:15.359 oral principles, move more into active threat detection, shifting away 318 00:17:15.400 --> 00:17:20.000 from just relying on known signatures of malware towards behavioral 319 00:17:20.000 --> 00:17:25.000 based detection using sophisticated AML to spot anomalies, unusual patterns 320 00:17:25.000 --> 00:17:28.039 of activity that might indicate a novel, never before seen threat. 321 00:17:28.480 --> 00:17:31.799 Protecting against the unknown unknowns, So spotting. 322 00:17:31.440 --> 00:17:35.200 Bad behavior even if you don't recognize a specific tool exactly. 323 00:17:35.240 --> 00:17:38.960 And related to that is specific ransomware detection. We can 324 00:17:39.000 --> 00:17:43.640 simulate the entire ransomware life cycle, the initial spread, installation, staging, 325 00:17:43.720 --> 00:17:48.519 data encryption, and also simulate defenses like honeypots. 326 00:17:48.160 --> 00:17:50.960 Ah those decoy systems designed at trap attackers. 327 00:17:51.359 --> 00:17:55.119 Right, AI can help optimize honeypop placement and analyze the 328 00:17:55.119 --> 00:17:57.160 behavior of attackers who fall into them. 329 00:17:57.480 --> 00:18:00.880 What about offense? Can AI actually create new attacks? 330 00:18:01.039 --> 00:18:04.920 That's one of the really disruptive possibilities, the potential for 331 00:18:05.000 --> 00:18:10.000 AI models to perhaps invent new atomic level vulnerabilities, maybe 332 00:18:10.079 --> 00:18:14.839 by fuzzing or analyzing code and novel ways automating penetration testing, 333 00:18:14.880 --> 00:18:18.519 not just by orchestrating known exploits, but by discovering entirely 334 00:18:18.640 --> 00:18:21.119 new ones at a granular level. That's a big step. 335 00:18:21.200 --> 00:18:23.079 Wow, Okay, that's significant. What else on the. 336 00:18:23.079 --> 00:18:27.240 Horizon asset discovery and classification. Imagine AI models that can 337 00:18:27.240 --> 00:18:28.960 infer the role of a server or the type of 338 00:18:29.039 --> 00:18:32.119 data holes. Ah, there's likely PII in here just from 339 00:18:32.160 --> 00:18:36.160 analyzing network traffic or scan results even with limited initial. 340 00:18:35.799 --> 00:18:38.880 Information, making sense of the network automatically. 341 00:18:38.480 --> 00:18:43.640 And attribution assisting human analysts in identifying and assigning responsibility 342 00:18:43.720 --> 00:18:48.599 to threat actors. There's research into using metric learning, essentially 343 00:18:48.680 --> 00:18:52.119 comparing patterns seen in live network data flows end points 344 00:18:52.359 --> 00:18:55.400 against a library of synthetic attack paths generated by URL 345 00:18:55.440 --> 00:18:59.000 agents trained to mimic different known actors. This could potentially 346 00:18:59.079 --> 00:19:03.200 allow for zero ROO attribution identifying a new campaign launched 347 00:19:03.240 --> 00:19:06.000 by a known group even if the specific tools. 348 00:19:05.680 --> 00:19:09.039 Are new, Identifying the actor behind a novel attack almost immediately. 349 00:19:09.079 --> 00:19:10.000 That would be huge for. 350 00:19:10.039 --> 00:19:15.359 Response game changing and finally, defensive modeling. Moving beyond static 351 00:19:15.440 --> 00:19:18.200 pre programmed response is like if you see this block 352 00:19:18.279 --> 00:19:22.200 that IP towards truly AI driven defenses that can dynamically 353 00:19:22.240 --> 00:19:25.440 analyze an ongoing attack and choose the optimal countermeasures in 354 00:19:25.480 --> 00:19:29.279 real time, adapting as the attack evolves active intelligent defense. 355 00:19:29.519 --> 00:19:32.119 This really paints a picture of an accelerating arms race. 356 00:19:32.319 --> 00:19:35.079 We're going to see true AI attacks. Aren't we not 357 00:19:35.119 --> 00:19:37.960 just humans using AI tools, but AI directing the attack? 358 00:19:38.200 --> 00:19:41.599 It seems inevitable. Malicious actors will likely use RL and 359 00:19:41.680 --> 00:19:45.880 other mL techniques to automate complex attack patterns, including the 360 00:19:45.920 --> 00:19:49.880 initial scanning and enumeration feases which are often tedious and 361 00:19:49.960 --> 00:19:54.039 think about social engineering. AI could be used for honing, refining, 362 00:19:54.079 --> 00:19:58.480 and using more efficiently these attacks, crafting hyper personalized phishing emails, 363 00:19:58.720 --> 00:20:03.759 maybe even generating real, realistic, relevant, customized synthetic media voice 364 00:20:04.079 --> 00:20:06.839 video for spearfishing, or disinformation. 365 00:20:06.319 --> 00:20:08.480 Deep figs for hacking. That's UNSOI it is. 366 00:20:08.440 --> 00:20:11.039 And attackers could use mL defensively too, in a sense 367 00:20:11.200 --> 00:20:14.160 observe how defenses like IDs or anti virus react to 368 00:20:14.200 --> 00:20:16.799 their probes and then use that feedback to craft malware 369 00:20:16.960 --> 00:20:20.119 or just simply hone their techniques to avoid detection, learning 370 00:20:20.200 --> 00:20:22.720 to bypass our security controls, so. 371 00:20:22.599 --> 00:20:25.480 The AI learns how to be invisible to our AI defenses. 372 00:20:25.759 --> 00:20:28.880 That's the adversarial dynamic, and a key challenge for a 373 00:20:28.920 --> 00:20:32.359 defensive AI is generalization. How do you get an RL 374 00:20:32.480 --> 00:20:36.119 model trained on one network simulation to perform well on 375 00:20:36.160 --> 00:20:39.400 a completely different real world network it's never seen before. 376 00:20:40.000 --> 00:20:42.720 That's where techniques like metal learning learning how to learn 377 00:20:42.799 --> 00:20:46.359 how to adapt quickly to new environments become absolutely critical, 378 00:20:46.880 --> 00:20:51.839 And this raises a fascinating, maybe provocative thought. Cybersecurity might 379 00:20:51.880 --> 00:20:55.720 be uniquely suited for AI evolution. How so think about it. 380 00:20:55.720 --> 00:20:59.599 It's perhaps the one domain of AI application that presents 381 00:20:59.680 --> 00:21:03.640 the can conditions for true evolution. Why Because it's AI 382 00:21:03.799 --> 00:21:08.200 existing in its natural environment. It's constantly interacting with other software, 383 00:21:08.240 --> 00:21:11.480 with hardware, with networks, and crucially with other AIS, both 384 00:21:11.519 --> 00:21:15.920 friendly and adversarial. It's a dynamic, competitive ecosystem. It might 385 00:21:15.960 --> 00:21:18.440 be the first place where human intelligence is really forced 386 00:21:18.519 --> 00:21:20.519 to turn the keys over to an AI that truly 387 00:21:20.559 --> 00:21:24.000 surpasses us, simply because the speed and complexity demand it. 388 00:21:24.119 --> 00:21:27.599 That's a huge point an environment driving AI evolution. Because 389 00:21:27.640 --> 00:21:30.480 the stakes are so high and the interaction so constant, it. 390 00:21:30.480 --> 00:21:33.720 Really raises an important question, doesn't it. As these AIS 391 00:21:33.799 --> 00:21:36.960 become more capable, especially in a competitive space like cyber 392 00:21:37.480 --> 00:21:40.640 how do we ensure we design them responsibly? The arms 393 00:21:40.720 --> 00:21:43.359 race dynamic likely means we will build them to be 394 00:21:43.440 --> 00:21:46.839 as effective as possible, even if their intelligence isn't human like. 395 00:21:47.079 --> 00:21:49.079 We need them to serve our defensive purposes. 396 00:21:49.279 --> 00:21:51.759 A profound challenge layered on top of the technical ones. 397 00:21:52.039 --> 00:21:54.319 So let's try to wrap this up. This deep dive, 398 00:21:54.359 --> 00:21:56.759 I think has really shown how reinforcement learning isn't just 399 00:21:56.799 --> 00:22:01.319 tweaking cybersecurity, it's fundamentally transforming it. We're moving away from 400 00:22:01.319 --> 00:22:07.759 these cumbersome, often slow, manual processes towards dynamic AI driven insights, 401 00:22:08.200 --> 00:22:11.319