WEBVTT 1 00:00:00.080 --> 00:00:01.679 Welcome to the Deep Dive, where the show that bigs 2 00:00:01.720 --> 00:00:03.919 through stacks of sources to give you the key takeaways, 3 00:00:04.120 --> 00:00:07.480 making sure you're well informed. And today, Wow, we are 4 00:00:07.519 --> 00:00:11.599 plunging into a pretty intense digital battlefield. The stakes are 5 00:00:11.599 --> 00:00:14.919 incredibly high. We're talking about malware, you know, that nasty 6 00:00:14.960 --> 00:00:18.359 software designed purely to disrupt damage steel and the scale, 7 00:00:18.399 --> 00:00:21.280 the sheer scale of this problem, it's just staggering. Get this. 8 00:00:22.120 --> 00:00:25.800 Every single day, something like three hundred and fifty thousand 9 00:00:25.839 --> 00:00:29.640 new instances of malicious software pop up detected. Just think 10 00:00:29.679 --> 00:00:32.119 about that number for a second. And back in twenty eighteen, 11 00:00:32.200 --> 00:00:34.960 over six hundred and sixty nine million new variants were 12 00:00:35.000 --> 00:00:37.439 spotted in that year alone. This isn't just annoying pop ups. 13 00:00:37.479 --> 00:00:40.320 It's a huge financial hit businesses. We're spending on average 14 00:00:40.359 --> 00:00:42.600 two point four million US dollars back in twenty eighteen 15 00:00:42.640 --> 00:00:45.759 twenty nineteen just fighting malware and web attacks. So our 16 00:00:45.799 --> 00:00:47.759 mission for this deep dive is to really get into 17 00:00:47.799 --> 00:00:51.399 how cutting edge artificial intelligence, specifically deep learning, is being 18 00:00:51.520 --> 00:00:54.280 used as well a crucial line of defense. We want 19 00:00:54.359 --> 00:00:58.159 to explore how these intelligence systems are learning, adapting, maybe 20 00:00:58.200 --> 00:01:00.640 even predicting threats that the old way just can't catch. 21 00:01:00.640 --> 00:01:03.280 It's not just about spotting the known bad guys anymore, right, 22 00:01:03.399 --> 00:01:06.439 it's about anticipating the unknown, the brand new stuff. 23 00:01:06.640 --> 00:01:10.959 That's precisely it. You know, for years cybersecurity really leaned 24 00:01:11.000 --> 00:01:14.439 heavily on what's called signature based detection. You could think 25 00:01:14.480 --> 00:01:18.120 of it like having a huge photo album of known criminals. 26 00:01:18.280 --> 00:01:21.560 It's great for recognizing malware we've already seen and fingerprinted, 27 00:01:21.959 --> 00:01:25.519 very efficient for that, but it's big weakness. It's achilles heel. 28 00:01:25.719 --> 00:01:28.879 Really is the zero day attack, ah. 29 00:01:28.599 --> 00:01:30.680 The infamous zero days Exactly? 30 00:01:31.040 --> 00:01:34.560 These are completely new malware variants never seen before. They 31 00:01:34.560 --> 00:01:36.760 don't have a signature, no photo in the album to match. 32 00:01:37.239 --> 00:01:39.719 And that's exactly where AI and deep learning are stepping in. 33 00:01:40.040 --> 00:01:43.120 They use much more sophisticated methods like looking at dynamic 34 00:01:43.159 --> 00:01:46.200 behavior to spot malicious intent, even if the code itself 35 00:01:46.239 --> 00:01:46.760 is brand new. 36 00:01:46.840 --> 00:01:49.560 Okay, let's unpack that a bit, Starting with like the 37 00:01:49.640 --> 00:01:52.680 raw materials, how do we even study malware? I gather 38 00:01:52.760 --> 00:01:55.760 there are two main ways, static and dynamic analysis. 39 00:01:55.799 --> 00:01:56.159 That's right. 40 00:01:56.200 --> 00:01:59.560 Static analysis is well like examining a suspicious package without 41 00:01:59.560 --> 00:02:02.280 actually opening it. You're looking at the code itself without 42 00:02:02.359 --> 00:02:05.480 running it, things like library calls that might make tech 43 00:02:05.519 --> 00:02:09.639 strings inside it, byte sequences, maybe the sequence of API calls. 44 00:02:09.639 --> 00:02:14.319 It seems designed to make signature based detection that mostly 45 00:02:14.439 --> 00:02:17.520 uses this static data, but as we said, it totally 46 00:02:17.599 --> 00:02:20.400 misses new malware because there's no existing signature, right, no 47 00:02:20.479 --> 00:02:24.159 mugshot exactly. And then you have dynamic analysis. This is 48 00:02:24.159 --> 00:02:28.039 where you actually detonate the malware so to speak. 49 00:02:28.120 --> 00:02:30.560 You run it sounds risky. 50 00:02:30.599 --> 00:02:32.520 Well, you run it or emulate it in a very 51 00:02:32.520 --> 00:02:36.280 controlled environment a sandbox usually, and you watch what it does. 52 00:02:36.439 --> 00:02:38.759 So you track the actual API calls it makes, how 53 00:02:38.759 --> 00:02:41.280 it interacts with the system, maybe even low level hardware 54 00:02:41.319 --> 00:02:45.080 events for unknown malware. Seeing its behavior what it actually 55 00:02:45.159 --> 00:02:48.639 does is absolutely critical. It's not just about its blueprint. 56 00:02:48.319 --> 00:02:50.520 But it's actions makes sense, and I heard some people 57 00:02:50.560 --> 00:02:53.080 are even combining them like a hybrid approach. 58 00:02:53.240 --> 00:02:56.639 Yes, absolutely. Hybrid analysis tries to get the best of 59 00:02:56.680 --> 00:02:59.400 both worlds, looking at both the static structure and the 60 00:02:59.479 --> 00:03:01.759 dynamic bee behavior to build more complete picture. 61 00:03:02.159 --> 00:03:03.639 Things like mal DNA try to do this. 62 00:03:04.039 --> 00:03:06.439 So you mentioned API calls and other things you look for. 63 00:03:06.599 --> 00:03:09.520 These are the features, right, The specific clues precisely. 64 00:03:09.759 --> 00:03:13.080 Features are the specific characteristics we extract. And API call 65 00:03:13.120 --> 00:03:17.080 sequences are incredibly valuable. Why because they directly show what 66 00:03:17.120 --> 00:03:20.479 a program is trying to do. Interact with files, connect 67 00:03:20.520 --> 00:03:24.280 to the network, modify the system. API calls reveal. 68 00:03:24.039 --> 00:03:26.560 That ah okay, And the key. 69 00:03:26.280 --> 00:03:29.240 Insight here is that the order of these calls often 70 00:03:29.319 --> 00:03:33.840 screams malicious intent. Think about it opening a file, encrypting it, 71 00:03:33.919 --> 00:03:36.960 then deleting the original. That sequence tells a very different 72 00:03:36.960 --> 00:03:38.879 story than just opening and reading a file. 73 00:03:38.960 --> 00:03:41.360 Yeah, it definitely sounds like ransomware exactly. 74 00:03:41.919 --> 00:03:45.120 So researchers use techniques like n grams, which is just 75 00:03:45.199 --> 00:03:47.680 a fancy way of saying they look at short ordered 76 00:03:47.680 --> 00:03:50.919 sequences of calls, like pairs or triplets to capture this 77 00:03:51.039 --> 00:03:54.919 vital order information. Opcode sequences are another important feature too. 78 00:03:55.159 --> 00:03:58.360 Those are the really low level machine instructions giving insight 79 00:03:58.400 --> 00:03:59.879 into the program's core functions. 80 00:04:00.439 --> 00:04:02.919 So how do analysts actually get this data? What tools 81 00:04:02.919 --> 00:04:03.560 are they using? 82 00:04:03.800 --> 00:04:06.639 Ah, there's a whole toolkit for static analysis. You have 83 00:04:06.680 --> 00:04:10.520 dissemblers and debuggers like ida pro or allidobig. They let 84 00:04:10.520 --> 00:04:13.719 you peek inside the compiled code. See the assembly instructions 85 00:04:13.800 --> 00:04:16.480 extract op codes, potential API calls, and for. 86 00:04:16.439 --> 00:04:19.040 The dynamic side, the sandbox stuff right. 87 00:04:19.199 --> 00:04:21.920 Tools like API monitor are used to track those API 88 00:04:22.000 --> 00:04:24.480 calls live, but you usually need to run the malware 89 00:04:24.560 --> 00:04:27.839 inside a virtual machine or sandbox to contain it. Buster 90 00:04:27.959 --> 00:04:32.839 Sandbox Analyzer BSA and similar tools like CW sandbox are 91 00:04:32.879 --> 00:04:35.560 designed for exactly that. They run the malware safely and 92 00:04:35.680 --> 00:04:39.639 log everything it does, file changes, network connections, API calls. 93 00:04:39.959 --> 00:04:43.879 They're even more advanced tools like ether, which use hardware virtualization. 94 00:04:44.040 --> 00:04:46.519 They kind of sit outside the operating system the malware 95 00:04:46.560 --> 00:04:48.600 is running in, making them much harder for the malware 96 00:04:48.639 --> 00:04:49.160 to detect. 97 00:04:49.160 --> 00:04:52.199 Okay, this is fascinating. So you've got all this raw data, 98 00:04:52.240 --> 00:04:56.360 API sequences, op codes, behaviors. Now how do you actually 99 00:04:56.399 --> 00:04:59.120 feed this into an AI? How does the machine see 100 00:04:59.560 --> 00:04:59.959 the malwa? 101 00:05:00.399 --> 00:05:01.000 Well, this is. 102 00:05:00.920 --> 00:05:03.160 Where some really creative approaches come in. One of the 103 00:05:03.160 --> 00:05:05.839 most surprising ones is malware visualization. 104 00:05:06.000 --> 00:05:08.439 Visualization you mean like charts and graphs. 105 00:05:08.439 --> 00:05:12.759 No, literally turning the malware code the binary file itself 106 00:05:13.079 --> 00:05:15.720 into an image, usually a grayscale image. 107 00:05:15.759 --> 00:05:18.600 Wait, what turning code into a picture? How does that 108 00:05:18.639 --> 00:05:19.160 even work? 109 00:05:19.439 --> 00:05:22.560 Or why it sounds bizarre? I know, but researchers found 110 00:05:22.560 --> 00:05:25.120 that malware samples from the same family, even if they 111 00:05:25.120 --> 00:05:28.600 look different in code, often end up having similar textures 112 00:05:28.639 --> 00:05:32.759 and structural patterns when you represent their binary data as pixels. 113 00:05:32.279 --> 00:05:34.240 In an image like a visual fingerprint. 114 00:05:34.480 --> 00:05:37.600 Kind of yeah, kindred attributes as some call it. And 115 00:05:37.639 --> 00:05:40.800 the brilliant part is this lets us use incredibly powerful 116 00:05:40.879 --> 00:05:44.120 deep learning models that were originally designed for image recognition. 117 00:05:44.560 --> 00:05:48.240 You mean, like the AI that recognizes cats and photos. 118 00:05:47.800 --> 00:05:52.079 Exactly, Convolutional neural networks or CNNs. They're designed to find 119 00:05:52.120 --> 00:05:57.000 patterns in images, edges, textures, shapes, increasingly complex features. So 120 00:05:57.160 --> 00:05:59.759 by turning malware into an image, we can train as 121 00:05:59.800 --> 00:06:02.879 c N to spot the visual hallmarks of malicious code, 122 00:06:03.079 --> 00:06:06.199 even if it has no obvious image component itself. It's 123 00:06:06.199 --> 00:06:07.279 surprisingly effective. 124 00:06:07.399 --> 00:06:10.360 Wow. Okay, that's pretty cool. So CNN's for the image approach. 125 00:06:10.519 --> 00:06:12.439 What other AI tools are in the box? 126 00:06:12.720 --> 00:06:16.240 Well, for data that's sequential where the order is crucial, 127 00:06:16.360 --> 00:06:19.279 like those API call sequences or op code sequences we 128 00:06:19.360 --> 00:06:23.199 talked about. With these different architectures, recurrent neural networks or 129 00:06:23.319 --> 00:06:27.759 RNNs are designed specifically for sequential data, Okay, and within 130 00:06:27.920 --> 00:06:32.000 RNNs variants like lstm's long short term memory networks are 131 00:06:32.079 --> 00:06:36.120 really powerful. They have mechanisms to remember information over longer sequences, 132 00:06:36.199 --> 00:06:39.160 which is perfect for tracking complex behaviors that unfold over. 133 00:06:39.040 --> 00:06:40.959 Time, so they can connect an early action with. 134 00:06:40.920 --> 00:06:42.160 A later one precisely. 135 00:06:42.560 --> 00:06:46.480 LSTMs are actually quite successful commercially. Another popular variation is 136 00:06:46.519 --> 00:06:49.480 the GRU or gated recurrent unit, which is a bit 137 00:06:49.519 --> 00:06:52.920 simpler than LSTM but often performs just as well. Both 138 00:06:53.079 --> 00:06:57.160 LSTMs and grus have shown really significant improvements in detecting malware, 139 00:06:57.439 --> 00:07:01.319 even things like spotting cybersecurity events based on say, patterns 140 00:07:01.319 --> 00:07:03.000 and social media messages over time. 141 00:07:03.279 --> 00:07:04.879 Interesting any other architectures. 142 00:07:05.079 --> 00:07:09.639 Definitely there are residual networks or resonants. Their key innovation 143 00:07:09.839 --> 00:07:13.839 is allowing the network to learn identity mappings, basically letting 144 00:07:13.879 --> 00:07:17.279 the signal skip layers if needed. This helps train much 145 00:07:17.439 --> 00:07:21.000 deeper networks without running into problems like vanishing gradients where 146 00:07:21.000 --> 00:07:22.600 the signal gets too weak to train the. 147 00:07:22.600 --> 00:07:23.639 Early layers effectively. 148 00:07:23.959 --> 00:07:26.319 It's kind of inspired by how neurons connect in the brain. 149 00:07:26.600 --> 00:07:29.519 Deeper networks mean potentially learning more complex patterns. 150 00:07:29.560 --> 00:07:30.920 I guess that's the idea. 151 00:07:31.360 --> 00:07:34.399 And then there are jans generative adversarial networks. 152 00:07:35.040 --> 00:07:38.279 These are fascinating adversarial sounds intense. 153 00:07:38.519 --> 00:07:41.279 It is in a way you have two networks competing. 154 00:07:41.839 --> 00:07:45.439 A generator tries to create fake data like fake malware samples, 155 00:07:45.920 --> 00:07:49.079 and a discriminator tries to tell the generator's fakes apart 156 00:07:49.120 --> 00:07:49.680 from real. 157 00:07:49.560 --> 00:07:51.240 Dat like a game of cat and mouse. 158 00:07:51.319 --> 00:07:54.639 Exactly a mini max game. The generator gets better at 159 00:07:54.680 --> 00:07:58.480 fooling the discriminator, and the discriminator gets better at spotting fakes. 160 00:07:59.040 --> 00:08:01.959 The really exciting part about cans is their potential for 161 00:08:02.040 --> 00:08:05.439 things like zero day malware detection, because the generator might 162 00:08:05.439 --> 00:08:08.959 create novel malicious patterns or even we can use them 163 00:08:08.959 --> 00:08:11.680 in the lab to generate challenging new threats to test 164 00:08:11.720 --> 00:08:14.839 our defenses before similar things appear in the wild. It's 165 00:08:14.839 --> 00:08:15.720 like a digital. 166 00:08:15.439 --> 00:08:19.120 Sparring partner proactive defense. I like that. What about understanding 167 00:08:19.199 --> 00:08:22.360 the words of malware like op codes or API calls? 168 00:08:22.480 --> 00:08:22.800 Ah? 169 00:08:22.879 --> 00:08:25.439 Yes, that's where word embedding techniques come in, like word 170 00:08:25.519 --> 00:08:28.680 two vec, or even approaches based on hidden Markov models 171 00:08:28.720 --> 00:08:31.800 like HMM two vec. The core idea is similar to 172 00:08:31.839 --> 00:08:35.159 how language models understand words and sentences. You treat op 173 00:08:35.240 --> 00:08:38.519 codes or API calls as words. These techniques learn to 174 00:08:38.559 --> 00:08:42.480 represent these words as numerical vectors in a high dimensional. 175 00:08:42.000 --> 00:08:44.159 Space, vectors like points on a map. 176 00:08:44.480 --> 00:08:47.600 Sort of yes, And the key is that words used 177 00:08:47.639 --> 00:08:50.799 in similar contexts, like API calls that often appear together 178 00:08:50.799 --> 00:08:54.919 in malicious sequences and then closer together in this vector space. 179 00:08:55.559 --> 00:08:58.559 Word two vec, for example, trained on just a shallow 180 00:08:58.600 --> 00:09:02.799 neural network, can capture really meaningful relationships. It learns the 181 00:09:02.879 --> 00:09:05.360 meaning or function of an op code from how it's 182 00:09:05.440 --> 00:09:06.879 used alongside others, so. 183 00:09:06.799 --> 00:09:09.039 It groups similar functions together automatically. 184 00:09:09.200 --> 00:09:13.440 Essentially, yes, it captures semantic relationships. There are others too, briefly, 185 00:09:13.639 --> 00:09:17.360 like extreme learning machines or elms. These are super fast 186 00:09:17.679 --> 00:09:21.200 because they don't use the typical backpropagation training method solving 187 00:09:21.240 --> 00:09:22.519 linear equations instead. 188 00:09:22.639 --> 00:09:25.960 Wow, okay, so it's a really diverse AI toolkit. CNNs 189 00:09:25.960 --> 00:09:30.759 for images, RNNs for sequences, jans for generating challenges, embeddings 190 00:09:30.799 --> 00:09:31.519 for meaning. 191 00:09:31.399 --> 00:09:34.840 Exactly, they're not just generic algorithms, they're specific tools honed 192 00:09:34.879 --> 00:09:37.639 for different facets of the malware problem. Each has its 193 00:09:37.639 --> 00:09:39.440 strengths depending on the data and the goal. 194 00:09:39.559 --> 00:09:42.080 Right, It's like having different kinds of sensors. And analyzers. 195 00:09:42.240 --> 00:09:44.440 So let's talk about where this is actually being deployed. 196 00:09:44.559 --> 00:09:47.440 Where are these AI techniques making a real difference on 197 00:09:47.480 --> 00:09:48.159 the front lines? 198 00:09:48.360 --> 00:09:53.039 Good question. A huge area is Android malware detection. Think 199 00:09:53.080 --> 00:09:56.159 about it, billions of smartphones out there. It's a massive target. 200 00:09:56.240 --> 00:09:58.440 Yeah, my phone feels like my life sometimes, right. 201 00:09:59.159 --> 00:10:02.600 So AI system analyze Android apps using static, dynamic or 202 00:10:02.679 --> 00:10:05.639 hybrid methods. They look for suspicious API calls and app 203 00:10:05.679 --> 00:10:10.440 shouldn't need like pt trace for debugging other processes, or 204 00:10:10.559 --> 00:10:15.879 mkdr to create directories unexpectedly or connect for unusual network activity. 205 00:10:16.320 --> 00:10:19.159 They also flag risky permission requests. Does that simple game 206 00:10:19.200 --> 00:10:22.919 really need send SS permission or read contacts or system 207 00:10:22.919 --> 00:10:25.679 milert window to draw over other apps. AI learns the 208 00:10:25.720 --> 00:10:27.759 patterns of legitimate apps versus malware. 209 00:10:27.879 --> 00:10:29.960 That makes sense. What about newer areas. I keep hearing 210 00:10:30.000 --> 00:10:31.759 about smart cars and potential hacking. 211 00:10:31.879 --> 00:10:35.320 That's a critical emerging frontier. Connected vehicle security part of 212 00:10:35.320 --> 00:10:39.559 intelligent transportation systems or rights. Modern cars are basically computers 213 00:10:39.559 --> 00:10:43.720 on wheels, packed with sensors embedded devices, communicating wirelessly V 214 00:10:43.799 --> 00:10:46.720 two V vehicle to vehicle, V two I vehicle to. 215 00:10:46.679 --> 00:10:49.279 Infrastructure, which means more tax surfaces. 216 00:10:49.039 --> 00:10:52.480 Exactly, and the risks are serious. Denial of service DOSS 217 00:10:52.600 --> 00:10:57.399 or distributed denial of service DAS attacks could cripple communication. 218 00:10:57.840 --> 00:11:02.000 Imagine jamming traffic safety messages or preventing cars from coordinating 219 00:11:02.039 --> 00:11:02.840 at intersections. 220 00:11:02.919 --> 00:11:05.000 That sounds potentially catastrophic. 221 00:11:05.200 --> 00:11:06.120 It could be so. 222 00:11:06.200 --> 00:11:09.639 AI is being developed to monitor the complex network traffic 223 00:11:09.720 --> 00:11:13.320 in and around vehicles, looking for anomalies communication patterns that 224 00:11:13.399 --> 00:11:17.480 indicate jamming, spoofing, or attempts to compromise vehicle systems. 225 00:11:17.639 --> 00:11:21.480 Okay, cars, phones, What about the cloud? So much runs 226 00:11:21.519 --> 00:11:21.919 there now? 227 00:11:22.000 --> 00:11:25.960 Absolutely, cloud infrastructure protection is vital. A major threat is 228 00:11:26.000 --> 00:11:30.759 malware injection into virtual machines vms, because cloud platforms often 229 00:11:30.799 --> 00:11:34.919 automatically provision lots of similar vms. If one type gets compromised, 230 00:11:35.240 --> 00:11:38.960 malware can potentially spread very easily to others configured the same. 231 00:11:38.720 --> 00:11:41.200 Way, like an infection spreading through identical twins. 232 00:11:41.440 --> 00:11:46.200 A good analogy. AI techniques, sometimes even simpler machine learning 233 00:11:46.279 --> 00:11:49.639 like keeneurest neighbors or local outlier factor can monitor the 234 00:11:49.679 --> 00:11:53.960 hypervisor the software managing the vms. They look at performance metrics, 235 00:11:54.159 --> 00:11:58.720 CPU load, memory usage, network IO. Anomalies in these patterns 236 00:11:58.799 --> 00:12:01.080 can indicate a VM has been compromised and is doing 237 00:12:01.120 --> 00:12:02.480 something malicious. 238 00:12:02.080 --> 00:12:03.720 Like a fever chart for the VM. 239 00:12:03.840 --> 00:12:06.559 Kind of yeah, though it can be less effective against 240 00:12:06.600 --> 00:12:09.120 low and slow malware that tries very hard to hide 241 00:12:09.159 --> 00:12:12.159 its activity and not cause obvious performance spikes. 242 00:12:11.919 --> 00:12:15.320 Right stealthy attacks. What about just general network defense like 243 00:12:15.360 --> 00:12:17.000 intrusion detection systems. 244 00:12:17.080 --> 00:12:20.879 Yes, IDs are a classic battleground where AI is making inroads. 245 00:12:21.279 --> 00:12:24.159 Instead of just relying on known attack signatures, AI can 246 00:12:24.159 --> 00:12:28.000 perform anomaly detection on system of ventlogs I think database logs, 247 00:12:28.120 --> 00:12:32.000 operating system logs. AI models, particularly auto encoders, can learn 248 00:12:32.000 --> 00:12:34.519 what normal activity looks like for a specific user or. 249 00:12:34.559 --> 00:12:36.960 System, establishing a baseline exactly. 250 00:12:37.639 --> 00:12:41.720 Then any significant deviation from that learned normality gets flagged 251 00:12:41.759 --> 00:12:44.879 as suspicious. It might be an attacker trying to escalate 252 00:12:44.919 --> 00:12:49.159 privileges or moving laterally through the network. Some systems even 253 00:12:49.240 --> 00:12:52.679 use hybrid approaches, maybe combining deep learning like auto encoders 254 00:12:52.679 --> 00:12:56.279 for complex dependent data with traditional machine learning like support 255 00:12:56.320 --> 00:13:01.080 vector machines for simpler independent data like timestamps. 256 00:13:00.159 --> 00:13:03.519 In different angles. And what about something seemingly simpler like spam? 257 00:13:03.639 --> 00:13:07.120 Ah, but spam gets clever too. Image spam is a 258 00:13:07.120 --> 00:13:11.440 big one. Spammers embed their malicious messages or links inside images, 259 00:13:11.759 --> 00:13:14.440 specifically to bypass text based filters. 260 00:13:14.679 --> 00:13:17.320 Oh right, so the filter doesn't see the text correct. 261 00:13:17.559 --> 00:13:21.720 But AI, especially CNN's again often combined with transfer learning 262 00:13:21.759 --> 00:13:25.039 models like VGG nineteen, which are pre trained on millions 263 00:13:25.080 --> 00:13:28.200 of images, can fight back effectively. They don't just read text. 264 00:13:28.279 --> 00:13:31.720 They analyze the image itself. It's metadata like height, with 265 00:13:32.120 --> 00:13:36.320 color statistics, mean color skewness, texture patterns, even shapes detected 266 00:13:36.399 --> 00:13:39.840 using edge filters. They learn the visual characteristics of spam. 267 00:13:39.559 --> 00:13:43.200 Images, so the AI sees the spamminess in the image itself. 268 00:13:43.200 --> 00:13:43.720 That's clever. 269 00:13:44.120 --> 00:13:47.559 It shows how AI can tackle threats designed to evade 270 00:13:47.600 --> 00:13:48.440 older methods. 271 00:13:48.519 --> 00:13:50.879 It really does feel like a constant arms race, though, 272 00:13:51.240 --> 00:13:54.799 as our AI gets better at spotting malware. 273 00:13:54.320 --> 00:13:58.039 The attackers start using AI themselves to create better malware. 274 00:13:58.120 --> 00:13:59.720 It's an unavoidable cycle. 275 00:13:59.440 --> 00:14:02.600 Which leads to this concept I've read about adversarial examples 276 00:14:02.879 --> 00:14:03.759 sounds ominous. 277 00:14:03.840 --> 00:14:08.720 It's a major challenge. Adversarial examples or aes or inputs 278 00:14:08.759 --> 00:14:10.960 could be an image, could be a data file, could 279 00:14:10.960 --> 00:14:14.799 be a software binary that are intentionally but very slightly modified. 280 00:14:15.200 --> 00:14:18.639 The modification is often tiny, maybe even imperceptible to a human, 281 00:14:19.159 --> 00:14:22.879 but it's specifically crafted to fool an AI classification. 282 00:14:22.320 --> 00:14:25.279 Model to make the AI misjudge it exactly in the 283 00:14:25.320 --> 00:14:29.159 malware context, attacker could take a genuinely malicious file, tweak 284 00:14:29.200 --> 00:14:31.200 it just a little bit, maybe adding some junk code, 285 00:14:31.279 --> 00:14:33.720 changing a few bytes so that our AI detector now 286 00:14:33.799 --> 00:14:35.399 classifies it as benign. 287 00:14:35.120 --> 00:14:36.840 But it still does the bad stuff. 288 00:14:37.200 --> 00:14:42.440 Crucially, yes, it preserves its original malicious functionality while wearing 289 00:14:42.480 --> 00:14:47.559 this AI fooling camouflage. It highlights that even powerful AI 290 00:14:47.600 --> 00:14:50.960 models can have these exploitable blind spots. There were even 291 00:14:51.000 --> 00:14:54.480 techniques to create universal perturbations that can fool a model 292 00:14:54.679 --> 00:14:56.240 across many different inputs. 293 00:14:56.360 --> 00:14:59.559 That's worrying. So the malware itself is also evolving, partly 294 00:14:59.600 --> 00:15:01.559 in respect through our defenses. 295 00:15:01.080 --> 00:15:03.679 Constantly, and machine learning is actually being used to track 296 00:15:03.720 --> 00:15:08.559 this evolution. Researchers analyze malware families over time, perhaps looking 297 00:15:08.559 --> 00:15:12.240 at op code sequences within specific time windows. They use techniques, 298 00:15:12.279 --> 00:15:15.240 maybe even simpler ones like linear SVMs, to detect points 299 00:15:15.240 --> 00:15:19.039 where a malware family significantly changed its characteristics. 300 00:15:18.440 --> 00:15:21.519 Like finding evolutionary branches in the malware family tree. 301 00:15:21.559 --> 00:15:25.919 Precisely understanding how threats adapt helps us anticipate future shifts 302 00:15:25.919 --> 00:15:27.279 in their tactics or structure. 303 00:15:27.399 --> 00:15:30.320 There must be practical challenges in just studying all this malware, 304 00:15:30.440 --> 00:15:31.559 especially older stuff. 305 00:15:31.559 --> 00:15:35.960 For live threats, oh absolutely, Handling live malware is inherently risky, 306 00:15:36.279 --> 00:15:39.039 and for older samples, the infrastructure they relied on, especially 307 00:15:39.080 --> 00:15:41.720 their command and control server C two servers, is often 308 00:15:41.799 --> 00:15:42.960 long gone, so you. 309 00:15:42.960 --> 00:15:45.200 Can't see their full behavior, not easily. 310 00:15:45.720 --> 00:15:48.879 That's where C two server emulators become really useful. These 311 00:15:48.879 --> 00:15:52.240 are tools researchers build to mimic the original C two server. 312 00:15:52.840 --> 00:15:55.919 This allows them to run the malware, even historical samples, 313 00:15:56.200 --> 00:15:59.320 in an isolated lab network and observe its full range 314 00:15:59.320 --> 00:16:02.039 of capability. Because the malware thinks it's talking to its 315 00:16:02.039 --> 00:16:05.799 real controller, you can extract features, understand its entire life cycle. 316 00:16:05.919 --> 00:16:07.720 You trick the malware into showing its hand. 317 00:16:08.519 --> 00:16:08.960 Essentially. 318 00:16:09.080 --> 00:16:12.000 Yes, sometimes you might even need to slightly patch the 319 00:16:12.039 --> 00:16:15.519 malware itself, maybe to bypass some anti analysis checks it has, 320 00:16:16.000 --> 00:16:18.679 or if say an encryption key needed for its C 321 00:16:18.759 --> 00:16:21.159 two communication was lost to time, like with some old 322 00:16:21.159 --> 00:16:22.279 cryptol locker variants. 323 00:16:22.360 --> 00:16:25.159 It's a complex process. Now, with all this focus on AI, 324 00:16:25.320 --> 00:16:28.799 this AI mania, almost are their downsides things we need 325 00:16:28.879 --> 00:16:29.799 to be cautious about. 326 00:16:30.080 --> 00:16:31.200 That's a very important point. 327 00:16:31.440 --> 00:16:35.639 Yes, while AI is powerful, we need perspective. Machine learning 328 00:16:35.679 --> 00:16:39.240 is data driven, but it's not magic. Humans still make 329 00:16:39.320 --> 00:16:43.440 crucial decisions, things like choosing the right model architecture, setting 330 00:16:43.480 --> 00:16:46.320 parameters like the number of hidden states in an HMM, 331 00:16:46.559 --> 00:16:49.840 selecting the kernel function for an SVM. These aren't automatic. 332 00:16:50.080 --> 00:16:53.720 They require human expertise and significantly impact performance. 333 00:16:53.840 --> 00:16:56.159 Right. The human element is still key in setting it. 334 00:16:56.159 --> 00:16:59.600 Up, definitely, and there are practical constraints. More data is 335 00:16:59.639 --> 00:17:03.039 often better, but it needs more computing power, more storage, 336 00:17:03.279 --> 00:17:07.079 longer training times. That's a real bottleneck. Plus, some highly 337 00:17:07.119 --> 00:17:08.359 tuned models can become. 338 00:17:08.319 --> 00:17:10.440 Very specific to the data set they were trained on. 339 00:17:10.720 --> 00:17:13.519 They might not generalize well to new, slightly different data, 340 00:17:13.559 --> 00:17:16.880 which is a constant issue with evolving malware. There's a 341 00:17:16.960 --> 00:17:20.440 real need for more robust, more generic deep learning approaches. 342 00:17:20.559 --> 00:17:22.319 Adaptability is crucial and. 343 00:17:22.279 --> 00:17:25.160 Another big challenge, maybe less technical, but just as important, 344 00:17:25.599 --> 00:17:28.880 is the lack of a unified standard for malware taxonomy. 345 00:17:29.519 --> 00:17:32.720 Different anti virus vendors often label the same threat differently, 346 00:17:33.000 --> 00:17:37.000 even with tools like virus Total that aggregate results. Correlating 347 00:17:37.039 --> 00:17:41.319 threats globally and building truly comprehensive data sets is harder 348 00:17:41.319 --> 00:17:43.519 than it should be because we don't always speak the 349 00:17:43.559 --> 00:17:45.240 same language when naming things. 350 00:17:46.319 --> 00:17:48.079 That makes collaborative defense tricky. 351 00:17:48.440 --> 00:17:53.119 It does, and one final sort of intriguing point. Researchers 352 00:17:53.119 --> 00:17:56.079 have found that different methods for selecting the most important features, 353 00:17:56.079 --> 00:17:59.160 like those API calls or op codes, can sometimes pick 354 00:17:59.279 --> 00:18:01.119 vastly different sets of features. 355 00:18:00.880 --> 00:18:01.720 But they still work. 356 00:18:01.960 --> 00:18:05.759 But they still end up achieving similar classification accuracy, which 357 00:18:05.799 --> 00:18:09.119 raises a fascinating question. Are these methods truly finding the 358 00:18:09.200 --> 00:18:12.640 single best set of features or are there potentially multiple 359 00:18:12.680 --> 00:18:15.400 different sets of features that are almost equally good at 360 00:18:15.480 --> 00:18:18.440 identifying malware. It makes you wonder about what the AI 361 00:18:18.559 --> 00:18:19.279 is really learning. 362 00:18:19.440 --> 00:18:22.000 That is interesting. It suggests maybe there isn't one perfect 363 00:18:22.079 --> 00:18:23.160 way to see the malware. 364 00:18:23.400 --> 00:18:25.680 Okay, we have definitely covered a lot of ground in 365 00:18:25.720 --> 00:18:28.960 this deep dive. We've seen how AI and deep learning 366 00:18:29.000 --> 00:18:35.079 are genuinely transforming the fight against malware. From visualizing code 367 00:18:35.119 --> 00:18:37.559 as images, which is still kind of blowing my mind yea, 368 00:18:37.720 --> 00:18:42.160 to understanding behavior through sequences and protecting everything from our 369 00:18:42.240 --> 00:18:45.519 phones and cars to the cloud. It's clearly a super dynamic, 370 00:18:45.599 --> 00:18:46.960 constantly evolving field. 371 00:18:47.119 --> 00:18:49.839 It absolutely is, and I think the key takeaway is 372 00:18:50.599 --> 00:18:55.079 the sheer complexity of this ongoing cybersecurity arms race. AI 373 00:18:55.160 --> 00:18:58.440 gives us incredibly powerful new tools, yes, but the ingenuity 374 00:18:58.480 --> 00:19:02.440 attackers means it's never solved. Critical thinking, human oversight, asking 375 00:19:02.440 --> 00:19:05.960 the right questions, understanding the limitations of the AI, these 376 00:19:06.000 --> 00:19:09.720 remain completely indispensable. It's very much a human machine partnership, 377 00:19:09.839 --> 00:19:13.880 absolutely a partnership against an ever adapting adversary. So maybe 378 00:19:13.880 --> 00:19:16.519 the thought to leave you, our listener with, is this, 379 00:19:17.559 --> 00:19:20.680 As AI gets better and better at spotting the hidden patterns, 380 00:19:20.720 --> 00:19:23.839 the secret signatures of malicious code, what new forms of 381 00:19:23.880 --> 00:19:27.440 digital camouflage will the attackers invent next? And will our 382 00:19:27.440 --> 00:19:30.920 intelligent defenses always find the optimal way to adapt or 383 00:19:31.039 --> 00:19:33.400 just one of many good enough ways. Constantly pushing the 384 00:19:33.480 --> 00:19:36.200 very boundaries of what these intelligent systems can even perceive 385 00:19:36.599 --> 00:19:37.920 is definitely something to think about.