WEBVTT 1 00:00:00.080 --> 00:00:02.640 You know, usually when we talk about a medical diagnosis, 2 00:00:02.640 --> 00:00:06.200 there's this expectation of like pure precision. 3 00:00:06.280 --> 00:00:08.400 Oh totally, it feels like engineering. 4 00:00:07.960 --> 00:00:10.839 Right exactly, like you break your arm, the X ray 5 00:00:10.960 --> 00:00:13.720 shows that jagged white line and the doctor just points 6 00:00:13.720 --> 00:00:14.800 and says, well, there it is. 7 00:00:15.160 --> 00:00:19.480 It's binary broken or not broken. It's clean, and honestly, 8 00:00:19.559 --> 00:00:21.160 it's very comforting for patients. 9 00:00:21.160 --> 00:00:23.600 We like things to be visible, easily categorized. 10 00:00:23.640 --> 00:00:24.359 Yeah, we really do. 11 00:00:24.839 --> 00:00:29.039 But then you step into the world of analyzing the 12 00:00:29.039 --> 00:00:32.119 the microscopic landscape of our own blood, or you try 13 00:00:32.200 --> 00:00:37.920 to extract a single meaningful diagnostic fact from thousands of 14 00:00:37.960 --> 00:00:39.840 pages of dense medical texts. 15 00:00:39.880 --> 00:00:41.640 Oh yeah, that's a nightmare. 16 00:00:41.439 --> 00:00:44.439 And suddenly that clean X ray machine is useless. We're 17 00:00:44.439 --> 00:00:47.560 looking at a data landscape that is murky. It's chaotic. 18 00:00:47.719 --> 00:00:50.960 It is the absolute definition of diagnostic muddy waters. 19 00:00:51.000 --> 00:00:53.520 Okay, let's unpack this, because if you were trying to 20 00:00:53.560 --> 00:00:57.039 make sense of this modern avalanche of information, you need 21 00:00:57.079 --> 00:01:00.560 to understand how the tools are evolving today. Our grounding 22 00:01:00.600 --> 00:01:04.000 material is a really fascinating collection of research from the 23 00:01:04.000 --> 00:01:09.400 proceedings of the International Conference on Data Science, Computation and. 24 00:01:09.319 --> 00:01:13.719 Security, which is a really long title, Yeah. 25 00:01:13.480 --> 00:01:16.959 It is. Specifically, we're looking at the IDSCS twenty twenty one. 26 00:01:16.920 --> 00:01:20.159 Papers, right, and this collection is really a treasure trove 27 00:01:20.400 --> 00:01:25.519 of how modern computational techniques are being applied to incredibly messy, 28 00:01:25.799 --> 00:01:27.760 real world problems exactly. 29 00:01:28.079 --> 00:01:30.159 And our mission today on this deep dive is to 30 00:01:30.200 --> 00:01:34.439 explore how cutting edge data science is solving three massive, 31 00:01:34.519 --> 00:01:35.560 interconnected challenges. 32 00:01:35.719 --> 00:01:37.439 Yeah, three really big ones. 33 00:01:37.519 --> 00:01:39.840 We are going to look at how algorithms are finally 34 00:01:39.920 --> 00:01:43.560 learning to digest human knowledge without losing the meaning crucial step, 35 00:01:44.000 --> 00:01:47.319 how they are identifying microscopic threats hiding in our bloodstreams, 36 00:01:47.359 --> 00:01:49.200 and crucially, how they are figuring out how to do 37 00:01:49.239 --> 00:01:53.000 all of this without demanding access to our private raw data. 38 00:01:53.040 --> 00:01:55.519 Because at the core of all these papers is a single, 39 00:01:55.640 --> 00:01:59.640 unifying mathematical challenge, which is extracting the vital signal from 40 00:01:59.680 --> 00:02:01.000 an over welming amount of noise. 41 00:02:01.319 --> 00:02:04.439 Well, let's start with the loudest noise of all, which 42 00:02:04.480 --> 00:02:07.480 is a problem I know you the listener deal with 43 00:02:07.879 --> 00:02:08.840 every single day. 44 00:02:09.080 --> 00:02:10.680 Oh, information overload. 45 00:02:11.000 --> 00:02:16.919 Yes, we are drowning in text emails, one hundred page reports, articles, 46 00:02:17.039 --> 00:02:18.000 research papers. 47 00:02:18.199 --> 00:02:18.919 It's endless. 48 00:02:19.439 --> 00:02:22.400 The traditional AI summary tools we've been using for the 49 00:02:22.479 --> 00:02:26.560 last few years often feel frankly kind of dumb. 50 00:02:26.919 --> 00:02:31.400 Why is that, Well, because traditional tech summarization algorithms often 51 00:02:31.479 --> 00:02:34.719 just skim the surface. Okay, The core problem is their methodology. 52 00:02:34.800 --> 00:02:38.680 They essentially look for frequently repeated words and just grab 53 00:02:38.719 --> 00:02:40.280 the sentences containing them. 54 00:02:40.159 --> 00:02:41.960 So they're just like scanning. 55 00:02:41.520 --> 00:02:44.159 For matches, right, But by doing that, they lose the 56 00:02:44.159 --> 00:02:47.319 crucial context. They miss the nuances I see, and they 57 00:02:47.360 --> 00:02:52.039 frequently drop highly useful specialized entities that maybe only appear 58 00:02:52.120 --> 00:02:54.479 once or twice, but they change the entire meaning of 59 00:02:54.479 --> 00:02:57.800 a paragraph. They don't actually understand what they are summarizing. 60 00:02:57.800 --> 00:02:58.840 They're literally discounting. 61 00:02:59.000 --> 00:03:02.919 So instead of just reading textbook and blindly highlighting repeated words, 62 00:03:03.400 --> 00:03:07.240 it's like a stressed college student highlighting the word mitochondria 63 00:03:07.599 --> 00:03:09.639 fifty times without knowing what it does. 64 00:03:10.319 --> 00:03:13.960 That is a very accurate, albeit depressing analogy. 65 00:03:14.280 --> 00:03:16.319 Yeah. So the first paper we are looking at today, 66 00:03:16.680 --> 00:03:21.120 by Sithan Seeing in Drundiepak proposes a knowledge centric semantic approach. 67 00:03:21.280 --> 00:03:22.120 Yeah they do. 68 00:03:22.280 --> 00:03:24.439 How does this actually fix the counting problem? 69 00:03:24.560 --> 00:03:27.479 By building what they call a term based ontology model. 70 00:03:27.639 --> 00:03:29.319 Okay, term based ontology. 71 00:03:29.400 --> 00:03:33.199 Yeah, let's break down how this algorithm actually reads. After 72 00:03:33.240 --> 00:03:33.960 it cleans up the. 73 00:03:33.919 --> 00:03:37.479 Text, like removing basic stop words and punctuations. 74 00:03:36.759 --> 00:03:40.199 Exactly after that, it applies something called TFIDF. 75 00:03:40.360 --> 00:03:44.199 Okay, what is tf IDF because that sounds like, I 76 00:03:44.199 --> 00:03:46.400 don't know, heavy military jargon. 77 00:03:46.560 --> 00:03:50.479 It does. It stands for term frequency inverse document frequency. 78 00:03:50.599 --> 00:03:51.520 It is a mouthful. 79 00:03:51.719 --> 00:03:53.680 It is. Think of it as a way to weigh 80 00:03:53.680 --> 00:03:55.919 the importance of a word. It looks at the most 81 00:03:55.960 --> 00:03:59.520 frequent words in a specific document. Okay, but it compares 82 00:03:59.560 --> 00:04:02.719 that to how rare those words are across a massive 83 00:04:02.919 --> 00:04:04.520 general corpus of text. 84 00:04:04.680 --> 00:04:05.919 Wait, can you give me an example. 85 00:04:06.240 --> 00:04:09.319 Sure, if the word blood appears fifty times in a 86 00:04:09.319 --> 00:04:14.039 medical paper but rarely in general English, TFIDF flags it 87 00:04:14.080 --> 00:04:17.639 as a highly unique identifier for that specific text. 88 00:04:17.839 --> 00:04:21.240 Okay, So it finds the unique keywords, but how does 89 00:04:21.240 --> 00:04:22.000 it know what they mean? 90 00:04:22.360 --> 00:04:25.040 Well, this is where it gets semantic. It takes these 91 00:04:25.079 --> 00:04:29.560 extracted features and cross references them with external knowledge sources 92 00:04:29.879 --> 00:04:31.839 like what specifically wikidata? 93 00:04:32.199 --> 00:04:34.120 Oh wow, Yeah, it. 94 00:04:34.160 --> 00:04:37.199 Is actively looking up the concepts it finds to build 95 00:04:37.279 --> 00:04:38.879 a domain based ontology. 96 00:04:39.079 --> 00:04:42.319 So like a literal mathematical map of how all these 97 00:04:42.439 --> 00:04:45.079 terms relate to each other in the real world exactly, 98 00:04:45.120 --> 00:04:47.279 So it's not just looking at the document in a vacuum. 99 00:04:47.480 --> 00:04:50.759 It's using wikidata to build a web of meaning, linking 100 00:04:50.839 --> 00:04:52.800 terms together before it even tries to summarize. 101 00:04:52.839 --> 00:04:55.480 You've got it. But then comes the hard part, which 102 00:04:55.519 --> 00:04:58.319 is deciding which sentences to actually keep and which to 103 00:04:58.360 --> 00:05:00.920 throw away. Right, this is where they bring in heavy 104 00:05:00.959 --> 00:05:03.720 duty statistical tools, starting with cross entropy. 105 00:05:03.759 --> 00:05:07.279 All right, slow down? What is cross entropy in plain English? 106 00:05:07.360 --> 00:05:07.560 Right? 107 00:05:07.600 --> 00:05:07.959 Sorry? 108 00:05:08.160 --> 00:05:12.279 In information theory, cross entropy essentially measures the difference between 109 00:05:12.399 --> 00:05:14.279 two probability distributions. 110 00:05:14.560 --> 00:05:16.199 Still a bit technical, okay. 111 00:05:16.279 --> 00:05:19.600 In the context of reading text, the algorithm is calculating 112 00:05:19.600 --> 00:05:21.879 the surprise factor of a new sentence. 113 00:05:22.160 --> 00:05:24.199 The surprise factor, Yeah. 114 00:05:23.800 --> 00:05:27.240 It's mathematically asking, based on the web of meaning I've 115 00:05:27.279 --> 00:05:30.800 already built from the previous sentences, how much genuinely new 116 00:05:30.959 --> 00:05:34.720 surprising information does this next sentence actually give me? 117 00:05:34.959 --> 00:05:37.519 Well, that is brilliant, it really is. So instead of 118 00:05:37.560 --> 00:05:40.720 just reading blindly, this algorithm acts like a genius friend 119 00:05:40.720 --> 00:05:43.720 who actually understands the meaning of the words. If the 120 00:05:43.759 --> 00:05:46.959 cross entropy is low, it means the algorithm isn't surprised 121 00:05:47.000 --> 00:05:47.279 at all. 122 00:05:47.360 --> 00:05:51.120 Precisely, it already knows this information. So the sentence is redundant. 123 00:05:50.759 --> 00:05:53.199 And it mathematically proves which sentences are redundant. 124 00:05:53.360 --> 00:05:56.639 Yes, And to further eliminate redundancy, it pairs this with 125 00:05:56.959 --> 00:06:02.120 NPMI or normalize point wise mutual information alongside ENOVA. 126 00:06:02.279 --> 00:06:03.879 Okay, NPMI, what does that do? 127 00:06:04.120 --> 00:06:07.839 NPMI looks at cooccurrence. If two concepts, say interest rates 128 00:06:07.839 --> 00:06:10.920 and inflation almost always show up together in the text, 129 00:06:11.319 --> 00:06:13.079 NPMI flags that strong. 130 00:06:12.839 --> 00:06:15.240 Relationship makes sense and ANOVA. 131 00:06:15.319 --> 00:06:19.879 The algorithm then uses an analysis of variance or ANOVA 132 00:06:20.319 --> 00:06:24.399 to generate statistical P values for these term relationships. 133 00:06:24.519 --> 00:06:27.560 So it's assigning a strict mathematical grade to every single 134 00:06:27.600 --> 00:06:28.680 word relationship. 135 00:06:28.800 --> 00:06:30.720 Yes, and the grading is ruthless. 136 00:06:31.000 --> 00:06:32.480 We really, Oh yeah. 137 00:06:32.680 --> 00:06:37.000 The system group's sentences based on these P values. It 138 00:06:37.120 --> 00:06:41.120 uses a strict threshold, like a cutoff point. Exactly, if 139 00:06:41.160 --> 00:06:43.920 the calculated value of cross entropy and the intersection of 140 00:06:43.959 --> 00:06:47.639 those NPMI scores is less than point five, that sentence 141 00:06:47.680 --> 00:06:49.040 is entirely eliminated. 142 00:06:49.240 --> 00:06:50.399 Just gone, gone. 143 00:06:50.519 --> 00:06:53.839 It is mathematically proving that the sentence adds no new 144 00:06:53.959 --> 00:06:55.040 value to the summary. 145 00:06:55.439 --> 00:06:58.000 But wait, if you mathematically chop up a fifty page 146 00:06:58.040 --> 00:07:01.519 document based on variance and entropy. The resulting summary might 147 00:07:01.560 --> 00:07:03.839 contain the right facts, but it's going to sound like 148 00:07:03.879 --> 00:07:06.079 a glitching robot trying to speak English. 149 00:07:05.839 --> 00:07:07.079 Right, it would be super disjointed. 150 00:07:07.040 --> 00:07:09.639 Sentences will just abruptly smash into each other, which is 151 00:07:09.680 --> 00:07:12.839 exactly why the authors included a final polishing step using 152 00:07:12.879 --> 00:07:15.040 two distinct agents. Oh they fixed the flow. 153 00:07:15.279 --> 00:07:18.360 Yes, First, a lexical agent using word net two point 154 00:07:18.480 --> 00:07:21.279 zero steps in what's word It acts like a massive 155 00:07:21.319 --> 00:07:26.160 conceptual dictionary to ensure the vocabulary transitions naturally and captures 156 00:07:26.199 --> 00:07:27.120 the right lexims. 157 00:07:27.279 --> 00:07:28.879 Okay, that helps the vocabulary. 158 00:07:28.920 --> 00:07:32.240 And then a grammatical agent restructures the phrasing to fix 159 00:07:32.279 --> 00:07:35.879 the grammatical errors that inevitably happen when you stitch disparate 160 00:07:35.920 --> 00:07:36.800 sentences together. 161 00:07:37.360 --> 00:07:40.560 So what were the actual results of this highly semantic, 162 00:07:40.639 --> 00:07:42.360 mathematically ruthless approach. 163 00:07:42.480 --> 00:07:45.439 Well, they tested this on the DUC two thousand and 164 00:07:45.480 --> 00:07:46.319 seven data. 165 00:07:46.120 --> 00:07:47.759 Set, which is what like a benchmark. 166 00:07:47.920 --> 00:07:51.600 Yeah, it's a standard academic benchmark containing hundreds of documents 167 00:07:51.600 --> 00:07:55.839 with manually created, human written summaries to test algorithms against. Okay, 168 00:07:55.959 --> 00:07:59.040 kind of, The sing and Deepak model achieved an F 169 00:07:59.120 --> 00:08:01.600 measure of eighty eight point twenty percent. 170 00:08:01.759 --> 00:08:03.480 Wow, that's high it is, And. 171 00:08:03.439 --> 00:08:06.279 Perhaps more importantly, a false negative rate of just zero 172 00:08:06.279 --> 00:08:07.160 point one four. 173 00:08:07.360 --> 00:08:10.680 Wait point one four, that's tiny. Let's translate that false 174 00:08:10.720 --> 00:08:13.279 negative rate into the real world. Okay, A false negative 175 00:08:13.279 --> 00:08:15.800 means the algorithm looked at a crucial piece of information 176 00:08:15.920 --> 00:08:19.199 and mistakenly decided to delete it. A rate of point 177 00:08:19.279 --> 00:08:23.560 one four means it is almost never deleting vital information exactly. 178 00:08:24.000 --> 00:08:27.480 For context, they compared it to baseline models like IKTWA, 179 00:08:27.920 --> 00:08:30.680 which only hit a seventy eight point three six percent 180 00:08:30.959 --> 00:08:31.480 F measure. 181 00:08:31.560 --> 00:08:34.159 If I'm relying on an AI to summarize a massive 182 00:08:34.240 --> 00:08:37.480 legal contract or a dense medical history, I need to 183 00:08:37.559 --> 00:08:41.200 know it didn't accidentally delete the hidden fee clause or 184 00:08:41.240 --> 00:08:42.559 the patient's drug allergy. 185 00:08:42.720 --> 00:08:43.320 It's vital. 186 00:08:43.440 --> 00:08:47.200 This math actually provides that confidence. It's so efficient. This 187 00:08:47.320 --> 00:08:49.960 directly benefits anyone trying to learn faster. 188 00:08:50.159 --> 00:08:52.919 It is a massive leap forward. It shows that by 189 00:08:52.919 --> 00:08:56.759 mapping the ontology of words, computers can finally move past 190 00:08:56.799 --> 00:08:59.639 just counting text to actually distilling human thought. 191 00:09:00.000 --> 00:09:03.519 If we can use statistical thresholds to filter out useless sentences, 192 00:09:03.879 --> 00:09:06.399 can we use that exact same logic to filter out 193 00:09:06.480 --> 00:09:07.919 useless noise in a medical scan. 194 00:09:08.120 --> 00:09:09.720 That is exactly what we're looking at. 195 00:09:09.759 --> 00:09:13.080 Next, we're moving from processing human language to processing human 196 00:09:13.080 --> 00:09:17.279 biology because data science isn't just about reading text faster, right, 197 00:09:17.639 --> 00:09:19.600 It's about seeing what the human eye misses. 198 00:09:19.799 --> 00:09:23.840 It is the transition from semantic analysis to geometric analysis 199 00:09:24.279 --> 00:09:26.519 using the exact same underlying. 200 00:09:26.039 --> 00:09:30.480 Principle extracting vital features from a sea of noise. This 201 00:09:30.559 --> 00:09:33.720 brings us to the second paper by Ali Siddam hasim 202 00:09:33.799 --> 00:09:37.840 geda Way and Gamela Judah. They're detecting abnormal red blood 203 00:09:37.840 --> 00:09:43.039 cells or RBCs using morphology and rotation. Now, why is 204 00:09:43.080 --> 00:09:45.399 this a problem that needs a data science solution? 205 00:09:45.960 --> 00:09:49.840 Because the steaks are incredibly high for conditions like hemolytic anemia, 206 00:09:50.320 --> 00:09:52.559 with sickle cell anemia being a prime example. 207 00:09:52.639 --> 00:09:53.639 Okay, hemoltic anemia. 208 00:09:53.799 --> 00:09:56.919 Yeah, in a healthy person, red blood cells are perfectly 209 00:09:56.960 --> 00:10:01.120 circular and flexible, but genetic abnormality can cause these cells 210 00:10:01.120 --> 00:10:05.639 to become deformed. They get all misshapen, right, They turn elliptical, rectangular, 211 00:10:05.720 --> 00:10:07.480 or sickle shaped like a crescent moon. 212 00:10:07.639 --> 00:10:11.360 And because of that elongated jagged shape, they become rigid. 213 00:10:11.399 --> 00:10:13.759 They get stuck in blood vessels, and they rupture easily 214 00:10:13.799 --> 00:10:15.320 as they pass through our capillaries. 215 00:10:15.440 --> 00:10:18.399 Right now, The traditional way to diagnose this involves a 216 00:10:18.440 --> 00:10:21.240 highly trained hematology technicians sitting at a. 217 00:10:21.159 --> 00:10:23.360 Microscope, staring through the lens all day. 218 00:10:23.559 --> 00:10:27.480 Manually examining a glass slide smeared with blood and looking 219 00:10:27.480 --> 00:10:31.120 for these deformed cells among hundreds or thousands of normal ones. 220 00:10:31.200 --> 00:10:32.159 That sounds exhausting. 221 00:10:32.480 --> 00:10:36.159 It is tedious, it is painstakingly slow, and it is 222 00:10:36.279 --> 00:10:39.799 highly prone to fatigue. You are relying entirely on a 223 00:10:39.840 --> 00:10:41.000 tired human eye. 224 00:10:41.080 --> 00:10:44.600 So Sudiam and his team built an automated solution. Yes 225 00:10:44.639 --> 00:10:47.120 they did, And what fascinated me is how they prep 226 00:10:47.200 --> 00:10:50.720 the image before the computer even looks for the cells. 227 00:10:51.279 --> 00:10:52.679 The preprocessing is key. 228 00:10:52.759 --> 00:10:55.600 They have to find the region of interest or ROI. 229 00:10:56.080 --> 00:10:59.759 They take the standard grayscale microscope image and essentially peel 230 00:10:59.759 --> 00:11:02.240 it up heart by converting it into pure black and 231 00:11:02.279 --> 00:11:03.480 white binary images. 232 00:11:03.559 --> 00:11:05.200 But they don't just do it once. 233 00:11:05.159 --> 00:11:10.039 Right, They process it at very specific intensity thresholds like 234 00:11:10.440 --> 00:11:15.240 sixty seventy eighty ninety one hundred correct. Why those specific numbers? 235 00:11:15.320 --> 00:11:17.240 Why not just make the dark stuff black and the 236 00:11:17.320 --> 00:11:18.080 light stuff white? 237 00:11:18.320 --> 00:11:21.159 Well, because a blood smear is messy. Lighting under a 238 00:11:21.200 --> 00:11:24.799 microscope isn't perfectly even some cells overlap, some are faded. 239 00:11:24.919 --> 00:11:27.080 Oh, so it's not a uniform image, not at all. 240 00:11:27.120 --> 00:11:30.399 By running multiple thresholds, the algorithm is essentially adjusting the 241 00:11:30.440 --> 00:11:31.759 exposure step by step. 242 00:11:31.840 --> 00:11:34.000 Oh like changing the settings on a camera exactly. 243 00:11:34.039 --> 00:11:36.840 It's finding the optimal contrast where the true edge of 244 00:11:36.879 --> 00:11:38.879 the cell separates from the background fluid. 245 00:11:39.120 --> 00:11:42.240 And as it creates these binary images, it runs a 246 00:11:42.279 --> 00:11:45.799 cleaning protocol. Yes, anything that shows up as a smooth 247 00:11:45.879 --> 00:11:49.600 region smaller than one hundred pixels is instantly deleted. 248 00:11:49.679 --> 00:11:51.879 It mathematically decides this is too small to be a 249 00:11:51.919 --> 00:11:53.679 red blood cell. It must be a speck of dust 250 00:11:53.759 --> 00:11:55.000 or an artifact on the glass. 251 00:11:55.440 --> 00:11:59.320 This leaves the algorithm with a clean map of distinct objects. Right, 252 00:11:59.399 --> 00:12:02.159 but an object, it's just a blob of pixels. Yeah, 253 00:12:02.200 --> 00:12:04.559 How does the computer know if it's a normal circle 254 00:12:04.759 --> 00:12:06.320 or an abnormal sickle cell. 255 00:12:06.480 --> 00:12:07.559 That's the real challenge. 256 00:12:07.639 --> 00:12:09.399 Here's where it gets really interesting. I was looking at 257 00:12:09.399 --> 00:12:13.440 the paper and it explains that once the algorithm isolates 258 00:12:13.440 --> 00:12:16.919 a cell, it actually rotates the image of that cell 259 00:12:16.960 --> 00:12:20.879 by ten twenty thirty and forty degrees counterclockwise. 260 00:12:21.000 --> 00:12:22.360 Yes, it spins the image. 261 00:12:22.399 --> 00:12:25.360 And my first thought was, wait, why does the algorithm 262 00:12:25.399 --> 00:12:27.279 then rotate the images? Why not just look at the 263 00:12:27.320 --> 00:12:30.000 cell as it is. If a sickle cell is shaped 264 00:12:30.039 --> 00:12:33.159 like a crescent moon, rotating it on a slide doesn't 265 00:12:33.200 --> 00:12:36.320 magically turn it into a circle. Why does the computer 266 00:12:36.440 --> 00:12:37.720 care what angle it's sitting at. 267 00:12:37.759 --> 00:12:40.399 It's a great question. It's because computers don't see shapes 268 00:12:40.440 --> 00:12:42.720 the way human eyes do. What do you mean They 269 00:12:42.720 --> 00:12:45.919 don't look at a cluster of pixels and instantly recognize 270 00:12:45.919 --> 00:12:49.519 a crescent. They understand geometry through bounding boxes. 271 00:12:49.600 --> 00:12:50.519 It's owning boxes. 272 00:12:50.639 --> 00:12:53.600 Yeah. Think of a bounding box as drawing a strict 273 00:12:53.639 --> 00:12:57.519 square or rectangle around the absolute furthest edges of the object. 274 00:12:57.559 --> 00:12:59.799 Okay, I'm visualizing drawing a tight box. 275 00:12:59.600 --> 00:13:02.799 Around a sus and that box is aligned perfectly with 276 00:13:02.879 --> 00:13:06.799 a horizontal x axis and a vertical y axis. Right now, 277 00:13:06.879 --> 00:13:10.440 think about how blood is smeared onto a slide. The 278 00:13:10.559 --> 00:13:13.360 cells land completely randomly. 279 00:13:13.120 --> 00:13:14.879 Just splattered everywhere right. 280 00:13:14.759 --> 00:13:19.240 They are oriented at all possible chaotic angles. A normal, 281 00:13:19.360 --> 00:13:20.879 healthy red blood cell is. 282 00:13:20.879 --> 00:13:23.759 Circular, so it's the same in every direction exactly. 283 00:13:23.960 --> 00:13:26.519 Its height and its width are roughly equaled no matter 284 00:13:26.559 --> 00:13:28.519 how you spin it inside that bounding box. 285 00:13:28.720 --> 00:13:31.480 But a sickle cell has a distinct long axis in 286 00:13:31.519 --> 00:13:34.720 a short axis. Ah So, if an elongated sickle cell 287 00:13:34.759 --> 00:13:37.720 happens to land diagonally on the slide and the computer 288 00:13:37.840 --> 00:13:40.240 draws a straight up and down bounding box around it, 289 00:13:40.759 --> 00:13:43.799 the box has to stretch out horizontally and vertically to 290 00:13:43.879 --> 00:13:45.879 capture the diagonal corners exactly. 291 00:13:45.960 --> 00:13:48.639 If it lands diagonally, the bounding box might actually look 292 00:13:48.639 --> 00:13:49.519 perfectly square. 293 00:13:49.879 --> 00:13:51.960 Oh well, I wouldn't have thought of that. 294 00:13:52.159 --> 00:13:55.559 The computer won't capture the cell's true maximum length versus 295 00:13:55.559 --> 00:13:58.759 its true minimum width. It'll just see a big square box. 296 00:13:59.360 --> 00:14:02.240 So by rote hitting the object by ten, twenty thirty 297 00:14:02.279 --> 00:14:05.679 and forty degrees, the algorithm forces the cell into alignment 298 00:14:06.080 --> 00:14:07.519 with the x and y axis. 299 00:14:07.840 --> 00:14:11.600 Yes, it tests different angles until it finds the orientation 300 00:14:11.720 --> 00:14:15.240 where the bounding box is stretched to its absolute maximum limit. 301 00:14:15.840 --> 00:14:19.600 That is so incredibly clever. It's essentially testing the geometry 302 00:14:19.600 --> 00:14:23.480 at every angle to find the cell's true stretched out shape, 303 00:14:23.679 --> 00:14:24.440 and the math. 304 00:14:24.240 --> 00:14:27.639 They use to flag it as diseased is so beautifully simple. 305 00:14:27.759 --> 00:14:29.639 It really is just basic subtraction. Right. 306 00:14:29.720 --> 00:14:32.720 Yeah, they calculate the difference between the height and width 307 00:14:32.759 --> 00:14:33.879 of that bounding box. 308 00:14:33.919 --> 00:14:34.960 They call it delta, right. 309 00:14:35.039 --> 00:14:38.559 Delta equals the absolute value of height minus width. If 310 00:14:38.600 --> 00:14:41.279 the minimum difference they find during all those rotations is 311 00:14:41.320 --> 00:14:42.919 greater than seven pixels, and. 312 00:14:42.960 --> 00:14:46.000 The cell's total area falls within the biological norm of 313 00:14:46.039 --> 00:14:48.559 four hundred and fifty to one thousand pixels. 314 00:14:48.159 --> 00:14:50.919 Then the algorithm officially flags that cell is abnormal. 315 00:14:51.039 --> 00:14:53.720 That's it. That's simple, yep. If the height and width 316 00:14:53.759 --> 00:14:56.639 remain relatively equal, so a delta of less than seven, 317 00:14:57.320 --> 00:14:59.840 it's classified as a normal, healthy cell. 318 00:15:00.120 --> 00:15:03.399 And the results from this geometric approach, they tested it 319 00:15:03.440 --> 00:15:07.279 on a data set of forty real blood smear images 320 00:15:07.759 --> 00:15:09.240 from the erythrocytes IDB. 321 00:15:09.440 --> 00:15:11.759 This is a very solid test set. Yeah. 322 00:15:11.799 --> 00:15:14.519 It achieved an eighty six percent detection rate with only 323 00:15:14.559 --> 00:15:16.720 a fourteen percent false alarm rate. 324 00:15:16.919 --> 00:15:18.639 That's incredibly promising. 325 00:15:18.759 --> 00:15:22.200 We are talking about taking a diagnostic process that usually 326 00:15:22.240 --> 00:15:26.279 requires a highly trained hematologist in a specialized lab and 327 00:15:26.480 --> 00:15:29.600 codifying it into an automated algorithm that could run on 328 00:15:29.639 --> 00:15:31.559 a basic computer in a remote clinic. 329 00:15:31.679 --> 00:15:34.399 It's the democratization of diagnostics. It really is. 330 00:15:34.519 --> 00:15:36.639 Okay, so we have an AI that can read and 331 00:15:36.639 --> 00:15:40.000 summarize complex documents like a genius, and another AI that 332 00:15:40.039 --> 00:15:43.679 can tirelessly diagnose our blood based on rotating bounding boxes. 333 00:15:43.759 --> 00:15:45.360 Two massive leaps forward. 334 00:15:45.519 --> 00:15:49.080 But both of these incredible tools an algorithm that digests 335 00:15:49.080 --> 00:15:51.879 our personal documents in a system that diagnoses our blood 336 00:15:51.919 --> 00:15:53.480 share a massive vulnerability. 337 00:15:53.600 --> 00:15:54.440 They absolutely do. 338 00:15:54.600 --> 00:15:57.480 They are completely reliant on digesting massive amounts of data, 339 00:15:57.600 --> 00:15:59.759 which brings up the elephant in the room data hunger. 340 00:15:59.799 --> 00:16:01.600 Who who owns this data and how do we keep 341 00:16:01.600 --> 00:16:04.799 it safe? Ginchus throw all of our most intimate data 342 00:16:04.840 --> 00:16:08.320 onto a giant, centralized public server so an algorithm can 343 00:16:08.360 --> 00:16:11.039 practice on it. No, we really can't, which brings us 344 00:16:11.080 --> 00:16:14.519 to the third paper, a systematic review by Kapelle Tawari, 345 00:16:14.960 --> 00:16:19.600 Semi Shashukla, and JOSSP. George on privacy preserving machine learning 346 00:16:20.000 --> 00:16:20.960 or PPML. 347 00:16:21.279 --> 00:16:24.720 What's fascinating here is that this addresses the central paradox 348 00:16:24.879 --> 00:16:26.639 of modern artificial intelligence. 349 00:16:26.720 --> 00:16:28.120 The paradox yeah. 350 00:16:28.320 --> 00:16:32.519 ML's effectiveness relies entirely on the amount, distribution, and variety 351 00:16:32.559 --> 00:16:33.480 of training data. 352 00:16:33.639 --> 00:16:35.559 Right, it needs to see a lot of examples to. 353 00:16:35.559 --> 00:16:38.960 Learn exactly, and AI trained only on data from one 354 00:16:39.000 --> 00:16:42.440 hospital in London won't be very accurate at diagnosing patients 355 00:16:42.440 --> 00:16:45.480 in a rural clinic in India. It needs diverse data 356 00:16:45.519 --> 00:16:46.360 to avoid bias. 357 00:16:46.519 --> 00:16:49.440 But getting that data from multiple diverse sources is a 358 00:16:49.519 --> 00:16:54.879 nightmare because of privacy concerns, security threats, data sovereignty laws. 359 00:16:54.759 --> 00:16:58.440 IPI in the US, GDPR in Europe. A hospital in 360 00:16:58.440 --> 00:17:01.519 London legally cannot and over its patient files to a 361 00:17:01.559 --> 00:17:03.399 tech company in Silicon Valley, and. 362 00:17:03.320 --> 00:17:05.640 Companies want to protect their competitive advantages too. 363 00:17:06.000 --> 00:17:08.440 Exactly, even if it were legal, they wouldn't want to share. 364 00:17:08.960 --> 00:17:13.160 So we face a roadblock. We have these algorithms, but 365 00:17:13.400 --> 00:17:16.480 the data is locked away in disconnected silos. 366 00:17:16.599 --> 00:17:19.240 So what does this all mean. We're basically stuck between 367 00:17:20.000 --> 00:17:24.400 wanting these life saving, time saving AI tools and not 368 00:17:24.519 --> 00:17:27.720 wanting to hand over our private medical records or personal 369 00:17:27.759 --> 00:17:29.480 notes to a giant central server. 370 00:17:29.720 --> 00:17:32.920 Yeah, we are stuck. But PPML is the crucial bridge here. Okay, 371 00:17:33.000 --> 00:17:36.039 it's an emerging suite of techniques designed to aggregate data, 372 00:17:36.359 --> 00:17:40.000 train models, and serve inferences without ever actually exposing the 373 00:17:40.079 --> 00:17:41.799 underlying raw data. 374 00:17:41.880 --> 00:17:45.160 So it's the silent security guard making the tech summarization 375 00:17:45.240 --> 00:17:48.200 and the medical imaging possible in the real world, But 376 00:17:48.240 --> 00:17:50.960 how does it actually work? A security guard just stops 377 00:17:50.960 --> 00:17:51.720 people at the door. 378 00:17:51.799 --> 00:17:55.440 Well, PPML isn't just one algorithm, it's an entire suite 379 00:17:55.480 --> 00:17:59.200 of cryptographic and statistical techniques. Likewise, one of the most 380 00:17:59.279 --> 00:18:02.839 powerful mechanisms they review in this paper is called federated learning. 381 00:18:03.079 --> 00:18:04.880 Okay, walk me through federated learning. 382 00:18:05.079 --> 00:18:07.279 In traditional AI, you take all the data from all 383 00:18:07.279 --> 00:18:09.799 over the world, move it to one giant central server, 384 00:18:10.039 --> 00:18:13.480 and train the model there. Federated learning completely reverses that 385 00:18:13.599 --> 00:18:16.880 architecture reverses it. Yeah, instead of moving the data to 386 00:18:16.960 --> 00:18:18.960 the model, we move the model to the data. 387 00:18:19.119 --> 00:18:21.799 Oh wow, It's like asking a thousand hospitals to bake 388 00:18:21.839 --> 00:18:25.319 a cake together. You want the ultimate perfect recipe, but 389 00:18:25.359 --> 00:18:29.039 you aren't allowed to know whose kitchen provided the eggs, 390 00:18:29.240 --> 00:18:31.960 or who sifted the flour, or what their kitchens even 391 00:18:32.000 --> 00:18:32.400 look like. 392 00:18:32.559 --> 00:18:36.440 That is a brilliant way to conceptualize it. In federated learning, 393 00:18:36.480 --> 00:18:39.799 a central server sends a blank, untrained copy of the 394 00:18:39.839 --> 00:18:45.119 AI model out to thousands of local hospitals or say smartphones. Okay, 395 00:18:45.440 --> 00:18:48.519 the model trains locally on that private data behind the 396 00:18:48.559 --> 00:18:53.000 hospital's own firewalls. The raw data never ever leaves the building. 397 00:18:53.160 --> 00:18:55.680 Wait, if the data never leaves, how does the central 398 00:18:55.720 --> 00:18:56.680 AI get any smarter? 399 00:18:57.000 --> 00:18:59.599 Because the local model doesn't send back the medical records, 400 00:18:59.640 --> 00:19:02.119 It only sends back the math. The math, Yeah, it 401 00:19:02.200 --> 00:19:05.640