WEBVTT 1 00:00:00.200 --> 00:00:04.120 Right now, sitting inside almost every single cell of your 2 00:00:04.160 --> 00:00:08.320 body is a three billion letter instruction. 3 00:00:07.960 --> 00:00:10.960 Manual, which is just I mean, it's a staggering scale 4 00:00:11.000 --> 00:00:12.279 to even try to picture. 5 00:00:12.599 --> 00:00:15.160 Yeah, think about that scale for a second. If a 6 00:00:15.160 --> 00:00:17.800 doctor wants to find out why you're sick or you know, 7 00:00:17.839 --> 00:00:21.160 why a medication isn't working, they essentially have to find 8 00:00:21.239 --> 00:00:24.480 a single microscopic typo in a book that is three 9 00:00:24.640 --> 00:00:25.960 million pages long. 10 00:00:26.000 --> 00:00:28.160 Right, and they need to find it fast. I mean, 11 00:00:28.160 --> 00:00:31.160 thirty years ago, doing that was a biological impossibility. It 12 00:00:31.239 --> 00:00:35.200 took over a decade and literally billions of dollars just 13 00:00:35.280 --> 00:00:35.840 to do it once. 14 00:00:36.000 --> 00:00:36.640 Wow. 15 00:00:36.880 --> 00:00:39.880 But today we expect those answers in a matter of days. 16 00:00:40.039 --> 00:00:43.399 It's moved from being this purely biological challenge to what 17 00:00:43.479 --> 00:00:46.399 is essentially a computational miracle. 18 00:00:46.560 --> 00:00:48.960 Okay, let's unpack this because if you've ever wondered how 19 00:00:48.960 --> 00:00:51.399 a simple cheek swab or like a vial of blood 20 00:00:51.479 --> 00:00:54.079 drawn out a clinic actually turns into a highly personalized 21 00:00:54.119 --> 00:00:57.159 medical profile, well, this is exactly the breakdown you need. 22 00:00:57.280 --> 00:00:58.960 Yeah. It's a fascinating journey. 23 00:00:58.840 --> 00:01:02.560 It really is. We're not just talking about the biology today. 24 00:01:02.640 --> 00:01:05.120 We are taking a deep dive into the journey from 25 00:01:05.120 --> 00:01:09.079 the wet, messy chemistry of a human cell to the 26 00:01:09.280 --> 00:01:13.400 digital data on a computer screen. And more importantly, we're 27 00:01:13.400 --> 00:01:17.640 looking at the mind bending mathematical tricks that allow a standard, 28 00:01:17.760 --> 00:01:22.000 cheap laptop to search your entire genetic code without instantly 29 00:01:22.040 --> 00:01:22.680 catching fire. 30 00:01:22.920 --> 00:01:25.400 It really is a collision of two completely different worlds. 31 00:01:25.599 --> 00:01:28.359 I mean, you have to physically extract the data from 32 00:01:28.359 --> 00:01:31.280 the molecule first, Yeah, right, and only then can the 33 00:01:31.319 --> 00:01:32.760 algorithms do their heavy lifting. 34 00:01:32.920 --> 00:01:35.040 Right, So let's start with that physical extraction. We've got 35 00:01:35.040 --> 00:01:37.799 this invisible DNA in a tube. How did we go 36 00:01:37.920 --> 00:01:41.040 from painstakingly reading one genetic sentence at a time to 37 00:01:41.280 --> 00:01:45.079 basically scanning the entire three million page library in an afternoon. 38 00:01:45.200 --> 00:01:46.799 I mean it didn't happen overnight. 39 00:01:46.560 --> 00:01:49.359 No, not at all. It's a story of well, constant, 40 00:01:49.359 --> 00:01:52.079 aggressive problem solving. It started back in nineteen seventy seven 41 00:01:52.120 --> 00:01:54.400 with what we now call first generation sequencing or saying 42 00:01:54.480 --> 00:01:58.319 or sequencing. The foundational idea was brilliant. Honestly, they used 43 00:01:58.359 --> 00:02:01.480 a natural enzyme to copy a strand of DNA, but 44 00:02:01.560 --> 00:02:05.200 they spike the chemical soup with these modified nucleotides. You know, 45 00:02:05.359 --> 00:02:07.560 the ACG and T building blocks. 46 00:02:07.680 --> 00:02:11.000 Right. The sources mentioned these modified blocks have fluorescent glowing 47 00:02:11.039 --> 00:02:14.439 tags on them. They act like molecular stop signs. 48 00:02:14.520 --> 00:02:17.599 Yeah, that's exactly it. That's the key. Imagine you're copying 49 00:02:17.639 --> 00:02:19.599 a sentence, but every time you write the letter A, 50 00:02:20.080 --> 00:02:20.960 your pen freezes. 51 00:02:21.159 --> 00:02:22.919 Oh weird, okay, right, so. 52 00:02:22.919 --> 00:02:25.919 You'd end up with a fragment ending in A. By 53 00:02:26.000 --> 00:02:28.439 running this process over and over, you end up with 54 00:02:28.479 --> 00:02:31.800 a massive mixture of DNA fragments of all different lengths. 55 00:02:32.360 --> 00:02:36.120 You sort them by size using electrical charge, a technique 56 00:02:36.120 --> 00:02:39.719 called electrophoresis, got it, and then a camera reads the 57 00:02:39.759 --> 00:02:42.039 glowing colors at the end of each fragment one by 58 00:02:42.080 --> 00:02:43.080 one to spell. 59 00:02:42.879 --> 00:02:45.840 Out the sequence, which sounds incredibly accurate, but I mean 60 00:02:46.039 --> 00:02:47.319 practically agonizing. 61 00:02:47.439 --> 00:02:49.280 Oh it's painfully slow. Yeah. 62 00:02:49.319 --> 00:02:52.479 The sources say this method maxes out at reading fragments 63 00:02:52.520 --> 00:02:55.560 about eight hundred letters long. If I'm trying to read 64 00:02:55.599 --> 00:02:58.759 a three billion letter genome, that makes me think of 65 00:02:58.960 --> 00:03:03.960 like a medieval monk painstakingly copying an encyclopedia by hand, 66 00:03:04.159 --> 00:03:05.159 letter by single letter. 67 00:03:05.199 --> 00:03:06.199 It's a great analogy. 68 00:03:06.479 --> 00:03:09.319 It works, but you aren't mass producing anything that way. 69 00:03:09.479 --> 00:03:13.039 No, it was a massive bottleneck, and that specific speed 70 00:03:13.080 --> 00:03:16.199 limit is what triggered a complete rethinking of the process. 71 00:03:16.759 --> 00:03:20.639 Companies like Alumina came along and essentially disrupted the biological 72 00:03:20.639 --> 00:03:23.759 space like a Silicon Valley tech company right the introduced 73 00:03:23.800 --> 00:03:25.560 second generation sequencing. 74 00:03:25.159 --> 00:03:28.680 The massively parallel approach. So if Sanger was the medieval monk, 75 00:03:29.000 --> 00:03:32.680 Alumina is like taking that three million page book, tossing 76 00:03:32.719 --> 00:03:35.159 it into a wood schipper and turning it into millions 77 00:03:35.159 --> 00:03:38.039 of tiny pieces of confetti exactly, and then you read 78 00:03:38.120 --> 00:03:40.599 every single shred of confetti at the exact same moment 79 00:03:40.719 --> 00:03:43.759 and basically force a computer to paste the book back together. 80 00:03:44.080 --> 00:03:47.199 That is essentially what they do. Yeah, but to read 81 00:03:47.360 --> 00:03:50.159 millions of tiny shreds at once, the signal has to 82 00:03:50.199 --> 00:03:52.080 be loud enough for a camera sensor to actually pick 83 00:03:52.120 --> 00:03:55.599 it up. A single DNA molecule is just too faint. 84 00:03:56.199 --> 00:03:59.520 So these techniques like a mulsion PCR bridge PCR. 85 00:03:59.199 --> 00:04:02.199 Hold on I see mentioned everywhere in the Deep dive sources, 86 00:04:02.199 --> 00:04:04.960 But what does that actually mean in this context? Bridge 87 00:04:05.039 --> 00:04:06.280 PCR think. 88 00:04:06.159 --> 00:04:10.000 Of it as microscopic photocopying. They wash the DNA fragments 89 00:04:10.039 --> 00:04:13.439 over a tiny glass slide. The fragments attached to the slide, 90 00:04:13.759 --> 00:04:15.759 and enzymes duplicate them right there in. 91 00:04:15.719 --> 00:04:17.279 Place, Okay, right there on the glass. 92 00:04:17.360 --> 00:04:20.720 Yeah, they bend over, forming a bridge and copy themselves 93 00:04:20.759 --> 00:04:24.759 again and again. Suddenly, instead of one faint DNA molecule, 94 00:04:25.079 --> 00:04:28.000 you have a dense little cluster of thousands of identical 95 00:04:28.040 --> 00:04:30.360 clones standing up like a tiny forest on the glass. 96 00:04:30.399 --> 00:04:30.920 Oh wow. 97 00:04:31.000 --> 00:04:33.040 So when you attach a glowing chemical tag to them, 98 00:04:33.360 --> 00:04:36.399 that entire cluster flashes brightly enough for a digital camera 99 00:04:36.439 --> 00:04:37.079 to photograph. 100 00:04:37.360 --> 00:04:40.160 That is wild. So you take a picture, wash the 101 00:04:40.199 --> 00:04:43.079 chemicals away, add the next letter, and take another picture. 102 00:04:43.600 --> 00:04:47.480 Just millions of clusters flashing in sequence. But you know, 103 00:04:47.600 --> 00:04:51.079 reading the sources, there's another second gen variation that completely 104 00:04:51.120 --> 00:04:54.560 blew my mind. Ion Torrent, Oh yeah, they don't use lasers, 105 00:04:54.560 --> 00:04:55.959 they don't use camera because. 106 00:04:55.720 --> 00:04:57.839 They aren't looking at light at all. They are literally 107 00:04:57.879 --> 00:04:59.839 measuring the acidity of the chemical soup. 108 00:05:00.160 --> 00:05:02.160 Hold on, how do you read a genetic code by 109 00:05:02.319 --> 00:05:03.439 checking the pH level? 110 00:05:03.519 --> 00:05:06.480 It comes down to basic chemistry. Really, Every single time 111 00:05:06.480 --> 00:05:10.279 a new nucleotide successfully attaches to a growing DNA strand, 112 00:05:10.560 --> 00:05:15.439 the chemical bond naturally releases a single positively charged hydrogen 113 00:05:15.480 --> 00:05:19.120 ion Okay ion torrent machines use a summit conductor chip 114 00:05:19.439 --> 00:05:23.399 layered with millions of microscopic wells. It's basically a massive 115 00:05:23.439 --> 00:05:27.480 grid of tiny pH meters. It detects that microscopic drop 116 00:05:27.480 --> 00:05:30.519 in pH when the hydrogen ion pops off. So it's 117 00:05:30.560 --> 00:05:34.199 translating a biological event directly into a digital electronic signal. 118 00:05:34.279 --> 00:05:37.759 You're just listening for the electrical pop of a hydrogen atom. Unbelievable. 119 00:05:37.800 --> 00:05:38.759 It is pretty incredible. 120 00:05:38.839 --> 00:05:42.839 But even with that speed, second generation sequencing still relies 121 00:05:42.879 --> 00:05:47.319 on tearing the DNA into tiny confetti, right, which brings 122 00:05:47.360 --> 00:05:51.399 us to the third generation technologies like pack bio and 123 00:05:51.439 --> 00:05:55.040 Oxford nanopore. This reads like pure science fiction. They don't 124 00:05:55.079 --> 00:05:57.120 chop it up, they don't pause to take pictures, they 125 00:05:57.160 --> 00:05:58.279 just read it continuously. 126 00:05:58.439 --> 00:06:03.199 Yeah, it's called single molecule time sequencing with nanopore. Imagine 127 00:06:03.240 --> 00:06:08.079 a microscopic hole, a literal poor punctured through a synthetic membrane. 128 00:06:08.680 --> 00:06:12.319 They apply a steady electrical current across that membrane. Then 129 00:06:12.720 --> 00:06:16.120 they physically pull a single long strand of DNA. 130 00:06:15.879 --> 00:06:17.639 Through that hole, like threading a needle. 131 00:06:17.800 --> 00:06:20.800 Exactly like that, And because the molecular shapes of an 132 00:06:20.800 --> 00:06:24.199 ASCG and a T are slightly different. They each block 133 00:06:24.279 --> 00:06:26.800 the hole in a uniquely different way as they pass through. 134 00:06:26.920 --> 00:06:29.720 Oh I see, yeah, that physically alters the electrical current. 135 00:06:29.879 --> 00:06:32.560 The machine reads this specific changes in the voltage to 136 00:06:32.600 --> 00:06:35.240 spell out the letters as the strand zips through. 137 00:06:35.120 --> 00:06:38.759 Which means you can read massive uninterrupted stretches. The sources 138 00:06:38.800 --> 00:06:41.199 say up to twenty thousand letters in a single read. 139 00:06:41.279 --> 00:06:44.160 It's like feeding the entire intact book through a high 140 00:06:44.160 --> 00:06:45.560 speed ticker tape scanner. 141 00:06:45.800 --> 00:06:47.399 Yep, it's a huge leap and read length. 142 00:06:47.600 --> 00:06:49.519 But I've got to pause you here. I'm looking at 143 00:06:49.519 --> 00:06:53.199 the data from the sources. If this third generation tech 144 00:06:53.360 --> 00:06:57.000 is so revolutionary and reads so fast, why are we 145 00:06:57.000 --> 00:06:59.800 still using the second generation confetti method at all? 146 00:07:00.120 --> 00:07:04.319 Right? Well, what's fascinating here is a very stubborn, hidden 147 00:07:04.360 --> 00:07:07.759 trade off between length and accuracy. When you are violently 148 00:07:07.800 --> 00:07:10.560 pulling a molecule through a microscopic hole at high speed, 149 00:07:10.920 --> 00:07:14.519 the sensor occasionally blinks. It might miss a letter entirely, 150 00:07:14.639 --> 00:07:17.600 or accidentally read the same letter twice. These are called 151 00:07:17.639 --> 00:07:21.600 insertion and deletion errors. Third generation tools historically sit at 152 00:07:21.639 --> 00:07:25.079 an error rate of about seventeen point eight to seventeen 153 00:07:25.160 --> 00:07:27.240 point nine percent, almost. 154 00:07:26.879 --> 00:07:30.199 An eighteen percent error rate in a medical context. I mean, 155 00:07:30.199 --> 00:07:32.920 if I'm looking for a single cancer causing mutation, an 156 00:07:32.959 --> 00:07:35.680 eighteen percent failure rate sounds absolutely terrifying. 157 00:07:35.759 --> 00:07:38.680 It does sound alarming, for sure, but scientists realize something 158 00:07:38.720 --> 00:07:43.399 brilliant about those errors. They're completely random. The nanophore doesn't 159 00:07:43.399 --> 00:07:46.319 systematically struggle with the letter C, for example. It's just 160 00:07:46.399 --> 00:07:47.120 random static. 161 00:07:47.199 --> 00:07:49.240 Okay, So how do you fix random static? 162 00:07:49.480 --> 00:07:52.399 The workaround is actually quite elegant. You just sequence the 163 00:07:52.480 --> 00:07:55.160 exact same strand of DNA twenty or thirty times. 164 00:07:55.160 --> 00:07:57.680 Oh, I see, because the odds of the machine making 165 00:07:57.720 --> 00:08:00.920 the exact same random mistake on the exact same letter 166 00:08:01.000 --> 00:08:03.279 twenty times in a row is basically zero. 167 00:08:03.480 --> 00:08:06.399 Precisely, you layer the thirty reads on top of each other, 168 00:08:06.759 --> 00:08:11.160 the random glitches mathematically cancel out, and the true underlying 169 00:08:11.199 --> 00:08:12.680 sequence emerges clearly. 170 00:08:12.879 --> 00:08:15.959 Okay, So that leads us directly to the next massive problem. 171 00:08:16.279 --> 00:08:19.720 If third generation sequencing has a nearly eighteen percent raw 172 00:08:19.839 --> 00:08:22.439 error rate, just dumping all that text into a computer 173 00:08:22.519 --> 00:08:25.560 file is completely useless. The computer needs to know which 174 00:08:25.639 --> 00:08:29.279 letters are biological facts and which letters are just machine hallucinations. 175 00:08:29.839 --> 00:08:31.639 So how do we tag the trustworthy data. 176 00:08:32.000 --> 00:08:35.039 That's where specialized file formats come in. The most basic 177 00:08:35.120 --> 00:08:37.080 format used to be called FASTA. It was just a 178 00:08:37.080 --> 00:08:40.840 plain text file, literally just a string of acsgs and t's. 179 00:08:41.320 --> 00:08:44.159 But as you point it out, FASTA isn't enough anymore. 180 00:08:44.559 --> 00:08:46.759 We needed a way to track the confidence of every 181 00:08:46.799 --> 00:08:47.399 single letter. 182 00:08:47.679 --> 00:08:51.200 Enter the fast Q format, where the Q literally stands 183 00:08:51.200 --> 00:08:52.600 for quality exactly. 184 00:08:52.879 --> 00:08:56.720 FASTQ attaches a crucial piece of metadata called the phred 185 00:08:57.279 --> 00:09:01.039 quality score or Q score. The sequencing machine actually grades 186 00:09:01.039 --> 00:09:04.200 its own homework. For every single letter it outputs, it 187 00:09:04.279 --> 00:09:06.960 calculates a mathematical probability that it made a mistake. 188 00:09:07.120 --> 00:09:09.840 I found the engineering behind this fascinating. A Q score 189 00:09:09.919 --> 00:09:12.679 is a number, right, say a score of thirty means 190 00:09:12.720 --> 00:09:15.360 a ninety nine point nine percent accuracy rate. 191 00:09:15.480 --> 00:09:16.919 Right, it's a logarithmic scale. 192 00:09:16.960 --> 00:09:19.240 But if you have to store a two digit number 193 00:09:19.399 --> 00:09:22.879 next to every single letter of a three billion letter genome, 194 00:09:23.279 --> 00:09:26.279 you instantly double or triple your file size. Our hard 195 00:09:26.360 --> 00:09:29.720 drives would fill up immediately. So instead, the algorithms take 196 00:09:29.759 --> 00:09:32.919 that Q score number, add exactly thirty three to it, 197 00:09:33.159 --> 00:09:34.679 and map it to a keyboard symbol. 198 00:09:34.720 --> 00:09:36.519 It's an incredibly clever compression hack. 199 00:09:36.720 --> 00:09:38.919 But wait, why add exactly thirty three? Why not just 200 00:09:39.039 --> 00:09:40.000 use the number itself. 201 00:09:40.120 --> 00:09:42.399 Well, it's because of how computers read text using the 202 00:09:42.440 --> 00:09:45.639 ASKI standard. The first thirty two characters in a computer's 203 00:09:45.720 --> 00:09:50.480 language aren't printable. They are invisible commands like escape or return. 204 00:09:50.639 --> 00:09:51.360 Oh right, okay. 205 00:09:51.399 --> 00:09:54.519 By mathematically adding thirty three to the Q score, you 206 00:09:54.639 --> 00:09:58.519 jump past those invisible commands and land perfectly on standard 207 00:09:58.559 --> 00:10:01.879 printable characters. So instead of storing the number thirty the 208 00:10:01.919 --> 00:10:05.480 computer stores a single XH symbol or maybe a question mark. 209 00:10:05.879 --> 00:10:09.000 You fit complex probability data into a single byte of memory. 210 00:10:09.200 --> 00:10:12.440 That is brilliant, And the stakes here are real because 211 00:10:12.480 --> 00:10:14.919 if a doctor is looking at your file and the 212 00:10:15.000 --> 00:10:18.600 sequence shows a genetic marker for a severe disease, they 213 00:10:18.639 --> 00:10:20.799 need to know if that marker have a high Q 214 00:10:21.000 --> 00:10:23.799 score or if it's just a low quality machine glitch. 215 00:10:23.960 --> 00:10:26.960 Exactly, if we connect this to the bigger picture, we 216 00:10:27.000 --> 00:10:29.360 aren't just trusting one read. We look at the read 217 00:10:29.399 --> 00:10:32.960 depth and the genotype quality. If fifty reads show mutation 218 00:10:33.120 --> 00:10:36.000 and have high Q scores, the algorithm confidently called it 219 00:10:36.039 --> 00:10:37.240 a true variant. 220 00:10:36.960 --> 00:10:38.720 It ignores the low quality blitches. 221 00:10:38.960 --> 00:10:41.399 Yes, and once we trust the letters, we have to 222 00:10:41.399 --> 00:10:43.759 figure out what they mean. You take those millions of 223 00:10:43.840 --> 00:10:46.600 verified fast Q shreds and you align them against a 224 00:10:46.639 --> 00:10:50.000 standard reference human genome. It's like checking your puzzle pieces 225 00:10:50.000 --> 00:10:52.559 against the picture on the front of the box. Once 226 00:10:52.559 --> 00:10:55.159 they are aligned, they are saved as a BAM file, 227 00:10:55.440 --> 00:10:57.519 which is a highly compressed binary format. 228 00:10:57.799 --> 00:11:01.159 But humans are fundamentally ninety nine point nine percent identical. 229 00:11:01.519 --> 00:11:04.120 If you sequence my DNA, almost all of it is 230 00:11:04.159 --> 00:11:06.759 exactly the same as the reference map. It seems wildly 231 00:11:06.759 --> 00:11:10.519 inefficient to store three billion letters just to say yep, 232 00:11:10.639 --> 00:11:11.600 still human. 233 00:11:11.519 --> 00:11:14.080 Which is why the final piece of this file pipeline 234 00:11:14.360 --> 00:11:18.519 is the VCF or variant call format. We don't store 235 00:11:18.559 --> 00:11:22.200 your whole genome. The VCF file only stores your mutations, 236 00:11:22.440 --> 00:11:25.240 the differences. It's essentially a list of typos. It says, 237 00:11:25.519 --> 00:11:28.200 at chromosome four position one million, there should be an A, 238 00:11:28.360 --> 00:11:29.559 but in this patient it's a G. 239 00:11:30.200 --> 00:11:32.399 Okay, let's step back. Because I'm looking at the sheer 240 00:11:32.440 --> 00:11:34.759 math of this alignment process. We sort of glossed over 241 00:11:34.799 --> 00:11:37.720 how we actually match the puzzle pieces. If I have 242 00:11:37.759 --> 00:11:40.440 one hundred letter fragment, and I have three billion possible 243 00:11:40.480 --> 00:11:43.200 places to stick it on the reference genome. What in 244 00:11:43.279 --> 00:11:46.279 a standard computer search algorithm just freeze? I mean, how 245 00:11:46.320 --> 00:11:48.120 do they avoid a total system crash? 246 00:11:48.200 --> 00:11:51.320 This is where we get into the real heavy algorithmic lifting. 247 00:11:51.440 --> 00:11:54.120 The first major hurdle is that genetic mutations mean you 248 00:11:54.159 --> 00:11:56.919 almost never have an exact match. You might have a 249 00:11:56.919 --> 00:11:59.440 missing letter or an extra one, so you can't just 250 00:11:59.559 --> 00:12:03.399 hit ctrl as string in search for the exact string. 251 00:12:03.919 --> 00:12:06.440 You have to use something called dynamic programming to calculate 252 00:12:06.480 --> 00:12:07.279 the edit distance. 253 00:12:07.919 --> 00:12:10.559 I read about this. It's about finding the minimum number 254 00:12:10.559 --> 00:12:15.519 of operations insertions, deletions, or substitutions to change one string 255 00:12:15.519 --> 00:12:19.000 of text into another. The source gave a great simple example, 256 00:12:19.440 --> 00:12:22.960 changing the word ants to bent. You substitute the A 257 00:12:23.279 --> 00:12:25.759 for an E, insert a B at the front, and 258 00:12:25.799 --> 00:12:28.279 delete the S at the end. That takes three steps, 259 00:12:28.360 --> 00:12:31.480 perfect exactly. But scaling that up to thousands of letters 260 00:12:31.519 --> 00:12:34.840 creates an astronomical number of possible operations. 261 00:12:34.960 --> 00:12:38.360 Right if you try to calculate every single possible combination 262 00:12:38.480 --> 00:12:42.519 from scratch using standard recursion, which essentially means the computer 263 00:12:42.600 --> 00:12:44.960 solves the problem by breaking it into smaller pieces and 264 00:12:44.960 --> 00:12:47.960 solving every single piece over and over, the computing time 265 00:12:48.000 --> 00:12:52.080 grows exponentially. The universe would literally end before your laptop finished. 266 00:12:52.200 --> 00:12:55.279 Okay, so if recursion crashes the computer, how does dynamic 267 00:12:55.320 --> 00:12:56.440 programming solve it? 268 00:12:56.759 --> 00:13:00.240 By using memory to save time, it builds what's called 269 00:13:00.279 --> 00:13:05.320 a dependency graph or a table. Think of it like 270 00:13:05.360 --> 00:13:09.039 getting driving directions. If you want to calculate the absolute 271 00:13:09.159 --> 00:13:11.879 fastest route from New York to Seattle, and part of 272 00:13:11.879 --> 00:13:15.559 your out goes through Chicago, you calculate the Chicago Seattle 273 00:13:15.639 --> 00:13:18.679 leg once you write that answer down on a sticky note. 274 00:13:18.799 --> 00:13:21.399 Oh okay, so yeah, if you were testing a million 275 00:13:21.440 --> 00:13:24.080 different routes out of New York and a bunch of 276 00:13:24.159 --> 00:13:28.200 them eventually passed through Chicago, you don't mathematically recalculate the 277 00:13:28.240 --> 00:13:30.960 western half of the United States every single time. You 278 00:13:31.000 --> 00:13:32.320 Just look at your sticky. 279 00:13:31.960 --> 00:13:34.200 Note, right, You've already done that math exactly. 280 00:13:34.600 --> 00:13:37.879 Dynamic programming does this for DNA. It solves the tiny 281 00:13:37.960 --> 00:13:40.679 sub problems of the text, saves the answers in a 282 00:13:40.720 --> 00:13:43.519 massive table, and just lifts them up. It drops the 283 00:13:43.519 --> 00:13:46.000 computing time from trillions of years down to minutes. 284 00:13:46.200 --> 00:13:49.240 It catches the answers that makes total sense. But even 285 00:13:49.240 --> 00:13:52.039 with the sticky notes, searching every edge of a three 286 00:13:52.120 --> 00:13:55.799 billion letter genome for millions of tiny confetti fragments is 287 00:13:55.840 --> 00:13:59.159 still too slow, which brings us to a concept called 288 00:13:59.200 --> 00:14:01.559 a bloom filter. And I've got to admit this is 289 00:14:01.559 --> 00:14:04.120 where the computer science gets really counterintuitive for me. 290 00:14:04.480 --> 00:14:06.519 It is a bit mind bending at first. 291 00:14:06.759 --> 00:14:11.600 It's a space efficient probabilistic data structure. Basically, it asks 292 00:14:11.639 --> 00:14:15.759 a massive database, does this sequence exist in here? Without 293 00:14:15.799 --> 00:14:17.240 actually looking through the data. Yeah. 294 00:14:17.279 --> 00:14:20.320 It uses mathematical hash functions and a simple bit array, 295 00:14:20.559 --> 00:14:23.080 just a microscopic sequence of ones and zeros. When you 296 00:14:23.120 --> 00:14:26.120 insert a genetic sequence into the system, it runs it 297 00:14:26.159 --> 00:14:29.960 through a math formula that flips specific zeros to ones. Okay, 298 00:14:30.360 --> 00:14:32.480 when you want to search for a sequence later, you 299 00:14:32.559 --> 00:14:35.480 run it through the same formula. If all the corresponding 300 00:14:35.519 --> 00:14:38.440 bits are ones, it tells you the item is probably there. 301 00:14:38.799 --> 00:14:41.159 But if even a single bit is a zero, it 302 00:14:41.240 --> 00:14:45.120 guarantees with absolute mathematical certainty that the item is not there. 303 00:14:45.320 --> 00:14:46.919 I was trying to picture this, and it makes me 304 00:14:46.960 --> 00:14:50.639 think of a very strict bouncer at a crowded VIP club. 305 00:14:50.960 --> 00:14:53.679 The bouncer uses a series of quick, weird rules to 306 00:14:53.759 --> 00:14:56.559 check people at the door. Are you wearing red shoes? 307 00:14:56.759 --> 00:14:57.600 Do you have a ticket? 308 00:14:57.840 --> 00:15:00.200 That's a good way to look at it right now. 309 00:15:00.240 --> 00:15:03.200 And then the bouncer might mistakenly let a random person 310 00:15:03.200 --> 00:15:05.480 in who isn't on the list. That's a false positive. 311 00:15:05.639 --> 00:15:08.960 But the bouncer will absolutely never ever turn away someone 312 00:15:09.000 --> 00:15:12.000 who is actually on the list. There is zero false negative. 313 00:15:12.360 --> 00:15:14.720 But let me challenge this directly, go for it. Why 314 00:15:14.759 --> 00:15:18.320 would computer scientists intentionally design an algorithm that we know 315 00:15:18.600 --> 00:15:22.240 for a fact gives false positives? Is an accuracy the 316 00:15:22.399 --> 00:15:24.279 entire point of medical science. 317 00:15:24.639 --> 00:15:28.039 This raises an important question about computational trade offs. It's 318 00:15:28.080 --> 00:15:31.519 all about conserving memory and speed. A bloom filter takes 319 00:15:31.600 --> 00:15:35.679 up an unbelievably small amount of memory by intentionally allowing 320 00:15:35.759 --> 00:15:39.080 a tiny predictable margin of error, say a one or 321 00:15:39.120 --> 00:15:43.519 two percent false positive rate. We can achieve near instantaneous search. 322 00:15:43.320 --> 00:15:45.879 Speeds because you aren't using the bloom filter for the 323 00:15:45.919 --> 00:15:48.720 final answer. You use it to instantly discard the ninety 324 00:15:48.799 --> 00:15:52.039 nine percent of the genome where the sequence definitely doesn't belong. 325 00:15:51.840 --> 00:15:54.879 Ray Siicely, you use the cheap fast algorithm to clear 326 00:15:54.879 --> 00:15:57.440 away the junk, and then you only perform the slow, 327 00:15:57.639 --> 00:16:01.600 rigorous dynamic programming check on the few positive hits. You 328 00:16:01.639 --> 00:16:04.720 save your heavy computational artillery for the targets that actually matter. 329 00:16:04.960 --> 00:16:08.559 Okay, so bloom filters tell us if a sequence exists somewhere, 330 00:16:09.000 --> 00:16:11.080 But to find exactly where it lives in the genome, 331 00:16:11.120 --> 00:16:13.279 we need an index, like the index at the back 332 00:16:13.320 --> 00:16:15.320 of a textbook telling you which page a word is on. 333 00:16:15.879 --> 00:16:18.519 But when I was looking at the source text, standard 334 00:16:18.519 --> 00:16:22.840 computer indexes for something this large are impossibly bloated. A 335 00:16:22.879 --> 00:16:26.039 standard index a suffix tree for the human genome takes 336 00:16:26.120 --> 00:16:28.120 up about forty gigabytes of active. 337 00:16:27.840 --> 00:16:30.840 Memory, which is a fatal bottleneck. You can't load forty 338 00:16:30.879 --> 00:16:33.440 gigabytes of data into the ram of a standard computer. 339 00:16:33.799 --> 00:16:35.879 It means the computer would have to constantly read back 340 00:16:35.879 --> 00:16:38.240 and forth from the hard drive, which slows everything to 341 00:16:38.279 --> 00:16:39.200 an absolute crawl. 342 00:16:39.320 --> 00:16:41.519 And this is where we get to the absolute crown 343 00:16:41.639 --> 00:16:45.559 jewel of this whole. Deep dive. Researchers Ferragina and Manzini 344 00:16:45.759 --> 00:16:48.559 created the FM index, and they did it using a 345 00:16:48.559 --> 00:16:53.600 mathematical trick called the Burrows Wheeler transform or BWT, but honestly, 346 00:16:53.679 --> 00:16:55.720 reading the mechanics of this transform broke my brain a 347 00:16:55.759 --> 00:16:57.679 little bit. How does BWT actually work. 348 00:16:57.960 --> 00:17:02.200 It is notoriously difficult to visual lies, but incredibly elegant 349 00:17:02.200 --> 00:17:06.480 once you get it. The BWT is a permutation. It 350 00:17:06.559 --> 00:17:11.279 reorganizes the text. Imagine taking a sequence of letters, rotating 351 00:17:11.279 --> 00:17:13.960 the whole sequence by one letter, writing that down, rotating 352 00:17:14.039 --> 00:17:17.480 it again, and listing out all the possible rotations. Then 353 00:17:17.720 --> 00:17:19.599 you sort those rows alphabetically. 354 00:17:19.720 --> 00:17:21.759 Okay, I'm with you, but why do that? What does 355 00:17:21.799 --> 00:17:25.000 alphabetically sorting a bunch of rotated gibberish actually achieve? 356 00:17:25.559 --> 00:17:28.440 Because of the underlying structure of human language and DNA, 357 00:17:29.039 --> 00:17:32.759 When you sort those rotations alphabetically, a mathematical magic trick happens. 358 00:17:32.799 --> 00:17:36.119 In the final column of that list. Identical characters suddenly 359 00:17:36.200 --> 00:17:39.440 group together. So instead of a random string like acgdac, 360 00:17:39.960 --> 00:17:42.079 the final column will spit out long runs of the 361 00:17:42.119 --> 00:17:44.759 same letter like aaccgt oh. 362 00:17:44.839 --> 00:17:47.759 And because they are grouped together, you can compress them exactly. 363 00:17:47.799 --> 00:17:49.200 It's called run lengthen coding. 364 00:17:49.359 --> 00:17:51.279 Wait, let me make sure I'm picturing this right. Instead 365 00:17:51.279 --> 00:17:54.720 of the computer wasting memory writing out aaaa, run lengthen, 366 00:17:54.720 --> 00:17:57.279 coding just writes five A yes, that. 367 00:17:57.359 --> 00:18:00.279 Single trick allows the FM index to shrink the already 368 00:18:00.279 --> 00:18:02.359 gigabyte in decks down to less than two gigabytes. 369 00:18:02.599 --> 00:18:06.440 Here's where it gets really interesting. Suddenly the entire searchable 370 00:18:06.519 --> 00:18:09.559 map of the human genome fits comfortably into the active 371 00:18:09.599 --> 00:18:11.480 memory of a cheap laptop you could buy at a 372 00:18:11.559 --> 00:18:12.359 big box store. 373 00:18:12.839 --> 00:18:16.319 Yeah, that is just staggering. But the compression isn't even 374 00:18:16.359 --> 00:18:19.240 the craziest part the source is mentioned. It allows for 375 00:18:19.599 --> 00:18:24.359 backward search, which sounds impossible. How do you search compressed 376 00:18:24.400 --> 00:18:27.720 data without uncompressing it first? This is the true genius 377 00:18:27.720 --> 00:18:31.319 of the BWT. Because of how the matrix is mathematically structured, 378 00:18:31.559 --> 00:18:34.519 you can jump between the columns to trace a sequence backwards, 379 00:18:34.599 --> 00:18:38.000 letter by letter without ever unpacking the file. And here 380 00:18:38.079 --> 00:18:40.519 is the kicker. The time it takes to search for 381 00:18:40.559 --> 00:18:43.000 a pattern is proportional only to the length of your 382 00:18:43.039 --> 00:18:46.359 query string. It completely ignores the massive size of the 383 00:18:46.359 --> 00:18:47.119 actual genome. 384 00:18:47.200 --> 00:18:49.240 Hold on, you're saying that if I want to search 385 00:18:49.240 --> 00:18:51.839 for a fifty letter sequence, it takes the exact same 386 00:18:51.880 --> 00:18:54.359 amount of time whether I am searching the tiny genome 387 00:18:54.400 --> 00:18:57.640 of a fruitfly or the three billion letter human genome. 388 00:18:57.799 --> 00:19:00.480 Exactly the size of the haystack no longer matters. The 389 00:19:00.519 --> 00:19:02.599 time it takes only depends on the size of the needle. 390 00:19:02.839 --> 00:19:05.319 It completely democratized genomic research overnight. 391 00:19:05.440 --> 00:19:08.920 Okay, so we've gone from wet chemistry to massive raw data, 392 00:19:09.440 --> 00:19:14.599 to error correcting files to mind blowing compression algorithms. So 393 00:19:14.839 --> 00:19:17.720 what does this all mean? What does all this computational 394 00:19:17.799 --> 00:19:21.240 heavy lifting actually do for the person listening right now? 395 00:19:21.359 --> 00:19:23.440 If you are a patient in a hospital, what are 396 00:19:23.440 --> 00:19:24.279 the steaks. 397 00:19:24.039 --> 00:19:27.279 The steaks for your life? Before these algorithms, the genome 398 00:19:27.359 --> 00:19:30.079 was a black box. Today, because we can search it 399 00:19:30.079 --> 00:19:33.000 so quickly, we discovered incredible things. We found out that 400 00:19:33.039 --> 00:19:36.680 humans only have about twenty thousand protein coding genes. That's 401 00:19:36.720 --> 00:19:38.480