WEBVTT 1 00:00:00.160 --> 00:00:02.759 Welcome to the deep dive, where we slice through the 2 00:00:02.759 --> 00:00:07.240 information clutter to bring you the clearest, most important insights. Today, 3 00:00:07.280 --> 00:00:09.400 we're taking a bit of a shortcut to becoming well 4 00:00:09.400 --> 00:00:14.039 informed about a really powerful tool in natural language processing NLP. 5 00:00:14.279 --> 00:00:18.039 It's called Spacey, that's right, and it's an interesting one. 6 00:00:18.079 --> 00:00:20.800 If you think of those huge language models, you know, 7 00:00:20.879 --> 00:00:24.440 like chat, GPT, maybe is a big powerful food processor. Okay, 8 00:00:24.600 --> 00:00:29.399 then Spacey is more like your practical, really well optimized 9 00:00:29.440 --> 00:00:33.200 kitchen knife. It's a library that's specifically designed to help 10 00:00:33.240 --> 00:00:34.560 you get actual work done. 11 00:00:34.560 --> 00:00:36.280 So moving beyond just theory. 12 00:00:36.320 --> 00:00:40.039 Exactly beyond just academic concepts and do efficient practical application. 13 00:00:40.719 --> 00:00:43.320 And we're going to uncover some surprising depth today. I 14 00:00:43.359 --> 00:00:46.719 think from you know, basic text processing right up to 15 00:00:46.799 --> 00:00:48.439 integrating with the latest AI stuff. 16 00:00:48.560 --> 00:00:51.679 Sounds good, And our mission for this deep dive is 17 00:00:51.719 --> 00:00:56.240 basically to give you a comprehensive but still really accessible 18 00:00:56.560 --> 00:00:59.280 understanding of what Spacey can do. We're drawing from quite 19 00:00:59.280 --> 00:01:05.480 a few sources, including the excellent book Mastering Spacey. Okay, 20 00:01:05.920 --> 00:01:08.599 let's untack this. So to kick us off, what's the 21 00:01:08.799 --> 00:01:12.400 absolute core thing our listeners should get about Spacey. 22 00:01:12.079 --> 00:01:15.519 Well at its heart. Spacey is this incredibly fast open 23 00:01:15.519 --> 00:01:20.920 source Python library, and it's really built for production ready 24 00:01:21.079 --> 00:01:22.200 NLP applications. 25 00:01:22.280 --> 00:01:24.200 Production ready. That sounds important. 26 00:01:24.280 --> 00:01:26.599 It is, a lot of its speed comes from using 27 00:01:26.640 --> 00:01:30.079 Python for the really performance critical bids, so it's highly 28 00:01:30.079 --> 00:01:33.000 optimized but still easy to use within Python. 29 00:01:33.120 --> 00:01:35.799 Aha. So it's not just another like academic tool set. 30 00:01:35.840 --> 00:01:37.840 It's built for real world stuff from the. 31 00:01:37.799 --> 00:01:40.719 Get go precisely. That's a key difference compared to maybe 32 00:01:40.760 --> 00:01:43.879 something like NLTK, the Natural Language Toolkit, which historically at 33 00:01:43.920 --> 00:01:47.159 least was often more focused on students researchers. Spacey, you're 34 00:01:47.239 --> 00:01:48.480 hitting the ground running for deployment. 35 00:01:48.799 --> 00:01:50.519 You mentioned it's built to get work done? Is that 36 00:01:50.599 --> 00:01:53.040 like the official philosophy pretty much? 37 00:01:53.680 --> 00:01:56.719 Inus Montani? What are the core? Creators? Often talks about this. 38 00:01:57.120 --> 00:01:59.799 The goal is genuinely to help people do their work efficiently. 39 00:02:00.280 --> 00:02:03.040 They're not trying to build some massive do everything system. 40 00:02:03.159 --> 00:02:06.599 Oh okay, it's more about providing these sharp, reliable tools 41 00:02:06.680 --> 00:02:09.960 like that knife, to fit nicely into whatever you're already doing. 42 00:02:10.120 --> 00:02:13.680 Got it and getting started? Is it complex? 43 00:02:15.000 --> 00:02:17.919 Not? Really? It works with modern Python runs on you know, 44 00:02:17.960 --> 00:02:21.039 the usual operating systems, Windows, Mac, Linux. 45 00:02:21.039 --> 00:02:24.319 And best practice is probably virtual environments. Right. 46 00:02:24.439 --> 00:02:27.400 Keep things clean, Oh, absolutely, always a good idea for 47 00:02:27.439 --> 00:02:29.680 any Python project. Keeps your dependency sorted. 48 00:02:29.960 --> 00:02:33.039 Now you mentioned something important. The language models aren't built 49 00:02:33.039 --> 00:02:33.719 in correct. 50 00:02:33.800 --> 00:02:37.639 That's a key point. Spacey itself is the framework, the tools, 51 00:02:38.240 --> 00:02:40.919 but for the statistical smarts, things like tagging parts of 52 00:02:40.960 --> 00:02:43.800 speech or finding named entities, you need to download a 53 00:02:43.840 --> 00:02:45.319 language model separately. 54 00:02:45.039 --> 00:02:48.280 Like Encore websism, that kind of thing exactly like encore 55 00:02:48.319 --> 00:02:49.280 webdism for English. 56 00:02:49.319 --> 00:02:52.639 This quick command line thing Python dash M, Spacey download 57 00:02:52.639 --> 00:02:56.360 oncre webism that downloads the small English model, gets you 58 00:02:56.400 --> 00:02:57.680 the core pipeline components. 59 00:02:58.000 --> 00:02:59.879 Okay, and once you've got that, how do you sort 60 00:02:59.879 --> 00:03:01.319 of of see what it's doing. 61 00:03:01.639 --> 00:03:04.639 Oh well, that's where displacy comes in. It's Spacey's built 62 00:03:04.680 --> 00:03:08.280 in visualization tool, and it's fantastic. How So, it just 63 00:03:08.360 --> 00:03:14.080 makes really complex linguistic concepts much easier to grasp visually. 64 00:03:14.560 --> 00:03:17.840 You could see dependency parses how words connect, or see 65 00:03:18.159 --> 00:03:20.879 named entities highlighted right in the text. It helps you 66 00:03:20.960 --> 00:03:22.319 spot patterns almost. 67 00:03:22.120 --> 00:03:24.199 Instantly, so you can actually see the analysis. 68 00:03:24.280 --> 00:03:26.759 Yeah, you can try it online. There's a demo, or 69 00:03:26.800 --> 00:03:29.199 you can run it locally from your code. Even in 70 00:03:29.280 --> 00:03:32.319 Jupiter notebooks. It's super helpful for understanding what's going on 71 00:03:32.319 --> 00:03:32.879 into the hood. 72 00:03:32.919 --> 00:03:37.520 Okay, so set up, done, model, downloaded, visualization. Ready, let's 73 00:03:37.520 --> 00:03:39.879 talk about the core processing. You mentioned a pipeline. 74 00:03:39.919 --> 00:03:41.960 Yeah, I think of it like an NLP assembly line. 75 00:03:42.120 --> 00:03:44.919 When you load a model, say using spacey dot load, 76 00:03:45.360 --> 00:03:48.319 you get back this NLP object, right, And when you 77 00:03:48.360 --> 00:03:51.680 feed text into that object like doc NLP, this is 78 00:03:51.680 --> 00:03:54.360 some text. It runs the text through a sequence of 79 00:03:54.400 --> 00:03:55.280 processing steps. 80 00:03:55.319 --> 00:03:57.479 The pipeline components exactly. 81 00:03:57.520 --> 00:04:01.800 The default pipeline usually include. It's a tokenizer, a tagger 82 00:04:01.840 --> 00:04:05.159 for part of speech, a dependency parser for sentence structure, 83 00:04:05.719 --> 00:04:10.120 and an entity recognizer or any R component. Each does 84 00:04:10.159 --> 00:04:11.319 its specific. 85 00:04:10.879 --> 00:04:13.680 Job, and the output is this doc object. 86 00:04:13.919 --> 00:04:16.800 Right. The doc object holds the result. It's not just 87 00:04:16.839 --> 00:04:19.720 the text, it's the text broken down into tokens, and 88 00:04:19.800 --> 00:04:22.519 each token is enriched with all the linguistic features found 89 00:04:22.519 --> 00:04:23.160 by the pipeline. 90 00:04:23.240 --> 00:04:26.240 Let's break down that pipeline. First up, Tokenization and sentence 91 00:04:26.279 --> 00:04:29.160 segmentation sounds simple, just splitting words, Ah. 92 00:04:29.120 --> 00:04:32.199 Well, it's a bit more nuanced than just splitting on spaces. 93 00:04:32.519 --> 00:04:36.319 Tokenization is breaking the text into its smallest meaningful parts, 94 00:04:36.720 --> 00:04:40.800 the tokens, words, numbers, punctuation. They all become tokens. Okay, 95 00:04:40.879 --> 00:04:44.639 But here's a surprising detail. Unlike most other pipeline components, 96 00:04:44.879 --> 00:04:47.680 the default tokenizer doesn't rely on a statistical model. 97 00:04:47.759 --> 00:04:48.959 Oh which does it use? 98 00:04:49.079 --> 00:04:52.680 It uses really carefully crafted language specific rules, which makes 99 00:04:52.720 --> 00:04:55.240 it very fast and predictable. And you can even customize it. 100 00:04:55.279 --> 00:04:57.439 You can add special cases like telling it how to 101 00:04:57.480 --> 00:04:59.480 handle slang or specific abbreviations. 102 00:04:59.560 --> 00:05:02.199 Let's teach it lemmey should be lemon me exactly. 103 00:05:02.240 --> 00:05:04.600 That kind of thing gives you fine grain control. 104 00:05:04.360 --> 00:05:08.839 And sentence segmentation, finding sentence boundaries that's. 105 00:05:08.639 --> 00:05:13.600 Actually often more complex than tokenization. Think about abbreviations like 106 00:05:13.720 --> 00:05:19.279 misder or complex punctuation. Spacey has a unique approach here. 107 00:05:19.360 --> 00:05:22.920 What's that It often uses the dependency parser, which understands 108 00:05:22.920 --> 00:05:26.439 sentence structure to help figure out sentence boundaries really accurately. 109 00:05:26.480 --> 00:05:28.399 It's quite a sophisticated design choice. 110 00:05:28.399 --> 00:05:32.480 Interesting. Okay, Next step lematization getting the root word yep. 111 00:05:32.639 --> 00:05:35.439 The lemma is the base or dictionary form. So like 112 00:05:35.480 --> 00:05:38.160 you said, eating eats eat tape, they all boil down 113 00:05:38.160 --> 00:05:39.240 to the lemma eat. 114 00:05:39.240 --> 00:05:40.959 How useful is that in practice? 115 00:05:41.079 --> 00:05:44.680 Oh, incredibly useful. Think about a chatbot for booking flights. 116 00:05:45.079 --> 00:05:47.120 A user might say I want to fly, or show 117 00:05:47.160 --> 00:05:48.959 me flights or I flew yesterday. 118 00:05:49.079 --> 00:05:52.040 Right, different forms of the same core idea exactly. 119 00:05:52.160 --> 00:05:55.920 Lemonization reduces fly flights flu all down to fly, so 120 00:05:56.000 --> 00:05:58.040 your system only needs to look for that one base 121 00:05:58.120 --> 00:06:02.560 form to understand the core intent. It simplifies things massively. 122 00:06:02.199 --> 00:06:04.279 Makes sense, and you could use it for other things too, 123 00:06:04.360 --> 00:06:05.399 like place names. 124 00:06:05.439 --> 00:06:08.839 Definitely maybe sometimes angel Town when they mean Los Angeles. 125 00:06:09.160 --> 00:06:11.879 You can actually add custom rules using something called an 126 00:06:11.879 --> 00:06:14.920 a tribune ruler to map Angeltown to the canonical Los 127 00:06:14.959 --> 00:06:19.120 Angeles lemma. During processing insures consistency. 128 00:06:18.600 --> 00:06:22.759 So Spacey processes the text, applies these steps and stores 129 00:06:22.800 --> 00:06:26.600 the results you mentioned. Container objects, doc, token span. 130 00:06:26.560 --> 00:06:29.480 Right, these are your main ways of accessing the processed information. 131 00:06:30.000 --> 00:06:33.480 The doc object represents the whole processed text. Okay, if 132 00:06:33.519 --> 00:06:35.480 you loop over a doc like for token and doc, 133 00:06:35.959 --> 00:06:38.000 you get individual token objects. 134 00:06:37.639 --> 00:06:39.839 And each token knows things about itself. 135 00:06:39.560 --> 00:06:42.120 Loads of things. A token object holds the original word, 136 00:06:42.160 --> 00:06:45.439 it's lemma, it's part of speech tag, it's dependency relation. 137 00:06:45.839 --> 00:06:48.480 It also has boolean flags like token dot is punk, 138 00:06:48.560 --> 00:06:51.519 token dot is currency token dot like earl, token dot 139 00:06:51.600 --> 00:06:52.480 latham wow. 140 00:06:52.560 --> 00:06:54.399 Okay, So you can check if a token looks like 141 00:06:54.439 --> 00:06:56.560 a URL or a number easily yep. 142 00:06:56.759 --> 00:06:58.879 And it knows it's entity type if it's part of 143 00:06:58.920 --> 00:07:02.319 one like token do type might be person or or worg. 144 00:07:02.879 --> 00:07:05.959 It even has a token dot shave attribute that gives 145 00:07:06.000 --> 00:07:08.600 you a kind of abstract representation of the words orthography, 146 00:07:08.720 --> 00:07:11.040 like is it capitalized, is it all digits, et cetera. 147 00:07:11.360 --> 00:07:13.439 Really useful for rule base matching. 148 00:07:13.279 --> 00:07:14.680 And span What does that fit in? 149 00:07:15.079 --> 00:07:17.680 A span? Is just a slice of the dock representing 150 00:07:17.720 --> 00:07:21.240 multiple tokens. Sentences are span objects. You can get them 151 00:07:21.319 --> 00:07:24.920 via doc dot sense. Named entities are also span objects, 152 00:07:25.040 --> 00:07:29.319 accessible via doc dot NZ. So doc token span or 153 00:07:29.399 --> 00:07:31.279 how you navigate and use the process. 154 00:07:31.000 --> 00:07:33.519 Text got it. Let's move into some of those linguistic 155 00:07:33.560 --> 00:07:40.040 features part of speech tagging pos tagging. That's identifying nouns, verbs, adjectives. 156 00:07:39.480 --> 00:07:43.120 Exactly, categorizing words by their grammatical role in the sentence. 157 00:07:43.199 --> 00:07:45.000 And how does space you figure that out? Is it 158 00:07:45.040 --> 00:07:46.079 just a dictionary look up? 159 00:07:46.160 --> 00:07:48.000 Oh no, it's much smarter than that. It looks at 160 00:07:48.000 --> 00:07:51.439 the word in context. The surrounding words heavily influence the tag. 161 00:07:51.920 --> 00:07:56.959 It uses sequential statistical models trained on large amounts of texts. 162 00:07:56.680 --> 00:07:58.319 So the same word could get different tags. 163 00:07:58.639 --> 00:08:02.079 Absolutely. Think of the word book, I read a book 164 00:08:02.519 --> 00:08:05.839 noun versus I want to book a flight verb. The 165 00:08:05.920 --> 00:08:08.199 context tells the tagger which role it's playing. 166 00:08:08.720 --> 00:08:11.160 And why is this useful beyond just grammar? 167 00:08:11.839 --> 00:08:14.959 Well, it's really important for understanding meaning, especially for word 168 00:08:15.000 --> 00:08:18.279 sense disambiguation, figuring out which meaning of a word is intended. 169 00:08:18.439 --> 00:08:19.399 Can you give an example? 170 00:08:19.560 --> 00:08:22.319 Sure, take the word beat. It can mean many things, 171 00:08:22.639 --> 00:08:25.040 But if the pos tagger confidently tags it as an 172 00:08:25.079 --> 00:08:28.800 adjective adj, as in I'm totally beat, you know, it 173 00:08:28.839 --> 00:08:31.040 almost certainly means exhausted. Ah. 174 00:08:31.079 --> 00:08:33.720 I see. The tag helps narrow down the meaning. 175 00:08:33.559 --> 00:08:36.519 Precisely, even if the verb or noun tags might still 176 00:08:36.519 --> 00:08:39.919 be ambiguous. Beat the drum versus follow the beat. The 177 00:08:40.000 --> 00:08:43.080 adjective tag is often quite specific. It adds a layer 178 00:08:43.120 --> 00:08:46.200 of understanding, even if lamonization kind of flattens out things 179 00:08:46.279 --> 00:08:46.960 like verb tense. 180 00:08:47.080 --> 00:08:50.799 Okay, that makes sense. Next up, dependency parsing. This sounds 181 00:08:50.840 --> 00:08:53.639 a bit more complex. Mapping sentence relationships. 182 00:08:53.720 --> 00:08:57.600 It is complex but incredibly powerful. Dependency parsing represents the 183 00:08:57.600 --> 00:09:01.000 grammatical structure of a sentence not just as a flat sequence, 184 00:09:01.159 --> 00:09:03.600 but as a tree of relationships. It shows how words 185 00:09:03.600 --> 00:09:04.279 depend on each. 186 00:09:04.200 --> 00:09:06.799 Other head and dependent exactly each. 187 00:09:06.679 --> 00:09:09.799 Word except usually the main verb. The root has a 188 00:09:09.840 --> 00:09:12.840 head word it modifies or relates to, and a specific 189 00:09:12.919 --> 00:09:17.559 dependency label describes that relationship, like N subject phenomenal subject, 190 00:09:17.679 --> 00:09:20.840 or dubject for direct object. Why go to all this trouble, Well, 191 00:09:20.879 --> 00:09:24.120 what's fascinating here is that sentences aren't just sequences of tokens. 192 00:09:24.360 --> 00:09:27.840 They have this deep, inherent structure, and understanding that structure 193 00:09:27.879 --> 00:09:32.000 is absolutely crucial for many real world NLP tasks, like 194 00:09:32.080 --> 00:09:36.159 what think about chatbots or a machine translation? You need 195 00:09:36.200 --> 00:09:39.559 to know who did what to whom. Consider I forwarded 196 00:09:39.600 --> 00:09:42.279 you the email versus you forwarded me the email. 197 00:09:42.440 --> 00:09:44.440 Same words, totally different meaning exactly. 198 00:09:44.679 --> 00:09:47.879 Dependency parsing helps the system figure out that I is 199 00:09:47.879 --> 00:09:50.120 the subject the one doing the forwarding in the first sentence, 200 00:09:50.320 --> 00:09:53.279 and you as the subject in the second. It disambiguates 201 00:09:53.320 --> 00:09:58.399 the roles based on the grammatical structure unsubject, DUBJIOJ relationships. 202 00:09:58.840 --> 00:10:02.159 Without that, I understand user intent would be much much. 203 00:10:02.000 --> 00:10:05.159 Harder, right, That makes the importance clear. Okay, what about 204 00:10:05.519 --> 00:10:09.519 named entity recognition any R spotting real world objects? 205 00:10:09.799 --> 00:10:12.840 Yep. A named entity is basically anything that can be 206 00:10:12.879 --> 00:10:16.080 referred to with a proper name or a quantity. So 207 00:10:16.399 --> 00:10:21.480 people's names, company names, locations, dates, monetary values, percentages. 208 00:10:21.879 --> 00:10:27.360 The categories seem pretty standard person or or GPE geopolitical entity. 209 00:10:27.440 --> 00:10:30.200 Those are common ones, yes, but the specific set of 210 00:10:30.320 --> 00:10:33.000 entity types is actually quite flexible and often depends on 211 00:10:33.039 --> 00:10:35.120 the data of the model was trained on or the 212 00:10:35.159 --> 00:10:38.399 specific task you have in mind. How so, Well, if 213 00:10:38.399 --> 00:10:42.960 you're analyzing financial news, entities like money and percentage might 214 00:10:43.000 --> 00:10:46.360 be way more important and frequent than say, work of art. 215 00:10:46.840 --> 00:10:49.480 The model needs to be tailored or chosen based on 216 00:10:49.519 --> 00:10:50.000 the domain. 217 00:10:50.159 --> 00:10:52.000 And how good as any are these days. 218 00:10:51.759 --> 00:10:54.200 It's gotten incredibly good. The state of the art methods 219 00:10:54.240 --> 00:10:58.039 often use those transformer architectures we mentioned earlier. They're very 220 00:10:58.039 --> 00:11:01.480 effective at understanding context to identify entities accurately. 221 00:11:01.679 --> 00:11:05.919 Okay, And sometimes the default tokenization or entity spans might 222 00:11:05.960 --> 00:11:07.759 not be quite right. Can you fix them? 223 00:11:08.039 --> 00:11:10.919 Yes? Absolutely. Spacey provides a really neat mechanism called doc 224 00:11:11.000 --> 00:11:14.279 dot retokenize it lets you merge multiple tokens into one, 225 00:11:14.639 --> 00:11:16.399 or split a single token into several. 226 00:11:16.440 --> 00:11:17.519 Why would you need to do that? 227 00:11:17.759 --> 00:11:19.799 Well, maybe an entity like New York City got split 228 00:11:19.840 --> 00:11:22.200 into three tokens, but you want to treat it as 229 00:11:22.240 --> 00:11:25.120 a single unit for analysis, you can merge them. Or 230 00:11:25.120 --> 00:11:28.399 maybe a typo resulted in San Francisco being one token 231 00:11:28.639 --> 00:11:29.519 and you want to split it. 232 00:11:29.639 --> 00:11:33.240 Ah okay, So for cleanup and normalization. 233 00:11:33.120 --> 00:11:36.919 Exactly, merging is usually simpler. Splitting can be a bit 234 00:11:36.919 --> 00:11:39.360 more involved because Spacey then needs to figure out the 235 00:11:39.480 --> 00:11:42.840 linguistic features and dependencies for the new tokens you've created. 236 00:11:43.039 --> 00:11:45.919 But it's a very powerful tool for practical adjustments. 237 00:11:46.200 --> 00:11:50.159 Let's shift gear slightly to rule based matching. You mentioned 238 00:11:50.240 --> 00:11:53.799 regular expressions can be tricky. What Spacey's alternative. 239 00:11:54.320 --> 00:11:58.200 Spacey offers the matriclass, and it's designed to be a well, 240 00:11:58.200 --> 00:12:02.399 a much cleaner, more readable, and definitely more maintainable alternative 241 00:12:02.639 --> 00:12:05.120 for finding patterns and text compared to rejects. 242 00:12:05.320 --> 00:12:06.720 Why is rejects problematic? 243 00:12:06.919 --> 00:12:09.559 Regular expressions can just become incredibly dense and hard to read, 244 00:12:10.000 --> 00:12:13.559 especially for complex patterns. They're also easy to get subtly wrong, 245 00:12:13.799 --> 00:12:16.039 which can lead to bugs that are hard to track down, 246 00:12:16.399 --> 00:12:17.679 and they operate purely on. 247 00:12:17.600 --> 00:12:19.799 Strings, and the match is different how. 248 00:12:19.759 --> 00:12:22.840 The matcher works with token objects and their attributes. You 249 00:12:22.919 --> 00:12:26.399 define patterns not as strings, but as lists of dictionaries, 250 00:12:26.639 --> 00:12:28.960 where each dictionary specifies the attributes. 251 00:12:29.000 --> 00:12:31.360 A token must have like low to match the word 252 00:12:31.360 --> 00:12:32.919 hello regardless. 253 00:12:32.360 --> 00:12:36.600 Of case precisely, or is punched true to match any 254 00:12:36.600 --> 00:12:40.759 punctuation mark or liken them true for number. Like tokens, 255 00:12:41.039 --> 00:12:44.240 you're matching based on linguistic features, not just character sequences. 256 00:12:44.320 --> 00:12:46.000 That sounds much more robust. 257 00:12:45.799 --> 00:12:48.600 It is, and you can use extended syntax too. You 258 00:12:48.639 --> 00:12:51.440 can match based on token length length check off a 259 00:12:51.480 --> 00:12:54.399 token is in a list I note or use boolean 260 00:12:54.480 --> 00:12:58.519 flags like east digit I, sulfa I supper great for finding, say, 261 00:12:58.720 --> 00:13:00.360 emphasized words in all cans. 262 00:13:00.480 --> 00:13:03.480 Does it have rejects like operators like optional parts. 263 00:13:03.639 --> 00:13:06.399 Yes, you can use operators like bunds to make a 264 00:13:06.440 --> 00:13:09.840 token pattern optional. Think about matching names like Barack Obama 265 00:13:09.919 --> 00:13:13.159 but also Barack Hussein Obama. The middle name token can 266 00:13:13.159 --> 00:13:17.039 be marked as optional, and you have operators like plus 267 00:13:17.399 --> 00:13:21.480 one or more and zero or more for specifying occurrences, 268 00:13:21.519 --> 00:13:25.320 similar to rejects. There's even a really useful online demo 269 00:13:25.639 --> 00:13:28.320 on the Spacey website where you can build and test 270 00:13:28.399 --> 00:13:29.799 matcher patterns interactively. 271 00:13:29.919 --> 00:13:32.799 Okay, that covers matching specific patterns. What if you have 272 00:13:32.919 --> 00:13:35.799 like a huge list of things to find, say thousands 273 00:13:35.840 --> 00:13:37.200 of product names, right. 274 00:13:37.200 --> 00:13:41.039 Creating individual matcher patterns for thousands of specific phrases would 275 00:13:41.039 --> 00:13:44.039 be well, not very efficient or practical. 276 00:13:44.200 --> 00:13:45.519 So what's the solution for that? 277 00:13:46.000 --> 00:13:49.960 Spacey provides the phrase matcher. It's optimized specifically for efficiently 278 00:13:50.000 --> 00:13:53.720 scanning text against large lists of multi word phrases or dictionaries. 279 00:13:53.840 --> 00:13:54.559 How does that work? 280 00:13:54.639 --> 00:13:56.799 You give it a list of doc objects representing the 281 00:13:56.840 --> 00:14:00.000 phrases you want to find, like Angela Merkele, Donald Trump, 282 00:14:00.159 --> 00:14:03.720 Alexis ceparus. It then uses a really efficient algorithm to 283 00:14:03.799 --> 00:14:07.200 find all occurrences of those exact phrases in your target text, 284 00:14:07.759 --> 00:14:10.679 much faster than running thousands of individual rules. 285 00:14:10.759 --> 00:14:14.080 Very useful for terminology lists or gazetteers exactly. 286 00:14:14.200 --> 00:14:17.000 And it can even match based on token attributes, not 287 00:14:17.039 --> 00:14:19.799 just the exact words. For instance, you could match based 288 00:14:19.799 --> 00:14:22.519 on the shape attribute, which is handy for finding structured 289 00:14:22.600 --> 00:14:26.279 data like IP addresses or specific code patterns and log files. 290 00:14:26.519 --> 00:14:28.519 Even if the exact digits change. 291 00:14:28.840 --> 00:14:31.360 So you have the matcher for flexible patterns and phrase 292 00:14:31.360 --> 00:14:35.200 matcher for large lists. How do you integrate these findings 293 00:14:35.279 --> 00:14:37.960 back into the main spacey doc. That's where the span 294 00:14:38.039 --> 00:14:40.559 ruler comes in. It's a pipeline component that lets you 295 00:14:40.679 --> 00:14:44.200 use rules to find very similarly to matcher patterns, to 296 00:14:44.360 --> 00:14:47.720 directly add span objects to your doc add themwhare to 297 00:14:47.879 --> 00:14:49.879 doc dot sense. You can configure it to add them 298 00:14:49.879 --> 00:14:52.799 to doc dot en, so effectively adding rule based named entities. 299 00:14:53.039 --> 00:14:54.519 Or you can have it add them to a custom 300 00:14:54.559 --> 00:14:57.200 span group like doc dot spans my custom patterns, so. 301 00:14:57.159 --> 00:14:59.279 You added to the pipeline like other components. 302 00:14:58.919 --> 00:15:02.120 YEP, NLP, dot X a pipe span ruler. Then you 303 00:15:02.120 --> 00:15:04.519 provide it with your patterns. For example, you could define 304 00:15:04.559 --> 00:15:07.200 a pattern to find every instance of the word chime 305 00:15:07.600 --> 00:15:09.600 and label it as an OARG entity. 306 00:15:09.759 --> 00:15:13.759 What if the regular ner model also finds entities? Do 307 00:15:13.840 --> 00:15:14.679 they clash? 308 00:15:14.759 --> 00:15:18.919 Good question. You can configure the span ruler. You can 309 00:15:18.960 --> 00:15:23.000 tell it whether your rule based entities should overwrite entities 310 00:15:23.080 --> 00:15:27.679 found by the statistical ner model, overrit true or not 311 00:15:28.279 --> 00:15:31.039 overwrite falls. You can also set it up so that 312 00:15:31.080 --> 00:15:35.039 statistical entities don't overwrite your rule based ones gives you 313 00:15:35.080 --> 00:15:37.720 control over which source of entities takes precedence. 314 00:15:37.840 --> 00:15:41.399 Okay, this rule based stuff seems really practical. Can we 315 00:15:41.440 --> 00:15:45.600 talk about some specific recipes like real world extraction examples? 316 00:15:45.799 --> 00:15:49.000 Absolutely, here's where it gets really interesting, showing Spacey's power. 317 00:15:49.519 --> 00:15:53.000 So you can easily build patterns to extract things like ibands, 318 00:15:53.080 --> 00:15:56.759 international bank account numbers, or phone numbers, these highly structured 319 00:15:56.840 --> 00:15:57.519 numeric things. 320 00:15:57.559 --> 00:15:58.279 Okay, what else? 321 00:15:58.320 --> 00:16:01.919 Think about? Social media? Could create patterns to find mentions 322 00:16:02.000 --> 00:16:05.799 expressing opinions, like matching the sequence business name plus iswaz 323 00:16:05.879 --> 00:16:08.840 bay plus Maybe an adverb plus an adjective. 324 00:16:08.519 --> 00:16:10.639 Like finding cafe X was really great. 325 00:16:10.480 --> 00:16:14.200 Exactly that pattern structure cafex was a adverb adjective. Could 326 00:16:14.200 --> 00:16:16.879 pick up cafe x is good, Cafe y was very slow, 327 00:16:17.120 --> 00:16:20.399 restaurant z will be amazing. Helps you gauge sentiment clever. 328 00:16:20.720 --> 00:16:21.600 Other examples. 329 00:16:21.759 --> 00:16:24.919 Hashtags are easy. You can match the hashtag symbol followed 330 00:16:24.919 --> 00:16:28.320 by tokens that meet certain criteria like IC or ICEULFA 331 00:16:28.879 --> 00:16:32.080 to reliably pull out things like hashtag deep learning or 332 00:16:32.159 --> 00:16:33.279 hashtag weekend fun. 333 00:16:33.559 --> 00:16:36.039 And what about slightly more complex entities? 334 00:16:36.399 --> 00:16:40.159 You can even use patterns to refine entities. For example, 335 00:16:40.200 --> 00:16:42.639 maybe the ner just picks up Smith as a person. 336 00:16:43.279 --> 00:16:45.000 You could use a match or pattern to look for 337 00:16:45.039 --> 00:16:48.440 a preceding title like mister AM's doctor nump, and then 338 00:16:48.519 --> 00:16:51.840 retokenize to merge the title and the name into a single, 339 00:16:52.080 --> 00:16:54.399 more complete entity span Miss Smith. 340 00:16:54.559 --> 00:16:57.159 Wow. Okay, that's quite granular. 341 00:16:56.720 --> 00:16:59.799 Control, it really is. These rule based tools, combined with 342 00:16:59.840 --> 00:17:02.039 the linguistic features, give you a lot of power for 343 00:17:02.159 --> 00:17:03.639 precise information extraction. 344 00:17:04.000 --> 00:17:07.160 Let's push deeper now into understanding meaning and intent. How 345 00:17:07.200 --> 00:17:09.759 does spacey help with semantic parsing figuring out what a 346 00:17:09.880 --> 00:17:11.799 user actually wants a great. 347 00:17:11.599 --> 00:17:13.880 Way to explore this is with data sets like eighty 348 00:17:13.960 --> 00:17:17.759 zis the airline travel information system. It contains thousands of 349 00:17:17.799 --> 00:17:19.400 real user requests about. 350 00:17:19.119 --> 00:17:22.440 Flights like show me flights from Boston to Denver exactly? 351 00:17:22.680 --> 00:17:25.880 Or what's the cheapest flight? What meals are served on 352 00:17:25.960 --> 00:17:31.079 flight x? Analyzing these requires understanding not just the words, 353 00:17:31.559 --> 00:17:32.720 but the underlying goal. 354 00:17:33.079 --> 00:17:35.000 Where do you even start with something like that? 355 00:17:35.200 --> 00:17:38.359 Well, a really crucial first step, honestly, is just looking 356 00:17:38.400 --> 00:17:41.799 at the data yourself. Read through a sample of the utterances, 357 00:17:42.240 --> 00:17:44.640 get a feel for the common patterns. The types of 358 00:17:44.759 --> 00:17:47.039 entities involved the grammar people use. 359 00:17:47.240 --> 00:17:49.480 What kind of things would you look for in the 360 00:17:49.799 --> 00:17:50.559 eightiest data. 361 00:17:51.000 --> 00:17:55.440 You'd quickly notice people specifying origins and destinations. But it's 362 00:17:55.440 --> 00:17:58.240 not enough just to spot Boston and Denver. You need 363 00:17:58.279 --> 00:18:01.720 to capture the relationship from Boston to Denver. You'd see 364 00:18:01.720 --> 00:18:05.519 the importance of prepositions like from to in Those little 365 00:18:05.519 --> 00:18:07.400 words carry a lot of semantic. 366 00:18:06.960 --> 00:18:09.440 Weight, So you need more than just finding keywords. 367 00:18:09.559 --> 00:18:12.759 Definitely, you need to understand the relationships between the words. 368 00:18:13.200 --> 00:18:15.200 And that's where Spacey's dependency matter. 369 00:18:15.119 --> 00:18:17.279 Comes in another matcher. How's this one different? 370 00:18:17.400 --> 00:18:19.920 Well, the matcher looks for sequences of tokens based on 371 00:18:19.960 --> 00:18:23.519 their attributes. The dependency match looks for patterns based on 372 00:18:23.559 --> 00:18:26.319 the syntactic dependency relationships between tokens. 373 00:18:26.599 --> 00:18:29.880 Ah. Using that dependency parstry we talked about earlier. 374 00:18:29.559 --> 00:18:33.359 Precisely, it lets you find patterns like a verb connected 375 00:18:33.359 --> 00:18:37.079 to a noun with a direct object relationship dub J. 376 00:18:38.440 --> 00:18:40.319 This is key for identifying intent. 377 00:18:40.599 --> 00:18:43.400 Can you give a quick linguistic primer on that objects? 378 00:18:43.519 --> 00:18:47.079 Sure? So? Very Basically, you have transitive verbs which need 379 00:18:47.119 --> 00:18:50.519 an object to act upon, like I bought flowers flowers 380 00:18:50.640 --> 00:18:53.640 is a direct object, and in transitive verbs which don't 381 00:18:54.079 --> 00:18:57.880 like I slept okay. And sometimes there's an indirect object too, 382 00:18:57.920 --> 00:18:59.799 like I gave him the book book is direct him 383 00:18:59.799 --> 00:19:03.519 as direct. The dependency matcher lets you specify these relationships 384 00:19:03.519 --> 00:19:04.319 in your patterns. 385 00:19:04.519 --> 00:19:08.240 How does that help find intent in the flight examples. 386 00:19:08.000 --> 00:19:10.319 Well, you could define a pattern looking for a verb 387 00:19:10.480 --> 00:19:13.799 like show or find that has a direct object TOBJ 388 00:19:14.119 --> 00:19:18.640 like flights. That pattern defined using dependency relations would match 389 00:19:19.000 --> 00:19:21.880 show me flights, find flights, I need you to show flights, etc. 390 00:19:22.440 --> 00:19:25.039 Capturing the core intent regardless of the exact phrasing. 391 00:19:25.160 --> 00:19:27.839 That seems much more robust than just keyword spotting. 392 00:19:28.119 --> 00:19:30.599 It is, and you can build more complex patterns. What 393 00:19:30.640 --> 00:19:34.160 if someone says, show all flights and fares. The dependency 394 00:19:34.160 --> 00:19:38.599 matcher can use the conjunct dependency link between flights and 395 00:19:38.720 --> 00:19:42.559 fares to recognize that the user has two related intents 396 00:19:42.599 --> 00:19:43.640 connected by and. 397 00:19:44.039 --> 00:19:48.160 Okay, that's powerful, But this raises a question. Once you've 398 00:19:48.279 --> 00:19:51.519 used these matchers to figure out the intent, say book flight, 399 00:19:51.880 --> 00:19:53.799 how do you store that information with the doc? 400 00:19:54.240 --> 00:19:56.799 Great question. You don't want that information just floating around 401 00:19:57.039 --> 00:20:01.079