WEBVTT 1 00:00:00.080 --> 00:00:04.040 Okay, let's try and unpack this. Imagine just for a second, 2 00:00:04.080 --> 00:00:08.359 the sheer, staggering amount of text data we all generate 3 00:00:08.839 --> 00:00:09.640 every single day. 4 00:00:09.720 --> 00:00:10.759 It's unbelievable. 5 00:00:10.800 --> 00:00:15.320 Really, Yeah, emails, social media posts, articles, research papers, I mean, 6 00:00:15.519 --> 00:00:19.079 even just our normal conversations like this massive digital ocean 7 00:00:19.120 --> 00:00:25.120 of words. Absolutely, but how do computers, these logical binary machines, 8 00:00:25.519 --> 00:00:27.359 how do they actually make any sense of it all? 9 00:00:27.359 --> 00:00:30.879 How do they read it, understand it, maybe even respond. 10 00:00:30.960 --> 00:00:35.159 Well, that's exactly where natural language processing comes in NLP, right, NLP. 11 00:00:35.439 --> 00:00:40.719 It's this really fascinating field dedicated to helping computers interact 12 00:00:40.759 --> 00:00:43.880 with and analyze natural human languages like the ones we speak. 13 00:00:43.960 --> 00:00:45.560 And what's really interesting you were saying, is how it 14 00:00:45.640 --> 00:00:47.119 pulls from so many different areas. 15 00:00:47.159 --> 00:00:50.880 Exactly, what's truly fascinating here is how this field bridges 16 00:00:51.000 --> 00:00:54.359 so many different disciplines. Our deep dive today is based 17 00:00:54.399 --> 00:00:57.600 on a pretty solid source natural language processing with Java 18 00:00:57.640 --> 00:01:01.200 second edition, and our mission really is to pull out 19 00:01:01.240 --> 00:01:03.840 the most important bits of knowledge and insight for you, 20 00:01:03.920 --> 00:01:09.159 the listener. Because NLP is well, it's multidisciplinary. It draws 21 00:01:09.200 --> 00:01:14.519 heavily from computer science, artificial intelligence AI, and also formal linguistics, 22 00:01:15.000 --> 00:01:17.719 and we're talking about the tech behind things you use 23 00:01:17.799 --> 00:01:22.680 constantly search engines obviously, but also automated help systems chatbots. 24 00:01:22.760 --> 00:01:23.680 Well yeah, those. 25 00:01:23.760 --> 00:01:28.280 Even really complex projects. Remember IBM's Watson playing Jeopardy. 26 00:01:27.840 --> 00:01:30.599 That kind of thing. Wow. Okay, So when we talk 27 00:01:30.599 --> 00:01:36.200 about natural language processing y NLP, what is it fundamentally? 28 00:01:36.200 --> 00:01:37.680 What does it actually do well? 29 00:01:38.040 --> 00:01:38.680 At its core? 30 00:01:38.799 --> 00:01:42.799 The formal definition involves using computer science AI and linguistics 31 00:01:42.840 --> 00:01:46.280 to analyze natural language. Okay, but maybe a more useful 32 00:01:46.319 --> 00:01:49.319 way to think about it is it's like a sophisticated toolkit, 33 00:01:49.680 --> 00:01:52.159 a set of tools designed to pull out meaningful, useful 34 00:01:52.200 --> 00:01:55.920 information from all that messy unstructured language data. You know, 35 00:01:56.000 --> 00:01:58.640 web pages, documents, tweet streams. 36 00:01:58.400 --> 00:02:01.519 Right, unstructured meaning like a neat database. 37 00:02:01.120 --> 00:02:03.719 Table precisely, And every time you type a query into 38 00:02:03.760 --> 00:02:07.000 Google or bing, NLP is humming away behind the scenes. 39 00:02:07.079 --> 00:02:10.120 It's translating your human question into something the computer can 40 00:02:10.159 --> 00:02:12.479 actually act on to get you the results you want. 41 00:02:13.039 --> 00:02:15.280 And to do that, it has to deal with, well, 42 00:02:15.560 --> 00:02:18.319 the fundamentals of language itself. We often hear words like 43 00:02:18.400 --> 00:02:21.360 syntax and semantics. Could you break those down a bit 44 00:02:22.000 --> 00:02:24.240 in the NLP context, I mean, and why is it 45 00:02:24.280 --> 00:02:25.599 so important to make that distinction? 46 00:02:25.759 --> 00:02:26.080 Sure? 47 00:02:26.319 --> 00:02:30.599 So, syntax that's basically the grammar, the rules for how 48 00:02:30.639 --> 00:02:33.479 you put words together to make a valid sentence. For instance, 49 00:02:33.599 --> 00:02:38.199 in English, tim hit the ball works tactically correct, but 50 00:02:38.319 --> 00:02:40.439 hit ball, Tim, that just doesn't fly. 51 00:02:40.639 --> 00:02:41.639 The syntax is wrong. 52 00:02:41.719 --> 00:02:43.360 Okay, So that's structure exactly. 53 00:02:43.439 --> 00:02:46.159 Then you have semantics, and that's about the meaning of 54 00:02:46.280 --> 00:02:48.039 the words and the sentences themselves. 55 00:02:48.080 --> 00:02:50.719 It's a meaning that sounds harder it is. 56 00:02:50.800 --> 00:02:53.560 And this isn't just you know, a linguistic detail. It's 57 00:02:53.719 --> 00:02:57.120 arguably the mount Everest for NLP because the real challenge 58 00:02:57.159 --> 00:03:01.000 isn't just sorting words correctly, it's understanding the world those 59 00:03:01.000 --> 00:03:04.919 words are describing. Without getting the semantics. A computer could 60 00:03:04.919 --> 00:03:07.560 index a million tweets about a movie maybe, but it 61 00:03:07.599 --> 00:03:09.680 couldn't tell you if people genuinely liked it or if 62 00:03:09.680 --> 00:03:10.840 they were just being sarcastic. 63 00:03:11.039 --> 00:03:14.199 Uh sarcasm. Yeah, computers must hate. 64 00:03:13.919 --> 00:03:14.879 That they do. 65 00:03:15.319 --> 00:03:18.759 It's the difference between just processing data and actually grasping 66 00:03:18.879 --> 00:03:22.479 human intent. And this is super important now because of 67 00:03:22.520 --> 00:03:27.400 the sheer volume of unstructured stuff out there, blogs, tweets, 68 00:03:27.840 --> 00:03:31.439 social media. You need to understand it, not just file 69 00:03:31.479 --> 00:03:31.879 it away. 70 00:03:32.199 --> 00:03:34.960 It sounds incredibly complex. I mean, human language is so 71 00:03:36.520 --> 00:03:39.719 well messy, isn't it. Yeah, compared to rigid computer code. 72 00:03:39.960 --> 00:03:43.080 What are some of those really fundamental, maybe frustratingly subtle 73 00:03:43.159 --> 00:03:45.439 challenges that make NLP so difficult. 74 00:03:45.919 --> 00:03:48.560 You've absolutely nailed the core problem. Natural languages are just 75 00:03:48.639 --> 00:03:51.520 full of nuance and ambiguity. They're not precisely Python or Java. 76 00:03:51.840 --> 00:03:53.840 I mean, one obvious thing is just the sheer number 77 00:03:53.840 --> 00:03:56.280 of languages, hundreds of them, each with its own syntax, 78 00:03:56.319 --> 00:03:56.960 its own quirks. 79 00:03:57.000 --> 00:03:57.639 Yeah, that a lot. 80 00:03:57.800 --> 00:04:01.120 But even within one language like English, WI, the challenges 81 00:04:01.120 --> 00:04:04.439 are well profound. Take ambiguity. Words often have multiple meanings. 82 00:04:04.439 --> 00:04:06.919 Think about home could be your house, could be your hometown, 83 00:04:06.919 --> 00:04:10.000 could be home base in baseball. NLP systems have to 84 00:04:10.080 --> 00:04:13.719 perform something called word sense disambiguation WSD to treat and 85 00:04:13.719 --> 00:04:15.280 figure out the intended meaning from. 86 00:04:15.120 --> 00:04:16.639 The context WSD. 87 00:04:17.040 --> 00:04:19.920 And then there's coreference. That's where different words or pronouns 88 00:04:19.959 --> 00:04:22.439 refer back to the same thing. Like in the city 89 00:04:22.480 --> 00:04:25.240 is large but beautiful, it fills the entire valley. It 90 00:04:26.000 --> 00:04:30.199 clearly refers to the city. Humans get that instantly, computers 91 00:04:30.680 --> 00:04:31.920 not so easy, I see. 92 00:04:32.000 --> 00:04:33.680 But the subtle problems. 93 00:04:33.199 --> 00:04:37.920 Go even deeper into things we barely notice, like punctuation. 94 00:04:38.360 --> 00:04:42.560 Punctuation really like commas and periods exactly? 95 00:04:42.639 --> 00:04:46.160 A period seems simple, right, But it could end a sentence, 96 00:04:46.439 --> 00:04:49.439 or it could end in abbreviation like mister or missus h. 97 00:04:49.600 --> 00:04:51.839 Or it could be part of a number like three 98 00:04:51.920 --> 00:04:56.120 point one four nine or part of an ellipsis you 99 00:04:56.160 --> 00:05:00.519 know the three dots? Never really thought about thattions themselves 100 00:05:00.560 --> 00:05:03.519 are tricky? Is it CIA or CIA with periods? How 101 00:05:03.519 --> 00:05:06.399 does the machine know? Then You've got sentences inside quotes 102 00:05:06.519 --> 00:05:09.079 or totally different conventions in tweets or chat messages where 103 00:05:09.079 --> 00:05:12.720 line breaks mean something else. Wow, Even simple things contractions 104 00:05:12.759 --> 00:05:13.959 like can't or don't? 105 00:05:14.240 --> 00:05:16.480 How do you split that? Is it one token or two? 106 00:05:16.759 --> 00:05:19.480 What about hyphenated words like first cut. 107 00:05:19.439 --> 00:05:20.240 And don't forget? 108 00:05:20.360 --> 00:05:23.600 Numbers are special characters mixed in with words like iPhone 109 00:05:23.639 --> 00:05:26.040 five s or a web address or an email? 110 00:05:26.399 --> 00:05:29.079 Wow? Okay, so it sounds like even the simplest things 111 00:05:29.240 --> 00:05:32.399 like a single period can be a total mindfield for 112 00:05:32.439 --> 00:05:37.240 a computer trying to understand text. What's the common thread here? 113 00:05:37.240 --> 00:05:39.439 What makes all these little details so tricky. 114 00:05:39.920 --> 00:05:43.560 I think the common thread is context and frankly human intuition. 115 00:05:44.000 --> 00:05:47.519 We just effortlessly figure this stuff out using the surrounding 116 00:05:47.519 --> 00:05:50.720 information and our world knowledge. But for a computer, each 117 00:05:50.759 --> 00:05:52.920 of these is like a tiny decision point where it 118 00:05:52.959 --> 00:05:56.040 needs to apply some rule or a statypical guess right. 119 00:05:56.079 --> 00:05:58.680 Okay, So given all these complexities, what does this actually 120 00:05:58.720 --> 00:06:02.160 mean for building systems that you know, process language? How 121 00:06:02.199 --> 00:06:05.199 do we even start to tackle this massive ocean of words? 122 00:06:05.560 --> 00:06:07.519 Well, the good news is that even the most complex 123 00:06:07.639 --> 00:06:10.279 NLP applications are usually built up from a set of 124 00:06:10.959 --> 00:06:14.399 fundamental techniques, building blocks, if you will. These often work 125 00:06:14.480 --> 00:06:17.319 together in sequence in what we call a pipeline pipeline, 126 00:06:17.519 --> 00:06:20.480 and the very first step usually is finding the parts 127 00:06:20.480 --> 00:06:25.160 of the text. This covers two main things, tokenization and normalization. 128 00:06:25.639 --> 00:06:29.920 Tokenization breaking into tokens like words exactly. 129 00:06:30.000 --> 00:06:34.120 Tokenization is absolutely fundamental. It's breaking down that raw stream 130 00:06:34.160 --> 00:06:36.439 of text into individual units. 131 00:06:36.160 --> 00:06:37.160 We call tokens. 132 00:06:37.959 --> 00:06:40.759 Usually these are words, but sometimes they can be smaller 133 00:06:40.800 --> 00:06:44.360 things too, like morphemes, more fe Yeah, the smallest bits 134 00:06:44.399 --> 00:06:48.199 of a word that still have meaning, like the unbreakable 135 00:06:48.399 --> 00:06:52.680 or the ed suffix in bounded aw or tokens could 136 00:06:52.680 --> 00:06:55.319 be bigger, like multi word phrases that act as a 137 00:06:55.360 --> 00:06:59.680 single unit, but yeah, mostly think words. NLP also has 138 00:06:59.759 --> 00:07:03.879 to fit figure out how to handle things like abbreviations, contractions, numbers, 139 00:07:03.920 --> 00:07:06.399 and even you know, synonyms different words meaning the same thing. 140 00:07:06.519 --> 00:07:08.120 And normalization what's that about? 141 00:07:08.360 --> 00:07:11.240 So once you have your tokens, normalization is basically cleaning 142 00:07:11.319 --> 00:07:14.920 them up. It's essential preprocessing a lot of NLP tools 143 00:07:14.920 --> 00:07:17.360 and APIs they kind of assume the data coming in 144 00:07:17.439 --> 00:07:18.920 is already clean and consistent. 145 00:07:19.079 --> 00:07:19.720 Right, makes sense. 146 00:07:19.800 --> 00:07:23.439 So normalization involves things like converting everything to lowercase so 147 00:07:23.519 --> 00:07:26.680 the and the are treated the same, removing stop words, 148 00:07:26.720 --> 00:07:30.360 those really common words like the is as which often 149 00:07:30.399 --> 00:07:33.360 don't add much unique meaning for analysis. Okay, and then 150 00:07:33.399 --> 00:07:37.120 we get into stemming. This is reducing words down to 151 00:07:37.160 --> 00:07:40.839 their root form, so like running, runs and ran might 152 00:07:40.920 --> 00:07:44.160 all get reduced down to just run. There's a famous 153 00:07:44.199 --> 00:07:46.040 algorithm called the porter stemmer for this. 154 00:07:46.199 --> 00:07:47.360 Okay, stemming, got it. 155 00:07:47.480 --> 00:07:50.800 And then there's something a bit more sophisticated called lemmatization. 156 00:07:51.720 --> 00:07:52.759 This tries to find. 157 00:07:52.600 --> 00:07:56.160 The actual dictionary form or lemma of a word. So, 158 00:07:56.360 --> 00:07:59.360 for example, the lemma of was is actually. 159 00:07:59.120 --> 00:08:02.639 B ah ah, I see the difference. Stemming is cruder, 160 00:08:03.279 --> 00:08:06.759 limitization is more linguistically aware exactly. 161 00:08:06.839 --> 00:08:10.720 Tools like Stanford Core NLP or open NLP have modules 162 00:08:10.759 --> 00:08:12.600 that can do this limitization pretty well. 163 00:08:12.720 --> 00:08:15.240 Okay, so we've broken the text into its basic atom, 164 00:08:15.360 --> 00:08:18.240 the tokens, the words. But language isn't just a jumble 165 00:08:18.279 --> 00:08:21.600 of words, right, It's structured into sentences, into ideas. You'd 166 00:08:21.600 --> 00:08:24.079 think finding sentences would be easy, just look for a period, 167 00:08:24.240 --> 00:08:27.959 question mark, exclamation point. But I suspect it's not that simple. 168 00:08:28.040 --> 00:08:31.160 You suspect correctly, it's definitely not that simple. This process 169 00:08:31.240 --> 00:08:36.320 is called sentence boundary disambigraation SBDSBD, and the difficulty, as 170 00:08:36.360 --> 00:08:39.279 you pointed out earlier, comes right back to the ambiguity 171 00:08:39.320 --> 00:08:43.919 of punctuation, especially the humble period. It ends sentences, sure, 172 00:08:44.440 --> 00:08:48.159 but it also ends abbreviations. Mister appears in numbers three 173 00:08:48.200 --> 00:08:51.799 point one four, talk by four signifies emissions, ellipses. 174 00:08:51.879 --> 00:08:52.840 Right, the list goes on. 175 00:08:53.039 --> 00:08:56.039 So imagine the sentence mister and missus Smith went to Washington. 176 00:08:56.480 --> 00:09:00.639 Those first two periods don't end sentences. No, yetting SBD 177 00:09:00.720 --> 00:09:03.120 write is crucial because many of the next steps in 178 00:09:03.159 --> 00:09:06.399 an NLP pipeline, like assigning parts of speech or finding 179 00:09:06.519 --> 00:09:09.879 named entities. They typically operate on one sentence at a time. 180 00:09:10.279 --> 00:09:12.200 Okay, so if you split the sentence wrong. 181 00:09:12.080 --> 00:09:14.919 Exactly, you can completely mess up the downstream analysis. You 182 00:09:14.960 --> 00:09:18.120 might confuse he walked over the hill was steep with 183 00:09:18.200 --> 00:09:20.919 the single phrase over the hill totally different. 184 00:09:20.639 --> 00:09:22.519 Meaning yikes. How do they handle it? Then? 185 00:09:22.759 --> 00:09:22.960 Well? 186 00:09:23.000 --> 00:09:25.279 There are different approaches. Some are rule based. 187 00:09:25.559 --> 00:09:28.919 Linpipe, for example, has something called a heuristic sentence model. 188 00:09:29.240 --> 00:09:32.000 It uses clever lists like sets of words that are 189 00:09:32.240 --> 00:09:34.799 possible stops at the end of a sentence, words that 190 00:09:34.840 --> 00:09:38.480 are impossible just before a period penultimates, and words that 191 00:09:38.519 --> 00:09:41.480 are impossible at the start of a new sentence, plus 192 00:09:41.639 --> 00:09:44.159 flags for things like balancing parentheses or quotes. 193 00:09:44.440 --> 00:09:46.039 Wow, that sounds like detective work. 194 00:09:46.320 --> 00:09:49.480 It kind of is using lots of rules and heuristics 195 00:09:49.519 --> 00:09:50.360 to make the best guess. 196 00:09:50.440 --> 00:09:52.519 Okay, this is where it gets really interesting for me. 197 00:09:52.679 --> 00:09:56.559 We've got words, we've got sentences. How do computers go 198 00:09:56.679 --> 00:09:59.279 beyond that to actually pick out the key things in 199 00:09:59.320 --> 00:10:01.840 the text? The who? What? Where? How does it know 200 00:10:02.120 --> 00:10:04.480 Apple is the company in one sentence and the fruit 201 00:10:04.519 --> 00:10:05.000 in another. 202 00:10:05.320 --> 00:10:08.600 Right, that's the job of named entity recognition or ner 203 00:10:08.840 --> 00:10:12.720 ANR ANYR is the process of finding mentions of entities, 204 00:10:12.759 --> 00:10:18.159 typically things like people, places, organizations, dates, money, time, and 205 00:10:18.200 --> 00:10:21.200 classifying them, tagging them with their specific category. 206 00:10:21.360 --> 00:10:23.159 Why is that hard? Seems like you could. 207 00:10:23.000 --> 00:10:26.759 Use lists lists help, but names themselves are ambiguous. Is 208 00:10:26.840 --> 00:10:28.639 penny a person's name or a coin? 209 00:10:28.960 --> 00:10:29.480 Good? Point? 210 00:10:29.720 --> 00:10:31.240 Is Georgia the. 211 00:10:31.399 --> 00:10:34.639 US state, the country or maybe even a person's name. 212 00:10:35.000 --> 00:10:36.360 Context is everything? 213 00:10:37.000 --> 00:10:38.919 Context again yep, and. 214 00:10:39.159 --> 00:10:43.320 Entities can be mentioned in different ways IBM versus international 215 00:10:43.320 --> 00:10:46.600 business machines. The system needs to know those referred to 216 00:10:46.759 --> 00:10:47.799 the same organization. 217 00:10:48.440 --> 00:10:51.840 So how do they do NR? Lists and well? 218 00:10:51.879 --> 00:10:55.240 There are broadly two main approaches. One is rule based, 219 00:10:55.320 --> 00:10:58.960 where human experts rate detailed rules or use large predefined 220 00:10:58.960 --> 00:11:02.799 lists gas tears. They're sometimes called The other approach, which 221 00:11:02.799 --> 00:11:06.200 is very common now, is machine learning. These systems learn 222 00:11:06.279 --> 00:11:08.399 patterns from huge amounts of texts that have already been 223 00:11:08.440 --> 00:11:12.639 annotated with entities. They use statistical models examples exactly, and 224 00:11:12.679 --> 00:11:16.279 for common structured entities you can sometimes use regular expressions 225 00:11:16.279 --> 00:11:20.200 those pattern matching rules to find things like phone numbers, URLs, 226 00:11:20.440 --> 00:11:24.600 zip codes, email addresses, maybe even specific time and date. 227 00:11:24.440 --> 00:11:27.600 Formats Okay, so we found the entities. What about the 228 00:11:27.639 --> 00:11:29.519 other words, Like, how do we get computers to understand 229 00:11:29.519 --> 00:11:33.000 the grammar? What's a noun, what's a verb, adjective? And 230 00:11:33.039 --> 00:11:35.039 why does that actually matter for understanding? 231 00:11:35.320 --> 00:11:37.840 Yeah, that's crucial too. This is done using part of 232 00:11:37.840 --> 00:11:39.879 speech tagging or POS tagging. 233 00:11:39.960 --> 00:11:40.799 POS tagging. 234 00:11:41.080 --> 00:11:47.759 It's the process of assigning a grammatical tag like noun, verb, adjective, preposition, pronoun, adverb, conjunction, 235 00:11:47.879 --> 00:11:50.000 interjection to each word in a sentence. 236 00:11:50.159 --> 00:11:50.960 Why do we need that? 237 00:11:51.279 --> 00:11:53.679 It's really important for figuring out the context of a 238 00:11:53.679 --> 00:11:56.600 word and its role in the sentence structure. Knowing if 239 00:11:56.759 --> 00:11:59.759 book is a noun or a verb changes everything. 240 00:11:59.480 --> 00:12:02.320 True book the flight versus read the book precisely. 241 00:12:02.600 --> 00:12:06.720 But even POS tagging has challenges. Remember normalization. If you 242 00:12:06.799 --> 00:12:10.000 lowercase everything, you might confuse sam the word with sam 243 00:12:10.039 --> 00:12:14.840 the name a proper noun. Contractions again, can't hyphenated words, 244 00:12:14.960 --> 00:12:19.080 State of the art, embedded numbers version five, weird character 245 00:12:19.120 --> 00:12:20.639 sequences like URLs. 246 00:12:21.240 --> 00:12:22.960 They all make POS tagging harder. 247 00:12:23.320 --> 00:12:25.519 So how are the tags assigned? Is there a standard? 248 00:12:25.919 --> 00:12:28.080 There are several tag sets, but a very common one 249 00:12:28.120 --> 00:12:31.000 is the pen Treebank tag set. It uses short tags 250 00:12:31.080 --> 00:12:34.240 like nn for a singular noun, n NS for plural noun, 251 00:12:34.559 --> 00:12:37.840 VBD for a past tense verb, jj for an adjective, 252 00:12:37.879 --> 00:12:41.440 and so on. And to train these pos tagging models, 253 00:12:41.799 --> 00:12:44.799 you need a corpus that's a large body of text 254 00:12:44.799 --> 00:12:47.200 that has already been manually tagged with the correct parts 255 00:12:47.200 --> 00:12:50.159 of speech. Famous examples are the Brown corpus or the 256 00:12:50.159 --> 00:12:54.120 British National corpus. The models learn from these labeled examples. 257 00:12:54.200 --> 00:12:58.600 Okay, this is fascinating. We've identified words, sentences, entities, grammar. 258 00:12:59.000 --> 00:13:02.159 Let's shift focus a bit. How do computers go beyond 259 00:13:02.279 --> 00:13:05.799 just identifying these pieces? How do they actually represent the text, 260 00:13:06.159 --> 00:13:09.200 especially the meaning in context for deeper analysis? 261 00:13:09.360 --> 00:13:12.360 Right moving towards representation, this brings us to two really 262 00:13:12.399 --> 00:13:15.720 important concepts, feature engineering and word embedding. 263 00:13:15.960 --> 00:13:18.840 Future engineering sounds like something out of AI. 264 00:13:18.879 --> 00:13:19.919 It is very much so. 265 00:13:20.200 --> 00:13:23.000 Feature engineering is essentially the art and it is still 266 00:13:23.000 --> 00:13:26.159 something of an art of transforming raw data into numerical 267 00:13:26.200 --> 00:13:29.519 features that machine learning algorithms can actually work with. It 268 00:13:29.559 --> 00:13:32.600 requires using domain knowledge to select or create the right 269 00:13:32.639 --> 00:13:36.279 features that will help the algorithm learn effectively. It's still 270 00:13:36.279 --> 00:13:38.240 a very human driven process in many ways. 271 00:13:38.279 --> 00:13:40.080 Okay, and how does that apply to text? Well. 272 00:13:40.159 --> 00:13:43.879 One common technique in text feature engineering is using n grams. 273 00:13:44.240 --> 00:13:44.720 N grams. 274 00:13:44.799 --> 00:13:48.200 Yeah, n grams are simply sequences of n consecutive words 275 00:13:48.600 --> 00:13:49.440 from the text. 276 00:13:49.600 --> 00:13:51.919 So if you have the sentence this is an. 277 00:13:51.919 --> 00:13:54.639 N gram model, A two gram or big gram would 278 00:13:54.639 --> 00:13:57.679 be this is is an n ergram n gram model, 279 00:13:57.759 --> 00:14:00.279 A three gram traegram would be this is an is 280 00:14:00.320 --> 00:14:02.320 an anagram an n gram model. 281 00:14:02.360 --> 00:14:04.600 Okay. Sequences of words? Why are they useful? 282 00:14:04.960 --> 00:14:07.919 They help capture a bit more context than just single words. 283 00:14:08.120 --> 00:14:11.279 They help us estimate the probability of a word sequence occurring. 284 00:14:11.679 --> 00:14:14.120 This is often used to predict the next word in 285 00:14:14.159 --> 00:14:17.960 a sequence, maybe for autocomplete. Many models use the Markov 286 00:14:17.960 --> 00:14:20.559 assumption here, the idea that the probability of the next 287 00:14:20.559 --> 00:14:23.320 word depends only on the previous one or few words. 288 00:14:23.120 --> 00:14:25.039 Right, I like on my phone keyboard exactly. 289 00:14:25.440 --> 00:14:27.039 But then we get to word embedding. 290 00:14:27.519 --> 00:14:30.120 This is a really powerful set of techniques for how 291 00:14:30.120 --> 00:14:33.320 computers can deal with the context and meaning of words 292 00:14:33.399 --> 00:14:34.919 in a more sophisticated way. 293 00:14:35.080 --> 00:14:37.759 Embedding like putting words into some kind of space. 294 00:14:38.080 --> 00:14:40.399 That's a great way to think about it. The goal 295 00:14:40.559 --> 00:14:44.240 is to represent words as numerical vectors lists of numbers 296 00:14:44.240 --> 00:14:47.080 in a high dimensional space, and the key idea is 297 00:14:47.080 --> 00:14:50.919 that words with similar meanings should have similar vector representations. 298 00:14:50.919 --> 00:14:53.000 They should be close to each other in this space. 299 00:14:53.440 --> 00:14:56.039 So king and queen would be close. Yeah, and maybe 300 00:14:56.080 --> 00:14:57.799 Apple and banana. 301 00:14:57.279 --> 00:15:01.120 Precisely, but Apple the company should be closer to say 302 00:15:01.440 --> 00:15:04.120 Microsoft or Google than to banana. 303 00:15:04.279 --> 00:15:04.639 Okay. 304 00:15:05.039 --> 00:15:07.840 The aim is to capture not just context, but also 305 00:15:08.080 --> 00:15:13.120 maybe hierarchical relationships like king, queen, prince and morphological information 306 00:15:13.240 --> 00:15:14.120 like run run a grant. 307 00:15:14.159 --> 00:15:15.639 How do they create these embeddings? 308 00:15:15.679 --> 00:15:19.240 There are two main families of approaches. First, frequency based embedding. 309 00:15:19.360 --> 00:15:22.840 These rely on counting how often words appear together simple 310 00:15:22.879 --> 00:15:26.440 counts like an account vector, or more sophisticated methods like 311 00:15:26.639 --> 00:15:28.200 tf IDF. 312 00:15:27.960 --> 00:15:29.600 Tf IDF I've heard of that way. 313 00:15:29.600 --> 00:15:32.759 Yeah, it's very common, especially in information retrieval. It stands 314 00:15:32.759 --> 00:15:37.039 for term frequency inverse document frequency. It combines two scores. 315 00:15:37.200 --> 00:15:40.559 Tf term frequency is just how often a word appears 316 00:15:40.559 --> 00:15:44.799 in a single document simple count IDF. Inverse document frequency 317 00:15:44.919 --> 00:15:47.759 measures how important that word is across the entire collection 318 00:15:47.799 --> 00:15:51.080 of documents. The idea is that words appearing in many 319 00:15:51.080 --> 00:15:54.879 many documents like the is A are less informative than 320 00:15:54.919 --> 00:15:58.000 words appearing in only a few so rare words get 321 00:15:58.039 --> 00:15:59.399 a higher IDF score. 322 00:15:59.519 --> 00:16:03.120 Ah, So it balances frequency within a document with rarity 323 00:16:03.159 --> 00:16:04.879 across all documents exactly. 324 00:16:04.919 --> 00:16:08.919 The combined TFIDF score helps rank how relevant a document 325 00:16:08.960 --> 00:16:11.600 is to a query. For example, it gives more weight 326 00:16:11.679 --> 00:16:14.200 to terms that are frequent in that document but relatively 327 00:16:14.279 --> 00:16:15.000 rare overall. 328 00:16:15.159 --> 00:16:17.720 Makes sense and the other type of embedding. 329 00:16:17.600 --> 00:16:21.559 The second type is prediction based embedding. These methods typically 330 00:16:21.639 --> 00:16:23.679 use neural networks and try to predict a word based 331 00:16:23.720 --> 00:16:26.279 on its neighbors, or predict the neighbors based on the word. 332 00:16:26.600 --> 00:16:30.039 This is where you hear names like word to vac, glove, cbow, 333 00:16:30.120 --> 00:16:33.200 continuous bag of words, and skip gram models. They often 334 00:16:33.240 --> 00:16:36.960 capture more subtle semantic relationships than frequency based methods. 335 00:16:37.039 --> 00:16:39.840 Okay, neural networks getting involved. So these embeddings create these 336 00:16:40.600 --> 00:16:44.600 complex vector representations. You mentioned high dimensional space. How high 337 00:16:44.600 --> 00:16:46.480 are we talking? Does that cause problems? 338 00:16:46.720 --> 00:16:48.919 Oh? It absolutely causes problems. 339 00:16:49.559 --> 00:16:53.559 We're often talking about vectors with hundreds, sometimes even thousands 340 00:16:53.679 --> 00:16:55.080 of dimensions for each word. 341 00:16:55.240 --> 00:16:55.679 Wow. 342 00:16:55.919 --> 00:16:58.519 Now imagine you have a vocabulary of a million words. 343 00:16:58.559 --> 00:17:02.120 Each with a three hundred dimension. That requires a lot 344 00:17:02.159 --> 00:17:05.240 of memory over six gigabytes in that example and computation, 345 00:17:05.599 --> 00:17:06.799 it can become impractical. 346 00:17:06.960 --> 00:17:08.599 Yeah, I can see that, So what do you do? 347 00:17:08.880 --> 00:17:12.279 This is where dimensionality reduction techniques come in. We need 348 00:17:12.319 --> 00:17:15.200 ways to reduce the number of dimensions while preserving as 349 00:17:15.279 --> 00:17:17.039 much of the important information as. 350 00:17:16.920 --> 00:17:19.519 Possible, like summarizing the dimensions sort of. 351 00:17:19.920 --> 00:17:24.440 One classic technique is principal component analysis or PCA. PCA 352 00:17:24.599 --> 00:17:27.279 is a linear algorithm. It looks for the directions in 353 00:17:27.319 --> 00:17:30.759 the data where the variance is highest the principal components, 354 00:17:31.119 --> 00:17:34.519 and projects the data onto a lower dimensional subspace defined 355 00:17:34.559 --> 00:17:37.160 by those components. It basically tries to find the main 356 00:17:37.240 --> 00:17:39.920 axis of variation and discard the less important ones. 357 00:17:40.000 --> 00:17:42.440 Okay, linear finds the main trends. 358 00:17:42.480 --> 00:17:46.720 Right, But sometimes the relationships between words their meanings aren't 359 00:17:46.759 --> 00:17:49.880 purely linear. They might be clustered in more complex ways. 360 00:17:50.440 --> 00:17:54.759 That's where nonlinear techniques like t distributed stochastic neighbor embedding 361 00:17:54.920 --> 00:17:56.400 or tSNE come. 362 00:17:56.480 --> 00:18:01.119 In tSNE that sounds fancy, it is quite sophisticated. 363 00:18:01.359 --> 00:18:04.559 It's a non linear, non deterministic, meaning you might get 364 00:18:04.599 --> 00:18:07.920 slightly different results each time. You run it algorithm. It's 365 00:18:07.960 --> 00:18:10.880 particularly good at creating two D or three D maps 366 00:18:10.880 --> 00:18:14.599 of high dimensional data that preserve the local structure, meaning 367 00:18:14.920 --> 00:18:17.119 points that are close together in the high dimensional space 368 00:18:17.240 --> 00:18:19.599 tend to remain close together in the low dimensional map. 369 00:18:19.680 --> 00:18:23.039 So it's good for visualization seeing clusters of words exactly. 370 00:18:23.200 --> 00:18:26.480 PCA is maybe better for just raw compression sometimes, but 371 00:18:26.599 --> 00:18:30.880 TSSE is fantastic for visualizing and exploring complex relationships in 372 00:18:30.960 --> 00:18:33.960 data like word embeddings. It's really good at finding structure 373 00:18:33.960 --> 00:18:36.440 that other algorithms might miss because it's so flexible. 374 00:18:36.599 --> 00:18:40.079 That's a great comparison. Okay, So once we've processed words 375 00:18:40.880 --> 00:18:44.799 maybe represented them with these embeddings, how do we classify 376 00:18:45.240 --> 00:18:47.920 entire pieces of text? It's like, is this news article 377 00:18:47.920 --> 00:18:52.079 about sports or politics? Is this customer review positive or negative? Right? 378 00:18:52.119 --> 00:18:55.039 Moving up to the document level, this involves task like 379 00:18:55.119 --> 00:18:58.440 text classification, sentiment analysis, and language identification. 380 00:18:58.519 --> 00:19:01.599 Okay, let's take those one by one. Text classification. 381 00:19:01.960 --> 00:19:07.079 Text classification is pretty straightforward conceptually. It's about assigning a 382 00:19:07.160 --> 00:19:11.160 piece of text, could be a sentence, paragraph, document, to 383 00:19:11.400 --> 00:19:15.119 one or more pre defined categories. Classic example spam detection 384 00:19:15.200 --> 00:19:17.599 and email. Is this email spam or not spam, right, 385 00:19:17.759 --> 00:19:21.839 But it's used for much more automatically organizing huge archives 386 00:19:21.839 --> 00:19:24.920 of documents by topic, maybe trying to determine the authorship 387 00:19:24.920 --> 00:19:28.319 of historical texts. There is famous work on the Federalist 388 00:19:28.319 --> 00:19:31.519 papers using this cool or even trying to infer things 389 00:19:31.559 --> 00:19:35.240 like the author's age range or gender based on writing style. 390 00:19:35.480 --> 00:19:38.960 Interesting, okay. And sentiment analysis that's the positive negative thing. 391 00:19:39.279 --> 00:19:39.559 Yes. 392 00:19:39.799 --> 00:19:43.559 Sentiment analysis is a specific type of text classification focused 393 00:19:43.599 --> 00:19:46.759 on determining the emotional tone or attitude expressed in a 394 00:19:46.799 --> 00:19:51.079 piece of text. Is it positive, negative, neutral? Sometimes it's 395 00:19:51.119 --> 00:19:53.599 mapped to a numerical writing like stars out of five? 396 00:19:53.680 --> 00:19:57.880 Where do you apply that? Reviews, social media. 397 00:19:57.279 --> 00:20:01.599 All of the above, product reviews, movie reroofm social media comments, 398 00:20:01.720 --> 00:20:04.839 survey responses, anything where you want to gauge opinion. It 399 00:20:04.839 --> 00:20:08.720 can be applied at different levels. The whole document, individual sentences, 400 00:20:08.799 --> 00:20:10.920 even clauses within sentences. 401 00:20:10.519 --> 00:20:12.920