WEBVTT

1
00:00:00.080 --> 00:00:04.040
<v Speaker 1>Okay, let's try and unpack this. Imagine just for a second,

2
00:00:04.080 --> 00:00:08.359
<v Speaker 1>the sheer, staggering amount of text data we all generate

3
00:00:08.839 --> 00:00:09.640
<v Speaker 1>every single day.

4
00:00:09.720 --> 00:00:10.759
<v Speaker 2>It's unbelievable.

5
00:00:10.800 --> 00:00:15.320
<v Speaker 1>Really, Yeah, emails, social media posts, articles, research papers, I mean,

6
00:00:15.519 --> 00:00:19.079
<v Speaker 1>even just our normal conversations like this massive digital ocean

7
00:00:19.120 --> 00:00:25.120
<v Speaker 1>of words. Absolutely, but how do computers, these logical binary machines,

8
00:00:25.519 --> 00:00:27.359
<v Speaker 1>how do they actually make any sense of it all?

9
00:00:27.359 --> 00:00:30.879
<v Speaker 1>How do they read it, understand it, maybe even respond.

10
00:00:30.960 --> 00:00:35.159
<v Speaker 2>Well, that's exactly where natural language processing comes in NLP, right, NLP.

11
00:00:35.439 --> 00:00:40.719
<v Speaker 2>It's this really fascinating field dedicated to helping computers interact

12
00:00:40.759 --> 00:00:43.880
<v Speaker 2>with and analyze natural human languages like the ones we speak.

13
00:00:43.960 --> 00:00:45.560
<v Speaker 1>And what's really interesting you were saying, is how it

14
00:00:45.640 --> 00:00:47.119
<v Speaker 1>pulls from so many different areas.

15
00:00:47.159 --> 00:00:50.880
<v Speaker 2>Exactly, what's truly fascinating here is how this field bridges

16
00:00:51.000 --> 00:00:54.359
<v Speaker 2>so many different disciplines. Our deep dive today is based

17
00:00:54.399 --> 00:00:57.600
<v Speaker 2>on a pretty solid source natural language processing with Java

18
00:00:57.640 --> 00:01:01.200
<v Speaker 2>second edition, and our mission really is to pull out

19
00:01:01.240 --> 00:01:03.840
<v Speaker 2>the most important bits of knowledge and insight for you,

20
00:01:03.920 --> 00:01:09.159
<v Speaker 2>the listener. Because NLP is well, it's multidisciplinary. It draws

21
00:01:09.200 --> 00:01:14.519
<v Speaker 2>heavily from computer science, artificial intelligence AI, and also formal linguistics,

22
00:01:15.000 --> 00:01:17.719
<v Speaker 2>and we're talking about the tech behind things you use

23
00:01:17.799 --> 00:01:22.680
<v Speaker 2>constantly search engines obviously, but also automated help systems chatbots.

24
00:01:22.760 --> 00:01:23.680
<v Speaker 1>Well yeah, those.

25
00:01:23.760 --> 00:01:28.280
<v Speaker 2>Even really complex projects. Remember IBM's Watson playing Jeopardy.

26
00:01:27.840 --> 00:01:30.599
<v Speaker 1>That kind of thing. Wow. Okay, So when we talk

27
00:01:30.599 --> 00:01:36.200
<v Speaker 1>about natural language processing y NLP, what is it fundamentally?

28
00:01:36.200 --> 00:01:37.680
<v Speaker 1>What does it actually do well?

29
00:01:38.040 --> 00:01:38.680
<v Speaker 3>At its core?

30
00:01:38.799 --> 00:01:42.799
<v Speaker 2>The formal definition involves using computer science AI and linguistics

31
00:01:42.840 --> 00:01:46.280
<v Speaker 2>to analyze natural language. Okay, but maybe a more useful

32
00:01:46.319 --> 00:01:49.319
<v Speaker 2>way to think about it is it's like a sophisticated toolkit,

33
00:01:49.680 --> 00:01:52.159
<v Speaker 2>a set of tools designed to pull out meaningful, useful

34
00:01:52.200 --> 00:01:55.920
<v Speaker 2>information from all that messy unstructured language data. You know,

35
00:01:56.000 --> 00:01:58.640
<v Speaker 2>web pages, documents, tweet streams.

36
00:01:58.400 --> 00:02:01.519
<v Speaker 1>Right, unstructured meaning like a neat database.

37
00:02:01.120 --> 00:02:03.719
<v Speaker 2>Table precisely, And every time you type a query into

38
00:02:03.760 --> 00:02:07.000
<v Speaker 2>Google or bing, NLP is humming away behind the scenes.

39
00:02:07.079 --> 00:02:10.120
<v Speaker 2>It's translating your human question into something the computer can

40
00:02:10.159 --> 00:02:12.479
<v Speaker 2>actually act on to get you the results you want.

41
00:02:13.039 --> 00:02:15.280
<v Speaker 1>And to do that, it has to deal with, well,

42
00:02:15.560 --> 00:02:18.319
<v Speaker 1>the fundamentals of language itself. We often hear words like

43
00:02:18.400 --> 00:02:21.360
<v Speaker 1>syntax and semantics. Could you break those down a bit

44
00:02:22.000 --> 00:02:24.240
<v Speaker 1>in the NLP context, I mean, and why is it

45
00:02:24.280 --> 00:02:25.599
<v Speaker 1>so important to make that distinction?

46
00:02:25.759 --> 00:02:26.080
<v Speaker 3>Sure?

47
00:02:26.319 --> 00:02:30.599
<v Speaker 2>So, syntax that's basically the grammar, the rules for how

48
00:02:30.639 --> 00:02:33.479
<v Speaker 2>you put words together to make a valid sentence. For instance,

49
00:02:33.599 --> 00:02:38.199
<v Speaker 2>in English, tim hit the ball works tactically correct, but

50
00:02:38.319 --> 00:02:40.439
<v Speaker 2>hit ball, Tim, that just doesn't fly.

51
00:02:40.639 --> 00:02:41.639
<v Speaker 3>The syntax is wrong.

52
00:02:41.719 --> 00:02:43.360
<v Speaker 1>Okay, So that's structure exactly.

53
00:02:43.439 --> 00:02:46.159
<v Speaker 2>Then you have semantics, and that's about the meaning of

54
00:02:46.280 --> 00:02:48.039
<v Speaker 2>the words and the sentences themselves.

55
00:02:48.080 --> 00:02:50.719
<v Speaker 1>It's a meaning that sounds harder it is.

56
00:02:50.800 --> 00:02:53.560
<v Speaker 2>And this isn't just you know, a linguistic detail. It's

57
00:02:53.719 --> 00:02:57.120
<v Speaker 2>arguably the mount Everest for NLP because the real challenge

58
00:02:57.159 --> 00:03:01.000
<v Speaker 2>isn't just sorting words correctly, it's understanding the world those

59
00:03:01.000 --> 00:03:04.919
<v Speaker 2>words are describing. Without getting the semantics. A computer could

60
00:03:04.919 --> 00:03:07.560
<v Speaker 2>index a million tweets about a movie maybe, but it

61
00:03:07.599 --> 00:03:09.680
<v Speaker 2>couldn't tell you if people genuinely liked it or if

62
00:03:09.680 --> 00:03:10.840
<v Speaker 2>they were just being sarcastic.

63
00:03:11.039 --> 00:03:14.199
<v Speaker 1>Uh sarcasm. Yeah, computers must hate.

64
00:03:13.919 --> 00:03:14.879
<v Speaker 3>That they do.

65
00:03:15.319 --> 00:03:18.759
<v Speaker 2>It's the difference between just processing data and actually grasping

66
00:03:18.879 --> 00:03:22.479
<v Speaker 2>human intent. And this is super important now because of

67
00:03:22.520 --> 00:03:27.400
<v Speaker 2>the sheer volume of unstructured stuff out there, blogs, tweets,

68
00:03:27.840 --> 00:03:31.439
<v Speaker 2>social media. You need to understand it, not just file

69
00:03:31.479 --> 00:03:31.879
<v Speaker 2>it away.

70
00:03:32.199 --> 00:03:34.960
<v Speaker 1>It sounds incredibly complex. I mean, human language is so

71
00:03:36.520 --> 00:03:39.719
<v Speaker 1>well messy, isn't it. Yeah, compared to rigid computer code.

72
00:03:39.960 --> 00:03:43.080
<v Speaker 1>What are some of those really fundamental, maybe frustratingly subtle

73
00:03:43.159 --> 00:03:45.439
<v Speaker 1>challenges that make NLP so difficult.

74
00:03:45.919 --> 00:03:48.560
<v Speaker 2>You've absolutely nailed the core problem. Natural languages are just

75
00:03:48.639 --> 00:03:51.520
<v Speaker 2>full of nuance and ambiguity. They're not precisely Python or Java.

76
00:03:51.840 --> 00:03:53.840
<v Speaker 2>I mean, one obvious thing is just the sheer number

77
00:03:53.840 --> 00:03:56.280
<v Speaker 2>of languages, hundreds of them, each with its own syntax,

78
00:03:56.319 --> 00:03:56.960
<v Speaker 2>its own quirks.

79
00:03:57.000 --> 00:03:57.639
<v Speaker 1>Yeah, that a lot.

80
00:03:57.800 --> 00:04:01.120
<v Speaker 2>But even within one language like English, WI, the challenges

81
00:04:01.120 --> 00:04:04.439
<v Speaker 2>are well profound. Take ambiguity. Words often have multiple meanings.

82
00:04:04.439 --> 00:04:06.919
<v Speaker 2>Think about home could be your house, could be your hometown,

83
00:04:06.919 --> 00:04:10.000
<v Speaker 2>could be home base in baseball. NLP systems have to

84
00:04:10.080 --> 00:04:13.719
<v Speaker 2>perform something called word sense disambiguation WSD to treat and

85
00:04:13.719 --> 00:04:15.280
<v Speaker 2>figure out the intended meaning from.

86
00:04:15.120 --> 00:04:16.639
<v Speaker 1>The context WSD.

87
00:04:17.040 --> 00:04:19.920
<v Speaker 2>And then there's coreference. That's where different words or pronouns

88
00:04:19.959 --> 00:04:22.439
<v Speaker 2>refer back to the same thing. Like in the city

89
00:04:22.480 --> 00:04:25.240
<v Speaker 2>is large but beautiful, it fills the entire valley. It

90
00:04:26.000 --> 00:04:30.199
<v Speaker 2>clearly refers to the city. Humans get that instantly, computers

91
00:04:30.680 --> 00:04:31.920
<v Speaker 2>not so easy, I see.

92
00:04:32.000 --> 00:04:33.680
<v Speaker 3>But the subtle problems.

93
00:04:33.199 --> 00:04:37.920
<v Speaker 2>Go even deeper into things we barely notice, like punctuation.

94
00:04:38.360 --> 00:04:42.560
<v Speaker 1>Punctuation really like commas and periods exactly?

95
00:04:42.639 --> 00:04:46.160
<v Speaker 2>A period seems simple, right, But it could end a sentence,

96
00:04:46.439 --> 00:04:49.439
<v Speaker 2>or it could end in abbreviation like mister or missus h.

97
00:04:49.600 --> 00:04:51.839
<v Speaker 2>Or it could be part of a number like three

98
00:04:51.920 --> 00:04:56.120
<v Speaker 2>point one four nine or part of an ellipsis you

99
00:04:56.160 --> 00:05:00.519
<v Speaker 2>know the three dots? Never really thought about thattions themselves

100
00:05:00.560 --> 00:05:03.519
<v Speaker 2>are tricky? Is it CIA or CIA with periods? How

101
00:05:03.519 --> 00:05:06.399
<v Speaker 2>does the machine know? Then You've got sentences inside quotes

102
00:05:06.519 --> 00:05:09.079
<v Speaker 2>or totally different conventions in tweets or chat messages where

103
00:05:09.079 --> 00:05:12.720
<v Speaker 2>line breaks mean something else. Wow, Even simple things contractions

104
00:05:12.759 --> 00:05:13.959
<v Speaker 2>like can't or don't?

105
00:05:14.240 --> 00:05:16.480
<v Speaker 3>How do you split that? Is it one token or two?

106
00:05:16.759 --> 00:05:19.480
<v Speaker 2>What about hyphenated words like first cut.

107
00:05:19.439 --> 00:05:20.240
<v Speaker 3>And don't forget?

108
00:05:20.360 --> 00:05:23.600
<v Speaker 2>Numbers are special characters mixed in with words like iPhone

109
00:05:23.639 --> 00:05:26.040
<v Speaker 2>five s or a web address or an email?

110
00:05:26.399 --> 00:05:29.079
<v Speaker 1>Wow? Okay, so it sounds like even the simplest things

111
00:05:29.240 --> 00:05:32.399
<v Speaker 1>like a single period can be a total mindfield for

112
00:05:32.439 --> 00:05:37.240
<v Speaker 1>a computer trying to understand text. What's the common thread here?

113
00:05:37.240 --> 00:05:39.439
<v Speaker 1>What makes all these little details so tricky.

114
00:05:39.920 --> 00:05:43.560
<v Speaker 2>I think the common thread is context and frankly human intuition.

115
00:05:44.000 --> 00:05:47.519
<v Speaker 2>We just effortlessly figure this stuff out using the surrounding

116
00:05:47.519 --> 00:05:50.720
<v Speaker 2>information and our world knowledge. But for a computer, each

117
00:05:50.759 --> 00:05:52.920
<v Speaker 2>of these is like a tiny decision point where it

118
00:05:52.959 --> 00:05:56.040
<v Speaker 2>needs to apply some rule or a statypical guess right.

119
00:05:56.079 --> 00:05:58.680
<v Speaker 1>Okay, So given all these complexities, what does this actually

120
00:05:58.720 --> 00:06:02.160
<v Speaker 1>mean for building systems that you know, process language? How

121
00:06:02.199 --> 00:06:05.199
<v Speaker 1>do we even start to tackle this massive ocean of words?

122
00:06:05.560 --> 00:06:07.519
<v Speaker 2>Well, the good news is that even the most complex

123
00:06:07.639 --> 00:06:10.279
<v Speaker 2>NLP applications are usually built up from a set of

124
00:06:10.959 --> 00:06:14.399
<v Speaker 2>fundamental techniques, building blocks, if you will. These often work

125
00:06:14.480 --> 00:06:17.319
<v Speaker 2>together in sequence in what we call a pipeline pipeline,

126
00:06:17.519 --> 00:06:20.480
<v Speaker 2>and the very first step usually is finding the parts

127
00:06:20.480 --> 00:06:25.160
<v Speaker 2>of the text. This covers two main things, tokenization and normalization.

128
00:06:25.639 --> 00:06:29.920
<v Speaker 1>Tokenization breaking into tokens like words exactly.

129
00:06:30.000 --> 00:06:34.120
<v Speaker 2>Tokenization is absolutely fundamental. It's breaking down that raw stream

130
00:06:34.160 --> 00:06:36.439
<v Speaker 2>of text into individual units.

131
00:06:36.160 --> 00:06:37.160
<v Speaker 3>We call tokens.

132
00:06:37.959 --> 00:06:40.759
<v Speaker 2>Usually these are words, but sometimes they can be smaller

133
00:06:40.800 --> 00:06:44.360
<v Speaker 2>things too, like morphemes, more fe Yeah, the smallest bits

134
00:06:44.399 --> 00:06:48.199
<v Speaker 2>of a word that still have meaning, like the unbreakable

135
00:06:48.399 --> 00:06:52.680
<v Speaker 2>or the ed suffix in bounded aw or tokens could

136
00:06:52.680 --> 00:06:55.319
<v Speaker 2>be bigger, like multi word phrases that act as a

137
00:06:55.360 --> 00:06:59.680
<v Speaker 2>single unit, but yeah, mostly think words. NLP also has

138
00:06:59.759 --> 00:07:03.879
<v Speaker 2>to fit figure out how to handle things like abbreviations, contractions, numbers,

139
00:07:03.920 --> 00:07:06.399
<v Speaker 2>and even you know, synonyms different words meaning the same thing.

140
00:07:06.519 --> 00:07:08.120
<v Speaker 1>And normalization what's that about?

141
00:07:08.360 --> 00:07:11.240
<v Speaker 2>So once you have your tokens, normalization is basically cleaning

142
00:07:11.319 --> 00:07:14.920
<v Speaker 2>them up. It's essential preprocessing a lot of NLP tools

143
00:07:14.920 --> 00:07:17.360
<v Speaker 2>and APIs they kind of assume the data coming in

144
00:07:17.439 --> 00:07:18.920
<v Speaker 2>is already clean and consistent.

145
00:07:19.079 --> 00:07:19.720
<v Speaker 1>Right, makes sense.

146
00:07:19.800 --> 00:07:23.439
<v Speaker 2>So normalization involves things like converting everything to lowercase so

147
00:07:23.519 --> 00:07:26.680
<v Speaker 2>the and the are treated the same, removing stop words,

148
00:07:26.720 --> 00:07:30.360
<v Speaker 2>those really common words like the is as which often

149
00:07:30.399 --> 00:07:33.360
<v Speaker 2>don't add much unique meaning for analysis. Okay, and then

150
00:07:33.399 --> 00:07:37.120
<v Speaker 2>we get into stemming. This is reducing words down to

151
00:07:37.160 --> 00:07:40.839
<v Speaker 2>their root form, so like running, runs and ran might

152
00:07:40.920 --> 00:07:44.160
<v Speaker 2>all get reduced down to just run. There's a famous

153
00:07:44.199 --> 00:07:46.040
<v Speaker 2>algorithm called the porter stemmer for this.

154
00:07:46.199 --> 00:07:47.360
<v Speaker 1>Okay, stemming, got it.

155
00:07:47.480 --> 00:07:50.800
<v Speaker 2>And then there's something a bit more sophisticated called lemmatization.

156
00:07:51.720 --> 00:07:52.759
<v Speaker 3>This tries to find.

157
00:07:52.600 --> 00:07:56.160
<v Speaker 2>The actual dictionary form or lemma of a word. So,

158
00:07:56.360 --> 00:07:59.360
<v Speaker 2>for example, the lemma of was is actually.

159
00:07:59.120 --> 00:08:02.639
<v Speaker 1>B ah ah, I see the difference. Stemming is cruder,

160
00:08:03.279 --> 00:08:06.759
<v Speaker 1>limitization is more linguistically aware exactly.

161
00:08:06.839 --> 00:08:10.720
<v Speaker 2>Tools like Stanford Core NLP or open NLP have modules

162
00:08:10.759 --> 00:08:12.600
<v Speaker 2>that can do this limitization pretty well.

163
00:08:12.720 --> 00:08:15.240
<v Speaker 1>Okay, so we've broken the text into its basic atom,

164
00:08:15.360 --> 00:08:18.240
<v Speaker 1>the tokens, the words. But language isn't just a jumble

165
00:08:18.279 --> 00:08:21.600
<v Speaker 1>of words, right, It's structured into sentences, into ideas. You'd

166
00:08:21.600 --> 00:08:24.079
<v Speaker 1>think finding sentences would be easy, just look for a period,

167
00:08:24.240 --> 00:08:27.959
<v Speaker 1>question mark, exclamation point. But I suspect it's not that simple.

168
00:08:28.040 --> 00:08:31.160
<v Speaker 2>You suspect correctly, it's definitely not that simple. This process

169
00:08:31.240 --> 00:08:36.320
<v Speaker 2>is called sentence boundary disambigraation SBDSBD, and the difficulty, as

170
00:08:36.360 --> 00:08:39.279
<v Speaker 2>you pointed out earlier, comes right back to the ambiguity

171
00:08:39.320 --> 00:08:43.919
<v Speaker 2>of punctuation, especially the humble period. It ends sentences, sure,

172
00:08:44.440 --> 00:08:48.159
<v Speaker 2>but it also ends abbreviations. Mister appears in numbers three

173
00:08:48.200 --> 00:08:51.799
<v Speaker 2>point one four, talk by four signifies emissions, ellipses.

174
00:08:51.879 --> 00:08:52.840
<v Speaker 1>Right, the list goes on.

175
00:08:53.039 --> 00:08:56.039
<v Speaker 2>So imagine the sentence mister and missus Smith went to Washington.

176
00:08:56.480 --> 00:09:00.639
<v Speaker 2>Those first two periods don't end sentences. No, yetting SBD

177
00:09:00.720 --> 00:09:03.120
<v Speaker 2>write is crucial because many of the next steps in

178
00:09:03.159 --> 00:09:06.399
<v Speaker 2>an NLP pipeline, like assigning parts of speech or finding

179
00:09:06.519 --> 00:09:09.879
<v Speaker 2>named entities. They typically operate on one sentence at a time.

180
00:09:10.279 --> 00:09:12.200
<v Speaker 1>Okay, so if you split the sentence wrong.

181
00:09:12.080 --> 00:09:14.919
<v Speaker 2>Exactly, you can completely mess up the downstream analysis. You

182
00:09:14.960 --> 00:09:18.120
<v Speaker 2>might confuse he walked over the hill was steep with

183
00:09:18.200 --> 00:09:20.919
<v Speaker 2>the single phrase over the hill totally different.

184
00:09:20.639 --> 00:09:22.519
<v Speaker 1>Meaning yikes. How do they handle it? Then?

185
00:09:22.759 --> 00:09:22.960
<v Speaker 2>Well?

186
00:09:23.000 --> 00:09:25.279
<v Speaker 3>There are different approaches. Some are rule based.

187
00:09:25.559 --> 00:09:28.919
<v Speaker 2>Linpipe, for example, has something called a heuristic sentence model.

188
00:09:29.240 --> 00:09:32.000
<v Speaker 2>It uses clever lists like sets of words that are

189
00:09:32.240 --> 00:09:34.799
<v Speaker 2>possible stops at the end of a sentence, words that

190
00:09:34.840 --> 00:09:38.480
<v Speaker 2>are impossible just before a period penultimates, and words that

191
00:09:38.519 --> 00:09:41.480
<v Speaker 2>are impossible at the start of a new sentence, plus

192
00:09:41.639 --> 00:09:44.159
<v Speaker 2>flags for things like balancing parentheses or quotes.

193
00:09:44.440 --> 00:09:46.039
<v Speaker 1>Wow, that sounds like detective work.

194
00:09:46.320 --> 00:09:49.480
<v Speaker 2>It kind of is using lots of rules and heuristics

195
00:09:49.519 --> 00:09:50.360
<v Speaker 2>to make the best guess.

196
00:09:50.440 --> 00:09:52.519
<v Speaker 1>Okay, this is where it gets really interesting for me.

197
00:09:52.679 --> 00:09:56.559
<v Speaker 1>We've got words, we've got sentences. How do computers go

198
00:09:56.679 --> 00:09:59.279
<v Speaker 1>beyond that to actually pick out the key things in

199
00:09:59.320 --> 00:10:01.840
<v Speaker 1>the text? The who? What? Where? How does it know

200
00:10:02.120 --> 00:10:04.480
<v Speaker 1>Apple is the company in one sentence and the fruit

201
00:10:04.519 --> 00:10:05.000
<v Speaker 1>in another.

202
00:10:05.320 --> 00:10:08.600
<v Speaker 2>Right, that's the job of named entity recognition or ner

203
00:10:08.840 --> 00:10:12.720
<v Speaker 2>ANR ANYR is the process of finding mentions of entities,

204
00:10:12.759 --> 00:10:18.159
<v Speaker 2>typically things like people, places, organizations, dates, money, time, and

205
00:10:18.200 --> 00:10:21.200
<v Speaker 2>classifying them, tagging them with their specific category.

206
00:10:21.360 --> 00:10:23.159
<v Speaker 1>Why is that hard? Seems like you could.

207
00:10:23.000 --> 00:10:26.759
<v Speaker 2>Use lists lists help, but names themselves are ambiguous. Is

208
00:10:26.840 --> 00:10:28.639
<v Speaker 2>penny a person's name or a coin?

209
00:10:28.960 --> 00:10:29.480
<v Speaker 1>Good? Point?

210
00:10:29.720 --> 00:10:31.240
<v Speaker 3>Is Georgia the.

211
00:10:31.399 --> 00:10:34.639
<v Speaker 2>US state, the country or maybe even a person's name.

212
00:10:35.000 --> 00:10:36.360
<v Speaker 3>Context is everything?

213
00:10:37.000 --> 00:10:38.919
<v Speaker 1>Context again yep, and.

214
00:10:39.159 --> 00:10:43.320
<v Speaker 2>Entities can be mentioned in different ways IBM versus international

215
00:10:43.320 --> 00:10:46.600
<v Speaker 2>business machines. The system needs to know those referred to

216
00:10:46.759 --> 00:10:47.799
<v Speaker 2>the same organization.

217
00:10:48.440 --> 00:10:51.840
<v Speaker 1>So how do they do NR? Lists and well?

218
00:10:51.879 --> 00:10:55.240
<v Speaker 2>There are broadly two main approaches. One is rule based,

219
00:10:55.320 --> 00:10:58.960
<v Speaker 2>where human experts rate detailed rules or use large predefined

220
00:10:58.960 --> 00:11:02.799
<v Speaker 2>lists gas tears. They're sometimes called The other approach, which

221
00:11:02.799 --> 00:11:06.200
<v Speaker 2>is very common now, is machine learning. These systems learn

222
00:11:06.279 --> 00:11:08.399
<v Speaker 2>patterns from huge amounts of texts that have already been

223
00:11:08.440 --> 00:11:12.639
<v Speaker 2>annotated with entities. They use statistical models examples exactly, and

224
00:11:12.679 --> 00:11:16.279
<v Speaker 2>for common structured entities you can sometimes use regular expressions

225
00:11:16.279 --> 00:11:20.200
<v Speaker 2>those pattern matching rules to find things like phone numbers, URLs,

226
00:11:20.440 --> 00:11:24.600
<v Speaker 2>zip codes, email addresses, maybe even specific time and date.

227
00:11:24.440 --> 00:11:27.600
<v Speaker 1>Formats Okay, so we found the entities. What about the

228
00:11:27.639 --> 00:11:29.519
<v Speaker 1>other words, Like, how do we get computers to understand

229
00:11:29.519 --> 00:11:33.000
<v Speaker 1>the grammar? What's a noun, what's a verb, adjective? And

230
00:11:33.039 --> 00:11:35.039
<v Speaker 1>why does that actually matter for understanding?

231
00:11:35.320 --> 00:11:37.840
<v Speaker 2>Yeah, that's crucial too. This is done using part of

232
00:11:37.840 --> 00:11:39.879
<v Speaker 2>speech tagging or POS tagging.

233
00:11:39.960 --> 00:11:40.799
<v Speaker 1>POS tagging.

234
00:11:41.080 --> 00:11:47.759
<v Speaker 2>It's the process of assigning a grammatical tag like noun, verb, adjective, preposition, pronoun, adverb, conjunction,

235
00:11:47.879 --> 00:11:50.000
<v Speaker 2>interjection to each word in a sentence.

236
00:11:50.159 --> 00:11:50.960
<v Speaker 1>Why do we need that?

237
00:11:51.279 --> 00:11:53.679
<v Speaker 2>It's really important for figuring out the context of a

238
00:11:53.679 --> 00:11:56.600
<v Speaker 2>word and its role in the sentence structure. Knowing if

239
00:11:56.759 --> 00:11:59.759
<v Speaker 2>book is a noun or a verb changes everything.

240
00:11:59.480 --> 00:12:02.320
<v Speaker 1>True book the flight versus read the book precisely.

241
00:12:02.600 --> 00:12:06.720
<v Speaker 2>But even POS tagging has challenges. Remember normalization. If you

242
00:12:06.799 --> 00:12:10.000
<v Speaker 2>lowercase everything, you might confuse sam the word with sam

243
00:12:10.039 --> 00:12:14.840
<v Speaker 2>the name a proper noun. Contractions again, can't hyphenated words,

244
00:12:14.960 --> 00:12:19.080
<v Speaker 2>State of the art, embedded numbers version five, weird character

245
00:12:19.120 --> 00:12:20.639
<v Speaker 2>sequences like URLs.

246
00:12:21.240 --> 00:12:22.960
<v Speaker 3>They all make POS tagging harder.

247
00:12:23.320 --> 00:12:25.519
<v Speaker 1>So how are the tags assigned? Is there a standard?

248
00:12:25.919 --> 00:12:28.080
<v Speaker 2>There are several tag sets, but a very common one

249
00:12:28.120 --> 00:12:31.000
<v Speaker 2>is the pen Treebank tag set. It uses short tags

250
00:12:31.080 --> 00:12:34.240
<v Speaker 2>like nn for a singular noun, n NS for plural noun,

251
00:12:34.559 --> 00:12:37.840
<v Speaker 2>VBD for a past tense verb, jj for an adjective,

252
00:12:37.879 --> 00:12:41.440
<v Speaker 2>and so on. And to train these pos tagging models,

253
00:12:41.799 --> 00:12:44.799
<v Speaker 2>you need a corpus that's a large body of text

254
00:12:44.799 --> 00:12:47.200
<v Speaker 2>that has already been manually tagged with the correct parts

255
00:12:47.200 --> 00:12:50.159
<v Speaker 2>of speech. Famous examples are the Brown corpus or the

256
00:12:50.159 --> 00:12:54.120
<v Speaker 2>British National corpus. The models learn from these labeled examples.

257
00:12:54.200 --> 00:12:58.600
<v Speaker 1>Okay, this is fascinating. We've identified words, sentences, entities, grammar.

258
00:12:59.000 --> 00:13:02.159
<v Speaker 1>Let's shift focus a bit. How do computers go beyond

259
00:13:02.279 --> 00:13:05.799
<v Speaker 1>just identifying these pieces? How do they actually represent the text,

260
00:13:06.159 --> 00:13:09.200
<v Speaker 1>especially the meaning in context for deeper analysis?

261
00:13:09.360 --> 00:13:12.360
<v Speaker 2>Right moving towards representation, this brings us to two really

262
00:13:12.399 --> 00:13:15.720
<v Speaker 2>important concepts, feature engineering and word embedding.

263
00:13:15.960 --> 00:13:18.840
<v Speaker 1>Future engineering sounds like something out of AI.

264
00:13:18.879 --> 00:13:19.919
<v Speaker 3>It is very much so.

265
00:13:20.200 --> 00:13:23.000
<v Speaker 2>Feature engineering is essentially the art and it is still

266
00:13:23.000 --> 00:13:26.159
<v Speaker 2>something of an art of transforming raw data into numerical

267
00:13:26.200 --> 00:13:29.519
<v Speaker 2>features that machine learning algorithms can actually work with. It

268
00:13:29.559 --> 00:13:32.600
<v Speaker 2>requires using domain knowledge to select or create the right

269
00:13:32.639 --> 00:13:36.279
<v Speaker 2>features that will help the algorithm learn effectively. It's still

270
00:13:36.279 --> 00:13:38.240
<v Speaker 2>a very human driven process in many ways.

271
00:13:38.279 --> 00:13:40.080
<v Speaker 1>Okay, and how does that apply to text? Well.

272
00:13:40.159 --> 00:13:43.879
<v Speaker 2>One common technique in text feature engineering is using n grams.

273
00:13:44.240 --> 00:13:44.720
<v Speaker 1>N grams.

274
00:13:44.799 --> 00:13:48.200
<v Speaker 2>Yeah, n grams are simply sequences of n consecutive words

275
00:13:48.600 --> 00:13:49.440
<v Speaker 2>from the text.

276
00:13:49.600 --> 00:13:51.919
<v Speaker 3>So if you have the sentence this is an.

277
00:13:51.919 --> 00:13:54.639
<v Speaker 2>N gram model, A two gram or big gram would

278
00:13:54.639 --> 00:13:57.679
<v Speaker 2>be this is is an n ergram n gram model,

279
00:13:57.759 --> 00:14:00.279
<v Speaker 2>A three gram traegram would be this is an is

280
00:14:00.320 --> 00:14:02.320
<v Speaker 2>an anagram an n gram model.

281
00:14:02.360 --> 00:14:04.600
<v Speaker 1>Okay. Sequences of words? Why are they useful?

282
00:14:04.960 --> 00:14:07.919
<v Speaker 2>They help capture a bit more context than just single words.

283
00:14:08.120 --> 00:14:11.279
<v Speaker 2>They help us estimate the probability of a word sequence occurring.

284
00:14:11.679 --> 00:14:14.120
<v Speaker 2>This is often used to predict the next word in

285
00:14:14.159 --> 00:14:17.960
<v Speaker 2>a sequence, maybe for autocomplete. Many models use the Markov

286
00:14:17.960 --> 00:14:20.559
<v Speaker 2>assumption here, the idea that the probability of the next

287
00:14:20.559 --> 00:14:23.320
<v Speaker 2>word depends only on the previous one or few words.

288
00:14:23.120 --> 00:14:25.039
<v Speaker 1>Right, I like on my phone keyboard exactly.

289
00:14:25.440 --> 00:14:27.039
<v Speaker 3>But then we get to word embedding.

290
00:14:27.519 --> 00:14:30.120
<v Speaker 2>This is a really powerful set of techniques for how

291
00:14:30.120 --> 00:14:33.320
<v Speaker 2>computers can deal with the context and meaning of words

292
00:14:33.399 --> 00:14:34.919
<v Speaker 2>in a more sophisticated way.

293
00:14:35.080 --> 00:14:37.759
<v Speaker 1>Embedding like putting words into some kind of space.

294
00:14:38.080 --> 00:14:40.399
<v Speaker 2>That's a great way to think about it. The goal

295
00:14:40.559 --> 00:14:44.240
<v Speaker 2>is to represent words as numerical vectors lists of numbers

296
00:14:44.240 --> 00:14:47.080
<v Speaker 2>in a high dimensional space, and the key idea is

297
00:14:47.080 --> 00:14:50.919
<v Speaker 2>that words with similar meanings should have similar vector representations.

298
00:14:50.919 --> 00:14:53.000
<v Speaker 2>They should be close to each other in this space.

299
00:14:53.440 --> 00:14:56.039
<v Speaker 1>So king and queen would be close. Yeah, and maybe

300
00:14:56.080 --> 00:14:57.799
<v Speaker 1>Apple and banana.

301
00:14:57.279 --> 00:15:01.120
<v Speaker 2>Precisely, but Apple the company should be closer to say

302
00:15:01.440 --> 00:15:04.120
<v Speaker 2>Microsoft or Google than to banana.

303
00:15:04.279 --> 00:15:04.639
<v Speaker 1>Okay.

304
00:15:05.039 --> 00:15:07.840
<v Speaker 2>The aim is to capture not just context, but also

305
00:15:08.080 --> 00:15:13.120
<v Speaker 2>maybe hierarchical relationships like king, queen, prince and morphological information

306
00:15:13.240 --> 00:15:14.120
<v Speaker 2>like run run a grant.

307
00:15:14.159 --> 00:15:15.639
<v Speaker 1>How do they create these embeddings?

308
00:15:15.679 --> 00:15:19.240
<v Speaker 2>There are two main families of approaches. First, frequency based embedding.

309
00:15:19.360 --> 00:15:22.840
<v Speaker 2>These rely on counting how often words appear together simple

310
00:15:22.879 --> 00:15:26.440
<v Speaker 2>counts like an account vector, or more sophisticated methods like

311
00:15:26.639 --> 00:15:28.200
<v Speaker 2>tf IDF.

312
00:15:27.960 --> 00:15:29.600
<v Speaker 1>Tf IDF I've heard of that way.

313
00:15:29.600 --> 00:15:32.759
<v Speaker 2>Yeah, it's very common, especially in information retrieval. It stands

314
00:15:32.759 --> 00:15:37.039
<v Speaker 2>for term frequency inverse document frequency. It combines two scores.

315
00:15:37.200 --> 00:15:40.559
<v Speaker 2>Tf term frequency is just how often a word appears

316
00:15:40.559 --> 00:15:44.799
<v Speaker 2>in a single document simple count IDF. Inverse document frequency

317
00:15:44.919 --> 00:15:47.759
<v Speaker 2>measures how important that word is across the entire collection

318
00:15:47.799 --> 00:15:51.080
<v Speaker 2>of documents. The idea is that words appearing in many

319
00:15:51.080 --> 00:15:54.879
<v Speaker 2>many documents like the is A are less informative than

320
00:15:54.919 --> 00:15:58.000
<v Speaker 2>words appearing in only a few so rare words get

321
00:15:58.039 --> 00:15:59.399
<v Speaker 2>a higher IDF score.

322
00:15:59.519 --> 00:16:03.120
<v Speaker 1>Ah, So it balances frequency within a document with rarity

323
00:16:03.159 --> 00:16:04.879
<v Speaker 1>across all documents exactly.

324
00:16:04.919 --> 00:16:08.919
<v Speaker 2>The combined TFIDF score helps rank how relevant a document

325
00:16:08.960 --> 00:16:11.600
<v Speaker 2>is to a query. For example, it gives more weight

326
00:16:11.679 --> 00:16:14.200
<v Speaker 2>to terms that are frequent in that document but relatively

327
00:16:14.279 --> 00:16:15.000
<v Speaker 2>rare overall.

328
00:16:15.159 --> 00:16:17.720
<v Speaker 1>Makes sense and the other type of embedding.

329
00:16:17.600 --> 00:16:21.559
<v Speaker 2>The second type is prediction based embedding. These methods typically

330
00:16:21.639 --> 00:16:23.679
<v Speaker 2>use neural networks and try to predict a word based

331
00:16:23.720 --> 00:16:26.279
<v Speaker 2>on its neighbors, or predict the neighbors based on the word.

332
00:16:26.600 --> 00:16:30.039
<v Speaker 2>This is where you hear names like word to vac, glove, cbow,

333
00:16:30.120 --> 00:16:33.200
<v Speaker 2>continuous bag of words, and skip gram models. They often

334
00:16:33.240 --> 00:16:36.960
<v Speaker 2>capture more subtle semantic relationships than frequency based methods.

335
00:16:37.039 --> 00:16:39.840
<v Speaker 1>Okay, neural networks getting involved. So these embeddings create these

336
00:16:40.600 --> 00:16:44.600
<v Speaker 1>complex vector representations. You mentioned high dimensional space. How high

337
00:16:44.600 --> 00:16:46.480
<v Speaker 1>are we talking? Does that cause problems?

338
00:16:46.720 --> 00:16:48.919
<v Speaker 3>Oh? It absolutely causes problems.

339
00:16:49.559 --> 00:16:53.559
<v Speaker 2>We're often talking about vectors with hundreds, sometimes even thousands

340
00:16:53.679 --> 00:16:55.080
<v Speaker 2>of dimensions for each word.

341
00:16:55.240 --> 00:16:55.679
<v Speaker 1>Wow.

342
00:16:55.919 --> 00:16:58.519
<v Speaker 2>Now imagine you have a vocabulary of a million words.

343
00:16:58.559 --> 00:17:02.120
<v Speaker 2>Each with a three hundred dimension. That requires a lot

344
00:17:02.159 --> 00:17:05.240
<v Speaker 2>of memory over six gigabytes in that example and computation,

345
00:17:05.599 --> 00:17:06.799
<v Speaker 2>it can become impractical.

346
00:17:06.960 --> 00:17:08.599
<v Speaker 1>Yeah, I can see that, So what do you do?

347
00:17:08.880 --> 00:17:12.279
<v Speaker 2>This is where dimensionality reduction techniques come in. We need

348
00:17:12.319 --> 00:17:15.200
<v Speaker 2>ways to reduce the number of dimensions while preserving as

349
00:17:15.279 --> 00:17:17.039
<v Speaker 2>much of the important information as.

350
00:17:16.920 --> 00:17:19.519
<v Speaker 1>Possible, like summarizing the dimensions sort of.

351
00:17:19.920 --> 00:17:24.440
<v Speaker 2>One classic technique is principal component analysis or PCA. PCA

352
00:17:24.599 --> 00:17:27.279
<v Speaker 2>is a linear algorithm. It looks for the directions in

353
00:17:27.319 --> 00:17:30.759
<v Speaker 2>the data where the variance is highest the principal components,

354
00:17:31.119 --> 00:17:34.519
<v Speaker 2>and projects the data onto a lower dimensional subspace defined

355
00:17:34.559 --> 00:17:37.160
<v Speaker 2>by those components. It basically tries to find the main

356
00:17:37.240 --> 00:17:39.920
<v Speaker 2>axis of variation and discard the less important ones.

357
00:17:40.000 --> 00:17:42.440
<v Speaker 1>Okay, linear finds the main trends.

358
00:17:42.480 --> 00:17:46.720
<v Speaker 2>Right, But sometimes the relationships between words their meanings aren't

359
00:17:46.759 --> 00:17:49.880
<v Speaker 2>purely linear. They might be clustered in more complex ways.

360
00:17:50.440 --> 00:17:54.759
<v Speaker 2>That's where nonlinear techniques like t distributed stochastic neighbor embedding

361
00:17:54.920 --> 00:17:56.400
<v Speaker 2>or tSNE come.

362
00:17:56.480 --> 00:18:01.119
<v Speaker 1>In tSNE that sounds fancy, it is quite sophisticated.

363
00:18:01.359 --> 00:18:04.559
<v Speaker 2>It's a non linear, non deterministic, meaning you might get

364
00:18:04.599 --> 00:18:07.920
<v Speaker 2>slightly different results each time. You run it algorithm. It's

365
00:18:07.960 --> 00:18:10.880
<v Speaker 2>particularly good at creating two D or three D maps

366
00:18:10.880 --> 00:18:14.599
<v Speaker 2>of high dimensional data that preserve the local structure, meaning

367
00:18:14.920 --> 00:18:17.119
<v Speaker 2>points that are close together in the high dimensional space

368
00:18:17.240 --> 00:18:19.599
<v Speaker 2>tend to remain close together in the low dimensional map.

369
00:18:19.680 --> 00:18:23.039
<v Speaker 1>So it's good for visualization seeing clusters of words exactly.

370
00:18:23.200 --> 00:18:26.480
<v Speaker 2>PCA is maybe better for just raw compression sometimes, but

371
00:18:26.599 --> 00:18:30.880
<v Speaker 2>TSSE is fantastic for visualizing and exploring complex relationships in

372
00:18:30.960 --> 00:18:33.960
<v Speaker 2>data like word embeddings. It's really good at finding structure

373
00:18:33.960 --> 00:18:36.440
<v Speaker 2>that other algorithms might miss because it's so flexible.

374
00:18:36.599 --> 00:18:40.079
<v Speaker 1>That's a great comparison. Okay, So once we've processed words

375
00:18:40.880 --> 00:18:44.799
<v Speaker 1>maybe represented them with these embeddings, how do we classify

376
00:18:45.240 --> 00:18:47.920
<v Speaker 1>entire pieces of text? It's like, is this news article

377
00:18:47.920 --> 00:18:52.079
<v Speaker 1>about sports or politics? Is this customer review positive or negative? Right?

378
00:18:52.119 --> 00:18:55.039
<v Speaker 2>Moving up to the document level, this involves task like

379
00:18:55.119 --> 00:18:58.440
<v Speaker 2>text classification, sentiment analysis, and language identification.

380
00:18:58.519 --> 00:19:01.599
<v Speaker 1>Okay, let's take those one by one. Text classification.

381
00:19:01.960 --> 00:19:07.079
<v Speaker 2>Text classification is pretty straightforward conceptually. It's about assigning a

382
00:19:07.160 --> 00:19:11.160
<v Speaker 2>piece of text, could be a sentence, paragraph, document, to

383
00:19:11.400 --> 00:19:15.119
<v Speaker 2>one or more pre defined categories. Classic example spam detection

384
00:19:15.200 --> 00:19:17.599
<v Speaker 2>and email. Is this email spam or not spam, right,

385
00:19:17.759 --> 00:19:21.839
<v Speaker 2>But it's used for much more automatically organizing huge archives

386
00:19:21.839 --> 00:19:24.920
<v Speaker 2>of documents by topic, maybe trying to determine the authorship

387
00:19:24.920 --> 00:19:28.319
<v Speaker 2>of historical texts. There is famous work on the Federalist

388
00:19:28.319 --> 00:19:31.519
<v Speaker 2>papers using this cool or even trying to infer things

389
00:19:31.559 --> 00:19:35.240
<v Speaker 2>like the author's age range or gender based on writing style.

390
00:19:35.480 --> 00:19:38.960
<v Speaker 1>Interesting, okay. And sentiment analysis that's the positive negative thing.

391
00:19:39.279 --> 00:19:39.559
<v Speaker 3>Yes.

392
00:19:39.799 --> 00:19:43.559
<v Speaker 2>Sentiment analysis is a specific type of text classification focused

393
00:19:43.599 --> 00:19:46.759
<v Speaker 2>on determining the emotional tone or attitude expressed in a

394
00:19:46.799 --> 00:19:51.079
<v Speaker 2>piece of text. Is it positive, negative, neutral? Sometimes it's

395
00:19:51.119 --> 00:19:53.599
<v Speaker 2>mapped to a numerical writing like stars out of five?

396
00:19:53.680 --> 00:19:57.880
<v Speaker 1>Where do you apply that? Reviews, social media.

397
00:19:57.279 --> 00:20:01.599
<v Speaker 2>All of the above, product reviews, movie reroofm social media comments,

398
00:20:01.720 --> 00:20:04.839
<v Speaker 2>survey responses, anything where you want to gauge opinion. It

399
00:20:04.839 --> 00:20:08.720
<v Speaker 2>can be applied at different levels. The whole document, individual sentences,

400
00:20:08.799 --> 00:20:10.920
<v Speaker 2>even clauses within sentences.

401
00:20:10.519 --> 00:20:12.920
<v Speaker 1>Are their challenges. There it seems like it could be tricky, Oh,

402
00:20:13.039 --> 00:20:13.720
<v Speaker 1>very tricky.

403
00:20:14.039 --> 00:20:16.400
<v Speaker 2>One big challenge is that a single piece of text

404
00:20:16.519 --> 00:20:19.680
<v Speaker 2>can express different sentiments about different things, different.

405
00:20:19.359 --> 00:20:20.599
<v Speaker 3>Targets or attributes.

406
00:20:20.880 --> 00:20:23.759
<v Speaker 2>Think about a review like the ride was very rough,

407
00:20:23.960 --> 00:20:27.359
<v Speaker 2>but the attendants did an excellent job of making us comfortable.

408
00:20:27.000 --> 00:20:29.319
<v Speaker 1>Right, negative about the ride, positives about.

409
00:20:29.160 --> 00:20:31.799
<v Speaker 2>The stat exactly. The system needs to figure that out.

410
00:20:31.960 --> 00:20:36.759
<v Speaker 2>It's not just one overall sentiment sarcasm, irony, negation. They

411
00:20:36.799 --> 00:20:37.799
<v Speaker 2>all make it hard too.

412
00:20:37.880 --> 00:20:38.759
<v Speaker 1>How do they approach it?

413
00:20:39.000 --> 00:20:43.119
<v Speaker 2>Often they use sentiment lexicons. These are basically dictionaries where

414
00:20:43.240 --> 00:20:46.799
<v Speaker 2>words are pre scored with positive or negative sentiment values.

415
00:20:47.359 --> 00:20:52.640
<v Speaker 2>Examples include the General Inquirer or the MPQA Subjectivity ques lexicon.

416
00:20:53.200 --> 00:20:55.640
<v Speaker 2>You can essentially count up the positive.

417
00:20:55.240 --> 00:20:56.079
<v Speaker 3>And negative words.

418
00:20:56.119 --> 00:20:57.000
<v Speaker 1>How you build your own?

419
00:20:57.160 --> 00:20:57.599
<v Speaker 3>You can?

420
00:20:57.960 --> 00:21:00.880
<v Speaker 2>You can use semi supervised learning to techniques to build

421
00:21:00.960 --> 00:21:04.680
<v Speaker 2>a custom lexicon for your specific domain, which often works better.

422
00:21:04.880 --> 00:21:08.359
<v Speaker 1>Okay, And the last one was language identification right.

423
00:21:08.440 --> 00:21:13.079
<v Speaker 2>Language identification. This is usually simpler detecting which natural language

424
00:21:13.119 --> 00:21:16.359
<v Speaker 2>a piece of text is written in? Is it English, French, Spanish?

425
00:21:16.480 --> 00:21:17.000
<v Speaker 3>Japanese.

426
00:21:17.599 --> 00:21:21.319
<v Speaker 2>Tools like Linpipe have models trained on large multi lingual

427
00:21:21.440 --> 00:21:23.559
<v Speaker 2>data sets like the Leipzig Corporate.

428
00:21:23.279 --> 00:21:25.960
<v Speaker 1>Collection to do this. When would that be hard?

429
00:21:26.440 --> 00:21:29.440
<v Speaker 2>It gets tricky with very short texts like tweets, or

430
00:21:29.480 --> 00:21:31.519
<v Speaker 2>when a single text mixes multiple languages.

431
00:21:31.640 --> 00:21:34.799
<v Speaker 1>Yeah, I can see that, Okay, stepping back again, Yeah,

432
00:21:34.880 --> 00:21:37.279
<v Speaker 1>sometimes you don't just want to classify a document, You

433
00:21:37.319 --> 00:21:39.680
<v Speaker 1>want to know what it's about. More broadly, what are

434
00:21:39.720 --> 00:21:42.960
<v Speaker 1>the main themes in say a thousand news articles.

435
00:21:43.039 --> 00:21:44.920
<v Speaker 2>Ah, now you're talking about topic modeling.

436
00:21:45.000 --> 00:21:46.440
<v Speaker 1>Topic modeling exactly.

437
00:21:47.000 --> 00:21:48.920
<v Speaker 2>This is a set of techniques used to discover the

438
00:21:49.000 --> 00:21:52.119
<v Speaker 2>hidden abstract topics that occur in a collection of documents.

439
00:21:52.559 --> 00:21:55.079
<v Speaker 2>The idea is that each document can be represented as

440
00:21:55.119 --> 00:21:58.839
<v Speaker 2>a mixture of topics, and each topic is a distribution.

441
00:21:58.400 --> 00:22:00.000
<v Speaker 1>Over words, a mixture of topic.

442
00:22:00.240 --> 00:22:03.039
<v Speaker 2>Yeah, so an article might be seventy percent about politics,

443
00:22:03.480 --> 00:22:07.880
<v Speaker 2>twenty percent about economics, and ten percent about international relations.

444
00:22:08.640 --> 00:22:11.839
<v Speaker 2>Topic modeling helps find the relevance of each word across

445
00:22:11.880 --> 00:22:15.200
<v Speaker 2>the topics e g. Election vote are relevant to the

446
00:22:15.200 --> 00:22:18.480
<v Speaker 2>politics topic, and the relevance of the topics across each document.

447
00:22:18.599 --> 00:22:21.160
<v Speaker 1>How does it find those topics? They aren't labeled beforehand,

448
00:22:21.240 --> 00:22:21.640
<v Speaker 1>right right?

449
00:22:21.680 --> 00:22:23.000
<v Speaker 3>It's usually unsupervised.

450
00:22:23.160 --> 00:22:27.880
<v Speaker 2>A very popular method is latent diriclet allocation or LDA LDA.

451
00:22:28.079 --> 00:22:31.640
<v Speaker 2>LDA is a generative statistical model. Basically, it assumes a

452
00:22:31.680 --> 00:22:34.599
<v Speaker 2>process for how documents are created based on underlying topics,

453
00:22:34.759 --> 00:22:37.160
<v Speaker 2>and then it works backward from the observed documents the

454
00:22:37.200 --> 00:22:40.240
<v Speaker 2>words to infer the most likely topic structure that could

455
00:22:40.240 --> 00:22:43.519
<v Speaker 2>have generated them. It typically involves converting the text into

456
00:22:43.559 --> 00:22:46.559
<v Speaker 2>a document term matrix and then using sampling methods to

457
00:22:46.720 --> 00:22:50.279
<v Speaker 2>estimate two key matrices, a document topic matrix and a

458
00:22:50.319 --> 00:22:51.119
<v Speaker 2>topic term.

459
00:22:50.960 --> 00:22:55.240
<v Speaker 1>Matrix, so it uncovers the hidden themes automatically. That sounds

460
00:22:55.279 --> 00:22:58.359
<v Speaker 1>incredibly powerful for exploring large data sets.

461
00:22:58.440 --> 00:22:59.000
<v Speaker 3>It really is.

462
00:22:59.079 --> 00:23:02.240
<v Speaker 2>It's great for understanding the main themes running through large

463
00:23:02.240 --> 00:23:04.240
<v Speaker 2>amounts of text without having to read everything.

464
00:23:04.400 --> 00:23:07.839
<v Speaker 1>Now we've talked a lot about words, context topics, but

465
00:23:07.960 --> 00:23:10.880
<v Speaker 1>how do computers really get at the structure of a sentence,

466
00:23:11.319 --> 00:23:15.680
<v Speaker 1>the relationships between words, how phrases fit together. This feels

467
00:23:15.720 --> 00:23:17.960
<v Speaker 1>like it's getting closer to genuine understanding.

468
00:23:18.240 --> 00:23:18.640
<v Speaker 3>You're right.

469
00:23:18.680 --> 00:23:21.079
<v Speaker 2>This is about digging into the grammatical structure. This is

470
00:23:21.119 --> 00:23:23.960
<v Speaker 2>the domain of parsing and relationship extraction.

471
00:23:24.400 --> 00:23:27.400
<v Speaker 1>Parsing like diagramming sentences.

472
00:23:26.920 --> 00:23:31.759
<v Speaker 2>In school, very similar idea Parsing or syntactic analysis is

473
00:23:31.799 --> 00:23:34.680
<v Speaker 2>the process of analyzing a string of symbols, in this case,

474
00:23:34.759 --> 00:23:36.799
<v Speaker 2>words and a sentence according to the rules of a

475
00:23:36.839 --> 00:23:40.319
<v Speaker 2>formal grammar. The output is often a parse tree, which

476
00:23:40.400 --> 00:23:43.759
<v Speaker 2>is a tree like structure showing how the sentence is organized,

477
00:23:44.000 --> 00:23:45.880
<v Speaker 2>how phrases are nested within each other.

478
00:23:46.119 --> 00:23:47.480
<v Speaker 1>Are they different kinds of parsing?

479
00:23:47.759 --> 00:23:47.960
<v Speaker 3>Yes.

480
00:23:48.079 --> 00:23:51.400
<v Speaker 2>Two main types are common Dependency parsing focuses on the

481
00:23:51.440 --> 00:23:56.319
<v Speaker 2>grammatical relationships between individual words. Which word modifies which other word.

482
00:23:56.880 --> 00:23:59.599
<v Speaker 2>For example, the verb governs the noun, object and adjective

483
00:23:59.640 --> 00:24:05.200
<v Speaker 2>modify a known phrase. Structure parsing or constituency parsing focuses

484
00:24:05.200 --> 00:24:08.759
<v Speaker 2>on breaking the sentence down into nested phrases or constituents

485
00:24:08.799 --> 00:24:12.279
<v Speaker 2>like noun phrases, verb phrases, propositional phrases.

486
00:24:11.839 --> 00:24:13.200
<v Speaker 1>And what do we need pars pres What are they

487
00:24:13.240 --> 00:24:13.519
<v Speaker 1>used for?

488
00:24:13.759 --> 00:24:18.240
<v Speaker 2>They're fundamental for many advanced NLP tasks. Information extraction pulling

489
00:24:18.240 --> 00:24:22.759
<v Speaker 2>specific facts from text, sophisticated grammar checking, high quality machine

490
00:24:22.759 --> 00:24:26.960
<v Speaker 2>translation often relies heavily on parsing the source and target languages.

491
00:24:27.079 --> 00:24:29.720
<v Speaker 1>Okay, and related to that, you mentioned coreference.

492
00:24:29.279 --> 00:24:32.839
<v Speaker 2>Before, right, coreference resolution. This is a crucial related task

493
00:24:33.079 --> 00:24:36.119
<v Speaker 2>figuring out when different expressions in a text all refer

494
00:24:36.240 --> 00:24:39.480
<v Speaker 2>back to the same person, place, or thing the same entity.

495
00:24:39.599 --> 00:24:42.839
<v Speaker 2>Remember the example, he the robber, saw him the policeman

496
00:24:42.880 --> 00:24:45.799
<v Speaker 2>in Boston. Resolving he to the robber and him to

497
00:24:45.839 --> 00:24:48.200
<v Speaker 2>the policeman is coreference resolution.

498
00:24:48.160 --> 00:24:52.160
<v Speaker 1>And understanding these relationships, the parsing and the coreference that

499
00:24:52.240 --> 00:24:55.160
<v Speaker 1>must be vital for things like answering questions. Right.

500
00:24:55.400 --> 00:24:58.559
<v Speaker 2>Absolutely, think about a question answering system. If you ask

501
00:24:58.640 --> 00:25:01.200
<v Speaker 2>who is the thirty second present of the United States?

502
00:25:01.880 --> 00:25:04.319
<v Speaker 2>The system needs to parse that question. It needs to

503
00:25:04.400 --> 00:25:08.359
<v Speaker 2>identify who as the thing being asked for is president

504
00:25:08.440 --> 00:25:11.240
<v Speaker 2>of as the relationship and the thirty second and the

505
00:25:11.319 --> 00:25:14.720
<v Speaker 2>United States as specifying the entity. Then it uses that

506
00:25:14.759 --> 00:25:18.279
<v Speaker 2>structured understanding to find the answer. Franklin D. Roosevelt parsing

507
00:25:18.279 --> 00:25:19.400
<v Speaker 2>and coreference are key.

508
00:25:19.759 --> 00:25:22.839
<v Speaker 1>Wow, so we've covered a huge amount of ground all

509
00:25:22.839 --> 00:25:26.000
<v Speaker 1>these intricate techniques. Let's maybe pull back a bit and

510
00:25:26.000 --> 00:25:29.000
<v Speaker 1>look at the bigger picture. How are these NLP models

511
00:25:29.039 --> 00:25:32.000
<v Speaker 1>actually developed and trained and put to use. It sounds

512
00:25:32.000 --> 00:25:33.200
<v Speaker 1>like a massive process.

513
00:25:33.559 --> 00:25:35.799
<v Speaker 2>It certainly can be, But there's a general workflow, a

514
00:25:35.880 --> 00:25:38.839
<v Speaker 2>process for how these models are typically built and deployed. First,

515
00:25:38.880 --> 00:25:42.160
<v Speaker 2>you identify the specific task you need to solve, sentiment analysis,

516
00:25:42.160 --> 00:25:45.279
<v Speaker 2>any R translation, whatever it is. Then you select an

517
00:25:45.319 --> 00:25:48.640
<v Speaker 2>appropriate model or algorithm for that task. Then comes the

518
00:25:48.640 --> 00:25:53.799
<v Speaker 2>crucial part, building and door training the model. This nearly

519
00:25:53.880 --> 00:25:57.880
<v Speaker 2>always requires data, specifically a corpus that large collection of

520
00:25:57.920 --> 00:26:01.200
<v Speaker 2>text often marked up or annotates with the correct answers

521
00:26:01.240 --> 00:26:01.599
<v Speaker 2>for your.

522
00:26:01.480 --> 00:26:03.319
<v Speaker 1>Task, right the training data exactly.

523
00:26:03.640 --> 00:26:06.480
<v Speaker 2>You train the model on this data, then you need

524
00:26:06.519 --> 00:26:09.839
<v Speaker 2>to verify its quality test how well it performs, usually

525
00:26:09.880 --> 00:26:12.359
<v Speaker 2>on a separate sample set of data it hasn't seen before.

526
00:26:12.480 --> 00:26:13.359
<v Speaker 1>Check your work.

527
00:26:13.279 --> 00:26:16.680
<v Speaker 2>Precisely, and once you're satisfied with the quality, you can

528
00:26:16.759 --> 00:26:19.880
<v Speaker 2>finally apply the model to your real world problem, to

529
00:26:20.000 --> 00:26:24.000
<v Speaker 2>new unseen data. But even before all that, a critical

530
00:26:24.039 --> 00:26:27.359
<v Speaker 2>first step in almost any NLP project is preparing the data.

531
00:26:27.480 --> 00:26:30.200
<v Speaker 1>Ah. Data prep always important.

532
00:26:29.799 --> 00:26:33.279
<v Speaker 2>Crucial, finding the right data, getting it into a usable format,

533
00:26:33.319 --> 00:26:37.200
<v Speaker 2>and very importantly making sure it's clean. As we said,

534
00:26:37.240 --> 00:26:40.160
<v Speaker 2>many nlpapis and tools just assume the input data is

535
00:26:40.200 --> 00:26:43.440
<v Speaker 2>already cleaned up in consistent lowercase punctuation, handled, etc.

536
00:26:43.880 --> 00:26:45.720
<v Speaker 1>So cleaning is key. What about tools for this you

537
00:26:45.759 --> 00:26:46.759
<v Speaker 1>mentioned Java earlier.

538
00:26:47.039 --> 00:26:49.960
<v Speaker 2>Yeah, the source text focuses on Java. Java has pretty

539
00:26:49.960 --> 00:26:53.359
<v Speaker 2>good built in support for character processing, reading files, and

540
00:26:53.400 --> 00:26:55.720
<v Speaker 2>there are libraries for handling all sorts of formats you

541
00:26:55.759 --> 00:27:01.039
<v Speaker 2>might encounter HTML, Microsoft word documents, PDFSXML.

542
00:27:00.000 --> 00:27:02.839
<v Speaker 1>Well, so you can pull text out of those definitely.

543
00:27:02.759 --> 00:27:05.240
<v Speaker 2>And when it comes to the NLP tools themselves, especially

544
00:27:05.279 --> 00:27:08.440
<v Speaker 2>in the Java world, there are several major players. Patchy

545
00:27:08.480 --> 00:27:12.519
<v Speaker 2>Open NLP is a popular toolkit. The Stanford NLP group

546
00:27:12.599 --> 00:27:15.799
<v Speaker 2>provides a very comprehensive suite of tools. Ling Pipe is

547
00:27:15.839 --> 00:27:19.880
<v Speaker 2>another powerful option for search, specifically a patche Lucine Core

548
00:27:20.039 --> 00:27:23.559
<v Speaker 2>is a fantastic open source library for building text search engines.

549
00:27:23.559 --> 00:27:26.559
<v Speaker 2>It underlies things like elastic search and solar and it

550
00:27:26.599 --> 00:27:31.240
<v Speaker 2>relies heavily on NLP concepts like tokenization for indexing Lucine Yeah,

551
00:27:31.279 --> 00:27:32.960
<v Speaker 2>I heard of that, and of course for the really

552
00:27:33.000 --> 00:27:36.279
<v Speaker 2>cutting edge stuff. Deep learning libraries like deep Learning FOURG

553
00:27:36.519 --> 00:27:41.240
<v Speaker 2>deal FOURJ integrate NLP capabilities. Another really core concept often

554
00:27:41.279 --> 00:27:45.599
<v Speaker 2>tied closely to NLP, especially search, is information retrieval.

555
00:27:45.279 --> 00:27:47.759
<v Speaker 1>Or r R. How is that different from NLP?

556
00:27:48.160 --> 00:27:52.799
<v Speaker 2>They're very related, often overlap. IR is specifically focused on

557
00:27:53.279 --> 00:27:58.240
<v Speaker 2>finding relevant information within large collections of unstructured data data

558
00:27:58.319 --> 00:28:02.440
<v Speaker 2>without a predefined model. Text documents or web pages. Search

559
00:28:02.480 --> 00:28:04.599
<v Speaker 2>engines are the classic IR application.

560
00:28:04.759 --> 00:28:06.039
<v Speaker 1>How do they search so fast?

561
00:28:06.200 --> 00:28:10.519
<v Speaker 2>A key technique is using inverted indexes. Instead of storing

562
00:28:10.599 --> 00:28:14.079
<v Speaker 2>documents and searching through them, an inverted index maps each

563
00:28:14.200 --> 00:28:17.720
<v Speaker 2>term word to a list of documents where it appears,

564
00:28:17.759 --> 00:28:19.920
<v Speaker 2>and often its position within those.

565
00:28:19.759 --> 00:28:22.000
<v Speaker 1>Documents, like the index in the back of a book,

566
00:28:22.200 --> 00:28:23.920
<v Speaker 1>but for every word exactly.

567
00:28:24.039 --> 00:28:25.440
<v Speaker 3>It makes look ups much faster.

568
00:28:26.079 --> 00:28:28.759
<v Speaker 2>IR also deals with things like how to efficiently store

569
00:28:28.799 --> 00:28:32.000
<v Speaker 2>the vocabulary the dictionary of terms using structures like hash

570
00:28:32.000 --> 00:28:36.160
<v Speaker 2>tables or trees, and it needs tolerant retrieval. Tolerant retrieval

571
00:28:36.319 --> 00:28:39.519
<v Speaker 2>meaning it can handle things like typos or spelling variations.

572
00:28:39.920 --> 00:28:43.359
<v Speaker 2>This involves spelling correction, finding the nearest correct word to

573
00:28:43.400 --> 00:28:47.359
<v Speaker 2>a misspelled query term, maybe using edit distance or phonetic matching.

574
00:28:47.400 --> 00:28:50.880
<v Speaker 2>Algorithms like soundex and IR systems need to rank the

575
00:28:50.920 --> 00:28:54.920
<v Speaker 2>documents they find. This often involves the vector space model,

576
00:28:55.119 --> 00:28:58.319
<v Speaker 2>where documents and queries are represented as vectors and scoring

577
00:28:58.359 --> 00:29:01.079
<v Speaker 2>and term waiting, which brings us right back to good

578
00:29:01.079 --> 00:29:05.200
<v Speaker 2>old TFIDF for figuring out which documents are most relevant.

579
00:29:04.799 --> 00:29:07.759
<v Speaker 1>To the query. It all connects back, it really does.

580
00:29:07.880 --> 00:29:11.519
<v Speaker 1>Now you mentioned pipelines earlier. It's clear these components are

581
00:29:11.559 --> 00:29:14.319
<v Speaker 1>powerful on their own, but the real magic must be

582
00:29:14.480 --> 00:29:17.960
<v Speaker 1>when they work together. How are these individual pieces actually

583
00:29:18.000 --> 00:29:20.160
<v Speaker 1>assembled into a working system.

584
00:29:19.880 --> 00:29:23.559
<v Speaker 2>That's exactly right. It's done using combined pipelines. A pipeline

585
00:29:23.559 --> 00:29:26.240
<v Speaker 2>in this context is just that sequence of operations we

586
00:29:26.319 --> 00:29:30.079
<v Speaker 2>talked about. The output from one NLP step, say tokenization

587
00:29:30.279 --> 00:29:33.480
<v Speaker 2>becomes the input for the next step maybe pos, tagging yeah,

588
00:29:33.519 --> 00:29:35.839
<v Speaker 2>and its output feeds the next like any.

589
00:29:35.799 --> 00:29:37.799
<v Speaker 1>R, like an assembly line for text.

590
00:29:37.599 --> 00:29:40.920
<v Speaker 2>Analysis perfect analogy and tools are often designed with this

591
00:29:41.000 --> 00:29:44.200
<v Speaker 2>in mind. The Stanford core NLP library, for instance, is

592
00:29:44.200 --> 00:29:47.359
<v Speaker 2>built around this idea. It uses annotator objects for each

593
00:29:47.440 --> 00:29:52.400
<v Speaker 2>task token I, sentence split pos, tag, limitize, ANYR parse coreference.

594
00:29:52.960 --> 00:29:56.160
<v Speaker 2>You can easily define a pipeline specifying which annotators you

595
00:29:56.240 --> 00:29:57.079
<v Speaker 2>want to run.

596
00:29:57.000 --> 00:29:57.720
<v Speaker 3>In what order.

597
00:29:57.839 --> 00:29:59.519
<v Speaker 1>That seems really flexible it is.

598
00:30:00.039 --> 00:30:03.440
<v Speaker 2>These pipelines often start even before the core NLP tasks.

599
00:30:03.880 --> 00:30:06.359
<v Speaker 2>They might include initial steps for just getting the text

600
00:30:06.440 --> 00:30:09.920
<v Speaker 2>out of various formats. There are libraries like boiler pipe

601
00:30:10.000 --> 00:30:14.359
<v Speaker 2>specifically designed to extract the main text content from MESSYHTML

602
00:30:14.359 --> 00:30:17.680
<v Speaker 2>web pages, stripping out ads and menus oh useful, or

603
00:30:17.759 --> 00:30:21.079
<v Speaker 2>apatche poi for pulling text from Microsoft word files, or

604
00:30:21.119 --> 00:30:23.920
<v Speaker 2>apatche Tika for handling PDFs and a huge range of

605
00:30:23.960 --> 00:30:27.640
<v Speaker 2>other formats. Tika is amazing, actually, it can detect and

606
00:30:27.720 --> 00:30:31.279
<v Speaker 2>extract metadata and text from thousands of different types of files,

607
00:30:31.519 --> 00:30:33.960
<v Speaker 2>so it provides that clean text input needed to start

608
00:30:34.000 --> 00:30:35.519
<v Speaker 2>the main NLP pipeline.

609
00:30:35.640 --> 00:30:38.880
<v Speaker 1>Okay, that paints a much clearer picture of how it

610
00:30:38.880 --> 00:30:42.240
<v Speaker 1>all fits together. We've gone through a lot of complex components.

611
00:30:42.960 --> 00:30:45.559
<v Speaker 1>How does this all come together to create something tangible,

612
00:30:45.599 --> 00:30:48.880
<v Speaker 1>something maybe you and I interact with regularly. Let's connect

613
00:30:48.880 --> 00:30:55.039
<v Speaker 1>these techniques to a really common modern application. Chatbots. Ah.

614
00:30:55.200 --> 00:31:00.240
<v Speaker 2>Chatbots, yes, a perfect example of NLP in action have

615
00:31:00.480 --> 00:31:03.160
<v Speaker 2>absolutely exploded in popularity, haven't.

616
00:31:03.000 --> 00:31:06.400
<v Speaker 1>They Definitely on websites, in apps, voice assistance.

617
00:31:06.119 --> 00:31:10.680
<v Speaker 2>Exactly, Facebook Messengers, slack bots, Amazon Alexa, Google Assistant, Siri,

618
00:31:11.359 --> 00:31:14.480
<v Speaker 2>They're everywhere, and they've evolved quite a bit, from very

619
00:31:14.480 --> 00:31:19.160
<v Speaker 2>simple systems that just answered predefined questions to more sophisticated,

620
00:31:19.200 --> 00:31:22.519
<v Speaker 2>action oriented bots that can actually do things for you,

621
00:31:22.759 --> 00:31:25.599
<v Speaker 2>book appointments, place orders, provide detailed support.

622
00:31:25.759 --> 00:31:28.039
<v Speaker 1>So how do they work underneath? What's the architecture?

623
00:31:28.200 --> 00:31:29.680
<v Speaker 3>Well, you can think about a spectrum.

624
00:31:29.920 --> 00:31:32.319
<v Speaker 2>On one end, you have simple chatbots, maybe just following

625
00:31:32.359 --> 00:31:34.799
<v Speaker 2>a strict script or decision tree. Then you have more

626
00:31:34.799 --> 00:31:38.839
<v Speaker 2>conversational chatbots that can maintain contexts across several turns of

627
00:31:38.880 --> 00:31:41.000
<v Speaker 2>the dialogue. They remember what you said earlier. And then

628
00:31:41.039 --> 00:31:44.519
<v Speaker 2>you have the more advanced AI chatbots These often use

629
00:31:44.640 --> 00:31:48.640
<v Speaker 2>machine learning, learning from vast amounts of training conversations. These

630
00:31:48.759 --> 00:31:52.720
<v Speaker 2>AI bots heavily leverage the NLP techniques we've discussed. They

631
00:31:52.799 --> 00:31:57.880
<v Speaker 2>use natural language understanding NLU, which involves intent classifications and

632
00:31:58.640 --> 00:32:01.039
<v Speaker 2>figuring out what the user wants to do. When you

633
00:32:01.079 --> 00:32:03.680
<v Speaker 2>say to Alexa, set a timer for five minutes, the

634
00:32:03.759 --> 00:32:06.839
<v Speaker 2>intent is set a timer, got it? And they use

635
00:32:07.000 --> 00:32:10.640
<v Speaker 2>entity extraction pulling out the key pieces of information in

636
00:32:10.680 --> 00:32:12.599
<v Speaker 2>that example, five minutes.

637
00:32:12.319 --> 00:32:14.960
<v Speaker 1>Is the time entity.

638
00:32:14.640 --> 00:32:17.960
<v Speaker 2>Exactly, So a modern AI chatbot is essentially running a

639
00:32:17.960 --> 00:32:23.119
<v Speaker 2>complex NLP pipeline, understand the user's utterance NLU, decide what

640
00:32:23.200 --> 00:32:26.119
<v Speaker 2>to do, maybe query a database or API, and then

641
00:32:26.200 --> 00:32:29.599
<v Speaker 2>generate a natural language response NLG natural language generation.

642
00:32:29.759 --> 00:32:32.519
<v Speaker 1>That makes sense. What about simpler bots? Though not all

643
00:32:32.519 --> 00:32:34.079
<v Speaker 1>of them are full AI, right.

644
00:32:34.079 --> 00:32:35.039
<v Speaker 3>No, definitely not.

645
00:32:35.519 --> 00:32:39.079
<v Speaker 2>Many useful chat bots, especially for specific tasks, are retrieval

646
00:32:39.119 --> 00:32:42.359
<v Speaker 2>based models. These don't generate novel sentences, but select the

647
00:32:42.359 --> 00:32:46.279
<v Speaker 2>best response from a pre defined set, often using rules, templates,

648
00:32:46.319 --> 00:32:48.319
<v Speaker 2>and the conversation history context.

649
00:32:48.680 --> 00:32:51.039
<v Speaker 1>Okay, like picking the best canned.

650
00:32:50.759 --> 00:32:54.079
<v Speaker 2>Response kind of, but it can be quite sophisticated. One

651
00:32:54.160 --> 00:32:56.880
<v Speaker 2>common way to define the patterns and responses for these

652
00:32:56.880 --> 00:33:01.240
<v Speaker 2>bots is using Artificial Intelligence Markup Language or AML.

653
00:33:01.400 --> 00:33:01.960
<v Speaker 1>AML.

654
00:33:02.079 --> 00:33:06.119
<v Speaker 2>Yeah, it's an XML based language specifically designed for creating chatbots.

655
00:33:06.599 --> 00:33:09.400
<v Speaker 2>You define patterns the box should recognize, and the corresponding

656
00:33:09.440 --> 00:33:12.400
<v Speaker 2>templates for its response. So a very simple AML rule

657
00:33:12.480 --> 00:33:15.279
<v Speaker 2>might look like, if the user input pattern is hello,

658
00:33:15.799 --> 00:33:18.160
<v Speaker 2>the template responses hello, how.

659
00:33:18.079 --> 00:33:19.920
<v Speaker 1>Are you okay? Basic pattern matching.

660
00:33:20.079 --> 00:33:23.400
<v Speaker 2>It gets more powerful. You can use wild cards like

661
00:33:23.480 --> 00:33:26.599
<v Speaker 2>a pattern I like the star matches any word or phrase.

662
00:33:27.160 --> 00:33:29.640
<v Speaker 2>The response template could be okay, so you like star.

663
00:33:30.039 --> 00:33:32.680
<v Speaker 2>The star tag inserts whatever the user actually.

664
00:33:32.440 --> 00:33:35.440
<v Speaker 1>Set after I like clever, so it can echo back

665
00:33:35.519 --> 00:33:37.000
<v Speaker 1>part of the user's input yep.

666
00:33:37.519 --> 00:33:40.599
<v Speaker 2>And AML also has tags like set and get that

667
00:33:40.720 --> 00:33:43.680
<v Speaker 2>let the bot store and retrieve information within the conversation.

668
00:33:43.839 --> 00:33:46.559
<v Speaker 2>It can remember your name, for instance, using set, and

669
00:33:46.599 --> 00:33:49.480
<v Speaker 2>then use it later with get. That helps maintain context.

670
00:33:49.559 --> 00:33:52.599
<v Speaker 1>So even simpler chatbots use these structured ways to manage

671
00:33:52.640 --> 00:33:54.079
<v Speaker 1>dialogue exactly.

672
00:33:54.400 --> 00:33:58.640
<v Speaker 2>It provides a framework for building reasonably interactive conversational flows

673
00:33:58.920 --> 00:34:01.400
<v Speaker 2>without needing deep learning AI.

674
00:34:01.839 --> 00:34:04.559
<v Speaker 1>Okay, so let's wrap this up. What does this all

675
00:34:04.599 --> 00:34:08.000
<v Speaker 1>mean for us? You know, the people using this tech

676
00:34:08.039 --> 00:34:11.440
<v Speaker 1>every day. It feels like from deciphering single words and

677
00:34:11.480 --> 00:34:15.760
<v Speaker 1>their messy meanings, to understanding complex sentences, finding topics in

678
00:34:15.840 --> 00:34:19.960
<v Speaker 1>mountains of text, and even building these interactive chatbots. NLP

679
00:34:20.119 --> 00:34:23.920
<v Speaker 1>is just fundamentally changing how we interact with computers and information.

680
00:34:24.159 --> 00:34:25.320
<v Speaker 3>It absolutely is.

681
00:34:25.400 --> 00:34:28.079
<v Speaker 2>The progress has been stunning, especially in the last decade

682
00:34:28.159 --> 00:34:28.360
<v Speaker 2>or so.

683
00:34:28.480 --> 00:34:30.239
<v Speaker 1>It's really incredible to see how far it's.

684
00:34:30.119 --> 00:34:33.039
<v Speaker 2>Come, which really leaves us with a well, a pretty

685
00:34:33.039 --> 00:34:36.360
<v Speaker 2>fascinating question to think about, doesn't it. Given how much

686
00:34:36.360 --> 00:34:40.679
<v Speaker 2>of our own language, its subjectivity, its ambiguity, its reliance

687
00:34:40.719 --> 00:34:44.400
<v Speaker 2>on shared context remains challenging even for us humans. How

688
00:34:44.480 --> 00:34:47.239
<v Speaker 2>much further can computers really go? Can they move beyond

689
00:34:47.280 --> 00:34:50.159
<v Speaker 2>just processing language to truly understanding it with all its

690
00:34:50.280 --> 00:34:53.599
<v Speaker 2>nuance and implicit meaning? And if they can, or as

691
00:34:53.599 --> 00:34:57.239
<v Speaker 2>they get closer, what kinds of new applications might emerge?

692
00:34:57.559 --> 00:35:00.960
<v Speaker 2>How might our relationship with technology change? And more as

693
00:35:01.000 --> 00:35:04.159
<v Speaker 2>these tools become even more sophisticated at grasping the real

694
00:35:04.239 --> 00:35:05.400
<v Speaker 2>depth of human intention.

695
00:35:05.639 --> 00:35:07.960
<v Speaker 1>Wow, that's definitely something that you want. What is the

696
00:35:08.039 --> 00:35:12.400
<v Speaker 1>limit of machine understanding of something so fundamentally human as language?

697
00:35:12.519 --> 00:35:16.880
<v Speaker 2>Exactly the possibilities and perhaps the challenges seem genuinely boundless.
