WEBVTT

1
00:00:00.160 --> 00:00:02.759
<v Speaker 1>Welcome to the deep dive, where we slice through the

2
00:00:02.759 --> 00:00:07.240
<v Speaker 1>information clutter to bring you the clearest, most important insights. Today,

3
00:00:07.280 --> 00:00:09.400
<v Speaker 1>we're taking a bit of a shortcut to becoming well

4
00:00:09.400 --> 00:00:14.039
<v Speaker 1>informed about a really powerful tool in natural language processing NLP.

5
00:00:14.279 --> 00:00:18.039
<v Speaker 2>It's called Spacey, that's right, and it's an interesting one.

6
00:00:18.079 --> 00:00:20.800
<v Speaker 2>If you think of those huge language models, you know,

7
00:00:20.879 --> 00:00:24.440
<v Speaker 2>like chat, GPT, maybe is a big powerful food processor. Okay,

8
00:00:24.600 --> 00:00:29.399
<v Speaker 2>then Spacey is more like your practical, really well optimized

9
00:00:29.440 --> 00:00:33.200
<v Speaker 2>kitchen knife. It's a library that's specifically designed to help

10
00:00:33.240 --> 00:00:34.560
<v Speaker 2>you get actual work done.

11
00:00:34.560 --> 00:00:36.280
<v Speaker 1>So moving beyond just theory.

12
00:00:36.320 --> 00:00:40.039
<v Speaker 2>Exactly beyond just academic concepts and do efficient practical application.

13
00:00:40.719 --> 00:00:43.320
<v Speaker 2>And we're going to uncover some surprising depth today. I

14
00:00:43.359 --> 00:00:46.719
<v Speaker 2>think from you know, basic text processing right up to

15
00:00:46.799 --> 00:00:48.439
<v Speaker 2>integrating with the latest AI stuff.

16
00:00:48.560 --> 00:00:51.679
<v Speaker 1>Sounds good, And our mission for this deep dive is

17
00:00:51.719 --> 00:00:56.240
<v Speaker 1>basically to give you a comprehensive but still really accessible

18
00:00:56.560 --> 00:00:59.280
<v Speaker 1>understanding of what Spacey can do. We're drawing from quite

19
00:00:59.280 --> 00:01:05.480
<v Speaker 1>a few sources, including the excellent book Mastering Spacey. Okay,

20
00:01:05.920 --> 00:01:08.599
<v Speaker 1>let's untack this. So to kick us off, what's the

21
00:01:08.799 --> 00:01:12.400
<v Speaker 1>absolute core thing our listeners should get about Spacey.

22
00:01:12.079 --> 00:01:15.519
<v Speaker 2>Well at its heart. Spacey is this incredibly fast open

23
00:01:15.519 --> 00:01:20.920
<v Speaker 2>source Python library, and it's really built for production ready

24
00:01:21.079 --> 00:01:22.200
<v Speaker 2>NLP applications.

25
00:01:22.280 --> 00:01:24.200
<v Speaker 1>Production ready. That sounds important.

26
00:01:24.280 --> 00:01:26.599
<v Speaker 2>It is, a lot of its speed comes from using

27
00:01:26.640 --> 00:01:30.079
<v Speaker 2>Python for the really performance critical bids, so it's highly

28
00:01:30.079 --> 00:01:33.000
<v Speaker 2>optimized but still easy to use within Python.

29
00:01:33.120 --> 00:01:35.799
<v Speaker 1>Aha. So it's not just another like academic tool set.

30
00:01:35.840 --> 00:01:37.840
<v Speaker 1>It's built for real world stuff from the.

31
00:01:37.799 --> 00:01:40.719
<v Speaker 2>Get go precisely. That's a key difference compared to maybe

32
00:01:40.760 --> 00:01:43.879
<v Speaker 2>something like NLTK, the Natural Language Toolkit, which historically at

33
00:01:43.920 --> 00:01:47.159
<v Speaker 2>least was often more focused on students researchers. Spacey, you're

34
00:01:47.239 --> 00:01:48.480
<v Speaker 2>hitting the ground running for deployment.

35
00:01:48.799 --> 00:01:50.519
<v Speaker 1>You mentioned it's built to get work done? Is that

36
00:01:50.599 --> 00:01:53.040
<v Speaker 1>like the official philosophy pretty much?

37
00:01:53.680 --> 00:01:56.719
<v Speaker 2>Inus Montani? What are the core? Creators? Often talks about this.

38
00:01:57.120 --> 00:01:59.799
<v Speaker 2>The goal is genuinely to help people do their work efficiently.

39
00:02:00.280 --> 00:02:03.040
<v Speaker 2>They're not trying to build some massive do everything system.

40
00:02:03.159 --> 00:02:06.599
<v Speaker 2>Oh okay, it's more about providing these sharp, reliable tools

41
00:02:06.680 --> 00:02:09.960
<v Speaker 2>like that knife, to fit nicely into whatever you're already doing.

42
00:02:10.120 --> 00:02:13.680
<v Speaker 1>Got it and getting started? Is it complex?

43
00:02:15.000 --> 00:02:17.919
<v Speaker 2>Not? Really? It works with modern Python runs on you know,

44
00:02:17.960 --> 00:02:21.039
<v Speaker 2>the usual operating systems, Windows, Mac, Linux.

45
00:02:21.039 --> 00:02:24.319
<v Speaker 1>And best practice is probably virtual environments. Right.

46
00:02:24.439 --> 00:02:27.400
<v Speaker 2>Keep things clean, Oh, absolutely, always a good idea for

47
00:02:27.439 --> 00:02:29.680
<v Speaker 2>any Python project. Keeps your dependency sorted.

48
00:02:29.960 --> 00:02:33.039
<v Speaker 1>Now you mentioned something important. The language models aren't built

49
00:02:33.039 --> 00:02:33.719
<v Speaker 1>in correct.

50
00:02:33.800 --> 00:02:37.639
<v Speaker 2>That's a key point. Spacey itself is the framework, the tools,

51
00:02:38.240 --> 00:02:40.919
<v Speaker 2>but for the statistical smarts, things like tagging parts of

52
00:02:40.960 --> 00:02:43.800
<v Speaker 2>speech or finding named entities, you need to download a

53
00:02:43.840 --> 00:02:45.319
<v Speaker 2>language model separately.

54
00:02:45.039 --> 00:02:48.280
<v Speaker 1>Like Encore websism, that kind of thing exactly like encore

55
00:02:48.319 --> 00:02:49.280
<v Speaker 1>webdism for English.

56
00:02:49.319 --> 00:02:52.639
<v Speaker 2>This quick command line thing Python dash M, Spacey download

57
00:02:52.639 --> 00:02:56.360
<v Speaker 2>oncre webism that downloads the small English model, gets you

58
00:02:56.400 --> 00:02:57.680
<v Speaker 2>the core pipeline components.

59
00:02:58.000 --> 00:02:59.879
<v Speaker 1>Okay, and once you've got that, how do you sort

60
00:02:59.879 --> 00:03:01.319
<v Speaker 1>of of see what it's doing.

61
00:03:01.639 --> 00:03:04.639
<v Speaker 2>Oh well, that's where displacy comes in. It's Spacey's built

62
00:03:04.680 --> 00:03:08.280
<v Speaker 2>in visualization tool, and it's fantastic. How So, it just

63
00:03:08.360 --> 00:03:14.080
<v Speaker 2>makes really complex linguistic concepts much easier to grasp visually.

64
00:03:14.560 --> 00:03:17.840
<v Speaker 2>You could see dependency parses how words connect, or see

65
00:03:18.159 --> 00:03:20.879
<v Speaker 2>named entities highlighted right in the text. It helps you

66
00:03:20.960 --> 00:03:22.319
<v Speaker 2>spot patterns almost.

67
00:03:22.120 --> 00:03:24.199
<v Speaker 1>Instantly, so you can actually see the analysis.

68
00:03:24.280 --> 00:03:26.759
<v Speaker 2>Yeah, you can try it online. There's a demo, or

69
00:03:26.800 --> 00:03:29.199
<v Speaker 2>you can run it locally from your code. Even in

70
00:03:29.280 --> 00:03:32.319
<v Speaker 2>Jupiter notebooks. It's super helpful for understanding what's going on

71
00:03:32.319 --> 00:03:32.879
<v Speaker 2>into the hood.

72
00:03:32.919 --> 00:03:37.520
<v Speaker 1>Okay, so set up, done, model, downloaded, visualization. Ready, let's

73
00:03:37.520 --> 00:03:39.879
<v Speaker 1>talk about the core processing. You mentioned a pipeline.

74
00:03:39.919 --> 00:03:41.960
<v Speaker 2>Yeah, I think of it like an NLP assembly line.

75
00:03:42.120 --> 00:03:44.919
<v Speaker 2>When you load a model, say using spacey dot load,

76
00:03:45.360 --> 00:03:48.319
<v Speaker 2>you get back this NLP object, right, And when you

77
00:03:48.360 --> 00:03:51.680
<v Speaker 2>feed text into that object like doc NLP, this is

78
00:03:51.680 --> 00:03:54.360
<v Speaker 2>some text. It runs the text through a sequence of

79
00:03:54.400 --> 00:03:55.280
<v Speaker 2>processing steps.

80
00:03:55.319 --> 00:03:57.479
<v Speaker 1>The pipeline components exactly.

81
00:03:57.520 --> 00:04:01.800
<v Speaker 2>The default pipeline usually include. It's a tokenizer, a tagger

82
00:04:01.840 --> 00:04:05.159
<v Speaker 2>for part of speech, a dependency parser for sentence structure,

83
00:04:05.719 --> 00:04:10.120
<v Speaker 2>and an entity recognizer or any R component. Each does

84
00:04:10.159 --> 00:04:11.319
<v Speaker 2>its specific.

85
00:04:10.879 --> 00:04:13.680
<v Speaker 1>Job, and the output is this doc object.

86
00:04:13.919 --> 00:04:16.800
<v Speaker 2>Right. The doc object holds the result. It's not just

87
00:04:16.839 --> 00:04:19.720
<v Speaker 2>the text, it's the text broken down into tokens, and

88
00:04:19.800 --> 00:04:22.519
<v Speaker 2>each token is enriched with all the linguistic features found

89
00:04:22.519 --> 00:04:23.160
<v Speaker 2>by the pipeline.

90
00:04:23.240 --> 00:04:26.240
<v Speaker 1>Let's break down that pipeline. First up, Tokenization and sentence

91
00:04:26.279 --> 00:04:29.160
<v Speaker 1>segmentation sounds simple, just splitting words, Ah.

92
00:04:29.120 --> 00:04:32.199
<v Speaker 2>Well, it's a bit more nuanced than just splitting on spaces.

93
00:04:32.519 --> 00:04:36.319
<v Speaker 2>Tokenization is breaking the text into its smallest meaningful parts,

94
00:04:36.720 --> 00:04:40.800
<v Speaker 2>the tokens, words, numbers, punctuation. They all become tokens. Okay,

95
00:04:40.879 --> 00:04:44.639
<v Speaker 2>But here's a surprising detail. Unlike most other pipeline components,

96
00:04:44.879 --> 00:04:47.680
<v Speaker 2>the default tokenizer doesn't rely on a statistical model.

97
00:04:47.759 --> 00:04:48.959
<v Speaker 1>Oh which does it use?

98
00:04:49.079 --> 00:04:52.680
<v Speaker 2>It uses really carefully crafted language specific rules, which makes

99
00:04:52.720 --> 00:04:55.240
<v Speaker 2>it very fast and predictable. And you can even customize it.

100
00:04:55.279 --> 00:04:57.439
<v Speaker 2>You can add special cases like telling it how to

101
00:04:57.480 --> 00:04:59.480
<v Speaker 2>handle slang or specific abbreviations.

102
00:04:59.560 --> 00:05:02.199
<v Speaker 1>Let's teach it lemmey should be lemon me exactly.

103
00:05:02.240 --> 00:05:04.600
<v Speaker 2>That kind of thing gives you fine grain control.

104
00:05:04.360 --> 00:05:08.839
<v Speaker 1>And sentence segmentation, finding sentence boundaries that's.

105
00:05:08.639 --> 00:05:13.600
<v Speaker 2>Actually often more complex than tokenization. Think about abbreviations like

106
00:05:13.720 --> 00:05:19.279
<v Speaker 2>misder or complex punctuation. Spacey has a unique approach here.

107
00:05:19.360 --> 00:05:22.920
<v Speaker 2>What's that It often uses the dependency parser, which understands

108
00:05:22.920 --> 00:05:26.439
<v Speaker 2>sentence structure to help figure out sentence boundaries really accurately.

109
00:05:26.480 --> 00:05:28.399
<v Speaker 2>It's quite a sophisticated design choice.

110
00:05:28.399 --> 00:05:32.480
<v Speaker 1>Interesting. Okay, Next step lematization getting the root word yep.

111
00:05:32.639 --> 00:05:35.439
<v Speaker 2>The lemma is the base or dictionary form. So like

112
00:05:35.480 --> 00:05:38.160
<v Speaker 2>you said, eating eats eat tape, they all boil down

113
00:05:38.160 --> 00:05:39.240
<v Speaker 2>to the lemma eat.

114
00:05:39.240 --> 00:05:40.959
<v Speaker 1>How useful is that in practice?

115
00:05:41.079 --> 00:05:44.680
<v Speaker 2>Oh, incredibly useful. Think about a chatbot for booking flights.

116
00:05:45.079 --> 00:05:47.120
<v Speaker 2>A user might say I want to fly, or show

117
00:05:47.160 --> 00:05:48.959
<v Speaker 2>me flights or I flew yesterday.

118
00:05:49.079 --> 00:05:52.040
<v Speaker 1>Right, different forms of the same core idea exactly.

119
00:05:52.160 --> 00:05:55.920
<v Speaker 2>Lemonization reduces fly flights flu all down to fly, so

120
00:05:56.000 --> 00:05:58.040
<v Speaker 2>your system only needs to look for that one base

121
00:05:58.120 --> 00:06:02.560
<v Speaker 2>form to understand the core intent. It simplifies things massively.

122
00:06:02.199 --> 00:06:04.279
<v Speaker 1>Makes sense, and you could use it for other things too,

123
00:06:04.360 --> 00:06:05.399
<v Speaker 1>like place names.

124
00:06:05.439 --> 00:06:08.839
<v Speaker 2>Definitely maybe sometimes angel Town when they mean Los Angeles.

125
00:06:09.160 --> 00:06:11.879
<v Speaker 2>You can actually add custom rules using something called an

126
00:06:11.879 --> 00:06:14.920
<v Speaker 2>a tribune ruler to map Angeltown to the canonical Los

127
00:06:14.959 --> 00:06:19.120
<v Speaker 2>Angeles lemma. During processing insures consistency.

128
00:06:18.600 --> 00:06:22.759
<v Speaker 1>So Spacey processes the text, applies these steps and stores

129
00:06:22.800 --> 00:06:26.600
<v Speaker 1>the results you mentioned. Container objects, doc, token span.

130
00:06:26.560 --> 00:06:29.480
<v Speaker 2>Right, these are your main ways of accessing the processed information.

131
00:06:30.000 --> 00:06:33.480
<v Speaker 2>The doc object represents the whole processed text. Okay, if

132
00:06:33.519 --> 00:06:35.480
<v Speaker 2>you loop over a doc like for token and doc,

133
00:06:35.959 --> 00:06:38.000
<v Speaker 2>you get individual token objects.

134
00:06:37.639 --> 00:06:39.839
<v Speaker 1>And each token knows things about itself.

135
00:06:39.560 --> 00:06:42.120
<v Speaker 2>Loads of things. A token object holds the original word,

136
00:06:42.160 --> 00:06:45.439
<v Speaker 2>it's lemma, it's part of speech tag, it's dependency relation.

137
00:06:45.839 --> 00:06:48.480
<v Speaker 2>It also has boolean flags like token dot is punk,

138
00:06:48.560 --> 00:06:51.519
<v Speaker 2>token dot is currency token dot like earl, token dot

139
00:06:51.600 --> 00:06:52.480
<v Speaker 2>latham wow.

140
00:06:52.560 --> 00:06:54.399
<v Speaker 1>Okay, So you can check if a token looks like

141
00:06:54.439 --> 00:06:56.560
<v Speaker 1>a URL or a number easily yep.

142
00:06:56.759 --> 00:06:58.879
<v Speaker 2>And it knows it's entity type if it's part of

143
00:06:58.920 --> 00:07:02.319
<v Speaker 2>one like token do type might be person or or worg.

144
00:07:02.879 --> 00:07:05.959
<v Speaker 2>It even has a token dot shave attribute that gives

145
00:07:06.000 --> 00:07:08.600
<v Speaker 2>you a kind of abstract representation of the words orthography,

146
00:07:08.720 --> 00:07:11.040
<v Speaker 2>like is it capitalized, is it all digits, et cetera.

147
00:07:11.360 --> 00:07:13.439
<v Speaker 2>Really useful for rule base matching.

148
00:07:13.279 --> 00:07:14.680
<v Speaker 1>And span What does that fit in?

149
00:07:15.079 --> 00:07:17.680
<v Speaker 2>A span? Is just a slice of the dock representing

150
00:07:17.720 --> 00:07:21.240
<v Speaker 2>multiple tokens. Sentences are span objects. You can get them

151
00:07:21.319 --> 00:07:24.920
<v Speaker 2>via doc dot sense. Named entities are also span objects,

152
00:07:25.040 --> 00:07:29.319
<v Speaker 2>accessible via doc dot NZ. So doc token span or

153
00:07:29.399 --> 00:07:31.279
<v Speaker 2>how you navigate and use the process.

154
00:07:31.000 --> 00:07:33.519
<v Speaker 1>Text got it. Let's move into some of those linguistic

155
00:07:33.560 --> 00:07:40.040
<v Speaker 1>features part of speech tagging pos tagging. That's identifying nouns, verbs, adjectives.

156
00:07:39.480 --> 00:07:43.120
<v Speaker 2>Exactly, categorizing words by their grammatical role in the sentence.

157
00:07:43.199 --> 00:07:45.000
<v Speaker 1>And how does space you figure that out? Is it

158
00:07:45.040 --> 00:07:46.079
<v Speaker 1>just a dictionary look up?

159
00:07:46.160 --> 00:07:48.000
<v Speaker 2>Oh no, it's much smarter than that. It looks at

160
00:07:48.000 --> 00:07:51.439
<v Speaker 2>the word in context. The surrounding words heavily influence the tag.

161
00:07:51.920 --> 00:07:56.959
<v Speaker 2>It uses sequential statistical models trained on large amounts of texts.

162
00:07:56.680 --> 00:07:58.319
<v Speaker 1>So the same word could get different tags.

163
00:07:58.639 --> 00:08:02.079
<v Speaker 2>Absolutely. Think of the word book, I read a book

164
00:08:02.519 --> 00:08:05.839
<v Speaker 2>noun versus I want to book a flight verb. The

165
00:08:05.920 --> 00:08:08.199
<v Speaker 2>context tells the tagger which role it's playing.

166
00:08:08.720 --> 00:08:11.160
<v Speaker 1>And why is this useful beyond just grammar?

167
00:08:11.839 --> 00:08:14.959
<v Speaker 2>Well, it's really important for understanding meaning, especially for word

168
00:08:15.000 --> 00:08:18.279
<v Speaker 2>sense disambiguation, figuring out which meaning of a word is intended.

169
00:08:18.439 --> 00:08:19.399
<v Speaker 1>Can you give an example?

170
00:08:19.560 --> 00:08:22.319
<v Speaker 2>Sure, take the word beat. It can mean many things,

171
00:08:22.639 --> 00:08:25.040
<v Speaker 2>But if the pos tagger confidently tags it as an

172
00:08:25.079 --> 00:08:28.800
<v Speaker 2>adjective adj, as in I'm totally beat, you know, it

173
00:08:28.839 --> 00:08:31.040
<v Speaker 2>almost certainly means exhausted. Ah.

174
00:08:31.079 --> 00:08:33.720
<v Speaker 1>I see. The tag helps narrow down the meaning.

175
00:08:33.559 --> 00:08:36.519
<v Speaker 2>Precisely, even if the verb or noun tags might still

176
00:08:36.519 --> 00:08:39.919
<v Speaker 2>be ambiguous. Beat the drum versus follow the beat. The

177
00:08:40.000 --> 00:08:43.080
<v Speaker 2>adjective tag is often quite specific. It adds a layer

178
00:08:43.120 --> 00:08:46.200
<v Speaker 2>of understanding, even if lamonization kind of flattens out things

179
00:08:46.279 --> 00:08:46.960
<v Speaker 2>like verb tense.

180
00:08:47.080 --> 00:08:50.799
<v Speaker 1>Okay, that makes sense. Next up, dependency parsing. This sounds

181
00:08:50.840 --> 00:08:53.639
<v Speaker 1>a bit more complex. Mapping sentence relationships.

182
00:08:53.720 --> 00:08:57.600
<v Speaker 2>It is complex but incredibly powerful. Dependency parsing represents the

183
00:08:57.600 --> 00:09:01.000
<v Speaker 2>grammatical structure of a sentence not just as a flat sequence,

184
00:09:01.159 --> 00:09:03.600
<v Speaker 2>but as a tree of relationships. It shows how words

185
00:09:03.600 --> 00:09:04.279
<v Speaker 2>depend on each.

186
00:09:04.200 --> 00:09:06.799
<v Speaker 1>Other head and dependent exactly each.

187
00:09:06.679 --> 00:09:09.799
<v Speaker 2>Word except usually the main verb. The root has a

188
00:09:09.840 --> 00:09:12.840
<v Speaker 2>head word it modifies or relates to, and a specific

189
00:09:12.919 --> 00:09:17.559
<v Speaker 2>dependency label describes that relationship, like N subject phenomenal subject,

190
00:09:17.679 --> 00:09:20.840
<v Speaker 2>or dubject for direct object. Why go to all this trouble, Well,

191
00:09:20.879 --> 00:09:24.120
<v Speaker 2>what's fascinating here is that sentences aren't just sequences of tokens.

192
00:09:24.360 --> 00:09:27.840
<v Speaker 2>They have this deep, inherent structure, and understanding that structure

193
00:09:27.879 --> 00:09:32.000
<v Speaker 2>is absolutely crucial for many real world NLP tasks, like

194
00:09:32.080 --> 00:09:36.159
<v Speaker 2>what think about chatbots or a machine translation? You need

195
00:09:36.200 --> 00:09:39.559
<v Speaker 2>to know who did what to whom. Consider I forwarded

196
00:09:39.600 --> 00:09:42.279
<v Speaker 2>you the email versus you forwarded me the email.

197
00:09:42.440 --> 00:09:44.440
<v Speaker 1>Same words, totally different meaning exactly.

198
00:09:44.679 --> 00:09:47.879
<v Speaker 2>Dependency parsing helps the system figure out that I is

199
00:09:47.879 --> 00:09:50.120
<v Speaker 2>the subject the one doing the forwarding in the first sentence,

200
00:09:50.320 --> 00:09:53.279
<v Speaker 2>and you as the subject in the second. It disambiguates

201
00:09:53.320 --> 00:09:58.399
<v Speaker 2>the roles based on the grammatical structure unsubject, DUBJIOJ relationships.

202
00:09:58.840 --> 00:10:02.159
<v Speaker 2>Without that, I understand user intent would be much much.

203
00:10:02.000 --> 00:10:05.159
<v Speaker 1>Harder, right, That makes the importance clear. Okay, what about

204
00:10:05.519 --> 00:10:09.519
<v Speaker 1>named entity recognition any R spotting real world objects?

205
00:10:09.799 --> 00:10:12.840
<v Speaker 2>Yep. A named entity is basically anything that can be

206
00:10:12.879 --> 00:10:16.080
<v Speaker 2>referred to with a proper name or a quantity. So

207
00:10:16.399 --> 00:10:21.480
<v Speaker 2>people's names, company names, locations, dates, monetary values, percentages.

208
00:10:21.879 --> 00:10:27.360
<v Speaker 1>The categories seem pretty standard person or or GPE geopolitical entity.

209
00:10:27.440 --> 00:10:30.200
<v Speaker 2>Those are common ones, yes, but the specific set of

210
00:10:30.320 --> 00:10:33.000
<v Speaker 2>entity types is actually quite flexible and often depends on

211
00:10:33.039 --> 00:10:35.120
<v Speaker 2>the data of the model was trained on or the

212
00:10:35.159 --> 00:10:38.399
<v Speaker 2>specific task you have in mind. How so, Well, if

213
00:10:38.399 --> 00:10:42.960
<v Speaker 2>you're analyzing financial news, entities like money and percentage might

214
00:10:43.000 --> 00:10:46.360
<v Speaker 2>be way more important and frequent than say, work of art.

215
00:10:46.840 --> 00:10:49.480
<v Speaker 2>The model needs to be tailored or chosen based on

216
00:10:49.519 --> 00:10:50.000
<v Speaker 2>the domain.

217
00:10:50.159 --> 00:10:52.000
<v Speaker 1>And how good as any are these days.

218
00:10:51.759 --> 00:10:54.200
<v Speaker 2>It's gotten incredibly good. The state of the art methods

219
00:10:54.240 --> 00:10:58.039
<v Speaker 2>often use those transformer architectures we mentioned earlier. They're very

220
00:10:58.039 --> 00:11:01.480
<v Speaker 2>effective at understanding context to identify entities accurately.

221
00:11:01.679 --> 00:11:05.919
<v Speaker 1>Okay, And sometimes the default tokenization or entity spans might

222
00:11:05.960 --> 00:11:07.759
<v Speaker 1>not be quite right. Can you fix them?

223
00:11:08.039 --> 00:11:10.919
<v Speaker 2>Yes? Absolutely. Spacey provides a really neat mechanism called doc

224
00:11:11.000 --> 00:11:14.279
<v Speaker 2>dot retokenize it lets you merge multiple tokens into one,

225
00:11:14.639 --> 00:11:16.399
<v Speaker 2>or split a single token into several.

226
00:11:16.440 --> 00:11:17.519
<v Speaker 1>Why would you need to do that?

227
00:11:17.759 --> 00:11:19.799
<v Speaker 2>Well, maybe an entity like New York City got split

228
00:11:19.840 --> 00:11:22.200
<v Speaker 2>into three tokens, but you want to treat it as

229
00:11:22.240 --> 00:11:25.120
<v Speaker 2>a single unit for analysis, you can merge them. Or

230
00:11:25.120 --> 00:11:28.399
<v Speaker 2>maybe a typo resulted in San Francisco being one token

231
00:11:28.639 --> 00:11:29.519
<v Speaker 2>and you want to split it.

232
00:11:29.639 --> 00:11:33.240
<v Speaker 1>Ah okay, So for cleanup and normalization.

233
00:11:33.120 --> 00:11:36.919
<v Speaker 2>Exactly, merging is usually simpler. Splitting can be a bit

234
00:11:36.919 --> 00:11:39.360
<v Speaker 2>more involved because Spacey then needs to figure out the

235
00:11:39.480 --> 00:11:42.840
<v Speaker 2>linguistic features and dependencies for the new tokens you've created.

236
00:11:43.039 --> 00:11:45.919
<v Speaker 2>But it's a very powerful tool for practical adjustments.

237
00:11:46.200 --> 00:11:50.159
<v Speaker 1>Let's shift gear slightly to rule based matching. You mentioned

238
00:11:50.240 --> 00:11:53.799
<v Speaker 1>regular expressions can be tricky. What Spacey's alternative.

239
00:11:54.320 --> 00:11:58.200
<v Speaker 2>Spacey offers the matriclass, and it's designed to be a well,

240
00:11:58.200 --> 00:12:02.399
<v Speaker 2>a much cleaner, more readable, and definitely more maintainable alternative

241
00:12:02.639 --> 00:12:05.120
<v Speaker 2>for finding patterns and text compared to rejects.

242
00:12:05.320 --> 00:12:06.720
<v Speaker 1>Why is rejects problematic?

243
00:12:06.919 --> 00:12:09.559
<v Speaker 2>Regular expressions can just become incredibly dense and hard to read,

244
00:12:10.000 --> 00:12:13.559
<v Speaker 2>especially for complex patterns. They're also easy to get subtly wrong,

245
00:12:13.799 --> 00:12:16.039
<v Speaker 2>which can lead to bugs that are hard to track down,

246
00:12:16.399 --> 00:12:17.679
<v Speaker 2>and they operate purely on.

247
00:12:17.600 --> 00:12:19.799
<v Speaker 1>Strings, and the match is different how.

248
00:12:19.759 --> 00:12:22.840
<v Speaker 2>The matcher works with token objects and their attributes. You

249
00:12:22.919 --> 00:12:26.399
<v Speaker 2>define patterns not as strings, but as lists of dictionaries,

250
00:12:26.639 --> 00:12:28.960
<v Speaker 2>where each dictionary specifies the attributes.

251
00:12:29.000 --> 00:12:31.360
<v Speaker 1>A token must have like low to match the word

252
00:12:31.360 --> 00:12:32.919
<v Speaker 1>hello regardless.

253
00:12:32.360 --> 00:12:36.600
<v Speaker 2>Of case precisely, or is punched true to match any

254
00:12:36.600 --> 00:12:40.759
<v Speaker 2>punctuation mark or liken them true for number. Like tokens,

255
00:12:41.039 --> 00:12:44.240
<v Speaker 2>you're matching based on linguistic features, not just character sequences.

256
00:12:44.320 --> 00:12:46.000
<v Speaker 1>That sounds much more robust.

257
00:12:45.799 --> 00:12:48.600
<v Speaker 2>It is, and you can use extended syntax too. You

258
00:12:48.639 --> 00:12:51.440
<v Speaker 2>can match based on token length length check off a

259
00:12:51.480 --> 00:12:54.399
<v Speaker 2>token is in a list I note or use boolean

260
00:12:54.480 --> 00:12:58.519
<v Speaker 2>flags like east digit I, sulfa I supper great for finding, say,

261
00:12:58.720 --> 00:13:00.360
<v Speaker 2>emphasized words in all cans.

262
00:13:00.480 --> 00:13:03.480
<v Speaker 1>Does it have rejects like operators like optional parts.

263
00:13:03.639 --> 00:13:06.399
<v Speaker 2>Yes, you can use operators like bunds to make a

264
00:13:06.440 --> 00:13:09.840
<v Speaker 2>token pattern optional. Think about matching names like Barack Obama

265
00:13:09.919 --> 00:13:13.159
<v Speaker 2>but also Barack Hussein Obama. The middle name token can

266
00:13:13.159 --> 00:13:17.039
<v Speaker 2>be marked as optional, and you have operators like plus

267
00:13:17.399 --> 00:13:21.480
<v Speaker 2>one or more and zero or more for specifying occurrences,

268
00:13:21.519 --> 00:13:25.320
<v Speaker 2>similar to rejects. There's even a really useful online demo

269
00:13:25.639 --> 00:13:28.320
<v Speaker 2>on the Spacey website where you can build and test

270
00:13:28.399 --> 00:13:29.799
<v Speaker 2>matcher patterns interactively.

271
00:13:29.919 --> 00:13:32.799
<v Speaker 1>Okay, that covers matching specific patterns. What if you have

272
00:13:32.919 --> 00:13:35.799
<v Speaker 1>like a huge list of things to find, say thousands

273
00:13:35.840 --> 00:13:37.200
<v Speaker 1>of product names, right.

274
00:13:37.200 --> 00:13:41.039
<v Speaker 2>Creating individual matcher patterns for thousands of specific phrases would

275
00:13:41.039 --> 00:13:44.039
<v Speaker 2>be well, not very efficient or practical.

276
00:13:44.200 --> 00:13:45.519
<v Speaker 1>So what's the solution for that?

277
00:13:46.000 --> 00:13:49.960
<v Speaker 2>Spacey provides the phrase matcher. It's optimized specifically for efficiently

278
00:13:50.000 --> 00:13:53.720
<v Speaker 2>scanning text against large lists of multi word phrases or dictionaries.

279
00:13:53.840 --> 00:13:54.559
<v Speaker 1>How does that work?

280
00:13:54.639 --> 00:13:56.799
<v Speaker 2>You give it a list of doc objects representing the

281
00:13:56.840 --> 00:14:00.000
<v Speaker 2>phrases you want to find, like Angela Merkele, Donald Trump,

282
00:14:00.159 --> 00:14:03.720
<v Speaker 2>Alexis ceparus. It then uses a really efficient algorithm to

283
00:14:03.799 --> 00:14:07.200
<v Speaker 2>find all occurrences of those exact phrases in your target text,

284
00:14:07.759 --> 00:14:10.679
<v Speaker 2>much faster than running thousands of individual rules.

285
00:14:10.759 --> 00:14:14.080
<v Speaker 1>Very useful for terminology lists or gazetteers exactly.

286
00:14:14.200 --> 00:14:17.000
<v Speaker 2>And it can even match based on token attributes, not

287
00:14:17.039 --> 00:14:19.799
<v Speaker 2>just the exact words. For instance, you could match based

288
00:14:19.799 --> 00:14:22.519
<v Speaker 2>on the shape attribute, which is handy for finding structured

289
00:14:22.600 --> 00:14:26.279
<v Speaker 2>data like IP addresses or specific code patterns and log files.

290
00:14:26.519 --> 00:14:28.519
<v Speaker 2>Even if the exact digits change.

291
00:14:28.840 --> 00:14:31.360
<v Speaker 3>So you have the matcher for flexible patterns and phrase

292
00:14:31.360 --> 00:14:35.200
<v Speaker 3>matcher for large lists. How do you integrate these findings

293
00:14:35.279 --> 00:14:37.960
<v Speaker 3>back into the main spacey doc. That's where the span

294
00:14:38.039 --> 00:14:40.559
<v Speaker 3>ruler comes in. It's a pipeline component that lets you

295
00:14:40.679 --> 00:14:44.200
<v Speaker 3>use rules to find very similarly to matcher patterns, to

296
00:14:44.360 --> 00:14:47.720
<v Speaker 3>directly add span objects to your doc add themwhare to

297
00:14:47.879 --> 00:14:49.879
<v Speaker 3>doc dot sense. You can configure it to add them

298
00:14:49.879 --> 00:14:52.799
<v Speaker 3>to doc dot en, so effectively adding rule based named entities.

299
00:14:53.039 --> 00:14:54.519
<v Speaker 3>Or you can have it add them to a custom

300
00:14:54.559 --> 00:14:57.200
<v Speaker 3>span group like doc dot spans my custom patterns, so.

301
00:14:57.159 --> 00:14:59.279
<v Speaker 1>You added to the pipeline like other components.

302
00:14:58.919 --> 00:15:02.120
<v Speaker 2>YEP, NLP, dot X a pipe span ruler. Then you

303
00:15:02.120 --> 00:15:04.519
<v Speaker 2>provide it with your patterns. For example, you could define

304
00:15:04.559 --> 00:15:07.200
<v Speaker 2>a pattern to find every instance of the word chime

305
00:15:07.600 --> 00:15:09.600
<v Speaker 2>and label it as an OARG entity.

306
00:15:09.759 --> 00:15:13.759
<v Speaker 1>What if the regular ner model also finds entities? Do

307
00:15:13.840 --> 00:15:14.679
<v Speaker 1>they clash?

308
00:15:14.759 --> 00:15:18.919
<v Speaker 2>Good question. You can configure the span ruler. You can

309
00:15:18.960 --> 00:15:23.000
<v Speaker 2>tell it whether your rule based entities should overwrite entities

310
00:15:23.080 --> 00:15:27.679
<v Speaker 2>found by the statistical ner model, overrit true or not

311
00:15:28.279 --> 00:15:31.039
<v Speaker 2>overwrite falls. You can also set it up so that

312
00:15:31.080 --> 00:15:35.039
<v Speaker 2>statistical entities don't overwrite your rule based ones gives you

313
00:15:35.080 --> 00:15:37.720
<v Speaker 2>control over which source of entities takes precedence.

314
00:15:37.840 --> 00:15:41.399
<v Speaker 1>Okay, this rule based stuff seems really practical. Can we

315
00:15:41.440 --> 00:15:45.600
<v Speaker 1>talk about some specific recipes like real world extraction examples?

316
00:15:45.799 --> 00:15:49.000
<v Speaker 2>Absolutely, here's where it gets really interesting, showing Spacey's power.

317
00:15:49.519 --> 00:15:53.000
<v Speaker 2>So you can easily build patterns to extract things like ibands,

318
00:15:53.080 --> 00:15:56.759
<v Speaker 2>international bank account numbers, or phone numbers, these highly structured

319
00:15:56.840 --> 00:15:57.519
<v Speaker 2>numeric things.

320
00:15:57.559 --> 00:15:58.279
<v Speaker 1>Okay, what else?

321
00:15:58.320 --> 00:16:01.919
<v Speaker 2>Think about? Social media? Could create patterns to find mentions

322
00:16:02.000 --> 00:16:05.799
<v Speaker 2>expressing opinions, like matching the sequence business name plus iswaz

323
00:16:05.879 --> 00:16:08.840
<v Speaker 2>bay plus Maybe an adverb plus an adjective.

324
00:16:08.519 --> 00:16:10.639
<v Speaker 1>Like finding cafe X was really great.

325
00:16:10.480 --> 00:16:14.200
<v Speaker 2>Exactly that pattern structure cafex was a adverb adjective. Could

326
00:16:14.200 --> 00:16:16.879
<v Speaker 2>pick up cafe x is good, Cafe y was very slow,

327
00:16:17.120 --> 00:16:20.399
<v Speaker 2>restaurant z will be amazing. Helps you gauge sentiment clever.

328
00:16:20.720 --> 00:16:21.600
<v Speaker 1>Other examples.

329
00:16:21.759 --> 00:16:24.919
<v Speaker 2>Hashtags are easy. You can match the hashtag symbol followed

330
00:16:24.919 --> 00:16:28.320
<v Speaker 2>by tokens that meet certain criteria like IC or ICEULFA

331
00:16:28.879 --> 00:16:32.080
<v Speaker 2>to reliably pull out things like hashtag deep learning or

332
00:16:32.159 --> 00:16:33.279
<v Speaker 2>hashtag weekend fun.

333
00:16:33.559 --> 00:16:36.039
<v Speaker 1>And what about slightly more complex entities?

334
00:16:36.399 --> 00:16:40.159
<v Speaker 2>You can even use patterns to refine entities. For example,

335
00:16:40.200 --> 00:16:42.639
<v Speaker 2>maybe the ner just picks up Smith as a person.

336
00:16:43.279 --> 00:16:45.000
<v Speaker 2>You could use a match or pattern to look for

337
00:16:45.039 --> 00:16:48.440
<v Speaker 2>a preceding title like mister AM's doctor nump, and then

338
00:16:48.519 --> 00:16:51.840
<v Speaker 2>retokenize to merge the title and the name into a single,

339
00:16:52.080 --> 00:16:54.399
<v Speaker 2>more complete entity span Miss Smith.

340
00:16:54.559 --> 00:16:57.159
<v Speaker 1>Wow. Okay, that's quite granular.

341
00:16:56.720 --> 00:16:59.799
<v Speaker 2>Control, it really is. These rule based tools, combined with

342
00:16:59.840 --> 00:17:02.039
<v Speaker 2>the linguistic features, give you a lot of power for

343
00:17:02.159 --> 00:17:03.639
<v Speaker 2>precise information extraction.

344
00:17:04.000 --> 00:17:07.160
<v Speaker 1>Let's push deeper now into understanding meaning and intent. How

345
00:17:07.200 --> 00:17:09.759
<v Speaker 1>does spacey help with semantic parsing figuring out what a

346
00:17:09.880 --> 00:17:11.799
<v Speaker 1>user actually wants a great.

347
00:17:11.599 --> 00:17:13.880
<v Speaker 2>Way to explore this is with data sets like eighty

348
00:17:13.960 --> 00:17:17.759
<v Speaker 2>zis the airline travel information system. It contains thousands of

349
00:17:17.799 --> 00:17:19.400
<v Speaker 2>real user requests about.

350
00:17:19.119 --> 00:17:22.440
<v Speaker 1>Flights like show me flights from Boston to Denver exactly?

351
00:17:22.680 --> 00:17:25.880
<v Speaker 2>Or what's the cheapest flight? What meals are served on

352
00:17:25.960 --> 00:17:31.079
<v Speaker 2>flight x? Analyzing these requires understanding not just the words,

353
00:17:31.559 --> 00:17:32.720
<v Speaker 2>but the underlying goal.

354
00:17:33.079 --> 00:17:35.000
<v Speaker 1>Where do you even start with something like that?

355
00:17:35.200 --> 00:17:38.359
<v Speaker 2>Well, a really crucial first step, honestly, is just looking

356
00:17:38.400 --> 00:17:41.799
<v Speaker 2>at the data yourself. Read through a sample of the utterances,

357
00:17:42.240 --> 00:17:44.640
<v Speaker 2>get a feel for the common patterns. The types of

358
00:17:44.759 --> 00:17:47.039
<v Speaker 2>entities involved the grammar people use.

359
00:17:47.240 --> 00:17:49.480
<v Speaker 1>What kind of things would you look for in the

360
00:17:49.799 --> 00:17:50.559
<v Speaker 1>eightiest data.

361
00:17:51.000 --> 00:17:55.440
<v Speaker 2>You'd quickly notice people specifying origins and destinations. But it's

362
00:17:55.440 --> 00:17:58.240
<v Speaker 2>not enough just to spot Boston and Denver. You need

363
00:17:58.279 --> 00:18:01.720
<v Speaker 2>to capture the relationship from Boston to Denver. You'd see

364
00:18:01.720 --> 00:18:05.519
<v Speaker 2>the importance of prepositions like from to in Those little

365
00:18:05.519 --> 00:18:07.400
<v Speaker 2>words carry a lot of semantic.

366
00:18:06.960 --> 00:18:09.440
<v Speaker 1>Weight, So you need more than just finding keywords.

367
00:18:09.559 --> 00:18:12.759
<v Speaker 2>Definitely, you need to understand the relationships between the words.

368
00:18:13.200 --> 00:18:15.200
<v Speaker 2>And that's where Spacey's dependency matter.

369
00:18:15.119 --> 00:18:17.279
<v Speaker 1>Comes in another matcher. How's this one different?

370
00:18:17.400 --> 00:18:19.920
<v Speaker 2>Well, the matcher looks for sequences of tokens based on

371
00:18:19.960 --> 00:18:23.519
<v Speaker 2>their attributes. The dependency match looks for patterns based on

372
00:18:23.559 --> 00:18:26.319
<v Speaker 2>the syntactic dependency relationships between tokens.

373
00:18:26.599 --> 00:18:29.880
<v Speaker 1>Ah. Using that dependency parstry we talked about earlier.

374
00:18:29.559 --> 00:18:33.359
<v Speaker 2>Precisely, it lets you find patterns like a verb connected

375
00:18:33.359 --> 00:18:37.079
<v Speaker 2>to a noun with a direct object relationship dub J.

376
00:18:38.440 --> 00:18:40.319
<v Speaker 2>This is key for identifying intent.

377
00:18:40.599 --> 00:18:43.400
<v Speaker 1>Can you give a quick linguistic primer on that objects?

378
00:18:43.519 --> 00:18:47.079
<v Speaker 2>Sure? So? Very Basically, you have transitive verbs which need

379
00:18:47.119 --> 00:18:50.519
<v Speaker 2>an object to act upon, like I bought flowers flowers

380
00:18:50.640 --> 00:18:53.640
<v Speaker 2>is a direct object, and in transitive verbs which don't

381
00:18:54.079 --> 00:18:57.880
<v Speaker 2>like I slept okay. And sometimes there's an indirect object too,

382
00:18:57.920 --> 00:18:59.799
<v Speaker 2>like I gave him the book book is direct him

383
00:18:59.799 --> 00:19:03.519
<v Speaker 2>as direct. The dependency matcher lets you specify these relationships

384
00:19:03.519 --> 00:19:04.319
<v Speaker 2>in your patterns.

385
00:19:04.519 --> 00:19:08.240
<v Speaker 1>How does that help find intent in the flight examples.

386
00:19:08.000 --> 00:19:10.319
<v Speaker 2>Well, you could define a pattern looking for a verb

387
00:19:10.480 --> 00:19:13.799
<v Speaker 2>like show or find that has a direct object TOBJ

388
00:19:14.119 --> 00:19:18.640
<v Speaker 2>like flights. That pattern defined using dependency relations would match

389
00:19:19.000 --> 00:19:21.880
<v Speaker 2>show me flights, find flights, I need you to show flights, etc.

390
00:19:22.440 --> 00:19:25.039
<v Speaker 2>Capturing the core intent regardless of the exact phrasing.

391
00:19:25.160 --> 00:19:27.839
<v Speaker 1>That seems much more robust than just keyword spotting.

392
00:19:28.119 --> 00:19:30.599
<v Speaker 2>It is, and you can build more complex patterns. What

393
00:19:30.640 --> 00:19:34.160
<v Speaker 2>if someone says, show all flights and fares. The dependency

394
00:19:34.160 --> 00:19:38.599
<v Speaker 2>matcher can use the conjunct dependency link between flights and

395
00:19:38.720 --> 00:19:42.559
<v Speaker 2>fares to recognize that the user has two related intents

396
00:19:42.599 --> 00:19:43.640
<v Speaker 2>connected by and.

397
00:19:44.039 --> 00:19:48.160
<v Speaker 1>Okay, that's powerful, But this raises a question. Once you've

398
00:19:48.279 --> 00:19:51.519
<v Speaker 1>used these matchers to figure out the intent, say book flight,

399
00:19:51.880 --> 00:19:53.799
<v Speaker 1>how do you store that information with the doc?

400
00:19:54.240 --> 00:19:56.799
<v Speaker 2>Great question. You don't want that information just floating around

401
00:19:57.039 --> 00:20:01.079
<v Speaker 2>Spacey has a mechanism for this extension attributes exten attributes. Yeah,

402
00:20:01.119 --> 00:20:04.599
<v Speaker 2>you can define your own custom attributes on doc token

403
00:20:04.759 --> 00:20:07.480
<v Speaker 2>or span objects. So you could create an attribute called

404
00:20:07.519 --> 00:20:10.920
<v Speaker 2>say doc dot intent. The underscore indicates it's a custom

405
00:20:10.920 --> 00:20:12.000
<v Speaker 2>extension and how.

406
00:20:11.839 --> 00:20:13.160
<v Speaker 1>Do you set that attribute?

407
00:20:13.279 --> 00:20:16.880
<v Speaker 2>You typically create a custom spacey pipeline component use a

408
00:20:16.880 --> 00:20:20.400
<v Speaker 2>special decorator at language dot factory to define it. Inside

409
00:20:20.440 --> 00:20:24.000
<v Speaker 2>this component's call method, which processes the doc. You'd run

410
00:20:24.000 --> 00:20:27.079
<v Speaker 2>your matcher or dependency matcher, figure out the intent and

411
00:20:27.119 --> 00:20:29.079
<v Speaker 2>then set doc dot intent.

412
00:20:29.079 --> 00:20:31.839
<v Speaker 1>Book flight so you can tailor the pipeline to extract

413
00:20:31.880 --> 00:20:34.039
<v Speaker 1>and store exactly what you need exactly.

414
00:20:34.079 --> 00:20:38.200
<v Speaker 2>It makes spacing incredibly flexible and extensible for specific tasks.

415
00:20:38.799 --> 00:20:42.200
<v Speaker 1>Now we touched on performance earlier. What about processing large

416
00:20:42.279 --> 00:20:45.400
<v Speaker 1>data sets like the full eight is corpus with thousands

417
00:20:45.400 --> 00:20:48.799
<v Speaker 1>of utterances. Doing them one by one sounds slow.

418
00:20:49.079 --> 00:20:52.480
<v Speaker 2>It would be processing doc NLP text for each of

419
00:20:52.519 --> 00:20:55.680
<v Speaker 2>the four nine and seventy eight utterances individually would take

420
00:20:55.759 --> 00:20:56.319
<v Speaker 2>quite a while.

421
00:20:56.440 --> 00:20:57.559
<v Speaker 1>So what's the efficient way?

422
00:20:57.720 --> 00:21:01.279
<v Speaker 2>The key is the NLP dot pipe method or language

423
00:21:01.279 --> 00:21:03.000
<v Speaker 2>dot pipe. If you're using the base class.

424
00:21:03.079 --> 00:21:04.200
<v Speaker 1>How does pipe help?

425
00:21:04.400 --> 00:21:07.519
<v Speaker 2>It processes the text as a stream and crucially it

426
00:21:07.559 --> 00:21:11.480
<v Speaker 2>buffers them internally and processes them in batches. This allows

427
00:21:11.480 --> 00:21:16.240
<v Speaker 2>Spacey to leverage optimizations and parallel processing much more effectively.

428
00:21:15.759 --> 00:21:17.799
<v Speaker 1>And the speed difference is noticeable.

429
00:21:17.440 --> 00:21:20.279
<v Speaker 2>Oh, absolutely dramatic. The sources mentioned going from something like

430
00:21:20.319 --> 00:21:22.759
<v Speaker 2>twenty seven seconds for processing the eighties data set one

431
00:21:22.759 --> 00:21:26.160
<v Speaker 2>by one down to under six seconds using NLP dot pipe.

432
00:21:26.240 --> 00:21:29.359
<v Speaker 2>It's the standard way to process large volumes of text efficiently.

433
00:21:29.480 --> 00:21:33.759
<v Speaker 1>Okay, essential for any real world application. Let's pivot now

434
00:21:33.799 --> 00:21:37.359
<v Speaker 1>to the really cutting edge stuff, transformers and large language

435
00:21:37.359 --> 00:21:41.519
<v Speaker 1>models LMS. The transformer architecture kind of kick things off right.

436
00:21:41.559 --> 00:21:43.519
<v Speaker 1>The attention is all you need paper.

437
00:21:43.799 --> 00:21:47.759
<v Speaker 2>Yes, that twenty seventeen paper was a landmark. Transformers really

438
00:21:47.799 --> 00:21:48.960
<v Speaker 2>revolutionized NLP.

439
00:21:49.200 --> 00:21:50.559
<v Speaker 1>What problem were they trying to solve?

440
00:21:50.759 --> 00:21:55.880
<v Speaker 2>Well, Previous models like LSTMs process texts sequentially. This meant

441
00:21:55.880 --> 00:21:59.000
<v Speaker 2>they could struggle with long range dependencies for getting information

442
00:21:59.079 --> 00:22:01.599
<v Speaker 2>from the beginning of a law text, and they weren't

443
00:22:01.599 --> 00:22:04.519
<v Speaker 2>easily parallelizable, which limited training speed.

444
00:22:04.640 --> 00:22:08.880
<v Speaker 1>And transformers fix this how with attention exactly.

445
00:22:09.440 --> 00:22:13.079
<v Speaker 2>The core innovation is the self attention mechanism, often implemented

446
00:22:13.079 --> 00:22:16.039
<v Speaker 2>in a multi head attention block. Instead of just looking

447
00:22:16.079 --> 00:22:19.400
<v Speaker 2>at the immediately preceding words. Self attention allows the model

448
00:22:19.400 --> 00:22:21.559
<v Speaker 2>to weigh the importance of all words and the input

449
00:22:21.599 --> 00:22:25.079
<v Speaker 2>sequence when calculating the representation for a single word.

450
00:22:24.960 --> 00:22:26.720
<v Speaker 1>So it looks at the whole context at once.

451
00:22:27.200 --> 00:22:31.039
<v Speaker 2>Sort of yeah. It calculates a words embedding its representation

452
00:22:31.480 --> 00:22:34.640
<v Speaker 2>by taking a weighted average of the embeddings of all

453
00:22:34.680 --> 00:22:38.079
<v Speaker 2>other words in the sequence, where the weights the attention

454
00:22:38.200 --> 00:22:42.799
<v Speaker 2>scores indicate relevance. This lets it understand language much more

455
00:22:42.839 --> 00:22:43.920
<v Speaker 2>deeply in context.

456
00:22:44.039 --> 00:22:47.440
<v Speaker 1>What was the big aha moment with this?

457
00:22:47.720 --> 00:22:51.079
<v Speaker 2>A major one was that transformers could generate dynamic word vectors.

458
00:22:51.720 --> 00:22:54.319
<v Speaker 2>Older methods like word two VAC gave the same vector

459
00:22:54.319 --> 00:22:57.519
<v Speaker 2>for bank every time, but a transformer can understand a

460
00:22:57.559 --> 00:23:00.160
<v Speaker 2>context and give a different vector for bank and riverbank

461
00:23:00.480 --> 00:23:02.039
<v Speaker 2>versus bank in investment bank.

462
00:23:02.400 --> 00:23:04.759
<v Speaker 1>That's a huge leap in understanding nuance.

463
00:23:04.920 --> 00:23:08.480
<v Speaker 2>It really was. And libraries like Hugging Faces Transformers Library

464
00:23:08.480 --> 00:23:11.279
<v Speaker 2>now provide access to literally thousands of these pre trained

465
00:23:11.279 --> 00:23:12.240
<v Speaker 2>transformer models.

466
00:23:12.359 --> 00:23:15.240
<v Speaker 1>How does Spacey integrate with these? Can you use transformers

467
00:23:15.279 --> 00:23:16.640
<v Speaker 1>within a Spacey pipeline?

468
00:23:16.720 --> 00:23:20.799
<v Speaker 2>Yes? Absolutely. A great example is text classification. Let's say

469
00:23:20.799 --> 00:23:24.000
<v Speaker 2>you want to classify Amazon product reviews as positive or negative.

470
00:23:24.440 --> 00:23:27.079
<v Speaker 2>You can use Spacey's text categorizer component.

471
00:23:26.799 --> 00:23:27.480
<v Speaker 1>Which is trainable.

472
00:23:27.599 --> 00:23:31.079
<v Speaker 2>Right, it's a trainable pipeline component. You'd prepare your training

473
00:23:31.160 --> 00:23:34.599
<v Speaker 2>data the reviews labeled as positive or negative using Spacey's

474
00:23:34.640 --> 00:23:38.680
<v Speaker 2>example object, and then serialize it efficiently using doc ben.

475
00:23:38.960 --> 00:23:41.759
<v Speaker 1>How do you manage the training process itself?

476
00:23:42.400 --> 00:23:46.000
<v Speaker 2>Spacey has a really nice configuration system. Instead of hard

477
00:23:46.039 --> 00:23:50.640
<v Speaker 2>coding parameters, you define everything the pipeline components, model settings,

478
00:23:50.720 --> 00:23:55.279
<v Speaker 2>hyper parameters, data paths in a single configuration file configured

479
00:23:55.279 --> 00:23:55.839
<v Speaker 2>on CFG.

480
00:23:56.079 --> 00:23:57.000
<v Speaker 1>Why is that better?

481
00:23:57.240 --> 00:24:00.279
<v Speaker 2>It makes your experiment incredibly reproducible. Yeah, there are no

482
00:24:00.359 --> 00:24:03.240
<v Speaker 2>hidden de faults. Everything is explicit in the config file.

483
00:24:03.759 --> 00:24:05.920
<v Speaker 2>You can then train your pipeline directly from the command

484
00:24:06.000 --> 00:24:07.200
<v Speaker 2>line using spacey train.

485
00:24:07.799 --> 00:24:11.000
<v Speaker 1>And can you include a transformer in that pipeline for

486
00:24:11.160 --> 00:24:12.079
<v Speaker 1>text classification?

487
00:24:12.440 --> 00:24:15.960
<v Speaker 2>Yes, you can configure the pipeline to include a transformer component.

488
00:24:16.359 --> 00:24:19.599
<v Speaker 2>This component generates those context to wear embeddings we talked about,

489
00:24:19.759 --> 00:24:23.039
<v Speaker 2>which are then fit into the text categorizer. Often, adding

490
00:24:23.039 --> 00:24:27.000
<v Speaker 2>a transformer significantly boosts the accuracy of the classifier because

491
00:24:27.000 --> 00:24:29.880
<v Speaker 2>it has a richer understanding of the text's meaning and sentiment.

492
00:24:30.079 --> 00:24:33.640
<v Speaker 1>Okay, so let's name some names. What about famous transformer

493
00:24:33.680 --> 00:24:36.839
<v Speaker 1>models like Bert and Roberta? What makes them special?

494
00:24:37.200 --> 00:24:41.480
<v Speaker 2>Right? Bert bi directional encoder representations from transformers was a

495
00:24:41.559 --> 00:24:45.359
<v Speaker 2>huge step. Its key innovation was being bi directional during

496
00:24:45.440 --> 00:24:46.440
<v Speaker 2>pre training.

497
00:24:46.319 --> 00:24:49.680
<v Speaker 1>Meaning it looked forwards and backwards in the text simultaneously.

498
00:24:49.759 --> 00:24:54.000
<v Speaker 2>Yeah. Previous models were often unidirectional or combined separate left

499
00:24:54.000 --> 00:24:56.640
<v Speaker 2>to right and right to left models. Bert used a

500
00:24:56.640 --> 00:25:00.920
<v Speaker 2>technique called masked language modeling predicting hidden workds to learn

501
00:25:01.079 --> 00:25:04.240
<v Speaker 2>context from both directions at the same time. This gave

502
00:25:04.279 --> 00:25:06.119
<v Speaker 2>it a deeper understanding.

503
00:25:05.720 --> 00:25:08.160
<v Speaker 1>And it produced those dynamic word vectors.

504
00:25:08.279 --> 00:25:11.160
<v Speaker 2>Yes. It also used some special tokens like cls at

505
00:25:11.160 --> 00:25:13.640
<v Speaker 2>the beginning of sequences and stap to the separate sentences,

506
00:25:13.920 --> 00:25:17.000
<v Speaker 2>and it used word piece tokenization, breaking words into common

507
00:25:17.039 --> 00:25:20.680
<v Speaker 2>subword units like playing might become play and hashtag in.

508
00:25:21.519 --> 00:25:24.200
<v Speaker 2>This helps it handle large vocabularies and even words it

509
00:25:24.200 --> 00:25:25.839
<v Speaker 2>hasn't explicitly seen before.

510
00:25:25.960 --> 00:25:28.440
<v Speaker 1>What about Roberta? How did that improve on? Burt?

511
00:25:28.720 --> 00:25:33.480
<v Speaker 2>Roberta developed by Facebook AI basically took the Bert architecture

512
00:25:33.680 --> 00:25:37.559
<v Speaker 2>and optimized the training procedure. They used things like dynamic masking,

513
00:25:37.920 --> 00:25:41.160
<v Speaker 2>changing the masked words during training, trained on much more

514
00:25:41.240 --> 00:25:44.839
<v Speaker 2>data for longer, and removed one of Burd's training objectives

515
00:25:45.039 --> 00:25:49.079
<v Speaker 2>next sentence prediction finding. It didn't always help. These changes

516
00:25:49.119 --> 00:25:52.440
<v Speaker 2>generally led to better performance on downstream tasks compared to

517
00:25:52.480 --> 00:25:53.799
<v Speaker 2>the original Bert models.

518
00:25:53.880 --> 00:25:57.440
<v Speaker 1>Okay, so transformers are powerful pretrain models. What about the

519
00:25:57.480 --> 00:26:00.200
<v Speaker 1>even bigger ones, The large language models are llms? How

520
00:26:00.200 --> 00:26:00.799
<v Speaker 1>do they fit in?

521
00:26:01.160 --> 00:26:04.960
<v Speaker 2>Lms are essentially an evolution or maybe a scaling up

522
00:26:05.440 --> 00:26:08.400
<v Speaker 2>of those pre trained language models like bird. We're talking

523
00:26:08.440 --> 00:26:11.240
<v Speaker 2>models with vastly more parameters two three is one hundred

524
00:26:11.240 --> 00:26:14.119
<v Speaker 2>and seventy five billion, for instance, trained on absolutely enormous

525
00:26:14.119 --> 00:26:14.880
<v Speaker 2>amounts of text.

526
00:26:14.720 --> 00:26:17.680
<v Speaker 1>And code, and they can do well almost anything text related.

527
00:26:17.720 --> 00:26:22.240
<v Speaker 2>They're incredibly versatile. Yeah, translation, summarization, question answering, code generation,

528
00:26:22.400 --> 00:26:25.960
<v Speaker 2>creative writing. They've shown promise in specialized field like medicine,

529
00:26:26.240 --> 00:26:27.440
<v Speaker 2>law education too.

530
00:26:27.559 --> 00:26:30.319
<v Speaker 1>But they're not perfect, right, What are the downsides?

531
00:26:30.640 --> 00:26:34.200
<v Speaker 2>Definitely not perfect? There are key limitations. One is this

532
00:26:34.279 --> 00:26:38.599
<v Speaker 2>year computational cost training and even running them requires massive resources.

533
00:26:39.000 --> 00:26:41.880
<v Speaker 2>They can also be slower to generate responses compared to

534
00:26:41.920 --> 00:26:46.319
<v Speaker 2>smaller models, and crucially, they have this tendency to hallucinate.

535
00:26:46.440 --> 00:26:48.759
<v Speaker 1>Hallucinate meaning they mix stuff up.

536
00:26:49.160 --> 00:26:52.640
<v Speaker 2>Essentially, Yes, they can generate responses that sound perfectly plausible

537
00:26:52.640 --> 00:26:56.640
<v Speaker 2>and grammatically correct, but are factually incorrect or nonsensical. They

538
00:26:56.680 --> 00:27:00.480
<v Speaker 2>don't inherently know things. They're predicting probable sequences.

539
00:27:00.079 --> 00:27:01.960
<v Speaker 1>Of words, so you need to be careful how you

540
00:27:02.079 --> 00:27:02.440
<v Speaker 1>use them.

541
00:27:02.640 --> 00:27:05.720
<v Speaker 2>Very careful. A lot of work goes into prompt engineering,

542
00:27:06.039 --> 00:27:09.920
<v Speaker 2>carefully crafting the input prompt to guide the LLM towards

543
00:27:09.920 --> 00:27:11.359
<v Speaker 2>the desired accurate output.

544
00:27:11.640 --> 00:27:14.319
<v Speaker 1>How does space help manage interactions with llms.

545
00:27:14.359 --> 00:27:17.599
<v Speaker 2>There's a package called spacelm. It provides a structured way

546
00:27:17.599 --> 00:27:21.440
<v Speaker 2>to integrate llms into spacey workflows. It treats interactions with

547
00:27:21.480 --> 00:27:25.519
<v Speaker 2>an LM as defined tasks like summerization or entity.

548
00:27:25.359 --> 00:27:26.720
<v Speaker 1>Extraction, and it uses prompts.

549
00:27:26.799 --> 00:27:29.039
<v Speaker 2>Yes, it uses GINGA templates to define the prompts for

550
00:27:29.079 --> 00:27:31.359
<v Speaker 2>these tasks. You can use built in tasks, or you

551
00:27:31.359 --> 00:27:32.839
<v Speaker 2>can define your own custom tasks.

552
00:27:32.920 --> 00:27:34.359
<v Speaker 1>Custom tasks like what.

553
00:27:34.480 --> 00:27:37.759
<v Speaker 2>For example, the sources mentioned creating a custom task to

554
00:27:37.880 --> 00:27:42.039
<v Speaker 2>extract specific quotes from a text and the surrounding contact sentences.

555
00:27:42.440 --> 00:27:45.079
<v Speaker 2>You define the prompt template to ask the LLM for

556
00:27:45.160 --> 00:27:48.480
<v Speaker 2>this specific output, and you also define how to parse

557
00:27:48.559 --> 00:27:52.920
<v Speaker 2>the llm's potentially messy response back into a structured format

558
00:27:52.920 --> 00:27:53.920
<v Speaker 2>that Spacey can use.

559
00:27:54.200 --> 00:27:57.880
<v Speaker 1>So spacelm provides a bridge and some structure for using

560
00:27:58.000 --> 00:28:00.559
<v Speaker 1>llms within a more controlled space environment.

561
00:28:00.680 --> 00:28:04.440
<v Speaker 2>Exactly, it helps make using lllms more systematic and reproducible.

562
00:28:04.759 --> 00:28:07.000
<v Speaker 1>Let's circle back to training your own models. We talked

563
00:28:07.039 --> 00:28:09.680
<v Speaker 1>about NR. When would you actually need to train a

564
00:28:09.720 --> 00:28:12.720
<v Speaker 1>custom ANYR model instead of using a pre trained one.

565
00:28:12.880 --> 00:28:16.079
<v Speaker 2>That's a common question. The rule of thumb is, if

566
00:28:16.119 --> 00:28:20.039
<v Speaker 2>a pre trained Spacey model like Encore welding performs reasonably

567
00:28:20.039 --> 00:28:23.079
<v Speaker 2>well on your data, maybe gets say seventy five percent

568
00:28:23.119 --> 00:28:26.039
<v Speaker 2>accuracy or higher on the entities you care about, you

569
00:28:26.119 --> 00:28:27.440
<v Speaker 2>might not need full custom training.

570
00:28:27.519 --> 00:28:28.240
<v Speaker 1>What would you do then?

571
00:28:28.759 --> 00:28:31.839
<v Speaker 2>You could potentially find tune the existing model, or more often,

572
00:28:31.920 --> 00:28:34.960
<v Speaker 2>you'd use other Spacey components like the matcher or span ruler.

573
00:28:35.000 --> 00:28:38.359
<v Speaker 2>We discussed to add rules that catch the specific cases

574
00:28:38.440 --> 00:28:41.640
<v Speaker 2>the pre train model misses or gets wrong, kind of

575
00:28:41.680 --> 00:28:42.400
<v Speaker 2>like augmenting it.

576
00:28:42.759 --> 00:28:45.160
<v Speaker 1>But when is custom training unavoidable?

577
00:28:45.400 --> 00:28:50.160
<v Speaker 2>It's usually necessary when your domain has many important entity

578
00:28:50.200 --> 00:28:53.240
<v Speaker 2>types that are just completely absent from the pre trained models.

579
00:28:53.920 --> 00:28:59.119
<v Speaker 2>Think about highly specialized fields, specific financial instruments, unique biological

580
00:28:59.160 --> 00:29:03.799
<v Speaker 2>gene names, custom product codes very niche legal terms. If

581
00:29:03.839 --> 00:29:06.319
<v Speaker 2>the pre train model doesn't even know these categories exist,

582
00:29:06.759 --> 00:29:09.200
<v Speaker 2>rules alone won't cut it. You need to teach a

583
00:29:09.240 --> 00:29:12.359
<v Speaker 2>model from scratch or significantly fine tune one, and that.

584
00:29:12.359 --> 00:29:15.680
<v Speaker 1>Involves getting data and labeling it exactly.

585
00:29:15.960 --> 00:29:19.319
<v Speaker 2>Data collection is the first step. Then comes annotation, manually

586
00:29:19.440 --> 00:29:23.319
<v Speaker 2>labeling examples of your text with the entities, parts of speech, dependencies,

587
00:29:23.359 --> 00:29:24.799
<v Speaker 2>whatever your model needs to learn.

588
00:29:24.920 --> 00:29:27.480
<v Speaker 1>Are there tools for that? Annotation sounds tedious?

589
00:29:27.640 --> 00:29:30.400
<v Speaker 2>It can be, but there are great tools. Prodigy, also

590
00:29:30.400 --> 00:29:32.759
<v Speaker 2>from the makers of Spacey, is a very modern annotation

591
00:29:32.839 --> 00:29:35.480
<v Speaker 2>tool that often uses active learning to be more efficient.

592
00:29:35.880 --> 00:29:38.359
<v Speaker 2>It suggests labels you can firm or correct. They are

593
00:29:38.400 --> 00:29:41.960
<v Speaker 2>also open source options like Nertwig, which integrates with Jupiter Notebooks.

594
00:29:42.079 --> 00:29:44.559
<v Speaker 1>Okay, so you annotate your data using one of these tools,

595
00:29:44.599 --> 00:29:45.559
<v Speaker 1>then what then.

596
00:29:45.440 --> 00:29:49.599
<v Speaker 2>You convert that annotated data into Space's efficient binary format

597
00:29:49.759 --> 00:29:52.839
<v Speaker 2>doc ben. You typically split your data into training and

598
00:29:52.839 --> 00:29:57.279
<v Speaker 2>evaluation sets. Then you use Space's training system Spacey Trained

599
00:29:57.319 --> 00:30:00.519
<v Speaker 2>with a config file to train your custom and eer component,

600
00:30:01.200 --> 00:30:04.440
<v Speaker 2>and finally use spacey Evaluate to see how well your

601
00:30:04.480 --> 00:30:07.799
<v Speaker 2>trained model performs on the unseen valuation data.

602
00:30:07.960 --> 00:30:12.240
<v Speaker 1>What's fascinating here is the possibility of combining models. Can

603
00:30:12.319 --> 00:30:16.720
<v Speaker 1>you use your custom trained ner model alongside one of

604
00:30:16.799 --> 00:30:18.160
<v Speaker 1>Spacey's pre trained ones.

605
00:30:18.240 --> 00:30:20.519
<v Speaker 2>Yes, and that's often a very powerful approach. You get

606
00:30:20.519 --> 00:30:21.480
<v Speaker 2>the best of both worlds.

607
00:30:21.519 --> 00:30:22.640
<v Speaker 1>How does that work technically?

608
00:30:22.839 --> 00:30:25.880
<v Speaker 2>First, you'd package your custom trained pipeline component, maybe the

609
00:30:25.920 --> 00:30:29.480
<v Speaker 2>one that recognizes fashion brand entities, into an installable Python

610
00:30:29.519 --> 00:30:31.640
<v Speaker 2>package using the spacey package command.

611
00:30:31.680 --> 00:30:34.440
<v Speaker 1>Okay, so it's like distributing your own mini model exactly.

612
00:30:35.079 --> 00:30:39.240
<v Speaker 2>Then use another command, spacey assemble with a special configuration file.

613
00:30:39.559 --> 00:30:41.720
<v Speaker 2>This config file tells Spacey how to build a new

614
00:30:41.759 --> 00:30:44.240
<v Speaker 2>pipeline by sourcing components from different places.

615
00:30:44.319 --> 00:30:46.720
<v Speaker 1>So you could say, take my custom fashion brand component

616
00:30:46.759 --> 00:30:49.880
<v Speaker 1>and also take the GPE location and money components from

617
00:30:49.920 --> 00:30:51.720
<v Speaker 1>the standard encore webs as a model.

618
00:30:51.960 --> 00:30:56.079
<v Speaker 2>Precisely, Spacey assemble pulls these components together into a single,

619
00:30:56.480 --> 00:31:00.440
<v Speaker 2>unified a pipeline that can recognize entities from both your

620
00:31:00.559 --> 00:31:04.359
<v Speaker 2>custom training and the general purpose pre trained model. It's

621
00:31:04.400 --> 00:31:08.279
<v Speaker 2>a very neat way to create highly specialized, yet broadly

622
00:31:08.319 --> 00:31:09.880
<v Speaker 2>capable NLP systems.

623
00:31:10.000 --> 00:31:13.319
<v Speaker 1>Very cool. Let's touch on entity linking. That's about connecting

624
00:31:13.400 --> 00:31:16.440
<v Speaker 1>mentions in text to actual entries in a knowledge base

625
00:31:16.559 --> 00:31:18.359
<v Speaker 1>right disambiguating Washington.

626
00:31:18.160 --> 00:31:22.640
<v Speaker 2>Exactly is Washington referring to George Washington, the person, Washington, DC,

627
00:31:22.839 --> 00:31:26.160
<v Speaker 2>the city, or Washington state. Entity linking aims to resolve

628
00:31:26.160 --> 00:31:29.079
<v Speaker 2>that ambiguity by linking the mention to a unique identifier,

629
00:31:29.319 --> 00:31:32.039
<v Speaker 2>often in a knowledge base like Wikidata or a custom

630
00:31:32.079 --> 00:31:33.000
<v Speaker 2>company database.

631
00:31:33.200 --> 00:31:34.519
<v Speaker 1>How does space handle this?

632
00:31:34.759 --> 00:31:38.119
<v Speaker 2>Spacey has an entity linker component. It's architecture basically involved

633
00:31:38.119 --> 00:31:40.680
<v Speaker 2>three main parts. First, you need a knowledge base KB.

634
00:31:40.839 --> 00:31:41.519
<v Speaker 1>What's in the KB?

635
00:31:41.960 --> 00:31:44.200
<v Speaker 3>It stores information about the entities.

636
00:31:43.839 --> 00:31:50.319
<v Speaker 2>You want to link to their unique IDs like Wikidata qids, names, descriptions,

637
00:31:50.319 --> 00:31:54.279
<v Speaker 2>and aliases. Spacey provides tools to create this. For example,

638
00:31:54.319 --> 00:31:57.359
<v Speaker 2>and in memory lookup KB, you'd add entries for say,

639
00:31:57.519 --> 00:32:01.119
<v Speaker 2>Taylor Swift, the singer, Taylor Lautner, the act Taylor Fritz,

640
00:32:01.160 --> 00:32:03.559
<v Speaker 2>the tennis player, each with your unique ID and maybe

641
00:32:03.599 --> 00:32:04.279
<v Speaker 2>a short description.

642
00:32:04.640 --> 00:32:06.480
<v Speaker 1>What else is needed besides the KB?

643
00:32:07.039 --> 00:32:10.160
<v Speaker 2>Second, you need a way to generate candidate entities from

644
00:32:10.200 --> 00:32:13.000
<v Speaker 2>the KB. For a given mention in the text. If

645
00:32:13.000 --> 00:32:16.200
<v Speaker 2>the text says tailor, the system needs to know that Swift, Latner,

646
00:32:16.240 --> 00:32:19.759
<v Speaker 2>and Fritz are all potential candidates. You also add aliases

647
00:32:19.799 --> 00:32:23.079
<v Speaker 2>with prior probabilities. Maybe Taylor Swift the alias has a

648
00:32:23.079 --> 00:32:25.680
<v Speaker 2>one hundred percent probability of linking to the singer's ID,

649
00:32:26.160 --> 00:32:28.920
<v Speaker 2>while just Taylor has an equal chance for all three

650
00:32:28.960 --> 00:32:31.400
<v Speaker 2>initially and the third part. The third part is a

651
00:32:31.440 --> 00:32:33.559
<v Speaker 2>machine learning model which is trained to look at the mention,

652
00:32:33.759 --> 00:32:36.440
<v Speaker 2>its context in the sentence, and the information about the

653
00:32:36.440 --> 00:32:39.480
<v Speaker 2>candidate entities from the KB and then predict the most

654
00:32:39.519 --> 00:32:42.359
<v Speaker 2>likely correct link or predict nil if none of the

655
00:32:42.400 --> 00:32:43.440
<v Speaker 2>candidates seem right.

656
00:32:44.000 --> 00:32:46.079
<v Speaker 1>Does training this require special data?

657
00:32:46.440 --> 00:32:50.000
<v Speaker 2>Yes. When you train the entity linker component, your training

658
00:32:50.079 --> 00:32:54.160
<v Speaker 2>data needs to clearly specify which mentions should link to

659
00:32:54.200 --> 00:32:58.039
<v Speaker 2>which kb IDs. You often need a custom Corpus reader

660
00:32:58.079 --> 00:33:00.759
<v Speaker 2>to handle this specific data format during training.

661
00:33:00.920 --> 00:33:03.920
<v Speaker 1>Okay, we've built all these amazing models and pipelines. How

662
00:33:03.920 --> 00:33:06.119
<v Speaker 1>do we actually put them into the hands of users

663
00:33:06.240 --> 00:33:11.160
<v Speaker 1>or other systems. Let's talk deployment. Building apps in APIs right, moving.

664
00:33:10.920 --> 00:33:13.960
<v Speaker 2>From the lab to the real world. Two popular Python

665
00:33:14.000 --> 00:33:17.200
<v Speaker 2>frameworks are great for this, with spacey for building interactive

666
00:33:17.200 --> 00:33:20.119
<v Speaker 2>web applications quickly, especially for not a front end expert.

667
00:33:20.200 --> 00:33:21.480
<v Speaker 2>Streamlet is fantastic.

668
00:33:21.599 --> 00:33:23.160
<v Speaker 1>Streamlt How does it work?

669
00:33:23.240 --> 00:33:25.839
<v Speaker 2>It lets you build web apps purely in Python. You

670
00:33:25.880 --> 00:33:30.160
<v Speaker 2>can create widgets like textboxes that textavia and buttons very easily.

671
00:33:30.480 --> 00:33:34.480
<v Speaker 2>There's even a specific package Spacey streamlet that provides ready

672
00:33:34.519 --> 00:33:38.839
<v Speaker 2>made components to visualize Space's analysis like anyr results directly

673
00:33:38.880 --> 00:33:40.279
<v Speaker 2>in your streamlet app, so you.

674
00:33:40.240 --> 00:33:43.599
<v Speaker 1>Could build a quick demo tool for your Spacey pipeline exactly.

675
00:33:43.799 --> 00:33:47.720
<v Speaker 2>And a key feature is streamlets caching at ffon cash

676
00:33:47.839 --> 00:33:50.640
<v Speaker 2>or at s don cash data. This prevents your Spacey

677
00:33:50.640 --> 00:33:53.480
<v Speaker 2>models from having to reload every single time a user

678
00:33:53.519 --> 00:33:55.960
<v Speaker 2>interacts with the app, which makes it much faster and

679
00:33:56.039 --> 00:33:56.799
<v Speaker 2>more responsive.

680
00:33:56.880 --> 00:33:59.039
<v Speaker 1>What if you need something more robust, like a back

681
00:33:59.160 --> 00:34:01.519
<v Speaker 1>end API that other services can call.

682
00:34:01.599 --> 00:34:04.640
<v Speaker 2>Then fastpi is an excellent choice. It's a modern, high

683
00:34:04.640 --> 00:34:09.679
<v Speaker 2>performance Python framework specifically designed for building APIs.

684
00:34:09.320 --> 00:34:10.639
<v Speaker 1>What makes fast to PI good.

685
00:34:10.960 --> 00:34:14.039
<v Speaker 2>It's known for its speed. It also leverages Python type

686
00:34:14.119 --> 00:34:17.119
<v Speaker 2>hints heavily. You define the expected data types for your

687
00:34:17.159 --> 00:34:20.280
<v Speaker 2>API inputs and outputs, which faster PI uses for automatic

688
00:34:20.360 --> 00:34:24.000
<v Speaker 2>data validation, gadging errors early, and also for automatically generating

689
00:34:24.039 --> 00:34:27.119
<v Speaker 2>interactive API documentation using Swagger UI.

690
00:34:27.280 --> 00:34:29.280
<v Speaker 1>So it makes development faster and more reliable.

691
00:34:29.599 --> 00:34:33.800
<v Speaker 2>Yes, significantly, you use pidanic models to define your data

692
00:34:33.800 --> 00:34:38.039
<v Speaker 2>structures and faster PI handles the validation and serialization. You

693
00:34:38.079 --> 00:34:40.800
<v Speaker 2>could easily build an API in point that takes some text,

694
00:34:41.079 --> 00:34:44.119
<v Speaker 2>runs it through your spacey ner pipeline, and returns the

695
00:34:44.119 --> 00:34:49.000
<v Speaker 2>extracted entities as structured Jason data. The autogenerated documentation makes

696
00:34:49.000 --> 00:34:52.559
<v Speaker 2>it super easy for others or yourself to understand and

697
00:34:52.599 --> 00:34:53.519
<v Speaker 2>test the API.

698
00:34:53.920 --> 00:34:58.159
<v Speaker 1>Okay, building models, deploying apps, the whole process can get complex.

699
00:34:58.519 --> 00:35:01.079
<v Speaker 1>How do you manage the entire end to end workflow,

700
00:35:01.280 --> 00:35:03.440
<v Speaker 1>especially for reproducibility and collaboration.

701
00:35:03.719 --> 00:35:06.480
<v Speaker 2>That's where workflow management tools come in. Spacey has a

702
00:35:06.519 --> 00:35:09.719
<v Speaker 2>companion tool called Weasel. Weasel Yeah, Weasel helps you structure

703
00:35:09.719 --> 00:35:13.280
<v Speaker 2>your entire NLP project. You define your workload steps like

704
00:35:13.320 --> 00:35:17.920
<v Speaker 2>downloading data, preprocessing, training, evaluating, along with any data assets

705
00:35:17.920 --> 00:35:21.639
<v Speaker 2>and custom commands, all within a configuration file project dot EML.

706
00:35:21.960 --> 00:35:24.400
<v Speaker 2>It makes your project reproducible and easier for others to

707
00:35:24.480 --> 00:35:25.400
<v Speaker 2>understand and run.

708
00:35:25.599 --> 00:35:28.639
<v Speaker 1>What about managing the data itself and the models? They

709
00:35:28.639 --> 00:35:30.920
<v Speaker 1>can get large and change often for that.

710
00:35:31.199 --> 00:35:34.960
<v Speaker 2>Data Version controlled DVC is an increasingly popular tool, especially

711
00:35:35.000 --> 00:35:37.920
<v Speaker 2>in the mL world. It works alongside geit. How does

712
00:35:38.000 --> 00:35:42.400
<v Speaker 2>DVC help it tackles several common problems. First, sharing large

713
00:35:42.480 --> 00:35:45.719
<v Speaker 2>data sets and models is hard with get alone, DVC

714
00:35:45.880 --> 00:35:48.480
<v Speaker 2>lets you version your data end models, storing them in

715
00:35:48.519 --> 00:35:52.000
<v Speaker 2>remote storage like s three year Google Cloud storage, while

716
00:35:52.079 --> 00:35:56.119
<v Speaker 2>keeping small metaphiles in GIT. This makes collaboration much easier.

717
00:35:56.159 --> 00:35:58.800
<v Speaker 2>What else, it helps make your data processing and model

718
00:35:58.840 --> 00:36:02.519
<v Speaker 2>training pipelines relyable and reproducible. You define the steps and

719
00:36:02.559 --> 00:36:06.599
<v Speaker 2>their dependencies, and DVC can tract everything. It also crucially

720
00:36:06.599 --> 00:36:09.239
<v Speaker 2>helps with tracking model metrics over time, so you can

721
00:36:09.280 --> 00:36:12.559
<v Speaker 2>see how performance changes as you modify data or code.

722
00:36:12.960 --> 00:36:16.360
<v Speaker 2>It really embraces giops principles for mlops, making your mL

723
00:36:16.400 --> 00:36:19.920
<v Speaker 2>workflows version, automated and continuously reconciled.

724
00:36:20.000 --> 00:36:22.960
<v Speaker 1>So it brings better engineering practices to data science exactly.

725
00:36:23.039 --> 00:36:25.360
<v Speaker 2>It helps manage the whole life cycle, and tools like

726
00:36:25.440 --> 00:36:28.360
<v Speaker 2>DVC studio even provide features like a model registry for

727
00:36:28.480 --> 00:36:31.199
<v Speaker 2>managing and sharing your train models effectively across a team.

728
00:36:31.599 --> 00:36:34.559
<v Speaker 2>Weasel and DVC together provide a really solid foundation for

729
00:36:34.639 --> 00:36:36.440
<v Speaker 2>managing serious NLP projects.

730
00:36:36.760 --> 00:36:41.000
<v Speaker 1>Wow from the core concepts and pipeline through advanced analysis

731
00:36:41.039 --> 00:36:46.039
<v Speaker 1>like dependency parsing and NR rule based matching, transformers LMS,

732
00:36:46.360 --> 00:36:50.960
<v Speaker 1>training custom models, and finally, deployment and workflow management. That

733
00:36:51.039 --> 00:36:55.000
<v Speaker 1>was an incredibly thorough deep dive into the Spacey ecosystem.

734
00:36:55.159 --> 00:36:56.960
<v Speaker 2>It really covers a lot of ground, doesn't it.

735
00:36:56.960 --> 00:36:59.960
<v Speaker 1>It absolutely does. It really reinforces that idea of space

736
00:37:00.079 --> 00:37:05.320
<v Speaker 1>see being that practical, well optimized kitchen knife, precise, powerful

737
00:37:05.360 --> 00:37:08.199
<v Speaker 1>and adaptable for so many different NLP tasks.

738
00:37:08.280 --> 00:37:09.840
<v Speaker 2>Yeah. And you know, if we connect this to the

739
00:37:09.840 --> 00:37:13.960
<v Speaker 2>bigger picture, really understanding tools like Spacey empowers you not

740
00:37:14.119 --> 00:37:17.599
<v Speaker 2>just to build things, but also to critically evaluate how

741
00:37:17.599 --> 00:37:20.480
<v Speaker 2>these language AI systems work, how they understand or misunderstand

742
00:37:20.480 --> 00:37:20.880
<v Speaker 2>the world.

743
00:37:20.960 --> 00:37:21.639
<v Speaker 1>That's a great point.

744
00:37:21.719 --> 00:37:23.760
<v Speaker 2>It kind of raises an important question for anyone listening.

745
00:37:23.800 --> 00:37:25.880
<v Speaker 2>I think, now that you have this deeper insight, how

746
00:37:25.920 --> 00:37:28.840
<v Speaker 2>will you use this knowledge, maybe to refine your own analysis,

747
00:37:28.920 --> 00:37:30.960
<v Speaker 2>or perhaps even to build something new and impactful.

748
00:37:31.239 --> 00:37:34.480
<v Speaker 1>A fantastic thought to end on. We definitely encourage you,

749
00:37:34.719 --> 00:37:38.519
<v Speaker 1>our listeners, to keep exploring these fascinating topics and think

750
00:37:38.519 --> 00:37:40.599
<v Speaker 1>about how you might apply this knowledge, whether it's in

751
00:37:40.639 --> 00:37:43.960
<v Speaker 1>your work, your studies, or just your own curiosity about

752
00:37:44.079 --> 00:37:44.960
<v Speaker 1>language and AI.
