WEBVTT

1
00:00:00.120 --> 00:00:03.319
<v Speaker 1>You know, it's wild how second nature it's become to

2
00:00:03.399 --> 00:00:06.960
<v Speaker 1>just talk to our devices. Hey, Google, set a timer, Siri,

3
00:00:07.080 --> 00:00:09.560
<v Speaker 1>what's the weather? We barely think about it.

4
00:00:09.759 --> 00:00:11.679
<v Speaker 2>Yeah, it really feels like something we just take for

5
00:00:11.720 --> 00:00:12.480
<v Speaker 2>granted now, But.

6
00:00:12.519 --> 00:00:14.880
<v Speaker 1>Pull back for a second. How does that actually happen?

7
00:00:14.960 --> 00:00:17.640
<v Speaker 1>How does your phone hear you, understand what you want,

8
00:00:17.719 --> 00:00:19.320
<v Speaker 1>and then you know, do something.

9
00:00:19.399 --> 00:00:21.320
<v Speaker 2>It does feel a bit like a magic trick, doesn't it.

10
00:00:22.079 --> 00:00:25.879
<v Speaker 2>But behind that simple interaction is this whole layered world

11
00:00:25.920 --> 00:00:26.800
<v Speaker 2>of technology.

12
00:00:27.079 --> 00:00:30.440
<v Speaker 1>It's quite complex, actually, and that's exactly the world we're

13
00:00:30.440 --> 00:00:33.479
<v Speaker 1>diving into today. We're taking a deep dive into how

14
00:00:33.520 --> 00:00:38.479
<v Speaker 1>you build these voice based applications, specifically thinking about Android devices.

15
00:00:38.520 --> 00:00:41.600
<v Speaker 2>Okay, and our guide for this exploration is fascinating. It

16
00:00:41.679 --> 00:00:45.560
<v Speaker 2>is a detailed technical guide published back in twenty thirteen.

17
00:00:45.640 --> 00:00:47.600
<v Speaker 1>Twenty thirteen, so a bit of a snapshot from that

18
00:00:47.640 --> 00:00:48.560
<v Speaker 1>era exactly.

19
00:00:48.640 --> 00:00:50.439
<v Speaker 2>It gives us a really interesting look at the tools

20
00:00:50.439 --> 00:00:54.280
<v Speaker 2>and approaches developers we're using them, leveraging Google's own capabilities

21
00:00:54.479 --> 00:00:56.200
<v Speaker 2>and also some open source software.

22
00:00:56.359 --> 00:00:59.320
<v Speaker 1>Right, So, our mission in this deep dive is to

23
00:00:59.359 --> 00:01:02.119
<v Speaker 1>kind of cut through the complexity. We want to unpack

24
00:01:02.159 --> 00:01:04.680
<v Speaker 1>the core concepts the essential building blocks.

25
00:01:04.840 --> 00:01:05.840
<v Speaker 2>Fundamental, Yeah, the.

26
00:01:05.760 --> 00:01:09.319
<v Speaker 1>Fundamentals, and show you the journey developers took to create

27
00:01:09.319 --> 00:01:12.480
<v Speaker 1>apps you could talk to all without you needing to

28
00:01:12.519 --> 00:01:15.000
<v Speaker 1>pour over the original technical manual yourself.

29
00:01:15.120 --> 00:01:15.599
<v Speaker 2>Sounds good.

30
00:01:16.239 --> 00:01:19.599
<v Speaker 1>We'll start with the absolute basics, speaking and listening and

31
00:01:19.719 --> 00:01:23.120
<v Speaker 1>build up from there, maybe getting into complex conversations and

32
00:01:23.200 --> 00:01:25.159
<v Speaker 1>even early virtual assistance.

33
00:01:25.719 --> 00:01:28.519
<v Speaker 2>Okay, let's unpack this starting at the very beginning, seems right.

34
00:01:29.400 --> 00:01:32.359
<v Speaker 2>Before a device can respond or have any kind of conversation,

35
00:01:32.840 --> 00:01:36.000
<v Speaker 2>it first has to be able to speak. And here right.

36
00:01:36.560 --> 00:01:39.680
<v Speaker 1>The fundamental capabilities Android provides for this are text to

37
00:01:39.719 --> 00:01:44.760
<v Speaker 1>speech TTS and automated speech recognition or ASR.

38
00:01:45.159 --> 00:01:48.680
<v Speaker 2>And thinking about why these are important, Well, it opens

39
00:01:48.719 --> 00:01:51.239
<v Speaker 2>up so many possibilities. Imagine your hands are full like

40
00:01:51.280 --> 00:01:52.920
<v Speaker 2>you're driving and need directions.

41
00:01:53.000 --> 00:01:55.319
<v Speaker 1>Oh yeah, it's simple then or and.

42
00:01:55.239 --> 00:01:58.400
<v Speaker 2>This is crucial for accessibility. Think about someone with a

43
00:01:58.439 --> 00:02:03.400
<v Speaker 2>visual impairment using a green reader that's TTS or ASR

44
00:02:03.480 --> 00:02:07.599
<v Speaker 2>helping someone communicate if they have difficulty speaking, much like

45
00:02:07.640 --> 00:02:10.719
<v Speaker 2>Stephen Hawking used speech synthesis technology.

46
00:02:10.879 --> 00:02:14.800
<v Speaker 1>They're really foundational tools. Okay, let's start with TTS text

47
00:02:14.800 --> 00:02:19.080
<v Speaker 1>to speech. Simply put, it turns written text into spoken audio.

48
00:02:19.000 --> 00:02:22.479
<v Speaker 2>Right, and the technology works in stages. First, it needs

49
00:02:22.479 --> 00:02:25.120
<v Speaker 2>to understand the text itself, things like you know how

50
00:02:25.159 --> 00:02:27.960
<v Speaker 2>to pronounce words that look similar but sound different based

51
00:02:28.000 --> 00:02:29.360
<v Speaker 2>on context.

52
00:02:28.919 --> 00:02:31.400
<v Speaker 1>Like reads versus read just exactly.

53
00:02:31.360 --> 00:02:34.960
<v Speaker 2>Or converting numbers and abbreviations into full words. The cool

54
00:02:35.000 --> 00:02:38.759
<v Speaker 2>part for developers often is the system handles a lot

55
00:02:38.759 --> 00:02:40.960
<v Speaker 2>of this linguistic complexity. You don't always have to get

56
00:02:41.000 --> 00:02:41.599
<v Speaker 2>into the weeds.

57
00:02:41.639 --> 00:02:43.800
<v Speaker 1>Okay, So it understands the text, then it has to

58
00:02:43.840 --> 00:02:45.960
<v Speaker 1>generate the actual sound exactly.

59
00:02:46.400 --> 00:02:48.960
<v Speaker 2>A common approach, especially around the time this book was written,

60
00:02:49.000 --> 00:02:51.000
<v Speaker 2>was something called concatenative synthesis.

61
00:02:51.120 --> 00:02:52.960
<v Speaker 1>Concatenative synthesis okay, okay.

62
00:02:53.080 --> 00:02:55.360
<v Speaker 2>Think of it like building a sentence by stitching together

63
00:02:55.400 --> 00:03:00.280
<v Speaker 2>tiny prerecorded pieces of speech. These could be sounds, syllables, words,

64
00:03:00.319 --> 00:03:01.280
<v Speaker 2>even short phrases.

65
00:03:01.400 --> 00:03:03.759
<v Speaker 1>Ah like digital lego bricks for voice.

66
00:03:04.000 --> 00:03:07.719
<v Speaker 2>Kind of algorithms select the right pieces and join them smoothly,

67
00:03:07.879 --> 00:03:12.439
<v Speaker 2>trying to mimic natural rhythm and intonation. When it's done well,

68
00:03:12.520 --> 00:03:14.360
<v Speaker 2>it can sound remarkably natural.

69
00:03:14.680 --> 00:03:17.039
<v Speaker 1>It's kind of amazing how they make those pieces fit together.

70
00:03:17.319 --> 00:03:20.240
<v Speaker 1>But wait, if they're stitching together pre recorded bits, why

71
00:03:20.280 --> 00:03:22.599
<v Speaker 1>not just record a human voice actor saying everything the

72
00:03:22.639 --> 00:03:23.680
<v Speaker 1>app might need to say.

73
00:03:24.000 --> 00:03:27.439
<v Speaker 2>That's a great point, and sometimes apps do use professional

74
00:03:27.479 --> 00:03:31.319
<v Speaker 2>voice actors, you know, for specific tromps where quality and

75
00:03:31.400 --> 00:03:33.520
<v Speaker 2>consistency are paramount.

76
00:03:33.039 --> 00:03:35.520
<v Speaker 1>Like a standard greeting or instruction exactly.

77
00:03:36.120 --> 00:03:39.879
<v Speaker 2>But TTS becomes absolutely essential when the text is dynamic,

78
00:03:40.360 --> 00:03:43.319
<v Speaker 2>when you can't possibly pre record everything it might need

79
00:03:43.360 --> 00:03:43.680
<v Speaker 2>to say.

80
00:03:43.879 --> 00:03:46.199
<v Speaker 1>Ah, right, like reading out a text message.

81
00:03:45.840 --> 00:03:48.080
<v Speaker 2>You just got, or a news headline that just updated, or.

82
00:03:48.120 --> 00:03:49.840
<v Speaker 1>Someone's name from your contact.

83
00:03:49.520 --> 00:03:52.759
<v Speaker 2>List Precisely, you just can't anticipate every single phraser name.

84
00:03:53.080 --> 00:03:55.840
<v Speaker 2>So while the quality might sometimes be a trade off

85
00:03:55.879 --> 00:04:00.560
<v Speaker 2>compared to say, a perfectly recorded voice, linets offers that

86
00:04:00.680 --> 00:04:03.159
<v Speaker 2>vital flexibility for dynamic content.

87
00:04:02.919 --> 00:04:05.960
<v Speaker 1>And Android has had this built in for ages. Right, how

88
00:04:06.000 --> 00:04:07.800
<v Speaker 1>do developers actually hook into it?

89
00:04:07.960 --> 00:04:12.199
<v Speaker 2>Yeah, the capability has been there since Android one point six.

90
00:04:12.240 --> 00:04:16.079
<v Speaker 2>Believe or not, developers use the framework provided. A key

91
00:04:16.120 --> 00:04:19.720
<v Speaker 2>step is making sure the necessary language data is actually

92
00:04:19.720 --> 00:04:20.720
<v Speaker 2>on the user's.

93
00:04:20.399 --> 00:04:21.959
<v Speaker 1>Device the voice itself.

94
00:04:22.040 --> 00:04:25.199
<v Speaker 2>Yeah, the voice files the rules for that language. The

95
00:04:25.199 --> 00:04:28.279
<v Speaker 2>system lets developers check for this using something called an intent,

96
00:04:28.839 --> 00:04:31.199
<v Speaker 2>and even prompt the user to install it if it's missing.

97
00:04:31.480 --> 00:04:31.879
<v Speaker 1>Okay.

98
00:04:32.040 --> 00:04:35.800
<v Speaker 2>The book suggests using a common software design pattern, a singleton,

99
00:04:36.040 --> 00:04:39.319
<v Speaker 2>basically ensuring only one instance of the TTS engine is created.

100
00:04:39.480 --> 00:04:42.839
<v Speaker 2>This helps manage resources efficiently smart and the examples in

101
00:04:42.879 --> 00:04:45.040
<v Speaker 2>the book show how you might use this to say,

102
00:04:45.360 --> 00:04:47.680
<v Speaker 2>read back text the user piped in, or maybe read

103
00:04:47.680 --> 00:04:50.560
<v Speaker 2>text loaded from a file. You can specify the language too,

104
00:04:50.680 --> 00:04:52.439
<v Speaker 2>like English or even regional variation.

105
00:04:52.600 --> 00:04:54.839
<v Speaker 1>Okay, so that's how the device speaks. Now the other

106
00:04:54.959 --> 00:05:00.759
<v Speaker 1>side hearing us automated Speech recognition ASR. This is turning

107
00:05:00.839 --> 00:05:02.720
<v Speaker 1>our spoken words into text.

108
00:05:02.639 --> 00:05:06.600
<v Speaker 2>Right, and like TTS, it involves steps. First, the device

109
00:05:06.639 --> 00:05:08.879
<v Speaker 2>needs to capture the sound from the microphone and process it.

110
00:05:09.000 --> 00:05:10.120
<v Speaker 2>Think of it as cleaning up.

111
00:05:10.040 --> 00:05:11.959
<v Speaker 1>The audio, getting rid of background noise.

112
00:05:11.920 --> 00:05:16.199
<v Speaker 2>Yeah, removing noise maybe echo, and just preparing it digitally

113
00:05:16.279 --> 00:05:17.120
<v Speaker 2>for analysis.

114
00:05:17.240 --> 00:05:19.839
<v Speaker 1>Then comes the recognition part itself. Yeah, breaking down the

115
00:05:19.879 --> 00:05:21.319
<v Speaker 1>audio into.

116
00:05:22.480 --> 00:05:27.519
<v Speaker 2>What sounds basically yeah, into tiny segments phones the basic

117
00:05:27.560 --> 00:05:30.399
<v Speaker 2>sounds of the language, and then it tries to match them.

118
00:05:30.759 --> 00:05:31.240
<v Speaker 1>Wow.

119
00:05:31.399 --> 00:05:34.759
<v Speaker 2>This is where powerful statistical models come into play. These

120
00:05:34.800 --> 00:05:37.920
<v Speaker 2>models are trained on massive amounts of recorded speech, learning

121
00:05:38.000 --> 00:05:41.079
<v Speaker 2>how different sounds are typically pronounced in different context by

122
00:05:41.079 --> 00:05:41.759
<v Speaker 2>different people.

123
00:05:41.920 --> 00:05:42.240
<v Speaker 1>Wow.

124
00:05:42.360 --> 00:05:45.399
<v Speaker 2>Okay, they build what's called an acoustic model. It's essentially

125
00:05:45.480 --> 00:05:48.680
<v Speaker 2>a statistical map of how sounds relate to words.

126
00:05:48.920 --> 00:05:51.879
<v Speaker 1>But words can sound exactly alike you mentioned read and

127
00:05:51.959 --> 00:05:55.040
<v Speaker 1>read or two and two. How does it know the difference?

128
00:05:55.199 --> 00:05:58.800
<v Speaker 2>Ah, good question. That's where another statistical model helps, the

129
00:05:58.879 --> 00:05:59.439
<v Speaker 2>language model.

130
00:05:59.519 --> 00:06:00.000
<v Speaker 1>Language model.

131
00:06:00.319 --> 00:06:03.920
<v Speaker 2>This one understands the probability of words appearing together in sequence.

132
00:06:04.439 --> 00:06:08.519
<v Speaker 2>It knows that after I went, the word two is

133
00:06:08.600 --> 00:06:09.959
<v Speaker 2>far far more likely than two.

134
00:06:10.360 --> 00:06:11.959
<v Speaker 1>Right context exactly.

135
00:06:12.240 --> 00:06:15.000
<v Speaker 2>The language model provides that crucial context to help resolve

136
00:06:15.040 --> 00:06:16.000
<v Speaker 2>those ambiguities.

137
00:06:16.279 --> 00:06:19.839
<v Speaker 1>And the result isn't always just one single interpretation, is

138
00:06:19.920 --> 00:06:21.879
<v Speaker 1>it like? It's not always certain?

139
00:06:22.120 --> 00:06:25.600
<v Speaker 2>No, definitely not. Typically, the ASR system gives you back

140
00:06:26.040 --> 00:06:29.439
<v Speaker 2>a list of possible results, ranked by how confident it

141
00:06:29.560 --> 00:06:32.199
<v Speaker 2>is in each one. A list, yeah, it's often called

142
00:06:32.240 --> 00:06:36.399
<v Speaker 2>an end best list. Each possibility comes with a confidence score,

143
00:06:36.720 --> 00:06:39.920
<v Speaker 2>usually from zero to one. A score near one means

144
00:06:39.920 --> 00:06:41.279
<v Speaker 2>the system is pretty.

145
00:06:40.959 --> 00:06:44.040
<v Speaker 1>Sure it got it right, and that's useful for the developer.

146
00:06:43.800 --> 00:06:46.680
<v Speaker 2>Incredibly valuable. They can just pick the top result. If

147
00:06:46.680 --> 00:06:49.800
<v Speaker 2>the confidence is high, or if it's lower, or if

148
00:06:49.839 --> 00:06:51.879
<v Speaker 2>the top one doesn't make sense in context, they can

149
00:06:51.920 --> 00:06:54.079
<v Speaker 2>look at the others in the list. Or maybe even

150
00:06:54.199 --> 00:06:56.839
<v Speaker 2>use the confidence score to decide HM I better ask

151
00:06:56.879 --> 00:06:58.040
<v Speaker 2>the user to confirm this.

152
00:06:58.040 --> 00:07:00.480
<v Speaker 1>This capability has also been around on end for a

153
00:07:00.519 --> 00:07:03.199
<v Speaker 1>while since version two point one. Often when you tap

154
00:07:03.240 --> 00:07:05.319
<v Speaker 1>the little microphone icon on the keyboard.

155
00:07:05.079 --> 00:07:08.480
<v Speaker 2>Yes exactly, and developers have flexibility here too. You can

156
00:07:08.560 --> 00:07:11.480
<v Speaker 2>use a simple built in tool and intent that handles

157
00:07:11.480 --> 00:07:15.519
<v Speaker 2>the speak now, prompt and feedback automatically, super easy, quick

158
00:07:15.560 --> 00:07:18.920
<v Speaker 2>and dirty, pretty much. Or if you want more control

159
00:07:18.920 --> 00:07:21.920
<v Speaker 2>over the look and feel the user interface, you could

160
00:07:22.000 --> 00:07:25.680
<v Speaker 2>use a more advanced component a speech recognizer instance. This

161
00:07:25.800 --> 00:07:29.279
<v Speaker 2>lets you manage the UI yourself and react to specific

162
00:07:29.319 --> 00:07:32.560
<v Speaker 2>recognition events like when the user starts or stops speaking.

163
00:07:32.759 --> 00:07:35.480
<v Speaker 1>More control, more work typically yeah.

164
00:07:35.600 --> 00:07:38.439
<v Speaker 2>The book again suggests using a library approach here, like

165
00:07:38.480 --> 00:07:41.839
<v Speaker 2>an ASRLB, just to keep the code organized and reusable.

166
00:07:42.079 --> 00:07:44.759
<v Speaker 1>And you mentioned language models. Can you tell the system

167
00:07:45.279 --> 00:07:48.319
<v Speaker 1>what kind of speech to expect, like am I dictating

168
00:07:48.319 --> 00:07:50.920
<v Speaker 1>an email or just barking a search query?

169
00:07:51.160 --> 00:07:54.680
<v Speaker 2>Exactly? You can specify different language models. There's one design

170
00:07:54.759 --> 00:07:58.480
<v Speaker 2>for free form dictation like long sentences, and another optimized

171
00:07:58.519 --> 00:08:00.879
<v Speaker 2>for shorter phrases like web search queries.

172
00:08:00.920 --> 00:08:01.399
<v Speaker 1>Ah.

173
00:08:01.560 --> 00:08:03.920
<v Speaker 2>The book does note though, that even with these models,

174
00:08:03.920 --> 00:08:06.560
<v Speaker 2>the input can still be quite open ended, so the

175
00:08:06.600 --> 00:08:09.439
<v Speaker 2>developer might need to do more processing afterwards to figure

176
00:08:09.480 --> 00:08:11.079
<v Speaker 2>out the specific command or meaning.

177
00:08:11.560 --> 00:08:14.680
<v Speaker 1>Oh and because these systems often connect to cloud services

178
00:08:14.720 --> 00:08:15.879
<v Speaker 1>for the heavy lifting.

179
00:08:15.600 --> 00:08:19.600
<v Speaker 2>Right, the recognition part. Yeah, the app usually needs permission

180
00:08:19.600 --> 00:08:22.600
<v Speaker 2>to access the Internet, and you need to handle potential

181
00:08:22.800 --> 00:08:27.120
<v Speaker 2>errors like no speech detected or no match found or

182
00:08:27.160 --> 00:08:29.079
<v Speaker 2>maybe a network problem.

183
00:08:29.120 --> 00:08:32.639
<v Speaker 1>Got it. So we've got the building blocks. The device

184
00:08:32.679 --> 00:08:36.440
<v Speaker 1>can speak TTS, and it can listen and turn speech

185
00:08:36.519 --> 00:08:40.600
<v Speaker 1>into text ASR even giving us a list of possibilities

186
00:08:40.600 --> 00:08:43.759
<v Speaker 1>with confidence scores. How do we actually put those together

187
00:08:43.840 --> 00:08:45.200
<v Speaker 1>to build simple interactions?

188
00:08:45.240 --> 00:08:48.240
<v Speaker 2>That's the next logical step, right, moving from just hearing

189
00:08:48.320 --> 00:08:51.360
<v Speaker 2>or speaking to creating a basic back and forth. Think

190
00:08:51.360 --> 00:08:53.320
<v Speaker 2>about those early voice actions.

191
00:08:53.120 --> 00:08:54.639
<v Speaker 1>Like on Google Now back in the.

192
00:08:54.679 --> 00:08:57.080
<v Speaker 2>Day, exactly telling your phone call mom or go to

193
00:08:57.080 --> 00:09:01.080
<v Speaker 2>Wikipedia dot org. These are structured commands, simple cause and.

194
00:09:01.080 --> 00:09:04.360
<v Speaker 1>Effect, and they're built just by combining those core TTS

195
00:09:04.360 --> 00:09:06.399
<v Speaker 1>and ASR capabilities we just talked about.

196
00:09:06.440 --> 00:09:09.559
<v Speaker 2>Pretty much the book provides them straightforward examples. One is

197
00:09:09.600 --> 00:09:10.759
<v Speaker 2>an app called voice Search.

198
00:09:10.879 --> 00:09:13.799
<v Speaker 1>It just takes whatever you say, listens using ASR right.

199
00:09:13.759 --> 00:09:16.360
<v Speaker 2>Grabs the top result from that end best list the

200
00:09:16.360 --> 00:09:18.600
<v Speaker 2>one with the highest confidence it seems it's right, and

201
00:09:18.759 --> 00:09:22.080
<v Speaker 2>immediately plugs it into a standard Android web search intent.

202
00:09:22.440 --> 00:09:24.679
<v Speaker 2>Boom search results appear very.

203
00:09:24.639 --> 00:09:27.879
<v Speaker 1>Simple, okay, but that immediately brings up a potential problem

204
00:09:27.919 --> 00:09:30.039
<v Speaker 1>which you hinted at. What if the ASR got it

205
00:09:30.080 --> 00:09:34.080
<v Speaker 1>wrong exactly? This seems particularly tricky in another example app.

206
00:09:34.080 --> 00:09:37.759
<v Speaker 1>The book mentions voice Launch, which tries to launch an

207
00:09:37.799 --> 00:09:41.720
<v Speaker 1>installed application based on what the user says. Right, what

208
00:09:41.840 --> 00:09:44.559
<v Speaker 1>if you don't say the exact app name, like maybe

209
00:09:44.559 --> 00:09:47.159
<v Speaker 1>you say music player, but the app is actually called

210
00:09:47.159 --> 00:09:47.759
<v Speaker 1>play Music.

211
00:09:48.000 --> 00:09:50.559
<v Speaker 2>This is where the idea of similarity measures comes in.

212
00:09:50.720 --> 00:09:53.960
<v Speaker 2>It's a crucial concept. The app needs a way to

213
00:09:54.000 --> 00:09:57.399
<v Speaker 2>compare what the user said to the actual names of

214
00:09:57.440 --> 00:09:59.960
<v Speaker 2>the apps installed on the device to find the best

215
00:10:00.120 --> 00:10:02.600
<v Speaker 2>to match, even if it's not identical.

216
00:10:02.720 --> 00:10:04.240
<v Speaker 1>How does it do that? Just check if the letters

217
00:10:04.279 --> 00:10:04.799
<v Speaker 1>are similar.

218
00:10:05.000 --> 00:10:08.120
<v Speaker 2>That's part of an orthographic similarity looking at the spelling.

219
00:10:08.440 --> 00:10:11.600
<v Speaker 2>But crucially, it can also look at phonetic similarity.

220
00:10:11.679 --> 00:10:12.720
<v Speaker 1>How words sound alike?

221
00:10:12.879 --> 00:10:15.080
<v Speaker 2>Yes, so it could figure out that, I don't know,

222
00:10:15.320 --> 00:10:19.200
<v Speaker 2>photos and photos probably refer to the same thing, even

223
00:10:19.200 --> 00:10:20.200
<v Speaker 2>if the spelling's different.

224
00:10:20.240 --> 00:10:21.000
<v Speaker 1>Okay, that's clever.

225
00:10:21.320 --> 00:10:25.519
<v Speaker 2>The book mentions using algorithms like soundex for this phonetic comparison,

226
00:10:26.080 --> 00:10:29.720
<v Speaker 2>although it notes the specific implementation they included was primarily

227
00:10:29.759 --> 00:10:34.200
<v Speaker 2>tuned for English. The key thing is normalizing the input first,

228
00:10:34.399 --> 00:10:38.039
<v Speaker 2>like her moving spaces, making everything lower case before you do.

229
00:10:38.120 --> 00:10:39.440
<v Speaker 2>The comparison makes sense.

230
00:10:39.519 --> 00:10:43.879
<v Speaker 1>Okay, So even with similarity measures, ASR isn't perfect. That

231
00:10:43.919 --> 00:10:47.120
<v Speaker 1>potential for error means you often need to double check

232
00:10:47.159 --> 00:10:49.919
<v Speaker 1>with the user right confirm things absolutely.

233
00:10:50.120 --> 00:10:53.879
<v Speaker 2>Confirmation is vital for robust interaction. The book includes a

234
00:10:53.919 --> 00:10:58.039
<v Speaker 2>simple example building on that Voicer chap. After recognizing something

235
00:10:58.159 --> 00:11:01.039
<v Speaker 2>like pizza places right, yeah, might use TTS to ask

236
00:11:01.120 --> 00:11:04.039
<v Speaker 2>did you say pizza places? And then it uses ASR again,

237
00:11:04.480 --> 00:11:07.000
<v Speaker 2>but this time listening specifically for a simple.

238
00:11:06.799 --> 00:11:10.720
<v Speaker 1>Yes or no uh, constraining the expected input exactly.

239
00:11:11.039 --> 00:11:13.639
<v Speaker 2>It's a basic but really important step, especially when you're

240
00:11:13.679 --> 00:11:16.879
<v Speaker 2>dealing with single critical pieces of data before taking an action.

241
00:11:17.000 --> 00:11:20.480
<v Speaker 1>So we can make the device speak, listen, perform. These

242
00:11:20.519 --> 00:11:24.759
<v Speaker 1>simple command action pairs handle some ambiguity with similarity, and

243
00:11:24.879 --> 00:11:28.960
<v Speaker 1>even ask for basic yes no confirmation. But these interactions

244
00:11:29.000 --> 00:11:31.720
<v Speaker 1>still feel quite rigid. You know, you have to say

245
00:11:31.759 --> 00:11:34.000
<v Speaker 1>things in a very specific way or is just one

246
00:11:34.039 --> 00:11:36.600
<v Speaker 1>command at a time. How do you make the conversation

247
00:11:36.679 --> 00:11:40.320
<v Speaker 1>more flexible, like guide the user through collecting multiple pieces

248
00:11:40.360 --> 00:11:40.879
<v Speaker 1>of information?

249
00:11:41.039 --> 00:11:43.960
<v Speaker 2>Okay, yeah, that takes us into the realm of more

250
00:11:44.000 --> 00:11:47.440
<v Speaker 2>structured conversations, often called form filling.

251
00:11:47.159 --> 00:11:49.720
<v Speaker 1>Dialogue form filling like on a website.

252
00:11:49.320 --> 00:11:52.480
<v Speaker 2>Exactly the same idea. The goal is to gather several

253
00:11:52.519 --> 00:11:55.919
<v Speaker 2>distinct pieces of information from the user, one by one,

254
00:11:56.279 --> 00:11:58.879
<v Speaker 2>but doing it through voice instead of textboxes and dropdowns.

255
00:11:58.960 --> 00:12:01.279
<v Speaker 1>Okay, So, like booking a flow, it might ask what

256
00:12:01.320 --> 00:12:04.120
<v Speaker 1>city are you flying from? Then once you answer, what

257
00:12:04.200 --> 00:12:05.559
<v Speaker 1>is your destination exactly?

258
00:12:05.600 --> 00:12:07.480
<v Speaker 2>And then maybe what date do you want to travel.

259
00:12:07.679 --> 00:12:10.080
<v Speaker 2>To manage this, you need a system. You need a

260
00:12:10.120 --> 00:12:12.879
<v Speaker 2>way to define the pieces of information you need. Think

261
00:12:12.879 --> 00:12:14.039
<v Speaker 2>of these as slots to be.

262
00:12:14.000 --> 00:12:15.840
<v Speaker 1>Filled like fields on a form.

263
00:12:15.799 --> 00:12:19.840
<v Speaker 2>Precisely, and you need an algorithm, some logic that knows

264
00:12:19.879 --> 00:12:22.480
<v Speaker 2>how to navigate the conversation to collect the info for

265
00:12:22.559 --> 00:12:24.759
<v Speaker 2>EID slot in some sensible order.

266
00:12:24.879 --> 00:12:27.440
<v Speaker 1>The book points to something called VoiceXML as a kind

267
00:12:27.440 --> 00:12:28.320
<v Speaker 1>of model for this.

268
00:12:28.600 --> 00:12:32.639
<v Speaker 2>Yeah. VoiceXML is or was, a W three C standard

269
00:12:32.639 --> 00:12:35.360
<v Speaker 2>for defining these kinds of voice dialogues often used in

270
00:12:35.559 --> 00:12:39.639
<v Speaker 2>call center systems. It uses concepts like forms, which contain

271
00:12:39.759 --> 00:12:43.080
<v Speaker 2>fields or slots. Each field has a prompt, which is

272
00:12:43.080 --> 00:12:44.759
<v Speaker 2>what the system asks the user.

273
00:12:44.759 --> 00:12:46.360
<v Speaker 1>What is your destination right?

274
00:12:46.600 --> 00:12:50.200
<v Speaker 2>And optionally, fields can have grammars associated with them, which

275
00:12:50.279 --> 00:12:53.240
<v Speaker 2>constrain or help interpret what the user can say in response.

276
00:12:53.559 --> 00:12:56.559
<v Speaker 1>So for a destination field, the grammar might only accept

277
00:12:56.600 --> 00:12:57.519
<v Speaker 1>city names.

278
00:12:57.519 --> 00:13:01.360
<v Speaker 2>Potentially yes and VoiceXML uses a concept called the form

279
00:13:01.360 --> 00:13:05.480
<v Speaker 2>interpretation algorithm or FIA. It's basically the logic engine that

280
00:13:05.519 --> 00:13:08.399
<v Speaker 2>steps through the form, asking for one piece of required

281
00:13:08.399 --> 00:13:11.639
<v Speaker 2>information at a time until all the necessary slots are filled.

282
00:13:12.279 --> 00:13:15.279
<v Speaker 2>The book uses a simplified subset of these ideas specifically

283
00:13:15.279 --> 00:13:16.600
<v Speaker 2>for Android development, and.

284
00:13:16.559 --> 00:13:18.840
<v Speaker 1>There's a specific library in the book to help build this.

285
00:13:19.279 --> 00:13:23.360
<v Speaker 2>Yes a library called form filip containing classes to represent

286
00:13:23.399 --> 00:13:27.120
<v Speaker 2>these forms and fields. It works by parsing XML files

287
00:13:27.360 --> 00:13:30.919
<v Speaker 2>that the developer writes. These XML files define the structure

288
00:13:30.919 --> 00:13:33.879
<v Speaker 2>of the conversation, what questions to ask, in what order,

289
00:13:34.039 --> 00:13:35.240
<v Speaker 2>which fields are needed, so.

290
00:13:35.240 --> 00:13:38.960
<v Speaker 1>The conversation logic is separate from the main app code exactly.

291
00:13:39.240 --> 00:13:42.840
<v Speaker 2>It uses standard Android tools like XML pull parser handled

292
00:13:42.879 --> 00:13:46.279
<v Speaker 2>via another helper library, xml lib to read these definitions.

293
00:13:46.759 --> 00:13:50.120
<v Speaker 2>Then a key piece called the dialogue interpreter class steps

294
00:13:50.120 --> 00:13:53.440
<v Speaker 2>through this structure, triggering the right TTS prompt and listening

295
00:13:53.440 --> 00:13:55.679
<v Speaker 2>for ASR responses to fill each field.

296
00:13:55.759 --> 00:13:59.039
<v Speaker 1>Does it handle background tasks like parsing might take time?

297
00:13:59.200 --> 00:14:02.480
<v Speaker 2>Good point. It's designed to do the potentially slow work

298
00:14:02.600 --> 00:14:05.519
<v Speaker 2>like parsing the XML or waiting for ASR in the

299
00:14:05.559 --> 00:14:09.440
<v Speaker 2>background using Android's acing task, so the main app remains responsive.

300
00:14:09.759 --> 00:14:11.840
<v Speaker 2>That separation of concerns is really nice.

301
00:14:11.879 --> 00:14:13.840
<v Speaker 1>A great example used in the book is the music

302
00:14:13.879 --> 00:14:15.120
<v Speaker 1>Brain app. What does that do?

303
00:14:15.519 --> 00:14:18.679
<v Speaker 2>Right? The music Brain demo app uses this form filling library.

304
00:14:18.840 --> 00:14:21.480
<v Speaker 2>It guides the user through a voice dialogue asking for

305
00:14:21.559 --> 00:14:23.919
<v Speaker 2>details like maybe a word that appears in an album

306
00:14:23.960 --> 00:14:25.399
<v Speaker 2>title or a start an end.

307
00:14:25.320 --> 00:14:27.879
<v Speaker 1>Date range, using that form structure.

308
00:14:27.559 --> 00:14:30.480
<v Speaker 2>Exactly once, it collects all the pieces of information needs

309
00:14:30.480 --> 00:14:33.159
<v Speaker 2>by filling the slots in its form, and use that

310
00:14:33.159 --> 00:14:37.399
<v Speaker 2>collected information to query the music Brain's web service, which

311
00:14:37.440 --> 00:14:39.440
<v Speaker 2>is a big online music database.

312
00:14:39.600 --> 00:14:43.960
<v Speaker 1>Ah. So it's combining the voice interface with external data a.

313
00:14:43.960 --> 00:14:47.480
<v Speaker 2>Mashup, precisely. It shows how you can take destruction data

314
00:14:47.759 --> 00:14:51.320
<v Speaker 2>gathered via voice and uses to interact with online services

315
00:14:51.759 --> 00:14:55.279
<v Speaker 2>retrieve information. Process it may be filter or sort the

316
00:14:55.320 --> 00:14:59.159
<v Speaker 2>results like sorting albums by release date using helper classes,

317
00:14:59.240 --> 00:15:01.759
<v Speaker 2>and then present that back to the user, perhaps speaking

318
00:15:01.799 --> 00:15:03.120
<v Speaker 2>the results or showing them on screen.

319
00:15:03.360 --> 00:15:05.960
<v Speaker 1>Okay, so form filling lets us manage these multi step

320
00:15:06.080 --> 00:15:10.159
<v Speaker 1>conversations to get structured data like album name and date range.

321
00:15:10.720 --> 00:15:13.559
<v Speaker 1>But you mentioned that the ASR input within each step

322
00:15:13.639 --> 00:15:17.399
<v Speaker 1>was still somewhat open ended in these basic examples. How

323
00:15:17.399 --> 00:15:19.759
<v Speaker 1>do we make the app understand more than just the

324
00:15:19.799 --> 00:15:21.960
<v Speaker 1>words the user says? How do we get it to

325
00:15:22.080 --> 00:15:24.000
<v Speaker 1>understand the meaning behind the words?

326
00:15:24.320 --> 00:15:27.720
<v Speaker 2>Right? That's a critical step towards more intelligent interaction, and

327
00:15:27.799 --> 00:15:29.799
<v Speaker 2>that's where grammars come in, leading us into the field

328
00:15:29.840 --> 00:15:32.240
<v Speaker 2>of natural language understanding or NLU.

329
00:15:32.600 --> 00:15:34.759
<v Speaker 1>Grammars and NLU Okay.

330
00:15:34.799 --> 00:15:38.519
<v Speaker 2>Grammars are tools designed specifically to help the application interpret

331
00:15:38.600 --> 00:15:41.720
<v Speaker 2>more complex user inputs. They help extract not just the

332
00:15:41.759 --> 00:15:45.919
<v Speaker 2>sequence of words, but the underlying meaning and specific structured

333
00:15:45.960 --> 00:15:46.879
<v Speaker 2>pieces of information.

334
00:15:47.480 --> 00:15:51.200
<v Speaker 1>So going beyond just recognizing show me flights to London

335
00:15:51.639 --> 00:15:53.240
<v Speaker 1>as a sequence of words.

336
00:15:52.919 --> 00:15:55.879
<v Speaker 2>So understanding that the user's intent is to see flights

337
00:15:56.440 --> 00:15:58.320
<v Speaker 2>and the destination parameter is London.

338
00:15:58.720 --> 00:16:01.159
<v Speaker 1>Got it? How do you create these grammars?

339
00:16:01.639 --> 00:16:06.080
<v Speaker 2>The book discusses two main approaches. First, there are handcrafted.

340
00:16:05.399 --> 00:16:07.600
<v Speaker 1>Grammars written manually by developers.

341
00:16:07.639 --> 00:16:10.480
<v Speaker 2>Exactly, you write them yourself, often in an XML format

342
00:16:10.519 --> 00:16:15.120
<v Speaker 2>like SRGS Speech Recognition Grammar Specification, though the book uses

343
00:16:15.159 --> 00:16:18.759
<v Speaker 2>its own simplified XML format. You define the structure of

344
00:16:18.799 --> 00:16:23.960
<v Speaker 2>acceptable phrases using rules, items within rules, alternatives, optional parts,

345
00:16:24.000 --> 00:16:25.320
<v Speaker 2>and links between different rules.

346
00:16:25.399 --> 00:16:27.759
<v Speaker 1>Can you give an example for that flight query?

347
00:16:28.039 --> 00:16:31.000
<v Speaker 2>Sure, you might have a top level rule like rule

348
00:16:31.080 --> 00:16:34.639
<v Speaker 2>ID fine flight. Inside that you might have an item

349
00:16:34.679 --> 00:16:38.279
<v Speaker 2>for the phrase show flights or fine flights. Then maybe

350
00:16:38.279 --> 00:16:41.840
<v Speaker 2>an optional item repeat zero one for the word two,

351
00:16:42.320 --> 00:16:45.039
<v Speaker 2>and then crucially a reference to another rule like ruler

352
00:16:45.080 --> 00:16:48.159
<v Speaker 2>ref u r F hashtag city which defines all the

353
00:16:48.279 --> 00:16:49.559
<v Speaker 2>valid city names.

354
00:16:49.279 --> 00:16:52.480
<v Speaker 1>And the hashtag city rule would list London, Paris, New

355
00:16:52.559 --> 00:16:53.159
<v Speaker 1>York right.

356
00:16:53.480 --> 00:16:56.240
<v Speaker 2>And within those rules you can use special semantic tags.

357
00:16:56.600 --> 00:16:59.240
<v Speaker 2>So next to the item London in the city rule,

358
00:16:59.440 --> 00:17:02.279
<v Speaker 2>you might have a tag like tag out lhr tag.

359
00:17:02.879 --> 00:17:05.839
<v Speaker 2>This tells the system if the user says London, don't

360
00:17:05.880 --> 00:17:09.519
<v Speaker 2>just return the word London, return the airport code lhr AH.

361
00:17:09.680 --> 00:17:14.319
<v Speaker 1>Extracting structured data directly based on the grammar match. That's powerful,

362
00:17:14.519 --> 00:17:15.240
<v Speaker 1>very powerful.

363
00:17:15.599 --> 00:17:18.039
<v Speaker 2>But as you can imagine writing these grammars to cover

364
00:17:18.119 --> 00:17:20.119
<v Speaker 2>all the different ways a user might phrase.

365
00:17:19.799 --> 00:17:22.400
<v Speaker 1>Something flights to London, show me London flights. I want

366
00:17:22.400 --> 00:17:23.559
<v Speaker 1>to flight to London exactly.

367
00:17:23.720 --> 00:17:28.279
<v Speaker 2>Designing handcrafted grammars for spontaneous, unpredictable speech is incredibly hard

368
00:17:28.559 --> 00:17:30.880
<v Speaker 2>and very time consuming. That's the big challenge.

369
00:17:30.920 --> 00:17:32.079
<v Speaker 1>So what's the alternative.

370
00:17:32.319 --> 00:17:35.960
<v Speaker 2>That's where the second type comes in. Statistical grammars, or

371
00:17:36.000 --> 00:17:38.119
<v Speaker 2>more broadly, statistical NLU.

372
00:17:37.799 --> 00:17:39.279
<v Speaker 1>Models learn from data.

373
00:17:39.519 --> 00:17:43.039
<v Speaker 2>Yes, these aren't written by hand. They're trained on vast

374
00:17:43.039 --> 00:17:47.200
<v Speaker 2>amounts of real world language data using machine learning techniques, and.

375
00:17:47.119 --> 00:17:50.000
<v Speaker 1>The advantage is they can be much more flexible handle

376
00:17:50.119 --> 00:17:52.279
<v Speaker 1>variations you didn't explicitly code for.

377
00:17:52.839 --> 00:17:56.200
<v Speaker 2>That's the key benefit. Because they work based on probabilities

378
00:17:56.200 --> 00:17:59.279
<v Speaker 2>and patterns learn from how people actually speak, they can

379
00:17:59.319 --> 00:18:04.119
<v Speaker 2>often handle more or irregular wording, synonyms, even slightly ungrammatical

380
00:18:04.160 --> 00:18:07.359
<v Speaker 2>inputs that would break a strict handcrafted grammar.

381
00:18:07.759 --> 00:18:08.759
<v Speaker 1>What's the downside?

382
00:18:09.039 --> 00:18:11.440
<v Speaker 2>The main one is they require huge data sets to

383
00:18:11.480 --> 00:18:14.759
<v Speaker 2>train effectively, and access to these trained models often comes

384
00:18:14.880 --> 00:18:17.839
<v Speaker 2>via cloud services. The book mentions a service from a

385
00:18:17.839 --> 00:18:20.440
<v Speaker 2>company called Maluba as an example of a real world

386
00:18:20.480 --> 00:18:23.119
<v Speaker 2>statistical NLU system available around that time.

387
00:18:22.960 --> 00:18:25.240
<v Speaker 1>And that kind of system tries to identify the core

388
00:18:25.279 --> 00:18:27.720
<v Speaker 1>intention and the relevant details the entities.

389
00:18:27.880 --> 00:18:30.960
<v Speaker 2>Precisely, you give it a phrase like what's the weather

390
00:18:31.000 --> 00:18:34.559
<v Speaker 2>in Belfast for tomorrow, and a statistical NLU system could

391
00:18:34.599 --> 00:18:38.920
<v Speaker 2>analyze it and return something structured like categories whether action is,

392
00:18:39.119 --> 00:18:43.519
<v Speaker 2>check status, and the entities are location, billfast, and date tomorrow,

393
00:18:43.680 --> 00:18:46.640
<v Speaker 2>maybe even resolving tomorrow to the actual calendar date. It's

394
00:18:46.640 --> 00:18:50.400
<v Speaker 2>focused on extracting that core meaning, often regardless of the

395
00:18:50.440 --> 00:18:51.960
<v Speaker 2>exact sentence structure used.

396
00:18:52.519 --> 00:18:55.039
<v Speaker 1>Does the book include a library to help developers work

397
00:18:55.079 --> 00:18:56.400
<v Speaker 1>with these different grammar types.

398
00:18:56.519 --> 00:19:00.200
<v Speaker 2>It does. An NLU lib. It contains classes for handelling

399
00:19:00.240 --> 00:19:04.359
<v Speaker 2>those handcrafted grammars, parsing the XML definitions into Java objects,

400
00:19:04.680 --> 00:19:07.839
<v Speaker 2>inverting the rules into patterns, often using regular expressions behind

401
00:19:07.880 --> 00:19:10.839
<v Speaker 2>the scenes, and then using Java's matching tools to check

402
00:19:10.920 --> 00:19:14.799
<v Speaker 2>user input against these patterns. It also extracts the semantic

403
00:19:14.839 --> 00:19:17.039
<v Speaker 2>information based on those tags we talked.

404
00:19:16.799 --> 00:19:18.480
<v Speaker 1>About, and for the statistical ones.

405
00:19:18.559 --> 00:19:21.680
<v Speaker 2>The library also incling code demonstrating how to connect to

406
00:19:21.880 --> 00:19:26.200
<v Speaker 2>external statistical NLU services like that mlluda API, sending the

407
00:19:26.279 --> 00:19:30.279
<v Speaker 2>user's text and parsing the structured semantic result that comes back, assuming,

408
00:19:30.319 --> 00:19:32.799
<v Speaker 2>of course, the developer has API access.

409
00:19:32.640 --> 00:19:34.960
<v Speaker 1>And is there a demo app to play with this?

410
00:19:35.440 --> 00:19:38.319
<v Speaker 2>Yes, a grammar test app. It's quite useful. It lets

411
00:19:38.319 --> 00:19:41.000
<v Speaker 2>you input text, either typing it or using the results

412
00:19:41.000 --> 00:19:44.640
<v Speaker 2>from ASR, and then test that input against either a

413
00:19:44.680 --> 00:19:48.079
<v Speaker 2>handcrafted grammar file you provide or by sending it off

414
00:19:48.119 --> 00:19:49.599
<v Speaker 2>to the statistical.

415
00:19:48.960 --> 00:19:51.440
<v Speaker 1>Service so you can see the difference exactly.

416
00:19:51.519 --> 00:19:54.480
<v Speaker 2>It shows you whether the input is considered valid according

417
00:19:54.519 --> 00:19:58.160
<v Speaker 2>to the grammar and more importantly, what semantic information or

418
00:19:58.160 --> 00:20:02.279
<v Speaker 2>structured representation some extracts. It's a clear way to see

419
00:20:02.359 --> 00:20:05.039
<v Speaker 2>the different capabilities and outputs of the two approaches to

420
00:20:05.119 --> 00:20:06.039
<v Speaker 2>understanding language.

421
00:20:06.079 --> 00:20:08.799
<v Speaker 1>Okay, this is really building up. We've gone from basic

422
00:20:08.839 --> 00:20:12.720
<v Speaker 1>speaking and listening to simple commands, structured form filling, and

423
00:20:12.759 --> 00:20:16.799
<v Speaker 1>now understanding meaning with grammars and NLU. How do we

424
00:20:16.880 --> 00:20:20.039
<v Speaker 1>make these voice apps even more robust and user friendly,

425
00:20:20.359 --> 00:20:23.039
<v Speaker 1>maybe for a wider audience or in different situations.

426
00:20:23.079 --> 00:20:26.519
<v Speaker 2>Well, two key aspects the book covers next are multi

427
00:20:26.599 --> 00:20:28.440
<v Speaker 2>linguality and multimodality.

428
00:20:28.599 --> 00:20:33.119
<v Speaker 1>Multi linguality supporting different languages seems obvious, but important.

429
00:20:32.880 --> 00:20:34.799
<v Speaker 2>Absolutely essential if you want your app to reach a

430
00:20:34.799 --> 00:20:37.680
<v Speaker 2>global audience. It means being able to use TTS and

431
00:20:37.759 --> 00:20:39.559
<v Speaker 2>ASR and languages other than.

432
00:20:39.440 --> 00:20:41.799
<v Speaker 1>Just the default How do developers handle that.

433
00:20:42.039 --> 00:20:45.920
<v Speaker 2>They specify the languages using standard codes like ISO six

434
00:20:46.039 --> 00:20:49.039
<v Speaker 2>three nine tosh one codes N for English as for

435
00:20:49.079 --> 00:20:52.599
<v Speaker 2>Spanish and so on. The Android system provides ways again

436
00:20:52.680 --> 00:20:55.759
<v Speaker 2>using intents to check which languages are actually supported or

437
00:20:55.799 --> 00:20:57.880
<v Speaker 2>installed on the user's specific device.

438
00:20:57.920 --> 00:21:02.200
<v Speaker 1>Because not all devices might have all languages pre installed, right, and.

439
00:21:02.160 --> 00:21:05.480
<v Speaker 2>This ties directly into the broader concept of localization in

440
00:21:05.519 --> 00:21:09.640
<v Speaker 2>Android development, you know, providing different text strings, images layouts,

441
00:21:09.640 --> 00:21:13.839
<v Speaker 2>and resource folders like resvalues for Spanish users versus Ree's

442
00:21:13.880 --> 00:21:16.839
<v Speaker 2>values and for English users. It's about adapting the whole

443
00:21:16.880 --> 00:21:20.000
<v Speaker 2>app experience. The book has a very simple silly parrot

444
00:21:20.000 --> 00:21:22.640
<v Speaker 2>app just to show switching TTSASR language.

445
00:21:22.640 --> 00:21:26.039
<v Speaker 1>Okay, so that's multiple languages. What about multimodality sounds complex?

446
00:21:26.160 --> 00:21:29.000
<v Speaker 2>It just means combining voice interaction with the traditional graphical

447
00:21:29.119 --> 00:21:32.519
<v Speaker 2>user interface, the GUI, the buttons and screens we're used

448
00:21:32.519 --> 00:21:33.240
<v Speaker 2>to tapping.

449
00:21:32.920 --> 00:21:36.039
<v Speaker 1>On ah, voice and touch working together. Why do that?

450
00:21:36.240 --> 00:21:39.720
<v Speaker 2>Because sometimes tapping is just easier or faster than speaking

451
00:21:39.759 --> 00:21:45.000
<v Speaker 2>for certain inputs, or users might want visual feedback confirming

452
00:21:45.000 --> 00:21:48.000
<v Speaker 2>what the system understood, or maybe they start by voice

453
00:21:48.079 --> 00:21:51.400
<v Speaker 2>and when finished by touch or vice versa. The idea

454
00:21:51.440 --> 00:21:54.400
<v Speaker 2>is to create a seamless link, a link between what

455
00:21:54.680 --> 00:21:56.839
<v Speaker 2>between the fields of information we were talking about in

456
00:21:56.880 --> 00:22:00.000
<v Speaker 2>form filling dialogues and the visual elements on the screen

457
00:22:00.319 --> 00:22:03.119
<v Speaker 2>like drop down lists, spinners and android lists. You can

458
00:22:03.160 --> 00:22:07.680
<v Speaker 2>scroll list views, radio buttons, checkboxes, text entry fields, edit text.

459
00:22:07.920 --> 00:22:09.799
<v Speaker 1>Okay, so how does that work in practice?

460
00:22:10.039 --> 00:22:13.440
<v Speaker 2>Imagine you have a field in your voice dialogue for say, urgency,

461
00:22:14.240 --> 00:22:17.640
<v Speaker 2>with options low, medium, high. You might also have radio

462
00:22:17.680 --> 00:22:19.920
<v Speaker 2>buttons on the screen for the same options. Right, If

463
00:22:20.000 --> 00:22:23.200
<v Speaker 2>the user says medium urgency, the app not only fills

464
00:22:23.240 --> 00:22:26.359
<v Speaker 2>that voice field internally, but it also automatically checks the

465
00:22:26.400 --> 00:22:27.519
<v Speaker 2>meeting radio button on.

466
00:22:27.480 --> 00:22:29.640
<v Speaker 1>The screen ah synchronization exactly.

467
00:22:30.119 --> 00:22:33.039
<v Speaker 2>And conversely, if the user taps the high radio button

468
00:22:33.039 --> 00:22:36.000
<v Speaker 2>on the screen, the app knows that the urgency information

469
00:22:36.039 --> 00:22:38.559
<v Speaker 2>has been provided, so the voice dialogue system shouldn't ask

470
00:22:38.640 --> 00:22:40.920
<v Speaker 2>for it orally anymore. The state is shared.

471
00:22:41.240 --> 00:22:44.519
<v Speaker 1>That sounds really useful, but potentially complex to manage.

472
00:22:44.599 --> 00:22:48.759
<v Speaker 2>It requires careful design. Grammars become important here too, to

473
00:22:48.920 --> 00:22:52.640
<v Speaker 2>ensure that the voice input for a specific field actually

474
00:22:52.720 --> 00:22:56.319
<v Speaker 2>matches one of the valid options available in the corresponding

475
00:22:56.359 --> 00:22:58.440
<v Speaker 2>GUI element, Like the items.

476
00:22:58.119 --> 00:23:00.079
<v Speaker 1>And a drop down list, and the book provides hell

477
00:23:00.119 --> 00:23:00.599
<v Speaker 1>for this too.

478
00:23:00.720 --> 00:23:03.720
<v Speaker 2>Another library, Yes, building on the form filling one, a

479
00:23:03.839 --> 00:23:07.720
<v Speaker 2>multimodal form FILLIP. It extends the basic form library by

480
00:23:07.759 --> 00:23:11.039
<v Speaker 2>adding grammar checking within the dialogue flow, so it only

481
00:23:11.079 --> 00:23:14.000
<v Speaker 2>accepts voice input that actually matches the grammar defined for

482
00:23:14.039 --> 00:23:18.240
<v Speaker 2>the current field. And crucially, it includes methods like oral

483
00:23:18.240 --> 00:23:21.319
<v Speaker 2>to gee and GI toral specifically designed to synchronize the

484
00:23:21.319 --> 00:23:24.759
<v Speaker 2>state between the internal voice feels and the visual GUI elements.

485
00:23:24.880 --> 00:23:26.000
<v Speaker 1>Is there an example app for this?

486
00:23:26.240 --> 00:23:28.559
<v Speaker 2>They use a mock send message app as a demonstration.

487
00:23:28.759 --> 00:23:31.079
<v Speaker 2>It lets the user provide details for sending a message

488
00:23:31.079 --> 00:23:35.279
<v Speaker 2>recipient urgency, maybe the message body itself, either by speaking the.

489
00:23:35.240 --> 00:23:38.000
<v Speaker 1>Information following the form filling prompts.

490
00:23:37.839 --> 00:23:41.440
<v Speaker 2>Right, or by interacting directly with the GeOI elements on

491
00:23:41.480 --> 00:23:43.839
<v Speaker 2>the screen, like picking a contact from a list or

492
00:23:43.880 --> 00:23:47.319
<v Speaker 2>tapping a radio button for urgency. The app keeps track

493
00:23:47.319 --> 00:23:50.160
<v Speaker 2>of the information consistently, regardless of whether it came via

494
00:23:50.240 --> 00:23:53.200
<v Speaker 2>voice or touch. It really highlights how voice and touch

495
00:23:53.480 --> 00:23:57.359
<v Speaker 2>don't have to be separate, isolated interaction modes. They can

496
00:23:57.400 --> 00:23:59.119
<v Speaker 2>compliment each other within the same task.

497
00:23:59.440 --> 00:24:01.880
<v Speaker 1>Okay, this is quite a journey We've layered on all

498
00:24:01.880 --> 00:24:09.519
<v Speaker 1>these capabilities speaking, listening, simple commands, handling, ambiguity, multistep forms,

499
00:24:10.039 --> 00:24:13.920
<v Speaker 1>really understanding language with Grammars in NLU, adding flexibility with

500
00:24:14.000 --> 00:24:18.000
<v Speaker 1>multiple languages, and combining voice with the screen through multimodality.

501
00:24:18.599 --> 00:24:21.039
<v Speaker 1>What happens when you take all these pieces and integrate

502
00:24:21.039 --> 00:24:23.559
<v Speaker 1>them into one coherent system, Well, that's when.

503
00:24:23.440 --> 00:24:26.359
<v Speaker 2>You get into the realm of virtual Personal assistance.

504
00:24:25.960 --> 00:24:29.640
<v Speaker 1>Vpas AH the serious, the Google Assistance, the Alexis of

505
00:24:29.640 --> 00:24:30.759
<v Speaker 1>the world exactly.

506
00:24:30.920 --> 00:24:33.680
<v Speaker 2>These are the conversational agents we're much more familiar with today.

507
00:24:34.200 --> 00:24:37.200
<v Speaker 2>They really represent the culmination of all these underlying technologies

508
00:24:37.200 --> 00:24:41.599
<v Speaker 2>we've been discussing, designed to understand potentially complex requests and

509
00:24:41.680 --> 00:24:43.119
<v Speaker 2>perform a whole range of tasks.

510
00:24:43.480 --> 00:24:46.720
<v Speaker 1>And the fundamental challenge for a VPA trying to bring

511
00:24:46.759 --> 00:24:50.759
<v Speaker 1>everything together must be accurately figuring out the user's intention

512
00:24:50.880 --> 00:24:53.319
<v Speaker 1>from whatever they happen to say, however they say it.

513
00:24:53.440 --> 00:24:55.559
<v Speaker 2>That's the core of it. And as we saw when

514
00:24:55.599 --> 00:24:59.000
<v Speaker 2>discussing Grammars in NLU, you can approach this understanding in

515
00:24:59.000 --> 00:25:02.119
<v Speaker 2>different ways. You might try to classify the user's input

516
00:25:02.400 --> 00:25:04.720
<v Speaker 2>using statistical methods trained on lots of.

517
00:25:04.759 --> 00:25:08.240
<v Speaker 1>Data, which is good for more open ended questions or requests.

518
00:25:07.839 --> 00:25:11.400
<v Speaker 2>Right, or you might use more structured grammars for specific commands,

519
00:25:11.559 --> 00:25:14.839
<v Speaker 2>extracting the core intent in any relevant details or parameters.

520
00:25:15.359 --> 00:25:18.240
<v Speaker 2>Often modern systems use a hybrid approach.

521
00:25:18.079 --> 00:25:21.680
<v Speaker 1>And once the intention is hopefully understood, the VPA needs

522
00:25:21.720 --> 00:25:24.480
<v Speaker 1>more logic. Right, it needs a system to decide what

523
00:25:24.599 --> 00:25:25.240
<v Speaker 1>to do next.

524
00:25:25.440 --> 00:25:29.359
<v Speaker 2>Definitely, it needs dialogue management. Based on the understood intent

525
00:25:29.559 --> 00:25:32.480
<v Speaker 2>and the current state of the conversation, the dialogue manager

526
00:25:32.519 --> 00:25:35.079
<v Speaker 2>decides the next action. Does it need to ask a

527
00:25:35.079 --> 00:25:38.599
<v Speaker 2>clarifying question, Does it have enough information to perform the task?

528
00:25:38.920 --> 00:25:40.559
<v Speaker 2>Does it need to access some data?

529
00:25:40.599 --> 00:25:42.799
<v Speaker 1>And then it needs to generate a response.

530
00:25:42.680 --> 00:25:46.400
<v Speaker 2>Yes, response generation figuring out what to say back to

531
00:25:46.400 --> 00:25:49.799
<v Speaker 2>the user via PTS or what information to display on

532
00:25:49.839 --> 00:25:52.079
<v Speaker 2>the screen, or what action to actually perform on the

533
00:25:52.119 --> 00:25:54.119
<v Speaker 2>device or via web service.

534
00:25:54.279 --> 00:25:57.440
<v Speaker 1>This whole idea of conversational AI has a long history,

535
00:25:57.480 --> 00:25:59.000
<v Speaker 1>doesn't it even before smartphone?

536
00:25:59.039 --> 00:26:01.079
<v Speaker 2>Oh yeah, it goes way back. The book even mentions

537
00:26:01.119 --> 00:26:04.599
<v Speaker 2>early chatbots like Eliza from the nineteen sixties, but for

538
00:26:04.680 --> 00:26:07.880
<v Speaker 2>building more sophisticated vpas around the time the book was written.

539
00:26:08.240 --> 00:26:11.680
<v Speaker 2>It focuses on using a specific platform called Pandora Bots.

540
00:26:11.799 --> 00:26:13.240
<v Speaker 1>Pandora Bots, what's that?

541
00:26:13.400 --> 00:26:16.160
<v Speaker 2>It's a platform still reunt today actually for creating and

542
00:26:16.240 --> 00:26:20.960
<v Speaker 2>hosting conversational agents or chatbots uses a language called AML.

543
00:26:21.319 --> 00:26:25.640
<v Speaker 2>AML Artificial Intelligence Markup language is based on XML, and

544
00:26:25.640 --> 00:26:28.920
<v Speaker 2>it lets you define the bot's conversational behavior using categories.

545
00:26:29.200 --> 00:26:31.480
<v Speaker 2>Each category has a pattern which is basically.

546
00:26:31.200 --> 00:26:33.839
<v Speaker 1>What the user might say a potential input phrase.

547
00:26:33.839 --> 00:26:37.319
<v Speaker 2>Right, and a template which defines how the bots should respond.

548
00:26:37.920 --> 00:26:40.440
<v Speaker 2>You can use wild cards like in the patterns to

549
00:26:40.440 --> 00:26:43.720
<v Speaker 2>make them more flexible, matching multiple variations of user input,

550
00:26:44.519 --> 00:26:47.279
<v Speaker 2>and there are special tags like srey that let you

551
00:26:47.359 --> 00:26:51.000
<v Speaker 2>redirect from one pattern to another, helping to handle synonyms

552
00:26:51.240 --> 00:26:53.799
<v Speaker 2>or rephrase requests without duplicating logic.

553
00:26:53.960 --> 00:26:56.519
<v Speaker 1>What happens if the user says something that doesn't match

554
00:26:56.559 --> 00:26:57.200
<v Speaker 1>any pattern?

555
00:26:57.480 --> 00:27:01.960
<v Speaker 2>Good question? AML has the concept of an ultimate default category,

556
00:27:02.359 --> 00:27:05.480
<v Speaker 2>a fallback pattern that matches anything else, usually triggering a

557
00:27:05.519 --> 00:27:09.400
<v Speaker 2>response like sorry, I didn't understand that or could you rephrase?

558
00:27:10.079 --> 00:27:12.319
<v Speaker 2>It's crucial for making the bot seem less brittle.

559
00:27:12.440 --> 00:27:16.000
<v Speaker 1>Okay, So the core conversation logic matching input patterns to

560
00:27:16.039 --> 00:27:19.920
<v Speaker 1>output templates runs on the Pandora Bots platform. How does

561
00:27:19.960 --> 00:27:22.200
<v Speaker 1>that connect to doing things on the Android device.

562
00:27:22.640 --> 00:27:24.640
<v Speaker 2>This is where a really interesting feature mentioned in the

563
00:27:24.640 --> 00:27:28.200
<v Speaker 2>book comes in, specifically designed for mobile vpas. It's the

564
00:27:28.319 --> 00:27:30.839
<v Speaker 2>UB tag oh ob out a band. It's a special

565
00:27:30.839 --> 00:27:33.680
<v Speaker 2>tag that the AML developer can embed within the box

566
00:27:33.720 --> 00:27:37.200
<v Speaker 2>response template alongside the text the bot is supposed to say,

567
00:27:37.240 --> 00:27:39.640
<v Speaker 2>how does that work? So the box template might have

568
00:27:39.720 --> 00:27:41.880
<v Speaker 2>the text okay, I'll search the web for that for you,

569
00:27:42.240 --> 00:27:44.680
<v Speaker 2>but hitting within the same AML template, there could also

570
00:27:44.720 --> 00:27:47.759
<v Speaker 2>be an UB tag containing a command like search query

571
00:27:47.799 --> 00:27:51.319
<v Speaker 2>the user asked for search okay, the Android app communicates

572
00:27:51.319 --> 00:27:54.440
<v Speaker 2>with the Pandora bot online. It sends the user's transcribed

573
00:27:54.480 --> 00:27:58.160
<v Speaker 2>speech from ASR. The Pandora bot finds the matching pattern

574
00:27:58.359 --> 00:28:01.720
<v Speaker 2>and sends back the AML template response. The Android app

575
00:28:01.799 --> 00:28:05.559
<v Speaker 2>then parses this response. It takes the regular tax part okay,

576
00:28:05.559 --> 00:28:07.920
<v Speaker 2>I'll search and sends it to the TTS engine to

577
00:28:07.920 --> 00:28:11.200
<v Speaker 2>be spoken. But it also looks for any ubtags. If

578
00:28:11.200 --> 00:28:14.440
<v Speaker 2>it find one, like search, it intercepts that command and

579
00:28:14.559 --> 00:28:17.359
<v Speaker 2>executes the corresponding action on the device, in this case

580
00:28:17.480 --> 00:28:19.839
<v Speaker 2>launching a web search with a specified query wow.

581
00:28:20.200 --> 00:28:23.480
<v Speaker 1>So the conversation, logic and knowledge stays centralized on the

582
00:28:23.480 --> 00:28:26.920
<v Speaker 1>Pandora bot server define an AML, but the actual device

583
00:28:26.960 --> 00:28:30.680
<v Speaker 1>actions searching, launching apps, making calls, opening URLs are triggered

584
00:28:30.720 --> 00:28:34.160
<v Speaker 1>locally on the Android device by these hidden commands embedded

585
00:28:34.200 --> 00:28:36.359
<v Speaker 1>in the bot's response exactly.

586
00:28:36.599 --> 00:28:39.480
<v Speaker 2>It's a clever way to decouple the conversational intelligence from

587
00:28:39.519 --> 00:28:43.960
<v Speaker 2>the specific device functionalities. The book provides a vp lib

588
00:28:44.079 --> 00:28:47.839
<v Speaker 2>library to handle this communication, connecting to a specific Pandora

589
00:28:47.839 --> 00:28:52.119
<v Speaker 2>bot online using its bodhead sending the ASR input, parsing

590
00:28:52.200 --> 00:28:55.319
<v Speaker 2>the XML response from Pandora bots, looking for the spoken

591
00:28:55.359 --> 00:28:58.759
<v Speaker 2>part often marked by a FAT tag, and checking for ubtags,

592
00:28:59.039 --> 00:29:01.279
<v Speaker 2>and then running the course wonding functions on the device

593
00:29:01.319 --> 00:29:04.359
<v Speaker 2>based on those tags. The book did note a limitation

594
00:29:04.400 --> 00:29:06.319
<v Speaker 2>at the time that the gender of the TTS voice

595
00:29:06.319 --> 00:29:10.240
<v Speaker 2>couldn't be controlled programmatically, which might affect the perceived persona

596
00:29:10.279 --> 00:29:10.680
<v Speaker 2>of the bot.

597
00:29:10.839 --> 00:29:14.039
<v Speaker 1>Does the book include sample vpas built using this Pandora

598
00:29:14.039 --> 00:29:14.799
<v Speaker 1>bots approach?

599
00:29:14.880 --> 00:29:18.759
<v Speaker 2>It does. It described three Jack, Derek and Stacy. Jack

600
00:29:18.839 --> 00:29:21.200
<v Speaker 2>is based on a well known general purpose amail bot

601
00:29:21.240 --> 00:29:25.200
<v Speaker 2>called Alice, designed for broad open ended conversation. Derek, on

602
00:29:25.240 --> 00:29:27.799
<v Speaker 2>the other hand, is presented as a specialized bot. It

603
00:29:27.920 --> 00:29:30.920
<v Speaker 2>trains specifically on a particular knowledge domain. The example uses

604
00:29:31.039 --> 00:29:34.279
<v Speaker 2>FAQs about type two diabetes. This highlights how you can

605
00:29:34.359 --> 00:29:38.000
<v Speaker 2>encode expert or domain specific knowledge using these AMMEL patterns

606
00:29:38.000 --> 00:29:38.559
<v Speaker 2>and templates.

607
00:29:38.640 --> 00:29:41.599
<v Speaker 1>So one generalist, one specialist, and Stacy.

608
00:29:41.920 --> 00:29:47.240
<v Speaker 2>Stacy is basically Jack the general conversational bot, but enhanced

609
00:29:47.279 --> 00:29:51.839
<v Speaker 2>with that oub tag functionality, so Stacey can not only chat,

610
00:29:52.039 --> 00:29:56.319
<v Speaker 2>but can also actually control device functions like searching online

611
00:29:56.359 --> 00:29:59.880
<v Speaker 2>or launching apps based on understanding those embedded commands within

612
00:30:00.000 --> 00:30:00.839
<v Speaker 2>the conversation flow.

613
00:30:00.960 --> 00:30:03.079
<v Speaker 1>That really brings it all together. If you look back

614
00:30:03.160 --> 00:30:08.039
<v Speaker 1>at how the book illustrates the overall VPA structure, you know,

615
00:30:08.279 --> 00:30:12.759
<v Speaker 1>starting with ASR capturing the audio, then spoken language understanding

616
00:30:12.759 --> 00:30:17.039
<v Speaker 1>whether that's NLU, grammars or AML pattern matching to figure.

617
00:30:16.720 --> 00:30:18.640
<v Speaker 2>Out the intent right the understanding part.

618
00:30:18.559 --> 00:30:22.200
<v Speaker 1>Then dialogue management deciding the next step, response generation formulating

619
00:30:22.240 --> 00:30:25.920
<v Speaker 1>the output, and finally TTS speaking the response.

620
00:30:25.640 --> 00:30:28.960
<v Speaker 2>All while potentially connecting to external data sources, knowledge bases,

621
00:30:29.000 --> 00:30:32.319
<v Speaker 2>and triggering those device actions via things like the ouptech.

622
00:30:32.599 --> 00:30:35.279
<v Speaker 1>You can really see how all those individual building blocks

623
00:30:35.319 --> 00:30:40.680
<v Speaker 1>we discuss, TTS, ASR, forms, grammars, multimodality fit together into

624
00:30:40.680 --> 00:30:42.039
<v Speaker 1>that complete VPA system.

625
00:30:42.079 --> 00:30:43.920
<v Speaker 2>It is the full picture. Now show how you assemble

626
00:30:43.960 --> 00:30:46.400
<v Speaker 2>these components to create those conversational agency.

627
00:30:46.240 --> 00:30:48.440
<v Speaker 1>And that really brings our deep dove to a close,

628
00:30:48.519 --> 00:30:52.400
<v Speaker 1>doesn't it. We've journeyed all the way from the absolute fundamentals,

629
00:30:52.440 --> 00:30:57.880
<v Speaker 1>the device's ability to speak using TTS and listen using ASR.

630
00:30:57.759 --> 00:30:59.599
<v Speaker 2>The core building blocks, through.

631
00:30:59.359 --> 00:31:02.960
<v Speaker 1>Building those first simple command interactions, bringing in ideas like

632
00:31:03.200 --> 00:31:05.599
<v Speaker 1>similarity measures and confirmation.

633
00:31:05.200 --> 00:31:06.680
<v Speaker 2>Making them a bit more robust, and.

634
00:31:06.759 --> 00:31:11.319
<v Speaker 1>Structuring more complex multi turn conversations using form filling dialogues,

635
00:31:11.400 --> 00:31:12.799
<v Speaker 1>managing the flow to gather.

636
00:31:12.640 --> 00:31:15.039
<v Speaker 2>Information, getting that structured data, then.

637
00:31:14.960 --> 00:31:18.680
<v Speaker 1>Diving into how to understand user input more deeply, moving

638
00:31:18.720 --> 00:31:23.920
<v Speaker 1>beyond words to meaning using both handcrafted and statistical grammars via.

639
00:31:23.880 --> 00:31:27.079
<v Speaker 2>NLU, extracting intent and entities, and.

640
00:31:27.039 --> 00:31:31.079
<v Speaker 1>Then adding layers of flexibility through multilinguality supporting different languages

641
00:31:31.119 --> 00:31:36.519
<v Speaker 1>and multimodality, seamlessly combining voice with graphical interfaces.

642
00:31:36.000 --> 00:31:38.240
<v Speaker 2>Making the interaction richer and more adaptable.

643
00:31:38.480 --> 00:31:41.599
<v Speaker 1>And finally we saw how all these technologies converge in

644
00:31:41.640 --> 00:31:46.519
<v Speaker 1>the architecture of virtual personal assistance, using platforms like Pandora

645
00:31:46.519 --> 00:31:50.680
<v Speaker 1>bots with AML and leveraging clever techniques like that OOB

646
00:31:50.799 --> 00:31:54.720
<v Speaker 1>tag to bridge the gap between conversation and actually doing

647
00:31:54.759 --> 00:31:55.799
<v Speaker 1>things on the device.

648
00:31:56.000 --> 00:31:59.839
<v Speaker 2>Absolutely, This deep dive, even though it's rooted in a

649
00:31:59.880 --> 00:32:03.720
<v Speaker 2>technical guide from over a decade ago, now really gives

650
00:32:03.759 --> 00:32:07.240
<v Speaker 2>you a solid conceptual shortcut. It helps you understand the

651
00:32:07.240 --> 00:32:10.880
<v Speaker 2>core challenges and the fundamental building blogs involved in bringing

652
00:32:10.960 --> 00:32:12.839
<v Speaker 2>voice interaction to life on a device.

653
00:32:13.079 --> 00:32:16.319
<v Speaker 1>Yeah, it lays out the pieces really clearly and considering

654
00:32:16.319 --> 00:32:19.200
<v Speaker 1>all these elements together, from breaking down sounds and words

655
00:32:19.480 --> 00:32:23.599
<v Speaker 1>to understanding complex intent, managing dialogue flow over multiple turns,

656
00:32:23.680 --> 00:32:26.880
<v Speaker 1>generating responses, and having the ability to trigger pretty much

657
00:32:26.960 --> 00:32:30.000
<v Speaker 1>any device function or connect to any web service. It

658
00:32:30.039 --> 00:32:32.200
<v Speaker 1>really opens up your imagination, doesn't it, to the kind

659
00:32:32.240 --> 00:32:36.240
<v Speaker 1>of truly personalized voice applications you could potentially create.

660
00:32:36.480 --> 00:32:39.160
<v Speaker 2>Right, It makes you think beyond the general purpose assistance

661
00:32:39.200 --> 00:32:41.880
<v Speaker 2>we mostly use today. What if a voice assistant wasn't

662
00:32:41.920 --> 00:32:45.319
<v Speaker 2>just generic? What if it was deeply, deeply specialized.

663
00:32:45.680 --> 00:32:50.759
<v Speaker 1>Yeah, Like, imagine a VPA that understands the incredibly specific vocabulary,

664
00:32:51.039 --> 00:32:54.880
<v Speaker 1>the jargon, the unique needs of your particular hobby or

665
00:32:54.920 --> 00:32:58.480
<v Speaker 1>your specific job, maybe using a custom built grammar or

666
00:32:58.480 --> 00:33:01.640
<v Speaker 1>a finely tuned n LU model, and it could instantly

667
00:33:01.680 --> 00:33:05.799
<v Speaker 1>connect you to niche online resources or internal databases that

668
00:33:05.880 --> 00:33:08.440
<v Speaker 1>a general assistant wouldn't even know existed, Or.

669
00:33:08.359 --> 00:33:11.160
<v Speaker 2>Think about something more personal, maybe an assistant that reads

670
00:33:11.200 --> 00:33:14.920
<v Speaker 2>you recipes, but it uses curated audio samples to sound

671
00:33:14.960 --> 00:33:18.680
<v Speaker 2>like a comforting, familiar voice, maybe even your grandmother's voice.

672
00:33:18.759 --> 00:33:21.000
<v Speaker 2>If you had the recordings and it lets you navigate

673
00:33:21.039 --> 00:33:24.759
<v Speaker 2>the cooking steps completely hands free, using just simple voice

674
00:33:24.759 --> 00:33:26.799
<v Speaker 2>commands tailored to that specific.

675
00:33:26.400 --> 00:33:29.319
<v Speaker 1>Task, that would be amazing. It makes you wonder what

676
00:33:29.440 --> 00:33:33.839
<v Speaker 1>kinds of unique, genuinely helpful or even just wonderfully quirky,

677
00:33:33.920 --> 00:33:36.720
<v Speaker 1>and specific voice interfaces are still waiting to be built,

678
00:33:36.960 --> 00:33:40.799
<v Speaker 1>especially when you start combining these foundational technologies tts ASR

679
00:33:41.119 --> 00:33:45.319
<v Speaker 1>NLU dialogue management, device control in new and unexpected ways.

680
00:33:45.319 --> 00:33:48.839
<v Speaker 2>Exactly what's the potential beyond the mainstream assistance we interact

681
00:33:48.839 --> 00:33:51.079
<v Speaker 2>with now? It definitely leaves you with something to think about.
