WEBVTT 1 00:00:00.120 --> 00:00:03.319 You know, it's wild how second nature it's become to 2 00:00:03.399 --> 00:00:06.960 just talk to our devices. Hey, Google, set a timer, Siri, 3 00:00:07.080 --> 00:00:09.560 what's the weather? We barely think about it. 4 00:00:09.759 --> 00:00:11.679 Yeah, it really feels like something we just take for 5 00:00:11.720 --> 00:00:12.480 granted now, But. 6 00:00:12.519 --> 00:00:14.880 Pull back for a second. How does that actually happen? 7 00:00:14.960 --> 00:00:17.640 How does your phone hear you, understand what you want, 8 00:00:17.719 --> 00:00:19.320 and then you know, do something. 9 00:00:19.399 --> 00:00:21.320 It does feel a bit like a magic trick, doesn't it. 10 00:00:22.079 --> 00:00:25.879 But behind that simple interaction is this whole layered world 11 00:00:25.920 --> 00:00:26.800 of technology. 12 00:00:27.079 --> 00:00:30.440 It's quite complex, actually, and that's exactly the world we're 13 00:00:30.440 --> 00:00:33.479 diving into today. We're taking a deep dive into how 14 00:00:33.520 --> 00:00:38.479 you build these voice based applications, specifically thinking about Android devices. 15 00:00:38.520 --> 00:00:41.600 Okay, and our guide for this exploration is fascinating. It 16 00:00:41.679 --> 00:00:45.560 is a detailed technical guide published back in twenty thirteen. 17 00:00:45.640 --> 00:00:47.600 Twenty thirteen, so a bit of a snapshot from that 18 00:00:47.640 --> 00:00:48.560 era exactly. 19 00:00:48.640 --> 00:00:50.439 It gives us a really interesting look at the tools 20 00:00:50.439 --> 00:00:54.280 and approaches developers we're using them, leveraging Google's own capabilities 21 00:00:54.479 --> 00:00:56.200 and also some open source software. 22 00:00:56.359 --> 00:00:59.320 Right, So, our mission in this deep dive is to 23 00:00:59.359 --> 00:01:02.119 kind of cut through the complexity. We want to unpack 24 00:01:02.159 --> 00:01:04.680 the core concepts the essential building blocks. 25 00:01:04.840 --> 00:01:05.840 Fundamental, Yeah, the. 26 00:01:05.760 --> 00:01:09.319 Fundamentals, and show you the journey developers took to create 27 00:01:09.319 --> 00:01:12.480 apps you could talk to all without you needing to 28 00:01:12.519 --> 00:01:15.000 pour over the original technical manual yourself. 29 00:01:15.120 --> 00:01:15.599 Sounds good. 30 00:01:16.239 --> 00:01:19.599 We'll start with the absolute basics, speaking and listening and 31 00:01:19.719 --> 00:01:23.120 build up from there, maybe getting into complex conversations and 32 00:01:23.200 --> 00:01:25.159 even early virtual assistance. 33 00:01:25.719 --> 00:01:28.519 Okay, let's unpack this starting at the very beginning, seems right. 34 00:01:29.400 --> 00:01:32.359 Before a device can respond or have any kind of conversation, 35 00:01:32.840 --> 00:01:36.000 it first has to be able to speak. And here right. 36 00:01:36.560 --> 00:01:39.680 The fundamental capabilities Android provides for this are text to 37 00:01:39.719 --> 00:01:44.760 speech TTS and automated speech recognition or ASR. 38 00:01:45.159 --> 00:01:48.680 And thinking about why these are important, Well, it opens 39 00:01:48.719 --> 00:01:51.239 up so many possibilities. Imagine your hands are full like 40 00:01:51.280 --> 00:01:52.920 you're driving and need directions. 41 00:01:53.000 --> 00:01:55.319 Oh yeah, it's simple then or and. 42 00:01:55.239 --> 00:01:58.400 This is crucial for accessibility. Think about someone with a 43 00:01:58.439 --> 00:02:03.400 visual impairment using a green reader that's TTS or ASR 44 00:02:03.480 --> 00:02:07.599 helping someone communicate if they have difficulty speaking, much like 45 00:02:07.640 --> 00:02:10.719 Stephen Hawking used speech synthesis technology. 46 00:02:10.879 --> 00:02:14.800 They're really foundational tools. Okay, let's start with TTS text 47 00:02:14.800 --> 00:02:19.080 to speech. Simply put, it turns written text into spoken audio. 48 00:02:19.000 --> 00:02:22.479 Right, and the technology works in stages. First, it needs 49 00:02:22.479 --> 00:02:25.120 to understand the text itself, things like you know how 50 00:02:25.159 --> 00:02:27.960 to pronounce words that look similar but sound different based 51 00:02:28.000 --> 00:02:29.360 on context. 52 00:02:28.919 --> 00:02:31.400 Like reads versus read just exactly. 53 00:02:31.360 --> 00:02:34.960 Or converting numbers and abbreviations into full words. The cool 54 00:02:35.000 --> 00:02:38.759 part for developers often is the system handles a lot 55 00:02:38.759 --> 00:02:40.960 of this linguistic complexity. You don't always have to get 56 00:02:41.000 --> 00:02:41.599 into the weeds. 57 00:02:41.639 --> 00:02:43.800 Okay, So it understands the text, then it has to 58 00:02:43.840 --> 00:02:45.960 generate the actual sound exactly. 59 00:02:46.400 --> 00:02:48.960 A common approach, especially around the time this book was written, 60 00:02:49.000 --> 00:02:51.000 was something called concatenative synthesis. 61 00:02:51.120 --> 00:02:52.960 Concatenative synthesis okay, okay. 62 00:02:53.080 --> 00:02:55.360 Think of it like building a sentence by stitching together 63 00:02:55.400 --> 00:03:00.280 tiny prerecorded pieces of speech. These could be sounds, syllables, words, 64 00:03:00.319 --> 00:03:01.280 even short phrases. 65 00:03:01.400 --> 00:03:03.759 Ah like digital lego bricks for voice. 66 00:03:04.000 --> 00:03:07.719 Kind of algorithms select the right pieces and join them smoothly, 67 00:03:07.879 --> 00:03:12.439 trying to mimic natural rhythm and intonation. When it's done well, 68 00:03:12.520 --> 00:03:14.360 it can sound remarkably natural. 69 00:03:14.680 --> 00:03:17.039 It's kind of amazing how they make those pieces fit together. 70 00:03:17.319 --> 00:03:20.240 But wait, if they're stitching together pre recorded bits, why 71 00:03:20.280 --> 00:03:22.599 not just record a human voice actor saying everything the 72 00:03:22.639 --> 00:03:23.680 app might need to say. 73 00:03:24.000 --> 00:03:27.439 That's a great point, and sometimes apps do use professional 74 00:03:27.479 --> 00:03:31.319 voice actors, you know, for specific tromps where quality and 75 00:03:31.400 --> 00:03:33.520 consistency are paramount. 76 00:03:33.039 --> 00:03:35.520 Like a standard greeting or instruction exactly. 77 00:03:36.120 --> 00:03:39.879 But TTS becomes absolutely essential when the text is dynamic, 78 00:03:40.360 --> 00:03:43.319 when you can't possibly pre record everything it might need 79 00:03:43.360 --> 00:03:43.680 to say. 80 00:03:43.879 --> 00:03:46.199 Ah, right, like reading out a text message. 81 00:03:45.840 --> 00:03:48.080 You just got, or a news headline that just updated, or. 82 00:03:48.120 --> 00:03:49.840 Someone's name from your contact. 83 00:03:49.520 --> 00:03:52.759 List Precisely, you just can't anticipate every single phraser name. 84 00:03:53.080 --> 00:03:55.840 So while the quality might sometimes be a trade off 85 00:03:55.879 --> 00:04:00.560 compared to say, a perfectly recorded voice, linets offers that 86 00:04:00.680 --> 00:04:03.159 vital flexibility for dynamic content. 87 00:04:02.919 --> 00:04:05.960 And Android has had this built in for ages. Right, how 88 00:04:06.000 --> 00:04:07.800 do developers actually hook into it? 89 00:04:07.960 --> 00:04:12.199 Yeah, the capability has been there since Android one point six. 90 00:04:12.240 --> 00:04:16.079 Believe or not, developers use the framework provided. A key 91 00:04:16.120 --> 00:04:19.720 step is making sure the necessary language data is actually 92 00:04:19.720 --> 00:04:20.720 on the user's. 93 00:04:20.399 --> 00:04:21.959 Device the voice itself. 94 00:04:22.040 --> 00:04:25.199 Yeah, the voice files the rules for that language. The 95 00:04:25.199 --> 00:04:28.279 system lets developers check for this using something called an intent, 96 00:04:28.839 --> 00:04:31.199 and even prompt the user to install it if it's missing. 97 00:04:31.480 --> 00:04:31.879 Okay. 98 00:04:32.040 --> 00:04:35.800 The book suggests using a common software design pattern, a singleton, 99 00:04:36.040 --> 00:04:39.319 basically ensuring only one instance of the TTS engine is created. 100 00:04:39.480 --> 00:04:42.839 This helps manage resources efficiently smart and the examples in 101 00:04:42.879 --> 00:04:45.040 the book show how you might use this to say, 102 00:04:45.360 --> 00:04:47.680 read back text the user piped in, or maybe read 103 00:04:47.680 --> 00:04:50.560 text loaded from a file. You can specify the language too, 104 00:04:50.680 --> 00:04:52.439 like English or even regional variation. 105 00:04:52.600 --> 00:04:54.839 Okay, so that's how the device speaks. Now the other 106 00:04:54.959 --> 00:05:00.759 side hearing us automated Speech recognition ASR. This is turning 107 00:05:00.839 --> 00:05:02.720 our spoken words into text. 108 00:05:02.639 --> 00:05:06.600 Right, and like TTS, it involves steps. First, the device 109 00:05:06.639 --> 00:05:08.879 needs to capture the sound from the microphone and process it. 110 00:05:09.000 --> 00:05:10.120 Think of it as cleaning up. 111 00:05:10.040 --> 00:05:11.959 The audio, getting rid of background noise. 112 00:05:11.920 --> 00:05:16.199 Yeah, removing noise maybe echo, and just preparing it digitally 113 00:05:16.279 --> 00:05:17.120 for analysis. 114 00:05:17.240 --> 00:05:19.839 Then comes the recognition part itself. Yeah, breaking down the 115 00:05:19.879 --> 00:05:21.319 audio into. 116 00:05:22.480 --> 00:05:27.519 What sounds basically yeah, into tiny segments phones the basic 117 00:05:27.560 --> 00:05:30.399 sounds of the language, and then it tries to match them. 118 00:05:30.759 --> 00:05:31.240 Wow. 119 00:05:31.399 --> 00:05:34.759 This is where powerful statistical models come into play. These 120 00:05:34.800 --> 00:05:37.920 models are trained on massive amounts of recorded speech, learning 121 00:05:38.000 --> 00:05:41.079 how different sounds are typically pronounced in different context by 122 00:05:41.079 --> 00:05:41.759 different people. 123 00:05:41.920 --> 00:05:42.240 Wow. 124 00:05:42.360 --> 00:05:45.399 Okay, they build what's called an acoustic model. It's essentially 125 00:05:45.480 --> 00:05:48.680 a statistical map of how sounds relate to words. 126 00:05:48.920 --> 00:05:51.879 But words can sound exactly alike you mentioned read and 127 00:05:51.959 --> 00:05:55.040 read or two and two. How does it know the difference? 128 00:05:55.199 --> 00:05:58.800 Ah, good question. That's where another statistical model helps, the 129 00:05:58.879 --> 00:05:59.439 language model. 130 00:05:59.519 --> 00:06:00.000 Language model. 131 00:06:00.319 --> 00:06:03.920 This one understands the probability of words appearing together in sequence. 132 00:06:04.439 --> 00:06:08.519 It knows that after I went, the word two is 133 00:06:08.600 --> 00:06:09.959 far far more likely than two. 134 00:06:10.360 --> 00:06:11.959 Right context exactly. 135 00:06:12.240 --> 00:06:15.000 The language model provides that crucial context to help resolve 136 00:06:15.040 --> 00:06:16.000 those ambiguities. 137 00:06:16.279 --> 00:06:19.839 And the result isn't always just one single interpretation, is 138 00:06:19.920 --> 00:06:21.879 it like? It's not always certain? 139 00:06:22.120 --> 00:06:25.600 No, definitely not. Typically, the ASR system gives you back 140 00:06:26.040 --> 00:06:29.439 a list of possible results, ranked by how confident it 141 00:06:29.560 --> 00:06:32.199 is in each one. A list, yeah, it's often called 142 00:06:32.240 --> 00:06:36.399 an end best list. Each possibility comes with a confidence score, 143 00:06:36.720 --> 00:06:39.920 usually from zero to one. A score near one means 144 00:06:39.920 --> 00:06:41.279 the system is pretty. 145 00:06:40.959 --> 00:06:44.040 Sure it got it right, and that's useful for the developer. 146 00:06:43.800 --> 00:06:46.680 Incredibly valuable. They can just pick the top result. If 147 00:06:46.680 --> 00:06:49.800 the confidence is high, or if it's lower, or if 148 00:06:49.839 --> 00:06:51.879 the top one doesn't make sense in context, they can 149 00:06:51.920 --> 00:06:54.079 look at the others in the list. Or maybe even 150 00:06:54.199 --> 00:06:56.839 use the confidence score to decide HM I better ask 151 00:06:56.879 --> 00:06:58.040 the user to confirm this. 152 00:06:58.040 --> 00:07:00.480 This capability has also been around on end for a 153 00:07:00.519 --> 00:07:03.199 while since version two point one. Often when you tap 154 00:07:03.240 --> 00:07:05.319 the little microphone icon on the keyboard. 155 00:07:05.079 --> 00:07:08.480 Yes exactly, and developers have flexibility here too. You can 156 00:07:08.560 --> 00:07:11.480 use a simple built in tool and intent that handles 157 00:07:11.480 --> 00:07:15.519 the speak now, prompt and feedback automatically, super easy, quick 158 00:07:15.560 --> 00:07:18.920 and dirty, pretty much. Or if you want more control 159 00:07:18.920 --> 00:07:21.920 over the look and feel the user interface, you could 160 00:07:22.000 --> 00:07:25.680 use a more advanced component a speech recognizer instance. This 161 00:07:25.800 --> 00:07:29.279 lets you manage the UI yourself and react to specific 162 00:07:29.319 --> 00:07:32.560 recognition events like when the user starts or stops speaking. 163 00:07:32.759 --> 00:07:35.480 More control, more work typically yeah. 164 00:07:35.600 --> 00:07:38.439 The book again suggests using a library approach here, like 165 00:07:38.480 --> 00:07:41.839 an ASRLB, just to keep the code organized and reusable. 166 00:07:42.079 --> 00:07:44.759 And you mentioned language models. Can you tell the system 167 00:07:45.279 --> 00:07:48.319 what kind of speech to expect, like am I dictating 168 00:07:48.319 --> 00:07:50.920 an email or just barking a search query? 169 00:07:51.160 --> 00:07:54.680 Exactly? You can specify different language models. There's one design 170 00:07:54.759 --> 00:07:58.480 for free form dictation like long sentences, and another optimized 171 00:07:58.519 --> 00:08:00.879 for shorter phrases like web search queries. 172 00:08:00.920 --> 00:08:01.399 Ah. 173 00:08:01.560 --> 00:08:03.920 The book does note though, that even with these models, 174 00:08:03.920 --> 00:08:06.560 the input can still be quite open ended, so the 175 00:08:06.600 --> 00:08:09.439 developer might need to do more processing afterwards to figure 176 00:08:09.480 --> 00:08:11.079 out the specific command or meaning. 177 00:08:11.560 --> 00:08:14.680 Oh and because these systems often connect to cloud services 178 00:08:14.720 --> 00:08:15.879 for the heavy lifting. 179 00:08:15.600 --> 00:08:19.600 Right, the recognition part. Yeah, the app usually needs permission 180 00:08:19.600 --> 00:08:22.600 to access the Internet, and you need to handle potential 181 00:08:22.800 --> 00:08:27.120 errors like no speech detected or no match found or 182 00:08:27.160 --> 00:08:29.079 maybe a network problem. 183 00:08:29.120 --> 00:08:32.639 Got it. So we've got the building blocks. The device 184 00:08:32.679 --> 00:08:36.440 can speak TTS, and it can listen and turn speech 185 00:08:36.519 --> 00:08:40.600 into text ASR even giving us a list of possibilities 186 00:08:40.600 --> 00:08:43.759 with confidence scores. How do we actually put those together 187 00:08:43.840 --> 00:08:45.200 to build simple interactions? 188 00:08:45.240 --> 00:08:48.240 That's the next logical step, right, moving from just hearing 189 00:08:48.320 --> 00:08:51.360 or speaking to creating a basic back and forth. Think 190 00:08:51.360 --> 00:08:53.320 about those early voice actions. 191 00:08:53.120 --> 00:08:54.639 Like on Google Now back in the. 192 00:08:54.679 --> 00:08:57.080 Day, exactly telling your phone call mom or go to 193 00:08:57.080 --> 00:09:01.080 Wikipedia dot org. These are structured commands, simple cause and. 194 00:09:01.080 --> 00:09:04.360 Effect, and they're built just by combining those core TTS 195 00:09:04.360 --> 00:09:06.399 and ASR capabilities we just talked about. 196 00:09:06.440 --> 00:09:09.559 Pretty much the book provides them straightforward examples. One is 197 00:09:09.600 --> 00:09:10.759 an app called voice Search. 198 00:09:10.879 --> 00:09:13.799 It just takes whatever you say, listens using ASR right. 199 00:09:13.759 --> 00:09:16.360 Grabs the top result from that end best list the 200 00:09:16.360 --> 00:09:18.600 one with the highest confidence it seems it's right, and 201 00:09:18.759 --> 00:09:22.080 immediately plugs it into a standard Android web search intent. 202 00:09:22.440 --> 00:09:24.679 Boom search results appear very. 203 00:09:24.639 --> 00:09:27.879 Simple, okay, but that immediately brings up a potential problem 204 00:09:27.919 --> 00:09:30.039 which you hinted at. What if the ASR got it 205 00:09:30.080 --> 00:09:34.080 wrong exactly? This seems particularly tricky in another example app. 206 00:09:34.080 --> 00:09:37.759 The book mentions voice Launch, which tries to launch an 207 00:09:37.799 --> 00:09:41.720 installed application based on what the user says. Right, what 208 00:09:41.840 --> 00:09:44.559 if you don't say the exact app name, like maybe 209 00:09:44.559 --> 00:09:47.159 you say music player, but the app is actually called 210 00:09:47.159 --> 00:09:47.759 play Music. 211 00:09:48.000 --> 00:09:50.559 This is where the idea of similarity measures comes in. 212 00:09:50.720 --> 00:09:53.960 It's a crucial concept. The app needs a way to 213 00:09:54.000 --> 00:09:57.399 compare what the user said to the actual names of 214 00:09:57.440 --> 00:09:59.960 the apps installed on the device to find the best 215 00:10:00.120 --> 00:10:02.600 to match, even if it's not identical. 216 00:10:02.720 --> 00:10:04.240 How does it do that? Just check if the letters 217 00:10:04.279 --> 00:10:04.799 are similar. 218 00:10:05.000 --> 00:10:08.120 That's part of an orthographic similarity looking at the spelling. 219 00:10:08.440 --> 00:10:11.600 But crucially, it can also look at phonetic similarity. 220 00:10:11.679 --> 00:10:12.720 How words sound alike? 221 00:10:12.879 --> 00:10:15.080 Yes, so it could figure out that, I don't know, 222 00:10:15.320 --> 00:10:19.200 photos and photos probably refer to the same thing, even 223 00:10:19.200 --> 00:10:20.200 if the spelling's different. 224 00:10:20.240 --> 00:10:21.000 Okay, that's clever. 225 00:10:21.320 --> 00:10:25.519 The book mentions using algorithms like soundex for this phonetic comparison, 226 00:10:26.080 --> 00:10:29.720 although it notes the specific implementation they included was primarily 227 00:10:29.759 --> 00:10:34.200 tuned for English. The key thing is normalizing the input first, 228 00:10:34.399 --> 00:10:38.039 like her moving spaces, making everything lower case before you do. 229 00:10:38.120 --> 00:10:39.440 The comparison makes sense. 230 00:10:39.519 --> 00:10:43.879 Okay, So even with similarity measures, ASR isn't perfect. That 231 00:10:43.919 --> 00:10:47.120 potential for error means you often need to double check 232 00:10:47.159 --> 00:10:49.919 with the user right confirm things absolutely. 233 00:10:50.120 --> 00:10:53.879 Confirmation is vital for robust interaction. The book includes a 234 00:10:53.919 --> 00:10:58.039 simple example building on that Voicer chap. After recognizing something 235 00:10:58.159 --> 00:11:01.039 like pizza places right, yeah, might use TTS to ask 236 00:11:01.120 --> 00:11:04.039 did you say pizza places? And then it uses ASR again, 237 00:11:04.480 --> 00:11:07.000 but this time listening specifically for a simple. 238 00:11:06.799 --> 00:11:10.720 Yes or no uh, constraining the expected input exactly. 239 00:11:11.039 --> 00:11:13.639 It's a basic but really important step, especially when you're 240 00:11:13.679 --> 00:11:16.879 dealing with single critical pieces of data before taking an action. 241 00:11:17.000 --> 00:11:20.480 So we can make the device speak, listen, perform. These 242 00:11:20.519 --> 00:11:24.759 simple command action pairs handle some ambiguity with similarity, and 243 00:11:24.879 --> 00:11:28.960 even ask for basic yes no confirmation. But these interactions 244 00:11:29.000 --> 00:11:31.720 still feel quite rigid. You know, you have to say 245 00:11:31.759 --> 00:11:34.000 things in a very specific way or is just one 246 00:11:34.039 --> 00:11:36.600 command at a time. How do you make the conversation 247 00:11:36.679 --> 00:11:40.320 more flexible, like guide the user through collecting multiple pieces 248 00:11:40.360 --> 00:11:40.879 of information? 249 00:11:41.039 --> 00:11:43.960 Okay, yeah, that takes us into the realm of more 250 00:11:44.000 --> 00:11:47.440 structured conversations, often called form filling. 251 00:11:47.159 --> 00:11:49.720 Dialogue form filling like on a website. 252 00:11:49.320 --> 00:11:52.480 Exactly the same idea. The goal is to gather several 253 00:11:52.519 --> 00:11:55.919 distinct pieces of information from the user, one by one, 254 00:11:56.279 --> 00:11:58.879 but doing it through voice instead of textboxes and dropdowns. 255 00:11:58.960 --> 00:12:01.279 Okay, So, like booking a flow, it might ask what 256 00:12:01.320 --> 00:12:04.120 city are you flying from? Then once you answer, what 257 00:12:04.200 --> 00:12:05.559 is your destination exactly? 258 00:12:05.600 --> 00:12:07.480 And then maybe what date do you want to travel. 259 00:12:07.679 --> 00:12:10.080 To manage this, you need a system. You need a 260 00:12:10.120 --> 00:12:12.879 way to define the pieces of information you need. Think 261 00:12:12.879 --> 00:12:14.039 of these as slots to be. 262 00:12:14.000 --> 00:12:15.840 Filled like fields on a form. 263 00:12:15.799 --> 00:12:19.840 Precisely, and you need an algorithm, some logic that knows 264 00:12:19.879 --> 00:12:22.480 how to navigate the conversation to collect the info for 265 00:12:22.559 --> 00:12:24.759 EID slot in some sensible order. 266 00:12:24.879 --> 00:12:27.440 The book points to something called VoiceXML as a kind 267 00:12:27.440 --> 00:12:28.320 of model for this. 268 00:12:28.600 --> 00:12:32.639 Yeah. VoiceXML is or was, a W three C standard 269 00:12:32.639 --> 00:12:35.360 for defining these kinds of voice dialogues often used in 270 00:12:35.559 --> 00:12:39.639 call center systems. It uses concepts like forms, which contain 271 00:12:39.759 --> 00:12:43.080 fields or slots. Each field has a prompt, which is 272 00:12:43.080 --> 00:12:44.759 what the system asks the user. 273 00:12:44.759 --> 00:12:46.360 What is your destination right? 274 00:12:46.600 --> 00:12:50.200 And optionally, fields can have grammars associated with them, which 275 00:12:50.279 --> 00:12:53.240 constrain or help interpret what the user can say in response. 276 00:12:53.559 --> 00:12:56.559 So for a destination field, the grammar might only accept 277 00:12:56.600 --> 00:12:57.519 city names. 278 00:12:57.519 --> 00:13:01.360 Potentially yes and VoiceXML uses a concept called the form 279 00:13:01.360 --> 00:13:05.480 interpretation algorithm or FIA. It's basically the logic engine that 280 00:13:05.519 --> 00:13:08.399 steps through the form, asking for one piece of required 281 00:13:08.399 --> 00:13:11.639 information at a time until all the necessary slots are filled. 282 00:13:12.279 --> 00:13:15.279 The book uses a simplified subset of these ideas specifically 283 00:13:15.279 --> 00:13:16.600 for Android development, and. 284 00:13:16.559 --> 00:13:18.840 There's a specific library in the book to help build this. 285 00:13:19.279 --> 00:13:23.360 Yes a library called form filip containing classes to represent 286 00:13:23.399 --> 00:13:27.120 these forms and fields. It works by parsing XML files 287 00:13:27.360 --> 00:13:30.919 that the developer writes. These XML files define the structure 288 00:13:30.919 --> 00:13:33.879 of the conversation, what questions to ask, in what order, 289 00:13:34.039 --> 00:13:35.240 which fields are needed, so. 290 00:13:35.240 --> 00:13:38.960 The conversation logic is separate from the main app code exactly. 291 00:13:39.240 --> 00:13:42.840 It uses standard Android tools like XML pull parser handled 292 00:13:42.879 --> 00:13:46.279 via another helper library, xml lib to read these definitions. 293 00:13:46.759 --> 00:13:50.120 Then a key piece called the dialogue interpreter class steps 294 00:13:50.120 --> 00:13:53.440 through this structure, triggering the right TTS prompt and listening 295 00:13:53.440 --> 00:13:55.679 for ASR responses to fill each field. 296 00:13:55.759 --> 00:13:59.039 Does it handle background tasks like parsing might take time? 297 00:13:59.200 --> 00:14:02.480 Good point. It's designed to do the potentially slow work 298 00:14:02.600 --> 00:14:05.519 like parsing the XML or waiting for ASR in the 299 00:14:05.559 --> 00:14:09.440 background using Android's acing task, so the main app remains responsive. 300 00:14:09.759 --> 00:14:11.840 That separation of concerns is really nice. 301 00:14:11.879 --> 00:14:13.840 A great example used in the book is the music 302 00:14:13.879 --> 00:14:15.120 Brain app. What does that do? 303 00:14:15.519 --> 00:14:18.679 Right? The music Brain demo app uses this form filling library. 304 00:14:18.840 --> 00:14:21.480 It guides the user through a voice dialogue asking for 305 00:14:21.559 --> 00:14:23.919 details like maybe a word that appears in an album 306 00:14:23.960 --> 00:14:25.399 title or a start an end. 307 00:14:25.320 --> 00:14:27.879 Date range, using that form structure. 308 00:14:27.559 --> 00:14:30.480 Exactly once, it collects all the pieces of information needs 309 00:14:30.480 --> 00:14:33.159 by filling the slots in its form, and use that 310 00:14:33.159 --> 00:14:37.399 collected information to query the music Brain's web service, which 311 00:14:37.440 --> 00:14:39.440 is a big online music database. 312 00:14:39.600 --> 00:14:43.960 Ah. So it's combining the voice interface with external data a. 313 00:14:43.960 --> 00:14:47.480 Mashup, precisely. It shows how you can take destruction data 314 00:14:47.759 --> 00:14:51.320 gathered via voice and uses to interact with online services 315 00:14:51.759 --> 00:14:55.279 retrieve information. Process it may be filter or sort the 316 00:14:55.320 --> 00:14:59.159 results like sorting albums by release date using helper classes, 317 00:14:59.240 --> 00:15:01.759 and then present that back to the user, perhaps speaking 318 00:15:01.799 --> 00:15:03.120 the results or showing them on screen. 319 00:15:03.360 --> 00:15:05.960 Okay, so form filling lets us manage these multi step 320 00:15:06.080 --> 00:15:10.159 conversations to get structured data like album name and date range. 321 00:15:10.720 --> 00:15:13.559 But you mentioned that the ASR input within each step 322 00:15:13.639 --> 00:15:17.399 was still somewhat open ended in these basic examples. How 323 00:15:17.399 --> 00:15:19.759 do we make the app understand more than just the 324 00:15:19.799 --> 00:15:21.960 words the user says? How do we get it to 325 00:15:22.080 --> 00:15:24.000 understand the meaning behind the words? 326 00:15:24.320 --> 00:15:27.720 Right? That's a critical step towards more intelligent interaction, and 327 00:15:27.799 --> 00:15:29.799 that's where grammars come in, leading us into the field 328 00:15:29.840 --> 00:15:32.240 of natural language understanding or NLU. 329 00:15:32.600 --> 00:15:34.759 Grammars and NLU Okay. 330 00:15:34.799 --> 00:15:38.519 Grammars are tools designed specifically to help the application interpret 331 00:15:38.600 --> 00:15:41.720 more complex user inputs. They help extract not just the 332 00:15:41.759 --> 00:15:45.919 sequence of words, but the underlying meaning and specific structured 333 00:15:45.960 --> 00:15:46.879 pieces of information. 334 00:15:47.480 --> 00:15:51.200 So going beyond just recognizing show me flights to London 335 00:15:51.639 --> 00:15:53.240 as a sequence of words. 336 00:15:52.919 --> 00:15:55.879 So understanding that the user's intent is to see flights 337 00:15:56.440 --> 00:15:58.320 and the destination parameter is London. 338 00:15:58.720 --> 00:16:01.159 Got it? How do you create these grammars? 339 00:16:01.639 --> 00:16:06.080 The book discusses two main approaches. First, there are handcrafted. 340 00:16:05.399 --> 00:16:07.600 Grammars written manually by developers. 341 00:16:07.639 --> 00:16:10.480 Exactly, you write them yourself, often in an XML format 342 00:16:10.519 --> 00:16:15.120 like SRGS Speech Recognition Grammar Specification, though the book uses 343 00:16:15.159 --> 00:16:18.759 its own simplified XML format. You define the structure of 344 00:16:18.799 --> 00:16:23.960 acceptable phrases using rules, items within rules, alternatives, optional parts, 345 00:16:24.000 --> 00:16:25.320 and links between different rules. 346 00:16:25.399 --> 00:16:27.759 Can you give an example for that flight query? 347 00:16:28.039 --> 00:16:31.000 Sure, you might have a top level rule like rule 348 00:16:31.080 --> 00:16:34.639 ID fine flight. Inside that you might have an item 349 00:16:34.679 --> 00:16:38.279 for the phrase show flights or fine flights. Then maybe 350 00:16:38.279 --> 00:16:41.840 an optional item repeat zero one for the word two, 351 00:16:42.320 --> 00:16:45.039 and then crucially a reference to another rule like ruler 352 00:16:45.080 --> 00:16:48.159 ref u r F hashtag city which defines all the 353 00:16:48.279 --> 00:16:49.559 valid city names. 354 00:16:49.279 --> 00:16:52.480 And the hashtag city rule would list London, Paris, New 355 00:16:52.559 --> 00:16:53.159 York right. 356 00:16:53.480 --> 00:16:56.240 And within those rules you can use special semantic tags. 357 00:16:56.600 --> 00:16:59.240 So next to the item London in the city rule, 358 00:16:59.440 --> 00:17:02.279 you might have a tag like tag out lhr tag. 359 00:17:02.879 --> 00:17:05.839 This tells the system if the user says London, don't 360 00:17:05.880 --> 00:17:09.519 just return the word London, return the airport code lhr AH. 361 00:17:09.680 --> 00:17:14.319 Extracting structured data directly based on the grammar match. That's powerful, 362 00:17:14.519 --> 00:17:15.240 very powerful. 363 00:17:15.599 --> 00:17:18.039 But as you can imagine writing these grammars to cover 364 00:17:18.119 --> 00:17:20.119 all the different ways a user might phrase. 365 00:17:19.799 --> 00:17:22.400 Something flights to London, show me London flights. I want 366 00:17:22.400 --> 00:17:23.559 to flight to London exactly. 367 00:17:23.720 --> 00:17:28.279 Designing handcrafted grammars for spontaneous, unpredictable speech is incredibly hard 368 00:17:28.559 --> 00:17:30.880 and very time consuming. That's the big challenge. 369 00:17:30.920 --> 00:17:32.079 So what's the alternative. 370 00:17:32.319 --> 00:17:35.960 That's where the second type comes in. Statistical grammars, or 371 00:17:36.000 --> 00:17:38.119 more broadly, statistical NLU. 372 00:17:37.799 --> 00:17:39.279 Models learn from data. 373 00:17:39.519 --> 00:17:43.039 Yes, these aren't written by hand. They're trained on vast 374 00:17:43.039 --> 00:17:47.200 amounts of real world language data using machine learning techniques, and. 375 00:17:47.119 --> 00:17:50.000 The advantage is they can be much more flexible handle 376 00:17:50.119 --> 00:17:52.279 variations you didn't explicitly code for. 377 00:17:52.839 --> 00:17:56.200 That's the key benefit. Because they work based on probabilities 378 00:17:56.200 --> 00:17:59.279 and patterns learn from how people actually speak, they can 379 00:17:59.319 --> 00:18:04.119 often handle more or irregular wording, synonyms, even slightly ungrammatical 380 00:18:04.160 --> 00:18:07.359 inputs that would break a strict handcrafted grammar. 381 00:18:07.759 --> 00:18:08.759 What's the downside? 382 00:18:09.039 --> 00:18:11.440 The main one is they require huge data sets to 383 00:18:11.480 --> 00:18:14.759 train effectively, and access to these trained models often comes 384 00:18:14.880 --> 00:18:17.839 via cloud services. The book mentions a service from a 385 00:18:17.839 --> 00:18:20.440 company called Maluba as an example of a real world 386 00:18:20.480 --> 00:18:23.119 statistical NLU system available around that time. 387 00:18:22.960 --> 00:18:25.240 And that kind of system tries to identify the core 388 00:18:25.279 --> 00:18:27.720 intention and the relevant details the entities. 389 00:18:27.880 --> 00:18:30.960 Precisely, you give it a phrase like what's the weather 390 00:18:31.000 --> 00:18:34.559 in Belfast for tomorrow, and a statistical NLU system could 391 00:18:34.599 --> 00:18:38.920 analyze it and return something structured like categories whether action is, 392 00:18:39.119 --> 00:18:43.519 check status, and the entities are location, billfast, and date tomorrow, 393 00:18:43.680 --> 00:18:46.640 maybe even resolving tomorrow to the actual calendar date. It's 394 00:18:46.640 --> 00:18:50.400 focused on extracting that core meaning, often regardless of the 395 00:18:50.440 --> 00:18:51.960 exact sentence structure used. 396 00:18:52.519 --> 00:18:55.039 Does the book include a library to help developers work 397 00:18:55.079 --> 00:18:56.400 with these different grammar types. 398 00:18:56.519 --> 00:19:00.200 It does. An NLU lib. It contains classes for handelling 399 00:19:00.240 --> 00:19:04.359 those handcrafted grammars, parsing the XML definitions into Java objects, 400 00:19:04.680 --> 00:19:07.839 inverting the rules into patterns, often using regular expressions behind 401 00:19:07.880 --> 00:19:10.839