WEBVTT 1 00:00:00.200 --> 00:00:03.399 Welcome to React Roundup, the podcast where we keep you 2 00:00:03.480 --> 00:00:07.080 updated on all things React related. This show is brought 3 00:00:07.080 --> 00:00:11.080 to you by Void and top End Devs. Unvoid provides 4 00:00:11.160 --> 00:00:14.839 high quality design and software development services on a client 5 00:00:14.880 --> 00:00:20.239 friendly business model. Unlike all other software agencies, Unvoid allows 6 00:00:20.239 --> 00:00:24.120 clients to only pay after the work is delivered and approved. 7 00:00:24.600 --> 00:00:28.079 Visit unvoid dot com to learn more and reach out. 8 00:00:28.199 --> 00:00:30.679 If you know a company that needs more professionals to 9 00:00:30.760 --> 00:00:36.000 help with design and software development, that's u n void 10 00:00:36.479 --> 00:00:39.840 dot com and top end Davs helps you stay up 11 00:00:39.840 --> 00:00:44.320 to date with cutting edge technologies like JavaScript, Ruby, Elixir, 12 00:00:44.520 --> 00:00:48.799 and AI. Visit topandevs dot com to join their AIDV 13 00:00:48.880 --> 00:00:54.159 boot camp, weekly community meetups and access expert tutorials. I'm 14 00:00:54.240 --> 00:00:58.079 Lucas Paganini, founder of Onvoid and host of this podcast. 15 00:00:58.439 --> 00:01:05.079 Thank you for tuning in. Let's jump into the episode. 16 00:01:07.079 --> 00:01:10.400 Hey everybody, and welcome to another episode of React Roundup. 17 00:01:10.560 --> 00:01:13.120 I am your host today TJ Van Tol and with 18 00:01:13.239 --> 00:01:15.120 me on the panel, I have Paige need you House. 19 00:01:15.359 --> 00:01:18.400 Hey everyone, and our special guest today is actually a 20 00:01:18.439 --> 00:01:22.640 React Round of returning Champion. We have even Lovery here. Ian. 21 00:01:22.799 --> 00:01:23.680 Welcome back to the show. 22 00:01:24.040 --> 00:01:25.239 Hey, thanks for having me back. 23 00:01:25.319 --> 00:01:27.760 Yeah, so why don't you start, you know, for people, 24 00:01:28.000 --> 00:01:30.439 I think it's show we're looking back? Is the show 25 00:01:30.480 --> 00:01:31.760 is about a year ago. We'll have to look up 26 00:01:31.799 --> 00:01:34.040 the episode number and toss it in the show notes. 27 00:01:34.079 --> 00:01:35.840 But it's been a while, So why don't you tell 28 00:01:35.879 --> 00:01:38.120 people know who you are, what you do in your 29 00:01:38.120 --> 00:01:41.159 background while you're famous, all those sorts of things. 30 00:01:41.560 --> 00:01:45.239 Yeah. So I work for a speech recognition company called 31 00:01:45.359 --> 00:01:50.400 pegle Boys, and we're a developer focused company that tries 32 00:01:50.480 --> 00:01:54.480 to power developers all over on any platform to have 33 00:01:54.719 --> 00:01:58.319 to bring voice to their platform. So we have a 34 00:01:58.359 --> 00:02:02.439 whole variety of different propus that cover speech to text, 35 00:02:02.599 --> 00:02:05.719 voice activation, wake word, all that, and we just want 36 00:02:05.840 --> 00:02:08.759 everybody to have a voice on their platform. Besides that, 37 00:02:08.960 --> 00:02:13.240 I'm a I do like interactive media hard and I 38 00:02:13.319 --> 00:02:15.159 play bass in a couple of bands. 39 00:02:16.879 --> 00:02:19.560 That's awesome, not just one band but multiple. 40 00:02:20.599 --> 00:02:22.479 Yeah, I'm an over at cheap I. 41 00:02:22.439 --> 00:02:26.680 Guess well cool. So Peka Voice looks interesting. I remember 42 00:02:26.759 --> 00:02:29.560 us talking about it last time, but maybe you can 43 00:02:29.560 --> 00:02:31.639 get an overview of like how it works. Like if 44 00:02:31.639 --> 00:02:33.520 I if I use Peka Voice, what am I? What 45 00:02:33.560 --> 00:02:35.280 am I getting? Am I getting a service that I 46 00:02:35.319 --> 00:02:38.000 can send like audio to you, and it comes back 47 00:02:38.039 --> 00:02:40.719 with the words like what other features maybe you could 48 00:02:40.759 --> 00:02:43.719 give us, like the rundown of everything. It does everything 49 00:02:43.759 --> 00:02:44.039 you do. 50 00:02:44.400 --> 00:02:46.960 Yeah. So the big thing with us is and our 51 00:02:47.039 --> 00:02:49.919 sort of thing that sets us apart from pretty much 52 00:02:49.960 --> 00:02:54.280 every other voice service is that we're entirely on device 53 00:02:54.680 --> 00:02:57.719 and so there is no there is no service. There's 54 00:02:57.759 --> 00:03:01.759 no cloud API that you're calling to send your audio to, 55 00:03:02.159 --> 00:03:05.439 which I mean, look look around. That's pretty much every 56 00:03:05.479 --> 00:03:09.280 single voice thing is just an API. So we're one 57 00:03:09.319 --> 00:03:12.199 of the only ones out there that is actually giving 58 00:03:12.240 --> 00:03:15.080 you the ability to hold on to your audio data 59 00:03:15.120 --> 00:03:17.919 and your user's audio data and process it on the 60 00:03:18.000 --> 00:03:21.919 device and return Again. We have like a variety of products. 61 00:03:21.960 --> 00:03:25.439 So we have like wakeword detection, where it's just like hey, 62 00:03:25.520 --> 00:03:28.879 Siri and okay Google. It's just all it's doing is 63 00:03:28.919 --> 00:03:31.639 sitting there processing frames of audio, waiting for you to 64 00:03:31.680 --> 00:03:34.039 say the thing, and then when it wakes up, it 65 00:03:34.080 --> 00:03:36.759 does the thing that you tell it to do. But 66 00:03:36.840 --> 00:03:40.639 we also have voice activity detection and which just basically 67 00:03:41.000 --> 00:03:45.199 peaks when it hears somebody talking. And obviously speech to text. 68 00:03:45.199 --> 00:03:49.680 Everyone wants speech to text, so auto transcription of voice. 69 00:03:49.919 --> 00:03:52.319 Yeah, it's very cool. It's also like one of those 70 00:03:52.360 --> 00:03:56.159 problems that I feel like is it's becoming more commonplace. 71 00:03:56.199 --> 00:03:58.680 We have smart devices in our house. Our phones can 72 00:03:58.800 --> 00:04:00.439 listen to wake words and that sort of thing. But 73 00:04:00.479 --> 00:04:03.960 I still I'm still sort of fascinated by the underlying technology. 74 00:04:04.479 --> 00:04:06.879 Maybe you could just start give us like the world's 75 00:04:06.879 --> 00:04:09.360 simplest rundown of like how does how does it actually 76 00:04:09.479 --> 00:04:11.599 work on the back end? Like do you just have 77 00:04:11.879 --> 00:04:14.240 a whole bunch of like low level C code that 78 00:04:14.280 --> 00:04:18.000 looks for patterns in audio data or like, I don't know, 79 00:04:18.000 --> 00:04:20.120 we don't need it. Sounds like two hours, but I'm. 80 00:04:20.000 --> 00:04:23.000 Just no, it's a good question. So, I mean, basically, 81 00:04:23.079 --> 00:04:26.680 it's deep learning, right, It's it's it's machine learning. So 82 00:04:26.720 --> 00:04:29.839 we teach through machine learning. We teach a machine a 83 00:04:29.879 --> 00:04:32.879 statistical model of what a word sounds like, or what 84 00:04:32.959 --> 00:04:36.240 a series of sounds sounds like. So we basically take 85 00:04:36.519 --> 00:04:41.639 audio in our actual When we're teaching our machine, all 86 00:04:41.680 --> 00:04:44.759 we're doing is sending it frames of audio that are labeled, 87 00:04:44.920 --> 00:04:46.800 and we get it to remember them and like form 88 00:04:46.839 --> 00:04:50.319 a little statistical pattern, and then it for something like wakeword. 89 00:04:50.480 --> 00:04:53.800 It's just like, hey, remember this pattern of three things. 90 00:04:54.120 --> 00:04:56.720 Just remember that and say, hey, I think I saw it. 91 00:04:57.240 --> 00:04:59.360 So it's a lot more complicated when you get into 92 00:04:59.399 --> 00:05:02.720 speech to text because not only are you teaching it 93 00:05:02.879 --> 00:05:06.759 every sound in the language, but you're also teaching it 94 00:05:06.920 --> 00:05:10.560 every word in the language, because then you're dealing with 95 00:05:10.759 --> 00:05:15.000 audio and writing, which are different things. I think people 96 00:05:15.240 --> 00:05:19.360 think language is a combination of those things, but really 97 00:05:19.560 --> 00:05:23.240 they're two entirely separate things. They're like that there's the 98 00:05:23.319 --> 00:05:25.560 series of sounds you make with your mouth that other 99 00:05:25.600 --> 00:05:29.120 people understand, and then there's the symbols you write them 100 00:05:29.160 --> 00:05:34.279 down with and the grammar and punctuation and everything that 101 00:05:34.360 --> 00:05:36.879 you put into the written form, and they're different, so 102 00:05:36.920 --> 00:05:39.639 we actually have to treat them differently. But you'll see 103 00:05:39.680 --> 00:05:42.879 a lot of the big cloud providers out there. The 104 00:05:42.959 --> 00:05:46.120 reason they got it so right so fast is because 105 00:05:46.160 --> 00:05:49.360 they had such large machines in the cloud in order 106 00:05:49.439 --> 00:05:52.480 to do this, so sort of like it outpaced the 107 00:05:52.639 --> 00:05:57.800 actual progressive voice recognition, and now everything's kind of caught 108 00:05:57.879 --> 00:06:00.759 up and we can actually do it on Devine, which 109 00:06:00.839 --> 00:06:03.360 is a big win because, to be honest, we were 110 00:06:03.399 --> 00:06:07.079 like boiling the ocean for like a while doing speech 111 00:06:07.160 --> 00:06:09.040 to text, and now we can do it on like 112 00:06:09.079 --> 00:06:10.360 a micro controller. 113 00:06:10.439 --> 00:06:15.519 So if you're using something like Peaco Voice, is it 114 00:06:15.600 --> 00:06:17.800 something that you as a user have to train the 115 00:06:17.839 --> 00:06:21.879 models or the models already there. It's trained. It knows 116 00:06:22.319 --> 00:06:25.319 you're speaking English or it knows you're speaking Spanish, and 117 00:06:25.360 --> 00:06:27.720 it will just it should be smart enough to be 118 00:06:27.759 --> 00:06:32.360 able to take that audio and translate it into the 119 00:06:32.439 --> 00:06:34.040 correct written words. 120 00:06:34.439 --> 00:06:37.160 Right, So, like for speech to text, for instance, we 121 00:06:37.279 --> 00:06:41.120 basically just have a general language model. You just give it. 122 00:06:41.240 --> 00:06:45.360 We offer eight different languages, and you just give it 123 00:06:45.399 --> 00:06:48.079 the language you want and we'll understand that language. But 124 00:06:48.399 --> 00:06:51.879 we actually use this thing called transfer learning, and we 125 00:06:51.959 --> 00:06:55.920 have a website Peako Voice Console where you can basically 126 00:06:56.319 --> 00:06:59.160 we have sort of a general model, but then you 127 00:06:59.279 --> 00:07:03.360 sort of do train it yourself. Because for something like wakeword, 128 00:07:03.519 --> 00:07:06.399 we have a model that understands a bunch of sounds 129 00:07:06.439 --> 00:07:08.360 in whatever language you give it, but then you want 130 00:07:08.360 --> 00:07:11.639 it to represent a certain series of sounds like okay, Google, 131 00:07:11.920 --> 00:07:15.800 So you literally type that in to our console and 132 00:07:15.959 --> 00:07:18.800 hit train, and then it will pop out a model 133 00:07:18.839 --> 00:07:23.120 that understands that. So that's that's sort of the when 134 00:07:23.120 --> 00:07:26.240 we say you train it, it's not like, oh, you 135 00:07:26.279 --> 00:07:29.079 have to go out and gather four thousand recordings of 136 00:07:29.120 --> 00:07:32.920 this word and you know, submit it to something and 137 00:07:33.000 --> 00:07:36.680 watch statistics go and decide. No. No, it's just like 138 00:07:37.000 --> 00:07:38.519 we are. We did the hard work. 139 00:07:38.720 --> 00:07:40.800 I was gonna say, because by saying that, you're sort 140 00:07:40.800 --> 00:07:43.160 of implying that you got went out and have four 141 00:07:43.199 --> 00:07:47.000 thousand recordings of these different words, right or like. 142 00:07:47.240 --> 00:07:50.079 No, No. So the thing is, it's again we've we've 143 00:07:50.079 --> 00:07:52.759 trained the general model, so it understands the sounds we 144 00:07:52.839 --> 00:07:55.639 needed to understand. You just tell us which sounds you 145 00:07:55.680 --> 00:07:58.480 want us you want to form your wake word, and 146 00:07:58.519 --> 00:08:01.160 we pop out a model that's that that just waits 147 00:08:01.199 --> 00:08:02.360 for those series of sounds. 148 00:08:02.519 --> 00:08:05.399 Interesting because I would have guessed that your building of 149 00:08:05.399 --> 00:08:07.360 the model was to get a bunch of people to 150 00:08:07.399 --> 00:08:10.680 say like it almost seems it kind of breaks my 151 00:08:10.720 --> 00:08:13.360 mind a little bit as possible, right, that you can 152 00:08:13.399 --> 00:08:14.759 sort of general. 153 00:08:14.600 --> 00:08:17.720 Us the old style the like. So I worked. I 154 00:08:17.759 --> 00:08:22.000 worked at a speech recognition company right out of college. 155 00:08:22.399 --> 00:08:25.959 And what we did we had one of the early 156 00:08:26.079 --> 00:08:29.879 early wakeword engines, and what we would do is we'd 157 00:08:30.120 --> 00:08:32.600 it was all b to be the company. We basically 158 00:08:32.720 --> 00:08:36.279 enter a contract with the company that says, hey, we're 159 00:08:36.279 --> 00:08:39.600 going to go out and gather four thousand recordings of 160 00:08:39.679 --> 00:08:42.679 this wake word, and we're going to train it and 161 00:08:42.720 --> 00:08:46.240 then deliver you the model. And it was very formal, 162 00:08:46.720 --> 00:08:50.159 and that was basically state of the art at the time. 163 00:08:50.320 --> 00:08:54.639 But we're actually a bit past that now because we're 164 00:08:54.639 --> 00:08:58.399 able to use this concept of transfer learning to take 165 00:08:58.399 --> 00:09:00.720 a general model and just kind of pointed in the 166 00:09:00.759 --> 00:09:03.080 right direction. So we no longer need to do all 167 00:09:03.120 --> 00:09:06.639 that all that pounding the pavement asking for people to 168 00:09:06.720 --> 00:09:08.879 say a wake word, because that was a lot of 169 00:09:08.919 --> 00:09:11.879 work and it took months, like every time somebody signed 170 00:09:11.919 --> 00:09:14.240 a contract. And I know because I was running the 171 00:09:14.279 --> 00:09:18.039 crowdsourcing technology for that company, So I would have to 172 00:09:18.039 --> 00:09:22.039 post these jobs and these these people would record it 173 00:09:22.039 --> 00:09:24.279 on their on their like mobile device, and I'd have 174 00:09:24.320 --> 00:09:26.720 to go through all the recordings and like you know, 175 00:09:27.440 --> 00:09:30.279 some people would just yeah. Some people would just you know, 176 00:09:30.440 --> 00:09:34.200 speak their manifesto into the phone, and I'd be like, no, no, no, no. 177 00:09:37.240 --> 00:09:40.279 So one one thing that I'm curious about is I'm 178 00:09:40.320 --> 00:09:44.639 assuming that when you would do these these wake word gatherings, 179 00:09:45.200 --> 00:09:48.120 you would have to take into account accents, because I 180 00:09:48.159 --> 00:09:52.679 know that that is something that every automated assistant struggles with. 181 00:09:52.799 --> 00:09:57.840 This English accents, Scottish accents, Caribbean accents, all speaking English, 182 00:09:57.879 --> 00:10:02.600 but all slightly differently. So is PEKO voice able to 183 00:10:02.919 --> 00:10:05.519 account for that and be able to interpret, you know, 184 00:10:05.600 --> 00:10:10.559 a deep Southern accent versus maybe a New York Boston accent. 185 00:10:11.639 --> 00:10:14.759 Yeah, So I mean that that's still a challenge for us. 186 00:10:14.799 --> 00:10:17.879 But I think the reason we're a bit more resilient 187 00:10:17.919 --> 00:10:20.759 to it is because we've trained this general model on 188 00:10:21.000 --> 00:10:26.159 like g'z like ten hundred thousand hours of speech. It's 189 00:10:26.159 --> 00:10:29.919 heard all the accents, not not all the accents, but 190 00:10:29.960 --> 00:10:34.639 it's heard it's heard a lot of variation, so it 191 00:10:34.799 --> 00:10:37.080 tends to be a bit more resilient. When I was 192 00:10:37.120 --> 00:10:39.639 doing the old style where we would get people to record, 193 00:10:40.000 --> 00:10:43.240 that was actually a lot less resilient to it because 194 00:10:43.399 --> 00:10:48.200 we only had like, you know, three hundred participants recording 195 00:10:48.279 --> 00:10:51.159 these wake words, and how much variety are you going 196 00:10:51.200 --> 00:10:54.399 to get between three hundred people? Like? Not enough? But 197 00:10:55.039 --> 00:10:57.720 when we train these general models, we have like tens 198 00:10:57.720 --> 00:11:01.600 of thousands of different speakers, maybe more, so we tend 199 00:11:01.639 --> 00:11:05.039 to be a lot more sensitive to the variations. But 200 00:11:05.039 --> 00:11:07.600 but it is, it is definitely a challenge because even 201 00:11:07.720 --> 00:11:11.200 us as humans, if you hear like a really thick 202 00:11:11.279 --> 00:11:14.120 accent that you're not used to, it can be confusing, 203 00:11:14.600 --> 00:11:18.559 like like we're we're not perfect either with it. So 204 00:11:18.840 --> 00:11:20.519 it's it's it's a challenge. 205 00:11:20.840 --> 00:11:24.159 So I think you so you added multiple language depart 206 00:11:24.240 --> 00:11:26.440 I believe that's new or at least newish from the 207 00:11:26.960 --> 00:11:31.679 last time we talk. So does that that like more generalizability, 208 00:11:31.720 --> 00:11:36.159 make that easier or I imagine there's still all sorts of 209 00:11:36.240 --> 00:11:37.480 challenges that go into that. 210 00:11:38.240 --> 00:11:41.480 Yeah, So when you when you actually work with a 211 00:11:41.519 --> 00:11:46.039 totally different language, that's basically starting over because accents is 212 00:11:46.080 --> 00:11:48.200 one thing you've already taught it the series of sounds 213 00:11:48.240 --> 00:11:51.879 in the language, and you're just looking for a combination 214 00:11:51.960 --> 00:11:54.399 of those sounds and those symbols. But when you move 215 00:11:54.440 --> 00:11:56.879 into a new language, there's a new set of symbols, 216 00:11:57.039 --> 00:12:00.240 and there's a new set of sounds. You know, there's 217 00:12:00.519 --> 00:12:03.919 everybody has an inventory. We call it a phonemic inventory, 218 00:12:04.320 --> 00:12:08.440 and it's basically a series of sounds that you hear 219 00:12:08.480 --> 00:12:12.279 in the language, and every language has a different phonemic inventory, 220 00:12:12.600 --> 00:12:15.840 and we need to train the machine to understand only 221 00:12:15.879 --> 00:12:19.360 that inventory of sounds and all the symbols that go 222 00:12:19.440 --> 00:12:21.879 into that. So when we start a new language, we 223 00:12:21.960 --> 00:12:24.399 have to do it completely from scratch. We have to 224 00:12:25.279 --> 00:12:29.559 get new data in that language, We need to get 225 00:12:29.879 --> 00:12:33.720 new text in that language, and we need to do 226 00:12:33.759 --> 00:12:37.919 our best to even understand the language enough to work 227 00:12:37.960 --> 00:12:41.399 with it because we need to listen to these recordings. 228 00:12:41.440 --> 00:12:44.200 We need to normalize the text we get and make 229 00:12:44.240 --> 00:12:46.879 sure it's not like full of symbols and stuff, but 230 00:12:47.120 --> 00:12:50.879 understand it enough so that we actually don't confuse the 231 00:12:50.919 --> 00:12:54.279 machine learning process, and that that could be a real challenge. 232 00:12:54.440 --> 00:12:55.360 It's a lot of work. 233 00:12:55.200 --> 00:12:59.320 Actually, so it's fascinating. Does that mean like when you 234 00:12:59.399 --> 00:13:01.480 kick off anywe language, I feel like you almost need 235 00:13:01.519 --> 00:13:05.879 to have like a professional linguist on staff for almost 236 00:13:05.919 --> 00:13:08.840 each of these languages, right, Like, or do you like 237 00:13:08.879 --> 00:13:11.840 bring on somebody who's, like, you know, a world class 238 00:13:12.039 --> 00:13:15.559 I don't know, a Spanish linguist to help, or like 239 00:13:15.559 --> 00:13:17.240 like how much of it are you able, like as 240 00:13:17.240 --> 00:13:20.559 a software developer to sort of test on your own 241 00:13:20.559 --> 00:13:22.519 and how much do you have to rely on a 242 00:13:22.600 --> 00:13:25.440 native speaker as the only person that can actually figure 243 00:13:25.480 --> 00:13:26.279 some of these things out. 244 00:13:26.679 --> 00:13:29.960 Yeah, so we do have like basically our like machine 245 00:13:30.000 --> 00:13:33.799 learning team. They do have to be part linguist, like 246 00:13:34.039 --> 00:13:38.200 because if you've studied languages, you at least understand the components, 247 00:13:38.639 --> 00:13:42.639 and basically every language is just a combination of the components. 248 00:13:43.799 --> 00:13:46.559 So they have a lot of expertise in that field 249 00:13:46.559 --> 00:13:50.600 to understand when they approach a new language how it works. 250 00:13:50.799 --> 00:13:54.840 But then that's not enough. So usually what we'll do 251 00:13:55.120 --> 00:13:58.720 is we'll get somebody, well, we will get a native speaker. 252 00:13:59.120 --> 00:14:04.159 Usually will basically hire somebody on a contract to work 253 00:14:04.200 --> 00:14:06.840 with us to help with the language, because you do 254 00:14:06.960 --> 00:14:10.960 need that expertise. Like, the fact is, even somebody who's 255 00:14:11.000 --> 00:14:13.840 like a language expert, if they sit down to an 256 00:14:14.000 --> 00:14:16.200 entirely new language, they're not going to be able to 257 00:14:16.720 --> 00:14:20.159 understand it enough to do the work that needs to 258 00:14:20.200 --> 00:14:23.519 be done to actually get it to a production ready state. 259 00:14:23.720 --> 00:14:26.759 So we often do need to get a native speaker 260 00:14:26.799 --> 00:14:30.519 in there to provide their input and that will really 261 00:14:31.080 --> 00:14:33.679 speed the process along. We tried to do it without 262 00:14:33.879 --> 00:14:38.000 experts a couple times, and it's just like you just 263 00:14:38.000 --> 00:14:41.440 don't get the performance and you spend a lot more time, 264 00:14:41.879 --> 00:14:44.519 you waste a lot more time. I should say sure. 265 00:14:44.759 --> 00:14:46.679 I mean that makes a lot of sense. When you 266 00:14:46.720 --> 00:14:50.960 think about getting expertise in anything else, it's a lot. 267 00:14:51.080 --> 00:14:54.399 It will almost undoubtedly go much quicker if you have 268 00:14:54.480 --> 00:14:56.919 somebody who is proficient in whatever it is that you're 269 00:14:56.919 --> 00:14:57.519 trying to do. 270 00:14:58.000 --> 00:15:01.240 Yeah, well, they can recognize mistakes, grammar and stuff, the 271 00:15:01.320 --> 00:15:03.679 stuff that's really hard to pick up as a non 272 00:15:03.759 --> 00:15:04.480 native speaker. 273 00:15:04.960 --> 00:15:08.960 Yes, So what languages do you currently offer Peko voice for? 274 00:15:09.519 --> 00:15:13.320 So we have I believe last year we announced we 275 00:15:13.360 --> 00:15:18.919 had Spanish, French, German, English, and then this year we 276 00:15:19.000 --> 00:15:25.480 added four new languages. We added Japanese, Korean, Portuguese, and Italian. 277 00:15:26.279 --> 00:15:27.600 These are some tough ones. 278 00:15:28.279 --> 00:15:31.559 Yeah, well, especially like when you get into the written 279 00:15:31.639 --> 00:15:37.919 forms of Korean and Japanese, they become very challenging. Like 280 00:15:38.519 --> 00:15:41.039 you know, we in English we have twenty six characters. 281 00:15:41.200 --> 00:15:45.279 Japanese has two alphabets of fifty six and then an 282 00:15:45.320 --> 00:15:53.399 additional alphabet of tens of thousands. So yeah, yeah, so 283 00:15:53.440 --> 00:15:56.679 that the text representation of that is really difficult. The 284 00:15:57.399 --> 00:16:01.080 actual spoken version of Japanese is a lot easier than 285 00:16:01.120 --> 00:16:05.519 English because Japanese has fifty six sounds, and they all 286 00:16:05.639 --> 00:16:10.279 map to a combination of characters. English mapping a combination 287 00:16:10.360 --> 00:16:15.960 of characters to the sound is incredibly difficult. Turns out 288 00:16:16.080 --> 00:16:18.440 we made some mistakes early on and we didn't really 289 00:16:18.480 --> 00:16:20.240 fix them. 290 00:16:20.600 --> 00:16:23.440 I mean, just thinking about the amount of spellings that 291 00:16:23.480 --> 00:16:27.399 we have for the same sounding word based on context, 292 00:16:27.519 --> 00:16:30.759 I can not even imagine how you would be able 293 00:16:30.759 --> 00:16:32.639 to figure that out for a transcript. 294 00:16:32.879 --> 00:16:35.399 And it's all exceptions in English. It's like, oh, yeah, 295 00:16:35.440 --> 00:16:38.720 it's this unless this or this unless this, and like 296 00:16:38.840 --> 00:16:41.879 here's three different reasons why this rule is wrong. 297 00:16:41.960 --> 00:16:45.480 And yeah, you see this when you have like younger 298 00:16:45.559 --> 00:16:47.679 kids that are starting to write, and you look at 299 00:16:47.679 --> 00:16:50.679 their writing because they start they don't know the exceptions yet, right, 300 00:16:50.879 --> 00:16:53.360 but they can speak it because they know. So you 301 00:16:53.480 --> 00:16:55.960 get like, they it's words you don't even think about too, 302 00:16:55.960 --> 00:16:58.840 because we internalize them so quickly. Because one of my 303 00:16:58.960 --> 00:17:02.320 kids spelled because ron and then you're like, oh, bill, 304 00:17:02.320 --> 00:17:03.960 because it's pretty easy. But then you think about it 305 00:17:04.000 --> 00:17:06.519 for like half a second and you realize, like, actually, 306 00:17:06.519 --> 00:17:09.119 the word because makes absolutely no sense, like right. 307 00:17:09.759 --> 00:17:11.920 Like if you try and explain it, you suddenly find 308 00:17:11.960 --> 00:17:17.720 yourself going just is what it is. Yes, just memorize it, yep. 309 00:17:19.039 --> 00:17:23.119 I mean that's really fantastic that you have taken on 310 00:17:23.200 --> 00:17:27.559 and it sounds like gotten through some very difficult dialects. 311 00:17:27.960 --> 00:17:30.880 Are what are future future languages that you hope to 312 00:17:30.920 --> 00:17:32.160 be able to process as well? 313 00:17:32.720 --> 00:17:36.559 So we're going, yeah, exactly, So we're going to try 314 00:17:37.000 --> 00:17:39.759 next year. We're going to double our language count again, 315 00:17:39.960 --> 00:17:43.680 I think, and we're going to do going to do Chinese, Vietnamese, 316 00:17:44.240 --> 00:17:49.440 what else? Dutch? I believe, Russian, Polish, I think, yeah, 317 00:17:49.519 --> 00:17:52.480 I can't remember all of them. But you basically need 318 00:17:52.799 --> 00:17:57.599 to be like a fully inclusive speech recognition company, you 319 00:17:57.680 --> 00:18:01.240 basically need like a bare minimum of like fifty languages. 320 00:18:01.559 --> 00:18:04.039 So like we're going to get to like twenty of 321 00:18:04.079 --> 00:18:07.279 the most popular and hold there for a while. Is 322 00:18:07.640 --> 00:18:11.559 kind of our plan because that covers a lot of people, 323 00:18:11.759 --> 00:18:15.519 Like that covers the majority of people, because because even 324 00:18:15.599 --> 00:18:19.119 in the cases where the people might not speak the language, 325 00:18:19.160 --> 00:18:20.960 they usually are like, oh but I speak this what 326 00:18:21.119 --> 00:18:24.359 this more popular language? But to really get up there, 327 00:18:24.440 --> 00:18:26.720 like I mean, you do need to get to like 328 00:18:26.880 --> 00:18:30.480 fifty or something. And I mean Google has like one 329 00:18:30.559 --> 00:18:34.359 hundred and fifty, so you know, it's it's kind of 330 00:18:34.400 --> 00:18:36.680 a never ending thing for us. 331 00:18:37.559 --> 00:18:39.400 How about Hindi that's a big one. 332 00:18:39.519 --> 00:18:41.960 Oh yeah, that's actually one of the other ones we're 333 00:18:41.960 --> 00:18:43.039 going to do next year. 334 00:18:43.200 --> 00:18:47.359 Yeah, so I guess I got to ask one last question. 335 00:18:47.400 --> 00:18:49.839 Are there any languages like you've come to hate, like 336 00:18:49.880 --> 00:18:52.480 because it was like very difficult or. 337 00:18:56.559 --> 00:19:00.920 It's funny how much you can hate your own language. No, actually, 338 00:19:01.079 --> 00:19:03.799 like seriously, English is the only Like I look at 339 00:19:03.839 --> 00:19:06.359 all other languages we've done, and I'm like, these are 340 00:19:06.400 --> 00:19:11.440 so much easier, like English is. Actually, it's just it 341 00:19:11.519 --> 00:19:14.720 came out of a mess of languages. It was a 342 00:19:14.759 --> 00:19:18.799 lot of combinations that happened over time, and a lot 343 00:19:18.799 --> 00:19:21.680 of them happened during like you know, a lot of 344 00:19:21.720 --> 00:19:26.119 English developed during like illiteracy, and so there's like really 345 00:19:26.160 --> 00:19:29.279 interesting examples you can find of like stuff where it's 346 00:19:29.359 --> 00:19:32.240 just like, oh, yeah, this was just a mistake that happened, 347 00:19:32.960 --> 00:19:34.759 you know, two hundred years ago that they kept in 348 00:19:35.079 --> 00:19:37.599 or actually, I have a fun fact the word dumb. 349 00:19:37.799 --> 00:19:39.480 So you look at that, you're like, why does it 350 00:19:39.480 --> 00:19:42.480 have a be at the end that apparently was there 351 00:19:42.559 --> 00:19:46.519 was a time where the like ruling class of England 352 00:19:46.559 --> 00:19:49.160 was trying to make it harder to write English so 353 00:19:49.200 --> 00:19:52.400 that the peasantry could like pick it up. And they 354 00:19:52.440 --> 00:19:56.519 literally just added some letters to the language here or there, 355 00:19:57.200 --> 00:19:59.480 and we're like, this is the proper way to write it, 356 00:19:59.599 --> 00:20:02.480 and then just to confuse people. And we literally still 357 00:20:02.519 --> 00:20:05.680 have that to this day. So English is so weird. 358 00:20:06.680 --> 00:20:08.960 So that's why knife has a K in front of it. 359 00:20:09.200 --> 00:20:12.839 Yeah, yeah, like like stuff like that. I think they 360 00:20:12.839 --> 00:20:14.720 were just messing with us and now we're just like 361 00:20:15.000 --> 00:20:16.000 we have to live with that. 362 00:20:17.240 --> 00:20:19.359 So I want to pivot a little bit and talk 363 00:20:19.400 --> 00:20:22.480 about the actual web development, like the side where you 364 00:20:22.519 --> 00:20:25.680 might actually use a service like this, because I remember 365 00:20:25.759 --> 00:20:27.519 last time we chat it a little bit too about 366 00:20:27.720 --> 00:20:30.960 common use cases, right, So maybe we could just start 367 00:20:31.039 --> 00:20:32.880 with a review, like how we have a lot of 368 00:20:32.880 --> 00:20:36.240 web developers listen to this show. What do you think, like, 369 00:20:36.400 --> 00:20:39.240 I guess, A, what would using something like this look like? 370 00:20:39.240 --> 00:20:40.759 Like how do you actually get it in an app? 371 00:20:41.039 --> 00:20:43.839 And B I guess like, what are some common use 372 00:20:43.880 --> 00:20:46.599 cases that you see for use on the web as well? 373 00:20:46.799 --> 00:20:49.400 Right? So one of the big things is obviously on 374 00:20:49.440 --> 00:20:51.599 the web, people are a lot more comfortable calling like 375 00:20:51.880 --> 00:20:55.920 an API and that is what they've come to expect 376 00:20:55.960 --> 00:21:00.240 for speech trek cognition and stuff. But we're actually bringing 377 00:21:00.319 --> 00:21:03.000 the We're actually kind of bringing back the power of 378 00:21:03.039 --> 00:21:06.160 the browser itself. So the I mean the browser is 379 00:21:06.160 --> 00:21:09.039 a virtual environment that can run whatever you want. And 380 00:21:09.319 --> 00:21:12.640 we actually can run entirely in the browser on the 381 00:21:12.680 --> 00:21:16.599 client side. And that's that's big because I mean, in 382 00:21:16.640 --> 00:21:20.599 the these days, we're getting a lot of progressive web apps, 383 00:21:21.160 --> 00:21:24.119 and the sort of web app is a big thing, 384 00:21:24.240 --> 00:21:27.079 especially with like SaaS companies and stuff. So if you're 385 00:21:27.119 --> 00:21:30.160 running like a SaaS company and you you want to 386 00:21:30.200 --> 00:21:34.720 integrate like voice into your console or something, having it 387 00:21:35.000 --> 00:21:39.359 on the client side is is I mean, it lowers 388 00:21:39.400 --> 00:21:42.160 the latency, It gives you a lot more direct control 389 00:21:42.319 --> 00:21:46.440 of what happens when you get boys, and it means 390 00:21:46.519 --> 00:21:50.119 you can be robust to connection issues, which like that 391 00:21:50.119 --> 00:21:52.599 that you know, that's a huge thing. Not everyone has 392 00:21:53.000 --> 00:21:55.519 amazing Internet and you don't want to have to be 393 00:21:55.880 --> 00:21:58.839 I can calls out to an API and just hoping 394 00:21:58.880 --> 00:22:01.079 it comes back for you or feature to work. This 395 00:22:01.440 --> 00:22:03.519 will just work. And also on top of all that, 396 00:22:03.839 --> 00:22:08.319 it's it's less expensive because we're not calling an API, 397 00:22:08.440 --> 00:22:13.480 We're not depending on cloud infrastructure. So you're actually if 398 00:22:13.559 --> 00:22:17.279 you're a developer and you integrate Peka Voice into your 399 00:22:17.480 --> 00:22:19.920 web app, your client is going to be using their 400 00:22:19.960 --> 00:22:23.160 machine to do the processing. So I think it's just 401 00:22:23.240 --> 00:22:26.000