WEBVTT

1
00:00:00.200 --> 00:00:03.399
<v Speaker 1>Welcome to React Roundup, the podcast where we keep you

2
00:00:03.480 --> 00:00:07.080
<v Speaker 1>updated on all things React related. This show is brought

3
00:00:07.080 --> 00:00:11.080
<v Speaker 1>to you by Void and top End Devs. Unvoid provides

4
00:00:11.160 --> 00:00:14.839
<v Speaker 1>high quality design and software development services on a client

5
00:00:14.880 --> 00:00:20.239
<v Speaker 1>friendly business model. Unlike all other software agencies, Unvoid allows

6
00:00:20.239 --> 00:00:24.120
<v Speaker 1>clients to only pay after the work is delivered and approved.

7
00:00:24.600 --> 00:00:28.079
<v Speaker 1>Visit unvoid dot com to learn more and reach out.

8
00:00:28.199 --> 00:00:30.679
<v Speaker 1>If you know a company that needs more professionals to

9
00:00:30.760 --> 00:00:36.000
<v Speaker 1>help with design and software development, that's u n void

10
00:00:36.479 --> 00:00:39.840
<v Speaker 1>dot com and top end Davs helps you stay up

11
00:00:39.840 --> 00:00:44.320
<v Speaker 1>to date with cutting edge technologies like JavaScript, Ruby, Elixir,

12
00:00:44.520 --> 00:00:48.799
<v Speaker 1>and AI. Visit topandevs dot com to join their AIDV

13
00:00:48.880 --> 00:00:54.159
<v Speaker 1>boot camp, weekly community meetups and access expert tutorials. I'm

14
00:00:54.240 --> 00:00:58.079
<v Speaker 1>Lucas Paganini, founder of Onvoid and host of this podcast.

15
00:00:58.439 --> 00:01:05.079
<v Speaker 1>Thank you for tuning in. Let's jump into the episode.

16
00:01:07.079 --> 00:01:10.400
<v Speaker 2>Hey everybody, and welcome to another episode of React Roundup.

17
00:01:10.560 --> 00:01:13.120
<v Speaker 2>I am your host today TJ Van Tol and with

18
00:01:13.239 --> 00:01:15.120
<v Speaker 2>me on the panel, I have Paige need you House.

19
00:01:15.359 --> 00:01:18.400
<v Speaker 2>Hey everyone, and our special guest today is actually a

20
00:01:18.439 --> 00:01:22.640
<v Speaker 2>React Round of returning Champion. We have even Lovery here. Ian.

21
00:01:22.799 --> 00:01:23.680
<v Speaker 2>Welcome back to the show.

22
00:01:24.040 --> 00:01:25.239
<v Speaker 3>Hey, thanks for having me back.

23
00:01:25.319 --> 00:01:27.760
<v Speaker 2>Yeah, so why don't you start, you know, for people,

24
00:01:28.000 --> 00:01:30.439
<v Speaker 2>I think it's show we're looking back? Is the show

25
00:01:30.480 --> 00:01:31.760
<v Speaker 2>is about a year ago. We'll have to look up

26
00:01:31.799 --> 00:01:34.040
<v Speaker 2>the episode number and toss it in the show notes.

27
00:01:34.079 --> 00:01:35.840
<v Speaker 2>But it's been a while, So why don't you tell

28
00:01:35.879 --> 00:01:38.120
<v Speaker 2>people know who you are, what you do in your

29
00:01:38.120 --> 00:01:41.159
<v Speaker 2>background while you're famous, all those sorts of things.

30
00:01:41.560 --> 00:01:45.239
<v Speaker 3>Yeah. So I work for a speech recognition company called

31
00:01:45.359 --> 00:01:50.400
<v Speaker 3>pegle Boys, and we're a developer focused company that tries

32
00:01:50.480 --> 00:01:54.480
<v Speaker 3>to power developers all over on any platform to have

33
00:01:54.719 --> 00:01:58.319
<v Speaker 3>to bring voice to their platform. So we have a

34
00:01:58.359 --> 00:02:02.439
<v Speaker 3>whole variety of different propus that cover speech to text,

35
00:02:02.599 --> 00:02:05.719
<v Speaker 3>voice activation, wake word, all that, and we just want

36
00:02:05.840 --> 00:02:08.759
<v Speaker 3>everybody to have a voice on their platform. Besides that,

37
00:02:08.960 --> 00:02:13.240
<v Speaker 3>I'm a I do like interactive media hard and I

38
00:02:13.319 --> 00:02:15.159
<v Speaker 3>play bass in a couple of bands.

39
00:02:16.879 --> 00:02:19.560
<v Speaker 4>That's awesome, not just one band but multiple.

40
00:02:20.599 --> 00:02:22.479
<v Speaker 3>Yeah, I'm an over at cheap I.

41
00:02:22.439 --> 00:02:26.680
<v Speaker 2>Guess well cool. So Peka Voice looks interesting. I remember

42
00:02:26.759 --> 00:02:29.560
<v Speaker 2>us talking about it last time, but maybe you can

43
00:02:29.560 --> 00:02:31.639
<v Speaker 2>get an overview of like how it works. Like if

44
00:02:31.639 --> 00:02:33.520
<v Speaker 2>I if I use Peka Voice, what am I? What

45
00:02:33.560 --> 00:02:35.280
<v Speaker 2>am I getting? Am I getting a service that I

46
00:02:35.319 --> 00:02:38.000
<v Speaker 2>can send like audio to you, and it comes back

47
00:02:38.039 --> 00:02:40.719
<v Speaker 2>with the words like what other features maybe you could

48
00:02:40.759 --> 00:02:43.719
<v Speaker 2>give us, like the rundown of everything. It does everything

49
00:02:43.759 --> 00:02:44.039
<v Speaker 2>you do.

50
00:02:44.400 --> 00:02:46.960
<v Speaker 3>Yeah. So the big thing with us is and our

51
00:02:47.039 --> 00:02:49.919
<v Speaker 3>sort of thing that sets us apart from pretty much

52
00:02:49.960 --> 00:02:54.280
<v Speaker 3>every other voice service is that we're entirely on device

53
00:02:54.680 --> 00:02:57.719
<v Speaker 3>and so there is no there is no service. There's

54
00:02:57.759 --> 00:03:01.759
<v Speaker 3>no cloud API that you're calling to send your audio to,

55
00:03:02.159 --> 00:03:05.439
<v Speaker 3>which I mean, look look around. That's pretty much every

56
00:03:05.479 --> 00:03:09.280
<v Speaker 3>single voice thing is just an API. So we're one

57
00:03:09.319 --> 00:03:12.199
<v Speaker 3>of the only ones out there that is actually giving

58
00:03:12.240 --> 00:03:15.080
<v Speaker 3>you the ability to hold on to your audio data

59
00:03:15.120 --> 00:03:17.919
<v Speaker 3>and your user's audio data and process it on the

60
00:03:18.000 --> 00:03:21.919
<v Speaker 3>device and return Again. We have like a variety of products.

61
00:03:21.960 --> 00:03:25.439
<v Speaker 3>So we have like wakeword detection, where it's just like hey,

62
00:03:25.520 --> 00:03:28.879
<v Speaker 3>Siri and okay Google. It's just all it's doing is

63
00:03:28.919 --> 00:03:31.639
<v Speaker 3>sitting there processing frames of audio, waiting for you to

64
00:03:31.680 --> 00:03:34.039
<v Speaker 3>say the thing, and then when it wakes up, it

65
00:03:34.080 --> 00:03:36.759
<v Speaker 3>does the thing that you tell it to do. But

66
00:03:36.840 --> 00:03:40.639
<v Speaker 3>we also have voice activity detection and which just basically

67
00:03:41.000 --> 00:03:45.199
<v Speaker 3>peaks when it hears somebody talking. And obviously speech to text.

68
00:03:45.199 --> 00:03:49.680
<v Speaker 3>Everyone wants speech to text, so auto transcription of voice.

69
00:03:49.919 --> 00:03:52.319
<v Speaker 2>Yeah, it's very cool. It's also like one of those

70
00:03:52.360 --> 00:03:56.159
<v Speaker 2>problems that I feel like is it's becoming more commonplace.

71
00:03:56.199 --> 00:03:58.680
<v Speaker 2>We have smart devices in our house. Our phones can

72
00:03:58.800 --> 00:04:00.439
<v Speaker 2>listen to wake words and that sort of thing. But

73
00:04:00.479 --> 00:04:03.960
<v Speaker 2>I still I'm still sort of fascinated by the underlying technology.

74
00:04:04.479 --> 00:04:06.879
<v Speaker 2>Maybe you could just start give us like the world's

75
00:04:06.879 --> 00:04:09.360
<v Speaker 2>simplest rundown of like how does how does it actually

76
00:04:09.479 --> 00:04:11.599
<v Speaker 2>work on the back end? Like do you just have

77
00:04:11.879 --> 00:04:14.240
<v Speaker 2>a whole bunch of like low level C code that

78
00:04:14.280 --> 00:04:18.000
<v Speaker 2>looks for patterns in audio data or like, I don't know,

79
00:04:18.000 --> 00:04:20.120
<v Speaker 2>we don't need it. Sounds like two hours, but I'm.

80
00:04:20.000 --> 00:04:23.000
<v Speaker 3>Just no, it's a good question. So, I mean, basically,

81
00:04:23.079 --> 00:04:26.680
<v Speaker 3>it's deep learning, right, It's it's it's machine learning. So

82
00:04:26.720 --> 00:04:29.839
<v Speaker 3>we teach through machine learning. We teach a machine a

83
00:04:29.879 --> 00:04:32.879
<v Speaker 3>statistical model of what a word sounds like, or what

84
00:04:32.959 --> 00:04:36.240
<v Speaker 3>a series of sounds sounds like. So we basically take

85
00:04:36.519 --> 00:04:41.639
<v Speaker 3>audio in our actual When we're teaching our machine, all

86
00:04:41.680 --> 00:04:44.759
<v Speaker 3>we're doing is sending it frames of audio that are labeled,

87
00:04:44.920 --> 00:04:46.800
<v Speaker 3>and we get it to remember them and like form

88
00:04:46.839 --> 00:04:50.319
<v Speaker 3>a little statistical pattern, and then it for something like wakeword.

89
00:04:50.480 --> 00:04:53.800
<v Speaker 3>It's just like, hey, remember this pattern of three things.

90
00:04:54.120 --> 00:04:56.720
<v Speaker 3>Just remember that and say, hey, I think I saw it.

91
00:04:57.240 --> 00:04:59.360
<v Speaker 3>So it's a lot more complicated when you get into

92
00:04:59.399 --> 00:05:02.720
<v Speaker 3>speech to text because not only are you teaching it

93
00:05:02.879 --> 00:05:06.759
<v Speaker 3>every sound in the language, but you're also teaching it

94
00:05:06.920 --> 00:05:10.560
<v Speaker 3>every word in the language, because then you're dealing with

95
00:05:10.759 --> 00:05:15.000
<v Speaker 3>audio and writing, which are different things. I think people

96
00:05:15.240 --> 00:05:19.360
<v Speaker 3>think language is a combination of those things, but really

97
00:05:19.560 --> 00:05:23.240
<v Speaker 3>they're two entirely separate things. They're like that there's the

98
00:05:23.319 --> 00:05:25.560
<v Speaker 3>series of sounds you make with your mouth that other

99
00:05:25.600 --> 00:05:29.120
<v Speaker 3>people understand, and then there's the symbols you write them

100
00:05:29.160 --> 00:05:34.279
<v Speaker 3>down with and the grammar and punctuation and everything that

101
00:05:34.360 --> 00:05:36.879
<v Speaker 3>you put into the written form, and they're different, so

102
00:05:36.920 --> 00:05:39.639
<v Speaker 3>we actually have to treat them differently. But you'll see

103
00:05:39.680 --> 00:05:42.879
<v Speaker 3>a lot of the big cloud providers out there. The

104
00:05:42.959 --> 00:05:46.120
<v Speaker 3>reason they got it so right so fast is because

105
00:05:46.160 --> 00:05:49.360
<v Speaker 3>they had such large machines in the cloud in order

106
00:05:49.439 --> 00:05:52.480
<v Speaker 3>to do this, so sort of like it outpaced the

107
00:05:52.639 --> 00:05:57.800
<v Speaker 3>actual progressive voice recognition, and now everything's kind of caught

108
00:05:57.879 --> 00:06:00.759
<v Speaker 3>up and we can actually do it on Devine, which

109
00:06:00.839 --> 00:06:03.360
<v Speaker 3>is a big win because, to be honest, we were

110
00:06:03.399 --> 00:06:07.079
<v Speaker 3>like boiling the ocean for like a while doing speech

111
00:06:07.160 --> 00:06:09.040
<v Speaker 3>to text, and now we can do it on like

112
00:06:09.079 --> 00:06:10.360
<v Speaker 3>a micro controller.

113
00:06:10.439 --> 00:06:15.519
<v Speaker 4>So if you're using something like Peaco Voice, is it

114
00:06:15.600 --> 00:06:17.800
<v Speaker 4>something that you as a user have to train the

115
00:06:17.839 --> 00:06:21.879
<v Speaker 4>models or the models already there. It's trained. It knows

116
00:06:22.319 --> 00:06:25.319
<v Speaker 4>you're speaking English or it knows you're speaking Spanish, and

117
00:06:25.360 --> 00:06:27.720
<v Speaker 4>it will just it should be smart enough to be

118
00:06:27.759 --> 00:06:32.360
<v Speaker 4>able to take that audio and translate it into the

119
00:06:32.439 --> 00:06:34.040
<v Speaker 4>correct written words.

120
00:06:34.439 --> 00:06:37.160
<v Speaker 3>Right, So, like for speech to text, for instance, we

121
00:06:37.279 --> 00:06:41.120
<v Speaker 3>basically just have a general language model. You just give it.

122
00:06:41.240 --> 00:06:45.360
<v Speaker 3>We offer eight different languages, and you just give it

123
00:06:45.399 --> 00:06:48.079
<v Speaker 3>the language you want and we'll understand that language. But

124
00:06:48.399 --> 00:06:51.879
<v Speaker 3>we actually use this thing called transfer learning, and we

125
00:06:51.959 --> 00:06:55.920
<v Speaker 3>have a website Peako Voice Console where you can basically

126
00:06:56.319 --> 00:06:59.160
<v Speaker 3>we have sort of a general model, but then you

127
00:06:59.279 --> 00:07:03.360
<v Speaker 3>sort of do train it yourself. Because for something like wakeword,

128
00:07:03.519 --> 00:07:06.399
<v Speaker 3>we have a model that understands a bunch of sounds

129
00:07:06.439 --> 00:07:08.360
<v Speaker 3>in whatever language you give it, but then you want

130
00:07:08.360 --> 00:07:11.639
<v Speaker 3>it to represent a certain series of sounds like okay, Google,

131
00:07:11.920 --> 00:07:15.800
<v Speaker 3>So you literally type that in to our console and

132
00:07:15.959 --> 00:07:18.800
<v Speaker 3>hit train, and then it will pop out a model

133
00:07:18.839 --> 00:07:23.120
<v Speaker 3>that understands that. So that's that's sort of the when

134
00:07:23.120 --> 00:07:26.240
<v Speaker 3>we say you train it, it's not like, oh, you

135
00:07:26.279 --> 00:07:29.079
<v Speaker 3>have to go out and gather four thousand recordings of

136
00:07:29.120 --> 00:07:32.920
<v Speaker 3>this word and you know, submit it to something and

137
00:07:33.000 --> 00:07:36.680
<v Speaker 3>watch statistics go and decide. No. No, it's just like

138
00:07:37.000 --> 00:07:38.519
<v Speaker 3>we are. We did the hard work.

139
00:07:38.720 --> 00:07:40.800
<v Speaker 2>I was gonna say, because by saying that, you're sort

140
00:07:40.800 --> 00:07:43.160
<v Speaker 2>of implying that you got went out and have four

141
00:07:43.199 --> 00:07:47.000
<v Speaker 2>thousand recordings of these different words, right or like.

142
00:07:47.240 --> 00:07:50.079
<v Speaker 3>No, No. So the thing is, it's again we've we've

143
00:07:50.079 --> 00:07:52.759
<v Speaker 3>trained the general model, so it understands the sounds we

144
00:07:52.839 --> 00:07:55.639
<v Speaker 3>needed to understand. You just tell us which sounds you

145
00:07:55.680 --> 00:07:58.480
<v Speaker 3>want us you want to form your wake word, and

146
00:07:58.519 --> 00:08:01.160
<v Speaker 3>we pop out a model that's that that just waits

147
00:08:01.199 --> 00:08:02.360
<v Speaker 3>for those series of sounds.

148
00:08:02.519 --> 00:08:05.399
<v Speaker 2>Interesting because I would have guessed that your building of

149
00:08:05.399 --> 00:08:07.360
<v Speaker 2>the model was to get a bunch of people to

150
00:08:07.399 --> 00:08:10.680
<v Speaker 2>say like it almost seems it kind of breaks my

151
00:08:10.720 --> 00:08:13.360
<v Speaker 2>mind a little bit as possible, right, that you can

152
00:08:13.399 --> 00:08:14.759
<v Speaker 2>sort of general.

153
00:08:14.600 --> 00:08:17.720
<v Speaker 3>Us the old style the like. So I worked. I

154
00:08:17.759 --> 00:08:22.000
<v Speaker 3>worked at a speech recognition company right out of college.

155
00:08:22.399 --> 00:08:25.959
<v Speaker 3>And what we did we had one of the early

156
00:08:26.079 --> 00:08:29.879
<v Speaker 3>early wakeword engines, and what we would do is we'd

157
00:08:30.120 --> 00:08:32.600
<v Speaker 3>it was all b to be the company. We basically

158
00:08:32.720 --> 00:08:36.279
<v Speaker 3>enter a contract with the company that says, hey, we're

159
00:08:36.279 --> 00:08:39.600
<v Speaker 3>going to go out and gather four thousand recordings of

160
00:08:39.679 --> 00:08:42.679
<v Speaker 3>this wake word, and we're going to train it and

161
00:08:42.720 --> 00:08:46.240
<v Speaker 3>then deliver you the model. And it was very formal,

162
00:08:46.720 --> 00:08:50.159
<v Speaker 3>and that was basically state of the art at the time.

163
00:08:50.320 --> 00:08:54.639
<v Speaker 3>But we're actually a bit past that now because we're

164
00:08:54.639 --> 00:08:58.399
<v Speaker 3>able to use this concept of transfer learning to take

165
00:08:58.399 --> 00:09:00.720
<v Speaker 3>a general model and just kind of pointed in the

166
00:09:00.759 --> 00:09:03.080
<v Speaker 3>right direction. So we no longer need to do all

167
00:09:03.120 --> 00:09:06.639
<v Speaker 3>that all that pounding the pavement asking for people to

168
00:09:06.720 --> 00:09:08.879
<v Speaker 3>say a wake word, because that was a lot of

169
00:09:08.919 --> 00:09:11.879
<v Speaker 3>work and it took months, like every time somebody signed

170
00:09:11.919 --> 00:09:14.240
<v Speaker 3>a contract. And I know because I was running the

171
00:09:14.279 --> 00:09:18.039
<v Speaker 3>crowdsourcing technology for that company, So I would have to

172
00:09:18.039 --> 00:09:22.039
<v Speaker 3>post these jobs and these these people would record it

173
00:09:22.039 --> 00:09:24.279
<v Speaker 3>on their on their like mobile device, and I'd have

174
00:09:24.320 --> 00:09:26.720
<v Speaker 3>to go through all the recordings and like you know,

175
00:09:27.440 --> 00:09:30.279
<v Speaker 3>some people would just yeah. Some people would just you know,

176
00:09:30.440 --> 00:09:34.200
<v Speaker 3>speak their manifesto into the phone, and I'd be like, no, no, no, no.

177
00:09:37.240 --> 00:09:40.279
<v Speaker 4>So one one thing that I'm curious about is I'm

178
00:09:40.320 --> 00:09:44.639
<v Speaker 4>assuming that when you would do these these wake word gatherings,

179
00:09:45.200 --> 00:09:48.120
<v Speaker 4>you would have to take into account accents, because I

180
00:09:48.159 --> 00:09:52.679
<v Speaker 4>know that that is something that every automated assistant struggles with.

181
00:09:52.799 --> 00:09:57.840
<v Speaker 4>This English accents, Scottish accents, Caribbean accents, all speaking English,

182
00:09:57.879 --> 00:10:02.600
<v Speaker 4>but all slightly differently. So is PEKO voice able to

183
00:10:02.919 --> 00:10:05.519
<v Speaker 4>account for that and be able to interpret, you know,

184
00:10:05.600 --> 00:10:10.559
<v Speaker 4>a deep Southern accent versus maybe a New York Boston accent.

185
00:10:11.639 --> 00:10:14.759
<v Speaker 3>Yeah, So I mean that that's still a challenge for us.

186
00:10:14.799 --> 00:10:17.879
<v Speaker 3>But I think the reason we're a bit more resilient

187
00:10:17.919 --> 00:10:20.759
<v Speaker 3>to it is because we've trained this general model on

188
00:10:21.000 --> 00:10:26.159
<v Speaker 3>like g'z like ten hundred thousand hours of speech. It's

189
00:10:26.159 --> 00:10:29.919
<v Speaker 3>heard all the accents, not not all the accents, but

190
00:10:29.960 --> 00:10:34.639
<v Speaker 3>it's heard it's heard a lot of variation, so it

191
00:10:34.799 --> 00:10:37.080
<v Speaker 3>tends to be a bit more resilient. When I was

192
00:10:37.120 --> 00:10:39.639
<v Speaker 3>doing the old style where we would get people to record,

193
00:10:40.000 --> 00:10:43.240
<v Speaker 3>that was actually a lot less resilient to it because

194
00:10:43.399 --> 00:10:48.200
<v Speaker 3>we only had like, you know, three hundred participants recording

195
00:10:48.279 --> 00:10:51.159
<v Speaker 3>these wake words, and how much variety are you going

196
00:10:51.200 --> 00:10:54.399
<v Speaker 3>to get between three hundred people? Like? Not enough? But

197
00:10:55.039 --> 00:10:57.720
<v Speaker 3>when we train these general models, we have like tens

198
00:10:57.720 --> 00:11:01.600
<v Speaker 3>of thousands of different speakers, maybe more, so we tend

199
00:11:01.639 --> 00:11:05.039
<v Speaker 3>to be a lot more sensitive to the variations. But

200
00:11:05.039 --> 00:11:07.600
<v Speaker 3>but it is, it is definitely a challenge because even

201
00:11:07.720 --> 00:11:11.200
<v Speaker 3>us as humans, if you hear like a really thick

202
00:11:11.279 --> 00:11:14.120
<v Speaker 3>accent that you're not used to, it can be confusing,

203
00:11:14.600 --> 00:11:18.559
<v Speaker 3>like like we're we're not perfect either with it. So

204
00:11:18.840 --> 00:11:20.519
<v Speaker 3>it's it's it's a challenge.

205
00:11:20.840 --> 00:11:24.159
<v Speaker 2>So I think you so you added multiple language depart

206
00:11:24.240 --> 00:11:26.440
<v Speaker 2>I believe that's new or at least newish from the

207
00:11:26.960 --> 00:11:31.679
<v Speaker 2>last time we talk. So does that that like more generalizability,

208
00:11:31.720 --> 00:11:36.159
<v Speaker 2>make that easier or I imagine there's still all sorts of

209
00:11:36.240 --> 00:11:37.480
<v Speaker 2>challenges that go into that.

210
00:11:38.240 --> 00:11:41.480
<v Speaker 3>Yeah, So when you when you actually work with a

211
00:11:41.519 --> 00:11:46.039
<v Speaker 3>totally different language, that's basically starting over because accents is

212
00:11:46.080 --> 00:11:48.200
<v Speaker 3>one thing you've already taught it the series of sounds

213
00:11:48.240 --> 00:11:51.879
<v Speaker 3>in the language, and you're just looking for a combination

214
00:11:51.960 --> 00:11:54.399
<v Speaker 3>of those sounds and those symbols. But when you move

215
00:11:54.440 --> 00:11:56.879
<v Speaker 3>into a new language, there's a new set of symbols,

216
00:11:57.039 --> 00:12:00.240
<v Speaker 3>and there's a new set of sounds. You know, there's

217
00:12:00.519 --> 00:12:03.919
<v Speaker 3>everybody has an inventory. We call it a phonemic inventory,

218
00:12:04.320 --> 00:12:08.440
<v Speaker 3>and it's basically a series of sounds that you hear

219
00:12:08.480 --> 00:12:12.279
<v Speaker 3>in the language, and every language has a different phonemic inventory,

220
00:12:12.600 --> 00:12:15.840
<v Speaker 3>and we need to train the machine to understand only

221
00:12:15.879 --> 00:12:19.360
<v Speaker 3>that inventory of sounds and all the symbols that go

222
00:12:19.440 --> 00:12:21.879
<v Speaker 3>into that. So when we start a new language, we

223
00:12:21.960 --> 00:12:24.399
<v Speaker 3>have to do it completely from scratch. We have to

224
00:12:25.279 --> 00:12:29.559
<v Speaker 3>get new data in that language, We need to get

225
00:12:29.879 --> 00:12:33.720
<v Speaker 3>new text in that language, and we need to do

226
00:12:33.759 --> 00:12:37.919
<v Speaker 3>our best to even understand the language enough to work

227
00:12:37.960 --> 00:12:41.399
<v Speaker 3>with it because we need to listen to these recordings.

228
00:12:41.440 --> 00:12:44.200
<v Speaker 3>We need to normalize the text we get and make

229
00:12:44.240 --> 00:12:46.879
<v Speaker 3>sure it's not like full of symbols and stuff, but

230
00:12:47.120 --> 00:12:50.879
<v Speaker 3>understand it enough so that we actually don't confuse the

231
00:12:50.919 --> 00:12:54.279
<v Speaker 3>machine learning process, and that that could be a real challenge.

232
00:12:54.440 --> 00:12:55.360
<v Speaker 3>It's a lot of work.

233
00:12:55.200 --> 00:12:59.320
<v Speaker 2>Actually, so it's fascinating. Does that mean like when you

234
00:12:59.399 --> 00:13:01.480
<v Speaker 2>kick off anywe language, I feel like you almost need

235
00:13:01.519 --> 00:13:05.879
<v Speaker 2>to have like a professional linguist on staff for almost

236
00:13:05.919 --> 00:13:08.840
<v Speaker 2>each of these languages, right, Like, or do you like

237
00:13:08.879 --> 00:13:11.840
<v Speaker 2>bring on somebody who's, like, you know, a world class

238
00:13:12.039 --> 00:13:15.559
<v Speaker 2>I don't know, a Spanish linguist to help, or like

239
00:13:15.559 --> 00:13:17.240
<v Speaker 2>like how much of it are you able, like as

240
00:13:17.240 --> 00:13:20.559
<v Speaker 2>a software developer to sort of test on your own

241
00:13:20.559 --> 00:13:22.519
<v Speaker 2>and how much do you have to rely on a

242
00:13:22.600 --> 00:13:25.440
<v Speaker 2>native speaker as the only person that can actually figure

243
00:13:25.480 --> 00:13:26.279
<v Speaker 2>some of these things out.

244
00:13:26.679 --> 00:13:29.960
<v Speaker 3>Yeah, so we do have like basically our like machine

245
00:13:30.000 --> 00:13:33.799
<v Speaker 3>learning team. They do have to be part linguist, like

246
00:13:34.039 --> 00:13:38.200
<v Speaker 3>because if you've studied languages, you at least understand the components,

247
00:13:38.639 --> 00:13:42.639
<v Speaker 3>and basically every language is just a combination of the components.

248
00:13:43.799 --> 00:13:46.559
<v Speaker 3>So they have a lot of expertise in that field

249
00:13:46.559 --> 00:13:50.600
<v Speaker 3>to understand when they approach a new language how it works.

250
00:13:50.799 --> 00:13:54.840
<v Speaker 3>But then that's not enough. So usually what we'll do

251
00:13:55.120 --> 00:13:58.720
<v Speaker 3>is we'll get somebody, well, we will get a native speaker.

252
00:13:59.120 --> 00:14:04.159
<v Speaker 3>Usually will basically hire somebody on a contract to work

253
00:14:04.200 --> 00:14:06.840
<v Speaker 3>with us to help with the language, because you do

254
00:14:06.960 --> 00:14:10.960
<v Speaker 3>need that expertise. Like, the fact is, even somebody who's

255
00:14:11.000 --> 00:14:13.840
<v Speaker 3>like a language expert, if they sit down to an

256
00:14:14.000 --> 00:14:16.200
<v Speaker 3>entirely new language, they're not going to be able to

257
00:14:16.720 --> 00:14:20.159
<v Speaker 3>understand it enough to do the work that needs to

258
00:14:20.200 --> 00:14:23.519
<v Speaker 3>be done to actually get it to a production ready state.

259
00:14:23.720 --> 00:14:26.759
<v Speaker 3>So we often do need to get a native speaker

260
00:14:26.799 --> 00:14:30.519
<v Speaker 3>in there to provide their input and that will really

261
00:14:31.080 --> 00:14:33.679
<v Speaker 3>speed the process along. We tried to do it without

262
00:14:33.879 --> 00:14:38.000
<v Speaker 3>experts a couple times, and it's just like you just

263
00:14:38.000 --> 00:14:41.440
<v Speaker 3>don't get the performance and you spend a lot more time,

264
00:14:41.879 --> 00:14:44.519
<v Speaker 3>you waste a lot more time. I should say sure.

265
00:14:44.759 --> 00:14:46.679
<v Speaker 4>I mean that makes a lot of sense. When you

266
00:14:46.720 --> 00:14:50.960
<v Speaker 4>think about getting expertise in anything else, it's a lot.

267
00:14:51.080 --> 00:14:54.399
<v Speaker 4>It will almost undoubtedly go much quicker if you have

268
00:14:54.480 --> 00:14:56.919
<v Speaker 4>somebody who is proficient in whatever it is that you're

269
00:14:56.919 --> 00:14:57.519
<v Speaker 4>trying to do.

270
00:14:58.000 --> 00:15:01.240
<v Speaker 3>Yeah, well, they can recognize mistakes, grammar and stuff, the

271
00:15:01.320 --> 00:15:03.679
<v Speaker 3>stuff that's really hard to pick up as a non

272
00:15:03.759 --> 00:15:04.480
<v Speaker 3>native speaker.

273
00:15:04.960 --> 00:15:08.960
<v Speaker 4>Yes, So what languages do you currently offer Peko voice for?

274
00:15:09.519 --> 00:15:13.320
<v Speaker 3>So we have I believe last year we announced we

275
00:15:13.360 --> 00:15:18.919
<v Speaker 3>had Spanish, French, German, English, and then this year we

276
00:15:19.000 --> 00:15:25.480
<v Speaker 3>added four new languages. We added Japanese, Korean, Portuguese, and Italian.

277
00:15:26.279 --> 00:15:27.600
<v Speaker 4>These are some tough ones.

278
00:15:28.279 --> 00:15:31.559
<v Speaker 3>Yeah, well, especially like when you get into the written

279
00:15:31.639 --> 00:15:37.919
<v Speaker 3>forms of Korean and Japanese, they become very challenging. Like

280
00:15:38.519 --> 00:15:41.039
<v Speaker 3>you know, we in English we have twenty six characters.

281
00:15:41.200 --> 00:15:45.279
<v Speaker 3>Japanese has two alphabets of fifty six and then an

282
00:15:45.320 --> 00:15:53.399
<v Speaker 3>additional alphabet of tens of thousands. So yeah, yeah, so

283
00:15:53.440 --> 00:15:56.679
<v Speaker 3>that the text representation of that is really difficult. The

284
00:15:57.399 --> 00:16:01.080
<v Speaker 3>actual spoken version of Japanese is a lot easier than

285
00:16:01.120 --> 00:16:05.519
<v Speaker 3>English because Japanese has fifty six sounds, and they all

286
00:16:05.639 --> 00:16:10.279
<v Speaker 3>map to a combination of characters. English mapping a combination

287
00:16:10.360 --> 00:16:15.960
<v Speaker 3>of characters to the sound is incredibly difficult. Turns out

288
00:16:16.080 --> 00:16:18.440
<v Speaker 3>we made some mistakes early on and we didn't really

289
00:16:18.480 --> 00:16:20.240
<v Speaker 3>fix them.

290
00:16:20.600 --> 00:16:23.440
<v Speaker 4>I mean, just thinking about the amount of spellings that

291
00:16:23.480 --> 00:16:27.399
<v Speaker 4>we have for the same sounding word based on context,

292
00:16:27.519 --> 00:16:30.759
<v Speaker 4>I can not even imagine how you would be able

293
00:16:30.759 --> 00:16:32.639
<v Speaker 4>to figure that out for a transcript.

294
00:16:32.879 --> 00:16:35.399
<v Speaker 3>And it's all exceptions in English. It's like, oh, yeah,

295
00:16:35.440 --> 00:16:38.720
<v Speaker 3>it's this unless this or this unless this, and like

296
00:16:38.840 --> 00:16:41.879
<v Speaker 3>here's three different reasons why this rule is wrong.

297
00:16:41.960 --> 00:16:45.480
<v Speaker 2>And yeah, you see this when you have like younger

298
00:16:45.559 --> 00:16:47.679
<v Speaker 2>kids that are starting to write, and you look at

299
00:16:47.679 --> 00:16:50.679
<v Speaker 2>their writing because they start they don't know the exceptions yet, right,

300
00:16:50.879 --> 00:16:53.360
<v Speaker 2>but they can speak it because they know. So you

301
00:16:53.480 --> 00:16:55.960
<v Speaker 2>get like, they it's words you don't even think about too,

302
00:16:55.960 --> 00:16:58.840
<v Speaker 2>because we internalize them so quickly. Because one of my

303
00:16:58.960 --> 00:17:02.320
<v Speaker 2>kids spelled because ron and then you're like, oh, bill,

304
00:17:02.320 --> 00:17:03.960
<v Speaker 2>because it's pretty easy. But then you think about it

305
00:17:04.000 --> 00:17:06.519
<v Speaker 2>for like half a second and you realize, like, actually,

306
00:17:06.519 --> 00:17:09.119
<v Speaker 2>the word because makes absolutely no sense, like right.

307
00:17:09.759 --> 00:17:11.920
<v Speaker 3>Like if you try and explain it, you suddenly find

308
00:17:11.960 --> 00:17:17.720
<v Speaker 3>yourself going just is what it is. Yes, just memorize it, yep.

309
00:17:19.039 --> 00:17:23.119
<v Speaker 4>I mean that's really fantastic that you have taken on

310
00:17:23.200 --> 00:17:27.559
<v Speaker 4>and it sounds like gotten through some very difficult dialects.

311
00:17:27.960 --> 00:17:30.880
<v Speaker 4>Are what are future future languages that you hope to

312
00:17:30.920 --> 00:17:32.160
<v Speaker 4>be able to process as well?

313
00:17:32.720 --> 00:17:36.559
<v Speaker 3>So we're going, yeah, exactly, So we're going to try

314
00:17:37.000 --> 00:17:39.759
<v Speaker 3>next year. We're going to double our language count again,

315
00:17:39.960 --> 00:17:43.680
<v Speaker 3>I think, and we're going to do going to do Chinese, Vietnamese,

316
00:17:44.240 --> 00:17:49.440
<v Speaker 3>what else? Dutch? I believe, Russian, Polish, I think, yeah,

317
00:17:49.519 --> 00:17:52.480
<v Speaker 3>I can't remember all of them. But you basically need

318
00:17:52.799 --> 00:17:57.599
<v Speaker 3>to be like a fully inclusive speech recognition company, you

319
00:17:57.680 --> 00:18:01.240
<v Speaker 3>basically need like a bare minimum of like fifty languages.

320
00:18:01.559 --> 00:18:04.039
<v Speaker 3>So like we're going to get to like twenty of

321
00:18:04.079 --> 00:18:07.279
<v Speaker 3>the most popular and hold there for a while. Is

322
00:18:07.640 --> 00:18:11.559
<v Speaker 3>kind of our plan because that covers a lot of people,

323
00:18:11.759 --> 00:18:15.519
<v Speaker 3>Like that covers the majority of people, because because even

324
00:18:15.599 --> 00:18:19.119
<v Speaker 3>in the cases where the people might not speak the language,

325
00:18:19.160 --> 00:18:20.960
<v Speaker 3>they usually are like, oh but I speak this what

326
00:18:21.119 --> 00:18:24.359
<v Speaker 3>this more popular language? But to really get up there,

327
00:18:24.440 --> 00:18:26.720
<v Speaker 3>like I mean, you do need to get to like

328
00:18:26.880 --> 00:18:30.480
<v Speaker 3>fifty or something. And I mean Google has like one

329
00:18:30.559 --> 00:18:34.359
<v Speaker 3>hundred and fifty, so you know, it's it's kind of

330
00:18:34.400 --> 00:18:36.680
<v Speaker 3>a never ending thing for us.

331
00:18:37.559 --> 00:18:39.400
<v Speaker 4>How about Hindi that's a big one.

332
00:18:39.519 --> 00:18:41.960
<v Speaker 3>Oh yeah, that's actually one of the other ones we're

333
00:18:41.960 --> 00:18:43.039
<v Speaker 3>going to do next year.

334
00:18:43.200 --> 00:18:47.359
<v Speaker 2>Yeah, so I guess I got to ask one last question.

335
00:18:47.400 --> 00:18:49.839
<v Speaker 2>Are there any languages like you've come to hate, like

336
00:18:49.880 --> 00:18:52.480
<v Speaker 2>because it was like very difficult or.

337
00:18:56.559 --> 00:19:00.920
<v Speaker 3>It's funny how much you can hate your own language. No, actually,

338
00:19:01.079 --> 00:19:03.799
<v Speaker 3>like seriously, English is the only Like I look at

339
00:19:03.839 --> 00:19:06.359
<v Speaker 3>all other languages we've done, and I'm like, these are

340
00:19:06.400 --> 00:19:11.440
<v Speaker 3>so much easier, like English is. Actually, it's just it

341
00:19:11.519 --> 00:19:14.720
<v Speaker 3>came out of a mess of languages. It was a

342
00:19:14.759 --> 00:19:18.799
<v Speaker 3>lot of combinations that happened over time, and a lot

343
00:19:18.799 --> 00:19:21.680
<v Speaker 3>of them happened during like you know, a lot of

344
00:19:21.720 --> 00:19:26.119
<v Speaker 3>English developed during like illiteracy, and so there's like really

345
00:19:26.160 --> 00:19:29.279
<v Speaker 3>interesting examples you can find of like stuff where it's

346
00:19:29.359 --> 00:19:32.240
<v Speaker 3>just like, oh, yeah, this was just a mistake that happened,

347
00:19:32.960 --> 00:19:34.759
<v Speaker 3>you know, two hundred years ago that they kept in

348
00:19:35.079 --> 00:19:37.599
<v Speaker 3>or actually, I have a fun fact the word dumb.

349
00:19:37.799 --> 00:19:39.480
<v Speaker 3>So you look at that, you're like, why does it

350
00:19:39.480 --> 00:19:42.480
<v Speaker 3>have a be at the end that apparently was there

351
00:19:42.559 --> 00:19:46.519
<v Speaker 3>was a time where the like ruling class of England

352
00:19:46.559 --> 00:19:49.160
<v Speaker 3>was trying to make it harder to write English so

353
00:19:49.200 --> 00:19:52.400
<v Speaker 3>that the peasantry could like pick it up. And they

354
00:19:52.440 --> 00:19:56.519
<v Speaker 3>literally just added some letters to the language here or there,

355
00:19:57.200 --> 00:19:59.480
<v Speaker 3>and we're like, this is the proper way to write it,

356
00:19:59.599 --> 00:20:02.480
<v Speaker 3>and then just to confuse people. And we literally still

357
00:20:02.519 --> 00:20:05.680
<v Speaker 3>have that to this day. So English is so weird.

358
00:20:06.680 --> 00:20:08.960
<v Speaker 4>So that's why knife has a K in front of it.

359
00:20:09.200 --> 00:20:12.839
<v Speaker 3>Yeah, yeah, like like stuff like that. I think they

360
00:20:12.839 --> 00:20:14.720
<v Speaker 3>were just messing with us and now we're just like

361
00:20:15.000 --> 00:20:16.000
<v Speaker 3>we have to live with that.

362
00:20:17.240 --> 00:20:19.359
<v Speaker 2>So I want to pivot a little bit and talk

363
00:20:19.400 --> 00:20:22.480
<v Speaker 2>about the actual web development, like the side where you

364
00:20:22.519 --> 00:20:25.680
<v Speaker 2>might actually use a service like this, because I remember

365
00:20:25.759 --> 00:20:27.519
<v Speaker 2>last time we chat it a little bit too about

366
00:20:27.720 --> 00:20:30.960
<v Speaker 2>common use cases, right, So maybe we could just start

367
00:20:31.039 --> 00:20:32.880
<v Speaker 2>with a review, like how we have a lot of

368
00:20:32.880 --> 00:20:36.240
<v Speaker 2>web developers listen to this show. What do you think, like,

369
00:20:36.400 --> 00:20:39.240
<v Speaker 2>I guess, A, what would using something like this look like?

370
00:20:39.240 --> 00:20:40.759
<v Speaker 2>Like how do you actually get it in an app?

371
00:20:41.039 --> 00:20:43.839
<v Speaker 2>And B I guess like, what are some common use

372
00:20:43.880 --> 00:20:46.599
<v Speaker 2>cases that you see for use on the web as well?

373
00:20:46.799 --> 00:20:49.400
<v Speaker 3>Right? So one of the big things is obviously on

374
00:20:49.440 --> 00:20:51.599
<v Speaker 3>the web, people are a lot more comfortable calling like

375
00:20:51.880 --> 00:20:55.920
<v Speaker 3>an API and that is what they've come to expect

376
00:20:55.960 --> 00:21:00.240
<v Speaker 3>for speech trek cognition and stuff. But we're actually bringing

377
00:21:00.319 --> 00:21:03.000
<v Speaker 3>the We're actually kind of bringing back the power of

378
00:21:03.039 --> 00:21:06.160
<v Speaker 3>the browser itself. So the I mean the browser is

379
00:21:06.160 --> 00:21:09.039
<v Speaker 3>a virtual environment that can run whatever you want. And

380
00:21:09.319 --> 00:21:12.640
<v Speaker 3>we actually can run entirely in the browser on the

381
00:21:12.680 --> 00:21:16.599
<v Speaker 3>client side. And that's that's big because I mean, in

382
00:21:16.640 --> 00:21:20.599
<v Speaker 3>the these days, we're getting a lot of progressive web apps,

383
00:21:21.160 --> 00:21:24.119
<v Speaker 3>and the sort of web app is a big thing,

384
00:21:24.240 --> 00:21:27.079
<v Speaker 3>especially with like SaaS companies and stuff. So if you're

385
00:21:27.119 --> 00:21:30.160
<v Speaker 3>running like a SaaS company and you you want to

386
00:21:30.200 --> 00:21:34.720
<v Speaker 3>integrate like voice into your console or something, having it

387
00:21:35.000 --> 00:21:39.359
<v Speaker 3>on the client side is is I mean, it lowers

388
00:21:39.400 --> 00:21:42.160
<v Speaker 3>the latency, It gives you a lot more direct control

389
00:21:42.319 --> 00:21:46.440
<v Speaker 3>of what happens when you get boys, and it means

390
00:21:46.519 --> 00:21:50.119
<v Speaker 3>you can be robust to connection issues, which like that

391
00:21:50.119 --> 00:21:52.599
<v Speaker 3>that you know, that's a huge thing. Not everyone has

392
00:21:53.000 --> 00:21:55.519
<v Speaker 3>amazing Internet and you don't want to have to be

393
00:21:55.880 --> 00:21:58.839
<v Speaker 3>I can calls out to an API and just hoping

394
00:21:58.880 --> 00:22:01.079
<v Speaker 3>it comes back for you or feature to work. This

395
00:22:01.440 --> 00:22:03.519
<v Speaker 3>will just work. And also on top of all that,

396
00:22:03.839 --> 00:22:08.319
<v Speaker 3>it's it's less expensive because we're not calling an API,

397
00:22:08.440 --> 00:22:13.480
<v Speaker 3>We're not depending on cloud infrastructure. So you're actually if

398
00:22:13.559 --> 00:22:17.279
<v Speaker 3>you're a developer and you integrate Peka Voice into your

399
00:22:17.480 --> 00:22:19.920
<v Speaker 3>web app, your client is going to be using their

400
00:22:19.960 --> 00:22:23.160
<v Speaker 3>machine to do the processing. So I think it's just

401
00:22:23.240 --> 00:22:26.000
<v Speaker 3>a win win situation for that. Yeah.

402
00:22:26.000 --> 00:22:29.079
<v Speaker 2>I feel it's especially important considering it's audio too, So

403
00:22:29.240 --> 00:22:33.400
<v Speaker 2>like bandwidth is like you you're not just shipping off

404
00:22:33.480 --> 00:22:35.319
<v Speaker 2>like a couple of things in a query string to

405
00:22:35.400 --> 00:22:37.759
<v Speaker 2>some service. You're like uploading.

406
00:22:37.279 --> 00:22:39.599
<v Speaker 3>Audio h mega adio.

407
00:22:40.160 --> 00:22:45.880
<v Speaker 2>Yeah, so the bandwidth consideration is amplified significantly.

408
00:22:46.519 --> 00:22:50.519
<v Speaker 3>Yeah. No, and it makes everything that that actually allows

409
00:22:50.519 --> 00:22:54.680
<v Speaker 3>for like something like real time audio. Real time audio

410
00:22:54.920 --> 00:22:57.079
<v Speaker 3>is very challenging to do for an API because you

411
00:22:57.079 --> 00:23:00.720
<v Speaker 3>basically need to stream it to the service and have

412
00:23:00.880 --> 00:23:04.480
<v Speaker 3>responses being streamed back. That's that's really expensive. That's like

413
00:23:04.519 --> 00:23:08.319
<v Speaker 3>a constant bandwidth issue. But when you're doing real time

414
00:23:08.359 --> 00:23:10.400
<v Speaker 3>audio and it's all running in your browser on the

415
00:23:10.440 --> 00:23:14.440
<v Speaker 3>client side, it's it's snappy and you can do things

416
00:23:14.440 --> 00:23:16.279
<v Speaker 3>that require timing and.

417
00:23:16.640 --> 00:23:18.839
<v Speaker 2>Yeah, very cool. And I know I think I remember

418
00:23:18.839 --> 00:23:21.119
<v Speaker 2>from last time too that because one of the ways

419
00:23:21.119 --> 00:23:24.319
<v Speaker 2>you keep it snappy is it's not JavaScript code running

420
00:23:24.359 --> 00:23:26.759
<v Speaker 2>in the browser, right, it's your I don't remember your

421
00:23:26.759 --> 00:23:28.720
<v Speaker 2>exact tech stack, but I know you have some sort

422
00:23:28.720 --> 00:23:31.759
<v Speaker 2>of fancy way of doing that. Maybe you could walk

423
00:23:31.799 --> 00:23:34.599
<v Speaker 2>people through some of the magic and challenges of how

424
00:23:34.640 --> 00:23:35.160
<v Speaker 2>that works.

425
00:23:35.559 --> 00:23:39.119
<v Speaker 3>Yeah, so our core code is in like C because

426
00:23:39.119 --> 00:23:42.000
<v Speaker 3>we we were trying to keep it as efficient and

427
00:23:42.039 --> 00:23:45.759
<v Speaker 3>snappy as possible. Now C code, and when you think

428
00:23:45.799 --> 00:23:47.880
<v Speaker 3>of C code next to React, you're like, how does

429
00:23:47.920 --> 00:23:52.319
<v Speaker 3>this even work? Like can these two ever talk? But

430
00:23:52.400 --> 00:23:55.400
<v Speaker 3>it turns out they can with WAM, And what we

431
00:23:55.480 --> 00:23:58.559
<v Speaker 3>do is we compile basically all our core code into

432
00:23:58.640 --> 00:24:02.200
<v Speaker 3>a WAHSM binary and then we ship that with our

433
00:24:02.680 --> 00:24:08.160
<v Speaker 3>like MPM package. So when you MPM install Peka Voice,

434
00:24:08.319 --> 00:24:10.640
<v Speaker 3>part of the part of what's going to be shipped

435
00:24:10.799 --> 00:24:17.160
<v Speaker 3>with your website is OURASM blob. And basically wasm's really

436
00:24:17.160 --> 00:24:21.640
<v Speaker 3>cool because it basically just wraps your native code in

437
00:24:21.799 --> 00:24:28.200
<v Speaker 3>JavaScript and then allows you to basically attached to it

438
00:24:28.559 --> 00:24:31.920
<v Speaker 3>like any sort of dynamic library, say, here's the functions

439
00:24:31.960 --> 00:24:34.079
<v Speaker 3>I want to call, here's the data I'm going to

440
00:24:34.119 --> 00:24:37.839
<v Speaker 3>put into it, and then you just call it like

441
00:24:37.920 --> 00:24:41.640
<v Speaker 3>you would any any other library. It's a little trickier

442
00:24:41.920 --> 00:24:45.359
<v Speaker 3>to work with because you're dealing with I mean JavaScript

443
00:24:45.359 --> 00:24:49.039
<v Speaker 3>obviously one of the things we're pretty aware of, and

444
00:24:49.160 --> 00:24:51.039
<v Speaker 3>I'm sure the listeners of your show are aware of.

445
00:24:51.119 --> 00:24:55.920
<v Speaker 3>Is JavaScript is like eh, types whatever, even typescript is

446
00:24:56.000 --> 00:24:58.440
<v Speaker 3>like is like yeah, types, but like you know, a

447
00:24:58.519 --> 00:25:01.240
<v Speaker 3>number is a number, right? Well, see is like what

448
00:25:01.640 --> 00:25:05.079
<v Speaker 3>how many bits is your number? Like he needs to know?

449
00:25:06.440 --> 00:25:09.680
<v Speaker 3>So you start to need to think about that. When

450
00:25:09.720 --> 00:25:11.880
<v Speaker 3>you work with WASM, you start to need to think

451
00:25:11.920 --> 00:25:14.640
<v Speaker 3>about okay, is this a thirty two bit and going

452
00:25:14.640 --> 00:25:17.319
<v Speaker 3>in here? And you need to start to think of okay,

453
00:25:18.359 --> 00:25:20.480
<v Speaker 3>you need to start to think of memory, like okay,

454
00:25:20.720 --> 00:25:23.400
<v Speaker 3>I need to have a pointer. I need to pass

455
00:25:23.440 --> 00:25:26.279
<v Speaker 3>in a pointer here to get something back and then

456
00:25:26.440 --> 00:25:29.960
<v Speaker 3>convert that pointer to like a JavaScript object of some sort.

457
00:25:30.680 --> 00:25:34.039
<v Speaker 3>So it's challenging to work with, but once you get

458
00:25:34.079 --> 00:25:38.359
<v Speaker 3>it working, it's extremely powerful because then we can ship

459
00:25:38.440 --> 00:25:42.279
<v Speaker 3>something that's incredibly complex piece of code and just put

460
00:25:42.480 --> 00:25:48.640
<v Speaker 3>basically a slim interface of JavaScript around it, and then

461
00:25:48.799 --> 00:25:51.920
<v Speaker 3>any jabscript developer can just call is just talking to

462
00:25:51.960 --> 00:25:54.200
<v Speaker 3>it like it's JavaScript. They don't need to worry about

463
00:25:54.240 --> 00:25:59.160
<v Speaker 3>the WAHSM that was our problem. Yeah, So it's challenging

464
00:25:59.240 --> 00:26:01.799
<v Speaker 3>to work with, but I do if anybody's thinking of

465
00:26:02.440 --> 00:26:06.160
<v Speaker 3>has a challenging problem that requires the efficiency of c

466
00:26:07.240 --> 00:26:09.119
<v Speaker 3>Don't be afraid of it. It's not that it's not

467
00:26:09.200 --> 00:26:12.279
<v Speaker 3>that hard, and it is pretty amazing when you start

468
00:26:12.319 --> 00:26:12.799
<v Speaker 3>working with it.

469
00:26:12.839 --> 00:26:16.599
<v Speaker 4>Actually, Okay, so it works or there there is an

470
00:26:16.720 --> 00:26:19.440
<v Speaker 4>NPM package if you want to use JavaScript with it.

471
00:26:19.759 --> 00:26:23.960
<v Speaker 4>But what if you are a Python developer or maybe

472
00:26:24.000 --> 00:26:27.119
<v Speaker 4>you're working with a micro controler like our do we

473
00:26:27.240 --> 00:26:32.039
<v Speaker 4>know is there are there options for other other languages

474
00:26:32.119 --> 00:26:32.400
<v Speaker 4>like that?

475
00:26:32.920 --> 00:26:36.720
<v Speaker 3>Yeah, So, I mean we support since we're a developer

476
00:26:36.759 --> 00:26:41.960
<v Speaker 3>focused company, we're pretty obsessed with our SDKs. So I

477
00:26:42.000 --> 00:26:45.240
<v Speaker 3>think for our two most popular products, I think we

478
00:26:45.279 --> 00:26:48.960
<v Speaker 3>have like twenty SDKs for each one, and it covers

479
00:26:48.960 --> 00:26:53.279
<v Speaker 3>all the favorites. And we even have you know, we

480
00:26:53.319 --> 00:26:56.559
<v Speaker 3>have three No, we have four different web SDKs. We

481
00:26:56.599 --> 00:27:03.240
<v Speaker 3>have Vanilla JavaScript, but we also have React, Angular and view.

482
00:27:03.799 --> 00:27:06.720
<v Speaker 3>So it allows we basically wanted to be like, use

483
00:27:06.759 --> 00:27:11.039
<v Speaker 3>it in your favorite environment, like, yeah, use it like

484
00:27:11.079 --> 00:27:14.720
<v Speaker 3>you use anything else in your in your stack, Like

485
00:27:14.759 --> 00:27:17.519
<v Speaker 3>we don't want to just disturb that basically.

486
00:27:17.680 --> 00:27:21.519
<v Speaker 4>Right, that's awesome. So what are some of the use

487
00:27:21.559 --> 00:27:25.519
<v Speaker 4>cases that you've seen people employing it for recently?

488
00:27:25.960 --> 00:27:29.599
<v Speaker 3>So we've seen, so we've actually come to some interesting

489
00:27:29.640 --> 00:27:34.319
<v Speaker 3>ones lately. So auto content moderation is a big one

490
00:27:34.400 --> 00:27:39.319
<v Speaker 3>right now. So let's say you're Minecraft or something, or

491
00:27:39.400 --> 00:27:43.279
<v Speaker 3>you're I guess, let's let's go Fortnite, and you have

492
00:27:43.480 --> 00:27:48.240
<v Speaker 3>open audio streams hundreds of thousands of players, and you're

493
00:27:48.240 --> 00:27:51.880
<v Speaker 3>trying to moderate all that. That's that's very difficult. And

494
00:27:52.119 --> 00:27:55.160
<v Speaker 3>it turns out a lot of big companies out there

495
00:27:55.240 --> 00:28:00.279
<v Speaker 3>are using auto moderation, which basically takes that audio and

496
00:28:00.440 --> 00:28:10.640
<v Speaker 3>is basically looking for key phrases let's call them, and

497
00:28:10.920 --> 00:28:13.559
<v Speaker 3>it's just looking to flag them and then and then

498
00:28:13.680 --> 00:28:16.480
<v Speaker 3>they'll usually have you know, a person go in and

499
00:28:16.799 --> 00:28:21.079
<v Speaker 3>inspect the actual content of it and decide whether it

500
00:28:21.200 --> 00:28:24.079
<v Speaker 3>was you know, a mistake or whether it is actually

501
00:28:24.119 --> 00:28:27.720
<v Speaker 3>like a bannable offense. Yeah, so that that's actually a

502
00:28:27.759 --> 00:28:31.240
<v Speaker 3>new exciting one. Also, like call centers, it turns out,

503
00:28:31.519 --> 00:28:35.440
<v Speaker 3>again we've got open phone lines, like like a whole

504
00:28:35.440 --> 00:28:38.920
<v Speaker 3>building full of them, and we're trying to understand, you know,

505
00:28:39.079 --> 00:28:41.599
<v Speaker 3>what's being said on all these different calls, and you

506
00:28:41.680 --> 00:28:44.880
<v Speaker 3>can't have people listening to all that audio. So a

507
00:28:44.880 --> 00:28:47.960
<v Speaker 3>lot of big call center companies need some sort of

508
00:28:48.039 --> 00:28:50.799
<v Speaker 3>automated system to take in all the audio from all

509
00:28:50.839 --> 00:28:54.319
<v Speaker 3>their phones and do something with it. So we're we're

510
00:28:54.400 --> 00:28:57.160
<v Speaker 3>encountering more use cases like that lately.

511
00:28:57.200 --> 00:29:01.440
<v Speaker 2>Actually, those are both really fascinating. It's funny. The content

512
00:29:01.480 --> 00:29:04.960
<v Speaker 2>moderation one really resonated with me because I play I

513
00:29:05.000 --> 00:29:07.319
<v Speaker 2>don't know. My kids are eleven, so they're right at

514
00:29:07.359 --> 00:29:09.839
<v Speaker 2>that impressionable age, but they're also right in the age

515
00:29:09.839 --> 00:29:12.119
<v Speaker 2>where they want to play like games that are the

516
00:29:12.119 --> 00:29:15.480
<v Speaker 2>sort where they have open audio. So there's a game

517
00:29:15.519 --> 00:29:18.920
<v Speaker 2>we play that's like five y five, so five people

518
00:29:18.920 --> 00:29:21.079
<v Speaker 2>on each team, and it has it has a way

519
00:29:21.119 --> 00:29:23.240
<v Speaker 2>for you to do audio communication. And the very first

520
00:29:23.279 --> 00:29:25.160
<v Speaker 2>thing I did was make sure to shut that off,

521
00:29:25.720 --> 00:29:30.400
<v Speaker 2>like disable it, because like I'm a professional Internet user,

522
00:29:30.519 --> 00:29:32.720
<v Speaker 2>and that's the first thing to learn is I don't

523
00:29:32.759 --> 00:29:35.960
<v Speaker 2>trust anybody. I wouldn't even want to hear it myself,

524
00:29:36.039 --> 00:29:38.440
<v Speaker 2>much less my kids, though I know.

525
00:29:38.680 --> 00:29:41.640
<v Speaker 3>It like brings me back to like like when I

526
00:29:41.680 --> 00:29:45.039
<v Speaker 3>was like you know, eleven or twelve, Like the Internet

527
00:29:45.200 --> 00:29:47.559
<v Speaker 3>was like a new exciting thing, and I just would

528
00:29:47.599 --> 00:29:50.720
<v Speaker 3>like I remember going to like I like was like

529
00:29:51.119 --> 00:29:53.839
<v Speaker 3>really into like going to like like different video game

530
00:29:53.880 --> 00:29:56.079
<v Speaker 3>websites and stuff. And then there was these just these

531
00:29:56.160 --> 00:29:59.279
<v Speaker 3>chat rooms about video games you could go to and

532
00:29:59.359 --> 00:30:02.559
<v Speaker 3>it was literally just like a room with like everybody

533
00:30:02.559 --> 00:30:05.680
<v Speaker 3>their microphones are on and you just start talking and

534
00:30:05.720 --> 00:30:08.279
<v Speaker 3>it was like, when I think of that now, I'm like,

535
00:30:08.359 --> 00:30:14.920
<v Speaker 3>oh my god, that's frightening. Yeah, but uh yeah, I

536
00:30:14.960 --> 00:30:17.440
<v Speaker 3>mean the fact is is we can we could keep

537
00:30:17.519 --> 00:30:21.519
<v Speaker 3>those spaces safe with these sorts of tools, because then

538
00:30:21.880 --> 00:30:24.680
<v Speaker 3>then air Duell's out there at least get banned when

539
00:30:24.720 --> 00:30:27.759
<v Speaker 3>they're when they're being inappropriate or whatever.

540
00:30:28.039 --> 00:30:31.319
<v Speaker 4>Oh god, Well, one thing that you put in the

541
00:30:31.319 --> 00:30:33.200
<v Speaker 4>show notes today that I would really like to hear

542
00:30:33.279 --> 00:30:37.319
<v Speaker 4>more about is a new speech to text engine or

543
00:30:37.440 --> 00:30:39.960
<v Speaker 4>engines cheetah and leopards. So maybe you could tell us

544
00:30:39.960 --> 00:30:41.160
<v Speaker 4>a little bit more about those.

545
00:30:41.759 --> 00:30:44.400
<v Speaker 3>Yeah, so I think, yeah, last time we spoke, we

546
00:30:44.440 --> 00:30:49.079
<v Speaker 3>actually didn't have a publicly available speech to text engine,

547
00:30:49.160 --> 00:30:52.519
<v Speaker 3>and we were using our speech to intent engine, which

548
00:30:52.559 --> 00:30:56.640
<v Speaker 3>was called Rhino, which was basically, like you, basically, yeah,

549
00:30:57.240 --> 00:31:02.599
<v Speaker 3>the founder of the company is pretty obsessed with animals,

550
00:31:02.720 --> 00:31:07.519
<v Speaker 3>So Rhino basically you teach it a small grammar and

551
00:31:07.559 --> 00:31:09.559
<v Speaker 3>then it would understand that grammar, which is great for

552
00:31:09.640 --> 00:31:12.519
<v Speaker 3>stuff like you know, controlling a coffee maker or like

553
00:31:12.680 --> 00:31:15.519
<v Speaker 3>you know, there's only so many functions that needs to understand,

554
00:31:15.880 --> 00:31:19.160
<v Speaker 3>but we decided to kind of go that extra mile

555
00:31:19.319 --> 00:31:25.519
<v Speaker 3>and bring speech to text to devices. And traditionally language

556
00:31:25.519 --> 00:31:29.319
<v Speaker 3>models are in the gigabyte realm of size, and we

557
00:31:29.319 --> 00:31:33.599
<v Speaker 3>actually got ours down to twenty megabytes for language and

558
00:31:34.200 --> 00:31:37.240
<v Speaker 3>that's sort of the big win for this is like

559
00:31:37.559 --> 00:31:41.119
<v Speaker 3>we can run on anything that can take twenty megabytes

560
00:31:41.240 --> 00:31:46.559
<v Speaker 3>of memory or of storage. And so Leopard and Cheetah

561
00:31:46.599 --> 00:31:50.000
<v Speaker 3>are actually two different sides of the same coin. So

562
00:31:50.160 --> 00:31:53.720
<v Speaker 3>Leopard is a speech to text engine that takes in

563
00:31:53.920 --> 00:31:57.720
<v Speaker 3>a set amount of audio, so like an audio file

564
00:31:58.039 --> 00:32:01.200
<v Speaker 3>or something, and gives you a transcript of that, and

565
00:32:01.440 --> 00:32:04.240
<v Speaker 3>that's a lot that's an easier problem because you can

566
00:32:04.279 --> 00:32:06.640
<v Speaker 3>basically say, okay, this is all the audio I'm going

567
00:32:06.680 --> 00:32:08.519
<v Speaker 3>to get, so I'm to look forward, I'm going to

568
00:32:08.599 --> 00:32:10.960
<v Speaker 3>look back, I'm going to make inferences based on the

569
00:32:10.960 --> 00:32:13.400
<v Speaker 3>future in the past and give you a response. But

570
00:32:13.480 --> 00:32:16.279
<v Speaker 3>then Cheetah, of course, because it's the fast one, it

571
00:32:16.359 --> 00:32:19.920
<v Speaker 3>goes it's real time, so it has zero look ahead,

572
00:32:20.000 --> 00:32:23.000
<v Speaker 3>which means it will take in every frame of audio

573
00:32:23.119 --> 00:32:25.519
<v Speaker 3>that you give it and it will return what it

574
00:32:25.559 --> 00:32:28.519
<v Speaker 3>thinks is being said, so they're both speech to text engines,

575
00:32:28.599 --> 00:32:31.440
<v Speaker 3>but they just work at different use cases. So I

576
00:32:31.440 --> 00:32:34.880
<v Speaker 3>mean audio files. The accuracy is much better, but of

577
00:32:34.920 --> 00:32:37.799
<v Speaker 3>course you sacrifice the sort of real time effect.

578
00:32:38.160 --> 00:32:41.039
<v Speaker 2>Yeah, so twenty megs is impressive, but is that still

579
00:32:41.079 --> 00:32:44.200
<v Speaker 2>like small enough for a browser to use, Like does

580
00:32:44.240 --> 00:32:46.880
<v Speaker 2>a user have to download that to use it in

581
00:32:46.920 --> 00:32:48.119
<v Speaker 2>their WebP.

582
00:32:47.960 --> 00:32:51.720
<v Speaker 3>So that was a challenge we recently. So we recently

583
00:32:51.759 --> 00:32:55.640
<v Speaker 3>did the webstcs for Cheatah and Leopard, and we actually

584
00:32:55.640 --> 00:32:58.599
<v Speaker 3>had to kind of redesign our whole system of delivering

585
00:32:58.920 --> 00:33:02.480
<v Speaker 3>language to the the browser to handle this. So, yes,

586
00:33:02.559 --> 00:33:06.039
<v Speaker 3>twenty megabytes is a lot, but we actually separate the

587
00:33:06.160 --> 00:33:10.000
<v Speaker 3>language model from the package, so basically we let the

588
00:33:10.039 --> 00:33:13.279
<v Speaker 3>developer decide how that's delivered to the user, but we

589
00:33:13.359 --> 00:33:15.640
<v Speaker 3>also made it part of our system that it could

590
00:33:15.680 --> 00:33:18.880
<v Speaker 3>be either a basicxty four representation that you can bake

591
00:33:18.920 --> 00:33:20.920
<v Speaker 3>into your website if you just want it to always

592
00:33:20.960 --> 00:33:23.240
<v Speaker 3>be there, or if you want to be kind of

593
00:33:23.319 --> 00:33:26.119
<v Speaker 3>smarter about it, what you can do is put it

594
00:33:26.160 --> 00:33:29.960
<v Speaker 3>in your public folder and have it downloaded to the

595
00:33:30.039 --> 00:33:33.920
<v Speaker 3>user's browser on first load, and then cashed in local

596
00:33:33.960 --> 00:33:36.480
<v Speaker 3>storage for the rest of the time, so that the

597
00:33:36.559 --> 00:33:39.400
<v Speaker 3>next So the very first load, yeah, it'll be a

598
00:33:39.400 --> 00:33:42.119
<v Speaker 3>twenty megabyte load, but the second load will be instant

599
00:33:42.279 --> 00:33:44.079
<v Speaker 3>because they already have the language model.

600
00:33:44.279 --> 00:33:46.720
<v Speaker 2>It's a pretty neat system because I think like it's

601
00:33:46.799 --> 00:33:48.960
<v Speaker 2>it's the nature of the beast, because I mean it's

602
00:33:49.079 --> 00:33:51.039
<v Speaker 2>it's in a way, it's kind of more of like

603
00:33:51.079 --> 00:33:54.000
<v Speaker 2>a native app feature, and native apps are downloading like

604
00:33:55.000 --> 00:33:58.799
<v Speaker 2>gigs at times of stuff, and so it's like a

605
00:33:58.799 --> 00:34:01.279
<v Speaker 2>feature that helps the web sort of compete with that.

606
00:34:01.359 --> 00:34:03.680
<v Speaker 2>So I think it makes sense, and I think, like

607
00:34:04.039 --> 00:34:06.599
<v Speaker 2>honestly that I think that's the best that this solution

608
00:34:06.759 --> 00:34:08.880
<v Speaker 2>is kind of clever because that's kind of all you

609
00:34:08.920 --> 00:34:11.599
<v Speaker 2>can do because you can't you can't magically get it

610
00:34:11.599 --> 00:34:14.559
<v Speaker 2>to the user ahead of time, like through an app

611
00:34:14.559 --> 00:34:15.360
<v Speaker 2>store or something.

612
00:34:15.480 --> 00:34:18.079
<v Speaker 3>So and a developer, if they're if they want to

613
00:34:18.119 --> 00:34:20.840
<v Speaker 3>be clever about it, they can they can stream it

614
00:34:21.079 --> 00:34:24.440
<v Speaker 3>from their public folder asynchronously on the first load, so

615
00:34:24.519 --> 00:34:26.840
<v Speaker 3>that it's just like by the time the user wants

616
00:34:26.840 --> 00:34:29.760
<v Speaker 3>to activate the voice feature, it's already downloaded. You know.

617
00:34:30.480 --> 00:34:32.159
<v Speaker 3>It's it's just the sort of thing you need to

618
00:34:32.480 --> 00:34:35.960
<v Speaker 3>you need to handle these sorts of ways. Because you know,

619
00:34:36.400 --> 00:34:38.559
<v Speaker 3>we were working with a company recently that they do

620
00:34:38.639 --> 00:34:41.639
<v Speaker 3>this all the time in their mobile apps. They'll their

621
00:34:41.639 --> 00:34:45.480
<v Speaker 3>mobile app actually downloads like stuff all the time to

622
00:34:45.920 --> 00:34:50.000
<v Speaker 3>keep their their app working, and it does it all asynchronously,

623
00:34:50.039 --> 00:34:52.119
<v Speaker 3>like when you open up the app and you know,

624
00:34:52.239 --> 00:34:54.880
<v Speaker 3>the user's none the wiser, but behind the scenes there's

625
00:34:54.960 --> 00:34:57.079
<v Speaker 3>all this stuff. So when when you look up, why

626
00:34:57.159 --> 00:35:00.360
<v Speaker 3>is this app using three point six gigabytes? When I

627
00:35:00.400 --> 00:35:03.480
<v Speaker 3>downloaded it, it was only five hundred miniwtes, that's because they

628
00:35:03.519 --> 00:35:06.360
<v Speaker 3>only delivered like the core code and the rest of

629
00:35:06.400 --> 00:35:10.719
<v Speaker 3>it was downloaded later on. Yeah, so's it's it's just

630
00:35:10.760 --> 00:35:14.079
<v Speaker 3>how it's just how you do stuff now is just

631
00:35:14.239 --> 00:35:17.159
<v Speaker 3>keep keep the package sizes small, but then just deliver

632
00:35:17.239 --> 00:35:19.519
<v Speaker 3>the features kind of as they're being used.

633
00:35:19.719 --> 00:35:22.639
<v Speaker 2>Yeah, I know, iOS and Andrew even have like APIs

634
00:35:22.679 --> 00:35:25.039
<v Speaker 2>built in to help you do that sort of thing

635
00:35:25.039 --> 00:35:27.599
<v Speaker 2>because it's it's such a common model.

636
00:35:27.960 --> 00:35:30.800
<v Speaker 3>Yeah, I think all the big companies want that. You know,

637
00:35:30.920 --> 00:35:34.239
<v Speaker 3>if you're Spotify, you just you got to have the features.

638
00:35:34.559 --> 00:35:38.480
<v Speaker 3>You don't want people to see three point six gayabytes

639
00:35:38.519 --> 00:35:41.000
<v Speaker 3>when they go to download your app. There's like a

640
00:35:41.000 --> 00:35:44.800
<v Speaker 3>sticker shock thing that happens. So it's kind of a

641
00:35:44.920 --> 00:35:47.840
<v Speaker 3>funny thing because it ends up being that it's it's

642
00:35:47.880 --> 00:35:50.400
<v Speaker 3>like when you book like an Airbnb and there's all

643
00:35:50.400 --> 00:35:53.119
<v Speaker 3>these extra expenses that like get reported later, or like

644
00:35:53.159 --> 00:35:56.960
<v Speaker 3>a or a flight where you get like the info later.

645
00:35:57.199 --> 00:35:59.159
<v Speaker 3>It's sort of like that. It's like reduce the sticker

646
00:35:59.199 --> 00:36:01.400
<v Speaker 3>shock and then we will show you the expenses after.

647
00:36:03.039 --> 00:36:06.119
<v Speaker 2>So you also have an article in here about writing

648
00:36:06.119 --> 00:36:11.039
<v Speaker 2>a podcast at transcription server to struggling to pronounce for

649
00:36:11.079 --> 00:36:14.199
<v Speaker 2>some reason, which is a fascinating idea that I think,

650
00:36:14.280 --> 00:36:16.280
<v Speaker 2>Like I know when we were talking before the show too,

651
00:36:16.320 --> 00:36:20.280
<v Speaker 2>we've done transcriptions and videos. I'm sure there's other people

652
00:36:20.320 --> 00:36:23.000
<v Speaker 2>that are call centers is another example, right, the things

653
00:36:23.000 --> 00:36:26.199
<v Speaker 2>that you want to transcribe. So does that use Peeter

654
00:36:26.400 --> 00:36:28.800
<v Speaker 2>Leopard or how does that work?

655
00:36:29.159 --> 00:36:31.559
<v Speaker 3>Yeah, so it used this Leopard because we actually have

656
00:36:31.639 --> 00:36:34.480
<v Speaker 3>the ability to get a whole file, like an hour

657
00:36:34.599 --> 00:36:39.280
<v Speaker 3>long podcast and transcribe it from start to finish. And Yeah,

658
00:36:39.320 --> 00:36:41.679
<v Speaker 3>the reason I kind of came up with that as

659
00:36:41.719 --> 00:36:44.320
<v Speaker 3>an idea to sort of demo or technology is like,

660
00:36:44.440 --> 00:36:47.400
<v Speaker 3>I know, I've listened to podcasts for years, and like

661
00:36:47.719 --> 00:36:50.239
<v Speaker 3>it's so often on a long running show. I'm sure

662
00:36:50.719 --> 00:36:53.280
<v Speaker 3>on this show you get the Hey have we talked

663
00:36:53.320 --> 00:36:55.599
<v Speaker 3>about that? Did we talk about this? I feel like

664
00:36:55.599 --> 00:36:59.679
<v Speaker 3>we've talked about this, and having show notes to go

665
00:36:59.760 --> 00:37:03.360
<v Speaker 3>back too is probably a really helpful thing. Or like

666
00:37:03.519 --> 00:37:05.519
<v Speaker 3>I was thinking of doing a next phase of the

667
00:37:05.599 --> 00:37:08.599
<v Speaker 3>article where I actually make a podcast like searchable. So

668
00:37:08.840 --> 00:37:13.880
<v Speaker 3>I made it transcribable and basically stored the like text representation.

669
00:37:14.079 --> 00:37:16.280
<v Speaker 3>But once you have the text representation, you can make

670
00:37:16.280 --> 00:37:18.639
<v Speaker 3>it searchable and then you can start being like, oh,

671
00:37:18.639 --> 00:37:20.880
<v Speaker 3>when did I say this? And then it will just

672
00:37:21.000 --> 00:37:23.559
<v Speaker 3>pop up the episode you set it in. So it

673
00:37:23.599 --> 00:37:25.360
<v Speaker 3>was just kind of kind of an idea I came

674
00:37:25.440 --> 00:37:29.079
<v Speaker 3>up with because I see a lot of people using

675
00:37:29.159 --> 00:37:33.000
<v Speaker 3>Leopard on a server to basically hook into an event

676
00:37:33.320 --> 00:37:36.880
<v Speaker 3>that's happening somewhere, whether it be on ourrs feed, RSS

677
00:37:37.000 --> 00:37:40.920
<v Speaker 3>feed that's like updated or you know, yeah, like a

678
00:37:40.920 --> 00:37:45.199
<v Speaker 3>new audio file or video is uploaded, and it hooks

679
00:37:45.199 --> 00:37:47.880
<v Speaker 3>into that event, it runs it through Leopard and then

680
00:37:48.039 --> 00:37:50.320
<v Speaker 3>stores it in a database. I thought that was like

681
00:37:50.400 --> 00:37:52.639
<v Speaker 3>kind of a universal use case, Like it's just so

682
00:37:53.119 --> 00:37:55.719
<v Speaker 3>it seems like a fundamental part of the web to

683
00:37:55.840 --> 00:37:57.559
<v Speaker 3>like have something like that in a server.

684
00:37:57.920 --> 00:38:01.559
<v Speaker 4>Yeah, I mean, it's it be so useful and it

685
00:38:01.559 --> 00:38:04.880
<v Speaker 4>would it would help I think everybody from people who

686
00:38:04.960 --> 00:38:08.280
<v Speaker 4>just want to reread part of a podcast if they're

687
00:38:08.320 --> 00:38:10.599
<v Speaker 4>looking for something specific instead of having to just kind

688
00:38:10.599 --> 00:38:13.239
<v Speaker 4>of hop through trying to figure out where it was

689
00:38:13.320 --> 00:38:16.719
<v Speaker 4>that that useful bit of information was well.

690
00:38:16.599 --> 00:38:19.159
<v Speaker 3>And and you can think too, like you can deliver

691
00:38:19.280 --> 00:38:22.039
<v Speaker 3>these like let's say you attached it to your podcast.

692
00:38:22.119 --> 00:38:25.519
<v Speaker 3>You can like deliver these transcripts along with your podcast

693
00:38:25.559 --> 00:38:28.400
<v Speaker 3>because if you have the server hook in, transcribe it,

694
00:38:28.440 --> 00:38:31.159
<v Speaker 3>and then deliver the transcript along with the podcast, suddenly

695
00:38:31.159 --> 00:38:36.559
<v Speaker 3>you've got a follow along with the transcript podcast. So

696
00:38:36.800 --> 00:38:39.239
<v Speaker 3>these sorts of things are useful for like auto captioning

697
00:38:39.519 --> 00:38:43.000
<v Speaker 3>like videos or audio as well for like accessibility.

698
00:38:43.239 --> 00:38:46.360
<v Speaker 2>It's accessible. It's also like marketing. People like it for

699
00:38:46.559 --> 00:38:51.239
<v Speaker 2>SEO purposes too, because you know audio, Google can index audio,

700
00:38:51.320 --> 00:38:54.599
<v Speaker 2>but if you have a transcription, it absolutely can even better.

701
00:38:54.679 --> 00:38:57.920
<v Speaker 3>Yeah, one hundred percent correct. Yeah, search engines aren't very

702
00:38:57.920 --> 00:39:01.400
<v Speaker 3>good at indixing audio, so you just have to plaster

703
00:39:01.480 --> 00:39:02.239
<v Speaker 3>the text somewhere.

704
00:39:03.960 --> 00:39:06.920
<v Speaker 2>Can you recognize different speakers because that's the other thing

705
00:39:06.960 --> 00:39:09.840
<v Speaker 2>about a transcript, right is knowing who's talking you do

706
00:39:09.880 --> 00:39:11.880
<v Speaker 2>you have the ability internally even if you don't know

707
00:39:11.960 --> 00:39:14.559
<v Speaker 2>names obviously, but can you say, like this is voice one,

708
00:39:14.639 --> 00:39:15.159
<v Speaker 2>voice two.

709
00:39:15.480 --> 00:39:19.119
<v Speaker 3>Yeah. So actually we're working right now on a I'll

710
00:39:19.159 --> 00:39:23.599
<v Speaker 3>give you guys the scoop right now on a speaker

711
00:39:23.880 --> 00:39:27.559
<v Speaker 3>identification system that will basically be able to tell people

712
00:39:27.639 --> 00:39:30.639
<v Speaker 3>apart because yeah, when when you think of something like

713
00:39:31.000 --> 00:39:33.280
<v Speaker 3>doing like like a Zoom meeting, if you want like

714
00:39:33.440 --> 00:39:36.599
<v Speaker 3>to have meeting notes, Yeah, it would be really useful

715
00:39:36.639 --> 00:39:39.159
<v Speaker 3>to have like this came from this person, This came

716
00:39:39.199 --> 00:39:41.360
<v Speaker 3>from this person, This came from this person, and you

717
00:39:41.360 --> 00:39:44.760
<v Speaker 3>can use different I mean, Zoom obviously has the ability

718
00:39:44.800 --> 00:39:46.960
<v Speaker 3>to know where the audio is coming from, so it

719
00:39:47.000 --> 00:39:49.280
<v Speaker 3>can kind of just label it. But if you have

720
00:39:49.400 --> 00:39:52.760
<v Speaker 3>an anonymous audio stream with a bunch of different voices,

721
00:39:53.079 --> 00:39:55.760
<v Speaker 3>that's that's challenging because you don't know. You just have

722
00:39:55.800 --> 00:39:58.440
<v Speaker 3>to base your assumptions on the character of the voice.

723
00:39:58.719 --> 00:40:01.719
<v Speaker 3>Who's who's different. That's actually a problem we're working out

724
00:40:01.800 --> 00:40:04.760
<v Speaker 3>right now. And I mean that's that's useful for not

725
00:40:04.800 --> 00:40:08.199
<v Speaker 3>only speaker labeling, but also a speaker verification. So like

726
00:40:08.360 --> 00:40:11.239
<v Speaker 3>if we want to voice activate something that only responds

727
00:40:11.320 --> 00:40:14.679
<v Speaker 3>to your voice, that's also like another use case for it.

728
00:40:14.880 --> 00:40:15.559
<v Speaker 4>That would be cool.

729
00:40:15.800 --> 00:40:18.960
<v Speaker 2>Well, this has been a blast. Is there anything that

730
00:40:19.000 --> 00:40:22.519
<v Speaker 2>you wanted to discuss today that we have not gotten

731
00:40:22.559 --> 00:40:23.440
<v Speaker 2>to at all?

732
00:40:24.159 --> 00:40:26.639
<v Speaker 3>No, I think I think we covered a lot here.

733
00:40:26.760 --> 00:40:29.480
<v Speaker 2>Yeah, yeah, excellent. So that why don't we move into

734
00:40:29.519 --> 00:40:32.519
<v Speaker 2>our picks, and Paige, do you want to kick us off?

735
00:40:32.920 --> 00:40:36.360
<v Speaker 4>Sure? So my pick is going to continue the trend

736
00:40:36.400 --> 00:40:40.280
<v Speaker 4>that I started last week, which was Star Trek. As

737
00:40:40.719 --> 00:40:43.119
<v Speaker 4>many of you who have been listening for a while,

738
00:40:43.199 --> 00:40:46.599
<v Speaker 4>I've been on a Star Trek journey through Next Generation

739
00:40:46.760 --> 00:40:50.760
<v Speaker 4>and forward. So most recently I've begun watching Star Trek

740
00:40:50.840 --> 00:40:54.840
<v Speaker 4>Lower Decks, which is their animated series, and there's only

741
00:40:54.960 --> 00:40:57.840
<v Speaker 4>I think there's maybe two, maybe three seasons of it,

742
00:40:58.039 --> 00:41:00.519
<v Speaker 4>but it is. It is in the stuff of Rick

743
00:41:00.559 --> 00:41:04.159
<v Speaker 4>and Morty, and it is the funniest Star Trek that

744
00:41:04.239 --> 00:41:07.039
<v Speaker 4>I've ever seen, to the point where I'm actually laughing,

745
00:41:07.159 --> 00:41:12.840
<v Speaker 4>which is unusual for anything animated, but it is really good.

746
00:41:12.960 --> 00:41:17.119
<v Speaker 4>There's a lot of references to other Star Trek franchises,

747
00:41:17.199 --> 00:41:20.840
<v Speaker 4>so if you are familiar with Next Generation or Voyager

748
00:41:21.159 --> 00:41:24.920
<v Speaker 4>or Enterprise, they throw in all sorts of little jokes

749
00:41:24.920 --> 00:41:28.239
<v Speaker 4>that are related to those characters. So I would definitely

750
00:41:28.400 --> 00:41:32.320
<v Speaker 4>recommend it. It's it's as family friendly as the rest

751
00:41:32.360 --> 00:41:35.440
<v Speaker 4>of the Star Trek a franchise is, and it's also

752
00:41:35.599 --> 00:41:38.119
<v Speaker 4>got a much bigger dose of humor than most of

753
00:41:38.159 --> 00:41:40.920
<v Speaker 4>them do. So if you're looking for something that's quick

754
00:41:41.000 --> 00:41:44.800
<v Speaker 4>twenty twenty five minute episodes, I would definitely say it's

755
00:41:44.840 --> 00:41:46.039
<v Speaker 4>a good one, excellent.

756
00:41:46.400 --> 00:41:48.480
<v Speaker 2>I still have not gotten into the Star Trek world,

757
00:41:48.639 --> 00:41:52.079
<v Speaker 2>so it's at some point. I've had it recommended several times,

758
00:41:52.119 --> 00:41:53.960
<v Speaker 2>but I feel like it's such an like you can't

759
00:41:54.000 --> 00:41:56.880
<v Speaker 2>like casually wade into it, right, like you kind of

760
00:41:56.880 --> 00:42:00.360
<v Speaker 2>have to. Yeah, so if I pick this week is

761
00:42:00.360 --> 00:42:02.719
<v Speaker 2>going to be The Great British Bakeoff, which I think

762
00:42:02.920 --> 00:42:07.679
<v Speaker 2>was the previous of Year's page. I started off watching

763
00:42:07.719 --> 00:42:10.400
<v Speaker 2>it because I just wanted to know what it was about, right,

764
00:42:10.599 --> 00:42:12.320
<v Speaker 2>just that sort of thing. And then next thing I

765
00:42:12.400 --> 00:42:14.559
<v Speaker 2>knew I had watched a few episodes and I didn't

766
00:42:14.599 --> 00:42:15.920
<v Speaker 2>even really understand why.

767
00:42:16.559 --> 00:42:20.239
<v Speaker 3>It's so it's so comforting that show, Like there's just

768
00:42:20.239 --> 00:42:22.920
<v Speaker 3>something so positive and warm about it.

769
00:42:22.920 --> 00:42:26.920
<v Speaker 2>It's it's strangely compelling, Like I can't even understand why

770
00:42:26.960 --> 00:42:29.320
<v Speaker 2>I ended up watching it, but it's quite good. I

771
00:42:29.400 --> 00:42:32.400
<v Speaker 2>think Netflix has like five seasons or so, I mean

772
00:42:32.440 --> 00:42:33.920
<v Speaker 2>I don't know how much I'm going to watch, but

773
00:42:33.920 --> 00:42:35.840
<v Speaker 2>I've it's a good thing to just have on you're

774
00:42:35.840 --> 00:42:38.320
<v Speaker 2>not sure what to do. It's just comforting, nice to

775
00:42:38.360 --> 00:42:41.079
<v Speaker 2>have on in the background. So I've been sucked into

776
00:42:41.119 --> 00:42:41.639
<v Speaker 2>that as well.

777
00:42:42.199 --> 00:42:44.719
<v Speaker 3>That's funny. My wife and I like literally just started

778
00:42:44.719 --> 00:42:47.400
<v Speaker 3>watching that like a few weeks ago, and for the

779
00:42:47.440 --> 00:42:50.199
<v Speaker 3>same reason, like why are people talking about this so much?

780
00:42:50.239 --> 00:42:53.639
<v Speaker 3>And yeah, now we're like shouting out the screen that's

781
00:42:53.719 --> 00:42:56.679
<v Speaker 3>not a good bake. Look at the problem mature.

782
00:42:57.159 --> 00:43:03.079
<v Speaker 2>Yeah, excellent, Ian. What picks do you have for us?

783
00:43:03.599 --> 00:43:06.000
<v Speaker 3>Yeah? So I think last time I brought Mandy a

784
00:43:06.039 --> 00:43:09.480
<v Speaker 3>really gnarly horror film. So I figure back up and

785
00:43:09.679 --> 00:43:12.639
<v Speaker 3>maybe do something a little different this time and actually

786
00:43:12.880 --> 00:43:15.960
<v Speaker 3>go with something tech related. So we've been working with

787
00:43:16.840 --> 00:43:21.000
<v Speaker 3>Mixed Panel recently, which is an amazing service. It's really

788
00:43:21.039 --> 00:43:24.360
<v Speaker 3>helped us because we're trying to add some analytics to

789
00:43:24.599 --> 00:43:28.280
<v Speaker 3>our website and our console, but custom analytics that allow

790
00:43:28.360 --> 00:43:32.320
<v Speaker 3>us to like basically track like basically when somebody enters

791
00:43:32.320 --> 00:43:35.320
<v Speaker 3>the website, what they interact with and like how long,

792
00:43:35.440 --> 00:43:38.599
<v Speaker 3>and like develop our own metrics based on the code

793
00:43:38.639 --> 00:43:41.360
<v Speaker 3>we actually put in the website and mixed Panel is

794
00:43:41.400 --> 00:43:44.639
<v Speaker 3>amazing at this. What they basically do is they're just

795
00:43:44.880 --> 00:43:46.599
<v Speaker 3>all they do is they say, hey, we're just going

796
00:43:46.639 --> 00:43:49.280
<v Speaker 3>to take events and we're going to represent them in

797
00:43:49.320 --> 00:43:51.960
<v Speaker 3>a whole bunch of different ways. You can filter, you

798
00:43:52.000 --> 00:43:56.599
<v Speaker 3>can form funnels, you can show userflow, you can take

799
00:43:56.599 --> 00:43:59.679
<v Speaker 3>a user and actually watch where they go on the

800
00:43:59.679 --> 00:44:03.480
<v Speaker 3>website and stuff. It's super super helpful for us. We've

801
00:44:03.480 --> 00:44:06.239
<v Speaker 3>actually been we've had a total crush on them since

802
00:44:06.280 --> 00:44:09.400
<v Speaker 3>we started working with their their product because they just

803
00:44:09.760 --> 00:44:12.559
<v Speaker 3>not only is their their UI like so nice to

804
00:44:12.599 --> 00:44:14.800
<v Speaker 3>work with, but it's just made our life. We were

805
00:44:14.800 --> 00:44:17.400
<v Speaker 3>thinking of building this far because basically we know wanted

806
00:44:17.440 --> 00:44:21.119
<v Speaker 3>data analytics to our website, but Google analytics and that

807
00:44:21.320 --> 00:44:23.840
<v Speaker 3>the sort of general analytics were just not enough. We

808
00:44:23.880 --> 00:44:27.400
<v Speaker 3>needed like really specific ones, right, and we were going

809
00:44:27.480 --> 00:44:30.880
<v Speaker 3>to build it ourselves, and then we stumbled the prompt

810
00:44:31.480 --> 00:44:33.719
<v Speaker 3>upon Mixed Panel and it was like, oh my god,

811
00:44:33.760 --> 00:44:36.280
<v Speaker 3>this saved us, Like like they made what we could

812
00:44:36.280 --> 00:44:38.280
<v Speaker 3>have made. It would have taken us. We would have

813
00:44:38.320 --> 00:44:40.559
<v Speaker 3>had to start a new company to make what they made,

814
00:44:40.880 --> 00:44:44.599
<v Speaker 3>and it's just it's so it's so helpful, So definitely

815
00:44:44.679 --> 00:44:47.480
<v Speaker 3>for any developer out there, that wants to add like

816
00:44:47.679 --> 00:44:50.719
<v Speaker 3>customer analytics, mixed panels, really really helpful.

817
00:44:51.039 --> 00:44:54.840
<v Speaker 2>Awesome, excellent, Well this has been amazing. My last question

818
00:44:54.920 --> 00:44:57.519
<v Speaker 2>for you. If people want to follow you, keep up

819
00:44:57.519 --> 00:44:59.599
<v Speaker 2>with you, what are the best we're the best places

820
00:44:59.599 --> 00:45:00.280
<v Speaker 2>to go to do that.

821
00:45:00.559 --> 00:45:03.960
<v Speaker 3>Yeah, I mean, let's see, I don't I don't have like,

822
00:45:04.639 --> 00:45:07.719
<v Speaker 3>I don't have like professional socials out there, but I

823
00:45:07.760 --> 00:45:11.079
<v Speaker 3>do have. I am on medium as Ian Lavery, so

824
00:45:11.119 --> 00:45:13.280
<v Speaker 3>you can read any articles I put up there. You

825
00:45:13.280 --> 00:45:16.639
<v Speaker 3>can follow Peak a Voice AI on Twitter, and we

826
00:45:16.719 --> 00:45:19.360
<v Speaker 3>have a YouTube channel if you want to check out

827
00:45:19.400 --> 00:45:23.559
<v Speaker 3>my bands. Yeah, fellow kids in Sleep Circle check them out.

828
00:45:23.840 --> 00:45:26.800
<v Speaker 2>Yeah, no, excellent, that's great. We'll get those in the

829
00:45:26.840 --> 00:45:29.840
<v Speaker 2>show notes. And yeah, thanks for joining us today. This

830
00:45:29.880 --> 00:45:30.519
<v Speaker 2>is a great chat.

831
00:45:30.920 --> 00:45:32.519
<v Speaker 3>Yeah, thanks for having me. This is great.

832
00:45:32.679 --> 00:45:35.039
<v Speaker 2>Cool, all right, everybody until next week.

833
00:45:35.360 --> 00:45:35.880
<v Speaker 4>See you then,
