WEBVTT

1
00:00:00.120 --> 00:00:04.280
<v Speaker 1>Welcome to the deep dive. Today, we're really getting into

2
00:00:04.320 --> 00:00:07.120
<v Speaker 1>the weeds of artificial intelligence.

3
00:00:06.719 --> 00:00:09.839
<v Speaker 2>Specifically for dot net developers.

4
00:00:09.320 --> 00:00:11.839
<v Speaker 1>Right exactly, we're looking at how you can actually use

5
00:00:11.880 --> 00:00:14.480
<v Speaker 1>things like speech, language and search.

6
00:00:14.480 --> 00:00:17.559
<v Speaker 2>All powered by Microsoft's cognitive services based on the source

7
00:00:17.559 --> 00:00:18.280
<v Speaker 2>material we've got.

8
00:00:18.359 --> 00:00:21.760
<v Speaker 1>Yeah, so think of this as well, moving past the buzzwords.

9
00:00:21.239 --> 00:00:23.640
<v Speaker 2>Right from the big ideas down to the practical tools

10
00:00:23.679 --> 00:00:25.320
<v Speaker 2>you can use to build smarter apps.

11
00:00:25.480 --> 00:00:27.960
<v Speaker 1>Our goal here is, you know, to cut through the noise,

12
00:00:28.120 --> 00:00:30.079
<v Speaker 1>pull out the really key concepts, get.

13
00:00:29.960 --> 00:00:33.439
<v Speaker 2>Those aha moments, maybe uncover some surprising bits.

14
00:00:33.399 --> 00:00:35.759
<v Speaker 1>And do it without you needing to like write a

15
00:00:35.799 --> 00:00:38.079
<v Speaker 1>single line of code. Right now, it's.

16
00:00:37.920 --> 00:00:41.799
<v Speaker 2>About understanding the essentials, the foundation, the services, so you

17
00:00:41.920 --> 00:00:42.960
<v Speaker 2>know what's possible.

18
00:00:43.240 --> 00:00:47.520
<v Speaker 1>Okay, so let's lay that foundation AI. It's everywhere often

19
00:00:47.560 --> 00:00:49.960
<v Speaker 1>sounds like science fiction. What's the real deal.

20
00:00:50.200 --> 00:00:53.240
<v Speaker 2>Well, at its core, it's about building systems that do

21
00:00:53.359 --> 00:00:54.920
<v Speaker 2>things needing human.

22
00:00:54.759 --> 00:00:58.799
<v Speaker 1>Like intelligence, but maybe not sentient robots just yet.

23
00:00:58.880 --> 00:01:02.960
<v Speaker 2>Hah No, it's much more grounded. Think specific tasks, specific

24
00:01:03.039 --> 00:01:04.760
<v Speaker 2>capabilities that are available now.

25
00:01:05.040 --> 00:01:07.920
<v Speaker 1>And getting here wasn't exactly a smooth ride, was it.

26
00:01:08.480 --> 00:01:10.159
<v Speaker 1>There were these AI winters.

27
00:01:10.480 --> 00:01:13.920
<v Speaker 2>That's right, two main periods, sort of mid seventies to

28
00:01:14.000 --> 00:01:17.439
<v Speaker 2>early eighties and then again late eighties to early nineties.

29
00:01:17.680 --> 00:01:20.680
<v Speaker 2>What happened there, Basically, the hype got way ahead of

30
00:01:20.680 --> 00:01:23.640
<v Speaker 2>the actual technology. Promises were made that just couldn't be

31
00:01:23.680 --> 00:01:25.519
<v Speaker 2>delivered with the computing power.

32
00:01:25.200 --> 00:01:29.079
<v Speaker 1>Back then, so uh, funding dried up, progress stalled exactly.

33
00:01:29.120 --> 00:01:32.879
<v Speaker 2>People got disillusioned. But then things started picking up again.

34
00:01:33.239 --> 00:01:34.079
<v Speaker 1>Why what changed?

35
00:01:34.159 --> 00:01:37.920
<v Speaker 2>Computers got faster, cheaper, you know Moore's law and action.

36
00:01:38.640 --> 00:01:40.879
<v Speaker 2>Suddenly some older ideas became.

37
00:01:40.599 --> 00:01:45.200
<v Speaker 1>Feasible, and we started seeing these specialized systems achieving big things.

38
00:01:45.400 --> 00:01:48.799
<v Speaker 2>Yes, like IBM's Deep Blue beating Gary kaspar Off at

39
00:01:48.879 --> 00:01:49.920
<v Speaker 2>chess in ninety seven.

40
00:01:50.040 --> 00:01:52.719
<v Speaker 1>That was huge, a machine beating the world champ in

41
00:01:52.760 --> 00:01:53.719
<v Speaker 1>such a complex game.

42
00:01:53.840 --> 00:01:56.040
<v Speaker 2>It showed that focused DAI could really excel. But the

43
00:01:56.680 --> 00:01:58.920
<v Speaker 2>current boom that really kicked off more recently.

44
00:01:58.640 --> 00:02:00.680
<v Speaker 1>Driven by the tech giants and testing heavily.

45
00:02:00.879 --> 00:02:04.120
<v Speaker 2>Absolutely, and then you had moments like IBM Watson winning

46
00:02:04.200 --> 00:02:06.120
<v Speaker 2>Jeopardy Yeah in twenty eleven.

47
00:02:05.920 --> 00:02:09.199
<v Speaker 1>Right, that really showcased natural language processing in a big way.

48
00:02:09.240 --> 00:02:13.520
<v Speaker 2>It did, and after that companies really started productizing these

49
00:02:13.599 --> 00:02:17.120
<v Speaker 2>AI capabilities, making them available as services.

50
00:02:16.719 --> 00:02:19.520
<v Speaker 1>Which brings us to today, where AI feels like it's

51
00:02:19.560 --> 00:02:21.439
<v Speaker 1>baked into so much tech.

52
00:02:21.680 --> 00:02:25.719
<v Speaker 2>Your phone, websites, even video game characters reacting to what

53
00:02:25.800 --> 00:02:26.120
<v Speaker 2>you do.

54
00:02:26.479 --> 00:02:30.199
<v Speaker 1>So it went from big dreams, some setbacks, and now

55
00:02:30.240 --> 00:02:32.599
<v Speaker 1>it's practical, usable tech.

56
00:02:32.479 --> 00:02:35.639
<v Speaker 2>Pretty much, and that practicality is changing how we even

57
00:02:35.719 --> 00:02:36.919
<v Speaker 2>interact with computers.

58
00:02:37.159 --> 00:02:40.199
<v Speaker 1>Let's talk about user interfaces. We started way back with

59
00:02:40.280 --> 00:02:41.560
<v Speaker 1>the command line, the CLI.

60
00:02:41.719 --> 00:02:44.840
<v Speaker 2>Powerful, yeah if you knew the commands, but super intimidating

61
00:02:44.840 --> 00:02:45.840
<v Speaker 2>for beginners.

62
00:02:45.520 --> 00:02:47.879
<v Speaker 1>Like learning a secret code kind of yeah.

63
00:02:47.919 --> 00:02:51.680
<v Speaker 2>Then came the GUI, the graphical user interface.

64
00:02:51.319 --> 00:02:54.439
<v Speaker 1>Total game changer, Windows icons, the mouse.

65
00:02:54.360 --> 00:02:58.240
<v Speaker 2>Building on work from places like Xerox PRC, then popularized

66
00:02:58.280 --> 00:03:01.120
<v Speaker 2>by Apple and Microsoft. It made computing accessible.

67
00:03:01.159 --> 00:03:02.759
<v Speaker 1>You could see what you were doing exactly.

68
00:03:03.080 --> 00:03:06.840
<v Speaker 2>But now there's another shift happening towards conversation.

69
00:03:06.560 --> 00:03:09.520
<v Speaker 1>The conversational user interface or see y right.

70
00:03:09.639 --> 00:03:12.080
<v Speaker 2>The idea is you just talk or type to the system,

71
00:03:12.199 --> 00:03:14.759
<v Speaker 2>like messaging a friend. No clicking through menus.

72
00:03:14.639 --> 00:03:17.719
<v Speaker 1>So ordering pizzas just typing get me a large pepperoni.

73
00:03:18.199 --> 00:03:22.280
<v Speaker 2>That's the goal, simple natural interaction. Messaging apps have made

74
00:03:22.360 --> 00:03:23.960
<v Speaker 2>us really comfortable.

75
00:03:23.479 --> 00:03:25.120
<v Speaker 1>With this, Okay, but that sounds like it needs some

76
00:03:25.159 --> 00:03:26.479
<v Speaker 1>serious smarts behind it.

77
00:03:26.479 --> 00:03:30.879
<v Speaker 2>It absolutely does. This is where AI is crucial, specifically

78
00:03:31.039 --> 00:03:33.960
<v Speaker 2>Natural language understanding NLU.

79
00:03:33.719 --> 00:03:36.599
<v Speaker 1>Because the system has to figure out what you actually mean,

80
00:03:37.120 --> 00:03:38.120
<v Speaker 1>not just what you typed.

81
00:03:38.360 --> 00:03:42.159
<v Speaker 2>Precisely, it needs NLU in the back end to interpret

82
00:03:42.199 --> 00:03:43.520
<v Speaker 2>that conversational input.

83
00:03:43.680 --> 00:03:48.439
<v Speaker 1>So what's the weather and forecast for today mean the

84
00:03:48.439 --> 00:03:49.439
<v Speaker 1>same thing to the system.

85
00:03:49.680 --> 00:03:52.960
<v Speaker 2>A good NLU system should understand that. Yes, it identifies

86
00:03:53.039 --> 00:03:54.960
<v Speaker 2>the user's goal, the intent.

87
00:03:55.240 --> 00:03:57.000
<v Speaker 1>It pulls out the important bits of information.

88
00:03:57.199 --> 00:04:00.759
<v Speaker 2>Those are the entities. So weather in London tomorrow. The

89
00:04:00.759 --> 00:04:03.680
<v Speaker 2>intena is get weather entities are London and tomorrow.

90
00:04:03.840 --> 00:04:06.960
<v Speaker 1>Seems intuitive for us, but you mentioned CUIs aren't perfect.

91
00:04:07.039 --> 00:04:10.400
<v Speaker 2>There are challenges, Oh definitely. They struggle with really complex,

92
00:04:10.520 --> 00:04:12.840
<v Speaker 2>nuanced conversations, and there are risks.

93
00:04:13.120 --> 00:04:17.519
<v Speaker 1>Remember Tay, Microsoft's Twitter bot. Oh yeah, that one sideways fast.

94
00:04:17.800 --> 00:04:22.000
<v Speaker 2>It learned from interactions, but unfortunately it learned toxic stuff

95
00:04:22.160 --> 00:04:25.160
<v Speaker 2>very quickly and started spewing offensive tweets.

96
00:04:25.360 --> 00:04:27.199
<v Speaker 1>Had to be shut down almost immediately.

97
00:04:27.319 --> 00:04:30.360
<v Speaker 2>A really stark reminder that AI learning from the real

98
00:04:30.399 --> 00:04:33.480
<v Speaker 2>world needs careful controls. You can't just let it loose

99
00:04:33.480 --> 00:04:34.399
<v Speaker 2>without safeguards.

100
00:04:34.600 --> 00:04:38.279
<v Speaker 1>So maybe it mixes better for now combining CUI and GUI.

101
00:04:38.560 --> 00:04:41.879
<v Speaker 2>Yeah, the source suggests a hybrid approach often makes sense.

102
00:04:42.360 --> 00:04:46.160
<v Speaker 2>Use conversation for simple things, stick to graphical interfaces for

103
00:04:46.240 --> 00:04:47.439
<v Speaker 2>more complex tasks.

104
00:04:47.519 --> 00:04:51.079
<v Speaker 1>Okay, let's dig into that NLU piece more. It's fundamental.

105
00:04:51.720 --> 00:04:54.800
<v Speaker 1>Why is it considered an AIHRD problem.

106
00:04:54.639 --> 00:04:58.240
<v Speaker 2>Because human language is just incredibly complex and subtle. Getting

107
00:04:58.279 --> 00:05:01.560
<v Speaker 2>a machine to grasp it properly isn't just one algorithm.

108
00:05:01.720 --> 00:05:04.560
<v Speaker 1>It's like computer vision or machine translation requires lots of

109
00:05:04.560 --> 00:05:05.800
<v Speaker 1>different techniques working together.

110
00:05:05.959 --> 00:05:09.639
<v Speaker 2>Exactly, there are multiple layers of difficulty like way, Well, first,

111
00:05:09.639 --> 00:05:13.680
<v Speaker 2>there's syntax, the grammar, the structure of sentences. Machines need

112
00:05:13.720 --> 00:05:14.680
<v Speaker 2>to parse that correctly.

113
00:05:14.720 --> 00:05:16.439
<v Speaker 1>Okay, sentence rules makes sense.

114
00:05:16.600 --> 00:05:20.560
<v Speaker 2>Then semantics. That's the meaning of words and sentences, synonyms,

115
00:05:20.600 --> 00:05:21.879
<v Speaker 2>words with multiple meanings.

116
00:05:22.160 --> 00:05:26.319
<v Speaker 1>I like apples versus I'm fond of apples, same meaning,

117
00:05:26.480 --> 00:05:27.160
<v Speaker 1>different words.

118
00:05:27.319 --> 00:05:30.399
<v Speaker 2>Right, The machine needs to get that underlying concept.

119
00:05:30.600 --> 00:05:32.439
<v Speaker 1>Any sounds tricky, what's.

120
00:05:32.279 --> 00:05:36.800
<v Speaker 2>Next, pragmatics. This is maybe the toughest bit. It's understanding

121
00:05:36.920 --> 00:05:40.759
<v Speaker 2>the implied meaning the context that's not being explicitly sex Exactly,

122
00:05:40.800 --> 00:05:42.480
<v Speaker 2>if I say wow, it's hot in here, I might

123
00:05:42.519 --> 00:05:46.319
<v Speaker 2>actually mean can you open a window? The machine needs

124
00:05:46.519 --> 00:05:48.720
<v Speaker 2>situational awareness.

125
00:05:48.399 --> 00:05:51.519
<v Speaker 1>Which computers usually lack. They don't have our common sense

126
00:05:51.600 --> 00:05:53.199
<v Speaker 1>or world knowledge precisely.

127
00:05:53.680 --> 00:05:55.319
<v Speaker 2>And then you've got just plain.

128
00:05:55.319 --> 00:05:57.480
<v Speaker 1>Ambiguity words meaning different things.

129
00:05:57.360 --> 00:06:00.720
<v Speaker 2>Yeah, like bank riverbank or financial bank, or sentences that

130
00:06:00.759 --> 00:06:03.000
<v Speaker 2>can be read multiple ways like I saw a man

131
00:06:03.040 --> 00:06:04.759
<v Speaker 2>on a fill with a telescope.

132
00:06:04.519 --> 00:06:06.800
<v Speaker 1>The classic who has the telescope? Right?

133
00:06:06.879 --> 00:06:10.680
<v Speaker 2>And finally just the sheer variation in language spoken versus

134
00:06:10.680 --> 00:06:15.000
<v Speaker 2>written dialects, slang, typos. It's messy, very messy for a

135
00:06:15.000 --> 00:06:17.240
<v Speaker 2>machine to handle consistently.

136
00:06:16.639 --> 00:06:22.160
<v Speaker 1>So syntax, semantics, pragmatics, ambiguity variation quite a challenge. Were

137
00:06:22.160 --> 00:06:23.759
<v Speaker 1>their early attempts to crack this.

138
00:06:24.240 --> 00:06:27.800
<v Speaker 2>Yeah, some famous ones Eliza back in the sixties mimicked

139
00:06:27.839 --> 00:06:31.680
<v Speaker 2>a therapist using pattern matching. It seems smart, but didn't really.

140
00:06:31.560 --> 00:06:33.199
<v Speaker 1>Understand, more like clever tricks.

141
00:06:33.360 --> 00:06:37.240
<v Speaker 2>Kind of a bigger step was SHRDLU around nineteen seventy.

142
00:06:37.480 --> 00:06:39.199
<v Speaker 1>SHRDLU what did it do?

143
00:06:39.560 --> 00:06:43.279
<v Speaker 2>It operated in a tiny virtual world of blocks. You

144
00:06:43.319 --> 00:06:45.480
<v Speaker 2>could tell it pick up the blue pyramid or ask

145
00:06:45.600 --> 00:06:49.519
<v Speaker 2>questions about the blocks, and understood within that very limited world. Yes,

146
00:06:49.639 --> 00:06:53.199
<v Speaker 2>remarkably well. It showed that NLU was possible if you

147
00:06:53.279 --> 00:06:55.279
<v Speaker 2>constrain the domain significantly.

148
00:06:55.800 --> 00:06:59.759
<v Speaker 1>Okay, so NLU is hard but vital. How do developers

149
00:06:59.839 --> 00:07:02.680
<v Speaker 1>like our listeners actually use it today without building it

150
00:07:02.680 --> 00:07:03.279
<v Speaker 1>all themselves.

151
00:07:03.360 --> 00:07:06.959
<v Speaker 2>That's where cloud services come in, like Microsoft's LUIS Language

152
00:07:07.040 --> 00:07:09.160
<v Speaker 2>Understanding Intelligence Service elleweeds.

153
00:07:09.240 --> 00:07:11.560
<v Speaker 1>So it's like NLU as a service pretty much.

154
00:07:11.600 --> 00:07:13.360
<v Speaker 2>You don't need to be the deep learning expert. Your

155
00:07:13.439 --> 00:07:16.199
<v Speaker 2>job is mainly to train it for your specific application.

156
00:07:16.319 --> 00:07:17.399
<v Speaker 1>How does that training work?

157
00:07:17.600 --> 00:07:21.680
<v Speaker 2>You feeded example sentences. They're called utterances that your users might.

158
00:07:21.560 --> 00:07:24.839
<v Speaker 1>Say things like find me a nearby Italian restaurant.

159
00:07:24.720 --> 00:07:27.480
<v Speaker 2>Exactly, And for each utterance you tell owe with the

160
00:07:27.600 --> 00:07:30.600
<v Speaker 2>user's goal. The intent is like fine restaurant, and you

161
00:07:30.680 --> 00:07:32.079
<v Speaker 2>label the key info the.

162
00:07:32.279 --> 00:07:37.720
<v Speaker 1>Entities, so Italian would be cuisine type, nearby implies location precisely.

163
00:07:37.959 --> 00:07:41.160
<v Speaker 2>You provide lots of examples, Louis learns from them using

164
00:07:41.199 --> 00:07:44.800
<v Speaker 2>machine learning algorithms. Then when a new sentence comes in,

165
00:07:44.879 --> 00:07:47.399
<v Speaker 2>it predicts the intent and extracts the entities.

166
00:07:47.839 --> 00:07:49.639
<v Speaker 1>What are the main bits you can figure in LA?

167
00:07:50.079 --> 00:07:52.800
<v Speaker 2>You define your intents the actions users can take, and

168
00:07:52.839 --> 00:07:55.079
<v Speaker 2>you define your entities the data points you need.

169
00:07:55.480 --> 00:07:57.040
<v Speaker 1>Are there different types of entities?

170
00:07:57.319 --> 00:07:59.959
<v Speaker 2>Yes, quite a few, simple entities are ones you define

171
00:08:00.240 --> 00:08:04.120
<v Speaker 2>like product category. But Louis also has pre built entities

172
00:08:04.120 --> 00:08:05.399
<v Speaker 2>which are super useful.

173
00:08:05.560 --> 00:08:06.199
<v Speaker 1>What do they cover?

174
00:08:06.439 --> 00:08:10.839
<v Speaker 2>Common stuff like dates, times, numbers, locations, email addresses, percentages.

175
00:08:10.959 --> 00:08:13.199
<v Speaker 2>Saves you a lot of effort. It already knows how

176
00:08:13.240 --> 00:08:16.040
<v Speaker 2>to recognize next Tuesday at three pm.

177
00:08:16.240 --> 00:08:17.279
<v Speaker 1>That's handy. What else?

178
00:08:17.439 --> 00:08:20.920
<v Speaker 2>You can create composite entities to group related entities like

179
00:08:21.000 --> 00:08:25.439
<v Speaker 2>an order entity containing item and quantity, and hierarchical entities

180
00:08:25.480 --> 00:08:29.199
<v Speaker 2>for parent child relationships like person name having first name

181
00:08:29.240 --> 00:08:30.439
<v Speaker 2>and last name help.

182
00:08:30.360 --> 00:08:33.200
<v Speaker 1>Structure the extracted data. What about phrase lists?

183
00:08:33.799 --> 00:08:37.480
<v Speaker 2>Think of them as giving ellewe hints. You list words

184
00:08:37.559 --> 00:08:41.639
<v Speaker 2>or phrases that are strong indicators for certain intents or entities,

185
00:08:42.039 --> 00:08:44.679
<v Speaker 2>like a list of all your product names or synonyms

186
00:08:44.759 --> 00:08:46.039
<v Speaker 2>for book a meeting.

187
00:08:46.200 --> 00:08:49.000
<v Speaker 1>It helps boost the signal for important terms exactly.

188
00:08:49.039 --> 00:08:51.399
<v Speaker 2>And then there's active learning. This is really important.

189
00:08:51.440 --> 00:08:52.799
<v Speaker 1>After you launch, what does that do?

190
00:08:53.080 --> 00:08:56.720
<v Speaker 2>Elliwei identifies utterances it wasn't very sure about. It shows

191
00:08:56.799 --> 00:08:59.799
<v Speaker 2>them to you. You clarify the correct intents and entities,

192
00:09:00.240 --> 00:09:03.919
<v Speaker 2>and that feedback helps retrain and improve the model over time.

193
00:09:04.360 --> 00:09:07.639
<v Speaker 1>So the model gets smarter based on real user interactions.

194
00:09:07.879 --> 00:09:10.120
<v Speaker 2>Correct It's a continuous improvement cycle and.

195
00:09:10.120 --> 00:09:13.519
<v Speaker 1>The overall flow for an app Using Alleyway, your.

196
00:09:13.399 --> 00:09:15.919
<v Speaker 2>App gets the user's text sense, sends it to the

197
00:09:15.960 --> 00:09:19.279
<v Speaker 2>Louis API. Louis sends back Jason with the predicted intent

198
00:09:19.440 --> 00:09:22.039
<v Speaker 2>and entities. Your app uses that info to.

199
00:09:21.960 --> 00:09:25.840
<v Speaker 1>Do the right thing, like calling another API, querying a database, whatever.

200
00:09:25.600 --> 00:09:28.639
<v Speaker 2>The action is exactly. It integrates nicely with things like

201
00:09:28.720 --> 00:09:31.399
<v Speaker 2>the Microsoft Bought framework for building chatbots.

202
00:09:31.679 --> 00:09:36.120
<v Speaker 1>Okay. LAOS handles the core understanding. What other text analysis

203
00:09:36.159 --> 00:09:38.639
<v Speaker 1>tools are there in cognitive services?

204
00:09:38.679 --> 00:09:42.559
<v Speaker 2>Several useful ones. There's the Bing's spell check API, just

205
00:09:42.600 --> 00:09:46.159
<v Speaker 2>basic spell check. It's smarter than that. It's contextual. It

206
00:09:46.240 --> 00:09:49.799
<v Speaker 2>understands that booking is correct in booking a flight, but

207
00:09:49.879 --> 00:09:53.919
<v Speaker 2>maybe not somewhere else. It gets proper nouns like Microsoft,

208
00:09:54.399 --> 00:09:56.840
<v Speaker 2>even if slightly misspelled ah.

209
00:09:56.879 --> 00:09:59.799
<v Speaker 1>So it considers the surrounding words useful for cleaning up

210
00:09:59.879 --> 00:10:01.159
<v Speaker 1>U input definitely.

211
00:10:01.519 --> 00:10:04.919
<v Speaker 2>It even handles some slang and common brand name misspellings.

212
00:10:05.120 --> 00:10:06.480
<v Speaker 1>What else in the text suite?

213
00:10:06.639 --> 00:10:10.159
<v Speaker 2>The text Analytics API bundles a few things. Language detection

214
00:10:10.279 --> 00:10:13.320
<v Speaker 2>figures out what language the text is in, useful for routing,

215
00:10:13.320 --> 00:10:15.559
<v Speaker 2>support tickets, or filtering content.

216
00:10:15.360 --> 00:10:17.639
<v Speaker 1>And sentiment analysis that seems really popular.

217
00:10:17.759 --> 00:10:21.000
<v Speaker 2>It is analyzing if text is positive, negative, or neutral.

218
00:10:21.360 --> 00:10:25.200
<v Speaker 2>Companies use it constantly for customer reviews, social media monitoring.

219
00:10:24.799 --> 00:10:26.960
<v Speaker 1>Getting a pulse on customer opinion at scale.

220
00:10:26.759 --> 00:10:29.360
<v Speaker 2>Right, it usually gives a score like point nine for

221
00:10:29.559 --> 00:10:31.720
<v Speaker 2>very positive, point one for very negative.

222
00:10:31.759 --> 00:10:33.120
<v Speaker 1>Does it do summarization too?

223
00:10:33.480 --> 00:10:36.759
<v Speaker 2>Not exactly summarization, but key phrase extraction pulls out the

224
00:10:36.799 --> 00:10:40.679
<v Speaker 2>main talking points, the important noun phrases, and topic detection

225
00:10:40.840 --> 00:10:45.000
<v Speaker 2>can group large amounts of text like reviews, into underlying themes.

226
00:10:45.279 --> 00:10:48.039
<v Speaker 1>The source also mentioned something called the Web language model

227
00:10:48.279 --> 00:10:48.960
<v Speaker 1>or web LM.

228
00:10:49.240 --> 00:10:52.919
<v Speaker 2>Yeah, that's a language model trained on well huge amounts

229
00:10:52.960 --> 00:10:56.639
<v Speaker 2>of web data from bing. It understands common word sequences

230
00:10:56.639 --> 00:10:57.600
<v Speaker 2>and probabilities.

231
00:10:57.679 --> 00:10:58.200
<v Speaker 1>What's that use?

232
00:10:58.279 --> 00:11:02.480
<v Speaker 2>For things like word breaking, splitting buy tickets now into

233
00:11:02.919 --> 00:11:07.440
<v Speaker 2>buy tickets now, calculating joint probability? How likely is the

234
00:11:07.480 --> 00:11:12.080
<v Speaker 2>phrase natural language processing versus say natural language pineapple okay?

235
00:11:12.240 --> 00:11:14.320
<v Speaker 1>Measuring how natural a phrase sounds.

236
00:11:14.039 --> 00:11:17.919
<v Speaker 2>And conditional probability predicting the next word given artificial how

237
00:11:18.080 --> 00:11:21.679
<v Speaker 2>likely is intelligence to follow? This powers things like autocorrect

238
00:11:21.799 --> 00:11:22.879
<v Speaker 2>and tax suggestions.

239
00:11:22.879 --> 00:11:25.799
<v Speaker 1>Wow, quite a toolbox for text. Now, what about turning

240
00:11:25.840 --> 00:11:27.639
<v Speaker 1>speech into text and back.

241
00:11:27.720 --> 00:11:30.039
<v Speaker 2>That's where the speech APIs come in. Speech to text

242
00:11:30.279 --> 00:11:34.159
<v Speaker 2>STT converts audio to text text to speech TTS does

243
00:11:34.200 --> 00:11:34.679
<v Speaker 2>the reverse.

244
00:11:34.720 --> 00:11:36.279
<v Speaker 1>How does STT work? Generally?

245
00:11:36.559 --> 00:11:40.080
<v Speaker 2>It analyzes the audio signal, breaks it down into basic

246
00:11:40.159 --> 00:11:44.440
<v Speaker 2>sound units called phonemes, and uses acoustic and language models

247
00:11:44.480 --> 00:11:47.519
<v Speaker 2>to figure out the most likely sequence of words. Usually

248
00:11:47.559 --> 00:11:48.799
<v Speaker 2>gives a confidence score.

249
00:11:48.559 --> 00:11:50.440
<v Speaker 1>Two and Microsoft's offerings.

250
00:11:50.679 --> 00:11:53.960
<v Speaker 2>There are standard speech APIs, but the really interesting one

251
00:11:54.039 --> 00:11:59.840
<v Speaker 2>is the Custom Speech Service criis custom. Howso CRES let's

252
00:11:59.879 --> 00:12:03.200
<v Speaker 2>use adapt the speech recognition model to your specific scenario.

253
00:12:03.360 --> 00:12:03.879
<v Speaker 1>What does that mean?

254
00:12:04.039 --> 00:12:07.159
<v Speaker 2>You can upload your own audio data and accurate transcripts.

255
00:12:07.519 --> 00:12:09.720
<v Speaker 2>If your app will be used in a noisy factory

256
00:12:09.919 --> 00:12:13.320
<v Speaker 2>or involves lots of specific jargon or product names, you

257
00:12:13.360 --> 00:12:15.879
<v Speaker 2>can train a model that's much better at understanding that

258
00:12:15.960 --> 00:12:18.279
<v Speaker 2>specific audio environment and vocabulary.

259
00:12:18.480 --> 00:12:21.480
<v Speaker 1>Ah, so you tailor it to overcome background noise or

260
00:12:21.519 --> 00:12:23.200
<v Speaker 1>specialized language exactly.

261
00:12:23.240 --> 00:12:26.120
<v Speaker 2>It can make a huge difference in accuracy for specific

262
00:12:26.200 --> 00:12:28.320
<v Speaker 2>use cases compared to a general purpose model.

263
00:12:28.360 --> 00:12:30.360
<v Speaker 1>And what about recognizing who is talking?

264
00:12:30.799 --> 00:12:34.879
<v Speaker 2>That's speaker recognition two main types. Verification confirms if a

265
00:12:34.960 --> 00:12:38.480
<v Speaker 2>voice matches a known person like voice log in usually

266
00:12:38.559 --> 00:12:41.519
<v Speaker 2>needs enrollment where the person says specific phrases.

267
00:12:41.559 --> 00:12:42.960
<v Speaker 1>Okay, one to one matching.

268
00:12:42.879 --> 00:12:47.159
<v Speaker 2>And identification, which tries to figure out which speaker from

269
00:12:47.200 --> 00:12:50.200
<v Speaker 2>a pre enrolled group is the one talking. Useful for

270
00:12:50.240 --> 00:12:51.960
<v Speaker 2>transcription that notes who said.

271
00:12:51.679 --> 00:12:54.480
<v Speaker 1>What, and for the other way text to speech, making

272
00:12:54.519 --> 00:12:55.840
<v Speaker 1>the computer talk naturally. Yeah.

273
00:12:55.879 --> 00:12:59.879
<v Speaker 2>TTS takes text and generates audio. The source mentions ssmls

274
00:13:00.039 --> 00:13:03.320
<v Speaker 2>each synthesis markup language, what's that for? It lets you

275
00:13:03.360 --> 00:13:07.039
<v Speaker 2>control how the text is spoken. Things like emphasis, pitch,

276
00:13:07.440 --> 00:13:12.519
<v Speaker 2>speaking rate, pauses, even pronunciation of specific words helps make

277
00:13:12.559 --> 00:13:14.240
<v Speaker 2>the synthesized voice sound less.

278
00:13:14.320 --> 00:13:17.320
<v Speaker 1>Robotic speech text feels like it's gotten way better recently.

279
00:13:17.399 --> 00:13:21.000
<v Speaker 2>It really has, largely thanks to deep learning bottles, but

280
00:13:21.080 --> 00:13:24.039
<v Speaker 2>accuracy is still a challenge, especially in noisy places or

281
00:13:24.080 --> 00:13:27.240
<v Speaker 2>with strong accents. The source notes that even with claims

282
00:13:27.279 --> 00:13:30.080
<v Speaker 2>of low error rates like Google's four point nine percent,

283
00:13:30.399 --> 00:13:32.799
<v Speaker 2>that's often in ideal conditions.

284
00:13:32.279 --> 00:13:34.879
<v Speaker 1>Which is why that custom speech service is valuable for

285
00:13:34.919 --> 00:13:36.840
<v Speaker 1>bridging the gap in real world scenarios.

286
00:13:36.879 --> 00:13:37.440
<v Speaker 2>Precisely.

287
00:13:37.720 --> 00:13:41.120
<v Speaker 1>Okay, shifting focus again, let's talk search and recommendations making

288
00:13:41.159 --> 00:13:42.240
<v Speaker 1>information findable.

289
00:13:42.480 --> 00:13:45.600
<v Speaker 2>Right, we have explicit search. You type a query, but

290
00:13:45.840 --> 00:13:48.759
<v Speaker 2>AI enables implicit search.

291
00:13:48.639 --> 00:13:50.360
<v Speaker 1>Where the system anticipates what you need.

292
00:13:50.639 --> 00:13:54.240
<v Speaker 2>Yeah, like Amazon showing customers who bought this also bought

293
00:13:54.519 --> 00:13:56.480
<v Speaker 2>or related items. It's proactive.

294
00:13:56.600 --> 00:13:59.000
<v Speaker 1>The source mentioned the three piece of search.

295
00:13:58.919 --> 00:14:04.840
<v Speaker 2>Right search everywhere, predictive, anticipating needs, proactive, giving answers before

296
00:14:04.879 --> 00:14:05.399
<v Speaker 2>you ask.

297
00:14:05.879 --> 00:14:09.720
<v Speaker 1>That's the ideal, and Microsoft has bing APIs for web

298
00:14:09.879 --> 00:14:14.480
<v Speaker 1>image news search. Let's focus on recommendations though, how do

299
00:14:14.600 --> 00:14:15.159
<v Speaker 1>those work?

300
00:14:15.440 --> 00:14:18.320
<v Speaker 2>The main goal is usually to increase sales or engagement

301
00:14:18.360 --> 00:14:19.879
<v Speaker 2>by suggesting relevant things.

302
00:14:20.000 --> 00:14:20.799
<v Speaker 1>What kinds are there?

303
00:14:20.919 --> 00:14:24.679
<v Speaker 2>Frequent bought together FBT is common items often bought in

304
00:14:24.720 --> 00:14:27.159
<v Speaker 2>the same transaction, like a camera and a memory card

305
00:14:27.279 --> 00:14:29.840
<v Speaker 2>makes sense. Then item to item, which is a type

306
00:14:29.840 --> 00:14:33.320
<v Speaker 2>of collaborative filtering. It suggests items based on what other

307
00:14:33.480 --> 00:14:37.120
<v Speaker 2>similar users liked. People who viewed this also viewed.

308
00:14:36.840 --> 00:14:38.080
<v Speaker 1>Based on collective behavior.

309
00:14:38.200 --> 00:14:40.679
<v Speaker 2>And user to item, which is more personalized. It looks

310
00:14:40.679 --> 00:14:44.080
<v Speaker 2>at your past history, views, purchases to recommend things specifically

311
00:14:44.080 --> 00:14:44.320
<v Speaker 2>for you.

312
00:14:44.799 --> 00:14:46.919
<v Speaker 1>How do you build these using the Microsoft service?

313
00:14:47.200 --> 00:14:51.519
<v Speaker 2>You need data two main types. Catalog data, which is

314
00:14:51.639 --> 00:14:55.799
<v Speaker 2>info about your items, products, articles, whatever, including features and

315
00:14:56.039 --> 00:15:00.519
<v Speaker 2>usage data records of user interactions like clicks, purchases, ratings,

316
00:15:00.600 --> 00:15:01.200
<v Speaker 2>so you feed it.

317
00:15:01.240 --> 00:15:02.840
<v Speaker 1>Your product list and how people.

318
00:15:02.559 --> 00:15:06.120
<v Speaker 2>Interact with exactly. Then you train different recommendation models called

319
00:15:06.159 --> 00:15:09.320
<v Speaker 2>builds on that data. There are specific builds for FBT

320
00:15:09.559 --> 00:15:12.360
<v Speaker 2>and others like SAR that handle itemed item and user

321
00:15:12.399 --> 00:15:13.000
<v Speaker 2>to item, and.

322
00:15:12.960 --> 00:15:15.679
<v Speaker 1>The quality of recommendations depends heavily on that input data.

323
00:15:15.720 --> 00:15:19.519
<v Speaker 2>Absolutely, good data, good quantity leads to better recommendations.

324
00:15:19.559 --> 00:15:22.120
<v Speaker 1>The source also mentioned ranking and offline evaluation.

325
00:15:22.519 --> 00:15:25.360
<v Speaker 2>Ranking is crucial how do you order the results. It's

326
00:15:25.360 --> 00:15:28.279
<v Speaker 2>often based on relevant scores derived from usage data and

327
00:15:28.320 --> 00:15:32.320
<v Speaker 2>item features. Offline evaluation lets you test your train models

328
00:15:32.320 --> 00:15:35.440
<v Speaker 2>on historical data before deploying them to see which build

329
00:15:35.519 --> 00:15:36.360
<v Speaker 2>performs best.

330
00:15:36.679 --> 00:15:39.200
<v Speaker 1>Okay, so we've covered a lot of ground from AI

331
00:15:39.200 --> 00:15:44.519
<v Speaker 1>history to interfaces and LU with LIS text analysis, speech

332
00:15:44.600 --> 00:15:46.919
<v Speaker 1>tech search recommendations.

333
00:15:47.000 --> 00:15:49.360
<v Speaker 2>It's quite a journey, but the key takeaway, I think,

334
00:15:49.440 --> 00:15:53.200
<v Speaker 2>is how these advanced AI capabilities are becoming accessible right.

335
00:15:53.759 --> 00:15:56.759
<v Speaker 1>Things that needed huge research teams are now available as

336
00:15:56.799 --> 00:15:59.360
<v Speaker 1>APIs like cognitive services.

337
00:15:59.159 --> 00:16:02.120
<v Speaker 2>Especially for developers already in the dot net world. It

338
00:16:02.240 --> 00:16:05.039
<v Speaker 2>lowers the barrier significantly to adding intelligence.

339
00:16:05.080 --> 00:16:07.120
<v Speaker 1>It makes you really think about where AI is already

340
00:16:07.159 --> 00:16:08.200
<v Speaker 1>working behind the scenes in.

341
00:16:08.200 --> 00:16:11.879
<v Speaker 2>Your life or how these tools could reshape industries think

342
00:16:11.879 --> 00:16:14.720
<v Speaker 2>about customer service, retail, healthcare.

343
00:16:14.799 --> 00:16:17.759
<v Speaker 1>Definitely yeah, and that brings us to the future. The

344
00:16:17.799 --> 00:16:22.679
<v Speaker 1>source touches on this idea of AI first organizations.

345
00:16:22.080 --> 00:16:25.960
<v Speaker 2>Yeah, companies embedding AI into their core strategy, their products,

346
00:16:25.960 --> 00:16:27.200
<v Speaker 2>how their people work.

347
00:16:27.080 --> 00:16:29.360
<v Speaker 1>And it addresses the big question about jobs.

348
00:16:29.559 --> 00:16:34.080
<v Speaker 2>The perspective offered is interesting tasks, not jobs, will be eliminated.

349
00:16:34.639 --> 00:16:37.440
<v Speaker 2>The focus shifts to how human roles will change and

350
00:16:37.519 --> 00:16:38.679
<v Speaker 2>work alongside AI.

351
00:16:38.919 --> 00:16:43.399
<v Speaker 1>Augmented intelligence not just artificial intelligence replacing humans exactly.

352
00:16:43.519 --> 00:16:46.919
<v Speaker 2>Combining the strengths of both human and machines working together

353
00:16:47.000 --> 00:16:49.559
<v Speaker 2>can achieve way more than either could alone.

354
00:16:49.639 --> 00:16:53.840
<v Speaker 1>It's a vision of AI becoming woven into everything, cars, factories, shopping,

355
00:16:54.240 --> 00:16:55.200
<v Speaker 1>daily life.

356
00:16:55.000 --> 00:16:56.360
<v Speaker 2>A fundamental transformation.

357
00:16:56.919 --> 00:17:00.200
<v Speaker 1>So to wrap up, we've seen how AI is evolve,

358
00:17:00.559 --> 00:17:03.799
<v Speaker 1>how tools like cognitive services make it practical for developers,

359
00:17:04.039 --> 00:17:07.920
<v Speaker 1>especially with dot net, to build apps that understand language, speech,

360
00:17:08.039 --> 00:17:10.720
<v Speaker 1>and user needs through search and recommendations.

361
00:17:10.799 --> 00:17:13.160
<v Speaker 2>It really puts powerful capabilities within reach.

362
00:17:13.440 --> 00:17:17.200
<v Speaker 1>And thinking about that future, that augmented intelligence idea where

363
00:17:17.240 --> 00:17:21.319
<v Speaker 1>tasks change and humans partner with AI. Here's a final

364
00:17:21.359 --> 00:17:23.799
<v Speaker 1>thought for you to consider. If AI is set to

365
00:17:23.839 --> 00:17:27.839
<v Speaker 1>transform our tasks and merge with our capabilities. What completely

366
00:17:27.880 --> 00:17:30.960
<v Speaker 1>new roles, new kinds of expertise, or maybe even entirely

367
00:17:31.000 --> 00:17:34.400
<v Speaker 1>new opportunities might emerge from this human machine partnership in

368
00:17:34.440 --> 00:17:37.200
<v Speaker 1>the coming years, Things we perhaps can't even quite imagine

369
00:17:37.240 --> 00:17:38.960
<v Speaker 2>Today, something definitely worth pondering.
