1
00:00:01,000 --> 00:00:04,759
How'd you like to listen to dot
net Rocks with no ads? Easy?

2
00:00:05,320 --> 00:00:09,400
Become a patron for just five dollars
a month. You get access to a

3
00:00:09,480 --> 00:00:14,199
private RSS feed where all the shows
have no ads. Twenty dollars a month,

4
00:00:14,240 --> 00:00:18,399
we'll get you that and a special
dot net Rocks patron mug. Sign

5
00:00:18,480 --> 00:00:24,079
up now at Patreon dot dot net
rocks dot com. Hey there, this

6
00:00:24,120 --> 00:00:28,440
is Jeff Fritz, the Purple Blazer
guy from Microsoft, letting you in on

7
00:00:28,480 --> 00:00:32,640
a little secret about my friend Carl
Franklin. You know, the guy who

8
00:00:32,679 --> 00:00:37,399
started dot net Rocks, the first
podcast about dot net in two thousand and

9
00:00:37,399 --> 00:00:43,000
two, The guy who's been teaching
Blazer on YouTube since twenty twenty. Yeah,

10
00:00:43,039 --> 00:00:47,600
that Carl Franklin. Well, Carl's
joined up with the folks from Code

11
00:00:47,600 --> 00:00:52,399
in a castle to teach a week
long hands on Blazer class at Are you

12
00:00:52,399 --> 00:00:58,439
ready to get this? At a
castle slash villa in Tuscany. It's sort

13
00:00:58,439 --> 00:01:03,920
of a luxury vacation. It's Blazer
learning built in. Carl's calling it the

14
00:01:03,000 --> 00:01:10,000
Blazer Master Class. You'll learn Blazer
from the ground up, finishing the week

15
00:01:10,040 --> 00:01:15,680
with the ability to build and deploy
Blazer applications. Since the training happens for

16
00:01:15,719 --> 00:01:19,920
only four hours in the morning over
six days, you can bring your significant

17
00:01:19,959 --> 00:01:26,040
other your partner with you and you
should right This part of Italy is absolutely

18
00:01:26,079 --> 00:01:30,560
beautiful. There's so much to see
and do and in. Larry and Marco

19
00:01:30,680 --> 00:01:34,799
from code In to Castle are organizing
daily activities both at the castle and in

20
00:01:34,840 --> 00:01:38,959
the area. The castle is in
the Marema, a less touristed region of

21
00:01:40,079 --> 00:01:45,879
Tuscany, offering both classic Tuscan hill
country as well as easy access to the

22
00:01:45,879 --> 00:01:52,239
Etruscan Riviera. With sublime local food, wine and olive oil around every corner.

23
00:01:52,519 --> 00:01:56,400
Breakfast is included every day. There
will be two communal dinners at the

24
00:01:56,439 --> 00:02:00,680
castle book ending the experience, and
most other meals and all activities are included.

25
00:02:01,480 --> 00:02:07,319
And did I mention you'll learn Blazer
in person from Carl Franklin Listen.

26
00:02:07,719 --> 00:02:12,199
Space is limited and for very good
reason. This is quality training in a

27
00:02:12,240 --> 00:02:19,759
beautiful setting. Go to code Inacastle
dot com slash Blazer twenty twenty three that's

28
00:02:19,879 --> 00:02:25,159
bla z O R two zero two
three to take advantage of this amazing opportunity

29
00:02:25,360 --> 00:02:31,240
to join Carl in Tuscany for an
unforgettable week of la dolce Vita while advancing

30
00:02:31,240 --> 00:02:38,879
your programming skills in this important new
technology. After building software for a while,

31
00:02:38,960 --> 00:02:43,560
you know it's only a matter of
time before you see an HTTP timeout

32
00:02:43,639 --> 00:02:46,800
or a database deadlock. In software, it's not a case of if things

33
00:02:46,879 --> 00:02:52,199
fail, but a case of when
one mishap like this and valuable data is

34
00:02:52,240 --> 00:02:57,400
lost forever. And these failures occur
all the time, but it doesn't have

35
00:02:57,439 --> 00:03:00,560
to be this way. Introducing n
service bus, the ultimate tool to build

36
00:03:00,639 --> 00:03:07,759
robust and reliable systems that can handle
failures gracefully, maintain high availability, and

37
00:03:07,919 --> 00:03:12,479
scale to meet growing demand. For
more than fifteen years, end service bus

38
00:03:12,479 --> 00:03:16,360
has been trusted to run mission critical
systems that must not go down or lose

39
00:03:16,479 --> 00:03:22,199
any data ever, and now you
can try it for yourself. End service

40
00:03:22,240 --> 00:03:25,680
Bus integrates seamlessly with your dot net
applications and could be hosted on premises or

41
00:03:25,719 --> 00:03:30,919
in the cloud. Say goodbye to
loss data and system failures and say hello

42
00:03:30,960 --> 00:03:35,879
to a better, more reliable way
of building distributed systems. Try end service

43
00:03:35,879 --> 00:03:40,439
bus today by heading over to go
dot particular dot net slash dot net rocks

44
00:03:40,800 --> 00:04:00,879
and start building better systems with asynchronous
messaging using end service bus Hey, Antwerp.

45
00:04:00,240 --> 00:04:11,319
It's dot Rock Holy crap, there
must be fifty thousand people here.

46
00:04:11,479 --> 00:04:13,719
Yeah, who knew we were in
a stadium? I know, right,

47
00:04:13,759 --> 00:04:15,040
we are in a state. We're
actually in a movie theater, which is

48
00:04:15,079 --> 00:04:18,240
cool. It is cool. The
last time we did a dot Net Rocks

49
00:04:18,240 --> 00:04:23,480
in a movie theater I think was
in Sofia, Bulgaria. Oh yeah,

50
00:04:23,519 --> 00:04:27,160
do you remember that? Yeah,
a few years that was dev reached every

51
00:04:27,199 --> 00:04:30,319
And the funny thing was the Bulgarians
thought it would be funny if Richard and

52
00:04:30,360 --> 00:04:35,639
I announced the names of the winners, the Bulgarian winners of the swag at

53
00:04:35,680 --> 00:04:41,480
the end of the show, and
that was funny for them, for them,

54
00:04:43,399 --> 00:04:50,560
Bulgarian names need to buy a bow
Okay, it lacks where snudge and

55
00:04:50,720 --> 00:04:55,120
they were just laughing their butts off. That's you. Oh you were there?

56
00:04:55,160 --> 00:04:59,480
Okay, all right, well,
uh it's we're glad to be here

57
00:05:00,079 --> 00:05:02,959
obviously. Yeah, last day of
the show. Fun to do a live

58
00:05:03,000 --> 00:05:06,800
show. Yeah, it's a live
show. We have some good stuff coming

59
00:05:06,879 --> 00:05:11,199
up here. Layla Bougria is here
and we'll be talking to her in a

60
00:05:11,240 --> 00:05:14,319
minute. But first we have to
do this little thing called better know a

61
00:05:14,360 --> 00:05:24,759
framework roll the crazy music. All
right, buddy, what do you got?

62
00:05:25,319 --> 00:05:28,959
Well, I saw this come across
Twitter. It's a tweet and while

63
00:05:28,959 --> 00:05:32,519
I'm linking to it, Boston Dynamics. You know who those crazy people are,

64
00:05:32,399 --> 00:05:35,839
the guys with the robots. They
make the robots that dance, they

65
00:05:35,879 --> 00:05:40,600
do backflips. Now, yeah,
a little parkour with robots and stuff.

66
00:05:40,639 --> 00:05:44,160
Yeah, they used to have a
little uncanny heathered robots, robots that were

67
00:05:44,160 --> 00:05:47,279
tethered, and then they got gas
powered robots. Now they're battery driven and

68
00:05:47,399 --> 00:05:51,160
they put out videos every once in
a while of things that look like animals,

69
00:05:51,199 --> 00:05:57,720
like cougars and dogs. Well,
anyway, they've put chat GPT into

70
00:05:57,720 --> 00:06:01,000
a robot and now you can talk
to it and it will talk to you

71
00:06:01,079 --> 00:06:05,120
back. And so there's a tweet
about that. It was from April,

72
00:06:05,240 --> 00:06:10,720
but I thought it was so cool
and a little bit scary. But they're

73
00:06:10,759 --> 00:06:14,720
asking a questions like, you know, are you function It's like data,

74
00:06:14,800 --> 00:06:16,600
are you functioning within normal parameters?
You know? And it would say,

75
00:06:16,720 --> 00:06:20,160
yes, you know, I have
this blah blah blah. What my levels

76
00:06:20,199 --> 00:06:25,199
are? You know these kinds of
things, my battery level? How many

77
00:06:25,560 --> 00:06:30,360
what did it say? How many
interactions in your last mission, and it

78
00:06:30,399 --> 00:06:33,920
will tell you about its last mission
and where it went. Just as they

79
00:06:33,959 --> 00:06:38,040
used the word mission, I don't
even know if it was mission, but

80
00:06:38,120 --> 00:06:42,519
that an extermination mission. It could
have been just wondering. It's pretty scary,

81
00:06:42,560 --> 00:06:45,399
but it's cool. That's awesome.
So that's what I got it.

82
00:06:45,600 --> 00:06:48,480
It's a tweet with a video.
Okay, yeah, Boston Dynamis videos are

83
00:06:48,480 --> 00:06:51,959
always amazing, fun. Good one
who's talking to us? Richard grabbed a

84
00:06:51,959 --> 00:06:57,399
comment off a show seventeen fifty three. That's the one we did with Mika

85
00:06:57,480 --> 00:07:02,759
about Visual Studio twenty twenty two productivity
back in August last year, and Mark

86
00:07:02,759 --> 00:07:05,240
Wansel had this great comment. Mark
has a lot of great common spells,

87
00:07:05,240 --> 00:07:10,079
and he says one of the things
that Mika noted a few times was telemetry.

88
00:07:11,040 --> 00:07:14,000
I think that this would be an
interesting show topic. There's a popular

89
00:07:14,079 --> 00:07:17,759
GitHub project called open telemetry that might
be a good starting space. There seems

90
00:07:17,759 --> 00:07:20,759
to be an art of telemetry.
What to collect, how much, performance

91
00:07:20,800 --> 00:07:26,759
considerations and primacy considerations. Check out
open telemetry dot io. What do you

92
00:07:26,759 --> 00:07:30,639
think that that's not a good idea? Not want to do that? Noah,

93
00:07:30,920 --> 00:07:32,240
we won't do that show. Yeah, sorry, Mark, I wish

94
00:07:32,240 --> 00:07:34,680
we could help you, but we
can't help you. Yeah, but I

95
00:07:34,720 --> 00:07:38,600
will send you a copy of music
Cobi And if you'd like copy of music

96
00:07:38,759 --> 00:07:41,480
by I read a comment on the
website at dot net rocks dot com or

97
00:07:41,480 --> 00:07:43,959
on the facebooks. We publish every
show there, and if you comment there

98
00:07:43,959 --> 00:07:45,639
and ever reading on the show,
we'll say, do you a copy of

99
00:07:45,720 --> 00:07:48,160
music by? And you should follow
us on Twitter. But the real fun

100
00:07:48,279 --> 00:07:54,120
happens on Mastodon. I'm Masodon,
I'm at Carl Franklin at tech Hubs Social,

101
00:07:54,240 --> 00:07:58,120
and I'm Rich Campbell at masodonda Social. Send us a two let us

102
00:07:58,160 --> 00:08:03,759
know you're listening, and that brings
us to our show on open telemetry.

103
00:08:03,839 --> 00:08:07,839
What yeah, how sorry, that's
the topic. Lila Bougrie is here.

104
00:08:07,920 --> 00:08:11,800
She's a software engineer with over fifteen
years of experience in the dot net space

105
00:08:11,240 --> 00:08:16,959
and currently works in particular software where
they build nd service bus. Maybe use

106
00:08:16,000 --> 00:08:18,720
it maybe forard a bit, certainly
don't know the show a few times.

107
00:08:18,759 --> 00:08:24,120
Yeah, She's a Microsoft MVP and
frequent speaker conferences and interspare time. She

108
00:08:24,160 --> 00:08:28,720
loves to knit and crochet. Welcome, thank you. How about Lila,

109
00:08:28,879 --> 00:08:35,200
huh, we were lying when we
said we weren't going to do a show

110
00:08:35,240 --> 00:08:39,320
about open telemetry. That's kind of
why we read that. Yeah, yeah,

111
00:08:39,320 --> 00:08:41,759
I figured if I've got a comment
literally somebody asked her for to open

112
00:08:41,799 --> 00:08:43,879
telemetry, now is the time to
read it. So, Lila, what's

113
00:08:43,879 --> 00:08:52,440
the elevator pitch for open telemetry?
Well, the elevator pitch. So we've

114
00:08:52,679 --> 00:08:58,320
we've had multiple telemetry signals in software
for years, right, the oldest one

115
00:08:58,320 --> 00:09:03,000
being logs. I think we all
know logs. We've also mostly used metrics.

116
00:09:03,480 --> 00:09:07,399
Maybe I guess distributed tracing is a
newest signal, but I would say

117
00:09:07,440 --> 00:09:13,960
that the elevator pitch for open telemetry
is really correlating them all together. And

118
00:09:13,080 --> 00:09:18,440
that's what makes it really interesting for
me at least, because you know,

119
00:09:18,480 --> 00:09:22,279
each signal has its own value.
And we've also been logging for years,

120
00:09:22,320 --> 00:09:24,279
and even though we can see,
like you know, tracing might be the

121
00:09:24,360 --> 00:09:28,440
better option today, but then we
still have all of those logs we don't

122
00:09:28,480 --> 00:09:33,639
want to go rewrite and higher applications
though. We do have log providers so

123
00:09:33,759 --> 00:09:37,080
you can plug in however you want
to do your logging, but that's not

124
00:09:37,159 --> 00:09:41,480
really that's taking the same source and
putting it in different places, isn't it.

125
00:09:41,519 --> 00:09:46,480
You're talking about ingesting different sources into
one place exactly, Yeah, and

126
00:09:46,519 --> 00:09:52,399
then being able to connect it.
So basically, imagine, you know,

127
00:09:52,480 --> 00:09:58,480
over the rainbow that you get lowered
and there's a metric that looks out of

128
00:09:58,519 --> 00:10:01,240
whack, and it's like, okay, something is up. What is up?

129
00:10:01,519 --> 00:10:05,200
I don't know, just the metric
is out of whack. So you

130
00:10:05,279 --> 00:10:09,080
have to go figure out how to
do that, and that's usually a challenge

131
00:10:09,159 --> 00:10:13,559
because you then have to go figure
out, Okay, how do I connect

132
00:10:13,559 --> 00:10:16,240
that to those other signals that I
have. But what if you could look

133
00:10:16,240 --> 00:10:20,320
at the metric and say, okay, I can see that it's correlated to

134
00:10:20,759 --> 00:10:24,000
these traces, and then go look
at that, and then the traces would

135
00:10:24,000 --> 00:10:28,799
also be connected to the logs and
you would be able to basically, yeah,

136
00:10:28,840 --> 00:10:33,399
paint that entire picture of what's going
on, and you wouldn't be losing

137
00:10:33,440 --> 00:10:37,919
all of that time doing that thing. Manually sounds like a Wikipedia rabbit hole,

138
00:10:37,080 --> 00:10:41,639
right. One thing leads you to
another to another. So I'm trying

139
00:10:41,639 --> 00:10:46,039
to distinguish between all these different things. I mean, logging means a particular

140
00:10:46,039 --> 00:10:54,679
product like sequel server spitting out logs
about how it's functioning, possibly versus metrics

141
00:10:54,720 --> 00:10:58,240
being more things like the state of
a server, like the hard drives pinned

142
00:10:58,279 --> 00:11:03,759
or running low memory. And then
when I think about traces, I think

143
00:11:03,799 --> 00:11:09,360
about there's tools that specifically are about
following a button click from the client to

144
00:11:09,559 --> 00:11:15,759
the server to the database and back
again. Right. Yeah, I like

145
00:11:15,799 --> 00:11:18,679
to think of that like following a
business transaction, right, right, like

146
00:11:18,759 --> 00:11:22,720
the workflow. Right. So it's
true, Yeah, that's definitely true.

147
00:11:22,720 --> 00:11:26,120
But at the same time, we've
also been logging in our applications a lot,

148
00:11:26,759 --> 00:11:31,720
and that's still then connects together.
This is as a developer writing code

149
00:11:31,759 --> 00:11:37,039
to push messages onto a log right, So that's usually I'm a developer,

150
00:11:37,120 --> 00:11:43,000
right, So I'm always looking at
things from the application perspective, and how

151
00:11:43,000 --> 00:11:46,639
do I make this application observable?
And there's multiple signals, and you would

152
00:11:46,679 --> 00:11:52,960
choose each individual signal based on whatever
your scenario is. It could even be

153
00:11:52,080 --> 00:11:58,600
that for a specific scenario you might
come to the conclusion that using multiple signals

154
00:11:58,679 --> 00:12:03,320
might be useful. So for example, let's say that a failure occurs,

155
00:12:03,559 --> 00:12:07,200
right, you want to keep track
of that and have a trace that reflects

156
00:12:07,240 --> 00:12:11,480
that failure. But you might also
want to have that reflected emetrics so you

157
00:12:11,519 --> 00:12:16,240
can do the alert de alerting and
all of that. Yeah, So that

158
00:12:16,480 --> 00:12:18,200
meaning to trip it up in through
the system means to say there was a

159
00:12:18,240 --> 00:12:22,679
failure that occurred as well as what
shows up in the law. Right now,

160
00:12:22,960 --> 00:12:28,559
there's a lot of third party products
out there that do telemetry. Or

161
00:12:28,799 --> 00:12:33,720
does open Telemetry allow you to use
those as sources and then pull them together,

162
00:12:33,840 --> 00:12:37,919
or does open Telemetry have its own
things that you can plug in or

163
00:12:37,960 --> 00:12:41,799
both? It's both, and that's
a good part because I think if we

164
00:12:41,840 --> 00:12:46,720
look at our applications and we want
to make them observable, there's a lot

165
00:12:46,759 --> 00:12:50,799
that we can do. But I
think with the Open Telemetry project, and

166
00:12:50,919 --> 00:12:56,200
also like all of the effort that
the entire community has basically put into this,

167
00:12:56,559 --> 00:13:00,120
they've made it easy for us,
right because you could basically say what

168
00:13:00,240 --> 00:13:03,960
second are using, Oh, I'm
using a speed on at core and I'm

169
00:13:03,039 --> 00:13:07,039
using the Azure is thek whatever it
is, right, and you could just

170
00:13:07,799 --> 00:13:13,399
use those instrumentation libraries that are available
from those frameworks. It could be built

171
00:13:13,440 --> 00:13:18,200
into the framework or could be a
dedicated package and you can turn them on

172
00:13:18,000 --> 00:13:22,120
and just by doing that, you're
already collecting a bunch of information, and

173
00:13:22,159 --> 00:13:26,159
specifically in the distributed system, it's
usually going to give you insight into that

174
00:13:26,519 --> 00:13:33,200
interservice communication where we have the blind
gaps. So it already like gives you

175
00:13:33,399 --> 00:13:37,679
a lot of information. So the
cool thing is that you can then intercept

176
00:13:37,080 --> 00:13:43,039
basically those traces that are being generated
by those libraries, and in that that

177
00:13:43,159 --> 00:13:46,440
I would be calling activity. It
occurred, right, and you could add

178
00:13:46,960 --> 00:13:50,960
you could create your own activities,
which is, by by the way,

179
00:13:50,000 --> 00:13:56,759
the sort of same thing as a
span. But yeah, they basically what

180
00:13:56,840 --> 00:14:01,159
they did is the activity API already
existed, and instead of creating a completely

181
00:14:01,200 --> 00:14:05,120
new API to match the naming of
the open telum Try specification, they just

182
00:14:05,279 --> 00:14:09,799
implemented the specification inside the activity API. So that's why the name is a

183
00:14:09,840 --> 00:14:13,200
little bit different. But basically,
what you could do is take the current

184
00:14:13,200 --> 00:14:16,120
activity, which could be omitted by
an instrumentation library, and say, I

185
00:14:16,159 --> 00:14:22,639
want to add some information to this
that is specific to my application, to

186
00:14:22,679 --> 00:14:26,840
the workflow I'm running in so that
I can get even more insight. What's

187
00:14:26,879 --> 00:14:31,840
the voodoo that allows us to go
between the different tiers and an app and

188
00:14:31,879 --> 00:14:35,840
say these are all part of the
same transaction. Right, that's basically a

189
00:14:35,919 --> 00:14:41,759
propagation mechanism really, because if you
think if a trace, it's basically a

190
00:14:41,799 --> 00:14:46,120
bunch of spans that are connected to
each other. Now, what happens is

191
00:14:46,200 --> 00:14:50,639
at the beginning of the trace,
we basically get a trace ID assigned and

192
00:14:50,720 --> 00:14:56,759
that's going to be carried across all
of the spens and then each span has

193
00:14:56,799 --> 00:15:01,720
a unique ID. Now, in
order for that information to propagate across multiple

194
00:15:01,759 --> 00:15:07,440
services, we need a propagation mechanism. And there's multiple protocols that are basically

195
00:15:07,480 --> 00:15:09,519
supported by the Open Telemetry Project,
of most of well known. One is

196
00:15:09,559 --> 00:15:15,320
a W three C trace context for
HTP headers always feel like I need a

197
00:15:15,320 --> 00:15:22,200
breath, yeah, And then there's
another one for gRPC. So it depends

198
00:15:22,200 --> 00:15:26,200
on what you're doing there. So
I have, for example, and as

199
00:15:26,240 --> 00:15:31,759
your web app, right, I
have Application Insights turned on, I've got

200
00:15:31,799 --> 00:15:35,480
all the switches lit up, and
do I need open telemetry at that point?

201
00:15:35,559 --> 00:15:39,480
What's it kind of give me over
what Azure Insights already has. So

202
00:15:39,519 --> 00:15:46,679
the way that I look at that
is what Application Insights provides you is also

203
00:15:46,799 --> 00:15:52,639
sort of known as black box instrumentation, so it's basically independent of your specific

204
00:15:52,639 --> 00:15:58,639
application code. Yeah, but obviously
the things that we are doing in our

205
00:15:58,679 --> 00:16:03,799
code is usually the interesting bits,
and sometimes we need a little bit more

206
00:16:03,840 --> 00:16:07,279
insight to understand, you know,
what pieces of code are we executing there,

207
00:16:07,320 --> 00:16:11,639
what are we doing, what is
like the cause of latency or whatever

208
00:16:11,639 --> 00:16:15,200
it is, what pieces slow?
Application Insights doesn't provide that well. I

209
00:16:15,240 --> 00:16:22,720
think it's it's quite different to to
compare because at least to my to my

210
00:16:22,840 --> 00:16:26,600
understanding, it's more of an overall
view that you get, and you can

211
00:16:26,639 --> 00:16:33,519
still use the Application Insights as AK
directly and still emit like application specific telemetry.

212
00:16:33,600 --> 00:16:36,960
But but then you're tied, right, You're tied to the vendor,

213
00:16:37,360 --> 00:16:41,039
right, and the thing is right
right exactly, and if you don't want

214
00:16:41,120 --> 00:16:45,799
to change to another vendor, you're
have that vanderlock. It's the point,

215
00:16:45,039 --> 00:16:49,320
right, Yeah, So if you
use open telemetry, then you you could

216
00:16:49,360 --> 00:16:53,480
just wire up a different exporter.
I mean that would also app Insights has

217
00:16:53,559 --> 00:16:57,399
good features for dot net specific absolutely, Yeah, if you've got some other

218
00:16:57,440 --> 00:17:00,799
code written in other things that aren't
ornet related, yeah, happid sence is

219
00:17:00,840 --> 00:17:03,160
only going to do so much for
you and certainly not going to work if

220
00:17:03,160 --> 00:17:07,119
you're in a container on AWS,
is it. Yeah? So, I

221
00:17:07,119 --> 00:17:11,960
mean it certainly opens the door to
working with more platforms, even more places

222
00:17:12,039 --> 00:17:17,400
and open hey, hey, there's
a concept for you. Yeah, and

223
00:17:17,400 --> 00:17:22,359
it being available cross platform and cross
front time as well. Like there's implementations

224
00:17:22,440 --> 00:17:27,599
for instrumentation libraries in many languages for
many framework So that's really cool, especially

225
00:17:27,640 --> 00:17:30,839
if you think of like multi stack
applications. Right. Sure, if I've

226
00:17:30,839 --> 00:17:34,559
got a group of Python developers going
to build a data importer for me,

227
00:17:34,640 --> 00:17:37,559
the fact that I can instrument it
the same way as everything I've gotten built

228
00:17:37,599 --> 00:17:41,319
and dot net, that's pretty compelling, right, and bring everything under one

229
00:17:41,359 --> 00:17:45,960
roof and measured the same way.
I mean, that's the real problem is

230
00:17:45,960 --> 00:17:51,559
that often we're chasing problems that transition
between different systems, and because they're measured

231
00:17:51,559 --> 00:17:55,519
differently, it's very hard to associate
restuff together. I'm just thinking in terms

232
00:17:55,519 --> 00:17:57,279
of how much code I need to
write as a developer to take advantage of

233
00:17:57,279 --> 00:18:00,839
all this and how much of it
comes of the box. Well, that's

234
00:18:00,839 --> 00:18:06,559
what I That's why I mentioned the
instrumentation libraries, right, and even if

235
00:18:06,599 --> 00:18:11,920
you know things like event counters and
stuff like that, even that has dedicated

236
00:18:11,119 --> 00:18:15,920
libraries available already, so you can
basically just turn them on and like I

237
00:18:15,960 --> 00:18:22,079
said, you could plug into that
and add to that information information. So

238
00:18:22,160 --> 00:18:25,920
usually what I say is, look
at what it already gives you. Turn

239
00:18:25,960 --> 00:18:30,039
on the instrumentation library. Yeah exactly, don't reinvent the library. So look

240
00:18:30,039 --> 00:18:33,240
at what's out there. Turn it
on and see what it emits, and

241
00:18:33,640 --> 00:18:37,559
take a look at what type of
insight that already gives you, and it

242
00:18:37,559 --> 00:18:40,599
probably is picked up by open telebras
you're just fine. Yeah, yeah,

243
00:18:40,640 --> 00:18:45,720
Well, for example, let's say
that you have aspe core instrumentation enabled,

244
00:18:45,839 --> 00:18:48,559
right, so you're going to see
the request, but you don't get a

245
00:18:48,559 --> 00:18:52,359
lot of insight into the request.
But I mean it has hooks that you

246
00:18:52,400 --> 00:18:59,799
could then plug in additional information and
expose whatever is you know, interesting to

247
00:18:59,799 --> 00:19:03,039
you in that scenario, so that
you could understand like the sort of business

248
00:19:03,119 --> 00:19:07,200
context of what's going on. And
that's what makes it really powerful. I

249
00:19:07,240 --> 00:19:11,240
spend enough time on the firefighting side
of being assisted min where there's stuff's being

250
00:19:11,240 --> 00:19:15,920
spewed out in the logs that we're
looking at. We just don't know what

251
00:19:15,960 --> 00:19:18,519
it means, right, right,
like anything there? Yeah, it's clear.

252
00:19:18,559 --> 00:19:21,759
It's like your own little Internet.
Everything you need to know is there,

253
00:19:22,119 --> 00:19:25,720
you just can't find it. Yeah, So you know, how do

254
00:19:25,720 --> 00:19:30,480
you add the additional information that helps
someone see? This is where we're having

255
00:19:30,640 --> 00:19:34,759
right. That's where defenders really come
in, right, because they are then

256
00:19:34,839 --> 00:19:40,559
going to offer capabilities that allow you
to querry, to analyze that information,

257
00:19:41,000 --> 00:19:45,000
and to basically get to actionable insights, because that's the whole point of telemetry.

258
00:19:45,079 --> 00:19:47,400
Right. That's a bunch of data, But then what do I do

259
00:19:47,480 --> 00:19:52,440
with it? Right? So you
want to basically have tools that help you

260
00:19:52,279 --> 00:19:56,279
get pointers on how do I fix
this problem or how do I improve this

261
00:19:56,400 --> 00:20:03,240
latency issue that I'm seeing, or
maybe even see like which feature gets used

262
00:20:03,240 --> 00:20:06,759
a lot? And things like that. She is almost more of a profiling

263
00:20:06,920 --> 00:20:08,559
thing, right, Like what are
the functions are we called the most often?

264
00:20:08,640 --> 00:20:11,640
And we know perfectly well why the
system is slow. It's the database

265
00:20:11,759 --> 00:20:17,240
is fault. We just blame the
DBA. Then we're done. Life is

266
00:20:17,240 --> 00:20:22,759
good, all right. You know
Another one that I always think of is

267
00:20:22,880 --> 00:20:30,480
because I'm you know, I'm like
an observability enthusiast, I would say,

268
00:20:30,759 --> 00:20:36,720
right. And the reason why I'm
so enthusiastic about it is because, well,

269
00:20:36,720 --> 00:20:41,119
my sort of core focus has been
message based systems for two years are

270
00:20:41,240 --> 00:20:47,720
to trust. Shoot, yeah,
it's just clim messages split across Q and

271
00:20:47,759 --> 00:20:49,680
then you get out of order messages
and it's like, what's going on?

272
00:20:49,960 --> 00:20:53,839
Why am I seeing this fail?
And especially in a message based system,

273
00:20:55,480 --> 00:20:59,680
usually the problem is happening like for
the up stream right, right, So

274
00:20:59,759 --> 00:21:03,279
how do you get to that?
Right? So? And how do you

275
00:21:03,319 --> 00:21:07,880
connect at even outside of the messages
that are being sent, because it has

276
00:21:07,880 --> 00:21:10,519
to maybe connect back to, like
you said, a click somewhere on a

277
00:21:10,640 --> 00:21:15,880
user interface. So being able to
have that full visibility across all of the

278
00:21:15,000 --> 00:21:18,799
subsystems is really really powerful. Yeah, and again I'm still worrying this a

279
00:21:18,839 --> 00:21:22,359
lot of codforma, right, But
you're telling me that when when you're using

280
00:21:22,359 --> 00:21:26,200
a library that has protocol understanding,
it's going to insert a lot of that

281
00:21:26,200 --> 00:21:30,920
information automatically for us. So it's
selfol together. Yes, Yeah, And

282
00:21:30,000 --> 00:21:36,559
because of the sort of nature of
how distributed tracing works, that information is

283
00:21:36,559 --> 00:21:41,799
going to be connected together through that
same trace idea that's basically being propagated does

284
00:21:41,839 --> 00:21:45,359
have substantial overhead. Is there is
there any reason to only turn it on

285
00:21:45,400 --> 00:21:47,920
when you have a problem or can
you leave it on all the time?

286
00:21:48,519 --> 00:21:53,039
Okay, that's that that's going to
be a long question. As you can

287
00:21:53,160 --> 00:22:00,480
you can go for the It dependspends
definitely. So yeah, it definitely depends.

288
00:22:00,640 --> 00:22:04,839
But so usually what I tend to
say is make sure that what you're

289
00:22:04,920 --> 00:22:11,920
collecting is useful, right start there, because if you just turn I don't

290
00:22:11,920 --> 00:22:15,640
know, every instrumentational library on the
planet on, we log all the things

291
00:22:15,720 --> 00:22:19,039
and capture all of the traces and
all of that, you're going to have

292
00:22:19,079 --> 00:22:23,200
to sift through that all of that
information to be able to understand like what's

293
00:22:23,240 --> 00:22:27,039
going on. Right, So it's
also not a thing of oh, look

294
00:22:27,079 --> 00:22:30,759
at all of these instrumentation libraries and
then turning everything on, because you're going

295
00:22:30,799 --> 00:22:34,960
to be incredibly overwhelmed, to the
point that as a developer you might feel

296
00:22:36,000 --> 00:22:38,200
like, okay, this is not
useful. Yeah, let's just turn it

297
00:22:38,240 --> 00:22:41,640
all back off. I mean,
I've also had the problem where I've said,

298
00:22:41,680 --> 00:22:44,319
Okay, I'm not going to measure
this thing, and then I never

299
00:22:44,400 --> 00:22:47,279
get data for that thing, Like
it turns out I'm looking for the wrong

300
00:22:47,359 --> 00:22:51,720
in the wrong place. Like that's
not a number that moves. So some

301
00:22:51,759 --> 00:22:56,039
of these telemetry products that are out
there have ways that they can work on

302
00:22:56,079 --> 00:23:03,160
a background thread or they can attach
as a sidecar, you know. So

303
00:23:03,839 --> 00:23:07,960
do you have those kinds of things
where you can sort of stay out of

304
00:23:07,960 --> 00:23:11,160
the way so if there is something
that takes up some more time, it

305
00:23:11,160 --> 00:23:15,079
can happen on a background thread.
Yeah. So that's where the open telemetry

306
00:23:15,119 --> 00:23:18,799
project is also really interesting because if
let's say that you look at the basic

307
00:23:18,880 --> 00:23:22,920
samples that are out there for dot
net right, what you're going to see

308
00:23:22,960 --> 00:23:26,000
there is that you can basically,
in a service enable open telemetry at an

309
00:23:26,000 --> 00:23:30,200
exporter, which means that you're collecting
let's say, for the sake of the

310
00:23:30,240 --> 00:23:36,519
example traces and sending them directly to
an observability back end. Basically what that

311
00:23:36,519 --> 00:23:41,279
could be as your application Insights or
Jager or Honeygo, whatever it is.

312
00:23:41,079 --> 00:23:47,519
So, but the thing is is
that there's many problems to that. First

313
00:23:47,559 --> 00:23:49,880
of all, like you said,
there is overhead for that service because it

314
00:23:49,920 --> 00:23:55,400
has to collect all of that information. There might even be some processing behind

315
00:23:55,440 --> 00:24:00,759
the scenes happening as well. We
had a service bus, well, well

316
00:24:00,880 --> 00:24:03,319
that's we'll get to that. It
will get to that, yeah, and

317
00:24:03,359 --> 00:24:07,680
then you have to export it.
But then imagine that the observability back end

318
00:24:07,799 --> 00:24:12,400
is not available for a few seconds
because it's you know, it's the network.

319
00:24:12,599 --> 00:24:21,559
It's the network. Yeah right,
so well no, that's usually that's

320
00:24:21,680 --> 00:24:26,960
handled by the libraries themselves, but
it is adding that pressure to the services

321
00:24:27,400 --> 00:24:32,359
telemetry, right, yeah, you
don't want to have that disconnected information and

322
00:24:32,400 --> 00:24:34,599
then looking at half of the story, right, But there are ways to

323
00:24:34,640 --> 00:24:38,079
solve this, and that's where the
open telemetry collector comes in. And then

324
00:24:38,079 --> 00:24:42,200
you have multiple deployment options on how
to run that. The first one is,

325
00:24:42,240 --> 00:24:45,039
like you said, a sidecar,
So basically you're going to have a

326
00:24:45,079 --> 00:24:51,759
sidecar for each service that you're instrumenting, and then immediately you're offloading all of

327
00:24:51,799 --> 00:24:55,960
the telemetry. Well, you're collecting
it and sending it through to the sidecar.

328
00:24:56,079 --> 00:24:59,799
But there are any processing that needs
to be done, like redacting information

329
00:25:00,039 --> 00:25:04,839
because remember sensitive information, you don't
want that you're in all of that telemetry

330
00:25:04,839 --> 00:25:10,079
you're collecting. So that's just down
the road. We can be fined very

331
00:25:10,079 --> 00:25:14,880
easily. I call that digital white
out. So then you have all of

332
00:25:14,880 --> 00:25:18,599
that processing and then you could export
it then to the observability back end and

333
00:25:18,640 --> 00:25:22,839
you'd be able to handle all those
communication issues and all of that in the

334
00:25:22,920 --> 00:25:27,599
side car and your service is not
affected. Now that's one option, but

335
00:25:27,759 --> 00:25:33,599
you could also set up the open
telemetry collector as as at the gateway,

336
00:25:33,640 --> 00:25:37,599
so it's a central components and all
of the services can basically send their telemetry

337
00:25:37,640 --> 00:25:42,640
information to that central component, which
then will take care of processing all of

338
00:25:42,680 --> 00:25:48,519
that information to the to the back
Yeah, and it could batch that information

339
00:25:48,720 --> 00:25:52,319
and it's a pretty powerful thing,
and you could even let go crazy or

340
00:25:52,400 --> 00:25:56,319
if you need it right, but
you could have a sort of hybrid model

341
00:25:56,359 --> 00:26:00,480
in which you have a sidecar per
service which then sending their reformation to the

342
00:26:00,480 --> 00:26:06,119
central collectors. Is there anything special
in the storage on that and there's just

343
00:26:06,160 --> 00:26:10,799
blobs or text files or are they
actually using a database on that? Well?

344
00:26:10,839 --> 00:26:15,599
I think it depends on the signal, sure, because usually metrics go

345
00:26:15,799 --> 00:26:21,960
to time serious databases and then logs. Honestly, I don't know. It's

346
00:26:22,160 --> 00:26:26,000
a good question, depends but it
depends. Yeah. But well, actually

347
00:26:26,039 --> 00:26:32,519
about those time series databases. That's
a sort of interesting topic on its own

348
00:26:32,559 --> 00:26:36,920
because at the beginning I was talking
about TELME tree correlation, right, and

349
00:26:37,079 --> 00:26:42,079
basically adding the trace ID to the
metric, just called them exemplar in open

350
00:26:42,079 --> 00:26:48,319
CELM tree naming, so that you
would be able to connect that together,

351
00:26:48,400 --> 00:26:49,799
so you'd see a spike in the
metric and see, oh, that's caused

352
00:26:49,839 --> 00:26:55,440
by those traces. Right now,
the thing is that you have to be

353
00:26:55,480 --> 00:27:00,920
aware of what's known as cardinality explosion. So I'll try to when a bomb

354
00:27:02,000 --> 00:27:06,200
goes off in the Vatican. Dude, but did I say that. I'm

355
00:27:06,240 --> 00:27:12,839
sorry, those are cardinal explosions,
not the same cardinality explosion. Cardinality explosion.

356
00:27:14,480 --> 00:27:17,720
I'll try to explain that, but
usually I do this visually, so

357
00:27:17,839 --> 00:27:21,720
okay, I'll give me a bit
to try. But think of the exemplars

358
00:27:21,759 --> 00:27:25,440
basically a label, right that you're
adding to a metric, and that's going

359
00:27:25,480 --> 00:27:30,359
to give you some insights on the
context in which that metric is being collected.

360
00:27:30,799 --> 00:27:34,559
So let's say that I have a
metric called failure rate, because I

361
00:27:34,599 --> 00:27:40,640
want insight into that and to have
a little bit more background information. I

362
00:27:40,680 --> 00:27:44,240
want to know which environment that that
happened, bill, development, tests,

363
00:27:44,480 --> 00:27:48,240
production, whatever it is. And
I also want to know which hp status

364
00:27:48,279 --> 00:27:52,440
coode came back. Now that HSP
status coode in a production environment is going

365
00:27:52,480 --> 00:27:57,200
to have like for the sake of
the example, thirty possible values. And

366
00:27:57,319 --> 00:28:03,839
we have three different environments. So
that's three possible values for that environment label.

367
00:28:03,799 --> 00:28:12,480
Now the cardinality is basically all of
the possible combinations of those values of

368
00:28:12,720 --> 00:28:18,759
every label that you add that is
a cardinality. So it's a multiplier exactly.

369
00:28:18,799 --> 00:28:23,839
So we're fine with the environment and
then having the HTP statoscope and then

370
00:28:23,839 --> 00:28:29,160
I add customer ID. Oh boy, so why let you do that?

371
00:28:29,200 --> 00:28:32,400
We're like one hundred combinations, and
then you threw in twenty thousand customers a

372
00:28:32,519 --> 00:28:36,319
ruined everything more. Yeah, how
do thousand customers a million customers? Somebody

373
00:28:36,319 --> 00:28:41,960
would ever do vouch? And that's
how we then basically get cardinality explosion because

374
00:28:41,960 --> 00:28:45,160
what happens in the time series database
is that every time you have a sort

375
00:28:45,160 --> 00:28:51,079
of unique combination and your series is
created, so your cost goes up right

376
00:28:51,200 --> 00:28:55,599
and it becomes really hard to quire
that information. So it's also important to

377
00:28:55,960 --> 00:28:59,960
one field, which just one field. If you just don't do that,

378
00:29:00,400 --> 00:29:04,319
there won't be this problem. Yeah, well it seems like a good idea

379
00:29:04,319 --> 00:29:10,759
when you do it right until cardinalities
exactly. So it's also important to be

380
00:29:10,839 --> 00:29:14,480
aware of, you know, what
observability back end are you using and how

381
00:29:14,519 --> 00:29:18,160
does that work, because there are
some tools out there that do support high

382
00:29:18,160 --> 00:29:21,960
cardinality. So it's just something that
you have to be aware of. Yeah,

383
00:29:22,279 --> 00:29:25,200
and so you can tolerate if you
really come to the resolution you have

384
00:29:25,240 --> 00:29:29,359
to do that one way or the
other. I mean, custerrity doesn't seem

385
00:29:29,400 --> 00:29:33,920
that crazy because it is useful if
you've got a customer on a phone to

386
00:29:33,000 --> 00:29:37,799
say, hey, I could pull
all the transactions, all of those streams

387
00:29:37,799 --> 00:29:41,279
for all of that customer and sort
of look at where they were having problems.

388
00:29:41,240 --> 00:29:47,440
Yep. That's like full production debugability, right, yeah, without doubt.

389
00:29:48,400 --> 00:29:49,759
And with that, I've got to
interrupt for one moment for this very

390
00:29:49,839 --> 00:29:57,799
important message too, and we're back. It's dotting at Rocks. I'm Richard

391
00:29:57,799 --> 00:30:02,079
Campbell, that's Carl Franklin. Hey, Hey, talking to our friend Leila

392
00:30:02,200 --> 00:30:04,880
about open telemetry. Hey hey,
and we've kind of gotten to that place

393
00:30:06,000 --> 00:30:11,400
now, right, Like we we
how do we visualize this because you're getting

394
00:30:11,400 --> 00:30:15,240
you're probably it a lot of information
like what is are the tooling that comes

395
00:30:15,240 --> 00:30:18,720
with it? Or I have to
write my own what are the dashboards?

396
00:30:18,799 --> 00:30:23,160
Well, that's hopefully where you choose
a vendor, right, and then you

397
00:30:23,200 --> 00:30:27,359
know, with the abilities of open
telemetry, by standardizing all of that information,

398
00:30:27,400 --> 00:30:30,880
I usually just say, like,
try a bunch out. You see

399
00:30:30,880 --> 00:30:37,000
what are the requirements for choices?
You imagine you can just look these up

400
00:30:37,000 --> 00:30:41,119
to standard graph controls and things like
that from your various vendors. If you

401
00:30:41,160 --> 00:30:45,920
want a dashboard, yeah, yeah, yeah, there's just so many options,

402
00:30:45,960 --> 00:30:52,519
and I feel like each of them
has their own strength. But yeah,

403
00:30:52,559 --> 00:30:55,599
for example, if you're if you're
running in the Azure stack then and

404
00:30:55,680 --> 00:30:59,359
you're already using application insights, it's
a thing you know, can it makes

405
00:30:59,359 --> 00:31:03,880
sense against and you're using that as
well. Yeah, But then for example,

406
00:31:03,359 --> 00:31:07,519
I've played around with Honeycomb, and
I think that they're especially like the

407
00:31:07,680 --> 00:31:12,200
collaboration that they've built into the tool
is really cool as well as AWS product.

408
00:31:14,400 --> 00:31:17,200
No, no, it's its own
company, Honeycomb. Okay, big

409
00:31:17,200 --> 00:31:21,240
big taste and a big big bike. Nice. That's the breakfast Cereal.

410
00:31:21,319 --> 00:31:23,279
But okay, I was a kid
of the you know what I'm thinking.

411
00:31:23,319 --> 00:31:27,880
I'm thinking of that there is a
honey Honey something product in AWSS, no

412
00:31:27,960 --> 00:31:32,319
code product, yeah, okay,
different one, but yeah, Honeycomb is

413
00:31:32,359 --> 00:31:37,960
an instrumentation library. Yeah. So, and they're very invested in open telemetry

414
00:31:37,000 --> 00:31:41,839
as well, and I think it's
one of those tools that supports high cardinality.

415
00:31:41,880 --> 00:31:44,480
They talk about it a lot as
well, right, yeah, but

416
00:31:44,640 --> 00:31:48,000
yeah, it would sit out to
me. There's really the sort of collaborative

417
00:31:48,359 --> 00:31:52,279
feature features that they had, because
usually if you're looking at a huge problem,

418
00:31:52,599 --> 00:31:55,000
you're not doing that by yourselves,
right, yeah, you know,

419
00:31:55,079 --> 00:32:00,160
and if it's like if you if
you have a twenty four seventeen and one

420
00:32:00,240 --> 00:32:01,920
and stair shift, you want to
be able to hand over where you left

421
00:32:01,960 --> 00:32:07,240
off things like that. Yeah,
I'm wondering about So who is this Typically

422
00:32:07,279 --> 00:32:08,920
the systems that are going to get
these packages in the first place, that

423
00:32:08,920 --> 00:32:14,000
that's where the air resh is first
show up, where the problems appear,

424
00:32:14,079 --> 00:32:16,160
and then they might be passing it
to development saying hey, we're looking at

425
00:32:16,160 --> 00:32:22,519
this and we think it's it's this
kind of problem. Well that's where I'm

426
00:32:22,559 --> 00:32:29,640
also expecting some evolution because yes,
usually now it would be a different team

427
00:32:29,680 --> 00:32:32,880
when they would be looking at something
that looks funny, yeah right, and

428
00:32:32,920 --> 00:32:38,119
then get some understanding, hopefully some
actionable insights, right, and then being

429
00:32:38,160 --> 00:32:44,200
able to bring that back to the
development team. But it's it's really interesting

430
00:32:44,240 --> 00:32:49,440
to me in the sense that we're
the developers, We are the ones going

431
00:32:49,559 --> 00:32:55,839
to be writing that application specific telemetry. So it's it's also really important to

432
00:32:55,880 --> 00:33:02,640
get like organization wide alignment as well
on the type of telemetry that you're going

433
00:33:02,640 --> 00:33:07,039
to be collecting. So this seems
like there's an infinite number of decisions to

434
00:33:07,079 --> 00:33:10,920
make here, right, I mean
you just by using open telemetry, that's

435
00:33:10,960 --> 00:33:15,599
just one step of many. Yeah, what are we going to be looking

436
00:33:15,640 --> 00:33:17,759
at how what's the granularity of it, How are we going to query it,

437
00:33:19,039 --> 00:33:22,000
how are we going to look at
it visually? Like, these are

438
00:33:22,039 --> 00:33:24,559
all things that aren't just in the
box. You have to think them through.

439
00:33:24,839 --> 00:33:30,359
Yeah. Well, usually what I
try to give a world try to

440
00:33:30,400 --> 00:33:35,920
advise as well, is that to
sort of documents and guidelines like what are

441
00:33:35,960 --> 00:33:38,759
you looking for with your telemetry?
What are the problems that you're trying to

442
00:33:38,799 --> 00:33:45,240
solve and like come up with a
specific set of questions that you could as

443
00:33:45,240 --> 00:33:51,680
a developer when you write a feature, ask yourself so that you could add

444
00:33:51,720 --> 00:33:53,920
the telemetry that is going to answer
those questions. Right, I mean,

445
00:33:53,960 --> 00:33:57,759
what if you don't know what you
don't know? What if you don't know

446
00:33:57,799 --> 00:34:00,480
what you want to look for as
a guidance? Are right to say?

447
00:34:00,400 --> 00:34:05,079
Yeah? Some of this sounds like
business related decisions, like we sell widgets,

448
00:34:05,119 --> 00:34:07,880
and I want to know when a
sale fails because of technology rather than

449
00:34:07,920 --> 00:34:12,280
the customer didn't want to buy it, right, yeah, definitely, or

450
00:34:12,360 --> 00:34:16,280
simple things like this page took too
long to render or whatever. Yeah.

451
00:34:15,639 --> 00:34:21,079
Yeah. And then from the sort
of failure perspective, I usually try to

452
00:34:21,599 --> 00:34:24,599
look at the code and think to
myself, if something were failing here,

453
00:34:24,800 --> 00:34:29,159
what I would be What would I
be looking at if I were debugging this,

454
00:34:29,360 --> 00:34:32,920
like, what what state would be
interesting to me? What variables would

455
00:34:32,960 --> 00:34:37,679
I be looking at? Have I
captured the input enough? Have I captured

456
00:34:37,719 --> 00:34:42,760
what's going out to be able to
understand and be able to then you know,

457
00:34:42,840 --> 00:34:45,320
go back and try to understand what
happened from the outside. Right,

458
00:34:45,440 --> 00:34:49,280
there's an easy solution all of this
you just put all your code in a

459
00:34:49,320 --> 00:34:53,760
try with an empty catch. No
problem is that on error resumed next a

460
00:34:53,920 --> 00:34:59,159
sort of yeah, it's more like
slash day to turn off all the debuggs.

461
00:34:59,239 --> 00:35:01,039
I don't want to know. Just
keep getting don't tell me about these

462
00:35:01,079 --> 00:35:06,159
guys. I mean a lot of
this we've talked about very proactively, like

463
00:35:06,199 --> 00:35:09,559
we're going to detect the errors before
the customer does or before the customer complaint.

464
00:35:09,639 --> 00:35:14,000
Yeah. I think there's another dynamic
where the customer is complaining and we're

465
00:35:14,039 --> 00:35:19,159
getting a ticket that's like what error. I think it'd be very challenging to

466
00:35:19,159 --> 00:35:22,559
say, you've got this ticket is
about this customer, it was roughly at

467
00:35:22,559 --> 00:35:24,280
this time, and now you want
to go dig through the logs to say,

468
00:35:24,320 --> 00:35:28,679
can we see what this person's complained
about. To do is to have

469
00:35:28,719 --> 00:35:31,679
the customer on the phone and enable
this sort of thing, but just for

470
00:35:31,719 --> 00:35:37,480
them, just for their customer ID
and say and then just watch it as

471
00:35:37,519 --> 00:35:39,679
they're going through the process where it
fails and now you've got something. But

472
00:35:40,440 --> 00:35:45,639
is that impossible? Well? I
think so, but it sort of depends

473
00:35:45,679 --> 00:35:49,920
on what type of sampling strategies that
you're applying. And also, like you

474
00:35:50,000 --> 00:35:52,840
know the type of observability that you're
collecting, because let's say that if you

475
00:35:53,159 --> 00:35:59,800
want full insight, so basically any
error that occurs to any user, right,

476
00:36:00,360 --> 00:36:02,880
be able to say, oh,
you know that was since we're an

477
00:36:02,880 --> 00:36:08,119
antwer that was Shalts and it was
three pm on a Friday, and basically

478
00:36:08,119 --> 00:36:13,480
be able to find that request and
look at what were they doing, because

479
00:36:13,480 --> 00:36:16,840
the thing is that users when they
open tickets or whatever it is, the

480
00:36:16,840 --> 00:36:21,920
thing is that they weren't paying attention
to doing their usual thing, right.

481
00:36:22,000 --> 00:36:24,840
They didn't set down See, let's
cause an error, and I'm not thinking

482
00:36:24,880 --> 00:36:29,679
about every step that I did so
I could be able to explain it to

483
00:36:29,719 --> 00:36:32,320
you later. I usually get it
doesn't work. What can you be more

484
00:36:32,360 --> 00:36:38,760
explicit? It doesn't work? Right, So then being able to connect that

485
00:36:38,840 --> 00:36:43,559
information back and say, oh,
yeah, that was shuts right, and

486
00:36:43,599 --> 00:36:46,679
then find that, you know,
Tracey's locks metric whatever was connected to that

487
00:36:46,800 --> 00:36:52,639
information. It's like being able to
debug in production. Really yeah, And

488
00:36:52,719 --> 00:36:55,719
then we think you just described like
turning up a lot of data there too,

489
00:36:57,280 --> 00:37:00,480
Like that's also you know, we
were also warning not to do that

490
00:37:00,519 --> 00:37:05,480
because you're getting buried in minutia yea, yep. That's where the sampling strategies

491
00:37:05,519 --> 00:37:10,000
commits. Sure, and there's different
ways to go about that. So if

492
00:37:10,079 --> 00:37:15,880
you think about traces specifically, you
basically can choose between head and tail sampling

493
00:37:16,119 --> 00:37:22,599
and head sampling and dare You're basically
going to decide whether to sample the trace

494
00:37:22,679 --> 00:37:25,639
at the beginning. So let's say
when the business transaction starts, right,

495
00:37:25,679 --> 00:37:29,760
they're going to immediately make the decision
of I'm keeping this or I just don't

496
00:37:29,800 --> 00:37:34,840
care about it. Right. Usually
that's the most unbiased type of sampling as

497
00:37:34,840 --> 00:37:40,039
well quick order. Right. The
thing is what if something fails, right,

498
00:37:40,559 --> 00:37:45,239
you don't know that upfront, What
if this request was super slow.

499
00:37:45,960 --> 00:37:50,159
It's not something that you can know
up front, so you could be losing

500
00:37:50,639 --> 00:37:53,880
a lot of insightful information. Sure, and that's where you get, you

501
00:37:53,920 --> 00:38:00,519
know, the tail based approach where
you're basically going to collect everything so the

502
00:38:00,719 --> 00:38:05,199
entire trace across all of the services
that it goes through, and at the

503
00:38:05,320 --> 00:38:08,239
end make the decision of is this
an interesting trace? Do I want to

504
00:38:08,320 --> 00:38:12,400
keep it? So? For example, does it carry some specific attributes that

505
00:38:12,440 --> 00:38:15,559
I care about? Or was it
slow? Yeah? And banks, the

506
00:38:15,639 --> 00:38:19,880
question can you turn these things on
and off in production without restarting. Right.

507
00:38:19,880 --> 00:38:23,320
So that's where it becomes important again
how you deployed this, right,

508
00:38:23,360 --> 00:38:28,639
because if you have a sort of
direct export and it's really tricky, but

509
00:38:29,119 --> 00:38:31,159
if if you had it deployed as
a side card, and it could be

510
00:38:31,239 --> 00:38:36,440
a thing of changing the configuration of
the side card, like tailbase says,

511
00:38:36,519 --> 00:38:39,440
I'm going to assess the finish transactions, right, there's nothing special about this,

512
00:38:39,840 --> 00:38:43,800
throw it out right. Oh,
this one had an unusual value,

513
00:38:43,800 --> 00:38:45,679
it took too long, it generated
this air. So forth, I'm going

514
00:38:45,679 --> 00:38:50,440
to keep this one. And so
that way you're sort of sculling as you

515
00:38:50,519 --> 00:38:52,800
complete exactly. That's pretty cool.
Yeah, there's a cost that comes with

516
00:38:52,840 --> 00:38:57,599
that, sure, because basically you're
collecting everything, so you've got those overhead

517
00:38:57,679 --> 00:39:01,639
on the workload. Although hopefully synchronous
to some degree. Actually that asynchronicsy brings

518
00:39:01,639 --> 00:39:05,000
some dishing point that you wanted to
have the customer on the phone while you

519
00:39:05,000 --> 00:39:07,000
work them through it, like that
one scenario. How much latency do we

520
00:39:07,079 --> 00:39:10,639
have when we're doing all that processing
separately, Like how soon can we see

521
00:39:10,719 --> 00:39:15,159
data from when it tries actionized?
You happen, right, that's again another

522
00:39:15,239 --> 00:39:20,239
concern that you have to basically be
focused though, right, like what is

523
00:39:21,559 --> 00:39:25,199
what type of latency are you willing
to accept? And then it becomes a

524
00:39:25,239 --> 00:39:29,400
thing of being very mindful of what
type of processing that you're doing there,

525
00:39:29,440 --> 00:39:32,119
because obviously all of that telemetry is
going to have to go through that pipeline

526
00:39:32,519 --> 00:39:37,400
of processors, so you have to
be very mindful of that. And then

527
00:39:37,960 --> 00:39:39,719
you have to get it out as
soon as you can so that you can

528
00:39:39,760 --> 00:39:44,519
see it as quickly as you can
and you can create as quickly as you

529
00:39:44,559 --> 00:39:49,599
can. But we're still talking seconds, aren't we really? Yeah, I

530
00:39:49,599 --> 00:39:54,320
hope even less milliseconds. That's been
my usual experiences with asynchronous and telemetry.

531
00:39:54,480 --> 00:40:00,000
It's only slightly behind it, but
it's not holding the transaction while it finished,

532
00:40:00,239 --> 00:40:02,440
you know that. That to me, the big sin here is don't

533
00:40:02,519 --> 00:40:07,960
delay the customer. Get they get
the working transaction done. All telemetry can

534
00:40:07,000 --> 00:40:12,320
happen later, even though that later
is a few milliseconds, right, That's

535
00:40:12,360 --> 00:40:15,679
why you offload. That's an our
components again, Yeah, safer, safer

536
00:40:15,719 --> 00:40:19,840
to work that way anyway, which
you don't want as a transaction field because

537
00:40:19,880 --> 00:40:23,440
you were measuring it exactly exactly that's
dumb, that's no no. Yeah,

538
00:40:23,480 --> 00:40:30,039
so yeah, definitely quantum. I
just got it, just like that,

539
00:40:30,079 --> 00:40:32,159
boom, it all makes it just
makes sense, even though it still don't

540
00:40:32,199 --> 00:40:36,400
understand quantum. I'll leave that to
you and sip, okay, we'll work

541
00:40:36,440 --> 00:40:39,320
We'll keep working on that problem.
Yeah, that's definitely something to keep in

542
00:40:39,360 --> 00:40:44,880
mind is that, you know,
instrumentation is at the end of the day,

543
00:40:44,960 --> 00:40:47,559
nice to have. It's not a
mission critical component, so have to

544
00:40:47,599 --> 00:40:52,719
be very mindful of of how you
do that. So one of the biggest

545
00:40:52,719 --> 00:40:58,000
mistakes that customers are people using the
open til I'm train make when they first

546
00:40:58,000 --> 00:41:00,519
started. The biggest mistake, well, I turned everything on. Of course

547
00:41:00,519 --> 00:41:05,119
you did. I do that too. I like all the knobs, turn

548
00:41:05,199 --> 00:41:10,039
on all the knobs and then yeah, another thing was just figuring out what

549
00:41:10,079 --> 00:41:16,320
do I really mean? Right,
So it's really understanding what your requirement is,

550
00:41:16,840 --> 00:41:22,119
because if you're saying I want to
be able to debug any user request

551
00:41:22,159 --> 00:41:28,599
that happens in the system and have
full visibility into that as opposed to for

552
00:41:28,639 --> 00:41:35,480
example, I walked into a project
it's massive, there's zero documentation, and

553
00:41:35,519 --> 00:41:37,800
all of the people that worked on
it, they're gone, right, and

554
00:41:37,920 --> 00:41:42,679
there's no instrumentation. Yeah, at
that point, what you would like to

555
00:41:42,719 --> 00:41:46,599
have observability wise, it's just insight
into how does this thing work? Yeah?

556
00:41:46,679 --> 00:41:51,440
Just follow a transaction, right,
you know, have one end to

557
00:41:51,559 --> 00:41:54,239
end trace once and you'll have made
a lot of progress. Yes, Now,

558
00:41:54,280 --> 00:41:58,679
the amount of telemetry that you need
for those two things, it's completely

559
00:41:58,719 --> 00:42:02,559
the opposite, right, it's like
one percent versus one hundred. So it's

560
00:42:02,599 --> 00:42:07,199
definitely still for me as well a
learning experience. I mean, this is

561
00:42:07,239 --> 00:42:12,960
still pretty new and we're basically just
adjusting to see still how the project is

562
00:42:12,960 --> 00:42:16,519
evolving. Yeah, because I mean
many parts of the specification are stable by

563
00:42:16,559 --> 00:42:21,280
now, but a lot of things
are still evolving. Yeah. Another thing

564
00:42:21,360 --> 00:42:25,400
that keeps coming up is sort of
the three pillars of observability being you know,

565
00:42:25,440 --> 00:42:29,400
traces, metrics, and logs,
And I wonder, but what if

566
00:42:29,400 --> 00:42:32,400
a first one comes along that might
just happen. I can't think of a

567
00:42:32,440 --> 00:42:37,599
fourth one, right, neither find
us pretty well. Yeah, when when

568
00:42:37,679 --> 00:42:40,239
the only thing we had was logs, could we think of metrics and traces?

569
00:42:40,280 --> 00:42:44,239
I guess that's true. Yeah,
we started dreaming about them. It's

570
00:42:44,280 --> 00:42:46,400
like I'm trying to I've had that
experience of I have the log from this

571
00:42:46,480 --> 00:42:50,920
machine, the log from this machine, now try and line those entries up.

572
00:42:51,119 --> 00:42:54,639
Would quantum computing introduce a fourth pillar? I think it would introduce to

573
00:42:54,639 --> 00:42:59,679
the sixteenth pillars, or two to
the two hundred and thirty seconds parallel there.

574
00:43:00,239 --> 00:43:02,320
I also like this idea of knowing
it failed before the customer does have

575
00:43:02,440 --> 00:43:06,400
failed. You know, you get
what I'm hoping to, this tail loog

576
00:43:06,519 --> 00:43:09,559
thing of I see these numbers as
insufficient. In some way, it kicks

577
00:43:09,880 --> 00:43:13,559
up into a system where someone can
look at it and perhaps even call a

578
00:43:13,559 --> 00:43:16,960
customer and say, hey, we
noticed, can we help you with You

579
00:43:17,000 --> 00:43:21,199
know, it's not just you're waiting
for the for the people to complain or

580
00:43:21,239 --> 00:43:23,559
if it has to be on fire. I have never gotten a message or

581
00:43:23,559 --> 00:43:28,320
an email like that from a company
that I and if I if that happened

582
00:43:28,320 --> 00:43:31,599
to me, I would be like
really impressed. So I'm on a website,

583
00:43:31,639 --> 00:43:35,880
for example, and it screws up, and then I immediately get an

584
00:43:35,880 --> 00:43:37,880
email that says, hey, we
noticed you had this problem, didn't work.

585
00:43:37,960 --> 00:43:43,400
Here's a solution. M Wow,
that would be amazed. I ever

586
00:43:43,440 --> 00:43:45,599
see that is like a credit card
when I'm traveling, yeah where every so

587
00:43:45,639 --> 00:43:50,119
often I use the card and a
minute or so later the phone rings and

588
00:43:50,119 --> 00:43:53,079
it's hey, are you in Belgium? Right, yep, yep, I'm

589
00:43:53,079 --> 00:43:57,039
in Belgium. Okay, that's good. Then thanks, And then you're like,

590
00:43:57,239 --> 00:44:00,800
that's pretty cool. It turns out
that was Bob from New Jersey.

591
00:44:00,039 --> 00:44:06,440
Just just want to know that happens
to me all the time. I asked

592
00:44:06,480 --> 00:44:09,639
you about how service bus or you
know, service buses and service bus plays

593
00:44:09,639 --> 00:44:14,000
into this. Obviously, this is
what you do for your job. How

594
00:44:14,039 --> 00:44:21,679
does using a messaging system figure into
open telemetry. Well, like I said

595
00:44:21,679 --> 00:44:25,719
earlier, right, um, building
a message based system is is really nice.

596
00:44:25,880 --> 00:44:29,760
Well, you know, if you
take into account the entire problem space.

597
00:44:30,280 --> 00:44:32,480
But one of the things you can
go around around, basically is how

598
00:44:32,480 --> 00:44:37,400
hard it becomes the troubleshoot things in
that type of system. Now, with

599
00:44:37,559 --> 00:44:42,840
the platform that we're building at particular, we don't only have in service bus

600
00:44:42,840 --> 00:44:45,440
as a middleware framework, but we
also have a bunch of tools. Now

601
00:44:45,880 --> 00:44:52,960
service Insite specifically is basically already that
sort of black box instrumentation. It's like

602
00:44:52,000 --> 00:44:57,159
you don't need to do anything,
you just need to configure that you want

603
00:44:58,239 --> 00:45:04,199
basically those platform tools to be enabled, and it already gives you insight into

604
00:45:04,360 --> 00:45:08,199
all of the messages that are being
sent around in the system, and you

605
00:45:08,199 --> 00:45:14,760
can see that in a production environment
and understand which exact flow and where that

606
00:45:14,840 --> 00:45:20,639
come from and which message led to
which message. So we've really been doing

607
00:45:20,679 --> 00:45:22,679
observability for years. Yeah, that's
the thing. Well, because of the

608
00:45:22,719 --> 00:45:28,280
asyncrency and out of orderness, like
you can't count on timestamps, you really

609
00:45:28,320 --> 00:45:32,039
need to have some kind of attribute
flag to be able to know the related

610
00:45:32,119 --> 00:45:35,920
Yeah, those are things that we
capture in the message headers, so we

611
00:45:35,960 --> 00:45:40,639
can basically know which message LEDs to
which other messages, So that that's really

612
00:45:40,679 --> 00:45:45,159
cool. But the thing that we
could never solve, and that's the sort

613
00:45:45,159 --> 00:45:51,239
of gap that opens telemetry closes,
is having that visibility system wise, so

614
00:45:51,360 --> 00:45:55,800
connecting back to you know, a
fronted or even a database, a web

615
00:45:55,800 --> 00:46:00,599
server that restarted in the middle of
something like those kinds of things like,

616
00:46:00,599 --> 00:46:02,559
that's where you want the metrics to
show. Hey, this machine got into

617
00:46:02,599 --> 00:46:07,840
crisis and the supervisor killed the process
and it recovery and continued, but it

618
00:46:08,199 --> 00:46:12,760
kicked off all this weirdness right,
like I can. You're looking at it,

619
00:46:13,039 --> 00:46:15,639
You're looking at the trades and going, what the heck happened here?

620
00:46:15,840 --> 00:46:17,280
Right? Right? Is the is
the program broken? It's just a bug?

621
00:46:17,320 --> 00:46:20,519
And it's like, no, this
is what recovery looks. Yeah,

622
00:46:20,599 --> 00:46:23,599
yeah, exactly is one of these
machines. Some of that is also built

623
00:46:23,639 --> 00:46:29,880
into the platform because we have insight
into what are the failure how many failures

624
00:46:29,880 --> 00:46:35,199
are happening, So we also have
messaging specific metrics already in there, and

625
00:46:35,199 --> 00:46:40,039
and yeah, now we're working to
basically make sure that it also connects and

626
00:46:40,199 --> 00:46:46,280
feeds into the open telemetry signals so
that if people are using it, that

627
00:46:46,400 --> 00:46:51,000
they get all of that information in
there as well. Wow, that's pretty

628
00:46:51,239 --> 00:46:54,440
pretty substantially cool, especially when you
start throwing cloud in here where it's entirely

629
00:46:54,480 --> 00:46:59,800
possible the cloud vendor might move you
and it might have impact on your software,

630
00:47:00,159 --> 00:47:02,800
like stuff you literally don't have control
over, Like hopefully your telemetry can

631
00:47:02,840 --> 00:47:06,400
surface that in a way where you're
like, oh, this wasn't us.

632
00:47:07,440 --> 00:47:12,159
It's the vendor change something for whatever
reason, and we should absorb it.

633
00:47:12,440 --> 00:47:15,440
But we don't know how yet because
we've never had this happen before and right

634
00:47:15,519 --> 00:47:16,639
now we have to look at it
and saying what would we do differently?

635
00:47:17,639 --> 00:47:22,880
Yeah, yeah, because we have
sort of recoverability built in, so retries

636
00:47:22,920 --> 00:47:25,800
and all of that happens out of
the box. You don't even have to

637
00:47:25,800 --> 00:47:30,360
configure them. But yeah, then
it's like, okay, why does it

638
00:47:30,360 --> 00:47:36,280
take ten times for this message to
be processed? Every time a message of

639
00:47:36,320 --> 00:47:38,719
that type comes in? Right,
So it's you don't want to hide that

640
00:47:38,800 --> 00:47:43,920
away. It could be that maybe
there's a database suffering underneath, right,

641
00:47:43,960 --> 00:47:45,559
it retries too quickly and it takes
that long for that thing to get up

642
00:47:45,559 --> 00:47:49,920
to speed or yeah, yeah,
along one of the side effects. Yeah,

643
00:47:50,039 --> 00:47:53,440
that's also why we have delayed retries. So we'll basically also have this

644
00:47:53,559 --> 00:47:59,239
sort of back off mechanism. We'll
retry immediately. But if we see exactly

645
00:47:59,400 --> 00:48:01,760
that's too yep, yeah, it's
the same. It's the same concept.

646
00:48:01,840 --> 00:48:07,639
Yeah. Yeah. So what's next
for you? What are you doing next?

647
00:48:07,840 --> 00:48:12,440
What's in your inbox? What's in
my inbox? Well, uh,

648
00:48:12,480 --> 00:48:15,960
we're so a techarama today. Then
I'm taking a few days with the family.

649
00:48:16,119 --> 00:48:20,360
It's a long weekend, is it? It's a long Weekend's holiday Tomorrow?

650
00:48:20,400 --> 00:48:25,199
And then they're basically bridging to the
weekend and then on Sunday, I'm

651
00:48:25,280 --> 00:48:30,400
leaving to Oslo for NDC. Very
great. I will not be joining you

652
00:48:30,440 --> 00:48:34,159
there this year, so I heard. Yeah, not happy to hear that.

653
00:48:34,400 --> 00:48:37,639
Sorry, just schedules. Yeah.
We love Oslo, we love NDC.

654
00:48:37,800 --> 00:48:39,880
We're there all the time. Usually, I'm gonna have to miss it

655
00:48:39,920 --> 00:48:43,639
this year. Yeah, I'm really
excited to go. It's my first time

656
00:48:43,679 --> 00:48:46,639
at Oslo's top sure to get into. Yeah, you know, and I

657
00:48:46,760 --> 00:48:51,280
always loved the way they did the
show on the floor in the but I

658
00:48:51,400 --> 00:48:57,760
kind of tell you this techarama's pretty
close to the Yeah, and I have

659
00:48:57,840 --> 00:49:01,559
it homies. Yeah, yeah,
all right, it's really cool. Everybody

660
00:49:01,559 --> 00:49:09,679
give it up for Lila Bilbria and
we'll see you next time on dot net

661
00:49:09,800 --> 00:49:36,719
rocks. Dot net Rocks is brought
to you by Franklin's Net and produced by

662
00:49:36,800 --> 00:49:42,760
Pop Studios, a full service audio, video and post production facility located physically

663
00:49:42,800 --> 00:49:46,719
in New London, Connecticut, and
of course in the cloud online at pwop

664
00:49:47,000 --> 00:49:52,599
dot com. Visit our website at
dt n et r ocks dot com for

665
00:49:52,800 --> 00:49:58,599
RSS feeds, downloads, mobile apps, comments, and access to the full

666
00:49:58,719 --> 00:50:02,280
archives going back to sh number one, recorded in September two thousand and two.

667
00:50:02,800 --> 00:50:06,960
And make sure you check out our
sponsors. They keep us in business.

668
00:50:07,440 --> 00:50:10,159
Now go write some code. See
you next time. You got a

669
00:50:10,239 --> 00:50:12,400
dead middle band
