WEBVTT

1
00:00:01.000 --> 00:00:04.759
How'd you like to listen to dot
net Rocks with no ads? Easy?

2
00:00:05.320 --> 00:00:09.400
Become a patron for just five dollars
a month. You get access to a

3
00:00:09.480 --> 00:00:14.199
private RSS feed where all the shows
have no ads. Twenty dollars a month,

4
00:00:14.240 --> 00:00:18.399
we'll get you that and a special
dot net Rocks patron mug. Sign

5
00:00:18.480 --> 00:00:24.079
up now at Patreon dot dot net
rocks dot com. Hey there, this

6
00:00:24.120 --> 00:00:28.440
is Jeff Fritz, the Purple Blazer
guy from Microsoft, letting you in on

7
00:00:28.480 --> 00:00:32.640
a little secret about my friend Carl
Franklin. You know, the guy who

8
00:00:32.679 --> 00:00:37.399
started dot net Rocks, the first
podcast about dot net in two thousand and

9
00:00:37.399 --> 00:00:43.000
two, The guy who's been teaching
Blazer on YouTube since twenty twenty. Yeah,

10
00:00:43.039 --> 00:00:47.600
that Carl Franklin. Well, Carl's
joined up with the folks from Code

11
00:00:47.600 --> 00:00:52.399
in a castle to teach a week
long hands on Blazer class at Are you

12
00:00:52.399 --> 00:00:58.439
ready to get this? At a
castle slash villa in Tuscany. It's sort

13
00:00:58.439 --> 00:01:03.920
of a luxury vacation. It's Blazer
learning built in. Carl's calling it the

14
00:01:03.000 --> 00:01:10.000
Blazer Master Class. You'll learn Blazer
from the ground up, finishing the week

15
00:01:10.040 --> 00:01:15.680
with the ability to build and deploy
Blazer applications. Since the training happens for

16
00:01:15.719 --> 00:01:19.920
only four hours in the morning over
six days, you can bring your significant

17
00:01:19.959 --> 00:01:26.040
other your partner with you and you
should right This part of Italy is absolutely

18
00:01:26.079 --> 00:01:30.560
beautiful. There's so much to see
and do and in. Larry and Marco

19
00:01:30.680 --> 00:01:34.799
from code In to Castle are organizing
daily activities both at the castle and in

20
00:01:34.840 --> 00:01:38.959
the area. The castle is in
the Marema, a less touristed region of

21
00:01:40.079 --> 00:01:45.879
Tuscany, offering both classic Tuscan hill
country as well as easy access to the

22
00:01:45.879 --> 00:01:52.239
Etruscan Riviera. With sublime local food, wine and olive oil around every corner.

23
00:01:52.519 --> 00:01:56.400
Breakfast is included every day. There
will be two communal dinners at the

24
00:01:56.439 --> 00:02:00.680
castle book ending the experience, and
most other meals and all activities are included.

25
00:02:01.480 --> 00:02:07.319
And did I mention you'll learn Blazer
in person from Carl Franklin Listen.

26
00:02:07.719 --> 00:02:12.199
Space is limited and for very good
reason. This is quality training in a

27
00:02:12.240 --> 00:02:19.759
beautiful setting. Go to code Inacastle
dot com slash Blazer twenty twenty three that's

28
00:02:19.879 --> 00:02:25.159
bla z O R two zero two
three to take advantage of this amazing opportunity

29
00:02:25.360 --> 00:02:31.240
to join Carl in Tuscany for an
unforgettable week of la dolce Vita while advancing

30
00:02:31.240 --> 00:02:38.879
your programming skills in this important new
technology. After building software for a while,

31
00:02:38.960 --> 00:02:43.560
you know it's only a matter of
time before you see an HTTP timeout

32
00:02:43.639 --> 00:02:46.800
or a database deadlock. In software, it's not a case of if things

33
00:02:46.879 --> 00:02:52.199
fail, but a case of when
one mishap like this and valuable data is

34
00:02:52.240 --> 00:02:57.400
lost forever. And these failures occur
all the time, but it doesn't have

35
00:02:57.439 --> 00:03:00.560
to be this way. Introducing n
service bus, the ultimate tool to build

36
00:03:00.639 --> 00:03:07.759
robust and reliable systems that can handle
failures gracefully, maintain high availability, and

37
00:03:07.919 --> 00:03:12.479
scale to meet growing demand. For
more than fifteen years, end service bus

38
00:03:12.479 --> 00:03:16.360
has been trusted to run mission critical
systems that must not go down or lose

39
00:03:16.479 --> 00:03:22.199
any data ever, and now you
can try it for yourself. End service

40
00:03:22.240 --> 00:03:25.680
Bus integrates seamlessly with your dot net
applications and could be hosted on premises or

41
00:03:25.719 --> 00:03:30.919
in the cloud. Say goodbye to
loss data and system failures and say hello

42
00:03:30.960 --> 00:03:35.879
to a better, more reliable way
of building distributed systems. Try end service

43
00:03:35.879 --> 00:03:40.439
bus today by heading over to go
dot particular dot net slash dot net rocks

44
00:03:40.800 --> 00:04:00.879
and start building better systems with asynchronous
messaging using end service bus Hey, Antwerp.

45
00:04:00.240 --> 00:04:11.319
It's dot Rock Holy crap, there
must be fifty thousand people here.

46
00:04:11.479 --> 00:04:13.719
Yeah, who knew we were in
a stadium? I know, right,

47
00:04:13.759 --> 00:04:15.040
we are in a state. We're
actually in a movie theater, which is

48
00:04:15.079 --> 00:04:18.240
cool. It is cool. The
last time we did a dot Net Rocks

49
00:04:18.240 --> 00:04:23.480
in a movie theater I think was
in Sofia, Bulgaria. Oh yeah,

50
00:04:23.519 --> 00:04:27.160
do you remember that? Yeah,
a few years that was dev reached every

51
00:04:27.199 --> 00:04:30.319
And the funny thing was the Bulgarians
thought it would be funny if Richard and

52
00:04:30.360 --> 00:04:35.639
I announced the names of the winners, the Bulgarian winners of the swag at

53
00:04:35.680 --> 00:04:41.480
the end of the show, and
that was funny for them, for them,

54
00:04:43.399 --> 00:04:50.560
Bulgarian names need to buy a bow
Okay, it lacks where snudge and

55
00:04:50.720 --> 00:04:55.120
they were just laughing their butts off. That's you. Oh you were there?

56
00:04:55.160 --> 00:04:59.480
Okay, all right, well,
uh it's we're glad to be here

57
00:05:00.079 --> 00:05:02.959
obviously. Yeah, last day of
the show. Fun to do a live

58
00:05:03.000 --> 00:05:06.800
show. Yeah, it's a live
show. We have some good stuff coming

59
00:05:06.879 --> 00:05:11.199
up here. Layla Bougria is here
and we'll be talking to her in a

60
00:05:11.240 --> 00:05:14.319
minute. But first we have to
do this little thing called better know a

61
00:05:14.360 --> 00:05:24.759
framework roll the crazy music. All
right, buddy, what do you got?

62
00:05:25.319 --> 00:05:28.959
Well, I saw this come across
Twitter. It's a tweet and while

63
00:05:28.959 --> 00:05:32.519
I'm linking to it, Boston Dynamics. You know who those crazy people are,

64
00:05:32.399 --> 00:05:35.839
the guys with the robots. They
make the robots that dance, they

65
00:05:35.879 --> 00:05:40.600
do backflips. Now, yeah,
a little parkour with robots and stuff.

66
00:05:40.639 --> 00:05:44.160
Yeah, they used to have a
little uncanny heathered robots, robots that were

67
00:05:44.160 --> 00:05:47.279
tethered, and then they got gas
powered robots. Now they're battery driven and

68
00:05:47.399 --> 00:05:51.160
they put out videos every once in
a while of things that look like animals,

69
00:05:51.199 --> 00:05:57.720
like cougars and dogs. Well,
anyway, they've put chat GPT into

70
00:05:57.720 --> 00:06:01.000
a robot and now you can talk
to it and it will talk to you

71
00:06:01.079 --> 00:06:05.120
back. And so there's a tweet
about that. It was from April,

72
00:06:05.240 --> 00:06:10.720
but I thought it was so cool
and a little bit scary. But they're

73
00:06:10.759 --> 00:06:14.720
asking a questions like, you know, are you function It's like data,

74
00:06:14.800 --> 00:06:16.600
are you functioning within normal parameters?
You know? And it would say,

75
00:06:16.720 --> 00:06:20.160
yes, you know, I have
this blah blah blah. What my levels

76
00:06:20.199 --> 00:06:25.199
are? You know these kinds of
things, my battery level? How many

77
00:06:25.560 --> 00:06:30.360
what did it say? How many
interactions in your last mission, and it

78
00:06:30.399 --> 00:06:33.920
will tell you about its last mission
and where it went. Just as they

79
00:06:33.959 --> 00:06:38.040
used the word mission, I don't
even know if it was mission, but

80
00:06:38.120 --> 00:06:42.519
that an extermination mission. It could
have been just wondering. It's pretty scary,

81
00:06:42.560 --> 00:06:45.399
but it's cool. That's awesome.
So that's what I got it.

82
00:06:45.600 --> 00:06:48.480
It's a tweet with a video.
Okay, yeah, Boston Dynamis videos are

83
00:06:48.480 --> 00:06:51.959
always amazing, fun. Good one
who's talking to us? Richard grabbed a

84
00:06:51.959 --> 00:06:57.399
comment off a show seventeen fifty three. That's the one we did with Mika

85
00:06:57.480 --> 00:07:02.759
about Visual Studio twenty twenty two productivity
back in August last year, and Mark

86
00:07:02.759 --> 00:07:05.240
Wansel had this great comment. Mark
has a lot of great common spells,

87
00:07:05.240 --> 00:07:10.079
and he says one of the things
that Mika noted a few times was telemetry.

88
00:07:11.040 --> 00:07:14.000
I think that this would be an
interesting show topic. There's a popular

89
00:07:14.079 --> 00:07:17.759
GitHub project called open telemetry that might
be a good starting space. There seems

90
00:07:17.759 --> 00:07:20.759
to be an art of telemetry.
What to collect, how much, performance

91
00:07:20.800 --> 00:07:26.759
considerations and primacy considerations. Check out
open telemetry dot io. What do you

92
00:07:26.759 --> 00:07:30.639
think that that's not a good idea? Not want to do that? Noah,

93
00:07:30.920 --> 00:07:32.240
we won't do that show. Yeah, sorry, Mark, I wish

94
00:07:32.240 --> 00:07:34.680
we could help you, but we
can't help you. Yeah, but I

95
00:07:34.720 --> 00:07:38.600
will send you a copy of music
Cobi And if you'd like copy of music

96
00:07:38.759 --> 00:07:41.480
by I read a comment on the
website at dot net rocks dot com or

97
00:07:41.480 --> 00:07:43.959
on the facebooks. We publish every
show there, and if you comment there

98
00:07:43.959 --> 00:07:45.639
and ever reading on the show,
we'll say, do you a copy of

99
00:07:45.720 --> 00:07:48.160
music by? And you should follow
us on Twitter. But the real fun

100
00:07:48.279 --> 00:07:54.120
happens on Mastodon. I'm Masodon,
I'm at Carl Franklin at tech Hubs Social,

101
00:07:54.240 --> 00:07:58.120
and I'm Rich Campbell at masodonda Social. Send us a two let us

102
00:07:58.160 --> 00:08:03.759
know you're listening, and that brings
us to our show on open telemetry.

103
00:08:03.839 --> 00:08:07.839
What yeah, how sorry, that's
the topic. Lila Bougrie is here.

104
00:08:07.920 --> 00:08:11.800
She's a software engineer with over fifteen
years of experience in the dot net space

105
00:08:11.240 --> 00:08:16.959
and currently works in particular software where
they build nd service bus. Maybe use

106
00:08:16.000 --> 00:08:18.720
it maybe forard a bit, certainly
don't know the show a few times.

107
00:08:18.759 --> 00:08:24.120
Yeah, She's a Microsoft MVP and
frequent speaker conferences and interspare time. She

108
00:08:24.160 --> 00:08:28.720
loves to knit and crochet. Welcome, thank you. How about Lila,

109
00:08:28.879 --> 00:08:35.200
huh, we were lying when we
said we weren't going to do a show

110
00:08:35.240 --> 00:08:39.320
about open telemetry. That's kind of
why we read that. Yeah, yeah,

111
00:08:39.320 --> 00:08:41.759
I figured if I've got a comment
literally somebody asked her for to open

112
00:08:41.799 --> 00:08:43.879
telemetry, now is the time to
read it. So, Lila, what's

113
00:08:43.879 --> 00:08:52.440
the elevator pitch for open telemetry?
Well, the elevator pitch. So we've

114
00:08:52.679 --> 00:08:58.320
we've had multiple telemetry signals in software
for years, right, the oldest one

115
00:08:58.320 --> 00:09:03.000
being logs. I think we all
know logs. We've also mostly used metrics.

116
00:09:03.480 --> 00:09:07.399
Maybe I guess distributed tracing is a
newest signal, but I would say

117
00:09:07.440 --> 00:09:13.960
that the elevator pitch for open telemetry
is really correlating them all together. And

118
00:09:13.080 --> 00:09:18.440
that's what makes it really interesting for
me at least, because you know,

119
00:09:18.480 --> 00:09:22.279
each signal has its own value.
And we've also been logging for years,

120
00:09:22.320 --> 00:09:24.279
and even though we can see,
like you know, tracing might be the

121
00:09:24.360 --> 00:09:28.440
better option today, but then we
still have all of those logs we don't

122
00:09:28.480 --> 00:09:33.639
want to go rewrite and higher applications
though. We do have log providers so

123
00:09:33.759 --> 00:09:37.080
you can plug in however you want
to do your logging, but that's not

124
00:09:37.159 --> 00:09:41.480
really that's taking the same source and
putting it in different places, isn't it.

125
00:09:41.519 --> 00:09:46.480
You're talking about ingesting different sources into
one place exactly, Yeah, and

126
00:09:46.519 --> 00:09:52.399
then being able to connect it.
So basically, imagine, you know,

127
00:09:52.480 --> 00:09:58.480
over the rainbow that you get lowered
and there's a metric that looks out of

128
00:09:58.519 --> 00:10:01.240
whack, and it's like, okay, something is up. What is up?

129
00:10:01.519 --> 00:10:05.200
I don't know, just the metric
is out of whack. So you

130
00:10:05.279 --> 00:10:09.080
have to go figure out how to
do that, and that's usually a challenge

131
00:10:09.159 --> 00:10:13.559
because you then have to go figure
out, Okay, how do I connect

132
00:10:13.559 --> 00:10:16.240
that to those other signals that I
have. But what if you could look

133
00:10:16.240 --> 00:10:20.320
at the metric and say, okay, I can see that it's correlated to

134
00:10:20.759 --> 00:10:24.000
these traces, and then go look
at that, and then the traces would

135
00:10:24.000 --> 00:10:28.799
also be connected to the logs and
you would be able to basically, yeah,

136
00:10:28.840 --> 00:10:33.399
paint that entire picture of what's going
on, and you wouldn't be losing

137
00:10:33.440 --> 00:10:37.919
all of that time doing that thing. Manually sounds like a Wikipedia rabbit hole,

138
00:10:37.080 --> 00:10:41.639
right. One thing leads you to
another to another. So I'm trying

139
00:10:41.639 --> 00:10:46.039
to distinguish between all these different things. I mean, logging means a particular

140
00:10:46.039 --> 00:10:54.679
product like sequel server spitting out logs
about how it's functioning, possibly versus metrics

141
00:10:54.720 --> 00:10:58.240
being more things like the state of
a server, like the hard drives pinned

142
00:10:58.279 --> 00:11:03.759
or running low memory. And then
when I think about traces, I think

143
00:11:03.799 --> 00:11:09.360
about there's tools that specifically are about
following a button click from the client to

144
00:11:09.559 --> 00:11:15.759
the server to the database and back
again. Right. Yeah, I like

145
00:11:15.799 --> 00:11:18.679
to think of that like following a
business transaction, right, right, like

146
00:11:18.759 --> 00:11:22.720
the workflow. Right. So it's
true, Yeah, that's definitely true.

147
00:11:22.720 --> 00:11:26.120
But at the same time, we've
also been logging in our applications a lot,

148
00:11:26.759 --> 00:11:31.720
and that's still then connects together.
This is as a developer writing code

149
00:11:31.759 --> 00:11:37.039
to push messages onto a log right, So that's usually I'm a developer,

150
00:11:37.120 --> 00:11:43.000
right, So I'm always looking at
things from the application perspective, and how

151
00:11:43.000 --> 00:11:46.639
do I make this application observable?
And there's multiple signals, and you would

152
00:11:46.679 --> 00:11:52.960
choose each individual signal based on whatever
your scenario is. It could even be

153
00:11:52.080 --> 00:11:58.600
that for a specific scenario you might
come to the conclusion that using multiple signals

154
00:11:58.679 --> 00:12:03.320
might be useful. So for example, let's say that a failure occurs,

155
00:12:03.559 --> 00:12:07.200
right, you want to keep track
of that and have a trace that reflects

156
00:12:07.240 --> 00:12:11.480
that failure. But you might also
want to have that reflected emetrics so you

157
00:12:11.519 --> 00:12:16.240
can do the alert de alerting and
all of that. Yeah, So that

158
00:12:16.480 --> 00:12:18.200
meaning to trip it up in through
the system means to say there was a

159
00:12:18.240 --> 00:12:22.679
failure that occurred as well as what
shows up in the law. Right now,

160
00:12:22.960 --> 00:12:28.559
there's a lot of third party products
out there that do telemetry. Or

161
00:12:28.799 --> 00:12:33.720
does open Telemetry allow you to use
those as sources and then pull them together,

162
00:12:33.840 --> 00:12:37.919
or does open Telemetry have its own
things that you can plug in or

163
00:12:37.960 --> 00:12:41.799
both? It's both, and that's
a good part because I think if we

164
00:12:41.840 --> 00:12:46.720
look at our applications and we want
to make them observable, there's a lot

165
00:12:46.759 --> 00:12:50.799
that we can do. But I
think with the Open Telemetry project, and

166
00:12:50.919 --> 00:12:56.200
also like all of the effort that
the entire community has basically put into this,

167
00:12:56.559 --> 00:13:00.120
they've made it easy for us,
right because you could basically say what

168
00:13:00.240 --> 00:13:03.960
second are using, Oh, I'm
using a speed on at core and I'm

169
00:13:03.039 --> 00:13:07.039
using the Azure is thek whatever it
is, right, and you could just

170
00:13:07.799 --> 00:13:13.399
use those instrumentation libraries that are available
from those frameworks. It could be built

171
00:13:13.440 --> 00:13:18.200
into the framework or could be a
dedicated package and you can turn them on

172
00:13:18.000 --> 00:13:22.120
and just by doing that, you're
already collecting a bunch of information, and

173
00:13:22.159 --> 00:13:26.159
specifically in the distributed system, it's
usually going to give you insight into that

174
00:13:26.519 --> 00:13:33.200
interservice communication where we have the blind
gaps. So it already like gives you

175
00:13:33.399 --> 00:13:37.679
a lot of information. So the
cool thing is that you can then intercept

176
00:13:37.080 --> 00:13:43.039
basically those traces that are being generated
by those libraries, and in that that

177
00:13:43.159 --> 00:13:46.440
I would be calling activity. It
occurred, right, and you could add

178
00:13:46.960 --> 00:13:50.960
you could create your own activities,
which is, by by the way,

179
00:13:50.000 --> 00:13:56.759
the sort of same thing as a
span. But yeah, they basically what

180
00:13:56.840 --> 00:14:01.159
they did is the activity API already
existed, and instead of creating a completely

181
00:14:01.200 --> 00:14:05.120
new API to match the naming of
the open telum Try specification, they just

182
00:14:05.279 --> 00:14:09.799
implemented the specification inside the activity API. So that's why the name is a

183
00:14:09.840 --> 00:14:13.200
little bit different. But basically,
what you could do is take the current

184
00:14:13.200 --> 00:14:16.120
activity, which could be omitted by
an instrumentation library, and say, I

185
00:14:16.159 --> 00:14:22.639
want to add some information to this
that is specific to my application, to

186
00:14:22.679 --> 00:14:26.840
the workflow I'm running in so that
I can get even more insight. What's

187
00:14:26.879 --> 00:14:31.840
the voodoo that allows us to go
between the different tiers and an app and

188
00:14:31.879 --> 00:14:35.840
say these are all part of the
same transaction. Right, that's basically a

189
00:14:35.919 --> 00:14:41.759
propagation mechanism really, because if you
think if a trace, it's basically a

190
00:14:41.799 --> 00:14:46.120
bunch of spans that are connected to
each other. Now, what happens is

191
00:14:46.200 --> 00:14:50.639
at the beginning of the trace,
we basically get a trace ID assigned and

192
00:14:50.720 --> 00:14:56.759
that's going to be carried across all
of the spens and then each span has

193
00:14:56.799 --> 00:15:01.720
a unique ID. Now, in
order for that information to propagate across multiple

194
00:15:01.759 --> 00:15:07.440
services, we need a propagation mechanism. And there's multiple protocols that are basically

195
00:15:07.480 --> 00:15:09.519
supported by the Open Telemetry Project,
of most of well known. One is

196
00:15:09.559 --> 00:15:15.320
a W three C trace context for
HTP headers always feel like I need a

197
00:15:15.320 --> 00:15:22.200
breath, yeah, And then there's
another one for gRPC. So it depends

198
00:15:22.200 --> 00:15:26.200
on what you're doing there. So
I have, for example, and as

199
00:15:26.240 --> 00:15:31.759
your web app, right, I
have Application Insights turned on, I've got

200
00:15:31.799 --> 00:15:35.480
all the switches lit up, and
do I need open telemetry at that point?

201
00:15:35.559 --> 00:15:39.480
What's it kind of give me over
what Azure Insights already has. So

202
00:15:39.519 --> 00:15:46.679
the way that I look at that
is what Application Insights provides you is also

203
00:15:46.799 --> 00:15:52.639
sort of known as black box instrumentation, so it's basically independent of your specific

204
00:15:52.639 --> 00:15:58.639
application code. Yeah, but obviously
the things that we are doing in our

205
00:15:58.679 --> 00:16:03.799
code is usually the interesting bits,
and sometimes we need a little bit more

206
00:16:03.840 --> 00:16:07.279
insight to understand, you know,
what pieces of code are we executing there,

207
00:16:07.320 --> 00:16:11.639
what are we doing, what is
like the cause of latency or whatever

208
00:16:11.639 --> 00:16:15.200
it is, what pieces slow?
Application Insights doesn't provide that well. I

209
00:16:15.240 --> 00:16:22.720
think it's it's quite different to to
compare because at least to my to my

210
00:16:22.840 --> 00:16:26.600
understanding, it's more of an overall
view that you get, and you can

211
00:16:26.639 --> 00:16:33.519
still use the Application Insights as AK
directly and still emit like application specific telemetry.

212
00:16:33.600 --> 00:16:36.960
But but then you're tied, right, You're tied to the vendor,

213
00:16:37.360 --> 00:16:41.039
right, and the thing is right
right exactly, and if you don't want

214
00:16:41.120 --> 00:16:45.799
to change to another vendor, you're
have that vanderlock. It's the point,

215
00:16:45.039 --> 00:16:49.320
right, Yeah, So if you
use open telemetry, then you you could

216
00:16:49.360 --> 00:16:53.480
just wire up a different exporter.
I mean that would also app Insights has

217
00:16:53.559 --> 00:16:57.399
good features for dot net specific absolutely, Yeah, if you've got some other

218
00:16:57.440 --> 00:17:00.799
code written in other things that aren't
ornet related, yeah, happid sence is

219
00:17:00.840 --> 00:17:03.160
only going to do so much for
you and certainly not going to work if

220
00:17:03.160 --> 00:17:07.119
you're in a container on AWS,
is it. Yeah? So, I

221
00:17:07.119 --> 00:17:11.960
mean it certainly opens the door to
working with more platforms, even more places

222
00:17:12.039 --> 00:17:17.400
and open hey, hey, there's
a concept for you. Yeah, and

223
00:17:17.400 --> 00:17:22.359
it being available cross platform and cross
front time as well. Like there's implementations

224
00:17:22.440 --> 00:17:27.599
for instrumentation libraries in many languages for
many framework So that's really cool, especially

225
00:17:27.640 --> 00:17:30.839
if you think of like multi stack
applications. Right. Sure, if I've

226
00:17:30.839 --> 00:17:34.559
got a group of Python developers going
to build a data importer for me,

227
00:17:34.640 --> 00:17:37.559
the fact that I can instrument it
the same way as everything I've gotten built

228
00:17:37.599 --> 00:17:41.319
and dot net, that's pretty compelling, right, and bring everything under one

229
00:17:41.359 --> 00:17:45.960
roof and measured the same way.
I mean, that's the real problem is

230
00:17:45.960 --> 00:17:51.559
that often we're chasing problems that transition
between different systems, and because they're measured

231
00:17:51.559 --> 00:17:55.519
differently, it's very hard to associate
restuff together. I'm just thinking in terms

232
00:17:55.519 --> 00:17:57.279
of how much code I need to
write as a developer to take advantage of

233
00:17:57.279 --> 00:18:00.839
all this and how much of it
comes of the box. Well, that's

234
00:18:00.839 --> 00:18:06.559
what I That's why I mentioned the
instrumentation libraries, right, and even if

235
00:18:06.599 --> 00:18:11.920
you know things like event counters and
stuff like that, even that has dedicated

236
00:18:11.119 --> 00:18:15.920
libraries available already, so you can
basically just turn them on and like I

237
00:18:15.960 --> 00:18:22.079
said, you could plug into that
and add to that information information. So

238
00:18:22.160 --> 00:18:25.920
usually what I say is, look
at what it already gives you. Turn

239
00:18:25.960 --> 00:18:30.039
on the instrumentation library. Yeah exactly, don't reinvent the library. So look

240
00:18:30.039 --> 00:18:33.240
at what's out there. Turn it
on and see what it emits, and

241
00:18:33.640 --> 00:18:37.559
take a look at what type of
insight that already gives you, and it

242
00:18:37.559 --> 00:18:40.599
probably is picked up by open telebras
you're just fine. Yeah, yeah,

243
00:18:40.640 --> 00:18:45.720
Well, for example, let's say
that you have aspe core instrumentation enabled,

244
00:18:45.839 --> 00:18:48.559
right, so you're going to see
the request, but you don't get a

245
00:18:48.559 --> 00:18:52.359
lot of insight into the request.
But I mean it has hooks that you

246
00:18:52.400 --> 00:18:59.799
could then plug in additional information and
expose whatever is you know, interesting to

247
00:18:59.799 --> 00:19:03.039
you in that scenario, so that
you could understand like the sort of business

248
00:19:03.119 --> 00:19:07.200
context of what's going on. And
that's what makes it really powerful. I

249
00:19:07.240 --> 00:19:11.240
spend enough time on the firefighting side
of being assisted min where there's stuff's being

250
00:19:11.240 --> 00:19:15.920
spewed out in the logs that we're
looking at. We just don't know what

251
00:19:15.960 --> 00:19:18.519
it means, right, right,
like anything there? Yeah, it's clear.

252
00:19:18.559 --> 00:19:21.759
It's like your own little Internet.
Everything you need to know is there,

253
00:19:22.119 --> 00:19:25.720
you just can't find it. Yeah, So you know, how do

254
00:19:25.720 --> 00:19:30.480
you add the additional information that helps
someone see? This is where we're having

255
00:19:30.640 --> 00:19:34.759
right. That's where defenders really come
in, right, because they are then

256
00:19:34.839 --> 00:19:40.559
going to offer capabilities that allow you
to querry, to analyze that information,

257
00:19:41.000 --> 00:19:45.000
and to basically get to actionable insights, because that's the whole point of telemetry.

258
00:19:45.079 --> 00:19:47.400
Right. That's a bunch of data, But then what do I do

259
00:19:47.480 --> 00:19:52.440
with it? Right? So you
want to basically have tools that help you

260
00:19:52.279 --> 00:19:56.279
get pointers on how do I fix
this problem or how do I improve this

261
00:19:56.400 --> 00:20:03.240
latency issue that I'm seeing, or
maybe even see like which feature gets used

262
00:20:03.240 --> 00:20:06.759
a lot? And things like that. She is almost more of a profiling

263
00:20:06.920 --> 00:20:08.559
thing, right, Like what are
the functions are we called the most often?

264
00:20:08.640 --> 00:20:11.640
And we know perfectly well why the
system is slow. It's the database

265
00:20:11.759 --> 00:20:17.240
is fault. We just blame the
DBA. Then we're done. Life is

266
00:20:17.240 --> 00:20:22.759
good, all right. You know
Another one that I always think of is

267
00:20:22.880 --> 00:20:30.480
because I'm you know, I'm like
an observability enthusiast, I would say,

268
00:20:30.759 --> 00:20:36.720
right. And the reason why I'm
so enthusiastic about it is because, well,

269
00:20:36.720 --> 00:20:41.119
my sort of core focus has been
message based systems for two years are

270
00:20:41.240 --> 00:20:47.720
to trust. Shoot, yeah,
it's just clim messages split across Q and

271
00:20:47.759 --> 00:20:49.680
then you get out of order messages
and it's like, what's going on?

272
00:20:49.960 --> 00:20:53.839
Why am I seeing this fail?
And especially in a message based system,

273
00:20:55.480 --> 00:20:59.680
usually the problem is happening like for
the up stream right, right, So

274
00:20:59.759 --> 00:21:03.279
how do you get to that?
Right? So? And how do you

275
00:21:03.319 --> 00:21:07.880
connect at even outside of the messages
that are being sent, because it has

276
00:21:07.880 --> 00:21:10.519
to maybe connect back to, like
you said, a click somewhere on a

277
00:21:10.640 --> 00:21:15.880
user interface. So being able to
have that full visibility across all of the

278
00:21:15.000 --> 00:21:18.799
subsystems is really really powerful. Yeah, and again I'm still worrying this a

279
00:21:18.839 --> 00:21:22.359
lot of codforma, right, But
you're telling me that when when you're using

280
00:21:22.359 --> 00:21:26.200
a library that has protocol understanding,
it's going to insert a lot of that

281
00:21:26.200 --> 00:21:30.920
information automatically for us. So it's
selfol together. Yes, Yeah, And

282
00:21:30.000 --> 00:21:36.559
because of the sort of nature of
how distributed tracing works, that information is

283
00:21:36.559 --> 00:21:41.799
going to be connected together through that
same trace idea that's basically being propagated does

284
00:21:41.839 --> 00:21:45.359
have substantial overhead. Is there is
there any reason to only turn it on

285
00:21:45.400 --> 00:21:47.920
when you have a problem or can
you leave it on all the time?

286
00:21:48.519 --> 00:21:53.039
Okay, that's that that's going to
be a long question. As you can

287
00:21:53.160 --> 00:22:00.480
you can go for the It dependspends
definitely. So yeah, it definitely depends.

288
00:22:00.640 --> 00:22:04.839
But so usually what I tend to
say is make sure that what you're

289
00:22:04.920 --> 00:22:11.920
collecting is useful, right start there, because if you just turn I don't

290
00:22:11.920 --> 00:22:15.640
know, every instrumentational library on the
planet on, we log all the things

291
00:22:15.720 --> 00:22:19.039
and capture all of the traces and
all of that, you're going to have

292
00:22:19.079 --> 00:22:23.200
to sift through that all of that
information to be able to understand like what's

293
00:22:23.240 --> 00:22:27.039
going on. Right, So it's
also not a thing of oh, look

294
00:22:27.079 --> 00:22:30.759
at all of these instrumentation libraries and
then turning everything on, because you're going

295
00:22:30.799 --> 00:22:34.960
to be incredibly overwhelmed, to the
point that as a developer you might feel

296
00:22:36.000 --> 00:22:38.200
like, okay, this is not
useful. Yeah, let's just turn it

297
00:22:38.240 --> 00:22:41.640
all back off. I mean,
I've also had the problem where I've said,

298
00:22:41.680 --> 00:22:44.319
Okay, I'm not going to measure
this thing, and then I never

299
00:22:44.400 --> 00:22:47.279
get data for that thing, Like
it turns out I'm looking for the wrong

300
00:22:47.359 --> 00:22:51.720
in the wrong place. Like that's
not a number that moves. So some

301
00:22:51.759 --> 00:22:56.039
of these telemetry products that are out
there have ways that they can work on

302
00:22:56.079 --> 00:23:03.160
a background thread or they can attach
as a sidecar, you know. So

303
00:23:03.839 --> 00:23:07.960
do you have those kinds of things
where you can sort of stay out of

304
00:23:07.960 --> 00:23:11.160
the way so if there is something
that takes up some more time, it

305
00:23:11.160 --> 00:23:15.079
can happen on a background thread.
Yeah. So that's where the open telemetry

306
00:23:15.119 --> 00:23:18.799
project is also really interesting because if
let's say that you look at the basic

307
00:23:18.880 --> 00:23:22.920
samples that are out there for dot
net right, what you're going to see

308
00:23:22.960 --> 00:23:26.000
there is that you can basically,
in a service enable open telemetry at an

309
00:23:26.000 --> 00:23:30.200
exporter, which means that you're collecting
let's say, for the sake of the

310
00:23:30.240 --> 00:23:36.519
example traces and sending them directly to
an observability back end. Basically what that

311
00:23:36.519 --> 00:23:41.279
could be as your application Insights or
Jager or Honeygo, whatever it is.

312
00:23:41.079 --> 00:23:47.519
So, but the thing is is
that there's many problems to that. First

313
00:23:47.559 --> 00:23:49.880
of all, like you said,
there is overhead for that service because it

314
00:23:49.920 --> 00:23:55.400
has to collect all of that information. There might even be some processing behind

315
00:23:55.440 --> 00:24:00.759
the scenes happening as well. We
had a service bus, well, well

316
00:24:00.880 --> 00:24:03.319
that's we'll get to that. It
will get to that, yeah, and

317
00:24:03.359 --> 00:24:07.680
then you have to export it.
But then imagine that the observability back end

318
00:24:07.799 --> 00:24:12.400
is not available for a few seconds
because it's you know, it's the network.

319
00:24:12.599 --> 00:24:21.559
It's the network. Yeah right,
so well no, that's usually that's

320
00:24:21.680 --> 00:24:26.960
handled by the libraries themselves, but
it is adding that pressure to the services

321
00:24:27.400 --> 00:24:32.359
telemetry, right, yeah, you
don't want to have that disconnected information and

322
00:24:32.400 --> 00:24:34.599
then looking at half of the story, right, But there are ways to

323
00:24:34.640 --> 00:24:38.079
solve this, and that's where the
open telemetry collector comes in. And then

324
00:24:38.079 --> 00:24:42.200
you have multiple deployment options on how
to run that. The first one is,

325
00:24:42.240 --> 00:24:45.039
like you said, a sidecar,
So basically you're going to have a

326
00:24:45.079 --> 00:24:51.759
sidecar for each service that you're instrumenting, and then immediately you're offloading all of

327
00:24:51.799 --> 00:24:55.960
the telemetry. Well, you're collecting
it and sending it through to the sidecar.

328
00:24:56.079 --> 00:24:59.799
But there are any processing that needs
to be done, like redacting information

329
00:25:00.039 --> 00:25:04.839
because remember sensitive information, you don't
want that you're in all of that telemetry

330
00:25:04.839 --> 00:25:10.079
you're collecting. So that's just down
the road. We can be fined very

331
00:25:10.079 --> 00:25:14.880
easily. I call that digital white
out. So then you have all of

332
00:25:14.880 --> 00:25:18.599
that processing and then you could export
it then to the observability back end and

333
00:25:18.640 --> 00:25:22.839
you'd be able to handle all those
communication issues and all of that in the

334
00:25:22.920 --> 00:25:27.599
side car and your service is not
affected. Now that's one option, but

335
00:25:27.759 --> 00:25:33.599
you could also set up the open
telemetry collector as as at the gateway,

336
00:25:33.640 --> 00:25:37.599
so it's a central components and all
of the services can basically send their telemetry

337
00:25:37.640 --> 00:25:42.640
information to that central component, which
then will take care of processing all of

338
00:25:42.680 --> 00:25:48.519
that information to the to the back
Yeah, and it could batch that information

339
00:25:48.720 --> 00:25:52.319
and it's a pretty powerful thing,
and you could even let go crazy or

340
00:25:52.400 --> 00:25:56.319
if you need it right, but
you could have a sort of hybrid model

341
00:25:56.359 --> 00:26:00.480
in which you have a sidecar per
service which then sending their reformation to the

342
00:26:00.480 --> 00:26:06.119
central collectors. Is there anything special
in the storage on that and there's just

343
00:26:06.160 --> 00:26:10.799
blobs or text files or are they
actually using a database on that? Well?

344
00:26:10.839 --> 00:26:15.599
I think it depends on the signal, sure, because usually metrics go

345
00:26:15.799 --> 00:26:21.960
to time serious databases and then logs. Honestly, I don't know. It's

346
00:26:22.160 --> 00:26:26.000
a good question, depends but it
depends. Yeah. But well, actually

347
00:26:26.039 --> 00:26:32.519
about those time series databases. That's
a sort of interesting topic on its own

348
00:26:32.559 --> 00:26:36.920
because at the beginning I was talking
about TELME tree correlation, right, and

349
00:26:37.079 --> 00:26:42.079
basically adding the trace ID to the
metric, just called them exemplar in open

350
00:26:42.079 --> 00:26:48.319
CELM tree naming, so that you
would be able to connect that together,

351
00:26:48.400 --> 00:26:49.799
so you'd see a spike in the
metric and see, oh, that's caused

352
00:26:49.839 --> 00:26:55.440
by those traces. Right now,
the thing is that you have to be

353
00:26:55.480 --> 00:27:00.920
aware of what's known as cardinality explosion. So I'll try to when a bomb

354
00:27:02.000 --> 00:27:06.200
goes off in the Vatican. Dude, but did I say that. I'm

355
00:27:06.240 --> 00:27:12.839
sorry, those are cardinal explosions,
not the same cardinality explosion. Cardinality explosion.

356
00:27:14.480 --> 00:27:17.720
I'll try to explain that, but
usually I do this visually, so

357
00:27:17.839 --> 00:27:21.720
okay, I'll give me a bit
to try. But think of the exemplars

358
00:27:21.759 --> 00:27:25.440
basically a label, right that you're
adding to a metric, and that's going

359
00:27:25.480 --> 00:27:30.359
to give you some insights on the
context in which that metric is being collected.

360
00:27:30.799 --> 00:27:34.559
So let's say that I have a
metric called failure rate, because I

361
00:27:34.599 --> 00:27:40.640
want insight into that and to have
a little bit more background information. I

362
00:27:40.680 --> 00:27:44.240
want to know which environment that that
happened, bill, development, tests,

363
00:27:44.480 --> 00:27:48.240
production, whatever it is. And
I also want to know which hp status

364
00:27:48.279 --> 00:27:52.440
coode came back. Now that HSP
status coode in a production environment is going

365
00:27:52.480 --> 00:27:57.200
to have like for the sake of
the example, thirty possible values. And

366
00:27:57.319 --> 00:28:03.839
we have three different environments. So
that's three possible values for that environment label.

367
00:28:03.799 --> 00:28:12.480
Now the cardinality is basically all of
the possible combinations of those values of

368
00:28:12.720 --> 00:28:18.759
every label that you add that is
a cardinality. So it's a multiplier exactly.

369
00:28:18.799 --> 00:28:23.839
So we're fine with the environment and
then having the HTP statoscope and then

370
00:28:23.839 --> 00:28:29.160
I add customer ID. Oh boy, so why let you do that?

371
00:28:29.200 --> 00:28:32.400
We're like one hundred combinations, and
then you threw in twenty thousand customers a

372
00:28:32.519 --> 00:28:36.319
ruined everything more. Yeah, how
do thousand customers a million customers? Somebody

373
00:28:36.319 --> 00:28:41.960
would ever do vouch? And that's
how we then basically get cardinality explosion because

374
00:28:41.960 --> 00:28:45.160
what happens in the time series database
is that every time you have a sort

375
00:28:45.160 --> 00:28:51.079
of unique combination and your series is
created, so your cost goes up right

376
00:28:51.200 --> 00:28:55.599
and it becomes really hard to quire
that information. So it's also important to

377
00:28:55.960 --> 00:28:59.960
one field, which just one field. If you just don't do that,

378
00:29:00.400 --> 00:29:04.319
there won't be this problem. Yeah, well it seems like a good idea

379
00:29:04.319 --> 00:29:10.759
when you do it right until cardinalities
exactly. So it's also important to be

380
00:29:10.839 --> 00:29:14.480
aware of, you know, what
observability back end are you using and how

381
00:29:14.519 --> 00:29:18.160
does that work, because there are
some tools out there that do support high

382
00:29:18.160 --> 00:29:21.960
cardinality. So it's just something that
you have to be aware of. Yeah,

383
00:29:22.279 --> 00:29:25.200
and so you can tolerate if you
really come to the resolution you have

384
00:29:25.240 --> 00:29:29.359
to do that one way or the
other. I mean, custerrity doesn't seem

385
00:29:29.400 --> 00:29:33.920
that crazy because it is useful if
you've got a customer on a phone to

386
00:29:33.000 --> 00:29:37.799
say, hey, I could pull
all the transactions, all of those streams

387
00:29:37.799 --> 00:29:41.279
for all of that customer and sort
of look at where they were having problems.

388
00:29:41.240 --> 00:29:47.440
Yep. That's like full production debugability, right, yeah, without doubt.

389
00:29:48.400 --> 00:29:49.759
And with that, I've got to
interrupt for one moment for this very

390
00:29:49.839 --> 00:29:57.799
important message too, and we're back. It's dotting at Rocks. I'm Richard

391
00:29:57.799 --> 00:30:02.079
Campbell, that's Carl Franklin. Hey, Hey, talking to our friend Leila

392
00:30:02.200 --> 00:30:04.880
about open telemetry. Hey hey,
and we've kind of gotten to that place

393
00:30:06.000 --> 00:30:11.400
now, right, Like we we
how do we visualize this because you're getting

394
00:30:11.400 --> 00:30:15.240
you're probably it a lot of information
like what is are the tooling that comes

395
00:30:15.240 --> 00:30:18.720
with it? Or I have to
write my own what are the dashboards?

396
00:30:18.799 --> 00:30:23.160
Well, that's hopefully where you choose
a vendor, right, and then you

397
00:30:23.200 --> 00:30:27.359
know, with the abilities of open
telemetry, by standardizing all of that information,

398
00:30:27.400 --> 00:30:30.880
I usually just say, like,
try a bunch out. You see

399
00:30:30.880 --> 00:30:37.000
what are the requirements for choices?
You imagine you can just look these up

400
00:30:37.000 --> 00:30:41.119
to standard graph controls and things like
that from your various vendors. If you

401
00:30:41.160 --> 00:30:45.920
want a dashboard, yeah, yeah, yeah, there's just so many options,

402
00:30:45.960 --> 00:30:52.519
and I feel like each of them
has their own strength. But yeah,

403
00:30:52.559 --> 00:30:55.599
for example, if you're if you're
running in the Azure stack then and

404
00:30:55.680 --> 00:30:59.359
you're already using application insights, it's
a thing you know, can it makes

405
00:30:59.359 --> 00:31:03.880
sense against and you're using that as
well. Yeah, But then for example,

406
00:31:03.359 --> 00:31:07.519
I've played around with Honeycomb, and
I think that they're especially like the

407
00:31:07.680 --> 00:31:12.200
collaboration that they've built into the tool
is really cool as well as AWS product.

408
00:31:14.400 --> 00:31:17.200
No, no, it's its own
company, Honeycomb. Okay, big

409
00:31:17.200 --> 00:31:21.240
big taste and a big big bike. Nice. That's the breakfast Cereal.

410
00:31:21.319 --> 00:31:23.279
But okay, I was a kid
of the you know what I'm thinking.

411
00:31:23.319 --> 00:31:27.880
I'm thinking of that there is a
honey Honey something product in AWSS, no

412
00:31:27.960 --> 00:31:32.319
code product, yeah, okay,
different one, but yeah, Honeycomb is

413
00:31:32.359 --> 00:31:37.960
an instrumentation library. Yeah. So, and they're very invested in open telemetry

414
00:31:37.000 --> 00:31:41.839
as well, and I think it's
one of those tools that supports high cardinality.

415
00:31:41.880 --> 00:31:44.480
They talk about it a lot as
well, right, yeah, but

416
00:31:44.640 --> 00:31:48.000
yeah, it would sit out to
me. There's really the sort of collaborative

417
00:31:48.359 --> 00:31:52.279
feature features that they had, because
usually if you're looking at a huge problem,

418
00:31:52.599 --> 00:31:55.000
you're not doing that by yourselves,
right, yeah, you know,

419
00:31:55.079 --> 00:32:00.160
and if it's like if you if
you have a twenty four seventeen and one

420
00:32:00.240 --> 00:32:01.920
and stair shift, you want to
be able to hand over where you left

421
00:32:01.960 --> 00:32:07.240
off things like that. Yeah,
I'm wondering about So who is this Typically

422
00:32:07.279 --> 00:32:08.920
the systems that are going to get
these packages in the first place, that

423
00:32:08.920 --> 00:32:14.000
that's where the air resh is first
show up, where the problems appear,

424
00:32:14.079 --> 00:32:16.160
and then they might be passing it
to development saying hey, we're looking at

425
00:32:16.160 --> 00:32:22.519
this and we think it's it's this
kind of problem. Well that's where I'm

426
00:32:22.559 --> 00:32:29.640
also expecting some evolution because yes,
usually now it would be a different team

427
00:32:29.680 --> 00:32:32.880
when they would be looking at something
that looks funny, yeah right, and

428
00:32:32.920 --> 00:32:38.119
then get some understanding, hopefully some
actionable insights, right, and then being

429
00:32:38.160 --> 00:32:44.200
able to bring that back to the
development team. But it's it's really interesting

430
00:32:44.240 --> 00:32:49.440
to me in the sense that we're
the developers, We are the ones going

431
00:32:49.559 --> 00:32:55.839
to be writing that application specific telemetry. So it's it's also really important to

432
00:32:55.880 --> 00:33:02.640
get like organization wide alignment as well
on the type of telemetry that you're going

433
00:33:02.640 --> 00:33:07.039
to be collecting. So this seems
like there's an infinite number of decisions to

434
00:33:07.079 --> 00:33:10.920
make here, right, I mean
you just by using open telemetry, that's

435
00:33:10.960 --> 00:33:15.599
just one step of many. Yeah, what are we going to be looking

436
00:33:15.640 --> 00:33:17.759
at how what's the granularity of it, How are we going to query it,

437
00:33:19.039 --> 00:33:22.000
how are we going to look at
it visually? Like, these are

438
00:33:22.039 --> 00:33:24.559
all things that aren't just in the
box. You have to think them through.

439
00:33:24.839 --> 00:33:30.359
Yeah. Well, usually what I
try to give a world try to

440
00:33:30.400 --> 00:33:35.920
advise as well, is that to
sort of documents and guidelines like what are

441
00:33:35.960 --> 00:33:38.759
you looking for with your telemetry?
What are the problems that you're trying to

442
00:33:38.799 --> 00:33:45.240
solve and like come up with a
specific set of questions that you could as

443
00:33:45.240 --> 00:33:51.680
a developer when you write a feature, ask yourself so that you could add

444
00:33:51.720 --> 00:33:53.920
the telemetry that is going to answer
those questions. Right, I mean,

445
00:33:53.960 --> 00:33:57.759
what if you don't know what you
don't know? What if you don't know

446
00:33:57.799 --> 00:34:00.480
what you want to look for as
a guidance? Are right to say?

447
00:34:00.400 --> 00:34:05.079
Yeah? Some of this sounds like
business related decisions, like we sell widgets,

448
00:34:05.119 --> 00:34:07.880
and I want to know when a
sale fails because of technology rather than

449
00:34:07.920 --> 00:34:12.280
the customer didn't want to buy it, right, yeah, definitely, or

450
00:34:12.360 --> 00:34:16.280
simple things like this page took too
long to render or whatever. Yeah.

451
00:34:15.639 --> 00:34:21.079
Yeah. And then from the sort
of failure perspective, I usually try to

452
00:34:21.599 --> 00:34:24.599
look at the code and think to
myself, if something were failing here,

453
00:34:24.800 --> 00:34:29.159
what I would be What would I
be looking at if I were debugging this,

454
00:34:29.360 --> 00:34:32.920
like, what what state would be
interesting to me? What variables would

455
00:34:32.960 --> 00:34:37.679
I be looking at? Have I
captured the input enough? Have I captured

456
00:34:37.719 --> 00:34:42.760
what's going out to be able to
understand and be able to then you know,

457
00:34:42.840 --> 00:34:45.320
go back and try to understand what
happened from the outside. Right,

458
00:34:45.440 --> 00:34:49.280
there's an easy solution all of this
you just put all your code in a

459
00:34:49.320 --> 00:34:53.760
try with an empty catch. No
problem is that on error resumed next a

460
00:34:53.920 --> 00:34:59.159
sort of yeah, it's more like
slash day to turn off all the debuggs.

461
00:34:59.239 --> 00:35:01.039
I don't want to know. Just
keep getting don't tell me about these

462
00:35:01.079 --> 00:35:06.159
guys. I mean a lot of
this we've talked about very proactively, like

463
00:35:06.199 --> 00:35:09.559
we're going to detect the errors before
the customer does or before the customer complaint.

464
00:35:09.639 --> 00:35:14.000
Yeah. I think there's another dynamic
where the customer is complaining and we're

465
00:35:14.039 --> 00:35:19.159
getting a ticket that's like what error. I think it'd be very challenging to

466
00:35:19.159 --> 00:35:22.559
say, you've got this ticket is
about this customer, it was roughly at

467
00:35:22.559 --> 00:35:24.280
this time, and now you want
to go dig through the logs to say,

468
00:35:24.320 --> 00:35:28.679
can we see what this person's complained
about. To do is to have

469
00:35:28.719 --> 00:35:31.679
the customer on the phone and enable
this sort of thing, but just for

470
00:35:31.719 --> 00:35:37.480
them, just for their customer ID
and say and then just watch it as

471
00:35:37.519 --> 00:35:39.679
they're going through the process where it
fails and now you've got something. But

472
00:35:40.440 --> 00:35:45.639
is that impossible? Well? I
think so, but it sort of depends

473
00:35:45.679 --> 00:35:49.920
on what type of sampling strategies that
you're applying. And also, like you

474
00:35:50.000 --> 00:35:52.840
know the type of observability that you're
collecting, because let's say that if you

475
00:35:53.159 --> 00:35:59.800
want full insight, so basically any
error that occurs to any user, right,

476
00:36:00.360 --> 00:36:02.880
be able to say, oh,
you know that was since we're an

477
00:36:02.880 --> 00:36:08.119
antwer that was Shalts and it was
three pm on a Friday, and basically

478
00:36:08.119 --> 00:36:13.480
be able to find that request and
look at what were they doing, because

479
00:36:13.480 --> 00:36:16.840
the thing is that users when they
open tickets or whatever it is, the

480
00:36:16.840 --> 00:36:21.920
thing is that they weren't paying attention
to doing their usual thing, right.

481
00:36:22.000 --> 00:36:24.840
They didn't set down See, let's
cause an error, and I'm not thinking

482
00:36:24.880 --> 00:36:29.679
about every step that I did so
I could be able to explain it to

483
00:36:29.719 --> 00:36:32.320
you later. I usually get it
doesn't work. What can you be more

484
00:36:32.360 --> 00:36:38.760
explicit? It doesn't work? Right, So then being able to connect that

485
00:36:38.840 --> 00:36:43.559
information back and say, oh,
yeah, that was shuts right, and

486
00:36:43.599 --> 00:36:46.679
then find that, you know,
Tracey's locks metric whatever was connected to that

487
00:36:46.800 --> 00:36:52.639
information. It's like being able to
debug in production. Really yeah, And

488
00:36:52.719 --> 00:36:55.719
then we think you just described like
turning up a lot of data there too,

489
00:36:57.280 --> 00:37:00.480
Like that's also you know, we
were also warning not to do that

490
00:37:00.519 --> 00:37:05.480
because you're getting buried in minutia yea, yep. That's where the sampling strategies

491
00:37:05.519 --> 00:37:10.000
commits. Sure, and there's different
ways to go about that. So if

492
00:37:10.079 --> 00:37:15.880
you think about traces specifically, you
basically can choose between head and tail sampling

493
00:37:16.119 --> 00:37:22.599
and head sampling and dare You're basically
going to decide whether to sample the trace

494
00:37:22.679 --> 00:37:25.639
at the beginning. So let's say
when the business transaction starts, right,

495
00:37:25.679 --> 00:37:29.760
they're going to immediately make the decision
of I'm keeping this or I just don't

496
00:37:29.800 --> 00:37:34.840
care about it. Right. Usually
that's the most unbiased type of sampling as

497
00:37:34.840 --> 00:37:40.039
well quick order. Right. The
thing is what if something fails, right,

498
00:37:40.559 --> 00:37:45.239
you don't know that upfront, What
if this request was super slow.

499
00:37:45.960 --> 00:37:50.159
It's not something that you can know
up front, so you could be losing

500
00:37:50.639 --> 00:37:53.880
a lot of insightful information. Sure, and that's where you get, you

501
00:37:53.920 --> 00:38:00.519
know, the tail based approach where
you're basically going to collect everything so the

502
00:38:00.719 --> 00:38:05.199
entire trace across all of the services
that it goes through, and at the

503
00:38:05.320 --> 00:38:08.239
end make the decision of is this
an interesting trace? Do I want to

504
00:38:08.320 --> 00:38:12.400
keep it? So? For example, does it carry some specific attributes that

505
00:38:12.440 --> 00:38:15.559
I care about? Or was it
slow? Yeah? And banks, the

506
00:38:15.639 --> 00:38:19.880
question can you turn these things on
and off in production without restarting. Right.

507
00:38:19.880 --> 00:38:23.320
So that's where it becomes important again
how you deployed this, right,

508
00:38:23.360 --> 00:38:28.639
because if you have a sort of
direct export and it's really tricky, but

509
00:38:29.119 --> 00:38:31.159
if if you had it deployed as
a side card, and it could be

510
00:38:31.239 --> 00:38:36.440
a thing of changing the configuration of
the side card, like tailbase says,

511
00:38:36.519 --> 00:38:39.440
I'm going to assess the finish transactions, right, there's nothing special about this,

512
00:38:39.840 --> 00:38:43.800
throw it out right. Oh,
this one had an unusual value,

513
00:38:43.800 --> 00:38:45.679
it took too long, it generated
this air. So forth, I'm going

514
00:38:45.679 --> 00:38:50.440
to keep this one. And so
that way you're sort of sculling as you

515
00:38:50.519 --> 00:38:52.800
complete exactly. That's pretty cool.
Yeah, there's a cost that comes with

516
00:38:52.840 --> 00:38:57.599
that, sure, because basically you're
collecting everything, so you've got those overhead

517
00:38:57.679 --> 00:39:01.639
on the workload. Although hopefully synchronous
to some degree. Actually that asynchronicsy brings

518
00:39:01.639 --> 00:39:05.000
some dishing point that you wanted to
have the customer on the phone while you

519
00:39:05.000 --> 00:39:07.000
work them through it, like that
one scenario. How much latency do we

520
00:39:07.079 --> 00:39:10.639
have when we're doing all that processing
separately, Like how soon can we see

521
00:39:10.719 --> 00:39:15.159
data from when it tries actionized?
You happen, right, that's again another

522
00:39:15.239 --> 00:39:20.239
concern that you have to basically be
focused though, right, like what is

523
00:39:21.559 --> 00:39:25.199
what type of latency are you willing
to accept? And then it becomes a

524
00:39:25.239 --> 00:39:29.400
thing of being very mindful of what
type of processing that you're doing there,

525
00:39:29.440 --> 00:39:32.119
because obviously all of that telemetry is
going to have to go through that pipeline

526
00:39:32.519 --> 00:39:37.400
of processors, so you have to
be very mindful of that. And then

527
00:39:37.960 --> 00:39:39.719
you have to get it out as
soon as you can so that you can

528
00:39:39.760 --> 00:39:44.519
see it as quickly as you can
and you can create as quickly as you

529
00:39:44.559 --> 00:39:49.599
can. But we're still talking seconds, aren't we really? Yeah, I

530
00:39:49.599 --> 00:39:54.320
hope even less milliseconds. That's been
my usual experiences with asynchronous and telemetry.

531
00:39:54.480 --> 00:40:00.000
It's only slightly behind it, but
it's not holding the transaction while it finished,

532
00:40:00.239 --> 00:40:02.440
you know that. That to me, the big sin here is don't

533
00:40:02.519 --> 00:40:07.960
delay the customer. Get they get
the working transaction done. All telemetry can

534
00:40:07.000 --> 00:40:12.320
happen later, even though that later
is a few milliseconds, right, That's

535
00:40:12.360 --> 00:40:15.679
why you offload. That's an our
components again, Yeah, safer, safer

536
00:40:15.719 --> 00:40:19.840
to work that way anyway, which
you don't want as a transaction field because

537
00:40:19.880 --> 00:40:23.440
you were measuring it exactly exactly that's
dumb, that's no no. Yeah,

538
00:40:23.480 --> 00:40:30.039
so yeah, definitely quantum. I
just got it, just like that,

539
00:40:30.079 --> 00:40:32.159
boom, it all makes it just
makes sense, even though it still don't

540
00:40:32.199 --> 00:40:36.400
understand quantum. I'll leave that to
you and sip, okay, we'll work

541
00:40:36.440 --> 00:40:39.320
We'll keep working on that problem.
Yeah, that's definitely something to keep in

542
00:40:39.360 --> 00:40:44.880
mind is that, you know,
instrumentation is at the end of the day,

543
00:40:44.960 --> 00:40:47.559
nice to have. It's not a
mission critical component, so have to

544
00:40:47.599 --> 00:40:52.719
be very mindful of of how you
do that. So one of the biggest

545
00:40:52.719 --> 00:40:58.000
mistakes that customers are people using the
open til I'm train make when they first

546
00:40:58.000 --> 00:41:00.519
started. The biggest mistake, well, I turned everything on. Of course

547
00:41:00.519 --> 00:41:05.119
you did. I do that too. I like all the knobs, turn

548
00:41:05.199 --> 00:41:10.039
on all the knobs and then yeah, another thing was just figuring out what

549
00:41:10.079 --> 00:41:16.320
do I really mean? Right,
So it's really understanding what your requirement is,

550
00:41:16.840 --> 00:41:22.119
because if you're saying I want to
be able to debug any user request

551
00:41:22.159 --> 00:41:28.599
that happens in the system and have
full visibility into that as opposed to for

552
00:41:28.639 --> 00:41:35.480
example, I walked into a project
it's massive, there's zero documentation, and

553
00:41:35.519 --> 00:41:37.800
all of the people that worked on
it, they're gone, right, and

554
00:41:37.920 --> 00:41:42.679
there's no instrumentation. Yeah, at
that point, what you would like to

555
00:41:42.719 --> 00:41:46.599
have observability wise, it's just insight
into how does this thing work? Yeah?

556
00:41:46.679 --> 00:41:51.440
Just follow a transaction, right,
you know, have one end to

557
00:41:51.559 --> 00:41:54.239
end trace once and you'll have made
a lot of progress. Yes, Now,

558
00:41:54.280 --> 00:41:58.679
the amount of telemetry that you need
for those two things, it's completely

559
00:41:58.719 --> 00:42:02.559
the opposite, right, it's like
one percent versus one hundred. So it's

560
00:42:02.599 --> 00:42:07.199
definitely still for me as well a
learning experience. I mean, this is

561
00:42:07.239 --> 00:42:12.960
still pretty new and we're basically just
adjusting to see still how the project is

562
00:42:12.960 --> 00:42:16.519
evolving. Yeah, because I mean
many parts of the specification are stable by

563
00:42:16.559 --> 00:42:21.280
now, but a lot of things
are still evolving. Yeah. Another thing

564
00:42:21.360 --> 00:42:25.400
that keeps coming up is sort of
the three pillars of observability being you know,

565
00:42:25.440 --> 00:42:29.400
traces, metrics, and logs,
And I wonder, but what if

566
00:42:29.400 --> 00:42:32.400
a first one comes along that might
just happen. I can't think of a

567
00:42:32.440 --> 00:42:37.599
fourth one, right, neither find
us pretty well. Yeah, when when

568
00:42:37.679 --> 00:42:40.239
the only thing we had was logs, could we think of metrics and traces?

569
00:42:40.280 --> 00:42:44.239
I guess that's true. Yeah,
we started dreaming about them. It's

570
00:42:44.280 --> 00:42:46.400
like I'm trying to I've had that
experience of I have the log from this

571
00:42:46.480 --> 00:42:50.920
machine, the log from this machine, now try and line those entries up.

572
00:42:51.119 --> 00:42:54.639
Would quantum computing introduce a fourth pillar? I think it would introduce to

573
00:42:54.639 --> 00:42:59.679
the sixteenth pillars, or two to
the two hundred and thirty seconds parallel there.

574
00:43:00.239 --> 00:43:02.320
I also like this idea of knowing
it failed before the customer does have

575
00:43:02.440 --> 00:43:06.400
failed. You know, you get
what I'm hoping to, this tail loog

576
00:43:06.519 --> 00:43:09.559
thing of I see these numbers as
insufficient. In some way, it kicks

577
00:43:09.880 --> 00:43:13.559
up into a system where someone can
look at it and perhaps even call a

578
00:43:13.559 --> 00:43:16.960
customer and say, hey, we
noticed, can we help you with You

579
00:43:17.000 --> 00:43:21.199
know, it's not just you're waiting
for the for the people to complain or

580
00:43:21.239 --> 00:43:23.559
if it has to be on fire. I have never gotten a message or

581
00:43:23.559 --> 00:43:28.320
an email like that from a company
that I and if I if that happened

582
00:43:28.320 --> 00:43:31.599
to me, I would be like
really impressed. So I'm on a website,

583
00:43:31.639 --> 00:43:35.880
for example, and it screws up, and then I immediately get an

584
00:43:35.880 --> 00:43:37.880
email that says, hey, we
noticed you had this problem, didn't work.

585
00:43:37.960 --> 00:43:43.400
Here's a solution. M Wow,
that would be amazed. I ever

586
00:43:43.440 --> 00:43:45.599
see that is like a credit card
when I'm traveling, yeah where every so

587
00:43:45.639 --> 00:43:50.119
often I use the card and a
minute or so later the phone rings and

588
00:43:50.119 --> 00:43:53.079
it's hey, are you in Belgium? Right, yep, yep, I'm

589
00:43:53.079 --> 00:43:57.039
in Belgium. Okay, that's good. Then thanks, And then you're like,

590
00:43:57.239 --> 00:44:00.800
that's pretty cool. It turns out
that was Bob from New Jersey.

591
00:44:00.039 --> 00:44:06.440
Just just want to know that happens
to me all the time. I asked

592
00:44:06.480 --> 00:44:09.639
you about how service bus or you
know, service buses and service bus plays

593
00:44:09.639 --> 00:44:14.000
into this. Obviously, this is
what you do for your job. How

594
00:44:14.039 --> 00:44:21.679
does using a messaging system figure into
open telemetry. Well, like I said

595
00:44:21.679 --> 00:44:25.719
earlier, right, um, building
a message based system is is really nice.

596
00:44:25.880 --> 00:44:29.760
Well, you know, if you
take into account the entire problem space.

597
00:44:30.280 --> 00:44:32.480
But one of the things you can
go around around, basically is how

598
00:44:32.480 --> 00:44:37.400
hard it becomes the troubleshoot things in
that type of system. Now, with

599
00:44:37.559 --> 00:44:42.840
the platform that we're building at particular, we don't only have in service bus

600
00:44:42.840 --> 00:44:45.440
as a middleware framework, but we
also have a bunch of tools. Now

601
00:44:45.880 --> 00:44:52.960
service Insite specifically is basically already that
sort of black box instrumentation. It's like

602
00:44:52.000 --> 00:44:57.159
you don't need to do anything,
you just need to configure that you want

603
00:44:58.239 --> 00:45:04.199
basically those platform tools to be enabled, and it already gives you insight into

604
00:45:04.360 --> 00:45:08.199
all of the messages that are being
sent around in the system, and you

605
00:45:08.199 --> 00:45:14.760
can see that in a production environment
and understand which exact flow and where that

606
00:45:14.840 --> 00:45:20.639
come from and which message led to
which message. So we've really been doing

607
00:45:20.679 --> 00:45:22.679
observability for years. Yeah, that's
the thing. Well, because of the

608
00:45:22.719 --> 00:45:28.280
asyncrency and out of orderness, like
you can't count on timestamps, you really

609
00:45:28.320 --> 00:45:32.039
need to have some kind of attribute
flag to be able to know the related

610
00:45:32.119 --> 00:45:35.920
Yeah, those are things that we
capture in the message headers, so we

611
00:45:35.960 --> 00:45:40.639
can basically know which message LEDs to
which other messages, So that that's really

612
00:45:40.679 --> 00:45:45.159
cool. But the thing that we
could never solve, and that's the sort

613
00:45:45.159 --> 00:45:51.239
of gap that opens telemetry closes,
is having that visibility system wise, so

614
00:45:51.360 --> 00:45:55.800
connecting back to you know, a
fronted or even a database, a web

615
00:45:55.800 --> 00:46:00.599
server that restarted in the middle of
something like those kinds of things like,

616
00:46:00.599 --> 00:46:02.559
that's where you want the metrics to
show. Hey, this machine got into

617
00:46:02.599 --> 00:46:07.840
crisis and the supervisor killed the process
and it recovery and continued, but it

618
00:46:08.199 --> 00:46:12.760
kicked off all this weirdness right,
like I can. You're looking at it,

619
00:46:13.039 --> 00:46:15.639
You're looking at the trades and going, what the heck happened here?

620
00:46:15.840 --> 00:46:17.280
Right? Right? Is the is
the program broken? It's just a bug?

621
00:46:17.320 --> 00:46:20.519
And it's like, no, this
is what recovery looks. Yeah,

622
00:46:20.599 --> 00:46:23.599
yeah, exactly is one of these
machines. Some of that is also built

623
00:46:23.639 --> 00:46:29.880
into the platform because we have insight
into what are the failure how many failures

624
00:46:29.880 --> 00:46:35.199
are happening, So we also have
messaging specific metrics already in there, and

625
00:46:35.199 --> 00:46:40.039
and yeah, now we're working to
basically make sure that it also connects and

626
00:46:40.199 --> 00:46:46.280
feeds into the open telemetry signals so
that if people are using it, that

627
00:46:46.400 --> 00:46:51.000
they get all of that information in
there as well. Wow, that's pretty

628
00:46:51.239 --> 00:46:54.440
pretty substantially cool, especially when you
start throwing cloud in here where it's entirely

629
00:46:54.480 --> 00:46:59.800
possible the cloud vendor might move you
and it might have impact on your software,

630
00:47:00.159 --> 00:47:02.800
like stuff you literally don't have control
over, Like hopefully your telemetry can

631
00:47:02.840 --> 00:47:06.400
surface that in a way where you're
like, oh, this wasn't us.

632
00:47:07.440 --> 00:47:12.159
It's the vendor change something for whatever
reason, and we should absorb it.

633
00:47:12.440 --> 00:47:15.440
But we don't know how yet because
we've never had this happen before and right

634
00:47:15.519 --> 00:47:16.639
now we have to look at it
and saying what would we do differently?

635
00:47:17.639 --> 00:47:22.880
Yeah, yeah, because we have
sort of recoverability built in, so retries

636
00:47:22.920 --> 00:47:25.800
and all of that happens out of
the box. You don't even have to

637
00:47:25.800 --> 00:47:30.360
configure them. But yeah, then
it's like, okay, why does it

638
00:47:30.360 --> 00:47:36.280
take ten times for this message to
be processed? Every time a message of

639
00:47:36.320 --> 00:47:38.719
that type comes in? Right,
So it's you don't want to hide that

640
00:47:38.800 --> 00:47:43.920
away. It could be that maybe
there's a database suffering underneath, right,

641
00:47:43.960 --> 00:47:45.559
it retries too quickly and it takes
that long for that thing to get up

642
00:47:45.559 --> 00:47:49.920
to speed or yeah, yeah,
along one of the side effects. Yeah,

643
00:47:50.039 --> 00:47:53.440
that's also why we have delayed retries. So we'll basically also have this

644
00:47:53.559 --> 00:47:59.239
sort of back off mechanism. We'll
retry immediately. But if we see exactly

645
00:47:59.400 --> 00:48:01.760
that's too yep, yeah, it's
the same. It's the same concept.

646
00:48:01.840 --> 00:48:07.639
Yeah. Yeah. So what's next
for you? What are you doing next?

647
00:48:07.840 --> 00:48:12.440
What's in your inbox? What's in
my inbox? Well, uh,

648
00:48:12.480 --> 00:48:15.960
we're so a techarama today. Then
I'm taking a few days with the family.

649
00:48:16.119 --> 00:48:20.360
It's a long weekend, is it? It's a long Weekend's holiday Tomorrow?

650
00:48:20.400 --> 00:48:25.199
And then they're basically bridging to the
weekend and then on Sunday, I'm

651
00:48:25.280 --> 00:48:30.400
leaving to Oslo for NDC. Very
great. I will not be joining you

652
00:48:30.440 --> 00:48:34.159
there this year, so I heard. Yeah, not happy to hear that.

653
00:48:34.400 --> 00:48:37.639
Sorry, just schedules. Yeah.
We love Oslo, we love NDC.

654
00:48:37.800 --> 00:48:39.880
We're there all the time. Usually, I'm gonna have to miss it

655
00:48:39.920 --> 00:48:43.639
this year. Yeah, I'm really
excited to go. It's my first time

656
00:48:43.679 --> 00:48:46.639
at Oslo's top sure to get into. Yeah, you know, and I

657
00:48:46.760 --> 00:48:51.280
always loved the way they did the
show on the floor in the but I

658
00:48:51.400 --> 00:48:57.760
kind of tell you this techarama's pretty
close to the Yeah, and I have

659
00:48:57.840 --> 00:49:01.559
it homies. Yeah, yeah,
all right, it's really cool. Everybody

660
00:49:01.559 --> 00:49:09.679
give it up for Lila Bilbria and
we'll see you next time on dot net

661
00:49:09.800 --> 00:49:36.719
rocks. Dot net Rocks is brought
to you by Franklin's Net and produced by

662
00:49:36.800 --> 00:49:42.760
Pop Studios, a full service audio, video and post production facility located physically

663
00:49:42.800 --> 00:49:46.719
in New London, Connecticut, and
of course in the cloud online at pwop

664
00:49:47.000 --> 00:49:52.599
dot com. Visit our website at
dt n et r ocks dot com for

665
00:49:52.800 --> 00:49:58.599
RSS feeds, downloads, mobile apps, comments, and access to the full

666
00:49:58.719 --> 00:50:02.280
archives going back to sh number one, recorded in September two thousand and two.

667
00:50:02.800 --> 00:50:06.960
And make sure you check out our
sponsors. They keep us in business.

668
00:50:07.440 --> 00:50:10.159
Now go write some code. See
you next time. You got a

669
00:50:10.239 --> 00:50:12.400
dead middle band

