1
00:00:01,080 --> 00:00:05,679
How'd you like to listen to dot
NetRocks with no ads? Easy? Become

2
00:00:05,679 --> 00:00:09,839
a patron for just five dollars a
month. You get access to a private

3
00:00:10,000 --> 00:00:14,400
RSS feed where all the shows have
no ads. Twenty dollars a month will

4
00:00:14,439 --> 00:00:18,839
get you that and a special dot
net Rocks patron mug. Sign up now

5
00:00:18,839 --> 00:00:50,560
at Patreon dot dot NetRocks dot com. Hi, it's all right, Holy

6
00:00:50,960 --> 00:00:55,560
crap, this is a great room. It's a great lot of echo.

7
00:00:55,880 --> 00:01:02,719
Yes, certainly filled this space.
He determined level of absolutely the blast radius.

8
00:01:03,240 --> 00:01:07,400
Right, we are back in Portugal. I love I love Porto,

9
00:01:07,519 --> 00:01:14,519
I love Portugal. I was in
the middle of town and I came across

10
00:01:14,560 --> 00:01:23,000
a restaurant and the sign said churiscaria, which means grill. Right they infante.

11
00:01:23,959 --> 00:01:27,599
I didn't know you guys ate babies
over here. So no, no,

12
00:01:27,599 --> 00:01:32,280
no, not funny, too long, too long, too soon?

13
00:01:33,799 --> 00:01:38,519
Infante. I guess Infante is the
name of the area. Yeah, but

14
00:01:38,599 --> 00:01:42,799
it looked like we eat our babies. Well done, nice on a skewer,

15
00:01:42,840 --> 00:01:44,560
I think it was. W.
C. Field said it was all

16
00:01:44,560 --> 00:01:51,439
about the sauce. Yes, I
love children exactly. You got it.

17
00:01:53,280 --> 00:01:55,439
Oh, we're gonna have some fun. Today. We are going to have

18
00:01:55,519 --> 00:01:59,000
some fun. But first we have
this little thing called better no Framework.

19
00:01:59,239 --> 00:02:12,000
Roll that crazy music. Well I've
been waiting for this one because it happened

20
00:02:12,000 --> 00:02:15,800
a while ago. But if you
don't know, I have a consultancy called

21
00:02:15,879 --> 00:02:22,639
at the next and we are the
shepherds of an open source project called Polly.

22
00:02:22,879 --> 00:02:25,919
Anybody who's Polly? How about uh? How about a clap of hands?

23
00:02:27,360 --> 00:02:31,240
Use poly lot of Well, Reverend
Billy just walked in the room.

24
00:02:31,280 --> 00:02:36,520
Well, Polly just did a make
just came out with a major update,

25
00:02:37,080 --> 00:02:39,520
version eight. And I, believe
it or not, it started with the

26
00:02:39,560 --> 00:02:45,400
dot net team because the dot Net
team was basically looking at the source code

27
00:02:45,400 --> 00:02:49,639
and said, hey, we think
we can improve the performance and the resource

28
00:02:49,759 --> 00:02:54,360
usage of Polly, but it's going
to require some new interfaces and you know,

29
00:02:54,879 --> 00:03:00,240
almost a complete rewrite. And so
the rest of us said, yeah,

30
00:03:00,280 --> 00:03:04,400
I'm sorry. The dot Net team
call you, yeah, say they

31
00:03:04,439 --> 00:03:07,199
want to make your project. Yeah, I mean what are you saying?

32
00:03:07,319 --> 00:03:14,800
Nah? So it took a lot
of meetings and a lot of understanding,

33
00:03:14,800 --> 00:03:19,400
and basically what they were able to
do is without you having to change any

34
00:03:19,439 --> 00:03:23,560
of your code that uses Polly,
you will get the benefits of the new

35
00:03:23,599 --> 00:03:29,520
performance and resource allocation that's under the
hood. But if you want to use

36
00:03:29,560 --> 00:03:34,439
the new models and the paradigms and
the interfaces, you you can do that.

37
00:03:34,680 --> 00:03:38,960
So Greenfield, you can go forward
with a new style. But just

38
00:03:39,000 --> 00:03:45,639
if you use Polly in place,
I'm not sure if it's completely compatible or

39
00:03:45,680 --> 00:03:49,759
you have to change some class to
some other class. But it's pretty much

40
00:03:50,000 --> 00:03:54,080
a simple a simple fix. And
I have been emailing with Joel hewle In

41
00:03:54,080 --> 00:04:00,879
to schedule Polly show for Yes,
we definitely will get cool. So that's

42
00:04:00,919 --> 00:04:02,879
what that's what I got. Awesome. Who's talking to us? Richard Grabby

43
00:04:02,879 --> 00:04:06,960
comment off of show eighteen sixty one
we did with Jeremy Miller back in the

44
00:04:06,960 --> 00:04:11,840
summer of twenty three, talk about
minimal architecture because Jeremy likes to cause trouble.

45
00:04:11,879 --> 00:04:15,720
Goodness nose And this comment comes from
Trevor who says, I love this

46
00:04:15,759 --> 00:04:19,879
discussion, enjoyed the comments on microservices
Worth's monoliths, which is actually a reference

47
00:04:19,920 --> 00:04:25,120
to an earlier show we didn't reporter. We've been following this trend of people

48
00:04:25,439 --> 00:04:30,040
sort of pushing back on microservice.
Yes, I got pushed heavily into microservices

49
00:04:30,040 --> 00:04:33,639
approaches with a product that we'd built
and re architected into microservices, and it

50
00:04:33,720 --> 00:04:40,079
was the worst mistake ever. Things
just became more complex, it was harder

51
00:04:40,120 --> 00:04:43,279
to maintain, it added a bunch
of latency and security issues in the complexity

52
00:04:43,399 --> 00:04:46,879
was just not worth it. And
so I came up with a new acronym

53
00:04:46,480 --> 00:04:55,600
for appropriately sized service or as nice
I can relate. This is a good

54
00:04:55,639 --> 00:04:58,800
one. I one hundred percent believe
in services, separation of concerns, and

55
00:04:58,879 --> 00:05:02,399
clean architectures, but the approach must
be appropriate to the complexity, solution and

56
00:05:02,480 --> 00:05:06,079
the size of the team. It
makes no sense to have one hundred separate

57
00:05:06,079 --> 00:05:10,199
services for a team of ten people. But then it also makes no sense

58
00:05:10,199 --> 00:05:12,639
to have a massive single deployment with
a code base in the team of a

59
00:05:12,639 --> 00:05:15,399
two hundred and fifty people. The
services need to work with the cognitive load

60
00:05:15,639 --> 00:05:20,600
and be appropriate to the organization and
team structures. And I was loving the

61
00:05:20,600 --> 00:05:26,319
discussion on all of this except for
that one point. Stop making the CTO

62
00:05:26,399 --> 00:05:32,279
out to be the bad guy.
Love from Trevor CTO. That's great,

63
00:05:32,399 --> 00:05:36,879
Yep, that's fair. I definitely
think we need a whole show on ass

64
00:05:38,000 --> 00:05:40,920
I think so it seems up.
Yah, So, Trevor, thank you

65
00:05:40,959 --> 00:05:43,319
so much for your comment, and
a copy of music by its own its

66
00:05:43,360 --> 00:05:45,240
way to you. And if you'd
like a copy of music co buy,

67
00:05:45,240 --> 00:05:46,680
I write a comment on the website
at dot at Rocks dot com or on

68
00:05:46,680 --> 00:05:49,439
the Facebook, so you published every
show there, and if you comment there

69
00:05:49,439 --> 00:05:51,639
and everyday in the show, it's
like your copy of music Cobe. And

70
00:05:51,759 --> 00:05:56,399
you can also follow us on Twitter
if you want to. But the real

71
00:05:56,759 --> 00:06:00,439
cool kids are over. I'm massedon, I'm at Carl Frank tech dot Social

72
00:06:00,519 --> 00:06:04,480
and Ambridge Campbell at maps. Send
us a two we'll get around reading it

73
00:06:05,920 --> 00:06:11,920
all, publish it and with that
let us introduce Charity Majors to the show.

74
00:06:12,079 --> 00:06:16,319
Charity is an OPS engineer and CTO
at Honeycomb dot Io. Before that,

75
00:06:16,480 --> 00:06:23,240
she worked at Parse, Facebook and
Linden Lab on operations and developer tools

76
00:06:23,279 --> 00:06:27,160
and always seemed to wind up running
the databases. That's because it's where all

77
00:06:27,160 --> 00:06:30,360
the problems were. Yeah, we
stand next to the database. You're going

78
00:06:30,399 --> 00:06:35,240
to be running it. Also co
author of O'Reilly's Database Reliability Engineering and the

79
00:06:35,319 --> 00:06:41,879
newly released Observability Engineering. Charity Loves
free speech, free software, and single

80
00:06:41,920 --> 00:06:50,439
malt Scotch blamer. Do you ever
have round the bloud charity Majors? Okay,

81
00:06:50,759 --> 00:06:53,839
I guess we would have start at
the beginning, right at the beginning.

82
00:06:53,879 --> 00:06:58,199
What the heck is observability engineering?
That's a great question. It's like

83
00:06:58,199 --> 00:07:03,519
the engineering of Windows, yeah,
kind of. I mean observability comes from

84
00:07:03,560 --> 00:07:09,319
control theory, right, and it's
like, how well can you understand what's

85
00:07:09,360 --> 00:07:17,040
going on inside your systems just by
observing outputs? And yeah, exactly.

86
00:07:17,759 --> 00:07:23,959
And you know, I for years
was like really religious about trying to define

87
00:07:23,959 --> 00:07:26,560
it in a very specific way,
and I should have won, but I

88
00:07:26,639 --> 00:07:33,160
lost. So so I mean it's
come to it's come to just kind of

89
00:07:33,160 --> 00:07:40,399
be a generic sitting, which is
what it is. But when we were

90
00:07:40,399 --> 00:07:43,079
trying to figure out how to talk
about, what I think of is just

91
00:07:43,160 --> 00:07:47,079
kind of the next generation of telemetry. It's kind of distinguished from the last

92
00:07:47,120 --> 00:07:50,639
generation of peletry, obviously, which
was very much focused around the metric,

93
00:07:50,959 --> 00:07:56,079
right, which is just a number. It's tags depended, doesn't handle high

94
00:07:56,079 --> 00:08:01,079
cardinality, doesn't handle dimensionality, doesn't
handle it's super fast. Is that powerful?

95
00:08:01,160 --> 00:08:05,839
Now you drop some OLAP terms into
their cardinality flexibility, Like it's funny

96
00:08:05,879 --> 00:08:11,319
for a database person to drop all
lap, but you're talking about just any

97
00:08:11,360 --> 00:08:15,839
way that you can really observe the
state, the internal state, not necessarily

98
00:08:15,879 --> 00:08:18,319
what it's doing on the outside.
It's about observing the internal state and being

99
00:08:18,399 --> 00:08:24,279
able to explore it right, not
having to decide in advance, here's the

100
00:08:24,319 --> 00:08:26,920
data I'm going to collect, because
here's the questions I'm going to need to

101
00:08:26,959 --> 00:08:30,560
answer. Here's my dashboard. You
know, it's about being able to go

102
00:08:30,959 --> 00:08:35,759
to combine your questions to ask because, like anything that you're trying to understand

103
00:08:35,840 --> 00:08:37,759
these days is going to be a
very complicated answer to cart. It's like,

104
00:08:39,320 --> 00:08:43,440
okay, these errors are spiking,
but only for users that are running

105
00:08:43,440 --> 00:08:46,440
this version of Android, who's a
particular firmware in this region, with this

106
00:08:46,559 --> 00:08:50,879
language pack. With each of those
are the high cardinality to mention it.

107
00:08:50,960 --> 00:08:54,240
And if you don't capture the data
in a way that preserves all that context,

108
00:08:54,720 --> 00:08:58,639
you can't ask me questions. Do
you have some examples of how observe

109
00:09:00,039 --> 00:09:05,799
observability has improved a project in particular? Sure, I mean I think of

110
00:09:05,840 --> 00:09:11,480
it as it's really it's kind of
where development meets operations, right, Like,

111
00:09:11,600 --> 00:09:16,279
I feel like big picture. You
know, in the beginning, there

112
00:09:16,320 --> 00:09:18,519
were engineers who wrote code and they
owned it in production, right, right,

113
00:09:18,720 --> 00:09:22,639
And then everything got super complicated and
we're like, ah, there's too

114
00:09:22,720 --> 00:09:24,320
much. So some of us are
going to write code and some of us

115
00:09:24,360 --> 00:09:28,600
are going to understand it. And
that was not about it. That was

116
00:09:28,639 --> 00:09:31,799
not a great idea, and so
like we're kind of like reunifying the streams

117
00:09:31,840 --> 00:09:35,759
now. I think every engineer should
be writing their code and owning it in

118
00:09:35,799 --> 00:09:43,000
production. Everyone who's especialist operations should
be also like opening the door and looking

119
00:09:43,080 --> 00:09:46,559
under the hood and understanding the code. Right. There's specialization is great,

120
00:09:46,080 --> 00:09:50,039
but ultimately, you know, our
systems have gotten so complex that you have

121
00:09:50,080 --> 00:09:52,559
to write it and understand. I
feel like you got to dig into that

122
00:09:52,799 --> 00:09:58,679
own it in production because it's not
like they're also going to be sisumits exactly

123
00:09:58,080 --> 00:10:03,759
as they are responsible for the You're
responsible for your systems, right, you

124
00:10:03,759 --> 00:10:07,080
wrote it, you own it,
You unleashed this support upon the world.

125
00:10:07,039 --> 00:10:09,759
I mean, I feel like there
are these feedback loops in the heart of

126
00:10:09,840 --> 00:10:13,320
engineering. Some of them are like
code review, right. Some of them

127
00:10:13,960 --> 00:10:18,639
are like deploys. But like,
if you don't hook up the feedback loop,

128
00:10:18,840 --> 00:10:22,559
if you aren't being exposed to the
consequence of what you're doing, then

129
00:10:22,600 --> 00:10:24,840
like you're not you don't actually know
if your code is good or not.

130
00:10:24,120 --> 00:10:28,919
Well, I think there's a great
point there as a developer then that if

131
00:10:28,919 --> 00:10:31,399
my telemetry just tells me how many
times my code was hit, that doesn't

132
00:10:31,399 --> 00:10:35,639
necessarily give me anything to do.
And this is this is where I feel

133
00:10:35,679 --> 00:10:39,960
like operations folks have had a harder
time embracing serviability in some ways than software

134
00:10:39,960 --> 00:10:43,759
engineers have because with up people,
it's like we learned how to debug,

135
00:10:43,799 --> 00:10:48,600
but it looked like this, I've
got a dashboard. Something's wrong, So

136
00:10:48,600 --> 00:10:52,919
I'm gonna start paging through dashboards and
looking for similar spikes, just like pattern

137
00:10:52,919 --> 00:10:54,879
maatting with my eyeballs, Like right, oh, it looks like it's redus,

138
00:10:54,919 --> 00:10:58,120
you know, and you get it. It's great because you're like you

139
00:10:58,159 --> 00:11:01,879
get this hero journey where you just
jump to the and you understand what's going

140
00:11:01,879 --> 00:11:03,840
on because you're in this shit all
day every day. Nobody else does,

141
00:11:05,120 --> 00:11:07,120
like whoa, how did you do? That? Was reset? The redit

142
00:11:07,159 --> 00:11:11,799
service? Problems went away? Right? But like that's not debugging. Countermatching

143
00:11:11,840 --> 00:11:16,159
with your eyeballs is not debugging.
Debugging looks like you take the step,

144
00:11:16,440 --> 00:11:20,519
you ask a question, you look
at the answer. Based on the answer,

145
00:11:20,559 --> 00:11:22,720
you take another step. It's like
following a trail of bread crupts.

146
00:11:22,759 --> 00:11:26,639
You don't know what the answer looks
like until you get there. Can we

147
00:11:26,679 --> 00:11:31,600
talk about some of the new modern
observability tools that we might think about using

148
00:11:31,639 --> 00:11:35,919
to replace the tools that we're currently
using. Yeah, I mean, I

149
00:11:35,919 --> 00:11:41,000
think big picture, it has to
be based It can't just be based on

150
00:11:41,039 --> 00:11:46,159
the metric because remember you've discarded all
that you're looking output exactly. It has

151
00:11:46,200 --> 00:11:50,519
to be based on ar truly wide
structured TETA blocks, which now look like

152
00:11:50,600 --> 00:11:56,200
scams, right, Those are just
like wide events structured which you can trace

153
00:11:56,240 --> 00:11:58,240
because there's been a number that's appended
to it. That's what you need in

154
00:11:58,279 --> 00:12:01,960
order to understand your telemetry and production. Because I can imagine at a peak

155
00:12:03,080 --> 00:12:05,320
load, like we think about a
metric that shows, you know, this

156
00:12:05,360 --> 00:12:09,519
is when we're posting the most number
of transactions. You're now really interested in

157
00:12:09,519 --> 00:12:13,960
the state of yes, we're we
queuing out yes, what's happening? Like

158
00:12:13,120 --> 00:12:16,799
metrics are great, but they're they're
limited, right, they're a snapshot.

159
00:12:18,080 --> 00:12:20,559
What you want to be is like, you know, okay, when this

160
00:12:20,679 --> 00:12:24,519
happened, what else happened? Right? What else is connected to it?

161
00:12:24,679 --> 00:12:26,279
You know? And like the old
generation of tool are ones where you find

162
00:12:26,799 --> 00:12:31,039
you're capturing this data another time.
For every single tool you're like, okay,

163
00:12:31,039 --> 00:12:35,559
here's my dashboards and the metrics,
here's my logs, here's my traces.

164
00:12:35,639 --> 00:12:37,000
So every time you're like I've got
a spike, I want to find

165
00:12:37,080 --> 00:12:41,720
the logs, there's nothing that connects
them. You're just eyeballing timestamps and hoping

166
00:12:41,720 --> 00:12:46,360
that they happen to match up.
And like, if you're finding the logs,

167
00:12:46,360 --> 00:12:48,080
we want to jump to a trace. Like that's not actually good enough.

168
00:12:48,320 --> 00:12:54,200
You can derive all of those data
formats from these arbitrary white from spans.

169
00:12:54,240 --> 00:12:56,039
You can't go in the other direction. When you say spans, what

170
00:12:56,159 --> 00:13:01,840
exactly you're talking about? A span
is a one hop of the trace,

171
00:13:01,120 --> 00:13:05,840
okay, across all of them.
So should we be gathering spans? All

172
00:13:05,879 --> 00:13:11,679
of the gathering telemetry one event per
request, per service, all of the

173
00:13:11,799 --> 00:13:16,399
data should be aggregated into that one
arm chreary wide production, so you have

174
00:13:16,519 --> 00:13:22,120
all that context, Like a really
mature instrumented service will have like two hundred

175
00:13:22,200 --> 00:13:26,879
three hundred dimensions per per hop and
that's that's magic because you're passing along all

176
00:13:26,919 --> 00:13:31,559
of the parameters, you're passing along
all of the I E. S,

177
00:13:31,759 --> 00:13:35,919
you're passing along all of that context, which lets you after the fact come

178
00:13:35,000 --> 00:13:39,879
back and say, oh, this
thing and this service that happened was connected

179
00:13:39,919 --> 00:13:41,679
to that thing and that service that
that happened. Now, this is not

180
00:13:41,679 --> 00:13:46,840
necessarily a per transaction level, like
you're not just chasing a transaction. What's

181
00:13:46,960 --> 00:13:56,759
this It basically it's one span from
a time it's well typically, well this

182
00:13:56,000 --> 00:14:03,240
complicated. There are lots of ways
that you can define as span, but

183
00:14:03,399 --> 00:14:05,240
typically I like to think it about
if you want to have a span around

184
00:14:05,240 --> 00:14:09,399
something that's interesting. So like if
it's anytime that you're crossing the network,

185
00:14:09,559 --> 00:14:13,840
you want to span. Anytime you're
taking a database request, you want to

186
00:14:13,879 --> 00:14:18,399
span because that's historically where problems happen. It's wherever you're crossing the right.

187
00:14:18,440 --> 00:14:20,960
So when you start it at a
user interface interaction and go from there and

188
00:14:22,000 --> 00:14:26,879
then you know, likewise we're back
end services that go on time like you

189
00:14:26,919 --> 00:14:30,879
can have you can have spans and
tracing in models too, and it's super

190
00:14:31,120 --> 00:14:37,039
super useful there as well. But
it becomes indispensable once you have service.

191
00:14:39,000 --> 00:14:41,200
Yeah, because if you think about
it as a monolith, at least you

192
00:14:41,320 --> 00:14:46,399
have all that context and it persists
throughout the request. When you jump across

193
00:14:46,440 --> 00:14:50,120
the network from service to service,
you're deciding what state's going to come with.

194
00:14:50,200 --> 00:14:54,679
And so how do you do all
this without bringing the server CPU to

195
00:14:54,759 --> 00:14:58,799
its knees? Do you do this? Typically? The way we do it

196
00:14:58,799 --> 00:15:01,639
now is it attached to back around
threads and that kind of stuff, and

197
00:15:01,720 --> 00:15:03,720
you can lose you know, if
those threats hang, you can lose data.

198
00:15:05,120 --> 00:15:07,120
There are lots of ways to do
it. Obviously, I think that

199
00:15:07,200 --> 00:15:11,960
my service does it best. Your
Honeycomb is, well, you've got to

200
00:15:11,960 --> 00:15:16,279
tell us about Honeycomb. Well,
sure, you know I'm not I'm not

201
00:15:16,320 --> 00:15:20,200
really great at pitching, but I
will say that, like, you know,

202
00:15:20,960 --> 00:15:24,480
the idea of how observability should happen
is how we built our service,

203
00:15:24,679 --> 00:15:30,039
you know, down to like the
data store, like because like, well

204
00:15:30,039 --> 00:15:31,919
back how many have you any if
you ever built apps on pars? The

205
00:15:33,000 --> 00:15:37,919
mobile back end is the service.
No, wow, I loved so much.

206
00:15:39,879 --> 00:15:45,039
It was it was like Firebase but
better and earlier. Any Facebook people

207
00:15:45,080 --> 00:15:48,960
here, Okay, cool, I
will have I will have a brudg against

208
00:15:48,960 --> 00:15:54,120
Spark Soccer for forever for what he
did to Pars. Okay, they shut

209
00:15:54,120 --> 00:15:58,960
it down. It's like we got
acquired. Anytime you want to get acquire

210
00:16:00,120 --> 00:16:04,679
or make sure that you have an
executive level sponsor who believes you were fond.

211
00:16:04,960 --> 00:16:11,320
Right. Not so we got shut
down even though we were still growing

212
00:16:11,320 --> 00:16:15,639
like aang busters anyway. Cars had
had one hundred million apps by the time

213
00:16:15,799 --> 00:16:19,840
I left, and we had built
our service originally on Ruby on Rails,

214
00:16:21,000 --> 00:16:23,720
which was not a terrible decision because
most startups fail. It's usually not because

215
00:16:23,759 --> 00:16:27,879
it's and Rails had the strength of
you can move fast, you can move

216
00:16:29,720 --> 00:16:33,960
It's just a we believed everything.
Yeah, ok, say whatever you want.

217
00:16:33,039 --> 00:16:40,960
Doesn't that Ruby on rails. The
downside is it's got it doesn't have

218
00:16:41,000 --> 00:16:44,519
threads right, fixed pool of workers, right. And so that was fine.

219
00:16:44,639 --> 00:16:47,399
We had one hundred thousand apps,
but we got bigger and bigger,

220
00:16:47,399 --> 00:16:51,320
and instead of having one database and
back end, we now had thirty forty

221
00:16:51,399 --> 00:16:55,159
fifty. And when you've got that
many something slow at any given time,

222
00:16:55,519 --> 00:17:00,120
which means that the sixth pool is
going it's filling up constantly with threads that

223
00:17:00,159 --> 00:17:03,599
are waiting on that one back end
service got oh and like, as a

224
00:17:03,640 --> 00:17:10,559
reliability engineer, this was personally humiliating. You're going down every day just like

225
00:17:10,839 --> 00:17:15,400
a hit the top ten and iTunes
down, goes parts again and again.

226
00:17:17,000 --> 00:17:21,920
And I tried everything to try and
figure this out. And what finally helped

227
00:17:21,960 --> 00:17:25,319
us was number one, we did
a rewrite to go length. We actually

228
00:17:25,559 --> 00:17:30,599
considered uh using dot net and it
got out voted. And I learned later

229
00:17:30,720 --> 00:17:34,160
that the blog post that I wrote
about why it got outvoted, they had

230
00:17:34,200 --> 00:17:38,039
a lot of people with Microsoft very
angry and changed a lot of their decisions,

231
00:17:38,079 --> 00:17:42,480
which is great. Yeah, inspiring
anger is really like, but I

232
00:17:42,480 --> 00:17:45,759
mean, I can I made a
career. I have a tough time of

233
00:17:45,119 --> 00:17:48,000
disagree with you on picking go line
too. When you think about a back

234
00:17:48,079 --> 00:17:52,920
end service at velocity like that,
language is very well suited for that.

235
00:17:53,119 --> 00:17:56,240
It was great, but it was
half of the half of the answer because

236
00:17:56,279 --> 00:18:00,640
we also had to understand what was
going on just right to code understand it

237
00:18:00,880 --> 00:18:06,519
observab This is where observability came into
play because Facebook also had this service called

238
00:18:06,519 --> 00:18:11,160
Scuba, which was, don't get
me wrong, but ugly aggressively hostile to

239
00:18:11,279 --> 00:18:15,039
users, but it did one thing
really well, which is people that you

240
00:18:15,079 --> 00:18:21,839
slice and dice in near real time
on dimensions of high cardinality with wide events.

241
00:18:21,920 --> 00:18:25,759
Right high cardinality for those who don't
know, it's the number of unique

242
00:18:25,759 --> 00:18:27,440
idemans in the set. So if
you've got a collection of one hundred million

243
00:18:27,519 --> 00:18:33,680
users, any unique idea like social
security number for the US folks would be

244
00:18:33,079 --> 00:18:37,880
the highest possible cardinality. Something like
species equals human would be the lowest because

245
00:18:38,079 --> 00:18:42,079
only one right first thing lasting high
cardinality, but there's some dupes, so

246
00:18:42,119 --> 00:18:47,480
it's not as high as the security
number. So everything around metrics is oriented

247
00:18:47,519 --> 00:18:52,920
around low cardinality dimensions, but everything
you want to use for debugging requires high

248
00:18:52,920 --> 00:18:59,039
cardinaliti ice some people over you run
into the influence before. So Scuba,

249
00:18:59,160 --> 00:19:02,519
let us slice it some of these
high cardinality mentions and instead of having to

250
00:19:02,599 --> 00:19:06,680
like, you know, obscur and
be like either I either I read it

251
00:19:06,680 --> 00:19:10,839
to dashboard for it or it's going
to be hours to like just dive through

252
00:19:10,880 --> 00:19:14,759
the logs and figure it out and
everything. It's like, instead it would

253
00:19:14,799 --> 00:19:18,119
be like, okay, we're getting
a spike and eras, let's break down

254
00:19:18,160 --> 00:19:21,839
by app one in ten million appies. Break down by that. Okay,

255
00:19:21,960 --> 00:19:26,319
now break down by her rights.
Don't break down by by normalized database query.

256
00:19:26,359 --> 00:19:36,559
Now are you making cube gestures or
gestures? Call them after Colum.

257
00:19:36,599 --> 00:19:40,200
It was just like, step by
step it would take me to It's like

258
00:19:40,240 --> 00:19:45,119
it isn't even engineering anymore. It's
like support. Right, These problems went

259
00:19:45,119 --> 00:19:48,640
from being like intractable, like it
would be I'm doing mean, from like

260
00:19:48,680 --> 00:19:51,200
it would take us a day to
figure out and then it would never happen

261
00:19:51,240 --> 00:19:56,039
again, to just being like,
you know, thirty seconds, like every

262
00:19:56,119 --> 00:20:00,279
single time. And that was what
like when I was leading Facebook. You

263
00:20:00,319 --> 00:20:02,960
know, I've never been one of
those kids who's like I don't want to

264
00:20:03,000 --> 00:20:07,759
start a company because I kind of
hate those people. But when I was

265
00:20:07,799 --> 00:20:11,960
thinking about having to live without this
tooling, I was like, I can't.

266
00:20:11,039 --> 00:20:14,880
I actually can't conceive of it,
Like it's becomes so coore to how

267
00:20:14,920 --> 00:20:18,680
I how I perceive the world as
an engineer, Like I just can't imagine

268
00:20:18,680 --> 00:20:22,599
going back. And so that's why
I made talking on So when how did

269
00:20:22,640 --> 00:20:29,000
the rest of the people in the
organization react to this new culture of observed

270
00:20:29,000 --> 00:20:36,240
ability and spam? There's a learning
curve, right, We've all spent our

271
00:20:36,279 --> 00:20:40,160
careers fitting our brain into asking questions
in the metrics and dactions type of way.

272
00:20:40,680 --> 00:20:45,440
But like you know how every job
I've ever had, the person who's

273
00:20:45,440 --> 00:20:48,640
best to be bugging is always a
person who's been in the moms. Always.

274
00:20:48,920 --> 00:20:53,559
That's no longer true when we have
different tools because instead of relying so

275
00:20:53,680 --> 00:20:57,720
much on what's in your head to
reason about system, it's right in front

276
00:20:57,759 --> 00:21:02,119
of you and you're just asking questions
and it's and it's more like the more

277
00:21:02,240 --> 00:21:06,400
curious you are, the more debugging
you do, the better you get.

278
00:21:06,519 --> 00:21:08,160
You don't have to you don't have
to have the whole system in your head.

279
00:21:08,640 --> 00:21:12,400
You can find the answer more quickly
that way, and it's kind of

280
00:21:12,440 --> 00:21:17,359
a beautiful thing. Yeah, process
of discovery too, right, find exceptions.

281
00:21:17,440 --> 00:21:21,200
That's why, like observability is not
just about yes, you're gonna have

282
00:21:21,200 --> 00:21:22,400
to have a Columner store. Yes, you're gonna have to have all these

283
00:21:22,400 --> 00:21:26,599
things in the back end and make
it fast. Because the other thing about

284
00:21:26,680 --> 00:21:29,200
logging tools is like if you want
to ask something interesting, it's like you

285
00:21:29,319 --> 00:21:32,000
enter the tool, you know,
the and then you're like, Okay,

286
00:21:32,000 --> 00:21:34,079
I'm gonna take thirty minutes and go
out for coffee because it's gonna like it

287
00:21:34,119 --> 00:21:37,519
has to be fast, it has
to be interactive. It has to be

288
00:21:37,599 --> 00:21:40,759
like under a seconds because you're like, you're taking steps and you have to

289
00:21:40,799 --> 00:21:44,680
stay in the zone, right you're
yeah, exactly. It has to be

290
00:21:44,720 --> 00:21:47,759
explorable, it has to be interactive, and it has to let you,

291
00:21:48,039 --> 00:21:51,920
I think most importantly, draw on
the on the brains of the people around

292
00:21:51,960 --> 00:21:55,920
you. So something we built into
Honeycomb is is history. You know,

293
00:21:55,920 --> 00:22:00,119
how you're debugging and it's like,
oh, I've lost the thread, so

294
00:22:02,160 --> 00:22:03,440
you can just go that, you
scroll back up that's where I knew I

295
00:22:03,440 --> 00:22:06,880
had it right, and you branch
out and you try something else. But

296
00:22:06,920 --> 00:22:11,400
then also you have access to the
history of everyone on your team. So

297
00:22:11,599 --> 00:22:15,400
if it's like last Thanksgiving, we
had this terrible my squel outage, you

298
00:22:15,440 --> 00:22:18,440
know, and everything was uh and
Ben and Emily were on call, say,

299
00:22:18,759 --> 00:22:22,160
and then I'm on call in March
and I'm like, this feels a

300
00:22:22,200 --> 00:22:26,759
lot like what was happening last last
November. I'm gonna go and look at

301
00:22:26,759 --> 00:22:30,000
like, well, what were Ben
and Emily doing and what did they say?

302
00:22:30,039 --> 00:22:36,559
Help them find out what trace are
So journaling, yeah, actions systems.

303
00:22:36,799 --> 00:22:40,920
History doesn't repeat that. It rhinds
right so much. So much of

304
00:22:40,960 --> 00:22:45,599
the wisdom of your like these are
socio technical systems. It's not just production.

305
00:22:47,000 --> 00:22:49,440
Like an example I often is this, You've got the New York Times

306
00:22:49,440 --> 00:22:53,880
on the Washington Post. They're like
both big newspapers, right, but if

307
00:22:53,920 --> 00:22:59,400
you took their teams and swapped them, you couldn't actually do that because so

308
00:22:59,480 --> 00:23:03,599
much of this this different system lives
in the heads of the people us write

309
00:23:03,640 --> 00:23:07,119
it. So like being able to
draw on that wisdom and use it,

310
00:23:07,200 --> 00:23:11,559
like it makes you a better engineer. Like all of the ship that I

311
00:23:11,640 --> 00:23:15,519
learned about being an engineer was looking
up over their shoulder of amazing engineers that

312
00:23:15,559 --> 00:23:18,039
I got. But it sounds like
the journaling approach you're talking about allows us

313
00:23:18,079 --> 00:23:23,759
to look over the best exactly and
and and get to know you don't have

314
00:23:23,799 --> 00:23:29,839
to remember how they did it because
it's recorded there, and so you know

315
00:23:30,039 --> 00:23:33,880
you can even learn their approach and
how they attacked that I feel like,

316
00:23:33,039 --> 00:23:37,160
you know, especially now nowadays,
when we're doing so much distributed working,

317
00:23:37,279 --> 00:23:40,759
like remote working, I worry a
lot about how are we going to bring

318
00:23:40,839 --> 00:23:44,720
up the next generation of engineers,
you know, and I feel I hope

319
00:23:44,720 --> 00:23:47,759
that we're all starting to think about
making this more of our tooling. Just

320
00:23:48,000 --> 00:23:51,039
how can we learn from you know, it's kind of embarrassing, but like

321
00:23:51,240 --> 00:23:53,799
when I was in college, I
learned so much from just going around and

322
00:23:53,960 --> 00:24:02,200
reading the Bash histories of all the
people I knew, from either trying the

323
00:24:02,200 --> 00:24:07,799
commands. You know, it's fucking
fascinating. Right, Oh, that's how

324
00:24:07,839 --> 00:24:11,519
you have to learn, said not
right, what does this do? I

325
00:24:11,519 --> 00:24:12,720
think we need a lot more of
that in our tools. Yeah, and

326
00:24:14,119 --> 00:24:18,200
I worry that we're making it even
harder to make that jump from junior to

327
00:24:18,440 --> 00:24:22,480
an intermediate. I mean, we've
always had a problem with intermediates anyway,

328
00:24:22,839 --> 00:24:27,000
but a lot of the automation tools
that are taking a lot of are eliminating

329
00:24:27,039 --> 00:24:33,039
the beginner stuff. Yeah, like
this whole like the generative AI stuff,

330
00:24:33,759 --> 00:24:37,279
Like it's great for senior engineers.
You're now so much more productive, you

331
00:24:37,319 --> 00:24:40,839
can put so much faster. But
like the way that you get to that

332
00:24:40,880 --> 00:24:48,079
point is it's scarts. How how
are we going to force ourselves? I

333
00:24:48,119 --> 00:24:52,039
mean, I believe that the solutions
will emerge. I hope it looks pretty

334
00:24:52,039 --> 00:24:56,440
bad. And I also think that
younger generation will find them too, because

335
00:24:56,480 --> 00:24:59,640
they are not you know, we've
done this show where we're talking about is

336
00:24:59,640 --> 00:25:03,839
all this scar tissue actually holding us
back? Right? That we have some

337
00:25:03,880 --> 00:25:07,039
of it? Yeah, some of
it's value, I think. You know,

338
00:25:07,079 --> 00:25:10,599
you have to internalize the damage and
say, like, what does this

339
00:25:10,680 --> 00:25:12,839
really look like? In generally speaking, And when someone says I will never

340
00:25:12,960 --> 00:25:17,160
use X product or X technique,
it's like you have not eternalized your scar

341
00:25:17,160 --> 00:25:22,000
as well. I love that every
team has, I think, but also

342
00:25:22,039 --> 00:25:26,640
that can they learn to speak to
These are the concerns I have when you

343
00:25:26,680 --> 00:25:30,480
think about the broader approaches to things
that might have created that problem back in

344
00:25:30,519 --> 00:25:33,799
the past. I read this book
called The Trauma of Everyday Life, which

345
00:25:33,839 --> 00:25:38,160
is written by this guy who's as
psychiatrist and a Zen Buddhist, and he's

346
00:25:38,200 --> 00:25:45,559
talking about how trauma isn't necessarily something
to be avoided because it's literally what shakes

347
00:25:45,599 --> 00:25:49,559
you think about a bond sire that's
just a normal tree, but it was

348
00:25:49,599 --> 00:25:53,519
it was put in this very specific
where its roots couldn't grow right, and

349
00:25:53,559 --> 00:25:57,720
so it's not it's like not RECTI
people like trauma is great, but it's

350
00:25:57,759 --> 00:26:02,759
also like there's scar. She was
just going to be different. And again,

351
00:26:02,759 --> 00:26:03,519
how you react to and how you
work to it, you can make

352
00:26:03,559 --> 00:26:08,640
beautiful things. So what are some
of the other pitfalls that people will encounter

353
00:26:08,839 --> 00:26:15,640
when sort of moving to this observability. The big one is the cognitive just

354
00:26:15,920 --> 00:26:19,480
the model that we have in our
brain. I feel like our industry has

355
00:26:22,240 --> 00:26:23,920
avoided this for a long time.
I feel like there's a bit of a

356
00:26:23,960 --> 00:26:29,039
reckoning. You know, Open telemetry. By the way, I've got to

357
00:26:29,039 --> 00:26:33,119
put in a quick plug for open
here. It's amazing. I know all

358
00:26:33,160 --> 00:26:37,240
of us hate redoing our code,
but like the promise of open telemetry is

359
00:26:37,599 --> 00:26:42,839
you reinstrument your code once and then
vendors have to compete for your dollars based

360
00:26:42,880 --> 00:26:48,720
on being awesome instead of having you
locked in. It is. It's it's

361
00:26:48,759 --> 00:26:55,559
the number two project after Kubernetes in
the what you call it thank you CF

362
00:26:55,960 --> 00:26:59,519
It's super active. A lot of
contributors. I was pretty skeptical about this,

363
00:26:59,640 --> 00:27:03,119
but it's it's it's the way I
wish we had had this ten years

364
00:27:03,119 --> 00:27:04,880
ago. I think we'd all parties. You know, we could have we

365
00:27:04,920 --> 00:27:10,119
could have just chose to political problem, not detectives totally. We're there now,

366
00:27:10,400 --> 00:27:12,160
but here we are. These are
the tools that we have. Open

367
00:27:12,160 --> 00:27:18,039
telemetry is worth putting on your roadmap
for the next year or two because there's

368
00:27:18,119 --> 00:27:22,680
also this this reckoning that's happening with
costs right now. Most of these vendors

369
00:27:22,759 --> 00:27:27,480
are billing just like ungodly amounts of
dollars that do not correspond to the value

370
00:27:27,480 --> 00:27:30,359
that you get out of them because
they can't because they got you're locked.

371
00:27:30,680 --> 00:27:33,720
Yeah, yeah, And so I
feel like we need to take her powers

372
00:27:33,720 --> 00:27:37,880
well. And it's a great pitch
for a feature that's not necessarily a new

373
00:27:37,920 --> 00:27:41,119
features to say, hey, I
can reduce our costs by moving us off

374
00:27:41,279 --> 00:27:45,160
this tool and onto open telemetry.
You know, I'm compliance some of these

375
00:27:45,200 --> 00:27:49,319
sidebar rants here, but like I
feel like learning to treat like one artifact

376
00:27:49,400 --> 00:27:52,839
of the zero interest rate like period, was it engineers forgot how to talk

377
00:27:52,880 --> 00:27:56,839
about our work in terms of dollars, you know, because like dollars are

378
00:27:56,920 --> 00:28:00,720
the universal denominator. Maybe something the
Euros. I don't know, but like

379
00:28:00,880 --> 00:28:06,519
money is the universal denominator, and
if we can't learn to talk about the

380
00:28:06,640 --> 00:28:11,400
value of the shit that we provide
to people in finance people, I feel

381
00:28:11,400 --> 00:28:15,279
like many many vps of engineering and
CTOs have this phenomena where they feel like

382
00:28:15,319 --> 00:28:18,160
the junior partner at the table.
They aren't really invited to all the critical

383
00:28:18,200 --> 00:28:21,960
meetings and stuff. And I believe
that that's because we haven't learned to talk

384
00:28:22,000 --> 00:28:26,640
about the value that we bring and
cost in the same language as every other

385
00:28:26,680 --> 00:28:29,359
team. Because if we did,
we generate a lot of value. We

386
00:28:29,599 --> 00:28:34,200
generate all company, We have all
the power we should need to have.

387
00:28:34,599 --> 00:28:37,920
We got to do is get a
hand on it. And Charity when they

388
00:28:37,960 --> 00:28:45,839
were up for one moment for this
very important message and we're back. It's

389
00:28:45,880 --> 00:28:48,039
not at Rocks. I'm Richard Campbell. Let's Carl Franklin. Hey to our

390
00:28:48,039 --> 00:28:56,920
friend Charity Majors about observability engineering and
watching the sausage being made. And so

391
00:28:56,079 --> 00:29:00,279
I want to follow up on you. You brought up generative to a and

392
00:29:00,359 --> 00:29:04,960
things for programmers. It's great for
senior programmers who can be more productive with

393
00:29:06,039 --> 00:29:08,799
stuff they might have forgotten how to
write or don't really care to figure out

394
00:29:08,880 --> 00:29:14,720
and just let chat GPT do it
for you. But what do you think

395
00:29:14,880 --> 00:29:18,839
the future of observability is, especially
in lieu of AI and where it's going.

396
00:29:18,920 --> 00:29:23,839
Do you think that we'll have AI
bots sort of watching our telemetry and

397
00:29:23,920 --> 00:29:30,480
giving us English prompts, you know, sending us text messages. Vendors are

398
00:29:30,519 --> 00:29:33,799
going to sell CTOs and pps like
tens of billions of dollars worth of bullshit

399
00:29:33,880 --> 00:29:37,200
that says that they can do that. Yeah, something that blew my mind

400
00:29:37,200 --> 00:29:41,599
when I became to see So are
you saying we don't need this? We

401
00:29:41,160 --> 00:29:45,319
have everything that we need right there
in front of something that blew my mind

402
00:29:45,319 --> 00:29:51,279
when I keep CTO to be wild
internalized. But the most executives have more

403
00:29:51,319 --> 00:29:56,720
trust and confidence in their vendor relationships
than their employees, because employees coming out

404
00:29:56,039 --> 00:30:02,759
the vendors left forever as long as
you keep paying them. In my mind,

405
00:30:02,200 --> 00:30:06,759
but what they're selling when they come
in. This is why my dander

406
00:30:06,839 --> 00:30:10,440
got raised so much by the whole
AIO saying, because they are all just

407
00:30:10,640 --> 00:30:12,640
like you don't need to understand your
systems. Pay us all this money,

408
00:30:12,799 --> 00:30:17,440
we'll understand it for you, and
but like the false positives are ridiculous and

409
00:30:17,480 --> 00:30:19,920
off the charts, all of the
data is junk. You would be better

410
00:30:21,000 --> 00:30:23,680
off just like turning off all that
data, Like it's just so many problems

411
00:30:23,720 --> 00:30:29,440
with it. I believe that we
should be looking at computers that do what

412
00:30:29,519 --> 00:30:33,440
computers do best, and people to
do it people do best, and computers

413
00:30:33,480 --> 00:30:37,319
crunch numbers, people attack meaning to
things. Sure, like your graphs are

414
00:30:37,359 --> 00:30:42,039
spiking all day long, most of
them you don't care about, because our

415
00:30:42,039 --> 00:30:45,599
computers are now resilient to a whole
lot of failure. Sure, it really

416
00:30:45,680 --> 00:30:52,160
takes a person coming along and going
matters. That matters often because it mattered

417
00:30:52,200 --> 00:30:56,000
to another person, and you're the
person who they're connecting those dots. And

418
00:30:56,039 --> 00:31:00,480
once you've decided it matters, you
need to understand why. And I think

419
00:31:00,480 --> 00:31:03,200
there are all kinds of ways for
computers to help us do that. We

420
00:31:03,279 --> 00:31:07,200
do this really cool thing called bubble
up the honeycomb, where any graph that

421
00:31:07,319 --> 00:31:11,160
any heat map that you've constructed,
you draw a little bubble around something you're

422
00:31:11,160 --> 00:31:14,720
like, I care about this,
and then we compute the baseline for all

423
00:31:14,759 --> 00:31:18,920
the hundreds of dimensions and the dimensions
that are inside the thing you care about,

424
00:31:18,079 --> 00:31:21,640
and then we dip them and sort
them, so it's like, Okay,

425
00:31:21,680 --> 00:31:23,759
this thing you care about, here
are the five to ten ways that

426
00:31:23,880 --> 00:31:27,200
is different from everything that you don't
care about. Computers are great at that,

427
00:31:27,720 --> 00:31:30,160
but they can't tell you what to
care about, and they shouldn't try

428
00:31:30,160 --> 00:31:33,880
it, because it's a fucking mess. Maybe ten years from now I will

429
00:31:33,920 --> 00:31:37,039
be eating my words, but for
the foreseeable future, I really think that

430
00:31:37,079 --> 00:31:44,160
we're all best served if we focus
on helping people understand what has meaning and

431
00:31:44,240 --> 00:31:45,599
letting computers take care of their rest. Yeah. I mean I can see

432
00:31:45,640 --> 00:31:49,240
the machine tools helping to point us
too unusually. Sure, Yeah, but

433
00:31:49,440 --> 00:31:53,359
you still have to interpret them.
Yeah. Still you want them to create

434
00:31:53,400 --> 00:31:57,440
that graph for you. You want
them to intelligently sample often, you want

435
00:31:57,480 --> 00:32:00,799
them to do you know, but
you don't don't want them in the business

436
00:32:00,799 --> 00:32:04,160
of telling you what that No,
they don't know, you don't know,

437
00:32:04,359 --> 00:32:08,279
and more and more stately, like
they're not even qualified to make that s

438
00:32:08,319 --> 00:32:13,119
that's been in any way. That
being said, like I can tell you

439
00:32:13,160 --> 00:32:15,119
we're talking a lot of old lap
terms here, like a lot of data

440
00:32:15,200 --> 00:32:21,640
analytic terms around all this and machine
learning models evolved from a lot of that

441
00:32:21,759 --> 00:32:24,680
technology. So you can see a
shape of this shape of history. You

442
00:32:24,799 --> 00:32:29,799
can't see a shape, but I
don't believe that it is. It is

443
00:32:29,920 --> 00:32:35,000
one that So here's the thing.
At the bottom line, we are forget

444
00:32:35,000 --> 00:32:38,160
technology. We are held legally accountable
for your engineers. We are legally and

445
00:32:38,240 --> 00:32:43,400
ethically and morally accountable for the codes
we put out into the world. Right,

446
00:32:43,880 --> 00:32:46,279
we can't point an algorithm it comes
to that, even if it's a

447
00:32:46,359 --> 00:32:50,519
machine learning I think, I don't
know if you've read it yula lately,

448
00:32:51,359 --> 00:32:53,160
A boy, oh boy, they
work really hard to make sure we're not

449
00:32:53,279 --> 00:33:01,160
legally accountable for any I believe in
the near infinite possibility employers. That's that's

450
00:33:01,200 --> 00:33:07,039
true. I want to make sure
that I understand. I also, I

451
00:33:07,079 --> 00:33:09,559
mean, I like that we're also
going to moral and ethical aspect because I

452
00:33:09,559 --> 00:33:14,559
think we need. I think that
legal aspects holding us back, that we

453
00:33:14,599 --> 00:33:17,920
can't own the value of what we
makeout, that we can't own the value

454
00:33:17,960 --> 00:33:23,559
of a make as long as we're
obligating our responsibilities. And then really,

455
00:33:23,599 --> 00:33:28,480
you know, the yula was invented
to allow us to not hold liability for

456
00:33:28,519 --> 00:33:34,039
the impact of our software, and
so we're kind of in a trap right

457
00:33:34,079 --> 00:33:37,960
as an industry. If we were
responsible for the damage we did, we

458
00:33:38,000 --> 00:33:43,480
would we would our employers would insist
on higher standards because they're getting caught up

459
00:33:43,480 --> 00:33:47,759
in that as well. But because
we've avoided the responsibility so thoroughly, I

460
00:33:47,799 --> 00:33:52,400
see what you're saying. That being
said like this is now we get into

461
00:33:52,400 --> 00:33:54,880
a pretty deep philosophical side of this
thing, like let's face it, good

462
00:33:54,920 --> 00:33:58,640
telemeter. In the end, we're
trying to understand why is the software behavior

463
00:33:58,640 --> 00:34:01,759
and its behavior? Why are our
customers unhappy? I mean, those are

464
00:34:01,759 --> 00:34:06,519
the things that actually matter. I
think. The more often that we as

465
00:34:06,599 --> 00:34:09,639
technologists speak in the term of the
customers, I think, why are our

466
00:34:09,639 --> 00:34:13,719
customers onhappy? You know? And
this is something I've been really grappling with

467
00:34:13,840 --> 00:34:15,039
lately. I don't know if I'm
alone in this or not, but I

468
00:34:15,079 --> 00:34:22,719
have like an almost knee jerk,
almost disgusted or like reaction towards like customer

469
00:34:22,920 --> 00:34:27,840
and value and things. And I've
been trying to because we've been battered with

470
00:34:27,920 --> 00:34:31,440
it, because we've been battered beaten
up those words. Yeah, I don't

471
00:34:31,480 --> 00:34:36,320
know, just the business aspect,
like I think there's some vestors of me.

472
00:34:36,400 --> 00:34:38,440
There's still like ew, we're better
than that. And I hate myself

473
00:34:38,440 --> 00:34:42,679
as I'm saying that. You know, and you're also open with the dollars

474
00:34:42,719 --> 00:34:45,639
matter they do, they're kind to
come from the customers. Oh, you

475
00:34:45,639 --> 00:34:47,880
should have seen me ten years ago, because this is a chill version.

476
00:34:47,960 --> 00:34:52,440
I get that. Okay, No, but you're you're absolutely right. We

477
00:34:52,519 --> 00:34:54,239
do this for the customer. We
do this for our users. That's the

478
00:34:54,320 --> 00:34:58,840
reason we exist, and we have
a responsibility to them. Sure, And

479
00:34:58,880 --> 00:35:00,840
I don't think I'll ever still comfortable
saying, well, the machine told me

480
00:35:00,920 --> 00:35:05,239
it was fine. That's a cop
out every time. Because the machine didn't

481
00:35:05,280 --> 00:35:08,079
tell you anything, you interpreted it
and chose to vocal. You know,

482
00:35:08,159 --> 00:35:14,559
in the end, everything we've talked
about program it's fine, getting very philosophical,

483
00:35:14,679 --> 00:35:17,159
but also none of this is described
an action we should take. All

484
00:35:17,159 --> 00:35:21,440
we're doing is observe what's going on. We still have to decide on the

485
00:35:21,480 --> 00:35:24,079
action. How would you change the
code? Given you've seen this in dilematry

486
00:35:24,199 --> 00:35:28,000
and you know what else? Like
I feel like this looks back really nicely

487
00:35:28,039 --> 00:35:31,679
into just like what is what is
the meaningful life? Right? Like because

488
00:35:31,760 --> 00:35:36,639
like that book that what's his face
wrote about about work and what makes us

489
00:35:36,679 --> 00:35:40,800
happy is it's not like having twenty
hours a day whatever, but it's like

490
00:35:42,199 --> 00:35:46,880
autonomy, mastery and meaning purpose.
Yeah, Daniel pink, Dangel pink,

491
00:35:46,960 --> 00:35:51,960
thank you, and like the meaning
the purpose that comes into play for us

492
00:35:52,000 --> 00:35:53,920
when it impacts other people. Well, and you hit on the key thing,

493
00:35:53,960 --> 00:35:58,800
which is when we crack this,
not like every time you chase a

494
00:35:58,840 --> 00:36:01,480
problem downline that and it turns into
a code change, you can make that

495
00:36:01,639 --> 00:36:07,079
then in later testing shows that problems
occur. Boy, that's a good day.

496
00:36:07,280 --> 00:36:13,000
Like you talk about purpose, there
is nothing better than figuring that complicated

497
00:36:13,000 --> 00:36:16,880
problem out and then literally, like
you, you live in a very hypothesis

498
00:36:16,960 --> 00:36:21,119
based world. It's like, well, I've seen this telemetry, I've seen

499
00:36:21,119 --> 00:36:24,119
this output. I believe it's this
code problem. Now I'm going to make

500
00:36:24,119 --> 00:36:28,559
a modification. I'm going to put
it into the stream and I'm going to

501
00:36:28,639 --> 00:36:31,280
go back and test again. And
if I don't see it, then I

502
00:36:31,400 --> 00:36:35,679
can, you know, hypothesize really
because I might be wrong. We may

503
00:36:35,719 --> 00:36:39,079
not have reconded recreated conditions perfectly.
That we're on it, that we're pushing

504
00:36:39,079 --> 00:36:43,320
the right thing, and nobody knows
just how deep that went. No,

505
00:36:43,880 --> 00:36:46,360
I also wonder, you know how
many times have you been fighting a problem

506
00:36:46,440 --> 00:36:51,159
like that and you chart changing code? Just see if you can change behavior

507
00:36:51,679 --> 00:36:55,440
at all? Like, am I
even assistant? They have emergent properties,

508
00:36:55,440 --> 00:37:00,199
They're no longer like I feel like
part of moving from like the old version

509
00:37:00,199 --> 00:37:05,880
of the new is except that TDD
is not enough interesting like the tests tell

510
00:37:05,960 --> 00:37:10,039
you will this logically execute, but
that reality ends at the border of your

511
00:37:10,079 --> 00:37:14,320
laptop. Yes, and the universe
is weirder. The weird intera is so

512
00:37:14,440 --> 00:37:16,920
much weirder than that. I feel
like our jobs are not done. It's

513
00:37:16,920 --> 00:37:22,719
like until we've instrumented that code,
deployed it and watched it in production and

514
00:37:22,800 --> 00:37:25,400
asked ourselves, is it doing what
I expected to do? And if anything

515
00:37:25,400 --> 00:37:30,119
else look weird? I know that
on the show before I was you know,

516
00:37:30,199 --> 00:37:31,719
I've did a lot of load testing. It's like I have never invented

517
00:37:31,719 --> 00:37:37,159
the load tests as weird as customers
on Saturday actually comes even come close.

518
00:37:37,480 --> 00:37:44,480
So customers are evil do things you
can't. They really opened six windows and

519
00:37:44,519 --> 00:37:47,400
hit refresh all at the same time. Did he really really? Okay?

520
00:37:49,360 --> 00:37:52,679
May I see some practical advice on
behalf of the listeners. So let's say

521
00:37:52,679 --> 00:38:00,559
you're listening, you're you've been surfing, you went to uncombed dot I and

522
00:38:00,639 --> 00:38:02,400
you checked it out, and you're
thinking this might be good. How do

523
00:38:02,440 --> 00:38:06,000
you go back to here? How
do these people in the audience go back

524
00:38:06,039 --> 00:38:10,320
to their teams and introduce this concept
without getting flogged? You know? How

525
00:38:10,519 --> 00:38:14,880
how do you approach that? I
mean, that's a great question. My

526
00:38:14,960 --> 00:38:20,199
approach is always to look for something
that's really painful, like, you know,

527
00:38:20,599 --> 00:38:23,199
things that are going down, you
don't understand, problem, problem,

528
00:38:23,239 --> 00:38:29,159
you can't crack and especially this the
siloed approach to telemetry that we're doing anyhow,

529
00:38:29,239 --> 00:38:30,840
things that are waking people up in
the middle. Then I you know,

530
00:38:31,000 --> 00:38:34,679
we've seen this a lot where you
know, people have tried to bring

531
00:38:34,760 --> 00:38:37,440
it in whatever, but then there's
an intractable problem and they put money come

532
00:38:37,480 --> 00:38:42,559
on it, and it's just like
like we've even had multiple times we've had

533
00:38:42,559 --> 00:38:47,360
our sales engineers doing demos on people's
production systems and you're about to have an

534
00:38:47,360 --> 00:38:51,719
outage here because this thing's happened,
and they're like what in the like ten

535
00:38:51,719 --> 00:38:55,440
minutes later they get paiged because it
is that Like I know, I'm a

536
00:38:55,440 --> 00:38:59,400
founder, believe nothing I say,
But is that much easier when you have

537
00:38:59,480 --> 00:39:01,440
the right tool, when you have
the right visibility, just to be able

538
00:39:01,440 --> 00:39:05,079
to see what's going on? Yeah, looking at something like that, or

539
00:39:05,079 --> 00:39:09,280
somewhat counterintuitively the other side, another
place we've seen a lot of success is

540
00:39:09,280 --> 00:39:14,559
people insumenting their CiCe pipelines, right, because if you insument your CCE pipeline

541
00:39:14,559 --> 00:39:17,199
as a trace, you've can see
where all that time is going. Yeah.

542
00:39:17,480 --> 00:39:22,320
Yeah, that's kind of another approach
to this, the model of what

543
00:39:22,480 --> 00:39:27,119
is the hard work here? What's
actually hurting us? The struggle is only

544
00:39:27,119 --> 00:39:30,760
getting in the front door. We
have like zero turn if the company didn't

545
00:39:30,760 --> 00:39:34,480
go out of business to keep buying
us. But it's difficult to get in

546
00:39:34,519 --> 00:39:37,280
the front door. But once we
get inside, like no, but I

547
00:39:37,280 --> 00:39:39,679
think you've made the most compelling argument, and that's going to be tough for

548
00:39:39,719 --> 00:39:43,400
anyone in the room wo's thinking about
this. It's like you have to go

549
00:39:43,440 --> 00:39:45,960
pick the largest dragon in the room
and say, I think I could take

550
00:39:46,000 --> 00:39:50,760
that one on if I had this
lance. If I can get this lance

551
00:39:50,800 --> 00:39:52,360
and gay it go, I'll go
for the big guy. Yeah, and

552
00:39:52,400 --> 00:39:55,239
that's the kind of bet you need
to make. But you know the underlying

553
00:39:55,280 --> 00:39:58,519
part of this, because a lot
of the software is already set up for

554
00:39:58,519 --> 00:40:00,960
the right plum tree, but it's
the customs suff we're building it is not

555
00:40:00,960 --> 00:40:05,559
is how you provide give visibility into
that yep, orienting it around. You

556
00:40:05,559 --> 00:40:07,199
know a lot of people also come
most when the start their open plumbatry journey

557
00:40:07,199 --> 00:40:12,440
because we have many of the world's
best experts in Hotel, so we could

558
00:40:12,440 --> 00:40:16,280
actually help consult. What do I
need to push onto the open telemet pelementary

559
00:40:16,320 --> 00:40:20,320
stack. That's going to help me, that's going to let these tools understand.

560
00:40:20,320 --> 00:40:23,920
What do any of you in the
room have a question for charity?

561
00:40:24,119 --> 00:40:28,280
Raise your hand, It's all right
right here. Phil Hack has a question,

562
00:40:28,719 --> 00:40:31,280
So why don't you repeat the question? The question is that's a lot

563
00:40:31,320 --> 00:40:35,880
of data and how does that cost? How does the coast get out?

564
00:40:36,119 --> 00:40:39,119
Is the costack get out? Again? This is why So on Twitter I

565
00:40:39,159 --> 00:40:42,320
was joking the other week and it
kind of got out of control and I

566
00:40:42,320 --> 00:40:45,039
could never write a database. No, really, never write a database.

567
00:40:45,320 --> 00:40:49,239
And I thought it was a very
fun self owned because we wrote a database.

568
00:40:49,480 --> 00:40:52,639
People didn't understand that. So yeah, it's a fund of data.

569
00:40:52,719 --> 00:40:57,920
Like we've got like seven hundred customers
and we run the combined production modes of

570
00:40:57,960 --> 00:41:00,000
all of them. It's like two
billion events persons or something like that,

571
00:41:01,719 --> 00:41:07,480
and we give everyone sixty days of
storage basically for free. And the way

572
00:41:07,480 --> 00:41:15,000
that we do this we so it's
a culundar store. So indexes are roboten

573
00:41:15,119 --> 00:41:17,760
for observability because indexes are way of
picking. I want this to run fast

574
00:41:17,760 --> 00:41:21,000
and nothing else to run fast.
You want to be able to query on

575
00:41:21,079 --> 00:41:24,360
any of these dimensions. So it's
a calundar store. And you're right,

576
00:41:24,480 --> 00:41:28,480
like two years in we ran into
this. We're never going to be profitable

577
00:41:28,480 --> 00:41:31,840
because there's so all of these SSDs, all this ram. And that's when

578
00:41:32,119 --> 00:41:37,760
one of my I brilliant engineer I've
been working he was my first manager.

579
00:41:37,039 --> 00:41:40,639
Name is Ian. He's he's nowhere
on the internet and he's amazing. Uh.

580
00:41:40,800 --> 00:41:45,760
He started looking into the cost models
and did some tests and so now

581
00:41:45,840 --> 00:41:51,280
actually we data comes in hits,
the API gets dropped into Kaffa and then

582
00:41:51,320 --> 00:41:54,840
gets read off onto you know a
pair of notes, which are as you

583
00:41:54,840 --> 00:41:59,840
would think like lots of CPO lots
of RAM, but then after like thirty

584
00:42:00,000 --> 00:42:06,320
six minutes it gets tailed out to
S three. The queery planner actually runs

585
00:42:06,360 --> 00:42:12,400
the LANDA jobs. Uh so the
query planner comes in forks out spans and

586
00:42:12,400 --> 00:42:15,719
and like we thought it was going
to be so much shorter like doing processing,

587
00:42:15,800 --> 00:42:20,880
you know from all these S threeboutives, it wasn't. It was different

588
00:42:20,920 --> 00:42:24,480
performance characteristics, but most careers still
return with under a second and S three

589
00:42:24,559 --> 00:42:29,320
is the cheap so that's what most
of the data is, and the lambda

590
00:42:29,400 --> 00:42:31,239
jobs are pretty expensive. That's a
big line end up in on our bills.

591
00:42:31,239 --> 00:42:35,920
So we've done we've actually done some
really great talks and written some great

592
00:42:35,920 --> 00:42:40,320
pieces about how we use Honeycomb to
optimize our LANDA jobs so that the planner,

593
00:42:40,960 --> 00:42:45,360
yeah, it's all. Yeah.
Our Honeycomb block, by the way,

594
00:42:45,480 --> 00:42:49,639
is dope. Like we we don't
do a lot of selling there.

595
00:42:49,679 --> 00:42:52,519
We just talk about a lot of
engineering and it's pretty great. There was

596
00:42:52,559 --> 00:42:58,320
another question back here. First somebody
back there had there end up. Okay,

597
00:42:58,320 --> 00:43:00,920
it wasn't you, but go ahead, son, repeat the question.

598
00:43:00,960 --> 00:43:07,159
I'm sorry. So the question I
think was something there's something about SLOs and

599
00:43:07,559 --> 00:43:12,920
metrics and it's too expensive to store
all the choices for all events, okay,

600
00:43:14,280 --> 00:43:17,079
And there's a few different answers for
this. I would probably want to

601
00:43:17,079 --> 00:43:21,880
ask you some more questions. It's
feel free to find me afterwards. But

602
00:43:21,960 --> 00:43:24,400
you're absolutely right. It can be
absolutely cost prohibitive to store the trace for

603
00:43:24,480 --> 00:43:28,039
every if you have a lot of
traffic, because if you think about it,

604
00:43:28,360 --> 00:43:34,400
you might find yourself storing five to
thirty times as much prelemetry data as

605
00:43:34,559 --> 00:43:38,960
production traffic. Obviously that's not tenable, right. The first solution that we

606
00:43:39,199 --> 00:43:45,840
usually steer people towards is intelligence sampling, which does not mean just like dumb

607
00:43:45,920 --> 00:43:47,400
dumb sampling, we're like one out
of the routen you drop them. It

608
00:43:47,440 --> 00:43:52,000
means like we have a thing called
refinerate, where there's a different between head

609
00:43:52,039 --> 00:43:54,960
sampling and tail sampling, meaning sampling
before you know what it is coming and

610
00:43:55,079 --> 00:43:59,079
after you know what it's coming.
So some of these things you sample after

611
00:43:59,119 --> 00:44:01,159
you know what's coming. Be sure
and grab all of the slow events,

612
00:44:01,639 --> 00:44:05,599
right. Some of it is head
sampling where you're just like, okay,

613
00:44:06,960 --> 00:44:10,119
for example, requests there are health
checks, there are two hundreds. This

614
00:44:10,239 --> 00:44:13,800
is junk. I don't need to
store all of these there's gonna be like

615
00:44:13,800 --> 00:44:16,800
a quarter of your traffic sometimes,
so like sample them heavily, two hundred

616
00:44:16,840 --> 00:44:22,639
okays to the main page, sample
the medium, keep every request that's in

617
00:44:22,840 --> 00:44:25,519
error, or every request that is
to slash payments or to billing or you

618
00:44:25,559 --> 00:44:30,880
know, like there's a lot of
that is trash that you can like to

619
00:44:30,159 --> 00:44:35,519
discard if you kind of go in
there with the fine teeth. Come the

620
00:44:35,559 --> 00:44:44,480
part that was about about SLOs.
We derive SLOs from events, and it's

621
00:44:44,519 --> 00:44:50,800
actually really important that they're not derived
from metrics. We're actually the only product

622
00:44:50,800 --> 00:44:54,400
out there that does SLOs the way
they're supposed to be done for the Google

623
00:44:54,480 --> 00:44:59,360
sor rebook, because other companies don't
actually capture their data in a way that

624
00:44:59,440 --> 00:45:04,159
lets them do that. It's actually
pretty dope. Like you have your SLOs,

625
00:45:04,199 --> 00:45:07,000
it tells you how how quickly you're
burning down the budget, and then

626
00:45:07,320 --> 00:45:12,559
it tells you what what is different
about the requests that are erring that are

627
00:45:12,559 --> 00:45:16,880
burning down the budget the other blah
blah blah. We also have the metrics,

628
00:45:17,239 --> 00:45:22,599
but I'm not sure if you're talking
about our metrics product or the events.

629
00:45:22,800 --> 00:45:25,039
Like the number one answer to the
events be too expensive is you use

630
00:45:25,920 --> 00:45:30,039
smart sampling? And the number one
answer to the SLOs is you want those

631
00:45:30,280 --> 00:45:34,239
badly. You can absolutely do sampling. So one of the one of the

632
00:45:34,280 --> 00:45:37,239
things in every event that ge is
sent to us, there is a sample

633
00:45:37,320 --> 00:45:40,840
rate embedded in it, so everyone
will say like one slash five and that

634
00:45:40,960 --> 00:45:45,920
means compute this to be five like
this, so the numbers all all work

635
00:45:45,960 --> 00:45:50,880
out to look like they weren't sampled. We had questions. You got to

636
00:45:50,880 --> 00:45:52,960
move on question right in the front, hay on can you repeat that with

637
00:45:53,000 --> 00:45:58,000
the micro The question was, obviously, you can go wrong with logging because

638
00:45:58,079 --> 00:46:00,800
you can get way too many log
events. You can go wrong with metrics

639
00:46:00,840 --> 00:46:04,760
because you could you can have famously
like the thirty thousand dollars metric that had

640
00:46:04,840 --> 00:46:08,079
high cardinality in it and oops your
budget. Answer the question is, I

641
00:46:08,079 --> 00:46:14,760
think, how can you go wrong
with observability as distinct from those you know?

642
00:46:14,840 --> 00:46:16,800
And I want to say, even
if you aren't doing traces, if

643
00:46:17,920 --> 00:46:22,280
the most important thing to take away
when it comes to telemetry is the magic

644
00:46:22,360 --> 00:46:28,480
of the one wide structured event per
request per service. I actually found out

645
00:46:28,639 --> 00:46:31,159
years into this that this is how
Amazon is done there telemetry all along,

646
00:46:31,480 --> 00:46:36,920
they had like a flat file at
the root domain of every node where they

647
00:46:37,000 --> 00:46:40,880
keep one of these like wide It's
it's magic. It makes everything because a

648
00:46:40,880 --> 00:46:45,840
lot of the logs that you are
encountering, or because like when a request

649
00:46:45,880 --> 00:46:51,159
is executing through through a service,
it's just like, oh, all these

650
00:46:51,159 --> 00:46:54,480
strings, right, But if you
just like collapse them into one wide event

651
00:46:54,760 --> 00:47:00,199
with all of those keys and values, then you have that context, right

652
00:47:00,199 --> 00:47:04,719
you can put its magic. So
the number one thing that I think people

653
00:47:04,800 --> 00:47:07,960
get wrong with observability is not understanding
that that's the heart of everything. It

654
00:47:07,960 --> 00:47:12,480
isn't much tool you're using. It
isn't whether you're tracing or not. It's

655
00:47:12,559 --> 00:47:16,199
that it's that that is the number
one thing that everyone should be caring about.

656
00:47:16,719 --> 00:47:22,559
The number two thing I think comes
out when dealing with spans slightly ordered

657
00:47:22,679 --> 00:47:25,960
higher order problem, and that's because
I feel like as an industry we have

658
00:47:27,079 --> 00:47:31,400
the really we aren't really we don't
really have a set of like good conventions.

659
00:47:31,440 --> 00:47:34,760
You were asking me, like,
when should do you have this span?

660
00:47:35,679 --> 00:47:37,880
Man? Like, I hope five
years from Everybody's like, well,

661
00:47:37,880 --> 00:47:40,079
obviously you should have this span blah
blah blah. But we aren't there yet,

662
00:47:40,239 --> 00:47:45,800
right, and so it's really easy
to either generate too many spans and

663
00:47:45,840 --> 00:47:47,599
they get lost in the noise kind
of like with logs, or too few

664
00:47:47,639 --> 00:47:52,199
spans and then not have the detail
that you need when you needed. The

665
00:47:52,280 --> 00:47:57,639
question is where to start with open
planetry, And there are only two good

666
00:47:57,679 --> 00:48:02,920
answers. One is my favorite,
uh, with the biggest pain and if

667
00:48:02,960 --> 00:48:07,719
you have to really like you're like, if you have a really resistant culture,

668
00:48:08,079 --> 00:48:12,719
then start with the least pain.
But I actually think that the best

669
00:48:12,719 --> 00:48:15,239
way to roll anything that has to
do with cymmetry out is is it kind

670
00:48:15,280 --> 00:48:20,360
of think of your attention like a
headlamp, and if you're on call for

671
00:48:20,360 --> 00:48:24,480
something that's breaking, have an instruments
first mentality, like you've instrument to figure

672
00:48:24,480 --> 00:48:29,400
out what's wrong, not if you've
around with your instrument, have to tell

673
00:48:29,400 --> 00:48:31,280
you the answer, and then it's
there for the next time you get paid

674
00:48:31,280 --> 00:48:36,000
again, instrument to find the problem, and it's there. And as your

675
00:48:36,199 --> 00:48:38,639
head lamp kind of moves around the
stack, you know, within a couple

676
00:48:38,679 --> 00:48:42,840
of months most of the stuff that
it really matters will be instrumented and then

677
00:48:42,840 --> 00:48:45,519
you can put it on the backlog
to do the rest and finish up and

678
00:48:45,559 --> 00:48:47,719
get rid of your ovenders. All
right, Well, I think that's it,

679
00:48:47,800 --> 00:48:53,559
so let's give charity majors a big
round of law. I will see

680
00:48:53,599 --> 00:49:21,480
you next time. On time dot
net Rocks is brought to you by Franklin's

681
00:49:21,519 --> 00:49:25,480
Net and produced by Pop Studios,
a full service audio, video and post

682
00:49:25,480 --> 00:49:30,239
production facility located physically in New London, Connecticut, and of course in the

683
00:49:30,280 --> 00:49:37,159
cloud online at pwop dot com.
Visit our website at d O T N

684
00:49:37,199 --> 00:49:40,760
E t R O c k S
dot com for RSS feeds, downloads,

685
00:49:40,920 --> 00:49:45,159
mobile apps, comments, and access
to the full archives going back to show

686
00:49:45,239 --> 00:49:50,519
number one, recorded in September two
thousand and two. And make sure you

687
00:49:50,599 --> 00:49:53,159
check out our sponsors. They keep
us in business. Now, go write

688
00:49:53,159 --> 00:50:01,559
some code, See you next time
you got jacks. See a summer time

689
00:50:01,760 --> 00:50:07,960
on that means home. Then my
texes in my credit b
