WEBVTT

1
00:00:01.080 --> 00:00:05.679
How'd you like to listen to dot
NetRocks with no ads? Easy? Become

2
00:00:05.679 --> 00:00:09.839
a patron for just five dollars a
month. You get access to a private

3
00:00:10.000 --> 00:00:14.400
RSS feed where all the shows have
no ads. Twenty dollars a month will

4
00:00:14.439 --> 00:00:18.839
get you that and a special dot
net Rocks patron mug. Sign up now

5
00:00:18.839 --> 00:00:50.560
at Patreon dot dot NetRocks dot com. Hi, it's all right, Holy

6
00:00:50.960 --> 00:00:55.560
crap, this is a great room. It's a great lot of echo.

7
00:00:55.880 --> 00:01:02.719
Yes, certainly filled this space.
He determined level of absolutely the blast radius.

8
00:01:03.240 --> 00:01:07.400
Right, we are back in Portugal. I love I love Porto,

9
00:01:07.519 --> 00:01:14.519
I love Portugal. I was in
the middle of town and I came across

10
00:01:14.560 --> 00:01:23.000
a restaurant and the sign said churiscaria, which means grill. Right they infante.

11
00:01:23.959 --> 00:01:27.599
I didn't know you guys ate babies
over here. So no, no,

12
00:01:27.599 --> 00:01:32.280
no, not funny, too long, too long, too soon?

13
00:01:33.799 --> 00:01:38.519
Infante. I guess Infante is the
name of the area. Yeah, but

14
00:01:38.599 --> 00:01:42.799
it looked like we eat our babies. Well done, nice on a skewer,

15
00:01:42.840 --> 00:01:44.560
I think it was. W.
C. Field said it was all

16
00:01:44.560 --> 00:01:51.439
about the sauce. Yes, I
love children exactly. You got it.

17
00:01:53.280 --> 00:01:55.439
Oh, we're gonna have some fun. Today. We are going to have

18
00:01:55.519 --> 00:01:59.000
some fun. But first we have
this little thing called better no Framework.

19
00:01:59.239 --> 00:02:12.000
Roll that crazy music. Well I've
been waiting for this one because it happened

20
00:02:12.000 --> 00:02:15.800
a while ago. But if you
don't know, I have a consultancy called

21
00:02:15.879 --> 00:02:22.639
at the next and we are the
shepherds of an open source project called Polly.

22
00:02:22.879 --> 00:02:25.919
Anybody who's Polly? How about uh? How about a clap of hands?

23
00:02:27.360 --> 00:02:31.240
Use poly lot of Well, Reverend
Billy just walked in the room.

24
00:02:31.280 --> 00:02:36.520
Well, Polly just did a make
just came out with a major update,

25
00:02:37.080 --> 00:02:39.520
version eight. And I, believe
it or not, it started with the

26
00:02:39.560 --> 00:02:45.400
dot net team because the dot Net
team was basically looking at the source code

27
00:02:45.400 --> 00:02:49.639
and said, hey, we think
we can improve the performance and the resource

28
00:02:49.759 --> 00:02:54.360
usage of Polly, but it's going
to require some new interfaces and you know,

29
00:02:54.879 --> 00:03:00.240
almost a complete rewrite. And so
the rest of us said, yeah,

30
00:03:00.280 --> 00:03:04.400
I'm sorry. The dot Net team
call you, yeah, say they

31
00:03:04.439 --> 00:03:07.199
want to make your project. Yeah, I mean what are you saying?

32
00:03:07.319 --> 00:03:14.800
Nah? So it took a lot
of meetings and a lot of understanding,

33
00:03:14.800 --> 00:03:19.400
and basically what they were able to
do is without you having to change any

34
00:03:19.439 --> 00:03:23.560
of your code that uses Polly,
you will get the benefits of the new

35
00:03:23.599 --> 00:03:29.520
performance and resource allocation that's under the
hood. But if you want to use

36
00:03:29.560 --> 00:03:34.439
the new models and the paradigms and
the interfaces, you you can do that.

37
00:03:34.680 --> 00:03:38.960
So Greenfield, you can go forward
with a new style. But just

38
00:03:39.000 --> 00:03:45.639
if you use Polly in place,
I'm not sure if it's completely compatible or

39
00:03:45.680 --> 00:03:49.759
you have to change some class to
some other class. But it's pretty much

40
00:03:50.000 --> 00:03:54.080
a simple a simple fix. And
I have been emailing with Joel hewle In

41
00:03:54.080 --> 00:04:00.879
to schedule Polly show for Yes,
we definitely will get cool. So that's

42
00:04:00.919 --> 00:04:02.879
what that's what I got. Awesome. Who's talking to us? Richard Grabby

43
00:04:02.879 --> 00:04:06.960
comment off of show eighteen sixty one
we did with Jeremy Miller back in the

44
00:04:06.960 --> 00:04:11.840
summer of twenty three, talk about
minimal architecture because Jeremy likes to cause trouble.

45
00:04:11.879 --> 00:04:15.720
Goodness nose And this comment comes from
Trevor who says, I love this

46
00:04:15.759 --> 00:04:19.879
discussion, enjoyed the comments on microservices
Worth's monoliths, which is actually a reference

47
00:04:19.920 --> 00:04:25.120
to an earlier show we didn't reporter. We've been following this trend of people

48
00:04:25.439 --> 00:04:30.040
sort of pushing back on microservice.
Yes, I got pushed heavily into microservices

49
00:04:30.040 --> 00:04:33.639
approaches with a product that we'd built
and re architected into microservices, and it

50
00:04:33.720 --> 00:04:40.079
was the worst mistake ever. Things
just became more complex, it was harder

51
00:04:40.120 --> 00:04:43.279
to maintain, it added a bunch
of latency and security issues in the complexity

52
00:04:43.399 --> 00:04:46.879
was just not worth it. And
so I came up with a new acronym

53
00:04:46.480 --> 00:04:55.600
for appropriately sized service or as nice
I can relate. This is a good

54
00:04:55.639 --> 00:04:58.800
one. I one hundred percent believe
in services, separation of concerns, and

55
00:04:58.879 --> 00:05:02.399
clean architectures, but the approach must
be appropriate to the complexity, solution and

56
00:05:02.480 --> 00:05:06.079
the size of the team. It
makes no sense to have one hundred separate

57
00:05:06.079 --> 00:05:10.199
services for a team of ten people. But then it also makes no sense

58
00:05:10.199 --> 00:05:12.639
to have a massive single deployment with
a code base in the team of a

59
00:05:12.639 --> 00:05:15.399
two hundred and fifty people. The
services need to work with the cognitive load

60
00:05:15.639 --> 00:05:20.600
and be appropriate to the organization and
team structures. And I was loving the

61
00:05:20.600 --> 00:05:26.319
discussion on all of this except for
that one point. Stop making the CTO

62
00:05:26.399 --> 00:05:32.279
out to be the bad guy.
Love from Trevor CTO. That's great,

63
00:05:32.399 --> 00:05:36.879
Yep, that's fair. I definitely
think we need a whole show on ass

64
00:05:38.000 --> 00:05:40.920
I think so it seems up.
Yah, So, Trevor, thank you

65
00:05:40.959 --> 00:05:43.319
so much for your comment, and
a copy of music by its own its

66
00:05:43.360 --> 00:05:45.240
way to you. And if you'd
like a copy of music co buy,

67
00:05:45.240 --> 00:05:46.680
I write a comment on the website
at dot at Rocks dot com or on

68
00:05:46.680 --> 00:05:49.439
the Facebook, so you published every
show there, and if you comment there

69
00:05:49.439 --> 00:05:51.639
and everyday in the show, it's
like your copy of music Cobe. And

70
00:05:51.759 --> 00:05:56.399
you can also follow us on Twitter
if you want to. But the real

71
00:05:56.759 --> 00:06:00.439
cool kids are over. I'm massedon, I'm at Carl Frank tech dot Social

72
00:06:00.519 --> 00:06:04.480
and Ambridge Campbell at maps. Send
us a two we'll get around reading it

73
00:06:05.920 --> 00:06:11.920
all, publish it and with that
let us introduce Charity Majors to the show.

74
00:06:12.079 --> 00:06:16.319
Charity is an OPS engineer and CTO
at Honeycomb dot Io. Before that,

75
00:06:16.480 --> 00:06:23.240
she worked at Parse, Facebook and
Linden Lab on operations and developer tools

76
00:06:23.279 --> 00:06:27.160
and always seemed to wind up running
the databases. That's because it's where all

77
00:06:27.160 --> 00:06:30.360
the problems were. Yeah, we
stand next to the database. You're going

78
00:06:30.399 --> 00:06:35.240
to be running it. Also co
author of O'Reilly's Database Reliability Engineering and the

79
00:06:35.319 --> 00:06:41.879
newly released Observability Engineering. Charity Loves
free speech, free software, and single

80
00:06:41.920 --> 00:06:50.439
malt Scotch blamer. Do you ever
have round the bloud charity Majors? Okay,

81
00:06:50.759 --> 00:06:53.839
I guess we would have start at
the beginning, right at the beginning.

82
00:06:53.879 --> 00:06:58.199
What the heck is observability engineering?
That's a great question. It's like

83
00:06:58.199 --> 00:07:03.519
the engineering of Windows, yeah,
kind of. I mean observability comes from

84
00:07:03.560 --> 00:07:09.319
control theory, right, and it's
like, how well can you understand what's

85
00:07:09.360 --> 00:07:17.040
going on inside your systems just by
observing outputs? And yeah, exactly.

86
00:07:17.759 --> 00:07:23.959
And you know, I for years
was like really religious about trying to define

87
00:07:23.959 --> 00:07:26.560
it in a very specific way,
and I should have won, but I

88
00:07:26.639 --> 00:07:33.160
lost. So so I mean it's
come to it's come to just kind of

89
00:07:33.160 --> 00:07:40.399
be a generic sitting, which is
what it is. But when we were

90
00:07:40.399 --> 00:07:43.079
trying to figure out how to talk
about, what I think of is just

91
00:07:43.160 --> 00:07:47.079
kind of the next generation of telemetry. It's kind of distinguished from the last

92
00:07:47.120 --> 00:07:50.639
generation of peletry, obviously, which
was very much focused around the metric,

93
00:07:50.959 --> 00:07:56.079
right, which is just a number. It's tags depended, doesn't handle high

94
00:07:56.079 --> 00:08:01.079
cardinality, doesn't handle dimensionality, doesn't
handle it's super fast. Is that powerful?

95
00:08:01.160 --> 00:08:05.839
Now you drop some OLAP terms into
their cardinality flexibility, Like it's funny

96
00:08:05.879 --> 00:08:11.319
for a database person to drop all
lap, but you're talking about just any

97
00:08:11.360 --> 00:08:15.839
way that you can really observe the
state, the internal state, not necessarily

98
00:08:15.879 --> 00:08:18.319
what it's doing on the outside.
It's about observing the internal state and being

99
00:08:18.399 --> 00:08:24.279
able to explore it right, not
having to decide in advance, here's the

100
00:08:24.319 --> 00:08:26.920
data I'm going to collect, because
here's the questions I'm going to need to

101
00:08:26.959 --> 00:08:30.560
answer. Here's my dashboard. You
know, it's about being able to go

102
00:08:30.959 --> 00:08:35.759
to combine your questions to ask because, like anything that you're trying to understand

103
00:08:35.840 --> 00:08:37.759
these days is going to be a
very complicated answer to cart. It's like,

104
00:08:39.320 --> 00:08:43.440
okay, these errors are spiking,
but only for users that are running

105
00:08:43.440 --> 00:08:46.440
this version of Android, who's a
particular firmware in this region, with this

106
00:08:46.559 --> 00:08:50.879
language pack. With each of those
are the high cardinality to mention it.

107
00:08:50.960 --> 00:08:54.240
And if you don't capture the data
in a way that preserves all that context,

108
00:08:54.720 --> 00:08:58.639
you can't ask me questions. Do
you have some examples of how observe

109
00:09:00.039 --> 00:09:05.799
observability has improved a project in particular? Sure, I mean I think of

110
00:09:05.840 --> 00:09:11.480
it as it's really it's kind of
where development meets operations, right, Like,

111
00:09:11.600 --> 00:09:16.279
I feel like big picture. You
know, in the beginning, there

112
00:09:16.320 --> 00:09:18.519
were engineers who wrote code and they
owned it in production, right, right,

113
00:09:18.720 --> 00:09:22.639
And then everything got super complicated and
we're like, ah, there's too

114
00:09:22.720 --> 00:09:24.320
much. So some of us are
going to write code and some of us

115
00:09:24.360 --> 00:09:28.600
are going to understand it. And
that was not about it. That was

116
00:09:28.639 --> 00:09:31.799
not a great idea, and so
like we're kind of like reunifying the streams

117
00:09:31.840 --> 00:09:35.759
now. I think every engineer should
be writing their code and owning it in

118
00:09:35.799 --> 00:09:43.000
production. Everyone who's especialist operations should
be also like opening the door and looking

119
00:09:43.080 --> 00:09:46.559
under the hood and understanding the code. Right. There's specialization is great,

120
00:09:46.080 --> 00:09:50.039
but ultimately, you know, our
systems have gotten so complex that you have

121
00:09:50.080 --> 00:09:52.559
to write it and understand. I
feel like you got to dig into that

122
00:09:52.799 --> 00:09:58.679
own it in production because it's not
like they're also going to be sisumits exactly

123
00:09:58.080 --> 00:10:03.759
as they are responsible for the You're
responsible for your systems, right, you

124
00:10:03.759 --> 00:10:07.080
wrote it, you own it,
You unleashed this support upon the world.

125
00:10:07.039 --> 00:10:09.759
I mean, I feel like there
are these feedback loops in the heart of

126
00:10:09.840 --> 00:10:13.320
engineering. Some of them are like
code review, right. Some of them

127
00:10:13.960 --> 00:10:18.639
are like deploys. But like,
if you don't hook up the feedback loop,

128
00:10:18.840 --> 00:10:22.559
if you aren't being exposed to the
consequence of what you're doing, then

129
00:10:22.600 --> 00:10:24.840
like you're not you don't actually know
if your code is good or not.

130
00:10:24.120 --> 00:10:28.919
Well, I think there's a great
point there as a developer then that if

131
00:10:28.919 --> 00:10:31.399
my telemetry just tells me how many
times my code was hit, that doesn't

132
00:10:31.399 --> 00:10:35.639
necessarily give me anything to do.
And this is this is where I feel

133
00:10:35.679 --> 00:10:39.960
like operations folks have had a harder
time embracing serviability in some ways than software

134
00:10:39.960 --> 00:10:43.759
engineers have because with up people,
it's like we learned how to debug,

135
00:10:43.799 --> 00:10:48.600
but it looked like this, I've
got a dashboard. Something's wrong, So

136
00:10:48.600 --> 00:10:52.919
I'm gonna start paging through dashboards and
looking for similar spikes, just like pattern

137
00:10:52.919 --> 00:10:54.879
maatting with my eyeballs, Like right, oh, it looks like it's redus,

138
00:10:54.919 --> 00:10:58.120
you know, and you get it. It's great because you're like you

139
00:10:58.159 --> 00:11:01.879
get this hero journey where you just
jump to the and you understand what's going

140
00:11:01.879 --> 00:11:03.840
on because you're in this shit all
day every day. Nobody else does,

141
00:11:05.120 --> 00:11:07.120
like whoa, how did you do? That? Was reset? The redit

142
00:11:07.159 --> 00:11:11.799
service? Problems went away? Right? But like that's not debugging. Countermatching

143
00:11:11.840 --> 00:11:16.159
with your eyeballs is not debugging.
Debugging looks like you take the step,

144
00:11:16.440 --> 00:11:20.519
you ask a question, you look
at the answer. Based on the answer,

145
00:11:20.559 --> 00:11:22.720
you take another step. It's like
following a trail of bread crupts.

146
00:11:22.759 --> 00:11:26.639
You don't know what the answer looks
like until you get there. Can we

147
00:11:26.679 --> 00:11:31.600
talk about some of the new modern
observability tools that we might think about using

148
00:11:31.639 --> 00:11:35.919
to replace the tools that we're currently
using. Yeah, I mean, I

149
00:11:35.919 --> 00:11:41.000
think big picture, it has to
be based It can't just be based on

150
00:11:41.039 --> 00:11:46.159
the metric because remember you've discarded all
that you're looking output exactly. It has

151
00:11:46.200 --> 00:11:50.519
to be based on ar truly wide
structured TETA blocks, which now look like

152
00:11:50.600 --> 00:11:56.200
scams, right, Those are just
like wide events structured which you can trace

153
00:11:56.240 --> 00:11:58.240
because there's been a number that's appended
to it. That's what you need in

154
00:11:58.279 --> 00:12:01.960
order to understand your telemetry and production. Because I can imagine at a peak

155
00:12:03.080 --> 00:12:05.320
load, like we think about a
metric that shows, you know, this

156
00:12:05.360 --> 00:12:09.519
is when we're posting the most number
of transactions. You're now really interested in

157
00:12:09.519 --> 00:12:13.960
the state of yes, we're we
queuing out yes, what's happening? Like

158
00:12:13.120 --> 00:12:16.799
metrics are great, but they're they're
limited, right, they're a snapshot.

159
00:12:18.080 --> 00:12:20.559
What you want to be is like, you know, okay, when this

160
00:12:20.679 --> 00:12:24.519
happened, what else happened? Right? What else is connected to it?

161
00:12:24.679 --> 00:12:26.279
You know? And like the old
generation of tool are ones where you find

162
00:12:26.799 --> 00:12:31.039
you're capturing this data another time.
For every single tool you're like, okay,

163
00:12:31.039 --> 00:12:35.559
here's my dashboards and the metrics,
here's my logs, here's my traces.

164
00:12:35.639 --> 00:12:37.000
So every time you're like I've got
a spike, I want to find

165
00:12:37.080 --> 00:12:41.720
the logs, there's nothing that connects
them. You're just eyeballing timestamps and hoping

166
00:12:41.720 --> 00:12:46.360
that they happen to match up.
And like, if you're finding the logs,

167
00:12:46.360 --> 00:12:48.080
we want to jump to a trace. Like that's not actually good enough.

168
00:12:48.320 --> 00:12:54.200
You can derive all of those data
formats from these arbitrary white from spans.

169
00:12:54.240 --> 00:12:56.039
You can't go in the other direction. When you say spans, what

170
00:12:56.159 --> 00:13:01.840
exactly you're talking about? A span
is a one hop of the trace,

171
00:13:01.120 --> 00:13:05.840
okay, across all of them.
So should we be gathering spans? All

172
00:13:05.879 --> 00:13:11.679
of the gathering telemetry one event per
request, per service, all of the

173
00:13:11.799 --> 00:13:16.399
data should be aggregated into that one
arm chreary wide production, so you have

174
00:13:16.519 --> 00:13:22.120
all that context, Like a really
mature instrumented service will have like two hundred

175
00:13:22.200 --> 00:13:26.879
three hundred dimensions per per hop and
that's that's magic because you're passing along all

176
00:13:26.919 --> 00:13:31.559
of the parameters, you're passing along
all of the I E. S,

177
00:13:31.759 --> 00:13:35.919
you're passing along all of that context, which lets you after the fact come

178
00:13:35.000 --> 00:13:39.879
back and say, oh, this
thing and this service that happened was connected

179
00:13:39.919 --> 00:13:41.679
to that thing and that service that
that happened. Now, this is not

180
00:13:41.679 --> 00:13:46.840
necessarily a per transaction level, like
you're not just chasing a transaction. What's

181
00:13:46.960 --> 00:13:56.759
this It basically it's one span from
a time it's well typically, well this

182
00:13:56.000 --> 00:14:03.240
complicated. There are lots of ways
that you can define as span, but

183
00:14:03.399 --> 00:14:05.240
typically I like to think it about
if you want to have a span around

184
00:14:05.240 --> 00:14:09.399
something that's interesting. So like if
it's anytime that you're crossing the network,

185
00:14:09.559 --> 00:14:13.840
you want to span. Anytime you're
taking a database request, you want to

186
00:14:13.879 --> 00:14:18.399
span because that's historically where problems happen. It's wherever you're crossing the right.

187
00:14:18.440 --> 00:14:20.960
So when you start it at a
user interface interaction and go from there and

188
00:14:22.000 --> 00:14:26.879
then you know, likewise we're back
end services that go on time like you

189
00:14:26.919 --> 00:14:30.879
can have you can have spans and
tracing in models too, and it's super

190
00:14:31.120 --> 00:14:37.039
super useful there as well. But
it becomes indispensable once you have service.

191
00:14:39.000 --> 00:14:41.200
Yeah, because if you think about
it as a monolith, at least you

192
00:14:41.320 --> 00:14:46.399
have all that context and it persists
throughout the request. When you jump across

193
00:14:46.440 --> 00:14:50.120
the network from service to service,
you're deciding what state's going to come with.

194
00:14:50.200 --> 00:14:54.679
And so how do you do all
this without bringing the server CPU to

195
00:14:54.759 --> 00:14:58.799
its knees? Do you do this? Typically? The way we do it

196
00:14:58.799 --> 00:15:01.639
now is it attached to back around
threads and that kind of stuff, and

197
00:15:01.720 --> 00:15:03.720
you can lose you know, if
those threats hang, you can lose data.

198
00:15:05.120 --> 00:15:07.120
There are lots of ways to do
it. Obviously, I think that

199
00:15:07.200 --> 00:15:11.960
my service does it best. Your
Honeycomb is, well, you've got to

200
00:15:11.960 --> 00:15:16.279
tell us about Honeycomb. Well,
sure, you know I'm not I'm not

201
00:15:16.320 --> 00:15:20.200
really great at pitching, but I
will say that, like, you know,

202
00:15:20.960 --> 00:15:24.480
the idea of how observability should happen
is how we built our service,

203
00:15:24.679 --> 00:15:30.039
you know, down to like the
data store, like because like, well

204
00:15:30.039 --> 00:15:31.919
back how many have you any if
you ever built apps on pars? The

205
00:15:33.000 --> 00:15:37.919
mobile back end is the service.
No, wow, I loved so much.

206
00:15:39.879 --> 00:15:45.039
It was it was like Firebase but
better and earlier. Any Facebook people

207
00:15:45.080 --> 00:15:48.960
here, Okay, cool, I
will have I will have a brudg against

208
00:15:48.960 --> 00:15:54.120
Spark Soccer for forever for what he
did to Pars. Okay, they shut

209
00:15:54.120 --> 00:15:58.960
it down. It's like we got
acquired. Anytime you want to get acquire

210
00:16:00.120 --> 00:16:04.679
or make sure that you have an
executive level sponsor who believes you were fond.

211
00:16:04.960 --> 00:16:11.320
Right. Not so we got shut
down even though we were still growing

212
00:16:11.320 --> 00:16:15.639
like aang busters anyway. Cars had
had one hundred million apps by the time

213
00:16:15.799 --> 00:16:19.840
I left, and we had built
our service originally on Ruby on Rails,

214
00:16:21.000 --> 00:16:23.720
which was not a terrible decision because
most startups fail. It's usually not because

215
00:16:23.759 --> 00:16:27.879
it's and Rails had the strength of
you can move fast, you can move

216
00:16:29.720 --> 00:16:33.960
It's just a we believed everything.
Yeah, ok, say whatever you want.

217
00:16:33.039 --> 00:16:40.960
Doesn't that Ruby on rails. The
downside is it's got it doesn't have

218
00:16:41.000 --> 00:16:44.519
threads right, fixed pool of workers, right. And so that was fine.

219
00:16:44.639 --> 00:16:47.399
We had one hundred thousand apps,
but we got bigger and bigger,

220
00:16:47.399 --> 00:16:51.320
and instead of having one database and
back end, we now had thirty forty

221
00:16:51.399 --> 00:16:55.159
fifty. And when you've got that
many something slow at any given time,

222
00:16:55.519 --> 00:17:00.120
which means that the sixth pool is
going it's filling up constantly with threads that

223
00:17:00.159 --> 00:17:03.599
are waiting on that one back end
service got oh and like, as a

224
00:17:03.640 --> 00:17:10.559
reliability engineer, this was personally humiliating. You're going down every day just like

225
00:17:10.839 --> 00:17:15.400
a hit the top ten and iTunes
down, goes parts again and again.

226
00:17:17.000 --> 00:17:21.920
And I tried everything to try and
figure this out. And what finally helped

227
00:17:21.960 --> 00:17:25.319
us was number one, we did
a rewrite to go length. We actually

228
00:17:25.559 --> 00:17:30.599
considered uh using dot net and it
got out voted. And I learned later

229
00:17:30.720 --> 00:17:34.160
that the blog post that I wrote
about why it got outvoted, they had

230
00:17:34.200 --> 00:17:38.039
a lot of people with Microsoft very
angry and changed a lot of their decisions,

231
00:17:38.079 --> 00:17:42.480
which is great. Yeah, inspiring
anger is really like, but I

232
00:17:42.480 --> 00:17:45.759
mean, I can I made a
career. I have a tough time of

233
00:17:45.119 --> 00:17:48.000
disagree with you on picking go line
too. When you think about a back

234
00:17:48.079 --> 00:17:52.920
end service at velocity like that,
language is very well suited for that.

235
00:17:53.119 --> 00:17:56.240
It was great, but it was
half of the half of the answer because

236
00:17:56.279 --> 00:18:00.640
we also had to understand what was
going on just right to code understand it

237
00:18:00.880 --> 00:18:06.519
observab This is where observability came into
play because Facebook also had this service called

238
00:18:06.519 --> 00:18:11.160
Scuba, which was, don't get
me wrong, but ugly aggressively hostile to

239
00:18:11.279 --> 00:18:15.039
users, but it did one thing
really well, which is people that you

240
00:18:15.079 --> 00:18:21.839
slice and dice in near real time
on dimensions of high cardinality with wide events.

241
00:18:21.920 --> 00:18:25.759
Right high cardinality for those who don't
know, it's the number of unique

242
00:18:25.759 --> 00:18:27.440
idemans in the set. So if
you've got a collection of one hundred million

243
00:18:27.519 --> 00:18:33.680
users, any unique idea like social
security number for the US folks would be

244
00:18:33.079 --> 00:18:37.880
the highest possible cardinality. Something like
species equals human would be the lowest because

245
00:18:38.079 --> 00:18:42.079
only one right first thing lasting high
cardinality, but there's some dupes, so

246
00:18:42.119 --> 00:18:47.480
it's not as high as the security
number. So everything around metrics is oriented

247
00:18:47.519 --> 00:18:52.920
around low cardinality dimensions, but everything
you want to use for debugging requires high

248
00:18:52.920 --> 00:18:59.039
cardinaliti ice some people over you run
into the influence before. So Scuba,

249
00:18:59.160 --> 00:19:02.519
let us slice it some of these
high cardinality mentions and instead of having to

250
00:19:02.599 --> 00:19:06.680
like, you know, obscur and
be like either I either I read it

251
00:19:06.680 --> 00:19:10.839
to dashboard for it or it's going
to be hours to like just dive through

252
00:19:10.880 --> 00:19:14.759
the logs and figure it out and
everything. It's like, instead it would

253
00:19:14.799 --> 00:19:18.119
be like, okay, we're getting
a spike and eras, let's break down

254
00:19:18.160 --> 00:19:21.839
by app one in ten million appies. Break down by that. Okay,

255
00:19:21.960 --> 00:19:26.319
now break down by her rights.
Don't break down by by normalized database query.

256
00:19:26.359 --> 00:19:36.559
Now are you making cube gestures or
gestures? Call them after Colum.

257
00:19:36.599 --> 00:19:40.200
It was just like, step by
step it would take me to It's like

258
00:19:40.240 --> 00:19:45.119
it isn't even engineering anymore. It's
like support. Right, These problems went

259
00:19:45.119 --> 00:19:48.640
from being like intractable, like it
would be I'm doing mean, from like

260
00:19:48.680 --> 00:19:51.200
it would take us a day to
figure out and then it would never happen

261
00:19:51.240 --> 00:19:56.039
again, to just being like,
you know, thirty seconds, like every

262
00:19:56.119 --> 00:20:00.279
single time. And that was what
like when I was leading Facebook. You

263
00:20:00.319 --> 00:20:02.960
know, I've never been one of
those kids who's like I don't want to

264
00:20:03.000 --> 00:20:07.759
start a company because I kind of
hate those people. But when I was

265
00:20:07.799 --> 00:20:11.960
thinking about having to live without this
tooling, I was like, I can't.

266
00:20:11.039 --> 00:20:14.880
I actually can't conceive of it,
Like it's becomes so coore to how

267
00:20:14.920 --> 00:20:18.680
I how I perceive the world as
an engineer, Like I just can't imagine

268
00:20:18.680 --> 00:20:22.599
going back. And so that's why
I made talking on So when how did

269
00:20:22.640 --> 00:20:29.000
the rest of the people in the
organization react to this new culture of observed

270
00:20:29.000 --> 00:20:36.240
ability and spam? There's a learning
curve, right, We've all spent our

271
00:20:36.279 --> 00:20:40.160
careers fitting our brain into asking questions
in the metrics and dactions type of way.

272
00:20:40.680 --> 00:20:45.440
But like you know how every job
I've ever had, the person who's

273
00:20:45.440 --> 00:20:48.640
best to be bugging is always a
person who's been in the moms. Always.

274
00:20:48.920 --> 00:20:53.559
That's no longer true when we have
different tools because instead of relying so

275
00:20:53.680 --> 00:20:57.720
much on what's in your head to
reason about system, it's right in front

276
00:20:57.759 --> 00:21:02.119
of you and you're just asking questions
and it's and it's more like the more

277
00:21:02.240 --> 00:21:06.400
curious you are, the more debugging
you do, the better you get.

278
00:21:06.519 --> 00:21:08.160
You don't have to you don't have
to have the whole system in your head.

279
00:21:08.640 --> 00:21:12.400
You can find the answer more quickly
that way, and it's kind of

280
00:21:12.440 --> 00:21:17.359
a beautiful thing. Yeah, process
of discovery too, right, find exceptions.

281
00:21:17.440 --> 00:21:21.200
That's why, like observability is not
just about yes, you're gonna have

282
00:21:21.200 --> 00:21:22.400
to have a Columner store. Yes, you're gonna have to have all these

283
00:21:22.400 --> 00:21:26.599
things in the back end and make
it fast. Because the other thing about

284
00:21:26.680 --> 00:21:29.200
logging tools is like if you want
to ask something interesting, it's like you

285
00:21:29.319 --> 00:21:32.000
enter the tool, you know,
the and then you're like, Okay,

286
00:21:32.000 --> 00:21:34.079
I'm gonna take thirty minutes and go
out for coffee because it's gonna like it

287
00:21:34.119 --> 00:21:37.519
has to be fast, it has
to be interactive. It has to be

288
00:21:37.599 --> 00:21:40.759
like under a seconds because you're like, you're taking steps and you have to

289
00:21:40.799 --> 00:21:44.680
stay in the zone, right you're
yeah, exactly. It has to be

290
00:21:44.720 --> 00:21:47.759
explorable, it has to be interactive, and it has to let you,

291
00:21:48.039 --> 00:21:51.920
I think most importantly, draw on
the on the brains of the people around

292
00:21:51.960 --> 00:21:55.920
you. So something we built into
Honeycomb is is history. You know,

293
00:21:55.920 --> 00:22:00.119
how you're debugging and it's like,
oh, I've lost the thread, so

294
00:22:02.160 --> 00:22:03.440
you can just go that, you
scroll back up that's where I knew I

295
00:22:03.440 --> 00:22:06.880
had it right, and you branch
out and you try something else. But

296
00:22:06.920 --> 00:22:11.400
then also you have access to the
history of everyone on your team. So

297
00:22:11.599 --> 00:22:15.400
if it's like last Thanksgiving, we
had this terrible my squel outage, you

298
00:22:15.440 --> 00:22:18.440
know, and everything was uh and
Ben and Emily were on call, say,

299
00:22:18.759 --> 00:22:22.160
and then I'm on call in March
and I'm like, this feels a

300
00:22:22.200 --> 00:22:26.759
lot like what was happening last last
November. I'm gonna go and look at

301
00:22:26.759 --> 00:22:30.000
like, well, what were Ben
and Emily doing and what did they say?

302
00:22:30.039 --> 00:22:36.559
Help them find out what trace are
So journaling, yeah, actions systems.

303
00:22:36.799 --> 00:22:40.920
History doesn't repeat that. It rhinds
right so much. So much of

304
00:22:40.960 --> 00:22:45.599
the wisdom of your like these are
socio technical systems. It's not just production.

305
00:22:47.000 --> 00:22:49.440
Like an example I often is this, You've got the New York Times

306
00:22:49.440 --> 00:22:53.880
on the Washington Post. They're like
both big newspapers, right, but if

307
00:22:53.920 --> 00:22:59.400
you took their teams and swapped them, you couldn't actually do that because so

308
00:22:59.480 --> 00:23:03.599
much of this this different system lives
in the heads of the people us write

309
00:23:03.640 --> 00:23:07.119
it. So like being able to
draw on that wisdom and use it,

310
00:23:07.200 --> 00:23:11.559
like it makes you a better engineer. Like all of the ship that I

311
00:23:11.640 --> 00:23:15.519
learned about being an engineer was looking
up over their shoulder of amazing engineers that

312
00:23:15.559 --> 00:23:18.039
I got. But it sounds like
the journaling approach you're talking about allows us

313
00:23:18.079 --> 00:23:23.759
to look over the best exactly and
and and get to know you don't have

314
00:23:23.799 --> 00:23:29.839
to remember how they did it because
it's recorded there, and so you know

315
00:23:30.039 --> 00:23:33.880
you can even learn their approach and
how they attacked that I feel like,

316
00:23:33.039 --> 00:23:37.160
you know, especially now nowadays,
when we're doing so much distributed working,

317
00:23:37.279 --> 00:23:40.759
like remote working, I worry a
lot about how are we going to bring

318
00:23:40.839 --> 00:23:44.720
up the next generation of engineers,
you know, and I feel I hope

319
00:23:44.720 --> 00:23:47.759
that we're all starting to think about
making this more of our tooling. Just

320
00:23:48.000 --> 00:23:51.039
how can we learn from you know, it's kind of embarrassing, but like

321
00:23:51.240 --> 00:23:53.799
when I was in college, I
learned so much from just going around and

322
00:23:53.960 --> 00:24:02.200
reading the Bash histories of all the
people I knew, from either trying the

323
00:24:02.200 --> 00:24:07.799
commands. You know, it's fucking
fascinating. Right, Oh, that's how

324
00:24:07.839 --> 00:24:11.519
you have to learn, said not
right, what does this do? I

325
00:24:11.519 --> 00:24:12.720
think we need a lot more of
that in our tools. Yeah, and

326
00:24:14.119 --> 00:24:18.200
I worry that we're making it even
harder to make that jump from junior to

327
00:24:18.440 --> 00:24:22.480
an intermediate. I mean, we've
always had a problem with intermediates anyway,

328
00:24:22.839 --> 00:24:27.000
but a lot of the automation tools
that are taking a lot of are eliminating

329
00:24:27.039 --> 00:24:33.039
the beginner stuff. Yeah, like
this whole like the generative AI stuff,

330
00:24:33.759 --> 00:24:37.279
Like it's great for senior engineers.
You're now so much more productive, you

331
00:24:37.319 --> 00:24:40.839
can put so much faster. But
like the way that you get to that

332
00:24:40.880 --> 00:24:48.079
point is it's scarts. How how
are we going to force ourselves? I

333
00:24:48.119 --> 00:24:52.039
mean, I believe that the solutions
will emerge. I hope it looks pretty

334
00:24:52.039 --> 00:24:56.440
bad. And I also think that
younger generation will find them too, because

335
00:24:56.480 --> 00:24:59.640
they are not you know, we've
done this show where we're talking about is

336
00:24:59.640 --> 00:25:03.839
all this scar tissue actually holding us
back? Right? That we have some

337
00:25:03.880 --> 00:25:07.039
of it? Yeah, some of
it's value, I think. You know,

338
00:25:07.079 --> 00:25:10.599
you have to internalize the damage and
say, like, what does this

339
00:25:10.680 --> 00:25:12.839
really look like? In generally speaking, And when someone says I will never

340
00:25:12.960 --> 00:25:17.160
use X product or X technique,
it's like you have not eternalized your scar

341
00:25:17.160 --> 00:25:22.000
as well. I love that every
team has, I think, but also

342
00:25:22.039 --> 00:25:26.640
that can they learn to speak to
These are the concerns I have when you

343
00:25:26.680 --> 00:25:30.480
think about the broader approaches to things
that might have created that problem back in

344
00:25:30.519 --> 00:25:33.799
the past. I read this book
called The Trauma of Everyday Life, which

345
00:25:33.839 --> 00:25:38.160
is written by this guy who's as
psychiatrist and a Zen Buddhist, and he's

346
00:25:38.200 --> 00:25:45.559
talking about how trauma isn't necessarily something
to be avoided because it's literally what shakes

347
00:25:45.599 --> 00:25:49.559
you think about a bond sire that's
just a normal tree, but it was

348
00:25:49.599 --> 00:25:53.519
it was put in this very specific
where its roots couldn't grow right, and

349
00:25:53.559 --> 00:25:57.720
so it's not it's like not RECTI
people like trauma is great, but it's

350
00:25:57.759 --> 00:26:02.759
also like there's scar. She was
just going to be different. And again,

351
00:26:02.759 --> 00:26:03.519
how you react to and how you
work to it, you can make

352
00:26:03.559 --> 00:26:08.640
beautiful things. So what are some
of the other pitfalls that people will encounter

353
00:26:08.839 --> 00:26:15.640
when sort of moving to this observability. The big one is the cognitive just

354
00:26:15.920 --> 00:26:19.480
the model that we have in our
brain. I feel like our industry has

355
00:26:22.240 --> 00:26:23.920
avoided this for a long time.
I feel like there's a bit of a

356
00:26:23.960 --> 00:26:29.039
reckoning. You know, Open telemetry. By the way, I've got to

357
00:26:29.039 --> 00:26:33.119
put in a quick plug for open
here. It's amazing. I know all

358
00:26:33.160 --> 00:26:37.240
of us hate redoing our code,
but like the promise of open telemetry is

359
00:26:37.599 --> 00:26:42.839
you reinstrument your code once and then
vendors have to compete for your dollars based

360
00:26:42.880 --> 00:26:48.720
on being awesome instead of having you
locked in. It is. It's it's

361
00:26:48.759 --> 00:26:55.559
the number two project after Kubernetes in
the what you call it thank you CF

362
00:26:55.960 --> 00:26:59.519
It's super active. A lot of
contributors. I was pretty skeptical about this,

363
00:26:59.640 --> 00:27:03.119
but it's it's it's the way I
wish we had had this ten years

364
00:27:03.119 --> 00:27:04.880
ago. I think we'd all parties. You know, we could have we

365
00:27:04.920 --> 00:27:10.119
could have just chose to political problem, not detectives totally. We're there now,

366
00:27:10.400 --> 00:27:12.160
but here we are. These are
the tools that we have. Open

367
00:27:12.160 --> 00:27:18.039
telemetry is worth putting on your roadmap
for the next year or two because there's

368
00:27:18.119 --> 00:27:22.680
also this this reckoning that's happening with
costs right now. Most of these vendors

369
00:27:22.759 --> 00:27:27.480
are billing just like ungodly amounts of
dollars that do not correspond to the value

370
00:27:27.480 --> 00:27:30.359
that you get out of them because
they can't because they got you're locked.

371
00:27:30.680 --> 00:27:33.720
Yeah, yeah, And so I
feel like we need to take her powers

372
00:27:33.720 --> 00:27:37.880
well. And it's a great pitch
for a feature that's not necessarily a new

373
00:27:37.920 --> 00:27:41.119
features to say, hey, I
can reduce our costs by moving us off

374
00:27:41.279 --> 00:27:45.160
this tool and onto open telemetry.
You know, I'm compliance some of these

375
00:27:45.200 --> 00:27:49.319
sidebar rants here, but like I
feel like learning to treat like one artifact

376
00:27:49.400 --> 00:27:52.839
of the zero interest rate like period, was it engineers forgot how to talk

377
00:27:52.880 --> 00:27:56.839
about our work in terms of dollars, you know, because like dollars are

378
00:27:56.920 --> 00:28:00.720
the universal denominator. Maybe something the
Euros. I don't know, but like

379
00:28:00.880 --> 00:28:06.519
money is the universal denominator, and
if we can't learn to talk about the

380
00:28:06.640 --> 00:28:11.400
value of the shit that we provide
to people in finance people, I feel

381
00:28:11.400 --> 00:28:15.279
like many many vps of engineering and
CTOs have this phenomena where they feel like

382
00:28:15.319 --> 00:28:18.160
the junior partner at the table.
They aren't really invited to all the critical

383
00:28:18.200 --> 00:28:21.960
meetings and stuff. And I believe
that that's because we haven't learned to talk

384
00:28:22.000 --> 00:28:26.640
about the value that we bring and
cost in the same language as every other

385
00:28:26.680 --> 00:28:29.359
team. Because if we did,
we generate a lot of value. We

386
00:28:29.599 --> 00:28:34.200
generate all company, We have all
the power we should need to have.

387
00:28:34.599 --> 00:28:37.920
We got to do is get a
hand on it. And Charity when they

388
00:28:37.960 --> 00:28:45.839
were up for one moment for this
very important message and we're back. It's

389
00:28:45.880 --> 00:28:48.039
not at Rocks. I'm Richard Campbell. Let's Carl Franklin. Hey to our

390
00:28:48.039 --> 00:28:56.920
friend Charity Majors about observability engineering and
watching the sausage being made. And so

391
00:28:56.079 --> 00:29:00.279
I want to follow up on you. You brought up generative to a and

392
00:29:00.359 --> 00:29:04.960
things for programmers. It's great for
senior programmers who can be more productive with

393
00:29:06.039 --> 00:29:08.799
stuff they might have forgotten how to
write or don't really care to figure out

394
00:29:08.880 --> 00:29:14.720
and just let chat GPT do it
for you. But what do you think

395
00:29:14.880 --> 00:29:18.839
the future of observability is, especially
in lieu of AI and where it's going.

396
00:29:18.920 --> 00:29:23.839
Do you think that we'll have AI
bots sort of watching our telemetry and

397
00:29:23.920 --> 00:29:30.480
giving us English prompts, you know, sending us text messages. Vendors are

398
00:29:30.519 --> 00:29:33.799
going to sell CTOs and pps like
tens of billions of dollars worth of bullshit

399
00:29:33.880 --> 00:29:37.200
that says that they can do that. Yeah, something that blew my mind

400
00:29:37.200 --> 00:29:41.599
when I became to see So are
you saying we don't need this? We

401
00:29:41.160 --> 00:29:45.319
have everything that we need right there
in front of something that blew my mind

402
00:29:45.319 --> 00:29:51.279
when I keep CTO to be wild
internalized. But the most executives have more

403
00:29:51.319 --> 00:29:56.720
trust and confidence in their vendor relationships
than their employees, because employees coming out

404
00:29:56.039 --> 00:30:02.759
the vendors left forever as long as
you keep paying them. In my mind,

405
00:30:02.200 --> 00:30:06.759
but what they're selling when they come
in. This is why my dander

406
00:30:06.839 --> 00:30:10.440
got raised so much by the whole
AIO saying, because they are all just

407
00:30:10.640 --> 00:30:12.640
like you don't need to understand your
systems. Pay us all this money,

408
00:30:12.799 --> 00:30:17.440
we'll understand it for you, and
but like the false positives are ridiculous and

409
00:30:17.480 --> 00:30:19.920
off the charts, all of the
data is junk. You would be better

410
00:30:21.000 --> 00:30:23.680
off just like turning off all that
data, Like it's just so many problems

411
00:30:23.720 --> 00:30:29.440
with it. I believe that we
should be looking at computers that do what

412
00:30:29.519 --> 00:30:33.440
computers do best, and people to
do it people do best, and computers

413
00:30:33.480 --> 00:30:37.319
crunch numbers, people attack meaning to
things. Sure, like your graphs are

414
00:30:37.359 --> 00:30:42.039
spiking all day long, most of
them you don't care about, because our

415
00:30:42.039 --> 00:30:45.599
computers are now resilient to a whole
lot of failure. Sure, it really

416
00:30:45.680 --> 00:30:52.160
takes a person coming along and going
matters. That matters often because it mattered

417
00:30:52.200 --> 00:30:56.000
to another person, and you're the
person who they're connecting those dots. And

418
00:30:56.039 --> 00:31:00.480
once you've decided it matters, you
need to understand why. And I think

419
00:31:00.480 --> 00:31:03.200
there are all kinds of ways for
computers to help us do that. We

420
00:31:03.279 --> 00:31:07.200
do this really cool thing called bubble
up the honeycomb, where any graph that

421
00:31:07.319 --> 00:31:11.160
any heat map that you've constructed,
you draw a little bubble around something you're

422
00:31:11.160 --> 00:31:14.720
like, I care about this,
and then we compute the baseline for all

423
00:31:14.759 --> 00:31:18.920
the hundreds of dimensions and the dimensions
that are inside the thing you care about,

424
00:31:18.079 --> 00:31:21.640
and then we dip them and sort
them, so it's like, Okay,

425
00:31:21.680 --> 00:31:23.759
this thing you care about, here
are the five to ten ways that

426
00:31:23.880 --> 00:31:27.200
is different from everything that you don't
care about. Computers are great at that,

427
00:31:27.720 --> 00:31:30.160
but they can't tell you what to
care about, and they shouldn't try

428
00:31:30.160 --> 00:31:33.880
it, because it's a fucking mess. Maybe ten years from now I will

429
00:31:33.920 --> 00:31:37.039
be eating my words, but for
the foreseeable future, I really think that

430
00:31:37.079 --> 00:31:44.160
we're all best served if we focus
on helping people understand what has meaning and

431
00:31:44.240 --> 00:31:45.599
letting computers take care of their rest. Yeah. I mean I can see

432
00:31:45.640 --> 00:31:49.240
the machine tools helping to point us
too unusually. Sure, Yeah, but

433
00:31:49.440 --> 00:31:53.359
you still have to interpret them.
Yeah. Still you want them to create

434
00:31:53.400 --> 00:31:57.440
that graph for you. You want
them to intelligently sample often, you want

435
00:31:57.480 --> 00:32:00.799
them to do you know, but
you don't don't want them in the business

436
00:32:00.799 --> 00:32:04.160
of telling you what that No,
they don't know, you don't know,

437
00:32:04.359 --> 00:32:08.279
and more and more stately, like
they're not even qualified to make that s

438
00:32:08.319 --> 00:32:13.119
that's been in any way. That
being said, like I can tell you

439
00:32:13.160 --> 00:32:15.119
we're talking a lot of old lap
terms here, like a lot of data

440
00:32:15.200 --> 00:32:21.640
analytic terms around all this and machine
learning models evolved from a lot of that

441
00:32:21.759 --> 00:32:24.680
technology. So you can see a
shape of this shape of history. You

442
00:32:24.799 --> 00:32:29.799
can't see a shape, but I
don't believe that it is. It is

443
00:32:29.920 --> 00:32:35.000
one that So here's the thing.
At the bottom line, we are forget

444
00:32:35.000 --> 00:32:38.160
technology. We are held legally accountable
for your engineers. We are legally and

445
00:32:38.240 --> 00:32:43.400
ethically and morally accountable for the codes
we put out into the world. Right,

446
00:32:43.880 --> 00:32:46.279
we can't point an algorithm it comes
to that, even if it's a

447
00:32:46.359 --> 00:32:50.519
machine learning I think, I don't
know if you've read it yula lately,

448
00:32:51.359 --> 00:32:53.160
A boy, oh boy, they
work really hard to make sure we're not

449
00:32:53.279 --> 00:33:01.160
legally accountable for any I believe in
the near infinite possibility employers. That's that's

450
00:33:01.200 --> 00:33:07.039
true. I want to make sure
that I understand. I also, I

451
00:33:07.079 --> 00:33:09.559
mean, I like that we're also
going to moral and ethical aspect because I

452
00:33:09.559 --> 00:33:14.559
think we need. I think that
legal aspects holding us back, that we

453
00:33:14.599 --> 00:33:17.920
can't own the value of what we
makeout, that we can't own the value

454
00:33:17.960 --> 00:33:23.559
of a make as long as we're
obligating our responsibilities. And then really,

455
00:33:23.599 --> 00:33:28.480
you know, the yula was invented
to allow us to not hold liability for

456
00:33:28.519 --> 00:33:34.039
the impact of our software, and
so we're kind of in a trap right

457
00:33:34.079 --> 00:33:37.960
as an industry. If we were
responsible for the damage we did, we

458
00:33:38.000 --> 00:33:43.480
would we would our employers would insist
on higher standards because they're getting caught up

459
00:33:43.480 --> 00:33:47.759
in that as well. But because
we've avoided the responsibility so thoroughly, I

460
00:33:47.799 --> 00:33:52.400
see what you're saying. That being
said like this is now we get into

461
00:33:52.400 --> 00:33:54.880
a pretty deep philosophical side of this
thing, like let's face it, good

462
00:33:54.920 --> 00:33:58.640
telemeter. In the end, we're
trying to understand why is the software behavior

463
00:33:58.640 --> 00:34:01.759
and its behavior? Why are our
customers unhappy? I mean, those are

464
00:34:01.759 --> 00:34:06.519
the things that actually matter. I
think. The more often that we as

465
00:34:06.599 --> 00:34:09.639
technologists speak in the term of the
customers, I think, why are our

466
00:34:09.639 --> 00:34:13.719
customers onhappy? You know? And
this is something I've been really grappling with

467
00:34:13.840 --> 00:34:15.039
lately. I don't know if I'm
alone in this or not, but I

468
00:34:15.079 --> 00:34:22.719
have like an almost knee jerk,
almost disgusted or like reaction towards like customer

469
00:34:22.920 --> 00:34:27.840
and value and things. And I've
been trying to because we've been battered with

470
00:34:27.920 --> 00:34:31.440
it, because we've been battered beaten
up those words. Yeah, I don't

471
00:34:31.480 --> 00:34:36.320
know, just the business aspect,
like I think there's some vestors of me.

472
00:34:36.400 --> 00:34:38.440
There's still like ew, we're better
than that. And I hate myself

473
00:34:38.440 --> 00:34:42.679
as I'm saying that. You know, and you're also open with the dollars

474
00:34:42.719 --> 00:34:45.639
matter they do, they're kind to
come from the customers. Oh, you

475
00:34:45.639 --> 00:34:47.880
should have seen me ten years ago, because this is a chill version.

476
00:34:47.960 --> 00:34:52.440
I get that. Okay, No, but you're you're absolutely right. We

477
00:34:52.519 --> 00:34:54.239
do this for the customer. We
do this for our users. That's the

478
00:34:54.320 --> 00:34:58.840
reason we exist, and we have
a responsibility to them. Sure, And

479
00:34:58.880 --> 00:35:00.840
I don't think I'll ever still comfortable
saying, well, the machine told me

480
00:35:00.920 --> 00:35:05.239
it was fine. That's a cop
out every time. Because the machine didn't

481
00:35:05.280 --> 00:35:08.079
tell you anything, you interpreted it
and chose to vocal. You know,

482
00:35:08.159 --> 00:35:14.559
in the end, everything we've talked
about program it's fine, getting very philosophical,

483
00:35:14.679 --> 00:35:17.159
but also none of this is described
an action we should take. All

484
00:35:17.159 --> 00:35:21.440
we're doing is observe what's going on. We still have to decide on the

485
00:35:21.480 --> 00:35:24.079
action. How would you change the
code? Given you've seen this in dilematry

486
00:35:24.199 --> 00:35:28.000
and you know what else? Like
I feel like this looks back really nicely

487
00:35:28.039 --> 00:35:31.679
into just like what is what is
the meaningful life? Right? Like because

488
00:35:31.760 --> 00:35:36.639
like that book that what's his face
wrote about about work and what makes us

489
00:35:36.679 --> 00:35:40.800
happy is it's not like having twenty
hours a day whatever, but it's like

490
00:35:42.199 --> 00:35:46.880
autonomy, mastery and meaning purpose.
Yeah, Daniel pink, Dangel pink,

491
00:35:46.960 --> 00:35:51.960
thank you, and like the meaning
the purpose that comes into play for us

492
00:35:52.000 --> 00:35:53.920
when it impacts other people. Well, and you hit on the key thing,

493
00:35:53.960 --> 00:35:58.800
which is when we crack this,
not like every time you chase a

494
00:35:58.840 --> 00:36:01.480
problem downline that and it turns into
a code change, you can make that

495
00:36:01.639 --> 00:36:07.079
then in later testing shows that problems
occur. Boy, that's a good day.

496
00:36:07.280 --> 00:36:13.000
Like you talk about purpose, there
is nothing better than figuring that complicated

497
00:36:13.000 --> 00:36:16.880
problem out and then literally, like
you, you live in a very hypothesis

498
00:36:16.960 --> 00:36:21.119
based world. It's like, well, I've seen this telemetry, I've seen

499
00:36:21.119 --> 00:36:24.119
this output. I believe it's this
code problem. Now I'm going to make

500
00:36:24.119 --> 00:36:28.559
a modification. I'm going to put
it into the stream and I'm going to

501
00:36:28.639 --> 00:36:31.280
go back and test again. And
if I don't see it, then I

502
00:36:31.400 --> 00:36:35.679
can, you know, hypothesize really
because I might be wrong. We may

503
00:36:35.719 --> 00:36:39.079
not have reconded recreated conditions perfectly.
That we're on it, that we're pushing

504
00:36:39.079 --> 00:36:43.320
the right thing, and nobody knows
just how deep that went. No,

505
00:36:43.880 --> 00:36:46.360
I also wonder, you know how
many times have you been fighting a problem

506
00:36:46.440 --> 00:36:51.159
like that and you chart changing code? Just see if you can change behavior

507
00:36:51.679 --> 00:36:55.440
at all? Like, am I
even assistant? They have emergent properties,

508
00:36:55.440 --> 00:37:00.199
They're no longer like I feel like
part of moving from like the old version

509
00:37:00.199 --> 00:37:05.880
of the new is except that TDD
is not enough interesting like the tests tell

510
00:37:05.960 --> 00:37:10.039
you will this logically execute, but
that reality ends at the border of your

511
00:37:10.079 --> 00:37:14.320
laptop. Yes, and the universe
is weirder. The weird intera is so

512
00:37:14.440 --> 00:37:16.920
much weirder than that. I feel
like our jobs are not done. It's

513
00:37:16.920 --> 00:37:22.719
like until we've instrumented that code,
deployed it and watched it in production and

514
00:37:22.800 --> 00:37:25.400
asked ourselves, is it doing what
I expected to do? And if anything

515
00:37:25.400 --> 00:37:30.119
else look weird? I know that
on the show before I was you know,

516
00:37:30.199 --> 00:37:31.719
I've did a lot of load testing. It's like I have never invented

517
00:37:31.719 --> 00:37:37.159
the load tests as weird as customers
on Saturday actually comes even come close.

518
00:37:37.480 --> 00:37:44.480
So customers are evil do things you
can't. They really opened six windows and

519
00:37:44.519 --> 00:37:47.400
hit refresh all at the same time. Did he really really? Okay?

520
00:37:49.360 --> 00:37:52.679
May I see some practical advice on
behalf of the listeners. So let's say

521
00:37:52.679 --> 00:38:00.559
you're listening, you're you've been surfing, you went to uncombed dot I and

522
00:38:00.639 --> 00:38:02.400
you checked it out, and you're
thinking this might be good. How do

523
00:38:02.440 --> 00:38:06.000
you go back to here? How
do these people in the audience go back

524
00:38:06.039 --> 00:38:10.320
to their teams and introduce this concept
without getting flogged? You know? How

525
00:38:10.519 --> 00:38:14.880
how do you approach that? I
mean, that's a great question. My

526
00:38:14.960 --> 00:38:20.199
approach is always to look for something
that's really painful, like, you know,

527
00:38:20.599 --> 00:38:23.199
things that are going down, you
don't understand, problem, problem,

528
00:38:23.239 --> 00:38:29.159
you can't crack and especially this the
siloed approach to telemetry that we're doing anyhow,

529
00:38:29.239 --> 00:38:30.840
things that are waking people up in
the middle. Then I you know,

530
00:38:31.000 --> 00:38:34.679
we've seen this a lot where you
know, people have tried to bring

531
00:38:34.760 --> 00:38:37.440
it in whatever, but then there's
an intractable problem and they put money come

532
00:38:37.480 --> 00:38:42.559
on it, and it's just like
like we've even had multiple times we've had

533
00:38:42.559 --> 00:38:47.360
our sales engineers doing demos on people's
production systems and you're about to have an

534
00:38:47.360 --> 00:38:51.719
outage here because this thing's happened,
and they're like what in the like ten

535
00:38:51.719 --> 00:38:55.440
minutes later they get paiged because it
is that Like I know, I'm a

536
00:38:55.440 --> 00:38:59.400
founder, believe nothing I say,
But is that much easier when you have

537
00:38:59.480 --> 00:39:01.440
the right tool, when you have
the right visibility, just to be able

538
00:39:01.440 --> 00:39:05.079
to see what's going on? Yeah, looking at something like that, or

539
00:39:05.079 --> 00:39:09.280
somewhat counterintuitively the other side, another
place we've seen a lot of success is

540
00:39:09.280 --> 00:39:14.559
people insumenting their CiCe pipelines, right, because if you insument your CCE pipeline

541
00:39:14.559 --> 00:39:17.199
as a trace, you've can see
where all that time is going. Yeah.

542
00:39:17.480 --> 00:39:22.320
Yeah, that's kind of another approach
to this, the model of what

543
00:39:22.480 --> 00:39:27.119
is the hard work here? What's
actually hurting us? The struggle is only

544
00:39:27.119 --> 00:39:30.760
getting in the front door. We
have like zero turn if the company didn't

545
00:39:30.760 --> 00:39:34.480
go out of business to keep buying
us. But it's difficult to get in

546
00:39:34.519 --> 00:39:37.280
the front door. But once we
get inside, like no, but I

547
00:39:37.280 --> 00:39:39.679
think you've made the most compelling argument, and that's going to be tough for

548
00:39:39.719 --> 00:39:43.400
anyone in the room wo's thinking about
this. It's like you have to go

549
00:39:43.440 --> 00:39:45.960
pick the largest dragon in the room
and say, I think I could take

550
00:39:46.000 --> 00:39:50.760
that one on if I had this
lance. If I can get this lance

551
00:39:50.800 --> 00:39:52.360
and gay it go, I'll go
for the big guy. Yeah, and

552
00:39:52.400 --> 00:39:55.239
that's the kind of bet you need
to make. But you know the underlying

553
00:39:55.280 --> 00:39:58.519
part of this, because a lot
of the software is already set up for

554
00:39:58.519 --> 00:40:00.960
the right plum tree, but it's
the customs suff we're building it is not

555
00:40:00.960 --> 00:40:05.559
is how you provide give visibility into
that yep, orienting it around. You

556
00:40:05.559 --> 00:40:07.199
know a lot of people also come
most when the start their open plumbatry journey

557
00:40:07.199 --> 00:40:12.440
because we have many of the world's
best experts in Hotel, so we could

558
00:40:12.440 --> 00:40:16.280
actually help consult. What do I
need to push onto the open telemet pelementary

559
00:40:16.320 --> 00:40:20.320
stack. That's going to help me, that's going to let these tools understand.

560
00:40:20.320 --> 00:40:23.920
What do any of you in the
room have a question for charity?

561
00:40:24.119 --> 00:40:28.280
Raise your hand, It's all right
right here. Phil Hack has a question,

562
00:40:28.719 --> 00:40:31.280
So why don't you repeat the question? The question is that's a lot

563
00:40:31.320 --> 00:40:35.880
of data and how does that cost? How does the coast get out?

564
00:40:36.119 --> 00:40:39.119
Is the costack get out? Again? This is why So on Twitter I

565
00:40:39.159 --> 00:40:42.320
was joking the other week and it
kind of got out of control and I

566
00:40:42.320 --> 00:40:45.039
could never write a database. No, really, never write a database.

567
00:40:45.320 --> 00:40:49.239
And I thought it was a very
fun self owned because we wrote a database.

568
00:40:49.480 --> 00:40:52.639
People didn't understand that. So yeah, it's a fund of data.

569
00:40:52.719 --> 00:40:57.920
Like we've got like seven hundred customers
and we run the combined production modes of

570
00:40:57.960 --> 00:41:00.000
all of them. It's like two
billion events persons or something like that,

571
00:41:01.719 --> 00:41:07.480
and we give everyone sixty days of
storage basically for free. And the way

572
00:41:07.480 --> 00:41:15.000
that we do this we so it's
a culundar store. So indexes are roboten

573
00:41:15.119 --> 00:41:17.760
for observability because indexes are way of
picking. I want this to run fast

574
00:41:17.760 --> 00:41:21.000
and nothing else to run fast.
You want to be able to query on

575
00:41:21.079 --> 00:41:24.360
any of these dimensions. So it's
a calundar store. And you're right,

576
00:41:24.480 --> 00:41:28.480
like two years in we ran into
this. We're never going to be profitable

577
00:41:28.480 --> 00:41:31.840
because there's so all of these SSDs, all this ram. And that's when

578
00:41:32.119 --> 00:41:37.760
one of my I brilliant engineer I've
been working he was my first manager.

579
00:41:37.039 --> 00:41:40.639
Name is Ian. He's he's nowhere
on the internet and he's amazing. Uh.

580
00:41:40.800 --> 00:41:45.760
He started looking into the cost models
and did some tests and so now

581
00:41:45.840 --> 00:41:51.280
actually we data comes in hits,
the API gets dropped into Kaffa and then

582
00:41:51.320 --> 00:41:54.840
gets read off onto you know a
pair of notes, which are as you

583
00:41:54.840 --> 00:41:59.840
would think like lots of CPO lots
of RAM, but then after like thirty

584
00:42:00.000 --> 00:42:06.320
six minutes it gets tailed out to
S three. The queery planner actually runs

585
00:42:06.360 --> 00:42:12.400
the LANDA jobs. Uh so the
query planner comes in forks out spans and

586
00:42:12.400 --> 00:42:15.719
and like we thought it was going
to be so much shorter like doing processing,

587
00:42:15.800 --> 00:42:20.880
you know from all these S threeboutives, it wasn't. It was different

588
00:42:20.920 --> 00:42:24.480
performance characteristics, but most careers still
return with under a second and S three

589
00:42:24.559 --> 00:42:29.320
is the cheap so that's what most
of the data is, and the lambda

590
00:42:29.400 --> 00:42:31.239
jobs are pretty expensive. That's a
big line end up in on our bills.

591
00:42:31.239 --> 00:42:35.920
So we've done we've actually done some
really great talks and written some great

592
00:42:35.920 --> 00:42:40.320
pieces about how we use Honeycomb to
optimize our LANDA jobs so that the planner,

593
00:42:40.960 --> 00:42:45.360
yeah, it's all. Yeah.
Our Honeycomb block, by the way,

594
00:42:45.480 --> 00:42:49.639
is dope. Like we we don't
do a lot of selling there.

595
00:42:49.679 --> 00:42:52.519
We just talk about a lot of
engineering and it's pretty great. There was

596
00:42:52.559 --> 00:42:58.320
another question back here. First somebody
back there had there end up. Okay,

597
00:42:58.320 --> 00:43:00.920
it wasn't you, but go ahead, son, repeat the question.

598
00:43:00.960 --> 00:43:07.159
I'm sorry. So the question I
think was something there's something about SLOs and

599
00:43:07.559 --> 00:43:12.920
metrics and it's too expensive to store
all the choices for all events, okay,

600
00:43:14.280 --> 00:43:17.079
And there's a few different answers for
this. I would probably want to

601
00:43:17.079 --> 00:43:21.880
ask you some more questions. It's
feel free to find me afterwards. But

602
00:43:21.960 --> 00:43:24.400
you're absolutely right. It can be
absolutely cost prohibitive to store the trace for

603
00:43:24.480 --> 00:43:28.039
every if you have a lot of
traffic, because if you think about it,

604
00:43:28.360 --> 00:43:34.400
you might find yourself storing five to
thirty times as much prelemetry data as

605
00:43:34.559 --> 00:43:38.960
production traffic. Obviously that's not tenable, right. The first solution that we

606
00:43:39.199 --> 00:43:45.840
usually steer people towards is intelligence sampling, which does not mean just like dumb

607
00:43:45.920 --> 00:43:47.400
dumb sampling, we're like one out
of the routen you drop them. It

608
00:43:47.440 --> 00:43:52.000
means like we have a thing called
refinerate, where there's a different between head

609
00:43:52.039 --> 00:43:54.960
sampling and tail sampling, meaning sampling
before you know what it is coming and

610
00:43:55.079 --> 00:43:59.079
after you know what it's coming.
So some of these things you sample after

611
00:43:59.119 --> 00:44:01.159
you know what's coming. Be sure
and grab all of the slow events,

612
00:44:01.639 --> 00:44:05.599
right. Some of it is head
sampling where you're just like, okay,

613
00:44:06.960 --> 00:44:10.119
for example, requests there are health
checks, there are two hundreds. This

614
00:44:10.239 --> 00:44:13.800
is junk. I don't need to
store all of these there's gonna be like

615
00:44:13.800 --> 00:44:16.800
a quarter of your traffic sometimes,
so like sample them heavily, two hundred

616
00:44:16.840 --> 00:44:22.639
okays to the main page, sample
the medium, keep every request that's in

617
00:44:22.840 --> 00:44:25.519
error, or every request that is
to slash payments or to billing or you

618
00:44:25.559 --> 00:44:30.880
know, like there's a lot of
that is trash that you can like to

619
00:44:30.159 --> 00:44:35.519
discard if you kind of go in
there with the fine teeth. Come the

620
00:44:35.559 --> 00:44:44.480
part that was about about SLOs.
We derive SLOs from events, and it's

621
00:44:44.519 --> 00:44:50.800
actually really important that they're not derived
from metrics. We're actually the only product

622
00:44:50.800 --> 00:44:54.400
out there that does SLOs the way
they're supposed to be done for the Google

623
00:44:54.480 --> 00:44:59.360
sor rebook, because other companies don't
actually capture their data in a way that

624
00:44:59.440 --> 00:45:04.159
lets them do that. It's actually
pretty dope. Like you have your SLOs,

625
00:45:04.199 --> 00:45:07.000
it tells you how how quickly you're
burning down the budget, and then

626
00:45:07.320 --> 00:45:12.559
it tells you what what is different
about the requests that are erring that are

627
00:45:12.559 --> 00:45:16.880
burning down the budget the other blah
blah blah. We also have the metrics,

628
00:45:17.239 --> 00:45:22.599
but I'm not sure if you're talking
about our metrics product or the events.

629
00:45:22.800 --> 00:45:25.039
Like the number one answer to the
events be too expensive is you use

630
00:45:25.920 --> 00:45:30.039
smart sampling? And the number one
answer to the SLOs is you want those

631
00:45:30.280 --> 00:45:34.239
badly. You can absolutely do sampling. So one of the one of the

632
00:45:34.280 --> 00:45:37.239
things in every event that ge is
sent to us, there is a sample

633
00:45:37.320 --> 00:45:40.840
rate embedded in it, so everyone
will say like one slash five and that

634
00:45:40.960 --> 00:45:45.920
means compute this to be five like
this, so the numbers all all work

635
00:45:45.960 --> 00:45:50.880
out to look like they weren't sampled. We had questions. You got to

636
00:45:50.880 --> 00:45:52.960
move on question right in the front, hay on can you repeat that with

637
00:45:53.000 --> 00:45:58.000
the micro The question was, obviously, you can go wrong with logging because

638
00:45:58.079 --> 00:46:00.800
you can get way too many log
events. You can go wrong with metrics

639
00:46:00.840 --> 00:46:04.760
because you could you can have famously
like the thirty thousand dollars metric that had

640
00:46:04.840 --> 00:46:08.079
high cardinality in it and oops your
budget. Answer the question is, I

641
00:46:08.079 --> 00:46:14.760
think, how can you go wrong
with observability as distinct from those you know?

642
00:46:14.840 --> 00:46:16.800
And I want to say, even
if you aren't doing traces, if

643
00:46:17.920 --> 00:46:22.280
the most important thing to take away
when it comes to telemetry is the magic

644
00:46:22.360 --> 00:46:28.480
of the one wide structured event per
request per service. I actually found out

645
00:46:28.639 --> 00:46:31.159
years into this that this is how
Amazon is done there telemetry all along,

646
00:46:31.480 --> 00:46:36.920
they had like a flat file at
the root domain of every node where they

647
00:46:37.000 --> 00:46:40.880
keep one of these like wide It's
it's magic. It makes everything because a

648
00:46:40.880 --> 00:46:45.840
lot of the logs that you are
encountering, or because like when a request

649
00:46:45.880 --> 00:46:51.159
is executing through through a service,
it's just like, oh, all these

650
00:46:51.159 --> 00:46:54.480
strings, right, But if you
just like collapse them into one wide event

651
00:46:54.760 --> 00:47:00.199
with all of those keys and values, then you have that context, right

652
00:47:00.199 --> 00:47:04.719
you can put its magic. So
the number one thing that I think people

653
00:47:04.800 --> 00:47:07.960
get wrong with observability is not understanding
that that's the heart of everything. It

654
00:47:07.960 --> 00:47:12.480
isn't much tool you're using. It
isn't whether you're tracing or not. It's

655
00:47:12.559 --> 00:47:16.199
that it's that that is the number
one thing that everyone should be caring about.

656
00:47:16.719 --> 00:47:22.559
The number two thing I think comes
out when dealing with spans slightly ordered

657
00:47:22.679 --> 00:47:25.960
higher order problem, and that's because
I feel like as an industry we have

658
00:47:27.079 --> 00:47:31.400
the really we aren't really we don't
really have a set of like good conventions.

659
00:47:31.440 --> 00:47:34.760
You were asking me, like,
when should do you have this span?

660
00:47:35.679 --> 00:47:37.880
Man? Like, I hope five
years from Everybody's like, well,

661
00:47:37.880 --> 00:47:40.079
obviously you should have this span blah
blah blah. But we aren't there yet,

662
00:47:40.239 --> 00:47:45.800
right, and so it's really easy
to either generate too many spans and

663
00:47:45.840 --> 00:47:47.599
they get lost in the noise kind
of like with logs, or too few

664
00:47:47.639 --> 00:47:52.199
spans and then not have the detail
that you need when you needed. The

665
00:47:52.280 --> 00:47:57.639
question is where to start with open
planetry, And there are only two good

666
00:47:57.679 --> 00:48:02.920
answers. One is my favorite,
uh, with the biggest pain and if

667
00:48:02.960 --> 00:48:07.719
you have to really like you're like, if you have a really resistant culture,

668
00:48:08.079 --> 00:48:12.719
then start with the least pain.
But I actually think that the best

669
00:48:12.719 --> 00:48:15.239
way to roll anything that has to
do with cymmetry out is is it kind

670
00:48:15.280 --> 00:48:20.360
of think of your attention like a
headlamp, and if you're on call for

671
00:48:20.360 --> 00:48:24.480
something that's breaking, have an instruments
first mentality, like you've instrument to figure

672
00:48:24.480 --> 00:48:29.400
out what's wrong, not if you've
around with your instrument, have to tell

673
00:48:29.400 --> 00:48:31.280
you the answer, and then it's
there for the next time you get paid

674
00:48:31.280 --> 00:48:36.000
again, instrument to find the problem, and it's there. And as your

675
00:48:36.199 --> 00:48:38.639
head lamp kind of moves around the
stack, you know, within a couple

676
00:48:38.679 --> 00:48:42.840
of months most of the stuff that
it really matters will be instrumented and then

677
00:48:42.840 --> 00:48:45.519
you can put it on the backlog
to do the rest and finish up and

678
00:48:45.559 --> 00:48:47.719
get rid of your ovenders. All
right, Well, I think that's it,

679
00:48:47.800 --> 00:48:53.559
so let's give charity majors a big
round of law. I will see

680
00:48:53.599 --> 00:49:21.480
you next time. On time dot
net Rocks is brought to you by Franklin's

681
00:49:21.519 --> 00:49:25.480
Net and produced by Pop Studios,
a full service audio, video and post

682
00:49:25.480 --> 00:49:30.239
production facility located physically in New London, Connecticut, and of course in the

683
00:49:30.280 --> 00:49:37.159
cloud online at pwop dot com.
Visit our website at d O T N

684
00:49:37.199 --> 00:49:40.760
E t R O c k S
dot com for RSS feeds, downloads,

685
00:49:40.920 --> 00:49:45.159
mobile apps, comments, and access
to the full archives going back to show

686
00:49:45.239 --> 00:49:50.519
number one, recorded in September two
thousand and two. And make sure you

687
00:49:50.599 --> 00:49:53.159
check out our sponsors. They keep
us in business. Now, go write

688
00:49:53.159 --> 00:50:01.559
some code, See you next time
you got jacks. See a summer time

689
00:50:01.760 --> 00:50:07.960
on that means home. Then my
texes in my credit b

