WEBVTT

1
00:00:14.919 --> 00:00:21.120
On, y'all, welcome to another
episode of Adventures and dev Ops. I'm

2
00:00:21.160 --> 00:00:25.960
your host, Will Button joining me
in the studio, my co host back

3
00:00:26.039 --> 00:00:30.760
on a streak making Tom Brady look
like a slacker, Warren Parade. Welcome

4
00:00:30.800 --> 00:00:36.880
Warren, and thanks for letting me
come back for an of my hopes up

5
00:00:36.920 --> 00:00:40.320
that it will keep on going.
I'm Warren, I'm the CTO of author

6
00:00:40.359 --> 00:00:43.880
has just to reintroduce myself. Yeah, I mean, I like how this

7
00:00:43.960 --> 00:00:49.119
is going so far, and I
have no plans to coward my way out

8
00:00:50.079 --> 00:00:54.439
of the future right on. I'm
excited to hear that, because otherwise it's

9
00:00:54.479 --> 00:01:00.679
just me and the guest and always
lead to trouble. Speaking guests joining us

10
00:01:00.719 --> 00:01:08.719
today Pete Fritchman sre infrastructure staff infrastructure
engineer over at Observe, Inc. And

11
00:01:10.599 --> 00:01:17.719
Pete has joined us today to talk
about observability in those pesky internal applications that

12
00:01:17.760 --> 00:01:23.799
we all have. And I don't
know, it might be a love hate

13
00:01:23.879 --> 00:01:26.680
relationship there, but Pete, welcome
to the show. Hey, thank you

14
00:01:26.719 --> 00:01:32.280
for having me right on. So
tell us a little bit about your background,

15
00:01:32.280 --> 00:01:37.319
because you have You've got the staff
engineer title, which takes a while

16
00:01:37.439 --> 00:01:42.760
to get to and is also I
think a little bit uncommon on the infrastructure

17
00:01:42.840 --> 00:01:47.879
side. You know, it's pretty
common on the software engineering side, but

18
00:01:48.560 --> 00:01:53.040
I think it's a little more rare
to see staff engineers on the infrastructure side.

19
00:01:53.040 --> 00:01:57.439
So tell us how you got to
that point. Sure, yeah,

20
00:01:57.480 --> 00:02:00.879
I mean I've been doing this for
a long time. I had in the

21
00:02:00.959 --> 00:02:04.200
computers as a kid, which I
guess is more common these days, maybe

22
00:02:04.319 --> 00:02:07.680
less so in the nineties. But
I knew early on as a kid that

23
00:02:07.879 --> 00:02:12.680
I wanted to do computer stuff,
and I wasn't exactly sure what computer stuff

24
00:02:12.840 --> 00:02:16.560
was. Something with programming. I
enjoyed that kind of you know, make

25
00:02:16.599 --> 00:02:22.159
the computer do what I want aspect
of things. I was really lucky to

26
00:02:22.240 --> 00:02:27.199
land an internship at my local ISP
in like eighth grade, the summer after

27
00:02:27.240 --> 00:02:30.039
eighth grade, before ninth grade.
Oh no, I had a great mentor

28
00:02:30.080 --> 00:02:34.879
there, and I haven't kind of. I had done a mentorship like my

29
00:02:34.960 --> 00:02:38.599
seventh grade summer doing Linux stuff,
and I bought a laptop and ran Linux

30
00:02:38.639 --> 00:02:43.560
on it. And running Linux on
a laptop is a great way to become

31
00:02:43.599 --> 00:02:45.759
one with Linux, you know,
hate it, love it, whatever.

32
00:02:46.759 --> 00:02:52.039
And working at this ISP and having
a really great mentor George kind of like

33
00:02:52.439 --> 00:02:54.199
I realized, Okay, CIS and
Men is the thing I'd like to do,

34
00:02:54.240 --> 00:03:01.919
Like I enjoy debugging these problems and
building things and writing automation. So,

35
00:03:02.120 --> 00:03:05.439
you know, I did the high
school thing. I worked all through

36
00:03:05.520 --> 00:03:09.680
high school. I'm in a workaholic
forever you for better or worse. I

37
00:03:09.800 --> 00:03:14.759
worked at this ISP all through high
school. Went to college for a year.

38
00:03:15.840 --> 00:03:19.639
Wasn't my thing. It was fun
socially, but the school part was

39
00:03:19.719 --> 00:03:23.319
just not. I don't know,
I just lacked the focus. I just

40
00:03:23.360 --> 00:03:29.080
really wanted to work right. So
I had contributed a bunch to FreeBSD in

41
00:03:29.080 --> 00:03:30.879
my high school days. I was
a ports committer and ports are like the

42
00:03:30.879 --> 00:03:37.400
packages in previous D. And through
that I landed a job and a really

43
00:03:37.400 --> 00:03:44.240
awesome group at FedEx doing system administration
on everything Internet facing. And it was

44
00:03:44.240 --> 00:03:46.439
a small group, and looking back, like that was the group to be

45
00:03:46.479 --> 00:03:52.560
in at FedEx for doing Unixie stuff. They were definitely ahead of their time.

46
00:03:53.319 --> 00:03:54.800
Everything was automated. They wrote their
own tools, and I thought this

47
00:03:54.879 --> 00:03:58.719
was very normal for like a two
thousand and two shop, which I you

48
00:03:58.719 --> 00:04:01.560
know, now we know maybe was
so the whole automation. First thing has

49
00:04:01.599 --> 00:04:06.120
always just kind of been how else
would you do it? Kind of thinking,

50
00:04:06.879 --> 00:04:11.919
and then I was lucky enough to
land and that's every gig at Google

51
00:04:11.960 --> 00:04:15.280
after that in two thousand and five, and you know, they were like,

52
00:04:15.319 --> 00:04:16.439
hey, we should automate things and
I was like, well, yeah,

53
00:04:16.439 --> 00:04:20.079
how else would you do it?
Right? And then just from there

54
00:04:20.120 --> 00:04:25.959
it's been a whirlwind of startups.
And I tried the banking world for a

55
00:04:25.959 --> 00:04:29.600
little bit, had some fun there, has some not fun there. I

56
00:04:29.759 --> 00:04:32.800
ultimately decided to go back to the
startup world because I think that's my true

57
00:04:33.879 --> 00:04:36.199
you know, that's where the for
me, that's where the most fun is

58
00:04:36.759 --> 00:04:39.879
the most fun. I had a
bank was kind of a startup in a

59
00:04:39.920 --> 00:04:45.199
bank, and that's very hard to
find, oh for sure. So yeah,

60
00:04:45.240 --> 00:04:49.279
cool, Yeah, I mean there's
definitely a huge It's it's almost completely

61
00:04:49.319 --> 00:04:57.439
two separate, completely careers working at
large enterprise organizations versus startups, Like you

62
00:04:57.480 --> 00:05:00.879
have to have two different mental models
to be successful at each of those.

63
00:05:01.079 --> 00:05:03.959
It's almost two different skill sets.
I mean the base technical skill set of

64
00:05:03.959 --> 00:05:09.759
course, and then you know at
a startup you kind of have to pick

65
00:05:09.839 --> 00:05:11.759
up the pieces and lead the way
with what you have. And then at

66
00:05:11.759 --> 00:05:14.360
a big enterprise you have all the
resources, but you also have all the

67
00:05:14.399 --> 00:05:17.439
politics, and so you have to
you know, as much as you know,

68
00:05:17.720 --> 00:05:20.439
everyone I know hates that you have
to play. You have to make

69
00:05:20.480 --> 00:05:27.800
friends in other organizations and figure out
how to influence people, and it is

70
00:05:27.839 --> 00:05:30.439
hard and stress. I enjoy the
stress of tech, and the stress of

71
00:05:30.439 --> 00:05:36.399
that is just it's a lot.
Yeah, it's a lot of political engineering

72
00:05:36.720 --> 00:05:47.040
versus softer engineering laying right, for
sure. So when you started work at

73
00:05:47.079 --> 00:05:51.120
the ISP, like are we talking
back? Like dial up modem is P.

74
00:05:51.800 --> 00:05:55.680
Yeah. I sat in the after
room. I had a I had

75
00:05:55.720 --> 00:05:58.959
like PM two e's next to me
and those are quiet at least, and

76
00:05:59.000 --> 00:06:00.879
then I had a rack of like
fifty six k's that were just you know,

77
00:06:00.920 --> 00:06:03.319
I'm pretty sure I could whistle my
way into a fourteen and four connection.

78
00:06:03.560 --> 00:06:09.000
Back then, we were like the
regional ISP, So we did t

79
00:06:09.160 --> 00:06:14.160
ones for businesses and they're there.
Big thing was they would put the I

80
00:06:14.199 --> 00:06:15.160
forget what Cisco it was, but
there was something where you can take the

81
00:06:15.160 --> 00:06:16.879
t one and like, oh,
I want to take some of my parents

82
00:06:16.920 --> 00:06:19.360
and use them as phone lines,
and some of them as data and that

83
00:06:19.439 --> 00:06:25.199
was like a revolutionary in ninety seven, so that yeah, for sure.

84
00:06:25.279 --> 00:06:28.360
Yeah. I was working in the
telco industry right around then and nice.

85
00:06:28.920 --> 00:06:31.240
Yeah, and then we got you
know, d slams and until DSL thing

86
00:06:31.360 --> 00:06:33.920
came through, and it was just
it was a fun way to get a

87
00:06:33.959 --> 00:06:36.319
lot of exposure to a lot of
different things. I got to go to

88
00:06:36.399 --> 00:06:40.240
pops and put stuff in, I
got to build Unix boxes, I got

89
00:06:40.279 --> 00:06:42.839
to kind of do the whole gamut
of things. So you got the benefit

90
00:06:42.839 --> 00:06:46.120
of dealing with Y two. Yeah, you know like it. Yeah,

91
00:06:46.199 --> 00:06:50.279
it was. It was such a
non event that people at work weren't even

92
00:06:50.279 --> 00:06:53.720
worried. You know, everyone was
very much a realist there. They were

93
00:06:53.759 --> 00:06:56.519
like, every whatever, everything breaks, we just live in the world or

94
00:06:56.560 --> 00:07:00.319
nothing works, and you know,
we passed ourselfware. What more is there

95
00:07:00.360 --> 00:07:02.720
to do. I knew people.
I didn't know them like very well or

96
00:07:02.720 --> 00:07:08.600
personally, but I knew people pre
Y two k that the area I lived

97
00:07:08.600 --> 00:07:13.560
in at the time, they built
underground bunkers and stopped them and they were

98
00:07:13.600 --> 00:07:17.480
going like late December they were going
underground and saying we're not coming out for

99
00:07:17.879 --> 00:07:23.399
ten years or something, and I've
never seen any of those people again,

100
00:07:23.519 --> 00:07:28.199
so I'm like insanely curious. Were
they of January? Are they still in

101
00:07:28.720 --> 00:07:33.079
Yeah? Yeah, yeah. I
didn't to have like the you know,

102
00:07:33.480 --> 00:07:35.639
I didn't know Kobal. I mean, there was a whole you know,

103
00:07:36.000 --> 00:07:41.600
crew of people that were just making
insane money and projects. I'm just watching

104
00:07:41.680 --> 00:07:44.240
ka prep needed or not. I
mean, I feel like we'll hit that

105
00:07:44.279 --> 00:07:46.519
again in twenty thirty eight with thirty
two bit. I t like that that

106
00:07:46.560 --> 00:07:49.680
one is actually a little scary.
I hope to be retired by that,

107
00:07:50.519 --> 00:07:59.480
right, no doubt. That seems
like too much cool. So so talking

108
00:07:59.560 --> 00:08:05.079
about internal platforms and observability, Yeah, like that's that was whenever this came

109
00:08:05.120 --> 00:08:09.720
across as our topic, I was
like, you know, that's that's just

110
00:08:09.839 --> 00:08:15.240
brilliant because all of these little internal
apps, and sometimes not little internal apps,

111
00:08:15.279 --> 00:08:20.959
but like the things that the company
uses to make decisions about are we

112
00:08:20.040 --> 00:08:26.279
doing the right thing or not?
They often are just like little pet projects.

113
00:08:26.360 --> 00:08:33.240
So what's what's your experience there and
how do you get those recognized as

114
00:08:33.320 --> 00:08:37.879
the valuable assets that they are?
Yeah, Well, often it happens on

115
00:08:37.960 --> 00:08:43.159
its own, and but it's the
worst case of there's a catastrophic failure at

116
00:08:43.200 --> 00:08:46.000
the worst possible time, and everyone
goes, oh, that was really important.

117
00:08:46.679 --> 00:08:52.919
I worked somewhere at a public company, and the system that did the

118
00:08:52.159 --> 00:08:56.480
closing the books every day, you
know, like very barring, right,

119
00:08:56.000 --> 00:08:58.600
Well, it wasn't working and it
was taking you know, thirty hours to

120
00:08:58.639 --> 00:09:01.919
close twenty four hours worth of books, and obviously that doesn't work so well.

121
00:09:03.600 --> 00:09:05.600
And they kind of didn't notice it
or take any action until, like

122
00:09:05.759 --> 00:09:09.159
they were a week before they had
to do some sec filing quarterly thing.

123
00:09:09.200 --> 00:09:11.879
I don't know all the details,
but it became, you know, suddenly

124
00:09:11.960 --> 00:09:15.919
on Friday afternoon, it was we
need all hands on deck to fix this.

125
00:09:16.000 --> 00:09:24.200
But it's everything from that to not
just necessarily apps, but like internal

126
00:09:24.200 --> 00:09:28.600
infrastructure, right CEI and developer experience, and everyone's got these you know,

127
00:09:28.159 --> 00:09:31.879
shell script infrastructure that dev's run to
run the local cluster on their laptop,

128
00:09:31.919 --> 00:09:35.440
and there's you know, any company
you can probably sit down and find a

129
00:09:35.440 --> 00:09:39.759
million of them. I just think
they're perpetually under you know. The sexy

130
00:09:39.759 --> 00:09:43.720
part is production, right, people
want to build SLOs for users and show

131
00:09:43.720 --> 00:09:48.399
awesome graphs and look at this great
incident management process. We have. Then

132
00:09:48.399 --> 00:09:52.679
they have this like Jenkins instance,
it's barely holding on internally, you know

133
00:09:52.799 --> 00:09:54.399
that everyone hates but no one really
talks about because it's just, oh,

134
00:09:54.440 --> 00:09:58.919
yeah, that's life, you know, that whole side of things just need

135
00:10:01.000 --> 00:10:07.000
more love. Yeah, And so
I think part of that, you know,

136
00:10:07.159 --> 00:10:11.799
like you mentioned there, especially with
like around Jenkins and things, it's

137
00:10:11.879 --> 00:10:18.519
people. It's like people who build
tools to solve specific problems that they're having,

138
00:10:18.159 --> 00:10:24.039
and then it almost has like this
organic growth of other people see it

139
00:10:24.080 --> 00:10:28.200
and I'm like, oh yeah,
I need to use that too, and

140
00:10:28.240 --> 00:10:33.720
then it grows in its role in
the company. So how do you identify

141
00:10:33.879 --> 00:10:39.919
those before that catastrophic event? Yeah, well, you know a lot if

142
00:10:39.960 --> 00:10:41.720
you think about So I do a
lot of thinking about like how to compare

143
00:10:41.759 --> 00:10:45.559
it to production. So in production, if you say I want to launch

144
00:10:45.600 --> 00:10:48.240
this new micro service or I want
to consume this two AWS service, you

145
00:10:48.279 --> 00:10:52.120
write the sign bocs, you have
launch review. I mean hopefully right,

146
00:10:52.320 --> 00:10:56.519
ideally you have these things, you
have this whole like rigmarole and process for

147
00:10:56.559 --> 00:10:58.320
it. But eternally it's like,
oh, I'm going to fire up a

148
00:10:58.360 --> 00:11:00.799
EC too and run this new tool
I downloaded hey, two weeks later,

149
00:11:00.840 --> 00:11:05.159
it's important you have to apply that
same same principle. And part of it

150
00:11:05.200 --> 00:11:05.879
can be a staffing thing, right, Like, you know, you might

151
00:11:05.919 --> 00:11:09.519
have this team of a million people
working on PROD and then you got the

152
00:11:09.559 --> 00:11:15.000
two you know, the IT guys
working on infrastructure stuff that doesn't work very

153
00:11:15.000 --> 00:11:18.120
well at all. Yeah, So
I mean, I think you have to

154
00:11:18.159 --> 00:11:24.320
put some process to it, and
I mean I think that I'm a big

155
00:11:24.360 --> 00:11:28.960
fan of the whole infrastructure management,
incident management, post mortem process. Like,

156
00:11:28.000 --> 00:11:31.600
I think that that is a great
way to drive, you know,

157
00:11:31.639 --> 00:11:33.480
out of just happen. And we
all have to accept that nothing's one hundred

158
00:11:33.480 --> 00:11:37.559
percent, but you should get the
most out of every adage. So when

159
00:11:37.559 --> 00:11:43.200
your internal tool does blow up and
cause a you know, company visible,

160
00:11:43.279 --> 00:11:45.559
everyone is like, hey, I
don't really know what this was, but

161
00:11:45.600 --> 00:11:48.960
we couldn't do business for a day. Make the most of that, right,

162
00:11:48.080 --> 00:11:52.000
Hey, this thing broke. By
the way, there's five other services

163
00:11:52.000 --> 00:11:56.039
we've identified that are in the exact
same state and could all blow up tomorrow

164
00:11:56.039 --> 00:11:58.240
and we'd be on the same call. And so that's your that's your you

165
00:11:58.240 --> 00:12:03.240
know, entrance point to getting everyone
to have attention on that thing. And

166
00:12:03.240 --> 00:12:05.120
it's unfortunate that, you know,
things have to fail first sometimes, but

167
00:12:07.039 --> 00:12:09.879
that's how it goes when you're especially
at a startup where you're kind of have

168
00:12:09.919 --> 00:12:15.480
conflicting priorities. Oh for sure.
Yeah, And I think it's a really

169
00:12:15.559 --> 00:12:18.120
good good point there is, like
outages happen, so let's just make the

170
00:12:18.120 --> 00:12:22.679
most out of that learning experience.
Absolutely, which is that's been a cultural

171
00:12:22.799 --> 00:12:28.639
change for us as an industry over
the last I don't know, i'd say

172
00:12:28.639 --> 00:12:33.559
ten to fifteen years. I feel
like Etsy published that blame you know,

173
00:12:33.600 --> 00:12:37.519
the famous blameless postmortems. It feels
like that's ages ago now, but it's

174
00:12:37.519 --> 00:12:41.799
still so relevant every day. Yeah. Yeah, things broke, who cares

175
00:12:41.840 --> 00:12:43.639
how they broke. Let's just have
it not break in the same way again,

176
00:12:45.480 --> 00:12:46.840
right, And if you can do
that, that to me is like

177
00:12:46.879 --> 00:12:50.399
the health of a necessary team,
right, it's not. Yeah, I

178
00:12:50.440 --> 00:12:52.679
really hate places that measure health by
oh there were four outages last quarter,

179
00:12:54.240 --> 00:12:58.240
okay, like where they repeat root
causes, Yes, okay, maybe there's

180
00:12:58.279 --> 00:13:01.639
a real problem. But if they
weren't repeat root causes and you're solving root

181
00:13:01.639 --> 00:13:07.480
causes and writing good postmortems, O
just are great. Yeah, yeah,

182
00:13:07.480 --> 00:13:09.000
for sure. And I think it's
I think that's one of the things that

183
00:13:09.720 --> 00:13:16.759
it's really hard to convince people of
is outages aren't as bad as you think

184
00:13:16.879 --> 00:13:24.000
because where most of us work like
we can recover and and I say that

185
00:13:24.039 --> 00:13:30.000
coming from a background where sometimes when
we had outages, people's lives were at

186
00:13:30.039 --> 00:13:33.399
stake. And so if you can
walk out of an outage and say no

187
00:13:33.440 --> 00:13:39.200
one died, like we're going to
be all right. Yeah. Well,

188
00:13:39.200 --> 00:13:43.759
plus it ties into SLIS SLS air
budgets, right, people a lot of

189
00:13:43.799 --> 00:13:48.200
times before the SLO concept became very
popular, like it's five minutes of downtime

190
00:13:48.200 --> 00:13:52.679
bad? What does that mean?
They have this air budget and you can

191
00:13:52.720 --> 00:13:54.519
go, well, you know,
we're a three nine service committed to we

192
00:13:54.559 --> 00:13:58.840
have forty four and a half minutes
a month. Okay, five minutes isn't

193
00:13:58.879 --> 00:14:01.039
great, but it's the fine.
You know we can. Yeah, but

194
00:14:01.080 --> 00:14:05.080
if you have you know, hey, I've gone past it and we're writing

195
00:14:05.159 --> 00:14:07.240
checks or you know, giving SLA
credits like okay, now you know we

196
00:14:07.279 --> 00:14:15.080
have this quantifiable not how people feel
number to talk about things and yeah,

197
00:14:15.120 --> 00:14:18.399
so do you for internal apps,
do you take them that far where you

198
00:14:18.519 --> 00:14:24.000
give them assign them slis and SLOs
and slas for performance. I've been starting

199
00:14:24.000 --> 00:14:28.000
to so this is actually my first, my first gig out of many where

200
00:14:28.000 --> 00:14:31.120
I've really kind of focused on internal
stuff. I've often been the problem of,

201
00:14:31.159 --> 00:14:33.679
you know, the person focusing on
proud and you know, on a

202
00:14:33.679 --> 00:14:37.440
different team that now my team is
kind of you know, wearing all the

203
00:14:37.480 --> 00:14:43.240
hats thing here, and yeah,
so I'm we're putting SLOs on internal services

204
00:14:43.240 --> 00:14:46.759
and internal workflows and trying to treat
them with the same we have that in

205
00:14:46.759 --> 00:14:50.840
production, of course, and we're
trying to give it the same, uh,

206
00:14:50.519 --> 00:14:56.720
the same level of you know,
thought and execution. I've seen a

207
00:14:56.799 --> 00:15:03.200
huge shift in the industry over We're
all like where some teams maybe they're called

208
00:15:03.200 --> 00:15:07.240
platform teams, have sort of been
excluded from thinking about what a product is

209
00:15:07.440 --> 00:15:11.159
or how they do product management or
even product ownership. And I feel like

210
00:15:11.159 --> 00:15:15.919
that's really been turning around, first
with the DevOps movement and now just I'll

211
00:15:15.919 --> 00:15:18.720
look at how we build services,
micro services all all together and no one

212
00:15:18.799 --> 00:15:24.679
sort of excluded or have a different
process of internal teams. Doesn't matter.

213
00:15:24.799 --> 00:15:28.600
You're still offering a real service to
someone. Just happens to be your customers

214
00:15:28.679 --> 00:15:31.759
or within the same customer or same
organization. That's exactly how I sell it

215
00:15:31.840 --> 00:15:35.120
is. Yeah, you still have
customers, they just happen to be coworkers,

216
00:15:35.960 --> 00:15:39.480
right. Yeah, you can still
do the same stuff. You can

217
00:15:39.480 --> 00:15:41.679
build user journeys. You can figure
out what does a customer expect, what

218
00:15:41.720 --> 00:15:45.080
makes a customer angry. It's a
lot easier to figure it out because you

219
00:15:45.120 --> 00:15:48.080
can ask them and slack instead of
guessing, like, man, what are

220
00:15:48.120 --> 00:15:50.039
my customers? What's the threshold at
which my customer complains? You can just

221
00:15:50.120 --> 00:15:52.600
straight up ask them when they work
for you know, I don't think they

222
00:15:52.679 --> 00:15:56.879
necessarily have good, like nice answers
though, Like you're a customer, you

223
00:15:56.919 --> 00:16:00.600
find there's a variety of personalities,
right, people, the moment of Jenkins

224
00:16:00.679 --> 00:16:03.759
job doesn't passes in four minutes.
They're like, ah, Jenkins is graph

225
00:16:03.799 --> 00:16:07.240
It's terrible, but at least you
can get it. You know. That's

226
00:16:07.240 --> 00:16:10.080
how customers are. You know,
are in general right you have so you

227
00:16:10.080 --> 00:16:12.399
get the whole gamut of emotions.
But you can tell when there's an outage.

228
00:16:12.440 --> 00:16:15.480
You know, how loud people are. Internal people will tend to be

229
00:16:15.519 --> 00:16:18.519
louder internally that in your customer I
mean, that's what I mean, Like

230
00:16:18.600 --> 00:16:21.200
you could definitely like it's still a
problem in some regard, but like how

231
00:16:21.240 --> 00:16:25.320
do you temper the difference of perspective? Like I feel like users on the

232
00:16:25.360 --> 00:16:27.960
outside are more quiet in a lot
of ways, like it's difficult to pull

233
00:16:29.000 --> 00:16:32.039
information out of them, and internally, as you mentioned, you know,

234
00:16:32.120 --> 00:16:34.519
it's like everyone screaming as soon as
something goes even a little bit. I

235
00:16:34.559 --> 00:16:38.559
think that if you can show them
that a there are graphs, right,

236
00:16:38.600 --> 00:16:41.360
like there should always be a graph, you know, like generically I work

237
00:16:41.360 --> 00:16:44.639
for AGRAD that always sad, where's
the graphere's the graph? And it was

238
00:16:44.639 --> 00:16:45.919
annoying at first, but you're like, okay, maybe there should always be

239
00:16:45.960 --> 00:16:51.240
a graph so people know that You're, hey, people are actually watching Jenkins.

240
00:16:51.240 --> 00:16:55.320
People are actually rather computers are watching
Jenkins, not people, right,

241
00:16:56.440 --> 00:16:57.759
And if you show like, hey, we had this outage and yes it

242
00:16:57.799 --> 00:17:00.480
was a really terrible day and no
one shift any God, but here's this

243
00:17:00.519 --> 00:17:06.599
great postpartum everyone can read and that's
got a reasonable trigger and root cause and

244
00:17:06.640 --> 00:17:10.000
follow ups and timeline and we're actually
closing the follow ups. Like it's very

245
00:17:10.039 --> 00:17:12.759
visible and I feel that should be
the same way with customer outages. It's

246
00:17:12.839 --> 00:17:18.640
very visible, right, nothing,
there shouldn't be anything to hype. Yeah,

247
00:17:18.759 --> 00:17:22.079
So one idea I've been working on
because I work a lot with startups

248
00:17:22.440 --> 00:17:29.640
and like, the whole thing about
a startup is odds are the product that

249
00:17:29.720 --> 00:17:32.599
you launch is not going to be
the product that you're successful with, and

250
00:17:32.640 --> 00:17:34.680
you're going to try a lot of
different things before you become successful as a

251
00:17:34.680 --> 00:17:41.160
company. So how do we measure
that so we don't spend any more time

252
00:17:41.279 --> 00:17:45.799
than it's absolutely necessary working on the
wrong problem. And so I've been working

253
00:17:45.799 --> 00:17:48.680
on this idea of success criteria,
like what does it take to make this

254
00:17:48.799 --> 00:17:55.400
application successful? And a lot of
my background is in the infrastructure for mobile

255
00:17:55.400 --> 00:17:57.759
apps, and for mobile apps,
you can measure it as you know,

256
00:17:57.799 --> 00:18:03.880
we need ten thousand monthly active users
spending forty five minutes per week in the

257
00:18:04.000 --> 00:18:10.039
app or something like that, measure
engagement. Yeah, yeah, yeah,

258
00:18:10.119 --> 00:18:14.000
So like because then if you if
you tie in, like you know,

259
00:18:14.039 --> 00:18:17.920
our our total cost of acquisition to
get a new user as x amount of

260
00:18:17.920 --> 00:18:22.319
dollars, and our cost of infrastructure
is x amount of dollars per user.

261
00:18:22.400 --> 00:18:25.559
You know, you can do some
pretty simple math there to find out how

262
00:18:25.559 --> 00:18:30.240
many users you need for this to
be a profitable product. And so trying

263
00:18:30.279 --> 00:18:34.599
to figure that out and get that
in the early stages of an application so

264
00:18:34.720 --> 00:18:41.960
we know when to either double down
on this application or shelve it as soon

265
00:18:42.000 --> 00:18:47.240
as possible. I'm wondering if like
that same thing applies to internal tools,

266
00:18:47.240 --> 00:18:51.319
and how do you how do you
define what that looks like? Yeah,

267
00:18:51.359 --> 00:18:53.359
that's a good question. And I
think I had like a good you know,

268
00:18:53.440 --> 00:19:02.240
where we deployed the like a code
analysis tool, and yeah, I

269
00:19:02.240 --> 00:19:04.400
don't it's a good I don't know
how to quantify it for stuff like that.

270
00:19:04.559 --> 00:19:08.039
But again, we have some of
the very active developers that will give

271
00:19:08.079 --> 00:19:11.319
feedback of like, hey, this
then gave me a bad analysis and which

272
00:19:11.359 --> 00:19:12.240
we can and make it better.
It's like, okay, that means they're

273
00:19:12.240 --> 00:19:17.640
actually looking at it and you know, making code health better. And I

274
00:19:17.640 --> 00:19:19.799
don't know, I think the like, I think you're on the right track

275
00:19:19.880 --> 00:19:23.480
with you know, having a number
like mus and that kind of thing.

276
00:19:25.119 --> 00:19:32.400
It's tough internally, Yeah, yeah, well, and it's interesting because yeah,

277
00:19:32.480 --> 00:19:34.279
well, I think there's two classes
of internal products, right there's internal

278
00:19:34.319 --> 00:19:38.039
products that people are very opinionated on. Code analysis is one, right,

279
00:19:38.039 --> 00:19:41.200
Like, I think this gives me
bad code analysis. I don't like that.

280
00:19:41.920 --> 00:19:48.160
But CI right, I think largely
people don't actually care what CI product

281
00:19:48.160 --> 00:19:52.440
you're running. They care do my
prs get approvals fast? Is master green?

282
00:19:53.599 --> 00:19:56.720
The deployments happen at the pace that
we think they should at our company,

283
00:19:57.079 --> 00:20:00.240
and if that works, who cares
what the product is? Right,

284
00:20:00.279 --> 00:20:02.920
they're like, And so for stuff
like that, you can just be super

285
00:20:02.960 --> 00:20:04.240
results oriented, right, Like,
hey, is are the s lives?

286
00:20:04.319 --> 00:20:08.119
These our solos we're trying to go
for we're making them, we're not making

287
00:20:08.160 --> 00:20:15.279
them. We're a d inking shop
now, and we're considering alternatives. And

288
00:20:15.519 --> 00:20:17.839
of course, you know, you
ask people what we should. There's a

289
00:20:17.920 --> 00:20:21.680
million things, but no one actually
cares what we run, right, you

290
00:20:21.680 --> 00:20:25.200
know, if they can declare their
jobs in some declarative way that's not too

291
00:20:25.279 --> 00:20:30.720
terrible and it works, everyone's happy. Yeah, I'll tell you what'll make

292
00:20:30.759 --> 00:20:34.119
them care. If they have to
be the ones to convert your drinking jobs

293
00:20:34.200 --> 00:20:37.599
to the new platform, then a
lot of people will be like, ah,

294
00:20:37.680 --> 00:20:42.559
you know, Chinkin's is probably okay. There's a lot of GitHub actions

295
00:20:42.640 --> 00:20:45.759
talk because a lot of people have
done stuff there. So I mean part

296
00:20:45.759 --> 00:20:48.160
of you know, part of that
is that's the politicals or well, the

297
00:20:48.720 --> 00:20:52.640
human side of it is, Hey, maybe it's if all of the products

298
00:20:52.720 --> 00:20:55.279
can do the thing, maybe we
pick the one people have the most experience

299
00:20:55.319 --> 00:20:57.480
with just to make life easier in
the transition. That's a good point.

300
00:20:57.640 --> 00:21:02.279
I mean, will you ask question, which is, how do we know

301
00:21:02.400 --> 00:21:06.559
our thing is going to be successful
at a startup level? And if we

302
00:21:06.599 --> 00:21:11.559
take the perceived public metrics on ninety
percent fail are what a companies do with

303
00:21:11.920 --> 00:21:15.720
teams that are part of that ninety
percent? Do they let them go?

304
00:21:17.400 --> 00:21:22.440
Do they reposition them? I think
that's a huge struggle and it's scary to

305
00:21:22.559 --> 00:21:27.720
know that there are some metrics associated
with your successful even within a larger company

306
00:21:29.279 --> 00:21:33.920
that you don't have job security in. Yeah, I can tell you just

307
00:21:33.960 --> 00:21:42.200
in my experience with startups, that's
like when you identify the success failed criteria

308
00:21:42.759 --> 00:21:48.519
answers that question because if you can
identify it early, then you take those

309
00:21:48.559 --> 00:21:52.640
people and you put them on what
your next idea is. But in many

310
00:21:52.720 --> 00:21:56.720
cases, if you wait too late
to identify that this product is not going

311
00:21:56.759 --> 00:22:00.079
to be successful. You've already been
bleeding too much cash for church along that

312
00:22:00.200 --> 00:22:04.960
you've got to salvage what's left,
and the first thing that goes when you're

313
00:22:04.960 --> 00:22:08.279
trying to cut costs is your staff. I wonder if there's a lesson to

314
00:22:08.279 --> 00:22:12.079
be learned that can be pulled to
internal apps. Though in the startup market,

315
00:22:12.200 --> 00:22:17.880
there's this idea of letters of intent, right getting signatures from potential customers

316
00:22:17.920 --> 00:22:22.000
even before you built anything based off
of that idea, and maybe maybe there's

317
00:22:22.000 --> 00:22:27.200
an idea here of how to transition
this to even internal teams. Yeah,

318
00:22:27.519 --> 00:22:32.240
I think, yeah, I like
the idea of formalizing it. I mean,

319
00:22:32.240 --> 00:22:34.480
I think what I often do for
these kind of things is like you

320
00:22:34.559 --> 00:22:37.079
find, uh, you know,
hey, you might have ten teams and

321
00:22:37.160 --> 00:22:41.480
six teams really care about this thing
you're building because they're big use of it.

322
00:22:41.480 --> 00:22:44.359
You find like a champion on each
team that is, you know,

323
00:22:44.400 --> 00:22:48.440
opinionated and is willing to talk to
you about what's good and bad. And

324
00:22:48.519 --> 00:22:49.920
maybe not a letter of intent,
but you write a design doc and then

325
00:22:49.960 --> 00:22:52.240
you make sure that they've read it
and give you feedback, not just to

326
00:22:52.319 --> 00:22:55.200
check, but you know, if
anyone reads a design doc and has no

327
00:22:55.279 --> 00:23:00.400
comments, they didn't actually read it. Like I've never met an engineer without

328
00:23:00.400 --> 00:23:03.079
an opinion on a design. So
so you get you know, you get

329
00:23:03.119 --> 00:23:07.240
feedback from people on these teams,
and it's maybe not as formal as a

330
00:23:07.279 --> 00:23:10.400
letter of intent, but you kind
of have the you know, social buy

331
00:23:10.400 --> 00:23:14.559
in, right, Yeah. Yeah, And that was one of the things

332
00:23:14.559 --> 00:23:19.119
that we whenever I worked at Active, I started with them super early and

333
00:23:21.240 --> 00:23:26.119
we just accidentally got this right.
But we built this platform for all the

334
00:23:26.160 --> 00:23:34.079
engineers to build and deploy their software, and the way we built it,

335
00:23:34.119 --> 00:23:40.640
we we had such great collaboration between
an infrastructure team and an engineering team that

336
00:23:40.839 --> 00:23:47.359
anytime the engineering team wanted to expand
the capabilities of that platform tool, most

337
00:23:47.400 --> 00:23:49.839
of the time their request came in
the form of a poor request to just

338
00:23:49.920 --> 00:23:55.839
add that feature to it. Yeah, and I've never been able to duplicate

339
00:23:55.880 --> 00:24:00.440
that since then, but that had
such a significant impact on how I think

340
00:24:00.440 --> 00:24:03.799
about platforms that that's my goal every
day. Yeah. I like that.

341
00:24:03.799 --> 00:24:08.200
It's kind of the open source approach, right, Yeah, Yeah, that's

342
00:24:08.240 --> 00:24:11.039
I mean, I think a lot
of that is the buy in, right,

343
00:24:11.039 --> 00:24:14.440
because if you're using a platform or
a language that no one else wants

344
00:24:14.559 --> 00:24:18.119
or knows or likes. They're not
going to learn rusk to send you a

345
00:24:18.160 --> 00:24:19.799
poll review to fix a thing they
don't like. But if it's already in

346
00:24:19.880 --> 00:24:22.759
go and the rest of your codes
and go, yeah, I just show

347
00:24:22.839 --> 00:24:27.759
up one morning. Maybe there's a
corollary here as well to the total economy

348
00:24:27.799 --> 00:24:33.000
in the startup world, where you
know your customers are somewhere along the spectrum

349
00:24:33.079 --> 00:24:37.640
of innovators, early adopters, early
minority, all the way to laggards,

350
00:24:37.680 --> 00:24:38.880
and it's really who are you talking
to? You know, what does the

351
00:24:38.880 --> 00:24:42.240
rest of your organization look like?
Because I feel like, well, you

352
00:24:42.240 --> 00:24:48.839
would need innovators and early adopters there
who not only need that functionality but have

353
00:24:48.880 --> 00:24:52.000
a huge stake or care about how
it's implemented, and not just people who

354
00:24:52.039 --> 00:24:57.880
think it's table stakes or just belongs
there or have opinions that they just want

355
00:24:57.880 --> 00:25:03.680
it done their way. Yeah,
yeah, true, yeah, And that's

356
00:25:03.920 --> 00:25:06.039
I think you hit that. I
think you hit it right on the head

357
00:25:06.079 --> 00:25:14.480
there that early stage startups require that
innovation mindset. For sure. They're not

358
00:25:14.480 --> 00:25:18.319
going to get very far if only
a few people are thinking about how to

359
00:25:18.319 --> 00:25:21.680
push what do you bring their product
is forward. Yeah, I really a

360
00:25:21.680 --> 00:25:25.359
great and I want to clarify that
there. You know, there's there's a

361
00:25:25.400 --> 00:25:30.039
certain type of person who just wants
to show up for work and do their

362
00:25:30.039 --> 00:25:33.799
assigned job and they want to do
that for twenty or thirty years till they

363
00:25:33.799 --> 00:25:38.240
retire. And there's absolutely nothing wrong
with that, just that an early stage

364
00:25:38.240 --> 00:25:44.160
startup is not the right environment for
you to be successful. Yeah, you

365
00:25:44.240 --> 00:25:45.960
definitely see that. You can find
out the bigger companies. You have these

366
00:25:45.960 --> 00:25:49.200
classes of employees, right, there
are some people that you can tell either

367
00:25:49.240 --> 00:25:53.279
came from startups or are going to
go to startups afterwards. And then like

368
00:25:53.279 --> 00:25:56.880
you said, there are some people
that just you know, close tickets and

369
00:25:56.920 --> 00:26:00.640
do the work. And that's yeah, completely fine. Yeah, I sometimes

370
00:26:00.640 --> 00:26:06.599
wish that I could be that person. It'd be a lot lessful. And

371
00:26:06.680 --> 00:26:07.559
yeah, I don't know, maybe
I will be in five or day.

372
00:26:07.559 --> 00:26:12.960
Who knows, right, people change, But yeah, So let's talk a

373
00:26:12.960 --> 00:26:18.480
little bit about the types of internal
platforms, because we've talked about you know,

374
00:26:18.480 --> 00:26:26.480
like CICD, what other tools are
out there that we should be looking

375
00:26:26.599 --> 00:26:33.400
for as internal tools. One that
pops up immediately in my mind only because

376
00:26:33.440 --> 00:26:37.680
I've been bitten by this one at
every single company I've ever worked at for

377
00:26:37.720 --> 00:26:45.440
the last three decades is the internal
data analytics team. Like those guys have

378
00:26:45.279 --> 00:26:52.960
monster infrastructures and are just doing what
it takes to generate the reports that the

379
00:26:53.000 --> 00:26:56.680
business wants to see. And whenever
you get a hold of it, like,

380
00:26:56.759 --> 00:27:00.960
oh wow, I'm not even really
certain how this thing is working.

381
00:27:00.839 --> 00:27:04.400
So what are there to break?
Internal tools are out there. Well,

382
00:27:04.480 --> 00:27:08.000
some people run their own vcs,
you know, get get lab on prem

383
00:27:08.880 --> 00:27:12.759
h we run Garrett so that's its
own you know, Carrien feeding and that's

384
00:27:12.759 --> 00:27:18.880
the loos. I think one thing
people actually miss is the monitoring infrastructure themselves.

385
00:27:18.880 --> 00:27:23.119
You know, if you run Prometheus
and Elastic or if you don't have

386
00:27:23.160 --> 00:27:26.920
a vendor, right, and even
if you have a vendor, you have

387
00:27:26.000 --> 00:27:30.119
some component that is shipping metrics to
them, right, you have to monitor

388
00:27:30.160 --> 00:27:33.839
that. If that stuff all breaks, you're flying blind and you need to

389
00:27:33.880 --> 00:27:40.799
know it. I think that often
goes neglected, right. I worked somewhere

390
00:27:40.839 --> 00:27:42.720
where someone came to us once and
they were like, you know, they

391
00:27:42.799 --> 00:27:45.960
run a really noisy on call rotation, which is its own problem, and

392
00:27:47.039 --> 00:27:48.839
they knew things were broken because they
had I haven't gotten a page duty in

393
00:27:48.920 --> 00:27:52.359
ninety minutes, and that was their
escalation and I was like, well,

394
00:27:52.440 --> 00:27:55.400
it's actually broken, and this is
so sad that this is how we found

395
00:27:55.400 --> 00:27:59.039
out about the problem. On many
levels, right one, we didn't know

396
00:27:59.160 --> 00:28:03.440
too. You expect to be paged
at least once every ninety minutes, but

397
00:28:03.559 --> 00:28:07.119
it's it's truly I think the monitoring
infrastructure, it seems like it should be

398
00:28:07.119 --> 00:28:11.400
obvious, but it's not necessarily obvious
that do you think that's Do you think

399
00:28:11.400 --> 00:28:17.759
that's a mess from like existing observability
tools that they aren't able to have internal

400
00:28:17.799 --> 00:28:21.640
metrics on what the expectation is on
getting logs, Like I feel like we've

401
00:28:21.720 --> 00:28:25.480
used cloud Watch for a while and
one of the things it does in AWS

402
00:28:25.559 --> 00:28:30.759
is you can alert on missing data. Well that's a hard like in Prometheus,

403
00:28:30.759 --> 00:28:33.880
it's very hard to alert on missing
data right end of alerting on like

404
00:28:33.920 --> 00:28:37.200
the uptime series being zero. But
then what if you have a problem that

405
00:28:37.240 --> 00:28:41.079
generates your targets and there is no
uptime series right, Like there are a

406
00:28:41.119 --> 00:28:45.119
lot of these different things. Yeah, I think that it's not Yeah,

407
00:28:45.160 --> 00:28:47.960
I don't think it's first class of
you know, watch the watchers. I

408
00:28:48.039 --> 00:28:49.759
think the data is there, though, and you have to kind of do

409
00:28:49.880 --> 00:28:56.359
it. I worked somewhere where we
had a relatively large Prometheus and Grafana infrastructure

410
00:28:56.400 --> 00:29:00.440
for metrics and a really large Blunk
infrastructure for law, and both of them

411
00:29:00.440 --> 00:29:03.480
had their problems and they're run much
two different teams. And we got together

412
00:29:03.519 --> 00:29:07.480
and said, hey, let's make
a deal. Right, We'll write a

413
00:29:07.519 --> 00:29:11.720
Probert for Splunk and have an export
metrics and monitors Spunk and Prometheus. You

414
00:29:11.759 --> 00:29:15.039
write a prober for Prometheus and have
it writ logs and monitor and Splunk And

415
00:29:15.079 --> 00:29:18.160
that's not perfect, but you make
with what you've got. And that was

416
00:29:18.200 --> 00:29:22.839
a company where it was a large
place that was going to be hard to

417
00:29:22.880 --> 00:29:26.039
bring in a third party tool to
help us, and you know, but

418
00:29:26.240 --> 00:29:30.200
we may do with it. And
it was super successful. We found lots

419
00:29:30.240 --> 00:29:32.240
of problems with you know, if
they both died at the same time,

420
00:29:32.279 --> 00:29:33.599
Okay, yeah, that's the end
of the world. But the world is

421
00:29:33.599 --> 00:29:37.799
probably already ending if they both died. So do you think there are third

422
00:29:37.799 --> 00:29:41.839
party vendors that actually solve this in
a reasonable way? Well, I mean,

423
00:29:41.839 --> 00:29:45.640
I think it's hard to say,
right, It depends what you're looking.

424
00:29:45.640 --> 00:29:48.519
If you're using a third party vendor
for monitoring, you know, you

425
00:29:48.559 --> 00:29:53.799
need to look at your metrics of
shipping. Is your data there with them?

426
00:29:55.000 --> 00:29:57.119
Right? So you have to set
something up in their system to do

427
00:29:57.200 --> 00:30:00.880
it. But then you do you
want to back up? Uh? You

428
00:30:00.880 --> 00:30:03.920
know? Do you want to know
if they're up? You may have to

429
00:30:03.000 --> 00:30:08.200
run a little side monitoring infrastructure too
to watch them because it might not be

430
00:30:08.200 --> 00:30:11.359
anything you can do about it,
but you may want to at least be

431
00:30:11.400 --> 00:30:14.359
aware that, Hey, the thing
that normally sends me alerts is not going

432
00:30:14.440 --> 00:30:18.559
to send me alerts. Maybe we
should all be go back to the knock

433
00:30:18.680 --> 00:30:21.799
days and you know, stare at
some things for a couple of hours.

434
00:30:22.839 --> 00:30:23.799
I mean, that's exactly what I
don't want to have to think about,

435
00:30:23.799 --> 00:30:27.319
Like I don't have to think about
my vendor being down in some way that

436
00:30:27.480 --> 00:30:30.599
requires me to monitor them. Like
I feel like, you know, if

437
00:30:30.640 --> 00:30:33.240
I'm paying money out for that,
I should I should be getting that by

438
00:30:33.319 --> 00:30:37.160
default. I don't know. Maybe
I think that's just my pessimistic brain.

439
00:30:37.599 --> 00:30:44.559
Yeah, for sure, everything will
break. Everything will break, So whether

440
00:30:44.599 --> 00:30:47.880
you're building it or not, it's
going to break and you either want to

441
00:30:47.880 --> 00:30:49.519
know about it or you know,
if you think something doesn't break, you're

442
00:30:49.599 --> 00:30:52.359
just not measuring it. Yeah,
that's kind of mine. Yeah, yeah,

443
00:30:52.480 --> 00:30:56.160
I have two examples where I think
that's a really like a lot of

444
00:30:56.160 --> 00:31:03.160
effort has gone into that Core Logics
as a logging and monitoring platform, and

445
00:31:03.200 --> 00:31:07.559
they have anomaly detection, and so
if you'll if you have a service that

446
00:31:08.200 --> 00:31:12.400
normally spits out, you know,
one thousand log entries per hour, if

447
00:31:12.440 --> 00:31:18.839
it stops, it identifies that as
an anomaly, and we'll trigger that and

448
00:31:18.839 --> 00:31:23.319
say, hey, nothing is like
technically alerted, but something's changed here.

449
00:31:25.000 --> 00:31:29.920
And then another one that we just
recently started using as a tool for monitoring

450
00:31:29.960 --> 00:31:36.559
our infrastructure spin called clouds zero,
and it has a really cool anomaly detection

451
00:31:36.720 --> 00:31:40.640
as well, says, hey,
this project, their current spend rate has

452
00:31:40.680 --> 00:31:45.039
been this, but today it changed
to this, which is you know,

453
00:31:45.079 --> 00:31:49.000
not necessarily a problem, but cool
that it acknowledges that and you're like,

454
00:31:49.079 --> 00:31:56.440
yeah, why did that change?
I think the correlation analysis like that is

455
00:31:56.480 --> 00:32:00.400
really good for not necessarily paging someone
itto a, but reports and debugging like

456
00:32:00.440 --> 00:32:06.039
hey, I have a problem started
two days ago, what else changed two

457
00:32:06.119 --> 00:32:07.720
days ago? Right, Hey,
you also started spending less money on this.

458
00:32:07.839 --> 00:32:10.960
It's like, yeah, she doesn't
mean causation, but I'm certainly going

459
00:32:12.000 --> 00:32:15.039
to look there first. Now.
I think people underestimate that really though,

460
00:32:15.079 --> 00:32:19.119
like just looking out, when does
the problem start? What else happened there?

461
00:32:19.160 --> 00:32:22.200
And I feel like it's so obvious
to say that, but it's usually

462
00:32:22.200 --> 00:32:24.480
one of the first things that's missed. Yeah, for sure. It's been

463
00:32:24.519 --> 00:32:28.240
like a pet project of mine for
a very long time. I have a

464
00:32:28.279 --> 00:32:31.640
super stale, busted repo trying to
do this. Back when, uh remember

465
00:32:31.720 --> 00:32:36.720
CEP continuous event processing, it was
all the hype and like maybe it wasn't

466
00:32:36.720 --> 00:32:39.519
all the hype, but it was
somewhat popular in the late two thousands,

467
00:32:39.839 --> 00:32:45.720
I was trying to do the uh
whole Winters forecasting, Like the thing that's

468
00:32:45.720 --> 00:32:47.480
built in is that you know already
tool added that thing again ages ago,

469
00:32:47.680 --> 00:32:52.519
dating myself. You could, you
know, put the prediction line in front

470
00:32:52.559 --> 00:32:54.039
of your graph, And I always
thought like, yeah, let's just do

471
00:32:54.079 --> 00:32:58.039
that, like let's predict every time
series. Like it may not be great

472
00:32:58.039 --> 00:33:00.079
signal and I'm not going to look
at it unless something's broke, but hey,

473
00:33:00.160 --> 00:33:02.880
I've got twenty thousand metrics. I'm
not going to look at all of

474
00:33:02.920 --> 00:33:07.319
them, right, So when the
site car starts getting slow, it'd be

475
00:33:07.359 --> 00:33:10.000
really interesting if I could also see, hey, i'll wait, time on

476
00:33:10.039 --> 00:33:14.240
the database server went up. It's
okay. Are we sending queries that a

477
00:33:14.319 --> 00:33:16.839
disc go bad? Oh? Hey, look errors on this disk spindle started

478
00:33:16.880 --> 00:33:22.000
going up too, Like maybe there's
something here. And I think it solves

479
00:33:22.200 --> 00:33:29.160
another monitoring problem that people still make
is people monitor root causes and not what

480
00:33:29.200 --> 00:33:31.799
they actually want out of a system, right, like sending me an alert

481
00:33:31.799 --> 00:33:35.599
that my CPU is high at too
in the morning, Like, cool,

482
00:33:35.599 --> 00:33:38.039
we're using the CPU we paid for, like comes up back to bed right,

483
00:33:38.640 --> 00:33:43.240
awesome capacity planning. But if it's
making the site slow, page me

484
00:33:43.279 --> 00:33:45.960
and say the site is slow,
right, don't use me and say don't

485
00:33:45.000 --> 00:33:47.640
page me and say job is doing
GC pauses? Well, yeah, you're

486
00:33:47.680 --> 00:33:52.759
running travel That's what it does all
day. If they're too long, you're

487
00:33:52.759 --> 00:33:55.440
going to fail a latency monitor and
page me on the thing that matters.

488
00:33:55.440 --> 00:33:59.880
So like, I want these root
causes to be the correlation engine to kind

489
00:33:59.880 --> 00:34:02.119
of help me figure out. Okay, well eighty thousand things go wrong,

490
00:34:02.160 --> 00:34:05.839
and if you put an alert on
all of those. You're back at that

491
00:34:05.880 --> 00:34:10.519
team that notices you don't get paid
for ninety minutes. Right, Yeah,

492
00:34:10.519 --> 00:34:14.559
if you look at every team that
has page fatigue, this is exactly the

493
00:34:14.559 --> 00:34:16.880
problem. Yeah, most people don't
even know why they get paid for ninety

494
00:34:16.880 --> 00:34:20.559
percent of the things. It's organically
grown over the life of the team.

495
00:34:20.880 --> 00:34:23.920
Hey, this one time, you
know, the database backups did this thing

496
00:34:24.039 --> 00:34:27.199
terribly, and now we page on
it. And then you have so many

497
00:34:27.239 --> 00:34:30.159
of those it's like, well,
there's eight thousand alerts and most of them

498
00:34:30.199 --> 00:34:34.760
resolve themselves, and we just thrug
our shoulders. And I don't think it

499
00:34:34.800 --> 00:34:37.880
starts from a good place. Like
I think there's this idea that just having

500
00:34:37.000 --> 00:34:42.480
dashboards is valuable in itself, and
it sort of propagates from there. It's

501
00:34:42.480 --> 00:34:44.280
like, oh, you know,
what are all the metrics we could be

502
00:34:44.320 --> 00:34:49.079
collecting, and let's show so.
And sometimes there is an organization that thinks

503
00:34:49.119 --> 00:34:52.239
that there is some inherent value,
like oh, this is how many requests

504
00:34:52.280 --> 00:34:55.199
we're getting, Let's use it to
get more headcount because we have to support

505
00:34:55.239 --> 00:35:00.159
such complex things. Or I remember
a previous company that thought it was really

506
00:35:00.159 --> 00:35:05.119
cool to have a geolocation dashboard of
where things were happening all over the world

507
00:35:05.360 --> 00:35:07.320
related to their thing, And I'm
like, you know what, I did

508
00:35:07.440 --> 00:35:09.519
need to drive that with real data. I could have just randomly pinged a

509
00:35:09.559 --> 00:35:13.960
spot on the world on a flat, you know, two dimensional map and

510
00:35:14.000 --> 00:35:15.559
be like, look it's happening.
I didn't need to know the lights coming

511
00:35:15.679 --> 00:35:20.559
up all over Yeah, for sure. It's just so totally unnecessary. And

512
00:35:20.599 --> 00:35:24.480
so I've been on this maybe personal
vendetta here to make them be actionable.

513
00:35:24.760 --> 00:35:29.119
So you know, what is the
business impact? But more than that,

514
00:35:29.199 --> 00:35:31.400
you know what will you do with
that information? Is the site being slow?

515
00:35:31.440 --> 00:35:36.280
Will you actually take an action here
to make it faster? Or is

516
00:35:36.280 --> 00:35:38.760
there like a run book or something
that we can go and actually execute on.

517
00:35:38.880 --> 00:35:42.960
Well that's another interesting part of it
all is like the term run book

518
00:35:42.960 --> 00:35:46.440
has been so ruined by so many
places. Like if your run book is

519
00:35:46.480 --> 00:35:52.280
log into the server and run the
script, like yeah, it's just not

520
00:35:52.320 --> 00:35:55.039
automated, and like I think our
run book should be these are the places

521
00:35:55.039 --> 00:35:58.960
you should look, like, if
it's ever a thing we know that breaks,

522
00:35:59.239 --> 00:36:01.840
fix it. Like, yeah,
I talked to someone once that was

523
00:36:01.840 --> 00:36:05.760
saying they were very proud. I
mean I felt that or they built this

524
00:36:05.800 --> 00:36:07.960
great thing and like we had this
service the crashes all the time, and

525
00:36:07.960 --> 00:36:13.039
we built this great thing that automatically
restarts it. I was like, okay,

526
00:36:14.320 --> 00:36:16.679
I see right. How about collecting
stack traces and sending it to developers

527
00:36:16.719 --> 00:36:21.679
and fixing the reason the service crashes? All like, yes, you should

528
00:36:21.719 --> 00:36:24.119
restart the service when it crashes,
no argument, But like, that is

529
00:36:24.119 --> 00:36:27.840
not the end of your journey.
That is the first ten percent of your

530
00:36:27.920 --> 00:36:30.519
journey, right, you know,
I you know, I love that because

531
00:36:30.519 --> 00:36:32.880
it actually happened at one of the
previous companies I was in this. The

532
00:36:32.960 --> 00:36:38.000
name of the service was the Service
Monitor Monitor, and it did actually do

533
00:36:38.079 --> 00:36:43.039
this. The root cause, though, and maybe you've got some infinite wisdom

534
00:36:43.079 --> 00:36:49.039
here, is they were using a
library for math operations that had a memory

535
00:36:49.079 --> 00:36:52.920
leak in it, and so this
would have required actually contacting a third party

536
00:36:52.920 --> 00:36:57.960
company to get their open source software
actually fixed, which I think was under

537
00:36:57.960 --> 00:37:04.519
a proprietor use license. So sometimes
you have to do some ridiculous things.

538
00:37:05.119 --> 00:37:07.320
Yeah, I mean, I guess
if you're in a world where you have

539
00:37:07.400 --> 00:37:10.800
to use a vendor library you can't
fix, you may just have to you

540
00:37:10.840 --> 00:37:15.519
know, live with the workarounds and
acknowledge the terror. Yeah, no,

541
00:37:15.639 --> 00:37:20.119
for sure. The problem is when
you start expending that to every problem that

542
00:37:20.199 --> 00:37:23.880
looks similar, rather than knowing that
it's the right answer. Right the Nightly

543
00:37:23.920 --> 00:37:31.239
Jenkins restart. Yeah, well go
ahead. I would say, like,

544
00:37:31.280 --> 00:37:36.519
you know, a lot of some
bank places I've worked, we're like,

545
00:37:36.559 --> 00:37:39.039
hey, we should just you know, we're twenty four by five point five,

546
00:37:39.159 --> 00:37:42.920
right, you know the Japan and
US markets and they're closed, you

547
00:37:42.920 --> 00:37:47.559
know, Friday night. Let's just
reboot everything every Saturday. I'm like why,

548
00:37:47.679 --> 00:37:51.320
Like why not? And I was
like, I think I asked the

549
00:37:51.400 --> 00:37:55.599
question first, like like yeah,
we should. We should reboot things when

550
00:37:55.639 --> 00:38:00.000
we upgrade the kernel, and we
should use that time to apply upgrade.

551
00:38:00.159 --> 00:38:02.599
But just doing a bunch of stuff
every Saturday because we can because there's no

552
00:38:02.679 --> 00:38:07.840
services. It's like it feels like
we're just making busy work and you know,

553
00:38:08.079 --> 00:38:10.800
finding things to break on a Saturday
and ruin our weekends. Yeah,

554
00:38:10.800 --> 00:38:15.320
but there's also really hard arguments to
argue against, even though you know there's

555
00:38:15.360 --> 00:38:23.239
somehow fundamentally wrong. Yeah, we
didn't do it. Regress, Yeah,

556
00:38:23.239 --> 00:38:27.239
someone you know, there was a
change set out to put a reboot in

557
00:38:27.239 --> 00:38:31.480
a crown job, and I was
like, that is no, that's something

558
00:38:31.559 --> 00:38:35.480
is going to be terrible there,
and I don't want to that's how to

559
00:38:35.559 --> 00:38:38.800
ruin a weekend one on one.
Yeah, totally. And it was already

560
00:38:38.840 --> 00:38:43.800
like that because I don't know.
I think twenty four by seven is a

561
00:38:43.880 --> 00:38:47.280
way better environment to have fundamentally good
operations than twenty four or five five.

562
00:38:47.280 --> 00:38:51.360
There's just too many bad habits in
twenty four or five five. Oh,

563
00:38:51.599 --> 00:38:53.920
we can always restart this by turning
off the database and running you're turning off

564
00:38:53.960 --> 00:38:57.599
the clients and running an alter.
Eventually you're going to have to run an

565
00:38:57.599 --> 00:39:00.880
alter mid week during trading. Yeah, you're not gonna like it, but

566
00:39:00.880 --> 00:39:02.320
it's gonna happen, and you're gonna
have to take an out us to do

567
00:39:02.320 --> 00:39:05.599
it because you haven't figured out how
to do it. I mean, there's

568
00:39:05.639 --> 00:39:08.000
two obvious failure modes from now,
which are, well, what happens if

569
00:39:08.039 --> 00:39:13.320
something changes on the restart, like
you know, a new upgrade or something

570
00:39:13.400 --> 00:39:17.280
right now you're triggering that at an
unpredictable time, or just straight database replication

571
00:39:17.440 --> 00:39:21.760
crashing and losing out on whatever what
was in the journal at that moment,

572
00:39:22.639 --> 00:39:25.079
that's not the thing you want to
actually have happened. Yeah, do it

573
00:39:25.119 --> 00:39:28.719
in your lab to figure out how
to deal with it, but maybe not

574
00:39:28.760 --> 00:39:37.000
proud. Yeah. So You've mentioned
Jenkins and Garrett, and so I feel

575
00:39:37.039 --> 00:39:40.519
like I'm picking up on a trend
here that you're running a lot of services

576
00:39:40.679 --> 00:39:45.239
in house, whereas other companies may
choose to use SaaS providers for those Do

577
00:39:45.280 --> 00:39:50.159
you have a particular opinion on that? Well, I mean I feel like

578
00:39:50.360 --> 00:39:54.360
my whole career, I've been at
all angles of the build versus by debate.

579
00:39:54.800 --> 00:39:59.719
It was funny some of the banks
I've worked that you really any of

580
00:39:59.719 --> 00:40:04.440
the big orgs you see, like
the historical cio CTOs whoever they are,

581
00:40:04.760 --> 00:40:08.039
or one wile will get hired and
there we have to build everything. Then

582
00:40:08.239 --> 00:40:10.760
then you can you know, you
have ten years of this right, and

583
00:40:10.760 --> 00:40:14.400
then you look back and you have
all these disparate things of like, oh,

584
00:40:14.440 --> 00:40:17.239
this must have been built during the
build era of two thousand and nine.

585
00:40:17.679 --> 00:40:22.880
I mean, I think the answer
is, really it depends on the

586
00:40:22.920 --> 00:40:25.920
staff and what their expertise is,
and should you run Jenkins in house?

587
00:40:25.920 --> 00:40:29.599
What can you run Jenkins in house? Right? Like do you have are

588
00:40:29.639 --> 00:40:34.480
you going to dedicate the resources to
do it? Right? But just outsourcing

589
00:40:34.519 --> 00:40:37.639
something isn't always as easy as just
paying somebody, right we mentioned before,

590
00:40:37.639 --> 00:40:43.360
it's porting stuff there, it's operationalizing
it, and actually, like taking taking

591
00:40:43.360 --> 00:40:47.800
advantage of a SaaS service to do
provide value is sometimes just as hard.

592
00:40:49.119 --> 00:40:52.559
The challenge isn't running the infrastructure,
it's using the infrastructure effectively. Right,

593
00:40:52.599 --> 00:40:58.800
Having Jenkins doesn't help. Having jobs
that do meaningful things are important, and

594
00:40:58.880 --> 00:41:01.280
you have that problem in any CI
infrastructure. So I think some of the

595
00:41:01.320 --> 00:41:07.400
things that doesn't matter, you know, observability, right, it's really hard

596
00:41:07.480 --> 00:41:12.920
to run high availability observability. It's
also I'm mouthful to say so, I

597
00:41:12.960 --> 00:41:15.039
mean, like I'm a fan of
I mean worker reservability company, but I

598
00:41:15.039 --> 00:41:17.719
mean I'm a fan of outsourcing some
of that stuff. I think there's a

599
00:41:17.719 --> 00:41:22.239
certain scale where you run it internally, I think very few people are at

600
00:41:22.239 --> 00:41:29.760
that scale. Yeah, very few
people are actually running for four nines proper,

601
00:41:30.360 --> 00:41:32.719
Like that's I don't know if it's
a triple digit number, but it's

602
00:41:32.719 --> 00:41:36.119
a small amount of companies and actually
a lot of people think they're doing it,

603
00:41:36.159 --> 00:41:38.320
but not a lot of people actually
do it, right, It's interesting.

604
00:41:38.320 --> 00:41:42.360
I mean, we're we're five nines
on our core competency service, which

605
00:41:42.400 --> 00:41:45.440
is like identifying stuff, but yeah, it's huge. We're not we're not

606
00:41:45.599 --> 00:41:50.719
running the monitoring observability stuff ourselves like
we are, like we've found we're using

607
00:41:50.719 --> 00:41:52.760
our cloud provider or actually we're still
in the process of trying to find a

608
00:41:52.840 --> 00:41:57.079
vendor that actually works with us.
And I think that's going to fuel my

609
00:41:57.199 --> 00:42:00.280
next question. I'm curious whether or
not you see companies get the build versus

610
00:42:00.280 --> 00:42:05.719
buy decision, right. I know
that they put a lot of effort into

611
00:42:05.840 --> 00:42:09.079
comparing vendors, but then I feel
like it's sort of there's this gap on

612
00:42:09.679 --> 00:42:15.920
actually being able to correctly identify what
the total cost of ownership is if they

613
00:42:15.920 --> 00:42:19.840
do actually go and build or run
something themselves. You know that that is

614
00:42:19.840 --> 00:42:22.360
a tough one, right. TCO
is so hard because it's easy to do

615
00:42:22.440 --> 00:42:25.360
the this is my AWS bill,
this is the vendor bill. How do

616
00:42:25.360 --> 00:42:30.599
you quantify the adages and the people
and the stress and the upgrades? And

617
00:42:30.159 --> 00:42:32.199
yeah, I haven't seen a good
answer for that. I mean, of

618
00:42:32.239 --> 00:42:36.360
course many you know, any vendor
will try to do that for you because

619
00:42:36.400 --> 00:42:39.079
they of course want to help.
Yeah, good, but that one is

620
00:42:39.119 --> 00:42:43.639
tough, I mean, but seeing
it done right. I think the most

621
00:42:43.679 --> 00:42:47.320
common mistake people make when they're doing
the evaluation is writing a requirement stock is

622
00:42:47.320 --> 00:42:52.639
surprisingly hard. Yeah, like writing
a requirement stock for CI and don't use

623
00:42:52.679 --> 00:42:58.239
a single product name right right.
People say, like, my requirement is

624
00:42:58.280 --> 00:43:00.199
an Envoyd proxy. It's like,
there's absolut no way. If your requirement

625
00:43:00.239 --> 00:43:04.760
is an on Boid proxy, what
have you done? Right? If your

626
00:43:04.760 --> 00:43:07.760
requirement is a thing that speaks XDS, maybe envoys the only answer. But

627
00:43:07.840 --> 00:43:12.679
write down the thing you want,
the outcome you want. This is the

628
00:43:12.719 --> 00:43:16.159
same root cause versus slow thing,
Like it's a it's the x Y problem

629
00:43:16.199 --> 00:43:19.679
as well. I don't know if
yeah yeah, people say I want to

630
00:43:19.679 --> 00:43:22.559
do this thing is like, what
are you really trying to accomplish? And

631
00:43:22.599 --> 00:43:25.320
then let's figure out this is what
we could build and accomplish that. This

632
00:43:25.360 --> 00:43:29.360
is what we can buy and accomplish
that or not accomplish that, and maybe

633
00:43:29.400 --> 00:43:31.440
the decision gets more obvious, but
it's just so hard to frame it in

634
00:43:31.519 --> 00:43:36.840
terms of completely agnostic to the tool, the thing you want to do.

635
00:43:37.400 --> 00:43:38.760
You get to be the bad guy. You you have to say, well,

636
00:43:39.119 --> 00:43:42.199
you were going to do an easy
job, which is just pick a

637
00:43:42.239 --> 00:43:45.000
tool. Uh, and now you're
forcing them to go back to the drawing

638
00:43:45.039 --> 00:43:49.840
board like well, why you know, really really look right? Why do

639
00:43:49.880 --> 00:43:54.000
you want your builds to succeed?
Yeah? Exactly, But it's a good

640
00:43:54.079 --> 00:43:57.239
question, like in the CI case, like why do you want your bills

641
00:43:57.280 --> 00:44:00.400
to succeed? Because I want to
merge code faster and ship code to production

642
00:44:00.480 --> 00:44:04.599
faster. Okay, so your metric
is time from commit to development, time

643
00:44:04.599 --> 00:44:07.199
from commit to production. Okay,
we came up with really great metrics from

644
00:44:07.199 --> 00:44:14.480
asking guard. I think it usually
it seems kind of like, I don't

645
00:44:14.480 --> 00:44:17.119
know, weird at first, and
you know, contrived, but I think

646
00:44:17.159 --> 00:44:20.840
it actually leads to like, oh, okay, yeah that makes sense.

647
00:44:21.079 --> 00:44:23.639
No, I'm totally what do Yeah, it makes me think that there's like

648
00:44:23.719 --> 00:44:28.000
some Freudian stuff here where I just
want to sit here with a pipe and

649
00:44:28.039 --> 00:44:34.039
go. But why do you want
your bill to succeed? Well? Is

650
00:44:34.039 --> 00:44:37.719
it because of unresolved issues with your
mother that you feel your build business succeed?

651
00:44:38.360 --> 00:44:44.400
I mean when you pull individuals into
an organization, their personal values do

652
00:44:44.679 --> 00:44:50.679
impact what that organization drives as important
and sometimes I've known software engineers to be

653
00:44:51.039 --> 00:44:55.159
uh, quite illogical and driven by
their emotional state to you know, it

654
00:44:55.199 --> 00:44:59.400
has to be like this, it's
so much better, and you know,

655
00:44:59.440 --> 00:45:02.760
you joke, but there is something
there. No, that's definitely I mean,

656
00:45:02.800 --> 00:45:13.719
I'm probably guilty of that. Yeah. So you work for a observability

657
00:45:13.760 --> 00:45:17.360
company and you're observing your internal tools. Do you use your own product for

658
00:45:17.480 --> 00:45:22.119
that? We do? Yeah,
Yeah, trying to dog food all the

659
00:45:22.159 --> 00:45:29.400
time and say we measure SLOs internally. One cool thing we've done is we

660
00:45:29.480 --> 00:45:31.159
have a bunch of you know,
it's not great, like you know,

661
00:45:31.239 --> 00:45:36.639
some some shell scripts stuff that maybe
shouldn't be shell scripts, but everyone's got

662
00:45:36.760 --> 00:45:40.119
a pile of that somewhere and it's
pretty important in the normal workflow. And

663
00:45:40.159 --> 00:45:45.039
it was having a lot of problems, and uh, you know, we

664
00:45:45.119 --> 00:45:47.239
just found a better way to measure
like what is the success of people that

665
00:45:47.360 --> 00:45:51.159
say I want to run a local
cluster? How often does that fail?

666
00:45:52.039 --> 00:45:57.840
Turns out it was failing a lot
more than we thought. And then the

667
00:45:57.840 --> 00:45:59.960
buggy. It's really hard because you're
like, okay, well could you go

668
00:46:00.159 --> 00:46:02.320
put a set dash X in this
file? And run it again in the

669
00:46:02.360 --> 00:46:07.280
mouth, and no one wants to
like, no one wants to do that,

670
00:46:07.400 --> 00:46:08.239
right. People just want to say, hey, I think broken,

671
00:46:08.360 --> 00:46:12.519
I can't you just fix it?
Right? So we built all of this

672
00:46:12.639 --> 00:46:15.559
in even to our shell right,
so like we have like a tracing view

673
00:46:15.599 --> 00:46:20.079
I think of micro service tracing.
Like there's a request I D and you

674
00:46:20.159 --> 00:46:22.280
know it hits tny services, Like
you run this one shell script that actually

675
00:46:22.320 --> 00:46:27.000
is running like you know, ninety
three shell scripts underneath it for better or

676
00:46:27.079 --> 00:46:30.239
worse or worse, but it's happening, and we have like a trace graph

677
00:46:30.280 --> 00:46:34.360
of you ran this and like so
the people come and say, man,

678
00:46:34.400 --> 00:46:37.280
it's took nine minutes to start this
thing and it normally takes six minutes.

679
00:46:38.159 --> 00:46:42.800
This sucks. Why is this?
So we can go in, look up

680
00:46:42.840 --> 00:46:45.480
their user name, find with a
nine minute run, click it, look

681
00:46:45.559 --> 00:46:49.000
at the trace, and go,
oh, yeah, there was this bug

682
00:46:49.079 --> 00:46:53.280
pulling from ECR or whatever it is, right, and just doing that helped

683
00:46:53.320 --> 00:46:57.800
us pin down. It turns out
all these problems with the tool were like

684
00:46:57.840 --> 00:47:00.599
systemic to one or two things being
wrong, you know, poor assumptions being

685
00:47:00.599 --> 00:47:04.880
made, and we fix those and
we're kind of off to the races on

686
00:47:04.920 --> 00:47:08.079
it, and now we're building a
prober for it, so we'll have like

687
00:47:08.119 --> 00:47:10.039
a graph. You know, again, everything has a graph, right,

688
00:47:10.079 --> 00:47:14.559
So the same way we have an
solo for ingesting an observation in a certain

689
00:47:14.599 --> 00:47:17.079
amount of time, we'll have an
sol for Hey, this thing can create

690
00:47:17.119 --> 00:47:22.000
a environment locally and it happens in
less than this amount of time. And

691
00:47:22.000 --> 00:47:24.960
then we'll have variants of it,
like does it work for people that run

692
00:47:24.960 --> 00:47:28.519
it over and over again on their
box? What about a new engineer that

693
00:47:28.519 --> 00:47:30.320
logs into a fresh box and runs
it, because that's always a different you

694
00:47:30.400 --> 00:47:35.719
know, there's some you know,
that's where always the goblins live in these

695
00:47:35.719 --> 00:47:37.760
things. Well, my terraform works
fine? Oh man, I destroyed it.

696
00:47:37.800 --> 00:47:40.239
Have to run it from scratch,
you know, does that work?

697
00:47:40.480 --> 00:47:45.760
So kind of trying to measure all
those different things from that using our tool.

698
00:47:46.760 --> 00:47:51.559
Is there like an eBPF integration here
that you plan on utilizing in the

699
00:47:51.599 --> 00:47:55.679
future to understand what the requests are
or how the script is running fundamentally on

700
00:47:55.719 --> 00:47:59.039
the machine. Yeah, I think
I think we're looking at that, and

701
00:47:59.119 --> 00:48:01.039
I haven't dealt with it too much
myself, but I think that that is

702
00:48:01.119 --> 00:48:05.679
kind of the ultimate for this,
I would I think marry the two right

703
00:48:06.119 --> 00:48:08.039
VPF for the raw like just show
me everything that's running and helped me and

704
00:48:08.079 --> 00:48:15.119
then my injected data. So yeah, I think that is in the future

705
00:48:15.639 --> 00:48:20.159
for your internal tools, Like once
you identify them, you're like, Okay,

706
00:48:20.440 --> 00:48:23.639
we need to bring this up to
being like a part of our we

707
00:48:23.639 --> 00:48:28.199
need to treat it like it's part
of our core infrastructure. How do you

708
00:48:29.719 --> 00:48:35.800
socialize that across engineering so that everyone
knows that this is a supported tool.

709
00:48:35.840 --> 00:48:38.519
This is a preferred path if you
were thinking about going and building something on

710
00:48:38.559 --> 00:48:42.559
your own for this one use case, we already have you covered here.

711
00:48:42.760 --> 00:48:49.079
How do you communicate that email to
the whole list? Pending your Slack message?

712
00:48:49.079 --> 00:48:52.800
What could you go wrong? Right? Channel? Baby? You can

713
00:48:52.880 --> 00:48:55.920
write documentation all day, but like
I think we all know how much our

714
00:48:55.920 --> 00:49:01.239
documentation gets read, right, I
think I think the way to do it

715
00:49:01.320 --> 00:49:06.119
is to this you know, goes
on the layer aight problem we're talking about

716
00:49:06.119 --> 00:49:08.400
earlier, but like have to build
a good relationship with all these teams and

717
00:49:08.440 --> 00:49:13.639
have them come to you sooner in
the process, like the sooner infrastructure and

718
00:49:13.800 --> 00:49:17.599
s E. Folks can be involved
in anything, like before code is written

719
00:49:17.639 --> 00:49:21.239
would be super ideal, so that
them come to you and say, hey,

720
00:49:21.280 --> 00:49:23.079
I wrote this really cool thing,
but it uses mago dB. It's

721
00:49:23.079 --> 00:49:27.719
like, we don't do that here, right, Like maybe there's a great

722
00:49:27.760 --> 00:49:30.000
reason that does that, but you
know, but it's much harder once they've

723
00:49:30.000 --> 00:49:32.719
written code, right, Like,
you know, it's the poker pot committed

724
00:49:32.760 --> 00:49:36.320
things like while I call for the
fluster on the flop, I'm putting all

725
00:49:36.320 --> 00:49:37.519
my money in on the tourna matter
what. It's like, let's do the

726
00:49:37.559 --> 00:49:42.639
math. Not great, get to
them, you know, pre flop,

727
00:49:42.880 --> 00:49:45.719
right, Like should you even be
in this pot with mango dB? Right?

728
00:49:45.760 --> 00:49:50.920
Maybe? Yeah, but that's not
easy to do, right, You

729
00:49:51.000 --> 00:49:54.360
have to There's not a technical answer
there. That's just build relationships, make

730
00:49:54.400 --> 00:50:00.159
your team available, make people you
know, had interactions people like you know.

731
00:50:00.599 --> 00:50:02.199
I think it was back to the
sort of product management thing earlier that

732
00:50:02.239 --> 00:50:07.199
we were talking about, where if
you are at the point where you need

733
00:50:07.239 --> 00:50:09.440
to tell people about the thing that
you're working on, like maybe you didn't

734
00:50:09.440 --> 00:50:15.000
approach the situation necessarily in the best
way rather than driving it from how you

735
00:50:15.039 --> 00:50:16.920
would a startup, which is okay, you know, what do our customers

736
00:50:17.000 --> 00:50:20.840
pain points look like? And you
know, as users, what do they

737
00:50:20.840 --> 00:50:22.360
want? And they're coming to us
and saying, hey, when are you

738
00:50:22.400 --> 00:50:27.280
done with this thing that we asked
for? And the implementation details are of

739
00:50:27.320 --> 00:50:30.880
course what you're picking because you know
that best. But fundamentally it's just a

740
00:50:30.920 --> 00:50:35.760
matter of pinging them on whatever RSS
feed that they're looking at. Yeah.

741
00:50:35.880 --> 00:50:40.079
I worked at a large company before
and we had an infrastructure PM and it

742
00:50:40.159 --> 00:50:45.360
was amazing. It was just so
it felt like it was easy mode,

743
00:50:45.400 --> 00:50:47.679
right, Like someone else is going
to go gather requirements I don't have.

744
00:50:49.320 --> 00:50:52.039
Yeah, I just I'll go do
some work. That's that's fine. And

745
00:50:52.039 --> 00:50:54.960
then they show up, you know, prioritize on a list. It's like,

746
00:50:55.000 --> 00:50:59.480
well this is this is awesome,
this is what it's like on the

747
00:50:59.480 --> 00:51:01.760
other side, right, you just
need to plug that into chat GPT and

748
00:51:01.880 --> 00:51:04.920
get the answer out as well,
and then you can just you know,

749
00:51:04.960 --> 00:51:10.239
stop doing all the work altogether.
What do my engineers want? That's interesting?

750
00:51:10.960 --> 00:51:15.239
Get a slack bot that uses chat
GPT to act as a PM role.

751
00:51:16.920 --> 00:51:20.360
I was just the other way around. Have the pms just answer,

752
00:51:20.480 --> 00:51:22.079
like answer the question of what they
want to have built, and then it

753
00:51:22.119 --> 00:51:27.960
will automatically build it for them.
This is what the hive mind Internet wants

754
00:51:28.000 --> 00:51:31.920
to build. Probably not so bad
if everyone wants to. I don't think

755
00:51:31.920 --> 00:51:37.679
you'll get to five nines. I
think it's more like five two's will be

756
00:51:37.719 --> 00:51:45.079
in the front in there somewhere if
you carry it out to enough decimal places.

757
00:51:45.119 --> 00:51:51.639
There's some nines. You know,
non repeating imager and doesn't really good

758
00:51:54.480 --> 00:51:59.079
cool, So what else should we
be thinking about? For internal tools?

759
00:52:00.320 --> 00:52:05.280
That's your big takeaway piece of advice. Apply all the same riggor Like you

760
00:52:05.280 --> 00:52:10.079
know, when PROD breaks, you
go and declare an incident using your incident

761
00:52:10.119 --> 00:52:15.039
management tool. You have a communications
role, you have you have the whole

762
00:52:15.039 --> 00:52:16.519
thing or you got to run,
but hopefully if you don't, you should,

763
00:52:17.360 --> 00:52:21.119
And then when you're done, you
write a post mortem. You might

764
00:52:21.159 --> 00:52:23.119
even publish. The customers publish it
internally, right, Like why shouldn't an

765
00:52:23.119 --> 00:52:27.559
engineer be able to read about why
Jenkins broke? If production breaks, you're

766
00:52:27.559 --> 00:52:30.119
going to file a bunch of follow
ups. You're going to prioritize it over

767
00:52:30.199 --> 00:52:32.159
other works because we don't want PROD
to break again. Same thing for internal

768
00:52:32.199 --> 00:52:37.039
tools, Like it's really I think
that's the big takeaway is there's really not

769
00:52:37.159 --> 00:52:40.639
much of a difference. I mean, and people, well, if PROD

770
00:52:40.719 --> 00:52:44.639
is down, our customers that pay
us can't do work. Okay, great,

771
00:52:45.039 --> 00:52:47.760
if CI is down, how are
you going to ship a fix when

772
00:52:47.880 --> 00:52:51.880
PROD is down? Right? Like
you're like, oh, I brought this,

773
00:52:51.960 --> 00:52:54.760
I wrote this really great infrastructure where
changes can only go through CI and

774
00:52:54.800 --> 00:52:59.760
not handmade. That's a great thing. But if your CI isn't as of

775
00:53:00.039 --> 00:53:02.559
the ball has production three nine CI
four nine is production. You can only

776
00:53:02.599 --> 00:53:07.079
make changes through CI. You know, not a math major. But we're

777
00:53:07.079 --> 00:53:10.559
going to have a problem here forty
minutes out of the year. You know

778
00:53:10.559 --> 00:53:15.119
it's going to be an issue.
So I think that treating it, I

779
00:53:15.159 --> 00:53:19.960
think people just underestimate how important that
stuff really is and the impact it can

780
00:53:19.960 --> 00:53:22.239
have. You don't have to wait
till it. You know. The worst

781
00:53:22.239 --> 00:53:28.840
case is production is having a problem. You're monitoring is broken and you can't

782
00:53:28.840 --> 00:53:30.719
see it. Your CI is broken
and you can't ship a fix for it,

783
00:53:31.079 --> 00:53:35.519
whether that fixes can FIG or code, I mean, and then that's

784
00:53:35.519 --> 00:53:38.119
a really terrible postportum to have to
send out to a customer of well,

785
00:53:38.159 --> 00:53:42.760
we knew what was wrong, but
we had to fix. We couldn't build

786
00:53:42.800 --> 00:53:46.119
the dockor image that had to fix
because you know our two thousand and four

787
00:53:46.159 --> 00:53:51.280
era Jenkins decided it's a crash and
I'll start it. Yet, Yeah,

788
00:53:51.400 --> 00:53:53.079
that's not It was not a good
look for anybody. I mean, you

789
00:53:53.119 --> 00:53:58.000
identified it. If there's value in
doing this activity for some of your services

790
00:53:58.079 --> 00:54:00.039
because of what the users look like, then there's probably value in doing it

791
00:54:00.079 --> 00:54:04.400
for other services that just happen to
be internal. I think that's a big

792
00:54:04.440 --> 00:54:07.280
part of post mortems, right,
Like if you find a production problem where

793
00:54:07.440 --> 00:54:10.440
oh, hey, we had this
bug in our database connection pool and we

794
00:54:10.480 --> 00:54:14.440
had this reconnection issue and this thing
happened. Where else do you have connection

795
00:54:14.480 --> 00:54:16.519
pools? Right? That should be
the logical question. You ask the same

796
00:54:16.519 --> 00:54:21.440
thing internally, Right, we had
this neglected service. Where else what else

797
00:54:21.599 --> 00:54:24.880
is flying under the radar that's going
to bite us? And those are the

798
00:54:24.880 --> 00:54:28.599
tough post you know, some of
the post mortem actually like fix this bug

799
00:54:28.639 --> 00:54:30.960
with a reproduced case. Okay,
that's like a day of work, right,

800
00:54:30.239 --> 00:54:32.360
you know, when you're done because
the test passes, it's easy.

801
00:54:34.000 --> 00:54:38.199
These are much more ominous, like
project the follow ups. But if you

802
00:54:38.199 --> 00:54:43.199
don't do them, you're going to
pay the price. Is there like some

803
00:54:43.360 --> 00:54:47.599
obvious pitfall that a lot of companies
or maybe even everyone seems to get wrong

804
00:54:47.800 --> 00:54:52.920
in this area sort of besides the
stuff we've been talking about, well in

805
00:54:52.960 --> 00:54:59.840
post mortems, people are extraordinarily bad
at distinction between root causes and triggers.

806
00:55:00.920 --> 00:55:04.440
Right, Pete type this command and
took the site down? Right, root

807
00:55:04.480 --> 00:55:09.320
cause is not Pete sucks? Right, that's right? Like the root cause

808
00:55:09.440 --> 00:55:13.599
is why was there a command to
where are the where are the seatbelts?

809
00:55:13.639 --> 00:55:15.440
Where are the gates? Where is
the code review? Where's all that stuff?

810
00:55:15.480 --> 00:55:19.920
Or you have to really really I
think the five wise thing is interesting.

811
00:55:19.920 --> 00:55:22.360
I don't think you actually have to
write down why five times and a

812
00:55:22.440 --> 00:55:27.159
documented filled out I think that's a
little bit you know, the meaning and

813
00:55:27.239 --> 00:55:30.000
not great, but the philosophy of
like, really come to the root cause

814
00:55:30.039 --> 00:55:36.440
of the problem, right, Like
root causes aren't well? Aws had an

815
00:55:36.480 --> 00:55:39.239
availability zone die? Like, what
can we do? You can run more

816
00:55:39.280 --> 00:55:42.679
than one availability zone? You can
do this, You can do that right,

817
00:55:42.760 --> 00:55:45.599
like, and maybe you choose not
to at this point, but you

818
00:55:45.599 --> 00:55:50.239
should at least identify it and say
Okay, like we know the root cause,

819
00:55:50.320 --> 00:55:53.000
and we've chosen that this is a
risk, and this is why we're

820
00:55:53.000 --> 00:55:57.599
a three nine service or a four
nine service. And maybe someday will make

821
00:55:57.599 --> 00:55:59.920
it better, maybe we won't,
but at least being honest with yourself about

822
00:55:59.960 --> 00:56:04.800
it. Ah, that's huge right
there. I want to highlight that because

823
00:56:06.760 --> 00:56:08.800
like, just because you identify the
root cause doesn't mean you have to do

824
00:56:08.840 --> 00:56:16.119
anything about it. Because I've seen
multiple instances where companies build infrastructure that is

825
00:56:16.280 --> 00:56:22.960
far beyond their budget and their actual
requirements because they're focused on that. And

826
00:56:22.360 --> 00:56:29.719
the analogy I like to use is
like, whenever I go to work every

827
00:56:29.800 --> 00:56:35.480
day, the fastest, most efficient
way for me to get there is buying

828
00:56:35.519 --> 00:56:39.800
my own jet copter, But my
budget really says I should stick with my

829
00:56:39.840 --> 00:56:44.320
eighty seven Toyota Corolla, you know, and so you have like balance those

830
00:56:44.320 --> 00:56:49.159
two things, right, right,
Maybe the fix is leave ten minutes earlier

831
00:56:49.239 --> 00:56:52.920
instead instead of mind Yeah exactly.
Yeah. Well I think also it comes

832
00:56:52.920 --> 00:56:55.920
back to SLOs of like, Okay, we had this big problem, but

833
00:56:57.119 --> 00:57:00.360
ay was it a big problem?
You know? Like it's hard back to

834
00:57:00.360 --> 00:57:02.440
the emotion thing, right, Like
if a big customer is impacted, it

835
00:57:02.760 --> 00:57:07.239
gets a lot more priority, but
you have to quantify at the end of

836
00:57:07.239 --> 00:57:08.960
the day, like, hey,
this is how many nines we have.

837
00:57:09.079 --> 00:57:14.320
This is our error budget. We
used it during this incident. Maybe that's

838
00:57:14.360 --> 00:57:17.239
not good, but that's this is
par for the course, and we don't

839
00:57:17.320 --> 00:57:22.440
need to suddenly become multi cloud,
multi region, you know, all the

840
00:57:22.480 --> 00:57:25.840
things load balancing, complexity, because
that's the other thing is looking at.

841
00:57:25.960 --> 00:57:30.920
You know, you can add nines
with complexity, but complexity can also reduce

842
00:57:30.000 --> 00:57:35.920
nnes. And you have to be
careful over engineering and response to things.

843
00:57:35.960 --> 00:57:38.760
And I hate to use the term
like you know this is always the escalator.

844
00:57:38.880 --> 00:57:42.119
Well, if there's an act of
God that you know, this managed

845
00:57:42.119 --> 00:57:44.360
service goes away, it's like,
well, what's an act of God?

846
00:57:44.440 --> 00:57:47.079
Is that any going down? I
don't know, that's just expected, right,

847
00:57:49.239 --> 00:57:52.320
Is it a tornado hit the Virginia
area and all of the US's to

848
00:57:52.320 --> 00:57:53.639
one one away? Okay, there's
an act of God we don't have to

849
00:57:53.639 --> 00:57:58.480
plan for. But if you're trying
to be a five nine service, you're

850
00:57:58.480 --> 00:58:00.480
not running in one region anyway,
So you know, it kind of all

851
00:58:00.559 --> 00:58:04.440
has to that has to make sense
together. You know, the whole story

852
00:58:04.440 --> 00:58:07.360
has to just kind of flow.
And that's where I think some people get

853
00:58:07.400 --> 00:58:10.320
off. They they write an SLO, but they don't have a story to

854
00:58:10.360 --> 00:58:15.920
back it, or they write a
really complex story where they don't need for

855
00:58:15.000 --> 00:58:22.719
an solo that's simpler. Yeah,
point for sure. But it's an art

856
00:58:22.840 --> 00:58:25.840
so it's you know, there's no
right or wrong answer. I always say

857
00:58:25.960 --> 00:58:30.880
solo everything is so technical and solos
or this like walfty, what is right?

858
00:58:30.920 --> 00:58:35.360
What is wrong? Who really knows? You just kind of have to

859
00:58:35.360 --> 00:58:37.679
do it. People always we were
rolling out SLOs in a company. How

860
00:58:37.679 --> 00:58:39.960
do I know my slo's right?
I was like, well, you know,

861
00:58:40.000 --> 00:58:44.400
it's not. It's your first SLO. So you build it and you

862
00:58:44.519 --> 00:58:47.440
have post mortems and you adjust it
as you go. You know, were

863
00:58:47.480 --> 00:58:51.159
you at you know, did you
have an outage? Yes? Did your

864
00:58:51.280 --> 00:58:54.519
solo show it no cool? Make
it more aggressive? Did you not have

865
00:58:54.559 --> 00:58:58.239
an outage? And your SLO says
you had a outage? Make it less

866
00:58:58.239 --> 00:59:00.800
aggressive? Right? Like it sounds
simple, but that's just the feedback loop.

867
00:59:00.800 --> 00:59:05.079
And if you're doing it twelve months
later, you probably have a pretty

868
00:59:05.079 --> 00:59:09.599
decent setup and have a much better
idea of what your customers consider an outage

869
00:59:09.679 --> 00:59:14.000
or not. You hit on something
really interesting there. Actually, So if

870
00:59:14.159 --> 00:59:17.559
SLA is your you know, contracted
amount and the and the I is whatever,

871
00:59:17.559 --> 00:59:22.239
your indicator is always just an objective. So it does seem like it's

872
00:59:22.280 --> 00:59:25.039
it must be subjective in every way. How do you sort of pick that?

873
00:59:25.119 --> 00:59:28.599
How do you know what? First
off, if you have an A,

874
00:59:29.079 --> 00:59:31.320
the O must be more aggressive,
let's hope, right, right,

875
00:59:32.679 --> 00:59:37.960
But let's assume for a moment you
don't have a contractual you know, I

876
00:59:37.239 --> 00:59:40.079
say, you know slas are like
SLOs with lawyers, right, it's really

877
00:59:40.400 --> 00:59:45.800
kind of a yeah, you have
to guess, Like I mean, I

878
00:59:45.800 --> 00:59:49.320
think that's the unfortunate of it.
And that's where you know, hire somebody

879
00:59:49.320 --> 00:59:52.440
with experience. They'll be able to
guess more accurately maybe, but they'll at

880
00:59:52.519 --> 00:59:55.559
least also know when they're wrong.
So I don't know. It's a process

881
00:59:55.559 --> 01:00:00.119
where you just have to iterate and
sometimes you have a major miss You're like,

882
01:00:00.159 --> 01:00:02.360
oh man, we really thought we
were measuring this service and we had

883
01:00:02.360 --> 01:00:07.239
this massive outage in our dashboard.
Was like everything's green, nothing's wrong,

884
01:00:07.280 --> 01:00:10.159
and users are entre It's like,
okay, well we missed this super critical

885
01:00:10.199 --> 01:00:15.599
part of the picture, and then
you do the postmortem thing and you figure

886
01:00:15.599 --> 01:00:17.440
out, Okay, well I made
this common mistake. What other of my

887
01:00:17.599 --> 01:00:22.199
solos have this common You know that
I make this mistake more than once.

888
01:00:22.360 --> 01:00:27.679
Would it be fair to say that
it should be meaningful so that if you're

889
01:00:27.760 --> 01:00:30.639
violating it, then you're taking some
action as a result, and if you're

890
01:00:30.679 --> 01:00:34.960
not, then you don't do anything. So maybe it's about finding that sweet

891
01:00:34.960 --> 01:00:37.519
spot where it causes the right thing
to happen in your organization. Absolutely,

892
01:00:37.599 --> 01:00:42.000
I mean I think that. I
think to get there right, you need

893
01:00:42.039 --> 01:00:45.639
to have some kind of alerting and
reporting around it. I think alerting on

894
01:00:45.800 --> 01:00:50.119
solos is like a very hard problem. Like the Google book will talk about

895
01:00:50.119 --> 01:00:52.599
burden rate alerting. Have fun implementing
that, right, that's very hard.

896
01:00:52.840 --> 01:00:55.920
But if you can get there or
have some approximation of it and report on

897
01:00:55.960 --> 01:01:00.639
it, like I'm a big fan
of getting you know, it doesn't work

898
01:01:00.719 --> 01:01:02.840
right away, but eventually maturing to
a point where every alert has a ticket,

899
01:01:04.239 --> 01:01:08.440
right and the ticket is either fixed
the thing that caused the alert.

900
01:01:08.719 --> 01:01:13.199
You know, this needs to be
more resilient, This needs more replicas or

901
01:01:13.360 --> 01:01:15.599
this alert page mean nothing was actually
wrong. We should fix the alert and

902
01:01:15.639 --> 01:01:19.199
if you do that over time,
like that's how you can get solos and

903
01:01:19.280 --> 01:01:22.000
cause change, and when they really
are broken, the post mortem loop is

904
01:01:22.000 --> 01:01:27.480
the real fix. Like the trick
with SLOs is you kind of I don't

905
01:01:27.480 --> 01:01:30.239
know. There's that perpetual battle with
infrastructure and product right You're like, hey,

906
01:01:30.280 --> 01:01:32.719
we need you guys to write more
stable code and products, like we

907
01:01:32.760 --> 01:01:36.079
need features, we need this needs
to be a different color, right,

908
01:01:36.239 --> 01:01:38.480
which is fine, like that needs
there's a balance there. But if you

909
01:01:38.519 --> 01:01:42.800
get everyone to agree on SLOs and
you're really user focused, like, hey,

910
01:01:43.840 --> 01:01:46.719
users are happy when the latency you
know p ninety nine latency is this,

911
01:01:46.800 --> 01:01:50.519
and you just are sad when it's
over that, and everyone everyone,

912
01:01:50.599 --> 01:01:55.159
like business engineering management, agrees on
that. Your argument is a lot easier

913
01:01:55.199 --> 01:01:59.960
when you have an outage to say, our users weren't happy, Like objectively,

914
01:02:00.599 --> 01:02:05.079
our users weren't happy, So we
either fix the thing or we decide

915
01:02:05.119 --> 01:02:07.639
that our threshold was incorrect on what
is a happy user. But one of

916
01:02:07.639 --> 01:02:12.480
the two has to We can't do
nothing. We can't just write features and

917
01:02:12.559 --> 01:02:15.159
say, well, users might be
unhappy again, because we know this thing

918
01:02:15.199 --> 01:02:19.039
will blow up, and I find
that that is a good you know,

919
01:02:19.039 --> 01:02:21.599
you almost trick the trick product into
it and like, oh yeah, so

920
01:02:21.599 --> 01:02:22.280
those these are great, and then
later it's like, oh, yeah,

921
01:02:22.280 --> 01:02:25.119
I guess we have to fix this
now, so you know, we'll give

922
01:02:25.239 --> 01:02:30.880
we'll give you a sprint or you
know, whatever it is. You know,

923
01:02:30.039 --> 01:02:35.000
I did the same trick a long
time ago with okayrs. Very similar

924
01:02:35.079 --> 01:02:37.840
thing, where you know, once
they agree to the okay ares, you

925
01:02:37.880 --> 01:02:39.599
try to set the mindset up what
does this actually mean? You know,

926
01:02:39.599 --> 01:02:43.760
why are we setting it? And
you set the so we can know what

927
01:02:43.800 --> 01:02:45.599
to do when we know what to
do the right thing, and then later

928
01:02:45.840 --> 01:02:47.559
you can just point back to it
and be like, hey, you know,

929
01:02:47.599 --> 01:02:50.960
we decided what the right thing was
going to be in this situation.

930
01:02:51.320 --> 01:02:52.960
Now it's time to execute on it. Or you know, you have to

931
01:02:53.000 --> 01:02:58.639
make a trade off. As you
said, Yeah, it's a tough.

932
01:02:58.840 --> 01:03:04.400
You know, you can't sell people
problems with tech, but you try.

933
01:03:07.320 --> 01:03:10.519
That'll be the million dollars million dollars
you know thing. But I don't know,

934
01:03:10.599 --> 01:03:14.239
I feel like data is always helpful. You know, emotions are bad,

935
01:03:14.239 --> 01:03:17.840
you know, everyone has emotional arguments, but data is hard to People

936
01:03:17.880 --> 01:03:22.039
still argue with data, but if
I feel like if you're out the side

937
01:03:22.039 --> 01:03:27.400
of data, you at least have
a fighting chance of pushing for change,

938
01:03:27.519 --> 01:03:31.199
or how do you know, like
you ever feel like you're in a situation

939
01:03:31.280 --> 01:03:37.400
where you get scared of survivorship bias, where you even if you're doing root

940
01:03:37.440 --> 01:03:40.599
cause analysis and you're finding really what
the underlying problem is that even though you're

941
01:03:40.639 --> 01:03:45.280
going on and fixing it, that
you're missing some other side of the iceberg

942
01:03:45.440 --> 01:03:51.039
that is waiting out there to come
and crush you totally. But I mean

943
01:03:51.079 --> 01:03:54.519
I think that I don't know.
I just accepted it, right. It's

944
01:03:54.519 --> 01:04:00.800
a pestimistic, sary brain, Like
everything will break. Everything I fix will

945
01:04:00.800 --> 01:04:03.800
also break when I leave a company, everybody will blame me when it breaks.

946
01:04:03.840 --> 01:04:08.199
And that's whatever, you know,
Like I've just I've internalized it and

947
01:04:08.239 --> 01:04:11.000
accepted it. And it used to
stress me out a lot more and now

948
01:04:11.039 --> 01:04:15.719
it's just like, Yep, this
thing I'm rolling out might break, and

949
01:04:15.840 --> 01:04:17.760
you know, at least you know
that if it breaks, you're going to

950
01:04:17.800 --> 01:04:20.960
write a good postmartum and learn from
it. And that's kind of the Constellation

951
01:04:21.079 --> 01:04:25.039
prize of like, yeah, it
sucks to write a big Postmartum after a

952
01:04:25.039 --> 01:04:29.239
big adage. But also you know, that's that's the gig, and like

953
01:04:29.440 --> 01:04:32.320
we shouldn't have hired Pete, that's
that's the R. Well, you know,

954
01:04:32.400 --> 01:04:34.679
like the first place I ever with
FedEx, right, there was this

955
01:04:34.840 --> 01:04:40.280
meeting every week, the R squared
Meeting, the Redundancy and Reliability Meeting interesting

956
01:04:40.519 --> 01:04:44.440
and it was like how good of
a week is the VP having? Right?

957
01:04:45.760 --> 01:04:47.599
And there was always this rumor that
like, oh, somebody was fired

958
01:04:47.639 --> 01:04:50.239
once at this meeting because they had
an outage, and like no one can

959
01:04:50.320 --> 01:04:56.719
actually tell you that person's name,
what year it was. I'm pretty sure

960
01:04:56.800 --> 01:04:59.679
it was, you know, it
was trying to hype you up to prepare

961
01:04:59.760 --> 01:05:01.519
for it. I think it was. I don't know if they intentially did

962
01:05:01.519 --> 01:05:05.320
this or it was just a grow
but like, that's such a terrible you

963
01:05:05.360 --> 01:05:10.320
know. I think the blameless culture
is big people can't be a problem,

964
01:05:10.440 --> 01:05:15.000
right, But I think often it's
not the person that made the change that

965
01:05:15.119 --> 01:05:17.679
is actually the problem. It is
someone's reviewing code, right, Like code

966
01:05:17.679 --> 01:05:21.679
review should be first class, not
a checking a box. I feel I

967
01:05:21.719 --> 01:05:26.159
feel really bad when to change I
reviewed cause an adage that happened recently.

968
01:05:26.280 --> 01:05:30.679
I was like, who approoved has
changed? Oh, never mind, it

969
01:05:30.719 --> 01:05:33.760
should have done better due diligence,
And yeah, I mean it goes back

970
01:05:33.800 --> 01:05:36.280
even further than that though, because
if you do think that it is one

971
01:05:36.320 --> 01:05:40.800
person's responsibility that caused the problem,
you can look at, well, you

972
01:05:40.840 --> 01:05:43.559
know, what was the culture we
had that allowed them to make the mistake?

973
01:05:43.679 --> 01:05:45.519
Or you know why were they even
hired? Right? Were they a

974
01:05:45.559 --> 01:05:47.119
good fit for the role in the
first place? And you can definitely go

975
01:05:47.199 --> 01:05:51.320
back up somewhere else or a different
chain to really dive in there some of

976
01:05:51.320 --> 01:05:55.199
those things you might not write.
It's like a MetaPost mortem of the Yeah,

977
01:05:55.199 --> 01:05:57.599
for sure, how do we get
here? Right? How do we

978
01:05:57.679 --> 01:06:01.760
end up hiring people that understand networking
for a networking product or whatever? You

979
01:06:01.840 --> 01:06:05.159
know it? Right? Yeah,
I think that's suff's important. And then

980
01:06:05.199 --> 01:06:09.039
then change the process, right,
Okay, these are a new criteria for

981
01:06:09.119 --> 01:06:13.960
this for sure. Yeah. Yeah. When I was in the Navy,

982
01:06:14.039 --> 01:06:18.519
we had this process for getting your
certification for whatever job you were doing.

983
01:06:18.599 --> 01:06:25.960
That I've tried to bring into poor
request reviews that it takes a cultural shift

984
01:06:26.000 --> 01:06:28.960
to get fully implemented. But in
the Navy. It was set up so

985
01:06:29.000 --> 01:06:30.800
that, like I was a nuclear
engineer, so you had to learn all

986
01:06:30.800 --> 01:06:35.599
these different skills to operate the power
plant. And so you would go around

987
01:06:35.599 --> 01:06:43.559
the power plant and work with the
existing engineers and show them that you knew

988
01:06:43.679 --> 01:06:47.960
how to do something, and if
they felt like you understood it, they

989
01:06:47.960 --> 01:06:50.960
would sign it off in your book. And this was way back pre computer

990
01:06:51.079 --> 01:06:55.639
stuff, so you actually had a
physical book that you carried around, but

991
01:06:55.760 --> 01:06:58.400
you signed it off in that book. And then if at any point in

992
01:06:58.440 --> 01:07:02.079
your career you ever screwed that task
up to a point where your skills were

993
01:07:02.079 --> 01:07:06.000
called into question, they would open
up the book to see who signed it

994
01:07:06.199 --> 01:07:10.519
and go back to that person and
say, hey, why why did we'll

995
01:07:10.559 --> 01:07:15.320
screw this up? And and so
it was that like that. It did

996
01:07:15.320 --> 01:07:18.800
put that sense of pressure on you
so that before you would sign off on

997
01:07:18.840 --> 01:07:25.960
anyone's book on anything, you wanted
to make sure that you were reasonably confident

998
01:07:26.000 --> 01:07:29.840
that they actually knew what they were
doing. And so I pre poor requests

999
01:07:29.840 --> 01:07:33.079
the same way like if I approve
a poor request and it breaks something,

1000
01:07:33.679 --> 01:07:38.360
I don't consider that to be a
fault with the person who submitted the poor

1001
01:07:38.400 --> 01:07:41.800
request. I consider it to be
my fault for not catching it in the

1002
01:07:41.840 --> 01:07:47.840
review. This is like the Hurdos
number corollar ate the uh, there's the

1003
01:07:47.960 --> 01:07:50.239
you know, you chase it back
all the way up the chain, like,

1004
01:07:50.280 --> 01:07:51.920
well, you know, who did
that person? You know, what

1005
01:07:51.920 --> 01:08:01.920
does that person's book look like?
The reviewer? Right? No, I

1006
01:08:01.920 --> 01:08:04.440
think that's a good, good way
to look at it, and it's just

1007
01:08:04.480 --> 01:08:12.239
good for accountability and you know,
yeah the meta issues. Yeah, and

1008
01:08:12.599 --> 01:08:15.279
like a one off instance, you
know, is not that big a deal.

1009
01:08:15.319 --> 01:08:23.279
But over time, you know,
if like everyone that I signed off

1010
01:08:23.319 --> 01:08:27.880
on this particular skill is having problems, that's going to point back to the

1011
01:08:27.960 --> 01:08:33.800
root cause being me not actually understanding
either what the skill is or how to

1012
01:08:33.840 --> 01:08:41.119
evaluate that skill. Yeah. I've
been places where the root cause has just

1013
01:08:41.199 --> 01:08:45.640
been fatigue. It's like the call
rotations are too insane. This was at

1014
01:08:45.640 --> 01:08:48.199
the end of this person covered someone
else's on call. They were on two

1015
01:08:48.199 --> 01:08:51.000
weeks of twenty four by seven on
call. They had had a bunch of

1016
01:08:51.039 --> 01:08:55.079
major incidents overnight, they were running
on no sleep, and they just simply

1017
01:08:55.079 --> 01:08:58.600
did the wrong thing. You can't
fault the human for that. Why do

1018
01:08:58.720 --> 01:09:01.520
we put them in this grinder like
every other industry. Airline industry is a

1019
01:09:01.560 --> 01:09:04.199
great one, right, you can
only have limit limits on how much you

1020
01:09:04.239 --> 01:09:08.039
can fly and be responsible over people's
lives. I mean, obviously when lives

1021
01:09:08.039 --> 01:09:10.680
are instakes, like you said,
you have to be more rigorous. But

1022
01:09:10.880 --> 01:09:13.439
it doesn't have to be that rigorous
for you know, being on call.

1023
01:09:13.479 --> 01:09:15.840
But you know, we have an
informal policy on our team is if we

1024
01:09:15.880 --> 01:09:20.319
take overnight pages, someone will offer
to cover the next day, the next

1025
01:09:20.439 --> 01:09:26.319
night so that person can catch up. Yeah, there's nothing worse than having

1026
01:09:26.359 --> 01:09:29.960
a week of terror where you're losing
sleep every night like you're just you're dead

1027
01:09:30.119 --> 01:09:32.279
by there, and then a major
incident. You know, the worst timing

1028
01:09:32.319 --> 01:09:35.079
always happens in these things, right, Like the worst adages are never one

1029
01:09:35.119 --> 01:09:39.279
thing. It's a confluence of events. Right, So that terrible outage is

1030
01:09:39.279 --> 01:09:43.840
going to come Friday when you're running
on fumes and your brain isn't isn't there,

1031
01:09:44.399 --> 01:09:46.359
and you're going to make mistakes,
you're not going to see problems and

1032
01:09:46.399 --> 01:09:51.920
that's not your fault. But that's
tough to in a startup world like oh

1033
01:09:53.000 --> 01:09:56.760
yes, you're meant to grind,
but there has to be some reasonableness of

1034
01:09:57.600 --> 01:10:00.880
Okay, the people with responsibility are
keeping the sight up need to be aware

1035
01:10:00.960 --> 01:10:04.199
and awake because we're not just running
run books where we copy and paste off.

1036
01:10:04.319 --> 01:10:10.000
It's the run book is use your
brain. Yeah, I think we

1037
01:10:10.079 --> 01:10:14.439
really realized over the at least the
most recent decade that the grind is not

1038
01:10:14.479 --> 01:10:18.159
helpful. Even like doing more hours
of work, especially in knowledge work industry,

1039
01:10:18.760 --> 01:10:23.479
does not translate to an additional value. And so if you do have

1040
01:10:23.880 --> 01:10:27.680
outages every incidence, every day that
people are on, I like, it

1041
01:10:27.720 --> 01:10:31.520
seems like fundamental that you would intentionally
rotate them off so that someone else is

1042
01:10:31.520 --> 01:10:35.760
there because and you know, I
think there's like a pride issue here where

1043
01:10:35.800 --> 01:10:40.560
the engineer just wants to stay on
because you know, it's their rotation and

1044
01:10:40.920 --> 01:10:44.680
they don't realize that it's actually harming
the company. Like they should speak up

1045
01:10:44.680 --> 01:10:47.239
for the benefit of the company.
It's not about them necessarily. Right.

1046
01:10:47.439 --> 01:10:51.479
Heroics shouldn't be I mean, it's
great sometimes heroics just have to happen,

1047
01:10:51.520 --> 01:10:55.079
and they happen and it's good,
but that shouldn't be the goal of like,

1048
01:10:55.119 --> 01:10:57.399
oh man, I want to be
a hero of the Saturagye, please

1049
01:10:57.479 --> 01:11:00.439
know, like it was really fun
early in my career, and now it's

1050
01:11:00.479 --> 01:11:03.720
like, oh, there's heroics happening. This is awful. Yeah, it's

1051
01:11:03.760 --> 01:11:09.880
it's the cult of the hero because
then you you fulfill that hero role and

1052
01:11:09.920 --> 01:11:14.439
then you know, you get at
mentioned in slack and you know, everyone's

1053
01:11:14.479 --> 01:11:15.520
like, oh wow, that was
such a tough effort, you know,

1054
01:11:15.720 --> 01:11:20.840
and the worst yeah, the worst
word to use the NHR rockstar. Yeah.

1055
01:11:20.880 --> 01:11:24.319
Absolutely. I don't want to be
a rock star at work, man,

1056
01:11:24.479 --> 01:11:27.159
Like, please know, I want
to do I'm going to be a

1057
01:11:27.239 --> 01:11:30.439
rock star. I want it to
be on Motley Cruz, fueled by cocaine

1058
01:11:30.479 --> 01:11:32.680
and hookers. Let me be a
rock star, like being a rock star

1059
01:11:32.760 --> 01:11:38.279
and infrastructure is like, yeah,
where's my hotel room to tear up?

1060
01:11:38.439 --> 01:11:41.399
Yeah, I mean at the beginning
it was oh they described me as a

1061
01:11:41.479 --> 01:11:43.800
rockstar, that must mean I'm doing
great. And now it's like, oh,

1062
01:11:45.239 --> 01:11:47.640
now it's like what your mouth?
Yeah, don't call it that.

1063
01:11:50.079 --> 01:11:54.159
But I think I think the culture
is Yeah, I don't know. It's

1064
01:11:54.920 --> 01:11:58.039
it's a tough one to just you
don't want to actively discourage you. You

1065
01:11:58.039 --> 01:12:00.039
don't want to say long off of
the home, like if you want to

1066
01:12:00.079 --> 01:12:02.079
work. I don't know. My
thing is like I'm a workaholic, and

1067
01:12:02.119 --> 01:12:04.840
I always say I'm happy to work
long hours if it's what I want to

1068
01:12:04.880 --> 01:12:09.680
do. If I want to work
on Saturday, awesome. If other people

1069
01:12:09.680 --> 01:12:13.039
want me to work on Saturday,
that kind of falls apart from me.

1070
01:12:14.840 --> 01:12:16.680
And the social engineering trick is that, well, just to keep work that

1071
01:12:16.720 --> 01:12:19.720
he really likes and it's passionate about, and he'll work on Saturday. That's

1072
01:12:19.720 --> 01:12:23.680
a fine. Okay, that's a
good trick. It works on me,

1073
01:12:24.079 --> 01:12:29.279
but I don't think people have used
that. And some weekends I game all

1074
01:12:29.319 --> 01:12:31.960
weekend. Some weekends I work all
weekend. But usually it's it's my choice.

1075
01:12:33.079 --> 01:12:38.760
The point about that, I think
part of that is that we are

1076
01:12:39.319 --> 01:12:43.479
creative in the work that we do, you know, using our creativity to

1077
01:12:43.560 --> 01:12:49.600
solve problems. And creativity doesn't doesn't
show up at nine am when you hit

1078
01:12:49.680 --> 01:12:54.359
the time clock, you know,
and if it's something that you're excited about,

1079
01:12:54.479 --> 01:12:58.039
you you get this, you know, that creative bush like, oh,

1080
01:12:58.079 --> 01:13:00.119
I got to go do this,
which is what leads you to you

1081
01:13:00.960 --> 01:13:04.399
go in and work on Saturday.
Most of my best work is done yeah,

1082
01:13:04.439 --> 01:13:08.560
after midnight or on the weekends correct, Like it's exactly that is.

1083
01:13:08.600 --> 01:13:11.399
You get the idea and you're like, man, you're laying in bed and

1084
01:13:11.439 --> 01:13:13.039
you're like, I can build it
this way, that way. It's like

1085
01:13:13.079 --> 01:13:16.119
past draft to build it right,
Like I know. It always fueled me

1086
01:13:16.359 --> 01:13:20.520
was there was some really annoying problem
that I did, like some other problem

1087
01:13:20.560 --> 01:13:25.159
I didn't want to have to solve, Like someone was asking ridiculous things on

1088
01:13:25.199 --> 01:13:29.680
how to fix Jenkins and the solution
was something I didn't want them to do,

1089
01:13:30.199 --> 01:13:34.399
and so I felt the need motivated
to just have this problem completely go

1090
01:13:34.479 --> 01:13:39.479
away. And that's when I would
really work on things like non stop wast

1091
01:13:39.479 --> 01:13:45.680
nerd sniping. Right, someone can't
be done see you Monday. Right here

1092
01:13:45.720 --> 01:13:50.319
it is, but I don't think. I think the other problem with that

1093
01:13:50.520 --> 01:13:54.359
is it has to be clear on
the team that that's happening. Like some

1094
01:13:54.439 --> 01:13:59.199
people just don't ever work weekends.
I think that's great, Like you know,

1095
01:13:59.720 --> 01:14:02.199
that's that should shouldn't be a problem. So it shouldn't be a peer

1096
01:14:02.199 --> 01:14:04.560
pressure of like, oh Pete work
of the weekend, everyone else should.

1097
01:14:04.560 --> 01:14:08.760
Like I think that is a terrible
message to send, so you have to

1098
01:14:08.760 --> 01:14:13.119
be careful with it, like I've
learned not to send too many emails.

1099
01:14:13.199 --> 01:14:15.600
Well, I don't. Luckily the
company I work at it's not an email

1100
01:14:15.640 --> 01:14:19.039
company, but I like big companies
where email is life. I'm always very

1101
01:14:19.039 --> 01:14:24.680
careful about setting emails on the weekend
because you're implicitly setting expectations for other people

1102
01:14:24.760 --> 01:14:29.399
that are watching, and I feel
like that is trouble. Yeah, I

1103
01:14:29.439 --> 01:14:33.680
mean that's another avenue realistically, Like, I think there is a thing about

1104
01:14:34.000 --> 01:14:36.640
just like you want the load on
your systems to be constant. You don't

1105
01:14:36.640 --> 01:14:41.159
want to see spikes because they're incredibly
hard to deal with the same goes for

1106
01:14:41.680 --> 01:14:46.159
teams that are putting out work.
Right, If some engineers are incredibly spiky

1107
01:14:46.720 --> 01:14:50.399
on load, then you're unpredictable and
what you can deliver and how much,

1108
01:14:50.439 --> 01:14:55.239
and the reliability or quality of that. So it's you're not necessarily doing anyone

1109
01:14:55.279 --> 01:15:00.840
a favor by one day going out
and solving a problem. If that's your

1110
01:15:00.880 --> 01:15:02.560
pattern. Yeah, I think it's
a difficult lesson for a lot of people

1111
01:15:02.600 --> 01:15:08.359
to learn. I struggle with it
a lot. I often find myself like

1112
01:15:08.439 --> 01:15:11.199
I'll work a weekend because I want
to, and then Monday, I'm like,

1113
01:15:11.239 --> 01:15:15.079
oh, man, I don't want
to work today, Like, yeah,

1114
01:15:15.159 --> 01:15:16.520
well that's fine, right, because
then you're then you're still having the

1115
01:15:16.520 --> 01:15:20.159
same amount of work that you're sort
of putting out and productivity for the team,

1116
01:15:20.159 --> 01:15:24.159
but you're not overburning them because someone
will has to review that, right,

1117
01:15:24.199 --> 01:15:27.640
someone that's still creating followup work down
the road, and depending on the

1118
01:15:27.640 --> 01:15:30.920
statement, maybe that creates incidents as
well. Yeah, now, you're right,

1119
01:15:31.000 --> 01:15:35.359
it's a very tough balance to figure
out that. Yeah, I'm still

1120
01:15:35.399 --> 01:15:39.760
trying to figure out what that is. I would say my whole career is

1121
01:15:39.760 --> 01:15:44.600
pretty spiky over There are some places
I work at where I I'm always you

1122
01:15:44.600 --> 01:15:45.439
know, I can't help the work
aholic at me. But there are some

1123
01:15:45.439 --> 01:15:48.199
places where I chill a little more
in some places where I'm like eighty hour

1124
01:15:48.279 --> 01:15:51.479
weeks insanity, And I don't know. For me, I've learned it's a

1125
01:15:51.479 --> 01:15:56.479
healthy cycle. Like every couple jobs, I do the insane grind because that's

1126
01:15:56.520 --> 01:15:59.359
just where I'm at, and then
I have a couple of years of like

1127
01:15:59.680 --> 01:16:02.640
less insane grind, kind of relax
and find other things to do, and

1128
01:16:02.680 --> 01:16:05.000
then bright back to it because I
miss it, you know. I mean,

1129
01:16:05.279 --> 01:16:08.560
it doesn't matter what your pattern is. I mean, if you are

1130
01:16:08.600 --> 01:16:11.279
spikey, it's fine as long as
it's somehow consistent. Right. You know,

1131
01:16:11.399 --> 01:16:14.399
if every couple of weekend, you
know, every other weekend, you

1132
01:16:14.439 --> 01:16:17.000
do extra work, then you sort
of expect that into the realm of things

1133
01:16:17.000 --> 01:16:20.600
and how that's going to play out. But if it happens and it's unpredictable,

1134
01:16:20.680 --> 01:16:24.600
then you don't know what the impact
is on the team. You may

1135
01:16:24.680 --> 01:16:28.720
think that the team can have more
work done than is reasonable, and so

1136
01:16:28.760 --> 01:16:32.560
a new big project comes out and
now it's taking even longer or unexpected because

1137
01:16:32.640 --> 01:16:40.279
he's not pulling those weekends anymore and
doing the a real job that has a

1138
01:16:40.279 --> 01:16:45.279
lot of value, but messed up
some sort of prediction or timelines or deadlines.

1139
01:16:45.880 --> 01:16:48.399
I think engineering time prediction is like
the hard I just throw that out

1140
01:16:48.439 --> 01:16:55.640
the window, especially with infra.
Right, well, it's two weeks of

1141
01:16:55.640 --> 01:16:59.479
infra work. Oh so it'll be
done in two weeks, maybe like two

1142
01:16:59.520 --> 01:17:02.720
weeks, And for work sometimes takes
two months with outages and interrupts, and

1143
01:17:03.279 --> 01:17:08.279
it's hard to explain sometimes. Yeah, it's like Einstein's theory of relativity.

1144
01:17:08.399 --> 01:17:14.479
This is real the fast where we
go this is different where we go no.

1145
01:17:14.600 --> 01:17:16.479
But I think going back to your
you're talking about working in spikes.

1146
01:17:16.520 --> 01:17:24.640
I think that's like, that's how
we've worked as humans for you know,

1147
01:17:24.960 --> 01:17:28.640
forty thousand years. Like you go
out and you do the big grind to

1148
01:17:30.199 --> 01:17:31.880
you know, to to hunt the
animals, and then you go back and

1149
01:17:31.920 --> 01:17:35.399
you just your rest and you relax
for a while. Or you go out

1150
01:17:35.439 --> 01:17:40.399
and you work all summer to plant
the crops and harvest the crops and then

1151
01:17:40.439 --> 01:17:42.800
store them for the winter, and
then you ride the winter out. So

1152
01:17:42.840 --> 01:17:49.720
I think that behavior is actually something
that has been native to us for a

1153
01:17:49.800 --> 01:17:54.279
long long time, and to try
to break that in the course of a

1154
01:17:54.399 --> 01:17:59.920
three decade career is going to be
difficult. I mean you're on as a

1155
01:18:00.119 --> 01:18:02.840
there definitely what are you called it? Art? Right? So you look

1156
01:18:02.880 --> 01:18:08.039
at the Renaissance artists, famous ones
like you know, see what they did?

1157
01:18:08.079 --> 01:18:11.560
What are other artists doing before?
And even today? You know,

1158
01:18:11.560 --> 01:18:14.319
how are they working? Because that's
the same of expectation you can have for

1159
01:18:14.359 --> 01:18:18.079
any knowledge work, which is very
similar to a creative process. You have

1160
01:18:18.159 --> 01:18:21.640
to have the right motivation and sometimes
that's really hard to figure out. I

1161
01:18:21.640 --> 01:18:26.000
don't know what that is sometimes,
right, some weekends, I'm just I

1162
01:18:26.000 --> 01:18:29.640
want to be at and stare at
a screen. And some weeks I wake

1163
01:18:29.720 --> 01:18:30.960
up early on a Saturday and I'm
like, let's write some code, let's

1164
01:18:30.960 --> 01:18:35.159
do stuff, And yeah, I
have no idea what what drives it?

1165
01:18:36.000 --> 01:18:39.359
Sometimes I know, but often it's
just I don't know. It's hard to

1166
01:18:39.359 --> 01:18:43.920
say how I'm feeling or will feel
until I actually wake up. And that's

1167
01:18:44.079 --> 01:18:48.880
what to do for sure. I
think if we figure that out and can

1168
01:18:49.159 --> 01:18:54.279
define it and reproduce it, we've
got our next multi billion dollar startup.

1169
01:18:55.039 --> 01:19:01.640
Seriously, awesome. Is there anything
else we should talk about for internal platforms

1170
01:19:01.960 --> 01:19:06.159
of infrastructure? I think we covered
a lot. I think that Yeah,

1171
01:19:06.199 --> 01:19:11.199
all the things I want to talk
about awesome. Cool. Let's do some

1172
01:19:11.279 --> 01:19:15.640
picks. Warren have been picking on
you for picks the last couple of episodes,

1173
01:19:15.760 --> 01:19:20.039
but I gave Pete the heads up
before we started recording, so I

1174
01:19:20.119 --> 01:19:23.560
know he's preps. So I'm gonna
put Pete on the spot. What'd you

1175
01:19:23.560 --> 01:19:29.399
bring for us? Pete? I'm
a gamer, and I like, uh,

1176
01:19:29.880 --> 01:19:32.279
I don't know. I like games
that are really hard. I like

1177
01:19:32.319 --> 01:19:35.439
games that are kind of like a
second job, and grindy and I don't

1178
01:19:35.439 --> 01:19:39.560
know, I'm go utton for punishment. I guess this is the summary.

1179
01:19:40.279 --> 01:19:43.840
Some people play games just to have
casual. I have some casual you know,

1180
01:19:44.000 --> 01:19:45.840
hang out with the boys and play
some games kind of thing. But

1181
01:19:45.880 --> 01:19:48.880
I like spreadsheet on the second monitor
and gaming on the first monitor kind of

1182
01:19:48.880 --> 01:19:53.479
thing. So I really like aarpg's
you know, hack and slash anything where

1183
01:19:53.479 --> 01:19:58.119
you can mind max. It's just
interesting. Up my alley. A new

1184
01:19:58.159 --> 01:20:00.560
game came out. I don't know. I think I went full release ten

1185
01:20:00.640 --> 01:20:04.479
days ago. It's been around for
years, and like betas and Alpha,

1186
01:20:04.479 --> 01:20:10.359
it's called last Epoch. It's like
a you know, Diablo ish ARPG thing.

1187
01:20:11.399 --> 01:20:14.760
And there's Diablo. I played a
lot of Diablo. There's Poe,

1188
01:20:14.840 --> 01:20:17.720
which is like the ultimate. Like
I don't have enough hours to I'm also

1189
01:20:17.840 --> 01:20:20.520
like a very addicted personality, so
I have to choose my gamescre I don't

1190
01:20:20.520 --> 01:20:25.399
play a factorio because I know that
I would just stop working. Like Last

1191
01:20:25.399 --> 01:20:30.039
Epoch is this nice middle of It's
a lot more complex than Diablo. There's

1192
01:20:30.039 --> 01:20:32.680
a lot more things that can contribute
to your final build that you can nerd

1193
01:20:32.720 --> 01:20:36.600
out on trying this and trying that
there's a lot of RNG, so there's

1194
01:20:36.640 --> 01:20:43.039
the grind aspect of it. It
just checks all the boxes. And for

1195
01:20:43.159 --> 01:20:45.399
me, I know a game is
good when I lose track of time constantly.

1196
01:20:45.640 --> 01:20:49.079
Oh it's a very am I should
go to bed now. That's been

1197
01:20:49.199 --> 01:20:53.680
like the last week and a half
as the game came out, And so

1198
01:20:53.800 --> 01:20:57.960
that is my pick. If you
like ARPGs or min maxine games or things

1199
01:20:58.000 --> 01:21:01.399
like that, this game just checks
all the boxes for me. Written by

1200
01:21:01.520 --> 01:21:04.800
gamers. Sometimes you play a game
when it's pretty clear that the people that

1201
01:21:04.840 --> 01:21:09.279
wrote it don't actually play the game
they've written. It's actually way more common

1202
01:21:09.520 --> 01:21:13.239
than you think. And you realize
because there's no quality of life features and

1203
01:21:13.239 --> 01:21:15.640
it's like, well this is awkward
and I have to do it every two

1204
01:21:15.720 --> 01:21:18.079
raids and this sucks this game just
like oh I have to do this thing.

1205
01:21:18.119 --> 01:21:20.800
Oh wait, this is really easy
to do, Like okay, just

1206
01:21:20.960 --> 01:21:26.399
like someone plays the game wrote the
game. So yeah, I enjoy a

1207
01:21:26.439 --> 01:21:30.560
game like that. It's a smaller
the company. So is there an online

1208
01:21:30.600 --> 01:21:34.680
multiplayer mode for it? Yeah?
So it's like a whole way do you

1209
01:21:34.680 --> 01:21:38.319
go to the common areas there's a
bunch of other people there, and there's

1210
01:21:38.399 --> 01:21:42.159
trading and stuff, and there's you
know, public chat where half trolling,

1211
01:21:42.239 --> 01:21:46.399
half people asking questions. But you
can party up with your buddies and tackle

1212
01:21:46.479 --> 01:21:49.319
content together. And so I have
a small group I play with and we've

1213
01:21:49.359 --> 01:21:54.079
always played ARPGs, so this is
our current one that we're all just kind

1214
01:21:54.119 --> 01:21:57.640
of grinding up characters on and figuring
out what the best builds are and what

1215
01:21:57.720 --> 01:22:00.520
synergizes with each other and things like
that. All right, So the follow

1216
01:22:00.600 --> 01:22:03.760
up question is do you want to
share your gamer tag so that you're listening

1217
01:22:03.800 --> 01:22:09.760
for the show can jump on and
talk trash. Yeah, I'm pdf backwards

1218
01:22:09.760 --> 01:22:15.479
the TEP all right, someone out
there has p F places. So I'm

1219
01:22:15.479 --> 01:22:18.279
a yeah, you said difficulty.
I thought for sure you were going to

1220
01:22:18.319 --> 01:22:23.960
bring up Lost Souls or Dark Souls, and so I that kind of game

1221
01:22:24.600 --> 01:22:28.800
I if I was good at I
would play. I don't think anyone's good

1222
01:22:28.800 --> 01:22:31.159
at it the whole, like the
the you know those kind of games where

1223
01:22:31.159 --> 01:22:33.880
it's like jumping mechanics and stuff.
I don't know, I'm that's stuff.

1224
01:22:33.920 --> 01:22:38.079
He's not a platformer. That that's
it? Yeah, I mean really twitchy

1225
01:22:38.159 --> 01:22:41.520
games. Yeah, I'm totally with
you. I played Ninja Guide in a

1226
01:22:41.560 --> 01:22:44.600
lot in the past, and like
that was very there are some parts that

1227
01:22:44.600 --> 01:22:48.279
were incredibly challenging. Blessed Podcast.
Enough of that, Like some of the

1228
01:22:48.279 --> 01:22:51.760
boss fights are legitimately challenging. You
know, we just one dude we bought

1229
01:22:51.880 --> 01:22:55.239
probably forty times in a before we
beat him, but now were really good

1230
01:22:55.239 --> 01:23:00.840
at it because we yeah, we
post bared them each death. Okay,

1231
01:23:01.600 --> 01:23:04.960
you can't stand in the middle if
really where games are worth right, Like

1232
01:23:05.000 --> 01:23:09.600
it's I think that's a real lesson. You know, you really have to

1233
01:23:09.640 --> 01:23:14.239
take root cause analysis and post mortems
to your your private life, and you

1234
01:23:14.279 --> 01:23:17.039
know in your friend group when something
goes wrong, you really need to investigate.

1235
01:23:17.399 --> 01:23:20.000
I often joke that, like many
things about me are really good for

1236
01:23:20.039 --> 01:23:24.479
work and really terrible for personal relationships, but they work out really well in

1237
01:23:24.520 --> 01:23:30.600
gaming. So we're causing relationship problem
is usually a bad idea. We're causing

1238
01:23:30.600 --> 01:23:38.039
why you died to a boss.
Apply it where you can fair enough,

1239
01:23:39.680 --> 01:23:42.920
all right, Warren, would you
bring this week? Yeah, so a

1240
01:23:42.960 --> 01:23:45.479
couple weeks ago I mentioned this already, but I'm gonna plug it again.

1241
01:23:45.880 --> 01:23:51.359
On Friday, there's a decompiled conference
in dressed in Germany, UH which I'm

1242
01:23:51.359 --> 01:23:56.239
actually giving a talk on about our
journey at authors and adding security. But

1243
01:23:56.279 --> 01:23:58.760
there are some interesting talks there,
like there's one that I really want to

1244
01:23:58.800 --> 01:24:03.159
go to that's about migrating from kubernetties
to server lusts, and I think there

1245
01:24:03.239 --> 01:24:06.039
are a bunch of other ones that
are really interesting. I'm looking forward to

1246
01:24:06.560 --> 01:24:13.600
nice, excellent, cool. So
my pick is going to be no surprise

1247
01:24:13.680 --> 01:24:16.800
to anyone who's been listening to the
last few episodes. I am picking platform

1248
01:24:16.880 --> 01:24:23.439
Con coming out in June. It's
a five day virtual conference about platform engineering,

1249
01:24:24.199 --> 01:24:27.720
so check it out totally free,
and there's going to be tons of

1250
01:24:27.760 --> 01:24:32.279
great talks there and specific to me. At the end of the conference,

1251
01:24:32.319 --> 01:24:36.239
I will be doing a live Q
and A session with some of the speakers,

1252
01:24:36.760 --> 01:24:40.720
so they're finalizing who the speakers are
going to be, and then once

1253
01:24:40.800 --> 01:24:44.880
that's done, I am going to
try and turn this into a Q and

1254
01:24:44.920 --> 01:24:47.319
a session that you actually want to
listen to. So I'm going to go

1255
01:24:47.359 --> 01:24:53.800
out on X and start asking people
which speakers do you want me to interview

1256
01:24:54.199 --> 01:24:58.279
and what questions do you want me
to ask them, so that it becomes

1257
01:24:58.279 --> 01:25:00.800
the interview that you want to hear, and yes, I know that by

1258
01:25:00.840 --> 01:25:04.840
going out on X and asking you
what questions to ask, some of you

1259
01:25:04.880 --> 01:25:08.640
are going to ask some of the
wrong questions. So I'm just gonna be

1260
01:25:08.680 --> 01:25:12.760
honest here with you. I'm not
going to ask that, but thank you

1261
01:25:12.760 --> 01:25:17.640
for listening to the show, you
sick little pervert. And so whenever the

1262
01:25:17.760 --> 01:25:23.119
Q and A session comes up,
I will ask the questions, some of

1263
01:25:23.159 --> 01:25:26.600
them because we know what you're going
want to ask, but I will ask

1264
01:25:26.600 --> 01:25:29.800
some of them and then you will
be able to say, hey, I

1265
01:25:29.880 --> 01:25:32.760
heard about this on the X,
which takes it full circle because I did

1266
01:25:32.840 --> 01:25:36.439
go through that whole setup just to
walk all the way through and plug his

1267
01:25:36.560 --> 01:25:43.039
zz Top song in my pick for
today. So all of you listeners who

1268
01:25:43.039 --> 01:25:45.520
are Zezytop fans out there, that
was all done for you. Thanks for

1269
01:25:45.600 --> 01:25:56.319
listening. Cool, So I think
we got an episode here. Awesome.

1270
01:25:56.840 --> 01:26:00.159
Yeah, Pete, thanks for joining
us. This has been a great talks.

1271
01:26:00.159 --> 01:26:01.720
Fun to have you on the show. Yeah, thank you for having

1272
01:26:01.760 --> 01:26:05.399
me again. This was great.
Yeah, anytime. Warren, thanks again

1273
01:26:05.479 --> 01:26:09.239
for joining me as a his co
host here. Of course, I love

1274
01:26:09.279 --> 01:26:12.239
having you on the show. It's
been a lot of fun. Look forward

1275
01:26:12.239 --> 01:26:15.560
to seeing you next week. Yeah, and for all you listeners out there,

1276
01:26:15.640 --> 01:26:18.359
and we will see you all next
week too. Thanks everyone,

