1
00:00:14,919 --> 00:00:21,120
On, y'all, welcome to another
episode of Adventures and dev Ops. I'm

2
00:00:21,160 --> 00:00:25,960
your host, Will Button joining me
in the studio, my co host back

3
00:00:26,039 --> 00:00:30,760
on a streak making Tom Brady look
like a slacker, Warren Parade. Welcome

4
00:00:30,800 --> 00:00:36,880
Warren, and thanks for letting me
come back for an of my hopes up

5
00:00:36,920 --> 00:00:40,320
that it will keep on going.
I'm Warren, I'm the CTO of author

6
00:00:40,359 --> 00:00:43,880
has just to reintroduce myself. Yeah, I mean, I like how this

7
00:00:43,960 --> 00:00:49,119
is going so far, and I
have no plans to coward my way out

8
00:00:50,079 --> 00:00:54,439
of the future right on. I'm
excited to hear that, because otherwise it's

9
00:00:54,479 --> 00:01:00,679
just me and the guest and always
lead to trouble. Speaking guests joining us

10
00:01:00,719 --> 00:01:08,719
today Pete Fritchman sre infrastructure staff infrastructure
engineer over at Observe, Inc. And

11
00:01:10,599 --> 00:01:17,719
Pete has joined us today to talk
about observability in those pesky internal applications that

12
00:01:17,760 --> 00:01:23,799
we all have. And I don't
know, it might be a love hate

13
00:01:23,879 --> 00:01:26,680
relationship there, but Pete, welcome
to the show. Hey, thank you

14
00:01:26,719 --> 00:01:32,280
for having me right on. So
tell us a little bit about your background,

15
00:01:32,280 --> 00:01:37,319
because you have You've got the staff
engineer title, which takes a while

16
00:01:37,439 --> 00:01:42,760
to get to and is also I
think a little bit uncommon on the infrastructure

17
00:01:42,840 --> 00:01:47,879
side. You know, it's pretty
common on the software engineering side, but

18
00:01:48,560 --> 00:01:53,040
I think it's a little more rare
to see staff engineers on the infrastructure side.

19
00:01:53,040 --> 00:01:57,439
So tell us how you got to
that point. Sure, yeah,

20
00:01:57,480 --> 00:02:00,879
I mean I've been doing this for
a long time. I had in the

21
00:02:00,959 --> 00:02:04,200
computers as a kid, which I
guess is more common these days, maybe

22
00:02:04,319 --> 00:02:07,680
less so in the nineties. But
I knew early on as a kid that

23
00:02:07,879 --> 00:02:12,680
I wanted to do computer stuff,
and I wasn't exactly sure what computer stuff

24
00:02:12,840 --> 00:02:16,560
was. Something with programming. I
enjoyed that kind of you know, make

25
00:02:16,599 --> 00:02:22,159
the computer do what I want aspect
of things. I was really lucky to

26
00:02:22,240 --> 00:02:27,199
land an internship at my local ISP
in like eighth grade, the summer after

27
00:02:27,240 --> 00:02:30,039
eighth grade, before ninth grade.
Oh no, I had a great mentor

28
00:02:30,080 --> 00:02:34,879
there, and I haven't kind of. I had done a mentorship like my

29
00:02:34,960 --> 00:02:38,599
seventh grade summer doing Linux stuff,
and I bought a laptop and ran Linux

30
00:02:38,639 --> 00:02:43,560
on it. And running Linux on
a laptop is a great way to become

31
00:02:43,599 --> 00:02:45,759
one with Linux, you know,
hate it, love it, whatever.

32
00:02:46,759 --> 00:02:52,039
And working at this ISP and having
a really great mentor George kind of like

33
00:02:52,439 --> 00:02:54,199
I realized, Okay, CIS and
Men is the thing I'd like to do,

34
00:02:54,240 --> 00:03:01,919
Like I enjoy debugging these problems and
building things and writing automation. So,

35
00:03:02,120 --> 00:03:05,439
you know, I did the high
school thing. I worked all through

36
00:03:05,520 --> 00:03:09,680
high school. I'm in a workaholic
forever you for better or worse. I

37
00:03:09,800 --> 00:03:14,759
worked at this ISP all through high
school. Went to college for a year.

38
00:03:15,840 --> 00:03:19,639
Wasn't my thing. It was fun
socially, but the school part was

39
00:03:19,719 --> 00:03:23,319
just not. I don't know,
I just lacked the focus. I just

40
00:03:23,360 --> 00:03:29,080
really wanted to work right. So
I had contributed a bunch to FreeBSD in

41
00:03:29,080 --> 00:03:30,879
my high school days. I was
a ports committer and ports are like the

42
00:03:30,879 --> 00:03:37,400
packages in previous D. And through
that I landed a job and a really

43
00:03:37,400 --> 00:03:44,240
awesome group at FedEx doing system administration
on everything Internet facing. And it was

44
00:03:44,240 --> 00:03:46,439
a small group, and looking back, like that was the group to be

45
00:03:46,479 --> 00:03:52,560
in at FedEx for doing Unixie stuff. They were definitely ahead of their time.

46
00:03:53,319 --> 00:03:54,800
Everything was automated. They wrote their
own tools, and I thought this

47
00:03:54,879 --> 00:03:58,719
was very normal for like a two
thousand and two shop, which I you

48
00:03:58,719 --> 00:04:01,560
know, now we know maybe was
so the whole automation. First thing has

49
00:04:01,599 --> 00:04:06,120
always just kind of been how else
would you do it? Kind of thinking,

50
00:04:06,879 --> 00:04:11,919
and then I was lucky enough to
land and that's every gig at Google

51
00:04:11,960 --> 00:04:15,280
after that in two thousand and five, and you know, they were like,

52
00:04:15,319 --> 00:04:16,439
hey, we should automate things and
I was like, well, yeah,

53
00:04:16,439 --> 00:04:20,079
how else would you do it?
Right? And then just from there

54
00:04:20,120 --> 00:04:25,959
it's been a whirlwind of startups.
And I tried the banking world for a

55
00:04:25,959 --> 00:04:29,600
little bit, had some fun there, has some not fun there. I

56
00:04:29,759 --> 00:04:32,800
ultimately decided to go back to the
startup world because I think that's my true

57
00:04:33,879 --> 00:04:36,199
you know, that's where the for
me, that's where the most fun is

58
00:04:36,759 --> 00:04:39,879
the most fun. I had a
bank was kind of a startup in a

59
00:04:39,920 --> 00:04:45,199
bank, and that's very hard to
find, oh for sure. So yeah,

60
00:04:45,240 --> 00:04:49,279
cool, Yeah, I mean there's
definitely a huge It's it's almost completely

61
00:04:49,319 --> 00:04:57,439
two separate, completely careers working at
large enterprise organizations versus startups, Like you

62
00:04:57,480 --> 00:05:00,879
have to have two different mental models
to be successful at each of those.

63
00:05:01,079 --> 00:05:03,959
It's almost two different skill sets.
I mean the base technical skill set of

64
00:05:03,959 --> 00:05:09,759
course, and then you know at
a startup you kind of have to pick

65
00:05:09,839 --> 00:05:11,759
up the pieces and lead the way
with what you have. And then at

66
00:05:11,759 --> 00:05:14,360
a big enterprise you have all the
resources, but you also have all the

67
00:05:14,399 --> 00:05:17,439
politics, and so you have to
you know, as much as you know,

68
00:05:17,720 --> 00:05:20,439
everyone I know hates that you have
to play. You have to make

69
00:05:20,480 --> 00:05:27,800
friends in other organizations and figure out
how to influence people, and it is

70
00:05:27,839 --> 00:05:30,439
hard and stress. I enjoy the
stress of tech, and the stress of

71
00:05:30,439 --> 00:05:36,399
that is just it's a lot.
Yeah, it's a lot of political engineering

72
00:05:36,720 --> 00:05:47,040
versus softer engineering laying right, for
sure. So when you started work at

73
00:05:47,079 --> 00:05:51,120
the ISP, like are we talking
back? Like dial up modem is P.

74
00:05:51,800 --> 00:05:55,680
Yeah. I sat in the after
room. I had a I had

75
00:05:55,720 --> 00:05:58,959
like PM two e's next to me
and those are quiet at least, and

76
00:05:59,000 --> 00:06:00,879
then I had a rack of like
fifty six k's that were just you know,

77
00:06:00,920 --> 00:06:03,319
I'm pretty sure I could whistle my
way into a fourteen and four connection.

78
00:06:03,560 --> 00:06:09,000
Back then, we were like the
regional ISP, So we did t

79
00:06:09,160 --> 00:06:14,160
ones for businesses and they're there.
Big thing was they would put the I

80
00:06:14,199 --> 00:06:15,160
forget what Cisco it was, but
there was something where you can take the

81
00:06:15,160 --> 00:06:16,879
t one and like, oh,
I want to take some of my parents

82
00:06:16,920 --> 00:06:19,360
and use them as phone lines,
and some of them as data and that

83
00:06:19,439 --> 00:06:25,199
was like a revolutionary in ninety seven, so that yeah, for sure.

84
00:06:25,279 --> 00:06:28,360
Yeah. I was working in the
telco industry right around then and nice.

85
00:06:28,920 --> 00:06:31,240
Yeah, and then we got you
know, d slams and until DSL thing

86
00:06:31,360 --> 00:06:33,920
came through, and it was just
it was a fun way to get a

87
00:06:33,959 --> 00:06:36,319
lot of exposure to a lot of
different things. I got to go to

88
00:06:36,399 --> 00:06:40,240
pops and put stuff in, I
got to build Unix boxes, I got

89
00:06:40,279 --> 00:06:42,839
to kind of do the whole gamut
of things. So you got the benefit

90
00:06:42,839 --> 00:06:46,120
of dealing with Y two. Yeah, you know like it. Yeah,

91
00:06:46,199 --> 00:06:50,279
it was. It was such a
non event that people at work weren't even

92
00:06:50,279 --> 00:06:53,720
worried. You know, everyone was
very much a realist there. They were

93
00:06:53,759 --> 00:06:56,519
like, every whatever, everything breaks, we just live in the world or

94
00:06:56,560 --> 00:07:00,319
nothing works, and you know,
we passed ourselfware. What more is there

95
00:07:00,360 --> 00:07:02,720
to do. I knew people.
I didn't know them like very well or

96
00:07:02,720 --> 00:07:08,600
personally, but I knew people pre
Y two k that the area I lived

97
00:07:08,600 --> 00:07:13,560
in at the time, they built
underground bunkers and stopped them and they were

98
00:07:13,600 --> 00:07:17,480
going like late December they were going
underground and saying we're not coming out for

99
00:07:17,879 --> 00:07:23,399
ten years or something, and I've
never seen any of those people again,

100
00:07:23,519 --> 00:07:28,199
so I'm like insanely curious. Were
they of January? Are they still in

101
00:07:28,720 --> 00:07:33,079
Yeah? Yeah, yeah. I
didn't to have like the you know,

102
00:07:33,480 --> 00:07:35,639
I didn't know Kobal. I mean, there was a whole you know,

103
00:07:36,000 --> 00:07:41,600
crew of people that were just making
insane money and projects. I'm just watching

104
00:07:41,680 --> 00:07:44,240
ka prep needed or not. I
mean, I feel like we'll hit that

105
00:07:44,279 --> 00:07:46,519
again in twenty thirty eight with thirty
two bit. I t like that that

106
00:07:46,560 --> 00:07:49,680
one is actually a little scary.
I hope to be retired by that,

107
00:07:50,519 --> 00:07:59,480
right, no doubt. That seems
like too much cool. So so talking

108
00:07:59,560 --> 00:08:05,079
about internal platforms and observability, Yeah, like that's that was whenever this came

109
00:08:05,120 --> 00:08:09,720
across as our topic, I was
like, you know, that's that's just

110
00:08:09,839 --> 00:08:15,240
brilliant because all of these little internal
apps, and sometimes not little internal apps,

111
00:08:15,279 --> 00:08:20,959
but like the things that the company
uses to make decisions about are we

112
00:08:20,040 --> 00:08:26,279
doing the right thing or not?
They often are just like little pet projects.

113
00:08:26,360 --> 00:08:33,240
So what's what's your experience there and
how do you get those recognized as

114
00:08:33,320 --> 00:08:37,879
the valuable assets that they are?
Yeah, Well, often it happens on

115
00:08:37,960 --> 00:08:43,159
its own, and but it's the
worst case of there's a catastrophic failure at

116
00:08:43,200 --> 00:08:46,000
the worst possible time, and everyone
goes, oh, that was really important.

117
00:08:46,679 --> 00:08:52,919
I worked somewhere at a public company, and the system that did the

118
00:08:52,159 --> 00:08:56,480
closing the books every day, you
know, like very barring, right,

119
00:08:56,000 --> 00:08:58,600
Well, it wasn't working and it
was taking you know, thirty hours to

120
00:08:58,639 --> 00:09:01,919
close twenty four hours worth of books, and obviously that doesn't work so well.

121
00:09:03,600 --> 00:09:05,600
And they kind of didn't notice it
or take any action until, like

122
00:09:05,759 --> 00:09:09,159
they were a week before they had
to do some sec filing quarterly thing.

123
00:09:09,200 --> 00:09:11,879
I don't know all the details,
but it became, you know, suddenly

124
00:09:11,960 --> 00:09:15,919
on Friday afternoon, it was we
need all hands on deck to fix this.

125
00:09:16,000 --> 00:09:24,200
But it's everything from that to not
just necessarily apps, but like internal

126
00:09:24,200 --> 00:09:28,600
infrastructure, right CEI and developer experience, and everyone's got these you know,

127
00:09:28,159 --> 00:09:31,879
shell script infrastructure that dev's run to
run the local cluster on their laptop,

128
00:09:31,919 --> 00:09:35,440
and there's you know, any company
you can probably sit down and find a

129
00:09:35,440 --> 00:09:39,759
million of them. I just think
they're perpetually under you know. The sexy

130
00:09:39,759 --> 00:09:43,720
part is production, right, people
want to build SLOs for users and show

131
00:09:43,720 --> 00:09:48,399
awesome graphs and look at this great
incident management process. We have. Then

132
00:09:48,399 --> 00:09:52,679
they have this like Jenkins instance,
it's barely holding on internally, you know

133
00:09:52,799 --> 00:09:54,399
that everyone hates but no one really
talks about because it's just, oh,

134
00:09:54,440 --> 00:09:58,919
yeah, that's life, you know, that whole side of things just need

135
00:10:01,000 --> 00:10:07,000
more love. Yeah, And so
I think part of that, you know,

136
00:10:07,159 --> 00:10:11,799
like you mentioned there, especially with
like around Jenkins and things, it's

137
00:10:11,879 --> 00:10:18,519
people. It's like people who build
tools to solve specific problems that they're having,

138
00:10:18,159 --> 00:10:24,039
and then it almost has like this
organic growth of other people see it

139
00:10:24,080 --> 00:10:28,200
and I'm like, oh yeah,
I need to use that too, and

140
00:10:28,240 --> 00:10:33,720
then it grows in its role in
the company. So how do you identify

141
00:10:33,879 --> 00:10:39,919
those before that catastrophic event? Yeah, well, you know a lot if

142
00:10:39,960 --> 00:10:41,720
you think about So I do a
lot of thinking about like how to compare

143
00:10:41,759 --> 00:10:45,559
it to production. So in production, if you say I want to launch

144
00:10:45,600 --> 00:10:48,240
this new micro service or I want
to consume this two AWS service, you

145
00:10:48,279 --> 00:10:52,120
write the sign bocs, you have
launch review. I mean hopefully right,

146
00:10:52,320 --> 00:10:56,519
ideally you have these things, you
have this whole like rigmarole and process for

147
00:10:56,559 --> 00:10:58,320
it. But eternally it's like,
oh, I'm going to fire up a

148
00:10:58,360 --> 00:11:00,799
EC too and run this new tool
I downloaded hey, two weeks later,

149
00:11:00,840 --> 00:11:05,159
it's important you have to apply that
same same principle. And part of it

150
00:11:05,200 --> 00:11:05,879
can be a staffing thing, right, Like, you know, you might

151
00:11:05,919 --> 00:11:09,519
have this team of a million people
working on PROD and then you got the

152
00:11:09,559 --> 00:11:15,000
two you know, the IT guys
working on infrastructure stuff that doesn't work very

153
00:11:15,000 --> 00:11:18,120
well at all. Yeah, So
I mean, I think you have to

154
00:11:18,159 --> 00:11:24,320
put some process to it, and
I mean I think that I'm a big

155
00:11:24,360 --> 00:11:28,960
fan of the whole infrastructure management,
incident management, post mortem process. Like,

156
00:11:28,000 --> 00:11:31,600
I think that that is a great
way to drive, you know,

157
00:11:31,639 --> 00:11:33,480
out of just happen. And we
all have to accept that nothing's one hundred

158
00:11:33,480 --> 00:11:37,559
percent, but you should get the
most out of every adage. So when

159
00:11:37,559 --> 00:11:43,200
your internal tool does blow up and
cause a you know, company visible,

160
00:11:43,279 --> 00:11:45,559
everyone is like, hey, I
don't really know what this was, but

161
00:11:45,600 --> 00:11:48,960
we couldn't do business for a day. Make the most of that, right,

162
00:11:48,080 --> 00:11:52,000
Hey, this thing broke. By
the way, there's five other services

163
00:11:52,000 --> 00:11:56,039
we've identified that are in the exact
same state and could all blow up tomorrow

164
00:11:56,039 --> 00:11:58,240
and we'd be on the same call. And so that's your that's your you

165
00:11:58,240 --> 00:12:03,240
know, entrance point to getting everyone
to have attention on that thing. And

166
00:12:03,240 --> 00:12:05,120
it's unfortunate that, you know,
things have to fail first sometimes, but

167
00:12:07,039 --> 00:12:09,879
that's how it goes when you're especially
at a startup where you're kind of have

168
00:12:09,919 --> 00:12:15,480
conflicting priorities. Oh for sure.
Yeah, And I think it's a really

169
00:12:15,559 --> 00:12:18,120
good good point there is, like
outages happen, so let's just make the

170
00:12:18,120 --> 00:12:22,679
most out of that learning experience.
Absolutely, which is that's been a cultural

171
00:12:22,799 --> 00:12:28,639
change for us as an industry over
the last I don't know, i'd say

172
00:12:28,639 --> 00:12:33,559
ten to fifteen years. I feel
like Etsy published that blame you know,

173
00:12:33,600 --> 00:12:37,519
the famous blameless postmortems. It feels
like that's ages ago now, but it's

174
00:12:37,519 --> 00:12:41,799
still so relevant every day. Yeah. Yeah, things broke, who cares

175
00:12:41,840 --> 00:12:43,639
how they broke. Let's just have
it not break in the same way again,

176
00:12:45,480 --> 00:12:46,840
right, And if you can do
that, that to me is like

177
00:12:46,879 --> 00:12:50,399
the health of a necessary team,
right, it's not. Yeah, I

178
00:12:50,440 --> 00:12:52,679
really hate places that measure health by
oh there were four outages last quarter,

179
00:12:54,240 --> 00:12:58,240
okay, like where they repeat root
causes, Yes, okay, maybe there's

180
00:12:58,279 --> 00:13:01,639
a real problem. But if they
weren't repeat root causes and you're solving root

181
00:13:01,639 --> 00:13:07,480
causes and writing good postmortems, O
just are great. Yeah, yeah,

182
00:13:07,480 --> 00:13:09,000
for sure. And I think it's
I think that's one of the things that

183
00:13:09,720 --> 00:13:16,759
it's really hard to convince people of
is outages aren't as bad as you think

184
00:13:16,879 --> 00:13:24,000
because where most of us work like
we can recover and and I say that

185
00:13:24,039 --> 00:13:30,000
coming from a background where sometimes when
we had outages, people's lives were at

186
00:13:30,039 --> 00:13:33,399
stake. And so if you can
walk out of an outage and say no

187
00:13:33,440 --> 00:13:39,200
one died, like we're going to
be all right. Yeah. Well,

188
00:13:39,200 --> 00:13:43,759
plus it ties into SLIS SLS air
budgets, right, people a lot of

189
00:13:43,799 --> 00:13:48,200
times before the SLO concept became very
popular, like it's five minutes of downtime

190
00:13:48,200 --> 00:13:52,679
bad? What does that mean?
They have this air budget and you can

191
00:13:52,720 --> 00:13:54,519
go, well, you know,
we're a three nine service committed to we

192
00:13:54,559 --> 00:13:58,840
have forty four and a half minutes
a month. Okay, five minutes isn't

193
00:13:58,879 --> 00:14:01,039
great, but it's the fine.
You know we can. Yeah, but

194
00:14:01,080 --> 00:14:05,080
if you have you know, hey, I've gone past it and we're writing

195
00:14:05,159 --> 00:14:07,240
checks or you know, giving SLA
credits like okay, now you know we

196
00:14:07,279 --> 00:14:15,080
have this quantifiable not how people feel
number to talk about things and yeah,

197
00:14:15,120 --> 00:14:18,399
so do you for internal apps,
do you take them that far where you

198
00:14:18,519 --> 00:14:24,000
give them assign them slis and SLOs
and slas for performance. I've been starting

199
00:14:24,000 --> 00:14:28,000
to so this is actually my first, my first gig out of many where

200
00:14:28,000 --> 00:14:31,120
I've really kind of focused on internal
stuff. I've often been the problem of,

201
00:14:31,159 --> 00:14:33,679
you know, the person focusing on
proud and you know, on a

202
00:14:33,679 --> 00:14:37,440
different team that now my team is
kind of you know, wearing all the

203
00:14:37,480 --> 00:14:43,240
hats thing here, and yeah,
so I'm we're putting SLOs on internal services

204
00:14:43,240 --> 00:14:46,759
and internal workflows and trying to treat
them with the same we have that in

205
00:14:46,759 --> 00:14:50,840
production, of course, and we're
trying to give it the same, uh,

206
00:14:50,519 --> 00:14:56,720
the same level of you know,
thought and execution. I've seen a

207
00:14:56,799 --> 00:15:03,200
huge shift in the industry over We're
all like where some teams maybe they're called

208
00:15:03,200 --> 00:15:07,240
platform teams, have sort of been
excluded from thinking about what a product is

209
00:15:07,440 --> 00:15:11,159
or how they do product management or
even product ownership. And I feel like

210
00:15:11,159 --> 00:15:15,919
that's really been turning around, first
with the DevOps movement and now just I'll

211
00:15:15,919 --> 00:15:18,720
look at how we build services,
micro services all all together and no one

212
00:15:18,799 --> 00:15:24,679
sort of excluded or have a different
process of internal teams. Doesn't matter.

213
00:15:24,799 --> 00:15:28,600
You're still offering a real service to
someone. Just happens to be your customers

214
00:15:28,679 --> 00:15:31,759
or within the same customer or same
organization. That's exactly how I sell it

215
00:15:31,840 --> 00:15:35,120
is. Yeah, you still have
customers, they just happen to be coworkers,

216
00:15:35,960 --> 00:15:39,480
right. Yeah, you can still
do the same stuff. You can

217
00:15:39,480 --> 00:15:41,679
build user journeys. You can figure
out what does a customer expect, what

218
00:15:41,720 --> 00:15:45,080
makes a customer angry. It's a
lot easier to figure it out because you

219
00:15:45,120 --> 00:15:48,080
can ask them and slack instead of
guessing, like, man, what are

220
00:15:48,120 --> 00:15:50,039
my customers? What's the threshold at
which my customer complains? You can just

221
00:15:50,120 --> 00:15:52,600
straight up ask them when they work
for you know, I don't think they

222
00:15:52,679 --> 00:15:56,879
necessarily have good, like nice answers
though, Like you're a customer, you

223
00:15:56,919 --> 00:16:00,600
find there's a variety of personalities,
right, people, the moment of Jenkins

224
00:16:00,679 --> 00:16:03,759
job doesn't passes in four minutes.
They're like, ah, Jenkins is graph

225
00:16:03,799 --> 00:16:07,240
It's terrible, but at least you
can get it. You know. That's

226
00:16:07,240 --> 00:16:10,080
how customers are. You know,
are in general right you have so you

227
00:16:10,080 --> 00:16:12,399
get the whole gamut of emotions.
But you can tell when there's an outage.

228
00:16:12,440 --> 00:16:15,480
You know, how loud people are. Internal people will tend to be

229
00:16:15,519 --> 00:16:18,519
louder internally that in your customer I
mean, that's what I mean, Like

230
00:16:18,600 --> 00:16:21,200
you could definitely like it's still a
problem in some regard, but like how

231
00:16:21,240 --> 00:16:25,320
do you temper the difference of perspective? Like I feel like users on the

232
00:16:25,360 --> 00:16:27,960
outside are more quiet in a lot
of ways, like it's difficult to pull

233
00:16:29,000 --> 00:16:32,039
information out of them, and internally, as you mentioned, you know,

234
00:16:32,120 --> 00:16:34,519
it's like everyone screaming as soon as
something goes even a little bit. I

235
00:16:34,559 --> 00:16:38,559
think that if you can show them
that a there are graphs, right,

236
00:16:38,600 --> 00:16:41,360
like there should always be a graph, you know, like generically I work

237
00:16:41,360 --> 00:16:44,639
for AGRAD that always sad, where's
the graphere's the graph? And it was

238
00:16:44,639 --> 00:16:45,919
annoying at first, but you're like, okay, maybe there should always be

239
00:16:45,960 --> 00:16:51,240
a graph so people know that You're, hey, people are actually watching Jenkins.

240
00:16:51,240 --> 00:16:55,320
People are actually rather computers are watching
Jenkins, not people, right,

241
00:16:56,440 --> 00:16:57,759
And if you show like, hey, we had this outage and yes it

242
00:16:57,799 --> 00:17:00,480
was a really terrible day and no
one shift any God, but here's this

243
00:17:00,519 --> 00:17:06,599
great postpartum everyone can read and that's
got a reasonable trigger and root cause and

244
00:17:06,640 --> 00:17:10,000
follow ups and timeline and we're actually
closing the follow ups. Like it's very

245
00:17:10,039 --> 00:17:12,759
visible and I feel that should be
the same way with customer outages. It's

246
00:17:12,839 --> 00:17:18,640
very visible, right, nothing,
there shouldn't be anything to hype. Yeah,

247
00:17:18,759 --> 00:17:22,079
So one idea I've been working on
because I work a lot with startups

248
00:17:22,440 --> 00:17:29,640
and like, the whole thing about
a startup is odds are the product that

249
00:17:29,720 --> 00:17:32,599
you launch is not going to be
the product that you're successful with, and

250
00:17:32,640 --> 00:17:34,680
you're going to try a lot of
different things before you become successful as a

251
00:17:34,680 --> 00:17:41,160
company. So how do we measure
that so we don't spend any more time

252
00:17:41,279 --> 00:17:45,799
than it's absolutely necessary working on the
wrong problem. And so I've been working

253
00:17:45,799 --> 00:17:48,680
on this idea of success criteria,
like what does it take to make this

254
00:17:48,799 --> 00:17:55,400
application successful? And a lot of
my background is in the infrastructure for mobile

255
00:17:55,400 --> 00:17:57,759
apps, and for mobile apps,
you can measure it as you know,

256
00:17:57,799 --> 00:18:03,880
we need ten thousand monthly active users
spending forty five minutes per week in the

257
00:18:04,000 --> 00:18:10,039
app or something like that, measure
engagement. Yeah, yeah, yeah,

258
00:18:10,119 --> 00:18:14,000
So like because then if you if
you tie in, like you know,

259
00:18:14,039 --> 00:18:17,920
our our total cost of acquisition to
get a new user as x amount of

260
00:18:17,920 --> 00:18:22,319
dollars, and our cost of infrastructure
is x amount of dollars per user.

261
00:18:22,400 --> 00:18:25,559
You know, you can do some
pretty simple math there to find out how

262
00:18:25,559 --> 00:18:30,240
many users you need for this to
be a profitable product. And so trying

263
00:18:30,279 --> 00:18:34,599
to figure that out and get that
in the early stages of an application so

264
00:18:34,720 --> 00:18:41,960
we know when to either double down
on this application or shelve it as soon

265
00:18:42,000 --> 00:18:47,240
as possible. I'm wondering if like
that same thing applies to internal tools,

266
00:18:47,240 --> 00:18:51,319
and how do you how do you
define what that looks like? Yeah,

267
00:18:51,359 --> 00:18:53,359
that's a good question. And I
think I had like a good you know,

268
00:18:53,440 --> 00:19:02,240
where we deployed the like a code
analysis tool, and yeah, I

269
00:19:02,240 --> 00:19:04,400
don't it's a good I don't know
how to quantify it for stuff like that.

270
00:19:04,559 --> 00:19:08,039
But again, we have some of
the very active developers that will give

271
00:19:08,079 --> 00:19:11,319
feedback of like, hey, this
then gave me a bad analysis and which

272
00:19:11,359 --> 00:19:12,240
we can and make it better.
It's like, okay, that means they're

273
00:19:12,240 --> 00:19:17,640
actually looking at it and you know, making code health better. And I

274
00:19:17,640 --> 00:19:19,799
don't know, I think the like, I think you're on the right track

275
00:19:19,880 --> 00:19:23,480
with you know, having a number
like mus and that kind of thing.

276
00:19:25,119 --> 00:19:32,400
It's tough internally, Yeah, yeah, well, and it's interesting because yeah,

277
00:19:32,480 --> 00:19:34,279
well, I think there's two classes
of internal products, right there's internal

278
00:19:34,319 --> 00:19:38,039
products that people are very opinionated on. Code analysis is one, right,

279
00:19:38,039 --> 00:19:41,200
Like, I think this gives me
bad code analysis. I don't like that.

280
00:19:41,920 --> 00:19:48,160
But CI right, I think largely
people don't actually care what CI product

281
00:19:48,160 --> 00:19:52,440
you're running. They care do my
prs get approvals fast? Is master green?

282
00:19:53,599 --> 00:19:56,720
The deployments happen at the pace that
we think they should at our company,

283
00:19:57,079 --> 00:20:00,240
and if that works, who cares
what the product is? Right,

284
00:20:00,279 --> 00:20:02,920
they're like, And so for stuff
like that, you can just be super

285
00:20:02,960 --> 00:20:04,240
results oriented, right, Like,
hey, is are the s lives?

286
00:20:04,319 --> 00:20:08,119
These our solos we're trying to go
for we're making them, we're not making

287
00:20:08,160 --> 00:20:15,279
them. We're a d inking shop
now, and we're considering alternatives. And

288
00:20:15,519 --> 00:20:17,839
of course, you know, you
ask people what we should. There's a

289
00:20:17,920 --> 00:20:21,680
million things, but no one actually
cares what we run, right, you

290
00:20:21,680 --> 00:20:25,200
know, if they can declare their
jobs in some declarative way that's not too

291
00:20:25,279 --> 00:20:30,720
terrible and it works, everyone's happy. Yeah, I'll tell you what'll make

292
00:20:30,759 --> 00:20:34,119
them care. If they have to
be the ones to convert your drinking jobs

293
00:20:34,200 --> 00:20:37,599
to the new platform, then a
lot of people will be like, ah,

294
00:20:37,680 --> 00:20:42,559
you know, Chinkin's is probably okay. There's a lot of GitHub actions

295
00:20:42,640 --> 00:20:45,759
talk because a lot of people have
done stuff there. So I mean part

296
00:20:45,759 --> 00:20:48,160
of you know, part of that
is that's the politicals or well, the

297
00:20:48,720 --> 00:20:52,640
human side of it is, Hey, maybe it's if all of the products

298
00:20:52,720 --> 00:20:55,279
can do the thing, maybe we
pick the one people have the most experience

299
00:20:55,319 --> 00:20:57,480
with just to make life easier in
the transition. That's a good point.

300
00:20:57,640 --> 00:21:02,279
I mean, will you ask question, which is, how do we know

301
00:21:02,400 --> 00:21:06,559
our thing is going to be successful
at a startup level? And if we

302
00:21:06,599 --> 00:21:11,559
take the perceived public metrics on ninety
percent fail are what a companies do with

303
00:21:11,920 --> 00:21:15,720
teams that are part of that ninety
percent? Do they let them go?

304
00:21:17,400 --> 00:21:22,440
Do they reposition them? I think
that's a huge struggle and it's scary to

305
00:21:22,559 --> 00:21:27,720
know that there are some metrics associated
with your successful even within a larger company

306
00:21:29,279 --> 00:21:33,920
that you don't have job security in. Yeah, I can tell you just

307
00:21:33,960 --> 00:21:42,200
in my experience with startups, that's
like when you identify the success failed criteria

308
00:21:42,759 --> 00:21:48,519
answers that question because if you can
identify it early, then you take those

309
00:21:48,559 --> 00:21:52,640
people and you put them on what
your next idea is. But in many

310
00:21:52,720 --> 00:21:56,720
cases, if you wait too late
to identify that this product is not going

311
00:21:56,759 --> 00:22:00,079
to be successful. You've already been
bleeding too much cash for church along that

312
00:22:00,200 --> 00:22:04,960
you've got to salvage what's left,
and the first thing that goes when you're

313
00:22:04,960 --> 00:22:08,279
trying to cut costs is your staff. I wonder if there's a lesson to

314
00:22:08,279 --> 00:22:12,079
be learned that can be pulled to
internal apps. Though in the startup market,

315
00:22:12,200 --> 00:22:17,880
there's this idea of letters of intent, right getting signatures from potential customers

316
00:22:17,920 --> 00:22:22,000
even before you built anything based off
of that idea, and maybe maybe there's

317
00:22:22,000 --> 00:22:27,200
an idea here of how to transition
this to even internal teams. Yeah,

318
00:22:27,519 --> 00:22:32,240
I think, yeah, I like
the idea of formalizing it. I mean,

319
00:22:32,240 --> 00:22:34,480
I think what I often do for
these kind of things is like you

320
00:22:34,559 --> 00:22:37,079
find, uh, you know,
hey, you might have ten teams and

321
00:22:37,160 --> 00:22:41,480
six teams really care about this thing
you're building because they're big use of it.

322
00:22:41,480 --> 00:22:44,359
You find like a champion on each
team that is, you know,

323
00:22:44,400 --> 00:22:48,440
opinionated and is willing to talk to
you about what's good and bad. And

324
00:22:48,519 --> 00:22:49,920
maybe not a letter of intent,
but you write a design doc and then

325
00:22:49,960 --> 00:22:52,240
you make sure that they've read it
and give you feedback, not just to

326
00:22:52,319 --> 00:22:55,200
check, but you know, if
anyone reads a design doc and has no

327
00:22:55,279 --> 00:23:00,400
comments, they didn't actually read it. Like I've never met an engineer without

328
00:23:00,400 --> 00:23:03,079
an opinion on a design. So
so you get you know, you get

329
00:23:03,119 --> 00:23:07,240
feedback from people on these teams,
and it's maybe not as formal as a

330
00:23:07,279 --> 00:23:10,400
letter of intent, but you kind
of have the you know, social buy

331
00:23:10,400 --> 00:23:14,559
in, right, Yeah. Yeah, And that was one of the things

332
00:23:14,559 --> 00:23:19,119
that we whenever I worked at Active, I started with them super early and

333
00:23:21,240 --> 00:23:26,119
we just accidentally got this right.
But we built this platform for all the

334
00:23:26,160 --> 00:23:34,079
engineers to build and deploy their software, and the way we built it,

335
00:23:34,119 --> 00:23:40,640
we we had such great collaboration between
an infrastructure team and an engineering team that

336
00:23:40,839 --> 00:23:47,359
anytime the engineering team wanted to expand
the capabilities of that platform tool, most

337
00:23:47,400 --> 00:23:49,839
of the time their request came in
the form of a poor request to just

338
00:23:49,920 --> 00:23:55,839
add that feature to it. Yeah, and I've never been able to duplicate

339
00:23:55,880 --> 00:24:00,440
that since then, but that had
such a significant impact on how I think

340
00:24:00,440 --> 00:24:03,799
about platforms that that's my goal every
day. Yeah. I like that.

341
00:24:03,799 --> 00:24:08,200
It's kind of the open source approach, right, Yeah, Yeah, that's

342
00:24:08,240 --> 00:24:11,039
I mean, I think a lot
of that is the buy in, right,

343
00:24:11,039 --> 00:24:14,440
because if you're using a platform or
a language that no one else wants

344
00:24:14,559 --> 00:24:18,119
or knows or likes. They're not
going to learn rusk to send you a

345
00:24:18,160 --> 00:24:19,799
poll review to fix a thing they
don't like. But if it's already in

346
00:24:19,880 --> 00:24:22,759
go and the rest of your codes
and go, yeah, I just show

347
00:24:22,839 --> 00:24:27,759
up one morning. Maybe there's a
corollary here as well to the total economy

348
00:24:27,799 --> 00:24:33,000
in the startup world, where you
know your customers are somewhere along the spectrum

349
00:24:33,079 --> 00:24:37,640
of innovators, early adopters, early
minority, all the way to laggards,

350
00:24:37,680 --> 00:24:38,880
and it's really who are you talking
to? You know, what does the

351
00:24:38,880 --> 00:24:42,240
rest of your organization look like?
Because I feel like, well, you

352
00:24:42,240 --> 00:24:48,839
would need innovators and early adopters there
who not only need that functionality but have

353
00:24:48,880 --> 00:24:52,000
a huge stake or care about how
it's implemented, and not just people who

354
00:24:52,039 --> 00:24:57,880
think it's table stakes or just belongs
there or have opinions that they just want

355
00:24:57,880 --> 00:25:03,680
it done their way. Yeah,
yeah, true, yeah, And that's

356
00:25:03,920 --> 00:25:06,039
I think you hit that. I
think you hit it right on the head

357
00:25:06,079 --> 00:25:14,480
there that early stage startups require that
innovation mindset. For sure. They're not

358
00:25:14,480 --> 00:25:18,319
going to get very far if only
a few people are thinking about how to

359
00:25:18,319 --> 00:25:21,680
push what do you bring their product
is forward. Yeah, I really a

360
00:25:21,680 --> 00:25:25,359
great and I want to clarify that
there. You know, there's there's a

361
00:25:25,400 --> 00:25:30,039
certain type of person who just wants
to show up for work and do their

362
00:25:30,039 --> 00:25:33,799
assigned job and they want to do
that for twenty or thirty years till they

363
00:25:33,799 --> 00:25:38,240
retire. And there's absolutely nothing wrong
with that, just that an early stage

364
00:25:38,240 --> 00:25:44,160
startup is not the right environment for
you to be successful. Yeah, you

365
00:25:44,240 --> 00:25:45,960
definitely see that. You can find
out the bigger companies. You have these

366
00:25:45,960 --> 00:25:49,200
classes of employees, right, there
are some people that you can tell either

367
00:25:49,240 --> 00:25:53,279
came from startups or are going to
go to startups afterwards. And then like

368
00:25:53,279 --> 00:25:56,880
you said, there are some people
that just you know, close tickets and

369
00:25:56,920 --> 00:26:00,640
do the work. And that's yeah, completely fine. Yeah, I sometimes

370
00:26:00,640 --> 00:26:06,599
wish that I could be that person. It'd be a lot lessful. And

371
00:26:06,680 --> 00:26:07,559
yeah, I don't know, maybe
I will be in five or day.

372
00:26:07,559 --> 00:26:12,960
Who knows, right, people change, But yeah, So let's talk a

373
00:26:12,960 --> 00:26:18,480
little bit about the types of internal
platforms, because we've talked about you know,

374
00:26:18,480 --> 00:26:26,480
like CICD, what other tools are
out there that we should be looking

375
00:26:26,599 --> 00:26:33,400
for as internal tools. One that
pops up immediately in my mind only because

376
00:26:33,440 --> 00:26:37,680
I've been bitten by this one at
every single company I've ever worked at for

377
00:26:37,720 --> 00:26:45,440
the last three decades is the internal
data analytics team. Like those guys have

378
00:26:45,279 --> 00:26:52,960
monster infrastructures and are just doing what
it takes to generate the reports that the

379
00:26:53,000 --> 00:26:56,680
business wants to see. And whenever
you get a hold of it, like,

380
00:26:56,759 --> 00:27:00,960
oh wow, I'm not even really
certain how this thing is working.

381
00:27:00,839 --> 00:27:04,400
So what are there to break?
Internal tools are out there. Well,

382
00:27:04,480 --> 00:27:08,000
some people run their own vcs,
you know, get get lab on prem

383
00:27:08,880 --> 00:27:12,759
h we run Garrett so that's its
own you know, Carrien feeding and that's

384
00:27:12,759 --> 00:27:18,880
the loos. I think one thing
people actually miss is the monitoring infrastructure themselves.

385
00:27:18,880 --> 00:27:23,119
You know, if you run Prometheus
and Elastic or if you don't have

386
00:27:23,160 --> 00:27:26,920
a vendor, right, and even
if you have a vendor, you have

387
00:27:26,000 --> 00:27:30,119
some component that is shipping metrics to
them, right, you have to monitor

388
00:27:30,160 --> 00:27:33,839
that. If that stuff all breaks, you're flying blind and you need to

389
00:27:33,880 --> 00:27:40,799
know it. I think that often
goes neglected, right. I worked somewhere

390
00:27:40,839 --> 00:27:42,720
where someone came to us once and
they were like, you know, they

391
00:27:42,799 --> 00:27:45,960
run a really noisy on call rotation, which is its own problem, and

392
00:27:47,039 --> 00:27:48,839
they knew things were broken because they
had I haven't gotten a page duty in

393
00:27:48,920 --> 00:27:52,359
ninety minutes, and that was their
escalation and I was like, well,

394
00:27:52,440 --> 00:27:55,400
it's actually broken, and this is
so sad that this is how we found

395
00:27:55,400 --> 00:27:59,039
out about the problem. On many
levels, right one, we didn't know

396
00:27:59,160 --> 00:28:03,440
too. You expect to be paged
at least once every ninety minutes, but

397
00:28:03,559 --> 00:28:07,119
it's it's truly I think the monitoring
infrastructure, it seems like it should be

398
00:28:07,119 --> 00:28:11,400
obvious, but it's not necessarily obvious
that do you think that's Do you think

399
00:28:11,400 --> 00:28:17,759
that's a mess from like existing observability
tools that they aren't able to have internal

400
00:28:17,799 --> 00:28:21,640
metrics on what the expectation is on
getting logs, Like I feel like we've

401
00:28:21,720 --> 00:28:25,480
used cloud Watch for a while and
one of the things it does in AWS

402
00:28:25,559 --> 00:28:30,759
is you can alert on missing data. Well that's a hard like in Prometheus,

403
00:28:30,759 --> 00:28:33,880
it's very hard to alert on missing
data right end of alerting on like

404
00:28:33,920 --> 00:28:37,200
the uptime series being zero. But
then what if you have a problem that

405
00:28:37,240 --> 00:28:41,079
generates your targets and there is no
uptime series right, Like there are a

406
00:28:41,119 --> 00:28:45,119
lot of these different things. Yeah, I think that it's not Yeah,

407
00:28:45,160 --> 00:28:47,960
I don't think it's first class of
you know, watch the watchers. I

408
00:28:48,039 --> 00:28:49,759
think the data is there, though, and you have to kind of do

409
00:28:49,880 --> 00:28:56,359
it. I worked somewhere where we
had a relatively large Prometheus and Grafana infrastructure

410
00:28:56,400 --> 00:29:00,440
for metrics and a really large Blunk
infrastructure for law, and both of them

411
00:29:00,440 --> 00:29:03,480
had their problems and they're run much
two different teams. And we got together

412
00:29:03,519 --> 00:29:07,480
and said, hey, let's make
a deal. Right, We'll write a

413
00:29:07,519 --> 00:29:11,720
Probert for Splunk and have an export
metrics and monitors Spunk and Prometheus. You

414
00:29:11,759 --> 00:29:15,039
write a prober for Prometheus and have
it writ logs and monitor and Splunk And

415
00:29:15,079 --> 00:29:18,160
that's not perfect, but you make
with what you've got. And that was

416
00:29:18,200 --> 00:29:22,839
a company where it was a large
place that was going to be hard to

417
00:29:22,880 --> 00:29:26,039
bring in a third party tool to
help us, and you know, but

418
00:29:26,240 --> 00:29:30,200
we may do with it. And
it was super successful. We found lots

419
00:29:30,240 --> 00:29:32,240
of problems with you know, if
they both died at the same time,

420
00:29:32,279 --> 00:29:33,599
Okay, yeah, that's the end
of the world. But the world is

421
00:29:33,599 --> 00:29:37,799
probably already ending if they both died. So do you think there are third

422
00:29:37,799 --> 00:29:41,839
party vendors that actually solve this in
a reasonable way? Well, I mean,

423
00:29:41,839 --> 00:29:45,640
I think it's hard to say,
right, It depends what you're looking.

424
00:29:45,640 --> 00:29:48,519
If you're using a third party vendor
for monitoring, you know, you

425
00:29:48,559 --> 00:29:53,799
need to look at your metrics of
shipping. Is your data there with them?

426
00:29:55,000 --> 00:29:57,119
Right? So you have to set
something up in their system to do

427
00:29:57,200 --> 00:30:00,880
it. But then you do you
want to back up? Uh? You

428
00:30:00,880 --> 00:30:03,920
know? Do you want to know
if they're up? You may have to

429
00:30:03,000 --> 00:30:08,200
run a little side monitoring infrastructure too
to watch them because it might not be

430
00:30:08,200 --> 00:30:11,359
anything you can do about it,
but you may want to at least be

431
00:30:11,400 --> 00:30:14,359
aware that, Hey, the thing
that normally sends me alerts is not going

432
00:30:14,440 --> 00:30:18,559
to send me alerts. Maybe we
should all be go back to the knock

433
00:30:18,680 --> 00:30:21,799
days and you know, stare at
some things for a couple of hours.

434
00:30:22,839 --> 00:30:23,799
I mean, that's exactly what I
don't want to have to think about,

435
00:30:23,799 --> 00:30:27,319
Like I don't have to think about
my vendor being down in some way that

436
00:30:27,480 --> 00:30:30,599
requires me to monitor them. Like
I feel like, you know, if

437
00:30:30,640 --> 00:30:33,240
I'm paying money out for that,
I should I should be getting that by

438
00:30:33,319 --> 00:30:37,160
default. I don't know. Maybe
I think that's just my pessimistic brain.

439
00:30:37,599 --> 00:30:44,559
Yeah, for sure, everything will
break. Everything will break, So whether

440
00:30:44,599 --> 00:30:47,880
you're building it or not, it's
going to break and you either want to

441
00:30:47,880 --> 00:30:49,519
know about it or you know,
if you think something doesn't break, you're

442
00:30:49,599 --> 00:30:52,359
just not measuring it. Yeah,
that's kind of mine. Yeah, yeah,

443
00:30:52,480 --> 00:30:56,160
I have two examples where I think
that's a really like a lot of

444
00:30:56,160 --> 00:31:03,160
effort has gone into that Core Logics
as a logging and monitoring platform, and

445
00:31:03,200 --> 00:31:07,559
they have anomaly detection, and so
if you'll if you have a service that

446
00:31:08,200 --> 00:31:12,400
normally spits out, you know,
one thousand log entries per hour, if

447
00:31:12,440 --> 00:31:18,839
it stops, it identifies that as
an anomaly, and we'll trigger that and

448
00:31:18,839 --> 00:31:23,319
say, hey, nothing is like
technically alerted, but something's changed here.

449
00:31:25,000 --> 00:31:29,920
And then another one that we just
recently started using as a tool for monitoring

450
00:31:29,960 --> 00:31:36,559
our infrastructure spin called clouds zero,
and it has a really cool anomaly detection

451
00:31:36,720 --> 00:31:40,640
as well, says, hey,
this project, their current spend rate has

452
00:31:40,680 --> 00:31:45,039
been this, but today it changed
to this, which is you know,

453
00:31:45,079 --> 00:31:49,000
not necessarily a problem, but cool
that it acknowledges that and you're like,

454
00:31:49,079 --> 00:31:56,440
yeah, why did that change?
I think the correlation analysis like that is

455
00:31:56,480 --> 00:32:00,400
really good for not necessarily paging someone
itto a, but reports and debugging like

456
00:32:00,440 --> 00:32:06,039
hey, I have a problem started
two days ago, what else changed two

457
00:32:06,119 --> 00:32:07,720
days ago? Right, Hey,
you also started spending less money on this.

458
00:32:07,839 --> 00:32:10,960
It's like, yeah, she doesn't
mean causation, but I'm certainly going

459
00:32:12,000 --> 00:32:15,039
to look there first. Now.
I think people underestimate that really though,

460
00:32:15,079 --> 00:32:19,119
like just looking out, when does
the problem start? What else happened there?

461
00:32:19,160 --> 00:32:22,200
And I feel like it's so obvious
to say that, but it's usually

462
00:32:22,200 --> 00:32:24,480
one of the first things that's missed. Yeah, for sure. It's been

463
00:32:24,519 --> 00:32:28,240
like a pet project of mine for
a very long time. I have a

464
00:32:28,279 --> 00:32:31,640
super stale, busted repo trying to
do this. Back when, uh remember

465
00:32:31,720 --> 00:32:36,720
CEP continuous event processing, it was
all the hype and like maybe it wasn't

466
00:32:36,720 --> 00:32:39,519
all the hype, but it was
somewhat popular in the late two thousands,

467
00:32:39,839 --> 00:32:45,720
I was trying to do the uh
whole Winters forecasting, Like the thing that's

468
00:32:45,720 --> 00:32:47,480
built in is that you know already
tool added that thing again ages ago,

469
00:32:47,680 --> 00:32:52,519
dating myself. You could, you
know, put the prediction line in front

470
00:32:52,559 --> 00:32:54,039
of your graph, And I always
thought like, yeah, let's just do

471
00:32:54,079 --> 00:32:58,039
that, like let's predict every time
series. Like it may not be great

472
00:32:58,039 --> 00:33:00,079
signal and I'm not going to look
at it unless something's broke, but hey,

473
00:33:00,160 --> 00:33:02,880
I've got twenty thousand metrics. I'm
not going to look at all of

474
00:33:02,920 --> 00:33:07,319
them, right, So when the
site car starts getting slow, it'd be

475
00:33:07,359 --> 00:33:10,000
really interesting if I could also see, hey, i'll wait, time on

476
00:33:10,039 --> 00:33:14,240
the database server went up. It's
okay. Are we sending queries that a

477
00:33:14,319 --> 00:33:16,839
disc go bad? Oh? Hey, look errors on this disk spindle started

478
00:33:16,880 --> 00:33:22,000
going up too, Like maybe there's
something here. And I think it solves

479
00:33:22,200 --> 00:33:29,160
another monitoring problem that people still make
is people monitor root causes and not what

480
00:33:29,200 --> 00:33:31,799
they actually want out of a system, right, like sending me an alert

481
00:33:31,799 --> 00:33:35,599
that my CPU is high at too
in the morning, Like, cool,

482
00:33:35,599 --> 00:33:38,039
we're using the CPU we paid for, like comes up back to bed right,

483
00:33:38,640 --> 00:33:43,240
awesome capacity planning. But if it's
making the site slow, page me

484
00:33:43,279 --> 00:33:45,960
and say the site is slow,
right, don't use me and say don't

485
00:33:45,000 --> 00:33:47,640
page me and say job is doing
GC pauses? Well, yeah, you're

486
00:33:47,680 --> 00:33:52,759
running travel That's what it does all
day. If they're too long, you're

487
00:33:52,759 --> 00:33:55,440
going to fail a latency monitor and
page me on the thing that matters.

488
00:33:55,440 --> 00:33:59,880
So like, I want these root
causes to be the correlation engine to kind

489
00:33:59,880 --> 00:34:02,119
of help me figure out. Okay, well eighty thousand things go wrong,

490
00:34:02,160 --> 00:34:05,839
and if you put an alert on
all of those. You're back at that

491
00:34:05,880 --> 00:34:10,519
team that notices you don't get paid
for ninety minutes. Right, Yeah,

492
00:34:10,519 --> 00:34:14,559
if you look at every team that
has page fatigue, this is exactly the

493
00:34:14,559 --> 00:34:16,880
problem. Yeah, most people don't
even know why they get paid for ninety

494
00:34:16,880 --> 00:34:20,559
percent of the things. It's organically
grown over the life of the team.

495
00:34:20,880 --> 00:34:23,920
Hey, this one time, you
know, the database backups did this thing

496
00:34:24,039 --> 00:34:27,199
terribly, and now we page on
it. And then you have so many

497
00:34:27,239 --> 00:34:30,159
of those it's like, well,
there's eight thousand alerts and most of them

498
00:34:30,199 --> 00:34:34,760
resolve themselves, and we just thrug
our shoulders. And I don't think it

499
00:34:34,800 --> 00:34:37,880
starts from a good place. Like
I think there's this idea that just having

500
00:34:37,000 --> 00:34:42,480
dashboards is valuable in itself, and
it sort of propagates from there. It's

501
00:34:42,480 --> 00:34:44,280
like, oh, you know,
what are all the metrics we could be

502
00:34:44,320 --> 00:34:49,079
collecting, and let's show so.
And sometimes there is an organization that thinks

503
00:34:49,119 --> 00:34:52,239
that there is some inherent value,
like oh, this is how many requests

504
00:34:52,280 --> 00:34:55,199
we're getting, Let's use it to
get more headcount because we have to support

505
00:34:55,239 --> 00:35:00,159
such complex things. Or I remember
a previous company that thought it was really

506
00:35:00,159 --> 00:35:05,119
cool to have a geolocation dashboard of
where things were happening all over the world

507
00:35:05,360 --> 00:35:07,320
related to their thing, And I'm
like, you know what, I did

508
00:35:07,440 --> 00:35:09,519
need to drive that with real data. I could have just randomly pinged a

509
00:35:09,559 --> 00:35:13,960
spot on the world on a flat, you know, two dimensional map and

510
00:35:14,000 --> 00:35:15,559
be like, look it's happening.
I didn't need to know the lights coming

511
00:35:15,679 --> 00:35:20,559
up all over Yeah, for sure. It's just so totally unnecessary. And

512
00:35:20,599 --> 00:35:24,480
so I've been on this maybe personal
vendetta here to make them be actionable.

513
00:35:24,760 --> 00:35:29,119
So you know, what is the
business impact? But more than that,

514
00:35:29,199 --> 00:35:31,400
you know what will you do with
that information? Is the site being slow?

515
00:35:31,440 --> 00:35:36,280
Will you actually take an action here
to make it faster? Or is

516
00:35:36,280 --> 00:35:38,760
there like a run book or something
that we can go and actually execute on.

517
00:35:38,880 --> 00:35:42,960
Well that's another interesting part of it
all is like the term run book

518
00:35:42,960 --> 00:35:46,440
has been so ruined by so many
places. Like if your run book is

519
00:35:46,480 --> 00:35:52,280
log into the server and run the
script, like yeah, it's just not

520
00:35:52,320 --> 00:35:55,039
automated, and like I think our
run book should be these are the places

521
00:35:55,039 --> 00:35:58,960
you should look, like, if
it's ever a thing we know that breaks,

522
00:35:59,239 --> 00:36:01,840
fix it. Like, yeah,
I talked to someone once that was

523
00:36:01,840 --> 00:36:05,760
saying they were very proud. I
mean I felt that or they built this

524
00:36:05,800 --> 00:36:07,960
great thing and like we had this
service the crashes all the time, and

525
00:36:07,960 --> 00:36:13,039
we built this great thing that automatically
restarts it. I was like, okay,

526
00:36:14,320 --> 00:36:16,679
I see right. How about collecting
stack traces and sending it to developers

527
00:36:16,719 --> 00:36:21,679
and fixing the reason the service crashes? All like, yes, you should

528
00:36:21,719 --> 00:36:24,119
restart the service when it crashes,
no argument, But like, that is

529
00:36:24,119 --> 00:36:27,840
not the end of your journey.
That is the first ten percent of your

530
00:36:27,920 --> 00:36:30,519
journey, right, you know,
I you know, I love that because

531
00:36:30,519 --> 00:36:32,880
it actually happened at one of the
previous companies I was in this. The

532
00:36:32,960 --> 00:36:38,000
name of the service was the Service
Monitor Monitor, and it did actually do

533
00:36:38,079 --> 00:36:43,039
this. The root cause, though, and maybe you've got some infinite wisdom

534
00:36:43,079 --> 00:36:49,039
here, is they were using a
library for math operations that had a memory

535
00:36:49,079 --> 00:36:52,920
leak in it, and so this
would have required actually contacting a third party

536
00:36:52,920 --> 00:36:57,960
company to get their open source software
actually fixed, which I think was under

537
00:36:57,960 --> 00:37:04,519
a proprietor use license. So sometimes
you have to do some ridiculous things.

538
00:37:05,119 --> 00:37:07,320
Yeah, I mean, I guess
if you're in a world where you have

539
00:37:07,400 --> 00:37:10,800
to use a vendor library you can't
fix, you may just have to you

540
00:37:10,840 --> 00:37:15,519
know, live with the workarounds and
acknowledge the terror. Yeah, no,

541
00:37:15,639 --> 00:37:20,119
for sure. The problem is when
you start expending that to every problem that

542
00:37:20,199 --> 00:37:23,880
looks similar, rather than knowing that
it's the right answer. Right the Nightly

543
00:37:23,920 --> 00:37:31,239
Jenkins restart. Yeah, well go
ahead. I would say, like,

544
00:37:31,280 --> 00:37:36,519
you know, a lot of some
bank places I've worked, we're like,

545
00:37:36,559 --> 00:37:39,039
hey, we should just you know, we're twenty four by five point five,

546
00:37:39,159 --> 00:37:42,920
right, you know the Japan and
US markets and they're closed, you

547
00:37:42,920 --> 00:37:47,559
know, Friday night. Let's just
reboot everything every Saturday. I'm like why,

548
00:37:47,679 --> 00:37:51,320
Like why not? And I was
like, I think I asked the

549
00:37:51,400 --> 00:37:55,599
question first, like like yeah,
we should. We should reboot things when

550
00:37:55,639 --> 00:38:00,000
we upgrade the kernel, and we
should use that time to apply upgrade.

551
00:38:00,159 --> 00:38:02,599
But just doing a bunch of stuff
every Saturday because we can because there's no

552
00:38:02,679 --> 00:38:07,840
services. It's like it feels like
we're just making busy work and you know,

553
00:38:08,079 --> 00:38:10,800
finding things to break on a Saturday
and ruin our weekends. Yeah,

554
00:38:10,800 --> 00:38:15,320
but there's also really hard arguments to
argue against, even though you know there's

555
00:38:15,360 --> 00:38:23,239
somehow fundamentally wrong. Yeah, we
didn't do it. Regress, Yeah,

556
00:38:23,239 --> 00:38:27,239
someone you know, there was a
change set out to put a reboot in

557
00:38:27,239 --> 00:38:31,480
a crown job, and I was
like, that is no, that's something

558
00:38:31,559 --> 00:38:35,480
is going to be terrible there,
and I don't want to that's how to

559
00:38:35,559 --> 00:38:38,800
ruin a weekend one on one.
Yeah, totally. And it was already

560
00:38:38,840 --> 00:38:43,800
like that because I don't know.
I think twenty four by seven is a

561
00:38:43,880 --> 00:38:47,280
way better environment to have fundamentally good
operations than twenty four or five five.

562
00:38:47,280 --> 00:38:51,360
There's just too many bad habits in
twenty four or five five. Oh,

563
00:38:51,599 --> 00:38:53,920
we can always restart this by turning
off the database and running you're turning off

564
00:38:53,960 --> 00:38:57,599
the clients and running an alter.
Eventually you're going to have to run an

565
00:38:57,599 --> 00:39:00,880
alter mid week during trading. Yeah, you're not gonna like it, but

566
00:39:00,880 --> 00:39:02,320
it's gonna happen, and you're gonna
have to take an out us to do

567
00:39:02,320 --> 00:39:05,599
it because you haven't figured out how
to do it. I mean, there's

568
00:39:05,639 --> 00:39:08,000
two obvious failure modes from now,
which are, well, what happens if

569
00:39:08,039 --> 00:39:13,320
something changes on the restart, like
you know, a new upgrade or something

570
00:39:13,400 --> 00:39:17,280
right now you're triggering that at an
unpredictable time, or just straight database replication

571
00:39:17,440 --> 00:39:21,760
crashing and losing out on whatever what
was in the journal at that moment,

572
00:39:22,639 --> 00:39:25,079
that's not the thing you want to
actually have happened. Yeah, do it

573
00:39:25,119 --> 00:39:28,719
in your lab to figure out how
to deal with it, but maybe not

574
00:39:28,760 --> 00:39:37,000
proud. Yeah. So You've mentioned
Jenkins and Garrett, and so I feel

575
00:39:37,039 --> 00:39:40,519
like I'm picking up on a trend
here that you're running a lot of services

576
00:39:40,679 --> 00:39:45,239
in house, whereas other companies may
choose to use SaaS providers for those Do

577
00:39:45,280 --> 00:39:50,159
you have a particular opinion on that? Well, I mean I feel like

578
00:39:50,360 --> 00:39:54,360
my whole career, I've been at
all angles of the build versus by debate.

579
00:39:54,800 --> 00:39:59,719
It was funny some of the banks
I've worked that you really any of

580
00:39:59,719 --> 00:40:04,440
the big orgs you see, like
the historical cio CTOs whoever they are,

581
00:40:04,760 --> 00:40:08,039
or one wile will get hired and
there we have to build everything. Then

582
00:40:08,239 --> 00:40:10,760
then you can you know, you
have ten years of this right, and

583
00:40:10,760 --> 00:40:14,400
then you look back and you have
all these disparate things of like, oh,

584
00:40:14,440 --> 00:40:17,239
this must have been built during the
build era of two thousand and nine.

585
00:40:17,679 --> 00:40:22,880
I mean, I think the answer
is, really it depends on the

586
00:40:22,920 --> 00:40:25,920
staff and what their expertise is,
and should you run Jenkins in house?

587
00:40:25,920 --> 00:40:29,599
What can you run Jenkins in house? Right? Like do you have are

588
00:40:29,639 --> 00:40:34,480
you going to dedicate the resources to
do it? Right? But just outsourcing

589
00:40:34,519 --> 00:40:37,639
something isn't always as easy as just
paying somebody, right we mentioned before,

590
00:40:37,639 --> 00:40:43,360
it's porting stuff there, it's operationalizing
it, and actually, like taking taking

591
00:40:43,360 --> 00:40:47,800
advantage of a SaaS service to do
provide value is sometimes just as hard.

592
00:40:49,119 --> 00:40:52,559
The challenge isn't running the infrastructure,
it's using the infrastructure effectively. Right,

593
00:40:52,599 --> 00:40:58,800
Having Jenkins doesn't help. Having jobs
that do meaningful things are important, and

594
00:40:58,880 --> 00:41:01,280
you have that problem in any CI
infrastructure. So I think some of the

595
00:41:01,320 --> 00:41:07,400
things that doesn't matter, you know, observability, right, it's really hard

596
00:41:07,480 --> 00:41:12,920
to run high availability observability. It's
also I'm mouthful to say so, I

597
00:41:12,960 --> 00:41:15,039
mean, like I'm a fan of
I mean worker reservability company, but I

598
00:41:15,039 --> 00:41:17,719
mean I'm a fan of outsourcing some
of that stuff. I think there's a

599
00:41:17,719 --> 00:41:22,239
certain scale where you run it internally, I think very few people are at

600
00:41:22,239 --> 00:41:29,760
that scale. Yeah, very few
people are actually running for four nines proper,

601
00:41:30,360 --> 00:41:32,719
Like that's I don't know if it's
a triple digit number, but it's

602
00:41:32,719 --> 00:41:36,119
a small amount of companies and actually
a lot of people think they're doing it,

603
00:41:36,159 --> 00:41:38,320
but not a lot of people actually
do it, right, It's interesting.

604
00:41:38,320 --> 00:41:42,360
I mean, we're we're five nines
on our core competency service, which

605
00:41:42,400 --> 00:41:45,440
is like identifying stuff, but yeah, it's huge. We're not we're not

606
00:41:45,599 --> 00:41:50,719
running the monitoring observability stuff ourselves like
we are, like we've found we're using

607
00:41:50,719 --> 00:41:52,760
our cloud provider or actually we're still
in the process of trying to find a

608
00:41:52,840 --> 00:41:57,079
vendor that actually works with us.
And I think that's going to fuel my

609
00:41:57,199 --> 00:42:00,280
next question. I'm curious whether or
not you see companies get the build versus

610
00:42:00,280 --> 00:42:05,719
buy decision, right. I know
that they put a lot of effort into

611
00:42:05,840 --> 00:42:09,079
comparing vendors, but then I feel
like it's sort of there's this gap on

612
00:42:09,679 --> 00:42:15,920
actually being able to correctly identify what
the total cost of ownership is if they

613
00:42:15,920 --> 00:42:19,840
do actually go and build or run
something themselves. You know that that is

614
00:42:19,840 --> 00:42:22,360
a tough one, right. TCO
is so hard because it's easy to do

615
00:42:22,440 --> 00:42:25,360
the this is my AWS bill,
this is the vendor bill. How do

616
00:42:25,360 --> 00:42:30,599
you quantify the adages and the people
and the stress and the upgrades? And

617
00:42:30,159 --> 00:42:32,199
yeah, I haven't seen a good
answer for that. I mean, of

618
00:42:32,239 --> 00:42:36,360
course many you know, any vendor
will try to do that for you because

619
00:42:36,400 --> 00:42:39,079
they of course want to help.
Yeah, good, but that one is

620
00:42:39,119 --> 00:42:43,639
tough, I mean, but seeing
it done right. I think the most

621
00:42:43,679 --> 00:42:47,320
common mistake people make when they're doing
the evaluation is writing a requirement stock is

622
00:42:47,320 --> 00:42:52,639
surprisingly hard. Yeah, like writing
a requirement stock for CI and don't use

623
00:42:52,679 --> 00:42:58,239
a single product name right right.
People say, like, my requirement is

624
00:42:58,280 --> 00:43:00,199
an Envoyd proxy. It's like,
there's absolut no way. If your requirement

625
00:43:00,239 --> 00:43:04,760
is an on Boid proxy, what
have you done? Right? If your

626
00:43:04,760 --> 00:43:07,760
requirement is a thing that speaks XDS, maybe envoys the only answer. But

627
00:43:07,840 --> 00:43:12,679
write down the thing you want,
the outcome you want. This is the

628
00:43:12,719 --> 00:43:16,159
same root cause versus slow thing,
Like it's a it's the x Y problem

629
00:43:16,199 --> 00:43:19,679
as well. I don't know if
yeah yeah, people say I want to

630
00:43:19,679 --> 00:43:22,559
do this thing is like, what
are you really trying to accomplish? And

631
00:43:22,599 --> 00:43:25,320
then let's figure out this is what
we could build and accomplish that. This

632
00:43:25,360 --> 00:43:29,360
is what we can buy and accomplish
that or not accomplish that, and maybe

633
00:43:29,400 --> 00:43:31,440
the decision gets more obvious, but
it's just so hard to frame it in

634
00:43:31,519 --> 00:43:36,840
terms of completely agnostic to the tool, the thing you want to do.

635
00:43:37,400 --> 00:43:38,760
You get to be the bad guy. You you have to say, well,

636
00:43:39,119 --> 00:43:42,199
you were going to do an easy
job, which is just pick a

637
00:43:42,239 --> 00:43:45,000
tool. Uh, and now you're
forcing them to go back to the drawing

638
00:43:45,039 --> 00:43:49,840
board like well, why you know, really really look right? Why do

639
00:43:49,880 --> 00:43:54,000
you want your builds to succeed?
Yeah? Exactly, But it's a good

640
00:43:54,079 --> 00:43:57,239
question, like in the CI case, like why do you want your bills

641
00:43:57,280 --> 00:44:00,400
to succeed? Because I want to
merge code faster and ship code to production

642
00:44:00,480 --> 00:44:04,599
faster. Okay, so your metric
is time from commit to development, time

643
00:44:04,599 --> 00:44:07,199
from commit to production. Okay,
we came up with really great metrics from

644
00:44:07,199 --> 00:44:14,480
asking guard. I think it usually
it seems kind of like, I don't

645
00:44:14,480 --> 00:44:17,119
know, weird at first, and
you know, contrived, but I think

646
00:44:17,159 --> 00:44:20,840
it actually leads to like, oh, okay, yeah that makes sense.

647
00:44:21,079 --> 00:44:23,639
No, I'm totally what do Yeah, it makes me think that there's like

648
00:44:23,719 --> 00:44:28,000
some Freudian stuff here where I just
want to sit here with a pipe and

649
00:44:28,039 --> 00:44:34,039
go. But why do you want
your bill to succeed? Well? Is

650
00:44:34,039 --> 00:44:37,719
it because of unresolved issues with your
mother that you feel your build business succeed?

651
00:44:38,360 --> 00:44:44,400
I mean when you pull individuals into
an organization, their personal values do

652
00:44:44,679 --> 00:44:50,679
impact what that organization drives as important
and sometimes I've known software engineers to be

653
00:44:51,039 --> 00:44:55,159
uh, quite illogical and driven by
their emotional state to you know, it

654
00:44:55,199 --> 00:44:59,400
has to be like this, it's
so much better, and you know,

655
00:44:59,440 --> 00:45:02,760
you joke, but there is something
there. No, that's definitely I mean,

656
00:45:02,800 --> 00:45:13,719
I'm probably guilty of that. Yeah. So you work for a observability

657
00:45:13,760 --> 00:45:17,360
company and you're observing your internal tools. Do you use your own product for

658
00:45:17,480 --> 00:45:22,119
that? We do? Yeah,
Yeah, trying to dog food all the

659
00:45:22,159 --> 00:45:29,400
time and say we measure SLOs internally. One cool thing we've done is we

660
00:45:29,480 --> 00:45:31,159
have a bunch of you know,
it's not great, like you know,

661
00:45:31,239 --> 00:45:36,639
some some shell scripts stuff that maybe
shouldn't be shell scripts, but everyone's got

662
00:45:36,760 --> 00:45:40,119
a pile of that somewhere and it's
pretty important in the normal workflow. And

663
00:45:40,159 --> 00:45:45,039
it was having a lot of problems, and uh, you know, we

664
00:45:45,119 --> 00:45:47,239
just found a better way to measure
like what is the success of people that

665
00:45:47,360 --> 00:45:51,159
say I want to run a local
cluster? How often does that fail?

666
00:45:52,039 --> 00:45:57,840
Turns out it was failing a lot
more than we thought. And then the

667
00:45:57,840 --> 00:45:59,960
buggy. It's really hard because you're
like, okay, well could you go

668
00:46:00,159 --> 00:46:02,320
put a set dash X in this
file? And run it again in the

669
00:46:02,360 --> 00:46:07,280
mouth, and no one wants to
like, no one wants to do that,

670
00:46:07,400 --> 00:46:08,239
right. People just want to say, hey, I think broken,

671
00:46:08,360 --> 00:46:12,519
I can't you just fix it?
Right? So we built all of this

672
00:46:12,639 --> 00:46:15,559
in even to our shell right,
so like we have like a tracing view

673
00:46:15,599 --> 00:46:20,079
I think of micro service tracing.
Like there's a request I D and you

674
00:46:20,159 --> 00:46:22,280
know it hits tny services, Like
you run this one shell script that actually

675
00:46:22,320 --> 00:46:27,000
is running like you know, ninety
three shell scripts underneath it for better or

676
00:46:27,079 --> 00:46:30,239
worse or worse, but it's happening, and we have like a trace graph

677
00:46:30,280 --> 00:46:34,360
of you ran this and like so
the people come and say, man,

678
00:46:34,400 --> 00:46:37,280
it's took nine minutes to start this
thing and it normally takes six minutes.

679
00:46:38,159 --> 00:46:42,800
This sucks. Why is this?
So we can go in, look up

680
00:46:42,840 --> 00:46:45,480
their user name, find with a
nine minute run, click it, look

681
00:46:45,559 --> 00:46:49,000
at the trace, and go,
oh, yeah, there was this bug

682
00:46:49,079 --> 00:46:53,280
pulling from ECR or whatever it is, right, and just doing that helped

683
00:46:53,320 --> 00:46:57,800
us pin down. It turns out
all these problems with the tool were like

684
00:46:57,840 --> 00:47:00,599
systemic to one or two things being
wrong, you know, poor assumptions being

685
00:47:00,599 --> 00:47:04,880
made, and we fix those and
we're kind of off to the races on

686
00:47:04,920 --> 00:47:08,079
it, and now we're building a
prober for it, so we'll have like

687
00:47:08,119 --> 00:47:10,039
a graph. You know, again, everything has a graph, right,

688
00:47:10,079 --> 00:47:14,559
So the same way we have an
solo for ingesting an observation in a certain

689
00:47:14,599 --> 00:47:17,079
amount of time, we'll have an
sol for Hey, this thing can create

690
00:47:17,119 --> 00:47:22,000
a environment locally and it happens in
less than this amount of time. And

691
00:47:22,000 --> 00:47:24,960
then we'll have variants of it,
like does it work for people that run

692
00:47:24,960 --> 00:47:28,519
it over and over again on their
box? What about a new engineer that

693
00:47:28,519 --> 00:47:30,320
logs into a fresh box and runs
it, because that's always a different you

694
00:47:30,400 --> 00:47:35,719
know, there's some you know,
that's where always the goblins live in these

695
00:47:35,719 --> 00:47:37,760
things. Well, my terraform works
fine? Oh man, I destroyed it.

696
00:47:37,800 --> 00:47:40,239
Have to run it from scratch,
you know, does that work?

697
00:47:40,480 --> 00:47:45,760
So kind of trying to measure all
those different things from that using our tool.

698
00:47:46,760 --> 00:47:51,559
Is there like an eBPF integration here
that you plan on utilizing in the

699
00:47:51,599 --> 00:47:55,679
future to understand what the requests are
or how the script is running fundamentally on

700
00:47:55,719 --> 00:47:59,039
the machine. Yeah, I think
I think we're looking at that, and

701
00:47:59,119 --> 00:48:01,039
I haven't dealt with it too much
myself, but I think that that is

702
00:48:01,119 --> 00:48:05,679
kind of the ultimate for this,
I would I think marry the two right

703
00:48:06,119 --> 00:48:08,039
VPF for the raw like just show
me everything that's running and helped me and

704
00:48:08,079 --> 00:48:15,119
then my injected data. So yeah, I think that is in the future

705
00:48:15,639 --> 00:48:20,159
for your internal tools, Like once
you identify them, you're like, Okay,

706
00:48:20,440 --> 00:48:23,639
we need to bring this up to
being like a part of our we

707
00:48:23,639 --> 00:48:28,199
need to treat it like it's part
of our core infrastructure. How do you

708
00:48:29,719 --> 00:48:35,800
socialize that across engineering so that everyone
knows that this is a supported tool.

709
00:48:35,840 --> 00:48:38,519
This is a preferred path if you
were thinking about going and building something on

710
00:48:38,559 --> 00:48:42,559
your own for this one use case, we already have you covered here.

711
00:48:42,760 --> 00:48:49,079
How do you communicate that email to
the whole list? Pending your Slack message?

712
00:48:49,079 --> 00:48:52,800
What could you go wrong? Right? Channel? Baby? You can

713
00:48:52,880 --> 00:48:55,920
write documentation all day, but like
I think we all know how much our

714
00:48:55,920 --> 00:49:01,239
documentation gets read, right, I
think I think the way to do it

715
00:49:01,320 --> 00:49:06,119
is to this you know, goes
on the layer aight problem we're talking about

716
00:49:06,119 --> 00:49:08,400
earlier, but like have to build
a good relationship with all these teams and

717
00:49:08,440 --> 00:49:13,639
have them come to you sooner in
the process, like the sooner infrastructure and

718
00:49:13,800 --> 00:49:17,599
s E. Folks can be involved
in anything, like before code is written

719
00:49:17,639 --> 00:49:21,239
would be super ideal, so that
them come to you and say, hey,

720
00:49:21,280 --> 00:49:23,079
I wrote this really cool thing,
but it uses mago dB. It's

721
00:49:23,079 --> 00:49:27,719
like, we don't do that here, right, Like maybe there's a great

722
00:49:27,760 --> 00:49:30,000
reason that does that, but you
know, but it's much harder once they've

723
00:49:30,000 --> 00:49:32,719
written code, right, Like,
you know, it's the poker pot committed

724
00:49:32,760 --> 00:49:36,320
things like while I call for the
fluster on the flop, I'm putting all

725
00:49:36,320 --> 00:49:37,519
my money in on the tourna matter
what. It's like, let's do the

726
00:49:37,559 --> 00:49:42,639
math. Not great, get to
them, you know, pre flop,

727
00:49:42,880 --> 00:49:45,719
right, Like should you even be
in this pot with mango dB? Right?

728
00:49:45,760 --> 00:49:50,920
Maybe? Yeah, but that's not
easy to do, right, You

729
00:49:51,000 --> 00:49:54,360
have to There's not a technical answer
there. That's just build relationships, make

730
00:49:54,400 --> 00:50:00,159
your team available, make people you
know, had interactions people like you know.

731
00:50:00,599 --> 00:50:02,199
I think it was back to the
sort of product management thing earlier that

732
00:50:02,239 --> 00:50:07,199
we were talking about, where if
you are at the point where you need

733
00:50:07,239 --> 00:50:09,440
to tell people about the thing that
you're working on, like maybe you didn't

734
00:50:09,440 --> 00:50:15,000
approach the situation necessarily in the best
way rather than driving it from how you

735
00:50:15,039 --> 00:50:16,920
would a startup, which is okay, you know, what do our customers

736
00:50:17,000 --> 00:50:20,840
pain points look like? And you
know, as users, what do they

737
00:50:20,840 --> 00:50:22,360
want? And they're coming to us
and saying, hey, when are you

738
00:50:22,400 --> 00:50:27,280
done with this thing that we asked
for? And the implementation details are of

739
00:50:27,320 --> 00:50:30,880
course what you're picking because you know
that best. But fundamentally it's just a

740
00:50:30,920 --> 00:50:35,760
matter of pinging them on whatever RSS
feed that they're looking at. Yeah.

741
00:50:35,880 --> 00:50:40,079
I worked at a large company before
and we had an infrastructure PM and it

742
00:50:40,159 --> 00:50:45,360
was amazing. It was just so
it felt like it was easy mode,

743
00:50:45,400 --> 00:50:47,679
right, Like someone else is going
to go gather requirements I don't have.

744
00:50:49,320 --> 00:50:52,039
Yeah, I just I'll go do
some work. That's that's fine. And

745
00:50:52,039 --> 00:50:54,960
then they show up, you know, prioritize on a list. It's like,

746
00:50:55,000 --> 00:50:59,480
well this is this is awesome,
this is what it's like on the

747
00:50:59,480 --> 00:51:01,760
other side, right, you just
need to plug that into chat GPT and

748
00:51:01,880 --> 00:51:04,920
get the answer out as well,
and then you can just you know,

749
00:51:04,960 --> 00:51:10,239
stop doing all the work altogether.
What do my engineers want? That's interesting?

750
00:51:10,960 --> 00:51:15,239
Get a slack bot that uses chat
GPT to act as a PM role.

751
00:51:16,920 --> 00:51:20,360
I was just the other way around. Have the pms just answer,

752
00:51:20,480 --> 00:51:22,079
like answer the question of what they
want to have built, and then it

753
00:51:22,119 --> 00:51:27,960
will automatically build it for them.
This is what the hive mind Internet wants

754
00:51:28,000 --> 00:51:31,920
to build. Probably not so bad
if everyone wants to. I don't think

755
00:51:31,920 --> 00:51:37,679
you'll get to five nines. I
think it's more like five two's will be

756
00:51:37,719 --> 00:51:45,079
in the front in there somewhere if
you carry it out to enough decimal places.

757
00:51:45,119 --> 00:51:51,639
There's some nines. You know,
non repeating imager and doesn't really good

758
00:51:54,480 --> 00:51:59,079
cool, So what else should we
be thinking about? For internal tools?

759
00:52:00,320 --> 00:52:05,280
That's your big takeaway piece of advice. Apply all the same riggor Like you

760
00:52:05,280 --> 00:52:10,079
know, when PROD breaks, you
go and declare an incident using your incident

761
00:52:10,119 --> 00:52:15,039
management tool. You have a communications
role, you have you have the whole

762
00:52:15,039 --> 00:52:16,519
thing or you got to run,
but hopefully if you don't, you should,

763
00:52:17,360 --> 00:52:21,119
And then when you're done, you
write a post mortem. You might

764
00:52:21,159 --> 00:52:23,119
even publish. The customers publish it
internally, right, Like why shouldn't an

765
00:52:23,119 --> 00:52:27,559
engineer be able to read about why
Jenkins broke? If production breaks, you're

766
00:52:27,559 --> 00:52:30,119
going to file a bunch of follow
ups. You're going to prioritize it over

767
00:52:30,199 --> 00:52:32,159
other works because we don't want PROD
to break again. Same thing for internal

768
00:52:32,199 --> 00:52:37,039
tools, Like it's really I think
that's the big takeaway is there's really not

769
00:52:37,159 --> 00:52:40,639
much of a difference. I mean, and people, well, if PROD

770
00:52:40,719 --> 00:52:44,639
is down, our customers that pay
us can't do work. Okay, great,

771
00:52:45,039 --> 00:52:47,760
if CI is down, how are
you going to ship a fix when

772
00:52:47,880 --> 00:52:51,880
PROD is down? Right? Like
you're like, oh, I brought this,

773
00:52:51,960 --> 00:52:54,760
I wrote this really great infrastructure where
changes can only go through CI and

774
00:52:54,800 --> 00:52:59,760
not handmade. That's a great thing. But if your CI isn't as of

775
00:53:00,039 --> 00:53:02,559
the ball has production three nine CI
four nine is production. You can only

776
00:53:02,599 --> 00:53:07,079
make changes through CI. You know, not a math major. But we're

777
00:53:07,079 --> 00:53:10,559
going to have a problem here forty
minutes out of the year. You know

778
00:53:10,559 --> 00:53:15,119
it's going to be an issue.
So I think that treating it, I

779
00:53:15,159 --> 00:53:19,960
think people just underestimate how important that
stuff really is and the impact it can

780
00:53:19,960 --> 00:53:22,239
have. You don't have to wait
till it. You know. The worst

781
00:53:22,239 --> 00:53:28,840
case is production is having a problem. You're monitoring is broken and you can't

782
00:53:28,840 --> 00:53:30,719
see it. Your CI is broken
and you can't ship a fix for it,

783
00:53:31,079 --> 00:53:35,519
whether that fixes can FIG or code, I mean, and then that's

784
00:53:35,519 --> 00:53:38,119
a really terrible postportum to have to
send out to a customer of well,

785
00:53:38,159 --> 00:53:42,760
we knew what was wrong, but
we had to fix. We couldn't build

786
00:53:42,800 --> 00:53:46,119
the dockor image that had to fix
because you know our two thousand and four

787
00:53:46,159 --> 00:53:51,280
era Jenkins decided it's a crash and
I'll start it. Yet, Yeah,

788
00:53:51,400 --> 00:53:53,079
that's not It was not a good
look for anybody. I mean, you

789
00:53:53,119 --> 00:53:58,000
identified it. If there's value in
doing this activity for some of your services

790
00:53:58,079 --> 00:54:00,039
because of what the users look like, then there's probably value in doing it

791
00:54:00,079 --> 00:54:04,400
for other services that just happen to
be internal. I think that's a big

792
00:54:04,440 --> 00:54:07,280
part of post mortems, right,
Like if you find a production problem where

793
00:54:07,440 --> 00:54:10,440
oh, hey, we had this
bug in our database connection pool and we

794
00:54:10,480 --> 00:54:14,440
had this reconnection issue and this thing
happened. Where else do you have connection

795
00:54:14,480 --> 00:54:16,519
pools? Right? That should be
the logical question. You ask the same

796
00:54:16,519 --> 00:54:21,440
thing internally, Right, we had
this neglected service. Where else what else

797
00:54:21,599 --> 00:54:24,880
is flying under the radar that's going
to bite us? And those are the

798
00:54:24,880 --> 00:54:28,599
tough post you know, some of
the post mortem actually like fix this bug

799
00:54:28,639 --> 00:54:30,960
with a reproduced case. Okay,
that's like a day of work, right,

800
00:54:30,239 --> 00:54:32,360
you know, when you're done because
the test passes, it's easy.

801
00:54:34,000 --> 00:54:38,199
These are much more ominous, like
project the follow ups. But if you

802
00:54:38,199 --> 00:54:43,199
don't do them, you're going to
pay the price. Is there like some

803
00:54:43,360 --> 00:54:47,599
obvious pitfall that a lot of companies
or maybe even everyone seems to get wrong

804
00:54:47,800 --> 00:54:52,920
in this area sort of besides the
stuff we've been talking about, well in

805
00:54:52,960 --> 00:54:59,840
post mortems, people are extraordinarily bad
at distinction between root causes and triggers.

806
00:55:00,920 --> 00:55:04,440
Right, Pete type this command and
took the site down? Right, root

807
00:55:04,480 --> 00:55:09,320
cause is not Pete sucks? Right, that's right? Like the root cause

808
00:55:09,440 --> 00:55:13,599
is why was there a command to
where are the where are the seatbelts?

809
00:55:13,639 --> 00:55:15,440
Where are the gates? Where is
the code review? Where's all that stuff?

810
00:55:15,480 --> 00:55:19,920
Or you have to really really I
think the five wise thing is interesting.

811
00:55:19,920 --> 00:55:22,360
I don't think you actually have to
write down why five times and a

812
00:55:22,440 --> 00:55:27,159
documented filled out I think that's a
little bit you know, the meaning and

813
00:55:27,239 --> 00:55:30,000
not great, but the philosophy of
like, really come to the root cause

814
00:55:30,039 --> 00:55:36,440
of the problem, right, Like
root causes aren't well? Aws had an

815
00:55:36,480 --> 00:55:39,239
availability zone die? Like, what
can we do? You can run more

816
00:55:39,280 --> 00:55:42,679
than one availability zone? You can
do this, You can do that right,

817
00:55:42,760 --> 00:55:45,599
like, and maybe you choose not
to at this point, but you

818
00:55:45,599 --> 00:55:50,239
should at least identify it and say
Okay, like we know the root cause,

819
00:55:50,320 --> 00:55:53,000
and we've chosen that this is a
risk, and this is why we're

820
00:55:53,000 --> 00:55:57,599
a three nine service or a four
nine service. And maybe someday will make

821
00:55:57,599 --> 00:55:59,920
it better, maybe we won't,
but at least being honest with yourself about

822
00:55:59,960 --> 00:56:04,800
it. Ah, that's huge right
there. I want to highlight that because

823
00:56:06,760 --> 00:56:08,800
like, just because you identify the
root cause doesn't mean you have to do

824
00:56:08,840 --> 00:56:16,119
anything about it. Because I've seen
multiple instances where companies build infrastructure that is

825
00:56:16,280 --> 00:56:22,960
far beyond their budget and their actual
requirements because they're focused on that. And

826
00:56:22,360 --> 00:56:29,719
the analogy I like to use is
like, whenever I go to work every

827
00:56:29,800 --> 00:56:35,480
day, the fastest, most efficient
way for me to get there is buying

828
00:56:35,519 --> 00:56:39,800
my own jet copter, But my
budget really says I should stick with my

829
00:56:39,840 --> 00:56:44,320
eighty seven Toyota Corolla, you know, and so you have like balance those

830
00:56:44,320 --> 00:56:49,159
two things, right, right,
Maybe the fix is leave ten minutes earlier

831
00:56:49,239 --> 00:56:52,920
instead instead of mind Yeah exactly.
Yeah. Well I think also it comes

832
00:56:52,920 --> 00:56:55,920
back to SLOs of like, Okay, we had this big problem, but

833
00:56:57,119 --> 00:57:00,360
ay was it a big problem?
You know? Like it's hard back to

834
00:57:00,360 --> 00:57:02,440
the emotion thing, right, Like
if a big customer is impacted, it

835
00:57:02,760 --> 00:57:07,239
gets a lot more priority, but
you have to quantify at the end of

836
00:57:07,239 --> 00:57:08,960
the day, like, hey,
this is how many nines we have.

837
00:57:09,079 --> 00:57:14,320
This is our error budget. We
used it during this incident. Maybe that's

838
00:57:14,360 --> 00:57:17,239
not good, but that's this is
par for the course, and we don't

839
00:57:17,320 --> 00:57:22,440
need to suddenly become multi cloud,
multi region, you know, all the

840
00:57:22,480 --> 00:57:25,840
things load balancing, complexity, because
that's the other thing is looking at.

841
00:57:25,960 --> 00:57:30,920
You know, you can add nines
with complexity, but complexity can also reduce

842
00:57:30,000 --> 00:57:35,920
nnes. And you have to be
careful over engineering and response to things.

843
00:57:35,960 --> 00:57:38,760
And I hate to use the term
like you know this is always the escalator.

844
00:57:38,880 --> 00:57:42,119
Well, if there's an act of
God that you know, this managed

845
00:57:42,119 --> 00:57:44,360
service goes away, it's like,
well, what's an act of God?

846
00:57:44,440 --> 00:57:47,079
Is that any going down? I
don't know, that's just expected, right,

847
00:57:49,239 --> 00:57:52,320
Is it a tornado hit the Virginia
area and all of the US's to

848
00:57:52,320 --> 00:57:53,639
one one away? Okay, there's
an act of God we don't have to

849
00:57:53,639 --> 00:57:58,480
plan for. But if you're trying
to be a five nine service, you're

850
00:57:58,480 --> 00:58:00,480
not running in one region anyway,
So you know, it kind of all

851
00:58:00,559 --> 00:58:04,440
has to that has to make sense
together. You know, the whole story

852
00:58:04,440 --> 00:58:07,360
has to just kind of flow.
And that's where I think some people get

853
00:58:07,400 --> 00:58:10,320
off. They they write an SLO, but they don't have a story to

854
00:58:10,360 --> 00:58:15,920
back it, or they write a
really complex story where they don't need for

855
00:58:15,000 --> 00:58:22,719
an solo that's simpler. Yeah,
point for sure. But it's an art

856
00:58:22,840 --> 00:58:25,840
so it's you know, there's no
right or wrong answer. I always say

857
00:58:25,960 --> 00:58:30,880
solo everything is so technical and solos
or this like walfty, what is right?

858
00:58:30,920 --> 00:58:35,360
What is wrong? Who really knows? You just kind of have to

859
00:58:35,360 --> 00:58:37,679
do it. People always we were
rolling out SLOs in a company. How

860
00:58:37,679 --> 00:58:39,960
do I know my slo's right?
I was like, well, you know,

861
00:58:40,000 --> 00:58:44,400
it's not. It's your first SLO. So you build it and you

862
00:58:44,519 --> 00:58:47,440
have post mortems and you adjust it
as you go. You know, were

863
00:58:47,480 --> 00:58:51,159
you at you know, did you
have an outage? Yes? Did your

864
00:58:51,280 --> 00:58:54,519
solo show it no cool? Make
it more aggressive? Did you not have

865
00:58:54,559 --> 00:58:58,239
an outage? And your SLO says
you had a outage? Make it less

866
00:58:58,239 --> 00:59:00,800
aggressive? Right? Like it sounds
simple, but that's just the feedback loop.

867
00:59:00,800 --> 00:59:05,079
And if you're doing it twelve months
later, you probably have a pretty

868
00:59:05,079 --> 00:59:09,599
decent setup and have a much better
idea of what your customers consider an outage

869
00:59:09,679 --> 00:59:14,000
or not. You hit on something
really interesting there. Actually, So if

870
00:59:14,159 --> 00:59:17,559
SLA is your you know, contracted
amount and the and the I is whatever,

871
00:59:17,559 --> 00:59:22,239
your indicator is always just an objective. So it does seem like it's

872
00:59:22,280 --> 00:59:25,039
it must be subjective in every way. How do you sort of pick that?

873
00:59:25,119 --> 00:59:28,599
How do you know what? First
off, if you have an A,

874
00:59:29,079 --> 00:59:31,320
the O must be more aggressive,
let's hope, right, right,

875
00:59:32,679 --> 00:59:37,960
But let's assume for a moment you
don't have a contractual you know, I

876
00:59:37,239 --> 00:59:40,079
say, you know slas are like
SLOs with lawyers, right, it's really

877
00:59:40,400 --> 00:59:45,800
kind of a yeah, you have
to guess, Like I mean, I

878
00:59:45,800 --> 00:59:49,320
think that's the unfortunate of it.
And that's where you know, hire somebody

879
00:59:49,320 --> 00:59:52,440
with experience. They'll be able to
guess more accurately maybe, but they'll at

880
00:59:52,519 --> 00:59:55,559
least also know when they're wrong.
So I don't know. It's a process

881
00:59:55,559 --> 01:00:00,119
where you just have to iterate and
sometimes you have a major miss You're like,

882
01:00:00,159 --> 01:00:02,360
oh man, we really thought we
were measuring this service and we had

883
01:00:02,360 --> 01:00:07,239
this massive outage in our dashboard.
Was like everything's green, nothing's wrong,

884
01:00:07,280 --> 01:00:10,159
and users are entre It's like,
okay, well we missed this super critical

885
01:00:10,199 --> 01:00:15,599
part of the picture, and then
you do the postmortem thing and you figure

886
01:00:15,599 --> 01:00:17,440
out, Okay, well I made
this common mistake. What other of my

887
01:00:17,599 --> 01:00:22,199
solos have this common You know that
I make this mistake more than once.

888
01:00:22,360 --> 01:00:27,679
Would it be fair to say that
it should be meaningful so that if you're

889
01:00:27,760 --> 01:00:30,639
violating it, then you're taking some
action as a result, and if you're

890
01:00:30,679 --> 01:00:34,960
not, then you don't do anything. So maybe it's about finding that sweet

891
01:00:34,960 --> 01:00:37,519
spot where it causes the right thing
to happen in your organization. Absolutely,

892
01:00:37,599 --> 01:00:42,000
I mean I think that. I
think to get there right, you need

893
01:00:42,039 --> 01:00:45,639
to have some kind of alerting and
reporting around it. I think alerting on

894
01:00:45,800 --> 01:00:50,119
solos is like a very hard problem. Like the Google book will talk about

895
01:00:50,119 --> 01:00:52,599
burden rate alerting. Have fun implementing
that, right, that's very hard.

896
01:00:52,840 --> 01:00:55,920
But if you can get there or
have some approximation of it and report on

897
01:00:55,960 --> 01:01:00,639
it, like I'm a big fan
of getting you know, it doesn't work

898
01:01:00,719 --> 01:01:02,840
right away, but eventually maturing to
a point where every alert has a ticket,

899
01:01:04,239 --> 01:01:08,440
right and the ticket is either fixed
the thing that caused the alert.

900
01:01:08,719 --> 01:01:13,199
You know, this needs to be
more resilient, This needs more replicas or

901
01:01:13,360 --> 01:01:15,599
this alert page mean nothing was actually
wrong. We should fix the alert and

902
01:01:15,639 --> 01:01:19,199
if you do that over time,
like that's how you can get solos and

903
01:01:19,280 --> 01:01:22,000
cause change, and when they really
are broken, the post mortem loop is

904
01:01:22,000 --> 01:01:27,480
the real fix. Like the trick
with SLOs is you kind of I don't

905
01:01:27,480 --> 01:01:30,239
know. There's that perpetual battle with
infrastructure and product right You're like, hey,

906
01:01:30,280 --> 01:01:32,719
we need you guys to write more
stable code and products, like we

907
01:01:32,760 --> 01:01:36,079
need features, we need this needs
to be a different color, right,

908
01:01:36,239 --> 01:01:38,480
which is fine, like that needs
there's a balance there. But if you

909
01:01:38,519 --> 01:01:42,800
get everyone to agree on SLOs and
you're really user focused, like, hey,

910
01:01:43,840 --> 01:01:46,719
users are happy when the latency you
know p ninety nine latency is this,

911
01:01:46,800 --> 01:01:50,519
and you just are sad when it's
over that, and everyone everyone,

912
01:01:50,599 --> 01:01:55,159
like business engineering management, agrees on
that. Your argument is a lot easier

913
01:01:55,199 --> 01:01:59,960
when you have an outage to say, our users weren't happy, Like objectively,

914
01:02:00,599 --> 01:02:05,079
our users weren't happy, So we
either fix the thing or we decide

915
01:02:05,119 --> 01:02:07,639
that our threshold was incorrect on what
is a happy user. But one of

916
01:02:07,639 --> 01:02:12,480
the two has to We can't do
nothing. We can't just write features and

917
01:02:12,559 --> 01:02:15,159
say, well, users might be
unhappy again, because we know this thing

918
01:02:15,199 --> 01:02:19,039
will blow up, and I find
that that is a good you know,

919
01:02:19,039 --> 01:02:21,599
you almost trick the trick product into
it and like, oh yeah, so

920
01:02:21,599 --> 01:02:22,280
those these are great, and then
later it's like, oh, yeah,

921
01:02:22,280 --> 01:02:25,119
I guess we have to fix this
now, so you know, we'll give

922
01:02:25,239 --> 01:02:30,880
we'll give you a sprint or you
know, whatever it is. You know,

923
01:02:30,039 --> 01:02:35,000
I did the same trick a long
time ago with okayrs. Very similar

924
01:02:35,079 --> 01:02:37,840
thing, where you know, once
they agree to the okay ares, you

925
01:02:37,880 --> 01:02:39,599
try to set the mindset up what
does this actually mean? You know,

926
01:02:39,599 --> 01:02:43,760
why are we setting it? And
you set the so we can know what

927
01:02:43,800 --> 01:02:45,599
to do when we know what to
do the right thing, and then later

928
01:02:45,840 --> 01:02:47,559
you can just point back to it
and be like, hey, you know,

929
01:02:47,599 --> 01:02:50,960
we decided what the right thing was
going to be in this situation.

930
01:02:51,320 --> 01:02:52,960
Now it's time to execute on it. Or you know, you have to

931
01:02:53,000 --> 01:02:58,639
make a trade off. As you
said, Yeah, it's a tough.

932
01:02:58,840 --> 01:03:04,400
You know, you can't sell people
problems with tech, but you try.

933
01:03:07,320 --> 01:03:10,519
That'll be the million dollars million dollars
you know thing. But I don't know,

934
01:03:10,599 --> 01:03:14,239
I feel like data is always helpful. You know, emotions are bad,

935
01:03:14,239 --> 01:03:17,840
you know, everyone has emotional arguments, but data is hard to People

936
01:03:17,880 --> 01:03:22,039
still argue with data, but if
I feel like if you're out the side

937
01:03:22,039 --> 01:03:27,400
of data, you at least have
a fighting chance of pushing for change,

938
01:03:27,519 --> 01:03:31,199
or how do you know, like
you ever feel like you're in a situation

939
01:03:31,280 --> 01:03:37,400
where you get scared of survivorship bias, where you even if you're doing root

940
01:03:37,440 --> 01:03:40,599
cause analysis and you're finding really what
the underlying problem is that even though you're

941
01:03:40,639 --> 01:03:45,280
going on and fixing it, that
you're missing some other side of the iceberg

942
01:03:45,440 --> 01:03:51,039
that is waiting out there to come
and crush you totally. But I mean

943
01:03:51,079 --> 01:03:54,519
I think that I don't know.
I just accepted it, right. It's

944
01:03:54,519 --> 01:04:00,800
a pestimistic, sary brain, Like
everything will break. Everything I fix will

945
01:04:00,800 --> 01:04:03,800
also break when I leave a company, everybody will blame me when it breaks.

946
01:04:03,840 --> 01:04:08,199
And that's whatever, you know,
Like I've just I've internalized it and

947
01:04:08,239 --> 01:04:11,000
accepted it. And it used to
stress me out a lot more and now

948
01:04:11,039 --> 01:04:15,719
it's just like, Yep, this
thing I'm rolling out might break, and

949
01:04:15,840 --> 01:04:17,760
you know, at least you know
that if it breaks, you're going to

950
01:04:17,800 --> 01:04:20,960
write a good postmartum and learn from
it. And that's kind of the Constellation

951
01:04:21,079 --> 01:04:25,039
prize of like, yeah, it
sucks to write a big Postmartum after a

952
01:04:25,039 --> 01:04:29,239
big adage. But also you know, that's that's the gig, and like

953
01:04:29,440 --> 01:04:32,320
we shouldn't have hired Pete, that's
that's the R. Well, you know,

954
01:04:32,400 --> 01:04:34,679
like the first place I ever with
FedEx, right, there was this

955
01:04:34,840 --> 01:04:40,280
meeting every week, the R squared
Meeting, the Redundancy and Reliability Meeting interesting

956
01:04:40,519 --> 01:04:44,440
and it was like how good of
a week is the VP having? Right?

957
01:04:45,760 --> 01:04:47,599
And there was always this rumor that
like, oh, somebody was fired

958
01:04:47,639 --> 01:04:50,239
once at this meeting because they had
an outage, and like no one can

959
01:04:50,320 --> 01:04:56,719
actually tell you that person's name,
what year it was. I'm pretty sure

960
01:04:56,800 --> 01:04:59,679
it was, you know, it
was trying to hype you up to prepare

961
01:04:59,760 --> 01:05:01,519
for it. I think it was. I don't know if they intentially did

962
01:05:01,519 --> 01:05:05,320
this or it was just a grow
but like, that's such a terrible you

963
01:05:05,360 --> 01:05:10,320
know. I think the blameless culture
is big people can't be a problem,

964
01:05:10,440 --> 01:05:15,000
right, But I think often it's
not the person that made the change that

965
01:05:15,119 --> 01:05:17,679
is actually the problem. It is
someone's reviewing code, right, Like code

966
01:05:17,679 --> 01:05:21,679
review should be first class, not
a checking a box. I feel I

967
01:05:21,719 --> 01:05:26,159
feel really bad when to change I
reviewed cause an adage that happened recently.

968
01:05:26,280 --> 01:05:30,679
I was like, who approoved has
changed? Oh, never mind, it

969
01:05:30,719 --> 01:05:33,760
should have done better due diligence,
And yeah, I mean it goes back

970
01:05:33,800 --> 01:05:36,280
even further than that though, because
if you do think that it is one

971
01:05:36,320 --> 01:05:40,800
person's responsibility that caused the problem,
you can look at, well, you

972
01:05:40,840 --> 01:05:43,559
know, what was the culture we
had that allowed them to make the mistake?

973
01:05:43,679 --> 01:05:45,519
Or you know why were they even
hired? Right? Were they a

974
01:05:45,559 --> 01:05:47,119
good fit for the role in the
first place? And you can definitely go

975
01:05:47,199 --> 01:05:51,320
back up somewhere else or a different
chain to really dive in there some of

976
01:05:51,320 --> 01:05:55,199
those things you might not write.
It's like a MetaPost mortem of the Yeah,

977
01:05:55,199 --> 01:05:57,599
for sure, how do we get
here? Right? How do we

978
01:05:57,679 --> 01:06:01,760
end up hiring people that understand networking
for a networking product or whatever? You

979
01:06:01,840 --> 01:06:05,159
know it? Right? Yeah,
I think that's suff's important. And then

980
01:06:05,199 --> 01:06:09,039
then change the process, right,
Okay, these are a new criteria for

981
01:06:09,119 --> 01:06:13,960
this for sure. Yeah. Yeah. When I was in the Navy,

982
01:06:14,039 --> 01:06:18,519
we had this process for getting your
certification for whatever job you were doing.

983
01:06:18,599 --> 01:06:25,960
That I've tried to bring into poor
request reviews that it takes a cultural shift

984
01:06:26,000 --> 01:06:28,960
to get fully implemented. But in
the Navy. It was set up so

985
01:06:29,000 --> 01:06:30,800
that, like I was a nuclear
engineer, so you had to learn all

986
01:06:30,800 --> 01:06:35,599
these different skills to operate the power
plant. And so you would go around

987
01:06:35,599 --> 01:06:43,559
the power plant and work with the
existing engineers and show them that you knew

988
01:06:43,679 --> 01:06:47,960
how to do something, and if
they felt like you understood it, they

989
01:06:47,960 --> 01:06:50,960
would sign it off in your book. And this was way back pre computer

990
01:06:51,079 --> 01:06:55,639
stuff, so you actually had a
physical book that you carried around, but

991
01:06:55,760 --> 01:06:58,400
you signed it off in that book. And then if at any point in

992
01:06:58,440 --> 01:07:02,079
your career you ever screwed that task
up to a point where your skills were

993
01:07:02,079 --> 01:07:06,000
called into question, they would open
up the book to see who signed it

994
01:07:06,199 --> 01:07:10,519
and go back to that person and
say, hey, why why did we'll

995
01:07:10,559 --> 01:07:15,320
screw this up? And and so
it was that like that. It did

996
01:07:15,320 --> 01:07:18,800
put that sense of pressure on you
so that before you would sign off on

997
01:07:18,840 --> 01:07:25,960
anyone's book on anything, you wanted
to make sure that you were reasonably confident

998
01:07:26,000 --> 01:07:29,840
that they actually knew what they were
doing. And so I pre poor requests

999
01:07:29,840 --> 01:07:33,079
the same way like if I approve
a poor request and it breaks something,

1000
01:07:33,679 --> 01:07:38,360
I don't consider that to be a
fault with the person who submitted the poor

1001
01:07:38,400 --> 01:07:41,800
request. I consider it to be
my fault for not catching it in the

1002
01:07:41,840 --> 01:07:47,840
review. This is like the Hurdos
number corollar ate the uh, there's the

1003
01:07:47,960 --> 01:07:50,239
you know, you chase it back
all the way up the chain, like,

1004
01:07:50,280 --> 01:07:51,920
well, you know, who did
that person? You know, what

1005
01:07:51,920 --> 01:08:01,920
does that person's book look like?
The reviewer? Right? No, I

1006
01:08:01,920 --> 01:08:04,440
think that's a good, good way
to look at it, and it's just

1007
01:08:04,480 --> 01:08:12,239
good for accountability and you know,
yeah the meta issues. Yeah, and

1008
01:08:12,599 --> 01:08:15,279
like a one off instance, you
know, is not that big a deal.

1009
01:08:15,319 --> 01:08:23,279
But over time, you know,
if like everyone that I signed off

1010
01:08:23,319 --> 01:08:27,880
on this particular skill is having problems, that's going to point back to the

1011
01:08:27,960 --> 01:08:33,800
root cause being me not actually understanding
either what the skill is or how to

1012
01:08:33,840 --> 01:08:41,119
evaluate that skill. Yeah. I've
been places where the root cause has just

1013
01:08:41,199 --> 01:08:45,640
been fatigue. It's like the call
rotations are too insane. This was at

1014
01:08:45,640 --> 01:08:48,199
the end of this person covered someone
else's on call. They were on two

1015
01:08:48,199 --> 01:08:51,000
weeks of twenty four by seven on
call. They had had a bunch of

1016
01:08:51,039 --> 01:08:55,079
major incidents overnight, they were running
on no sleep, and they just simply

1017
01:08:55,079 --> 01:08:58,600
did the wrong thing. You can't
fault the human for that. Why do

1018
01:08:58,720 --> 01:09:01,520
we put them in this grinder like
every other industry. Airline industry is a

1019
01:09:01,560 --> 01:09:04,199
great one, right, you can
only have limit limits on how much you

1020
01:09:04,239 --> 01:09:08,039
can fly and be responsible over people's
lives. I mean, obviously when lives

1021
01:09:08,039 --> 01:09:10,680
are instakes, like you said,
you have to be more rigorous. But

1022
01:09:10,880 --> 01:09:13,439
it doesn't have to be that rigorous
for you know, being on call.

1023
01:09:13,479 --> 01:09:15,840
But you know, we have an
informal policy on our team is if we

1024
01:09:15,880 --> 01:09:20,319
take overnight pages, someone will offer
to cover the next day, the next

1025
01:09:20,439 --> 01:09:26,319
night so that person can catch up. Yeah, there's nothing worse than having

1026
01:09:26,359 --> 01:09:29,960
a week of terror where you're losing
sleep every night like you're just you're dead

1027
01:09:30,119 --> 01:09:32,279
by there, and then a major
incident. You know, the worst timing

1028
01:09:32,319 --> 01:09:35,079
always happens in these things, right, Like the worst adages are never one

1029
01:09:35,119 --> 01:09:39,279
thing. It's a confluence of events. Right, So that terrible outage is

1030
01:09:39,279 --> 01:09:43,840
going to come Friday when you're running
on fumes and your brain isn't isn't there,

1031
01:09:44,399 --> 01:09:46,359
and you're going to make mistakes,
you're not going to see problems and

1032
01:09:46,399 --> 01:09:51,920
that's not your fault. But that's
tough to in a startup world like oh

1033
01:09:53,000 --> 01:09:56,760
yes, you're meant to grind,
but there has to be some reasonableness of

1034
01:09:57,600 --> 01:10:00,880
Okay, the people with responsibility are
keeping the sight up need to be aware

1035
01:10:00,960 --> 01:10:04,199
and awake because we're not just running
run books where we copy and paste off.

1036
01:10:04,319 --> 01:10:10,000
It's the run book is use your
brain. Yeah, I think we

1037
01:10:10,079 --> 01:10:14,439
really realized over the at least the
most recent decade that the grind is not

1038
01:10:14,479 --> 01:10:18,159
helpful. Even like doing more hours
of work, especially in knowledge work industry,

1039
01:10:18,760 --> 01:10:23,479
does not translate to an additional value. And so if you do have

1040
01:10:23,880 --> 01:10:27,680
outages every incidence, every day that
people are on, I like, it

1041
01:10:27,720 --> 01:10:31,520
seems like fundamental that you would intentionally
rotate them off so that someone else is

1042
01:10:31,520 --> 01:10:35,760
there because and you know, I
think there's like a pride issue here where

1043
01:10:35,800 --> 01:10:40,560
the engineer just wants to stay on
because you know, it's their rotation and

1044
01:10:40,920 --> 01:10:44,680
they don't realize that it's actually harming
the company. Like they should speak up

1045
01:10:44,680 --> 01:10:47,239
for the benefit of the company.
It's not about them necessarily. Right.

1046
01:10:47,439 --> 01:10:51,479
Heroics shouldn't be I mean, it's
great sometimes heroics just have to happen,

1047
01:10:51,520 --> 01:10:55,079
and they happen and it's good,
but that shouldn't be the goal of like,

1048
01:10:55,119 --> 01:10:57,399
oh man, I want to be
a hero of the Saturagye, please

1049
01:10:57,479 --> 01:11:00,439
know, like it was really fun
early in my career, and now it's

1050
01:11:00,479 --> 01:11:03,720
like, oh, there's heroics happening. This is awful. Yeah, it's

1051
01:11:03,760 --> 01:11:09,880
it's the cult of the hero because
then you you fulfill that hero role and

1052
01:11:09,920 --> 01:11:14,439
then you know, you get at
mentioned in slack and you know, everyone's

1053
01:11:14,479 --> 01:11:15,520
like, oh wow, that was
such a tough effort, you know,

1054
01:11:15,720 --> 01:11:20,840
and the worst yeah, the worst
word to use the NHR rockstar. Yeah.

1055
01:11:20,880 --> 01:11:24,319
Absolutely. I don't want to be
a rock star at work, man,

1056
01:11:24,479 --> 01:11:27,159
Like, please know, I want
to do I'm going to be a

1057
01:11:27,239 --> 01:11:30,439
rock star. I want it to
be on Motley Cruz, fueled by cocaine

1058
01:11:30,479 --> 01:11:32,680
and hookers. Let me be a
rock star, like being a rock star

1059
01:11:32,760 --> 01:11:38,279
and infrastructure is like, yeah,
where's my hotel room to tear up?

1060
01:11:38,439 --> 01:11:41,399
Yeah, I mean at the beginning
it was oh they described me as a

1061
01:11:41,479 --> 01:11:43,800
rockstar, that must mean I'm doing
great. And now it's like, oh,

1062
01:11:45,239 --> 01:11:47,640
now it's like what your mouth?
Yeah, don't call it that.

1063
01:11:50,079 --> 01:11:54,159
But I think I think the culture
is Yeah, I don't know. It's

1064
01:11:54,920 --> 01:11:58,039
it's a tough one to just you
don't want to actively discourage you. You

1065
01:11:58,039 --> 01:12:00,039
don't want to say long off of
the home, like if you want to

1066
01:12:00,079 --> 01:12:02,079
work. I don't know. My
thing is like I'm a workaholic, and

1067
01:12:02,119 --> 01:12:04,840
I always say I'm happy to work
long hours if it's what I want to

1068
01:12:04,880 --> 01:12:09,680
do. If I want to work
on Saturday, awesome. If other people

1069
01:12:09,680 --> 01:12:13,039
want me to work on Saturday,
that kind of falls apart from me.

1070
01:12:14,840 --> 01:12:16,680
And the social engineering trick is that, well, just to keep work that

1071
01:12:16,720 --> 01:12:19,720
he really likes and it's passionate about, and he'll work on Saturday. That's

1072
01:12:19,720 --> 01:12:23,680
a fine. Okay, that's a
good trick. It works on me,

1073
01:12:24,079 --> 01:12:29,279
but I don't think people have used
that. And some weekends I game all

1074
01:12:29,319 --> 01:12:31,960
weekend. Some weekends I work all
weekend. But usually it's it's my choice.

1075
01:12:33,079 --> 01:12:38,760
The point about that, I think
part of that is that we are

1076
01:12:39,319 --> 01:12:43,479
creative in the work that we do, you know, using our creativity to

1077
01:12:43,560 --> 01:12:49,600
solve problems. And creativity doesn't doesn't
show up at nine am when you hit

1078
01:12:49,680 --> 01:12:54,359
the time clock, you know,
and if it's something that you're excited about,

1079
01:12:54,479 --> 01:12:58,039
you you get this, you know, that creative bush like, oh,

1080
01:12:58,079 --> 01:13:00,119
I got to go do this,
which is what leads you to you

1081
01:13:00,960 --> 01:13:04,399
go in and work on Saturday.
Most of my best work is done yeah,

1082
01:13:04,439 --> 01:13:08,560
after midnight or on the weekends correct, Like it's exactly that is.

1083
01:13:08,600 --> 01:13:11,399
You get the idea and you're like, man, you're laying in bed and

1084
01:13:11,439 --> 01:13:13,039
you're like, I can build it
this way, that way. It's like

1085
01:13:13,079 --> 01:13:16,119
past draft to build it right,
Like I know. It always fueled me

1086
01:13:16,359 --> 01:13:20,520
was there was some really annoying problem
that I did, like some other problem

1087
01:13:20,560 --> 01:13:25,159
I didn't want to have to solve, Like someone was asking ridiculous things on

1088
01:13:25,199 --> 01:13:29,680
how to fix Jenkins and the solution
was something I didn't want them to do,

1089
01:13:30,199 --> 01:13:34,399
and so I felt the need motivated
to just have this problem completely go

1090
01:13:34,479 --> 01:13:39,479
away. And that's when I would
really work on things like non stop wast

1091
01:13:39,479 --> 01:13:45,680
nerd sniping. Right, someone can't
be done see you Monday. Right here

1092
01:13:45,720 --> 01:13:50,319
it is, but I don't think. I think the other problem with that

1093
01:13:50,520 --> 01:13:54,359
is it has to be clear on
the team that that's happening. Like some

1094
01:13:54,439 --> 01:13:59,199
people just don't ever work weekends.
I think that's great, Like you know,

1095
01:13:59,720 --> 01:14:02,199
that's that should shouldn't be a problem. So it shouldn't be a peer

1096
01:14:02,199 --> 01:14:04,560
pressure of like, oh Pete work
of the weekend, everyone else should.

1097
01:14:04,560 --> 01:14:08,760
Like I think that is a terrible
message to send, so you have to

1098
01:14:08,760 --> 01:14:13,119
be careful with it, like I've
learned not to send too many emails.

1099
01:14:13,199 --> 01:14:15,600
Well, I don't. Luckily the
company I work at it's not an email

1100
01:14:15,640 --> 01:14:19,039
company, but I like big companies
where email is life. I'm always very

1101
01:14:19,039 --> 01:14:24,680
careful about setting emails on the weekend
because you're implicitly setting expectations for other people

1102
01:14:24,760 --> 01:14:29,399
that are watching, and I feel
like that is trouble. Yeah, I

1103
01:14:29,439 --> 01:14:33,680
mean that's another avenue realistically, Like, I think there is a thing about

1104
01:14:34,000 --> 01:14:36,640
just like you want the load on
your systems to be constant. You don't

1105
01:14:36,640 --> 01:14:41,159
want to see spikes because they're incredibly
hard to deal with the same goes for

1106
01:14:41,680 --> 01:14:46,159
teams that are putting out work.
Right, If some engineers are incredibly spiky

1107
01:14:46,720 --> 01:14:50,399
on load, then you're unpredictable and
what you can deliver and how much,

1108
01:14:50,439 --> 01:14:55,239
and the reliability or quality of that. So it's you're not necessarily doing anyone

1109
01:14:55,279 --> 01:15:00,840
a favor by one day going out
and solving a problem. If that's your

1110
01:15:00,880 --> 01:15:02,560
pattern. Yeah, I think it's
a difficult lesson for a lot of people

1111
01:15:02,600 --> 01:15:08,359
to learn. I struggle with it
a lot. I often find myself like

1112
01:15:08,439 --> 01:15:11,199
I'll work a weekend because I want
to, and then Monday, I'm like,

1113
01:15:11,239 --> 01:15:15,079
oh, man, I don't want
to work today, Like, yeah,

1114
01:15:15,159 --> 01:15:16,520
well that's fine, right, because
then you're then you're still having the

1115
01:15:16,520 --> 01:15:20,159
same amount of work that you're sort
of putting out and productivity for the team,

1116
01:15:20,159 --> 01:15:24,159
but you're not overburning them because someone
will has to review that, right,

1117
01:15:24,199 --> 01:15:27,640
someone that's still creating followup work down
the road, and depending on the

1118
01:15:27,640 --> 01:15:30,920
statement, maybe that creates incidents as
well. Yeah, now, you're right,

1119
01:15:31,000 --> 01:15:35,359
it's a very tough balance to figure
out that. Yeah, I'm still

1120
01:15:35,399 --> 01:15:39,760
trying to figure out what that is. I would say my whole career is

1121
01:15:39,760 --> 01:15:44,600
pretty spiky over There are some places
I work at where I I'm always you

1122
01:15:44,600 --> 01:15:45,439
know, I can't help the work
aholic at me. But there are some

1123
01:15:45,439 --> 01:15:48,199
places where I chill a little more
in some places where I'm like eighty hour

1124
01:15:48,279 --> 01:15:51,479
weeks insanity, And I don't know. For me, I've learned it's a

1125
01:15:51,479 --> 01:15:56,479
healthy cycle. Like every couple jobs, I do the insane grind because that's

1126
01:15:56,520 --> 01:15:59,359
just where I'm at, and then
I have a couple of years of like

1127
01:15:59,680 --> 01:16:02,640
less insane grind, kind of relax
and find other things to do, and

1128
01:16:02,680 --> 01:16:05,000
then bright back to it because I
miss it, you know. I mean,

1129
01:16:05,279 --> 01:16:08,560
it doesn't matter what your pattern is. I mean, if you are

1130
01:16:08,600 --> 01:16:11,279
spikey, it's fine as long as
it's somehow consistent. Right. You know,

1131
01:16:11,399 --> 01:16:14,399
if every couple of weekend, you
know, every other weekend, you

1132
01:16:14,439 --> 01:16:17,000
do extra work, then you sort
of expect that into the realm of things

1133
01:16:17,000 --> 01:16:20,600
and how that's going to play out. But if it happens and it's unpredictable,

1134
01:16:20,680 --> 01:16:24,600
then you don't know what the impact
is on the team. You may

1135
01:16:24,680 --> 01:16:28,720
think that the team can have more
work done than is reasonable, and so

1136
01:16:28,760 --> 01:16:32,560
a new big project comes out and
now it's taking even longer or unexpected because

1137
01:16:32,640 --> 01:16:40,279
he's not pulling those weekends anymore and
doing the a real job that has a

1138
01:16:40,279 --> 01:16:45,279
lot of value, but messed up
some sort of prediction or timelines or deadlines.

1139
01:16:45,880 --> 01:16:48,399
I think engineering time prediction is like
the hard I just throw that out

1140
01:16:48,439 --> 01:16:55,640
the window, especially with infra.
Right, well, it's two weeks of

1141
01:16:55,640 --> 01:16:59,479
infra work. Oh so it'll be
done in two weeks, maybe like two

1142
01:16:59,520 --> 01:17:02,720
weeks, And for work sometimes takes
two months with outages and interrupts, and

1143
01:17:03,279 --> 01:17:08,279
it's hard to explain sometimes. Yeah, it's like Einstein's theory of relativity.

1144
01:17:08,399 --> 01:17:14,479
This is real the fast where we
go this is different where we go no.

1145
01:17:14,600 --> 01:17:16,479
But I think going back to your
you're talking about working in spikes.

1146
01:17:16,520 --> 01:17:24,640
I think that's like, that's how
we've worked as humans for you know,

1147
01:17:24,960 --> 01:17:28,640
forty thousand years. Like you go
out and you do the big grind to

1148
01:17:30,199 --> 01:17:31,880
you know, to to hunt the
animals, and then you go back and

1149
01:17:31,920 --> 01:17:35,399
you just your rest and you relax
for a while. Or you go out

1150
01:17:35,439 --> 01:17:40,399
and you work all summer to plant
the crops and harvest the crops and then

1151
01:17:40,439 --> 01:17:42,800
store them for the winter, and
then you ride the winter out. So

1152
01:17:42,840 --> 01:17:49,720
I think that behavior is actually something
that has been native to us for a

1153
01:17:49,800 --> 01:17:54,279
long long time, and to try
to break that in the course of a

1154
01:17:54,399 --> 01:17:59,920
three decade career is going to be
difficult. I mean you're on as a

1155
01:18:00,119 --> 01:18:02,840
there definitely what are you called it? Art? Right? So you look

1156
01:18:02,880 --> 01:18:08,039
at the Renaissance artists, famous ones
like you know, see what they did?

1157
01:18:08,079 --> 01:18:11,560
What are other artists doing before?
And even today? You know,

1158
01:18:11,560 --> 01:18:14,319
how are they working? Because that's
the same of expectation you can have for

1159
01:18:14,359 --> 01:18:18,079
any knowledge work, which is very
similar to a creative process. You have

1160
01:18:18,159 --> 01:18:21,640
to have the right motivation and sometimes
that's really hard to figure out. I

1161
01:18:21,640 --> 01:18:26,000
don't know what that is sometimes,
right, some weekends, I'm just I

1162
01:18:26,000 --> 01:18:29,640
want to be at and stare at
a screen. And some weeks I wake

1163
01:18:29,720 --> 01:18:30,960
up early on a Saturday and I'm
like, let's write some code, let's

1164
01:18:30,960 --> 01:18:35,159
do stuff, And yeah, I
have no idea what what drives it?

1165
01:18:36,000 --> 01:18:39,359
Sometimes I know, but often it's
just I don't know. It's hard to

1166
01:18:39,359 --> 01:18:43,920
say how I'm feeling or will feel
until I actually wake up. And that's

1167
01:18:44,079 --> 01:18:48,880
what to do for sure. I
think if we figure that out and can

1168
01:18:49,159 --> 01:18:54,279
define it and reproduce it, we've
got our next multi billion dollar startup.

1169
01:18:55,039 --> 01:19:01,640
Seriously, awesome. Is there anything
else we should talk about for internal platforms

1170
01:19:01,960 --> 01:19:06,159
of infrastructure? I think we covered
a lot. I think that Yeah,

1171
01:19:06,199 --> 01:19:11,199
all the things I want to talk
about awesome. Cool. Let's do some

1172
01:19:11,279 --> 01:19:15,640
picks. Warren have been picking on
you for picks the last couple of episodes,

1173
01:19:15,760 --> 01:19:20,039
but I gave Pete the heads up
before we started recording, so I

1174
01:19:20,119 --> 01:19:23,560
know he's preps. So I'm gonna
put Pete on the spot. What'd you

1175
01:19:23,560 --> 01:19:29,399
bring for us? Pete? I'm
a gamer, and I like, uh,

1176
01:19:29,880 --> 01:19:32,279
I don't know. I like games
that are really hard. I like

1177
01:19:32,319 --> 01:19:35,439
games that are kind of like a
second job, and grindy and I don't

1178
01:19:35,439 --> 01:19:39,560
know, I'm go utton for punishment. I guess this is the summary.

1179
01:19:40,279 --> 01:19:43,840
Some people play games just to have
casual. I have some casual you know,

1180
01:19:44,000 --> 01:19:45,840
hang out with the boys and play
some games kind of thing. But

1181
01:19:45,880 --> 01:19:48,880
I like spreadsheet on the second monitor
and gaming on the first monitor kind of

1182
01:19:48,880 --> 01:19:53,479
thing. So I really like aarpg's
you know, hack and slash anything where

1183
01:19:53,479 --> 01:19:58,119
you can mind max. It's just
interesting. Up my alley. A new

1184
01:19:58,159 --> 01:20:00,560
game came out. I don't know. I think I went full release ten

1185
01:20:00,640 --> 01:20:04,479
days ago. It's been around for
years, and like betas and Alpha,

1186
01:20:04,479 --> 01:20:10,359
it's called last Epoch. It's like
a you know, Diablo ish ARPG thing.

1187
01:20:11,399 --> 01:20:14,760
And there's Diablo. I played a
lot of Diablo. There's Poe,

1188
01:20:14,840 --> 01:20:17,720
which is like the ultimate. Like
I don't have enough hours to I'm also

1189
01:20:17,840 --> 01:20:20,520
like a very addicted personality, so
I have to choose my gamescre I don't

1190
01:20:20,520 --> 01:20:25,399
play a factorio because I know that
I would just stop working. Like Last

1191
01:20:25,399 --> 01:20:30,039
Epoch is this nice middle of It's
a lot more complex than Diablo. There's

1192
01:20:30,039 --> 01:20:32,680
a lot more things that can contribute
to your final build that you can nerd

1193
01:20:32,720 --> 01:20:36,600
out on trying this and trying that
there's a lot of RNG, so there's

1194
01:20:36,640 --> 01:20:43,039
the grind aspect of it. It
just checks all the boxes. And for

1195
01:20:43,159 --> 01:20:45,399
me, I know a game is
good when I lose track of time constantly.

1196
01:20:45,640 --> 01:20:49,079
Oh it's a very am I should
go to bed now. That's been

1197
01:20:49,199 --> 01:20:53,680
like the last week and a half
as the game came out, And so

1198
01:20:53,800 --> 01:20:57,960
that is my pick. If you
like ARPGs or min maxine games or things

1199
01:20:58,000 --> 01:21:01,399
like that, this game just checks
all the boxes for me. Written by

1200
01:21:01,520 --> 01:21:04,800
gamers. Sometimes you play a game
when it's pretty clear that the people that

1201
01:21:04,840 --> 01:21:09,279
wrote it don't actually play the game
they've written. It's actually way more common

1202
01:21:09,520 --> 01:21:13,239
than you think. And you realize
because there's no quality of life features and

1203
01:21:13,239 --> 01:21:15,640
it's like, well this is awkward
and I have to do it every two

1204
01:21:15,720 --> 01:21:18,079
raids and this sucks this game just
like oh I have to do this thing.

1205
01:21:18,119 --> 01:21:20,800
Oh wait, this is really easy
to do, Like okay, just

1206
01:21:20,960 --> 01:21:26,399
like someone plays the game wrote the
game. So yeah, I enjoy a

1207
01:21:26,439 --> 01:21:30,560
game like that. It's a smaller
the company. So is there an online

1208
01:21:30,600 --> 01:21:34,680
multiplayer mode for it? Yeah?
So it's like a whole way do you

1209
01:21:34,680 --> 01:21:38,319
go to the common areas there's a
bunch of other people there, and there's

1210
01:21:38,399 --> 01:21:42,159
trading and stuff, and there's you
know, public chat where half trolling,

1211
01:21:42,239 --> 01:21:46,399
half people asking questions. But you
can party up with your buddies and tackle

1212
01:21:46,479 --> 01:21:49,319
content together. And so I have
a small group I play with and we've

1213
01:21:49,359 --> 01:21:54,079
always played ARPGs, so this is
our current one that we're all just kind

1214
01:21:54,119 --> 01:21:57,640
of grinding up characters on and figuring
out what the best builds are and what

1215
01:21:57,720 --> 01:22:00,520
synergizes with each other and things like
that. All right, So the follow

1216
01:22:00,600 --> 01:22:03,760
up question is do you want to
share your gamer tag so that you're listening

1217
01:22:03,800 --> 01:22:09,760
for the show can jump on and
talk trash. Yeah, I'm pdf backwards

1218
01:22:09,760 --> 01:22:15,479
the TEP all right, someone out
there has p F places. So I'm

1219
01:22:15,479 --> 01:22:18,279
a yeah, you said difficulty.
I thought for sure you were going to

1220
01:22:18,319 --> 01:22:23,960
bring up Lost Souls or Dark Souls, and so I that kind of game

1221
01:22:24,600 --> 01:22:28,800
I if I was good at I
would play. I don't think anyone's good

1222
01:22:28,800 --> 01:22:31,159
at it the whole, like the
the you know those kind of games where

1223
01:22:31,159 --> 01:22:33,880
it's like jumping mechanics and stuff.
I don't know, I'm that's stuff.

1224
01:22:33,920 --> 01:22:38,079
He's not a platformer. That that's
it? Yeah, I mean really twitchy

1225
01:22:38,159 --> 01:22:41,520
games. Yeah, I'm totally with
you. I played Ninja Guide in a

1226
01:22:41,560 --> 01:22:44,600
lot in the past, and like
that was very there are some parts that

1227
01:22:44,600 --> 01:22:48,279
were incredibly challenging. Blessed Podcast.
Enough of that, Like some of the

1228
01:22:48,279 --> 01:22:51,760
boss fights are legitimately challenging. You
know, we just one dude we bought

1229
01:22:51,880 --> 01:22:55,239
probably forty times in a before we
beat him, but now were really good

1230
01:22:55,239 --> 01:23:00,840
at it because we yeah, we
post bared them each death. Okay,

1231
01:23:01,600 --> 01:23:04,960
you can't stand in the middle if
really where games are worth right, Like

1232
01:23:05,000 --> 01:23:09,600
it's I think that's a real lesson. You know, you really have to

1233
01:23:09,640 --> 01:23:14,239
take root cause analysis and post mortems
to your your private life, and you

1234
01:23:14,279 --> 01:23:17,039
know in your friend group when something
goes wrong, you really need to investigate.

1235
01:23:17,399 --> 01:23:20,000
I often joke that, like many
things about me are really good for

1236
01:23:20,039 --> 01:23:24,479
work and really terrible for personal relationships, but they work out really well in

1237
01:23:24,520 --> 01:23:30,600
gaming. So we're causing relationship problem
is usually a bad idea. We're causing

1238
01:23:30,600 --> 01:23:38,039
why you died to a boss.
Apply it where you can fair enough,

1239
01:23:39,680 --> 01:23:42,920
all right, Warren, would you
bring this week? Yeah, so a

1240
01:23:42,960 --> 01:23:45,479
couple weeks ago I mentioned this already, but I'm gonna plug it again.

1241
01:23:45,880 --> 01:23:51,359
On Friday, there's a decompiled conference
in dressed in Germany, UH which I'm

1242
01:23:51,359 --> 01:23:56,239
actually giving a talk on about our
journey at authors and adding security. But

1243
01:23:56,279 --> 01:23:58,760
there are some interesting talks there,
like there's one that I really want to

1244
01:23:58,800 --> 01:24:03,159
go to that's about migrating from kubernetties
to server lusts, and I think there

1245
01:24:03,239 --> 01:24:06,039
are a bunch of other ones that
are really interesting. I'm looking forward to

1246
01:24:06,560 --> 01:24:13,600
nice, excellent, cool. So
my pick is going to be no surprise

1247
01:24:13,680 --> 01:24:16,800
to anyone who's been listening to the
last few episodes. I am picking platform

1248
01:24:16,880 --> 01:24:23,439
Con coming out in June. It's
a five day virtual conference about platform engineering,

1249
01:24:24,199 --> 01:24:27,720
so check it out totally free,
and there's going to be tons of

1250
01:24:27,760 --> 01:24:32,279
great talks there and specific to me. At the end of the conference,

1251
01:24:32,319 --> 01:24:36,239
I will be doing a live Q
and A session with some of the speakers,

1252
01:24:36,760 --> 01:24:40,720
so they're finalizing who the speakers are
going to be, and then once

1253
01:24:40,800 --> 01:24:44,880
that's done, I am going to
try and turn this into a Q and

1254
01:24:44,920 --> 01:24:47,319
a session that you actually want to
listen to. So I'm going to go

1255
01:24:47,359 --> 01:24:53,800
out on X and start asking people
which speakers do you want me to interview

1256
01:24:54,199 --> 01:24:58,279
and what questions do you want me
to ask them, so that it becomes

1257
01:24:58,279 --> 01:25:00,800
the interview that you want to hear, and yes, I know that by

1258
01:25:00,840 --> 01:25:04,840
going out on X and asking you
what questions to ask, some of you

1259
01:25:04,880 --> 01:25:08,640
are going to ask some of the
wrong questions. So I'm just gonna be

1260
01:25:08,680 --> 01:25:12,760
honest here with you. I'm not
going to ask that, but thank you

1261
01:25:12,760 --> 01:25:17,640
for listening to the show, you
sick little pervert. And so whenever the

1262
01:25:17,760 --> 01:25:23,119
Q and A session comes up,
I will ask the questions, some of

1263
01:25:23,159 --> 01:25:26,600
them because we know what you're going
want to ask, but I will ask

1264
01:25:26,600 --> 01:25:29,800
some of them and then you will
be able to say, hey, I

1265
01:25:29,880 --> 01:25:32,760
heard about this on the X,
which takes it full circle because I did

1266
01:25:32,840 --> 01:25:36,439
go through that whole setup just to
walk all the way through and plug his

1267
01:25:36,560 --> 01:25:43,039
zz Top song in my pick for
today. So all of you listeners who

1268
01:25:43,039 --> 01:25:45,520
are Zezytop fans out there, that
was all done for you. Thanks for

1269
01:25:45,600 --> 01:25:56,319
listening. Cool, So I think
we got an episode here. Awesome.

1270
01:25:56,840 --> 01:26:00,159
Yeah, Pete, thanks for joining
us. This has been a great talks.

1271
01:26:00,159 --> 01:26:01,720
Fun to have you on the show. Yeah, thank you for having

1272
01:26:01,760 --> 01:26:05,399
me again. This was great.
Yeah, anytime. Warren, thanks again

1273
01:26:05,479 --> 01:26:09,239
for joining me as a his co
host here. Of course, I love

1274
01:26:09,279 --> 01:26:12,239
having you on the show. It's
been a lot of fun. Look forward

1275
01:26:12,239 --> 01:26:15,560
to seeing you next week. Yeah, and for all you listeners out there,

1276
01:26:15,640 --> 01:26:18,359
and we will see you all next
week too. Thanks everyone,
