1
00:00:05,040 --> 00:00:06,320
Speaker 1: What's up born, How you doing.

2
00:00:07,160 --> 00:00:10,599
Speaker 2: I'm doing I'm doing pretty well. Thanks, Thanks, well cool.

3
00:00:10,800 --> 00:00:14,039
Speaker 1: I cut my intro a little shorter this week because

4
00:00:14,759 --> 00:00:17,079
I just kind of assume, and maybe this is a

5
00:00:17,079 --> 00:00:20,760
bad assumption on my part, that people know which podcasts

6
00:00:20,800 --> 00:00:22,800
are listening to because they had to click on it,

7
00:00:23,079 --> 00:00:25,800
and so I felt like it was kind of redundant

8
00:00:25,800 --> 00:00:29,600
to say, welcome to the Adventures in DevOps podcast.

9
00:00:30,280 --> 00:00:32,240
Speaker 2: You just you just did it right.

10
00:00:32,320 --> 00:00:35,359
Speaker 1: That was subtle though, right, that was clever. I'm gonna

11
00:00:35,479 --> 00:00:37,439
just pat myself on the back for that one.

12
00:00:37,600 --> 00:00:41,119
Speaker 3: Yeah, I've got I've got actually an interesting fact that

13
00:00:41,159 --> 00:00:42,679
I can share, so you know, we can jump into

14
00:00:42,679 --> 00:00:47,159
that there was an OTP provider that actually changed hands

15
00:00:47,200 --> 00:00:51,240
similar to the XC vulnerability and compression and Linux not

16
00:00:51,280 --> 00:00:54,520
too long ago. And for those of you that did OTPs,

17
00:00:54,520 --> 00:00:56,240
that's one time passwords, so you can think.

18
00:00:56,119 --> 00:00:57,759
Speaker 2: Of like an off path that you got installed in

19
00:00:57,799 --> 00:00:58,240
your phone.

20
00:00:58,880 --> 00:01:02,119
Speaker 3: And it's sort of ridiculous that this even happened, because

21
00:01:02,200 --> 00:01:04,519
if you think about how bad it is for a

22
00:01:04,640 --> 00:01:08,200
open source library to get co opted by a militias attacker,

23
00:01:08,239 --> 00:01:11,319
to have an application on your phone that is also

24
00:01:11,400 --> 00:01:15,200
responsible for security for two factor off codes to change hands.

25
00:01:15,480 --> 00:01:19,560
That provider now has access to every single one of

26
00:01:19,599 --> 00:01:22,519
those users two factor off and it could be even

27
00:01:22,560 --> 00:01:25,799
primary factor if it comes to password resets and whatnot.

28
00:01:25,959 --> 00:01:29,480
So that's a great way of stealing credentials. And I

29
00:01:29,480 --> 00:01:31,000
don't think it's an attack factor that a lot of

30
00:01:31,040 --> 00:01:33,040
people think about. I think they, you know, it's like whatever,

31
00:01:33,079 --> 00:01:35,359
it just stores my two factor codes, doesn't really matter.

32
00:01:35,959 --> 00:01:38,879
Speaker 2: But now there's actually a huge problem that could come

33
00:01:38,920 --> 00:01:39,599
up because of that.

34
00:01:40,359 --> 00:01:43,879
Speaker 1: Nice I look forward to it. I'm just excited about that.

35
00:01:46,439 --> 00:01:48,359
Speaker 3: I think everyone really has to switch over to weboth

36
00:01:48,400 --> 00:01:50,680
and that's the truth secret. And if you don't know

37
00:01:50,719 --> 00:01:52,359
what that is, come and talk to me after the show.

38
00:01:52,879 --> 00:01:55,439
I'm happy to give everyone an earful about that.

39
00:01:56,359 --> 00:01:57,840
Speaker 1: Or can we just give up on it and just

40
00:01:57,879 --> 00:02:01,280
everyone uses password all lower case letters for their password.

41
00:02:01,640 --> 00:02:06,200
I mean there's some credibility to that approach.

42
00:02:05,879 --> 00:02:10,080
Speaker 2: Right, there's a whole episode there.

43
00:02:11,840 --> 00:02:15,479
Speaker 1: Maybe maybe, But today's episode, we're talking about one of

44
00:02:15,520 --> 00:02:19,960
my favorite topics, incident response and on call management. And

45
00:02:20,560 --> 00:02:25,800
to chat through that topic with us, we've got Felipe Jane,

46
00:02:25,960 --> 00:02:29,919
the CEO of pager Ley, joining us today. Felipe welcome.

47
00:02:31,560 --> 00:02:31,800
Speaker 4: Thanks.

48
00:02:31,800 --> 00:02:35,080
Speaker 5: So i'mon guys, happy to join on this and you

49
00:02:35,280 --> 00:02:41,120
have a biggest psion response and ons right on.

50
00:02:41,319 --> 00:02:41,599
Speaker 4: Cool.

51
00:02:41,680 --> 00:02:47,280
Speaker 1: I feel like incident response is a learned skill, you know,

52
00:02:47,439 --> 00:02:51,800
and it's learned on the job under pressure when everything's

53
00:02:51,840 --> 00:02:56,599
going to hell, And prior to your first incident, you

54
00:02:56,719 --> 00:02:59,680
never even thought through that this is where your life

55
00:02:59,719 --> 00:03:02,400
was going to lead. So how did you end up

56
00:03:03,039 --> 00:03:07,759
starting a company dedicated to this incident response?

57
00:03:09,639 --> 00:03:12,560
Speaker 5: It began with like when I started out my first

58
00:03:13,240 --> 00:03:16,639
job basically, and I was part of Amazon and in

59
00:03:16,680 --> 00:03:19,360
the dtail page TAM, which is friendly one of the

60
00:03:19,479 --> 00:03:25,520
highest sort of pages in terms of traffic difficulsider, And

61
00:03:26,879 --> 00:03:29,199
that was like the first hand experience. I like, my

62
00:03:29,199 --> 00:03:32,520
manager put me on oncoholic within sort of months of joining,

63
00:03:33,000 --> 00:03:34,719
and he told me like, hey, this is like the

64
00:03:34,719 --> 00:03:38,199
best way to learn about things, And I kidnam it.

65
00:03:38,319 --> 00:03:42,120
Speaker 4: That was some serious pressure, say in.

66
00:03:42,120 --> 00:03:48,000
Speaker 5: The initial initial stuff and I and since then I've

67
00:03:48,039 --> 00:03:49,919
always knie, yeah, this is the best way to sort

68
00:03:49,960 --> 00:03:54,599
of learn things because you are absolutely learning each and

69
00:03:54,759 --> 00:03:55,960
every bit of things.

70
00:03:55,719 --> 00:03:57,599
Speaker 4: In the shortest possible amount of time.

71
00:03:58,120 --> 00:04:01,599
Speaker 5: So I think that that was my first I would

72
00:04:01,599 --> 00:04:04,439
say interaction with the incidents and the on cord world.

73
00:04:05,159 --> 00:04:09,039
Speaker 1: Well, let's be realistic there. Your manager's thought process was, actually,

74
00:04:09,560 --> 00:04:11,319
if this guy's going to quit, I'm going to make

75
00:04:11,400 --> 00:04:13,719
him quit sooner rather than later, So I'm putting him

76
00:04:13,759 --> 00:04:14,199
on call.

77
00:04:14,319 --> 00:04:18,040
Speaker 2: Right at the beginning, it could be.

78
00:04:18,000 --> 00:04:20,240
Speaker 4: I think it was probably you know, just titting.

79
00:04:20,759 --> 00:04:24,600
Speaker 5: This is how you get waters And yeah, it was

80
00:04:24,680 --> 00:04:27,439
pretty pretty sort of a new world. Before that, I

81
00:04:27,519 --> 00:04:31,480
always thought like, yeah, software development is more about building

82
00:04:31,480 --> 00:04:36,000
stuff and you know, maybe designing it. This part is

83
00:04:36,240 --> 00:04:40,000
truly what you see a management or managing your maintaining

84
00:04:40,040 --> 00:04:40,879
your product.

85
00:04:41,240 --> 00:04:44,399
Speaker 4: And that was like the first kind of interaction I had.

86
00:04:47,199 --> 00:04:50,519
Speaker 1: Right and cool. So then you went from Amazon. After

87
00:04:50,639 --> 00:04:53,839
that you went to Disney, right yeah.

88
00:04:54,000 --> 00:04:56,959
Speaker 4: Yeah, so in Amazon it was pretty interesting.

89
00:04:57,279 --> 00:05:00,240
Speaker 5: The dal page and like there are they It was

90
00:05:00,279 --> 00:05:03,600
sort of quite a few days, especially during Fine Days

91
00:05:03,600 --> 00:05:06,600
and cyber Mondays and the Christmas week is always like

92
00:05:06,839 --> 00:05:11,160
pretty high pressure stuff and I remember each and every

93
00:05:11,279 --> 00:05:14,360
one of them, like in terms of the events, like

94
00:05:14,519 --> 00:05:17,800
even if there's like a small blade, there's like I

95
00:05:17,839 --> 00:05:22,519
think so many teams on a single bridge in even

96
00:05:22,560 --> 00:05:25,160
in different locations, and everyone just giving their status up

97
00:05:25,240 --> 00:05:28,360
dates each and after time and it was like it

98
00:05:28,480 --> 00:05:30,399
was almost like i'd say, like a bar room.

99
00:05:30,879 --> 00:05:31,000
Speaker 4: Uh.

100
00:05:31,240 --> 00:05:35,319
Speaker 5: People I think have now started to tagg in on

101
00:05:35,439 --> 00:05:38,680
called rooms as barrooms. Like everyone givings the startus updates

102
00:05:38,720 --> 00:05:40,959
and see update and go.

103
00:05:40,959 --> 00:05:44,759
Speaker 4: On for quite a few nights. So that was I

104
00:05:44,759 --> 00:05:47,800
think like pretty interesting in the Disney.

105
00:05:47,920 --> 00:05:51,120
Speaker 5: It was on the other way around, like our major

106
00:05:51,160 --> 00:05:55,360
events in Disney for the live stream. So in India

107
00:05:55,399 --> 00:05:58,879
we had a cricket as the major school and we

108
00:05:59,000 --> 00:06:02,040
used to live stream cricket and we had around twenty

109
00:06:02,160 --> 00:06:03,920
million concurent.

110
00:06:03,680 --> 00:06:04,839
Speaker 4: Viewers also at some point.

111
00:06:05,759 --> 00:06:10,040
Speaker 5: And with that scale, each and every bit of system

112
00:06:10,279 --> 00:06:14,839
you know, from starting from where to CDN even to

113
00:06:14,879 --> 00:06:19,040
a load balancer, even to a small humanities file to

114
00:06:19,079 --> 00:06:22,959
even to maybe our cashing systems, everything gets tested a lot.

115
00:06:23,680 --> 00:06:28,720
So for us in Disney, uh, that was our major priority.

116
00:06:28,959 --> 00:06:33,120
How to you know, manage our on calls response instead

117
00:06:33,160 --> 00:06:37,160
of responding on calls during the live streams because we

118
00:06:37,199 --> 00:06:40,639
cannot even afford to go for a minute down because

119
00:06:40,680 --> 00:06:43,240
we know how much takes the life events.

120
00:06:42,839 --> 00:06:44,439
Speaker 4: Are for the company.

121
00:06:44,839 --> 00:06:48,759
Speaker 1: Well, especially when you're talking about like live streaming cricket

122
00:06:48,800 --> 00:06:51,279
to Indians, because y'all take that seriously.

123
00:06:52,120 --> 00:06:55,720
Speaker 5: Yeah, yeah, yeah, we you know the you know, the

124
00:06:55,759 --> 00:06:59,319
broadcast contracts are for you know, billions of dollars and

125
00:06:59,639 --> 00:07:02,120
let's say we go for Doubt for a couple of minutes.

126
00:07:02,319 --> 00:07:04,920
We're really using losing money at every point in time,

127
00:07:05,519 --> 00:07:07,600
and we can see the Twitter. You know, we're just

128
00:07:07,639 --> 00:07:09,639
praending on trat real like your app is down on

129
00:07:09,759 --> 00:07:12,600
what's happening and what's not. So you need to be

130
00:07:12,759 --> 00:07:17,959
very very careful you know, how to respond publicly also,

131
00:07:18,040 --> 00:07:20,920
and how to you know, quickly bring things up in

132
00:07:20,959 --> 00:07:23,680
a way that it can last at least during the live.

133
00:07:24,839 --> 00:07:29,240
So so those were like I'd say, the most you know,

134
00:07:29,360 --> 00:07:33,279
the closest I can be a customer at that point

135
00:07:33,319 --> 00:07:37,040
of time, and the most engineer can be at the

136
00:07:37,160 --> 00:07:42,759
highest pressure point. So yeah, so that was my Disney

137
00:07:42,800 --> 00:07:47,079
stint and those that I've started out with patiently.

138
00:07:47,680 --> 00:07:50,680
Speaker 1: Right. Yeah, So I think after that you're kind of

139
00:07:50,800 --> 00:07:55,279
committed at this point, you yeah, you're just your career

140
00:07:55,319 --> 00:07:59,600
path is now incident response after those two stands.

141
00:07:59,800 --> 00:08:00,600
Speaker 4: Yeah yeah, yeah.

142
00:08:00,959 --> 00:08:06,199
Speaker 5: So interestingly, I think, uh, the two companies had slightly

143
00:08:06,240 --> 00:08:10,720
a different way of handling response, maybe because.

144
00:08:10,480 --> 00:08:13,399
Speaker 4: Of the company size or team sizes.

145
00:08:13,920 --> 00:08:19,079
Speaker 5: But overall, I think that the concept way like everyone

146
00:08:19,160 --> 00:08:22,000
was leading to a single sort of a role where

147
00:08:22,040 --> 00:08:26,600
we wanted to reduce the same incidents again and at

148
00:08:26,680 --> 00:08:27,879
least that that's.

149
00:08:27,720 --> 00:08:28,839
Speaker 4: Our primary rules.

150
00:08:29,279 --> 00:08:32,840
Speaker 5: And I see at what I saw in the two things,

151
00:08:32,879 --> 00:08:36,639
like the primary part of the incident response is the process,

152
00:08:36,759 --> 00:08:38,519
Like how do you sort of you know, set up

153
00:08:38,559 --> 00:08:43,000
the processes, how do you enable your engineers to you know,

154
00:08:43,120 --> 00:08:44,399
follow these processes?

155
00:08:44,799 --> 00:08:45,240
Speaker 4: Uh?

156
00:08:45,320 --> 00:08:48,759
Speaker 5: And and assistant I think like as an engineer, as

157
00:08:48,759 --> 00:08:53,639
a developer, ah, that's why we call on call as

158
00:08:53,679 --> 00:08:57,039
an operational part. Nobody wants to you know, spend a

159
00:08:57,080 --> 00:08:59,759
lot of time on it. Like everyone wants to maybe code,

160
00:09:00,159 --> 00:09:05,399
develop features, maybe even design, architect even blog nowadays, but

161
00:09:05,879 --> 00:09:08,679
on call is the last part, Like everyone wants to

162
00:09:08,720 --> 00:09:11,720
spend time, and so most grinted work, especially if they're

163
00:09:11,799 --> 00:09:16,039
like you know, work after work related to burgs or incidents.

164
00:09:16,440 --> 00:09:20,919
So that's where I saw a lot of common patterns

165
00:09:21,840 --> 00:09:24,799
across you know, incident response, even.

166
00:09:24,879 --> 00:09:25,799
Speaker 4: On call management.

167
00:09:26,200 --> 00:09:29,320
Speaker 5: UH is like like we there can be a lot

168
00:09:29,360 --> 00:09:33,759
of tools or automations or even.

169
00:09:33,600 --> 00:09:36,000
Speaker 4: Assistant agents which can help the engeneers.

170
00:09:36,120 --> 00:09:39,919
Speaker 5: So that's why I kind of you know started started

171
00:09:39,919 --> 00:09:42,799
with ag which is helping the teams to assist the

172
00:09:42,840 --> 00:09:44,200
incident and management.

173
00:09:44,559 --> 00:09:47,360
Speaker 1: Yeah, I think that's a solid point that's often overlooked.

174
00:09:47,840 --> 00:09:51,279
I work a lot with early stage startups, and it's

175
00:09:51,320 --> 00:09:54,360
a pattern I've seen over my career, like the biggest

176
00:09:54,440 --> 00:09:59,000
part of incident response happens before you ever have your

177
00:09:59,039 --> 00:10:01,679
first incident, but because you have to talk through like

178
00:10:01,879 --> 00:10:03,919
what are we going to do when this actually happens,

179
00:10:03,960 --> 00:10:06,639
who are we gonna bring on, how are we going

180
00:10:06,720 --> 00:10:11,159
to carry out communications? And so Yeah, I think that's

181
00:10:11,159 --> 00:10:13,080
a good solid point. I like the fact that you

182
00:10:13,159 --> 00:10:16,519
mentioned that there's multiple ways of doing that. You know,

183
00:10:16,559 --> 00:10:18,080
there's not one right way.

184
00:10:18,559 --> 00:10:22,039
Speaker 5: Right right, So there, I would say, like the first

185
00:10:22,039 --> 00:10:26,080
part is like you need to sort of realize, yeah,

186
00:10:26,320 --> 00:10:29,159
the time has come in my organization that we need

187
00:10:29,200 --> 00:10:33,120
to have this set up. I what I've usually seen

188
00:10:33,159 --> 00:10:37,240
it all ninety percent times it comes down from the

189
00:10:37,320 --> 00:10:41,279
top leadership with if you have like cetos or epes

190
00:10:41,480 --> 00:10:45,240
depending on the organization size, If if those books have

191
00:10:45,320 --> 00:10:50,200
come from a place where incident response or oncal processes had.

192
00:10:50,039 --> 00:10:52,559
Speaker 4: Been in place, they bring that culture into.

193
00:10:52,360 --> 00:10:55,080
Speaker 5: The company because they they have realized value over the

194
00:10:55,120 --> 00:10:58,320
time of these processes in companies where they have not

195
00:10:58,559 --> 00:11:02,559
like usually or they take a longer time to realize, yeah,

196
00:11:02,600 --> 00:11:06,120
we need such processes. So that's the first part is

197
00:11:06,159 --> 00:11:09,879
to realize like, hey, these are the processes we need

198
00:11:09,960 --> 00:11:12,679
so that we can at least, you know, radiuce are

199
00:11:12,799 --> 00:11:15,159
on our issues in a longer.

200
00:11:14,879 --> 00:11:15,519
Speaker 4: Frame of time.

201
00:11:16,200 --> 00:11:18,720
Speaker 5: So that's the first part. The second part is them

202
00:11:18,799 --> 00:11:21,759
to set up that on called roster. So on the

203
00:11:21,919 --> 00:11:25,480
rosters is something like uh, now that is something that

204
00:11:25,519 --> 00:11:29,279
which is very dependent on our too work. Some emanations

205
00:11:29,399 --> 00:11:33,120
wants to have a centralized on call team which kind

206
00:11:33,159 --> 00:11:36,840
of handles everything like let's say, even if it's like

207
00:11:36,879 --> 00:11:39,000
a re issue, they do it, even if it's like

208
00:11:39,039 --> 00:11:43,279
an infra situation, they handle it. And people have you know,

209
00:11:43,320 --> 00:11:45,879
different ways of setting up maybe like one one person

210
00:11:46,480 --> 00:11:49,480
from each team or just one person every week, and

211
00:11:49,600 --> 00:11:54,440
they do some some omnations have theredd on called team

212
00:11:54,519 --> 00:11:58,559
for each of their separate teams, so uh, and they

213
00:11:58,679 --> 00:12:01,000
kind of rotated weekly by I becieve monthly man.

214
00:12:01,480 --> 00:12:02,519
Speaker 4: So that's the other part.

215
00:12:02,639 --> 00:12:05,480
Speaker 5: The second part is to set the oncle on roster

216
00:12:06,080 --> 00:12:11,080
and I think then pergnission takes some time until the

217
00:12:11,360 --> 00:12:13,519
oncle rosters get setted and.

218
00:12:13,440 --> 00:12:15,960
Speaker 4: They start you know, debugging tickets, and.

219
00:12:16,039 --> 00:12:18,559
Speaker 5: After some certain amount of time then they go into

220
00:12:18,600 --> 00:12:22,559
the setting up the incident response part, which is the

221
00:12:22,639 --> 00:12:25,600
post mortems as well as you know, figuring out Hey, like,

222
00:12:25,720 --> 00:12:29,919
these are the sort of our general kind of you know,

223
00:12:29,960 --> 00:12:32,399
steps we take to solve an issue. These are the

224
00:12:32,559 --> 00:12:36,320
certain workflows that we do. Now let's try to streamline

225
00:12:36,320 --> 00:12:38,879
this both in the response as well as in the

226
00:12:38,960 --> 00:12:41,679
post mortem process as on the post modern analysis.

227
00:12:42,080 --> 00:12:44,279
Speaker 4: So that's how that's what.

228
00:12:44,120 --> 00:12:49,320
Speaker 5: We have seen orgnisation go from step A to step

229
00:12:49,679 --> 00:12:51,399
to last parties of post models.

230
00:12:51,600 --> 00:12:55,039
Speaker 3: You said something really interesting I think, which is I

231
00:12:55,039 --> 00:12:57,519
haven't worked at any company so far, and even my

232
00:12:57,639 --> 00:13:01,879
own authors here we have process. It isn't like we

233
00:13:01,919 --> 00:13:04,279
don't do anything when that happens. Maybe it's because we're

234
00:13:04,320 --> 00:13:06,639
a tech focused or a software focused company, and that's

235
00:13:06,639 --> 00:13:10,039
pretty much where I've worked. But the part that you

236
00:13:10,039 --> 00:13:12,080
said that was really interesting for me is that software

237
00:13:12,080 --> 00:13:15,879
engineers don't like on call and you know, I have

238
00:13:17,200 --> 00:13:19,200
I want to challenge that or like, you know, I want.

239
00:13:19,080 --> 00:13:20,559
Speaker 2: To live in the world where it's not a problem.

240
00:13:20,600 --> 00:13:22,720
Speaker 3: It's like, why why do people not like it so much,

241
00:13:22,720 --> 00:13:23,799
any thoughts about that.

242
00:13:24,240 --> 00:13:28,519
Speaker 5: Because people as engineers or developers, they don't consider as

243
00:13:28,559 --> 00:13:32,320
part of the building process. We love to build, we

244
00:13:32,440 --> 00:13:35,480
love to you know, architect things, but once we have

245
00:13:35,559 --> 00:13:37,600
done that, then we don want to sort of you know,

246
00:13:37,720 --> 00:13:41,039
going and you know, fix out just one tiny part

247
00:13:41,080 --> 00:13:43,440
of it which is actually causing the major issue, but

248
00:13:43,679 --> 00:13:47,120
going there fixing part of it and probably you know,

249
00:13:47,279 --> 00:13:50,679
just taking a blame or so people kind of have

250
00:13:50,840 --> 00:13:54,440
that kind of bias. Also, Hey, my my product is

251
00:13:54,480 --> 00:13:55,360
like a bug three.

252
00:13:55,559 --> 00:13:58,399
Speaker 4: My product is you know, super gid so so.

253
00:13:58,840 --> 00:14:01,600
Speaker 5: And going there fixing this as well as you know,

254
00:14:01,679 --> 00:14:04,639
you already have a lot of other work going on

255
00:14:04,759 --> 00:14:08,559
the strains in this agile world. So that's where people

256
00:14:08,799 --> 00:14:11,120
don't defini sort of difference, don't want to spend a

257
00:14:11,120 --> 00:14:11,879
lot of time on it.

258
00:14:12,399 --> 00:14:15,799
Speaker 4: So that's what you think.

259
00:14:15,799 --> 00:14:18,840
Speaker 3: It's not like well prioritized or rewarded. Like if you

260
00:14:18,879 --> 00:14:20,879
do on call work, you're not rewarded for it. If

261
00:14:20,919 --> 00:14:23,360
you write buglass code, you're not rewarded for it. So

262
00:14:23,480 --> 00:14:26,559
you know, whatever I don't want to do it, it's

263
00:14:26,559 --> 00:14:28,120
going to happen, and then I have to pay the

264
00:14:28,279 --> 00:14:29,919
I have to pay the fine because of it, and

265
00:14:30,000 --> 00:14:31,240
I don't I don't get the benefit.

266
00:14:32,440 --> 00:14:37,279
Speaker 5: Yeah, I think like benefits and all, like cans probably

267
00:14:37,279 --> 00:14:40,559
be sort of be defined by the engineting managers or

268
00:14:40,600 --> 00:14:42,919
team leads if they want to sort of reward or

269
00:14:42,960 --> 00:14:45,799
they want to highlight maybe you know, like if the

270
00:14:45,840 --> 00:14:48,759
person has solved these many decads and these incidents, or

271
00:14:49,120 --> 00:14:54,279
maybe find a better way of you know, rewarding rewarding uh,

272
00:14:54,600 --> 00:14:57,320
developers who actually solve a lot of incidents.

273
00:14:58,080 --> 00:14:59,120
Speaker 4: But yeah, I think like in.

274
00:14:59,159 --> 00:15:02,000
Speaker 5: General sense, like it's not part of the building, it's

275
00:15:02,159 --> 00:15:05,320
only maintain, but that's the major.

276
00:15:08,000 --> 00:15:09,919
Speaker 2: Like, no, no, I totally got it.

277
00:15:10,399 --> 00:15:12,720
Speaker 1: Yeah, I don't like on call because it's never my code,

278
00:15:12,759 --> 00:15:13,879
it's always something else.

279
00:15:14,159 --> 00:15:18,039
Speaker 3: Yeah, but I mean that makes me think there's something Yeah, no,

280
00:15:18,080 --> 00:15:19,600
I totally get I mean I feel like there's something

281
00:15:19,600 --> 00:15:20,919
fundamentally broken there.

282
00:15:20,960 --> 00:15:23,679
Speaker 2: Like I've seen that where I worked at one.

283
00:15:23,600 --> 00:15:27,039
Speaker 3: Of the previous previous jobs, that was twenty fifty engineers

284
00:15:27,080 --> 00:15:30,519
that were all rotating through all the same on call schedule,

285
00:15:30,960 --> 00:15:33,720
as if somehow code just because it was all in

286
00:15:33,720 --> 00:15:37,039
the monolith, if if something was broken, I somehow would

287
00:15:37,039 --> 00:15:39,320
magically know what was going on and say I don't

288
00:15:39,360 --> 00:15:42,919
know products or logistics code when I had nothing to

289
00:15:42,960 --> 00:15:45,440
do with the development of it, Like I like it

290
00:15:45,519 --> 00:15:47,600
might as well be like some for it, like I

291
00:15:47,639 --> 00:15:51,120
don't know Aramaic or you know, uniform to me, like

292
00:15:51,159 --> 00:15:53,759
I have no idea what that was going on there

293
00:15:53,840 --> 00:15:56,240
at all, and yet somehow I have to come in

294
00:15:56,320 --> 00:15:59,080
and debug or find out what the problem.

295
00:15:58,879 --> 00:16:02,279
Speaker 5: Was, right, Yeah, I think in Amazon this was a

296
00:16:02,360 --> 00:16:06,480
case like the entire I the de deal page was

297
00:16:07,000 --> 00:16:10,720
kind of at that point of time like that, and

298
00:16:10,240 --> 00:16:14,240
and that particular sort of service the page is like

299
00:16:14,320 --> 00:16:18,679
maintained by more than one fifty software engineers. So like

300
00:16:19,000 --> 00:16:21,080
most of the times you are debugging something that you

301
00:16:21,120 --> 00:16:24,320
have not ruled, so and you're evening, it's a three

302
00:16:24,480 --> 00:16:27,799
in the night, and you're already frustrated, like you don't

303
00:16:27,799 --> 00:16:31,000
know what's happening. So that's why I think some bit

304
00:16:31,039 --> 00:16:33,919
of uscision do come from. And that's why I think

305
00:16:34,000 --> 00:16:37,519
like good processes can sort of somehow mitigate some of

306
00:16:37,600 --> 00:16:41,600
the pain points, especially like institutions.

307
00:16:41,039 --> 00:16:45,399
Speaker 3: Like was the expectation at three am that software engineers

308
00:16:45,399 --> 00:16:47,679
should be able to log on and identify the problem

309
00:16:47,720 --> 00:16:50,399
and push out a fix like that seems like there's

310
00:16:50,440 --> 00:16:54,080
something that would never actually have like actually worked out

311
00:16:54,120 --> 00:16:55,679
in practice.

312
00:16:55,200 --> 00:16:58,360
Speaker 5: It did like something. So like it kind of depends

313
00:16:58,440 --> 00:17:01,440
so what kind ofs that you have in place. So

314
00:17:01,600 --> 00:17:04,440
one past one part is to maybe mitigate. Sometimes medication

315
00:17:04,519 --> 00:17:07,079
can be done just to give a revered last running.

316
00:17:07,279 --> 00:17:10,359
That's one medication people don't have. But that's even that

317
00:17:10,519 --> 00:17:14,400
is not done. You probably need to sort of page

318
00:17:14,759 --> 00:17:17,880
the person who has probably added that line of code

319
00:17:17,960 --> 00:17:20,839
and take hit from him or her, or you maybe

320
00:17:20,920 --> 00:17:23,000
have certain like a time of team or something like

321
00:17:23,079 --> 00:17:26,160
that which can sort of you know, uh orches straight

322
00:17:26,240 --> 00:17:29,880
and collaborate with a lot of different on calls for

323
00:17:30,000 --> 00:17:32,279
different developers.

324
00:17:31,839 --> 00:17:34,559
Speaker 4: To sort of mitigate and put up put a patch

325
00:17:34,640 --> 00:17:35,359
or fix the issues.

326
00:17:35,759 --> 00:17:39,079
Speaker 5: Right, So this kind of like depends on you know,

327
00:17:39,200 --> 00:17:43,240
how you have set up that inst response forrocess itself.

328
00:17:43,519 --> 00:17:46,759
Speaker 1: Yeah, I think that's a really good distinction to make

329
00:17:46,839 --> 00:17:52,279
there that like during an incident, oftentimes the primary goal

330
00:17:52,519 --> 00:17:56,440
is to mitigate the problem, which is different than solving

331
00:17:56,559 --> 00:17:59,799
the root cause of the problem. So like if if

332
00:17:59,799 --> 00:18:03,000
we're run an outage or an incident, like we might

333
00:18:03,160 --> 00:18:09,680
mitigate the problem by launching you know, fifteen more Kubernetes

334
00:18:09,720 --> 00:18:13,400
pods with just insane amounts of memory, just so we

335
00:18:13,480 --> 00:18:16,000
can ride through the problem too, we're able to figure

336
00:18:16,039 --> 00:18:18,960
out the root cause and test that theory and then

337
00:18:19,000 --> 00:18:20,039
deploy a fix for it.

338
00:18:20,480 --> 00:18:23,000
Speaker 3: So just so I got you right, Well, your strategy

339
00:18:23,119 --> 00:18:25,519
for incident management is turning it off and back on.

340
00:18:25,599 --> 00:18:29,640
Speaker 1: Again absolutely three times. Always reboot three times.

341
00:18:31,559 --> 00:18:34,000
Speaker 4: Yeah, I think like legging ice stream.

342
00:18:34,720 --> 00:18:38,480
Speaker 5: You know, one of the major kind of our last

343
00:18:38,559 --> 00:18:41,160
heart was to just put the live stream on and

344
00:18:41,720 --> 00:18:44,480
don't maybe don't have a pay on or something like that.

345
00:18:44,680 --> 00:18:46,920
Speaker 4: So that can be one even in.

346
00:18:48,839 --> 00:18:51,920
Speaker 5: Different scenarios that you can sit like, okay, even if

347
00:18:52,000 --> 00:18:53,160
fixing is stacking time even.

348
00:18:53,039 --> 00:18:58,279
Speaker 4: Immedi maybe do something. Maybe just put that I took there.

349
00:18:58,839 --> 00:19:02,519
Speaker 5: There was a certain uh forcessiarity had in peace, so

350
00:19:02,839 --> 00:19:06,160
I just said, like, just increase the community spots and

351
00:19:06,799 --> 00:19:09,680
at least that your customer is not facing that issue

352
00:19:09,680 --> 00:19:12,799
for the movement, then they use use that time to

353
00:19:13,759 --> 00:19:14,880
to actually fix the issue.

354
00:19:15,200 --> 00:19:18,640
Speaker 3: How do you decide what mitigation strategy makes the most sense, Like,

355
00:19:19,000 --> 00:19:20,960
if you like, I feel like we're in the case

356
00:19:21,000 --> 00:19:23,279
of the world now where we're going to automate whatever

357
00:19:23,400 --> 00:19:26,279
it is. So if we have some number of failures,

358
00:19:26,359 --> 00:19:28,799
Do we just immediately start deploying extra pods? Do we

359
00:19:28,960 --> 00:19:32,160
immediately try to roll back to a previous code version?

360
00:19:32,240 --> 00:19:35,240
Like can we even know upfront what the right approach

361
00:19:35,400 --> 00:19:37,200
is there to automate? Because the last thing I want,

362
00:19:37,240 --> 00:19:39,519
I feel like, is someone to get online and after

363
00:19:39,680 --> 00:19:41,799
half an hour be like, you know what, maybe we

364
00:19:41,920 --> 00:19:45,160
should deploy some pods with an insane amount of memory

365
00:19:45,200 --> 00:19:46,480
that will solve all of our problems.

366
00:19:47,359 --> 00:19:50,240
Speaker 4: Yeah, yeah, it's interesting.

367
00:19:50,319 --> 00:19:55,319
Speaker 5: It was like like as an engineer, like even if

368
00:19:55,400 --> 00:19:59,319
you have some similarity, you would know this is the issue.

369
00:19:59,519 --> 00:20:02,079
Speaker 4: Let's see, if you're seeing them mentally, then part increase

370
00:20:02,359 --> 00:20:03,000
makes a sense.

371
00:20:03,440 --> 00:20:07,119
Speaker 5: But let's say if there's a something else you're seeing

372
00:20:07,119 --> 00:20:10,319
a lot of five texis, probably the part increase might

373
00:20:10,359 --> 00:20:11,000
not make a sense.

374
00:20:11,000 --> 00:20:14,400
Speaker 4: They're probably reverting to a previous version might make a sense.

375
00:20:15,079 --> 00:20:18,799
Speaker 5: So I think you need to have some bit of humanity,

376
00:20:18,960 --> 00:20:21,680
you know, with whatever system that you are hiding it

377
00:20:22,279 --> 00:20:23,759
in case you're not, then it becomes.

378
00:20:23,559 --> 00:20:27,079
Speaker 4: A huge challenges. Then you probably don't know you know

379
00:20:27,200 --> 00:20:31,279
what to do. So I think, like I think some

380
00:20:31,839 --> 00:20:35,440
hulmarities need at least to have that first modication.

381
00:20:35,559 --> 00:20:41,799
Speaker 1: Still, I think that alludes to an entire skill set

382
00:20:42,000 --> 00:20:44,480
in software engineering of how to troubleshoot.

383
00:20:45,039 --> 00:20:45,160
Speaker 4: You know.

384
00:20:45,240 --> 00:20:48,200
Speaker 1: It's because like debugging when you're writing code is completely

385
00:20:48,319 --> 00:20:53,839
different I think than troubleshooting a live system and all

386
00:20:53,880 --> 00:20:56,839
of its different dependencies and trying to figure out where

387
00:20:56,920 --> 00:20:59,599
the potential problem might be and how do you how

388
00:20:59,640 --> 00:21:02,519
do you get some faith in that theory and then

389
00:21:03,319 --> 00:21:04,440
do something to mitigate it.

390
00:21:04,799 --> 00:21:05,000
Speaker 4: Yeah.

391
00:21:05,200 --> 00:21:07,599
Speaker 5: Yeah, I think like a couple of techniques that we

392
00:21:07,799 --> 00:21:10,079
have seen and we had to kind of deployed. One

393
00:21:10,240 --> 00:21:14,119
is maybe do like a shadow on care like you

394
00:21:14,279 --> 00:21:17,000
can you do on calls, like someone is handing the

395
00:21:17,079 --> 00:21:19,279
on care, but let's say if they have major incidents,

396
00:21:19,720 --> 00:21:22,759
you shadow them. So you know, these are tools like

397
00:21:22,920 --> 00:21:25,839
even even I've seen sometimes people don't even know like

398
00:21:26,000 --> 00:21:28,240
these are dashboards that you can refer to and which

399
00:21:28,240 --> 00:21:31,720
will probably help you more so, uh, reaching out to

400
00:21:32,079 --> 00:21:35,920
people shadowing them definitely hips, especially when whenever there's an

401
00:21:36,000 --> 00:21:38,640
incident or an issue in a system which you have

402
00:21:38,880 --> 00:21:42,599
not yet touched. So that is one other bit is

403
00:21:43,079 --> 00:21:45,559
what we had done was we had done chaos monkey

404
00:21:45,680 --> 00:21:50,200
a lot. So chaos monkey like like a concept I

405
00:21:50,279 --> 00:21:53,559
think probably generated in Netflix engineering.

406
00:21:53,519 --> 00:21:57,920
Speaker 4: That uh you kind of have like game days, uh

407
00:21:58,519 --> 00:22:02,559
where were what you do is like certain infrastructures, you

408
00:22:02,680 --> 00:22:03,359
just put it down.

409
00:22:03,880 --> 00:22:06,359
Speaker 5: Let's say you put it down like a replica of

410
00:22:06,680 --> 00:22:10,119
let's say post is early and and see how your

411
00:22:10,160 --> 00:22:11,599
engineing team is performing after that.

412
00:22:11,759 --> 00:22:13,240
Speaker 4: What's how much time they are just.

413
00:22:13,279 --> 00:22:17,160
Speaker 5: Taken to mitigate or at least at least mitigating what's

414
00:22:17,240 --> 00:22:20,359
their mptity, What's how they're fixing it, how they're communicating,

415
00:22:20,400 --> 00:22:22,759
how they're collaborating, They're putting.

416
00:22:22,519 --> 00:22:25,319
Speaker 4: The right communication to the right stakeholders at the right time.

417
00:22:25,839 --> 00:22:31,319
Speaker 5: So those kind of events, those kinds of practices have helped,

418
00:22:31,440 --> 00:22:33,559
especially when you have not done.

419
00:22:33,400 --> 00:22:36,960
Speaker 4: For a long time. So chaos lunkey help does a.

420
00:22:37,000 --> 00:22:43,119
Speaker 5: Lot, especially tripping for you large events, and especially having

421
00:22:43,279 --> 00:22:47,720
like a proper collaboration sync with other other teams, because

422
00:22:47,759 --> 00:22:50,559
that's also what is needed. You do it within your team,

423
00:22:50,920 --> 00:22:53,039
but you also have to do it with let's say

424
00:22:53,519 --> 00:22:56,160
your maybe front time team or maybe your DevOps team.

425
00:22:56,240 --> 00:22:58,519
You need to do in sync to mitigating issue. So

426
00:23:00,359 --> 00:23:02,640
there are techniques to do that. So that't like right

427
00:23:03,000 --> 00:23:06,680
takes and right information also is like like the right

428
00:23:06,839 --> 00:23:10,640
education is also done for the for whoever is coming on.

429
00:23:12,279 --> 00:23:16,079
Speaker 3: I often wonder how how much these things provide value.

430
00:23:16,160 --> 00:23:19,039
Like way along the spectrum is the right time to

431
00:23:19,079 --> 00:23:21,839
start implementing, say a game day where you're taking your

432
00:23:21,880 --> 00:23:24,599
own stuff down, or the Simian army to inject faults

433
00:23:24,599 --> 00:23:27,640
into your architecture or infrastructure. Like I see a lot

434
00:23:27,680 --> 00:23:30,160
of companies that I'm I could say, hey, you know what,

435
00:23:30,440 --> 00:23:34,279
that's probably not the highest value day. Like they're like,

436
00:23:34,799 --> 00:23:37,160
they have so many other problems that I think they

437
00:23:37,160 --> 00:23:41,119
should tackle first before they're ready to do that. But

438
00:23:41,279 --> 00:23:44,359
then on the opposite side, I'm thinking, wait, like if

439
00:23:44,400 --> 00:23:47,960
they did this, they may actually identify critical problems within

440
00:23:48,000 --> 00:23:51,480
their infrastructure that could cause them multi day downtimes or

441
00:23:51,599 --> 00:23:55,240
multi week downtimes, which you would have more catastrophic impacts

442
00:23:55,279 --> 00:23:58,440
in the long term. Uh, Like, I don't know, is

443
00:23:58,480 --> 00:24:00,279
that interesting? And do any thoughts of that?

444
00:24:01,240 --> 00:24:04,880
Speaker 5: And like like, of course, like your company at a

445
00:24:04,920 --> 00:24:07,240
startup stage or initial stages.

446
00:24:06,839 --> 00:24:09,480
Speaker 4: Where they maybe don't have a lot of customers or.

447
00:24:09,799 --> 00:24:12,839
Speaker 5: That's uh, they don't, they won't be doing this even

448
00:24:12,920 --> 00:24:17,839
if I'll say, I think once you have started having

449
00:24:17,960 --> 00:24:22,160
multiple teams, multiple engineering teams with say different different powers,

450
00:24:22,240 --> 00:24:25,400
kind of a system where sometimes the information is scattered

451
00:24:26,359 --> 00:24:29,279
between teams and you don't know, you know, like when

452
00:24:29,319 --> 00:24:31,119
a when a fire is there, you don't know who

453
00:24:31,200 --> 00:24:33,200
to who to say, like who put that?

454
00:24:33,759 --> 00:24:35,200
Speaker 4: And that's the beginning of it.

455
00:24:35,599 --> 00:24:38,559
Speaker 5: And as slightly, I think the team's start to mature,

456
00:24:38,640 --> 00:24:41,720
and the mature I think, I think that's the right

457
00:24:41,799 --> 00:24:45,039
time to sort of sort of start these processes.

458
00:24:45,400 --> 00:24:48,400
Speaker 1: Yeah. I think maturity is really the key word there

459
00:24:48,720 --> 00:24:52,039
because it takes you know, you have to have multiple

460
00:24:52,119 --> 00:24:54,279
layers of maturity there. You have to have a product

461
00:24:54,319 --> 00:24:59,359
that's mature enough to be tested, but you also also

462
00:24:59,440 --> 00:25:02,799
have to have maturity in your leadership team where they

463
00:25:03,000 --> 00:25:06,319
recognize and understand the value of saying, hey, we're not

464
00:25:06,440 --> 00:25:10,440
shipping new features this week, We're not shipping shiny new buttons.

465
00:25:10,480 --> 00:25:12,519
We're actually going to take the time and effort to

466
00:25:12,559 --> 00:25:14,359
see what it takes to break our system.

467
00:25:14,720 --> 00:25:18,839
Speaker 5: Yeah, I think probably, But a company having a launch,

468
00:25:19,440 --> 00:25:22,200
launching new things or launching a new product, and maybe

469
00:25:22,240 --> 00:25:25,720
a week so I think people do dog footing. They

470
00:25:25,759 --> 00:25:28,960
can add this maybe instead of response or as a

471
00:25:29,039 --> 00:25:31,839
part of it, so that you know how your team

472
00:25:31,839 --> 00:25:34,400
would be reacting on day one days ago one in two.

473
00:25:34,960 --> 00:25:40,720
So I think I think it like generally sometimes even

474
00:25:40,799 --> 00:25:45,000
the managers or the management kind of starts to realizing,

475
00:25:45,400 --> 00:25:47,119
now we are spending a lot of time on these

476
00:25:47,279 --> 00:25:51,319
incidents itself, like our delivery for other important stuff is

477
00:25:51,400 --> 00:25:53,519
also getting impacted, so now we.

478
00:25:53,519 --> 00:25:57,519
Speaker 4: Should find time to set some process time for.

479
00:25:57,599 --> 00:26:00,480
Speaker 5: This or that, like we get you know, these incident

480
00:26:00,759 --> 00:26:04,200
it is so we can have a longer time for

481
00:26:04,279 --> 00:26:05,160
our whole features.

482
00:26:05,400 --> 00:26:08,720
Speaker 4: So it's always, you know about finding that right balance.

483
00:26:09,279 --> 00:26:12,200
Speaker 5: Even engineering managers or I believe have a tough time

484
00:26:12,240 --> 00:26:15,079
to sort of something justify a lot of spending a

485
00:26:15,119 --> 00:26:16,680
lot of time for these kind of things.

486
00:26:16,799 --> 00:26:17,920
Speaker 4: Like it's always I think.

487
00:26:17,799 --> 00:26:22,000
Speaker 5: That's I think that's probably there always conundrum they are in,

488
00:26:22,559 --> 00:26:24,440
you know, which which part to spend time?

489
00:26:24,839 --> 00:26:30,000
Speaker 3: So uh, they have to take you I'm like stifling

490
00:26:30,160 --> 00:26:32,160
maybe laughing here because I feel like I have so

491
00:26:32,240 --> 00:26:36,079
many previous traumatic experiences of some sort of on call event.

492
00:26:36,480 --> 00:26:37,759
You know that's on one side the other and the

493
00:26:37,799 --> 00:26:39,400
other side. You said, it's like, oh, well, you know,

494
00:26:39,519 --> 00:26:43,119
the product manager needs to prioritize the factor, but like

495
00:26:43,319 --> 00:26:45,440
I want to hire that PM that actually is like

496
00:26:45,519 --> 00:26:48,279
you know what our insids, our incidents are are impacting

497
00:26:48,400 --> 00:26:52,039
our you know, future profitability, so we should actually take

498
00:26:52,039 --> 00:26:53,079
a look at it, improving ourselves.

499
00:26:53,079 --> 00:26:54,599
Speaker 2: Like I've never heard that. I've never heard.

500
00:26:54,440 --> 00:26:57,079
Speaker 3: Anything like anyone on that side say that, you know,

501
00:26:57,400 --> 00:26:59,400
like it's always the other way, like, oh, we don't

502
00:26:59,400 --> 00:27:00,839
need to worry about that is done right.

503
00:27:00,880 --> 00:27:02,400
Speaker 2: We didn't we finish that coverage to push it.

504
00:27:02,559 --> 00:27:02,839
Speaker 4: We don't.

505
00:27:02,920 --> 00:27:06,119
Speaker 2: We don't need to improve it anymore, and think like

506
00:27:06,519 --> 00:27:06,839
I think the.

507
00:27:06,839 --> 00:27:10,079
Speaker 5: First part is always about you know, having that right

508
00:27:10,200 --> 00:27:13,799
report or having some sort of information so that you

509
00:27:13,880 --> 00:27:16,400
could add like maybe you know if these are the

510
00:27:16,799 --> 00:27:20,079
these are incidents, there are recreatable incidents, these are the

511
00:27:20,400 --> 00:27:23,559
probably if you have some sort of a business impact

512
00:27:23,599 --> 00:27:25,759
to it, we show them their numbers and see like

513
00:27:26,240 --> 00:27:27,839
this is an impact and if you.

514
00:27:27,920 --> 00:27:30,000
Speaker 4: Want to sort of reduce.

515
00:27:29,720 --> 00:27:32,240
Speaker 5: That numbers of business impact, then we need this to

516
00:27:32,960 --> 00:27:35,079
I like, I think, I just think it's always a

517
00:27:35,079 --> 00:27:37,920
hard time to justify spending time on the instance.

518
00:27:38,039 --> 00:27:40,240
Speaker 4: But if you have that data, that data would be

519
00:27:40,720 --> 00:27:41,279
any use.

520
00:27:41,960 --> 00:27:44,279
Speaker 3: I mean, this is where like Dora is super successful,

521
00:27:44,319 --> 00:27:47,000
where we come in with meantime to resolution and change

522
00:27:47,039 --> 00:27:49,640
failure rate and so falling back on those statistics can

523
00:27:50,000 --> 00:27:53,279
be really helpful in the conversation to convince people that

524
00:27:53,400 --> 00:27:56,640
these aren't industry standard, that we have every single pot

525
00:27:56,720 --> 00:27:59,240
request we push out results in a bug in production.

526
00:28:01,759 --> 00:28:06,640
Speaker 4: That's right, right right.

527
00:28:07,920 --> 00:28:14,039
Speaker 1: I would imagine that most companies onboarding experience to incident

528
00:28:14,119 --> 00:28:18,880
response is a result of hitting a breaking point where

529
00:28:18,880 --> 00:28:21,799
they've had just outage after outage after outage, and finally

530
00:28:21,839 --> 00:28:25,480
they're like, Okay, we have to do something different, which

531
00:28:25,519 --> 00:28:29,599
is probably what leads them to you would that be

532
00:28:29,880 --> 00:28:30,720
a fair statement.

533
00:28:31,920 --> 00:28:35,400
Speaker 5: Yeah, yeah, I think like one is what you say,

534
00:28:35,559 --> 00:28:40,119
like having outages outages, and the other part is even

535
00:28:40,200 --> 00:28:43,960
if let's say, if they want to sort of stream

536
00:28:44,039 --> 00:28:47,519
into some process, usually they see like maybe oncoll is

537
00:28:47,559 --> 00:28:51,039
confused what to do, or maybe they are the OnCore

538
00:28:51,160 --> 00:28:55,160
is need to react, or the manager doesn't know what's happening,

539
00:28:55,400 --> 00:28:56,920
or someone someone doesn't even know how.

540
00:28:56,839 --> 00:29:00,160
Speaker 4: To report an for example. So there are different.

541
00:29:00,039 --> 00:29:04,640
Speaker 5: Different aspects to it. I obviously like like the entire

542
00:29:04,720 --> 00:29:08,960
incident responses part of two bars, what is you know

543
00:29:09,079 --> 00:29:12,079
the trigger or you know, how are you creating the incident?

544
00:29:12,240 --> 00:29:13,880
Like what's the trigger for that? And then how are

545
00:29:13,880 --> 00:29:17,519
you responding to that, which is like debugging, communicating, and then.

546
00:29:17,480 --> 00:29:18,480
Speaker 4: The post modern movement.

547
00:29:19,039 --> 00:29:22,039
Speaker 5: So so that's where we kind of try to come in,

548
00:29:22,160 --> 00:29:24,920
like sort of you can stream at the entire pipeline

549
00:29:24,960 --> 00:29:27,799
of it, like make it as quick as possible, make

550
00:29:27,880 --> 00:29:34,119
it visible across maybe stakeholders, maybe support across engineering teams

551
00:29:34,160 --> 00:29:36,640
and having the post modern analysis processes in the least.

552
00:29:37,160 --> 00:29:40,680
So it like I think, like like we come in

553
00:29:41,039 --> 00:29:45,960
when people when teams recognize too many repetated incidents or

554
00:29:46,000 --> 00:29:47,000
too many of these.

555
00:29:46,880 --> 00:29:51,279
Speaker 4: Stuff, and whoever is the on call is kind of feeling.

556
00:29:51,079 --> 00:29:55,200
Speaker 5: Very confused for a state of things. So that's where

557
00:29:55,480 --> 00:29:57,720
we have seen a lot of competition onto this.

558
00:29:58,319 --> 00:30:00,640
Speaker 1: Yeah, you mentioned the stakehold was there, and I think

559
00:30:00,640 --> 00:30:04,160
that's a really cool thing to dive into for a minute,

560
00:30:04,200 --> 00:30:09,440
because communication is one of the key things of incident response,

561
00:30:09,480 --> 00:30:12,039
and it's the one I always hated the most early

562
00:30:12,119 --> 00:30:14,960
on in my career because I would be in an

563
00:30:15,079 --> 00:30:18,839
incident and then everyone wants to know what's going on. Well,

564
00:30:18,920 --> 00:30:21,680
I'm working on it, damn it. But I can't work

565
00:30:21,759 --> 00:30:23,880
on it if you're sitting here hounded me with questions.

566
00:30:23,920 --> 00:30:25,720
And so I think a key part of a solid

567
00:30:26,519 --> 00:30:31,000
incident response plan is having a communication plan so that

568
00:30:31,839 --> 00:30:34,720
you can relay that information out and free up the

569
00:30:34,799 --> 00:30:38,240
people actually working on the incident to continue working on it.

570
00:30:38,359 --> 00:30:40,559
How do you recommend addressing that?

571
00:30:41,240 --> 00:30:43,799
Speaker 5: Like I say, like an on call is a person

572
00:30:43,880 --> 00:30:48,440
who's always on fire who has to you know, mitigate

573
00:30:48,519 --> 00:30:49,000
the issue.

574
00:30:49,039 --> 00:30:54,079
Speaker 4: I think that's the namone everyone Connie, but because of the.

575
00:30:54,559 --> 00:30:57,720
Speaker 5: Environment, he needs to do a lot of things also

576
00:30:58,319 --> 00:31:01,759
communicated to them support also, you know what message of

577
00:31:02,160 --> 00:31:06,920
what's the estimated time or resolution from we communicate to manager, Hey,

578
00:31:06,960 --> 00:31:09,839
this is probably the impact these many users, this much

579
00:31:09,960 --> 00:31:15,359
subscriptions are being impacted right now. So so I that's

580
00:31:15,400 --> 00:31:18,519
a major pain point the on call person has, Uh,

581
00:31:19,000 --> 00:31:21,160
what's the way The best way is to you know,

582
00:31:21,920 --> 00:31:26,039
delegate a lot of these stuff or maybe have a

583
00:31:26,519 --> 00:31:29,279
have a system which is you know, like which is

584
00:31:29,359 --> 00:31:32,319
visible to the stakeholders so that they don't ping.

585
00:31:32,279 --> 00:31:33,960
Speaker 4: The on call or they don't kind of you know,

586
00:31:34,160 --> 00:31:35,039
ask them again and again.

587
00:31:35,200 --> 00:31:38,559
Speaker 5: What's one of the ways we do via page leg

588
00:31:38,839 --> 00:31:42,000
is we live with Slack as a major part of

589
00:31:42,079 --> 00:31:46,400
the incident response. So let's say we created channel Slack

590
00:31:46,480 --> 00:31:50,839
channel for each incident and in the Slack channel, you

591
00:31:50,920 --> 00:31:55,119
can see you know what's the eating Uh, what's.

592
00:31:54,960 --> 00:31:55,839
Speaker 4: The business impact?

593
00:31:56,240 --> 00:31:58,920
Speaker 5: Or maybe some bit of information is like something some

594
00:31:59,039 --> 00:32:01,640
bit of it is a by the on kore, but

595
00:32:01,960 --> 00:32:04,160
nobody is like asking on again in the.

596
00:32:04,200 --> 00:32:07,200
Speaker 4: Game, you know what's the ETU? What's the impact? Let's

597
00:32:07,200 --> 00:32:09,000
say aug much wants to see they can go to

598
00:32:09,039 --> 00:32:09,839
the channel and see it.

599
00:32:10,279 --> 00:32:14,599
Speaker 5: Customer support can see you so like like like whatever,

600
00:32:14,960 --> 00:32:17,200
Let's say if someone wants to send an email, no

601
00:32:17,240 --> 00:32:21,279
one likely they can just send all that information to emails.

602
00:32:21,519 --> 00:32:26,279
What if the stakeholders that the company has so whatever

603
00:32:27,359 --> 00:32:31,160
kind of the you know, actions on cool has to

604
00:32:31,240 --> 00:32:35,880
do apart from mitigation is an additional effort. And whatever

605
00:32:36,079 --> 00:32:38,519
tools and resources they can utilize to sort of you know,

606
00:32:38,599 --> 00:32:42,200
delegate an automatated would be much more helpful. Uh, And

607
00:32:42,799 --> 00:32:46,200
so that they so that his major sort of brain

608
00:32:46,279 --> 00:32:49,759
focuses always on mitigating the issue as quickly as possible.

609
00:32:50,440 --> 00:32:52,480
Speaker 3: Yeah, I mean, I think having those additional things in

610
00:32:52,599 --> 00:32:55,480
place once you identify them to help streamline the process

611
00:32:55,519 --> 00:32:59,759
are super important. Like we've got uh status dashboards that

612
00:32:59,880 --> 00:33:02,240
we can point customers to immediately, so rather than trying

613
00:33:02,319 --> 00:33:05,200
to explain where the updates are going to be or

614
00:33:05,240 --> 00:33:06,599
how they're going to happen, and you just go to

615
00:33:06,680 --> 00:33:08,839
the Zurel and stuff is there. But I mean, I

616
00:33:08,839 --> 00:33:12,359
think also as customers of SaaS solutions, we have like

617
00:33:12,559 --> 00:33:16,839
an opportunity to even be nicer to companies that are

618
00:33:16,880 --> 00:33:19,559
having incidents. I mean, I think there's an emoji dedicated

619
00:33:19,559 --> 00:33:22,359
to this, hug ops right, you know when something's happening,

620
00:33:22,519 --> 00:33:25,720
you know, pass pass on the empathy a little bit.

621
00:33:25,839 --> 00:33:28,200
Like I care way more about as a customer that

622
00:33:28,319 --> 00:33:30,680
you tell me that you know that there's a problem

623
00:33:30,759 --> 00:33:33,240
that someone's looking at rather than being like, oh, we

624
00:33:33,359 --> 00:33:35,039
don't know, I don't I don't know what's going on,

625
00:33:35,359 --> 00:33:37,319
or you know, even someone's looking at it, Like I

626
00:33:37,440 --> 00:33:39,480
much prefer to be told oh, yeah, like we'll have

627
00:33:39,599 --> 00:33:42,519
an update in an hour, then oh, this is exactly

628
00:33:42,519 --> 00:33:43,200
what's happening at.

629
00:33:43,160 --> 00:33:44,400
Speaker 2: This moment, Like I don't care about that.

630
00:33:44,519 --> 00:33:46,559
Speaker 3: I want to know, you know, when's the next update

631
00:33:46,640 --> 00:33:50,319
going to be happening, more so than Play by Life.

632
00:33:50,839 --> 00:33:54,640
Speaker 5: Right like there too also always there are two types

633
00:33:54,680 --> 00:33:56,920
of communication, one internally and externally.

634
00:33:57,160 --> 00:33:58,480
Speaker 4: Both has to be I think that.

635
00:34:00,000 --> 00:34:02,640
Speaker 5: Suddenly more because you have the state like you have

636
00:34:02,759 --> 00:34:07,160
the ultimate stakedness, but like like both needs to be

637
00:34:07,559 --> 00:34:09,119
you know, always updated.

638
00:34:09,239 --> 00:34:11,400
Speaker 4: Both needs to be you know, always to the point

639
00:34:11,800 --> 00:34:13,679
so that like because.

640
00:34:13,480 --> 00:34:16,920
Speaker 5: In any of these conditions there's any miscommindation happening, then

641
00:34:16,960 --> 00:34:20,360
it will you know, just prolonged instead much wrong.

642
00:34:21,000 --> 00:34:27,800
Speaker 1: So it reminds me of the AWS status pages early

643
00:34:27,960 --> 00:34:31,880
in the days of AWS, like was always green, Like

644
00:34:32,360 --> 00:34:35,039
I would I would have put money that it was

645
00:34:35,280 --> 00:34:37,400
just a green icon there and there were no other

646
00:34:37,519 --> 00:34:40,119
options available because it was always green.

647
00:34:41,239 --> 00:34:44,079
Speaker 5: Right, I remember at that time, I think like, uh,

648
00:34:45,039 --> 00:34:48,320
I usually didn't sort of had a lot of confidence

649
00:34:48,400 --> 00:34:51,480
on that, I like down these some other or even

650
00:34:51,639 --> 00:34:54,440
Twitter was much more sort of a better way to

651
00:34:54,559 --> 00:34:58,440
you know, there's like actually a major and those status

652
00:34:58,519 --> 00:34:59,920
pages were like not at all.

653
00:35:01,800 --> 00:35:03,639
Speaker 4: I think, I think things have changed.

654
00:35:03,679 --> 00:35:07,239
Speaker 1: But you bring up Twitter though, and that's a really

655
00:35:07,320 --> 00:35:09,480
good point. I mean, I think for a lot of

656
00:35:09,920 --> 00:35:15,280
tech oriented companies that's a primary communication channel, you know,

657
00:35:15,519 --> 00:35:21,280
sending out notifications on Twitter or x and relaying information

658
00:35:21,360 --> 00:35:25,639
that way. And also like it's kind of sad to say,

659
00:35:25,679 --> 00:35:29,360
but that's also a good notification method of whenever your

660
00:35:29,400 --> 00:35:31,599
customers think something's going wrong.

661
00:35:32,320 --> 00:35:37,800
Speaker 3: Was I mean enough enough that I saw some products

662
00:35:37,840 --> 00:35:40,400
that specifically like we go around to social media and

663
00:35:40,519 --> 00:35:44,599
get the up real time status from potential users complaining

664
00:35:44,639 --> 00:35:47,800
because it's another source that you're not tapping into to

665
00:35:47,880 --> 00:35:50,199
actually let you know if customers have a problem, you know,

666
00:35:50,280 --> 00:35:52,000
they're not necessarily reporting it back to you.

667
00:35:52,199 --> 00:35:54,360
Speaker 2: This is the report mechanism, right.

668
00:35:55,159 --> 00:35:57,800
Speaker 4: I think these two kind of work.

669
00:35:59,239 --> 00:35:59,800
Speaker 2: That's a good point.

670
00:36:00,400 --> 00:36:03,400
Speaker 5: Yeah, And I think the companies, I think I've started

671
00:36:03,440 --> 00:36:06,639
to put artists feeds also, like for a longer time,

672
00:36:07,199 --> 00:36:11,519
and they have integrations with those feeds to their Twitter

673
00:36:11,679 --> 00:36:16,960
accounts or maybe some of their complements discord if they're

674
00:36:17,000 --> 00:36:18,920
doing a sas kind of a product or something like that,

675
00:36:19,079 --> 00:36:22,840
so that their customers are also updated by these platforms.

676
00:36:23,800 --> 00:36:25,880
Speaker 3: I mean accuracy, though, is what you're bringing up will

677
00:36:25,920 --> 00:36:28,719
And I feel like there's a huge challenge there realistically

678
00:36:28,880 --> 00:36:30,719
to like what do you what like what makes sense

679
00:36:30,800 --> 00:36:34,360
to even talk about and what should be intermittent hidden

680
00:36:34,440 --> 00:36:37,400
failures from an internal company standpoint, Like I don't want

681
00:36:37,440 --> 00:36:40,400
to see Amazon just being read all the time because

682
00:36:41,000 --> 00:36:44,599
some node in cloud front failed one request because the

683
00:36:44,760 --> 00:36:46,320
connection didn't go through.

684
00:36:46,280 --> 00:36:47,519
Speaker 2: Like how does that help?

685
00:36:47,599 --> 00:36:49,880
Speaker 3: So I mean I feel like or yellow all the

686
00:36:49,960 --> 00:36:53,360
time because there's always something that's probably impossibly problem. I

687
00:36:53,440 --> 00:36:56,239
think a single color there is is always wrong.

688
00:36:57,119 --> 00:36:59,480
Speaker 4: Right, and and that's why I didn't.

689
00:37:00,000 --> 00:37:02,679
Speaker 5: I think if you see AWS, they have although the

690
00:37:03,280 --> 00:37:06,679
period of time they have evolved their status page earlier

691
00:37:07,039 --> 00:37:10,079
like now, they have actually region wise. Also I think

692
00:37:10,119 --> 00:37:12,519
they have also started to do for so for some

693
00:37:12,559 --> 00:37:15,480
of the services, they have started to do more grummar

694
00:37:15,559 --> 00:37:20,800
scooping as much as possible, so like, uh, that's even

695
00:37:20,840 --> 00:37:23,719
for Slack. Also, like earlier they used to do only

696
00:37:23,840 --> 00:37:27,400
for messages, you know, if you have work spaces working fine.

697
00:37:27,480 --> 00:37:30,159
They have not started to do for APIs. And you know,

698
00:37:30,519 --> 00:37:34,079
like even logging has been different, so every bit of

699
00:37:34,199 --> 00:37:36,440
different they have started to do so that like you

700
00:37:36,559 --> 00:37:39,239
don't have like a yellow for maybe just a small

701
00:37:39,320 --> 00:37:41,559
issue in or maybe just a small service in a

702
00:37:41,639 --> 00:37:44,920
small region. So if if that day page is more,

703
00:37:45,440 --> 00:37:48,000
if that status page is more sort of detay, then

704
00:37:48,159 --> 00:37:50,519
I think it probably helps to sort of give the

705
00:37:50,599 --> 00:37:51,280
right information.

706
00:37:52,960 --> 00:37:55,519
Speaker 3: I mean, I actually think AWS went a lot further here.

707
00:37:55,599 --> 00:37:58,760
They have something called the Health Dashboard, which figures out

708
00:37:58,800 --> 00:38:00,480
what services you're actually used, I think, and how that

709
00:38:00,519 --> 00:38:03,320
could be impacted you and then actually have messages there,

710
00:38:03,360 --> 00:38:07,039
which I mean is really what we all actually care about, right,

711
00:38:07,079 --> 00:38:09,360
you know, is there something happening at this moment which

712
00:38:09,440 --> 00:38:13,480
actually affects us that could be interesting realistically if we

713
00:38:13,559 --> 00:38:15,760
saw a problem, does this explain it?

714
00:38:16,719 --> 00:38:18,159
Speaker 4: Right? Right? Absolutely?

715
00:38:18,679 --> 00:38:20,440
Speaker 1: So. One thing we haven't talked about a lot is

716
00:38:20,559 --> 00:38:24,880
the post mortem, and I feel like that's all just

717
00:38:25,000 --> 00:38:29,880
like that is as much work as doing the incident

718
00:38:30,519 --> 00:38:34,960
response itself, but sometimes it gets overlooked because it's no

719
00:38:35,079 --> 00:38:37,400
longer a priority. Like once the incident is no longer

720
00:38:37,480 --> 00:38:40,480
an incident, you have to just be disciplined enough to

721
00:38:40,559 --> 00:38:43,519
run through the post mortem process. How do you how

722
00:38:43,559 --> 00:38:44,239
do you approach that?

723
00:38:44,599 --> 00:38:47,880
Speaker 4: I think the post mortem is like I'll say, like

724
00:38:48,000 --> 00:38:50,800
a chain your top in terms of the look like

725
00:38:50,960 --> 00:38:51,239
you have.

726
00:38:51,760 --> 00:38:53,840
Speaker 5: You know, you're doing us keep less right, You're going

727
00:38:53,920 --> 00:38:56,840
to the incident and maybe fix it also, but now

728
00:38:57,280 --> 00:39:01,760
you need it like a Disney we used to have

729
00:39:01,920 --> 00:39:04,599
like a day Maximat's idea or even there's a need

730
00:39:04,760 --> 00:39:08,119
to you know, come with that most modern document because

731
00:39:08,199 --> 00:39:10,639
they were kind of very bullish on that, like we

732
00:39:10,920 --> 00:39:14,320
want to know what cause issues, soctly we can fix

733
00:39:14,360 --> 00:39:18,719
it r tomorre itself. So that like, like I would say,

734
00:39:18,800 --> 00:39:22,199
like that's where that's where the gap is, Like, that's

735
00:39:22,199 --> 00:39:24,880
where a lot of people drop where they don't want

736
00:39:24,920 --> 00:39:29,800
to do that work work, especially after a grueling period

737
00:39:29,880 --> 00:39:32,400
of you know, incident resolving process.

738
00:39:33,039 --> 00:39:36,599
Speaker 4: So but I think it's just about.

739
00:39:38,199 --> 00:39:40,800
Speaker 5: More of an education part or more of you know,

740
00:39:41,679 --> 00:39:47,199
realizing what you have learned from your incident resolving partly,

741
00:39:47,880 --> 00:39:51,239
h you have probably a tea has resolved a lot

742
00:39:51,280 --> 00:39:54,039
of incident, but if they have not learned anything from them,

743
00:39:54,440 --> 00:39:58,760
then's pretty much beastful because tomorrow similar or probably the

744
00:39:58,840 --> 00:40:01,199
same incident would occur, probably a different team of.

745
00:40:01,400 --> 00:40:04,760
Speaker 4: You know, in your team itself. So like, I think

746
00:40:04,960 --> 00:40:08,840
the value of the postem needs to be told pretty

747
00:40:09,000 --> 00:40:10,599
you know, clearly, and it's a very clear poposition.

748
00:40:10,679 --> 00:40:13,039
Speaker 5: I always feel like if you tell the engineer if

749
00:40:13,039 --> 00:40:15,159
you don't like, you know, hey, what we don't want

750
00:40:15,280 --> 00:40:18,239
is to you know, you spending this much amount of

751
00:40:18,280 --> 00:40:21,239
time again on a similar thing, you know, next week.

752
00:40:21,719 --> 00:40:25,400
So that's where postpartant can help. So I think think

753
00:40:25,480 --> 00:40:28,400
that value is pretty much it's I think it's it's important.

754
00:40:28,639 --> 00:40:31,159
Speaker 1: Yeah, And I think that's one of the really big

755
00:40:31,360 --> 00:40:35,000
values of using an incident response tool is it it

756
00:40:35,119 --> 00:40:37,599
will collect all of those data points and help you

757
00:40:38,400 --> 00:40:42,719
more easily see that you're having this common failure.

758
00:40:43,840 --> 00:40:45,280
Speaker 2: Yeah, over and over again.

759
00:40:45,519 --> 00:40:47,519
Speaker 1: That otherwise, if you're just tracking this in like Google

760
00:40:47,599 --> 00:40:50,239
docs or whatever, you wouldn't actually see that correlation.

761
00:40:51,039 --> 00:40:54,280
Speaker 5: Yeah, I think, like I think it needs to start

762
00:40:54,360 --> 00:40:58,679
with what information you are feeling. So like, like even visually,

763
00:40:58,719 --> 00:41:01,079
what we kind of help is to do the five buys.

764
00:41:01,239 --> 00:41:04,599
You know, what what went well? You know what's first

765
00:41:04,599 --> 00:41:06,679
of all, what happened, then what went well?

766
00:41:06,920 --> 00:41:07,920
Speaker 4: What can go with?

767
00:41:08,400 --> 00:41:11,880
Speaker 5: What we can do to you know, mediate or in future.

768
00:41:12,440 --> 00:41:16,679
So like having those information in places pretty much is

769
00:41:16,760 --> 00:41:19,599
the first tip. So do you have the right way

770
00:41:19,679 --> 00:41:23,679
to analyze stuff in the timeline spart so you know

771
00:41:23,760 --> 00:41:25,440
if you have you know, if you want to do

772
00:41:25,559 --> 00:41:28,000
the slack conversations or if you want to do you know,

773
00:41:28,760 --> 00:41:32,639
want to see what happened from when the incident was

774
00:41:32,639 --> 00:41:36,719
triggered puill the result that timeline also helps you a lot,

775
00:41:37,119 --> 00:41:39,480
so that you know where a lot of time was

776
00:41:39,519 --> 00:41:42,320
being spent or if there's a miss there is like

777
00:41:42,440 --> 00:41:45,480
a gap in the communication process. That is also that

778
00:41:45,639 --> 00:41:48,599
is also kind of visible from them or you know,

779
00:41:48,719 --> 00:41:51,559
what are the tickets or the action items that you

780
00:41:51,639 --> 00:41:54,480
have created out of it. So there are like I'm

781
00:41:54,519 --> 00:41:59,119
sell a lot of information in that postpotum document that can.

782
00:41:59,159 --> 00:42:01,519
Speaker 4: Help you to you know, analyze a lot of things.

783
00:42:02,039 --> 00:42:04,960
Speaker 5: And most of the times we have seen it are

784
00:42:05,079 --> 00:42:10,280
usually a communication you know, uh communication error that is happening. Generally,

785
00:42:10,400 --> 00:42:13,599
let's say you didn't sort of you didn't tell the

786
00:42:13,679 --> 00:42:16,119
team to you know, he ep vision celebrated, so you

787
00:42:16,199 --> 00:42:18,960
need to update. Things like that are the most common issues.

788
00:42:19,400 --> 00:42:22,400
So from you set up a process around that too.

789
00:42:22,880 --> 00:42:25,079
You know, next time, if you're great a version that

790
00:42:25,559 --> 00:42:30,000
that modefiction is tent to different teams. So but avery

791
00:42:30,119 --> 00:42:33,880
bit of this is always and always, you know, you

792
00:42:34,079 --> 00:42:37,400
get these results or these jwills only after you have.

793
00:42:37,519 --> 00:42:40,679
Speaker 4: That document which has that entire information place.

794
00:42:41,280 --> 00:42:43,519
Speaker 5: So so and you can sort of you know, add

795
00:42:43,639 --> 00:42:46,679
those action items alves maybe like a short term actional item,

796
00:42:46,719 --> 00:42:48,719
a long term action item, and that.

797
00:42:49,039 --> 00:42:52,079
Speaker 4: That really helps. And the other part is to follow

798
00:42:52,119 --> 00:42:52,440
this up.

799
00:42:52,800 --> 00:42:58,039
Speaker 5: So like you create these documents, you probably have meetings also,

800
00:42:58,599 --> 00:43:00,960
but what after that, we need to sort of follow

801
00:43:01,079 --> 00:43:05,239
those action items to the last brick until those tickets

802
00:43:05,239 --> 00:43:08,320
are closed. You need to follow that up because otherwise

803
00:43:08,800 --> 00:43:14,119
this entire process becomes useless. So following that part to

804
00:43:14,199 --> 00:43:15,880
the very end is also pretty much important.

805
00:43:16,199 --> 00:43:19,079
Speaker 3: I heard a spicy take recently, and I want to

806
00:43:19,159 --> 00:43:24,280
I want to lay this on you. Every incident could

807
00:43:24,320 --> 00:43:27,199
have been prevented if you just had the right test.

808
00:43:30,360 --> 00:43:35,480
Speaker 5: Like I think every instance, as I'll say, like most

809
00:43:35,519 --> 00:43:38,039
of the cases that we have seen is usually the

810
00:43:38,119 --> 00:43:41,920
communication part, Like that's the most common thing that we

811
00:43:42,000 --> 00:43:46,559
are also, uh like like I think like like I

812
00:43:46,719 --> 00:43:50,400
remember case like terraform has a lot of issues. Everyone

813
00:43:51,280 --> 00:43:53,079
kind of has a different story.

814
00:43:52,880 --> 00:43:55,239
Speaker 2: To it, but no argument here.

815
00:43:56,800 --> 00:43:59,280
Speaker 5: So we like one time what we saw is like

816
00:44:00,119 --> 00:44:03,639
if we update the security groups via terra form, what

817
00:44:04,039 --> 00:44:04,960
CBS was.

818
00:44:05,039 --> 00:44:08,039
Speaker 4: Doing was like it removes the security groups first. Like

819
00:44:08,159 --> 00:44:10,239
let's say, if I want to add a security group, it.

820
00:44:10,320 --> 00:44:14,000
Speaker 5: Removes the security group first, the existing one, and then

821
00:44:14,039 --> 00:44:16,719
it adds the list even though I'm adding a just

822
00:44:16,920 --> 00:44:20,960
one exact Now, what happened was in the meantime and

823
00:44:21,039 --> 00:44:23,880
it's removing those security groups. So we was down, like

824
00:44:24,119 --> 00:44:26,599
the service was down for let's say two three minutes.

825
00:44:27,079 --> 00:44:29,159
Speaker 4: Now, this is something that we kind of.

826
00:44:30,800 --> 00:44:34,280
Speaker 5: Like that happened to our two and that's a potentially

827
00:44:35,079 --> 00:44:38,199
like you know, ticking bomb, which can actually you know,

828
00:44:38,679 --> 00:44:42,920
happen any time to across any games. So even if

829
00:44:42,960 --> 00:44:46,000
we just communicate like, hey, this is what we have seen,

830
00:44:46,159 --> 00:44:48,679
this is what we had in the experience, and if

831
00:44:48,719 --> 00:44:51,920
we just relate to the all the engineering teams, that

832
00:44:52,079 --> 00:44:56,079
issue would not occur. So that's usually the case since

833
00:44:56,199 --> 00:44:59,599
we have seen Like if even if the communication is proper,

834
00:45:00,840 --> 00:45:04,239
i'd stay like excu s differicental times of incidents wander.

835
00:45:05,800 --> 00:45:08,079
Speaker 3: I still I still can't get over the fact that

836
00:45:08,199 --> 00:45:11,079
terrorform does that by default, Like it seems like something

837
00:45:11,159 --> 00:45:14,519
that no one in their right mind would have designed

838
00:45:14,800 --> 00:45:18,119
to have the default be first delete all of the

839
00:45:18,199 --> 00:45:21,920
resources and then recreate them. I mean that just seems

840
00:45:21,960 --> 00:45:24,079
like it just backwards to be like, isn't isn't the

841
00:45:24,159 --> 00:45:27,519
common wisdom in in operations to okay, first we'll create

842
00:45:27,599 --> 00:45:29,719
the new things, make sure that it works, and then

843
00:45:29,800 --> 00:45:32,239
switch over to it. Why is the default delete?

844
00:45:32,320 --> 00:45:32,679
Speaker 4: I don't know.

845
00:45:32,840 --> 00:45:35,119
Speaker 2: I I maybe maybe I just need enough.

846
00:45:35,119 --> 00:45:37,639
Speaker 3: Coffee or something and someone and it will just magically

847
00:45:37,800 --> 00:45:38,760
insight will come to me.

848
00:45:40,320 --> 00:45:42,239
Speaker 4: Oh maybe it's that's a a W is seeing. We

849
00:45:42,320 --> 00:45:45,840
don't know. I don't remember, no, because.

850
00:45:45,639 --> 00:45:48,800
Speaker 3: Like cloud cloud formation and CDK and everything like that,

851
00:45:49,239 --> 00:45:51,159
it's not it's just the order in which the s

852
00:45:51,320 --> 00:45:53,719
K is being executed. There's no fundamental reason why it

853
00:45:53,800 --> 00:45:54,440
has to be that way.

854
00:45:55,559 --> 00:45:57,880
Speaker 1: I want to come back to your your spicy take Warren.

855
00:45:59,679 --> 00:46:03,800
Every outage could be eliminated or avoided with the right test.

856
00:46:03,960 --> 00:46:06,599
I mean, I think in theory that's true, but the

857
00:46:07,199 --> 00:46:12,239
like the practical steps of executing that make it not

858
00:46:13,400 --> 00:46:14,960
the right answer for a lot of people. But I

859
00:46:15,000 --> 00:46:17,079
think it does highlight something that I don't think I've

860
00:46:17,119 --> 00:46:21,199
ever talked about in terms of incident response with anyone before,

861
00:46:21,599 --> 00:46:25,360
and that's identifying what your risk tolerance is. Because for

862
00:46:25,440 --> 00:46:29,159
a lot of companies, having some downtime is really not

863
00:46:29,360 --> 00:46:32,599
a big deal. In other companies, it is a big deal.

864
00:46:32,719 --> 00:46:36,840
Like I worked for a while in a medical company

865
00:46:37,320 --> 00:46:40,679
where downtime for us meant that patients could potentially die,

866
00:46:40,920 --> 00:46:44,440
So we were kind of risk averse there. But in

867
00:46:44,519 --> 00:46:46,880
other places, you know, I worked for a company that

868
00:46:46,920 --> 00:46:48,760
built a fitness app. You know, if we were down,

869
00:46:49,360 --> 00:46:51,400
somebody had to figure out how to use the treadmill

870
00:46:51,480 --> 00:46:53,880
on their own, I think they're going to be okay, But.

871
00:46:54,079 --> 00:46:55,159
Speaker 4: Like in those.

872
00:47:00,320 --> 00:47:02,719
Speaker 1: Yeah, but in those two extremes, you know, there's like

873
00:47:03,400 --> 00:47:05,719
there's a different risk tolerance for how much downtime you're

874
00:47:05,760 --> 00:47:08,679
willing to take. And I think that is probably something

875
00:47:08,760 --> 00:47:11,159
that maybe needs to be talked about more by companies

876
00:47:11,239 --> 00:47:16,239
when deciding how much downtime we want, Yeah, is down

877
00:47:16,280 --> 00:47:16,880
every weekend?

878
00:47:19,320 --> 00:47:23,400
Speaker 5: Right, I think both I think even in the downtime

879
00:47:23,559 --> 00:47:25,559
as well as I think it brings to a point

880
00:47:25,719 --> 00:47:31,320
also about alert fatigue or on call forty Like people

881
00:47:31,480 --> 00:47:35,519
kind of have very lower thresholds for a lot of

882
00:47:35,599 --> 00:47:37,800
things and over the time to realize like probably we

883
00:47:37,880 --> 00:47:41,760
don't need a lot of lower thresh shoes, so uh, Like,

884
00:47:42,079 --> 00:47:45,880
I think alert generally happens when people, when teams are

885
00:47:45,920 --> 00:47:48,719
starting to have their on call process in place, they

886
00:47:48,800 --> 00:47:51,599
put alerts on a bunch of things and over the

887
00:47:51,639 --> 00:47:54,599
time for we don't need this alert or that is

888
00:47:54,679 --> 00:47:57,239
probably what we or we can raise the thresholds, so

889
00:47:57,880 --> 00:48:01,880
like like like click really be also provide these values

890
00:48:02,000 --> 00:48:05,679
like you can innotate alerts to sort of analyze and

891
00:48:05,920 --> 00:48:08,800
probably reduce some of the alerts that you don't even.

892
00:48:08,719 --> 00:48:10,840
Speaker 4: Need or you can probably increase trasuments.

893
00:48:11,320 --> 00:48:15,199
Speaker 5: So similarly for formably incidents, also you can define or

894
00:48:15,360 --> 00:48:17,760
change your you know, sexual.

895
00:48:17,599 --> 00:48:19,480
Speaker 4: Values over the period of time.

896
00:48:20,000 --> 00:48:23,679
Speaker 5: Uh some some some companies can afford to have incident

897
00:48:24,360 --> 00:48:28,480
response only during let's say business hours, they don't probably

898
00:48:28,760 --> 00:48:31,559
they can afford to maybe don't do it during weekends

899
00:48:31,639 --> 00:48:34,079
or night times. But some companies can't afford for even

900
00:48:34,079 --> 00:48:38,639
a firm minute. So absolutely depends on completely company, type

901
00:48:38,639 --> 00:48:40,360
of product or type of service detment.

902
00:48:41,199 --> 00:48:42,840
Speaker 3: Yeah, I mean, I think the same thing with like

903
00:48:42,880 --> 00:48:46,320
the dependent on alerts from a security standpoint, which in

904
00:48:46,480 --> 00:48:47,280
my domainment we.

905
00:48:47,400 --> 00:48:49,840
Speaker 2: Talk about a lot like how how much do you want?

906
00:48:49,960 --> 00:48:52,559
Right like how much is important? How much is relevant

907
00:48:52,559 --> 00:48:52,719
for you?

908
00:48:53,119 --> 00:48:56,039
Speaker 3: And and maybe the you know pajorly and at all

909
00:48:56,400 --> 00:48:58,639
you know, help you actually identify after the fact how

910
00:48:58,719 --> 00:49:01,920
much you have. And then point the ROI is super

911
00:49:02,000 --> 00:49:05,440
critical to actually evaluate because you know, trying to actually

912
00:49:05,800 --> 00:49:08,880
sort of duplicate production in a way to actually test

913
00:49:08,920 --> 00:49:11,360
to see what happens at that scale at that moment,

914
00:49:11,440 --> 00:49:15,440
and there's no way with cloud providers to uh, well

915
00:49:15,679 --> 00:49:19,440
practice what does capacity constrained look like? And then if

916
00:49:19,480 --> 00:49:21,800
I mean your capacity constrained because there isn't another bare

917
00:49:21,840 --> 00:49:25,639
metal device available. There's no there's no alternative. Oh well,

918
00:49:25,679 --> 00:49:29,440
you know, well we should be you know, multi cloud provider.

919
00:49:29,559 --> 00:49:30,639
Like it is never the answer.

920
00:49:33,719 --> 00:49:37,159
Speaker 5: I mean you can have you know, backup das also,

921
00:49:37,280 --> 00:49:40,559
you can have as much as possible, but like there's

922
00:49:40,599 --> 00:49:41,599
no sort of answer.

923
00:49:41,880 --> 00:49:46,000
Speaker 3: H Yeah, there's some things like actually pick what your

924
00:49:46,280 --> 00:49:49,000
solo is going to be, what your objective is going

925
00:49:49,039 --> 00:49:52,320
to be for uptime or incidents, and then make sure

926
00:49:52,400 --> 00:49:55,440
your strategy actually includes that and handles it, and then

927
00:49:55,519 --> 00:49:57,679
measure it based off the number of incidents you get

928
00:49:57,760 --> 00:49:59,639
rather than saying, oh, yeah, we should know when the

929
00:49:59,679 --> 00:50:02,840
memory goes about ninety because then it's it's bad.

930
00:50:03,039 --> 00:50:03,559
Speaker 2: Apparently.

931
00:50:04,480 --> 00:50:09,599
Speaker 5: Yeah, I mean it always you know, gets updated, it

932
00:50:10,119 --> 00:50:13,400
always gets you know, with the time it's probably from

933
00:50:13,440 --> 00:50:15,760
your your port is growing, your customers are doing to

934
00:50:16,119 --> 00:50:18,360
kind of get every passage.

935
00:50:17,960 --> 00:50:23,519
Speaker 1: Of switch your top tips for someone who is not

936
00:50:23,760 --> 00:50:29,199
satisfied with their current incident response uh program or or software,

937
00:50:29,880 --> 00:50:30,400
I would.

938
00:50:30,159 --> 00:50:33,480
Speaker 5: Say, like, like I think the entire fighting I think

939
00:50:33,679 --> 00:50:37,400
like you can always see different parts to it. The

940
00:50:37,559 --> 00:50:41,119
first part is the you know how easily your team

941
00:50:41,480 --> 00:50:44,639
or anyone is able to report the incident. So, yes,

942
00:50:44,920 --> 00:50:50,800
you have automated alerts on TV's or on from easy

943
00:50:50,880 --> 00:50:55,000
tools and all those parts, but they'd say not everything

944
00:50:55,079 --> 00:50:55,880
can be automated.

945
00:50:56,960 --> 00:50:59,639
Speaker 4: You need to have you know, correct way of identify

946
00:50:59,800 --> 00:51:01,360
of you know, reporting issues.

947
00:51:01,400 --> 00:51:04,840
Speaker 5: So if you have customers support or if you have

948
00:51:05,000 --> 00:51:07,719
a product team of let's say someone wants even if

949
00:51:07,840 --> 00:51:10,599
even if let's say, if you have you know their

950
00:51:10,760 --> 00:51:13,960
environments or three fraud environments you want to report issues there,

951
00:51:14,639 --> 00:51:17,280
you need to have a good, good way of reporting

952
00:51:17,360 --> 00:51:20,920
that UH and hope and have a process that the

953
00:51:21,119 --> 00:51:25,000
correct uh the issue is reported to the correct team

954
00:51:25,159 --> 00:51:29,920
as quickly as possible. So time to trigger the incident

955
00:51:30,079 --> 00:51:33,639
or time to you know, you know you call that

956
00:51:33,840 --> 00:51:38,480
on call should be as quickly as possible. So identify

957
00:51:38,760 --> 00:51:42,400
the blockages in that there there can be blockage is

958
00:51:42,639 --> 00:51:45,039
in you know, in set uping this process office.

959
00:51:45,840 --> 00:51:48,599
Speaker 4: And the other part is let's say, once the incident

960
00:51:48,679 --> 00:51:52,679
has been triggered or created, what to do? So if

961
00:51:52,719 --> 00:51:54,199
you if if if you.

962
00:51:54,280 --> 00:51:56,559
Speaker 5: Feel like if the calls feel like a lot of

963
00:51:56,800 --> 00:52:01,800
work outside mitigation, uh, if they like you know, if

964
00:52:02,119 --> 00:52:05,039
every time you are having an incident, you if you're

965
00:52:05,119 --> 00:52:08,639
just running around and probably adding you know, calling people

966
00:52:09,639 --> 00:52:13,079
and just figuring out, you know, what's the stackers.

967
00:52:12,719 --> 00:52:16,239
Speaker 4: Or adding other team on calls all the time.

968
00:52:17,039 --> 00:52:21,440
Speaker 5: Figure out these kind of blockages in your processes and

969
00:52:22,239 --> 00:52:25,719
try to streamline as as much as possible so that

970
00:52:25,960 --> 00:52:30,400
like on call or whoever is other stakeholders, can you know,

971
00:52:30,559 --> 00:52:33,400
focus on solving as much as possible if you want

972
00:52:33,440 --> 00:52:33,639
to have.

973
00:52:33,800 --> 00:52:36,159
Speaker 4: Like ah, if that is still like.

974
00:52:36,320 --> 00:52:38,320
Speaker 5: Taking a lot of time, maybe set up like a

975
00:52:38,400 --> 00:52:42,400
team of an eyework person who is actually handling all

976
00:52:42,480 --> 00:52:45,760
the incidents and he's he or she's actually dispatching the

977
00:52:45,840 --> 00:52:47,239
incidents to a correct team.

978
00:52:47,039 --> 00:52:50,000
Speaker 4: And doing like a supervision of the entire process.

979
00:52:50,679 --> 00:52:54,679
Speaker 5: And the last one is to see whether are you

980
00:52:54,840 --> 00:52:57,800
doing the post modems correctly, like uh, you know, are

981
00:52:57,840 --> 00:53:00,800
you doing it all or not? Are you actually learning

982
00:53:00,920 --> 00:53:04,920
the you know, uh learning from your incidents? Have your

983
00:53:05,079 --> 00:53:10,320
repeated incidents have reduced over their time online? I say,

984
00:53:10,400 --> 00:53:12,559
like that's the most between a lot of a lot

985
00:53:12,639 --> 00:53:15,599
of companies kind of focus on entity. I think that's

986
00:53:15,800 --> 00:53:18,679
not probably the right metric. The right metric is to

987
00:53:18,840 --> 00:53:22,639
see how many unique incidents you're getting. I think if

988
00:53:22,760 --> 00:53:25,920
if if if that's if that is fine, that's fine.

989
00:53:25,960 --> 00:53:29,239
But if you're getting the repeated incidents time of her time,

990
00:53:29,840 --> 00:53:33,440
something you could would have over then your incident responded

991
00:53:33,440 --> 00:53:35,360
process is like the first model processes.

992
00:53:36,119 --> 00:53:38,480
Speaker 4: I think that that's interesting.

993
00:53:39,760 --> 00:53:40,440
Speaker 1: That's a good point.

994
00:53:41,719 --> 00:53:45,199
Speaker 3: I mean, I really wonder how many, like are people

995
00:53:45,559 --> 00:53:48,840
hitting the same incident over and over again? Like I

996
00:53:49,280 --> 00:53:53,000
my my guess would be probably not exactly, but maybe

997
00:53:53,119 --> 00:53:56,679
correct categorization would really help spill it. Like, you know,

998
00:53:56,920 --> 00:53:59,159
is it is it the same part of your framework,

999
00:53:59,239 --> 00:54:02,159
code based or or component? You know, if you have

1000
00:54:02,400 --> 00:54:04,880
even a monolith and not micro services, you still have

1001
00:54:04,960 --> 00:54:06,880
broken out components. You can at least target it down

1002
00:54:06,920 --> 00:54:08,960
to is it the same component that's causing the problem

1003
00:54:09,320 --> 00:54:11,559
all the time? Uh, as far as a place to

1004
00:54:11,639 --> 00:54:14,159
look and invest in rather than oh, you know, it's

1005
00:54:14,239 --> 00:54:16,480
just something happening within our whole system.

1006
00:54:17,360 --> 00:54:20,440
Speaker 5: Right, Like I think, so maybe you know that the

1007
00:54:20,519 --> 00:54:23,239
on call have the run book for when he was

1008
00:54:23,960 --> 00:54:27,079
solving an incidents? Is do they had the right tools

1009
00:54:27,159 --> 00:54:31,360
for you know, just seeing or for the service the

1010
00:54:31,400 --> 00:54:34,599
particular serve it doesn't even have a dashboard, but doesn't

1011
00:54:34,639 --> 00:54:39,800
have any playbook, uh, which can help on So there

1012
00:54:39,840 --> 00:54:42,960
are like I think a lot of learnings, uh, which

1013
00:54:43,679 --> 00:54:47,440
any or or any engine team can do. Uh and see,

1014
00:54:47,599 --> 00:54:50,400
you know how much like I think at the end,

1015
00:54:50,480 --> 00:54:52,480
it's all good how much we can help the on

1016
00:54:52,639 --> 00:54:54,119
hulls as much as possible.

1017
00:54:54,639 --> 00:54:59,800
Speaker 1: So yeah, I mean, if we're seeing the same incident

1018
00:55:00,079 --> 00:55:02,000
over and over again, we should at least be able

1019
00:55:02,079 --> 00:55:06,079
to brag about our meantime the resolution decreasing because everybody

1020
00:55:06,079 --> 00:55:08,119
instant service to restart.

1021
00:55:08,559 --> 00:55:10,239
Speaker 2: But maybe that's a good point though, right maybe that

1022
00:55:10,280 --> 00:55:11,280
maybe that's the whole point.

1023
00:55:11,159 --> 00:55:14,239
Speaker 3: Right, like you you don't want to have that actually

1024
00:55:14,320 --> 00:55:17,440
decreasing because right then then there's like, you know, it

1025
00:55:17,480 --> 00:55:19,800
really points to a different problem. It's like if you

1026
00:55:19,880 --> 00:55:22,559
have run books, you must be because you hit the

1027
00:55:22,599 --> 00:55:25,239
same problem over and over again. And so rather than

1028
00:55:25,320 --> 00:55:27,480
having the run book, it'd be better to eliminate where

1029
00:55:27,599 --> 00:55:29,519
the source of the problem is coming from.

1030
00:55:30,800 --> 00:55:32,159
Speaker 4: And that's why I didn't.

1031
00:55:34,480 --> 00:55:37,320
Speaker 5: Like not like I think it's always divided for you

1032
00:55:37,440 --> 00:55:40,679
on you know, with how good does a MPDIA matrix

1033
00:55:41,639 --> 00:55:46,159
actually want to like trust on it? Like, because it's

1034
00:55:46,280 --> 00:55:49,840
usually counter intuitive if you see, like as as you

1035
00:55:50,000 --> 00:55:53,119
rightly say, say, let's say let's say a company has

1036
00:55:54,039 --> 00:55:56,800
dissolved most of the incidents, like they have resolved it

1037
00:55:56,840 --> 00:56:00,760
to the correct points in six months of time. In

1038
00:56:00,840 --> 00:56:04,159
the seventh and in the seventh month, the engineer team

1039
00:56:04,239 --> 00:56:07,000
doesn't have the same incidents or similar incidents, but they

1040
00:56:07,880 --> 00:56:10,159
have only one new incident which kind of took a

1041
00:56:10,239 --> 00:56:12,840
long time because that was like a unique se So

1042
00:56:13,079 --> 00:56:17,239
in that case, the MPDR is too big because a

1043
00:56:17,320 --> 00:56:21,840
new incident came with a lower frequency, lower number of times,

1044
00:56:22,079 --> 00:56:25,000
but it took a longer time. But can we say

1045
00:56:25,039 --> 00:56:29,079
that that the engineering team had a you know, bad

1046
00:56:29,159 --> 00:56:32,400
state of incidents hygien, No, because they had kind of

1047
00:56:32,519 --> 00:56:35,320
resolved most of the incidents that have occurred in the

1048
00:56:35,360 --> 00:56:37,719
past and those are not occurring now this is like

1049
00:56:37,800 --> 00:56:41,079
a new one. So that's why I think like MDA

1050
00:56:41,199 --> 00:56:45,920
is always not the right victory to see in terms

1051
00:56:45,960 --> 00:56:47,039
of incident hygiene.

1052
00:56:47,599 --> 00:56:49,920
Speaker 3: Yeah, I think the ERA budget, according to your SOLO

1053
00:56:50,159 --> 00:56:53,920
is a much better one in this regard unfortunately. Yeah,

1054
00:56:53,960 --> 00:56:55,360
but I'm I'm totally I mean, I think that all

1055
00:56:55,400 --> 00:56:57,199
the door metrics sort of have that problem in a

1056
00:56:57,239 --> 00:57:00,760
way if you measure them purely or just people in general,

1057
00:57:00,960 --> 00:57:03,599
rather than how they're actually relevant. Like I I remember

1058
00:57:03,639 --> 00:57:05,679
working with one company that they were measuring even in

1059
00:57:06,639 --> 00:57:09,719
cycle time, but they were using feature flags and not

1060
00:57:09,920 --> 00:57:12,239
including that in the cycle time. So I'm like, yeah,

1061
00:57:12,280 --> 00:57:14,239
your code is going to production, but no one's using it,

1062
00:57:14,480 --> 00:57:15,760
So what's the point.

1063
00:57:15,800 --> 00:57:16,440
Speaker 2: What's the point?

1064
00:57:16,880 --> 00:57:19,559
Speaker 3: I mean, yeah, I mean measure then also measure the

1065
00:57:19,639 --> 00:57:22,400
cycle like the cycle time on feature flag removal. That's

1066
00:57:22,480 --> 00:57:25,000
going to tell you a lot more about your success.

1067
00:57:25,199 --> 00:57:27,119
Speaker 5: Right, I think, like you know, we have seen a

1068
00:57:27,199 --> 00:57:30,239
lot of tools on so we have seen a lot

1069
00:57:30,280 --> 00:57:33,199
of tools maybe just you know how many commits you

1070
00:57:33,239 --> 00:57:36,760
have pushed, So I think everything has to be you know,

1071
00:57:37,639 --> 00:57:40,679
read with a lot of context, with a lot of corns.

1072
00:57:40,800 --> 00:57:44,840
Is also not just because it can change based on

1073
00:57:44,880 --> 00:57:47,880
a different kind of uh things that are happening.

1074
00:57:48,119 --> 00:57:52,280
Speaker 3: Yeah, I know we're getting close to the limit on

1075
00:57:52,360 --> 00:57:55,199
the time that you've got with us today. Uh maybe

1076
00:57:55,280 --> 00:57:57,400
there's some last words and then we can move over

1077
00:57:57,480 --> 00:57:59,880
to picks. Anything you want to share.

1078
00:58:01,079 --> 00:58:03,800
Speaker 4: No, I think like this was pretty cool. Uh.

1079
00:58:04,559 --> 00:58:06,840
Speaker 5: Like I think like with with pajor L also we

1080
00:58:07,000 --> 00:58:10,920
have we have seen a lot of different and unique

1081
00:58:10,960 --> 00:58:14,480
cases h and it's good, like this is something that

1082
00:58:14,599 --> 00:58:17,199
which is very close to you know what we have

1083
00:58:17,400 --> 00:58:21,920
seen like what we have failed, and like mean, we're

1084
00:58:22,000 --> 00:58:25,400
like to help companies to sort of stream and this

1085
00:58:25,760 --> 00:58:27,679
entire process as much as possible.

1086
00:58:28,360 --> 00:58:30,719
Speaker 3: It's got to be super interesting too to see how

1087
00:58:30,880 --> 00:58:36,199
companies incidents actually look like Oh for sure, Yeah, I.

1088
00:58:36,239 --> 00:58:38,760
Speaker 5: Think like what we have realized is like every company

1089
00:58:38,880 --> 00:58:42,360
needs their kind of like every company has different processes,

1090
00:58:42,519 --> 00:58:44,719
probably because of the state of their product, the state

1091
00:58:44,800 --> 00:58:49,360
of all size. And what's what we have always ensured

1092
00:58:49,480 --> 00:58:52,239
is like whatever your process or whatever you feel like

1093
00:58:52,360 --> 00:58:55,480
is the most app will not force a tool of

1094
00:58:55,559 --> 00:58:58,639
that will adapt to your kind of processes. We'll just

1095
00:58:58,800 --> 00:59:02,840
make it more automate in most because we know, like

1096
00:59:03,079 --> 00:59:08,239
you have set up some big even your set and

1097
00:59:08,360 --> 00:59:10,880
we know that you know that uh sort of some

1098
00:59:11,000 --> 00:59:14,559
of the big spieces of it. So there's no one

1099
00:59:14,679 --> 00:59:18,440
particular way of doing things, but whatever it may will.

1100
00:59:18,360 --> 00:59:25,480
Speaker 2: Hit well said, well said, So what do you think, Well,

1101
00:59:25,519 --> 00:59:27,239
should we should we do the picks?

1102
00:59:27,639 --> 00:59:28,519
Speaker 1: Let's do some picks.

1103
00:59:29,000 --> 00:59:31,320
Speaker 2: Okay, I know you put me on the spot anyway,

1104
00:59:31,320 --> 00:59:32,119
so I'll just go first.

1105
00:59:32,320 --> 00:59:36,840
Speaker 3: Uh my, My my pick for day's session is a

1106
00:59:36,920 --> 00:59:40,039
book called Radical Focus by I think it's a Christina

1107
00:59:40,960 --> 00:59:44,719
Vodka it's a it's actually fantastic. It's a hypothetical story

1108
00:59:44,800 --> 00:59:49,119
about how to actually uh set priorities using okay, r

1109
00:59:49,239 --> 00:59:52,719
S or KPIs or whatever, MBI whatever you want to

1110
00:59:52,760 --> 00:59:55,280
call them, honestly, and how not to do it and

1111
00:59:55,440 --> 00:59:58,559
lessons learned from that. It's it's super relevant no matter

1112
00:59:58,639 --> 01:00:01,719
what level you're at, realistically, like even at the team level,

1113
01:00:01,800 --> 01:00:03,960
it's super interesting to think about, like how many priorities

1114
01:00:04,000 --> 01:00:06,039
and what should our focus be on? How to think

1115
01:00:06,039 --> 01:00:08,320
about that because I've seen so many teams, so many

1116
01:00:08,400 --> 01:00:11,159
companies have like, oh, yeah, we have ten initiatives for

1117
01:00:11,239 --> 01:00:13,360
this quarter, and I'm like, you can't. I bet your

1118
01:00:13,400 --> 01:00:16,840
engineers couldn't even tell you five of them, Like it's

1119
01:00:16,920 --> 01:00:19,239
just too many. And I think it's a great story

1120
01:00:19,239 --> 01:00:21,199
about how to actually think about this and what's relevant.

1121
01:00:21,800 --> 01:00:22,719
Speaker 2: So highly recommended.

1122
01:00:23,360 --> 01:00:26,119
Speaker 1: Dude, your picks are always so relevant. I feel like

1123
01:00:26,199 --> 01:00:29,320
mine are just like better, but yours are like, oh wow,

1124
01:00:29,400 --> 01:00:31,519
that could actually work and be helpful.

1125
01:00:33,519 --> 01:00:36,760
Speaker 3: I mean, I don't know, maybe being I'm being lazy

1126
01:00:36,840 --> 01:00:40,559
by picking easy things. Well, you know, I'm when I'm

1127
01:00:40,639 --> 01:00:42,880
year two of a host here, maybe I'll have run

1128
01:00:42,960 --> 01:00:43,199
out of.

1129
01:00:43,239 --> 01:00:46,280
Speaker 2: Things and then I'll be onto the I don't know

1130
01:00:46,719 --> 01:00:48,559
my weights that I've got in my other room that

1131
01:00:48,599 --> 01:00:49,079
I'm using.

1132
01:00:54,360 --> 01:00:56,119
Speaker 1: All right, Fali, what you bring for us for a

1133
01:00:56,159 --> 01:00:56,679
pick today?

1134
01:00:58,400 --> 01:01:01,719
Speaker 4: I that's a little thing is one. I think.

1135
01:01:02,719 --> 01:01:07,119
Speaker 5: One I've seen like a documentary recently which was how

1136
01:01:07,480 --> 01:01:12,719
Toyota Big Stuff and a lot of things was very interesting.

1137
01:01:12,920 --> 01:01:15,840
I'm forgetting what what they call it, but essentially what

1138
01:01:16,000 --> 01:01:20,480
they they is. And the third is like known for

1139
01:01:20,639 --> 01:01:23,480
it's like you know, building bug.

1140
01:01:23,320 --> 01:01:28,280
Speaker 4: Free products manufacturer. Yeah for sure, yeah. Yeah.

1141
01:01:28,760 --> 01:01:30,639
Speaker 5: And one of the one of the things that they

1142
01:01:31,599 --> 01:01:35,880
have always is like no matter where their manufacturing nits are,

1143
01:01:36,199 --> 01:01:40,559
no matter where the what they if there's an issue,

1144
01:01:40,840 --> 01:01:44,960
it gets reported to the topmost year like immediately, like

1145
01:01:45,280 --> 01:01:49,159
with with proper clarity and and and that's how I think,

1146
01:01:49,199 --> 01:01:52,920
like communication becomes so much important and like that kind

1147
01:01:52,920 --> 01:01:54,119
of solves a lot of things.

1148
01:01:54,599 --> 01:01:58,920
Speaker 4: So like it, I think, and it's just very fascinating

1149
01:01:59,000 --> 01:01:59,719
how kind of.

1150
01:01:59,760 --> 01:02:03,719
Speaker 3: They actually they actually have these like cords on the

1151
01:02:03,840 --> 01:02:07,440
manufacturing for called and onlines that helped. Yeah, they stop

1152
01:02:07,599 --> 01:02:10,679
the whole manufacturing line at once. You know, hey, you know,

1153
01:02:11,000 --> 01:02:12,840
no more pull requests at this moment. For the whole

1154
01:02:12,880 --> 01:02:15,639
company because there's something critically wrong going on, Like could

1155
01:02:15,679 --> 01:02:16,239
you imagine.

1156
01:02:17,119 --> 01:02:20,199
Speaker 4: Yeah, yeah, I think I think on line was I

1157
01:02:20,360 --> 01:02:23,280
was different to him. That's super super.

1158
01:02:23,840 --> 01:02:27,239
Speaker 5: I think that's such a simple kind of technique that

1159
01:02:27,519 --> 01:02:32,119
any company can sort of have, Like I make such

1160
01:02:32,119 --> 01:02:33,800
a simple forceses but very.

1161
01:02:36,360 --> 01:02:40,480
Speaker 1: Yeah. One of my customers was a company that provided

1162
01:02:41,880 --> 01:02:46,679
seats to Toyota, and it was wild because they would

1163
01:02:46,719 --> 01:02:49,719
get orders like Okay, we need three hundred and seventeen

1164
01:02:50,000 --> 01:02:53,840
seats that are beige delivered at ten twenty seven am.

1165
01:02:54,320 --> 01:02:56,920
You know, like the level of specificity because they have

1166
01:02:57,039 --> 01:02:59,800
that that just in time manufacturing, like we need these

1167
01:02:59,840 --> 01:03:03,199
at ten twenty seven am. And and this was a

1168
01:03:03,280 --> 01:03:05,920
smaller company, so like the level of pressure for them

1169
01:03:06,079 --> 01:03:09,679
to meet those requirements was just through the roof.

1170
01:03:10,440 --> 01:03:11,920
Speaker 3: I mean it makes a lot of sense too if

1171
01:03:11,960 --> 01:03:15,000
you think about it, because they see inventory storage as

1172
01:03:15,039 --> 01:03:17,800
a waste, as a cost to them, and so they

1173
01:03:17,880 --> 01:03:20,599
don't want to have it stored at You're the inventory

1174
01:03:20,719 --> 01:03:23,119
for them. You know, they're they're going to the shelf

1175
01:03:23,199 --> 01:03:25,679
and they're pulling it and you are that shelf for them.

1176
01:03:26,119 --> 01:03:27,199
Speaker 2: Yeah. No, it's awesome.

1177
01:03:27,840 --> 01:03:30,320
Speaker 1: Yeah, I feel like, you know, we were talking about

1178
01:03:30,360 --> 01:03:33,199
this before we started recording, about how my picks are

1179
01:03:33,559 --> 01:03:36,039
just kind of out there. I feel like this one

1180
01:03:36,159 --> 01:03:40,519
is going to be unlike the the crazy scale, this

1181
01:03:40,639 --> 01:03:47,639
one's going to be hard to top. And yeah, I'll

1182
01:03:47,679 --> 01:03:49,079
just get to it. So I read this book. I

1183
01:03:49,199 --> 01:03:51,559
just finished it up a couple of days ago, called

1184
01:03:51,880 --> 01:03:55,519
The Sacred Mushroom and the Cross by a guy named

1185
01:03:55,599 --> 01:04:00,320
John Marco Allegro, and I would be tempted to call

1186
01:04:00,440 --> 01:04:04,079
bullshit on the book right away, except for the fact

1187
01:04:04,159 --> 01:04:08,599
that this guy spent fifteen years deciphering the Dead Sea Scrolls.

1188
01:04:09,480 --> 01:04:11,559
And so if you're not familiar with the Dead Sea Scrolls,

1189
01:04:12,079 --> 01:04:15,960
it's a set of scrolls that were found in Egypt,

1190
01:04:16,039 --> 01:04:19,480
I believe in the nineteen forties that were thousands of

1191
01:04:19,599 --> 01:04:24,119
years old, and they contained some parts of the New

1192
01:04:24,199 --> 01:04:27,440
Testament Bible, but they also had other stories in there

1193
01:04:27,440 --> 01:04:30,480
as well that weren't included in there, and so he

1194
01:04:30,639 --> 01:04:33,440
deciphered them. But this book, The Sacred Mushroom in the Cross,

1195
01:04:35,920 --> 01:04:39,320
he basically goes through this book showing or arguing that

1196
01:04:40,760 --> 01:04:44,880
a lot of the stuff written in the Old Testament

1197
01:04:44,960 --> 01:04:48,320
and the New Testament and some other religious books as

1198
01:04:48,400 --> 01:04:54,280
well were not factual base, but they were actually like

1199
01:04:54,840 --> 01:04:59,800
a play on words referencing psychedelic mushrooms, and that the

1200
01:05:00,079 --> 01:05:04,400
whole religion is based on an ancient cult or ancient

1201
01:05:04,519 --> 01:05:11,199
culture that worshiped psychedelic mushrooms. And it's a wild read. Man,

1202
01:05:11,440 --> 01:05:14,639
It's very hard to read because of all the references

1203
01:05:14,719 --> 01:05:17,880
he makes to like the Aramaic and the Semitic languages.

1204
01:05:18,440 --> 01:05:22,159
But the big takeaway for me was, you know, you're

1205
01:05:22,159 --> 01:05:24,400
reading through this and he's like, oh, so, well, they

1206
01:05:24,480 --> 01:05:28,719
said this thing and that's actually the you know, the

1207
01:05:28,760 --> 01:05:33,480
ancient Sumerian word for this psychedelic mushroom, and like everything

1208
01:05:33,559 --> 01:05:36,679
points back to being the name of a psychedelic mushroom.

1209
01:05:36,679 --> 01:05:38,960
And I was like, dude, how is it that we

1210
01:05:39,239 --> 01:05:42,880
know so little about the ancient Sumerians, but you know

1211
01:05:43,079 --> 01:05:47,440
the four hundred different words they had for psychedelic mushrooms.

1212
01:05:48,800 --> 01:05:50,920
But then at the end of this book there's like

1213
01:05:50,960 --> 01:05:53,280
a chapter. I can figure out who wrote this last

1214
01:05:53,400 --> 01:05:58,400
chapter because it wasn't John o'leegertt was someone else. But

1215
01:05:58,599 --> 01:06:03,840
the guy was talking with his wife, she was from Russia,

1216
01:06:04,719 --> 01:06:07,360
and they came across a field with a bunch of mushrooms,

1217
01:06:07,400 --> 01:06:09,960
you know, and he's like he was an American and

1218
01:06:10,000 --> 01:06:12,159
he's like, no, don't eat those, they're all poisonous and stuff.

1219
01:06:12,199 --> 01:06:13,840
And she's like, no, this one is, this, this and this,

1220
01:06:14,000 --> 01:06:16,079
and so it turns out in Russian they have like

1221
01:06:16,159 --> 01:06:19,760
an endless number of words from mushroom, but in the

1222
01:06:20,000 --> 01:06:23,400
US and in Western cultures, we have, you know, like

1223
01:06:23,840 --> 01:06:28,519
three like toadstools, mushrooms and and whatever else.

1224
01:06:28,760 --> 01:06:31,199
Speaker 3: So it's a yeah, that's a European thing because people

1225
01:06:31,239 --> 01:06:33,880
actually go pick out pick mushrooms here, and so knowing

1226
01:06:34,000 --> 01:06:35,480
which ones are poisonous the same thing.

1227
01:06:35,519 --> 01:06:37,559
Speaker 2: You know, once I've moved here, I learned all about that.

1228
01:06:37,679 --> 01:06:39,880
Speaker 3: But I think you've meaned yourself well, because now I

1229
01:06:40,000 --> 01:06:43,599
see instead of you know, just this, but instead of aliens,

1230
01:06:43,679 --> 01:06:45,599
now it's just it's mushrooms.

1231
01:06:45,159 --> 01:06:54,800
Speaker 4: Right, aliens? The last one, well, no, it's the the

1232
01:06:54,880 --> 01:06:55,440
guy from.

1233
01:06:55,519 --> 01:06:58,400
Speaker 1: The Ancient Aliens TV show with the big hair. If

1234
01:06:58,440 --> 01:06:59,480
you've ever seen that meme.

1235
01:07:00,199 --> 01:07:02,440
Speaker 3: It was the History Channel, there was Ancient Aliens and yeah,

1236
01:07:02,480 --> 01:07:05,280
it's like the pyramids are landing platforms for aliens, and

1237
01:07:05,519 --> 01:07:07,000
you know, well's here, you know, trying to.

1238
01:07:07,000 --> 01:07:11,079
Speaker 2: Sell us on the fact that the religious cult is

1239
01:07:11,159 --> 01:07:12,000
of mushrooms.

1240
01:07:12,320 --> 01:07:13,800
Speaker 1: I mean, yeah, good reading.

1241
01:07:14,239 --> 01:07:16,000
Speaker 2: I added to my list, So thank you for that.

1242
01:07:17,159 --> 01:07:19,400
Speaker 1: Yeah, let me know when you when you get through it,

1243
01:07:19,440 --> 01:07:21,599
I'd be interested to talk through that with you, because

1244
01:07:22,199 --> 01:07:24,440
there's like some parts of which you're like, okay, I

1245
01:07:24,559 --> 01:07:26,400
see how you can get to that conclusion, and there's

1246
01:07:26,480 --> 01:07:28,320
other parts who're like, come.

1247
01:07:28,199 --> 01:07:36,199
Speaker 4: On, is it like pretty famous kind of book it

1248
01:07:36,320 --> 01:07:36,840
was written?

1249
01:07:37,800 --> 01:07:41,599
Speaker 1: I think it's recently. It was recently. It was published

1250
01:07:41,599 --> 01:07:46,760
in nineteen seventy, so it's an older book. And then

1251
01:07:46,880 --> 01:07:49,639
he got a lot of hatred and supposedly it was

1252
01:07:49,800 --> 01:07:55,800
very detrimental to his career. Go figure, who would have

1253
01:07:55,920 --> 01:07:59,960
thought that, you know, claiming Jesus was a psychedelic mushroom

1254
01:08:00,039 --> 01:08:03,320
with detrimental to your career. But anyway, but I think

1255
01:08:03,360 --> 01:08:06,480
it's gained in popularity over the last couple of years

1256
01:08:06,599 --> 01:08:10,840
just because of the shift in the things that we're seeing,

1257
01:08:11,519 --> 01:08:14,360
at least here in the Western world, where people are

1258
01:08:16,159 --> 01:08:18,760
kind of changing their opinion and approach to things like

1259
01:08:19,000 --> 01:08:22,680
to natural medicines like psychedelic mushrooms, and you know, the

1260
01:08:22,800 --> 01:08:26,680
legalization of pot, and now in Oregon and Colorado there's

1261
01:08:26,760 --> 01:08:34,319
actually decriminalized centers for using mushrooms to treat like PTSD

1262
01:08:34,800 --> 01:08:36,600
and memory issues and things like that.

1263
01:08:37,439 --> 01:08:39,920
Speaker 3: You're really close to Canada where that's been a huge

1264
01:08:40,039 --> 01:08:42,279
topic in the last years.

1265
01:08:42,800 --> 01:08:43,039
Speaker 4: Yeah.

1266
01:08:43,840 --> 01:08:46,960
Speaker 1: Yeah, so I think that's been a key to the

1267
01:08:47,039 --> 01:08:53,600
book gaining new popularity. Yeah all right, so there you go.

1268
01:08:53,840 --> 01:08:56,760
So now the challenge is on next week? What am

1269
01:08:56,800 --> 01:08:58,560
I going to come with a pick that tops Jesus

1270
01:08:58,600 --> 01:08:59,319
as a mushroom?

1271
01:09:02,479 --> 01:09:05,920
Speaker 4: I actually saw one movie which is on my mind

1272
01:09:06,079 --> 01:09:08,279
is Inside Out too. I saw, I think.

1273
01:09:08,279 --> 01:09:13,840
Speaker 5: Last night, and I think it, Uh, if someone wants

1274
01:09:13,880 --> 01:09:17,560
to make highly recomm I think you can sort of

1275
01:09:17,600 --> 01:09:20,560
feel a lot of emotions, uh for for.

1276
01:09:23,119 --> 01:09:25,199
Speaker 4: That's that is something which is like just all my

1277
01:09:25,319 --> 01:09:29,239
mind and what was the name of that one? Inside

1278
01:09:29,279 --> 01:09:30,479
Out the second part?

1279
01:09:31,600 --> 01:09:35,039
Speaker 1: Yeah, awesome, And with that done, I think we have

1280
01:09:35,199 --> 01:09:37,359
an episode for Thanks for joining us, man, This has

1281
01:09:37,399 --> 01:09:39,760
been a blast. Really appreciate having you on the show.

1282
01:09:40,359 --> 01:09:40,840
Speaker 4: Great, thank you.

1283
01:09:41,439 --> 01:09:45,600
Speaker 5: Yeah, I think I think, uh, it's been really great,

1284
01:09:46,399 --> 01:09:49,880
had you know, wonderful time just to chat about incidents

1285
01:09:50,319 --> 01:09:53,560
and a lot of other things and sharing each other's

1286
01:09:53,560 --> 01:09:56,239
you know, personal experiences. I think this is something like

1287
01:09:56,920 --> 01:10:00,680
every OnCore and every or even every developer has their

1288
01:10:00,760 --> 01:10:03,239
own personal experience what they want to share.

1289
01:10:03,399 --> 01:10:05,000
Speaker 4: So it's been a really good.

1290
01:10:04,880 --> 01:10:09,119
Speaker 1: Child, awesome cool. Thank you again, and to all the listeners,

1291
01:10:09,199 --> 01:10:12,079
thank you for listening. Appreciate y'all and be sure and

1292
01:10:12,279 --> 01:10:14,920
hit us up if there's anything we can do for you,

1293
01:10:15,319 --> 01:10:16,439
and we'll see y'all next week

