1
00:00:14,599 --> 00:00:19,679
What's going on everybody. I'm your
host today Will Button four Adventures in dev

2
00:00:19,719 --> 00:00:23,679
Ops And before we get started,
I do want to remind everyone, or

3
00:00:23,760 --> 00:00:29,440
tell you for the first time maybe
that we are now doing these shows live

4
00:00:29,679 --> 00:00:32,679
in addition to the podcast, so
if you do want to catch it live,

5
00:00:32,719 --> 00:00:39,000
we're recording on Tuesdays at nine to
thirty Central Time. That's GMT minus

6
00:00:39,079 --> 00:00:44,039
six, and actually it's like nine
thirty ish because we start at nine thirty,

7
00:00:44,039 --> 00:00:46,679
but then usually have a little bit
of a pre chat to get our

8
00:00:46,719 --> 00:00:50,560
guests up to speed on how the
process works, and then a click to

9
00:00:50,600 --> 00:00:54,840
go live button, So shortly after
nine thirty Central time, you can catch

10
00:00:54,920 --> 00:01:00,119
us live on Facebook, LinkedIn,
and YouTube. And speaking of our guests,

11
00:01:00,119 --> 00:01:06,959
today, I have Segee Brody,
chief Technology officer of Opti nine,

12
00:01:07,400 --> 00:01:11,760
consultant former software developer, according to
his own words, still has code and

13
00:01:11,799 --> 00:01:18,920
production that probably shouldn't be and I
can definitely relate to that. Today we're

14
00:01:18,920 --> 00:01:23,599
going to be talking about resilience and
disaster recovery in the cloud, how it's

15
00:01:23,640 --> 00:01:26,879
relevant, why you still need it, and then dig into that. See

16
00:01:26,959 --> 00:01:30,719
welcome to the show, Thank you
great to be here with you. Will

17
00:01:30,159 --> 00:01:34,959
I'm excited to speak to a technical
audience, which is not always the taste.

18
00:01:37,359 --> 00:01:38,799
So you know, see if I
get called out on anything. But

19
00:01:40,000 --> 00:01:42,879
it's great to be able to yet
as deep as I want to be with

20
00:01:42,959 --> 00:01:49,959
not having to pull myself back,
right. That's always my biggest fear in

21
00:01:49,799 --> 00:01:53,840
doing talks and live shows like this. It's the part of the show that

22
00:01:53,879 --> 00:01:57,280
I call stump the chump where you
say something wrong and somebody just calls you

23
00:01:57,319 --> 00:02:04,519
out on it. That pressure,
that pressure can be very useful with harness

24
00:02:04,560 --> 00:02:08,560
the right way. I used to
force myself to volunteer to speak on highly

25
00:02:08,599 --> 00:02:14,080
technical topics at conventions that I knew
nothing about, and I had like maybe

26
00:02:14,080 --> 00:02:16,520
two months and so these topics that
I was putting on the bottom of my

27
00:02:16,560 --> 00:02:21,199
list that I knew I needed to
learn, but I was just dragging my

28
00:02:21,240 --> 00:02:23,759
feet on. You know. Now, now I have a time and day

29
00:02:23,759 --> 00:02:27,560
where I'm the expert and I'm going
to potentially get stumped, and so,

30
00:02:27,960 --> 00:02:32,039
you know, fear of embarrassment is
a very great motivator. Oh for sure,

31
00:02:32,280 --> 00:02:37,680
I like the approach. That's bold, but that's definitely going to be

32
00:02:37,680 --> 00:02:40,520
effective. I like it. I
may steal that from you. Yeah,

33
00:02:40,599 --> 00:02:45,240
cool, So tell tell our viewers
a little bit about your background and how

34
00:02:45,280 --> 00:02:50,960
you got to be the CTEO of
Optanine. Sure. Yeah, So when

35
00:02:51,000 --> 00:02:54,599
I was you know, like like
probably many who were listening, you know,

36
00:02:54,800 --> 00:02:59,319
it got into sort of this industry
as a teenager, just you know,

37
00:03:00,080 --> 00:03:04,680
screwing around with computers and the Internet
and having fun and that you know,

38
00:03:04,719 --> 00:03:08,439
that sort of curiosity somehow turned into
a job, which is great.

39
00:03:08,919 --> 00:03:15,759
So it was late nineties, my
co founder of a company called webear.

40
00:03:15,840 --> 00:03:16,680
We were kind of in the right
place at the right time, and we

41
00:03:16,800 --> 00:03:20,719
just started hosting websites for our friends, so you can kind of think of

42
00:03:20,719 --> 00:03:24,680
it as a hosting company. So
we were working with technologies like mi FreeBSD

43
00:03:25,080 --> 00:03:30,039
and Apache and you know, the
typical sort of web stack. And this

44
00:03:30,159 --> 00:03:36,039
was great because it was before before
Google and before things like PHP, and

45
00:03:36,080 --> 00:03:38,719
before things like customer service or support, and it was kind of like sinker

46
00:03:38,800 --> 00:03:43,400
swim, and so we just built
everything ourselves and just sort of scaled up

47
00:03:43,439 --> 00:03:49,719
with our customers, grew that business, sort of pivoted towards enterprise about maybe

48
00:03:49,800 --> 00:03:54,080
ten years after that, and started
to focus on management of private cloud deployments.

49
00:03:54,280 --> 00:04:00,719
Management of public clouds, orchestration and
sort of owning the glue in between

50
00:04:00,759 --> 00:04:03,960
these hybrid cloun environments. So a
lot of networking, which is which is

51
00:04:04,000 --> 00:04:10,199
always fun, and then got into
b C d R so disaster recovery as

52
00:04:10,240 --> 00:04:14,319
a service backups. Networking was always
our secret sauce, which is fun,

53
00:04:14,400 --> 00:04:16,959
you know, saying things like it's
great that you're you're copying your data somewhere.

54
00:04:17,560 --> 00:04:19,879
How are you going to consume it? You know, what does consumption

55
00:04:20,000 --> 00:04:24,439
look like? You know, these
are networking problems. So I've always been

56
00:04:24,480 --> 00:04:28,160
a big a big network I eventually
we sold that business to private equity.

57
00:04:29,279 --> 00:04:32,000
I stayed on as the CTO and
we rebranded to opt and nine after we

58
00:04:32,079 --> 00:04:38,120
bought or merged with two other companies. And I'm still here mostly in a

59
00:04:38,279 --> 00:04:42,879
sort of a more of a you
know, chief Technology Product officer, which

60
00:04:42,920 --> 00:04:46,839
is starting to become a thing now, you know, which which you know,

61
00:04:46,360 --> 00:04:50,480
CTO is so vague, there's different
personas. So really my role is

62
00:04:50,519 --> 00:04:55,360
sort of product focused company, you
know, sort of customer focus. What

63
00:04:55,399 --> 00:04:58,839
are we building for customers, how
are we helping them? How are we

64
00:04:58,879 --> 00:05:01,519
working within the bounds of the third
part of technologies that we use. From

65
00:05:01,560 --> 00:05:06,480
an integration perspective, how do we
push down bloat and the multi doing some

66
00:05:06,519 --> 00:05:12,759
consulting on the side and just trying
to stay busy right on. I think

67
00:05:12,800 --> 00:05:16,199
that's cool perspective or a cool journey, because for a lot of us,

68
00:05:17,240 --> 00:05:21,759
we end up spending a few years
at a company and then jump to another

69
00:05:21,800 --> 00:05:26,759
company, and so we end up
going from company to company. I've done

70
00:05:26,759 --> 00:05:30,800
it myself just to well, let's
be honest, I've done it for salary

71
00:05:30,800 --> 00:05:36,160
increases, but also because of the
opportunity to work on different technologies that I

72
00:05:36,240 --> 00:05:41,879
wanted to go deeper into. But
I think that's a cool path and an

73
00:05:41,959 --> 00:05:46,399
unusual one in the fact that you've
been with the same company even though there

74
00:05:46,439 --> 00:05:51,199
have been mergers and acquisitions along the
way. So you've really built your skill

75
00:05:51,240 --> 00:06:00,279
set in driving the company as it
matures versus driving your skill your skill set

76
00:06:00,360 --> 00:06:05,920
as you mature. Yeah, you
know, I would say we we were

77
00:06:06,000 --> 00:06:12,560
a service provider, and a service
provider is interesting. It's very different from

78
00:06:12,600 --> 00:06:16,279
an enterprise environment, and a lot
of people don't realize the nuances and differences.

79
00:06:16,399 --> 00:06:19,000
It's funny whenever a vendor used to
call us and try to sell us

80
00:06:19,000 --> 00:06:21,279
something to be like, all right, cool, do you have multi tenant

81
00:06:21,319 --> 00:06:24,680
capabilities? No? I'm like,
okay, do you know what a service

82
00:06:24,720 --> 00:06:29,160
provider is? Like? You know, realize like and then we'd say,

83
00:06:29,160 --> 00:06:30,279
well, listen, if you can, if you can make a service better

84
00:06:30,319 --> 00:06:33,560
happy, you can probably make anybody
happy. So you know, we'll tell

85
00:06:33,560 --> 00:06:38,000
you what you need to do.
But but you're absolutely right. What was

86
00:06:38,040 --> 00:06:42,879
great for us was that our customers
were, you know, sort of in

87
00:06:42,959 --> 00:06:46,959
multiple industries and verticals, trying to
run different applications and solve different problems.

88
00:06:47,079 --> 00:06:51,800
I'd say the market is more sort
of segmented now and matured now. But

89
00:06:53,160 --> 00:06:57,160
what's great about being a service provider
is your customer customers are coming to you

90
00:06:57,319 --> 00:07:00,920
with the with the problems that need
to be solved, with the use cases,

91
00:07:00,480 --> 00:07:04,879
as the as the industry changes and
grows and as there's new shiny objects,

92
00:07:05,399 --> 00:07:08,759
they're coming back to you and pushing
you and saying, hey, we

93
00:07:08,800 --> 00:07:11,199
heard about this really cool thing.
We want to use it. Can we

94
00:07:11,319 --> 00:07:15,000
use it? And it's like,
uh, you know it's it's like yes,

95
00:07:16,279 --> 00:07:18,199
of course, well and then we'll
figure out on the back end.

96
00:07:18,319 --> 00:07:21,600
Or no, it's like you want
to lose a customer and listen if there's

97
00:07:21,759 --> 00:07:26,879
value in what they're saying, and
you've heard it more than more than once

98
00:07:26,959 --> 00:07:29,839
in the last four weeks, and
then you listen to it. And if

99
00:07:29,879 --> 00:07:30,920
you think someone else can benefit from
it, and you listen to it.

100
00:07:30,959 --> 00:07:35,480
So it's our customers have been pushing
us always and that's sort of driven innovation,

101
00:07:36,160 --> 00:07:40,079
and so we never had to.
Okay, maybe you're not like going

102
00:07:40,199 --> 00:07:44,120
way outside of your out of your
target zone, but you're constantly pivoting.

103
00:07:44,480 --> 00:07:46,920
You're constantly trying to keep the leg
up on your competitors, because if your

104
00:07:46,920 --> 00:07:50,439
customers are asking you for that,
then competitors are hearing the same thing.

105
00:07:50,920 --> 00:07:55,600
And so I think where we've done
well is we've owned our own you know,

106
00:07:55,879 --> 00:07:59,759
we've never owned our own IP,
but we've owned our own glue and

107
00:08:00,319 --> 00:08:05,160
empowered us to be able to mix
and match best and breed and just and

108
00:08:05,240 --> 00:08:07,920
just innovate and and be at the
forefront. So I agree with you.

109
00:08:07,920 --> 00:08:13,040
I think service providers are a great
place to be and even for you know,

110
00:08:13,160 --> 00:08:15,519
listen, I was a founder,
so maybe it's a little different,

111
00:08:15,560 --> 00:08:18,720
but within our environment, over the
years, you know, a new problem

112
00:08:18,759 --> 00:08:22,279
to be solved would come in and
maybe one employee would sort of just jump

113
00:08:22,319 --> 00:08:26,279
on it and just be like,
hey, you know that's cool, I

114
00:08:26,480 --> 00:08:28,720
can do that. And as a
smaller company we'd be like all right,

115
00:08:28,720 --> 00:08:31,480
you know, like uld say.
The silly example is when you know,

116
00:08:31,639 --> 00:08:35,519
like uh, you know, uh
no, SQL platforms got big and people

117
00:08:35,559 --> 00:08:39,039
wanted us to manage Mango and we'd
have we had one gentleman who is like,

118
00:08:39,120 --> 00:08:41,039
yeah, I'd love to do that. It's like all right, you

119
00:08:41,080 --> 00:08:45,720
know, great, next week.
You know that guy is the Mango expert.

120
00:08:45,799 --> 00:08:52,639
Everything have, any is goes goes
to him. And so there's just

121
00:08:52,679 --> 00:08:56,240
a lot of opportunity for self growth
there if you can recognize and take it.

122
00:08:58,799 --> 00:09:01,679
Yeah, for sure. And I
think that's the key to longevity in

123
00:09:01,720 --> 00:09:09,440
this space is a desire to continually
grow and learn new skills. Yeah,

124
00:09:09,600 --> 00:09:15,000
but there's also some old skills that
we can't let go. One of those

125
00:09:15,080 --> 00:09:20,720
being disaster recovery and backups. And
you mentioned it before we started recording.

126
00:09:20,840 --> 00:09:24,759
It's one of those that seems to
have been pushed on the back burner over

127
00:09:24,799 --> 00:09:31,799
the last ten years or so.
But doing so, it has some definite,

128
00:09:33,399 --> 00:09:37,240
some definite impacts to your business,
So talk to us a little bit

129
00:09:37,279 --> 00:09:43,480
about resilience in DR in the cloud. Yeah, I'd love to. So,

130
00:09:45,080 --> 00:09:50,120
you know, we have we have
traditionally provided a disaster recovery as a

131
00:09:50,159 --> 00:09:54,360
service offering for you know, I
don't want to say legacy, you know,

132
00:09:54,440 --> 00:09:58,159
sort of non non sort of cloud
native applications, so things that are

133
00:09:58,159 --> 00:10:03,600
not necessarily running on AWS or Azure, and so over the years we would

134
00:10:03,159 --> 00:10:09,440
look at out maybe a deployment running
on VMS, running on VMware or hyper

135
00:10:09,559 --> 00:10:18,000
v or KBM. We would basically
provide an entire ecosystem needed so that your

136
00:10:18,000 --> 00:10:24,679
applications would continue to operate despite some
sort of outers aut the production site or

137
00:10:24,799 --> 00:10:31,799
cybersecurity event or somebody fat fingering database
and so you know what that looks like

138
00:10:31,919 --> 00:10:37,080
is obviously replicating the data, but
more important than that is sort of understanding

139
00:10:37,240 --> 00:10:41,320
what is consumption. What does consumption
look like? How are your users going

140
00:10:41,360 --> 00:10:45,279
to consume the application from the DR
side as they did in production. And

141
00:10:45,279 --> 00:10:48,879
that's a big sort of networking sort
of task or challenge to deal with,

142
00:10:50,440 --> 00:10:54,120
and then also dealing with dependencies,
like what about all these shared services that

143
00:10:54,360 --> 00:11:01,559
they're relying upon, authentication or networking, c p IPM, stuff like that.

144
00:11:01,519 --> 00:11:07,399
And so we take ownership of authoring
the WRUNG books not only for fail

145
00:11:07,440 --> 00:11:09,960
over fail back, but what if
it's just one application that you want to

146
00:11:09,960 --> 00:11:13,879
faill over and what do you do
with shared resources? You know, if

147
00:11:13,879 --> 00:11:18,000
you have a legacy database server,
which is a weird thing to say,

148
00:11:18,320 --> 00:11:22,120
and it's running, you know that
is hosting databases, tend different applications and

149
00:11:22,159 --> 00:11:24,639
you want to fail over one,
do you bring the database server with you?

150
00:11:24,720 --> 00:11:28,559
And so always interesting situations and challenges. And then you know, when

151
00:11:28,600 --> 00:11:33,879
when public clouds started getting popular,
you know, I had a pretty pessimistic

152
00:11:33,919 --> 00:11:37,879
look on disaster recovery in general,
and I think I think the entire sort

153
00:11:37,919 --> 00:11:41,559
of industry was excited about the fact
that, like, we won't have to

154
00:11:41,559 --> 00:11:46,320
deal with that anymore. We have
ability now to just build applications that are

155
00:11:46,320 --> 00:11:50,879
inherently resilient, you know, from
the bottom up, and you know,

156
00:11:52,039 --> 00:11:56,080
we'll deploy them on the cloud.
They'll be self healing, and then we

157
00:11:56,159 --> 00:11:58,399
won't have to deal with this,
you know. And I think you know

158
00:11:58,440 --> 00:12:03,399
what's happened is that people have tried
that, and people have tried to build

159
00:12:03,440 --> 00:12:09,639
these applications that will run let's say, in multiple eight of US regions,

160
00:12:11,159 --> 00:12:18,120
and they realize the complexity involved in
building the applications from the start with that

161
00:12:18,279 --> 00:12:24,279
thought in mind is just it is
just far beyond the bounds of what they

162
00:12:24,320 --> 00:12:28,440
want to deal with. And we
see that even when you invest time and

163
00:12:28,519 --> 00:12:31,320
resources into that, it doesn't necessarily
mean that it's going to work. You

164
00:12:31,360 --> 00:12:35,120
know, every time like AWS East
has an outage or goes down, you

165
00:12:35,120 --> 00:12:41,080
know how many very large popular sites, you know, household name sites go

166
00:12:41,159 --> 00:12:46,159
down that are technology companies that we
know are deploying the multiple regions. So

167
00:12:46,399 --> 00:12:50,679
why are they down? Because it's
almost like it's like this impossible thing to

168
00:12:50,720 --> 00:12:54,399
build, and it's not always their
fault, right, Like the interdependence between

169
00:12:54,919 --> 00:13:01,039
their applications and even third party you
know SaaS or their party pass mean that

170
00:13:01,639 --> 00:13:05,200
can they actually test this thing,
can actually test their resilience plan without you

171
00:13:05,240 --> 00:13:11,679
know, without actually affecting production.
So what I've seen is sort of like

172
00:13:11,919 --> 00:13:16,639
the industry going towards the middle ground
where where some people don't even realize you

173
00:13:16,679 --> 00:13:20,279
can do this, but you can
basically employ an application in a in a

174
00:13:20,320 --> 00:13:26,879
single region not have to sort of
build this whole resilience concept into your application

175
00:13:26,960 --> 00:13:35,039
from day one, and then employ
traditional disaster recovery strategies towards you know,

176
00:13:35,120 --> 00:13:37,159
sort of gaining you know, resilience
of your app. So the point on

177
00:13:37,159 --> 00:13:43,919
one region and maybe now we can
use replication tools that are more cloud native

178
00:13:43,919 --> 00:13:46,879
focused, and then we could still
take all of those things that we learned

179
00:13:46,879 --> 00:13:52,759
over the years from from traditional disaster
recovery, things like dependency mapping, building

180
00:13:52,840 --> 00:13:58,840
run books to deal with different situations, building sort of network strategies so that

181
00:13:58,919 --> 00:14:03,120
I can test at the DR site
without poisoning my production data. You know,

182
00:14:03,159 --> 00:14:07,600
if your production app is connected with
the sales Salesforce API and you bring

183
00:14:07,679 --> 00:14:11,279
up your app and DR and you
start playing with records like oops, we're

184
00:14:11,279 --> 00:14:16,240
modifying production data. So you know, all of these sort of you know,

185
00:14:16,320 --> 00:14:22,080
sort of core disaster recovery strategies.
Give them a modern data mover that

186
00:14:22,279 --> 00:14:28,799
knows how to replicate or rewrite,
rewrite of resources. Let's they use the

187
00:14:28,919 --> 00:14:35,600
terraform or cloud formations. Give me
something modern data mover and then apply everything

188
00:14:35,600 --> 00:14:39,480
from traditional DR and you can actually
achieve resilience without having to go crazy from

189
00:14:39,480 --> 00:14:46,120
the development. Yeah, one thing
that you mentioned there a couple of times

190
00:14:46,159 --> 00:14:50,559
that I think is is really key
is testing that, And it reminds me

191
00:14:50,639 --> 00:14:54,720
every time I think about that.
It reminds me way back early in my

192
00:14:54,799 --> 00:15:01,440
career decades ago, my boss asking
our team, Hey, are you guys

193
00:15:01,440 --> 00:15:03,799
ready for a disaster and ology And
we're like, oh yeah, we're all

194
00:15:03,840 --> 00:15:09,200
set and he's like, okay,
great, everybody show up on Saturday.

195
00:15:09,279 --> 00:15:13,639
And so we showed up on this
Saturday and we went out to he ran

196
00:15:13,639 --> 00:15:18,399
into a conference room in a hotel, had some servers sitting there, and

197
00:15:18,440 --> 00:15:22,600
he had our backup tapes. He's
got great restore everything, you know.

198
00:15:22,639 --> 00:15:26,879
And we didn't even make it five
minutes into the process before we realized,

199
00:15:26,320 --> 00:15:31,759
oh wait, we don't have the
floppy disk to update our bios or we

200
00:15:31,799 --> 00:15:35,720
don't have the boot disc to reinstall
the operating system. And it was a

201
00:15:35,759 --> 00:15:41,279
really, really long and painful day. But the lessons have stuck for a

202
00:15:41,320 --> 00:15:46,600
couple decades now. Yeah, you
know, that's a It's great when people

203
00:15:46,720 --> 00:15:52,120
are sort of overly focused on some
sort of data application when it comes to

204
00:15:52,159 --> 00:15:56,559
disaster recovery or even where some people
just think that their backup strategy is also

205
00:15:56,720 --> 00:16:00,679
their sort of resilience or disaster recovery
strategy, and won't get too much into

206
00:16:00,679 --> 00:16:03,320
that, but you know, you
have two separate goals with you know,

207
00:16:03,399 --> 00:16:10,080
sort of two separate strategies to be
employed. So yeah, you really need

208
00:16:10,120 --> 00:16:12,960
to sort of pre author the run
book. And I think today what's interesting

209
00:16:14,000 --> 00:16:17,080
now too that we're seeing is that
if you look at an event like a

210
00:16:17,240 --> 00:16:22,879
like a ransomware attack or a cybersecurity
event, you know it's the incident response

211
00:16:23,000 --> 00:16:27,360
plan or the sort of the and
order the disaster recovery run book or something

212
00:16:27,360 --> 00:16:30,720
like that, it's not it's it's
not something that a single team would be

213
00:16:30,759 --> 00:16:37,639
dealing with. Right, Like a
DevOps team is responsible for sort of the

214
00:16:37,639 --> 00:16:41,559
the the uptime and resilience of an
application, and presumably they own sort of

215
00:16:41,600 --> 00:16:47,159
all this orchestration for production to dr
multi regions and fail over. That's great.

216
00:16:47,200 --> 00:16:49,039
But now if you bring in this
sort of you know, this sort

217
00:16:49,039 --> 00:16:55,639
of security aspect that this has this
need to fail over was in relation to

218
00:16:55,720 --> 00:16:59,840
a security event, now you have
a completely new team. Maybe it's an

219
00:17:00,080 --> 00:17:03,759
internal soccer security team or an external
MSSP. And now you have these two

220
00:17:03,799 --> 00:17:10,359
teams that unfortunately many organizations don't speak
that much, and now they need to

221
00:17:10,400 --> 00:17:14,880
be lockstep as part of the incident. And you know, if you think

222
00:17:14,880 --> 00:17:18,680
about a CTO or CIO at a
higher level, you know, they kind

223
00:17:18,680 --> 00:17:22,920
of become the quarterback between these teams
during an incident. And it's not something

224
00:17:22,960 --> 00:17:26,160
that I think they even realized that
they were ever going to have to deal

225
00:17:26,200 --> 00:17:32,519
with. And so the incident response
plans, the disaster recovery rum books need

226
00:17:32,559 --> 00:17:37,000
to be inclusive of who is who
owns what during a you know, sort

227
00:17:37,000 --> 00:17:41,440
of a security incident. You know, can you even bring up the application?

228
00:17:41,680 --> 00:17:51,440
You know at the r site do
you want to? So how does

229
00:17:51,440 --> 00:17:56,200
a team that maybe they recognize that
their their dr the resilience plan isn't where

230
00:17:56,240 --> 00:18:03,839
it should be. What what what
are the first steps like because to get

231
00:18:03,839 --> 00:18:07,759
this done you need to devote time
and resources and it has to be prioritized

232
00:18:08,359 --> 00:18:14,039
and sometimes that you have to prioritize
it above like day to day operations.

233
00:18:14,519 --> 00:18:17,839
And I think specifically it comes down
to what are you going to say no

234
00:18:18,039 --> 00:18:22,480
to and so that you can so
that you do have the bandwidth to say

235
00:18:22,720 --> 00:18:26,960
yes to this. So what are
some good early steps for people once they

236
00:18:26,000 --> 00:18:32,839
recognize that they're they're not where they
should be. Yeah, so I think

237
00:18:32,960 --> 00:18:34,359
the good question. I think the
first thing that they need to do,

238
00:18:34,440 --> 00:18:40,519
and I think this is I think
that the market has matured a bit here

239
00:18:40,559 --> 00:18:42,640
and this is austraily obvious now,
but you know, the teams need to

240
00:18:42,680 --> 00:18:48,200
kind of sit down and figure out, you know, what they have an

241
00:18:48,240 --> 00:18:52,799
appetite to take ownership and responsibility for
in this realm. And so if you

242
00:18:52,839 --> 00:19:00,400
look at a traditional you know,
DevOps sort of how this goes and general

243
00:19:00,440 --> 00:19:03,559
for like a DevOps conversation is,
you know, are we are we application

244
00:19:03,640 --> 00:19:08,480
developers? Are we sres? You
know? Who is responsible for ongoing ongoing

245
00:19:08,759 --> 00:19:15,720
management? You know sort of metrics
collection efficiency? And obviously that that's a

246
00:19:15,480 --> 00:19:18,839
there's no right or wrong with any
of these things. And a lot of

247
00:19:18,880 --> 00:19:22,759
it has to do with sort of
the the DNA of the company and what

248
00:19:22,799 --> 00:19:25,119
they kind of want to be when
they grow up and do they want their

249
00:19:25,200 --> 00:19:29,039
you know, certain IT teams adding
value to the business or managing infrastructure,

250
00:19:29,839 --> 00:19:33,920
and so you know, we'll see
I'll see smaller organizations that are like,

251
00:19:34,160 --> 00:19:37,000
you know, we're a small team. We own everything, so we're going

252
00:19:37,079 --> 00:19:41,559
to just internalize it, and also
see very large organizations that have, you

253
00:19:41,599 --> 00:19:48,119
know, an abundance of resources,
and they basically make the they basically make

254
00:19:48,200 --> 00:19:52,240
the decision that we don't want to
be in the business of managing disaster recovery.

255
00:19:52,400 --> 00:19:57,359
We don't want to be responsible for
it. We'd rather outsource it.

256
00:19:57,480 --> 00:20:03,480
And an interesting thing to think about
here is, you know, the the

257
00:20:03,559 --> 00:20:08,240
complexity of all of all of our
applications and our deployments are. They're not

258
00:20:08,319 --> 00:20:14,519
getting simpler, They're getting more complex. In fact, I think you can

259
00:20:14,720 --> 00:20:18,920
argue that part of the goal of
of DevOps these days, part of one

260
00:20:18,960 --> 00:20:22,920
of the things they should be striving
for, and maybe even a key metric

261
00:20:22,000 --> 00:20:26,880
to focus on is to what extent
am I making my you know, the

262
00:20:26,920 --> 00:20:30,200
deployment that I'm managing, to what
extent I make? Am I making it

263
00:20:30,240 --> 00:20:36,079
simpler and less complex? And obviously
the more complex, the harder it is

264
00:20:36,279 --> 00:20:40,079
to manage, to monitor, to
scale, to secure, and to make

265
00:20:40,200 --> 00:20:44,440
and to make resilience. So I
think people need to acknowledge that. And

266
00:20:44,480 --> 00:20:48,839
when you have that conversation, you
know, one of the answers that comes

267
00:20:48,839 --> 00:20:51,799
out of that conversation could be,
Hey, we want to make it simpler

268
00:20:51,839 --> 00:20:53,000
how do we make it simple?
Well, how do why don't we outsource

269
00:20:53,400 --> 00:21:00,519
certain layers and certain responsibilities and disaster
recovery and resilience is an easy one to

270
00:21:00,559 --> 00:21:03,880
outsource. It's low hanging fruit,
you know, typically it does not affect

271
00:21:03,920 --> 00:21:08,240
your production too much. If you
can use sort of that middle ground strategy

272
00:21:08,279 --> 00:21:12,039
that I mentioned at the beginning,
you don't have to modify your application,

273
00:21:12,400 --> 00:21:18,200
you know, much at all in
order to be able to achieve resilience.

274
00:21:19,000 --> 00:21:22,240
So that that would be my answer. The first thing I'll do is sit

275
00:21:22,279 --> 00:21:25,680
down and figure out, you know, what is your appetite to manage and

276
00:21:25,720 --> 00:21:30,279
own that internally? Yeah, for
sure. Yeah, And I think that's

277
00:21:30,680 --> 00:21:33,240
a huge selling point. If you
have a strategy where you don't have to

278
00:21:34,160 --> 00:21:38,400
find your existing infrastructure application a whole
lot, that's always going to be a

279
00:21:38,400 --> 00:21:44,519
big selling point. Let's let's do
this. Take a step back and help

280
00:21:44,559 --> 00:21:49,359
me understand why moving to the cloud
or using the cloud cloud providers like ABS

281
00:21:49,559 --> 00:21:55,319
is not a dr strategy in itself. Yeah. Well, you know,

282
00:21:55,359 --> 00:22:00,200
they give you the right deserverbody it
knows right it's going on homes on a

283
00:22:00,200 --> 00:22:02,640
home depot, and they're giving you
the right tools and you got to you

284
00:22:02,640 --> 00:22:06,799
know, makeup if you want.
So you have to look at it on

285
00:22:06,839 --> 00:22:11,799
a per sort of platform, you
know, uh, per platform sort of

286
00:22:12,599 --> 00:22:17,759
environment. So if you look at
something like S three, which obviously is

287
00:22:17,799 --> 00:22:23,160
being you know, is being stored
in multiple local zones within within a region,

288
00:22:23,319 --> 00:22:26,480
or even has the ability to sort
of have its own inherent built in

289
00:22:26,519 --> 00:22:32,720
sort of cross region replication, you're
probably good there from a you know,

290
00:22:32,720 --> 00:22:37,160
if you wanted to, if you
wanted to build a disaster recovery strategy between

291
00:22:37,240 --> 00:22:41,000
let's say East and West, when
as three perspective, it is, it

292
00:22:41,079 --> 00:22:44,640
is fairly straightforward. You can kind
of put a check next to that layer.

293
00:22:44,960 --> 00:22:48,720
As far as your data being available
at the at the dr site,

294
00:22:49,960 --> 00:22:56,319
you know, recovering from a cyber
attack or sort of a manipulation of the

295
00:22:56,400 --> 00:23:00,319
data, that's another story. But
if if you're if you know, if

296
00:23:00,640 --> 00:23:03,720
the entire interviews goes down and you
want your application back up and running within

297
00:23:04,119 --> 00:23:07,480
within a set rt O, you
know, you can kind of put a

298
00:23:07,559 --> 00:23:15,160
check there for other for other sort
of you know, other platforms. It's

299
00:23:15,200 --> 00:23:18,200
not always the case that that that
that is done. Typically it's not,

300
00:23:18,359 --> 00:23:22,440
you know, and so there are
snapshot capabilities that exist, but then there's

301
00:23:22,480 --> 00:23:26,920
this entire orchestration task that sits on
top of all that. So you have

302
00:23:27,039 --> 00:23:32,200
all of your all of your configurations
and resources, maybe have another site,

303
00:23:32,279 --> 00:23:37,400
but now your applications are not necessarily
written to be able to reference those at

304
00:23:37,519 --> 00:23:41,240
those reference ideas at the at the
the R site. And so now we're

305
00:23:41,359 --> 00:23:48,240
so so it's really a replication orchestration
strategy, right. And so what we'll

306
00:23:48,279 --> 00:23:52,279
do is we'll look at your various
applications and then we'll look at the A

307
00:23:52,519 --> 00:23:56,680
w U S and we're doing this
mostly for a w S today in addition

308
00:23:56,720 --> 00:24:00,000
to the legacy environments which I mentioned
before, but for a public cloud.

309
00:24:00,519 --> 00:24:04,240
We'll look at the various platforms that
your application is using, and we will

310
00:24:04,279 --> 00:24:11,039
employ underlying AWS technologies to ensure that
data is up to date at the DR

311
00:24:11,160 --> 00:24:15,599
site. And so maybe that's maybe
that is cross regent snapshots, or maybe

312
00:24:15,599 --> 00:24:19,480
that is a w S d r
S which works very well for certain platforms

313
00:24:19,519 --> 00:24:25,839
but can be expensive. So now
we get into the application criticality question of

314
00:24:26,440 --> 00:24:30,160
you know, how critical is each
application to be up and running and sort

315
00:24:30,160 --> 00:24:36,880
of match the right replication technologies to
the cost and to the application criticality.

316
00:24:37,640 --> 00:24:41,240
Beyond that, you know, we're
using orchestration tools and one of them is

317
00:24:41,279 --> 00:24:44,880
called r PO that we'll use that
will orchestrate some of this back and forth.

318
00:24:45,680 --> 00:24:48,359
And our PO might be something that's
great for a team that wants to

319
00:24:48,440 --> 00:24:52,640
internalize all this and just say we
got a tool, let's use it.

320
00:24:52,519 --> 00:24:56,920
Where OPT and I comes in is
it's not just about the tool, it's

321
00:24:56,960 --> 00:25:00,359
you know, who is you know, do you want to take owner ship

322
00:25:00,359 --> 00:25:03,319
of the sailover process and the sailback
process? Do you want the ownership of

323
00:25:03,400 --> 00:25:11,319
the testing, building the network integration
strategy, building the automations into let's say,

324
00:25:11,359 --> 00:25:15,200
you know, d n S,
maybe sd WAN policies, so on

325
00:25:15,240 --> 00:25:19,160
and so forth. So we kind
of sit on top and own the entire

326
00:25:19,240 --> 00:25:23,000
process, you know, suit to
notts, so that DevOps teams and IT

327
00:25:23,119 --> 00:25:30,559
teams can just wash their hands of
it and focus on building applications right on.

328
00:25:30,720 --> 00:25:34,119
Yeah, I'm actually an RBO customer
and it's a it's a great tool.

329
00:25:34,359 --> 00:25:38,480
It's just it's it's one of the
few tools I've seen that just does

330
00:25:40,119 --> 00:25:44,559
what it says it's going to do
at an exceptional level. But just like

331
00:25:44,640 --> 00:25:47,880
you mentioned, you know, that's
only part of it that handles the infrastructure.

332
00:25:47,880 --> 00:25:53,240
There's still the whole human aspect of
it of verifying what you've replicated and

333
00:25:53,279 --> 00:25:56,559
doing a failover to it and testing
it and making sure it works. And

334
00:25:56,880 --> 00:26:02,319
that's another full time job in itself, it is. And what the funny

335
00:26:02,319 --> 00:26:06,599
thing is for us is again have
being a company that has been doing and

336
00:26:06,759 --> 00:26:11,759
providing disaster recovery as a service for
you know, uh VMware platforms, physical

337
00:26:11,799 --> 00:26:18,200
servers, IBM I series, you
know zen KBM based applications. The funny

338
00:26:18,200 --> 00:26:22,559
thing is, you know we are
we're not. You can say we're a

339
00:26:22,640 --> 00:26:26,079
technology company, but it's really that
glue that we're owning. But we have

340
00:26:26,720 --> 00:26:32,880
we have broad invest in breed data
movers and sort of replication tools to to

341
00:26:33,039 --> 00:26:37,359
you know, to focus on specific
platforms and our and and so when brought

342
00:26:37,400 --> 00:26:41,079
in our PO it's like, hey, here's the best and breed tool for

343
00:26:41,279 --> 00:26:45,680
cloud native adobs apps. But everything
else that we're doing, all the value

344
00:26:45,720 --> 00:26:49,759
we're providing, and all the wrappers
around around the replication tool like they're all

345
00:26:49,799 --> 00:26:53,640
the same as we were doing five
ten years ago, which is actually pretty

346
00:26:53,640 --> 00:26:56,599
cool. It's like if you can
stay up with the tech, and you

347
00:26:56,640 --> 00:27:02,400
can build a platform that can support
multiple rations in a modular way like you

348
00:27:02,440 --> 00:27:06,519
can, you can stay relevant through
all of these crazy croudchets, for sure.

349
00:27:07,160 --> 00:27:11,559
I should. We had Doug from
RBO on the podcast a few weeks

350
00:27:11,599 --> 00:27:15,079
ago. I should do another episode
with both you and him and just go

351
00:27:15,359 --> 00:27:19,279
into a deep dive on this week, so him and I and I've known

352
00:27:19,319 --> 00:27:22,480
him for a while and I really
am super polish on their platform. I

353
00:27:22,519 --> 00:27:29,119
think it's amazing. Him and I
are doing a webinar tomorrow actually about all

354
00:27:29,119 --> 00:27:33,519
this in detail. Oh right,
I will get that from you and make

355
00:27:33,559 --> 00:27:37,920
sure that that's in our show notes
when this episode goes live. That will

356
00:27:37,960 --> 00:27:41,480
cool talk. When it comes to
DR in the cloud, you mentioned that

357
00:27:41,720 --> 00:27:45,119
providers like AWS have a lot of
the tools built in. You just have

358
00:27:45,240 --> 00:27:49,759
to look at them on a case
by case basis, see what those tools

359
00:27:51,039 --> 00:27:55,119
are and it make sure that they're
enabled and that they're working properly for you.

360
00:27:55,519 --> 00:28:00,720
How often do you see the need
or do you recommend cross provider DR

361
00:28:00,799 --> 00:28:07,039
strategies like backing up our AWS or
replicating our AWS environment in as your or

362
00:28:07,200 --> 00:28:11,960
GCP, because that brings with it
a whole, like an exponential increase in

363
00:28:12,160 --> 00:28:18,319
overhead as well as costs. Yeah, that's a great question. You know.

364
00:28:18,359 --> 00:28:21,160
I think that you kind of have
to look at at three buckets here

365
00:28:21,200 --> 00:28:23,759
in general. You know, you
have your high available you know, the

366
00:28:25,079 --> 00:28:29,279
ability to achieve high availability, right, which which maybe is sort of you

367
00:28:29,319 --> 00:28:33,759
know, I think in order to
build high availability for your application cross region

368
00:28:34,400 --> 00:28:40,319
or cross cloud, you're really not
going to be able to get away from

369
00:28:40,680 --> 00:28:45,759
sort of building your application with that
intent from day one and having to apply

370
00:28:47,000 --> 00:28:51,119
so much more complexity to your application, to your c c D process,

371
00:28:51,200 --> 00:28:56,440
and really the the level of expertise
that you need from your developers. Just

372
00:28:56,559 --> 00:29:00,519
I think it's it's on another level, right. And so if if you're

373
00:29:00,599 --> 00:29:03,599
just starting the process of building an
application now and that is your goal,

374
00:29:04,079 --> 00:29:07,039
you can't you can't go back.
You can't go back later and just be

375
00:29:07,079 --> 00:29:08,839
like, oh, we'll just do
that later. No, it has to

376
00:29:08,880 --> 00:29:12,440
be. It has to be in
the DNA of your application. This is

377
00:29:12,480 --> 00:29:18,960
also an interesting point when you start
to think about integrations with third parties.

378
00:29:18,279 --> 00:29:22,599
You start to think about all of
the third party providers that you're going to

379
00:29:22,799 --> 00:29:27,039
utilize. From an EPI perspective or
from a data perspective. You know,

380
00:29:27,079 --> 00:29:33,839
if you if you have this mandate
to have resilience and high availability as part

381
00:29:33,839 --> 00:29:37,960
of your application or security, and
you build a framework or a requirement around

382
00:29:38,000 --> 00:29:41,680
that you need to have, you
need to have those conversations with those third

383
00:29:41,680 --> 00:29:45,400
parties, you know, before you
start using them and not after, because

384
00:29:45,880 --> 00:29:49,559
if they're the if they're the weakest
link in the chain from that perspective,

385
00:29:49,559 --> 00:29:52,279
if they don't have great resiliency to
provide you with the options you need,

386
00:29:52,480 --> 00:29:56,200
then then you're stuck. I think
too many companies go and they'll you know,

387
00:29:56,440 --> 00:30:00,599
they have the SaaS sprawl or they
just start using them and then you

388
00:30:00,599 --> 00:30:04,119
know, you might spend I don't
know years ago being an AA application that

389
00:30:04,119 --> 00:30:07,400
that works a cross cloud, but
one of your vendors you know, is

390
00:30:07,440 --> 00:30:12,960
not locked up, and boom,
you know you achieve nothing. But so

391
00:30:14,119 --> 00:30:19,440
understand the difference is a j backups
and sort of traditional dr applied here and

392
00:30:19,720 --> 00:30:22,559
really sort of figure out, I
would say, figure out where do you

393
00:30:22,599 --> 00:30:26,480
want to where do you want your
sort of vendor lock in to be right

394
00:30:26,920 --> 00:30:30,240
if it's if it's data, If
you're okay with with vendor lock in with

395
00:30:30,279 --> 00:30:33,839
one cloud, that's fine. I
don't think there's anything wrong with that,

396
00:30:33,000 --> 00:30:37,359
especially again if you're if you're building
your application with forethought into that, and

397
00:30:37,440 --> 00:30:41,119
maybe you know, we see people, I'm sure you've seen it many times,

398
00:30:41,119 --> 00:30:42,920
people that are like, I'm going
to use a WUS multiple regions,

399
00:30:42,920 --> 00:30:47,839
but I'm purposely not going to use
any any platform services. You're going to

400
00:30:47,880 --> 00:30:52,680
run my own SQL instances and kind
of go backwards in that way. Fine,

401
00:30:53,880 --> 00:30:56,960
you know, if you're using our
r PO, as far as I'm

402
00:30:56,960 --> 00:31:02,440
aware, it isn't today, it
does not have any cross cloud replication capabilities,

403
00:31:02,720 --> 00:31:06,039
but let's say it did. Great
now your vendor lockin is on that

404
00:31:06,559 --> 00:31:10,240
level. So I would say a
lot of this is sort of risk,

405
00:31:10,559 --> 00:31:15,279
you know, risk aversion, risk
mitigation. I think the likelihood of all

406
00:31:15,359 --> 00:31:18,960
of a w US going down and
having a need for sort of cross cloud

407
00:31:19,680 --> 00:31:23,440
is, you know, hopefully very
little to none. But I think a

408
00:31:23,680 --> 00:31:29,759
single region outage as we've seen is
you know, fairly, it's definitely in

409
00:31:29,799 --> 00:31:33,440
realm possibility and happens. But I
do think what you're saying makes sense from

410
00:31:33,480 --> 00:31:38,000
a backup perspective, right, Maybe
we don't need, you know, an

411
00:31:38,200 --> 00:31:44,079
rto of being able to fail over
from a of US to Azure within four

412
00:31:44,119 --> 00:31:48,119
hours or twenty four hours. But
if we're copying our data, if we're

413
00:31:48,160 --> 00:31:51,119
having a copy of our data there
and we and we understand what the path

414
00:31:51,240 --> 00:31:55,519
to sort of bringing it back up
looks like, I think that you're in

415
00:31:55,599 --> 00:32:00,599
better shape than most are today.
Yeah, I agree with that one hundred

416
00:32:00,640 --> 00:32:04,279
percent. I've had as a consultant. I've had multiple companies come to me

417
00:32:04,400 --> 00:32:08,200
over the years and say, hey, we need to implement dr so we

418
00:32:08,319 --> 00:32:13,039
want to we can't trust AIDA,
or we don't want to trust AWS,

419
00:32:13,119 --> 00:32:16,640
so we want to use multiple cloud
providers. And my approach with them has

420
00:32:16,759 --> 00:32:21,640
always been, you know, I
don't think AWS is and not picking on

421
00:32:21,720 --> 00:32:23,160
AWS here, but I don't think
that's the weak point. And then we

422
00:32:23,279 --> 00:32:28,480
go through and look at their stack, and it always comes down to the

423
00:32:28,559 --> 00:32:32,440
fact that you know that hasn't been
the weak point. You know, they've

424
00:32:32,759 --> 00:32:39,160
chosen to use a managed database provider
and so all of their data is not

425
00:32:39,240 --> 00:32:43,920
even in their AWS environment, or
they have all of these external dependencies like

426
00:32:44,440 --> 00:32:47,680
Salesforce or different things like that,
and it's like, okay, if you

427
00:32:47,839 --> 00:32:54,000
can replicate all your infrastructure over to
another provider, but this third party tendancy

428
00:32:54,200 --> 00:33:00,680
is still a single point of failure
and much more painful that goes down.

429
00:33:01,359 --> 00:33:05,599
Which makes me think along those lines, since you work with a lot of

430
00:33:05,640 --> 00:33:12,680
companies in this how willing are third
party vendors to talk about what their own

431
00:33:12,799 --> 00:33:20,880
internal d are and high availability strategy
is? They all have they all have

432
00:33:21,319 --> 00:33:24,279
their boiler plate off the cuff answer
that they have to provide, you know,

433
00:33:24,480 --> 00:33:28,960
and it's always going to be pretty
vague, and you're probably going to

434
00:33:29,000 --> 00:33:32,519
have to go back two or three
more times. And sometimes I'll just refer

435
00:33:32,559 --> 00:33:37,400
you to the s l A and
obviously their s l A credit mechanisms,

436
00:33:37,440 --> 00:33:40,000
like most are going to be just
a joke, right, And so it

437
00:33:40,119 --> 00:33:44,119
is it is a risk. I
mean, I will say on the on

438
00:33:44,200 --> 00:33:47,759
the compute side, I do think
that you know, Kubernetes has has democratized

439
00:33:49,039 --> 00:33:52,599
sort of the the compute layer and
has made it very easy to sort of

440
00:33:52,960 --> 00:33:57,920
deploy you know, your your code
where you want when you want to.

441
00:33:58,319 --> 00:34:00,680
But but you're right, it is
it is the database layer, uh,

442
00:34:00,960 --> 00:34:06,799
and sort of the rest of the
shared services layers, and that's kind of

443
00:34:06,839 --> 00:34:09,159
as it is kind of a hard
pill to swallow because you know, again,

444
00:34:09,239 --> 00:34:13,119
what if if you kind of want
to manage and run and operate your

445
00:34:13,119 --> 00:34:16,320
own databases, that's fine. It'll
be less expensive that way, you'll save

446
00:34:16,440 --> 00:34:20,639
you'll save money, and you'll have
more control and you will be able to

447
00:34:20,880 --> 00:34:24,119
to sort of make good on this
sort of cross cross cloud resilience if you

448
00:34:24,199 --> 00:34:30,000
want to. But now the operational
overhead has has increased, and so you

449
00:34:30,079 --> 00:34:31,880
know, part of what we've done
and sort of what I've sort of been

450
00:34:32,000 --> 00:34:37,480
dabbling in, you know, with
some consulting is just doing that dependency mapping,

451
00:34:37,599 --> 00:34:42,119
application mapping and figuring out what we
what we want to do. And

452
00:34:42,360 --> 00:34:45,559
by the way, just because you're
using paths in production, doesn't mean that

453
00:34:45,679 --> 00:34:50,039
you can't have sort of a single
database deployment in dr with some sort of

454
00:34:50,480 --> 00:34:53,639
you know, sort of sort of
you know, snapshot or replication mechanism in

455
00:34:53,719 --> 00:34:57,559
place as a as a backup.
And look, it takes you two days

456
00:34:58,000 --> 00:35:00,719
to get that to get all the
tweets work out, you know, post

457
00:35:00,159 --> 00:35:02,960
event, you know, most people
will say that's not the end of the

458
00:35:04,000 --> 00:35:07,079
world, and they will accept that
as a solution because, to be honest,

459
00:35:07,639 --> 00:35:12,280
a lot of folks are looking unfortunately, they're looking to check the box

460
00:35:12,400 --> 00:35:16,760
on a DR strategy, are having
one in place for compliance and having the

461
00:35:16,840 --> 00:35:21,559
DR strategy does not necessarily mean that
you have you have a run book or

462
00:35:21,639 --> 00:35:23,239
you have super low R t O
s R pos. It just means that

463
00:35:23,320 --> 00:35:28,599
you have sat down and written what
you would do during an event, even

464
00:35:28,639 --> 00:35:30,400
if it hasn't been fully tested.
And so if that's what you're after,

465
00:35:30,639 --> 00:35:35,159
that's what your goal is. Because
maybe you are not a you know,

466
00:35:35,320 --> 00:35:39,000
a fully technology based platform, as
a as a as a business, as

467
00:35:39,000 --> 00:35:44,960
a revenue generation oftentimes that isn't a
Yeah. I think the having the conversation

468
00:35:45,000 --> 00:35:51,920
about rt O the recovery time objective
is really important to have because all my

469
00:35:52,119 --> 00:35:57,599
entire career, you know, I've
never worked for companies like Google. Well,

470
00:35:57,639 --> 00:36:01,840
there's been one exception where I had
one of my employers. We were

471
00:36:01,920 --> 00:36:07,199
doing health care for trauma patients,
so we had to had to move quickly

472
00:36:07,280 --> 00:36:15,239
there. But for most businesses,
having having that RTO conversation is very helpful

473
00:36:15,280 --> 00:36:20,079
because while ideally you would like to
say, oh, yeah we can,

474
00:36:21,039 --> 00:36:24,239
we can fail over in two hours. That's cool, but it comes with

475
00:36:24,920 --> 00:36:30,519
a set of costs and acknowledging the
fact that you know it would be embarrassing

476
00:36:30,679 --> 00:36:32,760
to tell your customers will be up
in two days. Maybe that is the

477
00:36:32,880 --> 00:36:40,360
right strategy based on your your business. Yeah, you got to start somewhere,

478
00:36:40,440 --> 00:36:45,480
right when when when I've worked with
companies to build a disaster recovery strategy

479
00:36:45,719 --> 00:36:50,400
and actually roll it out, you
know, the first thing we'll ask is

480
00:36:50,440 --> 00:36:52,280
what what are the what are the
business goals you're trying to achieve? And

481
00:36:52,440 --> 00:36:55,079
and some of the questions might be, you know, do you do you

482
00:36:55,199 --> 00:37:00,440
need and only looking to protect against
a sort of a full failure at the

483
00:37:00,440 --> 00:37:05,920
production site where all the applications need
to be filled over concurrently, or are

484
00:37:05,920 --> 00:37:10,519
you're looking to protect against situations where
you might need to fail over individual applications.

485
00:37:12,880 --> 00:37:15,039
And then there might be other questions
like do you want to faill over

486
00:37:15,159 --> 00:37:17,920
if there's like one server that is
sort of ransomwared And you know, of

487
00:37:19,039 --> 00:37:22,559
course everybody says yes to everything,
yes, we want all that. The

488
00:37:22,719 --> 00:37:28,199
problem is the sort of the more
situation, sort of the increased complexity,

489
00:37:28,280 --> 00:37:32,239
and you know, ironically enough,
the full failover event everything needs to come

490
00:37:32,320 --> 00:37:38,519
over at the same time is actually
much easier to build for and to achieve

491
00:37:38,920 --> 00:37:44,639
than all the others because typically you
have the sort of interdependence between applications are

492
00:37:44,679 --> 00:37:47,599
sitting maybe behind the same firewall and
the same VPC, the same network,

493
00:37:47,679 --> 00:37:52,559
and so if you can keep them
on all the same IP addresses and keep

494
00:37:52,599 --> 00:37:58,360
references intact, then it is much
easier. And so typically we'll employ a

495
00:37:58,440 --> 00:38:00,719
phase approach. Will let's let's be
able to achieve that, improve that,

496
00:38:00,920 --> 00:38:04,480
show that it works, and then
we'll sort of peel back the rest of

497
00:38:04,519 --> 00:38:07,360
the layers of the onion and strive
for more. Yeah. It reminds me

498
00:38:07,400 --> 00:38:14,239
of an analogy from drag car drag
car racing, speed, cost money,

499
00:38:14,320 --> 00:38:17,239
How fast can you afford to go? Yeah? Yeah, And I really

500
00:38:17,280 --> 00:38:22,159
think it's interesting when you think about
these things and you think about the burdens

501
00:38:22,760 --> 00:38:27,760
if you're looking for complete aha,
multi region or even multi cloud, the

502
00:38:27,840 --> 00:38:31,039
burden, the extra burden that you're
putting on your you know, DevOps or

503
00:38:31,159 --> 00:38:36,079
app dev teams, you know,
and what is what does that translate into

504
00:38:37,639 --> 00:38:40,880
just sort of the business impact?
You know, how much longer are your

505
00:38:40,920 --> 00:38:45,960
development cycles because of that? And
what are you not being what? What

506
00:38:45,360 --> 00:38:50,000
features are you not able to work
on because the oldest extra time put into

507
00:38:50,119 --> 00:38:53,800
the forethought of this high availability.
That's why I like. I like the

508
00:38:53,880 --> 00:39:00,559
middle ground approach where let's have our
developers focus on developing a application that runs

509
00:39:00,639 --> 00:39:06,800
on a single let's say a WUS
region and you know, hands head down,

510
00:39:06,880 --> 00:39:08,920
hands to keyboard, focus on building
applications, which they probably have,

511
00:39:09,239 --> 00:39:13,559
probably have a lot of experience doing
that. You know, this whole multi

512
00:39:13,719 --> 00:39:16,800
region thing is typically fairly new to
someone and they're going to go off on

513
00:39:16,840 --> 00:39:21,320
a tangent. So APT devs you
focus on building, you know, an

514
00:39:21,320 --> 00:39:27,159
application that is resilient within a region. AWS makes it fairly straightforward to do

515
00:39:27,320 --> 00:39:30,440
that. And then maybe a separate
team or SR team or a company like

516
00:39:30,519 --> 00:39:35,559
Optinine kind of comes in over the
top and says we are going to employ

517
00:39:35,800 --> 00:39:40,559
a disaster recovery as a service to
that single region deployment and achieve resilience using

518
00:39:40,599 --> 00:39:45,719
tools like RPO and using proven strategies, and that way the APT devs can

519
00:39:45,840 --> 00:39:49,559
just highly focused. I think that's
such a win win, and honestly,

520
00:39:49,599 --> 00:39:51,880
I don't even know that there's a
ton of developers out there that can even

521
00:39:51,920 --> 00:39:54,400
achieve the HA with the high degree
success. Yeah. I think one of

522
00:39:54,440 --> 00:40:00,639
the other benefits of that approach is
discovering tribal knowledge, because in a lot

523
00:40:00,719 --> 00:40:06,400
of the scenarios I've been involved with, we do things and we take certain

524
00:40:06,440 --> 00:40:10,559
steps or actions because of this tribal
knowledge that we happen to know. And

525
00:40:12,679 --> 00:40:16,480
in many cases we don't even know
that we're making decisions based on tribal knowledge.

526
00:40:16,840 --> 00:40:22,920
But when you bring in a third
party like opt to nine, then

527
00:40:22,960 --> 00:40:25,639
you're you're coming at it from a
fresh perspective without the tribal knowledge, and

528
00:40:27,000 --> 00:40:30,039
it works really well to expose that. It's like, oh, okay,

529
00:40:30,239 --> 00:40:36,239
now we have this piece of information
that has to be documented and formalized.

530
00:40:37,519 --> 00:40:39,880
Absolutely, And like I said at
the beginning, you know, when I

531
00:40:39,960 --> 00:40:44,840
hear tribal knowledge, you know,
I hear complexity, and I gin that.

532
00:40:45,679 --> 00:40:51,159
I think there's this whole idea of
managing complexity, managing complexity sprawl,

533
00:40:51,599 --> 00:40:55,800
you know, fighting to reduce complexity. It's not it is not being pushed

534
00:40:55,880 --> 00:40:59,239
enough, you know, from an
industry perspective. In fact, I think

535
00:40:59,280 --> 00:41:01,840
we have the opposite problem. I
think we have a lot of folks out

536
00:41:01,880 --> 00:41:06,159
there and I'll even you know,
different times in my career. It definitely

537
00:41:06,159 --> 00:41:08,679
have been guilty of this. You
know, we have we have shiny objects

538
00:41:08,679 --> 00:41:13,679
syndrome, and we want to be
able to be exposed to all the latest

539
00:41:13,719 --> 00:41:17,320
and greatest tools. You know,
I think we're all curious people in the

540
00:41:17,480 --> 00:41:21,559
indrodustry and we like playing with new
things. I think I think part of

541
00:41:21,599 --> 00:41:23,480
it also is just maybe a little
bit of fear and ensuring that we have

542
00:41:24,119 --> 00:41:30,599
the latest and greatest acronyms on our
resumes. Sure, but shiny objects syndrome

543
00:41:30,880 --> 00:41:35,840
is you know, is I think
the complete opposite of I want to keep

544
00:41:35,880 --> 00:41:38,039
my environment simple so that it's manageable, so that I can reduce the need

545
00:41:38,119 --> 00:41:43,679
for tribal knowledge. And this kind
of goes into you know, like other

546
00:41:43,760 --> 00:41:45,480
soft skills, right like, you
know, if I want the person at

547
00:41:45,480 --> 00:41:47,800
four a m to be able to
fix what I built. You know,

548
00:41:47,880 --> 00:41:52,519
to what extent am I a good
technical documenter? And to what extent do

549
00:41:52,599 --> 00:41:57,159
I take pride in that as a
standalone skill that I'm good at, you

550
00:41:57,239 --> 00:42:01,039
know, as as a developer or
an SR or you know or DevOps person.

551
00:42:01,639 --> 00:42:06,519
Yeah, agreed, And I just
speaking from personal experience, I'm not

552
00:42:06,639 --> 00:42:12,320
good at documenting. I'll write something
that just seems to be as clear as

553
00:42:12,400 --> 00:42:16,679
it can be, and then usually
me six months later looks at it and

554
00:42:16,800 --> 00:42:21,800
was like, who's the moron it
wrote it? Oh wait, never mind?

555
00:42:23,000 --> 00:42:24,960
Yeah, I think we've all been
there, right. I mean it's

556
00:42:25,239 --> 00:42:29,400
when I manage a lot of technical
teams, and it's always that last ten

557
00:42:29,440 --> 00:42:31,639
percent when you see the documentation.
How are we monitoring it? How do

558
00:42:31,719 --> 00:42:35,719
we know if it goes down?
How are you backing it up? I

559
00:42:35,840 --> 00:42:37,360
mean, we want to build cool
things, right, and then we just

560
00:42:37,400 --> 00:42:40,920
want to pass it off. But
I do think that that us as sort

561
00:42:40,960 --> 00:42:45,519
of you know, DevOps engineers,
we need to start taking pride in in

562
00:42:45,679 --> 00:42:51,280
sort of skills that are outside of
the hands to keyboard technical documentation. Taking

563
00:42:51,400 --> 00:42:53,519
pride and being able to walk away, go on vacation and people knowing what's

564
00:42:53,559 --> 00:42:59,360
going on by reading my documentation without
calling me. You know, I think

565
00:42:59,400 --> 00:43:02,360
also being good troubleshooter and this kind
this kind of kind of goes back into

566
00:43:02,480 --> 00:43:07,599
the complexity and sort of disaster of
every conversation. But to what extent,

567
00:43:07,159 --> 00:43:12,440
you know, is my troubleshooting skills
set high? And I think unfortunately a

568
00:43:12,519 --> 00:43:15,679
lot of the soft skills don't have
great KPIO metrics that you can kind of

569
00:43:15,719 --> 00:43:19,039
throw on a resume that can show
how well you do with those things.

570
00:43:19,079 --> 00:43:23,360
But but I love I love honing
the troubleshooting skill and being brought into a

571
00:43:23,440 --> 00:43:28,000
problem that I know nothing about and
you know, figuring it out, you

572
00:43:28,039 --> 00:43:31,199
know, quickly compared to maybe folks
that wrote it or have been dealing with

573
00:43:31,320 --> 00:43:34,239
it. It's you know, that's
fun. It's a great little challenge.

574
00:43:34,639 --> 00:43:39,440
Yeah, that's one thing I've advocated
for for years now is my role as

575
00:43:39,519 --> 00:43:44,280
a DevOps engineer is to work myself
out of a job, you know,

576
00:43:44,400 --> 00:43:47,519
to set everything up so that it
runs and when it doesn't, it's clearly

577
00:43:47,599 --> 00:43:52,960
documented and what stuff's to do,
and someone new can come on board and

578
00:43:54,280 --> 00:44:00,159
get their app to production without having
to rely on and do so in a

579
00:44:00,199 --> 00:44:06,639
way that makes sure that they honor
the constraints of the business. And if

580
00:44:06,679 --> 00:44:08,840
I can do that, then there's
no reason for me to be at that

581
00:44:09,000 --> 00:44:14,119
company anymore. And I think that's
my own personal metric for job success.

582
00:44:14,760 --> 00:44:19,039
Yeah. I think, actually,
you know, not not to not to

583
00:44:19,119 --> 00:44:22,239
pull mortion any objects into the conversation. I think so jen Ai, I

584
00:44:22,280 --> 00:44:24,159
think has a huge potential to help
in the screen. In fact, I'm

585
00:44:24,239 --> 00:44:30,480
talking to some startups that are already
starting to do this where you will plug

586
00:44:30,559 --> 00:44:36,840
them into all of your internal documentation
and they will basically just give you a

587
00:44:36,960 --> 00:44:40,480
chat bot where you can just ask
questions and so you know, having service

588
00:44:40,559 --> 00:44:45,320
provider experience. This is this is
really interesting because you know, if we're

589
00:44:45,440 --> 00:44:51,559
managing multiple customer deployments. You know, part of what optenine does is pro

590
00:44:51,800 --> 00:44:54,719
we're doing managed cloud ops for managing
AWS deployments on the pass of our customers.

591
00:44:57,079 --> 00:45:00,360
But you know, not to say
we don't want every kind of similar

592
00:45:00,360 --> 00:45:02,280
to be the owned science project,
but there is always going to be this

593
00:45:02,440 --> 00:45:10,639
balance of standardization and customization. And
so we have very detailed documentation on each

594
00:45:10,760 --> 00:45:15,239
customer's deployment and diagrams and all that. But it's very hard to scale that,

595
00:45:15,360 --> 00:45:19,639
especially for the person at four am
that gets the phone call that something

596
00:45:19,800 --> 00:45:22,719
is down and having to sit through
and read all that documentation and catch up.

597
00:45:22,880 --> 00:45:28,119
You know, it's like it's an
impossible task to do when you need

598
00:45:28,199 --> 00:45:30,719
to spend hours catching up before you
can even begin into troubleshoot. And this,

599
00:45:30,840 --> 00:45:34,480
I think is which is really cool. It's just where Jenny I can

600
00:45:34,559 --> 00:45:37,320
help where if you have this you
know LLM that's constantly looking at this data,

601
00:45:37,599 --> 00:45:39,800
and you can have a bot where
you say, hey, where's this

602
00:45:39,880 --> 00:45:44,159
customer stuff deployed? When was the
last time something was deployed? When was

603
00:45:44,199 --> 00:45:46,440
there a change? And you can
just quickly get those answers to me as

604
00:45:46,679 --> 00:45:50,920
someone who's managed twenty four seven teams, I mean, that's just super exciting

605
00:45:51,320 --> 00:45:53,599
and that really helps us scale,
you know, the Knock and the stock

606
00:45:53,719 --> 00:45:58,519
organization. Yeah, for sure,
because context switching is huge, and that's

607
00:45:58,559 --> 00:46:06,599
where it seems to really raise its
visibility of how painful and expensive context switching

608
00:46:06,800 --> 00:46:09,000
is. And I think you probably
are very familiar with it from your experience

609
00:46:09,039 --> 00:46:13,960
at OPTU nine when you switch from
not only project a project, but customer

610
00:46:14,039 --> 00:46:19,400
to customer and so you are working
on one customer's environment that's built this way,

611
00:46:19,920 --> 00:46:22,000
and then you know, the pager
goes off and you have to switch

612
00:46:22,039 --> 00:46:29,039
to a completely different environment. And
so how do you minimize that amount of

613
00:46:29,119 --> 00:46:32,079
time where you're just sitting there with
a blank stare trying to figure out where

614
00:46:32,159 --> 00:46:38,079
to begin in this environment that could
have infinite number of combinations yeah, and

615
00:46:39,199 --> 00:46:43,440
i'd say, like now based on
you know, you know, saying like

616
00:46:43,480 --> 00:46:47,000
the complexity, it's almost impossible,
to be honest, it really is.

617
00:46:47,159 --> 00:46:52,559
And having you know, the tribal
knowledge and the experience working on a specific

618
00:46:52,800 --> 00:46:59,000
customer's environment, you know, helps
greatly. So what we do is we

619
00:46:59,119 --> 00:47:02,039
obviously try to have as as many
standardized tools as we can standardize and monitoring.

620
00:47:02,519 --> 00:47:08,800
I like looking at different monitoring strategies
where we have we build monitoring again

621
00:47:08,880 --> 00:47:15,039
into the the CICD work focus far
as what we're going to monitor. But

622
00:47:15,400 --> 00:47:20,320
what I'd like to do is to
really have sort of macro level alerts go

623
00:47:20,480 --> 00:47:23,119
off at the same time as sort
of micro level alerts go off. So

624
00:47:23,199 --> 00:47:28,239
if my application is down, if
we're monitoring a specific query and we want

625
00:47:28,320 --> 00:47:31,039
to see that it's returning you know, greater than twenty five results from the

626
00:47:31,079 --> 00:47:35,519
customer perspective, if that goes down, I would like to see you know,

627
00:47:35,639 --> 00:47:38,559
four or five different monitors you know, they're monitoring specific layers of the

628
00:47:38,639 --> 00:47:43,039
back end or specific API and points
also going down at the same time.

629
00:47:43,559 --> 00:47:47,039
So the poor technician at four am, we're kind of spoon feeding them,

630
00:47:47,119 --> 00:47:51,719
Hey something, you know, there's
a serious problem. But at the same

631
00:47:51,760 --> 00:47:53,559
time, hey, we also noticed
these four things that are out of black,

632
00:47:53,679 --> 00:47:57,840
and so instead instead of having a
start from scratch that can kind of

633
00:47:57,880 --> 00:48:02,320
work backwards from the lowest hangings.
Yeah, just giving them a series of

634
00:48:02,360 --> 00:48:06,599
bread crumbs to follow. Yeah,
and again, I mean I think that's

635
00:48:06,679 --> 00:48:08,960
that's a strategy. I mean to
me, that's is that a technical Is

636
00:48:09,039 --> 00:48:13,280
that a is that a sort of
technical skill or is that sort of a

637
00:48:14,719 --> 00:48:19,000
quasi non technical strategy that you need
to employ, you know, with this

638
00:48:19,360 --> 00:48:24,559
resilience or sr HA you know for
sure sort of DevOps right there in the

639
00:48:24,599 --> 00:48:29,559
middle of DevOps, I think.
Yeah. One of the things I like

640
00:48:29,639 --> 00:48:32,400
to do is in all of my
alerts, I like to include like,

641
00:48:32,800 --> 00:48:37,920
hey, here's the alert obviously,
here's why it went off, and then

642
00:48:37,159 --> 00:48:43,639
here is a link to the application
dashboard and the run book for that,

643
00:48:44,239 --> 00:48:47,679
just you know, to leave those
breadcrumbs and help minimize that context switching time.

644
00:48:49,960 --> 00:48:53,480
Absolutely. Absolutely. Documentation. You've
mentioned that multiple times, and it's

645
00:48:53,719 --> 00:48:59,320
a pet peeve of mine because I
don't like Confluence, I don't like notion,

646
00:49:00,199 --> 00:49:02,840
I don't like read me pretty much. I don't like any of the

647
00:49:02,920 --> 00:49:08,800
documentation tools but you mentioned standardizing on
tools. Do you have a preferred documentation

648
00:49:08,960 --> 00:49:15,400
tool. I don't. I've used, I've used sort of all of the

649
00:49:15,480 --> 00:49:17,960
above. You know, I would
I would say the answer, I don't

650
00:49:17,960 --> 00:49:21,320
think that there's one tool that's better
than the other. Right, And this

651
00:49:21,519 --> 00:49:23,039
is this is a cliche, but
right, it's more about the use.

652
00:49:23,679 --> 00:49:28,159
It's like talking about it's like talking
about the best diet. Right, it's

653
00:49:28,199 --> 00:49:31,039
the best. It's the one that
you can do consistently over time. Yeah,

654
00:49:31,360 --> 00:49:35,000
I say one of the and I
think so as long as you as

655
00:49:35,039 --> 00:49:38,079
long as it's simple and you can
build them into your work, quote fairly

656
00:49:38,159 --> 00:49:43,199
easily, that that is the best
tool. I'll tell you one one win

657
00:49:43,519 --> 00:49:47,599
related to that that I that I
experienced years ago. It happened to be

658
00:49:49,000 --> 00:49:52,480
with Confluence, But the same example
I know is the same sort of capability,

659
00:49:52,519 --> 00:49:57,679
and it was available and almost all
documentation tools now years ago we used

660
00:49:57,679 --> 00:50:01,559
to use. We used to use
Zio to create like you know, diagrams,

661
00:50:01,559 --> 00:50:06,760
and then we'd upload them into the
documentation tool. And that whole process

662
00:50:06,800 --> 00:50:12,079
of sort of you know, bringing
the bringing the work or the output from

663
00:50:12,119 --> 00:50:14,960
one tool and the other. That
process, like people don't want to do

664
00:50:15,079 --> 00:50:17,320
that. They'll end up just sort
of keeping the diagram, let's say in

665
00:50:17,360 --> 00:50:22,159
their own let's say they're using you
know, Lucid Charts or Gliffy or something

666
00:50:22,199 --> 00:50:24,079
like that, they'll end up just
keeping it in that account. So a

667
00:50:24,159 --> 00:50:30,119
big win for me was was when
Confluence started adding in these plugins where you

668
00:50:30,239 --> 00:50:37,519
can actually create the diagram without having
to go out of the documentation system and

669
00:50:37,719 --> 00:50:42,920
have the diagram embedded right into the
documents right there, instead of having a

670
00:50:42,960 --> 00:50:45,760
building in a separate tool and an
important copy and all that, and so,

671
00:50:45,920 --> 00:50:49,480
you know, I think that was
great because now I mean it,

672
00:50:49,840 --> 00:50:52,960
I'm authoring a document, I want
to show a visual representation. I'm a

673
00:50:53,199 --> 00:50:57,400
I'm a big visual person, and
I can just create the diagram right there

674
00:50:57,440 --> 00:51:00,440
without having to leave the page.
It's saved, and now the actual the

675
00:51:00,519 --> 00:51:05,800
actual ip of that diagram is embedded
into the document. It can never be

676
00:51:05,920 --> 00:51:07,760
pulled apart. Nobody can ever tell
me, oh, yeah, I never

677
00:51:07,880 --> 00:51:12,440
upload I never uploaded the latest version
of the diagram into the document. So

678
00:51:12,519 --> 00:51:16,639
there's that whole concept of of you
know, working in the updating of documentations

679
00:51:16,679 --> 00:51:20,800
and diagrams into your workflow. I
think it's a really good example of how

680
00:51:21,400 --> 00:51:24,960
you can do that. Obviously with
them, I think with Youurra and get

681
00:51:25,000 --> 00:51:29,320
help, you can do that.
But I don't think that that capability exists

682
00:51:29,480 --> 00:51:36,599
enough for more of an infrastructure operations
a sor re perspective, right, Han,

683
00:51:38,400 --> 00:51:45,239
when it comes to like making sure
things are up to date, whether

684
00:51:45,320 --> 00:51:52,440
that's documentation or run books or your
failover strategy, what's the minimum frequency you

685
00:51:52,440 --> 00:51:59,599
would recommend someone reviewing that, Well, I'd say twice a year is probably

686
00:52:00,360 --> 00:52:02,960
the minimum. But then you also
you need to add you need to add

687
00:52:04,000 --> 00:52:08,440
hooks into your change control. Right, anytime you maybe deploy a new service,

688
00:52:08,840 --> 00:52:14,000
you know that should be a hook
to whoever is responsible for resilience,

689
00:52:14,079 --> 00:52:17,239
maybe an outside vendor like but it's
not if it's an outside vendor like Optie.

690
00:52:17,760 --> 00:52:21,320
You know. So if I'm sitting
in the customer seat, I'm going

691
00:52:21,360 --> 00:52:23,280
to add as much as many hooks
as possible, and I'm going to say

692
00:52:23,320 --> 00:52:25,440
to my vendor like, hey,
we just change this, we just change

693
00:52:25,480 --> 00:52:30,039
that, make sure our make sure
our million still works now. On the

694
00:52:30,079 --> 00:52:32,480
flip side, if I'm an opti
and ized seat, I might say,

695
00:52:32,559 --> 00:52:36,920
yeah, no problem, We've updated
it, which which we'll do in earnest

696
00:52:36,960 --> 00:52:38,719
and we have to but hey,
you know, we did what we had

697
00:52:38,800 --> 00:52:43,159
to do, but we got to
retest now, right, So you got

698
00:52:43,239 --> 00:52:45,119
you you do have to find that
balance. Then it doesn't mean you can't,

699
00:52:45,320 --> 00:52:49,320
you know, update these things in
an ongoing basis and then kind of

700
00:52:49,360 --> 00:52:52,360
have a list of what you want
to ensure functions during the next test.

701
00:52:52,840 --> 00:52:57,360
I will say, though, one
of the important things with testing is you

702
00:52:57,480 --> 00:53:00,960
don't want to just have the I
T teams doing the application testing. You

703
00:53:00,079 --> 00:53:06,320
really need to have users testing.
Maybe it's QA, maybe you have internal

704
00:53:06,360 --> 00:53:08,679
staff that are using the system.
You need people that can smell out a

705
00:53:08,760 --> 00:53:13,519
problem with the application, can smell
out the fact that it's maybe a little

706
00:53:13,519 --> 00:53:15,800
bit more sluggish, or that certain
functionality doesn't work as good. And this

707
00:53:15,960 --> 00:53:19,679
is a big miss. A lot
of our teams try to internalize it because

708
00:53:19,679 --> 00:53:22,280
they want to just move past it. For sure. Yeah, just as

709
00:53:22,320 --> 00:53:29,639
an it background, my my overall
objective is to avoid as many conversations with

710
00:53:29,760 --> 00:53:31,760
other humans as possible. But this
is one of those areas where you just

711
00:53:31,920 --> 00:53:35,840
kind of can't do that. And
I'm guilty of doing it too, of

712
00:53:36,239 --> 00:53:40,440
performing a failover looking, Yeah,
all the health checks pass, no alarms,

713
00:53:40,760 --> 00:53:45,559
that must be good and then moving
on. Yeah, and I like

714
00:53:45,639 --> 00:53:51,159
the idea of almost making product managers
responsible for some of this. You know,

715
00:53:51,280 --> 00:53:54,800
if if resilience and high availability is
you know, is a feature,

716
00:53:54,880 --> 00:54:00,440
a component you know of sort of
the outward product, then I do think

717
00:54:00,519 --> 00:54:07,920
that they can be the liaison between
the developers, third parties or whoever is

718
00:54:08,000 --> 00:54:12,960
whoever's owning the resilience. You do
need a quarterback there. And if and

719
00:54:13,039 --> 00:54:16,239
if there is a product product management
function, I think this is a great

720
00:54:16,280 --> 00:54:22,039
aspect for them to ensure continuity of
long term Yeah, agreed, Like a

721
00:54:22,199 --> 00:54:27,800
seasoned product manager is just worth their
weight and goal because they understand all of

722
00:54:27,880 --> 00:54:32,280
these different layers of complexity and interactions
between the teams and and just by job

723
00:54:32,360 --> 00:54:37,440
definition, they're really good at at
orchestrating and pulling in the right resources at

724
00:54:37,480 --> 00:54:40,760
the time that they're needed. Yeah, and with third parties, right,

725
00:54:40,840 --> 00:54:45,519
if they get win salesforce, you
know, what they're going to want to

726
00:54:45,559 --> 00:54:51,840
do is you know, potentially pull
in the you know, whatever positive capabilities

727
00:54:51,880 --> 00:54:54,159
are being pulled pulled through, maybe
they're pulling it into a product feature step

728
00:54:54,239 --> 00:54:59,000
they also need to better understand that, you know, what it means for

729
00:54:59,239 --> 00:55:02,199
the outward messaging on the resilience or
if they can still make good on that

730
00:55:02,320 --> 00:55:07,840
promise. All right, well we
are coming up on an hour here,

731
00:55:07,960 --> 00:55:13,000
is there anything else that you feel
like we should be covering when it comes

732
00:55:13,039 --> 00:55:19,639
to resilience, dr and managing complexity? So I'd say, like the most

733
00:55:19,679 --> 00:55:23,440
important thing, and this might be
a little cliche these days, but you

734
00:55:23,480 --> 00:55:29,320
know it's just make no assumptions,
Make no assumptions on any of the platforms

735
00:55:29,360 --> 00:55:36,960
that you're using in regards to what
built in resilience or redundancy exists. And

736
00:55:37,119 --> 00:55:43,679
also keep in mind that high availability
and resilience does not always equate your ability

737
00:55:43,760 --> 00:55:46,559
to recover from specific types of events. You know, if you're hit with

738
00:55:46,719 --> 00:55:53,039
a cyber attack and your data is
corrupted in production systems, you know,

739
00:55:53,199 --> 00:55:59,039
having a replica or having high availability
even with multiple regions does not mean you

740
00:55:59,119 --> 00:56:01,280
can recover from that. There are
other sort of strategies that you need to

741
00:56:01,360 --> 00:56:05,960
employ. Obviously, you know how
far back is your is your you know,

742
00:56:06,159 --> 00:56:08,159
is your snapshot history, your journal, and you know you'll need to

743
00:56:08,199 --> 00:56:14,119
have separate run books for that type
of situation then sort of the high availability

744
00:56:14,199 --> 00:56:17,559
type of situations. So just understand
there's sort of you know, those are

745
00:56:17,599 --> 00:56:24,559
completely separate and again make no assumptions. Yeah, it's almost like this would

746
00:56:24,599 --> 00:56:30,400
make a really good board game.
M that would make a good board game.

747
00:56:30,800 --> 00:56:37,559
Yeah, we should do like a
jump to conclusion. Uh yeah,

748
00:56:38,480 --> 00:56:43,760
well done on the office phase reference. Yeah, nice to see that.

749
00:56:45,159 --> 00:56:50,599
Well played. Cool. So if
folks, if our listeners want to talk

750
00:56:50,719 --> 00:56:54,519
more about this or reach out to
you directly with additional questions, what's the

751
00:56:54,599 --> 00:56:59,159
best way for them to do it? Uh, find me on LinkedIn.

752
00:56:59,360 --> 00:57:04,320
That's probably a time that I'm most
active on, you know, or or

753
00:57:05,320 --> 00:57:08,960
right to me on there or at
optinin tech dot com or you'll find me.

754
00:57:10,280 --> 00:57:14,039
My name is kind of unique,
so I have no doubts that that

755
00:57:14,159 --> 00:57:15,960
anyone who's listening to the show will
will not will be able to not be

756
00:57:15,960 --> 00:57:21,079
able to find me. So your
name is unique? Is that short for

757
00:57:21,199 --> 00:57:25,559
something? Sagi is a is a
Hebrew name. Like other Hebrew names,

758
00:57:27,800 --> 00:57:30,559
they can kind of get butchered,
you know, in these parts. But

759
00:57:30,239 --> 00:57:36,039
there are much worse Hebrew names that
I, you know, so I don't

760
00:57:36,079 --> 00:57:38,599
have it that bad, but it
is. I mean, it is nice

761
00:57:38,639 --> 00:57:43,400
because when I get cold calls,
I immediately know that this person never spoke

762
00:57:43,440 --> 00:57:49,840
to me before the land the built
in screening feature. Yeah yeah, that's

763
00:57:49,880 --> 00:57:53,639
good. Awesome. Well, thank
you so much for joining me today.

764
00:57:53,719 --> 00:57:59,360
This has been a cool conversation and
I think it's one that we we need

765
00:57:59,400 --> 00:58:02,719
to spend more time I'm talking about
because it often gets overlooked or assumed,

766
00:58:02,880 --> 00:58:08,119
Like you said, make no assumptions. Absolutely cool. Well, and thank

767
00:58:08,159 --> 00:58:12,000
you, thank you. I mean, this has been fun. You're a

768
00:58:12,039 --> 00:58:14,920
great presenter, and it's nice to
talk to someone who's kind of also lived

769
00:58:14,920 --> 00:58:17,840
it and been through it as well. Right as old guys have to group

770
00:58:17,880 --> 00:58:23,079
together and tell war story once in
a while. All that right, All

771
00:58:23,159 --> 00:58:27,199
right, thanks again the Gie,
thank you for listening, and we will

772
00:58:27,239 --> 00:58:28,239
see y'all next week.
