1
00:00:02,279 --> 00:00:05,919
Speaker 1: Welcome back everyone to another episodes of ventures and DevOps.

2
00:00:06,000 --> 00:00:09,240
And I am really excited today because we're going to

3
00:00:09,320 --> 00:00:12,199
jump into one of the areas that I find personally

4
00:00:12,199 --> 00:00:15,800
really interesting, but also our guest has worked at a

5
00:00:15,880 --> 00:00:19,120
number of companies in areas that I feel like lots

6
00:00:19,160 --> 00:00:21,120
of companies get wrong. So I just want to welcome

7
00:00:21,120 --> 00:00:24,480
to the show. Sylvan from from Rutley, who is the

8
00:00:24,519 --> 00:00:26,800
head of veloper Relations. Hey, how are you doing?

9
00:00:27,760 --> 00:00:29,640
Speaker 2: Thank you for having me. We're in all good.

10
00:00:30,839 --> 00:00:33,840
Speaker 1: Good to hear, so, I said, head of velper Relations,

11
00:00:33,880 --> 00:00:35,560
And I got to be honest, I feel like a

12
00:00:35,560 --> 00:00:38,759
lot of companies have started to over utilize this term

13
00:00:38,799 --> 00:00:43,200
to mean a wide variety of different roles and responsibilities.

14
00:00:43,560 --> 00:00:45,200
Can you give me like a breakdown like what it

15
00:00:45,240 --> 00:00:46,479
means for you today?

16
00:00:47,359 --> 00:00:51,759
Speaker 2: Yeah? Absolutely, Actually to a great question because I have

17
00:00:52,560 --> 00:00:56,479
just puts it on LinkedIn two days ago when I

18
00:00:56,640 --> 00:01:02,520
was higher and historically the role of developer relations is

19
00:01:02,600 --> 00:01:11,319
to empower developers to use a product developer tool by

20
00:01:11,359 --> 00:01:16,560
providing education resources, by answering any questions they may have,

21
00:01:17,439 --> 00:01:24,120
and then just overall marketing you know, the product in

22
00:01:24,159 --> 00:01:28,079
a way that fits engineer, which is not product marketing, right,

23
00:01:28,120 --> 00:01:31,359
it's all about what you get out of it. Don't

24
00:01:31,439 --> 00:01:36,640
don't tell, show me, So it's tutorials, talks, you know

25
00:01:36,640 --> 00:01:39,560
in the former article, YouTube, video, and so on and

26
00:01:39,560 --> 00:01:44,000
so forth. But as I joinedly, it was clear that

27
00:01:45,200 --> 00:01:50,200
AI and more specifically here llms are the future of

28
00:01:50,719 --> 00:01:55,040
incident management. For those who don't know, rutly Routly is

29
00:01:55,480 --> 00:01:59,719
and on call and incident management platform. So when something

30
00:01:59,760 --> 00:02:03,040
breaks and you have people on call we need to

31
00:02:03,079 --> 00:02:06,920
respond to the incident, that's where they go to manage

32
00:02:06,959 --> 00:02:11,560
the incident up to it being solved. So I played

33
00:02:11,560 --> 00:02:13,280
a big role in that, and we'll speak about it

34
00:02:13,360 --> 00:02:17,360
during this episode and most of my time actually at

35
00:02:17,400 --> 00:02:20,520
truly most of my AERG, I would say like maybe

36
00:02:20,759 --> 00:02:25,639
seventy five percent has been dedicated to agent AI agent

37
00:02:25,719 --> 00:02:34,319
relations because we can see II agent as just another

38
00:02:34,479 --> 00:02:38,599
member of the team, and this agent also need to

39
00:02:38,639 --> 00:02:42,000
be taught and onboarded, just in a way that's different

40
00:02:42,039 --> 00:02:45,759
from humans. So I would say, while I'm the head

41
00:02:45,759 --> 00:02:49,479
of developer relation, can also say I'm the head of

42
00:02:50,080 --> 00:02:51,400
AI agent relations.

43
00:02:52,560 --> 00:02:54,240
Speaker 1: Well, that's that's definitely a wide area.

44
00:02:54,280 --> 00:02:54,439
Speaker 2: You know.

45
00:02:54,599 --> 00:02:57,840
Speaker 1: I'm all interested because I found you mentioned this that

46
00:02:57,879 --> 00:03:00,719
you're not doing product marketing, you're still marked in some

47
00:03:00,800 --> 00:03:06,120
way to engineers. Uh, they're notoriously the biggest challenge to

48
00:03:06,319 --> 00:03:08,840
get engineers on board with whatever you're trying to sell.

49
00:03:08,879 --> 00:03:11,800
I mean I found of all groups of people, uh

50
00:03:12,479 --> 00:03:15,599
even ones in the technology space, I feel like engineers

51
00:03:15,599 --> 00:03:19,120
always want to do things themselves, right it.

52
00:03:19,000 --> 00:03:23,240
Speaker 2: Is and and with already we are targeting SARES psyche

53
00:03:23,240 --> 00:03:28,319
truly ability engineers who are even more skeptical and hard

54
00:03:28,360 --> 00:03:30,919
to convince because and for good reason. Right, their job

55
00:03:30,960 --> 00:03:36,120
is to ensure that the infrastructure is rolling smoothly and

56
00:03:36,159 --> 00:03:39,039
in optimized fation. And so you want to be careful

57
00:03:39,439 --> 00:03:42,680
with the tool and the frame. A new framework or

58
00:03:42,800 --> 00:03:47,479
new tool or methodology that might include might bring chaos

59
00:03:47,560 --> 00:03:52,599
or instability. And uh, you know, so I used to

60
00:03:52,639 --> 00:03:58,159
be an SARE myself. Back then SARE was not truly

61
00:03:58,199 --> 00:04:00,680
a thing yet. I was working for flight Chair as

62
00:04:00,719 --> 00:04:04,840
a develops engineer. We were in the top fifteen months

63
00:04:04,919 --> 00:04:10,400
visited website in the world back then, displaying about one

64
00:04:10,439 --> 00:04:14,039
point five billion slide a day, which you know was

65
00:04:14,080 --> 00:04:17,839
definitely large volume. And we got acquired by LinkedIn, where

66
00:04:17,879 --> 00:04:20,439
I work as a senior SR. This time for three

67
00:04:20,519 --> 00:04:23,120
year and you can imagine the volume. So I've been

68
00:04:23,279 --> 00:04:26,040
on the side, on the engineering side, and so I

69
00:04:26,079 --> 00:04:29,920
completely you know, get the personal and I think at

70
00:04:29,959 --> 00:04:32,000
the end of the day, it's just that these people

71
00:04:32,040 --> 00:04:35,240
they don't want to waste time with marketing copies. They

72
00:04:35,240 --> 00:04:39,240
want to understand what's in it for them and their job. Right,

73
00:04:40,079 --> 00:04:43,279
So for me, it comes, you know, very naturally. I

74
00:04:43,279 --> 00:04:46,000
don't think it's a challenge, but I think for someone

75
00:04:46,040 --> 00:04:49,360
who does not come from an engineering background and doesn't

76
00:04:49,399 --> 00:04:53,759
have as good of an understanding as you know, an

77
00:04:53,759 --> 00:04:56,120
engineer may have, it may be hard to communicate to

78
00:04:56,160 --> 00:04:59,560
this audience where you know, I think this thing might

79
00:04:59,560 --> 00:05:01,560
come from.

80
00:05:01,639 --> 00:05:05,560
Speaker 1: Yeah, no, I totally got it. I know I'm gonna

81
00:05:05,639 --> 00:05:07,800
I really want to dive into the core topic, which

82
00:05:07,839 --> 00:05:10,639
is self healing systems. And we were talking a little

83
00:05:10,639 --> 00:05:12,680
bit before the episode started about like this has been

84
00:05:12,720 --> 00:05:16,800
your like a lifelong project area. How did you get

85
00:05:16,839 --> 00:05:18,800
into this? I would say it's the first thing, like

86
00:05:18,879 --> 00:05:21,560
was this, Like you always knew this is this was

87
00:05:21,600 --> 00:05:23,959
the thing you're going to go into, and like how

88
00:05:23,959 --> 00:05:25,519
long you've been doing this and what does that really

89
00:05:25,560 --> 00:05:27,600
mean to work in self healing systems?

90
00:05:28,360 --> 00:05:31,399
Speaker 2: Yeah, so it goes back to actually I started the

91
00:05:31,439 --> 00:05:34,920
project when I was working for a slight share as

92
00:05:34,959 --> 00:05:40,120
a develops engineer slate slash slate site reabity engineer, and

93
00:05:40,319 --> 00:05:42,759
I was on call and I had to you know,

94
00:05:42,839 --> 00:05:47,199
manage outages back then who are using puppet as a

95
00:05:47,240 --> 00:05:51,800
way to ensure that unfraustructure was as it should And

96
00:05:51,839 --> 00:05:55,879
you know, ultimately there were a lot of repeat incident

97
00:05:56,040 --> 00:06:00,920
or a lot of outage or you know issues that

98
00:06:01,959 --> 00:06:05,720
were coming from the same type of problems. I think.

99
00:06:06,560 --> 00:06:09,639
You know, as you as engineer grow in their careers,

100
00:06:09,639 --> 00:06:13,439
they kind of know what are the main failure types

101
00:06:14,639 --> 00:06:18,360
and so and so, you know, like I think it,

102
00:06:18,360 --> 00:06:21,000
it gets repetitive, and I think a great engineer wants

103
00:06:21,079 --> 00:06:25,040
to automate itself and not do the same thing over

104
00:06:25,079 --> 00:06:30,000
and over again. And so yeah, that's where the idea

105
00:06:29,279 --> 00:06:35,879
of building a self feeling system came about. So maybe just.

106
00:06:35,839 --> 00:06:37,839
Speaker 1: Sort off like you said, you would see the same

107
00:06:37,879 --> 00:06:40,399
sort of regressions over and over again. Was there like

108
00:06:40,920 --> 00:06:43,360
one in your mind that was just like the most

109
00:06:43,360 --> 00:06:45,839
common where it was like every single time it happened. It

110
00:06:45,879 --> 00:06:47,680
was like the driver for you to make a real

111
00:06:47,800 --> 00:06:49,040
change in an organization.

112
00:06:50,160 --> 00:06:52,480
Speaker 2: Yeah, you know, I think. I mean, I know it's

113
00:06:52,519 --> 00:06:55,959
been like more than a decade, so you know, I

114
00:06:56,000 --> 00:06:59,199
won't have like super sharp example, but I think the

115
00:06:59,519 --> 00:07:03,600
classic ones are, you know, issue with lack of resources,

116
00:07:03,600 --> 00:07:08,160
whether it's you know, storage or CPU or memory, and

117
00:07:08,759 --> 00:07:13,319
you need either to increase decrease the lots somehow or

118
00:07:13,879 --> 00:07:18,360
distribute or scale. Could be like a service that misbehaving

119
00:07:18,439 --> 00:07:22,800
and you need to restart. Could be a lot of

120
00:07:22,839 --> 00:07:27,079
things that I think, So I think the industry took

121
00:07:27,160 --> 00:07:30,480
different throughout. I think now with Kubernetists, what we do

122
00:07:30,519 --> 00:07:35,079
is that if if this is misbehaving, we just shut

123
00:07:35,079 --> 00:07:37,160
it down, right, we get rid of it and we

124
00:07:37,480 --> 00:07:41,839
start a new one. And obviously kubernets is great at scaling,

125
00:07:43,000 --> 00:07:48,040
so I think this tool to like the self feeling system,

126
00:07:48,160 --> 00:07:50,680
it still works in this way, right like by new

127
00:07:50,759 --> 00:07:54,480
king thing or scaling thing, you can heal a system.

128
00:07:55,199 --> 00:07:57,680
I think in my mind I wanted to take a

129
00:07:57,680 --> 00:08:03,040
different approach where I was training. I was envisioning a

130
00:08:03,079 --> 00:08:08,480
system that would actually address the root cause I think

131
00:08:08,519 --> 00:08:10,639
in some of the cases, you know, instead of just

132
00:08:10,720 --> 00:08:14,120
new kings, this thing like really try to mimic what

133
00:08:15,360 --> 00:08:19,240
a human engineer would do. So, Yeah, that's that's kind

134
00:08:19,240 --> 00:08:21,759
of the philosophy that I had back then.

135
00:08:23,360 --> 00:08:26,279
Speaker 1: I mean, there's definitely a huge population of engineers who

136
00:08:26,360 --> 00:08:29,360
think that the what they would do in those examples

137
00:08:29,399 --> 00:08:32,360
would be for sure to restart the machine or the

138
00:08:32,399 --> 00:08:34,639
container or the node if it started to run out

139
00:08:34,679 --> 00:08:38,600
of memory or processing power. And I feel like that's

140
00:08:38,639 --> 00:08:40,200
sort of the crux of one of the issues that

141
00:08:40,240 --> 00:08:42,840
I've seen over and over again is that we do

142
00:08:42,919 --> 00:08:47,039
build those systems that I say we as collective humanity

143
00:08:47,159 --> 00:08:52,399
and not at my current company, that automatically restart or

144
00:08:52,600 --> 00:08:55,320
you know, allocate more memory or processing power. And I

145
00:08:55,320 --> 00:08:58,840
feel like the automatic scale scale out or scale up

146
00:09:00,080 --> 00:09:02,799
for resources can make sense if it doesn't create a

147
00:09:02,840 --> 00:09:06,279
negative impact on the feedback loop that you have to

148
00:09:06,320 --> 00:09:07,639
solve the problem. And I feel like this is one

149
00:09:07,679 --> 00:09:10,120
of the problems with automatic restarts is that it doesn't

150
00:09:10,159 --> 00:09:13,559
really solve the problem. It's still is persistent there, It's

151
00:09:13,559 --> 00:09:17,360
going to keep happening, and also you're delaying actually doing

152
00:09:17,399 --> 00:09:20,240
the investigation, and you're also eliminating some of the evidence

153
00:09:20,559 --> 00:09:23,200
that would allow you to identify the problem there. So

154
00:09:23,240 --> 00:09:25,720
it's really great to hear that you know, you thought

155
00:09:25,799 --> 00:09:28,480
that the appropriate process was, you know, go out like

156
00:09:28,519 --> 00:09:31,279
why is there extra memory usage? Why is you know,

157
00:09:31,279 --> 00:09:33,879
the machine getting stuck, et cetera, et cetera. And I

158
00:09:33,879 --> 00:09:36,840
feel like that's sort of the thing that sets apart

159
00:09:36,919 --> 00:09:40,480
the best esses from the ones that are just coming

160
00:09:40,480 --> 00:09:43,320
into quote unquote do the job it is.

161
00:09:43,519 --> 00:09:45,879
Speaker 2: I think then you need to strike the right balance

162
00:09:45,919 --> 00:09:50,559
between achieving the end results, which is capability, stability, and

163
00:09:50,679 --> 00:09:53,279
if restarting is the way to go, and you know,

164
00:09:53,320 --> 00:09:56,879
you don't need to spend engineering resources. And obviously, I

165
00:09:56,919 --> 00:10:01,840
mean it's walking, so you know, I don't think it's

166
00:10:01,879 --> 00:10:04,080
it's a valid issue, but yeah, as you said, you

167
00:10:04,120 --> 00:10:08,200
need to find a strike the right balance between just

168
00:10:09,639 --> 00:10:13,639
doing this over and over and if it's repeat like

169
00:10:13,759 --> 00:10:19,559
out edsure issue, you know might need investigation. And so

170
00:10:21,000 --> 00:10:25,879
the idea that I had back in two Southern was

171
00:10:26,039 --> 00:10:31,120
I think I started in twelve thirteen was too really

172
00:10:31,919 --> 00:10:37,879
ingest as many data as you can from a distributed system,

173
00:10:38,120 --> 00:10:45,320
you know, whether it's any lugs, any metrics application, lugs, traces,

174
00:10:46,159 --> 00:10:48,600
and ingest all of this in the dallabas. Back then

175
00:10:48,639 --> 00:10:52,879
we were using fluend, which is an open source message,

176
00:10:52,879 --> 00:10:55,559
but we still exist and actually it's very popular. Back

177
00:10:55,559 --> 00:10:58,120
then I was we were one with lecture of the

178
00:10:58,240 --> 00:11:02,519
first main user big use there actually shout out to

179
00:11:02,519 --> 00:11:04,879
the team if they are listening. And now they went

180
00:11:05,000 --> 00:11:07,840
very far with this technology and still all of this

181
00:11:07,960 --> 00:11:13,240
in unstructured database. Back then we I bet on on

182
00:11:13,360 --> 00:11:15,679
Mongo DIB. It doesn't really matter the technology, but that's

183
00:11:15,799 --> 00:11:18,679
what we use for the prototype. And then based on

184
00:11:18,759 --> 00:11:24,120
that come up with like a state of a system

185
00:11:24,320 --> 00:11:29,919
and and use and then like try to resolve this

186
00:11:30,080 --> 00:11:34,799
issue by throwing at the system a bunch of actions

187
00:11:34,799 --> 00:11:38,279
that would be safe. You know, I'm speaking like any

188
00:11:38,320 --> 00:11:41,360
action like a like AREM action or something like you know,

189
00:11:41,440 --> 00:11:45,200
drop database like you need to be careful. But but

190
00:11:45,200 --> 00:11:50,519
but other set of safe action and then use machine

191
00:11:50,600 --> 00:11:55,720
learning as an engine to learn basically what, depending of

192
00:11:55,799 --> 00:11:58,960
the state of the system, what could solve the issue.

193
00:12:00,679 --> 00:12:05,320
And so we designed this for a way distributed infrastructure.

194
00:12:05,440 --> 00:12:10,200
Actually continued this work at LinkedIn and they asked me

195
00:12:10,279 --> 00:12:16,480
to write a patent, which eventually was accepted. But yeah,

196
00:12:16,480 --> 00:12:19,759
it never got a chance to build it, unfortunately, because

197
00:12:19,759 --> 00:12:23,039
then I left to become an entrepreneur. But that's why,

198
00:12:23,360 --> 00:12:26,559
you know, I was telling you have been swimming in

199
00:12:26,639 --> 00:12:28,360
this topic for a little while.

200
00:12:28,799 --> 00:12:31,000
Speaker 1: Yeah, I mean I still want to go further into that.

201
00:12:31,120 --> 00:12:34,759
So just to summarize a little bit, the strategy is,

202
00:12:35,320 --> 00:12:38,519
we're collecting tons of logs, maybe metrics, et cetera. We

203
00:12:38,639 --> 00:12:41,919
maybe have access to the source code, and we train

204
00:12:42,080 --> 00:12:45,240
on that data to identify based off what sort of

205
00:12:45,320 --> 00:12:49,039
errors we're actually seeing, how to pinpoint potentially is it

206
00:12:49,080 --> 00:12:51,799
a part of the source code or the infrastructure which

207
00:12:51,799 --> 00:12:55,559
could be problematic, and then utilizing that on errors that

208
00:12:55,600 --> 00:12:58,679
actually do come out of the system to help dive

209
00:12:58,720 --> 00:13:01,159
in to identify the cause or does it go further

210
00:13:01,200 --> 00:13:01,440
than that?

211
00:13:02,000 --> 00:13:05,039
Speaker 2: Yes, I think that's a good point if we speak

212
00:13:05,039 --> 00:13:09,000
about I think I think back then I was I

213
00:13:09,120 --> 00:13:15,519
directly dive into resolution. I was a young engineer, you know,

214
00:13:15,559 --> 00:13:19,200
I was like twenty five, twenty six, maybe even younger

215
00:13:19,240 --> 00:13:23,399
than this, so I was not really mature. But I

216
00:13:23,759 --> 00:13:29,519
think I think starting with the root cause analysis is

217
00:13:29,559 --> 00:13:33,679
the right approach. Obviously, you know now with insight it

218
00:13:33,759 --> 00:13:37,360
makes sense, but but my goal was really resolution, which

219
00:13:37,440 --> 00:13:40,919
ultimately you know where you want to go. So yeah,

220
00:13:40,960 --> 00:13:44,960
I was really really focusing on on direct resolution, and

221
00:13:45,159 --> 00:13:49,320
I would be a mix of kind of run books

222
00:13:49,840 --> 00:13:52,799
that that you know, we could feed and then but

223
00:13:52,960 --> 00:14:02,039
then more interestingly set of safe action safe commands that

224
00:14:02,080 --> 00:14:06,080
the system could run and could see if this solved

225
00:14:06,080 --> 00:14:10,320
the issue, and then kind of do like a learn

226
00:14:10,440 --> 00:14:13,039
from it, you know, maybe try something. It doesn't work,

227
00:14:13,080 --> 00:14:16,559
it's fine, you know, we just ditch this kind of

228
00:14:17,039 --> 00:14:20,200
pass instruction instruction set as an option, but sometimes it

229
00:14:20,240 --> 00:14:23,720
will succeed and it will use this for the next incident.

230
00:14:24,120 --> 00:14:27,200
And all of this would be based on machine learning obviously,

231
00:14:27,240 --> 00:14:30,399
like you know, like the more success you have with

232
00:14:30,440 --> 00:14:32,840
an instruction, the more you are likely to use it

233
00:14:32,919 --> 00:14:35,879
next time. Back then, machine learning was nearly not as

234
00:14:35,960 --> 00:14:41,519
advanced as it is today, you know, so it's it

235
00:14:41,600 --> 00:14:44,039
was hard like to achieve this goal. And I think

236
00:14:45,080 --> 00:14:48,799
the industry, I don't think anyone build this type of

237
00:14:48,879 --> 00:14:55,159
system until like now you add one player that did

238
00:14:55,200 --> 00:14:59,639
something similar, which is Facebook. They build a system that's

239
00:14:59,679 --> 00:15:05,679
called f bar F B A A R that the

240
00:15:05,759 --> 00:15:10,519
definers self feeling it was to manage Facebook that center racks,

241
00:15:10,759 --> 00:15:14,840
so it was not system but tracks where it would

242
00:15:15,519 --> 00:15:20,960
auto automatically perform action to solve some production issue. But

243
00:15:21,039 --> 00:15:25,720
it was deterministic, so it was not the well, no,

244
00:15:25,720 --> 00:15:30,240
none of machine learning used in it. And then Dropbox

245
00:15:30,320 --> 00:15:38,679
in twenty sixteen presented no rout Aserican and this was

246
00:15:38,799 --> 00:15:44,200
a self feeling system but for distribute distributed systems, this

247
00:15:44,279 --> 00:15:48,039
time for web infrastructure, but same year it was deterministic.

248
00:15:48,120 --> 00:15:50,679
So I you know, I think this system I've been

249
00:15:50,720 --> 00:15:53,919
around for like more than a decade, and I've been

250
00:15:53,960 --> 00:15:58,120
produce producing value. I know they've been in production, and

251
00:15:58,480 --> 00:16:03,120
I think as area generally not up with having mechanism

252
00:16:03,759 --> 00:16:06,480
or system working on their behalf and kind of in

253
00:16:06,519 --> 00:16:08,360
the shadow. But the truth is that they've been around

254
00:16:08,399 --> 00:16:11,399
for a while. I think. Now the main difference, the

255
00:16:11,440 --> 00:16:14,320
big difference, which is a huge difference, is that we

256
00:16:14,480 --> 00:16:20,279
are including this machine learning LLM part which is non deterministic.

257
00:16:20,399 --> 00:16:21,600
And that's a big deal.

258
00:16:22,200 --> 00:16:24,159
Speaker 1: So let's just dive into that for a second. When

259
00:16:24,159 --> 00:16:27,360
we say deterministic in the history of self healing systems,

260
00:16:27,360 --> 00:16:31,600
we're talking about like auto scaling groups or identifying specifically

261
00:16:31,639 --> 00:16:34,919
based off of rules that some engineer wrote, what they're

262
00:16:34,960 --> 00:16:39,360
seeing and then how to handle the situation very concretely.

263
00:16:39,480 --> 00:16:40,080
Is that accurate?

264
00:16:40,480 --> 00:16:43,559
Speaker 2: It's a great and also a one that was called

265
00:16:43,639 --> 00:16:46,720
earths and it's exactly how you describe it, like a

266
00:16:46,759 --> 00:16:49,759
runt book that's threaten by a human and then this

267
00:16:49,919 --> 00:16:53,000
rent book is a trigger based on a specific signal.

268
00:16:53,080 --> 00:16:54,919
But it's absolutely deterministic.

269
00:16:55,320 --> 00:16:58,679
Speaker 1: Yeah, And I think the interesting thing with the deterministic

270
00:16:58,759 --> 00:17:01,559
systems is that really require you to do the root

271
00:17:01,559 --> 00:17:04,119
cause analysis so that you could write like if no

272
00:17:04,240 --> 00:17:06,160
run book applied or the run or a run book

273
00:17:06,160 --> 00:17:09,720
applied either by a human or through automation, didn't have

274
00:17:10,519 --> 00:17:13,920
an effect that actually resolved whatever incident you had, you

275
00:17:13,960 --> 00:17:15,839
had to actually still do the root cause analysis. And

276
00:17:15,920 --> 00:17:18,440
now I feel like we're getting into you know, I

277
00:17:18,440 --> 00:17:20,680
think everyone's waiting for us to talk about this how

278
00:17:20,680 --> 00:17:25,480
to apply Uh, I hate to say AI to this concept.

279
00:17:25,599 --> 00:17:28,640
So now that we're now that it's twenty twenty five,

280
00:17:29,839 --> 00:17:33,880
what does it mean to deploy LM to be able

281
00:17:33,920 --> 00:17:36,359
to self heal a system? What does that actually look

282
00:17:36,400 --> 00:17:37,039
like in practice?

283
00:17:38,279 --> 00:17:41,440
Speaker 2: Yeah? So, you know, I think you hit the nail

284
00:17:41,440 --> 00:17:44,079
on the head with speaking about the root cause analysis.

285
00:17:44,079 --> 00:17:47,920
That's obviously the first step that this system need to

286
00:17:47,960 --> 00:17:53,119
do is to understand what's up. I think in the

287
00:17:53,160 --> 00:17:55,839
past it was a mix of for this self feeling

288
00:17:55,880 --> 00:18:02,440
system I think if part of this runt book where

289
00:18:03,160 --> 00:18:09,519
automatically based on a very specific maybe anologue or you know,

290
00:18:09,559 --> 00:18:13,519
something like very trivial, and then it would be automatically applied.

291
00:18:13,599 --> 00:18:16,400
In other cases, a human would need to do the

292
00:18:16,480 --> 00:18:19,599
root cause analysis and then you would push this rund book,

293
00:18:19,720 --> 00:18:22,000
which would still save a ton of time, right because

294
00:18:23,000 --> 00:18:25,680
this system would orchestrate all the action that needs to

295
00:18:25,680 --> 00:18:29,720
be done. And when we are speaking about very complicated

296
00:18:29,920 --> 00:18:33,960
infrasecure like at Google, Facebook or LinkedIn, you know, it

297
00:18:33,960 --> 00:18:37,240
can be a lot of work. But here the idea

298
00:18:37,240 --> 00:18:41,640
is that we can throw whatever broken system to this

299
00:18:41,839 --> 00:18:47,400
LLM and it should understand what's up or at least

300
00:18:48,000 --> 00:18:54,160
come up with hypothesis. And I think the hypothesis is

301
00:18:54,960 --> 00:18:58,759
very trivial, very important. Sorry, I don't think we need

302
00:18:58,880 --> 00:19:02,240
We should not consider this LM to be God or

303
00:19:02,279 --> 00:19:06,319
to be the silver bullet. We should consider them just

304
00:19:06,359 --> 00:19:10,480
as another human which can make mistakes. We make a

305
00:19:10,519 --> 00:19:13,799
lot of mistakes, and so I think one key element

306
00:19:13,960 --> 00:19:18,039
is to understand this that this ipothesis also kind of

307
00:19:18,200 --> 00:19:22,480
a degree of certainty, and so I think a great

308
00:19:22,559 --> 00:19:28,880
ais are will provide as part of the diagnosis what

309
00:19:29,079 --> 00:19:33,240
the degree of certainty that it has about the diagnosis,

310
00:19:33,400 --> 00:19:36,960
so human can say, hey, if it's fifty percent, maybe

311
00:19:37,000 --> 00:19:39,000
I should not pay too much attention attention to it.

312
00:19:39,000 --> 00:19:41,759
If it's ninety five percent, okay, maybe I should really

313
00:19:41,839 --> 00:19:42,440
look into it.

314
00:19:43,680 --> 00:19:48,880
Speaker 1: So what sorts of source data are you utilizing in

315
00:19:49,000 --> 00:19:51,680
order to feed into the l I'm going to ask

316
00:19:51,720 --> 00:19:55,559
you questions about about that afterwards, but specifically right now,

317
00:19:55,640 --> 00:19:58,240
like is it you know a list of like source

318
00:19:58,319 --> 00:20:01,000
code and some other things like what does the sourt

319
00:20:01,000 --> 00:20:01,640
settle look like?

320
00:20:02,079 --> 00:20:07,240
Speaker 2: M So I think with as with many tools in

321
00:20:07,279 --> 00:20:13,519
the LM air space today, context is king, and I'm

322
00:20:13,519 --> 00:20:16,440
going to speak about what we are building at Rutely,

323
00:20:17,480 --> 00:20:21,279
which again is an incident management platform. And why I

324
00:20:21,319 --> 00:20:24,680
repeat this because it really matters in the sense that

325
00:20:26,519 --> 00:20:30,000
engineering teams will all the signals that are associated with

326
00:20:30,039 --> 00:20:34,079
an incident will flow through their incident management platform. It

327
00:20:34,480 --> 00:20:38,200
just makes sense, right, And so these platforms such as

328
00:20:38,319 --> 00:20:43,559
routely have pretty much all the context that is available

329
00:20:43,640 --> 00:20:49,720
generally to solve this incident. So it can be monitoring,

330
00:20:50,720 --> 00:20:55,079
logging traces, it can be more than this. It can

331
00:20:55,160 --> 00:20:59,680
be slight conversation, right, it can be a zoom call

332
00:21:00,079 --> 00:21:01,960
here you know, you can do a transcript as a

333
00:21:02,039 --> 00:21:03,640
zoom call, so you can fit this in the l

334
00:21:03,720 --> 00:21:07,200
l M. But I think there also are very important

335
00:21:07,279 --> 00:21:12,079
data such as the history of post mortem incident resolution

336
00:21:12,319 --> 00:21:17,720
report right where everything is documented from what happened, how

337
00:21:17,759 --> 00:21:21,960
it happened, how did we solve the incident, how this

338
00:21:22,079 --> 00:21:25,400
incident like how it do I solve? Who solved it?

339
00:21:26,440 --> 00:21:31,079
And all this data is like super important for for

340
00:21:31,240 --> 00:21:35,599
the AI agent to to to find a root cause.

341
00:21:37,519 --> 00:21:39,759
And and the last one I I forgot you mentioned

342
00:21:39,960 --> 00:21:44,559
is obviously anything that's linked to changes, which you know

343
00:21:44,599 --> 00:21:47,400
takes the form of cod So the list of commits

344
00:21:48,119 --> 00:21:51,920
is often you know where you'll find the issue.

345
00:21:53,480 --> 00:21:55,319
Speaker 1: So you've actually got a system that and justs all

346
00:21:55,319 --> 00:21:58,960
this information and that's out. You know, here's a here's

347
00:21:59,160 --> 00:22:01,440
some action items take And then I imagine some companies

348
00:22:01,440 --> 00:22:05,279
are actually automating based off of that to remediate the

349
00:22:05,319 --> 00:22:07,599
problems in production. Or is this like you need a

350
00:22:07,640 --> 00:22:10,880
human to review this before you do anything else.

351
00:22:13,079 --> 00:22:18,880
Speaker 2: Yeah, I think the safe this field is extremely like

352
00:22:19,039 --> 00:22:23,440
the auto healing mechanism around a LAMB is extremely young.

353
00:22:24,440 --> 00:22:27,279
You know, the oldest company in the field maybe two

354
00:22:27,359 --> 00:22:30,720
year old, perhaps even less, there is a lot of

355
00:22:30,720 --> 00:22:36,559
competition in the space, seen at least twenty five products,

356
00:22:37,319 --> 00:22:40,480
and I've spoken to a lot of engineers who are

357
00:22:40,480 --> 00:22:46,160
building this internally at large companies, so you know, I

358
00:22:46,200 --> 00:22:50,000
think everybody's doing differently. The maturity of the product is

359
00:22:50,279 --> 00:22:54,039
also very different. But I think for s ARE is

360
00:22:54,079 --> 00:23:00,799
obviously a really ability is ultimately the goal, and not experimentation,

361
00:23:01,000 --> 00:23:05,960
right its secondary goal definitely Right, starting with just investigation

362
00:23:06,920 --> 00:23:10,680
is the right way to go. And then I think,

363
00:23:10,759 --> 00:23:16,000
as the space meteor and perhaps the model, what's work

364
00:23:16,039 --> 00:23:19,119
great with this technology? You can teach it right, It

365
00:23:19,200 --> 00:23:23,680
becomes better over time, so you can train models on

366
00:23:24,480 --> 00:23:28,160
your data, you can tune it right. So as the

367
00:23:28,200 --> 00:23:31,599
technology mature and the model is learning, and perhaps we

368
00:23:31,640 --> 00:23:38,839
are learning as humans, we can go more towards allowing

369
00:23:38,960 --> 00:23:40,720
this tool to do the resolution. But I think the

370
00:23:40,759 --> 00:23:44,240
first step in is just fun our root cause analysis.

371
00:23:45,079 --> 00:23:48,279
Speaker 1: Yeah, I mean, at least from my personal standpoint, I'm

372
00:23:48,359 --> 00:23:53,200
scared to hand over the tools to make changes to

373
00:23:53,240 --> 00:23:59,599
production infrastructure automatically without involving some sort of review process.

374
00:23:59,799 --> 00:24:01,799
And I mean, I guess, I guess it's fine to

375
00:24:01,839 --> 00:24:05,200
have like another LM review. The first llm's work in

376
00:24:05,240 --> 00:24:07,960
some way, but I don't know if the direction matters.

377
00:24:08,000 --> 00:24:10,759
Like I think you need someone to review the context

378
00:24:10,759 --> 00:24:13,359
of what's happening, just like you probably want multiple engineers

379
00:24:13,920 --> 00:24:17,519
on call to actually validate any sort of code changes

380
00:24:17,559 --> 00:24:19,880
that would have to go into production, because I mean,

381
00:24:19,920 --> 00:24:22,240
otherwise you're in a situation where there's a critical event.

382
00:24:22,599 --> 00:24:25,720
It's three, something's going wrong with the database. You log

383
00:24:25,759 --> 00:24:30,039
in and accidentally drop the production dB. I mean, I'll

384
00:24:30,039 --> 00:24:32,119
pause there because that this has actually happened to more

385
00:24:32,119 --> 00:24:35,400
than one company, But I think there's one in our history.

386
00:24:36,000 --> 00:24:38,839
I think it's almost ten years old now, a major

387
00:24:39,440 --> 00:24:44,359
source code get server company had a production incident, very

388
00:24:44,359 --> 00:24:46,440
famous with their I think it was Postcress at the time.

389
00:24:47,799 --> 00:24:51,640
So engineer is definitely not infallible when it comes to remediation.

390
00:24:52,480 --> 00:24:54,759
But I guess my question is going to be do

391
00:24:54,799 --> 00:24:58,039
you find with all the data that you're collecting that

392
00:24:58,119 --> 00:25:02,519
the set of incidents all point back to some like

393
00:25:02,559 --> 00:25:06,000
as far as you're concerned, repeatable or already seen problems

394
00:25:06,039 --> 00:25:10,680
like oh yeah, this sort of software development issue or

395
00:25:10,680 --> 00:25:15,559
some syntax problem like no reference exception or dynamic module

396
00:25:15,640 --> 00:25:18,440
loading or you know, memory exhaustion or something like that,

397
00:25:19,160 --> 00:25:22,359
or are there like minor differences as time goes on,

398
00:25:22,440 --> 00:25:24,559
like oh well it used to be this said, but

399
00:25:24,960 --> 00:25:27,119
the next thing is sort of something that we haven't

400
00:25:27,119 --> 00:25:30,440
discovered yet, and so you're still discovering sort of new

401
00:25:30,440 --> 00:25:31,160
failure modes.

402
00:25:32,759 --> 00:25:36,079
Speaker 2: Yes, just picking like very briefly on what you said before.

403
00:25:36,079 --> 00:25:40,000
I totally agree with you that ELMS should you be

404
00:25:40,559 --> 00:25:45,119
considered as another gument So can review, doing canary deploy,

405
00:25:46,240 --> 00:25:50,319
you know, passing the change through the CD basically making

406
00:25:50,400 --> 00:25:53,519
sure that the change is safe is just a must do, right.

407
00:25:53,599 --> 00:25:56,079
I don't think we should treat treat what the AI

408
00:25:56,240 --> 00:25:59,720
say as the resolution pass as any different as a

409
00:25:59,799 --> 00:26:01,200
human would say.

410
00:26:01,960 --> 00:26:04,079
Speaker 1: I mean it goes further than that though, right, Because

411
00:26:04,519 --> 00:26:08,279
if we were able to confidently take the output from

412
00:26:08,440 --> 00:26:11,119
LLMS and feed it back in, LMS should be able

413
00:26:11,160 --> 00:26:16,440
to develop increasingly large solution of any size. And we

414
00:26:16,519 --> 00:26:20,720
see that no company has a software an automated software

415
00:26:20,720 --> 00:26:24,720
development or agent engineer that can just continually push out code.

416
00:26:24,839 --> 00:26:28,400
Even ones operating of very small scopes have utterly failed

417
00:26:28,559 --> 00:26:31,480
in their release and their push out of their products,

418
00:26:31,920 --> 00:26:34,039
let alone larger companies that have been trying to build

419
00:26:34,079 --> 00:26:40,119
stuff up. And the recent craze on vibe coding. Yeah,

420
00:26:40,160 --> 00:26:42,400
I mean, and for anyone who's not aware, it's this

421
00:26:42,440 --> 00:26:44,359
idea where you just you don't even look at the code.

422
00:26:44,400 --> 00:26:46,359
You just have the LM produce all the output, and

423
00:26:46,359 --> 00:26:48,880
whenever there's a problem, you just say, hey, here's the issue,

424
00:26:48,960 --> 00:26:51,519
try to fix it. The problem is that the context

425
00:26:51,519 --> 00:26:54,400
window will have to keep growing indefinitely. Every new feature

426
00:26:54,440 --> 00:26:56,559
you add will continue to grow. And so as long

427
00:26:56,599 --> 00:26:59,200
as we have these two failure modes A the LM's

428
00:26:59,240 --> 00:27:02,880
finite context W and B, companies who have made it

429
00:27:02,880 --> 00:27:05,680
their sole goal to make money off of automated software

430
00:27:05,680 --> 00:27:09,240
development aren't making money off of that, you know, aren't

431
00:27:09,279 --> 00:27:12,119
like wildly successful. The likelihood of you being able to

432
00:27:12,160 --> 00:27:15,400
do it, lust being able to trust them fundamentally tells

433
00:27:15,480 --> 00:27:17,400
us that, you know, we're not at that point yet.

434
00:27:18,880 --> 00:27:24,359
Speaker 2: For sure. I think the this vibe coding is variable

435
00:27:24,440 --> 00:27:29,920
in many situations. If you want to prototypes, maybe if

436
00:27:29,920 --> 00:27:32,880
you are very young startup, you know, I think it

437
00:27:32,920 --> 00:27:35,319
makes a lot of sense. But when you get to

438
00:27:36,200 --> 00:27:39,319
the stage where you hire a necessary or you need

439
00:27:39,319 --> 00:27:42,680
stability in your product, or you are pushing a product

440
00:27:43,200 --> 00:27:46,759
that is crucial for your customer system. You know, I

441
00:27:46,759 --> 00:27:52,079
don't think this type of engineering practice, if we can

442
00:27:52,119 --> 00:27:57,279
call it this way, makes sense, but I think this

443
00:27:57,359 --> 00:28:00,400
technology can bring a lot of value. You mentioned do

444
00:28:00,440 --> 00:28:05,000
we find patterns in the type of incidents that with

445
00:28:05,119 --> 00:28:07,920
see through the system? And that's a really great question.

446
00:28:08,039 --> 00:28:12,680
So one of the initial leading at rutely is the

447
00:28:13,160 --> 00:28:18,599
rootly Air Labs. It's a community driven initiative where we

448
00:28:19,359 --> 00:28:23,960
hire software engineers. We have the head of platform engineering

449
00:28:24,000 --> 00:28:27,240
at Venmo and the former head of AI at Video

450
00:28:27,440 --> 00:28:33,079
and other very smart student PhDs from Stanford and whatnot,

451
00:28:34,119 --> 00:28:40,920
and we pay them to create open source prototype leveraging

452
00:28:41,319 --> 00:28:45,599
the latest air innovation to see how can this be

453
00:28:45,680 --> 00:28:49,519
applied to the world of free the ability and system operation.

454
00:28:49,640 --> 00:28:53,720
And one of the projects that we're working on is

455
00:28:53,759 --> 00:28:57,319
exactly what you mentioned, is to create a graph of

456
00:28:58,720 --> 00:29:03,640
the incidents and see if we can find patterns. So

457
00:29:03,640 --> 00:29:08,039
it could be an area we are infrastructure, or a

458
00:29:08,079 --> 00:29:11,880
part of your application, or perhaps a type of failure.

459
00:29:12,160 --> 00:29:14,680
You know, let's let's say we speak about we spoke

460
00:29:14,720 --> 00:29:18,839
about resources. Is the resources often something you know that's

461
00:29:18,880 --> 00:29:21,920
that's fading our system and maybe because our skating rules

462
00:29:21,960 --> 00:29:26,920
are not aggressive in US, and and alms are helping

463
00:29:27,000 --> 00:29:32,319
us to to create this graph because they are you know,

464
00:29:32,359 --> 00:29:36,200
they are great interesting unstructured data and make sense of it.

465
00:29:37,240 --> 00:29:41,000
And so then we can create this graph that can

466
00:29:42,039 --> 00:29:45,920
and power a sorry team to understand where and stability

467
00:29:46,279 --> 00:29:46,680
come from.

468
00:29:47,759 --> 00:29:50,640
Speaker 1: I mean, that's something I'd be super interested to find out,

469
00:29:50,640 --> 00:29:54,400
like where statistically are the most problems coming from, and

470
00:29:54,440 --> 00:29:57,880
how that maps are, or like what the confounding variables

471
00:29:57,920 --> 00:30:00,440
are between maybe the culture of the company or software

472
00:30:00,519 --> 00:30:04,160
languages that they're utilizing, or the frameworks or the industries. Right,

473
00:30:04,200 --> 00:30:07,039
you know, maybe these industries have these common incidents. Like

474
00:30:07,079 --> 00:30:09,680
I think that'd be super interesting to say.

475
00:30:10,440 --> 00:30:13,720
Speaker 2: Well, yeah, so we we're building it. You can check

476
00:30:13,759 --> 00:30:19,279
it out. We have a gid up GitHub space if

477
00:30:19,319 --> 00:30:22,440
you look for rootly AI labs. Everything is open source

478
00:30:23,559 --> 00:30:28,000
and we're always welcoming people to to join, just giving

479
00:30:28,039 --> 00:30:32,319
ideas or or contributing. Again, we're paying people to do that,

480
00:30:32,559 --> 00:30:35,559
and so it's kind of a side job. But yeah,

481
00:30:35,559 --> 00:30:38,920
I think AI is breaking like and and you know,

482
00:30:39,160 --> 00:30:41,440
I would say it's kind of a side thing. You know,

483
00:30:41,799 --> 00:30:47,359
it's it's not it's not as an ambitious goal as

484
00:30:47,440 --> 00:30:51,880
like self ining system. But I do think that's where

485
00:30:51,880 --> 00:30:54,000
you see that l M can can allow you to

486
00:30:54,079 --> 00:30:56,079
do other things that are interesting. I know there are

487
00:30:56,319 --> 00:30:59,400
prototype that I think it might be interesting in two

488
00:30:59,440 --> 00:31:02,720
other prototy, but speak about it very briefly. One of

489
00:31:02,720 --> 00:31:08,960
them is to create a diagram out of a post

490
00:31:09,039 --> 00:31:15,839
more tem showing where things went wrong. And post more

491
00:31:15,880 --> 00:31:19,559
TEMs are actually kind of painful for engineer to write

492
00:31:19,559 --> 00:31:22,079
like you know, no one wants to do that. You

493
00:31:22,119 --> 00:31:26,039
need to remember what happened and bring all of this together. Actually,

494
00:31:26,200 --> 00:31:28,960
that's what's good with lelamps and that's something we have

495
00:31:28,960 --> 00:31:33,720
in Rutley. Rutley will draft a pastmare tem for you

496
00:31:33,839 --> 00:31:36,359
and then you just have to review it and chances

497
00:31:36,400 --> 00:31:38,920
are that the post more time is going to be great.

498
00:31:39,359 --> 00:31:42,440
And then the next step that we we tried with

499
00:31:42,519 --> 00:31:47,440
the Routly air lab is how about trying to offer

500
00:31:47,480 --> 00:31:52,480
another way to consume a postmare tem and a visual

501
00:31:52,559 --> 00:31:58,599
way may help, especially I think non engineering audiences to

502
00:31:58,799 --> 00:32:03,440
understand where the failure happened and why the other service

503
00:32:03,960 --> 00:32:06,680
that may seem totally enerated was done as well. So

504
00:32:06,920 --> 00:32:10,519
the way it works is that it will ingest the

505
00:32:10,559 --> 00:32:13,880
post mortem makes sense of it as a geson and

506
00:32:13,920 --> 00:32:18,119
then ingest you know your your code base, infrastructure and

507
00:32:18,200 --> 00:32:20,359
code and make a gson out of it and then

508
00:32:20,599 --> 00:32:27,160
merge this to and and create a knockdown graph. So yeah,

509
00:32:27,200 --> 00:32:30,640
that's you know, another way to leverage LMS, which which

510
00:32:30,759 --> 00:32:34,000
can ultimately help a very team to do their job

511
00:32:34,240 --> 00:32:34,960
more efficiently.

512
00:32:35,519 --> 00:32:38,359
Speaker 1: I like how you called out that after you are

513
00:32:39,400 --> 00:32:44,319
you've pushed out a post mortem that someone actually has

514
00:32:44,319 --> 00:32:48,039
to review what way you've created, Like, no, don't just

515
00:32:48,079 --> 00:32:49,960
don't just take that and you know, start sending it

516
00:32:50,000 --> 00:32:52,960
to people as the official thing are Like, if you

517
00:32:53,079 --> 00:32:55,839
take an LM generated post mortem and you put that

518
00:32:55,960 --> 00:33:00,400
up publicly, you will for sure get harassed on Blue

519
00:33:00,440 --> 00:33:04,000
Sky and asd on very quickly about how you spend

520
00:33:04,079 --> 00:33:06,720
zero effort and then making sure that that was accurate.

521
00:33:07,119 --> 00:33:11,319
It's very easy to identify LM generated stuff like that.

522
00:33:12,880 --> 00:33:15,119
Speaker 2: And the second thing that we've built that maybe of

523
00:33:15,240 --> 00:33:19,559
interest to the audience is is an on called Burnard Detector.

524
00:33:20,599 --> 00:33:24,039
I think that's particularly interested for companies that are distributed

525
00:33:25,039 --> 00:33:30,440
where manager may not be in touch as much with

526
00:33:30,640 --> 00:33:33,440
what the team is doing, and especially for large companies.

527
00:33:33,839 --> 00:33:37,440
So what we do is that we feed all the

528
00:33:37,559 --> 00:33:43,880
associated data about incident responder can be how long was

529
00:33:43,920 --> 00:33:46,279
there shift over the last week, how many incidents they

530
00:33:46,319 --> 00:33:49,599
had to travel, shoot, what was the severity of this incident,

531
00:33:51,079 --> 00:33:55,119
how long were they working during the night, and so

532
00:33:55,200 --> 00:33:58,440
many things you know that are all instructured data. Right,

533
00:33:59,640 --> 00:34:02,119
So we again like Elem's are great at this, And

534
00:34:02,160 --> 00:34:05,039
then from this an elelant can come up with kind

535
00:34:05,079 --> 00:34:08,920
of a burnout level, you know, and see, hey, like

536
00:34:09,079 --> 00:34:12,239
you know, this person was like smashed very hard with

537
00:34:12,360 --> 00:34:16,159
a bunch of hard incident, like you may consider giving

538
00:34:16,199 --> 00:34:21,840
them a break. I'm sorry, So as.

539
00:34:21,679 --> 00:34:25,719
Speaker 1: Long as it doesn't also suggest the therapy that should

540
00:34:25,719 --> 00:34:28,320
be necessary and try to provide that. I think you're

541
00:34:28,360 --> 00:34:31,719
on the right track there, Yeah, I mean, yeah, it

542
00:34:31,760 --> 00:34:35,519
can be. It can be difficult to see the differences

543
00:34:35,559 --> 00:34:38,840
between individuals. Like some of them are way more interested

544
00:34:38,880 --> 00:34:41,480
in actually jumping in and you know, diving in and

545
00:34:41,519 --> 00:34:44,960
trying to identify those problems and solve them, and others

546
00:34:45,000 --> 00:34:48,400
are you know, care more about the routine. But I

547
00:34:49,280 --> 00:34:52,840
don't think in my in the history of my engineering career,

548
00:34:52,880 --> 00:34:55,119
I ever saw someone jump up and down and say yes,

549
00:34:55,239 --> 00:34:57,159
I would love to be woken up at three a m.

550
00:34:57,559 --> 00:35:00,440
And jump on a call with other people and try

551
00:35:00,480 --> 00:35:03,039
to justify what was what was happening. So, you know,

552
00:35:03,039 --> 00:35:04,840
I think you're definitely onto something interesting there.

553
00:35:05,840 --> 00:35:09,480
Speaker 2: Yeah. I was when I was young because I wanted

554
00:35:09,519 --> 00:35:15,760
to learn. But I'm definitely not into that, but go

555
00:35:15,840 --> 00:35:16,079
for it.

556
00:35:16,519 --> 00:35:18,960
Speaker 1: Yeah, I mean, I know. I think that's an interesting

557
00:35:19,000 --> 00:35:22,480
point because you know your career, things change for you

558
00:35:22,519 --> 00:35:24,480
over time, and maybe at some point you are willing

559
00:35:24,519 --> 00:35:27,760
to make some sacrifices, you know, But I don't know

560
00:35:27,760 --> 00:35:29,280
if it was the case for me, Like I remember

561
00:35:29,360 --> 00:35:32,760
my first job out of university, there would be incidents

562
00:35:32,760 --> 00:35:35,119
in the middle of the night, and I never had

563
00:35:35,119 --> 00:35:37,400
to deal with that sort of thing in my life

564
00:35:37,480 --> 00:35:39,199
up until that point. Like I didn't run my own

565
00:35:39,280 --> 00:35:41,000
data center in my home, and even if I did,

566
00:35:41,039 --> 00:35:42,760
I don't. It was not at the point where you'd

567
00:35:42,760 --> 00:35:44,760
be like getting alerts to be woken up to deal

568
00:35:44,800 --> 00:35:49,159
with one of your virtual machines failing. And the university

569
00:35:49,280 --> 00:35:53,280
wasn't a thing. You're you're definitely awake while you're causing problems, right,

570
00:35:53,559 --> 00:35:56,280
things aren't happening while you're sleeping. And so my first

571
00:35:56,360 --> 00:35:58,719
job like this would happen, and I definitely came away

572
00:35:58,760 --> 00:36:01,320
from that with it, with the idea this is wrong,

573
00:36:01,519 --> 00:36:04,000
Like I don't ever want to be woken up in

574
00:36:04,039 --> 00:36:05,760
the middle of the night, Like you don't have to

575
00:36:05,800 --> 00:36:08,480
be it's not a requirement. And since then, like I've

576
00:36:08,559 --> 00:36:11,599
really been on the path of highly reliable systems. And

577
00:36:11,639 --> 00:36:14,639
I think the part that really stumps a lot of

578
00:36:14,639 --> 00:36:17,599
people is they focus a lot on the preventative nature

579
00:36:17,880 --> 00:36:20,199
that they can try to prevent every problem. Oh, get

580
00:36:20,239 --> 00:36:22,639
one hundred percent test coverage, or you know, have a

581
00:36:22,719 --> 00:36:26,760
highly reliable solution by duplicating the infrastructure in multiple regions.

582
00:36:26,800 --> 00:36:29,440
And I mean the thing I think you said at

583
00:36:29,480 --> 00:36:32,639
the beginning of the episode, which is that it will

584
00:36:32,679 --> 00:36:36,079
go down, Like you cannot have one hundred percent reliable

585
00:36:36,079 --> 00:36:38,440
system and so at some point you have to optimize

586
00:36:38,480 --> 00:36:42,159
for recovery and not just prevention. And this is where

587
00:36:42,199 --> 00:36:44,280
I think a lot of people get stuck because like

588
00:36:44,440 --> 00:36:47,519
at our company, we have a five nines reliability SLA,

589
00:36:47,920 --> 00:36:49,760
and that means that by the time so one gets

590
00:36:49,800 --> 00:36:54,760
alerted and they get online, we've already violated the SLA,

591
00:36:54,880 --> 00:36:57,920
let alone identified and fixed the problem.

592
00:36:58,480 --> 00:37:00,960
Speaker 2: That's a great point you bring that I think this

593
00:37:01,119 --> 00:37:05,159
system well. First, first of all, getting woken up at

594
00:37:05,239 --> 00:37:09,239
three am is never a pleasant experience, and it takes

595
00:37:09,280 --> 00:37:12,280
time for your brain to get into it. And you know,

596
00:37:12,320 --> 00:37:15,519
maybe you were in some deep sleep and you are

597
00:37:15,559 --> 00:37:18,639
waking up and kind of having like little panic attack

598
00:37:18,800 --> 00:37:21,280
or you know, something like tough on your body, and

599
00:37:21,320 --> 00:37:26,320
then you need to you know, get time to ingest

600
00:37:26,400 --> 00:37:27,840
the data and so on and so for so we

601
00:37:27,880 --> 00:37:30,800
know how hard it is for your body and your mind.

602
00:37:31,239 --> 00:37:35,079
I think that's where aiser is. You know, which are

603
00:37:35,159 --> 00:37:39,920
like self feeling system or tools that can lead to

604
00:37:39,960 --> 00:37:43,639
that can help. Is like, hey, this tool can ingest

605
00:37:43,920 --> 00:37:47,360
so much data in such a small amount of time

606
00:37:47,920 --> 00:37:50,400
and give you something to get started, like an initial

607
00:37:50,519 --> 00:37:53,239
root cause analysis. Then by the time you get to

608
00:37:53,280 --> 00:37:57,440
your computer, you already have something ready to look at.

609
00:37:57,480 --> 00:38:01,280
Hopefully it's ninety five percent confidence and you know you

610
00:38:01,440 --> 00:38:03,719
just have to push the fig that they suggests. I

611
00:38:03,719 --> 00:38:06,719
think it's such a great tool. I think that we'll

612
00:38:06,760 --> 00:38:10,639
have a great positive impact on the on the health

613
00:38:10,800 --> 00:38:17,079
of our people. The second thing I think that's that's

614
00:38:18,079 --> 00:38:21,400
interesting with this tool is that you mentioned you know

615
00:38:21,480 --> 00:38:25,280
five nines, and you know, we know that it's possible

616
00:38:25,320 --> 00:38:28,760
to get five, nine or six nines. But the companies

617
00:38:28,760 --> 00:38:31,559
that are achieving that, like the Google of the world,

618
00:38:32,159 --> 00:38:36,440
are investing huge amount of resources, uh you know, human

619
00:38:36,760 --> 00:38:40,599
and financial to reach this level, and for the rest

620
00:38:40,599 --> 00:38:43,480
of us, the rest of the businesses is simply not

621
00:38:43,599 --> 00:38:49,119
possible until today. I believe that these self healing tools

622
00:38:49,440 --> 00:38:54,480
will allow companies to reach this type of you know

623
00:38:54,639 --> 00:38:59,320
SLA without spending the budget that that Google does. And

624
00:38:59,360 --> 00:39:02,679
I think that's truly I think that's going to already

625
00:39:02,719 --> 00:39:04,239
find the sory space.

626
00:39:05,519 --> 00:39:08,000
Speaker 1: Yeah, I mean, I will say that one of the

627
00:39:08,000 --> 00:39:14,039
biggest struggles we have is actually customer perspective alignment. Like

628
00:39:14,599 --> 00:39:17,280
it's a challenge for us to know what the status

629
00:39:17,280 --> 00:39:19,480
of our system is like it's subjective. Is it up

630
00:39:19,599 --> 00:39:21,719
or down? Is not like you can look at some

631
00:39:22,000 --> 00:39:25,280
chart and have the answer there. And what's even more

632
00:39:25,320 --> 00:39:27,760
important is that if we believe that our system is up,

633
00:39:27,920 --> 00:39:30,239
that our customers also believe that our system is up,

634
00:39:30,800 --> 00:39:33,039
because this mismatch is really what you're trying to solve

635
00:39:33,079 --> 00:39:37,239
for If customers always like one hundred percent, reliability is

636
00:39:37,239 --> 00:39:39,599
not what whether you think it's up, it's whether or

637
00:39:39,639 --> 00:39:41,880
not you know the people that are paying you money

638
00:39:41,920 --> 00:39:45,119
to you know, run some system believe it is and

639
00:39:45,559 --> 00:39:49,159
the customer expectational alignment like that's actually a really that's

640
00:39:49,199 --> 00:39:54,440
a huge challenge and I'm not sure you can fundamentally

641
00:39:54,559 --> 00:39:57,400
all that problem. But yeah, I do things as a

642
00:39:57,440 --> 00:39:59,639
huge gap with a lot of companies being able to

643
00:39:59,639 --> 00:40:03,320
get for from where they're at, which is like their

644
00:40:03,320 --> 00:40:05,599
software is going down like at least once a week,

645
00:40:05,800 --> 00:40:07,920
to something much further than that.

646
00:40:09,360 --> 00:40:12,360
Speaker 2: Yeah, yeah, so, I you know, I do think the

647
00:40:12,519 --> 00:40:16,320
LM can can help with that in some capacity. Maybe

648
00:40:16,360 --> 00:40:19,000
I can jump also and share a little bit about

649
00:40:19,599 --> 00:40:22,719
what are the challenging of building these type of tools.

650
00:40:22,800 --> 00:40:25,559
Speaker 1: Yeah, please, I'm dying to now, right.

651
00:40:25,599 --> 00:40:29,159
Speaker 2: So, I think one of the hardest things, which you know,

652
00:40:29,400 --> 00:40:32,519
I think it's a big week LACE is obviously the

653
00:40:32,679 --> 00:40:37,960
non deterministic part of the system. And here I think,

654
00:40:38,480 --> 00:40:42,880
you know, the old adage you cannot improve or fix

655
00:40:43,679 --> 00:40:47,400
what you cannot measure is you know, works very well,

656
00:40:47,519 --> 00:40:52,199
right like for l elms, even if you provide the

657
00:40:52,239 --> 00:40:58,079
same input, the output will be different. And so it's

658
00:40:58,159 --> 00:41:03,960
very hard for engineering team to ensure that one, you know,

659
00:41:04,119 --> 00:41:08,440
is my system running well as you say, it's subjective,

660
00:41:08,480 --> 00:41:11,639
and I think here it's even more subjective because it's

661
00:41:11,639 --> 00:41:14,239
not a matter of just hey, am I getting a

662
00:41:14,800 --> 00:41:17,360
two hundred or five hundred or maybe it's a two hundred,

663
00:41:17,440 --> 00:41:22,639
but which is too much you know latency. We're speaking

664
00:41:22,679 --> 00:41:28,719
about an output which is natural language. And second is

665
00:41:29,559 --> 00:41:35,800
my output better or worse? So that's that's like a

666
00:41:35,920 --> 00:41:41,599
big challenge in in in building this system. And and

667
00:41:41,639 --> 00:41:45,039
another point is that this system don't have sking in

668
00:41:45,079 --> 00:41:48,559
the game. And elms are like dream machines. They are

669
00:41:48,679 --> 00:41:54,360
designed to put together chain of tokens that are, you know,

670
00:41:54,519 --> 00:41:57,920
using statistics, the more likely to be pleasing, you know,

671
00:41:59,239 --> 00:42:03,199
and sometimes this what the assembly is not rooted in reality.

672
00:42:03,320 --> 00:42:07,320
But they still did their job as they should. And

673
00:42:07,400 --> 00:42:10,159
so if we compare this to a human when you know,

674
00:42:10,840 --> 00:42:14,480
if let's say were you're my manager and I'm working

675
00:42:14,519 --> 00:42:16,599
on trouble shooting this incident, and I'm like, hey, I

676
00:42:16,639 --> 00:42:20,239
think that's the issue. I think this is where we

677
00:42:20,280 --> 00:42:24,239
should you know, Look, I have skin in the game, right, like,

678
00:42:24,239 --> 00:42:30,519
like I'm putting my skills on the line, and so

679
00:42:30,719 --> 00:42:33,440
you know, when I share this with you, I have

680
00:42:33,519 --> 00:42:37,199
a certain degree of certainty that this is a probable

681
00:42:37,239 --> 00:42:42,039
cause for ALMS. There is none of that, right, So here,

682
00:42:42,679 --> 00:42:47,159
what we've done at routely is that we we have

683
00:42:47,159 --> 00:42:51,599
two types of agents. We have the master agent, which

684
00:42:51,679 --> 00:42:56,559
is orchestrating sub agents which are in charge of doing

685
00:42:57,639 --> 00:43:01,960
the work of gathering days trying to understand, like doing

686
00:43:02,000 --> 00:43:04,599
the grain walk and then coming up with an answer.

687
00:43:05,039 --> 00:43:09,119
And the master agent will make sure that the overall

688
00:43:09,199 --> 00:43:11,760
narrative mix sense. And there is anothern agent that's like,

689
00:43:12,320 --> 00:43:14,400
you know, coming up with something that doesn't make sense,

690
00:43:14,519 --> 00:43:18,280
like a manager would do. So what's funny with with

691
00:43:18,559 --> 00:43:21,679
LMS is that it kind of mimics a human narrative,

692
00:43:21,760 --> 00:43:24,679
a human dynamic. Yeah.

693
00:43:24,760 --> 00:43:27,039
Speaker 1: No, I mean I feel like the most common questions

694
00:43:27,119 --> 00:43:29,800
I end up asking are how do you know? And

695
00:43:29,880 --> 00:43:35,719
why now? And Alan's not so good at solving that one,

696
00:43:36,440 --> 00:43:39,519
especially when like a bunch of changes all stack together

697
00:43:39,719 --> 00:43:43,159
to then cause the problem. Right, you know, you look

698
00:43:43,199 --> 00:43:45,599
at individual changes and they all seem fine, and then

699
00:43:46,360 --> 00:43:48,280
only together do they cause the issue. So I mean,

700
00:43:48,400 --> 00:43:51,360
I do see this sort of interaction is necessary. I

701
00:43:51,400 --> 00:43:54,320
do want to ask you about your models, though, so

702
00:43:54,760 --> 00:43:57,800
are you taking some fundamental like some foundational model out

703
00:43:57,800 --> 00:44:00,440
there that's available open source and mind too, it are

704
00:44:00,440 --> 00:44:03,920
you building it up from scratch? Is there like one

705
00:44:03,960 --> 00:44:06,400
particular companies models that you like more than others? What

706
00:44:06,440 --> 00:44:07,400
does this look like for you?

707
00:44:09,639 --> 00:44:13,400
Speaker 2: Yeah, So I think the assumption that I think the

708
00:44:13,800 --> 00:44:18,519
you know, anyone not not like deep in the space

709
00:44:18,559 --> 00:44:21,519
would assume is that you need you need to train models,

710
00:44:22,159 --> 00:44:25,800
you need to tune it using in our case, you know,

711
00:44:26,719 --> 00:44:30,119
like your customer data or like you know, if you

712
00:44:30,159 --> 00:44:33,800
are building this internally, your specific data. What we found

713
00:44:34,000 --> 00:44:37,199
is that this is actually not needed for most of

714
00:44:37,239 --> 00:44:43,559
the incidents, Like out of out of the shell like

715
00:44:43,800 --> 00:44:48,320
model like work perfectly fine, and we'll find most of

716
00:44:48,360 --> 00:44:53,199
the issues. Training model is actually really hard, really costly,

717
00:44:54,079 --> 00:44:58,239
and we haven't found so far. You know, we're still

718
00:44:58,280 --> 00:45:00,519
early you know in the space. So I think we'll

719
00:45:00,559 --> 00:45:02,960
get to this eventually. But I think for now we're

720
00:45:03,440 --> 00:45:08,639
finding the most value by not doing it. I think again, like,

721
00:45:08,920 --> 00:45:12,239
it's it's difficult, it's expensive. And then there is a

722
00:45:12,239 --> 00:45:15,519
lot of skepticism and I think issue with privacy and

723
00:45:15,559 --> 00:45:19,800
security companies on one their data going into l LMS heaven,

724
00:45:19,880 --> 00:45:22,280
you know, if we would do this only for their ALM.

725
00:45:23,880 --> 00:45:28,239
So what we found matter of the most is ready

726
00:45:29,280 --> 00:45:33,480
the context that you provide. And I think what we've

727
00:45:33,480 --> 00:45:39,400
found is the most valuable is the non technical stuff.

728
00:45:39,719 --> 00:45:41,760
But what I mean by this is the human generating

729
00:45:42,239 --> 00:45:48,000
generated context. And when we link this to roughtly, it's

730
00:45:48,039 --> 00:45:53,000
two things. The first one is the former postmare terms

731
00:45:53,039 --> 00:45:56,760
like this is a gold mine of information. Most of

732
00:45:56,800 --> 00:46:02,760
the time your system is unstable and it's gonna you know,

733
00:46:03,000 --> 00:46:05,880
this area will remain unstable at least for some period

734
00:46:05,920 --> 00:46:08,760
of time. Generally you have action items that your team

735
00:46:08,800 --> 00:46:12,920
is supposed to implement. Sometimes it's actually items are done,

736
00:46:12,960 --> 00:46:17,440
sometimes not. You know, there is always a priority issue

737
00:46:17,599 --> 00:46:21,039
with we need to release this future, just fix this

738
00:46:21,079 --> 00:46:28,599
potential bug. And the second thing is all the communications

739
00:46:28,639 --> 00:46:33,639
that's happening on Slack or teams or Zoom or Google Meter,

740
00:46:33,920 --> 00:46:40,519
you know and whatnot where that's how incidents are sold, right,

741
00:46:40,559 --> 00:46:43,320
it's human communicating between each other and sharing so much

742
00:46:43,360 --> 00:46:49,119
information that's business specific, right Like Ellen's are trained on

743
00:46:49,199 --> 00:46:54,760
a ton of data that's online, but it's not specific

744
00:46:54,920 --> 00:46:58,320
to a company obviously, right, And and and so we

745
00:46:58,440 --> 00:47:02,440
found that this data is that really boosts the results

746
00:47:02,440 --> 00:47:04,039
that we get out of these tools.

747
00:47:04,199 --> 00:47:06,280
Speaker 1: Yeah, I mean, I think you you said it a

748
00:47:06,320 --> 00:47:08,920
different way. I like the context of you have to

749
00:47:08,960 --> 00:47:13,039
pull in the business criteria, understanding and context in order

750
00:47:13,079 --> 00:47:16,039
to have a valuable output. And I think it's you know,

751
00:47:16,039 --> 00:47:18,760
it can even be more than that. It's the fundamental

752
00:47:18,840 --> 00:47:20,960
nature of l ms that we have today, Like it's

753
00:47:21,000 --> 00:47:24,960
not a it's transformer architecture, which you know, is fundamentally

754
00:47:25,039 --> 00:47:28,199
lacking the reasoning piece, Like they'll never be able to reason,

755
00:47:28,239 --> 00:47:30,000
which means they'll never be able to make a decision

756
00:47:30,400 --> 00:47:33,559
based off of the business context. But they'll be able

757
00:47:33,559 --> 00:47:35,880
to do a little bit better of pulling that in

758
00:47:35,920 --> 00:47:40,000
and combining with the output that it would normally get. So,

759
00:47:40,559 --> 00:47:42,719
you know, my one of my questions is here, Okay,

760
00:47:42,719 --> 00:47:44,719
so building up a foundational model, and I think we've

761
00:47:44,719 --> 00:47:49,159
heard this before auntaventures and DevOps, and that's that it's

762
00:47:49,199 --> 00:47:53,960
incredibly expensive. Also, the industry is moving quicker that the

763
00:47:54,000 --> 00:47:58,280
new foundational models are are just as good, So spending

764
00:47:58,320 --> 00:48:00,679
money on building a new one doesn't make sense. Actually,

765
00:48:00,840 --> 00:48:03,400
I think we heard one time that even fine tuning

766
00:48:03,400 --> 00:48:07,079
models doesn't make sense because the next generation while like

767
00:48:07,159 --> 00:48:09,920
say anthropics three point seven cloud versus three point five,

768
00:48:10,159 --> 00:48:13,000
it's not it's not really that much of an improvement,

769
00:48:13,400 --> 00:48:16,559
but you are getting up to date data. If anything. Rather,

770
00:48:16,639 --> 00:48:19,159
you know the time stamp has changed, and if you

771
00:48:19,199 --> 00:48:23,000
spend time training it, refine tuning it. Rather then by

772
00:48:23,039 --> 00:48:25,119
the time the next one comes out, all you're fine.

773
00:48:25,119 --> 00:48:27,239
Tuning well, first of all is a waste, second all

774
00:48:27,360 --> 00:48:29,719
is expensive, and third of all, like you may be

775
00:48:29,719 --> 00:48:32,320
able to throw your quarries at your prompts at the

776
00:48:32,360 --> 00:48:34,800
new model and get the right answer out anyway, So

777
00:48:35,039 --> 00:48:36,760
it's good to hear that. Does that mean you're using

778
00:48:36,760 --> 00:48:42,320
some sort you're using like something from Olama or deep

779
00:48:42,360 --> 00:48:43,239
seek or something like that.

780
00:48:44,599 --> 00:48:48,880
Speaker 2: We found that what entropy provide is generally, you know,

781
00:48:48,920 --> 00:48:55,239
the best performing. Ultimately, we are integrating a number of

782
00:48:55,280 --> 00:49:01,079
different model providers and we use different model a different

783
00:49:01,119 --> 00:49:07,239
step of the process. You know, I cannot explain in

784
00:49:07,280 --> 00:49:10,000
detail because it would be too long, but you know,

785
00:49:10,039 --> 00:49:14,360
when we're basically like the the agent will come up

786
00:49:14,400 --> 00:49:18,440
with an initial probe that you know, we compose and

787
00:49:19,119 --> 00:49:22,280
I would say, like different model will for instance, coming

788
00:49:22,360 --> 00:49:25,519
up with the let's say the master thesis of what

789
00:49:26,400 --> 00:49:32,000
you know we need to look for might be better created

790
00:49:32,039 --> 00:49:35,480
by some model, and then you know, the actual technical

791
00:49:35,920 --> 00:49:39,559
part maybe better than by another model. So it's and

792
00:49:39,599 --> 00:49:41,960
it's a moving target, as you said, like the industry

793
00:49:42,000 --> 00:49:44,960
is moving fast. There is a constant flow of new

794
00:49:45,000 --> 00:49:49,159
models and so so you know, I don't think it's

795
00:49:49,159 --> 00:49:52,719
something that really set in stone.

796
00:49:53,280 --> 00:49:56,599
Speaker 1: Do you do something to validate model changes? So for instance,

797
00:49:56,599 --> 00:49:58,840
when three point seven came out, are you still using

798
00:49:58,840 --> 00:50:00,800
three point five before? Or you can, like you have

799
00:50:00,840 --> 00:50:03,880
some sort of ender templates or system prompts that you

800
00:50:03,880 --> 00:50:06,960
can throw out and validate that the answers still makes sense,

801
00:50:07,039 --> 00:50:11,079
that the RCAs and post mortems that you're doing still

802
00:50:11,400 --> 00:50:15,199
are understandable on match and somehow validating the outputs. Like

803
00:50:15,320 --> 00:50:16,559
what does this process work for you?

804
00:50:17,239 --> 00:50:21,079
Speaker 2: Yeah, lung Chain as a bunch of open source tools

805
00:50:21,119 --> 00:50:23,800
that can allow you to do this. So we are

806
00:50:23,840 --> 00:50:31,000
constantly you know, tracing every like all the different you know,

807
00:50:31,000 --> 00:50:33,480
it's like kind of a tree right with different nerd

808
00:50:33,519 --> 00:50:38,239
and paths, and we keep track of everything that's being done,

809
00:50:38,239 --> 00:50:41,960
the reasoning, the output, and we constantly measure you know,

810
00:50:42,360 --> 00:50:47,519
the performance. So that's definitely something that we do. That

811
00:50:47,639 --> 00:50:52,679
being said, I think it's a challenge. It's still a challenge,

812
00:50:52,679 --> 00:50:58,320
you know to really understand the the quality, how the

813
00:50:58,400 --> 00:51:03,679
quality is shifting. They were a talk at Aserican in

814
00:51:03,760 --> 00:51:07,000
Santa Clara a few weeks ago. I think it was

815
00:51:07,199 --> 00:51:13,719
the AI director at ASIA who were speaking about one

816
00:51:13,719 --> 00:51:18,440
of the mobile based product that they are using, and

817
00:51:18,599 --> 00:51:21,039
it was saying that it's very hard for them to

818
00:51:21,119 --> 00:51:30,119
understand how to measure that, and they are relying on NPS.

819
00:51:30,639 --> 00:51:36,679
So it's a neutral promo Neutral Promoter Score, which is

820
00:51:36,719 --> 00:51:42,320
basically an industry standard rating, which is like would you

821
00:51:42,360 --> 00:51:45,440
recommend this to your friend and family to assess how

822
00:51:45,480 --> 00:51:47,280
their model are doing? And I think that was like

823
00:51:47,360 --> 00:51:48,800
really shocking to the audience.

824
00:51:49,719 --> 00:51:52,119
Speaker 1: I mean, we do not We do know from experience

825
00:51:52,159 --> 00:51:56,320
that like NPS is like totally wrong from the net standpoint,

826
00:51:56,519 --> 00:51:59,280
because you should never ask from a human psychology standpoint,

827
00:51:59,280 --> 00:52:01,880
you should never ask one what they would do, but

828
00:52:01,960 --> 00:52:05,679
metrics on what they have done. And I think that's

829
00:52:05,719 --> 00:52:07,079
often the problem. But I mean, I think it really

830
00:52:07,079 --> 00:52:09,320
goes to show that there is no good way of

831
00:52:09,800 --> 00:52:12,599
adequately measuring these things. You have to do it within

832
00:52:12,639 --> 00:52:14,559
context of like what your business is doing, you know,

833
00:52:14,639 --> 00:52:17,760
for instance, really being able to do the incident management.

834
00:52:18,119 --> 00:52:20,880
And I do think that at least I know I

835
00:52:20,880 --> 00:52:23,159
have this question, so I'm sure someone else does that.

836
00:52:23,199 --> 00:52:25,199
There you're getting to the point where you don't want

837
00:52:25,239 --> 00:52:27,599
to have to make the code changes to like go

838
00:52:27,719 --> 00:52:31,840
into GitHub or get lab or heaven forbid a bitbucket

839
00:52:31,920 --> 00:52:34,519
or one of the other ones to actually put up

840
00:52:34,559 --> 00:52:36,599
a poor request to fix the problem. Wouldn't it be

841
00:52:36,639 --> 00:52:39,159
great if there was another LM out there that had

842
00:52:39,159 --> 00:52:41,280
the context of the source code and everything, and you

843
00:52:41,320 --> 00:52:43,719
could just give it the output from routlely and have

844
00:52:43,800 --> 00:52:46,039
a different agent do that. And that for me means

845
00:52:46,360 --> 00:52:49,440
you need to somehow integrate with other agents. And I

846
00:52:49,440 --> 00:52:54,719
can't believe I'm saying this, but MCP model context protocol, Like,

847
00:52:55,239 --> 00:52:56,360
how do you feel about that?

848
00:52:57,360 --> 00:52:59,760
Speaker 2: I'm a huge fan of this. I think you are, Like,

849
00:53:00,119 --> 00:53:03,079
that's exactly the architectures that you need to have in mind.

850
00:53:04,119 --> 00:53:09,360
Is not an agent is the collection of agents, And

851
00:53:09,480 --> 00:53:11,119
it can go as deep as like, let's say, like

852
00:53:12,440 --> 00:53:15,800
you are doing work with GitHub. You can have an

853
00:53:15,800 --> 00:53:20,639
agent working on commits one, on pull request one, you know,

854
00:53:20,719 --> 00:53:24,239
like on each type of different resources that you may

855
00:53:24,280 --> 00:53:26,920
have with gitub you really need to tailor the agent

856
00:53:27,000 --> 00:53:31,800
too for them to do their best job. Because again,

857
00:53:31,880 --> 00:53:36,360
like they don't always have the business logic and understanding

858
00:53:36,440 --> 00:53:40,480
that we do as a human and bringing this context

859
00:53:41,639 --> 00:53:44,280
in each of the small like sub set of agent

860
00:53:44,880 --> 00:53:48,559
is critical for MCP. I you know, I'm a big

861
00:53:48,599 --> 00:53:53,159
fan of MCP. I've been wearing an MCP badge at

862
00:53:53,199 --> 00:53:57,039
Aserican and Cuicon because I truly think it's it's the

863
00:53:57,280 --> 00:54:02,199
technology itself is nothing, you know, amazing, Like at the

864
00:54:02,280 --> 00:54:03,920
end of the day, it's just to get away to

865
00:54:03,960 --> 00:54:04,559
an API.

866
00:54:05,159 --> 00:54:06,960
Speaker 1: I mean, I really want, I really want to stress

867
00:54:07,000 --> 00:54:09,039
that enough, Like anyone who's not cut off on this,

868
00:54:09,199 --> 00:54:12,320
like it's nothing special, Like just imagine you deploy a

869
00:54:12,480 --> 00:54:15,719
new API or verse proxy CDN in front of your

870
00:54:15,719 --> 00:54:19,559
existing software and you're mapping from one protocol to another

871
00:54:19,599 --> 00:54:22,519
one like from TCP to UDP, from h GDP to

872
00:54:23,760 --> 00:54:27,440
or from rest to g RPC. It's really just another one.

873
00:54:27,519 --> 00:54:29,840
And I think the joke right now is that the

874
00:54:30,000 --> 00:54:32,519
AS and MCP stands for security.

875
00:54:33,119 --> 00:54:35,960
Speaker 2: Yeah, you know, I think I mean that. Yeah, I

876
00:54:36,000 --> 00:54:37,679
think we go back to the you know what we

877
00:54:37,760 --> 00:54:40,199
discussed at the beginning of the conversation. You have the engineer,

878
00:54:40,280 --> 00:54:42,840
we will be an athayer and obviously there is a

879
00:54:42,920 --> 00:54:46,440
lot of things that are wrong with this, with this protocol,

880
00:54:46,880 --> 00:54:50,440
it's not stable, it's full of bug securities. Absolutely not

881
00:54:50,559 --> 00:54:53,679
to concern. But I think what's interesting is the concept

882
00:54:53,760 --> 00:55:00,119
of breaking this world between UH this AIG and and

883
00:55:00,880 --> 00:55:04,679
all the resources that are out there, whether it's data

884
00:55:04,800 --> 00:55:10,119
or system and and m cps just think it as USB, right,

885
00:55:10,239 --> 00:55:16,840
it's facilitating the conversation between between these two entities. It's

886
00:55:16,840 --> 00:55:23,440
it's open source and and it really unleash a lot

887
00:55:23,440 --> 00:55:29,079
of power for for AI agent and data sources, which

888
00:55:29,159 --> 00:55:30,639
as you say, most of the time is going to

889
00:55:30,639 --> 00:55:33,639
be race API to communicate in a way that's very

890
00:55:33,679 --> 00:55:38,480
optimized because one there are many issues with agent to

891
00:55:38,559 --> 00:55:42,440
consume most of the time it's it's race APIs as

892
00:55:42,440 --> 00:55:44,639
we say, like they may not have the business logic.

893
00:55:45,159 --> 00:55:48,920
So getting an information from an API main equest you

894
00:55:48,960 --> 00:55:54,519
to do multiple calls to multiple route and you know,

895
00:55:54,679 --> 00:55:57,880
like the LM may not know what if it's feasible,

896
00:55:58,360 --> 00:56:00,320
they may not use the best way to do that

897
00:56:00,400 --> 00:56:04,639
may get lost. And so MCP is really enabling this,

898
00:56:04,840 --> 00:56:09,599
removing all of all of this complexity. And for instance,

899
00:56:09,639 --> 00:56:12,639
I built an m CP server for Rutely and what

900
00:56:13,199 --> 00:56:16,519
it allows the developer to do is to when they

901
00:56:16,519 --> 00:56:20,320
get paid and stuff opening, you know, the web up

902
00:56:20,320 --> 00:56:23,400
going to rut looking the incident and you know it

903
00:56:23,480 --> 00:56:27,159
taket time context switching, which we know is bad for developers.

904
00:56:27,599 --> 00:56:32,639
They can just ask into their favorite power, I get

905
00:56:32,679 --> 00:56:35,039
me the last latest incident is going to put up

906
00:56:35,039 --> 00:56:40,599
in their chat and assuming it's simple in US and

907
00:56:40,719 --> 00:56:44,639
there is in US data in the payload of the incident,

908
00:56:44,719 --> 00:56:47,559
you can ask in this case, I use yourself to

909
00:56:47,599 --> 00:56:51,760
fix the incident. And so you go from production incident

910
00:56:51,880 --> 00:56:56,159
to resolution in a matter of a minute. Again, it's

911
00:56:56,280 --> 00:56:58,760
as you said, it's you know, some people were like, yeah,

912
00:56:58,800 --> 00:57:02,039
it's a joke, it's it's not truly, it's not revolutionary,

913
00:57:02,039 --> 00:57:04,440
it's not. But I think what's great is that it

914
00:57:04,840 --> 00:57:09,800
allows workflows to be done and it reduced a lot

915
00:57:09,840 --> 00:57:14,119
of friction. And we see a lot of companies in

916
00:57:14,119 --> 00:57:19,480
customer like Canvas and Bricks. They are huge engineering organization

917
00:57:19,599 --> 00:57:23,920
and they're like, investigate so much into MCP because they

918
00:57:24,000 --> 00:57:27,320
want their developer to remain where they produce the most value,

919
00:57:27,679 --> 00:57:30,119
which is in their idea, and so they are trying

920
00:57:30,159 --> 00:57:34,280
to bring as many you know, ICP server and then

921
00:57:34,320 --> 00:57:38,000
it doesn't matter if it's ICP. Actually, IBM really is

922
00:57:38,039 --> 00:57:43,519
a competitive protocol which is called ACP, which does the

923
00:57:43,559 --> 00:57:46,480
same thing. But you know, they're trying to bring all

924
00:57:46,519 --> 00:57:50,159
the contexts and the context that engineers need to do

925
00:57:50,239 --> 00:57:54,079
their work into the idea, and MCP is allowing just this.

926
00:57:55,000 --> 00:57:57,480
Speaker 1: I think I'll be remissive. I didn't bring up Randall

927
00:57:57,480 --> 00:58:02,039
Monroe's comic on the we have fourteen competing standards for this,

928
00:58:02,480 --> 00:58:05,800
you know what, we need one universal standard to do this,

929
00:58:05,960 --> 00:58:08,440
you know, and then time later we have fifteen competing

930
00:58:08,480 --> 00:58:11,239
standards for this. I mean, because there really are. There's

931
00:58:11,360 --> 00:58:13,920
like a Tobos came out with not long ago Smithy

932
00:58:14,079 --> 00:58:20,320
for u h GDP services design pattern for documenting their APIs.

933
00:58:20,599 --> 00:58:24,119
We had open API specification. It's on version three point

934
00:58:24,119 --> 00:58:26,679
one right now, so that's you know, three versions later,

935
00:58:27,119 --> 00:58:29,480
and there's a whole bunch of these that different companies use,

936
00:58:29,639 --> 00:58:32,199
and I think the biggest trouble a lot of them

937
00:58:32,239 --> 00:58:36,079
have is that, like we have open API specification for authors,

938
00:58:36,400 --> 00:58:39,719
is that even if getting a human to understand what

939
00:58:39,880 --> 00:58:43,519
was written there is quite challenging, and so like feeding

940
00:58:43,559 --> 00:58:46,239
that into a model is you know, nonsensical, Like it's

941
00:58:46,280 --> 00:58:47,480
just not going to get you the word. As you

942
00:58:47,559 --> 00:58:50,280
pointed out, often the pattern is multiple things. I mean,

943
00:58:50,320 --> 00:58:51,840
we have things like graph ql, which you know has

944
00:58:51,880 --> 00:58:54,719
its own problems and whatnot. So I think we're just

945
00:58:54,719 --> 00:58:56,719
going to keep seeing more of these and I don't

946
00:58:56,760 --> 00:58:58,239
think we're ever going to really be able to settle

947
00:58:58,280 --> 00:58:59,920
on one. It would be nice if we could have one.

948
00:59:00,199 --> 00:59:03,239
The thing about MCP is, even if we pretend for

949
00:59:03,280 --> 00:59:05,159
one moment, does the worst thing in the world. As

950
00:59:05,199 --> 00:59:08,639
you pointed out, like I think Azure, GCP, and ABS

951
00:59:08,679 --> 00:59:16,880
all released MCP servers for they're built in like AI products,

952
00:59:17,480 --> 00:59:19,639
so you can interact with AWS better rock through an

953
00:59:19,719 --> 00:59:22,800
MCP server and like, so irrelevant what you think about that.

954
00:59:23,039 --> 00:59:26,719
It now exists and large companies have put some effort

955
00:59:26,800 --> 00:59:28,960
behind it, and maybe they're just trying to capture some

956
00:59:29,000 --> 00:59:32,159
of the market share and later things can evolve. I

957
00:59:32,199 --> 00:59:35,039
do think that, especially if a lot of companies are

958
00:59:35,079 --> 00:59:38,960
going to speed over quality that we may not get

959
00:59:39,079 --> 00:59:43,360
for like that many more iterations of a protocol to work.

960
00:59:43,400 --> 00:59:47,599
I mean, any I'll take this over the using sound

961
00:59:48,119 --> 00:59:52,039
high frequencies to communicate between devices that have you know, LMS,

962
00:59:52,039 --> 00:59:54,480
like I don't need that. It can go over the internet, please, Like,

963
00:59:54,800 --> 00:59:57,920
that's where I'm comfortable with my security. I'm not comfortable

964
00:59:57,960 --> 01:00:00,800
with things going through the airwaves because otherwise it's going

965
01:00:00,880 --> 01:00:04,280
to be the Alexa. Please order me, you know, another

966
01:00:04,599 --> 01:00:07,599
twenty four roll of toilet paper from an advertisement running

967
01:00:07,599 --> 01:00:10,599
on my television and actually have it happen and like

968
01:00:10,639 --> 01:00:12,719
this is recorded, that doesn't happen. Like, so I I

969
01:00:13,719 --> 01:00:16,119
don't need that to happen. People will have this happening.

970
01:00:16,199 --> 01:00:18,159
So I think MCP is still a little bit more

971
01:00:18,199 --> 01:00:20,400
secure than some of these other protocols that are out there.

972
01:00:20,800 --> 01:00:23,599
Speaker 2: It is. Yeah, you know again, I'm not you know,

973
01:00:24,239 --> 01:00:28,320
I'm not an MCP evangelist. I think I'm not vouching

974
01:00:28,360 --> 01:00:31,920
for the technology, but not the concept. I think there's

975
01:00:32,199 --> 01:00:36,039
some serious limitation, a lot of issue with it. I

976
01:00:36,079 --> 01:00:38,199
think one of the security I think we've already discussed.

977
01:00:38,280 --> 01:00:40,400
We won't let that. But I think one issue is,

978
01:00:40,440 --> 01:00:44,519
for instance, you spoke about open API, so you can

979
01:00:44,599 --> 01:00:48,920
fit actually your open MPI and MCP can use this

980
01:00:48,960 --> 01:00:51,599
as a reference, which is great because you know, if

981
01:00:51,639 --> 01:00:55,519
you if your API is constantly updated with the latest

982
01:00:56,639 --> 01:00:59,000
you know state and translated into an API, then you

983
01:00:59,000 --> 01:01:01,000
make sure you're m CPCR is always up to date.

984
01:01:01,400 --> 01:01:05,519
What we found out at Rutely is that because we

985
01:01:05,559 --> 01:01:10,320
work with large corporations like LinkedIn, Canva and Cisco and

986
01:01:10,360 --> 01:01:13,119
so on and so for, they have like very specific

987
01:01:14,079 --> 01:01:17,599
requests in how they want to run their internet management.

988
01:01:17,639 --> 01:01:20,400
So our API is very verbal. We have a lot

989
01:01:20,440 --> 01:01:24,840
of routes to please our customer, and if you expose

990
01:01:24,880 --> 01:01:27,519
all of this to MCP, it's going to get lost

991
01:01:27,559 --> 01:01:30,800
in it, even though it's supposed you know to do this,

992
01:01:30,880 --> 01:01:34,159
So you need to restrict the amount of route that

993
01:01:34,519 --> 01:01:39,719
you expose. And the second thing is even at the

994
01:01:39,800 --> 01:01:43,480
next level in the MCP server chain is within the

995
01:01:43,559 --> 01:01:49,440
client in the editor like what people recommend, you can

996
01:01:49,519 --> 01:01:53,800
have up to five, two ten MCP server. After that

997
01:01:54,519 --> 01:01:57,119
your local agent is going to get lost because again

998
01:01:57,199 --> 01:02:01,199
too much context. So you know that this technology is

999
01:02:01,840 --> 01:02:03,400
I don't know if it's going to mature or something

1000
01:02:03,519 --> 01:02:06,280
is going to replace it. You know, then you need

1001
01:02:06,320 --> 01:02:10,639
to envision maybe something that centralized this MCP server into

1002
01:02:10,840 --> 01:02:12,920
the central hubs so you don't have to configure like

1003
01:02:12,960 --> 01:02:17,320
fifty of them. But I think it's on the right

1004
01:02:17,400 --> 01:02:21,920
track and I think we see adoption and but but yeah,

1005
01:02:22,320 --> 01:02:26,760
we will see where this move open AI recently and

1006
01:02:26,880 --> 01:02:29,800
now that they are supporting MCP, which you know is

1007
01:02:30,360 --> 01:02:35,239
is interesting because they're competing with Entropy. So yeah, I

1008
01:02:35,280 --> 01:02:37,840
think there will be more of this for sure.

1009
01:02:38,159 --> 01:02:40,639
Speaker 1: But I actually my pack at the end of the

1010
01:02:40,639 --> 01:02:42,679
episode will actually related to that. So I think it's

1011
01:02:42,719 --> 01:02:45,360
really interesting that you brought that up. Yeah, I mean

1012
01:02:45,400 --> 01:02:48,639
there's a lot. There's a lot there realistically, and I

1013
01:02:48,880 --> 01:02:50,719
don't like, unless you need it, you probably don't need

1014
01:02:50,719 --> 01:02:53,559
to spend any time looking at the MTP uh. You know,

1015
01:02:53,679 --> 01:02:57,000
it's highly specific here for for agents talking communicating with

1016
01:02:57,039 --> 01:02:59,440
each other. I think the hard problem that will get

1017
01:02:59,440 --> 01:03:04,239
to very quick is at scale, being concise and meaningful

1018
01:03:04,440 --> 01:03:07,840
and focused on what the business value is is going

1019
01:03:07,880 --> 01:03:10,719
to be even more important. And arguably it has always

1020
01:03:10,719 --> 01:03:13,480
been important, but it's very easy to add another route

1021
01:03:13,559 --> 01:03:17,719
to your open API specification or your you know, your

1022
01:03:17,760 --> 01:03:20,719
your web service or whatever you're running and having users

1023
01:03:20,719 --> 01:03:23,119
should be like, oh, they'll deal with it, right, they'll

1024
01:03:23,119 --> 01:03:25,719
deal with the problem. And I think realistically, you know,

1025
01:03:25,800 --> 01:03:28,880
you want to be as clear and concise about what

1026
01:03:28,920 --> 01:03:31,079
you're offering and what your business is and what the

1027
01:03:31,079 --> 01:03:34,559
product is offering, but still give your customer's freedom to

1028
01:03:34,760 --> 01:03:38,239
utilize your product how you want. And now you are

1029
01:03:38,239 --> 01:03:40,960
almost required to make it happen because of limited context

1030
01:03:40,960 --> 01:03:45,199
windows for for LMS, for agents. For MCP is going

1031
01:03:45,239 --> 01:03:46,960
to be even more a problem. I mean, you scared

1032
01:03:47,000 --> 01:03:49,159
me by saying two to five. I feel like if

1033
01:03:49,159 --> 01:03:51,000
you have any more than one, I think you really

1034
01:03:51,079 --> 01:03:52,679
have to question, you know, what the thing is that

1035
01:03:52,719 --> 01:03:55,440
you're fundamentally offering. I mean I do see platforms like

1036
01:03:56,400 --> 01:03:58,440
at LASTI ends, where like you have may have one

1037
01:03:58,480 --> 01:04:01,239
for Gyra and one for Confluent, you know, because that's

1038
01:04:01,239 --> 01:04:03,599
like a knowledge base, and there's like the day issues

1039
01:04:03,719 --> 01:04:06,719
and one for maybe the GIT server. Each one of

1040
01:04:06,760 --> 01:04:10,119
those could potentially be a different server. You say you're

1041
01:04:10,159 --> 01:04:12,199
not an evangelist, but you are the first person on

1042
01:04:12,239 --> 01:04:15,760
this episode, on this podcast to come on and say MTP,

1043
01:04:16,239 --> 01:04:20,599
so that I think, by definition makes you the evangelist.

1044
01:04:22,159 --> 01:04:25,079
And I think there may be a good a good

1045
01:04:25,159 --> 01:04:27,880
moment to switch over to PECKS. But before we do that,

1046
01:04:27,880 --> 01:04:29,559
I'll ask you, you know, is there any one last thing

1047
01:04:29,599 --> 01:04:30,440
that you want to share?

1048
01:04:30,960 --> 01:04:35,159
Speaker 2: Yeah, if you are curious about MCP, and you know

1049
01:04:35,159 --> 01:04:37,599
I've been to Cube con Asserica and the vast majority

1050
01:04:37,599 --> 01:04:41,000
of people still don't know about it. We are organizing

1051
01:04:41,119 --> 01:04:46,280
an event on April twenty fourth at guitub in San Francisco.

1052
01:04:46,360 --> 01:04:53,079
We love speaker from brother Bays, Entropic, Open the Eye, Guitub,

1053
01:04:53,880 --> 01:04:56,719
Factory I, and a lot of other companies. We have

1054
01:04:56,920 --> 01:05:00,360
demo and a panel. Well, we'll go over what the

1055
01:05:00,400 --> 01:05:03,920
heck is MCP and I think mab broady, you know,

1056
01:05:03,960 --> 01:05:06,400
as we would chart it, like where what does this

1057
01:05:06,480 --> 01:05:10,679
mean for the industry and where is this going? So yeah,

1058
01:05:10,719 --> 01:05:13,719
if you type m C P Rootly even guitub on

1059
01:05:13,800 --> 01:05:16,400
Google or perhaps we can share this in the description

1060
01:05:16,480 --> 01:05:17,159
of the episode.

1061
01:05:17,440 --> 01:05:20,840
Speaker 1: Yeah, for sure, they'll be a link. Okay, then I

1062
01:05:20,840 --> 01:05:22,880
think it's a great point to move on to our

1063
01:05:23,079 --> 01:05:27,719
our picks. Uh. So I'll go first. My pick is

1064
01:05:27,760 --> 01:05:32,360
this short article online by ed Zeitron. He has a

1065
01:05:32,400 --> 01:05:36,960
blog and it's called where the Money. Uh. It's he's

1066
01:05:37,039 --> 01:05:40,039
arguing that there's no AI revolution. Uh. You know, if

1067
01:05:40,079 --> 01:05:44,760
you look at companies like Anthropic and Open AI, they're

1068
01:05:45,119 --> 01:05:47,760
funneling tons of money in to it and they're not

1069
01:05:47,840 --> 01:05:50,920
getting the value out and so in a way they're

1070
01:05:50,920 --> 01:05:53,760
doing the nice thing of subsidizing all our great AI usage,

1071
01:05:53,800 --> 01:05:55,559
you know, so get it. While the fountain is going.

1072
01:05:56,199 --> 01:05:58,840
Really's got a great one, it seems you know there

1073
01:05:58,880 --> 01:06:02,039
are ones out there. Uh, it's just it's a really

1074
01:06:02,039 --> 01:06:05,199
great breakdown of you know, how companies are supposed to work,

1075
01:06:05,559 --> 01:06:07,719
how where the money is coming from, you know, where

1076
01:06:07,719 --> 01:06:10,119
it's being spent, and challenging some of those assumptions. So

1077
01:06:10,519 --> 01:06:14,119
if you are only optimistic about everything related to AI,

1078
01:06:14,239 --> 01:06:17,519
I highly recommend reading the article because there's there's a

1079
01:06:17,519 --> 01:06:19,840
bunch of really good points that are made that are

1080
01:06:19,880 --> 01:06:21,199
are hard to argue against.

1081
01:06:23,079 --> 01:06:28,159
Speaker 2: Love it. Yeah, that's an interesting question. Yeah, I think

1082
01:06:28,199 --> 01:06:32,679
the you know a gi and and you know the

1083
01:06:32,760 --> 01:06:35,960
goal of great getting to this great intelligence, so you

1084
01:06:36,119 --> 01:06:38,800
see that, you know, that's why the money is just bringing.

1085
01:06:39,559 --> 01:06:42,920
Speaker 1: Yeah, I mean there is this theory that basically we

1086
01:06:43,079 --> 01:06:46,880
can spend literally all of humanity's resources to achieve this

1087
01:06:46,960 --> 01:06:49,480
because once we have it, it will produce so much value.

1088
01:06:50,239 --> 01:06:53,639
That's you know, that theory hasn't been proven yet, but

1089
01:06:54,119 --> 01:06:56,199
I'll leave it to people to read the article. Who

1090
01:06:56,800 --> 01:06:59,639
he's articulated this much better than I have. Okay, so

1091
01:07:00,079 --> 01:07:01,320
you've got for us today.

1092
01:07:01,440 --> 01:07:04,480
Speaker 2: Well, I'm going it's going to be my pig that

1093
01:07:04,559 --> 01:07:07,480
I wrote, and I know it's going to be controversial,

1094
01:07:07,519 --> 01:07:10,320
which is why I want to share it even better.

1095
01:07:12,320 --> 01:07:14,440
You know, we spoke a lot about this. We didn't

1096
01:07:14,480 --> 01:07:17,000
speak a lot about this episode, but online everybody is

1097
01:07:17,000 --> 01:07:22,159
speaking about vibe coding, and so I think what's coming

1098
01:07:22,199 --> 01:07:28,639
for us SA is is incident vibing, because the amount

1099
01:07:28,920 --> 01:07:31,960
of incidents that is going to come our way is

1100
01:07:32,000 --> 01:07:35,039
going to probably going to increase. And more importantly, I

1101
01:07:35,039 --> 01:07:38,960
think a lot of the fundamentals that makes an engineering

1102
01:07:39,039 --> 01:07:43,519
organizations solid are going away. A few things. For instance,

1103
01:07:43,800 --> 01:07:47,239
I think a team that knows their code base very well,

1104
01:07:47,920 --> 01:07:50,800
it's kind of going away because humans are not doing

1105
01:07:51,199 --> 01:07:55,079
the coding anymore, right, they are merely like reading it

1106
01:07:55,159 --> 01:07:58,519
doing coveryview. Perhaps they will, you know, use another l LM,

1107
01:07:58,920 --> 01:08:01,679
another model to do the could review of another model.

1108
01:08:02,519 --> 01:08:05,280
But anyway, I think in general we know that the

1109
01:08:05,320 --> 01:08:06,920
knowledge of the code base is going to go down.

1110
01:08:07,400 --> 01:08:12,000
The other one is having matter experts in some fields,

1111
01:08:12,599 --> 01:08:15,480
especially as your company grow. You know, let's say maybe

1112
01:08:16,000 --> 01:08:18,520
you want someone with like very sharp on database or

1113
01:08:18,600 --> 01:08:21,840
website or whatever it is. And this again is going

1114
01:08:21,840 --> 01:08:25,319
away because of what I've just mentioned, but also because

1115
01:08:26,239 --> 01:08:28,600
I think it's going to be increasingly harder for young

1116
01:08:29,039 --> 01:08:34,840
professional to gain this experience and this flair that senior

1117
01:08:34,880 --> 01:08:38,520
engineer have. And so what's the solution. I think it's

1118
01:08:38,560 --> 01:08:42,359
incident vibing. And I think it's one of this story

1119
01:08:42,439 --> 01:08:45,039
where if you cannot beat them, you should join them.

1120
01:08:46,800 --> 01:08:49,039
And so in this article I speak about what some

1121
01:08:49,159 --> 01:08:51,840
of the ways that the companies can can get ready

1122
01:08:51,920 --> 01:08:53,359
with incident vibing.

1123
01:08:54,319 --> 01:08:56,760
Speaker 1: I love it, well, we'll share that like an OPEC

1124
01:08:56,840 --> 01:09:00,880
section of the episode. I mean, I I both love

1125
01:09:00,920 --> 01:09:03,399
and hate your pick honestly, because, like I am, I'm

1126
01:09:03,439 --> 01:09:06,199
so with you that vibe coding is terrible. And if

1127
01:09:06,239 --> 01:09:08,319
we look at the door Report or the episode we

1128
01:09:08,319 --> 01:09:09,960
did on the Door Report from twenty twenty four, we

1129
01:09:10,000 --> 01:09:14,399
see that the LAM sacrifice speed for quality. We also

1130
01:09:14,439 --> 01:09:16,640
know that there's a huge problem coming and companies are

1131
01:09:16,680 --> 01:09:19,079
still adopting it. So you have to live with the outcome,

1132
01:09:19,119 --> 01:09:21,239
Like even if you are using lams as best as

1133
01:09:21,279 --> 01:09:23,920
you can, you're gonna that means you're gonna get more incidents.

1134
01:09:24,039 --> 01:09:28,600
And so I'm totally with you. I hate that this

1135
01:09:28,680 --> 01:09:31,680
is happening, but Uh, there's no avoiding it. And so

1136
01:09:31,720 --> 01:09:36,000
the next level is also viving the incident resolution. Okay,

1137
01:09:36,319 --> 01:09:36,680
it is.

1138
01:09:36,720 --> 01:09:42,000
Speaker 2: And and we've seen companies, you know, hiring people engineer

1139
01:09:42,239 --> 01:09:45,119
and and they cannot cut, they only have, they can

1140
01:09:45,159 --> 01:09:48,680
only prompt And Yeah, whether you like it or it's,

1141
01:09:48,760 --> 01:09:51,279
it's happening, it's coming. It's the future of software engineering

1142
01:09:51,359 --> 01:09:54,199
in some capacity. And so I, you know, I just

1143
01:09:54,279 --> 01:09:56,279
think we need to get ready for it. That's the

1144
01:09:56,399 --> 01:09:57,399
only thing you can do.

1145
01:09:58,359 --> 01:10:00,960
Speaker 1: I mean, I love the per respective. You know, it

1146
01:10:01,199 --> 01:10:04,399
doesn't matter if you agree or disagree with with utilizing it.

1147
01:10:04,399 --> 01:10:08,159
It's it's happening. Uh. And that I'll say, thank you

1148
01:10:08,159 --> 01:10:10,920
Silvin so much for coming on this episode and sharing

1149
01:10:10,960 --> 01:10:13,319
your perspective and what really has I've been doing.

1150
01:10:13,720 --> 01:10:14,840
Speaker 2: Thank you very intriving me

1151
01:10:15,359 --> 01:10:18,000
Speaker 1: Yea, and thanks for all the listeners and viewers of

1152
01:10:18,039 --> 01:10:18,600
this podcast.

