WEBVTT

1
00:00:05.320 --> 00:00:09.519
Hey everybody, and welcome to another
episode of Ruby Rogues. This week counter

2
00:00:09.599 --> 00:00:13.480
Panel, we have Nate Hopkins,
Hello everybody, Andrew Mason. Hello.

3
00:00:13.839 --> 00:00:16.920
I'm Charles Maxwood from dev chat dot
TV. And this week we have a

4
00:00:16.920 --> 00:00:20.079
special guest and that's kier shatdrop here. Do you want to say hi?

5
00:00:20.239 --> 00:00:23.440
Let us know who you are.
Hi, my name is Keir. I'm

6
00:00:23.480 --> 00:00:28.079
a production dinin. You're at Shopify
where I work on the scalability in the

7
00:00:28.120 --> 00:00:32.600
platform, and I'm based in London. A cake nice. Now, Shopify

8
00:00:32.679 --> 00:00:36.159
doesn't have to deal with any scalability, right, I mean they only run

9
00:00:36.240 --> 00:00:39.280
like half the shopping carts on the
web and things like that. Right.

10
00:00:39.600 --> 00:00:44.119
Oh yeah, So I'm curious as
we dive into this. You know,

11
00:00:44.159 --> 00:00:47.359
you gave us a couple of articles. One was on the state of background

12
00:00:47.439 --> 00:00:50.960
jobs. The other one was on
like capacity planning for web apps. I

13
00:00:51.039 --> 00:00:56.359
kind of want to start with this
and dive mostly into when should I start

14
00:00:56.399 --> 00:00:59.600
caring about this? Right? Because
if I have a small app, it

15
00:01:00.600 --> 00:01:04.040
matters a lot less for a while, and then eventually I'll get enough users

16
00:01:04.120 --> 00:01:10.400
or enough people using the capacity to
actually go all right now, I really

17
00:01:10.400 --> 00:01:12.599
need to start thinking about this,
So, Yeah, where do you find

18
00:01:12.640 --> 00:01:17.400
that the cutoff point is for this
kind of thing? Definitely, there is

19
00:01:17.480 --> 00:01:23.799
a lot of talk and technologies that
it's natural for engineers to be super interested

20
00:01:23.799 --> 00:01:30.239
in it. But the price over
engineering things and choosing some solutions that are

21
00:01:30.560 --> 00:01:36.000
maybe too complex at the stage where
your project is right now, that price

22
00:01:36.000 --> 00:01:40.359
can be too high, and often
the most resourceful thing you can do is

23
00:01:40.400 --> 00:01:44.719
just deployed on Heirocu and let it
run, and it will cost a few

24
00:01:44.799 --> 00:01:49.920
hundred dollars for your Hiroco bill.
For me, I think the cutpoint is

25
00:01:51.079 --> 00:01:57.879
around the time when you start losing
the control of maybe you're hosting costs or

26
00:01:59.079 --> 00:02:07.480
you noticing that whatever scalability promise you
have start hurting your customers and you start

27
00:02:07.599 --> 00:02:12.360
losing money, either as a result
of your customers being unhappy or as a

28
00:02:12.400 --> 00:02:17.159
result of the thing costing to run
a lot more than a company kind of

29
00:02:17.280 --> 00:02:22.400
work to run the business in a
reliable way. Yeah, that makes sense.

30
00:02:22.759 --> 00:02:25.800
It's interesting too that you've kind of
tied it to those two practical breakpoints,

31
00:02:25.879 --> 00:02:29.240
right. A lot of people they
try and tie it to well,

32
00:02:29.240 --> 00:02:30.479
I have a certain number of users, or I have a certain size of

33
00:02:30.479 --> 00:02:35.879
an app or I have you know, a certain amount of server capacity or

34
00:02:36.520 --> 00:02:38.919
you know, stuff like that,
and it's it's interesting to me that a

35
00:02:38.960 --> 00:02:42.360
lot of this, you know,
you've tied it back to oh, it's

36
00:02:42.400 --> 00:02:46.039
impacting the customers or oh, you
know, it's it's impacting my bottom line,

37
00:02:46.080 --> 00:02:47.599
and then it's like, oh,
okay, how do I deal with

38
00:02:47.639 --> 00:02:51.560
this? I also think it's interesting
that you mentioned that, you know,

39
00:02:51.639 --> 00:02:53.879
it's easy to do if you just
hand it off to Roku and let them

40
00:02:53.919 --> 00:02:58.919
handle it. And I know that
I haven't heard it as much from Nate,

41
00:02:58.960 --> 00:03:01.840
but I've definitely heard it from Eric
over at code Fund that that's kind

42
00:03:01.840 --> 00:03:05.039
of his approach. He doesn't want
to deal with DevOps. He just wants

43
00:03:05.080 --> 00:03:07.960
to push it to the cloud and
then, you know, let them handle

44
00:03:07.000 --> 00:03:09.400
it, and he's willing to pay
for Heroku to do it. Yeah,

45
00:03:09.439 --> 00:03:13.840
that's that's our philosophy right now.
But I mean we're also short staffed,

46
00:03:14.000 --> 00:03:16.479
right Yeah, so we've got two
well really, we're just one and a

47
00:03:16.520 --> 00:03:23.800
half developers on the project. Other
than we've got plenty of contributors that help

48
00:03:23.879 --> 00:03:27.319
us fix bugs and things like that, but there's only two of us that

49
00:03:27.360 --> 00:03:30.759
are full time. You know,
looking at code, and Eric's really only

50
00:03:30.759 --> 00:03:35.240
about halftime looking at code, if
that right, So we don't have the

51
00:03:35.280 --> 00:03:40.560
time of the bandwidth to really delve
deep into into you know, the ops

52
00:03:40.560 --> 00:03:45.800
story. That makes a lot of
sense. So I'm curious, Nate,

53
00:03:45.840 --> 00:03:49.840
at what point would you guys consider
moving off of Heroku? I mean,

54
00:03:49.919 --> 00:03:52.479
would it be a cost thing or
would it be something else? You know,

55
00:03:52.520 --> 00:03:57.879
we're still we've found product market fit
and we are trying to scale it

56
00:03:57.919 --> 00:04:00.599
now. We're trying to scale on
the sales side. So as soon as

57
00:04:00.639 --> 00:04:05.800
we have enough customers and enough consistent
revenue flowing in to allow us to kind

58
00:04:05.800 --> 00:04:11.319
of back off and look at our
operations story, that's probably the time.

59
00:04:11.439 --> 00:04:15.160
So I would say we're probably maybe
six months away from you know, having

60
00:04:15.199 --> 00:04:17.920
the luxury being able to look at
that. Yeah, that makes sense.

61
00:04:18.519 --> 00:04:21.879
So Keer as somebody gets to that
point, you know, and I think

62
00:04:21.959 --> 00:04:26.879
this might be a relevant conversation then
for Nate. But you know, when

63
00:04:26.879 --> 00:04:29.480
they get to that point and they're
thinking, Okay, we're going to scale

64
00:04:29.560 --> 00:04:32.639
this, maybe they move it off
of Heroku and onto you know, a

65
00:04:32.720 --> 00:04:36.879
Kubernetes cluster, or they move it
on to you know, a virtual private

66
00:04:36.920 --> 00:04:41.800
server, something like digital lotion or
something. What things should they be looking

67
00:04:41.800 --> 00:04:46.639
at then to scale their their stuff
up. For any hosted services, like

68
00:04:46.680 --> 00:04:50.360
for instance, it's common to use
hosted database as a service, I think

69
00:04:50.399 --> 00:04:57.040
it's important to look at whatever limitation
that service provides, because any hosted service

70
00:04:57.079 --> 00:05:01.639
would have some kind of those I
remember read a blog post where an app

71
00:05:01.680 --> 00:05:09.680
had a very specific requirement for some
postgrous extension that they've been using, and

72
00:05:09.759 --> 00:05:15.279
they switched i think three three providers
that gave them Postgrass's service, and they've

73
00:05:15.319 --> 00:05:19.079
been unhappy with each and they obviously
spent a lot of efforts, and finally

74
00:05:19.120 --> 00:05:25.759
they got to run postgrass on their
own because having that very extension and requirement

75
00:05:26.199 --> 00:05:30.240
that was a huge point for them
when choosing a provider like that. It's

76
00:05:30.240 --> 00:05:36.279
important to understand any limitations and and
from another angle, I think there is

77
00:05:36.959 --> 00:05:44.480
there is so many scalability related problems
that you can run into that usually it's

78
00:05:44.759 --> 00:05:48.120
you start looking at the one that's
most critical right now. Like I've I've

79
00:05:48.160 --> 00:05:56.480
been part of projects where they've run
into scalability issues with the database layer with

80
00:05:56.600 --> 00:06:01.399
my sequel or with progress and as
they fixed it and iterated it on it

81
00:06:01.560 --> 00:06:08.800
and their database could accept a lot
more load. They came to another bottleneck,

82
00:06:09.240 --> 00:06:14.000
and that bottleneck is different every time, depending on the business, depending

83
00:06:14.040 --> 00:06:19.680
on your patterns of the usage that's
coming from your customers. So it's fixing

84
00:06:20.079 --> 00:06:24.959
one thing at the time, one
by one, and sometimes that's a never

85
00:06:25.279 --> 00:06:30.160
ending story, especially if the company
grows large and there is a team works

86
00:06:30.279 --> 00:06:33.759
just on scalability, which is currently
the case for my team of Shopify.

87
00:06:34.240 --> 00:06:38.920
Yeah, that's a terrific point in
terms of really, this is not a

88
00:06:39.000 --> 00:06:43.639
job that ever completes, right,
It's something that you're always having to stay

89
00:06:43.680 --> 00:06:46.920
on top of it, especially if
the company is enjoying any level of success.

90
00:06:46.319 --> 00:06:49.759
One cool thing about code fund is
we are even though we're on Heroku,

91
00:06:49.800 --> 00:06:54.000
we're able to leverage some of the
postgress at more advanced postgress features like

92
00:06:54.040 --> 00:06:58.160
table partitioning and things like that,
which has enabled us to continue to scale

93
00:06:58.199 --> 00:07:02.240
on that platform. We're hosted on
one hundred and sixty plus sites right now,

94
00:07:02.720 --> 00:07:08.519
and so we're seeing between two and
a half million and three million requests

95
00:07:08.560 --> 00:07:12.240
a day pipe through the server.
Now. We are paying a premium for

96
00:07:12.360 --> 00:07:16.839
Heroku, but we're still I think
we're under eight hundred a month on our

97
00:07:17.079 --> 00:07:23.319
on our production setup, and we're
probably a little over provisioned in anticipation of

98
00:07:23.319 --> 00:07:26.759
spikes and things like that, and
so we don't quite have the fine tuned

99
00:07:26.839 --> 00:07:30.759
control that we would like to have. Your point on postgress, as you

100
00:07:31.040 --> 00:07:35.160
want to customize that and install your
own plugins and things like that into the

101
00:07:35.240 --> 00:07:40.720
database players, that would be something
that would be fantastic because since we are

102
00:07:40.800 --> 00:07:45.720
using table partitioning, I know there's
some plugins that just are not broadly available

103
00:07:45.720 --> 00:07:48.160
on the Heroku platform that would be
kind of a luxury to use for us

104
00:07:48.639 --> 00:07:54.079
that we've kind of had to work
our way around some of those things.

105
00:07:54.639 --> 00:07:59.519
I'm curious about your experience and time
with Shopify. How long have you been

106
00:07:59.560 --> 00:08:03.839
with the team and what types of
changes have happened since you've been at the

107
00:08:03.879 --> 00:08:09.680
company. I've been a Chopify for
almost for years, and I've always been

108
00:08:09.759 --> 00:08:16.959
part of the production engineering department,
which deals with the infrastructure and is less

109
00:08:16.959 --> 00:08:24.000
exposed to the product. And just
that department grew so much from maybe while

110
00:08:24.319 --> 00:08:28.240
I've been here, from maybe thirty
people to now more than one hundred,

111
00:08:28.639 --> 00:08:35.440
and all of those people are working
on the infrastructure and reliability, and with

112
00:08:35.559 --> 00:08:39.919
the motto of that, our job
is to keep the site up. There's

113
00:08:39.919 --> 00:08:43.360
another aspect of scaling here, going
from forty to one hundred people, Like

114
00:08:43.840 --> 00:08:48.120
how has the team scaled? Like
what's the dynamic been? Like, Yeah,

115
00:08:48.240 --> 00:08:54.720
it's interesting to follow dynamics in terms
of team scaling in every organization,

116
00:08:54.919 --> 00:09:01.120
and I imagine it's a different story. It affected so many things. Like

117
00:09:01.159 --> 00:09:05.600
for instance, at the time when
I joined, our Shopify is based in

118
00:09:05.639 --> 00:09:13.000
Canada and most of infrastructure engineers were
just one office. Now people who work

119
00:09:13.039 --> 00:09:16.639
on the infrastructure are based in three
offices, and there is also a lot

120
00:09:16.679 --> 00:09:22.600
of remote people like me. And
then as you grow, you end up

121
00:09:22.639 --> 00:09:28.039
investing into some of the things that
you would never invest before and have teams

122
00:09:28.360 --> 00:09:35.360
who work just on one part of
development environment for instance, or just on

123
00:09:35.879 --> 00:09:41.440
background jobs infrastructure, something that I
wouldn't have imagined three years ago. So

124
00:09:41.639 --> 00:09:46.120
what is the technical portfolio for Shopify
around and like how has it changed since

125
00:09:46.159 --> 00:09:50.960
you join? Obviously that's a great
question. There's been a lot of new

126
00:09:52.000 --> 00:09:54.679
tools and techniques and stuff that have
come out, but you know, just

127
00:09:54.720 --> 00:09:58.000
over the last four years, and
so I'm curious with the evolution of tooling

128
00:09:58.200 --> 00:10:03.679
has looked like, Yeah, that's
a great point of discussion. So I

129
00:10:03.720 --> 00:10:07.519
think first there is something I wanted
to give the context to our listeners.

130
00:10:07.559 --> 00:10:13.600
First is that when Shopify was founded
about twelve years ago by Toby Lutke,

131
00:10:15.000 --> 00:10:22.039
Toby was one of the first contributors
to Rails and he knew David djh and

132
00:10:22.159 --> 00:10:26.200
they exchanged some emails and around the
time when he started company, when he

133
00:10:26.240 --> 00:10:31.519
started Shopify on rails, rails was
just a ZIF file that they exchanged over

134
00:10:31.720 --> 00:10:39.120
an email. It wasn't even some
specific version published on a GEM server because

135
00:10:39.320 --> 00:10:43.039
I'm not even sure there was if
there were any GEM servers at that point.

136
00:10:43.519 --> 00:10:50.840
So from that day when he started
on rails, that app still exists.

137
00:10:50.879 --> 00:10:54.440
It was never rewritten. It's a
monolith that has been around for more

138
00:10:54.480 --> 00:10:58.879
than a decade. We tend to
put a lot of love into it to

139
00:10:58.960 --> 00:11:05.080
make sure that developer experience stays great. Unlike it often happens that a monolith

140
00:11:05.200 --> 00:11:09.759
is just too slow and too hard
to work with that developers get so much

141
00:11:09.799 --> 00:11:18.159
friction and decide to go splitting or
calling the monolith a legacy. It never

142
00:11:18.279 --> 00:11:22.039
happened for us. I've got to
interject and just ask a question on your

143
00:11:22.039 --> 00:11:24.879
monolith in terms of, like I
know Shopify is a very large company,

144
00:11:24.919 --> 00:11:30.960
how many developers have their hands in
the monolithic code based my rough guess would

145
00:11:31.000 --> 00:11:37.679
be from one hundred to two hundred
people, given that R and D in

146
00:11:37.720 --> 00:11:43.320
total is a lot more because there
would always be people working on other part

147
00:11:43.360 --> 00:11:46.039
of stack, also mobile developers and
so on as you can imagine. So

148
00:11:46.159 --> 00:11:50.960
back to your point about how has
the stack changed in terms of tools that

149
00:11:52.000 --> 00:11:58.720
are familiar to listeners of our podcast, it's still pretty much a classical rails

150
00:11:58.840 --> 00:12:03.000
up with all the things that come
with it. In terms of the infrastructure,

151
00:12:03.120 --> 00:12:07.679
I think the biggest shift that I
have observed of the company was moved

152
00:12:07.759 --> 00:12:15.440
from physical data centers to the cloud
to Kubernetes. And that's another who interesting

153
00:12:15.440 --> 00:12:20.879
story because we were able to move
to Kubernatus in cloud one shop at the

154
00:12:20.919 --> 00:12:24.679
time, Given that we have millions
of them, we wanted to make this

155
00:12:24.080 --> 00:12:30.480
process as continuous and find control as
possible, so we just took one shop,

156
00:12:30.559 --> 00:12:33.720
moved it to cloud and progressed and
we were able to control that.

157
00:12:35.519 --> 00:12:41.759
It's fascinating to me that you have
upwards of two hundred developers working on a

158
00:12:41.799 --> 00:12:46.639
monolithic RAILS code base. Like some
conventional wisdom that I've heard in other circles

159
00:12:46.639 --> 00:12:50.279
and certainly bumped into in my career
has been that if you're going to scale

160
00:12:50.519 --> 00:12:56.799
your organization, you apply conways and
break out into micro services. In the

161
00:12:56.799 --> 00:13:01.159
conventional wisdom seems to be that that's
really the only way to do it,

162
00:13:01.519 --> 00:13:05.440
and you, guys are a terrific
counterpoint to that. What are some techniques

163
00:13:05.480 --> 00:13:11.200
you've used to facilitate it. I
think one of the biggest has been adopting

164
00:13:13.000 --> 00:13:18.679
domain driven development development and splitting that
monolith into I would not call them name

165
00:13:18.720 --> 00:13:22.360
spaces, but it's kind of components
at least that's how we call them.

166
00:13:22.600 --> 00:13:26.600
There is nothing very secret or special
about it. It's basically just a way

167
00:13:26.639 --> 00:13:33.720
to structure your app directory so that
each team, each component gets their part.

168
00:13:35.200 --> 00:13:39.440
Therefore, it helps a lot to
establish the ownership because, for instance,

169
00:13:39.480 --> 00:13:43.639
as soon as you see an exception
in production in some of the exception

170
00:13:43.720 --> 00:13:48.639
tracking service that you use, you
see that exception is coming from components Slash

171
00:13:48.039 --> 00:13:52.080
support, Slash app, Slash model, slash something. You immediately know that

172
00:13:52.600 --> 00:13:58.000
a support component and you have all
the metadata to find people who can help

173
00:13:58.039 --> 00:14:03.559
with that, even a non call
escalation or a Slack channel where you can

174
00:14:05.240 --> 00:14:09.639
chat and point out. And we
started leveraging that for some of the to

175
00:14:09.720 --> 00:14:15.480
automate some other things like, for
instance, if exception within one app happened

176
00:14:15.559 --> 00:14:20.440
in that component, will send a
notification to their Slack channel, not to

177
00:14:20.519 --> 00:14:26.279
some generic Slack channel with tons of
exceptions from all over the company. Establishing

178
00:14:26.320 --> 00:14:30.759
those ownership is I would say,
the main technique. Okay, so domain

179
00:14:31.000 --> 00:14:33.720
kind of a domain driven design,
and then you give a team like full

180
00:14:33.759 --> 00:14:39.120
stack responsibility or at least all the
areas of the stack that that particular domain

181
00:14:39.159 --> 00:14:43.240
piece may touch, right, so
that could slice all the way through front

182
00:14:43.279 --> 00:14:46.600
end, all the way down into
the model layer. Yeah, it's not

183
00:14:48.080 --> 00:14:54.080
as strict as you can imagine,
and there would always be cases of reaching

184
00:14:54.120 --> 00:15:00.159
out directly from one active record model
to another through components, through different domains,

185
00:15:00.600 --> 00:15:03.799
and that's not great. We try
to build tools to discourage people from

186
00:15:05.360 --> 00:15:09.440
doing that and for them to know
what are the right patterns. Like for

187
00:15:09.600 --> 00:15:15.840
us, it's mostly entry points that
are well that are typed and declared and

188
00:15:16.200 --> 00:15:18.320
documented. So this is kind of
shifting gears a little bit. I'm really

189
00:15:18.440 --> 00:15:26.960
curious about the database infrastructure because I
know on Shopify, essentially you've sharded the

190
00:15:28.039 --> 00:15:31.480
database or maybe not sharter, but
there's multiple instances of the database, right

191
00:15:31.519 --> 00:15:35.759
that are all that backs this.
How is that structured? And how do

192
00:15:35.799 --> 00:15:39.080
you manage that from an OPS perspective? Oh yeah, that's also a great

193
00:15:39.480 --> 00:15:43.519
discussion point. So also to give
some of the context to the listeners.

194
00:15:45.159 --> 00:15:52.799
For all well known rails companies like
Shopify, Gethub, based Camp name a

195
00:15:52.840 --> 00:15:58.480
few that's been founded around ten years
ago. At that time, my sequel

196
00:15:58.679 --> 00:16:04.600
was that best known database that everyone
knew how to run and operate. People

197
00:16:04.679 --> 00:16:10.960
were the most familiar, and some
other like posgress were not maybe as good

198
00:16:11.080 --> 00:16:18.120
or as established at that point.
So that's one huge reason why this subset

199
00:16:18.159 --> 00:16:22.320
of companies, including US, are
all based on my sequel. And yeah,

200
00:16:22.559 --> 00:16:29.600
at I think it was around twenty
fourteen twenty fifteen when we realized we

201
00:16:30.240 --> 00:16:34.600
can no longer fit everything into one
dB. We figure out we have to

202
00:16:34.679 --> 00:16:41.240
find a way to scale horizontally,
and for a multi tenant SaaS application,

203
00:16:41.120 --> 00:16:48.120
there is a great way to do
that. Since your tenants are always isolated,

204
00:16:48.519 --> 00:16:52.879
you don't have to. You don't
have any joints between multiple tenants,

205
00:16:52.000 --> 00:17:00.679
so you can put tenants through different
charts, through different partitions and manage those

206
00:17:00.720 --> 00:17:06.119
independently, which also reduces the blest
radios. If you have hundred charts,

207
00:17:06.400 --> 00:17:11.880
one is down for whatever reason,
only one percent of your customers are getting

208
00:17:12.160 --> 00:17:17.839
some negative experience, and you go
and fix that as as soon as possible.

209
00:17:17.920 --> 00:17:22.119
But it's not all of the platform. So we invested a lot into

210
00:17:22.240 --> 00:17:27.839
charting. In terms of application logic, it's it's mostly done on rails layer.

211
00:17:29.119 --> 00:17:36.160
We have a rails team at Shopify
that that helps to steer that into

212
00:17:36.319 --> 00:17:41.000
the best direction possible, at least
from the rails point of view and from

213
00:17:41.039 --> 00:17:47.799
the opps point of view, it's
it's just a lot of charts that that

214
00:17:47.880 --> 00:17:53.279
can be located even in different regions, and which also can allow to isolate

215
00:17:53.440 --> 00:18:00.039
some tenants geographically. So let me
just recap to see if I've got the

216
00:18:00.079 --> 00:18:04.319
picture in my mind correct. So
we've got a rails monolith that's kind of

217
00:18:04.400 --> 00:18:10.880
structured with kind of these domain areas
of responsibility. That's how you structure your

218
00:18:10.920 --> 00:18:15.039
teams and the way you've scaled this
at least up to this point in the

219
00:18:15.039 --> 00:18:18.440
conversation is you're just dealing with gut
like just mountains and mountains of data,

220
00:18:18.480 --> 00:18:25.799
So you've sharded your multi tenancy across
different database nodes. For the developer,

221
00:18:26.319 --> 00:18:30.400
it can just look like a typical
rails application, correct, And something to

222
00:18:30.400 --> 00:18:37.519
add is that we our goal is
to make that all that starting complexity gidden

223
00:18:37.599 --> 00:18:41.680
away from developers who right product features
for them. It may feel like there

224
00:18:41.759 --> 00:18:48.359
is just a database with a lot
of tables that represent the business model,

225
00:18:48.799 --> 00:18:53.960
but underneath there would be some smart
sharp selection that would happen at the beginning

226
00:18:53.960 --> 00:18:59.759
of the request, for instance,
that would select the right database. And

227
00:19:00.119 --> 00:19:07.519
I mentioned this just for my sequel
for relational database, but we've realized that

228
00:19:08.000 --> 00:19:14.480
it makes no sense to have shared
it my sequel, but just one global

229
00:19:14.519 --> 00:19:19.039
redditis because regardless of how well you
shared that one global redis or that one

230
00:19:19.079 --> 00:19:23.599
global memcash would still be a single
point of failure. And as you can

231
00:19:23.680 --> 00:19:30.759
imagine, we learned that lesson by
experiencing those single point of failures. So

232
00:19:30.519 --> 00:19:37.880
our philosophy is that every resource would
be sharded, so there would be a

233
00:19:37.000 --> 00:19:41.960
smaller instance of shopify that has its
own My sequel that has its own raddits

234
00:19:41.160 --> 00:19:48.839
that has its own memcash that helps
with this isolation. So with each web

235
00:19:48.880 --> 00:19:53.319
server essentially or maybe partition of web
servers the scale horizontally, all of those

236
00:19:53.359 --> 00:19:57.799
would not necessarily have a local copy
of them cash and read us, but

237
00:19:57.960 --> 00:20:03.240
maybe just a shared one that cluster
of web servers. One thing I should

238
00:20:03.240 --> 00:20:07.279
note is that stuff like web servers, it's still all shared capacity, and

239
00:20:07.759 --> 00:20:15.119
it's mostly it's only resources that are
isolated. So any web server can talk

240
00:20:15.200 --> 00:20:22.039
to any to any partition or any
like smaller instance of Shopify, it's mostly

241
00:20:22.920 --> 00:20:30.039
the matter of selecting the right path
depending on what's the customer. So now

242
00:20:30.079 --> 00:20:37.599
I'm a little curious in terms of
because there's obviously a pretty significant coordination piece

243
00:20:37.599 --> 00:20:42.119
there. You know, when the
request initially comes in and then you assign

244
00:20:42.279 --> 00:20:47.759
the correct mem cash server, the
correct redit server, and the correct my

245
00:20:47.880 --> 00:20:52.200
squel server. How much of that
infrastructure did you guys have to build Shopify

246
00:20:52.240 --> 00:20:56.759
and how much are you leaning on
the database providers for those things? Honestly,

247
00:20:56.759 --> 00:21:00.400
I think it's mostly all in house
built. And to give a bit

248
00:21:00.400 --> 00:21:06.799
of context about that, it's mainly
a component called sortine hat I like the

249
00:21:06.920 --> 00:21:15.160
name that is using sound the sortine
hat is using a global lookoff table to

250
00:21:15.640 --> 00:21:19.960
find which which domain, which shop
is on which partition. It gets the

251
00:21:21.000 --> 00:21:26.640
partition and then it goes to the
location of that partition can be US West,

252
00:21:26.000 --> 00:21:32.599
Central, use East, somewhere else, and then it just hits the

253
00:21:32.680 --> 00:21:37.519
right database located in that region,
and the right through all through rails and

254
00:21:37.720 --> 00:21:45.880
mostly through HDP headers with And what's
what I find very interesting is that we

255
00:21:45.880 --> 00:21:49.240
were able to build all of that
on top of Engine X since Engine X

256
00:21:49.279 --> 00:21:56.039
allows you to write scriptible LUA modules
where you can implement any kind of logic

257
00:21:56.319 --> 00:22:00.200
in those local modules. In Engine
X, you can query your database to

258
00:22:00.240 --> 00:22:07.240
look up something where that tenant leaves, and then you just proxy that through

259
00:22:07.279 --> 00:22:12.799
Engine X and you manipulate the headers
and just make this work. So it's

260
00:22:14.240 --> 00:22:18.680
quite a lot of infrastructure that we
had to write. But at the same

261
00:22:18.720 --> 00:22:22.880
time, as I talked to call
different companies, it's all custom tailored and

262
00:22:22.920 --> 00:22:27.519
there is no there is rarely a
same stack, same use case. So

263
00:22:27.640 --> 00:22:32.839
that's also that would be a bit
hard, maybe a bit hard to share

264
00:22:32.880 --> 00:22:37.759
and abstract. So yeah, how
much of that infrastructure tooling is open sources

265
00:22:37.799 --> 00:22:41.440
that all secret sauce internal stuff,
or have you open sourced some of it

266
00:22:41.880 --> 00:22:48.599
which try to open source quite a
few things. There is also a lot

267
00:22:48.640 --> 00:22:55.000
of conference tocks that will link to
show notes that give way better over of

268
00:22:55.119 --> 00:23:02.119
the architecture. Then I just explained
the routing layer itself. I wouldn't say

269
00:23:02.119 --> 00:23:07.480
it's open sourced, but there is
lots of information out there for someone who

270
00:23:07.680 --> 00:23:11.279
who would want to build and use
same techniques. So that's probably a good

271
00:23:11.279 --> 00:23:18.759
segue into you know, additional scaling
aspects. So you've you've addressed a lot

272
00:23:18.799 --> 00:23:23.000
of the persistence layer pretty much the
entire persistence layer horizontal scalability, but you

273
00:23:23.119 --> 00:23:26.960
still have response times to deal with, right, And so one way to

274
00:23:26.960 --> 00:23:33.119
make response times fast is through background
jobs. And I know you've got quite

275
00:23:33.119 --> 00:23:38.759
a bit of expertise there. What
is the approach and architecture of Shopify's background

276
00:23:38.920 --> 00:23:42.960
job system. Well, and just
to pile on here real quick, it

277
00:23:44.039 --> 00:23:48.240
seems like when people start talking about
scaling ruby at or rails apps or sondraps

278
00:23:48.319 --> 00:23:52.079
or whatever, this is one of
the first things people reach for, right

279
00:23:52.599 --> 00:23:56.000
because any long running task they just
you know, shunt it off to background

280
00:23:56.039 --> 00:24:00.200
job and you know, report errors
back to the user if they have to,

281
00:24:02.079 --> 00:24:06.400
and it shortens the response time because
then it's hey, go do this

282
00:24:06.519 --> 00:24:08.680
job instead of I'm going to grind
through the work of doing this job.

283
00:24:10.079 --> 00:24:12.240
Yeah, and before you jump in
with an answer too, I mean one

284
00:24:12.240 --> 00:24:17.640
thing to bear in mind is like
some of the stuff is just it's baked

285
00:24:17.680 --> 00:24:21.279
into rails with active job. But
you don't even have to set up redd

286
00:24:21.319 --> 00:24:22.960
us or anything like that to support
it, right, It'll run it on

287
00:24:23.000 --> 00:24:29.079
a background thread out of the box. So what is the path for developer

288
00:24:29.400 --> 00:24:33.480
kind of chucks lead in question?
You start on a small project that's maybe

289
00:24:33.559 --> 00:24:36.039
a little hobby thing, and it
starts to get some traction and then maybe

290
00:24:36.039 --> 00:24:40.279
it turns into a business. What
does the evolution of kind of evolving that

291
00:24:40.599 --> 00:24:45.519
background job handling look like over time? Oh yeah, And to note that

292
00:24:45.119 --> 00:24:52.000
like myself or some of the byprojects, I run background jobs exactly in the

293
00:24:52.119 --> 00:24:56.119
background thread in those uma processes.
Yeah, just because it makes no sense

294
00:24:56.160 --> 00:25:02.039
to pay for extra for instance,
kick down as on Hierroco for those bad

295
00:25:02.079 --> 00:25:07.960
projects. And exactly as you pointed
out, it makes sense to start with

296
00:25:07.119 --> 00:25:15.079
something as brutal as a background thread, and then I'm really happy that Ruby

297
00:25:15.119 --> 00:25:22.160
community has a project like Sidekick and
Mike Perham who is behind that project,

298
00:25:22.599 --> 00:25:30.960
who has pushed the community to adopt
some beast practices around background jobs, and

299
00:25:30.480 --> 00:25:37.440
also offers nineteen nine percent of what
community needs as an open sound project,

300
00:25:37.880 --> 00:25:41.319
and for the remaining of one percent, when you get to that point,

301
00:25:41.839 --> 00:25:47.720
you can buy a pro or an
enterprise edition, and I'm pretty sure that

302
00:25:47.839 --> 00:25:52.480
when anyone is at that point,
that's actually quite an affordable software to buy.

303
00:25:52.680 --> 00:25:59.480
As a company, and just like
most of the community who is using

304
00:26:00.400 --> 00:26:07.359
Sidekick, Shopify is very similar in
terms of setup. Because we've been around

305
00:26:07.440 --> 00:26:12.440
for so long time, such a
long time. We've started with Rescue if

306
00:26:12.519 --> 00:26:21.039
anyone remembers, that was a pre
Sidekick era library to basically achieve the same.

307
00:26:21.640 --> 00:26:27.400
So we still run Rescue, we
run reddits. We got to rewrite

308
00:26:27.519 --> 00:26:34.279
most of Rescue internals because we're multi
tenant and we want to share some of

309
00:26:34.319 --> 00:26:41.359
the capacity and reuse that between tenants, which we can dive into if if

310
00:26:41.359 --> 00:26:45.119
you say later, I guess the
first question from you and from some of

311
00:26:45.160 --> 00:26:51.200
the listeners could be why we're not
on Sidekick, And the answer I would

312
00:26:51.240 --> 00:26:56.839
say is mostly the legacy part and
also how much we know the stack and

313
00:26:56.039 --> 00:27:00.240
how much we customize it for us
at this point. But we're all so

314
00:27:00.480 --> 00:27:04.319
starting some smaller apps at the company, some smaller rails apps. In fact,

315
00:27:04.720 --> 00:27:08.599
in addition to the Monoli, you
probably have a couple hundred other smaller

316
00:27:08.720 --> 00:27:15.920
rails services for something very specific or
maybe something just employee facing, and all

317
00:27:15.960 --> 00:27:22.079
of that would use the recommended set
of libraries that includes Psychic. Yeah,

318
00:27:22.200 --> 00:27:26.599
that makes sense. I'm also working
on a software as a service. I'm

319
00:27:26.640 --> 00:27:30.960
sponsoring one of the bigger conferences that
serves that niche podcasting in August, and

320
00:27:32.000 --> 00:27:34.680
so I anticipate that things you're going
to grow. And yeah, I have

321
00:27:34.720 --> 00:27:37.599
a lot of things that I am
pushing into the background jobs right now just

322
00:27:37.640 --> 00:27:41.799
because you know, I want to
get the response times down. But one

323
00:27:41.799 --> 00:27:45.400
thing that I'm wondering about, and
I'm kind of tempted to go with Heroku,

324
00:27:45.519 --> 00:27:48.160
but part of me, I don't
know, I have this mental block

325
00:27:48.200 --> 00:27:53.240
about paying for something that I could
probably figure out the scaling on myself or

326
00:27:53.279 --> 00:27:57.160
at least do some you know,
a couple of minor things to help with

327
00:27:57.200 --> 00:28:02.200
the performance and scaling that way.
So what should I be looking at next.

328
00:28:02.200 --> 00:28:03.880
It seems like you all have kind
of gone toward the cloud, and

329
00:28:03.920 --> 00:28:07.920
I'm wondering if that's the right answer, or you know, beyond background jobs,

330
00:28:07.960 --> 00:28:14.279
what's the next step? A step
to reduce response time? No more,

331
00:28:14.400 --> 00:28:15.880
it's more a step to just get
it to scale, you know,

332
00:28:15.960 --> 00:28:19.680
get that you know, be able
to handle more traffic without having the site

333
00:28:19.720 --> 00:28:25.920
slow down. Right, there would
always be some kind of bottleneck, which

334
00:28:26.200 --> 00:28:32.400
is depending on if you have a
good setup of tools, should be possible

335
00:28:32.519 --> 00:28:37.599
to find. And for us,
that bottleneck has changed through the time,

336
00:28:37.480 --> 00:28:45.160
And I would guess there is no
single answer because maybe there is something in

337
00:28:45.160 --> 00:28:48.519
a web server, in a controller
still spending quite a lot of time which

338
00:28:48.039 --> 00:28:55.039
which slows down the response time.
Or maybe it's all database that's a bottleneck,

339
00:28:55.160 --> 00:29:02.039
or maybe it's it's reddis or maybe
the rails reaches out to some external

340
00:29:02.119 --> 00:29:07.759
service that is not located too close
to it, which increases latency and also

341
00:29:07.839 --> 00:29:12.920
impacts response time. Yeah, that
makes sense. I'm curious what criteria you

342
00:29:14.039 --> 00:29:18.319
use to determine what should move into
a background job. Obviously you may hit

343
00:29:18.400 --> 00:29:23.440
some latency on a particular request and
see something that is kind of low hanging

344
00:29:23.480 --> 00:29:26.000
through to move to a background job. But just because you moved it to

345
00:29:26.039 --> 00:29:30.640
a background job doesn't mean you've actually
addressed the root of the problem. You've

346
00:29:30.680 --> 00:29:33.880
just moved it out of the request
flow, right. Oh yeah. And

347
00:29:34.960 --> 00:29:42.079
a very common batter that I see
in people do with jobs is, for

348
00:29:42.119 --> 00:29:48.680
instance, you want to iterate over
all users in your app and do something

349
00:29:48.799 --> 00:29:53.039
about each of them, maybe remind
them that they need to add a credit

350
00:29:53.079 --> 00:29:59.160
card or maybe something expired, or
you want to send them an engagement email.

351
00:29:59.519 --> 00:30:02.640
When you start, you have just
one hundred users, so that job

352
00:30:02.960 --> 00:30:07.039
works off pretty quickly under a minute, maybe depending on what kind of work

353
00:30:07.079 --> 00:30:12.480
that is. You grow to thousands, hundred, thousands, to millions,

354
00:30:12.759 --> 00:30:19.240
and a job to iterate over a
million users and to check balance of each

355
00:30:19.279 --> 00:30:26.319
of them, that job starts taking
days or weeks. And how do you

356
00:30:26.359 --> 00:30:30.880
solve that? And it's just so
easy to introduce that problem. You just

357
00:30:30.920 --> 00:30:34.599
do user dot find each in a
job and it works, but until the

358
00:30:34.599 --> 00:30:40.359
point when it stops. So the
way how we solved it, and that's

359
00:30:40.400 --> 00:30:45.240
actually all open source. We'll also
linking a show note. We've solved that

360
00:30:45.599 --> 00:30:55.119
by making every job interruptible and preserving
a cursor so that a job would progress

361
00:30:55.200 --> 00:31:00.920
for a bit and then maybe it
would get restarted for some reasonasically, this

362
00:31:00.119 --> 00:31:06.880
allows us to iterate over really long
collections and do some work with them and

363
00:31:06.920 --> 00:31:11.279
never lose the work that has been
done. Nice. Yeah, that's really

364
00:31:11.279 --> 00:31:15.279
cool. I'm gonna check out the
Shopify job iteration. That sounds really really

365
00:31:15.279 --> 00:31:18.640
interesting. One of the things that
we've done a code fund is when we're

366
00:31:18.640 --> 00:31:22.519
iterating across of course, we'll do
like a find in batches, and then

367
00:31:22.599 --> 00:31:26.559
we will just in queue the smaller
work, so when the large job fails,

368
00:31:26.680 --> 00:31:33.680
it's essentially item potent and can be
just rerun again without without impacting things

369
00:31:33.720 --> 00:31:37.839
that may have been half processed or
halfway chunked through. Yeah, that's the

370
00:31:37.880 --> 00:31:41.640
approach that I take as well.
An interesting side effect of that could be

371
00:31:41.720 --> 00:31:48.000
that, again, if this leads
to a fin out of a million jobs,

372
00:31:48.240 --> 00:31:52.960
because if you have ten million users
and each batch is side of ten

373
00:31:53.039 --> 00:31:56.319
for instance, like the numbers don't
really matter, but the point is that

374
00:31:56.720 --> 00:32:01.119
if the fen out of so many
the jobs, we need to remember that

375
00:32:01.400 --> 00:32:08.079
something like credits is always limited in
memory, and there's been so many times

376
00:32:08.240 --> 00:32:15.720
across every I would say across every
organization where I worked, that people would

377
00:32:15.359 --> 00:32:21.920
push reddits into out of memory state, and unfortunately there is no I would

378
00:32:21.960 --> 00:32:27.519
love to have a great solution for
that. But every time we want to

379
00:32:27.559 --> 00:32:31.240
do something like you describe, iterate
in batches, thank you something, we

380
00:32:31.400 --> 00:32:37.039
have to be mindful about what's behind
that. And yeah, I've been that

381
00:32:37.079 --> 00:32:44.319
as well. You start dropping jobs
because there's no memory left. Certainly happens

382
00:32:44.359 --> 00:32:47.160
at times when there's when jobs might
be failing. Right sidekick for it gives

383
00:32:47.200 --> 00:32:53.640
you some pretty nice failsafe capability where
it will reattempt those jobs. But if

384
00:32:53.640 --> 00:32:58.799
you've got a bug and not a
lot of memory dedicated to your reddits,

385
00:32:58.839 --> 00:33:02.319
instance, then of course you may
start losing work that may be critical to

386
00:33:02.359 --> 00:33:06.519
the business. Yeah I could see
that. I haven't run into that myself,

387
00:33:06.559 --> 00:33:09.400
but I could definitely see that happening. This is a great reminder about

388
00:33:09.880 --> 00:33:15.119
all sorts of data databases that exist
there, and maybe push push someone to

389
00:33:15.759 --> 00:33:19.640
learn about that, because at the
end rad as so reddits is in memory

390
00:33:19.720 --> 00:33:23.920
database which is bound by some ram
that you give. It can be gigabyte,

391
00:33:24.000 --> 00:33:29.599
can be four, can be sixteen, and that backlock of jobs would

392
00:33:29.599 --> 00:33:36.279
not be backed by something that's that
can be written on storage that's bigger than

393
00:33:36.359 --> 00:33:40.119
RAM like like which would be DISC
if it's, for instance, my sequel

394
00:33:40.160 --> 00:33:46.480
progress. So something that we would
really like to find is a store that

395
00:33:46.559 --> 00:33:54.720
could persist those things on disc with
a performance not too far and features not

396
00:33:54.839 --> 00:34:00.720
too far from radits. Reddis does
have the capability to push right to disk,

397
00:34:00.799 --> 00:34:05.079
right to flush itself out to disc. Yeah, So that only helps

398
00:34:05.279 --> 00:34:12.159
to have a snapshot in case the
computer where REDDITS is running rebootst but it

399
00:34:12.199 --> 00:34:16.559
still doesn't allow you to store more
than you have than the RAM that you

400
00:34:16.639 --> 00:34:20.440
have. Yeah. I mean,
that's probably a great argument to move to

401
00:34:20.480 --> 00:34:23.840
cloud, right because on Heroku,
it's just one button click when you see

402
00:34:24.239 --> 00:34:30.159
the memory filling up to scale out
or scale up your REDD storage capacity.

403
00:34:30.800 --> 00:34:36.280
Yeah. And a lot of cloud
databases or cloud instances they have methods were

404
00:34:36.320 --> 00:34:39.840
compensating for that, and so they
will just migrate you to a bigger instance

405
00:34:40.079 --> 00:34:45.559
or you know, basically allocated to
allocate it new memory without you even having

406
00:34:45.599 --> 00:34:52.280
to click it. As far as
that workflow is validated and people are certain

407
00:34:52.360 --> 00:34:57.840
that it will work. That's a
great feature of cloud providers. One of

408
00:34:57.840 --> 00:35:00.559
the thoughts that i've I've had architectural
which would be kind of neat on the

409
00:35:00.599 --> 00:35:07.239
background processing would be some jobs obviously
are a bit more ephemeral and less critical,

410
00:35:07.599 --> 00:35:12.719
and they could be handled in a
little bit more localized fashion, So

411
00:35:12.880 --> 00:35:15.840
it'd be neat to build a routing
layer that was intelligent where you maybe had

412
00:35:15.880 --> 00:35:22.199
three stages of reddus or just background
job storage. Right. One could be

413
00:35:22.599 --> 00:35:27.519
this is very ephemeral and not very
important, so we'll just let it be

414
00:35:27.599 --> 00:35:30.800
handled in process on a separate thread, so we'll route that job over there.

415
00:35:31.360 --> 00:35:35.280
Or it may be that this web
server the job is still kind of

416
00:35:35.320 --> 00:35:37.119
ephemeral, but a little bit more
important, So we could have a dedicated

417
00:35:37.159 --> 00:35:43.239
redd instance sitting on the web server
that has just a small set of dedicated

418
00:35:43.280 --> 00:35:45.119
memory for that, and you could
push those jobs there to handle some of

419
00:35:45.119 --> 00:35:47.719
that back pressure, and then for
the really important stuff, you could hef

420
00:35:47.760 --> 00:35:52.880
to those off to like your appliance
tre of reddis storage that gives you the

421
00:35:52.880 --> 00:36:00.239
full capacity across the entire application.
Oh yeah, we haven't done something like

422
00:36:00.280 --> 00:36:04.679
this for jobs though, I think
it could help a lot. But in

423
00:36:04.719 --> 00:36:07.280
general, like in terms of building
systems, I think This is a common

424
00:36:07.320 --> 00:36:15.639
case of defining priority for different workloads, which also allows you to shed some

425
00:36:15.719 --> 00:36:17.840
of the load. So, for
instance, you would have it doesn't have

426
00:36:17.880 --> 00:36:23.519
to be jobs. It could be
something as basic as web requests. And

427
00:36:24.800 --> 00:36:30.559
there are requests that go to something
that's very important to the business, maybe

428
00:36:30.639 --> 00:36:37.360
checkouts, which has the highest priority. Then you have something medium priority that

429
00:36:37.960 --> 00:36:44.079
may be browsing just the admin,
and then you have something low priority like

430
00:36:45.039 --> 00:36:50.840
checking out robots X or checking out
site map or hitting an API. And

431
00:36:51.199 --> 00:36:55.880
by declaring priorities to those requests when
you're at the load, you can shed

432
00:36:55.960 --> 00:37:02.000
some of those that you don't need. And this idea comes mostly from the

433
00:37:02.079 --> 00:37:07.679
largest companies in the industry, Like
Google has lots of papers and books how

434
00:37:07.719 --> 00:37:13.320
they do it, and as you
can imagine, every request to Google service

435
00:37:13.480 --> 00:37:17.559
would have some kind of priority and
they actually shared those like I'm pretty sure

436
00:37:17.599 --> 00:37:24.400
that mail is higher priority than watching
videos on YouTube. It's really interesting and

437
00:37:24.480 --> 00:37:28.519
one of the neat things about sidekick
is it provides like in terms of if

438
00:37:28.519 --> 00:37:32.480
you couch that in terms of background
jobs, sidekick provides some of that facility

439
00:37:32.519 --> 00:37:36.480
just out of the box, even
for a simple deploy right because will you

440
00:37:36.559 --> 00:37:38.360
can you can prioritize. You can
say this is in the critical queue,

441
00:37:38.360 --> 00:37:42.880
this is in the default queue,
is in the low priority queue, and

442
00:37:43.039 --> 00:37:47.199
Sidekick will drain the higher priority queues
first. Now you could start there and

443
00:37:47.280 --> 00:37:51.519
then and then eventually expand out and
say, well, I'm going to give

444
00:37:51.880 --> 00:37:58.559
a set of dedicated worker virtual machines
or dinas or whatever to process a particular

445
00:37:58.679 --> 00:38:02.880
queue. And I may even give
us up dedicated reddis instance or tier for

446
00:38:02.920 --> 00:38:07.079
that particular cueue. But you can
start with just a simple Reddus instance and

447
00:38:07.679 --> 00:38:14.039
the default Sidekick configuration. Say just
for anyone listening, because when we're talking

448
00:38:14.079 --> 00:38:17.480
about like scaling large systems, right
like Shopify. But if you're starting a

449
00:38:17.559 --> 00:38:22.280
rails app, for me, the
go to is pretty much I always reach

450
00:38:22.360 --> 00:38:28.519
for Rettus, Postgress and Sidekick,
along with everything else that comes out of

451
00:38:28.519 --> 00:38:30.480
the box with Rails. That's pretty
much what I always go for when I

452
00:38:30.519 --> 00:38:35.000
start a new project. Yeah,
I mean, I use I've used Rescue

453
00:38:35.039 --> 00:38:37.519
in the past for a lot of
projects, and then yeah, I've moved

454
00:38:37.519 --> 00:38:40.960
into Sidekick for my newer stuff.
But yeah, when is it too much

455
00:38:42.000 --> 00:38:45.440
to background something? Right? So
I wrote a gym that allows me to

456
00:38:45.559 --> 00:38:50.119
essentially background every or any method that
hangs off of an active record model,

457
00:38:50.440 --> 00:38:54.280
which is really convenient, but what
I've found is it makes it almost too

458
00:38:54.320 --> 00:38:59.480
convenient, where if something seems to
be slowing down a request, you can

459
00:38:59.559 --> 00:39:02.559
just do it dot defer to the
method name and it would stick it into

460
00:39:02.719 --> 00:39:07.360
the background, which is great,
but it got abused and we ended up

461
00:39:07.360 --> 00:39:12.159
with far too much running in the
background, hitting those problems you're talking about,

462
00:39:12.199 --> 00:39:15.960
like exhausting memory and stuff. So
how do you how do you determine

463
00:39:15.960 --> 00:39:20.840
what should be backgrounded? That's a
good question, and frankly, as someone

464
00:39:20.840 --> 00:39:23.960
who's spent quite a lot of time
on that part of stack, I'm not

465
00:39:24.000 --> 00:39:30.760
sure there is a single answer,
and I think it's somewhat related to how

466
00:39:31.440 --> 00:39:36.880
For instance, if it's active record
and sequel quarius, how heavy are those

467
00:39:37.159 --> 00:39:45.760
quaris? If your request i'me out
is thirty seconds and just one sequel query,

468
00:39:45.039 --> 00:39:50.400
that's for some reason heavy some kind
of aggregation takes ten and you need

469
00:39:50.559 --> 00:39:52.440
maybe to run a few of those. There is no way to fit that

470
00:39:52.679 --> 00:39:58.559
into a bub request, And of
course it might not make a lot of

471
00:39:58.639 --> 00:40:02.159
sense to do the premature optimization,
and it can be fine to just start

472
00:40:02.199 --> 00:40:07.920
with everything in a web request in
a controller, and then you find out

473
00:40:07.039 --> 00:40:10.800
that's the thing where your apps spends
most of the time in a web request,

474
00:40:10.920 --> 00:40:15.119
and you just move that to a
job because for simple apps, that's

475
00:40:15.440 --> 00:40:20.440
maybe it will be part it will
never be a job, and it will

476
00:40:20.440 --> 00:40:23.920
scale fine for the next few years. Yeah, I wonder if a good

477
00:40:23.960 --> 00:40:29.880
approach would be to first This probably
very much depends on if you've got paying

478
00:40:29.960 --> 00:40:32.960
customers that are being impacted, right, So, if paying customers are being

479
00:40:34.119 --> 00:40:37.639
impacted and you've got just some inefficiency
and a query or some aspect of a

480
00:40:37.639 --> 00:40:42.800
web request, maybe you background that, but you also set you put it

481
00:40:42.840 --> 00:40:47.039
in some type of planning process where
you revisit that job and try to actually

482
00:40:47.079 --> 00:40:52.079
optimize the real root of the problem. Yeah. I tend to use the

483
00:40:52.079 --> 00:40:57.639
background jobs when I have a performance
issue in the request pipeline, like we've

484
00:40:57.679 --> 00:41:00.639
talked about before, and then if
there's problem with running it in a background

485
00:41:00.760 --> 00:41:05.400
job, you know it's timing out
or you know something's breaking or something like

486
00:41:05.440 --> 00:41:07.880
that, you know, then I
revisit it from there. I don't know

487
00:41:07.920 --> 00:41:10.480
if there's a silver bullet. I
think a lot of times it's context specific

488
00:41:10.519 --> 00:41:15.000
and you just have to Okay,
I'm moving this out of the request pipeline.

489
00:41:15.599 --> 00:41:16.880
Okay, now it's having a problem
here, So now I've got to

490
00:41:16.880 --> 00:41:21.960
address the issue there. And yeah, you know, eventually it kind of

491
00:41:21.960 --> 00:41:27.079
bubbles itself up to the top of
your tech det queue and you address it.

492
00:41:27.480 --> 00:41:30.840
So one thing before we wrap up, do you have like some favorite

493
00:41:30.840 --> 00:41:35.440
tips or tricks or approaches that you
do it shopify or have done at other

494
00:41:36.039 --> 00:41:40.719
employers that make this easier, or
you know something that you just feel like

495
00:41:40.880 --> 00:41:46.719
is something that you did that you're
proud of. Yes, For someone who

496
00:41:49.119 --> 00:41:54.079
is curious about performance and fixing those
kind of bottlenecks, my best advice would

497
00:41:54.119 --> 00:42:00.239
be to study all the set and
variety of tools that you can use.

498
00:42:00.719 --> 00:42:07.400
These tools can be as high level
and web based and simple as muralk and

499
00:42:07.480 --> 00:42:14.400
some of the similar services that you
can connect to your app and see insights.

500
00:42:14.840 --> 00:42:22.039
Two more system level tools like for
instance, as trays. The amount

501
00:42:22.320 --> 00:42:28.440
of times where as trays saved me
or and some of my colleagues at the

502
00:42:28.440 --> 00:42:34.440
middle of the of the service disruption
just it's so hard to count those And

503
00:42:35.199 --> 00:42:39.800
my advice is not necessarily about as
trays, but knowing the wide variety of

504
00:42:39.920 --> 00:42:45.719
tools that you can use. Some
of those tools are very Linux specific and

505
00:42:45.800 --> 00:42:52.599
system level. Some of them are
Ruby level, like arbispy, a great

506
00:42:52.599 --> 00:42:58.280
tool by Julie Evans, or arbitrays, and then there are some services that

507
00:42:58.400 --> 00:43:04.159
offer that those kinds of things.
So if you know that range of tools

508
00:43:04.440 --> 00:43:07.360
and you know which one is the
best for something that you're looking for,

509
00:43:08.159 --> 00:43:13.679
you pick it up and fix the
thing. Anyway, You've got to wrap

510
00:43:13.719 --> 00:43:15.199
up soon. I've got a couple, just a couple of questions to put

511
00:43:15.199 --> 00:43:19.360
you on the spot here. One
is, do you know what the request

512
00:43:19.400 --> 00:43:23.159
volume that chaff of hy does per
second? The public number that I can

513
00:43:23.199 --> 00:43:30.000
say is about eighty thousand requests per
minute. And what about background jobs?

514
00:43:30.039 --> 00:43:35.360
How how many background jobs you are
being processed per minute? That's a great

515
00:43:35.440 --> 00:43:40.880
question, and to be honest,
I don't remember those numbers just out of

516
00:43:40.920 --> 00:43:45.079
my head. Yeah, yeah,
I have probably suff It's a lot,

517
00:43:45.280 --> 00:43:50.719
right, Yeah, it's a lot, and it can be very spiky.

518
00:43:50.960 --> 00:43:58.039
And there is a huge difference from
steady state and spiky state. Because shopify

519
00:43:58.119 --> 00:44:04.199
is also hosting some of the words
largest sales, sometimes for celebrities, sometimes

520
00:44:04.320 --> 00:44:13.920
it's worldwide cups and some special sales
that where millions of people try to crash

521
00:44:14.079 --> 00:44:19.480
Superfest stores. Yeah, I can
imagine code fund is tiny in comparison.

522
00:44:19.920 --> 00:44:23.800
Since January we've done over three hundred
million. Wow, that still feels like

523
00:44:23.840 --> 00:44:29.519
a lot to me. Yeah.
We keep changing what's in the background,

524
00:44:29.559 --> 00:44:32.199
what's not in the background, so
that we've had that number kind of artificially

525
00:44:32.199 --> 00:44:37.800
inflated at times. But still,
yeah, that's a lot of background work.

526
00:44:37.119 --> 00:44:39.559
Yeah, makes sense. All right, Well, I'm going to push

527
00:44:39.599 --> 00:44:43.440
us to picks, Nate, do
you want to start us off with the

528
00:44:43.480 --> 00:44:47.639
picks? Sure? So I guess
one pick for me today is open source.

529
00:44:47.880 --> 00:44:52.159
How fantastic open sources are good.
A thing on the side that I'm

530
00:44:52.159 --> 00:44:57.639
doing for my brother in law and
it's basically a CRM. So I went

531
00:44:57.960 --> 00:45:02.000
kind of diving around for open source
tools that I might be able to use

532
00:45:02.079 --> 00:45:06.440
to set up for him, and
I found fat free CRM, which is

533
00:45:06.480 --> 00:45:12.800
a rails based CRM. It's a
bit antiquated on the uh you know,

534
00:45:12.920 --> 00:45:15.079
the way it looks in terms of
the UI and UX, but it's pretty

535
00:45:15.079 --> 00:45:21.000
fantastic that data models solid and it
meets all of his needs. Which is

536
00:45:21.119 --> 00:45:27.039
terrific. The other pick I've got
is cats. So we've got a Maine

537
00:45:27.119 --> 00:45:34.480
coon in a Russian blue and they
just provide so much joy for my girls

538
00:45:34.519 --> 00:45:38.719
and for the family in general.
So highly recommend getting a pet, and

539
00:45:38.920 --> 00:45:44.519
especially a cat. Nice. I'm
gonna step in here with a couple of

540
00:45:44.559 --> 00:45:49.880
picks. The first one that I
have is a challenge that I've been doing.

541
00:45:50.360 --> 00:45:52.039
This is a challenge that has been
less fun with a broken arm,

542
00:45:52.519 --> 00:45:57.800
but it you know, I started
it because I just I really want to

543
00:45:57.840 --> 00:46:00.239
prove to myself that I can do
this, and yeah, doing it with

544
00:46:00.280 --> 00:46:05.400
a broken arm, it just I
wasn't gonna wait to heal because it's several

545
00:46:05.440 --> 00:46:08.239
weeks to heal a broken arm.
Anyway, the challenge is called seventy five

546
00:46:08.280 --> 00:46:13.559
Hard. It comes off of the
mf CEO Project podcast with Andy Frizella,

547
00:46:14.079 --> 00:46:17.159
and I've picked that on the show
before his podcast, but anyway, it's

548
00:46:17.159 --> 00:46:23.199
basically a challenge that he made up. But it essentially is a challenge to

549
00:46:23.280 --> 00:46:27.440
prove that you can, you know, do what you've got to do for

550
00:46:27.480 --> 00:46:30.159
seventy five days. So there are
five rules and if you violate any of

551
00:46:30.199 --> 00:46:34.920
the rules, then you have to
start over the seventy five days. And

552
00:46:35.400 --> 00:46:37.199
the first rule is you have to
work out twice a day for at least

553
00:46:37.280 --> 00:46:40.960
forty five minutes each time, and
one of the workouts has to be outside.

554
00:46:42.119 --> 00:46:45.079
So if it's raining, if it's
cold, if it's hot, if

555
00:46:45.079 --> 00:46:49.280
there's a hurricane, you know whatever, you're going to work out outside.

556
00:46:49.719 --> 00:46:54.239
And basically the he says that that's
just a you push you through the you

557
00:46:54.280 --> 00:46:58.400
know what. Sometimes you have to
do stuff when the conditions aren't ideal.

558
00:46:58.880 --> 00:47:00.320
The other rule, you have to
drink a gallon of water every day.

559
00:47:01.000 --> 00:47:06.400
You have to read ten pages of
a book every day. You have to

560
00:47:07.519 --> 00:47:09.960
choose a diet and stick to it. No cheating every day for seventy five

561
00:47:10.039 --> 00:47:14.000
days, So a lot of diets. You know, people are like,

562
00:47:14.159 --> 00:47:15.639
well, I take a cheap day
every week, no cheat days, no

563
00:47:15.760 --> 00:47:19.559
cheat days on seventy five hard.
And then the last one is you have

564
00:47:19.599 --> 00:47:24.639
to post a status photo to social
media. And so yeah, I've restarted

565
00:47:24.679 --> 00:47:29.239
twice so far. The first time
I forgot to read the ten pages,

566
00:47:29.920 --> 00:47:31.400
which was dumb. It was the
one thing I kind of took for granted

567
00:47:31.400 --> 00:47:37.039
that I do and I didn't do
it. The other one, I got

568
00:47:37.039 --> 00:47:43.159
a salad from Coasta Vida and I
didn't realize that I hadn't told them to

569
00:47:43.159 --> 00:47:45.199
take the rice out of it.
And I've been doing a Kido diet,

570
00:47:45.079 --> 00:47:49.519
so yeah, so I started over. I felt really dumb about that.

571
00:47:49.599 --> 00:47:51.679
I was like, I know they
put rice in it. I don't know

572
00:47:51.679 --> 00:47:53.639
why I didn't ask them to take
it out. So yeah, So it's

573
00:47:53.679 --> 00:47:58.920
just kind of learning to adapt to
some of this stuff. But I'm definitely

574
00:47:59.119 --> 00:48:02.480
enjoying the pros. And incidentally,
just to throw it out there, so

575
00:48:02.519 --> 00:48:07.440
I've I've been doing the the challenge
for about a week and a half and

576
00:48:07.519 --> 00:48:12.559
you know, and I'm currently on
day two. Just to throw that in

577
00:48:12.599 --> 00:48:15.840
the right because I had to restart. The flip side is is that I've

578
00:48:15.840 --> 00:48:22.199
lost ten pounds and know we can
that's a serious program, like you're gonna

579
00:48:22.239 --> 00:48:27.199
be committed. Yeah, but he
says it's a mental toughness challenge. Right,

580
00:48:27.559 --> 00:48:30.159
You're going to go and some days
you're just gonna have to push through

581
00:48:30.840 --> 00:48:34.719
do some stuff but you really don't
feel like doing. Yeah, like the

582
00:48:35.039 --> 00:48:37.639
run to that I have scheduled today, it'll probably beat both of my forty

583
00:48:37.639 --> 00:48:44.079
five minute workouts together. Because it's
it's one of my longer training runs for

584
00:48:44.159 --> 00:48:47.559
the marathon I'm gonna run in October. And yeah, I'm really feeling it

585
00:48:47.599 --> 00:48:52.679
today, especially with my arm and
everything else. I do not want to

586
00:48:52.679 --> 00:48:54.480
go out there and do it,
but you know, I've got to suck

587
00:48:54.519 --> 00:48:58.320
it up and go do it,
so anyway, but yeah, you know,

588
00:48:58.360 --> 00:49:00.960
I've got to go do two workouts
tomorrow, and tomorrow's a holiday,

589
00:49:00.639 --> 00:49:05.880
so yeah. Anyway, so that
that's my pick. If you want to

590
00:49:05.880 --> 00:49:07.840
go follow me on Instagram, I
think my handle is Charles max Wood.

591
00:49:08.360 --> 00:49:13.679
Then I've been posting my uh my
social media posts there. I tend to

592
00:49:13.679 --> 00:49:15.519
try and post them to Twitter and
Facebook as well, but I'm not always

593
00:49:15.519 --> 00:49:22.119
great about that. I'm pretty consistent
on Instagram. So anyway, here,

594
00:49:22.159 --> 00:49:24.079
do you have some picks for us? To be honest, I'm not I

595
00:49:24.079 --> 00:49:30.519
don't know the like the format very
well. If you can, just are

596
00:49:30.519 --> 00:49:36.320
there one or two things that you
think everybody in the world should know about

597
00:49:36.400 --> 00:49:42.360
that way, right this one?
I think it would be interesting for the

598
00:49:42.400 --> 00:49:46.440
main audience like Ruby developers. A
couple of weeks ago, I followed a

599
00:49:46.519 --> 00:49:52.719
hecking guide from MRI committers that shows
you how to build Ruby, how to

600
00:49:53.440 --> 00:49:59.760
change some simple source and see how
to rebuild it again and see how it

601
00:49:59.800 --> 00:50:05.599
works. Which also allows you to
try all the new features that are coming

602
00:50:05.639 --> 00:50:08.320
with Ruby two point seven because you
build it from the master branch, so

603
00:50:08.360 --> 00:50:14.199
you can go and try stuff like
batter and matching. It's something that you're

604
00:50:14.239 --> 00:50:20.280
excited about. And the reason why
it can be interesting for any roupe developer

605
00:50:20.320 --> 00:50:27.440
to try is because you get to
see all the magic behind it, just

606
00:50:27.519 --> 00:50:32.480
all the ce code, and it's
becomes no longer just a thing that some

607
00:50:34.079 --> 00:50:38.559
room committers that I have no idea
about build, and it becomes something that

608
00:50:38.719 --> 00:50:45.440
you can understand a little bit better
maybe. And I think that a haicking

609
00:50:45.440 --> 00:50:52.880
guide was also made to reduce the
bearrier to start doing that open source.

610
00:50:53.199 --> 00:51:00.280
So I think this point falls back
to the pick that need brought up about

611
00:51:00.280 --> 00:51:05.000
open source being awesome. We'll link
that to very cool, yeaph cool.

612
00:51:05.719 --> 00:51:07.559
One more question. If people want
to find you online see what you're working

613
00:51:07.599 --> 00:51:12.440
on these days, how do they
find you? Yeah, it's a Katrov

614
00:51:12.920 --> 00:51:17.880
on Twitter or Kres on GitHub.
Awesome. All right, well, thank

615
00:51:17.920 --> 00:51:22.280
you for coming. This is really
interesting. I want to ask like a

616
00:51:22.320 --> 00:51:24.400
dozen more questions, but we just
don't have times, so maybe we'll have

617
00:51:24.440 --> 00:51:28.800
you come back. Thanks for inviting. We'll be happy to come back,

618
00:51:29.280 --> 00:51:31.000
all right. Well, let's go
ahead and wrap this one up, folks,

619
00:51:31.039 --> 00:51:36.800
and we'll come back next week with
another episode. Thanks lot, Bye bye,