WEBVTT

1
00:00:00.120 --> 00:00:04.839
<v Speaker 1>Welcome back to the deep dive. Today, we are wrestling with, uh,

2
00:00:05.160 --> 00:00:09.039
<v Speaker 1>probably the defining architectural challenge of the last few years,

3
00:00:09.039 --> 00:00:09.480
<v Speaker 1>maybe the.

4
00:00:09.439 --> 00:00:10.919
<v Speaker 2>Decade definitely feels like it.

5
00:00:11.240 --> 00:00:15.080
<v Speaker 1>How do you reliably route traffic when the ground beneath

6
00:00:15.080 --> 00:00:17.600
<v Speaker 1>your feet is constantly shifting. We're talking about the shift

7
00:00:17.640 --> 00:00:20.440
<v Speaker 1>from stable, predictable monoliths.

8
00:00:19.920 --> 00:00:22.239
<v Speaker 2>Right, the quarterly release cycle kind of.

9
00:00:22.160 --> 00:00:25.879
<v Speaker 1>Thing exactly, to this well sometimes chaotic world of micro

10
00:00:26.000 --> 00:00:28.600
<v Speaker 1>services that scale up and down constantly.

11
00:00:28.800 --> 00:00:31.640
<v Speaker 2>Yeah, it's like comparing I don't know, a printed map

12
00:00:31.679 --> 00:00:34.119
<v Speaker 2>from the nineties to Google Maps in a city where

13
00:00:34.200 --> 00:00:38.359
<v Speaker 2>roads just appear and disappear every few minutes and buildings

14
00:00:38.399 --> 00:00:39.479
<v Speaker 2>resize themselves.

15
00:00:39.560 --> 00:00:42.640
<v Speaker 1>That's a great analogy. And those older load balancers, the

16
00:00:42.679 --> 00:00:45.159
<v Speaker 1>ones built for the static map, they just can't cope.

17
00:00:45.240 --> 00:00:48.159
<v Speaker 2>They're stuck in that static config mindset. They pretty much

18
00:00:48.200 --> 00:00:52.079
<v Speaker 2>melt down when faced with how dynamic a modern cloud

19
00:00:52.159 --> 00:00:53.119
<v Speaker 2>environment really.

20
00:00:52.920 --> 00:00:56.320
<v Speaker 1>Is and that operational headache. That's why we're diving deep

21
00:00:56.320 --> 00:00:59.240
<v Speaker 1>into Treevik today. It's an open source API gateway and

22
00:00:59.240 --> 00:01:02.000
<v Speaker 1>it's build specific to handle that dynamic complexity.

23
00:01:02.280 --> 00:01:05.480
<v Speaker 2>Right. The idea is to simplify deploy micro services, especially

24
00:01:05.480 --> 00:01:08.519
<v Speaker 2>if you're in the kuber eddies world, which, let's face it, many.

25
00:01:08.280 --> 00:01:11.000
<v Speaker 1>Are so our mission today.

26
00:01:10.840 --> 00:01:14.319
<v Speaker 2>Our mission is to unpack how trific acts as this

27
00:01:14.400 --> 00:01:17.319
<v Speaker 2>crucial link. Think of it as the intelligent gateway tier.

28
00:01:18.079 --> 00:01:23.439
<v Speaker 2>It connects that volatile ecosystem of services to the outside world.

29
00:01:23.599 --> 00:01:26.359
<v Speaker 1>And by the end you listening should have a pretty

30
00:01:26.359 --> 00:01:28.799
<v Speaker 1>good handle on the cutting edge of network routing.

31
00:01:29.120 --> 00:01:32.079
<v Speaker 2>A shortcut maybe, yeah, a shortcut to being well informed

32
00:01:32.120 --> 00:01:34.319
<v Speaker 2>about this stuff. Resilience patterns too.

33
00:01:34.400 --> 00:01:37.120
<v Speaker 1>Okay, let's start with the big picture, the monolith problem.

34
00:01:37.200 --> 00:01:39.920
<v Speaker 1>We probably all remember it, right, tight coupling.

35
00:01:39.640 --> 00:01:42.560
<v Speaker 2>Slow releases, Oh the.

36
00:01:42.439 --> 00:01:46.159
<v Speaker 1>Pain, and that really expensive all or nothing scaling. Need

37
00:01:46.239 --> 00:01:48.640
<v Speaker 1>more horsepower for login scale.

38
00:01:48.239 --> 00:01:51.400
<v Speaker 2>The whole thing, huge waste of resources. That worked. Okay,

39
00:01:51.480 --> 00:01:56.480
<v Speaker 2>I guess with the classic three tier model presentation, application data, simple.

40
00:01:56.319 --> 00:01:59.120
<v Speaker 1>Enough, but micro services break that model completely.

41
00:01:59.280 --> 00:02:01.719
<v Speaker 2>When you shatter that happened to I don't know, dozens

42
00:02:01.840 --> 00:02:04.519
<v Speaker 2>hundreds of tiny services. You need a different architecture.

43
00:02:04.560 --> 00:02:06.640
<v Speaker 1>You have to evolve the four tier model.

44
00:02:06.439 --> 00:02:08.360
<v Speaker 2>Exactly, build for distributed systems.

45
00:02:08.639 --> 00:02:11.199
<v Speaker 1>And that fourth tier is where Trafiic lives, right, that's.

46
00:02:11.039 --> 00:02:13.319
<v Speaker 2>His home turf. Right, So the four tiers you really

47
00:02:13.360 --> 00:02:16.960
<v Speaker 2>need are first, content delivery, the UI, the client stuff.

48
00:02:17.039 --> 00:02:17.360
<v Speaker 1>Okay.

49
00:02:17.560 --> 00:02:23.280
<v Speaker 2>Second, the gateway tier, that's STRAFIK, discovery, routing, correlating requests,

50
00:02:23.400 --> 00:02:26.840
<v Speaker 2>aggregating responses sometimes all that happens here the traffic hap

51
00:02:27.159 --> 00:02:30.680
<v Speaker 2>sort of. Yeah. Then third is the services tier, your

52
00:02:30.719 --> 00:02:35.719
<v Speaker 2>actual decoupled business logic units, high cohesion, loose coupling, all

53
00:02:35.759 --> 00:02:40.240
<v Speaker 2>that good stuff. And finally, the data tier databases, message queues,

54
00:02:40.479 --> 00:02:43.759
<v Speaker 2>but now ideally exclusive to the services that own that data.

55
00:02:43.800 --> 00:02:46.560
<v Speaker 1>Okay, So the gateway tier is critical, it's the front door.

56
00:02:47.000 --> 00:02:50.439
<v Speaker 1>What does a modern gateway like trafic absolutely have to

57
00:02:50.439 --> 00:02:52.199
<v Speaker 1>do to handle that chaos in tier three?

58
00:02:52.400 --> 00:02:54.000
<v Speaker 2>Right, It's got to be more than just a simple

59
00:02:54.039 --> 00:02:57.000
<v Speaker 2>port forwarder. Layer seven routing is non negotiable.

60
00:02:56.479 --> 00:02:59.159
<v Speaker 1>Where seven meaning application layer.

61
00:02:59.039 --> 00:03:03.240
<v Speaker 2>Exactly routing based on HTDP headers, host names, paths, maybe

62
00:03:03.240 --> 00:03:05.360
<v Speaker 2>even stuff in the request body, not just layer four

63
00:03:05.639 --> 00:03:08.199
<v Speaker 2>like TCP or UDP ports. And it needs to speak

64
00:03:08.199 --> 00:03:11.879
<v Speaker 2>different languages essentially HDTP one, HGDP two, gRPC rest it

65
00:03:11.919 --> 00:03:12.479
<v Speaker 2>shouldn't care.

66
00:03:12.719 --> 00:03:15.680
<v Speaker 1>And security that feels like a huge piece, especially with

67
00:03:15.719 --> 00:03:18.680
<v Speaker 1>all those services chattering away behind the gateway.

68
00:03:18.719 --> 00:03:23.639
<v Speaker 2>Oh massive, Absolutely, the gateway must handle TLS termination, you know,

69
00:03:23.960 --> 00:03:25.560
<v Speaker 2>decrypting the incoming.

70
00:03:25.280 --> 00:03:27.719
<v Speaker 1>Public traffic standard stuff, right, But.

71
00:03:27.639 --> 00:03:31.159
<v Speaker 2>Then inside the cluster for service to service chat you

72
00:03:31.240 --> 00:03:34.319
<v Speaker 2>need mutual tls MTLs.

73
00:03:33.759 --> 00:03:35.960
<v Speaker 1>So both sides prove who they are precisely.

74
00:03:36.159 --> 00:03:39.080
<v Speaker 2>It's not just the client showing ID. The server demands

75
00:03:39.120 --> 00:03:42.400
<v Speaker 2>ID back, show me your papers too. It's essential for

76
00:03:42.479 --> 00:03:45.240
<v Speaker 2>locking things down inside your perimeter if something goes wrong,

77
00:03:45.719 --> 00:03:46.960
<v Speaker 2>limits the blast radius.

78
00:03:47.039 --> 00:03:49.759
<v Speaker 1>Okay, that makes sense, which leads us right to maybe

79
00:03:49.759 --> 00:03:53.919
<v Speaker 1>the killer feature, autoconfiguration. Because, like you said, hundreds of services,

80
00:03:53.960 --> 00:03:59.520
<v Speaker 1>maybe thousands of instances, updating config files byhand impossible right there, It.

81
00:03:59.560 --> 00:04:02.719
<v Speaker 2>Just doesn't scale. That's where Treyfik fundamentally solves the service

82
00:04:02.759 --> 00:04:05.479
<v Speaker 2>discovery problem. Instead of a human editing.

83
00:04:05.240 --> 00:04:07.759
<v Speaker 1>A file, which is always error prone.

84
00:04:07.479 --> 00:04:11.759
<v Speaker 2>Always, instead, Trephi talks directly to a service registry. Think Console,

85
00:04:11.759 --> 00:04:15.599
<v Speaker 2>etca a Kubernetes itself. These things are like near real

86
00:04:15.639 --> 00:04:18.759
<v Speaker 2>time databases of where every active service instance lives on

87
00:04:18.800 --> 00:04:19.279
<v Speaker 2>the network.

88
00:04:19.399 --> 00:04:21.680
<v Speaker 1>Ah, so Triffic doesn't need its own map. It just

89
00:04:21.839 --> 00:04:23.959
<v Speaker 1>asks the map maker constantly.

90
00:04:23.600 --> 00:04:27.800
<v Speaker 2>Exactly perfect analogy. Treyfik calls these map makers providers. It

91
00:04:27.839 --> 00:04:29.920
<v Speaker 2>has first class support baked in, just sits there and

92
00:04:29.959 --> 00:04:33.519
<v Speaker 2>watches the provider. A new service instance spins up. Treyfix

93
00:04:33.560 --> 00:04:35.879
<v Speaker 2>C is it yep, an old one dies, treyfix E

94
00:04:36.000 --> 00:04:39.519
<v Speaker 2>is that too, and it automatically reconfigures its own routing

95
00:04:39.560 --> 00:04:44.399
<v Speaker 2>tables crucially without needing a restart or dropping existing connections.

96
00:04:44.600 --> 00:04:47.600
<v Speaker 1>Hot reloads zero downtime. That's the dream.

97
00:04:47.800 --> 00:04:51.959
<v Speaker 2>That's critical. Dynamic configuration and hot reloads are absolutely key.

98
00:04:52.199 --> 00:04:55.000
<v Speaker 1>How tricky is it if you're running say, Docker and

99
00:04:55.079 --> 00:04:59.759
<v Speaker 1>Kubernetes and maybe console, can one Trafiic instance watch all

100
00:04:59.800 --> 00:05:00.319
<v Speaker 1>of them?

101
00:05:00.439 --> 00:05:03.600
<v Speaker 2>Yeah? Surprisingly easily. That's the beauty of the provider concept.

102
00:05:04.199 --> 00:05:08.360
<v Speaker 2>Trayfik kind of abstracts away the specific details of talking

103
00:05:08.360 --> 00:05:11.519
<v Speaker 2>to Kubernetes versus talking to Console, so you can centralize

104
00:05:11.600 --> 00:05:13.319
<v Speaker 2>routing even in a mixed environment.

105
00:05:13.399 --> 00:05:15.800
<v Speaker 1>So developers just deploy to whatever platform.

106
00:05:15.399 --> 00:05:17.480
<v Speaker 2>They use, and Treyfik figures out how to find it

107
00:05:17.519 --> 00:05:21.399
<v Speaker 2>and send traffic there. Developers focus on code. Treefiic handles

108
00:05:21.439 --> 00:05:22.439
<v Speaker 2>the routing complexity.

109
00:05:22.759 --> 00:05:26.879
<v Speaker 1>Okay, so Treyfik knows where everything is. Now let's talk

110
00:05:26.920 --> 00:05:30.279
<v Speaker 1>about actually sending the traffic efficiently. We all know basic

111
00:05:30.360 --> 00:05:34.199
<v Speaker 1>round robin, right, just deal them out equally fine for

112
00:05:34.240 --> 00:05:35.000
<v Speaker 1>stateless stuff.

113
00:05:35.079 --> 00:05:38.240
<v Speaker 2>Yeah, simple, effective, if all your servers are identical, but.

114
00:05:38.240 --> 00:05:42.079
<v Speaker 1>They rarely are so weighted Round Robin WRR. How does

115
00:05:42.120 --> 00:05:42.519
<v Speaker 1>that work?

116
00:05:42.680 --> 00:05:46.120
<v Speaker 2>Right? WRR is about being smarter with resources. Maybe you

117
00:05:46.199 --> 00:05:49.519
<v Speaker 2>have an older, cheaper server with less CPU. You don't

118
00:05:49.560 --> 00:05:51.720
<v Speaker 2>want to getting the same traffic as your brand new

119
00:05:52.240 --> 00:05:56.279
<v Speaker 2>beat cloud instance makes sense, So WRR lets you assign weights.

120
00:05:56.600 --> 00:05:59.199
<v Speaker 2>You could say, send three requests to the powerful guests

121
00:05:59.279 --> 00:06:01.399
<v Speaker 2>B one group for every one request you send to

122
00:06:01.439 --> 00:06:04.199
<v Speaker 2>the older guest D two, a three point one ratio

123
00:06:04.319 --> 00:06:04.839
<v Speaker 2>for example.

124
00:06:04.839 --> 00:06:08.040
<v Speaker 1>So it's not just load balancing, it's cost optimization too.

125
00:06:08.240 --> 00:06:12.040
<v Speaker 2>Definitely in the cloud especially, WRR helps you squeeze maximum

126
00:06:12.120 --> 00:06:15.439
<v Speaker 2>value out of cheaper or older instances alongside the new ones.

127
00:06:15.639 --> 00:06:19.600
<v Speaker 2>Keeps everything utilized efficiently, saves money, no resource just sitting

128
00:06:19.600 --> 00:06:20.959
<v Speaker 2>idle or getting totally slammed.

129
00:06:21.000 --> 00:06:23.639
<v Speaker 1>Okay, let's flip that. What about apps where the user's

130
00:06:23.720 --> 00:06:26.839
<v Speaker 1>state matters, like a shopping cart stored in memory on

131
00:06:26.839 --> 00:06:29.759
<v Speaker 1>one specific server instance, Round Robin would break that.

132
00:06:30.079 --> 00:06:33.360
<v Speaker 2>Yeah, that needs sticky sessions. If a user's second request

133
00:06:33.439 --> 00:06:36.199
<v Speaker 2>hits a different server, poof their cart is gone or

134
00:06:36.240 --> 00:06:38.680
<v Speaker 2>they get logged out. Bad experience.

135
00:06:39.000 --> 00:06:41.720
<v Speaker 1>So how does trophy candle that it uses cookies.

136
00:06:41.959 --> 00:06:44.800
<v Speaker 2>Typically when the first request hits a back end instance,

137
00:06:45.000 --> 00:06:48.560
<v Speaker 2>treefix sets a cookie in the response for subsequent requests

138
00:06:48.560 --> 00:06:51.040
<v Speaker 2>from that same user, trific reads the cookie and make

139
00:06:51.079 --> 00:06:53.720
<v Speaker 2>sure to send the request back to that same original instance.

140
00:06:54.240 --> 00:06:55.199
<v Speaker 2>Keeps a session alive.

141
00:06:55.480 --> 00:06:59.120
<v Speaker 1>Okay, sticky sessions makes sense, But underlying all this balancing,

142
00:06:59.439 --> 00:07:02.560
<v Speaker 1>you need health right. Making sure you're not sending traffic

143
00:07:02.600 --> 00:07:03.279
<v Speaker 1>to a dead.

144
00:07:03.160 --> 00:07:06.560
<v Speaker 2>Server absolutely fundamental. You only want to route traffic to

145
00:07:06.720 --> 00:07:10.399
<v Speaker 2>instances that are actually healthy, usually meaning they return a

146
00:07:10.439 --> 00:07:14.079
<v Speaker 2>two XX or a three X HTTP status code. Anything

147
00:07:14.079 --> 00:07:14.920
<v Speaker 2>else is an error.

148
00:07:15.079 --> 00:07:19.480
<v Speaker 1>Doesn't constantly poking every instance ad overhead though a performance tax.

149
00:07:19.639 --> 00:07:22.639
<v Speaker 2>That's a fair question. It's a trade off. Trefik does

150
00:07:22.720 --> 00:07:25.279
<v Speaker 2>use active checks where it sends a probe and passive

151
00:07:25.319 --> 00:07:28.800
<v Speaker 2>checks watching responses. But you can figure the interval. You

152
00:07:28.879 --> 00:07:32.399
<v Speaker 2>tune it so you find a balance, right, you said it, so.

153
00:07:32.439 --> 00:07:35.519
<v Speaker 2>The monitoring overhead isn't painful, but it's frequent enough to

154
00:07:35.560 --> 00:07:38.759
<v Speaker 2>pull an unhealthy instance out of the pool quickly when

155
00:07:38.800 --> 00:07:41.560
<v Speaker 2>it does fail. It's crucial for the stability of things

156
00:07:41.560 --> 00:07:42.600
<v Speaker 2>like round robin.

157
00:07:42.920 --> 00:07:45.839
<v Speaker 1>Let's shift gears a bit to more advanced resilience patterns.

158
00:07:46.240 --> 00:07:50.199
<v Speaker 1>Traffic mirroring sometimes called shadowing, sounds useful for testing.

159
00:07:49.879 --> 00:07:52.879
<v Speaker 2>Oh, it's fantastic for canary deployments, really safe testing. The

160
00:07:52.959 --> 00:07:56.319
<v Speaker 2>idea is you take your live production traffic, the real stuff,

161
00:07:56.360 --> 00:07:59.319
<v Speaker 2>the real stuff, and you copy a small percentage of it,

162
00:07:59.319 --> 00:08:02.720
<v Speaker 2>say ten percent, and send that copy asynchronously to a

163
00:08:02.720 --> 00:08:05.120
<v Speaker 2>new test environment, maybe your guess V two.

164
00:08:05.040 --> 00:08:09.199
<v Speaker 1>Version, asynchronously, so the original user isn't waiting exactly.

165
00:08:09.040 --> 00:08:13.279
<v Speaker 2>And critically trafiic ignores the response from that mirror request.

166
00:08:13.279 --> 00:08:15.959
<v Speaker 2>It just fires it off and forgets about it unless

167
00:08:16.000 --> 00:08:18.800
<v Speaker 2>you see how your new code behaves under real load stability,

168
00:08:19.079 --> 00:08:22.839
<v Speaker 2>resource use without any risk to the actual user experience.

169
00:08:23.120 --> 00:08:26.040
<v Speaker 1>That's clever. Okay, so we've handled load and safe testing.

170
00:08:26.360 --> 00:08:29.839
<v Speaker 1>But what about when things actually fail, not just one instance,

171
00:08:29.839 --> 00:08:33.440
<v Speaker 1>but maybe a whole downstream database or API becomes slow

172
00:08:33.600 --> 00:08:37.000
<v Speaker 1>or unresponsive in a micro services world. That seems like

173
00:08:37.039 --> 00:08:38.240
<v Speaker 1>it could cause chaos.

174
00:08:38.559 --> 00:08:42.799
<v Speaker 2>It absolutely can. That's the dreaded cascading failure scenario. One

175
00:08:43.120 --> 00:08:45.480
<v Speaker 2>slow dependency makes its callers wait, they.

176
00:08:45.519 --> 00:08:47.639
<v Speaker 1>Run out of threads or connections.

177
00:08:47.159 --> 00:08:50.080
<v Speaker 2>Exactly, and then they fail, taking down the services to

178
00:08:50.120 --> 00:08:51.799
<v Speaker 2>call them. It ripples outwards.

179
00:08:52.080 --> 00:08:54.519
<v Speaker 1>So how does trific act as a ble kid prevent

180
00:08:54.600 --> 00:08:55.279
<v Speaker 1>that ripple.

181
00:08:55.480 --> 00:08:58.840
<v Speaker 2>That's the job of the circuit breaker pattern. Trafic middleware

182
00:08:58.879 --> 00:09:01.919
<v Speaker 2>can implement this. It watches for failures going to a

183
00:09:01.919 --> 00:09:03.399
<v Speaker 2>particular back end service.

184
00:09:03.639 --> 00:09:06.759
<v Speaker 1>Failure is meaning errors or timeouts.

185
00:09:06.240 --> 00:09:09.919
<v Speaker 2>Both typically yeah, if the failure rate or maybe latency

186
00:09:10.200 --> 00:09:11.200
<v Speaker 2>crosses a threshold you.

187
00:09:11.240 --> 00:09:13.519
<v Speaker 1>Define like too many errors in the last minute.

188
00:09:13.320 --> 00:09:16.639
<v Speaker 2>Right, or responses are taking too long. If that happens,

189
00:09:16.879 --> 00:09:20.080
<v Speaker 2>Trefix trips the breaker. It stops sending requests to that

190
00:09:20.080 --> 00:09:21.559
<v Speaker 2>struggling service altogether for.

191
00:09:21.519 --> 00:09:24.440
<v Speaker 1>A period and just returns an error immediately.

192
00:09:23.960 --> 00:09:26.960
<v Speaker 2>Yep, usually a five zero three service unavailable. It does

193
00:09:26.960 --> 00:09:30.600
<v Speaker 2>this instantly without even trying the failing service. This protects

194
00:09:30.600 --> 00:09:33.960
<v Speaker 2>the calling services from getting bogged down and saves resources

195
00:09:34.000 --> 00:09:36.519
<v Speaker 2>across the system. It's like the system saying nope, that

196
00:09:36.600 --> 00:09:38.559
<v Speaker 2>are closed for now, try again later.

197
00:09:38.480 --> 00:09:41.320
<v Speaker 1>And the conditions for tripping. It can be quite sophisticated.

198
00:09:41.360 --> 00:09:44.679
<v Speaker 2>I saw yeah. Treefix implementation is pretty powerful. It's not

199
00:09:44.720 --> 00:09:48.639
<v Speaker 2>just simple failure counts. You could use expressions like trip

200
00:09:48.679 --> 00:09:52.360
<v Speaker 2>if latency at quantil ms fifty point zero hundred meaning

201
00:09:52.600 --> 00:09:54.679
<v Speaker 2>the meeting response time is over one hundred.

202
00:09:54.399 --> 00:09:57.000
<v Speaker 1>Milliseconds, or based on error ratio exactly.

203
00:09:57.000 --> 00:09:59.320
<v Speaker 2>Response cut a ratio five hundred, six hundred point twenty

204
00:09:59.320 --> 00:10:01.600
<v Speaker 2>five trip if more than twenty five percent of recent

205
00:10:01.679 --> 00:10:05.039
<v Speaker 2>responses were five xx errors gives you fine grain control.

206
00:10:05.200 --> 00:10:08.159
<v Speaker 1>Okay, circuit breakers handle the big failures. What about those

207
00:10:08.159 --> 00:10:12.240
<v Speaker 1>little annoying transient glitches like a brief network kickup that

208
00:10:12.360 --> 00:10:13.840
<v Speaker 1>just needs a quick retry.

209
00:10:14.279 --> 00:10:17.159
<v Speaker 2>Perfect use case for retries middleware, Just like getting refresh

210
00:10:17.240 --> 00:10:19.399
<v Speaker 2>in your browser when it page times out right, TRIFIC

211
00:10:19.440 --> 00:10:22.559
<v Speaker 2>could be configured to automatically retry a request, maybe once

212
00:10:22.639 --> 00:10:25.159
<v Speaker 2>or twice if it fails with specific errors like a

213
00:10:25.159 --> 00:10:27.679
<v Speaker 2>connection timeout or maybe a five h two bad gateway.

214
00:10:28.200 --> 00:10:30.559
<v Speaker 2>It provides a basic level of self healing for those

215
00:10:30.600 --> 00:10:31.799
<v Speaker 2>intermitt network blips.

216
00:10:31.960 --> 00:10:35.919
<v Speaker 1>Makes sense. So we've got routing balancing resilience. But when

217
00:10:35.960 --> 00:10:38.039
<v Speaker 1>things do go wrong despite all this, we need to

218
00:10:38.039 --> 00:10:40.759
<v Speaker 1>figure out why. Let's talk observability.

219
00:10:41.080 --> 00:10:45.159
<v Speaker 2>Crucial observability isn't just knowing that something is wrong, but

220
00:10:45.279 --> 00:10:49.799
<v Speaker 2>having the data to understand why. And TRIFIC, sitting at

221
00:10:49.799 --> 00:10:53.120
<v Speaker 2>the entry point, is perfectly placed to collect that data.

222
00:10:53.279 --> 00:10:57.440
<v Speaker 1>Across the three pillars right, logs, traces, metrics exactly.

223
00:10:57.559 --> 00:10:58.519
<v Speaker 2>Let's start with logs.

224
00:10:59.159 --> 00:11:02.120
<v Speaker 1>Now people off and say application logs alone aren't enough

225
00:11:02.120 --> 00:11:05.919
<v Speaker 1>in micro services. What makes trifix logs actually useful here?

226
00:11:06.200 --> 00:11:09.360
<v Speaker 2>Well, it generates standard error logs, of course, but the

227
00:11:09.399 --> 00:11:12.320
<v Speaker 2>real value is often in the access logs. The trick

228
00:11:12.480 --> 00:11:16.480
<v Speaker 2>is logging everything for every request can be really resource intensive.

229
00:11:16.600 --> 00:11:18.840
<v Speaker 1>Yeah, generates huge amounts of data.

230
00:11:18.600 --> 00:11:21.360
<v Speaker 2>So trific lets you filter them intelligently. You might say,

231
00:11:21.559 --> 00:11:24.519
<v Speaker 2>only lawged requests that resulted in a redirect status codes

232
00:11:24.559 --> 00:11:27.919
<v Speaker 2>three hundred to three h two, or only log requests

233
00:11:27.919 --> 00:11:31.279
<v Speaker 2>that took longer than say, five seconds to complete using

234
00:11:31.279 --> 00:11:32.440
<v Speaker 2>a mind duration filter.

235
00:11:32.600 --> 00:11:35.600
<v Speaker 1>Ah, so you capture the interesting or problematic events without

236
00:11:35.679 --> 00:11:37.399
<v Speaker 1>drowning and routine data.

237
00:11:37.120 --> 00:11:40.600
<v Speaker 2>Precisely optimizes performance, gets you the diagnostic data you actually need.

238
00:11:40.679 --> 00:11:42.559
<v Speaker 1>Okay, logs tell us what happened at the edge, But

239
00:11:42.759 --> 00:11:46.440
<v Speaker 1>to follow a request through multiple services, we need tracing.

240
00:11:46.360 --> 00:11:50.799
<v Speaker 2>Right request tracing stitches the whole journey together. Each piece

241
00:11:50.840 --> 00:11:53.840
<v Speaker 2>of work done by a service is a span. All

242
00:11:53.879 --> 00:11:57.919
<v Speaker 2>the spans for one user request combine into a single trace, like.

243
00:11:57.879 --> 00:11:59.840
<v Speaker 1>A timeline of the request's life.

244
00:11:59.600 --> 00:12:03.399
<v Speaker 2>Exactly, and Trafik being the first point of contact, can

245
00:12:03.440 --> 00:12:08.080
<v Speaker 2>generate standardized trace headers, often B three propagation headers, things

246
00:12:08.120 --> 00:12:11.000
<v Speaker 2>like XB three trace seed. Think of them like a digital.

247
00:12:10.720 --> 00:12:12.919
<v Speaker 1>Passport, and it passes that passport along.

248
00:12:13.240 --> 00:12:16.000
<v Speaker 2>It injects those headers into the request before forwarding it

249
00:12:16.039 --> 00:12:19.000
<v Speaker 2>to the first back end service. That service, if it's

250
00:12:19.039 --> 00:12:21.840
<v Speaker 2>trace aware, adds its own span and passes the headers on.

251
00:12:22.399 --> 00:12:25.200
<v Speaker 2>So even if the request hits five different micro services,

252
00:12:25.320 --> 00:12:25.639
<v Speaker 2>you can.

253
00:12:25.519 --> 00:12:27.879
<v Speaker 1>See the whole chain in a system like Zipkin or

254
00:12:28.000 --> 00:12:28.960
<v Speaker 1>Jaeger exactly.

255
00:12:29.080 --> 00:12:32.879
<v Speaker 2>End to end visibility invaluable for debugging distributed systems.

256
00:12:32.519 --> 00:12:34.960
<v Speaker 1>And the third pillar metrics the numbers yep.

257
00:12:35.320 --> 00:12:39.679
<v Speaker 2>Treyfix exposes key application level metrics, things like total request counts,

258
00:12:39.720 --> 00:12:43.559
<v Speaker 2>request latencies, average quantiles, error rates, information about the.

259
00:12:43.559 --> 00:12:45.639
<v Speaker 1>Back end servers, and you feed that into.

260
00:12:45.679 --> 00:12:49.919
<v Speaker 2>Standard monitoring systems, typically Prometheus. Prometheus scrapes these metrics from

261
00:12:50.000 --> 00:12:53.399
<v Speaker 2>Treyfi periodically. Then you can use tools like Rafona to

262
00:12:53.720 --> 00:12:58.360
<v Speaker 2>visualize trends, plan capacity, and set up automated alerts if say,

263
00:12:58.399 --> 00:13:00.600
<v Speaker 2>aer rates spike or latency degrades.

264
00:13:01.039 --> 00:13:03.399
<v Speaker 1>Got it? Okay, let's bring this home to the place

265
00:13:03.399 --> 00:13:07.840
<v Speaker 1>where treefix seems most popular. Kubernetes. You mentioned earlier that

266
00:13:07.919 --> 00:13:10.919
<v Speaker 1>the original Kubernetes ingress API wasn't great.

267
00:13:11.159 --> 00:13:14.519
<v Speaker 2>Yeah, it was. Let's say a bit under specified vague,

268
00:13:14.879 --> 00:13:18.440
<v Speaker 2>which forced vendors like treyfick in Jinks and others to

269
00:13:18.519 --> 00:13:20.200
<v Speaker 2>rely heavily on custom.

270
00:13:19.879 --> 00:13:23.120
<v Speaker 1>Annotations, annotations being those kind of messy tech strings. In

271
00:13:23.159 --> 00:13:23.759
<v Speaker 1>the Yamo.

272
00:13:23.720 --> 00:13:27.799
<v Speaker 2>Exactly, you'd have dozens of vendor specific annotations to configure

273
00:13:27.840 --> 00:13:31.519
<v Speaker 2>basic things like timeouts or retries or sticky sessions. It

274
00:13:31.559 --> 00:13:33.120
<v Speaker 2>wasn't clean, wasn't standardized.

275
00:13:33.279 --> 00:13:35.240
<v Speaker 1>So how did trefik improve on that? They gave up

276
00:13:35.240 --> 00:13:36.879
<v Speaker 1>on Ingress in treyfiic v two.

277
00:13:36.919 --> 00:13:41.440
<v Speaker 2>They shifted strategy. They embraced Kubernetes's custom resource definitions or crds.

278
00:13:41.799 --> 00:13:45.679
<v Speaker 2>They introduced their own resources like ingress, root middleware TLS.

279
00:13:45.240 --> 00:13:48.360
<v Speaker 1>Option, So instead of annotations, you define routing rules using

280
00:13:48.440 --> 00:13:51.279
<v Speaker 1>these custom but still native feeling Kubernetes's objects.

281
00:13:51.559 --> 00:13:55.639
<v Speaker 2>Precisely, it's a much nicer experience. As they say, configuration

282
00:13:55.720 --> 00:14:00.840
<v Speaker 2>becomes structured, version controllable Kubernetes YAML, just like your deployment services.

283
00:14:01.240 --> 00:14:04.759
<v Speaker 2>Any Kubernetes engineer can understand it. It follows familiar patterns,

284
00:14:05.159 --> 00:14:08.600
<v Speaker 2>no more digging through annotation documentation for different vendors.

285
00:14:08.639 --> 00:14:10.879
<v Speaker 1>That sounds like a huge improvement. And you also touched

286
00:14:10.919 --> 00:14:15.080
<v Speaker 1>on TLS simplification getting certificates is often a real pain.

287
00:14:15.159 --> 00:14:19.600
<v Speaker 2>Oh historically it was awful manual requests, validation hoops, remembering

288
00:14:19.639 --> 00:14:22.799
<v Speaker 2>to new high chance of error, high risk.

289
00:14:23.000 --> 00:14:24.840
<v Speaker 1>So how does trifick fix that.

290
00:14:25.200 --> 00:14:28.159
<v Speaker 2>It integrates directly with the ACME protocol, which is the

291
00:14:28.279 --> 00:14:32.679
<v Speaker 2>standard let's encrypt uses for automating certificate issuance for public domains.

292
00:14:32.840 --> 00:14:35.639
<v Speaker 1>Let's encrypt the free certificate authority right.

293
00:14:35.639 --> 00:14:38.720
<v Speaker 2>When in trifick you basically just configure a cert resolver

294
00:14:38.840 --> 00:14:42.120
<v Speaker 2>pointing to let's encrypt. Than when you define an ingress

295
00:14:42.200 --> 00:14:43.360
<v Speaker 2>route for a public host.

296
00:14:43.240 --> 00:14:44.840
<v Speaker 1>Name, trifiic just handles it.

297
00:14:44.840 --> 00:14:49.039
<v Speaker 2>It handles the entire life cycle automatically. It requests the certificate,

298
00:14:49.320 --> 00:14:52.440
<v Speaker 2>handles the domain validation challenge, often using something called the

299
00:14:52.480 --> 00:14:57.799
<v Speaker 2>TLS ALPN zero one challenge. It's quite neat, retrieves the certificate,

300
00:14:58.000 --> 00:15:01.320
<v Speaker 2>installs it, and even handles renew before it expires.

301
00:15:01.799 --> 00:15:04.720
<v Speaker 1>Wow. So the developer just defines the route asks for

302
00:15:04.840 --> 00:15:07.960
<v Speaker 1>TLS and trefick and let's encrypt do the rest.

303
00:15:08.039 --> 00:15:12.639
<v Speaker 2>Pretty much focus on the application logic. The complicated, error

304
00:15:12.679 --> 00:15:15.279
<v Speaker 2>prone task of certificate management just happens.

305
00:15:15.440 --> 00:15:18.159
<v Speaker 1>So wrapping it up, trefix core value seems to be

306
00:15:18.200 --> 00:15:22.039
<v Speaker 1>replacing that old, rigid manual configuration world.

307
00:15:21.840 --> 00:15:24.639
<v Speaker 2>Which just breaks under micro service dynamism.

308
00:15:24.440 --> 00:15:28.120
<v Speaker 1>With the dynamic self configuring system built for that reality.

309
00:15:28.159 --> 00:15:31.080
<v Speaker 1>It's the traffic cop that learns the roads automatically as

310
00:15:31.120 --> 00:15:33.279
<v Speaker 1>they get built or torn down well put.

311
00:15:33.360 --> 00:15:35.759
<v Speaker 2>And there's a final thought, maybe a provocative one, tied

312
00:15:35.759 --> 00:15:39.120
<v Speaker 2>to that certificate automation. We just discussed why traditionally certificate

313
00:15:39.159 --> 00:15:42.440
<v Speaker 2>management was so painful and manual. People did it infrequently,

314
00:15:42.919 --> 00:15:45.480
<v Speaker 2>maybe once a year. This meant certificates were valued for

315
00:15:45.519 --> 00:15:48.679
<v Speaker 2>a long time. If one got compromise somehow, an attacker

316
00:15:48.679 --> 00:15:50.039
<v Speaker 2>had a year long window.

317
00:15:50.279 --> 00:15:52.799
<v Speaker 1>Right. Long lived credentials are risky.

318
00:15:52.720 --> 00:15:56.799
<v Speaker 2>Very yeah. But because Trefix integration with let's encrypt automates

319
00:15:56.840 --> 00:16:01.120
<v Speaker 2>the renewal process, certificates typically only live for ninety days now,

320
00:16:01.679 --> 00:16:05.399
<v Speaker 2>and the renewals automatic, often no human touch needed.

321
00:16:05.639 --> 00:16:08.840
<v Speaker 1>So it drastically shrinks the window of opportunity for an

322
00:16:08.879 --> 00:16:11.559
<v Speaker 1>attacker using a compromise certificate.

323
00:16:11.120 --> 00:16:15.519
<v Speaker 2>Exactly here removes a tedious, error prone operational task and

324
00:16:15.639 --> 00:16:19.960
<v Speaker 2>significantly improves your security posture. By enforcing short certificate lifetimes.

325
00:16:20.639 --> 00:16:23.799
<v Speaker 2>That whole category of operational security risk just kind of

326
00:16:23.840 --> 00:16:25.440
<v Speaker 2>melts away thanks to automation.

327
00:16:25.679 --> 00:16:29.240
<v Speaker 1>That's a really powerful side effect of adopting modern tooling.

328
00:16:29.320 --> 00:16:31.759
<v Speaker 1>A fantastic insight to end on, Thank you for taking

329
00:16:31.840 --> 00:16:33.679
<v Speaker 1>us through this deep dive into trafit.

330
00:16:33.399 --> 00:16:35.039
<v Speaker 2>My pleasure is fascinating technology.

331
00:16:35.120 --> 00:16:37.519
<v Speaker 1>Then thank you our listeners for joining us. We'll catch

332
00:16:37.519 --> 00:16:38.519
<v Speaker 1>you on the next deep dive.
