WEBVTT 1 00:00:00.080 --> 00:00:03.160 Welcome back to the deep dive. Today we're pulling back 2 00:00:03.200 --> 00:00:07.080 the curtain on something well fundamental yet often kind of unseen, 3 00:00:07.919 --> 00:00:10.919 the intricate world of cloud networking on AWS. 4 00:00:11.119 --> 00:00:12.119 That's right, you've sent. 5 00:00:12.160 --> 00:00:14.279 Us a stack of sources looks like excerpts from an 6 00:00:14.279 --> 00:00:19.879 AWS certification book, and our mission is pretty clear. Distill 7 00:00:19.920 --> 00:00:23.120 those crucial insights. We give you a shortcut really to 8 00:00:23.199 --> 00:00:27.879 being genuinely well informed about the hidden physics of the cloud. 9 00:00:28.320 --> 00:00:29.000 Let's umpact this. 10 00:00:29.199 --> 00:00:32.439 It is fascinating, isn't it, Because you know, cloud services 11 00:00:32.560 --> 00:00:36.600 can feel almost like magic, but underneath they're built on 12 00:00:36.679 --> 00:00:41.719 these deeply rooted networking principles, just with a dynamic global twist. 13 00:00:42.079 --> 00:00:44.159 So today, yeah, we're going to try and uncover those 14 00:00:44.200 --> 00:00:47.320 aha moments. We'll go from the virtual network interfaces powering 15 00:00:47.399 --> 00:00:50.159 your instances all the way to how global traffic is managed, 16 00:00:50.159 --> 00:00:53.399 how it's secured, revealing the key bits and importantly the 17 00:00:53.479 --> 00:00:56.399 unseen challenges that keep everything running. 18 00:00:56.520 --> 00:00:58.679 Okay, So where do we even begin with something as 19 00:00:58.679 --> 00:01:01.200 big as cloud networking? Maybe start right at the foundation 20 00:01:01.560 --> 00:01:02.039 makes sense. 21 00:01:02.119 --> 00:01:04.359 Let's talk about the elastic network interface, the E and I, 22 00:01:04.560 --> 00:01:05.400 The E and I. 23 00:01:05.319 --> 00:01:07.959 Right, So think of it like the cloud's virtual network 24 00:01:08.000 --> 00:01:11.400 card for your EC two instances. Every single bit of 25 00:01:11.480 --> 00:01:13.640 network traffic in and out it flows through an E 26 00:01:13.719 --> 00:01:16.519 and I exactly. But the key thing about E and i's, 27 00:01:16.560 --> 00:01:20.799 it seems, isn't just that they're virtual NICs. It's their flexibility, right, Absolutely, 28 00:01:20.840 --> 00:01:21.799 the flexibility is huge. 29 00:01:21.799 --> 00:01:24.959 You can attach them, detach them in different states too, 30 00:01:25.280 --> 00:01:25.560 Like you. 31 00:01:25.480 --> 00:01:29.480 Can do a hot attachment while the instance is actually running. 32 00:01:29.280 --> 00:01:32.120 Yep, hot attachment, while running warm attachment if it's stopped, 33 00:01:32.200 --> 00:01:34.120 or even coal attachment right when you'll launch it. 34 00:01:34.239 --> 00:01:37.319 Wow, And an instance can have more than one, like 35 00:01:37.439 --> 00:01:39.079 connected to different parts of your network. 36 00:01:39.200 --> 00:01:41.079 Yeah, definitely, you can have multiple E and i's on 37 00:01:41.120 --> 00:01:44.040 a single EC two instance, each connected to a different 38 00:01:44.120 --> 00:01:47.680 VPC subnet, maybe for different security zones or traffic types. 39 00:01:48.120 --> 00:01:51.920 That detachment flexibility really changes how you think about high availability. 40 00:01:52.079 --> 00:01:55.120 Okay, so that's the interface. What about the addresses themselves 41 00:01:55.560 --> 00:01:56.760 inside aws? 42 00:01:56.840 --> 00:02:01.159 Good question. Let's dive into IP addressing. So mostly you'll 43 00:02:01.200 --> 00:02:04.680 see vpcs using private IP ranges, you know, the standard 44 00:02:04.840 --> 00:02:06.680 RFC nineteen eighteen stuff. 45 00:02:06.400 --> 00:02:10.960 Right like ten dot or one seventy two dot sixteen exactly. 46 00:02:11.599 --> 00:02:14.919 But subnets can also allow for the auto assignment of 47 00:02:15.039 --> 00:02:18.159 public IPv four addresses when you'll launch an instance. 48 00:02:18.360 --> 00:02:22.360 Ah okay, but wait if those auto assigned public ips 49 00:02:22.400 --> 00:02:26.039 can just change, say if you stop and start the instance. Yeah, 50 00:02:26.120 --> 00:02:29.879 how do you deal with services that absolutely need a fixed, 51 00:02:30.120 --> 00:02:33.840 unchanging external address, like a web server or something. 52 00:02:34.199 --> 00:02:37.639 That's a fantastic question, and that's precisely where elastic IP 53 00:02:37.759 --> 00:02:41.560 addresses or EPs come in. They're indispensable for that hpiece. 54 00:02:41.800 --> 00:02:45.240 They are static public IPv four addresses. You basically allocate 55 00:02:45.240 --> 00:02:48.879 them to your AWS account, not directly to an instance initially. 56 00:02:49.000 --> 00:02:51.639 Ah okay, So they belong to the account. 57 00:02:51.280 --> 00:02:53.680 Right, and then you could associate an EP with an 58 00:02:53.680 --> 00:02:56.199 E and I or directly with an EC two instance. 59 00:02:56.560 --> 00:02:59.199 The crucial flexibility here is that the EEP isn't permanently 60 00:02:59.240 --> 00:03:02.080 tied to that specific piece of hardware or virtual. 61 00:03:01.759 --> 00:03:03.719 Hardware, so you can move it around exactly. 62 00:03:03.759 --> 00:03:06.000 If an instance fails or you need to swap something out, 63 00:03:06.360 --> 00:03:09.560 you just reassociate that same EP with a different instance 64 00:03:09.639 --> 00:03:11.680 or E and I. Gives you that stable public phase 65 00:03:11.680 --> 00:03:14.919 for your applications, regardless of the underlying instance churn. 66 00:03:15.120 --> 00:03:18.879 That makes a lot of sense, okay. Building on managing IPS, efficiently. 67 00:03:20.560 --> 00:03:24.120 The sources mentioned something called prefix lists. What are those about? 68 00:03:24.159 --> 00:03:25.439 How do they make life simpler? 69 00:03:25.520 --> 00:03:29.759 Prefix lists are actually quite clever. They're basically custom managed 70 00:03:29.759 --> 00:03:32.960 lists of IP address ranges or prefixes. You maintain these 71 00:03:33.000 --> 00:03:36.319 lists and then you can reference them consistently in your network. 72 00:03:36.000 --> 00:03:39.960 Can fix like in security groups or route tables precisely. 73 00:03:40.400 --> 00:03:43.960 Instead of typing out or copying and pasting potentially huge 74 00:03:44.080 --> 00:03:46.479 lists of IP addresses over and over again, you just 75 00:03:46.520 --> 00:03:49.520 refer to the prefix list by its name. It simplifies 76 00:03:49.599 --> 00:03:50.919 policy creation immensely. 77 00:03:51.039 --> 00:03:53.800 Okay, so you define it once, use it many times exactly. 78 00:03:53.960 --> 00:03:56.840 And they are two types. You've got AWS managed prefix lists, 79 00:03:56.919 --> 00:03:59.599 which AWS maintains for their own services, makes it super 80 00:03:59.599 --> 00:04:03.360 easy to allow traffic to S three or DynamoDB for example. 81 00:04:03.520 --> 00:04:03.960 Oh nice. 82 00:04:04.000 --> 00:04:06.439 And then you have customer managed prefix lists where you 83 00:04:06.479 --> 00:04:08.639 define your own groups of ips. Maybe you create one 84 00:04:08.680 --> 00:04:12.319 called all dev resources that includes all this CIDR blocks 85 00:04:12.319 --> 00:04:16.120 for your development VPCS makes managing access way easier. 86 00:04:16.040 --> 00:04:18.120 Right, I can see how that would tidy things up. 87 00:04:18.600 --> 00:04:23.399 Now for something that sounds a bit more mysterious, the hyperplane. 88 00:04:23.560 --> 00:04:24.680 What on earth is that? 89 00:04:24.839 --> 00:04:26.399 Yeah? It does sound a bit sci fi doesn't it. 90 00:04:26.600 --> 00:04:30.360 Think of the hyperplane as like the virtual network engine 91 00:04:30.360 --> 00:04:35.680 of AWS. It's the massively distributed underlying infrastructure that takes 92 00:04:35.680 --> 00:04:39.360 the physical network and slices it up virtually for every customer. 93 00:04:39.560 --> 00:04:42.399 That's what makes vpcs and all these services actually work. 94 00:04:42.720 --> 00:04:45.800 Okay, the engine behind the scenes, But what's surprising about it. 95 00:04:46.160 --> 00:04:48.879 What's surprising or at least really important to understand, is 96 00:04:48.920 --> 00:04:52.680 how it operates with these artificial limits. AWS puts these 97 00:04:52.680 --> 00:04:55.959 in place to ensure fair resource allocation across all tenants. 98 00:04:56.120 --> 00:04:57.040 Limits like bandwidth. 99 00:04:57.160 --> 00:04:59.399 Yeah, bandwidth or throughput limits are part of it, and 100 00:04:59.439 --> 00:05:01.959 they very hugely depending on the service. You know, a 101 00:05:02.000 --> 00:05:05.800 transit gateway VPC attachment might go up to fifty gigabits 102 00:05:05.800 --> 00:05:09.639 per second, whereas a single VPN tunnel might top out 103 00:05:09.680 --> 00:05:13.439 at say one point twenty five gvps and direct connect 104 00:05:13.439 --> 00:05:15.360 depending on the port, maybe one ten or even one 105 00:05:15.399 --> 00:05:18.000 hundred gvps. Now, these numbers kind of show the different 106 00:05:18.000 --> 00:05:18.959 skills you're working at. 107 00:05:19.680 --> 00:05:24.399 That huge difference between TGW and a VPN tunnel is striking. 108 00:05:25.079 --> 00:05:28.879 But you mentioned something else, something trickier than just bandwidth. 109 00:05:29.040 --> 00:05:31.279 Yes, and this is the one that catches people out, 110 00:05:31.319 --> 00:05:36.120 even experienced folks. It's the packets per second or PPS limitations. 111 00:05:35.639 --> 00:05:37.800 Packets per second. Why is that trickier? 112 00:05:37.959 --> 00:05:40.680 Because you can often hit the PPS limit before you 113 00:05:40.759 --> 00:05:43.560 hit the bandwidth limit, especially with lots of small packets, 114 00:05:43.600 --> 00:05:46.160 like certain types of application traffic, or maybe even a 115 00:05:46.240 --> 00:05:48.240 d DOS attack using small packets. 116 00:05:48.240 --> 00:05:48.800 And what happens? 117 00:05:48.800 --> 00:05:52.639 Then you start dropping packets silently. Your bandwidth monitors might 118 00:05:52.680 --> 00:05:55.800 look totally fine, nowhere near saturated, but packets are just 119 00:05:55.879 --> 00:05:59.480 vanishing into the ether because the hyperplane component handling your 120 00:05:59.480 --> 00:06:01.839 traffic can't process them fast enough. 121 00:06:01.879 --> 00:06:04.560 Ouch. So how do you even spot that? If there are, 122 00:06:04.680 --> 00:06:06.879 as you said, no obvious signs, That. 123 00:06:06.959 --> 00:06:09.240 Is the challenge. It feels like a ghost in the machine. 124 00:06:09.720 --> 00:06:13.639 Diagnosing it usually means you need to look beyond just throughput. 125 00:06:13.839 --> 00:06:16.879 You need metrics on packet counts, maybe packet drop counters 126 00:06:16.920 --> 00:06:19.879 if the service exposes them, or you have to use 127 00:06:19.920 --> 00:06:24.120 tools like VPC flow logs or even VBC traffic mirroring, 128 00:06:24.160 --> 00:06:26.079 which we can get into later, to try and see 129 00:06:26.079 --> 00:06:28.720 what's actually happening at the packet level. It definitely defies 130 00:06:28.759 --> 00:06:30.360 traditional bandwidth troubleshooting. 131 00:06:30.439 --> 00:06:33.600 Okay, that's a really crucial subtle point So we've got 132 00:06:33.600 --> 00:06:37.560 the building blocks E and i's ips limits. But clouds 133 00:06:37.560 --> 00:06:40.680 aren't usually isolated islands, right, You need them to talk 134 00:06:40.720 --> 00:06:41.279 to each other. 135 00:06:41.439 --> 00:06:43.040 Absolutely, Connecting things is key. 136 00:06:43.160 --> 00:06:46.639 So moving beyond individual instances, how do we connect these 137 00:06:46.720 --> 00:06:50.199 cloud components. What's the simplest way? VPC peering. 138 00:06:50.639 --> 00:06:53.639 Yeah, VPC peering is often the starting point. It creates 139 00:06:53.680 --> 00:06:58.439 a direct private connection between two vpcs using aws's backbone. 140 00:06:58.600 --> 00:07:01.360 It's pretty straightforward to set up. And actually there aren't 141 00:07:01.439 --> 00:07:05.600 explicit throughput limits imposed by the peering connection itself, beyond 142 00:07:05.720 --> 00:07:07.120 instance or other limits. 143 00:07:07.319 --> 00:07:09.560 Sounds good, but there have to be catches. 144 00:07:09.319 --> 00:07:13.720 Right, Oh, definitely key considerations. First, it's nontransitive, meaning if 145 00:07:13.800 --> 00:07:18.279 VPCA is peered with VPCB and VPCD is peered with VPCC, 146 00:07:19.240 --> 00:07:23.160 VPCA cannot automatically talk to VPCC just by going through B. 147 00:07:23.600 --> 00:07:26.480 There's no implicit routing pass through. You'd need a separate 148 00:07:26.480 --> 00:07:29.000 peering connection directly between A and C. Ah. 149 00:07:29.040 --> 00:07:31.600 Okay, so no hubbin spoke, just using peering exactly. 150 00:07:31.720 --> 00:07:34.920 And the second big one, maybe even bigger. It absolutely 151 00:07:34.920 --> 00:07:38.279 cannot be used if the vpcs have overlapping CADR ranges. 152 00:07:38.360 --> 00:07:41.120 Right, If both vpcs use ten point zero point zero 153 00:07:41.120 --> 00:07:42.759 point zero one six For example. 154 00:07:42.519 --> 00:07:46.199 YEP peering just won't work. Routing across peered vpcs relies 155 00:07:46.240 --> 00:07:48.319 on static routes. You have to manually add to the 156 00:07:48.360 --> 00:07:51.399 route tables in both vpcs for traffic to flow back 157 00:07:51.399 --> 00:07:51.839 and forth. 158 00:07:51.920 --> 00:07:55.240 That IP overlap thing, that sounds like a potential nightmare. 159 00:07:55.480 --> 00:07:58.959 You mentioned company mergers earlier. Imagine trying to connect two 160 00:07:59.000 --> 00:08:02.120 company networks that both picked say ten point one hundred 161 00:08:02.120 --> 00:08:05.160 point zero points zero one six independently peering is out? 162 00:08:05.360 --> 00:08:08.720 Is there any way to make workloads with overlapping IPS talk? 163 00:08:08.800 --> 00:08:11.040 You're right, it's a huge challenge in mergers or large 164 00:08:11.120 --> 00:08:15.040 organizations VPC peering it's a wall there. There is a solution, 165 00:08:15.480 --> 00:08:18.839 though it's not perfect. Yeah, using private net gateways. 166 00:08:18.399 --> 00:08:20.879 Not gateways, but usually those are forgetting out to the internet. 167 00:08:21.040 --> 00:08:23.800 Correct, those are public net gateways, but you can also 168 00:08:23.839 --> 00:08:27.000 set up private net gateways. They allow workloads in one 169 00:08:27.079 --> 00:08:31.040 DPC to initiate connections to workloads in another VPC, even 170 00:08:31.040 --> 00:08:34.360 if they have overlapping IPS, because the neat gateway handles 171 00:08:34.399 --> 00:08:36.320 the address translation on the way out. 172 00:08:36.399 --> 00:08:39.320 Ah, clever, but you said initiate. 173 00:08:39.039 --> 00:08:41.879 Yeah, that's the caveat. The communication generally have to be 174 00:08:41.919 --> 00:08:45.080 initiated from the side using the neat gateway. It's not 175 00:08:45.159 --> 00:08:48.679 a truly transparent bidirectional connection like peering would be if 176 00:08:48.679 --> 00:08:52.200 the ips didn't overlap. Solves a specific problem, but it's 177 00:08:52.240 --> 00:08:54.840 not a universal fix for overlapping CIDRs. 178 00:08:55.159 --> 00:08:58.279 Okay, so peering is simple, but limited, especially by transitivity 179 00:08:58.320 --> 00:09:01.720 and IP overlap. How did AS addressed the need for larger, 180 00:09:01.879 --> 00:09:04.399 more complex, maybe hub and spoke networks in the cloud. 181 00:09:04.720 --> 00:09:07.039 Well, the community first came up with solutions like the 182 00:09:07.080 --> 00:09:10.759 Transit VPC. This usually involves setting up dedicated EC two 183 00:09:10.840 --> 00:09:15.440 instances running routing software, network virtual appliances or mvas in 184 00:09:15.519 --> 00:09:17.360 the central VPC to act as a hub. 185 00:09:17.559 --> 00:09:19.879 So building your own router in the cloud basically. 186 00:09:19.600 --> 00:09:23.159 Pretty much it worked, But managing those mvas, worrying about 187 00:09:23.159 --> 00:09:28.039 their scaling, high availability, it's complex. So AWS eventually released 188 00:09:28.039 --> 00:09:30.759 a mandaged service to solve this much more elegantly, the 189 00:09:30.840 --> 00:09:34.519 AWS Transit Gateway or TGW Transit Gateway. 190 00:09:34.559 --> 00:09:35.440 Okay, how's that different. 191 00:09:35.519 --> 00:09:39.600 TGW acts as a fully managed, highly scalable central cloud 192 00:09:39.679 --> 00:09:43.000 router or hub. You attach your vpcs, your VPN connections, 193 00:09:43.000 --> 00:09:47.559 your direct connections all to the TGW. It simplifies INNERVPC 194 00:09:47.639 --> 00:09:51.440 connectivity massively and also makes hybrid networking connecting to on 195 00:09:51.519 --> 00:09:54.720 premises much cleaner. It takes the routing burden off you 196 00:09:54.960 --> 00:09:57.799 and puts it into a managed AWS service. A true 197 00:09:57.879 --> 00:09:59.399 hub and spoke model becomes easy. 198 00:09:59.480 --> 00:10:04.360 Got it? So TDW is the modern way for complex connectivity. Now, 199 00:10:04.440 --> 00:10:07.320 speaking of hybrid, what about that dedicated link you mentioned, 200 00:10:07.360 --> 00:10:10.600 Direct connect or DX. Why would someone go for DX 201 00:10:10.600 --> 00:10:12.360 instead of just setting up a VPN over the Internet. 202 00:10:12.399 --> 00:10:13.840 It seems like VPNs are pretty common. 203 00:10:13.960 --> 00:10:17.080 They are common and often sufficient, but direct connect offers 204 00:10:17.120 --> 00:10:21.960 several really critical advantages, especially for larger enterprises or sensitive workloads. 205 00:10:22.159 --> 00:10:26.320 Like what First, Privacy and security DX provides a dedicated 206 00:10:26.399 --> 00:10:29.519 private circuit. Your traffic isn't going over the public Internet, 207 00:10:29.600 --> 00:10:34.120 so it can't be snooped on easily. Second, reliability, DX 208 00:10:34.159 --> 00:10:38.200 comes with service level Agreements slas, promising certain levels of uptime. 209 00:10:38.440 --> 00:10:41.080 The public Internet is inherently best effort. 210 00:10:40.840 --> 00:10:43.679 Okay, so more secure, more reliable, and. 211 00:10:43.639 --> 00:10:48.519 Third performance significantly higher bandwidth as possible. D X connections 212 00:10:48.559 --> 00:10:51.480 come in one gbp's, ten gvps and now even one 213 00:10:51.600 --> 00:10:55.559 hundred gbp's flavors, plus you generally get lower and more 214 00:10:55.600 --> 00:10:57.320 consistent latency compared. 215 00:10:56.960 --> 00:10:59.759 To the Internet one hundred gigs. Wow. And it's literally 216 00:10:59.799 --> 00:11:01.840 a physical connection right like a cable. 217 00:11:02.000 --> 00:11:05.360 Yes, Fundamentally you work with AWS or a partner to 218 00:11:05.360 --> 00:11:07.519 get a physical cross connect cable run in a shared 219 00:11:07.559 --> 00:11:10.559 data center, a direct connect location between your networking equipment 220 00:11:10.639 --> 00:11:13.799 and aws's equipment. There's even a document involved, the Letter 221 00:11:13.840 --> 00:11:17.799 of Authorization in Connecting Facility Assignment or LOACFA, that you 222 00:11:17.879 --> 00:11:20.159 use to authorize the data center technicians to make that 223 00:11:20.200 --> 00:11:23.279 physical link. It's a tangible piece of your cloud connection. 224 00:11:23.320 --> 00:11:26.320 A physical manifestation of the cloud. Okay, that's cool. So 225 00:11:26.480 --> 00:11:30.120 DX sounds robust. What if one link isn't enough bandwidth 226 00:11:30.279 --> 00:11:32.919 or you need more redundancy, and how do you actually 227 00:11:33.360 --> 00:11:36.440 get that physical pipe connected into your virtual network your. 228 00:11:36.399 --> 00:11:40.559 Vpcs great questions for more bandwidth or redundancy. AWS offers 229 00:11:40.600 --> 00:11:44.399 link aggregation groups or lags. This is pretty neat. It 230 00:11:44.480 --> 00:11:47.799 lets you bundle multiple physical DX connections together, say for 231 00:11:48.039 --> 00:11:51.240 one gbp's links, and treat them as a single logical 232 00:11:51.279 --> 00:11:55.799 connection with combined bandwidth like four gbps. It simplifies management too. 233 00:11:55.840 --> 00:11:58.240 Ah like bonding network interfaces. 234 00:11:57.679 --> 00:12:00.720 Exactly like that, and to extend that physical connectivity into 235 00:12:00.759 --> 00:12:05.679 your actual AWS resources, you use virtual interfaces or visifs. 236 00:12:05.799 --> 00:12:09.200 Vifs essentially carve up that physical DX connection or LAG 237 00:12:09.240 --> 00:12:12.639 into logical pathways using VLAN tagging standard A to two 238 00:12:12.679 --> 00:12:15.720 point one qvland tags. This lets you run different types 239 00:12:15.759 --> 00:12:17.639 of network traffic over the same physical. 240 00:12:17.320 --> 00:12:18.200 Link, different type. 241 00:12:18.279 --> 00:12:21.879 Yeah, we mainly distinguish between three types. Private vifs, which 242 00:12:21.879 --> 00:12:24.759 are used to connect directly to your vpcs, usually via 243 00:12:24.799 --> 00:12:27.759 a component called a virtual private gateway PGW or more 244 00:12:27.799 --> 00:12:31.720 commonly now a transit gateway. Then there are public vifs, 245 00:12:31.879 --> 00:12:34.960 which let you access public AWS services like S three 246 00:12:35.080 --> 00:12:39.240 or EC two APIs over your private DX link bypassing 247 00:12:39.279 --> 00:12:39.759 the Internet. 248 00:12:39.840 --> 00:12:43.480 Okay, private for vpcs, public for AWS services. What's the third? 249 00:12:43.600 --> 00:12:46.480 The third is the transit VIF. This one is specifically 250 00:12:46.480 --> 00:12:49.919 designed to connect your direct connect link to a transit gateway. 251 00:12:49.799 --> 00:12:53.240 Right for that hub and spoke model with TGW precisely. 252 00:12:53.480 --> 00:12:56.480 And there's a crucial point here about transit, vis and TGW. 253 00:12:56.720 --> 00:12:59.639 I've heard about something called hairpinning that can get really 254 00:12:59.639 --> 00:13:01.080 expended if you're not careful. 255 00:13:01.159 --> 00:13:03.919 Ah. Yes, hairpinning. You're absolutely right to bring that up. 256 00:13:03.919 --> 00:13:07.360 It's a potentially very costly mistake. If you're using transit 257 00:13:07.399 --> 00:13:09.919 gateway with direct connect, it's critical that you use a 258 00:13:09.960 --> 00:13:14.039 single transit VIF per TGW connection to your on premises site. 259 00:13:14.120 --> 00:13:17.600 Why just one, because if you have multiple or misconfigure routing, 260 00:13:17.840 --> 00:13:20.080 you can end up with hair pinning. This is where 261 00:13:20.120 --> 00:13:22.799 traffic comes in from your on prem network via DX, 262 00:13:23.000 --> 00:13:25.480 goes to the TGW maybe needs to get to another VPC, 263 00:13:25.919 --> 00:13:29.159 but instead of routing directly, the TGW rows it back 264 00:13:29.159 --> 00:13:31.720 out the same DX connection towards your on prem router, 265 00:13:32.080 --> 00:13:34.480 only for your router to send it immediately back into 266 00:13:34.519 --> 00:13:37.519 AWS over DX again to reach the intended VPC. 267 00:13:37.799 --> 00:13:41.399 So it makes a U turn back through your own network. 268 00:13:41.159 --> 00:13:44.600 Exactly, a totally unnecessary round trip out of AWS and 269 00:13:44.639 --> 00:13:47.519 back in. And since you pay for a data egress 270 00:13:47.519 --> 00:13:51.799 from AWS, that double egress gets incredibly expensive really fast. 271 00:13:52.240 --> 00:13:55.480 A single transit VIF for TGW connection point, along with 272 00:13:55.559 --> 00:13:59.080 proper route propagation and filtering, prevents this costly detour. 273 00:13:59.279 --> 00:14:03.360 Wow. Okay, definitely noted avoid the hairpin So last piece 274 00:14:03.399 --> 00:14:06.120 on connectivity, what if you need your vpcs to talk 275 00:14:06.279 --> 00:14:09.840 privately to services could be aws's owned services or maybe 276 00:14:09.879 --> 00:14:13.000 third party sauce providers you use, but you don't want 277 00:14:13.000 --> 00:14:14.440 to go out to the Internet, and you don't want 278 00:14:14.440 --> 00:14:16.519 to route through net gateways if you can avoid it, 279 00:14:16.559 --> 00:14:17.399 what's the play there? 280 00:14:17.480 --> 00:14:20.320 That's the perfect use case for AWS private link private 281 00:14:20.360 --> 00:14:24.080 LINKA private link uses a component called a VPC endpoint. 282 00:14:24.759 --> 00:14:28.399 It essentially creates a secure private connection directly from your 283 00:14:28.480 --> 00:14:32.240 VPC to the service. The service endpoint effectively gets a 284 00:14:32.279 --> 00:14:36.399 private IP address within your vpc's address range, making the 285 00:14:36.440 --> 00:14:39.639 external service appear as if it's running right there inside 286 00:14:39.639 --> 00:14:40.200 your network. 287 00:14:40.240 --> 00:14:43.679 Ah So no Internet gateway, no, not no public eyps 288 00:14:43.720 --> 00:14:45.360 involved for that service connection. 289 00:14:45.519 --> 00:14:49.879 Correct traffic stays entirely within the AWS network backbone. It 290 00:14:50.000 --> 00:14:53.600 massively improves security by keeping sensitive data off the public Internet, 291 00:14:53.840 --> 00:14:57.039 and it simplifies your network architecture because you don't need 292 00:14:57.159 --> 00:15:00.399 complex firewall rules or net setups just to reach those 293 00:15:00.399 --> 00:15:03.279 services privately. It's very powerful for secure service consumption. 294 00:15:03.440 --> 00:15:05.399 Okay, that covers a lot of ground on how to 295 00:15:05.440 --> 00:15:08.440 connect things, but even with the best designs, things go wrong. 296 00:15:08.559 --> 00:15:11.159 Cloud networks, maybe even more than traditional ones, can have 297 00:15:11.200 --> 00:15:14.519 these elusive problems. Because so much is abstracted. How do 298 00:15:14.559 --> 00:15:18.360 you start peeling back those layers when inevitably something breaks. 299 00:15:18.399 --> 00:15:21.279 Let's talk about potential problems first. What kind of things 300 00:15:21.320 --> 00:15:23.320 typically bite you in cloud networking? 301 00:15:23.600 --> 00:15:26.519 Oh, there's a whole list. We definitely see IP address 302 00:15:26.519 --> 00:15:29.600 allocation issues pretty often, like a subnet just runs out 303 00:15:29.639 --> 00:15:33.840 of available IPS, IP exhaustion, yeah, or worse, those overlapping 304 00:15:33.879 --> 00:15:37.639 CIDR ranges we talked about causing weird routing conflicts. If 305 00:15:37.639 --> 00:15:40.759 someone tries to connect things that shouldn't be connected, then 306 00:15:40.799 --> 00:15:44.519 there are root scale limitations. AWS services have limits on 307 00:15:44.519 --> 00:15:47.240 the number of routes they can handle. Exceed those and 308 00:15:47.320 --> 00:15:50.159 routes might just disappear, or BGP sessions with your on 309 00:15:50.279 --> 00:15:51.679 prem gear might tear down. 310 00:15:51.840 --> 00:15:53.480 Okay, limits again. What else? 311 00:15:53.759 --> 00:15:57.799 Packet size mismatches. This is a subtle one. Issues with 312 00:15:57.840 --> 00:16:02.600 maximum transmission unit MTU or maximum segment size MSS can 313 00:16:02.679 --> 00:16:06.799 cause fragmentation. This often doesn't look like a network down problem, 314 00:16:06.879 --> 00:16:09.279 but it hits applications. You might see really slow file 315 00:16:09.320 --> 00:16:13.399 transfers or some web apps timing out without obvious network errors. 316 00:16:13.240 --> 00:16:16.159 Right because the network itself is passing packets just fragmented 317 00:16:16.200 --> 00:16:18.240 ones the application struggles with exactly. 318 00:16:18.919 --> 00:16:21.240 Then we have the hard limits we discussed band with 319 00:16:21.320 --> 00:16:24.759 throughput limitations which are usually pretty core quotas, and those 320 00:16:24.799 --> 00:16:28.600 tricky PPS limitations causing those silent packet drops that are 321 00:16:28.600 --> 00:16:29.679 so hard to diagnose. 322 00:16:29.960 --> 00:16:30.639 Still scary. 323 00:16:30.759 --> 00:16:34.159 Yeah, and related to that just general packet loss, maybe 324 00:16:34.279 --> 00:16:38.159 due to unreliable transit somewhere between regions, or maybe the 325 00:16:38.279 --> 00:16:42.039 end hosts themselves are just overwhelmed and dropping packets. And finally, 326 00:16:42.519 --> 00:16:47.240 never underestimate plain old security misconfiguration, a wrong rule in 327 00:16:47.279 --> 00:16:51.200 a security group or more often a network Access control 328 00:16:51.240 --> 00:16:54.879 list NaCl is a super frequent cause of it just 329 00:16:54.960 --> 00:16:56.320 doesn't connect problems. 330 00:16:56.440 --> 00:16:59.240 That's quite a list. Sounds like troubleshooting could be finding 331 00:16:59.279 --> 00:17:03.200 a needle in a haystack. What tools does AWS actually 332 00:17:03.200 --> 00:17:07.240 give you to get visibility to see inside this sometimes 333 00:17:07.240 --> 00:17:08.160 opaque window. 334 00:17:08.519 --> 00:17:11.920 Well, the courterstone of observability in AWS is definitely Amazon 335 00:17:11.960 --> 00:17:12.759 cloud Watch. 336 00:17:12.599 --> 00:17:15.000 Cloud Watch right, that's for metrics and logs for pretty 337 00:17:15.079 --> 00:17:16.079 much everything exactly. 338 00:17:16.079 --> 00:17:18.640 It's the central hub you need to understand its core components. 339 00:17:18.640 --> 00:17:21.880 There are name spaces, which are basically containers for metrics 340 00:17:21.920 --> 00:17:25.039 from a specific service like EC two or ELB. Then 341 00:17:25.079 --> 00:17:27.640 the metrics themselves. Those are the actual time series data 342 00:17:27.640 --> 00:17:31.359 points like CPU utilization or network in. Then you have dimensions, 343 00:17:31.519 --> 00:17:34.160 which are key value pairs that help you filter in 344 00:17:34.200 --> 00:17:38.839 group metrics like instant seed or autoscaling group name, and 345 00:17:38.920 --> 00:17:41.960 finally periods which define the time interval over which the 346 00:17:42.039 --> 00:17:45.319 data is aggregated, like one minute or five minutes. Cloud 347 00:17:45.319 --> 00:17:49.160 Watch is your main dashboard for performance, health and setting alarms. 348 00:17:49.359 --> 00:17:52.079 So cloud watch gives you the high level metrics. But 349 00:17:52.160 --> 00:17:56.319 what about seeing the actual traffic flows, like which connections 350 00:17:56.319 --> 00:17:59.279 are being allowed or denied. That sounds more like VPC 351 00:17:59.400 --> 00:18:00.799 flow logs precisely. 352 00:18:01.119 --> 00:18:04.559 VPC flowlugs give you metadata about the IP traffic flowing 353 00:18:04.599 --> 00:18:08.480 through your VPC. They capture information for each flow like 354 00:18:08.599 --> 00:18:12.680 source and destination, IP ports protocol, the number of packets 355 00:18:12.680 --> 00:18:15.640 and bytes, and crucially, the forwarding decision made by the 356 00:18:15.720 --> 00:18:18.839 VPC router, whether the traffic was accepted or rejected. 357 00:18:19.440 --> 00:18:22.200 That accept traject status seems key for troubleshooting. 358 00:18:22.279 --> 00:18:25.079 It is, but remember flow lugs are not full packet captures. 359 00:18:25.119 --> 00:18:27.000 They don't show you the payload, but they give you 360 00:18:27.079 --> 00:18:30.079 really valuable insight into network level decisions, and you can 361 00:18:30.119 --> 00:18:33.319 even set up custom formats for flow logs now custom formats. 362 00:18:33.319 --> 00:18:34.039 How would you use that? 363 00:18:34.200 --> 00:18:37.400 Well, for instance, you could include TCP flags in your logs. 364 00:18:37.880 --> 00:18:43.079 That might help you troubleshoot specific issues like TCP handshake problems. 365 00:18:43.119 --> 00:18:46.880 Are you seeing syn packets but no syn ACKs back? 366 00:18:47.119 --> 00:18:49.799 Things like that. It lets you tailor the logs to 367 00:18:49.880 --> 00:18:51.079 the problem you're investigating. 368 00:18:51.440 --> 00:18:55.279 That's handy. Now to make this concrete, the source material 369 00:18:55.359 --> 00:18:59.319 had this Prailcats troubleshooting example. Can you walk us through that? 370 00:18:59.359 --> 00:19:02.440 It seemed like a good illustration of using these tools systematically. 371 00:19:02.720 --> 00:19:05.640 Yeah, the Trailcats examples classic. They had a website and 372 00:19:05.720 --> 00:19:09.440 it was having these mysterious connectivity problems between two of 373 00:19:09.480 --> 00:19:12.039 its back end servers. So the first thing they did 374 00:19:12.400 --> 00:19:15.279 was enable VPT flow logs, but they did it at 375 00:19:15.279 --> 00:19:17.480 the NI level for the servers involved. 376 00:19:17.519 --> 00:19:20.440 Okay, looking right at the server's network interfaces, right, and. 377 00:19:20.400 --> 00:19:24.799 Those logs showed nothing rejected all ec SPTT. So initial 378 00:19:24.839 --> 00:19:26.599 thought might be, okay, the network's fine, must be an 379 00:19:26.640 --> 00:19:27.839 application problem. 380 00:19:27.519 --> 00:19:29.480 A dead end potentially exactly. 381 00:19:29.720 --> 00:19:31.759 But they didn't stop there. They widened the scope. They 382 00:19:31.880 --> 00:19:34.279 enabled flow logs, but this time at the subnet. 383 00:19:34.000 --> 00:19:37.519 Level AH one level up from the instance NI YEP 384 00:19:37.680 --> 00:19:38.799 and boom. 385 00:19:39.039 --> 00:19:42.519 The subnet level logs immediately showed rejected traffic between those 386 00:19:42.559 --> 00:19:43.240 two servers. 387 00:19:43.319 --> 00:19:44.400 So what did that point to? 388 00:19:44.759 --> 00:19:48.960 It pointed directly to a network Access control list or ANACL. 389 00:19:49.680 --> 00:19:52.960 Because nacls operated at the subnet boundary, they were blocking 390 00:19:52.960 --> 00:19:55.720 the traffic before it even got to the instance's E 391 00:19:55.839 --> 00:19:58.720 and I. The ENI level logs never saw the rejected 392 00:19:58.759 --> 00:20:00.920 packets because they never or reach the E ANDI. 393 00:20:01.480 --> 00:20:04.680 That's a brilliant example of how changing your observation point 394 00:20:04.759 --> 00:20:07.759 widening the scope is critical in cloud troubleshooting. 395 00:20:07.960 --> 00:20:09.880 Absolutely, you have to look at the different layers. 396 00:20:09.960 --> 00:20:13.599 Okay, so flow logs give metadata except reject But what 397 00:20:13.720 --> 00:20:16.400 if you do need to see the actual packet contents, 398 00:20:16.519 --> 00:20:19.200 like you suspect something weird in the payload or you 399 00:20:19.240 --> 00:20:22.480 need deep protocol analysis. Is there an equivalent to plugging 400 00:20:22.480 --> 00:20:25.279 in wire shark via a span port like in a 401 00:20:25.279 --> 00:20:26.279