WEBVTT 1 00:00:00.040 --> 00:00:03.759 Okay, let's unpack this. You ever feel like you're just drowning. 2 00:00:03.359 --> 00:00:07.360 In numbers, Definitely, spreadsheets everywhere, complex reports. 3 00:00:07.000 --> 00:00:10.599 Exactly, grabs that maybe look cool but don't actually tell 4 00:00:10.599 --> 00:00:12.560 you much. It's so easy to get lost in all 5 00:00:12.599 --> 00:00:15.000 that data, right, But imagine if you could just see 6 00:00:15.039 --> 00:00:18.160 through it all, if the numbers could instantly like paint 7 00:00:18.199 --> 00:00:19.039 a clear picture. 8 00:00:19.679 --> 00:00:22.399 That's the idea, isn't it. That's the real power of 9 00:00:22.480 --> 00:00:27.600 data visualization, turning that overwhelming information into something you can grasp, 10 00:00:27.879 --> 00:00:29.320 you know, quickly and really thoroughly. 11 00:00:29.359 --> 00:00:32.479 Yeah, moving beyond just tables and stats exactly. 12 00:00:32.560 --> 00:00:35.399 We're so used to looking at tables, maybe hearing about models, 13 00:00:35.759 --> 00:00:38.439 and those have their place for sure, but a good 14 00:00:38.560 --> 00:00:41.920 visualization it can give you that immediate kind of gut 15 00:00:42.039 --> 00:00:45.679 level understanding, seeing the patterns, the relationships hiding in there. 16 00:00:45.759 --> 00:00:48.719 It's like reading a recipe versus actually seeing the finished dish. 17 00:00:48.759 --> 00:00:51.840 Like you said, yeah, precisely, And that brings us nicely 18 00:00:51.880 --> 00:00:55.399 to our source for this deep dive. The book Data 19 00:00:55.479 --> 00:00:58.000 Visualization A practical. 20 00:00:57.520 --> 00:00:59.960 Introduction, Ah, yes, good one. 21 00:01:00.200 --> 00:01:02.759 And this isn't just you know, a gallery of pretty charts. 22 00:01:02.799 --> 00:01:05.239 It's a really practical, hands on guide mm hmm. 23 00:01:05.599 --> 00:01:10.040 It walks you through using R the programming language and 24 00:01:10.159 --> 00:01:13.959 this really flexible tool called gig plot two right. 25 00:01:14.000 --> 00:01:16.200 And what I found really insightful is how it focuses 26 00:01:16.239 --> 00:01:18.640 not just on the aesthetics like does it look. 27 00:01:18.599 --> 00:01:20.599 Nice, although that matters. 28 00:01:20.200 --> 00:01:25.159 True, but more on how our brains actually process visual 29 00:01:25.200 --> 00:01:28.879 information and designing charts that work with that process. 30 00:01:29.079 --> 00:01:32.480 That connection is key, isn't it, between how we see 31 00:01:32.920 --> 00:01:36.120 and what we understand. The book really emphasizes that the 32 00:01:36.200 --> 00:01:39.439 best visualizations are the ones that kind of tap into 33 00:01:39.439 --> 00:01:43.879 how we intuitively interpret things like size, color, position, makes sense. 34 00:01:43.680 --> 00:01:46.680 Like making the data speak directly through what we see exactly. 35 00:01:46.760 --> 00:01:48.760 So our mission today really is to pull out the 36 00:01:48.840 --> 00:01:52.560 key insights from this book to help you listening become 37 00:01:52.560 --> 00:01:53.480 more data savvy. 38 00:01:53.599 --> 00:01:55.359 Yeah, give you the tools to not just make your 39 00:01:55.359 --> 00:01:57.480 own effective charts, but also to look at any graph 40 00:01:57.599 --> 00:02:00.400 you see and really understand what it's telling you or 41 00:02:00.400 --> 00:02:01.519 maybe what it isn't telling you. 42 00:02:01.719 --> 00:02:04.959 Good point. We want to help you avoid those common 43 00:02:05.000 --> 00:02:08.639 pitfalls and just feel more confident navigating all this data. 44 00:02:09.080 --> 00:02:12.000 Okay, so where should we start? Maybe the big question 45 00:02:13.360 --> 00:02:17.800 why even bother visualizing data? Why not just stick with 46 00:02:17.879 --> 00:02:18.439 the tables. 47 00:02:18.680 --> 00:02:21.479 Right. The book makes a really strong case for moving 48 00:02:21.520 --> 00:02:22.759 beyond just the numbers. 49 00:02:22.919 --> 00:02:26.960 It does. There's a great example from Jackman back in 50 00:02:27.039 --> 00:02:27.680 nineteen eighty. 51 00:02:27.759 --> 00:02:29.240 Oh yeah, the voter turnout one. 52 00:02:29.360 --> 00:02:31.879 That's the one. He was looking at voter turnout and 53 00:02:32.080 --> 00:02:36.879 income inequality across different countries. Okay, and the initial analysis, 54 00:02:36.960 --> 00:02:40.400 just crunching the numbers for eighteen countries suggested a pretty 55 00:02:40.439 --> 00:02:41.080 strong link. 56 00:02:41.520 --> 00:02:42.719 Seems straightforward enough. 57 00:02:42.719 --> 00:02:45.000 But then he just plotted the data a simple scatterplot 58 00:02:45.680 --> 00:02:49.039 and bam, it was instantly clear that whole relationship is 59 00:02:49.039 --> 00:02:51.319 basically being driven by one single data. 60 00:02:51.080 --> 00:02:52.159 Point South Africa. 61 00:02:52.479 --> 00:02:56.439 Wow, so one outlier was creating the entire trend pretty much. 62 00:02:56.639 --> 00:03:00.000 Now, you could find that eventually with more stats, sensitivity analysis, 63 00:03:00.159 --> 00:03:00.719 stuff like that. 64 00:03:00.800 --> 00:03:02.719 Sure, big, deep enough, but the visual. 65 00:03:02.520 --> 00:03:04.159 It made it obvious immediately. 66 00:03:04.400 --> 00:03:07.159 That's a powerful demonstration right there. It really is. 67 00:03:07.280 --> 00:03:07.520 Yeah. 68 00:03:07.560 --> 00:03:11.280 And there's another great illustration of the book, inspired by Enscom's. 69 00:03:10.879 --> 00:03:12.439 Quartet ah the classic. 70 00:03:12.719 --> 00:03:16.719 Yeah. So another researcher, Van Hove, created sixteen different data 71 00:03:16.719 --> 00:03:20.960 sets and here's the kicker. Huh, every single one had 72 00:03:20.960 --> 00:03:25.280 the exact same statistical correlation between x and y r 73 00:03:25.800 --> 00:03:26.919 equals point. 74 00:03:26.680 --> 00:03:31.319 Six Okay, point six seems like a decent positive relationship. 75 00:03:30.759 --> 00:03:33.159 Right, that's what the number tells you. But then you 76 00:03:33.280 --> 00:03:36.680 visualize them. You plot those sixteen data sets, and they 77 00:03:36.680 --> 00:03:39.800 look at different, totally different. Some look like a nice 78 00:03:39.800 --> 00:03:42.840 cloud of points like you'd expect. Others have a crazy 79 00:03:42.919 --> 00:03:45.840 outlier point of the line. Some are clearly curved, some 80 00:03:45.879 --> 00:03:47.919 are just like two separate groups of dots. 81 00:03:48.039 --> 00:03:51.479 So the single number point six completely hid all that 82 00:03:51.599 --> 00:03:52.840 variation completely. 83 00:03:52.879 --> 00:03:56.240 The core insight there is just critical. Yeah, always always 84 00:03:56.240 --> 00:03:58.919 look at a scatterplot of your correlations. Don't just trust 85 00:03:59.000 --> 00:03:59.400 the number. 86 00:03:59.520 --> 00:04:01.919 It really Ammer's home that point, doesn't it. A single 87 00:04:01.960 --> 00:04:05.240 statistic can mask wildly different realities in the data. You 88 00:04:05.280 --> 00:04:07.800 just have to see the shape, spot the weird. 89 00:04:07.599 --> 00:04:08.960 Stuff, understand the distribution. 90 00:04:09.120 --> 00:04:12.159 Yeah, but it's also important, as the book notes, not 91 00:04:12.199 --> 00:04:14.120 to just blindly trust the visual either. 92 00:04:14.319 --> 00:04:18.399 Right. Absolutely, visualizations have their own what the book calls 93 00:04:18.480 --> 00:04:21.360 rhetorical plausibility. They suggest things, They. 94 00:04:21.199 --> 00:04:23.560 Frame the data in a certain way exactly. 95 00:04:24.199 --> 00:04:25.959 Just because it's in a chart doesn't make it the 96 00:04:26.000 --> 00:04:28.480 absolute truth. We still need to think critically. 97 00:04:28.759 --> 00:04:32.279 Okay, so we're sold on why visualization is powerful, But 98 00:04:32.360 --> 00:04:34.639 what makes one good versus bad Right. 99 00:04:34.959 --> 00:04:38.160 The book kind of breaks down the problems into three buckets. Okay, 100 00:04:38.439 --> 00:04:43.079 there's issues of just plain bad taste aesthetics. Then there 101 00:04:43.120 --> 00:04:46.879 are substantive problems like how the data itself is shown, 102 00:04:47.279 --> 00:04:51.240 and finally perceptual problems how our brains interpret the visual. 103 00:04:51.439 --> 00:04:54.639 Let's start with bad taste. What falls into that category? 104 00:04:54.800 --> 00:04:56.839 This is where aesthetics really come in. Things that make 105 00:04:56.879 --> 00:05:01.639 a graph distracting, cluttered, hard to read, inconsistent design. 106 00:05:01.399 --> 00:05:02.519 Choice, What is too much going on? 107 00:05:02.639 --> 00:05:05.279 Yeah? Exactly. The book uses an example figure one point 108 00:05:05.279 --> 00:05:08.399 four of what it calls chart junk, classic term chart junk. 109 00:05:08.560 --> 00:05:09.720 Love it. What's an example? 110 00:05:09.959 --> 00:05:13.439 Oh, think like bars that are hard to distinguish, labels 111 00:05:13.439 --> 00:05:17.360 repeated everywhere pointlessly, Maybe those fake three D effects that 112 00:05:17.399 --> 00:05:22.160 add nothing, oh the worst? Or drop shadows, yes, drop shadows, 113 00:05:22.199 --> 00:05:25.879 pointless textures. It's just visual clutter getting in the way 114 00:05:25.879 --> 00:05:26.759 of the actual data. 115 00:05:26.920 --> 00:05:29.519 So the idea is keep it clean pretty much. 116 00:05:29.759 --> 00:05:33.720 Less is often more. Every little line, every color should 117 00:05:33.800 --> 00:05:36.240 be there for a reason, helping the data speak. If 118 00:05:36.279 --> 00:05:38.319 it's not adding understanding, maybe it shouldn't be there. 119 00:05:38.600 --> 00:05:41.399 It reminds me of Edward Tuff's work. The book mentions him, right. 120 00:05:41.319 --> 00:05:44.519 Oh, yeah, Tuft's foundational his concept of the data to 121 00:05:44.600 --> 00:05:45.319 ink ratio. 122 00:05:45.920 --> 00:05:49.000 Right, maximize the ink that shows data, minimize the rest. 123 00:05:48.839 --> 00:05:50.800 Exactly, get rid of the chart junk. He also talked 124 00:05:50.839 --> 00:05:55.480 about graphical excellence, showing interesting data clearly, efficiently, telling the 125 00:05:55.480 --> 00:05:58.480 truth about it, getting the most ideas across with the 126 00:05:58.560 --> 00:05:59.480 least visual noise. 127 00:05:59.600 --> 00:06:04.759 Makes sense, simplify, remove extra gridlines, pointless colors. 128 00:06:04.600 --> 00:06:08.040 Usually yes, But the book throws in a really interesting 129 00:06:08.079 --> 00:06:13.560 curveball here. Some research Bateman Borkin, they found that sometimes 130 00:06:13.639 --> 00:06:17.560 those more visually embellished graphs, almost like many infographics, Yeah, 131 00:06:17.759 --> 00:06:20.399 they can actually be more memorable than the super simple, 132 00:06:20.439 --> 00:06:21.079 clean ones. 133 00:06:21.199 --> 00:06:25.319 Really, that's counterintuitive, more memorable even if they're harder to 134 00:06:25.360 --> 00:06:26.680 read initially. 135 00:06:26.199 --> 00:06:29.199 It seems so. Yeah, people might recall something visually unique 136 00:06:29.279 --> 00:06:30.920 or novel more easily later on. 137 00:06:31.199 --> 00:06:33.360 Huh. So there's a bit of a trade off maybe 138 00:06:33.680 --> 00:06:36.160 between immediate clarity and long term. 139 00:06:35.959 --> 00:06:40.720 Recall potentially, But the key is memorable doesn't automatically mean 140 00:06:41.199 --> 00:06:42.800 easy to interpret. 141 00:06:42.480 --> 00:06:46.519 Accurately, right, which brings us to that third category of problems, 142 00:06:46.800 --> 00:06:48.759 perceptual issues exactly. 143 00:06:48.879 --> 00:06:51.720 This is where it gets really fascinating. Even a clean, 144 00:06:51.959 --> 00:06:56.879 well designed graph can unintentionally mislead people just because of 145 00:06:56.879 --> 00:06:59.560 how our brains work. How so well? The book shows 146 00:06:59.560 --> 00:07:02.879 an example with stacked bar charts, trying to compare the 147 00:07:02.920 --> 00:07:05.800 size of the say, middle segment, across several different bars. 148 00:07:06.439 --> 00:07:08.439 It's surprisingly difficult for our eyes. 149 00:07:08.720 --> 00:07:11.560 Yeah, I can picture that your baseline keeps changing, right. 150 00:07:11.800 --> 00:07:13.920 And there's another example with lines that look like they're 151 00:07:13.920 --> 00:07:17.399 converging getting closer just because of the aspect ratio the 152 00:07:17.439 --> 00:07:20.279 shape of the plot, even if the underlying data shows 153 00:07:20.279 --> 00:07:21.279 they're staying parallel. 154 00:07:21.519 --> 00:07:24.560 Wow, So good taste isn't enough. You really need to 155 00:07:24.639 --> 00:07:25.639 understand perception. 156 00:07:25.959 --> 00:07:28.639 You absolutely do. And our perception is a uniform right, 157 00:07:28.759 --> 00:07:31.959 Like how we see color. Our ability to distinguish shades 158 00:07:32.600 --> 00:07:34.000 changes across the spectrum. 159 00:07:34.079 --> 00:07:36.000 And it depends on lightness too, doesn't it. 160 00:07:36.279 --> 00:07:39.519 Yeah, chroma depends on luminance. It gets complex. That's why 161 00:07:39.759 --> 00:07:43.839 the book really pushes for using perceptually uniform color palettes. 162 00:07:44.000 --> 00:07:46.079 Perceptually uniform, Okay, what does that mean? 163 00:07:46.120 --> 00:07:49.839 Exactly? Imagine a color ramp where each step up represents 164 00:07:49.879 --> 00:07:53.439 an equal increase in the data value. A perceptually uniform 165 00:07:53.519 --> 00:07:57.199 palette makes those steps look equally spaced in color intensity. 166 00:07:57.360 --> 00:08:00.360 Ah, so a non uniform one might make some small 167 00:08:00.480 --> 00:08:03.680 data changes look huge visually or vice versa. 168 00:08:03.920 --> 00:08:08.480 Exactly, it avoids accidentally emphasizing or de emphasizing parts of 169 00:08:08.519 --> 00:08:11.360 the data just because of quirks in the color scale. 170 00:08:11.560 --> 00:08:14.199 Okay, so the book talks about different types of these palets. 171 00:08:14.279 --> 00:08:18.480 Yeah, three main ones. First, sequential scales, think lo to 172 00:08:18.600 --> 00:08:22.720 high data like income or maybe temperature if it's all positive. 173 00:08:22.560 --> 00:08:25.120 Makes sense, like light blue to dark blue. 174 00:08:25.240 --> 00:08:28.120 Right. Then you have diverging scales. These are for data 175 00:08:28.160 --> 00:08:33.000 with a meaningful midpoint like zero temperature changes, maybe deviations 176 00:08:33.000 --> 00:08:34.200 from an average. 177 00:08:33.919 --> 00:08:36.360 Like that blue to red scale example figure one point. 178 00:08:36.399 --> 00:08:37.240 Then you see that's a. 179 00:08:37.240 --> 00:08:41.120 Classic zero or the midpoint is usually a neutral color 180 00:08:41.320 --> 00:08:44.120 like white or light gray, and the extremes diverge to 181 00:08:44.159 --> 00:08:45.159 two different hues. 182 00:08:45.240 --> 00:08:45.600 Okay. 183 00:08:45.879 --> 00:08:50.080 And third type qualitative talents. These are for categorical data 184 00:08:50.080 --> 00:08:53.720 where there's no inherent order. Think countries, talks of products, 185 00:08:53.879 --> 00:08:54.759 political parties. 186 00:08:54.799 --> 00:08:58.480 So the goal there is just distinct colors distinct. 187 00:08:58.120 --> 00:09:01.960 But also ideally with similar visual weight, so one category 188 00:09:02.000 --> 00:09:05.240 doesn't just pop out unintentionally. The bottom palette that same 189 00:09:05.279 --> 00:09:07.320 figure one point one end scene is a good example. 190 00:09:07.399 --> 00:09:10.120 It's really about making sure the visual differences match the 191 00:09:10.200 --> 00:09:12.279 data differences accurately precisely. 192 00:09:12.600 --> 00:09:14.960 Using the wrong palate can really mess with interpretation. 193 00:09:15.320 --> 00:09:18.559 The book also mentions complexity overload trying to map too 194 00:09:18.600 --> 00:09:19.399 many things at once. 195 00:09:19.679 --> 00:09:22.960 Yeah, like using size and shape and color and position 196 00:09:23.360 --> 00:09:26.000 all in one go. Unless the data has a really 197 00:09:26.039 --> 00:09:29.840 really clear structure, it just becomes noise. Figure one point 198 00:09:29.919 --> 00:09:32.759 nineteen shows that, Well, hard to track everything. 199 00:09:32.399 --> 00:09:34.919 Too much happening. And what about gestalt rules? 200 00:09:35.039 --> 00:09:38.759 Ah, yeah, that's about how our brains naturally look for patterns. 201 00:09:39.200 --> 00:09:42.360 We group things, we connect things. We see shapes even. 202 00:09:42.200 --> 00:09:45.559 If they aren't really there sometimes like seeing faces in clouds, 203 00:09:45.759 --> 00:09:46.320 kind of like that. 204 00:09:46.440 --> 00:09:49.960 Yeah. Figure one point one each shows seemingly random dots, 205 00:09:50.320 --> 00:09:52.559 but you can't help trying to see clusters or lines. 206 00:09:53.360 --> 00:09:56.000 This is powerful if you use it right in visualization design, 207 00:09:56.080 --> 00:09:58.440 but it can also trick people into seeing patterns that 208 00:09:58.480 --> 00:09:59.519 are just random chance. 209 00:10:00.039 --> 00:10:03.120 So understanding perception is crucial, which leads to how we 210 00:10:03.159 --> 00:10:06.639 actually encode data visually. The book talks about Cleveland and 211 00:10:06.720 --> 00:10:08.600 McGill's research foundational stuff. 212 00:10:08.799 --> 00:10:11.519 Figure one point two to three summarizes it. They basically 213 00:10:11.559 --> 00:10:13.639 figured out what visual tasks were best at? 214 00:10:13.840 --> 00:10:16.360 Perceptual Okay, what's at the top. What are we best at? 215 00:10:16.440 --> 00:10:20.720 Judging position along a common scale? Think comparing bar heights 216 00:10:20.720 --> 00:10:23.559 in a standard bar chart. We're really accurate. 217 00:10:23.200 --> 00:10:25.559 At That makes sense, they all start from zero, right. 218 00:10:25.960 --> 00:10:29.919 Then comes position on a lined but separate scales. Still 219 00:10:29.960 --> 00:10:34.279 pretty good. Then judging links like line segments, but only 220 00:10:34.360 --> 00:10:35.919 if they share a common baseline. 221 00:10:36.080 --> 00:10:38.399 Hmm. Okay, and what are we worse at? 222 00:10:38.679 --> 00:10:42.320 Our accuracy drops off for judging links without a common baseline. 223 00:10:43.200 --> 00:10:46.360 Then things like angles, which is why pie charts can 224 00:10:46.440 --> 00:10:50.360 be problematic for comparison in area and volume and color 225 00:10:50.399 --> 00:10:53.000 saturation or hue are further down the list. 226 00:10:53.120 --> 00:10:56.320 So this hierarchy should guide our choices. If you want 227 00:10:56.360 --> 00:10:59.039 people to compare values accurately, use. 228 00:10:58.919 --> 00:11:02.080 Position along the commons. Bar charts are often great for that. 229 00:11:02.120 --> 00:11:04.720 If you're showing trends, maybe line charts work well for 230 00:11:04.799 --> 00:11:07.559 judging slope or angle, though even that's not top tier. 231 00:11:07.759 --> 00:11:10.720 It really highlights why choosing the right chart type matters 232 00:11:10.720 --> 00:11:13.200 so much for effective communication. It's about how easily the 233 00:11:13.279 --> 00:11:16.399 viewer can decode the information exactly, and. 234 00:11:16.320 --> 00:11:18.919 The book also stresses it's not just which channel you 235 00:11:19.000 --> 00:11:22.440 choose like color or position, but how you implement. 236 00:11:22.000 --> 00:11:24.960 It, like using a good sequential palette for ordered data 237 00:11:25.120 --> 00:11:27.039 or distinct hues for categories. 238 00:11:27.120 --> 00:11:30.240 Precisely the details of the implementation matter hugely. 239 00:11:30.440 --> 00:11:32.480 Okay, this is great theory, but the book is also 240 00:11:32.720 --> 00:11:36.159 very practical. Right it dives into using R and gg 241 00:11:36.279 --> 00:11:36.720 plot two. 242 00:11:36.960 --> 00:11:39.440 It does it shift skiers into how you actually make 243 00:11:39.480 --> 00:11:41.240 these visualizations using code. 244 00:11:41.320 --> 00:11:44.159 Now, programming can sound a bit scary. The book suggests 245 00:11:44.159 --> 00:11:47.279 starting with something called R mark down. Why is that helpful? 246 00:11:47.480 --> 00:11:51.279 Armarkdown is fantastic for reproducibility unless you combine your code, 247 00:11:51.399 --> 00:11:54.600 your notes, and your output the plots, the tables all 248 00:11:54.639 --> 00:11:55.919 in one document. 249 00:11:55.600 --> 00:11:57.879 So you can see exactly how you got a result exactly. 250 00:11:58.039 --> 00:12:01.159 You write in plaintext embedchun of our code. When you 251 00:12:01.200 --> 00:12:04.000 process the document, the code runs and the results get 252 00:12:04.000 --> 00:12:07.120 inserted right there. It's great for keeping track, sharing work, 253 00:12:07.519 --> 00:12:09.679 and avoiding that how do I make this chart again? 254 00:12:09.879 --> 00:12:13.240 Problem? That sounds incredibly useful and R itself. 255 00:12:13.480 --> 00:12:16.360 R is a super powerful language widely used in statistics 256 00:12:16.360 --> 00:12:19.440 and data science, and gg plot two is this amazing 257 00:12:19.480 --> 00:12:22.159 package within R for visualization. 258 00:12:21.600 --> 00:12:24.320 Built on the grammar of graphics what's that about. 259 00:12:24.559 --> 00:12:26.759 Think of it like a system for building graphs piece 260 00:12:26.799 --> 00:12:30.200 by piece. You start with your data, then you define 261 00:12:30.360 --> 00:12:34.440 esthetic mappings linking data variables to visual properties like exposition, 262 00:12:34.840 --> 00:12:37.720 we position, color size. 263 00:12:37.240 --> 00:12:38.919 Okay, mapping data to visuals. 264 00:12:39.080 --> 00:12:42.039 Then you choose gms to the geometric objects like points, lines, 265 00:12:42.120 --> 00:12:44.720 bars that actually represent the data, and you layer these 266 00:12:44.720 --> 00:12:45.279 things together. 267 00:12:45.399 --> 00:12:47.440 So it's a structured way to think about building any 268 00:12:47.519 --> 00:12:47.759 kind of. 269 00:12:47.759 --> 00:12:51.480 Plot exactly, very flexible, very powerful once you grasp the 270 00:12:51.480 --> 00:12:55.639 core ideas developed by Leland Wilkinson implemented in gig plot 271 00:12:55.639 --> 00:12:56.919 two by Hadley Wickham. 272 00:12:57.200 --> 00:13:00.799 And the book mentions the ecology of assistance better. 273 00:13:00.399 --> 00:13:03.080 Now Yeah, basically meaning there's just so much help available 274 00:13:03.120 --> 00:13:07.919 online now, websites like stack overflow, our communities, tutorials, blogs, 275 00:13:08.480 --> 00:13:10.559 it's much easier to get started in financewers. When you 276 00:13:10.559 --> 00:13:11.559 get stuck then it used to. 277 00:13:11.480 --> 00:13:14.879 Be that's encouraging. So to get started, the book says, 278 00:13:14.919 --> 00:13:17.120 install the tidy verse right. 279 00:13:17.399 --> 00:13:20.240 The tidy Verse is a collection of our packages including 280 00:13:20.279 --> 00:13:23.279 deep plot two, deeplier for a data manipulation, and others, 281 00:13:23.639 --> 00:13:26.600 all designed to work together really well. You install it 282 00:13:26.600 --> 00:13:29.840 in our studio usually with just installed out packages. 283 00:13:29.639 --> 00:13:32.879 Tidy verse, and the book suggests typing out the examples. 284 00:13:33.080 --> 00:13:35.720 Yeah, it's good advice. Actually typing the code helps it 285 00:13:35.759 --> 00:13:37.679 sync in much better than just copy pasting. 286 00:13:37.799 --> 00:13:42.000 Good tip and reassuringly. Gplot's defaults are pretty. 287 00:13:41.679 --> 00:13:45.919 Good generally, Yes, the default settings for colors, themes, et 288 00:13:46.000 --> 00:13:49.240 cetera are thoughtfully chosen. You can often get a decent 289 00:13:49.279 --> 00:13:52.840 looking informative plot without much tweaking, which is great for beginners. 290 00:13:53.000 --> 00:13:56.519 Okay, let's get into those core jiggy plot concepts. First, 291 00:13:56.639 --> 00:13:59.679 ascetic mappings using ease. Break that down again. 292 00:13:59.799 --> 00:14:02.519 Right. Ease is where you tell gd plot which variables 293 00:14:02.519 --> 00:14:05.840 in your data control which visual property. So as x 294 00:14:05.960 --> 00:14:10.159 gdt per cap y equals life x, color equals the 295 00:14:10.360 --> 00:14:13.679 x axis, life x controls the I axis, and the 296 00:14:13.720 --> 00:14:15.279 continent column controls the color. 297 00:14:15.399 --> 00:14:17.799 Crucially, you're not saying which color, just what controls the 298 00:14:17.799 --> 00:14:18.639 color exactly. 299 00:14:18.720 --> 00:14:22.200 Gb plot handles assigning the actual colors, positions, et cetera 300 00:14:22.519 --> 00:14:23.639 based on the data values. 301 00:14:23.720 --> 00:14:25.600 Okay, then GMS GMS. 302 00:14:25.240 --> 00:14:29.279 Are the visual markers. Geompoint makes a scatterplot, gmline draws lines, 303 00:14:29.600 --> 00:14:33.320 GMAM makes bar shirts. Gmsmooth adds a smooth trend line 304 00:14:33.399 --> 00:14:35.159 you add into your plot with a plus sign. 305 00:14:35.399 --> 00:14:38.279 So ggplot sets up the canvas and mappings. Then you 306 00:14:38.360 --> 00:14:40.960 add plus gom point or plus gmbi. 307 00:14:41.080 --> 00:14:43.240 You got you build plots layer by layer. 308 00:14:43.399 --> 00:14:45.679 And the importance of tidy data. Ah. 309 00:14:45.759 --> 00:14:48.039 Yes, tidy data is a way of structuring your data 310 00:14:48.080 --> 00:14:51.600 set that ggplot and the tidy verse really prefer. Basically, 311 00:14:52.200 --> 00:14:55.399 each variable gets its own column, each observation gets its. 312 00:14:55.279 --> 00:14:58.840 Own row like a long format, not wide exactly. 313 00:14:58.919 --> 00:15:01.279 It might seem like a small detail, but organizing your 314 00:15:01.320 --> 00:15:04.360 data this way makes working with gdplot much much smoother 315 00:15:04.440 --> 00:15:05.200 and more intuitive. 316 00:15:05.240 --> 00:15:08.080 Got it? And this idea of inheritance of mappings. 317 00:15:08.279 --> 00:15:10.519 That just means if you define mappings in the main 318 00:15:10.639 --> 00:15:15.039 gd plot call like g plot data gapminder asex c 319 00:15:15.240 --> 00:15:18.000 GDP per cap y life x, any gms you add 320 00:15:18.039 --> 00:15:20.799 later like plus gom point or plus gm smooth will 321 00:15:20.840 --> 00:15:23.279 automatically use those x and y mappings. 322 00:15:22.960 --> 00:15:25.600 Unless you override them specifically in the GM right. 323 00:15:25.720 --> 00:15:28.240 You can give a GM its own a's mapping if needed, 324 00:15:28.480 --> 00:15:31.080 but inherence saves a lot of typing for common mappings. 325 00:15:31.200 --> 00:15:33.559 Okay, let's run through some practical plot examples from the 326 00:15:33.600 --> 00:15:37.919 book Basic Scatterplot Life expectancy versus GDP per capita using 327 00:15:37.919 --> 00:15:39.480 the gapminder data YEP. 328 00:15:39.799 --> 00:15:43.440 That would be gg plot data gapminder mapping es x 329 00:15:43.639 --> 00:15:46.360 GDP per cap y lifex that sets it up plus 330 00:15:46.360 --> 00:15:50.080 g ome point boom scatterplot simple enough. Add a smoother 331 00:15:50.399 --> 00:15:52.840 Just add plus GM smooth. On the next line. Gg 332 00:15:52.919 --> 00:15:56.279 plot adds a default trend line, usually with a confidence 333 00:15:56.320 --> 00:15:56.840 band around it. 334 00:15:56.960 --> 00:16:00.159 Nice. Now that GDP data is probably skewed right of 335 00:16:00.159 --> 00:16:02.039 lower values a few very high ones. 336 00:16:01.919 --> 00:16:05.399 Usually is makes the scatterplot bunch up on one side. 337 00:16:05.480 --> 00:16:08.519 So transforming the scale like a log scale for the 338 00:16:08.720 --> 00:16:09.840 X axis good idea. 339 00:16:09.960 --> 00:16:13.120 Yes, add another layer plus scale x log ten that 340 00:16:13.200 --> 00:16:16.039 transforms the x axis to a base ten log scale, 341 00:16:16.200 --> 00:16:17.840 spreading the data out much better. 342 00:16:17.720 --> 00:16:20.639 Visually okay, and making it look more professional. Titles axis 343 00:16:20.720 --> 00:16:21.440 labels use. 344 00:16:21.360 --> 00:16:24.879 The labs function add plus lab title my plot title 345 00:16:25.320 --> 00:16:28.360 x GDP per capita why life expectancy. 346 00:16:27.919 --> 00:16:30.240 Simple and what if you want to format the axis 347 00:16:30.320 --> 00:16:32.639 labels like showing dollars on the X axis. 348 00:16:32.759 --> 00:16:35.360 That's where the scales package comes in Handy You modified 349 00:16:35.399 --> 00:16:38.120 the scale function maybe like plus scale x log ten 350 00:16:38.360 --> 00:16:40.759 labels at cool scales dot dollar gives you a nice 351 00:16:40.759 --> 00:16:42.080 dollar formatting Cool. 352 00:16:42.159 --> 00:16:45.879 Now, what about mapping categories like coloring the points by continent. 353 00:16:46.000 --> 00:16:49.879 You add color continent inside the a's function, so as 354 00:16:50.159 --> 00:16:53.519 xx GDP per cap y life x color. 355 00:16:53.480 --> 00:16:56.120 Continent and ggplot handles the rest YEP. 356 00:16:56.240 --> 00:16:58.559 It assigns a color to each continent and automatically adds 357 00:16:58.559 --> 00:17:02.200 a legend explaining the colors. If you also have GM smooth, 358 00:17:02.600 --> 00:17:05.640 you'll likely get a separate smooth line for each continent 359 00:17:05.839 --> 00:17:07.079 in its corresponding color. 360 00:17:07.279 --> 00:17:10.839 Okay, this brings up that crucial difference mapping versus setting 361 00:17:11.279 --> 00:17:12.759 making all points purple. 362 00:17:12.839 --> 00:17:16.160 For instance, right, if you put color purple inside a's 363 00:17:16.240 --> 00:17:19.599 fod G plot treats purple as a data value, it 364 00:17:19.640 --> 00:17:22.240 gives all points the same default color and makes a 365 00:17:22.359 --> 00:17:23.640 useless legend entry. 366 00:17:23.400 --> 00:17:25.119 For purple because you mapped it to data. 367 00:17:25.359 --> 00:17:27.480 Exactly, If you just want to set all points b purple, 368 00:17:27.519 --> 00:17:30.720 you put color purple outside a's inside the GM function 369 00:17:30.759 --> 00:17:33.240 itself like GM point color purple. 370 00:17:33.359 --> 00:17:37.200 No mapping, just setting a fixed visual property, no legend needed. 371 00:17:37.160 --> 00:17:41.000 Precisely, huge difference. Common point of confusion makes sense. 372 00:17:41.079 --> 00:17:43.799 Then there's faceting, splitting the plot into panels. 373 00:17:44.039 --> 00:17:47.720 Yes, super useful. Face wrap lets you split by one 374 00:17:47.759 --> 00:17:51.640 categorical variable, arranging panels and a grid face. A grid 375 00:17:51.880 --> 00:17:54.480 lets you split by two variables, creating a two D 376 00:17:54.599 --> 00:17:55.960 grid of plots like. 377 00:17:55.920 --> 00:17:59.160 That age versus children example, faceted by sex and. 378 00:17:59.160 --> 00:18:03.039 Race exactly to compare relationships across different groups really effectively. 379 00:18:03.200 --> 00:18:06.319 What about visualizing just one continuous variable? 380 00:18:06.640 --> 00:18:10.559 Histograms EM histogram you map your variable to x like 381 00:18:10.680 --> 00:18:13.720 a x use area. It bins the data and shows 382 00:18:13.720 --> 00:18:16.359 counts as bars. You might need to adjust the binwith 383 00:18:16.440 --> 00:18:18.720 argument to get a good view. Or density plots GEM 384 00:18:18.759 --> 00:18:21.920 density similar idea, but gives you a smooth curve estimated 385 00:18:21.920 --> 00:18:25.559 distribution often a nice alternative or complement to histograms. 386 00:18:25.200 --> 00:18:27.039 And briefly, graph tables. 387 00:18:27.599 --> 00:18:30.640 GM table allows embedding a small table right onto the plot. 388 00:18:30.720 --> 00:18:33.279 Can be handy for showing summary stats alongside the visual. 389 00:18:33.319 --> 00:18:37.000 Okay, crucial step saving your masterpiece? How do we save plots? 390 00:18:37.440 --> 00:18:40.279 Easiest way is DG save After you display your plot 391 00:18:40.359 --> 00:18:43.119 to type gg stave myplot dot pdf or gg save 392 00:18:43.240 --> 00:18:46.920 myplot dot PNG. It saves the last plot by default. 393 00:18:46.680 --> 00:18:49.519 PDF versus PNG you mentioned, vector versus raster. 394 00:18:49.880 --> 00:18:53.519 Yeah. Vector formats like PDF or SVG are usually best. 395 00:18:53.720 --> 00:18:56.680 They store the plot as lines and shapes, so you 396 00:18:56.720 --> 00:19:00.720 can resize them infinitely without getting blurry. Good for publications. 397 00:19:00.880 --> 00:19:04.160 Raster formats like PNG or JPEG are pixel based, so 398 00:19:04.200 --> 00:19:06.519 they can get blocky if you enlarge them too much. 399 00:19:06.640 --> 00:19:09.240 Right, use vector when you can, especially for line art 400 00:19:09.279 --> 00:19:10.519 like most plots, and. 401 00:19:10.480 --> 00:19:12.880