WEBVTT 1 00:00:00.080 --> 00:00:04.719 Have you ever felt just completely buried in information? You know, 2 00:00:04.759 --> 00:00:07.679 you've got articles, research notes piling up, and you just 3 00:00:07.719 --> 00:00:08.919 want to get to the point. Oh. 4 00:00:08.960 --> 00:00:12.560 Absolutely, it's that feeling of being swamped trying to find 5 00:00:12.599 --> 00:00:16.399 the real gems in well, just a mountain of data. 6 00:00:15.919 --> 00:00:19.079 Exactly, finding those surprising facts, the stuff that really matters, 7 00:00:19.239 --> 00:00:20.519 without wading through everything. 8 00:00:20.920 --> 00:00:23.879 That's tough, it really is, and the sheer volume can 9 00:00:23.920 --> 00:00:26.120 actually hide the insights you're looking for. 10 00:00:26.359 --> 00:00:29.079 Right. Well, that's why today this deep dive is kind 11 00:00:29.079 --> 00:00:31.719 of your shortcut. We want to help you get genuinely 12 00:00:31.719 --> 00:00:35.640 well informed on a topic that honestly is fundamental to 13 00:00:35.679 --> 00:00:38.200 any good analysis, Python data cleaning. 14 00:00:39.039 --> 00:00:41.439 And our guide for this is the Python Data Cleaning 15 00:00:41.479 --> 00:00:44.479 Cookbook by Michael Walker. It came out from Pact Publishing 16 00:00:44.520 --> 00:00:47.840 back in twenty twenty. It's a really comprehensive resource. 17 00:00:47.479 --> 00:00:50.320 It is, so our mission today is basically to pull 18 00:00:50.359 --> 00:00:54.520 out the most important nuggets of knowledge from this cookbook. 19 00:00:54.600 --> 00:00:57.119 Yeah, we want to help you understand the modern techniques, 20 00:00:57.159 --> 00:00:59.719 the Python tools. You need a spot and you know, 21 00:01:00.079 --> 00:01:02.560 fixed dirty data. Think of it like transforming that raw, 22 00:01:02.600 --> 00:01:03.679 messy stuff. 23 00:01:03.640 --> 00:01:06.000 Into something clear, something you can actually. 24 00:01:05.760 --> 00:01:10.319 Use, precisely clear actionable insights. We'll try to surface some 25 00:01:10.400 --> 00:01:14.120 surprising facts too, maybe keep it hopefully entertaining along the way. 26 00:01:14.560 --> 00:01:18.840 Okay, let's jump in then. So data cleaning, the very 27 00:01:18.879 --> 00:01:22.560 first step often is just getting the data into Python, 28 00:01:22.959 --> 00:01:27.159 and that simple step surprisingly can be well tricky, It 29 00:01:27.200 --> 00:01:27.760 really can. 30 00:01:27.879 --> 00:01:30.560 It's fascinating how much variety there is. Right at the start. 31 00:01:30.599 --> 00:01:34.000 You think data is data, but how it's structured or 32 00:01:34.040 --> 00:01:37.159 maybe not structured, sets you up for different challenges right away. 33 00:01:37.239 --> 00:01:40.519 Okay, so let's start with maybe the most common one, CSV. 34 00:01:40.159 --> 00:01:45.640 Files, right, comma separated values basically raw text, comma splitting columns, 35 00:01:45.680 --> 00:01:48.280 new lines for rows, simple concept. 36 00:01:47.920 --> 00:01:50.079 And pandas read. CSV is the tool for that. 37 00:01:50.319 --> 00:01:52.599 It is, But here's the first little catch. Read the 38 00:01:52.680 --> 00:01:55.280 CSV tries to be smart. It makes an educated guess 39 00:01:55.280 --> 00:01:56.200 about your data types. 40 00:01:56.400 --> 00:01:58.040 Oh so it might guess wrong. 41 00:01:58.200 --> 00:02:00.719 It often does, or it might not get exactly what 42 00:02:00.760 --> 00:02:02.680 you need. You usually have to step in, maybe tell 43 00:02:02.680 --> 00:02:05.680 it the column names explicitly, or make sure understands your 44 00:02:05.760 --> 00:02:07.959 dates are actually dates, not just strings of text. 45 00:02:08.240 --> 00:02:10.120 Gotcha, So you need to be specific. 46 00:02:10.240 --> 00:02:13.319 Yeah, take the landtemp's data set for example, It's like 47 00:02:13.319 --> 00:02:16.439 one hundred thousand row sample from this big climate network. 48 00:02:16.719 --> 00:02:19.919 You load it and maybe you run is nullaw. 49 00:02:19.599 --> 00:02:22.479 Dot s okay, and that shows you missing values exactly. 50 00:02:22.520 --> 00:02:25.080 It'll quickly show you, hey, you've got gaps and avtu 51 00:02:25.199 --> 00:02:28.199 temp or maybe the country column. And for something critical 52 00:02:28.280 --> 00:02:31.599 like average temperature. Missing values aren't just counts, they're like 53 00:02:32.199 --> 00:02:33.800 potential analysis killers. 54 00:02:34.039 --> 00:02:36.199 You might drop those rows using DROPNA. 55 00:02:36.360 --> 00:02:40.280 You could yeah, drop the subsetvtemp in lace true would 56 00:02:40.280 --> 00:02:43.639 remove rose missing that crucial temperature. But you know, dropping 57 00:02:43.680 --> 00:02:46.439 isn't always the best move. Sometimes filling those gaps imputation 58 00:02:47.039 --> 00:02:48.319 is better. Depends on the goal. 59 00:02:48.439 --> 00:02:49.439 Right, It's a judgment call. 60 00:02:49.599 --> 00:02:52.159 It is, oh, and a neat little thing about read CSV. 61 00:02:52.400 --> 00:02:57.319 It can often handle zipped CSV files directly saves you 62 00:02:57.360 --> 00:02:59.000 an unzip step handy. 63 00:02:59.240 --> 00:03:03.439 Okay. So ESVs have their quirks. What about Excel files? 64 00:03:03.479 --> 00:03:06.599 I feel like everyone has horror stories about messy spreadsheets. 65 00:03:06.639 --> 00:03:09.280 Oh Excel. Yeah, they bring a whole different set of 66 00:03:09.439 --> 00:03:14.599 let's call them features. We'll use Excel in very flexible ways. 67 00:03:14.840 --> 00:03:16.319 Flexible is a polite word for it. 68 00:03:16.560 --> 00:03:19.599 Huh. Right, So you often find extra rows at the top, 69 00:03:19.800 --> 00:03:22.840 like report titles or maybe summary rows at the bottom, 70 00:03:23.120 --> 00:03:26.240 blank columns used for spacing if it looks fine to 71 00:03:26.240 --> 00:03:27.840 a human but confuses. 72 00:03:27.400 --> 00:03:28.439 The code exactly. 73 00:03:28.719 --> 00:03:31.400 So with Panda's not read Excel, you use arguments like 74 00:03:31.479 --> 00:03:35.639 skip rows, skip footer, use calls to basically tell pandas, okay, 75 00:03:35.759 --> 00:03:38.599 ignore that stuff. Just grab this block of cells you're 76 00:03:38.639 --> 00:03:40.199 targeting the actual data. 77 00:03:40.000 --> 00:03:42.840 Table makes sense, you're zeroing in and. 78 00:03:42.800 --> 00:03:46.479 Another common Excel thing. People use symbols like A or 79 00:03:46.479 --> 00:03:48.680 maybe na to show missing data. 80 00:03:48.759 --> 00:03:51.240 Right, not blank cells, but actual text symbols. 81 00:03:51.439 --> 00:03:54.560 Yeah, Python reads those as texts as object types. So 82 00:03:54.560 --> 00:03:56.719 if you try to do math, it breaks the key move. 83 00:03:56.759 --> 00:03:59.120 Here is pd dot two numeric with the errors. 84 00:03:58.919 --> 00:04:01.439 Coerce argument r's coerce. What does that do? 85 00:04:01.560 --> 00:04:04.000 It tells pandas try to make this column numeric. If 86 00:04:04.039 --> 00:04:06.400 you find anything you can't convert, like wow, just turn 87 00:04:06.439 --> 00:04:08.360 it into nan ah nan. 88 00:04:08.759 --> 00:04:11.759 Not a number Panda's way of saying missing numeric value. 89 00:04:11.879 --> 00:04:15.199 Precisely. Without that step, your numbers are stuck as text. 90 00:04:15.479 --> 00:04:18.000 And oh, watch out for extra spaces too, Like in 91 00:04:18.040 --> 00:04:21.959 that OECD GDP data example. You might have spaces before 92 00:04:22.160 --> 00:04:25.879 or after values. Always use dot str dot strip to 93 00:04:25.959 --> 00:04:28.160 clean those up before you analyze or merge. 94 00:04:28.319 --> 00:04:32.000 So many the little traps. Okay, csvs. EXCEL. What a 95 00:04:32.120 --> 00:04:36.720 pulling data from proper databases like SQL databases, surely that's cleaner. 96 00:04:36.920 --> 00:04:40.759 Generally, Yes, data from enterprise systems SQL databases tends to 97 00:04:40.800 --> 00:04:44.480 be more structured, but the logic isn't always obvious from 98 00:04:44.480 --> 00:04:45.120 the data alone. 99 00:04:45.199 --> 00:04:45.639 What do you mean. 100 00:04:45.759 --> 00:04:48.920 Well, you might find really complex coding schemes like three 101 00:04:49.079 --> 00:04:52.199 means mother has secondary education, or they might use special 102 00:04:52.279 --> 00:04:54.839 numbers like nine nine nine ninet nine to mean missing 103 00:04:54.959 --> 00:04:57.240 or not applicable. It makes sense in the database, but 104 00:04:57.279 --> 00:04:57.879 not when you. 105 00:04:57.800 --> 00:04:59.560 Just pull the raw number right, so the context is 106 00:04:59.600 --> 00:05:00.319 missing exactly. 107 00:05:00.360 --> 00:05:03.759 So you use tools like pimsqual or mysqualapis with pd 108 00:05:03.920 --> 00:05:06.360 dot read school to pull the data, and you can 109 00:05:06.399 --> 00:05:09.360 actually do some initial cleanup in the SQL query itself, like. 110 00:05:09.360 --> 00:05:12.120 Renaming columns or filtering rows right at the source. 111 00:05:12.439 --> 00:05:16.360 Yeah, using the select statement. Then once it's in pandas, 112 00:05:16.439 --> 00:05:19.279 a really good technique is to replace those codes like 113 00:05:19.360 --> 00:05:22.199 three with meaningful labels like secondary ed. 114 00:05:22.360 --> 00:05:24.480 Makes the data much easier to understand totally. 115 00:05:24.879 --> 00:05:28.360 And then this is key for efficiency, convert that column 116 00:05:28.399 --> 00:05:29.839 to a category data type. 117 00:05:29.920 --> 00:05:30.600 Why category? 118 00:05:30.639 --> 00:05:32.480 It saves a ton of memory, especially if you have 119 00:05:32.560 --> 00:05:35.920 text labels repeated many times. The student math data set 120 00:05:35.920 --> 00:05:39.959 example shows this clearly with memory usage index false. It's 121 00:05:40.000 --> 00:05:41.959 not just about clarity, it's about performance. 122 00:05:42.040 --> 00:05:45.079 With bigger data, memory savings are always good. Okay. What 123 00:05:45.120 --> 00:05:50.519 about data from statistical software SPSS data SaaS are. 124 00:05:50.519 --> 00:05:53.160 Yeah, those have their own formats too. Libraries like py 125 00:05:53.240 --> 00:05:56.800 reed stat for SPSS, Stata SaaS and pyritter for our 126 00:05:56.879 --> 00:05:57.800 data files are. 127 00:05:57.720 --> 00:05:59.680 What you'd use and their own querkx I assume. 128 00:06:00.519 --> 00:06:03.639 One thing is these files often have metadata with column 129 00:06:03.720 --> 00:06:06.600 labels that are way more descriptive than the short, sometimes 130 00:06:06.639 --> 00:06:08.040 cryptic variable names. 131 00:06:08.120 --> 00:06:09.319 So you want to use the labels. 132 00:06:09.439 --> 00:06:11.839 Definitely prefer the labels, but then you still need to 133 00:06:11.839 --> 00:06:15.240 clean those up, make them lowercase, replace spaces with underscores, 134 00:06:15.600 --> 00:06:20.240 remove weird characters, make them usable as variable names and python. 135 00:06:19.959 --> 00:06:22.079 Okay, standardizing the names right. 136 00:06:22.560 --> 00:06:27.959 And another big one for stats packages logical missing values. Stata, 137 00:06:28.040 --> 00:06:30.480 for instance, might use codes like meta five point zero, 138 00:06:30.519 --> 00:06:32.199 agive a four point zero and adver meta four one 139 00:06:32.199 --> 00:06:36.519 point zero. These aren't errors, their codes meaning refused, don't know, 140 00:06:37.120 --> 00:06:37.879 not applicable. 141 00:06:38.160 --> 00:06:40.279 Ah. So they look like numbers, but they aren't really 142 00:06:40.319 --> 00:06:41.319 data points. 143 00:06:40.959 --> 00:06:45.000 For analysis exactly you need to tell pandas to treat 144 00:06:45.000 --> 00:06:48.079 those specific numbers as missing, otherwise they'll mess up your 145 00:06:48.079 --> 00:06:49.480 calculations like your averages. 146 00:06:49.600 --> 00:06:52.120 Got it? So much depends on understanding the source of 147 00:06:52.160 --> 00:06:52.600 the data. 148 00:06:52.639 --> 00:06:54.800 Absolutely. Context is everything in cleaning. 149 00:06:55.120 --> 00:06:57.000 Okay, so we've wrangled the data in, we've done some 150 00:06:57.040 --> 00:06:59.759 initial cleanup. Now the book talks about saving this clean 151 00:06:59.839 --> 00:07:03.160 day data. Why is picking the right format important and 152 00:07:03.199 --> 00:07:05.879 what are the trade offs? It seems simple, but maybe 153 00:07:05.879 --> 00:07:06.160 it's not. 154 00:07:06.519 --> 00:07:10.160 That's a really important point. Persisting the data. Why bother 155 00:07:10.279 --> 00:07:12.639 saving it in a new format. Well, maybe you want 156 00:07:12.639 --> 00:07:15.920 to clean snapshot before doing more complex stuff, or the 157 00:07:16.000 --> 00:07:18.639 data doesn't change much so you work off the clean version, 158 00:07:19.040 --> 00:07:20.959 or maybe you want the flexibility of JSON. 159 00:07:21.240 --> 00:07:26.040 Okay, those are good reasons, but what's overlooked. 160 00:07:25.680 --> 00:07:29.279 The trade offs. CSV is memory light, sure, but it's 161 00:07:29.319 --> 00:07:32.120 slow to write for big files, and crucially, it forgets 162 00:07:32.120 --> 00:07:35.680 your data types. All that work converting numbers poof, they 163 00:07:35.759 --> 00:07:37.399 might become strings again when you reload. 164 00:07:37.439 --> 00:07:38.879 Oh that's annoying. What about pickle. 165 00:07:39.079 --> 00:07:42.360 Pickle does remember data types, which is great, but creating 166 00:07:42.399 --> 00:07:45.800 those Pickle files, the serialization process can be heavy on memory, 167 00:07:45.839 --> 00:07:49.800 and CPU might not be ideal on a resource constrained system. 168 00:07:49.920 --> 00:07:50.879 Okay, and feather. 169 00:07:51.240 --> 00:07:54.000 Feather is generally faster and lighter than pickle, and it 170 00:07:54.040 --> 00:07:56.959 plays nice with r which is cool for teams using both, 171 00:07:57.360 --> 00:07:59.959 but you often lose the data frame index, and it's 172 00:08:00.079 --> 00:08:03.720 long term support is maybe less certain than other formats. 173 00:08:03.759 --> 00:08:05.399 So there's no single perfect format. 174 00:08:05.680 --> 00:08:08.839 Not really depends on the needs. But here's the big warning. 175 00:08:09.279 --> 00:08:11.720 When you save data, you separate it from the code 176 00:08:11.759 --> 00:08:14.519 that created it. It's super easy to forget later how 177 00:08:14.519 --> 00:08:17.079 a variable was calculated or cleaned right. 178 00:08:17.240 --> 00:08:18.879 The logic gets lost exactly. 179 00:08:19.000 --> 00:08:23.879 So the advice is only persist your data at significant milestones, 180 00:08:23.920 --> 00:08:27.120 when you've reached a stable, well understood point in your 181 00:08:27.120 --> 00:08:30.240 cleaning process. Treat it like saving a major version. 182 00:08:30.759 --> 00:08:34.000 That makes a lot of sense. Okay, milestone reached data 183 00:08:34.080 --> 00:08:36.360 is imported. What's the very very first thing you do 184 00:08:36.440 --> 00:08:39.200 that question? Everyone asks, so how does it look? 185 00:08:39.360 --> 00:08:41.559 Yeah, that's the immediate next step, and you need a 186 00:08:41.679 --> 00:08:44.960 routine a system. Even if you think you know the data, 187 00:08:45.039 --> 00:08:46.600 a new batch can always. 188 00:08:46.240 --> 00:08:48.919 Have surprises, So a quick diagnostic checkup exactly. 189 00:08:49.000 --> 00:08:51.480 You want to quickly grasp what's the unit of analysis? 190 00:08:51.519 --> 00:08:54.720 How many rows? How many columns, What are the common categories, 191 00:08:54.840 --> 00:08:58.720 how are the numbers distributed? And critically, where are the 192 00:08:58.759 --> 00:09:02.960 missing values and potential outliers. It's about building that initial intuition. 193 00:09:03.159 --> 00:09:05.840 Building intuition. I like that. So what commands give you 194 00:09:05.879 --> 00:09:06.639 that first glance? 195 00:09:06.879 --> 00:09:11.840 Dataframe dot shape for rowsan columns, simple but fundamental dataframe 196 00:09:11.840 --> 00:09:15.039 dot info is gold. It shows data types and counts 197 00:09:15.159 --> 00:09:19.080 non missing values per column, instant red flags for missing data. 198 00:09:19.440 --> 00:09:22.320 Okay, shape and info. What about seeing the actual data? 199 00:09:22.440 --> 00:09:25.320 Dataframe dot head for the first few rows, dataframe dot 200 00:09:25.360 --> 00:09:27.799 tail for the last few, and dataframe dot sample is 201 00:09:27.840 --> 00:09:30.679 great for a random peak. Use sample random state one 202 00:09:30.720 --> 00:09:34.120 if you want the same random sample each time. For reproducibility. 203 00:09:34.200 --> 00:09:36.120 Reproducibility is good always. 204 00:09:36.039 --> 00:09:38.919 And a key tip here set a meaningful index if 205 00:09:38.919 --> 00:09:42.120 you have one, like a unique personad. It makes selecting 206 00:09:42.159 --> 00:09:45.360 specific rows later so much easier. It anchors your data. 207 00:09:45.399 --> 00:09:48.320 Good point. Okay, we've got the overview. How about focusing 208 00:09:48.320 --> 00:09:50.559 on columns, selecting and organizing them. 209 00:09:50.600 --> 00:09:53.480 Standard selection is easy with square brackets or using dot 210 00:09:53.600 --> 00:09:55.759 lock and dot ilock. But a real time saver is 211 00:09:55.799 --> 00:09:56.279 dot filter. 212 00:09:56.440 --> 00:09:57.440 Like how does filter work? 213 00:09:57.480 --> 00:09:59.480 It lets you select columns based on patterns in their 214 00:09:59.519 --> 00:10:02.039 names have columns like weeks work zero one, weeks work 215 00:10:02.120 --> 00:10:04.120 zero one, week's work zero two. You can grab them 216 00:10:04.159 --> 00:10:07.200 all with df dot filter like weeks worked super useful. 217 00:10:07.279 --> 00:10:10.080 Oh nice's typing them all out exactly. 218 00:10:10.440 --> 00:10:13.240 You can also select to buy data type df dot. 219 00:10:13.240 --> 00:10:17.399 Selected types include number gets all numeric columns, or include 220 00:10:17.399 --> 00:10:22.120 category for categoricals and for just keeping things sane. Group 221 00:10:22.200 --> 00:10:29.559 related columns together. Create lists of column names like demographics, age, gender, location, workforce, occupation, income, 222 00:10:30.000 --> 00:10:32.440 and then you can easily work with those logical groups. 223 00:10:32.600 --> 00:10:36.000 Keeps the analysis tidy. Now, what about selecting specific rows? 224 00:10:36.120 --> 00:10:38.759 You mentioned that issues often pop up when you look 225 00:10:38.799 --> 00:10:39.399 at subsets. 226 00:10:39.440 --> 00:10:42.440 Absolutely, this is where booling indexing comes in. You filter 227 00:10:42.559 --> 00:10:45.360 rows based on conditions. For example, and that NLS data 228 00:10:45.360 --> 00:10:48.639 set NLS ninety seven dot nightly hirsleep equal four would 229 00:10:48.679 --> 00:10:50.759 pull out everyone reporting very little. 230 00:10:50.480 --> 00:10:53.240 Sleep, So you can isolate specific groups easily yep. 231 00:10:53.720 --> 00:10:56.759 And you can combine conditions using and for a and 232 00:10:56.879 --> 00:10:59.480 D for or like sleep four and children x three 233 00:10:59.519 --> 00:11:02.360 finds people with little sleep and three or more kids less. 234 00:11:02.360 --> 00:11:03.240 You really zoom in. 235 00:11:03.440 --> 00:11:05.320 You can select rows and columns together. 236 00:11:05.480 --> 00:11:07.919 Yes, using dot lock you can give it row conditions 237 00:11:07.960 --> 00:11:10.559 and column names in one go. Very powerful for getting 238 00:11:10.559 --> 00:11:12.120 exactly the slice of data you need. 239 00:11:12.399 --> 00:11:15.919 Okay, slicing and dicing now, I remember a researcher telling 240 00:11:15.960 --> 00:11:18.840 me once ninety percent of what you'll find is in 241 00:11:18.879 --> 00:11:22.399 the frequency distributions. Why are frequencies so revealing? 242 00:11:22.480 --> 00:11:25.320 Yeah, that's a great quote, and it's often true, especially 243 00:11:25.360 --> 00:11:29.200 for categorical data frequencies. Using series dot value counts are 244 00:11:29.240 --> 00:11:32.000 your best friend. They immediately show you what are the 245 00:11:32.039 --> 00:11:35.919 actual categories present? Are their typos, weird values, too many 246 00:11:35.960 --> 00:11:36.919 other responses? 247 00:11:37.159 --> 00:11:39.960 So it's like a reality check for your categories. 248 00:11:39.399 --> 00:11:43.320 Totally and adding normalized true gives you percentages, which helps 249 00:11:43.399 --> 00:11:46.360 understand the proportions. You can even apply it to multiple 250 00:11:46.360 --> 00:11:49.360 columns at once using dot apply, like checking all those 251 00:11:49.399 --> 00:11:53.720 government responsibility questions in the NLS data together efficient very 252 00:11:54.000 --> 00:11:57.360 and remember that tip about converting text columns object type 253 00:11:57.360 --> 00:12:00.480 to category. It pays off here too, makes these value 254 00:12:00.480 --> 00:12:02.919 counts operations faster and more memory efficient. 255 00:12:02.960 --> 00:12:06.879 Good reminder, Okay, frequencies for categories. What about summarizing our 256 00:12:06.919 --> 00:12:08.480 continuous numeric variables? 257 00:12:08.600 --> 00:12:11.639 Before you analyze numbers, you need to understand their basic properties. 258 00:12:12.080 --> 00:12:19.559 Central tendency like mean or median, spread, standard deviation, and shape, skewness. 259 00:12:18.840 --> 00:12:21.480 And dataframe dot describe is the go to. 260 00:12:21.559 --> 00:12:24.480 For that it is. It gives you count means, standard deviation, 261 00:12:24.759 --> 00:12:28.000 min max, and the quartiles twenty fifth fiftieth which is 262 00:12:28.039 --> 00:12:31.919 the median and seventy fifth percentile. A fantastic quick summary. 263 00:12:32.000 --> 00:12:34.559 It also mentions skewness and critosis. What do those tell us? 264 00:12:34.799 --> 00:12:38.000 Skewness tells you if the distribution is symmetric or locksided. 265 00:12:38.600 --> 00:12:41.559 Critosis tells you about the tails. Are they fat, lots 266 00:12:41.559 --> 00:12:45.559 of extreme values or thin? For instance, in the COVID data, 267 00:12:46.039 --> 00:12:49.919 total cases and total deaths were heavily skewed right, meaning 268 00:12:50.200 --> 00:12:52.799 meaning the mean was much higher than the median. That's 269 00:12:52.840 --> 00:12:55.279 a classic sign of outliers pulling the average up a 270 00:12:55.279 --> 00:12:58.600 few countries with extremely high numbers. It immediately tells you 271 00:12:58.639 --> 00:13:00.279 the simple average might be mislead. 272 00:13:00.080 --> 00:13:04.080 One right, the mean is sensitive to extremes. And how 273 00:13:04.120 --> 00:13:06.039 do we visualize these distributions? 274 00:13:06.240 --> 00:13:08.799 Histograms are the first stop. PLT dots gives you a 275 00:13:08.840 --> 00:13:11.519 quick picture of the shape. Are there multiple peaks? Is 276 00:13:11.559 --> 00:13:14.039 it skewed? Are their values way off on their own? 277 00:13:14.159 --> 00:13:16.559 Okay? And QQ plots. What are they for? 278 00:13:17.039 --> 00:13:20.080 QQ plots usually using stats models dot API dot QQ 279 00:13:20.200 --> 00:13:23.279 plot are more technical. They compare your data's distribution directly 280 00:13:23.279 --> 00:13:25.919 against a theoretical one, usually the normal distribution. 281 00:13:26.279 --> 00:13:27.240 How does that help you? 282 00:13:27.240 --> 00:13:31.320 Plot your data's quantiles against the normal distributions quantiles? If 283 00:13:31.320 --> 00:13:34.080 your data is normally distributed, the points will fall roughly 284 00:13:34.159 --> 00:13:37.519 on a straight diagonal line. Deviations from that line show 285 00:13:37.519 --> 00:13:41.559 you exactly how your data differs from normal. Maybe fatter tails, 286 00:13:41.600 --> 00:13:44.000 maybe skewness. It's a great diagnostic got it? 287 00:13:44.240 --> 00:13:47.840 And detecting outliers that one point five times IQR rule. 288 00:13:47.679 --> 00:13:50.720 Yeah, that's a common rule of thumb. Calculate the intercartile 289 00:13:50.879 --> 00:13:53.679 range IQR, which is the distance between the seventy fifth 290 00:13:53.720 --> 00:13:57.559 and twenty fifth percentiles. Anything below q one one point 291 00:13:57.600 --> 00:14:01.039 five iqr or above q three plus one point five 292 00:14:01.120 --> 00:14:03.679 iqr is flagged as a potential outlier. 293 00:14:03.840 --> 00:14:07.440 Potential outlier, so not definitely wrong, but worth investigating exactly. 294 00:14:07.879 --> 00:14:11.440 A good practice is to output these potential outliers, maybe 295 00:14:11.440 --> 00:14:13.559 save them to a separate Excel file along with some 296 00:14:13.639 --> 00:14:16.320 related data, so you can examine them more closely. Are 297 00:14:16.320 --> 00:14:19.919 they data errors or are they genuinely unusual but valid cases. 298 00:14:20.000 --> 00:14:22.799 Right, context matters again. Okay, we've got the basic checks done. 299 00:14:22.840 --> 00:14:26.320 But data isn't just about individual variables, It's about relationships. 300 00:14:26.320 --> 00:14:27.600 How do we start digging into those? 301 00:14:27.799 --> 00:14:30.919 This is where it gets really interesting because sometimes issues, 302 00:14:31.039 --> 00:14:33.720 especially outliers, only really jump out when you look at 303 00:14:33.759 --> 00:14:35.600 two or more variables together. 304 00:14:35.440 --> 00:14:38.440 Like your example, a ten year old earning fifty million dollars. 305 00:14:38.720 --> 00:14:40.679 Each number might be okay on its own, but. 306 00:14:40.720 --> 00:14:44.840 Together exactly that combination flags a huge issue. So how 307 00:14:44.879 --> 00:14:47.960 do we spot these? One way is using cross tabulations, 308 00:14:48.039 --> 00:14:51.200 but smartly, you can use pd dot q cut to 309 00:14:51.320 --> 00:14:55.039 bin your continuous variables into quantiles, say very low to 310 00:14:55.279 --> 00:14:57.000 very high based on ranges. 311 00:14:57.080 --> 00:14:59.679 Okay, so you categorize the continuous data right. 312 00:15:00.120 --> 00:15:02.519 Then you use pd dot cross stab to see how 313 00:15:02.519 --> 00:15:06.159 the quantiles of two variables relate. In the COVID data. Example, 314 00:15:06.240 --> 00:15:09.440 you could cross tab total case ESKIC and total deaths GIT. 315 00:15:10.120 --> 00:15:12.600 That might show you countries like Qatar and Singapore in 316 00:15:12.639 --> 00:15:15.720 the very high cases bin, but only the medium death spin. 317 00:15:16.159 --> 00:15:17.879 That discrepancy jumps right out. 318 00:15:17.720 --> 00:15:19.600 A pattern that doesn't fit the general tread. 319 00:15:19.399 --> 00:15:23.080 Precisely and visually. Scatterplots are key seaborne dot reg plot 320 00:15:23.120 --> 00:15:25.000 is great because it shows the points and fits a 321 00:15:25.000 --> 00:15:27.720 regression line. You can immediately see points that fall far 322 00:15:27.759 --> 00:15:30.080 from the line potential by variate outliers. 323 00:15:30.159 --> 00:15:32.919 What about identifying points that have a really big influence 324 00:15:32.919 --> 00:15:33.960 on that regression line. 325 00:15:34.279 --> 00:15:37.200 Ugh, that's where statistical measures like Cook's distance come in. 326 00:15:37.559 --> 00:15:40.799 It basically measures how much the entire regression model changes 327 00:15:41.200 --> 00:15:43.639 if you remove a single specific data point. 328 00:15:43.840 --> 00:15:46.919 So a high Cook's distance means that point is really polling. 329 00:15:46.679 --> 00:15:50.600 The line exactly. It has high leverage or influence. Removing 330 00:15:50.639 --> 00:15:53.559 an outlier like Qatar and the COVID analysis, for instance, 331 00:15:53.639 --> 00:15:58.559 could significantly change the calculated relationship between say median age 332 00:15:58.600 --> 00:16:01.639 and cases per million. It tells you that single point 333 00:16:01.759 --> 00:16:03.440 is heavily impacting your conclusion. 334 00:16:03.559 --> 00:16:06.600 That's powerful. What if we suspect outliers based on many 335 00:16:06.679 --> 00:16:08.480 variables at once, not just two? 336 00:16:08.840 --> 00:16:13.440 For that multivariate perspective, Canearest Neighbors kNN is a good approach, 337 00:16:13.519 --> 00:16:16.679 often using the piod library Pieto. Yeah, it's a Pipelon 338 00:16:16.720 --> 00:16:20.600 library specifically for outlier detection. It wraps algorithms like kNN 339 00:16:20.679 --> 00:16:23.759 from Psychic Learn. The idea with kNN is to find 340 00:16:23.799 --> 00:16:27.519 points that are far away from their neighbors in multidimensional space. 341 00:16:27.320 --> 00:16:29.200 So points that don't fit in with any cluster. 342 00:16:29.519 --> 00:16:33.440 Kind of yeah, but remember, for distance based methods like kNN, 343 00:16:33.720 --> 00:16:37.720 you must standardize your data first, usually using Z scores, 344 00:16:38.159 --> 00:16:42.240 Otherwise variables with larger ranges will dominate the distance calculation. 345 00:16:42.399 --> 00:16:44.480 Right, put everything on the same scale exactly. 346 00:16:44.759 --> 00:16:48.120 Applying this to the COVID data might flag Singapore, Qatar, 347 00:16:48.360 --> 00:16:51.960 Hong Kong as outliers when considering both cases and deaths 348 00:16:51.960 --> 00:16:55.159 per million together, revealing multifaceted anomalies. 349 00:16:55.320 --> 00:16:58.559 Okay, outlier detection covered. Now let's get into the real 350 00:16:58.600 --> 00:17:01.120 workhorse stuff, manipulating the data itself. 351 00:17:01.279 --> 00:17:04.759 Series operations right pandas series. Think of them as the 352 00:17:04.839 --> 00:17:08.039 columns in your data frame are where most columnwise action happens. 353 00:17:08.279 --> 00:17:12.440 Basic access uses slicing like misoriestart five or dot lock 354 00:17:12.559 --> 00:17:15.640 for label based access, dot ilock for position. 355 00:17:15.359 --> 00:17:18.359 Based standard indexing. What about changing values based on. 356 00:17:18.319 --> 00:17:22.000 Conditions numbe where is incredibly useful for that basic if 357 00:17:22.000 --> 00:17:24.759 then else logic on a column np dot ware condition 358 00:17:24.920 --> 00:17:27.279 value of true value false, like assigning high or low 359 00:17:27.319 --> 00:17:29.960 elevation based on a threshold, simple and fast. 360 00:17:30.160 --> 00:17:33.079 But what If the logic is really complicated involving multiple 361 00:17:33.119 --> 00:17:35.119 columns for each row, that's where. 362 00:17:34.920 --> 00:17:38.200 You often need apply access one with a custom function. 363 00:17:38.359 --> 00:17:41.400 A UDF user defined function, you write a function that 364 00:17:41.440 --> 00:17:44.480 takes a row of data as input, applies your complex 365 00:17:44.519 --> 00:17:47.359 logic using values from different columns in that row, and 366 00:17:47.440 --> 00:17:49.359 returns a result for that row, like. 367 00:17:49.279 --> 00:17:53.400 The sleep deprived reasons example, checking kids wages, work hours 368 00:17:53.400 --> 00:17:54.960 for each person exactly. 369 00:17:55.400 --> 00:17:58.519 Apply access one lets you handle that row by row 370 00:17:58.599 --> 00:18:02.440 custom logic, which is sometimes unavoidable for really specific business 371 00:18:02.480 --> 00:18:04.440 rules or derived variables. 372 00:18:04.519 --> 00:18:07.680 Okay, and text data string cleaning must be common. 373 00:18:07.759 --> 00:18:11.279 Oh yeah, huge part of cleaning pandas dot dr accessor 374 00:18:11.319 --> 00:18:14.920 is your friend here. Sdr dot contains to find substrings, 375 00:18:14.960 --> 00:18:18.279 str dot strip to remove leading trailing white space, str 376 00:18:18.319 --> 00:18:21.160 dot lower or straw dot upper for case changes, cdr 377 00:18:21.160 --> 00:18:22.720 dot replace for substitutions. 378 00:18:22.839 --> 00:18:24.680 What about more complex patterns, That's. 379 00:18:24.519 --> 00:18:27.400 Where regular expressions come in use with methods like str 380 00:18:27.440 --> 00:18:30.079 dot sindel or Steward dot extract. If you need to 381 00:18:30.079 --> 00:18:32.920 pull out specific patterns like codes or numbers embedded in 382 00:18:33.000 --> 00:18:35.640 text rejects is the way to go. Cleaning up something 383 00:18:35.640 --> 00:18:37.599 like marital status, where you might have married with a 384 00:18:37.640 --> 00:18:40.200 trailing space is a classic dr dot strip job. 385 00:18:40.440 --> 00:18:42.960 Dates too, calculations like age. 386 00:18:42.680 --> 00:18:46.559 Definitely first step is always converting date columns to proper 387 00:18:46.640 --> 00:18:50.079 DateTime objects using cordid dot to date time. Once they 388 00:18:50.079 --> 00:18:53.160 are date times, you can easily film missing dates fil serena, 389 00:18:53.519 --> 00:18:57.039 calculate time differences like subtracting birth date from today's date 390 00:18:57.079 --> 00:19:00.119 to get age, or find intervals like days since a specific. 391 00:18:59.839 --> 00:19:04.359 Of and filling missing values. Imputation we mentioned dropping or 392 00:19:04.480 --> 00:19:07.039 using a simple fill. You're smarter ways. 393 00:19:06.920 --> 00:19:09.359 Beyond fill and NONA. With a constant. You can impute 394 00:19:09.440 --> 00:19:12.799 with the overall mean or median. Better yet, use a 395 00:19:12.839 --> 00:19:15.799 group mean with group b transform means. 396 00:19:15.880 --> 00:19:17.119 How does transform works there? 397 00:19:17.200 --> 00:19:20.160 It calculates the mean for each group, say the mean 398 00:19:20.200 --> 00:19:23.720 income for each occupation, and then broadcasts that group specific 399 00:19:23.799 --> 00:19:26.319 mean back to fill the missing values within that group, 400 00:19:26.759 --> 00:19:30.039 more targeted than a global memet. You can also use 401 00:19:30.160 --> 00:19:33.759