WEBVTT

1
00:00:00.080 --> 00:00:04.719
<v Speaker 1>Have you ever felt just completely buried in information? You know,

2
00:00:04.759 --> 00:00:07.679
<v Speaker 1>you've got articles, research notes piling up, and you just

3
00:00:07.719 --> 00:00:08.919
<v Speaker 1>want to get to the point. Oh.

4
00:00:08.960 --> 00:00:12.560
<v Speaker 2>Absolutely, it's that feeling of being swamped trying to find

5
00:00:12.599 --> 00:00:16.399
<v Speaker 2>the real gems in well, just a mountain of data.

6
00:00:15.919 --> 00:00:19.079
<v Speaker 1>Exactly, finding those surprising facts, the stuff that really matters,

7
00:00:19.239 --> 00:00:20.519
<v Speaker 1>without wading through everything.

8
00:00:20.920 --> 00:00:23.879
<v Speaker 2>That's tough, it really is, and the sheer volume can

9
00:00:23.920 --> 00:00:26.120
<v Speaker 2>actually hide the insights you're looking for.

10
00:00:26.359 --> 00:00:29.079
<v Speaker 1>Right. Well, that's why today this deep dive is kind

11
00:00:29.079 --> 00:00:31.719
<v Speaker 1>of your shortcut. We want to help you get genuinely

12
00:00:31.719 --> 00:00:35.640
<v Speaker 1>well informed on a topic that honestly is fundamental to

13
00:00:35.679 --> 00:00:38.200
<v Speaker 1>any good analysis, Python data cleaning.

14
00:00:39.039 --> 00:00:41.439
<v Speaker 2>And our guide for this is the Python Data Cleaning

15
00:00:41.479 --> 00:00:44.479
<v Speaker 2>Cookbook by Michael Walker. It came out from Pact Publishing

16
00:00:44.520 --> 00:00:47.840
<v Speaker 2>back in twenty twenty. It's a really comprehensive resource.

17
00:00:47.479 --> 00:00:50.320
<v Speaker 1>It is, so our mission today is basically to pull

18
00:00:50.359 --> 00:00:54.520
<v Speaker 1>out the most important nuggets of knowledge from this cookbook.

19
00:00:54.600 --> 00:00:57.119
<v Speaker 2>Yeah, we want to help you understand the modern techniques,

20
00:00:57.159 --> 00:00:59.719
<v Speaker 2>the Python tools. You need a spot and you know,

21
00:01:00.079 --> 00:01:02.560
<v Speaker 2>fixed dirty data. Think of it like transforming that raw,

22
00:01:02.600 --> 00:01:03.679
<v Speaker 2>messy stuff.

23
00:01:03.640 --> 00:01:06.000
<v Speaker 1>Into something clear, something you can actually.

24
00:01:05.760 --> 00:01:10.319
<v Speaker 2>Use, precisely clear actionable insights. We'll try to surface some

25
00:01:10.400 --> 00:01:14.120
<v Speaker 2>surprising facts too, maybe keep it hopefully entertaining along the way.

26
00:01:14.560 --> 00:01:18.840
<v Speaker 1>Okay, let's jump in then. So data cleaning, the very

27
00:01:18.879 --> 00:01:22.560
<v Speaker 1>first step often is just getting the data into Python,

28
00:01:22.959 --> 00:01:27.159
<v Speaker 1>and that simple step surprisingly can be well tricky, It

29
00:01:27.200 --> 00:01:27.760
<v Speaker 1>really can.

30
00:01:27.879 --> 00:01:30.560
<v Speaker 2>It's fascinating how much variety there is. Right at the start.

31
00:01:30.599 --> 00:01:34.000
<v Speaker 2>You think data is data, but how it's structured or

32
00:01:34.040 --> 00:01:37.159
<v Speaker 2>maybe not structured, sets you up for different challenges right away.

33
00:01:37.239 --> 00:01:40.519
<v Speaker 1>Okay, so let's start with maybe the most common one, CSV.

34
00:01:40.159 --> 00:01:45.640
<v Speaker 2>Files, right, comma separated values basically raw text, comma splitting columns,

35
00:01:45.680 --> 00:01:48.280
<v Speaker 2>new lines for rows, simple concept.

36
00:01:47.920 --> 00:01:50.079
<v Speaker 1>And pandas read. CSV is the tool for that.

37
00:01:50.319 --> 00:01:52.599
<v Speaker 2>It is, But here's the first little catch. Read the

38
00:01:52.680 --> 00:01:55.280
<v Speaker 2>CSV tries to be smart. It makes an educated guess

39
00:01:55.280 --> 00:01:56.200
<v Speaker 2>about your data types.

40
00:01:56.400 --> 00:01:58.040
<v Speaker 1>Oh so it might guess wrong.

41
00:01:58.200 --> 00:02:00.719
<v Speaker 2>It often does, or it might not get exactly what

42
00:02:00.760 --> 00:02:02.680
<v Speaker 2>you need. You usually have to step in, maybe tell

43
00:02:02.680 --> 00:02:05.680
<v Speaker 2>it the column names explicitly, or make sure understands your

44
00:02:05.760 --> 00:02:07.959
<v Speaker 2>dates are actually dates, not just strings of text.

45
00:02:08.240 --> 00:02:10.120
<v Speaker 1>Gotcha, So you need to be specific.

46
00:02:10.240 --> 00:02:13.319
<v Speaker 2>Yeah, take the landtemp's data set for example, It's like

47
00:02:13.319 --> 00:02:16.439
<v Speaker 2>one hundred thousand row sample from this big climate network.

48
00:02:16.719 --> 00:02:19.919
<v Speaker 2>You load it and maybe you run is nullaw.

49
00:02:19.599 --> 00:02:22.479
<v Speaker 1>Dot s okay, and that shows you missing values exactly.

50
00:02:22.520 --> 00:02:25.080
<v Speaker 2>It'll quickly show you, hey, you've got gaps and avtu

51
00:02:25.199 --> 00:02:28.199
<v Speaker 2>temp or maybe the country column. And for something critical

52
00:02:28.280 --> 00:02:31.599
<v Speaker 2>like average temperature. Missing values aren't just counts, they're like

53
00:02:32.199 --> 00:02:33.800
<v Speaker 2>potential analysis killers.

54
00:02:34.039 --> 00:02:36.199
<v Speaker 1>You might drop those rows using DROPNA.

55
00:02:36.360 --> 00:02:40.280
<v Speaker 2>You could yeah, drop the subsetvtemp in lace true would

56
00:02:40.280 --> 00:02:43.639
<v Speaker 2>remove rose missing that crucial temperature. But you know, dropping

57
00:02:43.680 --> 00:02:46.439
<v Speaker 2>isn't always the best move. Sometimes filling those gaps imputation

58
00:02:47.039 --> 00:02:48.319
<v Speaker 2>is better. Depends on the goal.

59
00:02:48.439 --> 00:02:49.439
<v Speaker 1>Right, It's a judgment call.

60
00:02:49.599 --> 00:02:52.159
<v Speaker 2>It is, oh, and a neat little thing about read CSV.

61
00:02:52.400 --> 00:02:57.319
<v Speaker 2>It can often handle zipped CSV files directly saves you

62
00:02:57.360 --> 00:02:59.000
<v Speaker 2>an unzip step handy.

63
00:02:59.240 --> 00:03:03.439
<v Speaker 1>Okay. So ESVs have their quirks. What about Excel files?

64
00:03:03.479 --> 00:03:06.599
<v Speaker 1>I feel like everyone has horror stories about messy spreadsheets.

65
00:03:06.639 --> 00:03:09.280
<v Speaker 2>Oh Excel. Yeah, they bring a whole different set of

66
00:03:09.439 --> 00:03:14.599
<v Speaker 2>let's call them features. We'll use Excel in very flexible ways.

67
00:03:14.840 --> 00:03:16.319
<v Speaker 1>Flexible is a polite word for it.

68
00:03:16.560 --> 00:03:19.599
<v Speaker 2>Huh. Right, So you often find extra rows at the top,

69
00:03:19.800 --> 00:03:22.840
<v Speaker 2>like report titles or maybe summary rows at the bottom,

70
00:03:23.120 --> 00:03:26.240
<v Speaker 2>blank columns used for spacing if it looks fine to

71
00:03:26.240 --> 00:03:27.840
<v Speaker 2>a human but confuses.

72
00:03:27.400 --> 00:03:28.439
<v Speaker 1>The code exactly.

73
00:03:28.719 --> 00:03:31.400
<v Speaker 2>So with Panda's not read Excel, you use arguments like

74
00:03:31.479 --> 00:03:35.639
<v Speaker 2>skip rows, skip footer, use calls to basically tell pandas, okay,

75
00:03:35.759 --> 00:03:38.599
<v Speaker 2>ignore that stuff. Just grab this block of cells you're

76
00:03:38.639 --> 00:03:40.199
<v Speaker 2>targeting the actual data.

77
00:03:40.000 --> 00:03:42.840
<v Speaker 1>Table makes sense, you're zeroing in and.

78
00:03:42.800 --> 00:03:46.479
<v Speaker 2>Another common Excel thing. People use symbols like A or

79
00:03:46.479 --> 00:03:48.680
<v Speaker 2>maybe na to show missing data.

80
00:03:48.759 --> 00:03:51.240
<v Speaker 1>Right, not blank cells, but actual text symbols.

81
00:03:51.439 --> 00:03:54.560
<v Speaker 2>Yeah, Python reads those as texts as object types. So

82
00:03:54.560 --> 00:03:56.719
<v Speaker 2>if you try to do math, it breaks the key move.

83
00:03:56.759 --> 00:03:59.120
<v Speaker 2>Here is pd dot two numeric with the errors.

84
00:03:58.919 --> 00:04:01.439
<v Speaker 1>Coerce argument r's coerce. What does that do?

85
00:04:01.560 --> 00:04:04.000
<v Speaker 2>It tells pandas try to make this column numeric. If

86
00:04:04.039 --> 00:04:06.400
<v Speaker 2>you find anything you can't convert, like wow, just turn

87
00:04:06.439 --> 00:04:08.360
<v Speaker 2>it into nan ah nan.

88
00:04:08.759 --> 00:04:11.759
<v Speaker 1>Not a number Panda's way of saying missing numeric value.

89
00:04:11.879 --> 00:04:15.199
<v Speaker 2>Precisely. Without that step, your numbers are stuck as text.

90
00:04:15.479 --> 00:04:18.000
<v Speaker 2>And oh, watch out for extra spaces too, Like in

91
00:04:18.040 --> 00:04:21.959
<v Speaker 2>that OECD GDP data example. You might have spaces before

92
00:04:22.160 --> 00:04:25.879
<v Speaker 2>or after values. Always use dot str dot strip to

93
00:04:25.959 --> 00:04:28.160
<v Speaker 2>clean those up before you analyze or merge.

94
00:04:28.319 --> 00:04:32.000
<v Speaker 1>So many the little traps. Okay, csvs. EXCEL. What a

95
00:04:32.120 --> 00:04:36.720
<v Speaker 1>pulling data from proper databases like SQL databases, surely that's cleaner.

96
00:04:36.920 --> 00:04:40.759
<v Speaker 2>Generally, Yes, data from enterprise systems SQL databases tends to

97
00:04:40.800 --> 00:04:44.480
<v Speaker 2>be more structured, but the logic isn't always obvious from

98
00:04:44.480 --> 00:04:45.120
<v Speaker 2>the data alone.

99
00:04:45.199 --> 00:04:45.639
<v Speaker 1>What do you mean.

100
00:04:45.759 --> 00:04:48.920
<v Speaker 2>Well, you might find really complex coding schemes like three

101
00:04:49.079 --> 00:04:52.199
<v Speaker 2>means mother has secondary education, or they might use special

102
00:04:52.279 --> 00:04:54.839
<v Speaker 2>numbers like nine nine nine ninet nine to mean missing

103
00:04:54.959 --> 00:04:57.240
<v Speaker 2>or not applicable. It makes sense in the database, but

104
00:04:57.279 --> 00:04:57.879
<v Speaker 2>not when you.

105
00:04:57.800 --> 00:04:59.560
<v Speaker 1>Just pull the raw number right, so the context is

106
00:04:59.600 --> 00:05:00.319
<v Speaker 1>missing exactly.

107
00:05:00.360 --> 00:05:03.759
<v Speaker 2>So you use tools like pimsqual or mysqualapis with pd

108
00:05:03.920 --> 00:05:06.360
<v Speaker 2>dot read school to pull the data, and you can

109
00:05:06.399 --> 00:05:09.360
<v Speaker 2>actually do some initial cleanup in the SQL query itself, like.

110
00:05:09.360 --> 00:05:12.120
<v Speaker 1>Renaming columns or filtering rows right at the source.

111
00:05:12.439 --> 00:05:16.360
<v Speaker 2>Yeah, using the select statement. Then once it's in pandas,

112
00:05:16.439 --> 00:05:19.279
<v Speaker 2>a really good technique is to replace those codes like

113
00:05:19.360 --> 00:05:22.199
<v Speaker 2>three with meaningful labels like secondary ed.

114
00:05:22.360 --> 00:05:24.480
<v Speaker 1>Makes the data much easier to understand totally.

115
00:05:24.879 --> 00:05:28.360
<v Speaker 2>And then this is key for efficiency, convert that column

116
00:05:28.399 --> 00:05:29.839
<v Speaker 2>to a category data type.

117
00:05:29.920 --> 00:05:30.600
<v Speaker 1>Why category?

118
00:05:30.639 --> 00:05:32.480
<v Speaker 2>It saves a ton of memory, especially if you have

119
00:05:32.560 --> 00:05:35.920
<v Speaker 2>text labels repeated many times. The student math data set

120
00:05:35.920 --> 00:05:39.959
<v Speaker 2>example shows this clearly with memory usage index false. It's

121
00:05:40.000 --> 00:05:41.959
<v Speaker 2>not just about clarity, it's about performance.

122
00:05:42.040 --> 00:05:45.079
<v Speaker 1>With bigger data, memory savings are always good. Okay. What

123
00:05:45.120 --> 00:05:50.519
<v Speaker 1>about data from statistical software SPSS data SaaS are.

124
00:05:50.519 --> 00:05:53.160
<v Speaker 2>Yeah, those have their own formats too. Libraries like py

125
00:05:53.240 --> 00:05:56.800
<v Speaker 2>reed stat for SPSS, Stata SaaS and pyritter for our

126
00:05:56.879 --> 00:05:57.800
<v Speaker 2>data files are.

127
00:05:57.720 --> 00:05:59.680
<v Speaker 1>What you'd use and their own querkx I assume.

128
00:06:00.519 --> 00:06:03.639
<v Speaker 2>One thing is these files often have metadata with column

129
00:06:03.720 --> 00:06:06.600
<v Speaker 2>labels that are way more descriptive than the short, sometimes

130
00:06:06.639 --> 00:06:08.040
<v Speaker 2>cryptic variable names.

131
00:06:08.120 --> 00:06:09.319
<v Speaker 1>So you want to use the labels.

132
00:06:09.439 --> 00:06:11.839
<v Speaker 2>Definitely prefer the labels, but then you still need to

133
00:06:11.839 --> 00:06:15.240
<v Speaker 2>clean those up, make them lowercase, replace spaces with underscores,

134
00:06:15.600 --> 00:06:20.240
<v Speaker 2>remove weird characters, make them usable as variable names and python.

135
00:06:19.959 --> 00:06:22.079
<v Speaker 1>Okay, standardizing the names right.

136
00:06:22.560 --> 00:06:27.959
<v Speaker 2>And another big one for stats packages logical missing values. Stata,

137
00:06:28.040 --> 00:06:30.480
<v Speaker 2>for instance, might use codes like meta five point zero,

138
00:06:30.519 --> 00:06:32.199
<v Speaker 2>agive a four point zero and adver meta four one

139
00:06:32.199 --> 00:06:36.519
<v Speaker 2>point zero. These aren't errors, their codes meaning refused, don't know,

140
00:06:37.120 --> 00:06:37.879
<v Speaker 2>not applicable.

141
00:06:38.160 --> 00:06:40.279
<v Speaker 1>Ah. So they look like numbers, but they aren't really

142
00:06:40.319 --> 00:06:41.319
<v Speaker 1>data points.

143
00:06:40.959 --> 00:06:45.000
<v Speaker 2>For analysis exactly you need to tell pandas to treat

144
00:06:45.000 --> 00:06:48.079
<v Speaker 2>those specific numbers as missing, otherwise they'll mess up your

145
00:06:48.079 --> 00:06:49.480
<v Speaker 2>calculations like your averages.

146
00:06:49.600 --> 00:06:52.120
<v Speaker 1>Got it? So much depends on understanding the source of

147
00:06:52.160 --> 00:06:52.600
<v Speaker 1>the data.

148
00:06:52.639 --> 00:06:54.800
<v Speaker 2>Absolutely. Context is everything in cleaning.

149
00:06:55.120 --> 00:06:57.000
<v Speaker 1>Okay, so we've wrangled the data in, we've done some

150
00:06:57.040 --> 00:06:59.759
<v Speaker 1>initial cleanup. Now the book talks about saving this clean

151
00:06:59.839 --> 00:07:03.160
<v Speaker 1>day data. Why is picking the right format important and

152
00:07:03.199 --> 00:07:05.879
<v Speaker 1>what are the trade offs? It seems simple, but maybe

153
00:07:05.879 --> 00:07:06.160
<v Speaker 1>it's not.

154
00:07:06.519 --> 00:07:10.160
<v Speaker 2>That's a really important point. Persisting the data. Why bother

155
00:07:10.279 --> 00:07:12.639
<v Speaker 2>saving it in a new format. Well, maybe you want

156
00:07:12.639 --> 00:07:15.920
<v Speaker 2>to clean snapshot before doing more complex stuff, or the

157
00:07:16.000 --> 00:07:18.639
<v Speaker 2>data doesn't change much so you work off the clean version,

158
00:07:19.040 --> 00:07:20.959
<v Speaker 2>or maybe you want the flexibility of JSON.

159
00:07:21.240 --> 00:07:26.040
<v Speaker 1>Okay, those are good reasons, but what's overlooked.

160
00:07:25.680 --> 00:07:29.279
<v Speaker 2>The trade offs. CSV is memory light, sure, but it's

161
00:07:29.319 --> 00:07:32.120
<v Speaker 2>slow to write for big files, and crucially, it forgets

162
00:07:32.120 --> 00:07:35.680
<v Speaker 2>your data types. All that work converting numbers poof, they

163
00:07:35.759 --> 00:07:37.399
<v Speaker 2>might become strings again when you reload.

164
00:07:37.439 --> 00:07:38.879
<v Speaker 1>Oh that's annoying. What about pickle.

165
00:07:39.079 --> 00:07:42.360
<v Speaker 2>Pickle does remember data types, which is great, but creating

166
00:07:42.399 --> 00:07:45.800
<v Speaker 2>those Pickle files, the serialization process can be heavy on memory,

167
00:07:45.839 --> 00:07:49.800
<v Speaker 2>and CPU might not be ideal on a resource constrained system.

168
00:07:49.920 --> 00:07:50.879
<v Speaker 1>Okay, and feather.

169
00:07:51.240 --> 00:07:54.000
<v Speaker 2>Feather is generally faster and lighter than pickle, and it

170
00:07:54.040 --> 00:07:56.959
<v Speaker 2>plays nice with r which is cool for teams using both,

171
00:07:57.360 --> 00:07:59.959
<v Speaker 2>but you often lose the data frame index, and it's

172
00:08:00.079 --> 00:08:03.720
<v Speaker 2>long term support is maybe less certain than other formats.

173
00:08:03.759 --> 00:08:05.399
<v Speaker 1>So there's no single perfect format.

174
00:08:05.680 --> 00:08:08.839
<v Speaker 2>Not really depends on the needs. But here's the big warning.

175
00:08:09.279 --> 00:08:11.720
<v Speaker 2>When you save data, you separate it from the code

176
00:08:11.759 --> 00:08:14.519
<v Speaker 2>that created it. It's super easy to forget later how

177
00:08:14.519 --> 00:08:17.079
<v Speaker 2>a variable was calculated or cleaned right.

178
00:08:17.240 --> 00:08:18.879
<v Speaker 1>The logic gets lost exactly.

179
00:08:19.000 --> 00:08:23.879
<v Speaker 2>So the advice is only persist your data at significant milestones,

180
00:08:23.920 --> 00:08:27.120
<v Speaker 2>when you've reached a stable, well understood point in your

181
00:08:27.120 --> 00:08:30.240
<v Speaker 2>cleaning process. Treat it like saving a major version.

182
00:08:30.759 --> 00:08:34.000
<v Speaker 1>That makes a lot of sense. Okay, milestone reached data

183
00:08:34.080 --> 00:08:36.360
<v Speaker 1>is imported. What's the very very first thing you do

184
00:08:36.440 --> 00:08:39.200
<v Speaker 1>that question? Everyone asks, so how does it look?

185
00:08:39.360 --> 00:08:41.559
<v Speaker 2>Yeah, that's the immediate next step, and you need a

186
00:08:41.679 --> 00:08:44.960
<v Speaker 2>routine a system. Even if you think you know the data,

187
00:08:45.039 --> 00:08:46.600
<v Speaker 2>a new batch can always.

188
00:08:46.240 --> 00:08:48.919
<v Speaker 1>Have surprises, So a quick diagnostic checkup exactly.

189
00:08:49.000 --> 00:08:51.480
<v Speaker 2>You want to quickly grasp what's the unit of analysis?

190
00:08:51.519 --> 00:08:54.720
<v Speaker 2>How many rows? How many columns, What are the common categories,

191
00:08:54.840 --> 00:08:58.720
<v Speaker 2>how are the numbers distributed? And critically, where are the

192
00:08:58.759 --> 00:09:02.960
<v Speaker 2>missing values and potential outliers. It's about building that initial intuition.

193
00:09:03.159 --> 00:09:05.840
<v Speaker 1>Building intuition. I like that. So what commands give you

194
00:09:05.879 --> 00:09:06.639
<v Speaker 1>that first glance?

195
00:09:06.879 --> 00:09:11.840
<v Speaker 2>Dataframe dot shape for rowsan columns, simple but fundamental dataframe

196
00:09:11.840 --> 00:09:15.039
<v Speaker 2>dot info is gold. It shows data types and counts

197
00:09:15.159 --> 00:09:19.080
<v Speaker 2>non missing values per column, instant red flags for missing data.

198
00:09:19.440 --> 00:09:22.320
<v Speaker 1>Okay, shape and info. What about seeing the actual data?

199
00:09:22.440 --> 00:09:25.320
<v Speaker 2>Dataframe dot head for the first few rows, dataframe dot

200
00:09:25.360 --> 00:09:27.799
<v Speaker 2>tail for the last few, and dataframe dot sample is

201
00:09:27.840 --> 00:09:30.679
<v Speaker 2>great for a random peak. Use sample random state one

202
00:09:30.720 --> 00:09:34.120
<v Speaker 2>if you want the same random sample each time. For reproducibility.

203
00:09:34.200 --> 00:09:36.120
<v Speaker 1>Reproducibility is good always.

204
00:09:36.039 --> 00:09:38.919
<v Speaker 2>And a key tip here set a meaningful index if

205
00:09:38.919 --> 00:09:42.120
<v Speaker 2>you have one, like a unique personad. It makes selecting

206
00:09:42.159 --> 00:09:45.360
<v Speaker 2>specific rows later so much easier. It anchors your data.

207
00:09:45.399 --> 00:09:48.320
<v Speaker 1>Good point. Okay, we've got the overview. How about focusing

208
00:09:48.320 --> 00:09:50.559
<v Speaker 1>on columns, selecting and organizing them.

209
00:09:50.600 --> 00:09:53.480
<v Speaker 2>Standard selection is easy with square brackets or using dot

210
00:09:53.600 --> 00:09:55.759
<v Speaker 2>lock and dot ilock. But a real time saver is

211
00:09:55.799 --> 00:09:56.279
<v Speaker 2>dot filter.

212
00:09:56.440 --> 00:09:57.440
<v Speaker 1>Like how does filter work?

213
00:09:57.480 --> 00:09:59.480
<v Speaker 2>It lets you select columns based on patterns in their

214
00:09:59.519 --> 00:10:02.039
<v Speaker 2>names have columns like weeks work zero one, weeks work

215
00:10:02.120 --> 00:10:04.120
<v Speaker 2>zero one, week's work zero two. You can grab them

216
00:10:04.159 --> 00:10:07.200
<v Speaker 2>all with df dot filter like weeks worked super useful.

217
00:10:07.279 --> 00:10:10.080
<v Speaker 1>Oh nice's typing them all out exactly.

218
00:10:10.440 --> 00:10:13.240
<v Speaker 2>You can also select to buy data type df dot.

219
00:10:13.240 --> 00:10:17.399
<v Speaker 2>Selected types include number gets all numeric columns, or include

220
00:10:17.399 --> 00:10:22.120
<v Speaker 2>category for categoricals and for just keeping things sane. Group

221
00:10:22.200 --> 00:10:29.559
<v Speaker 2>related columns together. Create lists of column names like demographics, age, gender, location, workforce, occupation, income,

222
00:10:30.000 --> 00:10:32.440
<v Speaker 2>and then you can easily work with those logical groups.

223
00:10:32.600 --> 00:10:36.000
<v Speaker 1>Keeps the analysis tidy. Now, what about selecting specific rows?

224
00:10:36.120 --> 00:10:38.759
<v Speaker 1>You mentioned that issues often pop up when you look

225
00:10:38.799 --> 00:10:39.399
<v Speaker 1>at subsets.

226
00:10:39.440 --> 00:10:42.440
<v Speaker 2>Absolutely, this is where booling indexing comes in. You filter

227
00:10:42.559 --> 00:10:45.360
<v Speaker 2>rows based on conditions. For example, and that NLS data

228
00:10:45.360 --> 00:10:48.639
<v Speaker 2>set NLS ninety seven dot nightly hirsleep equal four would

229
00:10:48.679 --> 00:10:50.759
<v Speaker 2>pull out everyone reporting very little.

230
00:10:50.480 --> 00:10:53.240
<v Speaker 1>Sleep, So you can isolate specific groups easily yep.

231
00:10:53.720 --> 00:10:56.759
<v Speaker 2>And you can combine conditions using and for a and

232
00:10:56.879 --> 00:10:59.480
<v Speaker 2>D for or like sleep four and children x three

233
00:10:59.519 --> 00:11:02.360
<v Speaker 2>finds people with little sleep and three or more kids less.

234
00:11:02.360 --> 00:11:03.240
<v Speaker 2>You really zoom in.

235
00:11:03.440 --> 00:11:05.320
<v Speaker 1>You can select rows and columns together.

236
00:11:05.480 --> 00:11:07.919
<v Speaker 2>Yes, using dot lock you can give it row conditions

237
00:11:07.960 --> 00:11:10.559
<v Speaker 2>and column names in one go. Very powerful for getting

238
00:11:10.559 --> 00:11:12.120
<v Speaker 2>exactly the slice of data you need.

239
00:11:12.399 --> 00:11:15.919
<v Speaker 1>Okay, slicing and dicing now, I remember a researcher telling

240
00:11:15.960 --> 00:11:18.840
<v Speaker 1>me once ninety percent of what you'll find is in

241
00:11:18.879 --> 00:11:22.399
<v Speaker 1>the frequency distributions. Why are frequencies so revealing?

242
00:11:22.480 --> 00:11:25.320
<v Speaker 2>Yeah, that's a great quote, and it's often true, especially

243
00:11:25.360 --> 00:11:29.200
<v Speaker 2>for categorical data frequencies. Using series dot value counts are

244
00:11:29.240 --> 00:11:32.000
<v Speaker 2>your best friend. They immediately show you what are the

245
00:11:32.039 --> 00:11:35.919
<v Speaker 2>actual categories present? Are their typos, weird values, too many

246
00:11:35.960 --> 00:11:36.919
<v Speaker 2>other responses?

247
00:11:37.159 --> 00:11:39.960
<v Speaker 1>So it's like a reality check for your categories.

248
00:11:39.399 --> 00:11:43.320
<v Speaker 2>Totally and adding normalized true gives you percentages, which helps

249
00:11:43.399 --> 00:11:46.360
<v Speaker 2>understand the proportions. You can even apply it to multiple

250
00:11:46.360 --> 00:11:49.360
<v Speaker 2>columns at once using dot apply, like checking all those

251
00:11:49.399 --> 00:11:53.720
<v Speaker 2>government responsibility questions in the NLS data together efficient very

252
00:11:54.000 --> 00:11:57.360
<v Speaker 2>and remember that tip about converting text columns object type

253
00:11:57.360 --> 00:12:00.480
<v Speaker 2>to category. It pays off here too, makes these value

254
00:12:00.480 --> 00:12:02.919
<v Speaker 2>counts operations faster and more memory efficient.

255
00:12:02.960 --> 00:12:06.879
<v Speaker 1>Good reminder, Okay, frequencies for categories. What about summarizing our

256
00:12:06.919 --> 00:12:08.480
<v Speaker 1>continuous numeric variables?

257
00:12:08.600 --> 00:12:11.639
<v Speaker 2>Before you analyze numbers, you need to understand their basic properties.

258
00:12:12.080 --> 00:12:19.559
<v Speaker 2>Central tendency like mean or median, spread, standard deviation, and shape, skewness.

259
00:12:18.840 --> 00:12:21.480
<v Speaker 1>And dataframe dot describe is the go to.

260
00:12:21.559 --> 00:12:24.480
<v Speaker 2>For that it is. It gives you count means, standard deviation,

261
00:12:24.759 --> 00:12:28.000
<v Speaker 2>min max, and the quartiles twenty fifth fiftieth which is

262
00:12:28.039 --> 00:12:31.919
<v Speaker 2>the median and seventy fifth percentile. A fantastic quick summary.

263
00:12:32.000 --> 00:12:34.559
<v Speaker 1>It also mentions skewness and critosis. What do those tell us?

264
00:12:34.799 --> 00:12:38.000
<v Speaker 2>Skewness tells you if the distribution is symmetric or locksided.

265
00:12:38.600 --> 00:12:41.559
<v Speaker 2>Critosis tells you about the tails. Are they fat, lots

266
00:12:41.559 --> 00:12:45.559
<v Speaker 2>of extreme values or thin? For instance, in the COVID data,

267
00:12:46.039 --> 00:12:49.919
<v Speaker 2>total cases and total deaths were heavily skewed right, meaning

268
00:12:50.200 --> 00:12:52.799
<v Speaker 2>meaning the mean was much higher than the median. That's

269
00:12:52.840 --> 00:12:55.279
<v Speaker 2>a classic sign of outliers pulling the average up a

270
00:12:55.279 --> 00:12:58.600
<v Speaker 2>few countries with extremely high numbers. It immediately tells you

271
00:12:58.639 --> 00:13:00.279
<v Speaker 2>the simple average might be mislead.

272
00:13:00.080 --> 00:13:04.080
<v Speaker 1>One right, the mean is sensitive to extremes. And how

273
00:13:04.120 --> 00:13:06.039
<v Speaker 1>do we visualize these distributions?

274
00:13:06.240 --> 00:13:08.799
<v Speaker 2>Histograms are the first stop. PLT dots gives you a

275
00:13:08.840 --> 00:13:11.519
<v Speaker 2>quick picture of the shape. Are there multiple peaks? Is

276
00:13:11.559 --> 00:13:14.039
<v Speaker 2>it skewed? Are their values way off on their own?

277
00:13:14.159 --> 00:13:16.559
<v Speaker 1>Okay? And QQ plots. What are they for?

278
00:13:17.039 --> 00:13:20.080
<v Speaker 2>QQ plots usually using stats models dot API dot QQ

279
00:13:20.200 --> 00:13:23.279
<v Speaker 2>plot are more technical. They compare your data's distribution directly

280
00:13:23.279 --> 00:13:25.919
<v Speaker 2>against a theoretical one, usually the normal distribution.

281
00:13:26.279 --> 00:13:27.240
<v Speaker 1>How does that help you?

282
00:13:27.240 --> 00:13:31.320
<v Speaker 2>Plot your data's quantiles against the normal distributions quantiles? If

283
00:13:31.320 --> 00:13:34.080
<v Speaker 2>your data is normally distributed, the points will fall roughly

284
00:13:34.159 --> 00:13:37.519
<v Speaker 2>on a straight diagonal line. Deviations from that line show

285
00:13:37.519 --> 00:13:41.559
<v Speaker 2>you exactly how your data differs from normal. Maybe fatter tails,

286
00:13:41.600 --> 00:13:44.000
<v Speaker 2>maybe skewness. It's a great diagnostic got it?

287
00:13:44.240 --> 00:13:47.840
<v Speaker 1>And detecting outliers that one point five times IQR rule.

288
00:13:47.679 --> 00:13:50.720
<v Speaker 2>Yeah, that's a common rule of thumb. Calculate the intercartile

289
00:13:50.879 --> 00:13:53.679
<v Speaker 2>range IQR, which is the distance between the seventy fifth

290
00:13:53.720 --> 00:13:57.559
<v Speaker 2>and twenty fifth percentiles. Anything below q one one point

291
00:13:57.600 --> 00:14:01.039
<v Speaker 2>five iqr or above q three plus one point five

292
00:14:01.120 --> 00:14:03.679
<v Speaker 2>iqr is flagged as a potential outlier.

293
00:14:03.840 --> 00:14:07.440
<v Speaker 1>Potential outlier, so not definitely wrong, but worth investigating exactly.

294
00:14:07.879 --> 00:14:11.440
<v Speaker 2>A good practice is to output these potential outliers, maybe

295
00:14:11.440 --> 00:14:13.559
<v Speaker 2>save them to a separate Excel file along with some

296
00:14:13.639 --> 00:14:16.320
<v Speaker 2>related data, so you can examine them more closely. Are

297
00:14:16.320 --> 00:14:19.919
<v Speaker 2>they data errors or are they genuinely unusual but valid cases.

298
00:14:20.000 --> 00:14:22.799
<v Speaker 1>Right, context matters again. Okay, we've got the basic checks done.

299
00:14:22.840 --> 00:14:26.320
<v Speaker 1>But data isn't just about individual variables, It's about relationships.

300
00:14:26.320 --> 00:14:27.600
<v Speaker 1>How do we start digging into those?

301
00:14:27.799 --> 00:14:30.919
<v Speaker 2>This is where it gets really interesting because sometimes issues,

302
00:14:31.039 --> 00:14:33.720
<v Speaker 2>especially outliers, only really jump out when you look at

303
00:14:33.759 --> 00:14:35.600
<v Speaker 2>two or more variables together.

304
00:14:35.440 --> 00:14:38.440
<v Speaker 1>Like your example, a ten year old earning fifty million dollars.

305
00:14:38.720 --> 00:14:40.679
<v Speaker 1>Each number might be okay on its own, but.

306
00:14:40.720 --> 00:14:44.840
<v Speaker 2>Together exactly that combination flags a huge issue. So how

307
00:14:44.879 --> 00:14:47.960
<v Speaker 2>do we spot these? One way is using cross tabulations,

308
00:14:48.039 --> 00:14:51.200
<v Speaker 2>but smartly, you can use pd dot q cut to

309
00:14:51.320 --> 00:14:55.039
<v Speaker 2>bin your continuous variables into quantiles, say very low to

310
00:14:55.279 --> 00:14:57.000
<v Speaker 2>very high based on ranges.

311
00:14:57.080 --> 00:14:59.679
<v Speaker 1>Okay, so you categorize the continuous data right.

312
00:15:00.120 --> 00:15:02.519
<v Speaker 2>Then you use pd dot cross stab to see how

313
00:15:02.519 --> 00:15:06.159
<v Speaker 2>the quantiles of two variables relate. In the COVID data. Example,

314
00:15:06.240 --> 00:15:09.440
<v Speaker 2>you could cross tab total case ESKIC and total deaths GIT.

315
00:15:10.120 --> 00:15:12.600
<v Speaker 2>That might show you countries like Qatar and Singapore in

316
00:15:12.639 --> 00:15:15.720
<v Speaker 2>the very high cases bin, but only the medium death spin.

317
00:15:16.159 --> 00:15:17.879
<v Speaker 2>That discrepancy jumps right out.

318
00:15:17.720 --> 00:15:19.600
<v Speaker 1>A pattern that doesn't fit the general tread.

319
00:15:19.399 --> 00:15:23.080
<v Speaker 2>Precisely and visually. Scatterplots are key seaborne dot reg plot

320
00:15:23.120 --> 00:15:25.000
<v Speaker 2>is great because it shows the points and fits a

321
00:15:25.000 --> 00:15:27.720
<v Speaker 2>regression line. You can immediately see points that fall far

322
00:15:27.759 --> 00:15:30.080
<v Speaker 2>from the line potential by variate outliers.

323
00:15:30.159 --> 00:15:32.919
<v Speaker 1>What about identifying points that have a really big influence

324
00:15:32.919 --> 00:15:33.960
<v Speaker 1>on that regression line.

325
00:15:34.279 --> 00:15:37.200
<v Speaker 2>Ugh, that's where statistical measures like Cook's distance come in.

326
00:15:37.559 --> 00:15:40.799
<v Speaker 2>It basically measures how much the entire regression model changes

327
00:15:41.200 --> 00:15:43.639
<v Speaker 2>if you remove a single specific data point.

328
00:15:43.840 --> 00:15:46.919
<v Speaker 1>So a high Cook's distance means that point is really polling.

329
00:15:46.679 --> 00:15:50.600
<v Speaker 2>The line exactly. It has high leverage or influence. Removing

330
00:15:50.639 --> 00:15:53.559
<v Speaker 2>an outlier like Qatar and the COVID analysis, for instance,

331
00:15:53.639 --> 00:15:58.559
<v Speaker 2>could significantly change the calculated relationship between say median age

332
00:15:58.600 --> 00:16:01.639
<v Speaker 2>and cases per million. It tells you that single point

333
00:16:01.759 --> 00:16:03.440
<v Speaker 2>is heavily impacting your conclusion.

334
00:16:03.559 --> 00:16:06.600
<v Speaker 1>That's powerful. What if we suspect outliers based on many

335
00:16:06.679 --> 00:16:08.480
<v Speaker 1>variables at once, not just two?

336
00:16:08.840 --> 00:16:13.440
<v Speaker 2>For that multivariate perspective, Canearest Neighbors kNN is a good approach,

337
00:16:13.519 --> 00:16:16.679
<v Speaker 2>often using the piod library Pieto. Yeah, it's a Pipelon

338
00:16:16.720 --> 00:16:20.600
<v Speaker 2>library specifically for outlier detection. It wraps algorithms like kNN

339
00:16:20.679 --> 00:16:23.759
<v Speaker 2>from Psychic Learn. The idea with kNN is to find

340
00:16:23.799 --> 00:16:27.519
<v Speaker 2>points that are far away from their neighbors in multidimensional space.

341
00:16:27.320 --> 00:16:29.200
<v Speaker 1>So points that don't fit in with any cluster.

342
00:16:29.519 --> 00:16:33.440
<v Speaker 2>Kind of yeah, but remember, for distance based methods like kNN,

343
00:16:33.720 --> 00:16:37.720
<v Speaker 2>you must standardize your data first, usually using Z scores,

344
00:16:38.159 --> 00:16:42.240
<v Speaker 2>Otherwise variables with larger ranges will dominate the distance calculation.

345
00:16:42.399 --> 00:16:44.480
<v Speaker 1>Right, put everything on the same scale exactly.

346
00:16:44.759 --> 00:16:48.120
<v Speaker 2>Applying this to the COVID data might flag Singapore, Qatar,

347
00:16:48.360 --> 00:16:51.960
<v Speaker 2>Hong Kong as outliers when considering both cases and deaths

348
00:16:51.960 --> 00:16:55.159
<v Speaker 2>per million together, revealing multifaceted anomalies.

349
00:16:55.320 --> 00:16:58.559
<v Speaker 1>Okay, outlier detection covered. Now let's get into the real

350
00:16:58.600 --> 00:17:01.120
<v Speaker 1>workhorse stuff, manipulating the data itself.

351
00:17:01.279 --> 00:17:04.759
<v Speaker 2>Series operations right pandas series. Think of them as the

352
00:17:04.839 --> 00:17:08.039
<v Speaker 2>columns in your data frame are where most columnwise action happens.

353
00:17:08.279 --> 00:17:12.440
<v Speaker 2>Basic access uses slicing like misoriestart five or dot lock

354
00:17:12.559 --> 00:17:15.640
<v Speaker 2>for label based access, dot ilock for position.

355
00:17:15.359 --> 00:17:18.359
<v Speaker 1>Based standard indexing. What about changing values based on.

356
00:17:18.319 --> 00:17:22.000
<v Speaker 2>Conditions numbe where is incredibly useful for that basic if

357
00:17:22.000 --> 00:17:24.759
<v Speaker 2>then else logic on a column np dot ware condition

358
00:17:24.920 --> 00:17:27.279
<v Speaker 2>value of true value false, like assigning high or low

359
00:17:27.319 --> 00:17:29.960
<v Speaker 2>elevation based on a threshold, simple and fast.

360
00:17:30.160 --> 00:17:33.079
<v Speaker 1>But what If the logic is really complicated involving multiple

361
00:17:33.119 --> 00:17:35.119
<v Speaker 1>columns for each row, that's where.

362
00:17:34.920 --> 00:17:38.200
<v Speaker 2>You often need apply access one with a custom function.

363
00:17:38.359 --> 00:17:41.400
<v Speaker 2>A UDF user defined function, you write a function that

364
00:17:41.440 --> 00:17:44.480
<v Speaker 2>takes a row of data as input, applies your complex

365
00:17:44.519 --> 00:17:47.359
<v Speaker 2>logic using values from different columns in that row, and

366
00:17:47.440 --> 00:17:49.359
<v Speaker 2>returns a result for that row, like.

367
00:17:49.279 --> 00:17:53.400
<v Speaker 1>The sleep deprived reasons example, checking kids wages, work hours

368
00:17:53.400 --> 00:17:54.960
<v Speaker 1>for each person exactly.

369
00:17:55.400 --> 00:17:58.519
<v Speaker 2>Apply access one lets you handle that row by row

370
00:17:58.599 --> 00:18:02.440
<v Speaker 2>custom logic, which is sometimes unavoidable for really specific business

371
00:18:02.480 --> 00:18:04.440
<v Speaker 2>rules or derived variables.

372
00:18:04.519 --> 00:18:07.680
<v Speaker 1>Okay, and text data string cleaning must be common.

373
00:18:07.759 --> 00:18:11.279
<v Speaker 2>Oh yeah, huge part of cleaning pandas dot dr accessor

374
00:18:11.319 --> 00:18:14.920
<v Speaker 2>is your friend here. Sdr dot contains to find substrings,

375
00:18:14.960 --> 00:18:18.279
<v Speaker 2>str dot strip to remove leading trailing white space, str

376
00:18:18.319 --> 00:18:21.160
<v Speaker 2>dot lower or straw dot upper for case changes, cdr

377
00:18:21.160 --> 00:18:22.720
<v Speaker 2>dot replace for substitutions.

378
00:18:22.839 --> 00:18:24.680
<v Speaker 1>What about more complex patterns, That's.

379
00:18:24.519 --> 00:18:27.400
<v Speaker 2>Where regular expressions come in use with methods like str

380
00:18:27.440 --> 00:18:30.079
<v Speaker 2>dot sindel or Steward dot extract. If you need to

381
00:18:30.079 --> 00:18:32.920
<v Speaker 2>pull out specific patterns like codes or numbers embedded in

382
00:18:33.000 --> 00:18:35.640
<v Speaker 2>text rejects is the way to go. Cleaning up something

383
00:18:35.640 --> 00:18:37.599
<v Speaker 2>like marital status, where you might have married with a

384
00:18:37.640 --> 00:18:40.200
<v Speaker 2>trailing space is a classic dr dot strip job.

385
00:18:40.440 --> 00:18:42.960
<v Speaker 1>Dates too, calculations like age.

386
00:18:42.680 --> 00:18:46.559
<v Speaker 2>Definitely first step is always converting date columns to proper

387
00:18:46.640 --> 00:18:50.079
<v Speaker 2>DateTime objects using cordid dot to date time. Once they

388
00:18:50.079 --> 00:18:53.160
<v Speaker 2>are date times, you can easily film missing dates fil serena,

389
00:18:53.519 --> 00:18:57.039
<v Speaker 2>calculate time differences like subtracting birth date from today's date

390
00:18:57.079 --> 00:19:00.119
<v Speaker 2>to get age, or find intervals like days since a specific.

391
00:18:59.839 --> 00:19:04.359
<v Speaker 1>Of and filling missing values. Imputation we mentioned dropping or

392
00:19:04.480 --> 00:19:07.039
<v Speaker 1>using a simple fill. You're smarter ways.

393
00:19:06.920 --> 00:19:09.359
<v Speaker 2>Beyond fill and NONA. With a constant. You can impute

394
00:19:09.440 --> 00:19:12.799
<v Speaker 2>with the overall mean or median. Better yet, use a

395
00:19:12.839 --> 00:19:15.799
<v Speaker 2>group mean with group b transform means.

396
00:19:15.880 --> 00:19:17.119
<v Speaker 1>How does transform works there?

397
00:19:17.200 --> 00:19:20.160
<v Speaker 2>It calculates the mean for each group, say the mean

398
00:19:20.200 --> 00:19:23.720
<v Speaker 2>income for each occupation, and then broadcasts that group specific

399
00:19:23.799 --> 00:19:26.319
<v Speaker 2>mean back to fill the missing values within that group,

400
00:19:26.759 --> 00:19:30.039
<v Speaker 2>more targeted than a global memet. You can also use

401
00:19:30.160 --> 00:19:33.759
<v Speaker 2>ephyl forward fill or b phil backward fill to propagate

402
00:19:33.799 --> 00:19:36.279
<v Speaker 2>the last or next known value. And for a machine

403
00:19:36.319 --> 00:19:39.240
<v Speaker 2>learning approach, there's kN a imputer from Psychic Learn.

404
00:19:39.480 --> 00:19:40.799
<v Speaker 1>Using kN N again.

405
00:19:40.799 --> 00:19:44.000
<v Speaker 2>Yeah, kN imputer finds the k most similar ros based

406
00:19:44.039 --> 00:19:47.279
<v Speaker 2>on the other columns and uses their values to estimate

407
00:19:47.319 --> 00:19:50.599
<v Speaker 2>the missing value. It leverages the relationships in your data.

408
00:19:51.160 --> 00:19:53.960
<v Speaker 2>Can be very effective, but computationally more expensive.

409
00:19:54.519 --> 00:19:58.079
<v Speaker 1>Lots of options for missing data, Okay, shifting gears a bit.

410
00:19:58.519 --> 00:20:01.039
<v Speaker 1>Analysts spend tons of time on and what Hadley Wickham

411
00:20:01.039 --> 00:20:05.279
<v Speaker 1>famously called split, apply, combine. Could you unpack that?

412
00:20:05.599 --> 00:20:09.119
<v Speaker 2>It's a fundamental pattern in data analysis. You split your

413
00:20:09.200 --> 00:20:13.000
<v Speaker 2>data into groups based on some criteria. You apply some

414
00:20:13.119 --> 00:20:17.039
<v Speaker 2>function like calculate a means some count to each group independently,

415
00:20:17.440 --> 00:20:20.759
<v Speaker 2>and then you combine the results back into a useful summary.

416
00:20:20.400 --> 00:20:22.759
<v Speaker 1>Structure and pannas. Group B is the main tool for this.

417
00:20:22.799 --> 00:20:26.160
<v Speaker 2>Absolutely. While you could technically loop through your data, maybe

418
00:20:26.240 --> 00:20:29.359
<v Speaker 2>using inner tuples or NumPy raise for speed, group B

419
00:20:29.559 --> 00:20:32.119
<v Speaker 2>is almost always vastly more efficient and concise.

420
00:20:32.200 --> 00:20:33.279
<v Speaker 1>So how does group be work.

421
00:20:33.599 --> 00:20:36.960
<v Speaker 2>You call dfsf dot group B column to group B.

422
00:20:37.759 --> 00:20:40.559
<v Speaker 2>This creates a special group B object. Then you chain

423
00:20:40.599 --> 00:20:43.680
<v Speaker 2>an aggregation method like dot mean or dot described to

424
00:20:43.839 --> 00:20:45.359
<v Speaker 2>calculate results for each group.

425
00:20:45.480 --> 00:20:47.279
<v Speaker 1>Can you do multiple calculations at once?

426
00:20:47.599 --> 00:20:50.759
<v Speaker 2>Yes? Using dot ag this is super flexible. You can

427
00:20:50.799 --> 00:20:54.119
<v Speaker 2>pass it a list of functions mean std to apply

428
00:20:54.160 --> 00:20:57.079
<v Speaker 2>to all columns, or a dictionary to apply different functions

429
00:20:57.079 --> 00:20:59.759
<v Speaker 2>to different columns like col a some coal mean.

430
00:21:00.319 --> 00:21:03.599
<v Speaker 1>So you can get complex summaries easily, and this changes

431
00:21:03.640 --> 00:21:05.359
<v Speaker 1>the unit of analysis often.

432
00:21:05.519 --> 00:21:08.680
<v Speaker 2>Yes, if you group daily sales by month and calculate

433
00:21:08.720 --> 00:21:12.640
<v Speaker 2>the sum, your result is now monthly sales. You've aggregated

434
00:21:13.000 --> 00:21:15.599
<v Speaker 2>the output. Might need that reset index to turn the

435
00:21:15.640 --> 00:21:19.000
<v Speaker 2>group keys back into columns or unstack to reshape it.

436
00:21:19.000 --> 00:21:20.240
<v Speaker 2>It's how you roll up data.

437
00:21:20.680 --> 00:21:25.440
<v Speaker 1>Very powerful. Okay, what about sticking data sets together vertically first?

438
00:21:25.839 --> 00:21:28.759
<v Speaker 2>Using concat is for stacking data frames on top of

439
00:21:28.799 --> 00:21:32.119
<v Speaker 2>each other appending rows. Think adding this month's data to

440
00:21:32.200 --> 00:21:34.599
<v Speaker 2>last months, or combining data from different regions that have

441
00:21:34.680 --> 00:21:35.480
<v Speaker 2>the same columns.

442
00:21:35.640 --> 00:21:36.839
<v Speaker 1>What's the main thing to watch out for?

443
00:21:37.400 --> 00:21:40.359
<v Speaker 2>Column names and data types. If the columns aren't exactly

444
00:21:40.400 --> 00:21:43.319
<v Speaker 2>the same or the types differ, concat can get confused

445
00:21:43.400 --> 00:21:46.960
<v Speaker 2>or produce weird results, like turning numbers into generic object types.

446
00:21:47.240 --> 00:21:49.359
<v Speaker 2>Always check alignment before concatenating.

447
00:21:49.640 --> 00:21:55.799
<v Speaker 1>Good heads up. Now the trickier one horizontal combination merge

448
00:21:56.000 --> 00:21:57.880
<v Speaker 1>you mentioned. This is where accidents happen.

449
00:21:58.000 --> 00:22:02.599
<v Speaker 2>Hah. Yes, merging or jaining tables is fundamental, but it's

450
00:22:02.599 --> 00:22:05.400
<v Speaker 2>so easy to get wrong, especially with different join types.

451
00:22:05.720 --> 00:22:07.720
<v Speaker 2>I often joke you need a buddy to double check

452
00:22:07.759 --> 00:22:08.359
<v Speaker 2>your merges.

453
00:22:08.519 --> 00:22:11.240
<v Speaker 1>Okay, so one to one merges are easy, unique keys

454
00:22:11.279 --> 00:22:15.119
<v Speaker 1>in both tables, and multiple keys are possible too. Where

455
00:22:15.119 --> 00:22:16.400
<v Speaker 1>does the danger zone start?

456
00:22:17.000 --> 00:22:20.400
<v Speaker 2>It starts with one too many merges. One table has

457
00:22:20.519 --> 00:22:23.359
<v Speaker 2>unique keys like customer ID in a customer table. The

458
00:22:23.400 --> 00:22:26.960
<v Speaker 2>other has duplicates, like the same customer ID appearing multiple

459
00:22:26.960 --> 00:22:28.160
<v Speaker 2>times in an order's table.

460
00:22:28.319 --> 00:22:31.640
<v Speaker 1>Right, and the key here is inner versus left join

461
00:22:31.960 --> 00:22:32.839
<v Speaker 1>critical difference.

462
00:22:33.079 --> 00:22:36.119
<v Speaker 2>An inner join only keeps rows where the key exists

463
00:22:36.119 --> 00:22:38.480
<v Speaker 2>in both tables. If a customer hasn't placed an order,

464
00:22:38.519 --> 00:22:40.759
<v Speaker 2>they disappear. A left join keeps all rows from the

465
00:22:40.839 --> 00:22:43.680
<v Speaker 2>left table customers and brings in matching data from the

466
00:22:43.759 --> 00:22:46.839
<v Speaker 2>right orders. Customers with no orders will still be there,

467
00:22:46.839 --> 00:22:50.000
<v Speaker 2>but order details will be nan. Choosing the wrong join

468
00:22:50.079 --> 00:22:52.160
<v Speaker 2>type can silently drop data you needed.

469
00:22:52.400 --> 00:22:54.640
<v Speaker 1>That sounds like a very easy mistake to make, it is.

470
00:22:55.000 --> 00:22:57.799
<v Speaker 2>And then there's the dreaded many to many merge keys

471
00:22:57.839 --> 00:22:59.440
<v Speaker 2>are duplicated in both tables.

472
00:22:59.440 --> 00:23:02.359
<v Speaker 1>The book said this should rarely.

473
00:23:02.000 --> 00:23:05.359
<v Speaker 2>Be needed because it usually means something is wrong. Conceptually,

474
00:23:05.960 --> 00:23:09.359
<v Speaker 2>a direct many to many merge creates a Cartesian product,

475
00:23:09.480 --> 00:23:12.039
<v Speaker 2>every matching row on the left joins with every matching

476
00:23:12.119 --> 00:23:15.160
<v Speaker 2>row on the right. If you merge students and courses

477
00:23:15.160 --> 00:23:17.759
<v Speaker 2>where both have duplicate IDs, you can end up matching

478
00:23:17.799 --> 00:23:20.799
<v Speaker 2>every student to every course multiple times, inflating counts and

479
00:23:20.839 --> 00:23:21.839
<v Speaker 2>some nonsensically.

480
00:23:22.039 --> 00:23:23.000
<v Speaker 1>So what's the fix?

481
00:23:23.440 --> 00:23:26.319
<v Speaker 2>Usually you need to rethink your data structure before the merge,

482
00:23:26.839 --> 00:23:29.720
<v Speaker 2>find or create a unique identifier for the relationship you

483
00:23:29.759 --> 00:23:33.000
<v Speaker 2>actually want to capture, recover the implied one to many

484
00:23:33.079 --> 00:23:36.240
<v Speaker 2>link first, rather than doing a raw many to many.

485
00:23:36.400 --> 00:23:39.440
<v Speaker 1>Okay, avoid many to many if possible. What about just

486
00:23:39.519 --> 00:23:41.319
<v Speaker 1>tidying up duplicates within a table?

487
00:23:41.559 --> 00:23:44.920
<v Speaker 2>Drop duplicates is your friend. There. You can specify subset

488
00:23:44.920 --> 00:23:47.759
<v Speaker 2>to check for duplicates based only on certain columns, and

489
00:23:47.839 --> 00:23:49.960
<v Speaker 2>keep lets you choose whether to keep the first or

490
00:23:50.079 --> 00:23:54.599
<v Speaker 2>last duplicate encountered. Like drop duplicates subset location, keep last

491
00:23:54.920 --> 00:23:58.160
<v Speaker 2>to get just the latest COVID data row for each country.

492
00:23:58.640 --> 00:24:01.119
<v Speaker 1>And this relates back to that data idea.

493
00:24:01.359 --> 00:24:05.359
<v Speaker 2>Absolutely, fixing many to many often involves making your data tidier.

494
00:24:05.720 --> 00:24:08.680
<v Speaker 2>Instead of one giant MESSI table, break it down into

495
00:24:08.720 --> 00:24:11.680
<v Speaker 2>smaller related tables where each piece of information lives in

496
00:24:11.720 --> 00:24:15.640
<v Speaker 2>only one place a table for museum items, another for creators,

497
00:24:15.680 --> 00:24:20.880
<v Speaker 2>another for citations, all linked by IDs, cleaner, less error prone.

498
00:24:20.920 --> 00:24:23.920
<v Speaker 1>Makes sense? Okay. Sometimes we need to reshape data moving

499
00:24:24.000 --> 00:24:27.160
<v Speaker 1>between wide and long formats. Why do we do that?

500
00:24:27.359 --> 00:24:31.240
<v Speaker 2>Often analytical tools or models prefer data in a long format,

501
00:24:31.279 --> 00:24:34.680
<v Speaker 2>where you have one Rowe observation per time point or category,

502
00:24:34.920 --> 00:24:38.039
<v Speaker 2>rather than having time points or categories spread across columns

503
00:24:38.240 --> 00:24:38.880
<v Speaker 2>wide format.

504
00:24:38.960 --> 00:24:40.279
<v Speaker 1>So how do we go from winde to long?

505
00:24:40.559 --> 00:24:43.440
<v Speaker 2>Stack is a basic way. It pivots columns into the index.

506
00:24:43.519 --> 00:24:47.000
<v Speaker 2>More flexible is MELT. You tell it which columns are identifiers,

507
00:24:47.079 --> 00:24:50.240
<v Speaker 2>idvars and which columns you want melt down value VARs.

508
00:24:50.319 --> 00:24:52.920
<v Speaker 2>You can name the new variable and value columns too.

509
00:24:52.880 --> 00:24:55.519
<v Speaker 1>Like turning week's work zero week's work zero one into

510
00:24:55.559 --> 00:24:58.319
<v Speaker 1>a single year column and a single value column exactly.

511
00:24:58.359 --> 00:25:01.400
<v Speaker 2>MELT is great for that you you have multiple sets

512
00:25:01.440 --> 00:25:04.680
<v Speaker 2>of columns following a pattern like weeks worked by year

513
00:25:05.119 --> 00:25:08.880
<v Speaker 2>and deep Poliner college enrollment by year. Then pd dot

514
00:25:08.880 --> 00:25:12.200
<v Speaker 2>wide to long is often more powerful. It's designed to

515
00:25:12.240 --> 00:25:16.720
<v Speaker 2>handle those structured wide formats efficiently, reshaping multiple variable groups

516
00:25:16.759 --> 00:25:19.079
<v Speaker 2>simultaneously based on naming conventions.

517
00:25:19.240 --> 00:25:21.559
<v Speaker 1>Very cool, and the reverse long back to.

518
00:25:21.519 --> 00:25:26.160
<v Speaker 2>Wide that uses unstack or pivotable. Sometimes you need data

519
00:25:26.200 --> 00:25:28.480
<v Speaker 2>in a wide format for reporting or for software that

520
00:25:28.519 --> 00:25:31.279
<v Speaker 2>expects it that way, even if it's technically untidy.

521
00:25:31.519 --> 00:25:35.559
<v Speaker 1>Okay, fantastic overview of the core manipulation tools. When doing

522
00:25:35.599 --> 00:25:39.440
<v Speaker 1>all this manually every time new data arrives sounds exhausting.

523
00:25:39.759 --> 00:25:40.960
<v Speaker 1>How do we automate? Right?

524
00:25:41.000 --> 00:25:43.480
<v Speaker 2>That's the ultimate goal, isn't it making the cleaning process

525
00:25:43.519 --> 00:25:46.799
<v Speaker 2>repeatable and reliable. This is where reusable code, functions and

526
00:25:46.839 --> 00:25:50.319
<v Speaker 2>classes becomes essential. It separates the logic of your cleaning

527
00:25:50.319 --> 00:25:52.640
<v Speaker 2>from the specific data set you're working on at that moment.

528
00:25:52.839 --> 00:25:55.279
<v Speaker 1>So user defined functions UDS or a big part of this,

529
00:25:55.319 --> 00:25:57.160
<v Speaker 1>building our own little toolkit exactly.

530
00:25:57.279 --> 00:26:00.400
<v Speaker 2>You can create functions for common tasks like at first

531
00:26:00.440 --> 00:26:03.559
<v Speaker 2>look function in a separate Python file Basic Descriptives dot

532
00:26:03.559 --> 00:26:06.519
<v Speaker 2>py that you import, run it on any new data frame,

533
00:26:06.640 --> 00:26:10.160
<v Speaker 2>and it instantly gives you the shape info head unique

534
00:26:10.200 --> 00:26:11.640
<v Speaker 2>ideas your standard first.

535
00:26:11.519 --> 00:26:13.839
<v Speaker 1>Checks automating the diagnostics.

536
00:26:13.400 --> 00:26:17.319
<v Speaker 2>YEP, or functions like get tots to calculate custom stats,

537
00:26:17.359 --> 00:26:21.440
<v Speaker 2>maybe including specific percentiles like fifteenth and eighty fifth, or

538
00:26:21.559 --> 00:26:25.559
<v Speaker 2>get missings to summarize missing data cleanly by row and column,

539
00:26:25.640 --> 00:26:27.160
<v Speaker 2>maybe as percentages.

540
00:26:26.720 --> 00:26:29.359
<v Speaker 1>And for visualization or outliers.

541
00:26:28.920 --> 00:26:31.319
<v Speaker 2>You could have a get disprops function that returns skew

542
00:26:31.440 --> 00:26:34.880
<v Speaker 2>crtosis maybe runs a normality test on a column, or

543
00:26:35.079 --> 00:26:37.640
<v Speaker 2>a make plot function that wraps map plot lib or

544
00:26:37.680 --> 00:26:42.480
<v Speaker 2>seaborn to quickly generate standard histograms or box plots without

545
00:26:42.559 --> 00:26:44.880
<v Speaker 2>rewriting the plotting code every time, so.

546
00:26:44.839 --> 00:26:48.119
<v Speaker 1>You build up a library of your common cleaning steps precisely.

547
00:26:48.400 --> 00:26:50.960
<v Speaker 2>You could even have more complex functions like that AJA

548
00:26:51.039 --> 00:26:53.880
<v Speaker 2>means example, for calculating running totals, but with logic to

549
00:26:53.960 --> 00:26:57.759
<v Speaker 2>exclude sudden huge jumps, like maybe in daily COVID case

550
00:26:57.799 --> 00:27:00.880
<v Speaker 2>reporting using numbpier rays internal for efficiency.

551
00:27:00.960 --> 00:27:03.680
<v Speaker 1>Okay, functions are great for reasonable steps. What about classes?

552
00:27:03.720 --> 00:27:05.680
<v Speaker 1>How do they fit into data cleaning automation.

553
00:27:06.000 --> 00:27:08.720
<v Speaker 2>Classes offer a different way to structure your logic, especially

554
00:27:08.759 --> 00:27:11.599
<v Speaker 2>when you think about the unit of analysis. You could

555
00:27:11.640 --> 00:27:14.000
<v Speaker 2>define a respondent class for the NLS.

556
00:27:13.720 --> 00:27:17.440
<v Speaker 1>Data, so each object of the class represents one person exactly.

557
00:27:17.519 --> 00:27:20.279
<v Speaker 2>You'd initialize it with a row of data, maybe as

558
00:27:20.319 --> 00:27:23.519
<v Speaker 2>a dictionary. Then you can define methods on that class

559
00:27:23.559 --> 00:27:26.599
<v Speaker 2>to calculate things specific to that respondent, like a child

560
00:27:26.680 --> 00:27:30.640
<v Speaker 2>NHUMB method or abwixworked or age b date. The logic

561
00:27:30.680 --> 00:27:33.960
<v Speaker 2>related to a respondent lives with the respondent's data inside

562
00:27:34.000 --> 00:27:34.559
<v Speaker 2>the object.

563
00:27:34.640 --> 00:27:38.480
<v Speaker 1>That sounds quite intuitive. Where do classes really become powerful? Though?

564
00:27:38.680 --> 00:27:42.759
<v Speaker 2>They truly shine with complex non tabular data like nested

565
00:27:42.839 --> 00:27:47.000
<v Speaker 2>JASON or XML. Think about that museum data example. One

566
00:27:47.079 --> 00:27:50.000
<v Speaker 2>JSON record for an artwork might contain a list of creators,

567
00:27:50.039 --> 00:27:52.400
<v Speaker 2>each with their own details and a list of citations.

568
00:27:52.519 --> 00:27:55.400
<v Speaker 1>Right, nested structures hard to flatten cleanly.

569
00:27:55.359 --> 00:27:57.680
<v Speaker 2>Very but with a collection item class you could have

570
00:27:57.720 --> 00:28:00.799
<v Speaker 2>methods like birth your Creator one that direct accesses the

571
00:28:00.799 --> 00:28:04.119
<v Speaker 2>first creator's birth year within the nested structure, or birth

572
00:28:04.160 --> 00:28:06.599
<v Speaker 2>your Soul that loops through the creator's list inside the

573
00:28:06.599 --> 00:28:08.640
<v Speaker 2>object and returns all their birth years.

574
00:28:08.759 --> 00:28:11.279
<v Speaker 1>So you're working with the data's natural hierarchy instead of

575
00:28:11.319 --> 00:28:12.960
<v Speaker 1>forcing it flat exactly.

576
00:28:13.240 --> 00:28:15.759
<v Speaker 2>It avoids many of the errors that come from complex

577
00:28:15.799 --> 00:28:19.839
<v Speaker 2>flattening and merging. You model the data conceptually. Now, iterating

578
00:28:19.880 --> 00:28:22.920
<v Speaker 2>and creating class instances can use more memory and time

579
00:28:23.200 --> 00:28:26.519
<v Speaker 2>than pure vectorized pandas for simple tasks.

580
00:28:26.599 --> 00:28:30.519
<v Speaker 1>But the tradeoff is clarity and correctness for complex data.

581
00:28:30.559 --> 00:28:34.759
<v Speaker 2>Often yes, the conceptual clarity, the intuitive modeling, and potentially

582
00:28:34.839 --> 00:28:37.440
<v Speaker 2>fewer passes over the raw data can make it a

583
00:28:37.480 --> 00:28:41.440
<v Speaker 2>winner for intricate cleaning tasks, especially with nested or irregular

584
00:28:41.519 --> 00:28:44.119
<v Speaker 2>data structures. It helps you focus on the meaning of

585
00:28:44.160 --> 00:28:44.880
<v Speaker 2>the data unit.

586
00:28:45.160 --> 00:28:48.720
<v Speaker 1>Wow, that's quite a comprehensive journey we've been on from

587
00:28:48.839 --> 00:28:52.759
<v Speaker 1>just getting data in wrestling with formats, doing those initial checks,

588
00:28:52.960 --> 00:28:57.160
<v Speaker 1>spotting outliers, reshaping, all the way to building automated functions

589
00:28:57.160 --> 00:28:57.759
<v Speaker 1>and glasses.

590
00:28:57.839 --> 00:29:00.480
<v Speaker 2>It really covers the spectrum, and the key takeaway I

591
00:29:00.519 --> 00:29:03.559
<v Speaker 2>think isn't just the specific Python commands, but developing that

592
00:29:03.640 --> 00:29:07.599
<v Speaker 2>critical eye for data quality, building confidence that the information

593
00:29:07.640 --> 00:29:08.680
<v Speaker 2>you're using is sound.

594
00:29:08.839 --> 00:29:10.960
<v Speaker 1>It feels like this is the shortcut to actually using

595
00:29:11.000 --> 00:29:13.759
<v Speaker 1>information effectively, getting real meaning out of the noise.

596
00:29:14.000 --> 00:29:17.279
<v Speaker 2>It truly is clean. Data is the foundation for any

597
00:29:17.319 --> 00:29:18.279
<v Speaker 2>reliable insight.

598
00:29:18.720 --> 00:29:21.200
<v Speaker 1>So maybe a final thought for every listening as you

599
00:29:21.279 --> 00:29:25.000
<v Speaker 1>look at your own data, consider this, if knowledge really

600
00:29:25.079 --> 00:29:29.000
<v Speaker 1>is most valuable when it's understood and applied, What patterns,

601
00:29:29.039 --> 00:29:32.119
<v Speaker 1>what crucial insights might still be hidden in your data sets,

602
00:29:32.559 --> 00:29:34.920
<v Speaker 1>just waiting for the right cleaning technique to reveal them.

603
00:29:35.039 --> 00:29:37.960
<v Speaker 2>That's a great question. We really encourage you to explore

604
00:29:38.000 --> 00:29:43.960
<v Speaker 2>these libraries PANDAS, NUMPI piod stats models. Experiment with these techniques.

605
00:29:44.119 --> 00:29:46.319
<v Speaker 2>See the power of clean data for yourself.

606
00:29:46.480 --> 00:29:49.880
<v Speaker 1>Absolutely the path to being well informed to getting real insights.

607
00:29:50.039 --> 00:29:53.359
<v Speaker 1>It's an ongoing process, and solid data cleaning is always

608
00:29:53.400 --> 00:29:54.640
<v Speaker 1>that essential first step.
