WEBVTT

1
00:00:00.000 --> 00:00:03.879
<v Speaker 1>All right, so we're diving into some excerpts from mining

2
00:00:03.960 --> 00:00:07.160
<v Speaker 1>social media finding stories in Internet data today.

3
00:00:07.240 --> 00:00:08.160
<v Speaker 2>Yeah, this should be fun.

4
00:00:08.359 --> 00:00:11.240
<v Speaker 1>It looks like we've got a lot to unpack, from

5
00:00:11.919 --> 00:00:16.719
<v Speaker 1>bot detection to ethical data scraping to even like real

6
00:00:16.800 --> 00:00:19.480
<v Speaker 1>reddit data on vaccinations.

7
00:00:18.719 --> 00:00:19.320
<v Speaker 2>That's to cover.

8
00:00:19.559 --> 00:00:23.679
<v Speaker 1>Yeah, so where do we want to begin? I guess

9
00:00:23.719 --> 00:00:26.199
<v Speaker 1>like bots on Twitter. Sure, they kind of give me

10
00:00:26.239 --> 00:00:27.160
<v Speaker 1>the creeps honestly.

11
00:00:27.280 --> 00:00:28.640
<v Speaker 2>Yeah, they're definitely concerned.

12
00:00:28.719 --> 00:00:30.600
<v Speaker 1>How big of a problem are they really? Like, are

13
00:00:30.679 --> 00:00:32.719
<v Speaker 1>they actually swaying public opinion?

14
00:00:32.960 --> 00:00:37.000
<v Speaker 2>Well, they can definitely amplify certain narratives and manipulate trends,

15
00:00:37.240 --> 00:00:40.759
<v Speaker 2>you know, and even like so discord among real users.

16
00:00:40.479 --> 00:00:41.920
<v Speaker 1>Like a digital echo chamber you.

17
00:00:42.159 --> 00:00:45.840
<v Speaker 2>Yeah, exactly. It creates that illusion of like widespread support

18
00:00:45.920 --> 00:00:48.280
<v Speaker 2>or opposition to an idea when it might not really

19
00:00:48.320 --> 00:00:48.719
<v Speaker 2>be there.

20
00:00:49.000 --> 00:00:51.000
<v Speaker 1>So how can we spot these bots?

21
00:00:51.079 --> 00:00:52.799
<v Speaker 2>Well, one of the easiest ways is to look at

22
00:00:52.799 --> 00:00:54.520
<v Speaker 2>their tweeting frequency.

23
00:00:54.159 --> 00:00:55.799
<v Speaker 1>Like how often they tweet exactly.

24
00:00:56.079 --> 00:00:59.039
<v Speaker 2>Bots can tweet way more than any human possibly could.

25
00:00:59.280 --> 00:01:01.880
<v Speaker 1>Ah, they're like tweeting machines pretty much.

26
00:01:02.399 --> 00:01:04.760
<v Speaker 2>There's an example in the book about this account that's

27
00:01:04.760 --> 00:01:05.920
<v Speaker 2>cinever sets one hundred.

28
00:01:06.040 --> 00:01:07.079
<v Speaker 1>Okay, what about it.

29
00:01:07.079 --> 00:01:10.200
<v Speaker 2>It was flagged for tweeting over seventy times in a

30
00:01:10.239 --> 00:01:10.719
<v Speaker 2>single day.

31
00:01:10.879 --> 00:01:12.439
<v Speaker 1>Wow, that's a lot.

32
00:01:12.680 --> 00:01:14.239
<v Speaker 2>Yeah, no human could keep up with that.

33
00:01:14.400 --> 00:01:16.439
<v Speaker 1>So once you've spotted a potential bought.

34
00:01:16.439 --> 00:01:19.359
<v Speaker 2>What then, well, believe it or not, you can actually

35
00:01:19.359 --> 00:01:20.920
<v Speaker 2>do a lot with Google Sheets.

36
00:01:21.120 --> 00:01:22.519
<v Speaker 1>Really Google Sheet.

37
00:01:22.640 --> 00:01:24.959
<v Speaker 2>Yeah, I know it sounds basic, but you can get

38
00:01:24.959 --> 00:01:27.239
<v Speaker 2>some pretty interesting insights if you know how to use it.

39
00:01:27.359 --> 00:01:30.239
<v Speaker 1>Humh okay, what's the catch?

40
00:01:30.680 --> 00:01:33.799
<v Speaker 2>Well, you have to make sure your data is formatted correctly,

41
00:01:34.599 --> 00:01:37.159
<v Speaker 2>you know, Like, so Google Sheets can tell the difference

42
00:01:37.200 --> 00:01:40.599
<v Speaker 2>between like text numbers and dates, right right. Then you

43
00:01:40.599 --> 00:01:43.560
<v Speaker 2>can use pivot tables to summarize your data, Like, so

44
00:01:43.640 --> 00:01:45.400
<v Speaker 2>you want to see the daily tweet counts for a

45
00:01:45.439 --> 00:01:48.159
<v Speaker 2>suspected bot account, A pivot table can do that.

46
00:01:48.719 --> 00:01:50.680
<v Speaker 1>Interesting, So you're telling me, I can use the same

47
00:01:50.760 --> 00:01:53.640
<v Speaker 1>tool I use to track my Grotery budget to analyze

48
00:01:53.680 --> 00:01:55.159
<v Speaker 1>bot activity exactly.

49
00:01:55.680 --> 00:01:57.920
<v Speaker 2>And you can go even further with formulas. They're like

50
00:01:57.959 --> 00:01:59.519
<v Speaker 2>Google Sheets version of code.

51
00:01:59.640 --> 00:02:03.400
<v Speaker 1>Okay, so it's like you're giving Google Sheets instructions to

52
00:02:03.439 --> 00:02:05.280
<v Speaker 1>manipulate the data exactly.

53
00:02:06.319 --> 00:02:10.479
<v Speaker 2>There's one formula look up that lets you combine data

54
00:02:10.520 --> 00:02:14.759
<v Speaker 2>from different sets by finding matching values. And then there's

55
00:02:14.840 --> 00:02:17.360
<v Speaker 2>ipher which helps you handle errors.

56
00:02:17.439 --> 00:02:21.800
<v Speaker 1>So it's like basic coding, but in a spreadsheet. I'm intrigued.

57
00:02:21.919 --> 00:02:23.400
<v Speaker 2>Yeah, it's pretty powerful stuff.

58
00:02:23.439 --> 00:02:25.400
<v Speaker 1>But what if you want to get even more advanced.

59
00:02:25.400 --> 00:02:27.360
<v Speaker 1>I've heard a lot about Python being the go to

60
00:02:27.520 --> 00:02:28.919
<v Speaker 1>language for data analysis.

61
00:02:29.080 --> 00:02:32.400
<v Speaker 2>Yeah, Python is definitely the next level, especially for handling

62
00:02:32.479 --> 00:02:33.400
<v Speaker 2>large data sets.

63
00:02:33.599 --> 00:02:37.199
<v Speaker 1>All right, so Python large data sets. It sounds intimidating, It's.

64
00:02:37.000 --> 00:02:39.199
<v Speaker 2>Not as bad as it sounds. One important thing to

65
00:02:39.280 --> 00:02:42.280
<v Speaker 2>understand is the concept of virtual environments.

66
00:02:42.479 --> 00:02:44.199
<v Speaker 1>Virtual environments, like, what are those?

67
00:02:44.319 --> 00:02:48.080
<v Speaker 2>They basically help you manage different libraries without causing conflicts.

68
00:02:48.199 --> 00:02:48.759
<v Speaker 1>Libraries.

69
00:02:48.840 --> 00:02:51.840
<v Speaker 2>Yeah, libraries are basically collections of pre written code for

70
00:02:51.879 --> 00:02:54.639
<v Speaker 2>specific tasks, kind of like specialized toolkits.

71
00:02:54.800 --> 00:02:59.039
<v Speaker 1>Virtual environments are like separate workspaces for your Python projects.

72
00:02:59.080 --> 00:03:00.759
<v Speaker 1>Exactly makes sense.

73
00:03:00.960 --> 00:03:03.199
<v Speaker 2>Then what well, once you've got your environment set up,

74
00:03:03.400 --> 00:03:05.960
<v Speaker 2>you can use Jupiter notebook to write and run your

75
00:03:06.000 --> 00:03:06.759
<v Speaker 2>Python code.

76
00:03:06.879 --> 00:03:10.639
<v Speaker 1>Jupiter Notebook, got it. And what about pandas. I've heard

77
00:03:10.639 --> 00:03:13.159
<v Speaker 1>that name thrown around a lot in data analysis circles.

78
00:03:13.319 --> 00:03:16.360
<v Speaker 2>PANDAS is a game changer, especially for social media data.

79
00:03:16.439 --> 00:03:20.400
<v Speaker 2>It's a library that's specifically designed for handling those massive

80
00:03:20.439 --> 00:03:21.000
<v Speaker 2>data sets.

81
00:03:21.080 --> 00:03:22.759
<v Speaker 1>So it helps you make sense of all that data.

82
00:03:22.919 --> 00:03:26.599
<v Speaker 2>Yeah, you can clean it, manipulate it, analyze it. It's

83
00:03:26.639 --> 00:03:29.159
<v Speaker 2>a must have for any serious data analyst.

84
00:03:29.560 --> 00:03:34.719
<v Speaker 1>Okay, so we've talked bots, Google sheets, Python, pandas. What

85
00:03:34.800 --> 00:03:38.039
<v Speaker 1>about that Reddit data you mentioned, the stuff about vaccinations.

86
00:03:38.199 --> 00:03:39.919
<v Speaker 2>Yeah, we can dive into that next. We'll use that

87
00:03:39.960 --> 00:03:42.439
<v Speaker 2>as a case study. There's tons of data from Reddit

88
00:03:42.520 --> 00:03:46.120
<v Speaker 2>thanks to this guy, Jason Baumgartner, who's like a data archivist.

89
00:03:46.479 --> 00:03:48.639
<v Speaker 1>Cool, So what specifically are we going to look at.

90
00:03:48.680 --> 00:03:50.919
<v Speaker 2>We'll focus on the rask science subreddit. We can see

91
00:03:50.919 --> 00:03:53.000
<v Speaker 2>how people are talking about vaccinations online.

92
00:03:53.120 --> 00:03:56.199
<v Speaker 1>Sounds fascinating. And how do we even begin to analyze

93
00:03:56.199 --> 00:03:57.120
<v Speaker 1>all of that data?

94
00:03:57.159 --> 00:04:01.039
<v Speaker 2>We'll use pandas. In Jupiter Notebook, there were these handy

95
00:04:01.120 --> 00:04:04.639
<v Speaker 2>methods like dot head, dot columns, and dot ilock that

96
00:04:04.719 --> 00:04:06.199
<v Speaker 2>help you get a feel for the data.

97
00:04:06.280 --> 00:04:08.560
<v Speaker 1>So like what's in there, the names of the columns,

98
00:04:08.599 --> 00:04:11.479
<v Speaker 1>how to select specific data points exactly.

99
00:04:12.120 --> 00:04:14.639
<v Speaker 2>But fair warning, we're probably going to run into some

100
00:04:14.719 --> 00:04:16.439
<v Speaker 2>missing values in the data.

101
00:04:16.560 --> 00:04:18.879
<v Speaker 1>Missing values like what Yeah.

102
00:04:18.720 --> 00:04:21.879
<v Speaker 2>Like netive or nan entries. They can miss things up

103
00:04:21.879 --> 00:04:22.720
<v Speaker 2>if you're not careful.

104
00:04:22.839 --> 00:04:23.959
<v Speaker 1>So what do you do about them.

105
00:04:24.199 --> 00:04:27.480
<v Speaker 2>You can either remove those roads entirely with dot dropna

106
00:04:27.959 --> 00:04:30.800
<v Speaker 2>or replace them with something else using dot filma. It

107
00:04:30.839 --> 00:04:32.519
<v Speaker 2>really depends on what you're trying to find out.

108
00:04:33.000 --> 00:04:36.839
<v Speaker 1>So no one size fits all solution, got it. So

109
00:04:37.160 --> 00:04:41.000
<v Speaker 1>how do we actually go about analyzing these Reddit conversations

110
00:04:41.040 --> 00:04:42.199
<v Speaker 1>about vaccinations.

111
00:04:42.439 --> 00:04:44.079
<v Speaker 2>Well, first we need to figure out what we're trying

112
00:04:44.120 --> 00:04:46.519
<v Speaker 2>to understand, you know, like are we trying to gauge

113
00:04:46.560 --> 00:04:50.199
<v Speaker 2>overall sentiment or are we looking for specific themes or patterns.

114
00:04:51.319 --> 00:04:54.759
<v Speaker 1>I'm really interested in how engaged people are in these discussions,

115
00:04:54.839 --> 00:04:58.160
<v Speaker 1>like are they generally supportive or hesitant? Are there any

116
00:04:58.199 --> 00:05:01.839
<v Speaker 1>common arguments or concerns keep pupping up perfect We.

117
00:05:01.800 --> 00:05:03.839
<v Speaker 2>Can definitely look into that. We'll need to look at

118
00:05:03.839 --> 00:05:06.160
<v Speaker 2>both the content of the posts and things like up votes.

119
00:05:05.959 --> 00:05:07.759
<v Speaker 1>And comments, right, so we're not just looking at what

120
00:05:07.800 --> 00:05:11.279
<v Speaker 1>people are saying, but also how others are reacting.

121
00:05:10.839 --> 00:05:13.480
<v Speaker 2>To it exactly. And that's where the idea of central

122
00:05:13.519 --> 00:05:16.680
<v Speaker 2>tendency comes in. We can use statistical measures like the

123
00:05:16.759 --> 00:05:19.319
<v Speaker 2>mean and median to get a sense of the average engagement.

124
00:05:19.639 --> 00:05:22.600
<v Speaker 1>So like if a lot of pro vaccination posts have

125
00:05:22.720 --> 00:05:25.279
<v Speaker 1>a ton of up votes, that might suggest there's a

126
00:05:25.319 --> 00:05:27.000
<v Speaker 1>lot of support for that viewpoint.

127
00:05:27.240 --> 00:05:29.959
<v Speaker 2>Yeah, it could, but we have to be careful about

128
00:05:30.040 --> 00:05:33.439
<v Speaker 2>jumping to conclusions. There might be other things at play. Right.

129
00:05:33.600 --> 00:05:37.279
<v Speaker 1>Correlation doesn't equal causation, so we can't just assume that

130
00:05:37.399 --> 00:05:40.040
<v Speaker 1>upvotes equal agreement, right exactly.

131
00:05:40.240 --> 00:05:42.560
<v Speaker 2>That's why it's so important to look at multiple factors

132
00:05:42.800 --> 00:05:44.160
<v Speaker 2>and to consider the context.

133
00:05:44.399 --> 00:05:47.680
<v Speaker 1>Okay, makes sense. But before we get too deep into analysis,

134
00:05:47.720 --> 00:05:50.759
<v Speaker 1>I'm guessing we need to narrow down this massive Reddit

135
00:05:50.839 --> 00:05:53.639
<v Speaker 1>data set to just the stuff about vaccinations.

136
00:05:53.720 --> 00:05:56.079
<v Speaker 2>Right, Yeah, we don't want to waste time sifting through

137
00:05:56.160 --> 00:05:57.279
<v Speaker 2>irrelevant posts.

138
00:05:57.439 --> 00:05:58.319
<v Speaker 1>So how do we do that?

139
00:05:58.639 --> 00:06:01.720
<v Speaker 2>Well, pandas is great for filtering data. We can create

140
00:06:01.759 --> 00:06:05.040
<v Speaker 2>a new data frame that only includes posts with certain

141
00:06:05.120 --> 00:06:07.600
<v Speaker 2>keywords related to vaccinations.

142
00:06:07.079 --> 00:06:09.399
<v Speaker 1>Like a supercharged search function.

143
00:06:09.240 --> 00:06:11.759
<v Speaker 2>Basically pretty much. And then we can start looking at

144
00:06:11.800 --> 00:06:13.240
<v Speaker 2>those engagement metrics.

145
00:06:13.040 --> 00:06:15.959
<v Speaker 1>Right, all those upvotes and comments. How do we make

146
00:06:16.000 --> 00:06:16.839
<v Speaker 1>sense of all of that?

147
00:06:17.160 --> 00:06:20.519
<v Speaker 2>We can combine those columns into a new metric like

148
00:06:20.600 --> 00:06:25.079
<v Speaker 2>combined engagement, and then calculate the average using dot mean.

149
00:06:25.600 --> 00:06:28.000
<v Speaker 2>We can also use dusk gribe to get a better

150
00:06:28.120 --> 00:06:31.079
<v Speaker 2>understanding of the distribution of that engagement.

151
00:06:30.800 --> 00:06:33.199
<v Speaker 1>So we might see that some posts get way more

152
00:06:33.199 --> 00:06:36.720
<v Speaker 1>engagement than others, even if the average is relatively consistent.

153
00:06:36.959 --> 00:06:40.000
<v Speaker 2>Exactly, and those outliers can tell us a lot about

154
00:06:40.000 --> 00:06:41.480
<v Speaker 2>what's really driving the conversation.

155
00:06:41.600 --> 00:06:43.879
<v Speaker 1>Okay, I'm starting to see how this all comes together. Now,

156
00:06:44.240 --> 00:06:46.480
<v Speaker 1>can we switch gears and talk about Facebook for a second.

157
00:06:46.480 --> 00:06:48.759
<v Speaker 1>I know, a little off topic, but sure I get

158
00:06:48.759 --> 00:06:50.959
<v Speaker 1>a little creaked out by how much data Facebook collects

159
00:06:51.000 --> 00:06:51.279
<v Speaker 1>on us.

160
00:06:51.399 --> 00:06:52.279
<v Speaker 2>Yeah, it's a lot.

161
00:06:52.360 --> 00:06:55.879
<v Speaker 1>But I did hear you can download an archive of

162
00:06:56.199 --> 00:06:57.399
<v Speaker 1>your user data.

163
00:06:57.600 --> 00:06:58.040
<v Speaker 2>That's right.

164
00:06:58.079 --> 00:06:59.279
<v Speaker 1>What kind of stuff is in there? Oh?

165
00:06:59.319 --> 00:07:02.240
<v Speaker 2>Pretty much? Every thing? Your posts, your interactions, the ads

166
00:07:02.240 --> 00:07:02.839
<v Speaker 2>you've clicked on.

167
00:07:02.920 --> 00:07:04.639
<v Speaker 1>Wait, they track what ads I click on? Yep.

168
00:07:04.879 --> 00:07:05.319
<v Speaker 2>Everything.

169
00:07:05.639 --> 00:07:08.959
<v Speaker 1>That's kind of creepy but also kind of fascinating. Yeah,

170
00:07:09.040 --> 00:07:10.759
<v Speaker 1>I'd love to know what kind of insights I could

171
00:07:10.759 --> 00:07:11.680
<v Speaker 1>get from all that data.

172
00:07:11.759 --> 00:07:13.319
<v Speaker 2>Well, that's where webscraping comes in.

173
00:07:13.399 --> 00:07:14.839
<v Speaker 1>Web scraping what is that?

174
00:07:15.240 --> 00:07:19.560
<v Speaker 2>It's basically a way to extract specific information from websites

175
00:07:20.160 --> 00:07:20.879
<v Speaker 2>using code.

176
00:07:21.240 --> 00:07:24.079
<v Speaker 1>So you're telling me I can use code to dig

177
00:07:24.120 --> 00:07:26.560
<v Speaker 1>through my Facebook data and find out what they know

178
00:07:26.639 --> 00:07:27.000
<v Speaker 1>about me.

179
00:07:27.319 --> 00:07:28.319
<v Speaker 2>Yeah, pretty much.

180
00:07:28.399 --> 00:07:30.720
<v Speaker 1>That's both amazing and terrifying.

181
00:07:30.800 --> 00:07:31.120
<v Speaker 2>It is.

182
00:07:31.439 --> 00:07:34.040
<v Speaker 1>But before I go on a Facebook data mining spree.

183
00:07:34.079 --> 00:07:36.959
<v Speaker 1>I imagine there are some ethical considerations here, right, Oh?

184
00:07:36.959 --> 00:07:40.800
<v Speaker 2>Absolutely. One big one is the robot's exclusion protocol.

185
00:07:40.639 --> 00:07:42.680
<v Speaker 1>Robots exclusion protocol. What's that?

186
00:07:42.920 --> 00:07:45.800
<v Speaker 2>It's basically a set of rules that websites use to

187
00:07:45.839 --> 00:07:49.199
<v Speaker 2>tell web robots, which are like automated programs that browse

188
00:07:49.240 --> 00:07:51.360
<v Speaker 2>the web, which parts of the site they can and

189
00:07:51.399 --> 00:07:52.079
<v Speaker 2>can't access.

190
00:07:52.480 --> 00:07:54.920
<v Speaker 1>Okay, so it's like a digital do not enter sign

191
00:07:55.040 --> 00:07:56.600
<v Speaker 1>for bots pretty much.

192
00:07:57.040 --> 00:07:59.800
<v Speaker 2>And these rules are outlined in a file called robots

193
00:07:59.800 --> 00:08:02.319
<v Speaker 2>dot txt. Every website has one.

194
00:08:02.240 --> 00:08:04.120
<v Speaker 1>So if I want to scrape data from a website

195
00:08:04.120 --> 00:08:06.800
<v Speaker 1>I need to check their robots dot txt file first

196
00:08:06.800 --> 00:08:08.839
<v Speaker 1>to make sure I'm not breaking any rules exactly.

197
00:08:09.160 --> 00:08:11.120
<v Speaker 2>It's about being respectful of those boundaries.

198
00:08:11.160 --> 00:08:13.319
<v Speaker 1>Makes sense. So how does this apply to scraping my

199
00:08:13.399 --> 00:08:14.319
<v Speaker 1>Facebook data?

200
00:08:14.439 --> 00:08:19.120
<v Speaker 2>Well, Facebook's robots dot txt file will likely restrict scraping

201
00:08:19.240 --> 00:08:22.759
<v Speaker 2>certain types of data like user profiles or private messages.

202
00:08:22.920 --> 00:08:26.279
<v Speaker 1>So I can't just go willy nilly scraping everything on Facebook.

203
00:08:26.439 --> 00:08:28.199
<v Speaker 2>Nop. You gotta play by the rule.

204
00:08:28.199 --> 00:08:31.639
<v Speaker 1>Okay, I get it FICX first, But assuming I am

205
00:08:31.759 --> 00:08:34.759
<v Speaker 1>following the rules, how do I actually do this web

206
00:08:34.799 --> 00:08:36.720
<v Speaker 1>scraping thing? What kind of tools do I need.

207
00:08:36.799 --> 00:08:39.559
<v Speaker 2>Python is great for webscraping, especially when you combine it

208
00:08:39.600 --> 00:08:41.399
<v Speaker 2>with a library called beautiful Soup.

209
00:08:41.519 --> 00:08:44.240
<v Speaker 1>Beautiful Soup huh, interesting name. What does it do?

210
00:08:44.519 --> 00:08:47.759
<v Speaker 2>Beautiful Soup helps you parse HTML content, which is the

211
00:08:47.799 --> 00:08:51.440
<v Speaker 2>code that structures web pages, and extract the data you need.

212
00:08:51.639 --> 00:08:54.519
<v Speaker 1>So it's like a digital detective, sifting through all that

213
00:08:54.600 --> 00:08:56.240
<v Speaker 1>code and finding the clues I'm looking for.

214
00:08:56.440 --> 00:08:59.279
<v Speaker 2>Exactly. It helps you make sense of the messy world

215
00:08:59.320 --> 00:08:59.919
<v Speaker 2>of web data.

216
00:09:00.480 --> 00:09:03.360
<v Speaker 1>So I could use beautiful Soup to, say, extract all

217
00:09:03.399 --> 00:09:05.519
<v Speaker 1>the ads I've clicked on from my Facebook archive.

218
00:09:05.399 --> 00:09:07.720
<v Speaker 2>Exactly, and then you can analyze that data to see

219
00:09:07.720 --> 00:09:10.919
<v Speaker 2>what kind of patterns emerge. You might be surprised by

220
00:09:11.000 --> 00:09:11.519
<v Speaker 2>what you find.

221
00:09:11.799 --> 00:09:14.840
<v Speaker 1>This is really opening up a whole new world of possibilities.

222
00:09:15.080 --> 00:09:18.639
<v Speaker 1>But I'm realizing the data itself is only half the story, right,

223
00:09:19.159 --> 00:09:21.320
<v Speaker 1>is what we do with it that really matters.

224
00:09:21.440 --> 00:09:24.879
<v Speaker 2>Absolutely, The real magic happens when we start interpreting that data,

225
00:09:25.360 --> 00:09:28.120
<v Speaker 2>drawing conclusions, and telling stories with it.

226
00:09:28.279 --> 00:09:31.080
<v Speaker 1>So data analysis is more than just a technical skill.

227
00:09:31.080 --> 00:09:33.200
<v Speaker 1>It's a form of storytelling.

228
00:09:32.639 --> 00:09:36.440
<v Speaker 2>Exactly, and those stories have the power to inform, inspire,

229
00:09:37.000 --> 00:09:38.320
<v Speaker 2>and even change the world.

230
00:09:38.559 --> 00:09:41.240
<v Speaker 1>I'm sold. This is way more exciting than I ever imagine.

231
00:09:41.320 --> 00:09:44.399
<v Speaker 2>Yeah, it's pretty cool stuff. So we're to next. We've

232
00:09:44.399 --> 00:09:45.799
<v Speaker 2>got a lot of ground to cover still.

233
00:09:45.919 --> 00:09:49.279
<v Speaker 1>Hmm, well before we move on to something completely different,

234
00:09:49.320 --> 00:09:50.919
<v Speaker 1>I'm kind of curious about Wikipedia.

235
00:09:51.200 --> 00:09:52.559
<v Speaker 2>Okay, what about it?

236
00:09:52.559 --> 00:09:55.080
<v Speaker 1>It's such a massive source of information, right, like a

237
00:09:55.240 --> 00:09:58.679
<v Speaker 1>giant online encyclopedia it is. I bet there are some

238
00:09:58.919 --> 00:10:01.840
<v Speaker 1>amazing stories and all that data. You're right there are,

239
00:10:02.000 --> 00:10:04.919
<v Speaker 1>But I imagine it's also quite challenging to scrape data

240
00:10:04.919 --> 00:10:05.879
<v Speaker 1>from Wikipedia.

241
00:10:06.039 --> 00:10:08.759
<v Speaker 2>It can be. It's a very dynamic website, constantly being

242
00:10:08.840 --> 00:10:11.519
<v Speaker 2>updated and edited by volunteers all over the world.

243
00:10:11.759 --> 00:10:15.120
<v Speaker 1>So how do you even approach scraping data from something

244
00:10:15.200 --> 00:10:16.039
<v Speaker 1>like Wikipedia.

245
00:10:16.600 --> 00:10:20.000
<v Speaker 2>Patients and the right tools are key, and a good

246
00:10:20.120 --> 00:10:22.759
<v Speaker 2>understanding of Wikipedia structure and how it works.

247
00:10:23.000 --> 00:10:26.639
<v Speaker 1>Okay, so what if let's say I wanted to compile

248
00:10:26.720 --> 00:10:30.159
<v Speaker 1>a list of all the women computer scientists listed on Wikipedia.

249
00:10:30.480 --> 00:10:34.360
<v Speaker 2>That's a great example. Wikipedia has category pages dedicated to

250
00:10:34.399 --> 00:10:38.519
<v Speaker 2>specific topics. You could start with the women computer scientists

251
00:10:38.559 --> 00:10:42.080
<v Speaker 2>category page and use web scraping to extract the names

252
00:10:42.120 --> 00:10:43.759
<v Speaker 2>of all the individuals listed there.

253
00:10:43.759 --> 00:10:45.799
<v Speaker 1>Cool, and I could even grab the links to their

254
00:10:45.799 --> 00:10:47.600
<v Speaker 1>individual Wikipedia pages.

255
00:10:47.600 --> 00:10:50.279
<v Speaker 2>Exactly, And then you could deal deeper into those pages

256
00:10:50.320 --> 00:10:54.240
<v Speaker 2>and extract even more information like their birth dates, nationalities,

257
00:10:54.519 --> 00:10:57.759
<v Speaker 2>areas of expertise. The possibilities are endless.

258
00:10:58.000 --> 00:11:00.200
<v Speaker 1>This is blowing my mind, but I imagine. And there

259
00:11:00.200 --> 00:11:02.480
<v Speaker 1>are some specific considerations we need to keep in mind

260
00:11:02.519 --> 00:11:03.720
<v Speaker 1>when scraping Wikipedia.

261
00:11:03.799 --> 00:11:06.080
<v Speaker 2>Oh absolutely. One of the most important is to be

262
00:11:06.120 --> 00:11:07.919
<v Speaker 2>respectful of their terms of service.

263
00:11:07.679 --> 00:11:10.360
<v Speaker 1>Right, we don't want to crash Wikipedia or anything exactly.

264
00:11:10.919 --> 00:11:14.120
<v Speaker 2>They have guidelines in place to prevent abuse and ensure

265
00:11:14.120 --> 00:11:16.600
<v Speaker 2>that scraping activities don't overload their servers.

266
00:11:16.919 --> 00:11:18.440
<v Speaker 1>So how can we be mindful of that?

267
00:11:18.960 --> 00:11:22.080
<v Speaker 2>Well, one simple technique is to incorporate pauses into your

268
00:11:22.120 --> 00:11:23.039
<v Speaker 2>scraping code.

269
00:11:23.240 --> 00:11:23.720
<v Speaker 1>Pauses.

270
00:11:23.960 --> 00:11:27.679
<v Speaker 2>Yeah, Python has a function called sleep that allows you

271
00:11:27.759 --> 00:11:30.679
<v Speaker 2>to pause the execution of your code for a specified

272
00:11:30.679 --> 00:11:31.320
<v Speaker 2>amount of time.

273
00:11:31.559 --> 00:11:35.039
<v Speaker 1>Ah, so we're giving Wikipedia servers a little breather between

274
00:11:35.080 --> 00:11:36.519
<v Speaker 1>requests exactly.

275
00:11:37.080 --> 00:11:39.639
<v Speaker 2>This can help prevent you from sending too many requests

276
00:11:39.639 --> 00:11:42.600
<v Speaker 2>in quick succession, which could trigger their defenses and get

277
00:11:42.639 --> 00:11:43.840
<v Speaker 2>your IP address blocked.

278
00:11:44.080 --> 00:11:47.440
<v Speaker 1>Okay, so be polite, be patient, and don't overload the system.

279
00:11:47.799 --> 00:11:51.320
<v Speaker 1>Got it. But how do we actually translate our scraping

280
00:11:51.360 --> 00:11:54.759
<v Speaker 1>intentions into Python code that beautiful Soup can understand.

281
00:11:55.399 --> 00:11:58.639
<v Speaker 2>Beautiful Soup makes it pretty intuitive to target specific htmail

282
00:11:58.679 --> 00:12:01.399
<v Speaker 2>elements on a web page. We can use those classes

283
00:12:01.399 --> 00:12:04.320
<v Speaker 2>and ideas we talked about earlier to pinpoint the exact

284
00:12:04.320 --> 00:12:07.480
<v Speaker 2>parts of a Wikipedia page that contain the information we need.

285
00:12:07.600 --> 00:12:10.799
<v Speaker 1>So it's like giving beautiful Soup a treasure map, guiding

286
00:12:10.879 --> 00:12:13.799
<v Speaker 1>it to the exact spot where the digital gold is buried.

287
00:12:14.039 --> 00:12:16.960
<v Speaker 2>I like that analogy. And once you've identified those key

288
00:12:17.120 --> 00:12:21.279
<v Speaker 2>HTML elements, beautiful Soup makes it very easy to extract

289
00:12:21.279 --> 00:12:24.600
<v Speaker 2>the content within them. You can then organize that data

290
00:12:25.159 --> 00:12:27.559
<v Speaker 2>meatly into a spreadsheet for further analysis.

291
00:12:27.639 --> 00:12:30.919
<v Speaker 1>This is amazing. It's like having a superpower that allows

292
00:12:30.960 --> 00:12:33.519
<v Speaker 1>you to unlock the hidden knowledge of the Internet.

293
00:12:33.600 --> 00:12:36.799
<v Speaker 2>It's pretty powerful stuff. But remember the data itself is

294
00:12:36.799 --> 00:12:38.240
<v Speaker 2>only part of the story, right.

295
00:12:38.279 --> 00:12:41.759
<v Speaker 1>We need to analyze it, interpret it, and ultimately tell

296
00:12:41.759 --> 00:12:42.519
<v Speaker 1>a story.

297
00:12:42.240 --> 00:12:46.080
<v Speaker 2>With it exactly. Yeah, and those stories can be incredibly impactful.

298
00:12:46.200 --> 00:12:48.559
<v Speaker 1>This whole deep dive has been eye opening. It's like

299
00:12:48.559 --> 00:12:50.360
<v Speaker 1>I'm seeing the Internet in a whole new light.

300
00:12:50.519 --> 00:12:52.799
<v Speaker 2>I know what you mean. There's so much more to

301
00:12:52.840 --> 00:12:55.120
<v Speaker 2>it than meets the eye. But before we get too

302
00:12:55.159 --> 00:12:58.240
<v Speaker 2>carried away, with webscraping. Let's circle back to Google Sheets

303
00:12:58.279 --> 00:12:58.720
<v Speaker 2>for a moment.

304
00:12:58.840 --> 00:13:01.399
<v Speaker 1>Google Sheets. I thought we were moving on to more

305
00:13:01.440 --> 00:13:02.320
<v Speaker 1>advanced tools.

306
00:13:02.840 --> 00:13:06.679
<v Speaker 2>Google sheets might seem basic, but it's actually quite powerful

307
00:13:06.759 --> 00:13:09.879
<v Speaker 2>for data analysis, especially if you're just starting out.

308
00:13:10.120 --> 00:13:13.320
<v Speaker 1>Hmm, okay, I'm intrigued. What can you do with it?

309
00:13:13.759 --> 00:13:16.320
<v Speaker 2>Well, remember those Twitter bots we talked about earlier. Google

310
00:13:16.360 --> 00:13:18.480
<v Speaker 2>sheets is great for analyzing their activity.

311
00:13:18.600 --> 00:13:21.879
<v Speaker 1>Really, you can track bot behavior in a spreadsheet.

312
00:13:22.000 --> 00:13:25.159
<v Speaker 2>You can. You can even use it to visualize their

313
00:13:25.200 --> 00:13:28.120
<v Speaker 2>tweeting patterns and see if there are any suspicious spikes

314
00:13:28.159 --> 00:13:28.679
<v Speaker 2>or trends.

315
00:13:29.039 --> 00:13:32.000
<v Speaker 1>Wow. I never would have thought of that. So Google

316
00:13:32.000 --> 00:13:35.679
<v Speaker 1>Sheets is like a gateway drug to data analysis. It

317
00:13:35.720 --> 00:13:38.320
<v Speaker 1>helps you get a taste of what's possible before diving

318
00:13:38.360 --> 00:13:40.159
<v Speaker 1>into the more advanced tools.

319
00:13:40.320 --> 00:13:43.799
<v Speaker 2>You could say that, but even experienced data analysts often

320
00:13:43.919 --> 00:13:48.000
<v Speaker 2>use Google Sheets for quick explorations or for creating simple visualization.

321
00:13:48.200 --> 00:13:50.279
<v Speaker 1>Right, So it's not just for beginners exactly.

322
00:13:50.320 --> 00:13:52.399
<v Speaker 2>It's all about choosing the right tool for the job.

323
00:13:53.039 --> 00:13:54.960
<v Speaker 2>Sometimes a simple spreadsheet is all you need.

324
00:13:55.120 --> 00:13:57.600
<v Speaker 1>Okay, I'm starting to see the appeal. What else can

325
00:13:57.639 --> 00:14:00.320
<v Speaker 1>we do with Google Sheets for data analysis?

326
00:14:00.840 --> 00:14:04.000
<v Speaker 2>Well, we could, for instance, analyze the sentiment of those

327
00:14:04.320 --> 00:14:06.480
<v Speaker 2>Reddit posts about vaccinations.

328
00:14:06.759 --> 00:14:09.840
<v Speaker 1>Sentiment you mean, like whether people are generally positive or

329
00:14:09.840 --> 00:14:11.879
<v Speaker 1>negative about vaccinations exactly?

330
00:14:12.120 --> 00:14:14.279
<v Speaker 2>We can look at the words and phrases people using

331
00:14:14.320 --> 00:14:18.440
<v Speaker 2>their posts and use Google Sheets to categorize them as positive, negative,

332
00:14:18.559 --> 00:14:19.120
<v Speaker 2>or neutral.

333
00:14:19.360 --> 00:14:22.840
<v Speaker 1>That sounds incredibly useful. So Google Sheets can help us

334
00:14:22.840 --> 00:14:24.799
<v Speaker 1>go beyond just the numbers and get a sense of

335
00:14:24.799 --> 00:14:27.399
<v Speaker 1>the emotional tone of the conversation precisely.

336
00:14:27.840 --> 00:14:30.519
<v Speaker 2>And we can even use Google Sheets to visualize those

337
00:14:30.559 --> 00:14:33.879
<v Speaker 2>sentiment trends over time. Does either there any shifts or patterns?

338
00:14:34.159 --> 00:14:37.279
<v Speaker 1>I'm starting to see the potential here. Google sheets might

339
00:14:37.320 --> 00:14:39.799
<v Speaker 1>not be as flashy as Python, but it's definitely a

340
00:14:39.879 --> 00:14:41.879
<v Speaker 1>versatile tool for data analysis.

341
00:14:41.960 --> 00:14:43.960
<v Speaker 2>I agree. It's a great place to start for anyone

342
00:14:44.000 --> 00:14:46.240
<v Speaker 2>who's new to data analysis, and it can be a

343
00:14:46.360 --> 00:14:49.159
<v Speaker 2>powerful tool even for experienced analysts.

344
00:14:49.200 --> 00:14:51.679
<v Speaker 1>Okay, I'm officially a Google Sheets convert.

345
00:14:51.759 --> 00:14:54.360
<v Speaker 2>Great. Now, are you ready for something a little more advanced?

346
00:14:54.519 --> 00:14:55.120
<v Speaker 1>Hit me with it.

347
00:14:55.279 --> 00:14:57.559
<v Speaker 2>Let's talk about data analysis for journalists.

348
00:14:57.919 --> 00:15:01.399
<v Speaker 1>Oooh, now this is getting interesting. I've always been fascinated

349
00:15:01.440 --> 00:15:03.799
<v Speaker 1>by the intersection of data and storytelling.

350
00:15:03.879 --> 00:15:07.919
<v Speaker 2>It's a powerful combination. Data journalism is all about using

351
00:15:08.000 --> 00:15:12.679
<v Speaker 2>data to uncover hidden truths, hold the powerful accountable, and

352
00:15:12.720 --> 00:15:13.919
<v Speaker 2>tell stories that matter.

353
00:15:14.320 --> 00:15:18.000
<v Speaker 1>So data journalists are like digital detectives, using data as

354
00:15:18.080 --> 00:15:20.639
<v Speaker 1>clues to solve mysteries and expose wrongdoing.

355
00:15:20.919 --> 00:15:24.120
<v Speaker 2>Exactly, they're using data to dig deeper, to go beyond

356
00:15:24.159 --> 00:15:25.919
<v Speaker 2>the surface, and to find the real story.

357
00:15:26.120 --> 00:15:29.799
<v Speaker 1>This is incredible. Are there any specific examples of how

358
00:15:29.960 --> 00:15:32.639
<v Speaker 1>data journalism is being used to make a difference in

359
00:15:32.679 --> 00:15:33.080
<v Speaker 1>the world.

360
00:15:33.200 --> 00:15:36.679
<v Speaker 2>Oh, there are tons. Data journalists have exposed everything from

361
00:15:36.720 --> 00:15:41.279
<v Speaker 2>corruption and fraud to environmental abuses in human rights violations.

362
00:15:41.320 --> 00:15:44.919
<v Speaker 1>Wow. So data journalism is like a superpower for journalists,

363
00:15:45.039 --> 00:15:48.480
<v Speaker 1>giving them the ability to see things that others can't.

364
00:15:48.600 --> 00:15:51.120
<v Speaker 2>You could say that. And the best part is data

365
00:15:51.159 --> 00:15:55.000
<v Speaker 2>journalism is not limited to large news organizations. Anyone with

366
00:15:55.120 --> 00:15:58.320
<v Speaker 2>access to data and the willingness to learn can use

367
00:15:58.360 --> 00:15:59.000
<v Speaker 2>these techniques.

368
00:15:59.080 --> 00:16:02.399
<v Speaker 1>So it's like a democratizing force, empowering anyone to become

369
00:16:02.399 --> 00:16:04.519
<v Speaker 1>a watchdog and hold the powerful accountable.

370
00:16:04.639 --> 00:16:07.960
<v Speaker 2>Precisely, data journalism is giving a voice to the voiceless

371
00:16:08.279 --> 00:16:10.960
<v Speaker 2>and helping to create a more informed and just society.

372
00:16:11.320 --> 00:16:14.279
<v Speaker 1>This is so inspiring. I'm starting to see the incredible

373
00:16:14.320 --> 00:16:17.120
<v Speaker 1>potential of data analysis to make a real impact in

374
00:16:17.120 --> 00:16:17.519
<v Speaker 1>the world.

375
00:16:17.639 --> 00:16:20.039
<v Speaker 2>I agree. It's a powerful tool for change, and I

376
00:16:20.080 --> 00:16:22.240
<v Speaker 2>think we're only just beginning to scratch the surface of

377
00:16:22.279 --> 00:16:22.960
<v Speaker 2>what's possible.

378
00:16:23.240 --> 00:16:25.240
<v Speaker 1>I can't wait to see what the future holds for

379
00:16:25.320 --> 00:16:28.320
<v Speaker 1>data analysis. It feels like we're on the cusp of

380
00:16:28.320 --> 00:16:30.399
<v Speaker 1>something truly transformative.

381
00:16:30.679 --> 00:16:33.159
<v Speaker 2>I think you're right. The world of data is vast

382
00:16:33.240 --> 00:16:38.559
<v Speaker 2>and ever evolving, and there are endless possibilities for exploration, discovery,

383
00:16:38.639 --> 00:16:39.360
<v Speaker 2>and impact.

384
00:16:39.519 --> 00:16:41.799
<v Speaker 1>I'm ready to dive in. This whole deep dive has

385
00:16:41.799 --> 00:16:45.000
<v Speaker 1>been a revelation. I'm feeling energized and inspired to learn

386
00:16:45.039 --> 00:16:48.039
<v Speaker 1>more and to see what stories I can uncover with data.

387
00:16:48.120 --> 00:16:51.960
<v Speaker 2>That's the spirit, and remember the journey of data analysis

388
00:16:52.080 --> 00:16:56.200
<v Speaker 2>is just as important as the destination. Embrace the challenges,

389
00:16:56.480 --> 00:16:59.919
<v Speaker 2>celebrate the victories, and never stop asking questions.

390
00:17:00.159 --> 00:17:02.159
<v Speaker 1>Wise words. Okay, I think we've covered a lot of

391
00:17:02.200 --> 00:17:04.279
<v Speaker 1>ground for this first part of our deep dive. I'm

392
00:17:04.279 --> 00:17:05.720
<v Speaker 1>excited to see where we go next.

393
00:17:05.799 --> 00:17:08.640
<v Speaker 2>Me too. Let's take a break and come back refresh

394
00:17:08.720 --> 00:17:10.039
<v Speaker 2>for the next part of our exploration.

395
00:17:10.319 --> 00:17:12.279
<v Speaker 1>All right, so before we jump to this part, we

396
00:17:12.279 --> 00:17:14.880
<v Speaker 1>were getting into how AI is becoming a bigger and

397
00:17:14.880 --> 00:17:18.359
<v Speaker 1>bigger part of data analysis. That's super interesting and all,

398
00:17:18.400 --> 00:17:20.200
<v Speaker 1>but I also want to know more about the human

399
00:17:20.319 --> 00:17:23.319
<v Speaker 1>side of this field, Like what kind of person thrives

400
00:17:23.359 --> 00:17:25.920
<v Speaker 1>in a data analysis rule, that's a great question.

401
00:17:26.599 --> 00:17:29.599
<v Speaker 2>You know, the really successful data analysts tend to have

402
00:17:29.640 --> 00:17:33.119
<v Speaker 2>a few things in common, okay, like what, Well, first

403
00:17:33.119 --> 00:17:35.440
<v Speaker 2>of all, they're super curious, makes sense.

404
00:17:35.680 --> 00:17:37.920
<v Speaker 1>Gotta love digging into data.

405
00:17:37.599 --> 00:17:39.880
<v Speaker 2>Right, They're always trying to figure out how things work,

406
00:17:40.200 --> 00:17:43.839
<v Speaker 2>uncover hidden patterns, find those answers to those tough questions.

407
00:17:43.960 --> 00:17:46.359
<v Speaker 1>So it's not just about the numbers. It's about asking

408
00:17:46.400 --> 00:17:48.200
<v Speaker 1>the right question exactly.

409
00:17:47.920 --> 00:17:50.359
<v Speaker 2>Yeah, and figuring out how to use the data to

410
00:17:50.519 --> 00:17:53.799
<v Speaker 2>get those answers. It's a whole process of discovery.

411
00:17:53.319 --> 00:17:55.720
<v Speaker 1>That sounds way more exciting than just staring at a

412
00:17:55.720 --> 00:17:56.640
<v Speaker 1>spreadsheet all day.

413
00:17:56.759 --> 00:17:58.839
<v Speaker 2>Oh, it's definitely not just spreadsheets. You got to be

414
00:17:58.880 --> 00:18:00.039
<v Speaker 2>a creative thinker too.

415
00:18:00.160 --> 00:18:03.680
<v Speaker 1>Really, So there's an artistic side to data analysis.

416
00:18:03.759 --> 00:18:06.680
<v Speaker 2>You could say that you need to be able to

417
00:18:06.759 --> 00:18:10.559
<v Speaker 2>see those connections that others might miss, come up with

418
00:18:10.640 --> 00:18:13.920
<v Speaker 2>new ways to solve problems, and then of course present

419
00:18:13.960 --> 00:18:15.640
<v Speaker 2>your findings in a way that makes sense.

420
00:18:15.920 --> 00:18:18.279
<v Speaker 1>Right. You can't just drown people in data. You got

421
00:18:18.279 --> 00:18:19.160
<v Speaker 1>to tell a story with it.

422
00:18:19.519 --> 00:18:23.400
<v Speaker 2>You got it. Data can be powerful, but it only

423
00:18:23.440 --> 00:18:26.759
<v Speaker 2>really matters if people understand it, and that means turning

424
00:18:26.839 --> 00:18:30.480
<v Speaker 2>it into a story that resonates with them. The best

425
00:18:30.559 --> 00:18:32.880
<v Speaker 2>data analysts they're great storytellers too.

426
00:18:33.039 --> 00:18:35.880
<v Speaker 1>Okay, that makes total sense. Take those raw numbers and

427
00:18:35.920 --> 00:18:38.920
<v Speaker 1>weave them into something that captures people's attention, something that

428
00:18:38.960 --> 00:18:39.599
<v Speaker 1>makes them care.

429
00:18:40.119 --> 00:18:43.240
<v Speaker 2>Right, It's all about finding the human connection within the data.

430
00:18:44.920 --> 00:18:47.599
<v Speaker 2>Oh and that brings me to another important trait, empathy.

431
00:18:48.000 --> 00:18:50.839
<v Speaker 2>Empathy that seems a little unexpected for a field that's

432
00:18:50.839 --> 00:18:51.640
<v Speaker 2>so data driven.

433
00:18:51.920 --> 00:18:55.480
<v Speaker 1>Yeah, it might seem surprising, but it's super important. Remember,

434
00:18:55.599 --> 00:18:58.759
<v Speaker 1>data analysis isn't just about the numbers themselves. It's about

435
00:18:58.839 --> 00:19:03.519
<v Speaker 1>understanding people. Whether you're looking at customer behavior, social media trends,

436
00:19:03.640 --> 00:19:07.680
<v Speaker 1>or even healthcare data, you're ultimately dealing with human experiences.

437
00:19:08.240 --> 00:19:10.240
<v Speaker 1>And if you can put yourself in other people's shoes

438
00:19:10.359 --> 00:19:14.440
<v Speaker 1>understand their perspectives, then you can ask better questions, you know,

439
00:19:14.759 --> 00:19:17.079
<v Speaker 1>and draw more meaningful conclusions from that data.

440
00:19:17.160 --> 00:19:18.960
<v Speaker 2>Okay, that's a really good point. So it's not just

441
00:19:19.000 --> 00:19:22.039
<v Speaker 2>a technical field. It's one where you need to understand people,

442
00:19:22.319 --> 00:19:25.599
<v Speaker 2>connect with them on a human level exactly. It's all

443
00:19:25.640 --> 00:19:29.880
<v Speaker 2>about blending that analytical rigor with empathy, with that human

444
00:19:29.920 --> 00:19:33.440
<v Speaker 2>capacity for understanding, and when you combine those elements you

445
00:19:33.480 --> 00:19:35.200
<v Speaker 2>can achieve some truly amazing things.

446
00:19:35.279 --> 00:19:37.599
<v Speaker 1>Wow. Okay, so we talked about how Python is a

447
00:19:37.640 --> 00:19:42.079
<v Speaker 1>super powerful tool for data analysis, especially when you're dealing

448
00:19:42.079 --> 00:19:44.599
<v Speaker 1>with these massive data sets. Why don't we talk a

449
00:19:44.640 --> 00:19:47.799
<v Speaker 1>little bit more about what makes Python so great for

450
00:19:47.839 --> 00:19:48.640
<v Speaker 1>this kind of work.

451
00:19:48.920 --> 00:19:50.839
<v Speaker 2>Yeah? Sure, Python is super versatile.

452
00:19:51.119 --> 00:19:54.119
<v Speaker 1>So what makes it so popular for data analysis?

453
00:19:54.240 --> 00:19:57.359
<v Speaker 2>Well, first of all, it's got this really clear, readable syntax,

454
00:19:57.880 --> 00:20:01.440
<v Speaker 2>which basically means it's relatively easy to learn and use,

455
00:20:01.839 --> 00:20:02.799
<v Speaker 2>even if you're a beginner.

456
00:20:03.000 --> 00:20:05.119
<v Speaker 1>Okay, so it's not like some super secret code that

457
00:20:05.160 --> 00:20:07.039
<v Speaker 1>only experts can understand.

458
00:20:06.640 --> 00:20:08.640
<v Speaker 2>Not at all. It almost reads like plain English, so

459
00:20:08.640 --> 00:20:10.319
<v Speaker 2>you don't spend all your time trying to figure out

460
00:20:10.319 --> 00:20:10.759
<v Speaker 2>what the code is.

461
00:20:10.799 --> 00:20:12.880
<v Speaker 1>Even saying that's definitely a plus. What else?

462
00:20:13.119 --> 00:20:16.599
<v Speaker 2>Another reason Python is so great for data analysis is

463
00:20:16.640 --> 00:20:20.279
<v Speaker 2>that it's got this huge, active community of developers who

464
00:20:20.359 --> 00:20:24.960
<v Speaker 2>are always creating new libraries and tools specifically for that purpose.

465
00:20:25.240 --> 00:20:28.119
<v Speaker 1>So it's like having a global support system, a whole

466
00:20:28.160 --> 00:20:31.319
<v Speaker 1>team of data enthusiasts ready to help you out.

467
00:20:31.279 --> 00:20:34.440
<v Speaker 2>Exactly you're not alone in this. There are tons of

468
00:20:34.480 --> 00:20:37.039
<v Speaker 2>resources out there to get you started, to help you

469
00:20:37.119 --> 00:20:39.119
<v Speaker 2>tackle any challenge you come across.

470
00:20:39.359 --> 00:20:42.599
<v Speaker 1>That's pretty awesome. And you mentioned libraries earlier.

471
00:20:42.359 --> 00:20:46.480
<v Speaker 2>Right, So libraries they're basically like specialized toolkits for different

472
00:20:46.559 --> 00:20:47.920
<v Speaker 2>data analysis tasks.

473
00:20:48.079 --> 00:20:49.279
<v Speaker 1>So what are some examples.

474
00:20:49.359 --> 00:20:51.519
<v Speaker 2>Okay, So you've got pandas, which we've already talked about

475
00:20:51.519 --> 00:20:56.240
<v Speaker 2>a bit for manipulating and analyzing data. Numb Pi is

476
00:20:56.279 --> 00:20:59.200
<v Speaker 2>great for working with numerical data, and then there are

477
00:20:59.440 --> 00:21:03.039
<v Speaker 2>libraries like map plotlib and seaborne which are all about

478
00:21:03.079 --> 00:21:06.599
<v Speaker 2>creating visualizations. And of course, if you're getting into machine learning,

479
00:21:06.680 --> 00:21:07.680
<v Speaker 2>you've got side kit learn.

480
00:21:07.920 --> 00:21:10.079
<v Speaker 1>Okay, So it's like having a whole arsenal of tools

481
00:21:10.119 --> 00:21:13.440
<v Speaker 1>at your disposal, each one designed for a specific purpose exactly.

482
00:21:13.480 --> 00:21:15.920
<v Speaker 2>And Python makes it super easy to use these libraries,

483
00:21:16.160 --> 00:21:18.000
<v Speaker 2>so you don't have to start from scratch every time.

484
00:21:18.200 --> 00:21:20.759
<v Speaker 1>That sounds incredibly efficient. But going back to something we

485
00:21:20.799 --> 00:21:24.599
<v Speaker 1>touched on earlier, the whole data cleaning and preparation part

486
00:21:24.720 --> 00:21:28.240
<v Speaker 1>of webscriping, Why is that so important? I mean, why

487
00:21:28.279 --> 00:21:30.720
<v Speaker 1>not just dive right into the analysis.

488
00:21:30.079 --> 00:21:34.880
<v Speaker 2>Because real world data it's messy, it's inconsistent, it's not

489
00:21:34.920 --> 00:21:37.880
<v Speaker 2>always perfect. Think of it like cooking. Before you can

490
00:21:37.920 --> 00:21:41.119
<v Speaker 2>make a delicious meal, you got to wash, chop, and

491
00:21:41.240 --> 00:21:44.240
<v Speaker 2>prep those ingredients. It's the same with data. Before you

492
00:21:44.240 --> 00:21:47.119
<v Speaker 2>can draw meaningful insights from it, you got to clean

493
00:21:47.160 --> 00:21:49.279
<v Speaker 2>it up, make sure it's accurate and consistent.

494
00:21:49.519 --> 00:21:52.200
<v Speaker 1>So data cleaning is like laying the groundwork for a

495
00:21:52.359 --> 00:21:53.559
<v Speaker 1>solid analysis.

496
00:21:53.759 --> 00:21:56.400
<v Speaker 2>Exactly. If you start with messy data, you're going to

497
00:21:56.480 --> 00:21:59.759
<v Speaker 2>get messy results, garbage in and garbage out, as they say.

498
00:21:59.599 --> 00:22:02.799
<v Speaker 1>So, what are some common data cleaning tasks?

499
00:22:03.400 --> 00:22:07.319
<v Speaker 2>Well, one common task is handling missing values. Like let's

500
00:22:07.319 --> 00:22:10.160
<v Speaker 2>say you're working with a data set of survey responses

501
00:22:10.839 --> 00:22:13.319
<v Speaker 2>and some people just skipped certain questions.

502
00:22:13.039 --> 00:22:14.880
<v Speaker 1>Right, So there are gaps in the data.

503
00:22:14.839 --> 00:22:17.519
<v Speaker 2>Exactly, and those gaps can really skew your analysis if

504
00:22:17.519 --> 00:22:19.400
<v Speaker 2>you're not careful. So you got to figure out how

505
00:22:19.440 --> 00:22:21.759
<v Speaker 2>to deal with them. Sometimes you can just remove those rows.

506
00:22:22.079 --> 00:22:24.480
<v Speaker 2>Other times you might replace them with a default value

507
00:22:24.920 --> 00:22:27.920
<v Speaker 2>or use some statistical techniques to fill in the missing data.

508
00:22:27.960 --> 00:22:29.759
<v Speaker 1>So it's not a one size fits all approach. You

509
00:22:29.799 --> 00:22:31.519
<v Speaker 1>got to be strategic about it, right.

510
00:22:31.599 --> 00:22:34.920
<v Speaker 2>Data cleaning is all about being thoughtful and understanding your

511
00:22:35.000 --> 00:22:37.799
<v Speaker 2>data and what you're trying to achieve with your analysis

512
00:22:38.079 --> 00:22:38.599
<v Speaker 2>makes sense.

513
00:22:39.599 --> 00:22:42.480
<v Speaker 1>So we've talked about cleaning and preparing the data, but

514
00:22:42.640 --> 00:22:45.440
<v Speaker 1>what about actually making sense of it. I mean, we

515
00:22:45.480 --> 00:22:47.960
<v Speaker 1>can look at rows and columns of numbers all day,

516
00:22:48.079 --> 00:22:50.599
<v Speaker 1>but that doesn't necessarily tell us anything useful.

517
00:22:50.839 --> 00:22:52.519
<v Speaker 2>That's where data visualization comes in.

518
00:22:52.680 --> 00:22:56.599
<v Speaker 1>Data visualization huh, so pretty graphs and charts.

519
00:22:56.279 --> 00:22:58.880
<v Speaker 2>Well, it's more than just making things look pretty. It's

520
00:22:58.880 --> 00:23:02.599
<v Speaker 2>about transforming that raw data into a visual format that's

521
00:23:02.640 --> 00:23:04.200
<v Speaker 2>easy to understand and interpret.

522
00:23:04.319 --> 00:23:06.440
<v Speaker 1>Okay, So, like, instead of just seeing a bunch of numbers,

523
00:23:06.519 --> 00:23:09.240
<v Speaker 1>you can actually see patterns and trends exactly.

524
00:23:09.799 --> 00:23:13.559
<v Speaker 2>A good visualization can help you identify trends, spot outliers,

525
00:23:13.880 --> 00:23:16.920
<v Speaker 2>and see relationships between variables that might not be obvious

526
00:23:17.000 --> 00:23:18.359
<v Speaker 2>just from looking at the raw data.

527
00:23:18.400 --> 00:23:20.799
<v Speaker 1>So it's about bringing the data to life, making it

528
00:23:20.839 --> 00:23:22.400
<v Speaker 1>more engaging, more intuitive.

529
00:23:22.680 --> 00:23:24.759
<v Speaker 2>Right, it helps you tell a story with your data,

530
00:23:24.920 --> 00:23:26.119
<v Speaker 2>you know, make it more impactful.

531
00:23:26.440 --> 00:23:29.799
<v Speaker 1>Okay, I'm sold on the power of visualization. But what

532
00:23:30.039 --> 00:23:33.200
<v Speaker 1>makes a good visualization versus a bad one? What should

533
00:23:33.240 --> 00:23:35.160
<v Speaker 1>we keep in mind when creating them.

534
00:23:35.519 --> 00:23:40.079
<v Speaker 2>A good visualization should be clear, concise, and informative. It

535
00:23:40.119 --> 00:23:44.039
<v Speaker 2>should accurately represent the data without distorting or manipulating it

536
00:23:44.079 --> 00:23:46.640
<v Speaker 2>in any way, and it should be easy to understand

537
00:23:46.720 --> 00:23:48.839
<v Speaker 2>even for someone who's not familiar with the data.

538
00:23:48.920 --> 00:23:51.319
<v Speaker 1>So no misleading graphs or anything like that. Got it.

539
00:23:51.400 --> 00:23:54.559
<v Speaker 1>But there are so many different types of visualizations out there.

540
00:23:54.599 --> 00:23:56.039
<v Speaker 1>How do you know which one to use?

541
00:23:56.359 --> 00:23:58.519
<v Speaker 2>It depends on the type of data you have and

542
00:23:58.559 --> 00:24:01.200
<v Speaker 2>what you're trying to show. Bar charts and line graphs

543
00:24:01.240 --> 00:24:04.400
<v Speaker 2>are great for showing trends over time, scatterplots are good

544
00:24:04.400 --> 00:24:08.160
<v Speaker 2>for exploring relationships between variables, pie charts are useful for

545
00:24:08.160 --> 00:24:11.119
<v Speaker 2>showing proportions, and heat maps can be great for displaying

546
00:24:11.119 --> 00:24:13.799
<v Speaker 2>more complex data in a visually intuitive way.

547
00:24:14.039 --> 00:24:16.200
<v Speaker 1>Okay, so it's all about choosing the right tool for

548
00:24:16.279 --> 00:24:19.160
<v Speaker 1>the job. But what about the audience? Do you have

549
00:24:19.240 --> 00:24:22.039
<v Speaker 1>to consider who you're creating the visualization for.

550
00:24:22.440 --> 00:24:25.160
<v Speaker 2>Absolutely, you need to think about who's going to be

551
00:24:25.160 --> 00:24:28.000
<v Speaker 2>looking at this visualization and what they need to get

552
00:24:28.039 --> 00:24:30.559
<v Speaker 2>out of it, What will resonate with them, what will

553
00:24:30.599 --> 00:24:31.880
<v Speaker 2>help them understand the data.

554
00:24:32.359 --> 00:24:34.720
<v Speaker 1>So data visualization is a bit of an art form,

555
00:24:34.759 --> 00:24:35.359
<v Speaker 1>then huh?

556
00:24:35.400 --> 00:24:40.599
<v Speaker 2>You could say that it requires a blend of technical skills, creativity,

557
00:24:40.839 --> 00:24:43.039
<v Speaker 2>and a good understanding of communication. Principles.

558
00:24:43.240 --> 00:24:45.400
<v Speaker 1>Okay, so now that I'm convinced of the importance of

559
00:24:45.480 --> 00:24:48.240
<v Speaker 1>data visualization, what are some of the tools that are

560
00:24:48.279 --> 00:24:51.960
<v Speaker 1>used to create these visualizations? Are there any specific Python

561
00:24:52.079 --> 00:24:53.480
<v Speaker 1>libraries that are good for this?

562
00:24:54.160 --> 00:25:00.000
<v Speaker 2>Absolutely, Python has several powerful libraries for creating amazing visualizations.

563
00:25:00.599 --> 00:25:02.839
<v Speaker 2>Is one of the most popular and versatile that gives

564
00:25:02.880 --> 00:25:06.200
<v Speaker 2>you tons of options. Seaborn is another one. It builds

565
00:25:06.200 --> 00:25:09.759
<v Speaker 2>on that plotlib and provides a higher level interface and

566
00:25:09.839 --> 00:25:12.440
<v Speaker 2>some pre built themes that make it really easy to

567
00:25:12.480 --> 00:25:15.200
<v Speaker 2>create beautiful, professional looking visualizations.

568
00:25:15.279 --> 00:25:18.079
<v Speaker 1>Okay, so matt plotlib is like the foundation and Seaborn

569
00:25:18.160 --> 00:25:20.519
<v Speaker 1>is like adding those finishing touches making it all look

570
00:25:20.599 --> 00:25:21.480
<v Speaker 1>polished and pretty.

571
00:25:21.640 --> 00:25:23.559
<v Speaker 2>Yeah, that's a good way to think about it. And

572
00:25:23.599 --> 00:25:26.720
<v Speaker 2>then there are other libraries for creating interactive visualizations, three

573
00:25:26.799 --> 00:25:30.279
<v Speaker 2>D plots, even animated charts. It's really incredible what you

574
00:25:30.319 --> 00:25:31.680
<v Speaker 2>could do with Python these days.

575
00:25:31.799 --> 00:25:33.880
<v Speaker 1>It sounds like it. Okay, so we've talked about the

576
00:25:33.880 --> 00:25:36.480
<v Speaker 1>power of data visualization, but let's shift gears for a

577
00:25:36.480 --> 00:25:39.839
<v Speaker 1>moment and talk about bias in data analysis. I know

578
00:25:39.920 --> 00:25:42.680
<v Speaker 1>we've mentioned ethical considerations before, but I'd like to dig

579
00:25:42.680 --> 00:25:46.039
<v Speaker 1>into this a little bit more. How can bias sneak

580
00:25:46.119 --> 00:25:49.039
<v Speaker 1>into data analysis? And what can we do to prevent it?

581
00:25:50.039 --> 00:25:53.359
<v Speaker 2>That's a super important question. Bias is a huge issue

582
00:25:53.359 --> 00:25:55.079
<v Speaker 2>in data analysis, and it can show up in many

583
00:25:55.079 --> 00:25:58.920
<v Speaker 2>different ways. Sometimes the data itself is biased, reflecting existing

584
00:25:58.960 --> 00:26:03.039
<v Speaker 2>societal prejudices or inequalities. Other times, the bias is introduced

585
00:26:03.119 --> 00:26:06.759
<v Speaker 2>during the data collection, the processing, or the analysis itself.

586
00:26:07.079 --> 00:26:09.279
<v Speaker 1>So the bias can either be built into the data

587
00:26:09.319 --> 00:26:12.400
<v Speaker 1>from the start, or we can accidentally introduce it ourselves.

588
00:26:12.440 --> 00:26:14.440
<v Speaker 1>That's a little scary, It definitely is.

589
00:26:14.680 --> 00:26:17.519
<v Speaker 2>It's something we need to be constantly aware of. Let's say,

590
00:26:17.559 --> 00:26:21.319
<v Speaker 2>for example, you're analyzing data on hiring practices and you

591
00:26:21.359 --> 00:26:23.519
<v Speaker 2>find that men are being hired at a much higher

592
00:26:23.599 --> 00:26:27.559
<v Speaker 2>rate than women. That discrepancy, well, it could be due

593
00:26:27.599 --> 00:26:30.720
<v Speaker 2>to actual gender discrimination, but it could also be because

594
00:26:30.720 --> 00:26:35.000
<v Speaker 2>of other factors like differences in qualifications or experience. It's

595
00:26:35.039 --> 00:26:38.640
<v Speaker 2>really important to consider all the possible explanations and not

596
00:26:38.759 --> 00:26:39.799
<v Speaker 2>jump to conclusions.

597
00:26:40.000 --> 00:26:42.960
<v Speaker 1>So we need to be critical thinkers, questioning our assumptions,

598
00:26:43.000 --> 00:26:45.039
<v Speaker 1>even when the data seems to point in a certain

599
00:26:45.079 --> 00:26:46.079
<v Speaker 1>direction exactly.

600
00:26:46.119 --> 00:26:48.559
<v Speaker 2>We have to be aware of our own biases, and

601
00:26:48.599 --> 00:26:50.519
<v Speaker 2>we have to be aware of the biases that might

602
00:26:50.559 --> 00:26:53.400
<v Speaker 2>already be embedded in the data, and we should always

603
00:26:53.400 --> 00:26:56.400
<v Speaker 2>try to use data from different sources, involve people from

604
00:26:56.400 --> 00:26:59.799
<v Speaker 2>different backgrounds in the analysis process. This can help mitigate

605
00:26:59.799 --> 00:27:00.720
<v Speaker 2>potential bias.

606
00:27:01.119 --> 00:27:03.519
<v Speaker 1>Makes sense. It's all about bringing a more diverse and

607
00:27:03.559 --> 00:27:07.160
<v Speaker 1>inclusive perspective to the table. Speaking of perspectives, let's talk

608
00:27:07.160 --> 00:27:10.240
<v Speaker 1>about storytelling again. You said earlier that the best data

609
00:27:10.279 --> 00:27:13.519
<v Speaker 1>analysts are also good storytellers. Can you unpack that a

610
00:27:13.519 --> 00:27:16.079
<v Speaker 1>little more for me? What does it actually mean to

611
00:27:16.119 --> 00:27:17.519
<v Speaker 1>tell a story with data?

612
00:27:17.720 --> 00:27:21.000
<v Speaker 2>It means going beyond just presenting the numbers and stats.

613
00:27:21.759 --> 00:27:25.400
<v Speaker 2>It's about weaving those numbers into a narrative that grabs

614
00:27:25.440 --> 00:27:28.799
<v Speaker 2>your audience's attention, helps them connect with the insights, and

615
00:27:29.000 --> 00:27:31.160
<v Speaker 2>inspires them to maybe even take action.

616
00:27:31.440 --> 00:27:33.759
<v Speaker 1>So you're not just presenting the facts. You're creating an

617
00:27:33.799 --> 00:27:35.279
<v Speaker 1>experience for the audience.

618
00:27:35.839 --> 00:27:39.480
<v Speaker 2>Exactly. A good data story should have a beginning, a middle,

619
00:27:39.480 --> 00:27:43.599
<v Speaker 2>and an end. It should have characters, conflict resolution. It

620
00:27:43.640 --> 00:27:46.440
<v Speaker 2>should make people feel something, make them want to do something.

621
00:27:46.680 --> 00:27:49.519
<v Speaker 1>Wow. So it's about turning data into a horm of art.

622
00:27:49.759 --> 00:27:52.039
<v Speaker 2>You could say that it's about taking something that can

623
00:27:52.079 --> 00:27:55.480
<v Speaker 2>be dry and technical and infusing it with human emotion,

624
00:27:55.680 --> 00:27:57.279
<v Speaker 2>with meaning, with purpose.

625
00:27:57.599 --> 00:28:00.200
<v Speaker 1>That's a really powerful way to think about it. It

626
00:28:00.200 --> 00:28:03.359
<v Speaker 1>makes me appreciate the potential of data analysis even more.

627
00:28:03.480 --> 00:28:05.599
<v Speaker 2>It's pretty amazing, right, And we're seeing more and more

628
00:28:05.599 --> 00:28:08.720
<v Speaker 2>examples of data storytelling and all sorts of fields, from

629
00:28:09.119 --> 00:28:12.839
<v Speaker 2>journalism to science to even personal narratives. It's really becoming

630
00:28:12.839 --> 00:28:14.720
<v Speaker 2>a pervasive part of our culture.

631
00:28:14.920 --> 00:28:17.480
<v Speaker 1>It's like data is giving us this new language, this

632
00:28:17.559 --> 00:28:20.000
<v Speaker 1>new way to connect with each other and understand the

633
00:28:20.039 --> 00:28:21.519
<v Speaker 1>world around us exactly.

634
00:28:21.799 --> 00:28:25.240
<v Speaker 2>And the more we embrace this language, the more powerful

635
00:28:25.359 --> 00:28:27.240
<v Speaker 2>and impactful our stories will become.

636
00:28:27.519 --> 00:28:30.799
<v Speaker 1>So what you're saying is anyone can be a data storyteller.

637
00:28:30.960 --> 00:28:34.640
<v Speaker 2>Absolutely, It's not just for tech experts or statisticians. If

638
00:28:34.680 --> 00:28:36.960
<v Speaker 2>you have a story to tell and you're willing to

639
00:28:37.039 --> 00:28:39.960
<v Speaker 2>learn the tools, you can use data to make your

640
00:28:39.960 --> 00:28:43.839
<v Speaker 2>story more compelling, more persuasive, and more impactful.

641
00:28:44.319 --> 00:28:48.519
<v Speaker 1>Okay, I am officially inspired. Data analysis is more than

642
00:28:48.599 --> 00:28:51.000
<v Speaker 1>just a skill. It's a way to make a difference

643
00:28:51.000 --> 00:28:51.359
<v Speaker 1>in the world.

644
00:28:51.480 --> 00:28:52.279
<v Speaker 2>That's exactly it.

645
00:28:52.359 --> 00:28:54.559
<v Speaker 1>So before we wrap up this part of our deep dive,

646
00:28:54.599 --> 00:28:56.279
<v Speaker 1>I just want to touch on one last thing that

647
00:28:56.319 --> 00:28:58.799
<v Speaker 1>has really stuck with me throughout this conversation was that

648
00:28:59.079 --> 00:29:01.160
<v Speaker 1>the importance of human curiosity.

649
00:29:01.279 --> 00:29:04.160
<v Speaker 2>Ah. Yes, that's really the foundation of it all.

650
00:29:04.240 --> 00:29:07.279
<v Speaker 1>It's what drives us to ask those questions, to explore

651
00:29:07.359 --> 00:29:09.599
<v Speaker 1>new ideas, to seek out knowledge.

652
00:29:09.759 --> 00:29:13.960
<v Speaker 2>Curiosity is the engine that fuels the entire data analysis process.

653
00:29:14.559 --> 00:29:16.480
<v Speaker 2>Without it, it would just be a bunch of numbers

654
00:29:16.480 --> 00:29:19.119
<v Speaker 2>in formula, is devoid of any real meaning or purpose.

655
00:29:19.240 --> 00:29:22.200
<v Speaker 1>I love that curiosity is what transforms data into something

656
00:29:22.240 --> 00:29:24.079
<v Speaker 1>truly meaningful exactly.

657
00:29:24.200 --> 00:29:27.359
<v Speaker 2>It's that spark that ignites the fire, that fuels the

658
00:29:27.440 --> 00:29:31.119
<v Speaker 2>passion for discovery, and that's what makes data analysis such

659
00:29:31.119 --> 00:29:32.839
<v Speaker 2>a rewarding and exciting field.

660
00:29:33.160 --> 00:29:35.359
<v Speaker 1>Okay, I think we've covered a lot of ground in

661
00:29:35.400 --> 00:29:37.839
<v Speaker 1>this part of our deep dive. We've talked about the

662
00:29:37.920 --> 00:29:41.759
<v Speaker 1>human side of data analysis, the importance of empathy, the

663
00:29:41.799 --> 00:29:45.559
<v Speaker 1>power of storytelling, the nuances of data cleaning and preparation,

664
00:29:46.039 --> 00:29:49.599
<v Speaker 1>and of course, the enduring importance of human curiosity. But

665
00:29:49.680 --> 00:29:51.400
<v Speaker 1>before we move on to the final part, I think

666
00:29:51.400 --> 00:29:55.480
<v Speaker 1>it's also important to acknowledge that data analysis isn't perfect, right.

667
00:29:55.279 --> 00:29:57.279
<v Speaker 2>It's important to be aware of its limitations.

668
00:29:57.519 --> 00:29:58.799
<v Speaker 1>What are some of the things we need to be

669
00:29:58.839 --> 00:29:59.559
<v Speaker 1>cautious about.

670
00:29:59.759 --> 00:30:03.240
<v Speaker 2>Well, for one, data can only tell us about the past.

671
00:30:03.599 --> 00:30:06.480
<v Speaker 2>It can't predict the future with absolute certainty.

672
00:30:06.240 --> 00:30:08.039
<v Speaker 1>So we can't treat it like a crystal ball.

673
00:30:08.480 --> 00:30:12.000
<v Speaker 2>Exactly. We can use data to identify trends and make

674
00:30:12.119 --> 00:30:15.279
<v Speaker 2>educated guesses about what might happen, but we should always

675
00:30:15.279 --> 00:30:18.519
<v Speaker 2>be careful about extrapolating too far beyond the data we have.

676
00:30:18.759 --> 00:30:21.799
<v Speaker 1>Okay, so don't get too carried away with predictions. What else?

677
00:30:22.119 --> 00:30:26.000
<v Speaker 2>Another limitation is that data is often incomplete or imperfect.

678
00:30:26.440 --> 00:30:29.400
<v Speaker 2>There might be gaps in the data, errors in data entry,

679
00:30:29.799 --> 00:30:32.440
<v Speaker 2>or biases in the way the data was collected.

680
00:30:32.400 --> 00:30:34.599
<v Speaker 1>So we can't just blindly trust the data. We need

681
00:30:34.640 --> 00:30:36.440
<v Speaker 1>to be skeptical and critical thinkers.

682
00:30:36.559 --> 00:30:38.960
<v Speaker 2>Exactly, you always have to question the data, where it

683
00:30:39.000 --> 00:30:42.000
<v Speaker 2>came from, how it was collected, and what limitations it

684
00:30:42.119 --> 00:30:46.000
<v Speaker 2>might have. And finally, it's crucial to remember that data

685
00:30:46.079 --> 00:30:49.000
<v Speaker 2>is just one piece of the puzzle. We should always

686
00:30:49.039 --> 00:30:54.519
<v Speaker 2>consider other sources of information like qualitative research, expert opinions,

687
00:30:54.799 --> 00:30:58.319
<v Speaker 2>and even our own intuition and experience. Data can be

688
00:30:58.359 --> 00:31:01.599
<v Speaker 2>a powerful tool for informing our decisions, but it shouldn't

689
00:31:01.599 --> 00:31:02.480
<v Speaker 2>be the only tool.

690
00:31:02.640 --> 00:31:05.359
<v Speaker 1>That's a really good point. We can't let data analysis

691
00:31:05.440 --> 00:31:08.480
<v Speaker 1>become a substitute for good judgment and critical thinking. It's

692
00:31:08.519 --> 00:31:11.920
<v Speaker 1>about using data to enhance our understanding, not to replace it.

693
00:31:12.039 --> 00:31:15.559
<v Speaker 2>Absolutely, data analysis is a tool, and like any tool,

694
00:31:15.599 --> 00:31:17.799
<v Speaker 2>it needs to be used wisely and responsibly.

695
00:31:18.079 --> 00:31:20.440
<v Speaker 1>So with those limitations in mind, I'm curious to hear

696
00:31:20.480 --> 00:31:22.960
<v Speaker 1>your thoughts on the future of data analysis. Where do

697
00:31:23.000 --> 00:31:24.279
<v Speaker 1>you see this field heading?

698
00:31:24.559 --> 00:31:27.880
<v Speaker 2>That's a tough question. It's such a rapidly evolving field.

699
00:31:28.400 --> 00:31:30.400
<v Speaker 2>It's hard to say for sure what the future holds,

700
00:31:31.039 --> 00:31:34.039
<v Speaker 2>but there are definitely a few trends I'm excited about. Okay,

701
00:31:34.240 --> 00:31:37.039
<v Speaker 2>Like what One trend that's already having a huge impact

702
00:31:37.359 --> 00:31:41.000
<v Speaker 2>is the rise of artificial intelligence and machine learning. I

703
00:31:41.000 --> 00:31:43.599
<v Speaker 2>think we're just scratching the surface of what's possible with

704
00:31:43.640 --> 00:31:44.559
<v Speaker 2>these technologies.

705
00:31:44.759 --> 00:31:47.759
<v Speaker 1>So AI is more than just a buzzword. It's really

706
00:31:47.839 --> 00:31:50.319
<v Speaker 1>changing the game when it comes to data analysis.

707
00:31:50.359 --> 00:31:54.000
<v Speaker 2>Absolutely, AI can help us analyze these massive data sets,

708
00:31:54.440 --> 00:31:57.759
<v Speaker 2>identify patterns that we might not even see, make predictions

709
00:31:57.759 --> 00:32:01.519
<v Speaker 2>with much greater accuracy than before. It's pretty mind blowing.

710
00:32:01.759 --> 00:32:04.079
<v Speaker 2>But we have to remember AI is a tool, and

711
00:32:04.240 --> 00:32:06.599
<v Speaker 2>like any tool, it can be used for good or.

712
00:32:06.559 --> 00:32:09.680
<v Speaker 1>Bad, So it's important to think about the ethical implications

713
00:32:09.720 --> 00:32:12.680
<v Speaker 1>of AI, especially as it becomes more integrated into data

714
00:32:12.680 --> 00:32:13.839
<v Speaker 1>analysis exactly.

715
00:32:13.920 --> 00:32:16.720
<v Speaker 2>Another trend I'm really excited about is the growing emphasis

716
00:32:16.720 --> 00:32:17.720
<v Speaker 2>on data literacy.

717
00:32:17.839 --> 00:32:20.480
<v Speaker 1>Data literacy, so you mean like everyone needs to become

718
00:32:20.480 --> 00:32:21.359
<v Speaker 1>a data expert.

719
00:32:21.599 --> 00:32:25.400
<v Speaker 2>Well, not necessarily an expert, but as data becomes more

720
00:32:25.440 --> 00:32:28.079
<v Speaker 2>and more a part of our lives, it's crucial that

721
00:32:28.119 --> 00:32:32.119
<v Speaker 2>everyone has at least a basic understanding of how to

722
00:32:32.160 --> 00:32:33.920
<v Speaker 2>interpret and critically evaluate it.

723
00:32:34.160 --> 00:32:36.519
<v Speaker 1>Okay, so we need to be able to spot misinformation,

724
00:32:36.839 --> 00:32:39.599
<v Speaker 1>to understand what the data is really telling us exactly.

725
00:32:39.640 --> 00:32:42.920
<v Speaker 2>That's data literacy. It's about being able to think critically

726
00:32:42.920 --> 00:32:46.319
<v Speaker 2>about data and make informed decisions based on that data.

727
00:32:47.000 --> 00:32:48.759
<v Speaker 2>And lastly, I think there's going to be a huge

728
00:32:48.759 --> 00:32:51.960
<v Speaker 2>demand for data storytellers in the future, people who can

729
00:32:52.039 --> 00:32:55.200
<v Speaker 2>bridge that gap between the technical world of data and

730
00:32:55.240 --> 00:32:56.839
<v Speaker 2>the human need for meaning and connection.

731
00:32:57.079 --> 00:32:59.240
<v Speaker 1>It's not enough to just crunch the numbers. You need

732
00:32:59.279 --> 00:33:01.359
<v Speaker 1>to be able to commit unicate those insights in a

733
00:33:01.359 --> 00:33:02.640
<v Speaker 1>way that resonates with people.

734
00:33:02.799 --> 00:33:06.039
<v Speaker 2>You got it. Data storytelling is becoming an essential skill

735
00:33:06.039 --> 00:33:07.240
<v Speaker 2>in pretty much every field.

736
00:33:07.440 --> 00:33:10.480
<v Speaker 1>Wow, these are some exciting trends. So to wrap up

737
00:33:10.480 --> 00:33:12.240
<v Speaker 1>this part of our deep dive, I guess what you're

738
00:33:12.279 --> 00:33:15.359
<v Speaker 1>saying is that the future of data analysis is bright,

739
00:33:15.799 --> 00:33:18.519
<v Speaker 1>but it also comes with a lot of responsibility. We

740
00:33:18.599 --> 00:33:21.400
<v Speaker 1>need to be mindful of the ethical considerations, we need

741
00:33:21.400 --> 00:33:24.519
<v Speaker 1>to promote data literacy, and we need to never lose

742
00:33:24.559 --> 00:33:27.440
<v Speaker 1>sight of the human element, the power of storytelling.

743
00:33:27.640 --> 00:33:31.440
<v Speaker 2>Well said, data analysis is a powerful tool, but it's

744
00:33:31.519 --> 00:33:35.160
<v Speaker 2>ultimately up to us as humans to use it wisely

745
00:33:35.319 --> 00:33:36.640
<v Speaker 2>to make the world a better place.

746
00:33:37.000 --> 00:33:38.839
<v Speaker 1>I love that it's not just about the data, it's

747
00:33:38.880 --> 00:33:41.000
<v Speaker 1>about what we do with it, how we use it

748
00:33:41.039 --> 00:33:43.880
<v Speaker 1>to make a positive impact. Okay, I think that's a

749
00:33:43.880 --> 00:33:45.920
<v Speaker 1>great place to pause for now. This has been an

750
00:33:45.920 --> 00:33:47.920
<v Speaker 1>incredibly insightful conversation.

751
00:33:48.200 --> 00:33:50.480
<v Speaker 2>I agree, we've covered so much ground, and I'm really

752
00:33:50.519 --> 00:33:53.200
<v Speaker 2>looking forward to continuing our exploration in the final part

753
00:33:53.240 --> 00:33:53.880
<v Speaker 2>of our deep dive.

754
00:33:54.000 --> 00:33:55.599
<v Speaker 1>Okay, so we're back for the final part of our

755
00:33:55.680 --> 00:33:58.200
<v Speaker 1>data analysis deep dive. We've talked about so much already,

756
00:33:58.200 --> 00:34:00.960
<v Speaker 1>from the technical stuff like Python and webster graping, to

757
00:34:01.160 --> 00:34:04.079
<v Speaker 1>the ethical side of things and the art of data storytelling.

758
00:34:04.319 --> 00:34:05.599
<v Speaker 1>What else is there to explore?

759
00:34:05.799 --> 00:34:08.360
<v Speaker 2>Well, you know, we talk about data as this tool

760
00:34:08.480 --> 00:34:13.119
<v Speaker 2>for uncovering truths and telling stories, but we shouldn't forget

761
00:34:13.199 --> 00:34:16.320
<v Speaker 2>about its potential for actually solving problems, making a real

762
00:34:16.360 --> 00:34:17.119
<v Speaker 2>difference in the world.

763
00:34:17.199 --> 00:34:19.239
<v Speaker 1>That's a great point. I think we often get caught

764
00:34:19.280 --> 00:34:21.559
<v Speaker 1>up in the analytical side of things, you know, just

765
00:34:21.599 --> 00:34:24.079
<v Speaker 1>trying to understand the data. But you're right, it can

766
00:34:24.119 --> 00:34:27.880
<v Speaker 1>be used to address real world issues and actually create change.

767
00:34:28.039 --> 00:34:31.039
<v Speaker 2>Absolutely. I mean, think about the big challenges facing our

768
00:34:31.039 --> 00:34:36.800
<v Speaker 2>world today, climate change, poverty, disease. Data analysis is already

769
00:34:36.800 --> 00:34:40.880
<v Speaker 2>being used to develop solutions, track progress, and hold people accountable.

770
00:34:40.960 --> 00:34:44.679
<v Speaker 1>It's like data analysis is the bridge between information and action.

771
00:34:44.800 --> 00:34:47.159
<v Speaker 1>We can use it to understand the problems and then

772
00:34:47.239 --> 00:34:49.480
<v Speaker 1>actually do something about them exactly.

773
00:34:49.920 --> 00:34:53.159
<v Speaker 2>It's pretty amazing what's happening in so many fields. Organizations

774
00:34:53.199 --> 00:34:56.719
<v Speaker 2>are using data analysis to optimize energy consumption, reduce ways,

775
00:34:56.800 --> 00:34:58.559
<v Speaker 2>and develop more sustainable practices.

776
00:34:58.760 --> 00:35:01.079
<v Speaker 1>So data can help us tackle climate change head on.

777
00:35:01.199 --> 00:35:02.760
<v Speaker 1>What about other areas well?

778
00:35:02.760 --> 00:35:06.679
<v Speaker 2>In healthcare, data analysis is being used to identify disease

779
00:35:06.719 --> 00:35:11.079
<v Speaker 2>outbreaks early on, track the effectiveness of treatments, and even

780
00:35:11.159 --> 00:35:12.360
<v Speaker 2>personalized medicine.

781
00:35:12.360 --> 00:35:15.599
<v Speaker 1>Wow, that's incredible. It's really inspiring to see how data

782
00:35:15.599 --> 00:35:18.119
<v Speaker 1>analysis is being used to make a tangible impact in

783
00:35:18.159 --> 00:35:20.800
<v Speaker 1>the world. But I'm also curious about the future of

784
00:35:20.920 --> 00:35:24.719
<v Speaker 1>data analysis itself. What skills and knowledge do you think

785
00:35:24.719 --> 00:35:27.920
<v Speaker 1>will be most valuable in this field as it keeps evolving.

786
00:35:28.079 --> 00:35:30.480
<v Speaker 2>Hmm, that's a tough one to predict. I mean, the

787
00:35:30.519 --> 00:35:33.480
<v Speaker 2>field is constantly changing. Technical skills are always going to

788
00:35:33.480 --> 00:35:37.119
<v Speaker 2>be important, obviously, but I think those critical thinking skills,

789
00:35:37.199 --> 00:35:40.639
<v Speaker 2>problem solving and communication, those are going to become even

790
00:35:40.639 --> 00:35:41.360
<v Speaker 2>more valuable.

791
00:35:41.679 --> 00:35:43.559
<v Speaker 1>Right. It's not enough to just crunch the numbers. You've

792
00:35:43.519 --> 00:35:45.320
<v Speaker 1>got to be able to explain what they mean, what

793
00:35:45.360 --> 00:35:47.239
<v Speaker 1>the implications are exactly.

794
00:35:47.559 --> 00:35:49.639
<v Speaker 2>Data analysts of the future, they need to be able

795
00:35:49.719 --> 00:35:53.000
<v Speaker 2>to not only analyze data, but also interpret it, put

796
00:35:53.039 --> 00:35:55.679
<v Speaker 2>it in context, and then communicate those findings in a

797
00:35:55.679 --> 00:35:57.920
<v Speaker 2>way that's clear, compelling, and actionable.

798
00:35:58.199 --> 00:36:00.639
<v Speaker 1>So it's about being a well rounded thinker, not just

799
00:36:00.679 --> 00:36:02.639
<v Speaker 1>a technical whiz exactly.

800
00:36:02.960 --> 00:36:05.199
<v Speaker 2>And I also think we're going to see a greater

801
00:36:05.280 --> 00:36:08.760
<v Speaker 2>need for data analysts who really understand ethics, who take

802
00:36:08.840 --> 00:36:12.719
<v Speaker 2>social responsibility seriously. As data becomes more and more powerful,

803
00:36:12.840 --> 00:36:14.760
<v Speaker 2>we need to make sure it is being used ethically

804
00:36:14.800 --> 00:36:15.760
<v Speaker 2>for the good of everyone.

805
00:36:15.840 --> 00:36:18.840
<v Speaker 1>Absolutely, data can be a powerful tool for good, but

806
00:36:18.920 --> 00:36:21.960
<v Speaker 1>it can also be misused. We need to make sure

807
00:36:22.079 --> 00:36:25.679
<v Speaker 1>being used to create a more just and equitable world,

808
00:36:25.880 --> 00:36:28.280
<v Speaker 1>not to reinforce existing inequalities.

809
00:36:28.320 --> 00:36:30.440
<v Speaker 2>I couldn't agree more. And on that note, I think

810
00:36:30.480 --> 00:36:32.760
<v Speaker 2>it's fitting that we circle back to something you mentioned earlier,

811
00:36:33.320 --> 00:36:35.320
<v Speaker 2>the importance of human curiosity.

812
00:36:35.599 --> 00:36:39.440
<v Speaker 1>Yes, curiosity is what has driven this whole conversation. Really,

813
00:36:39.719 --> 00:36:43.039
<v Speaker 1>it's what drives us to explore and learn and understand.

814
00:36:43.199 --> 00:36:46.760
<v Speaker 2>Curiosity is the fuel that powers the engine of discovery.

815
00:36:46.800 --> 00:36:50.519
<v Speaker 2>Without it, data analysis would just be a dry, technical exercise.

816
00:36:50.559 --> 00:36:52.559
<v Speaker 2>It wouldn't have that spark, that sense of wonder.

817
00:36:52.679 --> 00:36:55.360
<v Speaker 1>It's like curiosity is the secret ingredient that makes data

818
00:36:55.400 --> 00:36:57.719
<v Speaker 1>analysis so engaging, so rewarding.

819
00:36:58.039 --> 00:37:00.599
<v Speaker 2>Exactly. So, my final thought for one out there who's

820
00:37:00.639 --> 00:37:04.239
<v Speaker 2>interested in data analysis is this, never lose that sense

821
00:37:04.239 --> 00:37:09.079
<v Speaker 2>of wonder, never stop asking questions, and never be afraid

822
00:37:09.119 --> 00:37:10.519
<v Speaker 2>to challenge the status quo.

823
00:37:10.880 --> 00:37:13.199
<v Speaker 1>I love that, and I think it's the perfect message

824
00:37:13.199 --> 00:37:15.400
<v Speaker 1>to end on this whole deep dive has been an

825
00:37:15.400 --> 00:37:17.880
<v Speaker 1>incredible journey. It's open my eyes to the power of

826
00:37:17.960 --> 00:37:22.800
<v Speaker 1>data analysis, the complexities, the challenges, but also the immense possibilities.

827
00:37:23.119 --> 00:37:26.400
<v Speaker 1>I'm walking away feeling inspired and ready to dive even

828
00:37:26.440 --> 00:37:28.000
<v Speaker 1>deeper into this fascinating world.

829
00:37:28.239 --> 00:37:30.440
<v Speaker 2>It's been a pleasure exploring these ideas with you. I

830
00:37:30.440 --> 00:37:32.360
<v Speaker 2>hope you all out there listening feel the same way.

831
00:37:32.559 --> 00:37:34.440
<v Speaker 1>Thank you so much for joining us on this deep

832
00:37:34.480 --> 00:37:37.400
<v Speaker 1>dive into the world of social media data analysis. We

833
00:37:37.440 --> 00:37:40.320
<v Speaker 1>hope you've enjoyed the ride, learn something new, and maybe

834
00:37:40.440 --> 00:37:42.920
<v Speaker 1>even sparked your own curiosity about the power of data.

835
00:37:43.800 --> 00:37:46.360
<v Speaker 1>And remember, this is just the beginning. The world of

836
00:37:46.440 --> 00:37:49.039
<v Speaker 1>data is vast and ever changing, and there are endless

837
00:37:49.079 --> 00:37:52.760
<v Speaker 1>stories waiting to be uncovered. So keep exploring, keep learning,

838
00:37:52.880 --> 00:37:57.639
<v Speaker 1>and keep asking those questions. Until next time, happy analyzing.
