WEBVTT

1
00:00:00.160 --> 00:00:02.799
<v Speaker 1>Welcome to the deep Dive. We're the show that helps

2
00:00:02.799 --> 00:00:05.080
<v Speaker 1>you cut through the noise, taking stacks of sources and

3
00:00:05.120 --> 00:00:07.719
<v Speaker 1>finding those key insights so you can get genuinely well

4
00:00:07.719 --> 00:00:12.400
<v Speaker 1>informed fast. Today we're diving into something that feels well,

5
00:00:12.640 --> 00:00:16.039
<v Speaker 1>part magic, part engineering, maybe even a little bit detective work.

6
00:00:16.480 --> 00:00:19.280
<v Speaker 2>It's web scraping, right, the ability to basically write a

7
00:00:19.320 --> 00:00:21.760
<v Speaker 2>little program that goes out onto the Internet and gathers

8
00:00:21.839 --> 00:00:22.920
<v Speaker 2>data all by itself.

9
00:00:23.000 --> 00:00:25.359
<v Speaker 1>Yeah, and seeing a work the first time, there's this

10
00:00:25.440 --> 00:00:28.480
<v Speaker 1>real rush, like you've unlocked some secret level of the

11
00:00:28.480 --> 00:00:29.280
<v Speaker 1>web or something.

12
00:00:29.359 --> 00:00:29.920
<v Speaker 2>Definitely.

13
00:00:29.960 --> 00:00:33.200
<v Speaker 1>Our main guide today is the book Practical Web Scraping

14
00:00:33.280 --> 00:00:36.280
<v Speaker 1>for Data Science by Seppie Vanden Brook and Bart Basin's

15
00:00:36.479 --> 00:00:38.200
<v Speaker 1>really comprehensive.

16
00:00:37.520 --> 00:00:39.200
<v Speaker 2>Stuff it is. It covers a lot of ground.

17
00:00:39.399 --> 00:00:42.320
<v Speaker 1>So our mission today to really get you up to

18
00:00:42.359 --> 00:00:45.359
<v Speaker 1>speed on what webs are graping is why it's so

19
00:00:45.479 --> 00:00:49.079
<v Speaker 1>important for data science and crucially the things you absolutely

20
00:00:49.119 --> 00:00:52.600
<v Speaker 1>need to think about technically and maybe even more importantly ethically.

21
00:00:52.799 --> 00:00:54.520
<v Speaker 2>Yeah, the how, but also the should you.

22
00:00:54.880 --> 00:00:57.840
<v Speaker 1>Exactly get ready for some aha moments because we're going

23
00:00:57.880 --> 00:00:59.799
<v Speaker 1>to unpack how you can get a solid handle on

24
00:00:59.799 --> 00:01:04.040
<v Speaker 1>this pretty powerful skill. Okay, so let's start at the beginning.

25
00:01:04.319 --> 00:01:07.359
<v Speaker 1>What actually happens under the hood when you just type

26
00:01:07.400 --> 00:01:11.000
<v Speaker 1>say www dot Google dot com into your browser. Most

27
00:01:11.079 --> 00:01:13.000
<v Speaker 1>it's just hit enter right right.

28
00:01:12.879 --> 00:01:15.480
<v Speaker 2>And we take it for granted. But there's this incredible

29
00:01:16.560 --> 00:01:19.959
<v Speaker 2>coordination happening invisibly, like you said, under the hood before

30
00:01:19.959 --> 00:01:22.239
<v Speaker 2>you even see anything. All these protocols are firing off.

31
00:01:22.519 --> 00:01:25.680
<v Speaker 2>DNS is translating that name into an IP address.

32
00:01:25.359 --> 00:01:27.400
<v Speaker 1>The computer's actual address exactly.

33
00:01:27.439 --> 00:01:30.400
<v Speaker 2>Then TCP make sure the data gets there reliably. But

34
00:01:30.439 --> 00:01:33.239
<v Speaker 2>the layer we really care about for scraping the actual

35
00:01:33.319 --> 00:01:35.480
<v Speaker 2>sort of language the web speaks.

36
00:01:35.799 --> 00:01:38.319
<v Speaker 1>That's HTTP, Hypertext Transfer Protocol.

37
00:01:38.400 --> 00:01:41.040
<v Speaker 2>That's the one. It's basically a plaintext conversation, the browser

38
00:01:41.079 --> 00:01:43.159
<v Speaker 2>sense of request, the server sense of response, with the

39
00:01:43.159 --> 00:01:47.200
<v Speaker 2>web page content. Understanding that back and forth is fundamental.

40
00:01:47.439 --> 00:01:52.159
<v Speaker 1>Okay, so HTTP is the conversation. How do we get

41
00:01:52.159 --> 00:01:54.560
<v Speaker 1>our program to join that conversation. How do we make

42
00:01:54.599 --> 00:01:55.359
<v Speaker 1>those requests?

43
00:01:55.400 --> 00:01:58.719
<v Speaker 2>Well, that's where Python's requests library is just fantastic. You

44
00:01:58.719 --> 00:02:00.519
<v Speaker 2>can use Python's built d stuff.

45
00:02:00.000 --> 00:02:02.640
<v Speaker 1>Of erlib, but requests is nicer, oh.

46
00:02:02.680 --> 00:02:05.719
<v Speaker 2>Much nicer, way more user friendly. Think of it like

47
00:02:05.719 --> 00:02:08.039
<v Speaker 2>a really efficient messenger. You just tell it, hey, go

48
00:02:08.120 --> 00:02:11.360
<v Speaker 2>get this page using requests dot get or send this

49
00:02:11.479 --> 00:02:14.599
<v Speaker 2>data with request dot post. It handles a lot of

50
00:02:14.599 --> 00:02:17.879
<v Speaker 2>the fiddley bits for you automatically, like setting standard headers

51
00:02:17.919 --> 00:02:20.520
<v Speaker 2>like user agent, which tells the server what kind of

52
00:02:20.520 --> 00:02:22.120
<v Speaker 2>browser you are or pretending to.

53
00:02:22.080 --> 00:02:24.560
<v Speaker 1>Be ah, so you can look like a normal browser

54
00:02:24.680 --> 00:02:25.240
<v Speaker 1>pretty much.

55
00:02:25.800 --> 00:02:28.560
<v Speaker 2>And crucially, it also lets you change those headers if

56
00:02:28.599 --> 00:02:31.479
<v Speaker 2>you need to. Sometimes servers are a bit picky about

57
00:02:31.479 --> 00:02:34.000
<v Speaker 2>who they talk to, so that flexibility is key.

58
00:02:34.159 --> 00:02:37.479
<v Speaker 1>Right, that makes sense. So requests fetches the page content.

59
00:02:37.520 --> 00:02:40.520
<v Speaker 1>But then you've got this big blob of well usually HTML, right,

60
00:02:40.879 --> 00:02:44.039
<v Speaker 1>and looking at raw HTML it can be pretty intimidating

61
00:02:44.120 --> 00:02:45.319
<v Speaker 1>all those angle brackets.

62
00:02:45.319 --> 00:02:48.919
<v Speaker 2>Oh yeah, it looks like tag soups sometimes just the jumble.

63
00:02:49.080 --> 00:02:51.360
<v Speaker 1>So how do we find the actual data we want

64
00:02:51.439 --> 00:02:52.439
<v Speaker 1>inside that jumble?

65
00:02:52.639 --> 00:02:56.360
<v Speaker 2>That's the next piece of the puzzle. HTML hypertext markup language.

66
00:02:56.719 --> 00:02:59.680
<v Speaker 2>It looks messy, but it actually has structure. It uses

67
00:02:59.719 --> 00:03:01.599
<v Speaker 2>tag it's like A for a link or DIV for

68
00:03:01.639 --> 00:03:05.039
<v Speaker 2>a section. Then CSS styles it to navigate that structure

69
00:03:05.080 --> 00:03:08.000
<v Speaker 2>and pull out specific bits. We use another great library,

70
00:03:08.479 --> 00:03:12.280
<v Speaker 2>beautiful soup. Okay, it takes that messy HTML string and

71
00:03:12.360 --> 00:03:16.680
<v Speaker 2>turns it into this navigable Python object like a tree

72
00:03:16.680 --> 00:03:18.520
<v Speaker 2>structure you can walk through, ah.

73
00:03:18.520 --> 00:03:21.120
<v Speaker 1>Like a family tree for the web page elements.

74
00:03:20.840 --> 00:03:23.039
<v Speaker 2>Kind of Yeah, and then you can easily say, find

75
00:03:23.039 --> 00:03:25.039
<v Speaker 2>me all the A tags or find the div with

76
00:03:25.080 --> 00:03:28.240
<v Speaker 2>this specific ID, or even use these really powerful CSS

77
00:03:28.240 --> 00:03:32.080
<v Speaker 2>selectors to pinpoint exactly the element you need based on

78
00:03:32.159 --> 00:03:33.919
<v Speaker 2>its styling or position, and just.

79
00:03:33.840 --> 00:03:37.240
<v Speaker 1>Building on that for you listening, your browser's developer tools

80
00:03:37.319 --> 00:03:41.960
<v Speaker 1>are like your secret weapon here. Seriously, invaluable, absolutely cannot

81
00:03:42.000 --> 00:03:44.719
<v Speaker 1>stress that enough. You hit F twelve usually, and the

82
00:03:44.759 --> 00:03:47.560
<v Speaker 1>elements tab shows you that nice structured tree view of

83
00:03:47.599 --> 00:03:50.280
<v Speaker 1>the HTML that beautiful soup will see. You can hover

84
00:03:50.360 --> 00:03:52.240
<v Speaker 1>over stuff on the page, see the code light up.

85
00:03:52.599 --> 00:03:55.159
<v Speaker 2>Yeah, it's brilliant for figuring out what tags or what

86
00:03:55.280 --> 00:03:57.840
<v Speaker 2>CSS selectors you need to target. You can often just

87
00:03:57.879 --> 00:04:00.639
<v Speaker 2>write click an element and copy it select directly.

88
00:04:00.879 --> 00:04:03.599
<v Speaker 1>Just one quick tip though, remember view source shows the

89
00:04:03.680 --> 00:04:07.159
<v Speaker 1>raw HTML the server scent. The elements tab shows what

90
00:04:07.199 --> 00:04:10.280
<v Speaker 1>the browser has processed, which might include changes made by

91
00:04:10.400 --> 00:04:13.000
<v Speaker 1>JavaScript after the page loaded.

92
00:04:13.080 --> 00:04:15.360
<v Speaker 2>That's a really key distinction. Yeah, what you see in

93
00:04:15.400 --> 00:04:17.920
<v Speaker 2>elements is often closer to what you need if the

94
00:04:17.959 --> 00:04:18.839
<v Speaker 2>page is dynamic.

95
00:04:18.920 --> 00:04:21.959
<v Speaker 1>Okay, perfect segway that covers static pages really well. But

96
00:04:22.079 --> 00:04:25.360
<v Speaker 1>what about those more complex sites, the ones that are

97
00:04:25.720 --> 00:04:29.680
<v Speaker 1>heavy on JavaScript where content loads dynamically as you scroll,

98
00:04:29.839 --> 00:04:33.199
<v Speaker 1>or maybe they set cookies using JavaScript. Our requests and

99
00:04:33.279 --> 00:04:37.120
<v Speaker 1>beautiful soup approach might just stop working there because they

100
00:04:37.120 --> 00:04:39.720
<v Speaker 1>aren't actually running a browser. They're just fetching the initial

101
00:04:39.839 --> 00:04:40.720
<v Speaker 1>HTML source.

102
00:04:40.959 --> 00:04:43.240
<v Speaker 2>You hit the nail on the head. That's a huge

103
00:04:43.360 --> 00:04:47.480
<v Speaker 2>challenge with modern web development. So many sites are JavaScript heavy.

104
00:04:47.879 --> 00:04:50.680
<v Speaker 2>The initial HTML might be almost empty just to shell.

105
00:04:51.160 --> 00:04:54.480
<v Speaker 2>The actual content gets fetched and rendered by JavaScript running

106
00:04:54.480 --> 00:04:58.399
<v Speaker 2>in your browser, and sometimes that JavaScript is deliberately obfuscated,

107
00:04:58.680 --> 00:05:02.959
<v Speaker 2>made hard to read to make reverse engineering it almost impossible.

108
00:05:02.639 --> 00:05:04.959
<v Speaker 1>So you can't easily figure out where it's getting the

109
00:05:05.040 --> 00:05:06.360
<v Speaker 1>data from exactly.

110
00:05:06.920 --> 00:05:09.120
<v Speaker 2>Or maybe it sets a special cookie like a not

111
00:05:09.319 --> 00:05:13.000
<v Speaker 2>a security token using JavaScript, and without that cookie you

112
00:05:13.040 --> 00:05:17.040
<v Speaker 2>can't make further requests. So if requests can't run JavaScript,

113
00:05:17.319 --> 00:05:17.920
<v Speaker 2>what do you do?

114
00:05:18.040 --> 00:05:21.360
<v Speaker 1>And that's where I guess Selenium comes into the picture.

115
00:05:21.399 --> 00:05:23.319
<v Speaker 1>It's more than just a scraper, isn't it. It's about

116
00:05:23.399 --> 00:05:24.920
<v Speaker 1>browser automation precisely.

117
00:05:25.319 --> 00:05:28.759
<v Speaker 2>Selenium's original purpose was actually for automated testing of websites. Yeah,

118
00:05:28.839 --> 00:05:31.360
<v Speaker 2>making sure, buttons, work forms, submit, et cetera. But that

119
00:05:31.399 --> 00:05:34.519
<v Speaker 2>makes it incredibly powerful for scraping because it literally drives

120
00:05:34.720 --> 00:05:37.879
<v Speaker 2>the real web browser, Chrome, Firefox, whatever you can figure.

121
00:05:37.959 --> 00:05:39.639
<v Speaker 1>So it can run the JavaScript.

122
00:05:39.879 --> 00:05:43.680
<v Speaker 2>Yes, it loads the page, waits for things to appear,

123
00:05:44.279 --> 00:05:47.720
<v Speaker 2>clicks buttons, fills in forms, scrolls down the page. Anything

124
00:05:47.759 --> 00:05:50.439
<v Speaker 2>a human user can do, Selenium can automate.

125
00:05:50.800 --> 00:05:53.959
<v Speaker 1>Ah. Okay, So for those sites where content loads as

126
00:05:53.959 --> 00:05:57.279
<v Speaker 1>you scroll, like maybe infinite scrolling on social media or news.

127
00:05:57.079 --> 00:06:00.639
<v Speaker 2>Sites, perfect example requests would only get the first patch,

128
00:06:00.920 --> 00:06:04.160
<v Speaker 2>selem can actually perform the scroll action, wait for the

129
00:06:04.199 --> 00:06:06.839
<v Speaker 2>new content to load because the JavaScript runs, and then

130
00:06:06.879 --> 00:06:07.240
<v Speaker 2>grab it.

131
00:06:07.439 --> 00:06:10.920
<v Speaker 1>That's clever. What about waiting Pages don't always load instantly, right?

132
00:06:10.959 --> 00:06:13.319
<v Speaker 2>Selenium has tools for that too. You can use weights

133
00:06:13.399 --> 00:06:16.600
<v Speaker 2>telling your script, hey, wait until this specific button is clickable,

134
00:06:17.040 --> 00:06:19.759
<v Speaker 2>or wait until this piece of text appears before you

135
00:06:19.800 --> 00:06:22.399
<v Speaker 2>try to interact with it. It makes your scraper much

136
00:06:22.439 --> 00:06:25.959
<v Speaker 2>more robust against slow loading pages or dynamic elements.

137
00:06:26.040 --> 00:06:29.360
<v Speaker 1>That sounds incredibly capable, but I imagine driving a full

138
00:06:29.399 --> 00:06:32.720
<v Speaker 1>browser isn't as lightweight as just making a simple HTTP request.

139
00:06:33.120 --> 00:06:34.120
<v Speaker 1>Is there a downside?

140
00:06:34.160 --> 00:06:37.879
<v Speaker 2>Absolutely? There's a trade off. Selenium is significantly slower and

141
00:06:38.000 --> 00:06:41.600
<v Speaker 2>uses way more memory and CPU resources than requests and

142
00:06:41.639 --> 00:06:42.639
<v Speaker 2>beautiful soup.

143
00:06:42.480 --> 00:06:44.480
<v Speaker 1>Because it's literally running Chrome in the.

144
00:06:44.439 --> 00:06:48.720
<v Speaker 2>Background or something exactly. You're paying for that full browser emulation.

145
00:06:49.120 --> 00:06:53.439
<v Speaker 2>So it's powerful, essential for those tricky dynamic sites. But

146
00:06:53.519 --> 00:06:56.319
<v Speaker 2>you always want to check first, can I get this

147
00:06:56.439 --> 00:07:00.600
<v Speaker 2>data with the simpler, faster request approach Uselen when you have.

148
00:07:00.600 --> 00:07:03.480
<v Speaker 1>To, okay, makes sense, choose the right tool for the job.

149
00:07:04.600 --> 00:07:08.399
<v Speaker 1>So let's say we've figured out how to scrape one page,

150
00:07:08.399 --> 00:07:11.279
<v Speaker 1>maybe even a dynamic one with Selenium. How do we

151
00:07:11.360 --> 00:07:13.519
<v Speaker 1>scale that up? How do we go from scraping a

152
00:07:13.560 --> 00:07:17.399
<v Speaker 1>page to well, crawling hundreds or thousands across a whole website.

153
00:07:17.399 --> 00:07:18.600
<v Speaker 1>That feels like a different beast.

154
00:07:18.720 --> 00:07:22.319
<v Speaker 2>It is, And that distinction between scraping grabbing data from

155
00:07:22.319 --> 00:07:26.120
<v Speaker 2>a specific page, and crawling, navigating link by link to

156
00:07:26.199 --> 00:07:28.759
<v Speaker 2>discover and scrape many pages is really.

157
00:07:28.560 --> 00:07:31.680
<v Speaker 1>Important, like what search engines do, but on a smaller scale.

158
00:07:31.399 --> 00:07:34.360
<v Speaker 2>Maybe exactly they crawl the web constantly. For data science,

159
00:07:34.360 --> 00:07:36.199
<v Speaker 2>if you need to crawl a site, you need a

160
00:07:36.199 --> 00:07:39.759
<v Speaker 2>more structured approach. Best practices become vital. You'll almost certainly

161
00:07:39.800 --> 00:07:43.160
<v Speaker 2>want a database, something simple like squilight is often fine.

162
00:07:43.160 --> 00:07:45.399
<v Speaker 2>Maybe using a helper library like records to keep track

163
00:07:45.439 --> 00:07:47.519
<v Speaker 2>of everything. What kind of thing, Well, you need a

164
00:07:47.519 --> 00:07:50.000
<v Speaker 2>list of URLs you plan to visit the crawl frontier.

165
00:07:50.399 --> 00:07:52.920
<v Speaker 2>You need a list of URLs you've already visited so

166
00:07:52.959 --> 00:07:55.240
<v Speaker 2>you don't get stuck in loops or scrape the same

167
00:07:55.279 --> 00:07:58.399
<v Speaker 2>page multiple times. And of course you need to store

168
00:07:58.439 --> 00:08:01.519
<v Speaker 2>the data you extract. It's also really good practice to

169
00:08:01.560 --> 00:08:05.240
<v Speaker 2>separate the logic. Have one part of your code responsible

170
00:08:05.279 --> 00:08:08.879
<v Speaker 2>for finding new links, the crawler, and another part responsible

171
00:08:08.879 --> 00:08:12.199
<v Speaker 2>for extracting data from a page the scraper makes it

172
00:08:12.199 --> 00:08:13.279
<v Speaker 2>easier to manage, and.

173
00:08:13.240 --> 00:08:15.319
<v Speaker 1>You have to be careful not to hammer the website

174
00:08:15.560 --> 00:08:16.360
<v Speaker 1>absolutely critical.

175
00:08:16.439 --> 00:08:18.879
<v Speaker 2>You need to build in delays or cool down periods

176
00:08:18.879 --> 00:08:21.360
<v Speaker 2>between your requests, don't just fire them off as fast

177
00:08:21.360 --> 00:08:23.839
<v Speaker 2>as possible. You also need air handling. What if a

178
00:08:24.040 --> 00:08:27.959
<v Speaker 2>page is temporarily down, You need logic to retry later

179
00:08:28.279 --> 00:08:30.920
<v Speaker 2>and thinking about doing things in peril can speed it up,

180
00:08:31.079 --> 00:08:32.759
<v Speaker 2>but you have to be even more careful not to

181
00:08:32.799 --> 00:08:35.080
<v Speaker 2>overload the server. Then it's a balancing act.

182
00:08:35.720 --> 00:08:38.759
<v Speaker 1>And you mentioned some specific tools for handling URLs.

183
00:08:39.039 --> 00:08:43.039
<v Speaker 2>Yeah, little things become important when crawling, like earlib dot

184
00:08:43.080 --> 00:08:46.840
<v Speaker 2>parse dot earl join. Websites often use relative links like

185
00:08:46.919 --> 00:08:49.919
<v Speaker 2>about us. Your crawler needs to correctly combine that with

186
00:08:50.000 --> 00:08:52.600
<v Speaker 2>the base you RL get the full address, earl Join

187
00:08:52.679 --> 00:08:56.279
<v Speaker 2>handles that reliably, and Earl's frag helps remove those fragment

188
00:08:56.360 --> 00:08:59.440
<v Speaker 2>identifiers the bit after the hashtags. You don't accidentally crawl

189
00:09:00.600 --> 00:09:04.320
<v Speaker 2>HTML church section one and PA html tag section two

190
00:09:04.600 --> 00:09:05.799
<v Speaker 2>as if they were different pages.

191
00:09:05.960 --> 00:09:08.879
<v Speaker 1>So why is this scaling up, this crawling capability so

192
00:09:08.960 --> 00:09:12.440
<v Speaker 1>important for you our listeners doing data science? What doors

193
00:09:12.480 --> 00:09:13.039
<v Speaker 1>does it open?

194
00:09:13.279 --> 00:09:15.039
<v Speaker 2>It opens the door to data sets that just don't

195
00:09:15.080 --> 00:09:18.519
<v Speaker 2>exist anywhere else or aren't available in a neat packaged format.

196
00:09:19.120 --> 00:09:23.320
<v Speaker 2>The web is this enormous, constantly updated, incredibly rich source

197
00:09:23.320 --> 00:09:25.519
<v Speaker 2>of well mostly unstructured.

198
00:09:25.000 --> 00:09:27.799
<v Speaker 1>Data a real treasure trove if you can access it exactly?

199
00:09:28.039 --> 00:09:30.600
<v Speaker 2>Imagine you want to build a sentiment analysis model for

200
00:09:30.639 --> 00:09:33.600
<v Speaker 2>product reviews. You I need thousands, tens of thousands reviews.

201
00:09:33.600 --> 00:09:35.559
<v Speaker 2>Where do you get them? You call e commerce sites?

202
00:09:35.559 --> 00:09:38.080
<v Speaker 1>Well? Maybe tracking housing prices perfect.

203
00:09:37.639 --> 00:09:40.960
<v Speaker 2>Example, collect real estate listings across a whole city or

204
00:09:40.960 --> 00:09:45.360
<v Speaker 2>region for analysis or visualization. We've seen amazing projects born

205
00:09:45.440 --> 00:09:48.879
<v Speaker 2>from this. Google Translate got massively better by using scrape

206
00:09:48.919 --> 00:09:51.879
<v Speaker 2>texts from across the web. There was the Billion Prices

207
00:09:51.879 --> 00:09:55.480
<v Speaker 2>project at MIT, which scraped online retailers daily to create

208
00:09:55.559 --> 00:09:59.519
<v Speaker 2>near real time inflation tragging way faster than official government stats.

209
00:09:59.600 --> 00:09:59.840
<v Speaker 1>Wow.

210
00:10:00.120 --> 00:10:03.320
<v Speaker 2>Yeah. Or think about monitoring social media for mentions of

211
00:10:03.320 --> 00:10:07.080
<v Speaker 2>bitcoin to gauge public sentiment, or analyzing job postings to

212
00:10:07.080 --> 00:10:10.039
<v Speaker 2>see which data science skills are currently in demand. All

213
00:10:10.080 --> 00:10:14.279
<v Speaker 2>rely on robust crawling. It's about turning the messy, sprawling

214
00:10:14.320 --> 00:10:19.120
<v Speaker 2>web into structured, valuable information for your data science pipeline.

215
00:10:19.200 --> 00:10:21.039
<v Speaker 1>Okay, so let's pull back a bit thinking about that

216
00:10:21.120 --> 00:10:24.720
<v Speaker 1>data science pipeline, maybe using a framework like CRISPADM. Where

217
00:10:24.720 --> 00:10:27.320
<v Speaker 1>does webscraping fit into the bigger picture?

218
00:10:27.600 --> 00:10:30.679
<v Speaker 2>Good question. It primarily slots into the early phases data

219
00:10:30.840 --> 00:10:32.720
<v Speaker 2>understanding and data.

220
00:10:32.480 --> 00:10:34.679
<v Speaker 1>Preparation, finding and getting the data right.

221
00:10:34.879 --> 00:10:40.159
<v Speaker 2>Specifically, it's often part of identified data sources, realizing the

222
00:10:40.200 --> 00:10:42.879
<v Speaker 2>web is a potential source, and then select the data

223
00:10:42.919 --> 00:10:46.440
<v Speaker 2>and actually collecting it. It's usually about enriching data sets

224
00:10:46.440 --> 00:10:49.080
<v Speaker 2>you already have, or maybe creating a totally new data

225
00:10:49.080 --> 00:10:51.399
<v Speaker 2>set from scratch using web data.

226
00:10:51.440 --> 00:10:53.320
<v Speaker 1>But it's not just a technical task. Is that you

227
00:10:53.399 --> 00:10:55.000
<v Speaker 1>mentioned managerial concerns?

228
00:10:55.120 --> 00:10:58.759
<v Speaker 2>Yes, and this is often underestimated. There's this crucial gap

229
00:10:58.840 --> 00:11:03.480
<v Speaker 2>between building a model using scrape data the model train

230
00:11:03.559 --> 00:11:06.720
<v Speaker 2>phase and actually deploying that model where it needs ongoing

231
00:11:06.759 --> 00:11:09.080
<v Speaker 2>scrape data to work the model run phase.

232
00:11:09.519 --> 00:11:11.639
<v Speaker 1>Ah, because the website might change exactly.

233
00:11:11.679 --> 00:11:15.600
<v Speaker 2>Websites change all the time, layouts change, HTML structure changes,

234
00:11:15.799 --> 00:11:19.559
<v Speaker 2>login processes change. A scraper that works perfectly today might

235
00:11:19.600 --> 00:11:20.240
<v Speaker 2>break tomorrow.

236
00:11:20.360 --> 00:11:23.799
<v Speaker 1>That warning, so your production model suddenly stops working because

237
00:11:23.960 --> 00:11:24.720
<v Speaker 1>it's data.

238
00:11:24.519 --> 00:11:29.399
<v Speaker 2>Feed broke precisely, which means web scrapers require ongoing maintenance.

239
00:11:29.720 --> 00:11:31.639
<v Speaker 2>Someone has to monitor them, fix them when they break.

240
00:11:32.000 --> 00:11:34.000
<v Speaker 2>That's real cost, and that's why the golden rule. The

241
00:11:34.000 --> 00:11:37.120
<v Speaker 2>first piece of advice is always look for an official API.

242
00:11:36.840 --> 00:11:41.320
<v Speaker 1>First application programming interface, a structured way for programs to

243
00:11:41.320 --> 00:11:42.039
<v Speaker 1>get data.

244
00:11:42.159 --> 00:11:46.200
<v Speaker 2>Right, If the website offers an API and it provides

245
00:11:46.240 --> 00:11:48.600
<v Speaker 2>the data you need and the terms are acceptable, maybe

246
00:11:48.639 --> 00:11:52.679
<v Speaker 2>it's free or reasonably priced, use the API. It's almost

247
00:11:52.720 --> 00:11:56.720
<v Speaker 2>always going to be more stable, more reliable, and less

248
00:11:56.799 --> 00:11:59.799
<v Speaker 2>likely to break than a custom scraper you build yourself.

249
00:12:00.159 --> 00:12:03.360
<v Speaker 1>Really solid advice. But what if there isn't an API,

250
00:12:03.600 --> 00:12:06.080
<v Speaker 1>or maybe the EPI exists but it's I don't know,

251
00:12:06.159 --> 00:12:08.519
<v Speaker 1>super limited in how many requests you can make, or

252
00:12:08.559 --> 00:12:10.879
<v Speaker 1>it just doesn't have that one specific piece of data

253
00:12:10.960 --> 00:12:14.759
<v Speaker 1>you absolutely need. When does building the scraper become worth

254
00:12:14.799 --> 00:12:15.399
<v Speaker 1>the hassle?

255
00:12:15.759 --> 00:12:17.720
<v Speaker 2>That's the judgment call, isn't it. It's a trade off.

256
00:12:17.759 --> 00:12:20.559
<v Speaker 2>If the API doesn't cut it for whatever reason, cost

257
00:12:20.720 --> 00:12:24.639
<v Speaker 2>rate limits, missing data fields, then yeah, building and maintaining

258
00:12:24.639 --> 00:12:25.960
<v Speaker 2>a scraper might be your only option.

259
00:12:26.120 --> 00:12:29.120
<v Speaker 1>So you weigh the development and maintenance effort against the

260
00:12:29.240 --> 00:12:31.240
<v Speaker 1>value of the data exactly.

261
00:12:31.200 --> 00:12:33.879
<v Speaker 2>But you go into it with your eyes open knowing

262
00:12:34.000 --> 00:12:36.759
<v Speaker 2>it's likely going to require ongoing work. Is that cat

263
00:12:36.759 --> 00:12:39.919
<v Speaker 2>and mouse game people talk about. Websites might actively try

264
00:12:39.960 --> 00:12:41.919
<v Speaker 2>to block scraper, so you might need to adapt your

265
00:12:41.960 --> 00:12:42.960
<v Speaker 2>techniques constantly.

266
00:12:43.240 --> 00:12:48.000
<v Speaker 1>And speaking of blocking and well, potential conflicts. The legal

267
00:12:48.080 --> 00:12:50.200
<v Speaker 1>side of this you mentioned it's complex. It sounds like

268
00:12:50.200 --> 00:12:52.679
<v Speaker 1>it's not just a technical decision, but a legal and

269
00:12:52.720 --> 00:12:53.480
<v Speaker 1>ethical one too.

270
00:12:53.600 --> 00:12:58.360
<v Speaker 2>Absolutely, it's murky waters. Legally speaking, there isn't one single

271
00:12:58.480 --> 00:13:02.039
<v Speaker 2>law that says webs scraping is legal or web scraping

272
00:13:02.080 --> 00:13:06.600
<v Speaker 2>is illegal. It depends. Several legal arguments tend to pop

273
00:13:06.679 --> 00:13:09.559
<v Speaker 2>up in court cases, at least in the US, like what, well,

274
00:13:09.600 --> 00:13:13.120
<v Speaker 2>there's breach of terms and conditions. If a website's terms

275
00:13:13.120 --> 00:13:17.240
<v Speaker 2>of service explicitly forbids scraping and you clicked I accept somewhere,

276
00:13:17.600 --> 00:13:19.960
<v Speaker 2>they might have a case. We saw that with Ryanair

277
00:13:19.960 --> 00:13:21.519
<v Speaker 2>winning against a flight data scraper.

278
00:13:21.559 --> 00:13:23.360
<v Speaker 1>Okay, so read the terms definitely.

279
00:13:23.679 --> 00:13:27.440
<v Speaker 2>Then there's copyright infringement. Is the data itself copyrighted? Usually

280
00:13:27.480 --> 00:13:30.279
<v Speaker 2>facts aren't, but the presentation might be. The fair use

281
00:13:30.320 --> 00:13:33.120
<v Speaker 2>doctrine often gets debated here. Think about Google book scanning

282
00:13:33.159 --> 00:13:37.120
<v Speaker 2>millions of books, lots of legal wrangling there. There's also

283
00:13:37.159 --> 00:13:40.879
<v Speaker 2>the CFAA, the Computer Fraud and Abuse Act. It's meant

284
00:13:40.919 --> 00:13:44.639
<v Speaker 2>to target hacking unauthorized access. Sometimes companies try to argue

285
00:13:44.639 --> 00:13:49.000
<v Speaker 2>that scraping constitutes unauthorized access, especially if you bypass technical.

286
00:13:48.720 --> 00:13:52.039
<v Speaker 1>Barriers HM that seems like a stretch for public data.

287
00:13:52.240 --> 00:13:55.240
<v Speaker 2>Courts has struggled with it. There's also older concepts like

288
00:13:55.440 --> 00:13:59.759
<v Speaker 2>trespass to chattels, basically arguing your scraper is interfering with

289
00:13:59.799 --> 00:14:03.240
<v Speaker 2>their server resources, especially if you overload it. And then

290
00:14:03.279 --> 00:14:05.360
<v Speaker 2>there's the robots dot txt file.

291
00:14:05.480 --> 00:14:07.600
<v Speaker 1>Right, the file that tells bots where they shouldn't go.

292
00:14:07.840 --> 00:14:11.320
<v Speaker 2>Yeah, it's not strictly legally binding in most cases, but

293
00:14:11.440 --> 00:14:15.039
<v Speaker 2>ignoring it is definitely not playing nice and signals you're

294
00:14:15.080 --> 00:14:17.759
<v Speaker 2>disregarding the site owner's wishes. It could be used as

295
00:14:17.799 --> 00:14:19.399
<v Speaker 2>evidence against you.

296
00:14:19.399 --> 00:14:23.120
<v Speaker 1>You mentioned a specific case earlier, hi q Labs versus LinkedIn,

297
00:14:23.480 --> 00:14:26.080
<v Speaker 1>that seemed pretty important for this whole public data question.

298
00:14:26.200 --> 00:14:29.799
<v Speaker 2>It was, yeah, a really significant case high q Labs

299
00:14:29.919 --> 00:14:33.600
<v Speaker 2>was scraping data from public LinkedIn profiles. LinkedIn tried to

300
00:14:33.639 --> 00:14:37.320
<v Speaker 2>stop them technologically and legally, invoking the CFAA.

301
00:14:37.440 --> 00:14:38.519
<v Speaker 1>And what did the courts say?

302
00:14:38.879 --> 00:14:42.480
<v Speaker 2>The courts, particularly the Ninth Circuit, basically ruled that scraping

303
00:14:42.519 --> 00:14:45.559
<v Speaker 2>publicly accessible data, even if the site tries to block

304
00:14:45.600 --> 00:14:48.320
<v Speaker 2>you with technical measures or says not to in its terms,

305
00:14:48.519 --> 00:14:52.919
<v Speaker 2>doesn't necessarily violate the cfaas without authorization clause. The key

306
00:14:53.039 --> 00:14:54.559
<v Speaker 2>was that the data was already opened.

307
00:14:54.320 --> 00:14:57.480
<v Speaker 1>To the public, So if it's public, maybe it's fair game.

308
00:14:57.639 --> 00:15:01.000
<v Speaker 2>It leans that way, but it's not a blanket permission slip.

309
00:15:01.600 --> 00:15:05.440
<v Speaker 2>It highlighted just how blurry the lines are around public

310
00:15:05.480 --> 00:15:09.039
<v Speaker 2>information and unauthorized access when it comes to the web.

311
00:15:09.320 --> 00:15:11.440
<v Speaker 2>The legal landscape is definitely still evolving.

312
00:15:11.679 --> 00:15:16.360
<v Speaker 1>Okay, So with all that complexity technical, ethical, legal, what's

313
00:15:16.399 --> 00:15:19.120
<v Speaker 1>the takeaway for you, our listener? What are your core

314
00:15:19.159 --> 00:15:22.799
<v Speaker 1>responsibilities when you decide to scrape data. What's the baseline

315
00:15:22.799 --> 00:15:24.559
<v Speaker 1>for being a good digital citizen.

316
00:15:24.799 --> 00:15:29.120
<v Speaker 2>The absolute number one rule is play nice, be respectful,

317
00:15:29.519 --> 00:15:32.679
<v Speaker 2>don't bombard a website with requests. Think about the impact

318
00:15:32.720 --> 00:15:33.120
<v Speaker 2>you're having.

319
00:15:33.159 --> 00:15:35.440
<v Speaker 1>Don't be the reason their site goes down exactly.

320
00:15:35.559 --> 00:15:38.120
<v Speaker 2>That can cause real financial damage, and that's when legal

321
00:15:38.120 --> 00:15:41.080
<v Speaker 2>action becomes much more likely. We saw qvc SEW a

322
00:15:41.120 --> 00:15:45.240
<v Speaker 2>company called resultantly claiming excessive scraping caused outages costing millions.

323
00:15:45.399 --> 00:15:48.240
<v Speaker 2>You don't want to be that person, so throttle your requests,

324
00:15:48.480 --> 00:15:51.639
<v Speaker 2>put delays in, Identify yourself with a proper user agent

325
00:15:51.679 --> 00:15:54.960
<v Speaker 2>header if you can, maybe even include contact info trans

326
00:15:55.200 --> 00:16:00.000
<v Speaker 2>if appropriate. Yes, always always check the robots dot com

327
00:16:00.080 --> 00:16:02.840
<v Speaker 2>txt file and the terms of service first, see what

328
00:16:02.879 --> 00:16:06.320
<v Speaker 2>the site owner explicitly asks for or forbids, and if

329
00:16:06.360 --> 00:16:09.519
<v Speaker 2>you can, the absolute safest route is to get written

330
00:16:09.519 --> 00:16:10.799
<v Speaker 2>permission from the website owner.

331
00:16:10.960 --> 00:16:13.360
<v Speaker 1>That might not always be practical, but it's the ideal.

332
00:16:13.399 --> 00:16:16.279
<v Speaker 2>It's the gold standard. Yeah, and just pause and think

333
00:16:17.519 --> 00:16:20.480
<v Speaker 2>is this data truly intended to be public and consumed

334
00:16:20.480 --> 00:16:23.639
<v Speaker 2>in this way? Or am I accessing something private or

335
00:16:23.639 --> 00:16:26.600
<v Speaker 2>trying to circumvent a system using common sense and acting

336
00:16:26.639 --> 00:16:29.440
<v Speaker 2>ethically is just as important as writing clever code.

337
00:16:29.639 --> 00:16:32.200
<v Speaker 1>Wow, Okay, that was definitely a deep dive. We've gone

338
00:16:32.240 --> 00:16:34.639
<v Speaker 1>from the basics of HTTP.

339
00:16:34.399 --> 00:16:36.759
<v Speaker 2>To parsing htmail with beautiful soup.

340
00:16:36.679 --> 00:16:38.679
<v Speaker 1>Tackling tricky JavaScript with Selenia.

341
00:16:38.720 --> 00:16:40.279
<v Speaker 2>You're scaling up with crawling techniques.

342
00:16:40.360 --> 00:16:43.600
<v Speaker 1>I'm wrestling with those really crucial management and legal questions.

343
00:16:44.240 --> 00:16:46.639
<v Speaker 1>I think you listening should now have a much clearer

344
00:16:46.679 --> 00:16:49.480
<v Speaker 1>picture not just of the how of webscraping, but the

345
00:16:49.559 --> 00:16:52.639
<v Speaker 1>really important why and when and when not to.

346
00:16:53.120 --> 00:16:57.519
<v Speaker 2>Yeah, it's about using this powerful tool effectively but also responsibly.

347
00:16:57.679 --> 00:17:00.399
<v Speaker 1>Absolutely, So here's a final thought to leave you with

348
00:17:00.759 --> 00:17:03.720
<v Speaker 1>something to chew on. We've talked about this cat and

349
00:17:03.799 --> 00:17:07.799
<v Speaker 1>mouse game between scrapers and websites and the shifting legal sands.

350
00:17:08.640 --> 00:17:12.240
<v Speaker 1>Considering how fast things like AI are evolving and maybe

351
00:17:12.359 --> 00:17:16.480
<v Speaker 1>new techniques for hiding or protecting data online, how might

352
00:17:16.519 --> 00:17:20.039
<v Speaker 1>our very definition of publicly available data change in the

353
00:17:20.079 --> 00:17:20.799
<v Speaker 1>next few years.

354
00:17:21.240 --> 00:17:22.920
<v Speaker 2>That's a big question, right.

355
00:17:23.039 --> 00:17:25.480
<v Speaker 1>And what could that changing definition mean for how we

356
00:17:25.519 --> 00:17:28.160
<v Speaker 1>gather data, how we do analysis, and maybe even the

357
00:17:28.240 --> 00:17:30.960
<v Speaker 1>kinds of innovation that are possible across well pretty much

358
00:17:30.960 --> 00:17:31.480
<v Speaker 1>every field.

359
00:17:31.599 --> 00:17:34.880
<v Speaker 2>Yeah, what does public mean when data is generated by

360
00:17:34.920 --> 00:17:38.880
<v Speaker 2>AI or locked behind complex interactions, Lots to think about.

361
00:17:38.920 --> 00:17:40.960
<v Speaker 1>There definitely something to ponder. We'll leave it there for

362
00:17:40.960 --> 00:17:41.599
<v Speaker 1>this deep dive
