WEBVTT 1 00:00:00.160 --> 00:00:02.799 Welcome to the deep Dive. We're the show that helps 2 00:00:02.799 --> 00:00:05.080 you cut through the noise, taking stacks of sources and 3 00:00:05.120 --> 00:00:07.719 finding those key insights so you can get genuinely well 4 00:00:07.719 --> 00:00:12.400 informed fast. Today we're diving into something that feels well, 5 00:00:12.640 --> 00:00:16.039 part magic, part engineering, maybe even a little bit detective work. 6 00:00:16.480 --> 00:00:19.280 It's web scraping, right, the ability to basically write a 7 00:00:19.320 --> 00:00:21.760 little program that goes out onto the Internet and gathers 8 00:00:21.839 --> 00:00:22.920 data all by itself. 9 00:00:23.000 --> 00:00:25.359 Yeah, and seeing a work the first time, there's this 10 00:00:25.440 --> 00:00:28.480 real rush, like you've unlocked some secret level of the 11 00:00:28.480 --> 00:00:29.280 web or something. 12 00:00:29.359 --> 00:00:29.920 Definitely. 13 00:00:29.960 --> 00:00:33.200 Our main guide today is the book Practical Web Scraping 14 00:00:33.280 --> 00:00:36.280 for Data Science by Seppie Vanden Brook and Bart Basin's 15 00:00:36.479 --> 00:00:38.200 really comprehensive. 16 00:00:37.520 --> 00:00:39.200 Stuff it is. It covers a lot of ground. 17 00:00:39.399 --> 00:00:42.320 So our mission today to really get you up to 18 00:00:42.359 --> 00:00:45.359 speed on what webs are graping is why it's so 19 00:00:45.479 --> 00:00:49.079 important for data science and crucially the things you absolutely 20 00:00:49.119 --> 00:00:52.600 need to think about technically and maybe even more importantly ethically. 21 00:00:52.799 --> 00:00:54.520 Yeah, the how, but also the should you. 22 00:00:54.880 --> 00:00:57.840 Exactly get ready for some aha moments because we're going 23 00:00:57.880 --> 00:00:59.799 to unpack how you can get a solid handle on 24 00:00:59.799 --> 00:01:04.040 this pretty powerful skill. Okay, so let's start at the beginning. 25 00:01:04.319 --> 00:01:07.359 What actually happens under the hood when you just type 26 00:01:07.400 --> 00:01:11.000 say www dot Google dot com into your browser. Most 27 00:01:11.079 --> 00:01:13.000 it's just hit enter right right. 28 00:01:12.879 --> 00:01:15.480 And we take it for granted. But there's this incredible 29 00:01:16.560 --> 00:01:19.959 coordination happening invisibly, like you said, under the hood before 30 00:01:19.959 --> 00:01:22.239 you even see anything. All these protocols are firing off. 31 00:01:22.519 --> 00:01:25.680 DNS is translating that name into an IP address. 32 00:01:25.359 --> 00:01:27.400 The computer's actual address exactly. 33 00:01:27.439 --> 00:01:30.400 Then TCP make sure the data gets there reliably. But 34 00:01:30.439 --> 00:01:33.239 the layer we really care about for scraping the actual 35 00:01:33.319 --> 00:01:35.480 sort of language the web speaks. 36 00:01:35.799 --> 00:01:38.319 That's HTTP, Hypertext Transfer Protocol. 37 00:01:38.400 --> 00:01:41.040 That's the one. It's basically a plaintext conversation, the browser 38 00:01:41.079 --> 00:01:43.159 sense of request, the server sense of response, with the 39 00:01:43.159 --> 00:01:47.200 web page content. Understanding that back and forth is fundamental. 40 00:01:47.439 --> 00:01:52.159 Okay, so HTTP is the conversation. How do we get 41 00:01:52.159 --> 00:01:54.560 our program to join that conversation. How do we make 42 00:01:54.599 --> 00:01:55.359 those requests? 43 00:01:55.400 --> 00:01:58.719 Well, that's where Python's requests library is just fantastic. You 44 00:01:58.719 --> 00:02:00.519 can use Python's built d stuff. 45 00:02:00.000 --> 00:02:02.640 Of erlib, but requests is nicer, oh. 46 00:02:02.680 --> 00:02:05.719 Much nicer, way more user friendly. Think of it like 47 00:02:05.719 --> 00:02:08.039 a really efficient messenger. You just tell it, hey, go 48 00:02:08.120 --> 00:02:11.360 get this page using requests dot get or send this 49 00:02:11.479 --> 00:02:14.599 data with request dot post. It handles a lot of 50 00:02:14.599 --> 00:02:17.879 the fiddley bits for you automatically, like setting standard headers 51 00:02:17.919 --> 00:02:20.520 like user agent, which tells the server what kind of 52 00:02:20.520 --> 00:02:22.120 browser you are or pretending to. 53 00:02:22.080 --> 00:02:24.560 Be ah, so you can look like a normal browser 54 00:02:24.680 --> 00:02:25.240 pretty much. 55 00:02:25.800 --> 00:02:28.560 And crucially, it also lets you change those headers if 56 00:02:28.599 --> 00:02:31.479 you need to. Sometimes servers are a bit picky about 57 00:02:31.479 --> 00:02:34.000 who they talk to, so that flexibility is key. 58 00:02:34.159 --> 00:02:37.479 Right, that makes sense. So requests fetches the page content. 59 00:02:37.520 --> 00:02:40.520 But then you've got this big blob of well usually HTML, right, 60 00:02:40.879 --> 00:02:44.039 and looking at raw HTML it can be pretty intimidating 61 00:02:44.120 --> 00:02:45.319 all those angle brackets. 62 00:02:45.319 --> 00:02:48.919 Oh yeah, it looks like tag soups sometimes just the jumble. 63 00:02:49.080 --> 00:02:51.360 So how do we find the actual data we want 64 00:02:51.439 --> 00:02:52.439 inside that jumble? 65 00:02:52.639 --> 00:02:56.360 That's the next piece of the puzzle. HTML hypertext markup language. 66 00:02:56.719 --> 00:02:59.680 It looks messy, but it actually has structure. It uses 67 00:02:59.719 --> 00:03:01.599 tag it's like A for a link or DIV for 68 00:03:01.639 --> 00:03:05.039 a section. Then CSS styles it to navigate that structure 69 00:03:05.080 --> 00:03:08.000 and pull out specific bits. We use another great library, 70 00:03:08.479 --> 00:03:12.280 beautiful soup. Okay, it takes that messy HTML string and 71 00:03:12.360 --> 00:03:16.680 turns it into this navigable Python object like a tree 72 00:03:16.680 --> 00:03:18.520 structure you can walk through, ah. 73 00:03:18.520 --> 00:03:21.120 Like a family tree for the web page elements. 74 00:03:20.840 --> 00:03:23.039 Kind of Yeah, and then you can easily say, find 75 00:03:23.039 --> 00:03:25.039 me all the A tags or find the div with 76 00:03:25.080 --> 00:03:28.240 this specific ID, or even use these really powerful CSS 77 00:03:28.240 --> 00:03:32.080 selectors to pinpoint exactly the element you need based on 78 00:03:32.159 --> 00:03:33.919 its styling or position, and just. 79 00:03:33.840 --> 00:03:37.240 Building on that for you listening, your browser's developer tools 80 00:03:37.319 --> 00:03:41.960 are like your secret weapon here. Seriously, invaluable, absolutely cannot 81 00:03:42.000 --> 00:03:44.719 stress that enough. You hit F twelve usually, and the 82 00:03:44.759 --> 00:03:47.560 elements tab shows you that nice structured tree view of 83 00:03:47.599 --> 00:03:50.280 the HTML that beautiful soup will see. You can hover 84 00:03:50.360 --> 00:03:52.240 over stuff on the page, see the code light up. 85 00:03:52.599 --> 00:03:55.159 Yeah, it's brilliant for figuring out what tags or what 86 00:03:55.280 --> 00:03:57.840 CSS selectors you need to target. You can often just 87 00:03:57.879 --> 00:04:00.639 write click an element and copy it select directly. 88 00:04:00.879 --> 00:04:03.599 Just one quick tip though, remember view source shows the 89 00:04:03.680 --> 00:04:07.159 raw HTML the server scent. The elements tab shows what 90 00:04:07.199 --> 00:04:10.280 the browser has processed, which might include changes made by 91 00:04:10.400 --> 00:04:13.000 JavaScript after the page loaded. 92 00:04:13.080 --> 00:04:15.360 That's a really key distinction. Yeah, what you see in 93 00:04:15.400 --> 00:04:17.920 elements is often closer to what you need if the 94 00:04:17.959 --> 00:04:18.839 page is dynamic. 95 00:04:18.920 --> 00:04:21.959 Okay, perfect segway that covers static pages really well. But 96 00:04:22.079 --> 00:04:25.360 what about those more complex sites, the ones that are 97 00:04:25.720 --> 00:04:29.680 heavy on JavaScript where content loads dynamically as you scroll, 98 00:04:29.839 --> 00:04:33.199 or maybe they set cookies using JavaScript. Our requests and 99 00:04:33.279 --> 00:04:37.120 beautiful soup approach might just stop working there because they 100 00:04:37.120 --> 00:04:39.720 aren't actually running a browser. They're just fetching the initial 101 00:04:39.839 --> 00:04:40.720 HTML source. 102 00:04:40.959 --> 00:04:43.240 You hit the nail on the head. That's a huge 103 00:04:43.360 --> 00:04:47.480 challenge with modern web development. So many sites are JavaScript heavy. 104 00:04:47.879 --> 00:04:50.680 The initial HTML might be almost empty just to shell. 105 00:04:51.160 --> 00:04:54.480 The actual content gets fetched and rendered by JavaScript running 106 00:04:54.480 --> 00:04:58.399 in your browser, and sometimes that JavaScript is deliberately obfuscated, 107 00:04:58.680 --> 00:05:02.959 made hard to read to make reverse engineering it almost impossible. 108 00:05:02.639 --> 00:05:04.959 So you can't easily figure out where it's getting the 109 00:05:05.040 --> 00:05:06.360 data from exactly. 110 00:05:06.920 --> 00:05:09.120 Or maybe it sets a special cookie like a not 111 00:05:09.319 --> 00:05:13.000 a security token using JavaScript, and without that cookie you 112 00:05:13.040 --> 00:05:17.040 can't make further requests. So if requests can't run JavaScript, 113 00:05:17.319 --> 00:05:17.920 what do you do? 114 00:05:18.040 --> 00:05:21.360 And that's where I guess Selenium comes into the picture. 115 00:05:21.399 --> 00:05:23.319 It's more than just a scraper, isn't it. It's about 116 00:05:23.399 --> 00:05:24.920 browser automation precisely. 117 00:05:25.319 --> 00:05:28.759 Selenium's original purpose was actually for automated testing of websites. Yeah, 118 00:05:28.839 --> 00:05:31.360 making sure, buttons, work forms, submit, et cetera. But that 119 00:05:31.399 --> 00:05:34.519 makes it incredibly powerful for scraping because it literally drives 120 00:05:34.720 --> 00:05:37.879 the real web browser, Chrome, Firefox, whatever you can figure. 121 00:05:37.959 --> 00:05:39.639 So it can run the JavaScript. 122 00:05:39.879 --> 00:05:43.680 Yes, it loads the page, waits for things to appear, 123 00:05:44.279 --> 00:05:47.720 clicks buttons, fills in forms, scrolls down the page. Anything 124 00:05:47.759 --> 00:05:50.439 a human user can do, Selenium can automate. 125 00:05:50.800 --> 00:05:53.959 Ah. Okay, So for those sites where content loads as 126 00:05:53.959 --> 00:05:57.279 you scroll, like maybe infinite scrolling on social media or news. 127 00:05:57.079 --> 00:06:00.639 Sites, perfect example requests would only get the first patch, 128 00:06:00.920 --> 00:06:04.160 selem can actually perform the scroll action, wait for the 129 00:06:04.199 --> 00:06:06.839 new content to load because the JavaScript runs, and then 130 00:06:06.879 --> 00:06:07.240 grab it. 131 00:06:07.439 --> 00:06:10.920 That's clever. What about waiting Pages don't always load instantly, right? 132 00:06:10.959 --> 00:06:13.319 Selenium has tools for that too. You can use weights 133 00:06:13.399 --> 00:06:16.600 telling your script, hey, wait until this specific button is clickable, 134 00:06:17.040 --> 00:06:19.759 or wait until this piece of text appears before you 135 00:06:19.800 --> 00:06:22.399 try to interact with it. It makes your scraper much 136 00:06:22.439 --> 00:06:25.959 more robust against slow loading pages or dynamic elements. 137 00:06:26.040 --> 00:06:29.360 That sounds incredibly capable, but I imagine driving a full 138 00:06:29.399 --> 00:06:32.720 browser isn't as lightweight as just making a simple HTTP request. 139 00:06:33.120 --> 00:06:34.120 Is there a downside? 140 00:06:34.160 --> 00:06:37.879 Absolutely? There's a trade off. Selenium is significantly slower and 141 00:06:38.000 --> 00:06:41.600 uses way more memory and CPU resources than requests and 142 00:06:41.639 --> 00:06:42.639 beautiful soup. 143 00:06:42.480 --> 00:06:44.480 Because it's literally running Chrome in the. 144 00:06:44.439 --> 00:06:48.720 Background or something exactly. You're paying for that full browser emulation. 145 00:06:49.120 --> 00:06:53.439 So it's powerful, essential for those tricky dynamic sites. But 146 00:06:53.519 --> 00:06:56.319 you always want to check first, can I get this 147 00:06:56.439 --> 00:07:00.600 data with the simpler, faster request approach Uselen when you have. 148 00:07:00.600 --> 00:07:03.480 To, okay, makes sense, choose the right tool for the job. 149 00:07:04.600 --> 00:07:08.399 So let's say we've figured out how to scrape one page, 150 00:07:08.399 --> 00:07:11.279 maybe even a dynamic one with Selenium. How do we 151 00:07:11.360 --> 00:07:13.519 scale that up? How do we go from scraping a 152 00:07:13.560 --> 00:07:17.399 page to well, crawling hundreds or thousands across a whole website. 153 00:07:17.399 --> 00:07:18.600 That feels like a different beast. 154 00:07:18.720 --> 00:07:22.319 It is, And that distinction between scraping grabbing data from 155 00:07:22.319 --> 00:07:26.120 a specific page, and crawling, navigating link by link to 156 00:07:26.199 --> 00:07:28.759 discover and scrape many pages is really. 157 00:07:28.560 --> 00:07:31.680 Important, like what search engines do, but on a smaller scale. 158 00:07:31.399 --> 00:07:34.360 Maybe exactly they crawl the web constantly. For data science, 159 00:07:34.360 --> 00:07:36.199 if you need to crawl a site, you need a 160 00:07:36.199 --> 00:07:39.759 more structured approach. Best practices become vital. You'll almost certainly 161 00:07:39.800 --> 00:07:43.160 want a database, something simple like squilight is often fine. 162 00:07:43.160 --> 00:07:45.399 Maybe using a helper library like records to keep track 163 00:07:45.439 --> 00:07:47.519 of everything. What kind of thing, Well, you need a 164 00:07:47.519 --> 00:07:50.000 list of URLs you plan to visit the crawl frontier. 165 00:07:50.399 --> 00:07:52.920 You need a list of URLs you've already visited so 166 00:07:52.959 --> 00:07:55.240 you don't get stuck in loops or scrape the same 167 00:07:55.279 --> 00:07:58.399 page multiple times. And of course you need to store 168 00:07:58.439 --> 00:08:01.519 the data you extract. It's also really good practice to 169 00:08:01.560 --> 00:08:05.240 separate the logic. Have one part of your code responsible 170 00:08:05.279 --> 00:08:08.879 for finding new links, the crawler, and another part responsible 171 00:08:08.879 --> 00:08:12.199 for extracting data from a page the scraper makes it 172 00:08:12.199 --> 00:08:13.279 easier to manage, and. 173 00:08:13.240 --> 00:08:15.319 You have to be careful not to hammer the website 174 00:08:15.560 --> 00:08:16.360 absolutely critical. 175 00:08:16.439 --> 00:08:18.879 You need to build in delays or cool down periods 176 00:08:18.879 --> 00:08:21.360 between your requests, don't just fire them off as fast 177 00:08:21.360 --> 00:08:23.839 as possible. You also need air handling. What if a 178 00:08:24.040 --> 00:08:27.959 page is temporarily down, You need logic to retry later 179 00:08:28.279 --> 00:08:30.920 and thinking about doing things in peril can speed it up, 180 00:08:31.079 --> 00:08:32.759 but you have to be even more careful not to 181 00:08:32.799 --> 00:08:35.080 overload the server. Then it's a balancing act. 182 00:08:35.720 --> 00:08:38.759 And you mentioned some specific tools for handling URLs. 183 00:08:39.039 --> 00:08:43.039 Yeah, little things become important when crawling, like earlib dot 184 00:08:43.080 --> 00:08:46.840 parse dot earl join. Websites often use relative links like 185 00:08:46.919 --> 00:08:49.919 about us. Your crawler needs to correctly combine that with 186 00:08:50.000 --> 00:08:52.600 the base you RL get the full address, earl Join 187 00:08:52.679 --> 00:08:56.279 handles that reliably, and Earl's frag helps remove those fragment 188 00:08:56.360 --> 00:08:59.440 identifiers the bit after the hashtags. You don't accidentally crawl 189 00:09:00.600 --> 00:09:04.320 HTML church section one and PA html tag section two 190 00:09:04.600 --> 00:09:05.799 as if they were different pages. 191 00:09:05.960 --> 00:09:08.879 So why is this scaling up, this crawling capability so 192 00:09:08.960 --> 00:09:12.440 important for you our listeners doing data science? What doors 193 00:09:12.480 --> 00:09:13.039 does it open? 194 00:09:13.279 --> 00:09:15.039 It opens the door to data sets that just don't 195 00:09:15.080 --> 00:09:18.519 exist anywhere else or aren't available in a neat packaged format. 196 00:09:19.120 --> 00:09:23.320 The web is this enormous, constantly updated, incredibly rich source 197 00:09:23.320 --> 00:09:25.519 of well mostly unstructured. 198 00:09:25.000 --> 00:09:27.799 Data a real treasure trove if you can access it exactly? 199 00:09:28.039 --> 00:09:30.600 Imagine you want to build a sentiment analysis model for 200 00:09:30.639 --> 00:09:33.600 product reviews. You I need thousands, tens of thousands reviews. 201 00:09:33.600 --> 00:09:35.559 Where do you get them? You call e commerce sites? 202 00:09:35.559 --> 00:09:38.080 Well? Maybe tracking housing prices perfect. 203 00:09:37.639 --> 00:09:40.960 Example, collect real estate listings across a whole city or 204 00:09:40.960 --> 00:09:45.360 region for analysis or visualization. We've seen amazing projects born 205 00:09:45.440 --> 00:09:48.879 from this. Google Translate got massively better by using scrape 206 00:09:48.919 --> 00:09:51.879 texts from across the web. There was the Billion Prices 207 00:09:51.879 --> 00:09:55.480 project at MIT, which scraped online retailers daily to create 208 00:09:55.559 --> 00:09:59.519 near real time inflation tragging way faster than official government stats. 209 00:09:59.600 --> 00:09:59.840 Wow. 210 00:10:00.120 --> 00:10:03.320 Yeah. Or think about monitoring social media for mentions of 211 00:10:03.320 --> 00:10:07.080 bitcoin to gauge public sentiment, or analyzing job postings to 212 00:10:07.080 --> 00:10:10.039 see which data science skills are currently in demand. All 213 00:10:10.080 --> 00:10:14.279 rely on robust crawling. It's about turning the messy, sprawling 214 00:10:14.320 --> 00:10:19.120 web into structured, valuable information for your data science pipeline. 215 00:10:19.200 --> 00:10:21.039 Okay, so let's pull back a bit thinking about that 216 00:10:21.120 --> 00:10:24.720 data science pipeline, maybe using a framework like CRISPADM. Where 217 00:10:24.720 --> 00:10:27.320 does webscraping fit into the bigger picture? 218 00:10:27.600 --> 00:10:30.679 Good question. It primarily slots into the early phases data 219 00:10:30.840 --> 00:10:32.720 understanding and data. 220 00:10:32.480 --> 00:10:34.679 Preparation, finding and getting the data right. 221 00:10:34.879 --> 00:10:40.159 Specifically, it's often part of identified data sources, realizing the 222 00:10:40.200 --> 00:10:42.879 web is a potential source, and then select the data 223 00:10:42.919 --> 00:10:46.440 and actually collecting it. It's usually about enriching data sets 224 00:10:46.440 --> 00:10:49.080 you already have, or maybe creating a totally new data 225 00:10:49.080 --> 00:10:51.399 set from scratch using web data. 226 00:10:51.440 --> 00:10:53.320 But it's not just a technical task. Is that you 227 00:10:53.399 --> 00:10:55.000 mentioned managerial concerns? 228 00:10:55.120 --> 00:10:58.759 Yes, and this is often underestimated. There's this crucial gap 229 00:10:58.840 --> 00:11:03.480 between building a model using scrape data the model train 230 00:11:03.559 --> 00:11:06.720 phase and actually deploying that model where it needs ongoing 231 00:11:06.759 --> 00:11:09.080 scrape data to work the model run phase. 232 00:11:09.519 --> 00:11:11.639 Ah, because the website might change exactly. 233 00:11:11.679 --> 00:11:15.600 Websites change all the time, layouts change, HTML structure changes, 234 00:11:15.799 --> 00:11:19.559 login processes change. A scraper that works perfectly today might 235 00:11:19.600 --> 00:11:20.240 break tomorrow. 236 00:11:20.360 --> 00:11:23.799 That warning, so your production model suddenly stops working because 237 00:11:23.960 --> 00:11:24.720 it's data. 238 00:11:24.519 --> 00:11:29.399 Feed broke precisely, which means web scrapers require ongoing maintenance. 239 00:11:29.720 --> 00:11:31.639 Someone has to monitor them, fix them when they break. 240 00:11:32.000 --> 00:11:34.000 That's real cost, and that's why the golden rule. The 241 00:11:34.000 --> 00:11:37.120 first piece of advice is always look for an official API. 242 00:11:36.840 --> 00:11:41.320 First application programming interface, a structured way for programs to 243 00:11:41.320 --> 00:11:42.039 get data. 244 00:11:42.159 --> 00:11:46.200 Right, If the website offers an API and it provides 245 00:11:46.240 --> 00:11:48.600 the data you need and the terms are acceptable, maybe 246 00:11:48.639 --> 00:11:52.679 it's free or reasonably priced, use the API. It's almost 247 00:11:52.720 --> 00:11:56.720 always going to be more stable, more reliable, and less 248 00:11:56.799 --> 00:11:59.799 likely to break than a custom scraper you build yourself. 249 00:12:00.159 --> 00:12:03.360 Really solid advice. But what if there isn't an API, 250 00:12:03.600 --> 00:12:06.080 or maybe the EPI exists but it's I don't know, 251 00:12:06.159 --> 00:12:08.519 super limited in how many requests you can make, or 252 00:12:08.559 --> 00:12:10.879 it just doesn't have that one specific piece of data 253 00:12:10.960 --> 00:12:14.759 you absolutely need. When does building the scraper become worth 254 00:12:14.799 --> 00:12:15.399 the hassle? 255 00:12:15.759 --> 00:12:17.720 That's the judgment call, isn't it. It's a trade off. 256 00:12:17.759 --> 00:12:20.559 If the API doesn't cut it for whatever reason, cost 257 00:12:20.720 --> 00:12:24.639 rate limits, missing data fields, then yeah, building and maintaining 258 00:12:24.639 --> 00:12:25.960 a scraper might be your only option. 259 00:12:26.120 --> 00:12:29.120 So you weigh the development and maintenance effort against the 260 00:12:29.240 --> 00:12:31.240 value of the data exactly. 261 00:12:31.200 --> 00:12:33.879 But you go into it with your eyes open knowing 262 00:12:34.000 --> 00:12:36.759 it's likely going to require ongoing work. Is that cat 263 00:12:36.759 --> 00:12:39.919 and mouse game people talk about. Websites might actively try 264 00:12:39.960 --> 00:12:41.919 to block scraper, so you might need to adapt your 265 00:12:41.960 --> 00:12:42.960 techniques constantly. 266 00:12:43.240 --> 00:12:48.000 And speaking of blocking and well, potential conflicts. The legal 267 00:12:48.080 --> 00:12:50.200 side of this you mentioned it's complex. It sounds like 268 00:12:50.200 --> 00:12:52.679 it's not just a technical decision, but a legal and 269 00:12:52.720 --> 00:12:53.480 ethical one too. 270 00:12:53.600 --> 00:12:58.360 Absolutely, it's murky waters. Legally speaking, there isn't one single 271 00:12:58.480 --> 00:13:02.039 law that says webs scraping is legal or web scraping 272 00:13:02.080 --> 00:13:06.600 is illegal. It depends. Several legal arguments tend to pop 273 00:13:06.679 --> 00:13:09.559 up in court cases, at least in the US, like what, well, 274 00:13:09.600 --> 00:13:13.120 there's breach of terms and conditions. If a website's terms 275 00:13:13.120 --> 00:13:17.240 of service explicitly forbids scraping and you clicked I accept somewhere, 276 00:13:17.600 --> 00:13:19.960 they might have a case. We saw that with Ryanair 277 00:13:19.960 --> 00:13:21.519 winning against a flight data scraper. 278 00:13:21.559 --> 00:13:23.360 Okay, so read the terms definitely. 279 00:13:23.679 --> 00:13:27.440 Then there's copyright infringement. Is the data itself copyrighted? Usually 280 00:13:27.480 --> 00:13:30.279 facts aren't, but the presentation might be. The fair use 281 00:13:30.320 --> 00:13:33.120 doctrine often gets debated here. Think about Google book scanning 282 00:13:33.159 --> 00:13:37.120 millions of books, lots of legal wrangling there. There's also 283 00:13:37.159 --> 00:13:40.879 the CFAA, the Computer Fraud and Abuse Act. It's meant 284 00:13:40.919 --> 00:13:44.639 to target hacking unauthorized access. Sometimes companies try to argue 285 00:13:44.639 --> 00:13:49.000 that scraping constitutes unauthorized access, especially if you bypass technical. 286 00:13:48.720 --> 00:13:52.039 Barriers HM that seems like a stretch for public data. 287 00:13:52.240 --> 00:13:55.240 Courts has struggled with it. There's also older concepts like 288 00:13:55.440 --> 00:13:59.759 trespass to chattels, basically arguing your scraper is interfering with 289 00:13:59.799 --> 00:14:03.240 their server resources, especially if you overload it. And then 290 00:14:03.279 --> 00:14:05.360 there's the robots dot txt file. 291 00:14:05.480 --> 00:14:07.600 Right, the file that tells bots where they shouldn't go. 292 00:14:07.840 --> 00:14:11.320 Yeah, it's not strictly legally binding in most cases, but 293 00:14:11.440 --> 00:14:15.039 ignoring it is definitely not playing nice and signals you're 294 00:14:15.080 --> 00:14:17.759 disregarding the site owner's wishes. It could be used as 295 00:14:17.799 --> 00:14:19.399 evidence against you. 296 00:14:19.399 --> 00:14:23.120 You mentioned a specific case earlier, hi q Labs versus LinkedIn, 297 00:14:23.480 --> 00:14:26.080 that seemed pretty important for this whole public data question. 298 00:14:26.200 --> 00:14:29.799 It was, yeah, a really significant case high q Labs 299 00:14:29.919 --> 00:14:33.600 was scraping data from public LinkedIn profiles. LinkedIn tried to 300 00:14:33.639 --> 00:14:37.320 stop them technologically and legally, invoking the CFAA. 301 00:14:37.440 --> 00:14:38.519 And what did the courts say? 302 00:14:38.879 --> 00:14:42.480 The courts, particularly the Ninth Circuit, basically ruled that scraping 303 00:14:42.519 --> 00:14:45.559 publicly accessible data, even if the site tries to block 304 00:14:45.600 --> 00:14:48.320 you with technical measures or says not to in its terms, 305 00:14:48.519 --> 00:14:52.919 doesn't necessarily violate the cfaas without authorization clause. The key 306 00:14:53.039 --> 00:14:54.559 was that the data was already opened. 307 00:14:54.320 --> 00:14:57.480 To the public, So if it's public, maybe it's fair game. 308 00:14:57.639 --> 00:15:01.000 It leans that way, but it's not a blanket permission slip. 309 00:15:01.600 --> 00:15:05.440 It highlighted just how blurry the lines are around public 310 00:15:05.480 --> 00:15:09.039 information and unauthorized access when it comes to the web. 311 00:15:09.320 --> 00:15:11.440 The legal landscape is definitely still evolving. 312 00:15:11.679 --> 00:15:16.360 Okay, So with all that complexity technical, ethical, legal, what's 313 00:15:16.399 --> 00:15:19.120 the takeaway for you, our listener? What are your core 314 00:15:19.159 --> 00:15:22.799 responsibilities when you decide to scrape data. What's the baseline 315 00:15:22.799 --> 00:15:24.559 for being a good digital citizen. 316 00:15:24.799 --> 00:15:29.120 The absolute number one rule is play nice, be respectful, 317 00:15:29.519 --> 00:15:32.679 don't bombard a website with requests. Think about the impact 318 00:15:32.720 --> 00:15:33.120 you're having. 319 00:15:33.159 --> 00:15:35.440 Don't be the reason their site goes down exactly. 320 00:15:35.559 --> 00:15:38.120 That can cause real financial damage, and that's when legal 321 00:15:38.120 --> 00:15:41.080 action becomes much more likely. We saw qvc SEW a 322 00:15:41.120 --> 00:15:45.240 company called resultantly claiming excessive scraping caused outages costing millions. 323 00:15:45.399 --> 00:15:48.240 You don't want to be that person, so throttle your requests, 324 00:15:48.480 --> 00:15:51.639 put delays in, Identify yourself with a proper user agent 325 00:15:51.679 --> 00:15:54.960 header if you can, maybe even include contact info trans 326 00:15:55.200 --> 00:16:00.000 if appropriate. Yes, always always check the robots dot com 327 00:16:00.080 --> 00:16:02.840 txt file and the terms of service first, see what 328 00:16:02.879 --> 00:16:06.320 the site owner explicitly asks for or forbids, and if 329 00:16:06.360 --> 00:16:09.519 you can, the absolute safest route is to get written 330 00:16:09.519 --> 00:16:10.799 permission from the website owner. 331 00:16:10.960 --> 00:16:13.360 That might not always be practical, but it's the ideal. 332 00:16:13.399 --> 00:16:16.279 It's the gold standard. Yeah, and just pause and think 333 00:16:17.519 --> 00:16:20.480 is this data truly intended to be public and consumed 334 00:16:20.480 --> 00:16:23.639 in this way? Or am I accessing something private or 335 00:16:23.639 --> 00:16:26.600 trying to circumvent a system using common sense and acting 336 00:16:26.639 --> 00:16:29.440 ethically is just as important as writing clever code. 337 00:16:29.639 --> 00:16:32.200 Wow, Okay, that was definitely a deep dive. We've gone 338 00:16:32.240 --> 00:16:34.639 from the basics of HTTP. 339 00:16:34.399 --> 00:16:36.759 To parsing htmail with beautiful soup. 340 00:16:36.679 --> 00:16:38.679 Tackling tricky JavaScript with Selenia. 341 00:16:38.720 --> 00:16:40.279 You're scaling up with crawling techniques. 342 00:16:40.360 --> 00:16:43.600 I'm wrestling with those really crucial management and legal questions. 343 00:16:44.240 --> 00:16:46.639 I think you listening should now have a much clearer 344 00:16:46.679 --> 00:16:49.480 picture not just of the how of webscraping, but the 345 00:16:49.559 --> 00:16:52.639 really important why and when and when not to. 346 00:16:53.120 --> 00:16:57.519 Yeah, it's about using this powerful tool effectively but also responsibly. 347 00:16:57.679 --> 00:17:00.399 Absolutely, So here's a final thought to leave you with 348 00:17:00.759 --> 00:17:03.720 something to chew on. We've talked about this cat and 349 00:17:03.799 --> 00:17:07.799 mouse game between scrapers and websites and the shifting legal sands. 350 00:17:08.640 --> 00:17:12.240 Considering how fast things like AI are evolving and maybe 351 00:17:12.359 --> 00:17:16.480 new techniques for hiding or protecting data online, how might 352 00:17:16.519 --> 00:17:20.039 our very definition of publicly available data change in the 353 00:17:20.079 --> 00:17:20.799 next few years. 354 00:17:21.240 --> 00:17:22.920 That's a big question, right. 355 00:17:23.039 --> 00:17:25.480 And what could that changing definition mean for how we 356 00:17:25.519 --> 00:17:28.160 gather data, how we do analysis, and maybe even the 357 00:17:28.240 --> 00:17:30.960 kinds of innovation that are possible across well pretty much 358 00:17:30.960 --> 00:17:31.480 every field. 359 00:17:31.599 --> 00:17:34.880 Yeah, what does public mean when data is generated by 360 00:17:34.920 --> 00:17:38.880 AI or locked behind complex interactions, Lots to think about. 361 00:17:38.920 --> 00:17:40.960 There definitely something to ponder. We'll leave it there for 362 00:17:40.960 --> 00:17:41.599 this deep dive