WEBVTT 1 00:00:00.080 --> 00:00:03.879 Okay, welcome to the deep dive. Today. We're jumping into webscraping. 2 00:00:04.440 --> 00:00:09.679 That's right, specifically using PhD based on some key ideas 3 00:00:09.679 --> 00:00:12.679 from roller source Instant PHPE Webscraping. 4 00:00:12.759 --> 00:00:15.599 Yeah, it's from twenty thirteen by Jacob Board. So things 5 00:00:15.640 --> 00:00:19.519 have moved on obviously, but the core ideas they're often 6 00:00:19.559 --> 00:00:20.800 still relevant, aren't. 7 00:00:20.559 --> 00:00:23.280 They They really are. The book was aimed at beginners, 8 00:00:23.399 --> 00:00:26.760 you know, showing how to programmatically crawl websites, download content, 9 00:00:27.280 --> 00:00:31.280 and basically turn unstructured web stuff into structured data using 10 00:00:32.119 --> 00:00:32.799 using PHP. 11 00:00:33.159 --> 00:00:35.640 So our mission here is to pull out those fundamental 12 00:00:35.679 --> 00:00:38.880 techniques from these excerpts give you a solid grounding in 13 00:00:38.920 --> 00:00:41.200 how PHPU webscraping works at its core. 14 00:00:41.439 --> 00:00:44.439 Even if you're maybe adapting these ideas for more modern 15 00:00:44.479 --> 00:00:47.280 sites later on, the basics often carry. 16 00:00:47.039 --> 00:00:50.159 Through exactly the source assumes. Maybe not a ton of 17 00:00:50.200 --> 00:00:54.280 programming experience, though knowing some PHP and HTML definitely. 18 00:00:53.840 --> 00:00:56.200 Helps, sure, But the focus is really on the scraping 19 00:00:56.240 --> 00:00:57.119 concepts themselves. 20 00:00:57.200 --> 00:00:59.640 All right, let's kick things off. Before you scrape, you 21 00:00:59.679 --> 00:01:03.039 need your tools. What's the basic toolkit according to these sources? 22 00:01:03.320 --> 00:01:07.760 Okay, so first, obviously you need PHP itself that's the language, right, 23 00:01:08.040 --> 00:01:10.840 then a good place to write your code an ID 24 00:01:11.640 --> 00:01:16.760 integrated development environment. The source mentions Eclipse PDT. 25 00:01:16.400 --> 00:01:19.439 PDT being the PHP development tools for Eclipse, so a 26 00:01:19.480 --> 00:01:20.680 specialized code editor. 27 00:01:20.840 --> 00:01:24.480 Yeah. Basically makes coding easier, keeps things organized. And then 28 00:01:24.519 --> 00:01:26.680 you need a way to run the PHP and. 29 00:01:26.640 --> 00:01:29.680 Probably a database too, like a local server setup exactly. 30 00:01:30.000 --> 00:01:34.040 The source recommends xm yp It bundles a patche which 31 00:01:34.079 --> 00:01:37.599 is the web server, PHP and myseqel the database all 32 00:01:37.599 --> 00:01:38.400 in one package. 33 00:01:38.480 --> 00:01:41.239 Ah. Convenient avoids installing everything separately. 34 00:01:41.400 --> 00:01:44.159 Yeah, and it even includes php I admin oh. 35 00:01:44.120 --> 00:01:48.359 Right for managing the myseql database visually useful later definitely Okay, 36 00:01:48.359 --> 00:01:53.439 So you install XAMPP maybe Eclipse any specific setup tweaks needed. 37 00:01:53.599 --> 00:01:55.959 A couple of key things. The source points out. Setting 38 00:01:56.000 --> 00:01:58.599 your PHP path variable is good practice. Lets you run 39 00:01:58.640 --> 00:02:01.920 PHP scripts easily from the command line for testing and stuff, right, 40 00:02:01.959 --> 00:02:06.280 But the really critical one for scraping is enabling the CURL. 41 00:02:05.680 --> 00:02:07.920 Extension curl Okay, what is that? Exactly? 42 00:02:08.039 --> 00:02:10.719 It's a PHP library. You need it enabled in your 43 00:02:10.759 --> 00:02:13.759 main PHP can fig file the PHP dot i ne 44 00:02:14.560 --> 00:02:17.879 Without it, your PHP script can't really make web requests easily. 45 00:02:18.000 --> 00:02:21.080 Ah, So it's essential for fetching pages programmatically. 46 00:02:21.240 --> 00:02:23.199 Absolutely, And then you know, just test this setup, make 47 00:02:23.240 --> 00:02:26.120 sure a patche runs. Maybe run a simple finfo script 48 00:02:26.280 --> 00:02:28.479 to see if curl is listed as enabled. 49 00:02:28.560 --> 00:02:32.280 Got it, so tool get ready curl enabled. Now the 50 00:02:32.360 --> 00:02:36.520 first actual step in scraping getting the web page. 51 00:02:36.199 --> 00:02:39.520 Fetching the content. Yeah, this is where CRL comes into 52 00:02:39.520 --> 00:02:40.560 play directly. 53 00:02:40.199 --> 00:02:42.800 Because it handles HTTP requests. 54 00:02:42.520 --> 00:02:46.080 Exactly unless your script act like a browser, essentially sending 55 00:02:46.120 --> 00:02:48.599 a request to a URL and getting back the HTML 56 00:02:48.680 --> 00:02:49.439 source code. 57 00:02:49.360 --> 00:02:51.960 And the source provides a function example curl get. What's 58 00:02:52.000 --> 00:02:53.800 the basic flow there, it's pretty logical. 59 00:02:53.840 --> 00:02:56.280 You initialize a CRL session, think of it as opening 60 00:02:56.280 --> 00:02:59.120 a connection channel. Okay, Then you set options for that session, 61 00:02:59.360 --> 00:03:00.280 tell it what you want to. 62 00:03:00.280 --> 00:03:01.560 Do, like the URL you want to fetch. 63 00:03:01.840 --> 00:03:04.800 Curl opterill is the main one, and critically kurl opter 64 00:03:04.919 --> 00:03:07.400 own transfer. You usually want that set to true. Why 65 00:03:07.439 --> 00:03:10.759 is that so? That's url returns the page content as 66 00:03:10.800 --> 00:03:13.919 a string variable in your PHP script instead of just 67 00:03:13.960 --> 00:03:16.400 like printing it straight to the screen. You need it 68 00:03:16.439 --> 00:03:17.800 as a variable to work with. 69 00:03:17.759 --> 00:03:20.800 It, right, makes sense? Any other key options? 70 00:03:20.840 --> 00:03:23.439 Oh yeah, curl opc fallow location is super. 71 00:03:23.280 --> 00:03:25.840 Useful for redirects like three ozho One's exactly. 72 00:03:26.039 --> 00:03:29.159 Websites often redirect you. This option tells the URL to 73 00:03:29.199 --> 00:03:32.639 automatically follow those redirects to the final page. Saves you 74 00:03:32.680 --> 00:03:34.120 a lot to hassle. Nice. 75 00:03:34.240 --> 00:03:36.199 What about curl op twuser agent. The source gives them 76 00:03:36.159 --> 00:03:38.319 an example string ah. 77 00:03:38.039 --> 00:03:41.319 The user agent. It's basically a string that identifies your 78 00:03:41.319 --> 00:03:45.400 client your script to the web server one well, partly politeness, 79 00:03:45.439 --> 00:03:48.560 partly necessity. Some servers block requests that don't have a 80 00:03:48.639 --> 00:03:50.479 user agent string that looks like it's from a normal 81 00:03:50.479 --> 00:03:53.360 web browser, so sending one makes your script look less 82 00:03:53.360 --> 00:03:54.199 like a basic. 83 00:03:53.919 --> 00:03:56.840 Bot okay, helps avoid immediate blocks potentially. 84 00:03:56.960 --> 00:03:59.479 Yeah. Then there's curl optt header if you need to 85 00:03:59.520 --> 00:04:02.919 send custom headers sometimes needed for specific sites and curl 86 00:04:02.960 --> 00:04:04.039 up tfell on error. 87 00:04:04.080 --> 00:04:04.680 What does that do? 88 00:04:05.000 --> 00:04:08.439 It tells curl to treat HDP air codes like four 89 00:04:08.439 --> 00:04:11.280 O four not found or five hundred server error as 90 00:04:11.479 --> 00:04:12.560 well actual. 91 00:04:12.199 --> 00:04:15.000 Script errors instead of just returning an empty page or 92 00:04:15.039 --> 00:04:15.719 an air page. 93 00:04:15.919 --> 00:04:18.000 Right, it can be a simple way to detect if 94 00:04:18.040 --> 00:04:20.000 the request failed badly okay. 95 00:04:20.319 --> 00:04:23.800 And the source mentions checking the HTTP response code itself. 96 00:04:24.040 --> 00:04:25.120 Using curl jet info. 97 00:04:25.439 --> 00:04:29.040 Why bother because knowing the code tells you exactly what happened. 98 00:04:29.120 --> 00:04:31.319 Two hundred oka means success, you have the page. 99 00:04:31.439 --> 00:04:34.839 Four oh four means it doesn't exist, right crucial info 100 00:04:35.480 --> 00:04:37.879 A four h three forbidden means you don't have permission. 101 00:04:38.279 --> 00:04:41.040 Maybe you need to log in or your IP is blocked. 102 00:04:40.879 --> 00:04:44.600 Three oh one moved permanently, which follow location handles, but 103 00:04:44.839 --> 00:04:45.319 good to know. 104 00:04:45.639 --> 00:04:49.399 Yeah, checking the status code is fundamental for robust error handling. 105 00:04:49.519 --> 00:04:50.759 You know why something might have failed. 106 00:04:50.800 --> 00:04:54.120 Okay, so CRL gets you the raw HTML, maybe a 107 00:04:54.160 --> 00:04:57.439 massive string of code. Now the real challenge finding the 108 00:04:57.439 --> 00:04:59.639 specific bit of data you want inside all that. 109 00:05:00.360 --> 00:05:03.160 Extraction time and the main tool the source introduces here 110 00:05:03.279 --> 00:05:04.759 is XPath XPath. 111 00:05:04.800 --> 00:05:06.519 I've heard of it with XML. How does it apply 112 00:05:06.560 --> 00:05:07.199 to HTML? 113 00:05:07.439 --> 00:05:10.639 Well, HTML isn't always perfect XML, but it's structured right 114 00:05:10.680 --> 00:05:13.959 with tags and attributes. You can parts that downloaded HTML 115 00:05:14.000 --> 00:05:17.720 string into something called a DOM a document object model. 116 00:05:17.519 --> 00:05:18.879 A tree structure of the page. 117 00:05:19.199 --> 00:05:22.920 Precisely, an XPath is a language specifically for navigating that 118 00:05:22.959 --> 00:05:27.639 tree and selecting nodes elements attributes text based on their 119 00:05:27.639 --> 00:05:28.959 path or characteristics. 120 00:05:29.079 --> 00:05:32.040 So it's more structured than just like searching for keywords 121 00:05:32.079 --> 00:05:33.120 in the string much more. 122 00:05:33.439 --> 00:05:36.839 The source shows a function return XPath object. It basically 123 00:05:36.839 --> 00:05:37.959 takes the HTML strength. 124 00:05:38.000 --> 00:05:39.759 So when you got from crl right, it. 125 00:05:39.800 --> 00:05:43.120 Uses PHPs built in don document class to load that HTML, 126 00:05:43.399 --> 00:05:44.600 even if it's a bit messy. 127 00:05:44.800 --> 00:05:46.920 I see the source us as an AT symbol before 128 00:05:46.959 --> 00:05:49.680 load HTML. Is that related to MESSYHTML? 129 00:05:49.800 --> 00:05:53.279 It is? Real world HTML often have minor errors. The 130 00:05:53.319 --> 00:05:56.920 AT symbol in PHP suppresses warnings that load HTML might 131 00:05:56.959 --> 00:06:00.000 generate because of that imperfect markup. It stops your script 132 00:06:00.040 --> 00:06:01.600 potentially halting on minor issues. 133 00:06:01.639 --> 00:06:03.959 Ah a practical trick for scraping. Okay, so don document 134 00:06:04.000 --> 00:06:05.279 lugs the hhamel than what. 135 00:06:05.360 --> 00:06:08.120 Then you create a dom XPath object from that DOM document, 136 00:06:08.439 --> 00:06:10.519 and that XPath object is what you use to run 137 00:06:10.519 --> 00:06:11.040 your queries. 138 00:06:11.120 --> 00:06:14.040 Okay queries. The source has examples like h one or 139 00:06:14.079 --> 00:06:16.000 span at class some class exactly. 140 00:06:16.040 --> 00:06:19.279 Those are XPath expressions. H one means find any H 141 00:06:19.360 --> 00:06:21.399 one element anywhere in the document. 142 00:06:21.120 --> 00:06:22.199 And the span it class. 143 00:06:22.279 --> 00:06:25.439 That's more specific, find any span element that has an 144 00:06:25.439 --> 00:06:29.279 attribute named class with the exact value some class. 145 00:06:29.399 --> 00:06:31.160 What about at href at. 146 00:06:31.120 --> 00:06:33.800 The end of one example that's selecting an attribute, so 147 00:06:33.839 --> 00:06:36.439 maybe it found a specific link attag and added to 148 00:06:36.439 --> 00:06:39.040 rev says get the value of its h ref attribute 149 00:06:39.240 --> 00:06:40.199 the url itself. 150 00:06:40.279 --> 00:06:43.519 So you run these queries against the XPath object and it. 151 00:06:43.480 --> 00:06:46.639 Gives you back a list of matching nodes, elements. 152 00:06:46.240 --> 00:06:49.439 Or attributes, and then the source shows item zero node 153 00:06:49.519 --> 00:06:51.240 value to get the actual text. 154 00:06:51.439 --> 00:06:54.240 Right, the query might find multiple matches, so item zero 155 00:06:54.399 --> 00:06:56.600 usually gets the first one in the list. Then node 156 00:06:56.720 --> 00:07:00.000 value extracts the text content from inside that element. 157 00:07:00.079 --> 00:07:03.120 Okay, so XPath is powerful for navigating that structure. 158 00:07:03.399 --> 00:07:06.199 Very The source has a table with common expressions eight 159 00:07:06.319 --> 00:07:10.439 headed taro apro using brackets for conditions. That's your vocabulary 160 00:07:10.480 --> 00:07:11.519 for building these queries. 161 00:07:11.600 --> 00:07:14.439 What if the data isn't neatly inside a tag or 162 00:07:14.480 --> 00:07:17.439 the structure's just chaotic, XPath might not work then. 163 00:07:17.680 --> 00:07:22.800 Exactly, Sometimes XPath is overkill or just plain impossible. That's 164 00:07:22.800 --> 00:07:24.959 where as the source shows, you might need more direct 165 00:07:25.000 --> 00:07:26.920 approach custom functions. 166 00:07:26.800 --> 00:07:29.040 Like the screen between function mentioned. 167 00:07:28.759 --> 00:07:33.000 Perfect example, it's much simpler conceptually. Its whole job is 168 00:07:33.040 --> 00:07:36.120 to find a chunk of text that sits between two 169 00:07:36.279 --> 00:07:39.000 other known unique strings, so you. 170 00:07:38.920 --> 00:07:42.199 Don't care about HTML tags, just find the text after 171 00:07:42.240 --> 00:07:44.399 start marker and before end marker. 172 00:07:44.319 --> 00:07:46.519 Precisely, you give it the whole chunk of text, like 173 00:07:46.560 --> 00:07:49.199 the page source, the starting string and the ending string. 174 00:07:49.279 --> 00:07:50.040 How does it work. 175 00:07:50.240 --> 00:07:54.839 It uses basic PHP string functions stripos to find the 176 00:07:54.879 --> 00:07:58.759 position of the start and end markers, then subscripted to 177 00:07:58.759 --> 00:08:00.959 cut out the piece of the string between those positions. 178 00:08:01.160 --> 00:08:04.240 Simple but effective if the markers are reliable. The source 179 00:08:04.360 --> 00:08:06.879 uses scraping a Google Analytics ID as an example. 180 00:08:07.000 --> 00:08:10.759 Yeah, that ID is often embedded in JavaScript between specific 181 00:08:10.839 --> 00:08:14.279 quote marks or function calls. XPath wouldn't easily grab that, 182 00:08:14.399 --> 00:08:16.319 but scrape between works perfectly. 183 00:08:16.399 --> 00:08:19.120 Okay, so we have text extraction covered with XPath and 184 00:08:19.160 --> 00:08:22.399 custom functions. What about non text content images? 185 00:08:22.600 --> 00:08:25.319 Good question. You often need to grab images too. The 186 00:08:25.319 --> 00:08:28.800 process combines things we've discussed how So, First you usually 187 00:08:28.800 --> 00:08:32.000 find the images url using XPath. You look for an 188 00:08:32.080 --> 00:08:35.240 mg tag and grab its src attribute. 189 00:08:34.799 --> 00:08:37.360 So like mng at src exactly. 190 00:08:37.559 --> 00:08:39.799 That gives you the URL of the image file. 191 00:08:39.960 --> 00:08:41.399 Then you use currl again. 192 00:08:41.600 --> 00:08:44.960 Yep, you use your curl git function or similar to 193 00:08:45.080 --> 00:08:47.240 download the content at that image url. 194 00:08:47.279 --> 00:08:49.879 But this time you're expecting image data, not HDML. 195 00:08:50.120 --> 00:08:53.879 Right, binary data and the source suggests a good practice. 196 00:08:54.480 --> 00:08:58.200 Verify it actually is an image before saving it. How 197 00:08:58.519 --> 00:09:01.279 PHP has a function get image. You can pass it 198 00:09:01.320 --> 00:09:04.519 to the downloaded data or the filepath. If you save 199 00:09:04.519 --> 00:09:07.879 it temporarily, it'll return image dimensions if it's valid or 200 00:09:07.919 --> 00:09:09.879 false if it's not a recognized image type. 201 00:09:10.000 --> 00:09:12.440 Smart So you verify it and then then you. 202 00:09:12.399 --> 00:09:15.679 Just use standard PHP file functions. So open to open 203 00:09:15.720 --> 00:09:18.440 a local file for writing, right to write the image 204 00:09:18.480 --> 00:09:20.840 data you got from CURL into it, and F close 205 00:09:20.919 --> 00:09:21.559 to close. 206 00:09:21.279 --> 00:09:23.200 The file, and you've saved the image locally. 207 00:09:23.360 --> 00:09:27.120 You have and that basic method find url, download with CRL, 208 00:09:27.200 --> 00:09:29.480 save with file functions works for other file types too, 209 00:09:29.519 --> 00:09:30.440 like pds or whatever. 210 00:09:30.559 --> 00:09:33.159 Okay, fetching static stuff in images is one thing, but 211 00:09:33.240 --> 00:09:36.440 lots of data is behind logins or search forms. Yeah, 212 00:09:36.519 --> 00:09:38.039 how do you interact with sites like that? 213 00:09:38.240 --> 00:09:42.399 Yeah? This requires simulating form submissions. Forms often use the 214 00:09:42.519 --> 00:09:45.600 HTTP post method to send data. 215 00:09:45.279 --> 00:09:48.879 So you need to make post requests with curl exactly. 216 00:09:49.159 --> 00:09:51.840 The source shows a CURL post function example for this. 217 00:09:52.080 --> 00:09:54.039 What do you need to know to make that POC 218 00:09:54.080 --> 00:09:54.960 request work? 219 00:09:55.360 --> 00:09:58.080 You have to inspect the HTML form on the actual 220 00:09:58.080 --> 00:10:01.120 web page. First, look for the the form tag itself. 221 00:10:01.720 --> 00:10:04.480 You need its action attribute. That's the URL you send 222 00:10:04.480 --> 00:10:05.879 the post request to. 223 00:10:06.000 --> 00:10:07.679 Okay, the destination you arel yep. 224 00:10:07.879 --> 00:10:10.039 Then you need to find all the input elements inside 225 00:10:10.039 --> 00:10:13.279 that form and select or text area too potentially what 226 00:10:13.360 --> 00:10:16.639 about them? You need their name attributes. Those names become 227 00:10:16.679 --> 00:10:19.039 the keys in the data you send, and you need 228 00:10:19.080 --> 00:10:20.720 the value you want to send for each name. 229 00:10:20.919 --> 00:10:24.720 So if there's an input name username, you send your username. 230 00:10:24.440 --> 00:10:27.919 Right And crucially don't forget hidden input fields. They often 231 00:10:27.960 --> 00:10:31.080 contain important stuff like session tokens or form IDs that 232 00:10:31.120 --> 00:10:34.519 the server expects back. The source login example mentions needing 233 00:10:34.559 --> 00:10:38.000 email password, but also destination and format which might be 234 00:10:38.320 --> 00:10:39.080 hidden fields. 235 00:10:39.360 --> 00:10:41.840 Ah, I got to check the source carefully. What about 236 00:10:41.840 --> 00:10:43.919 login specifically? Don't they involve cookies? 237 00:10:44.240 --> 00:10:47.960 Absolutely vital When you log in successfully, the server usually 238 00:10:48.000 --> 00:10:51.080 sends back cookies to track your session for subsequent request 239 00:10:51.159 --> 00:10:53.720 or restricted pages. You need to send those cookies back. 240 00:10:53.759 --> 00:10:55.480 How does Kira r L handle that? 241 00:10:55.720 --> 00:10:58.679 It has options for it? Curl up pokie jar tells 242 00:10:58.799 --> 00:11:03.200 Curl to save cookies receives into a specified file. Cookie 243 00:11:03.240 --> 00:11:06.159 file tells Curl to read cookies from a file and 244 00:11:06.240 --> 00:11:07.799 send them with the request. 245 00:11:07.639 --> 00:11:10.000 So you log in, save the cookies, and then use 246 00:11:10.000 --> 00:11:12.679 those cookies for future requests to stay logged in. 247 00:11:12.879 --> 00:11:15.679 That's the basic idea. It maintains your session state. 248 00:11:15.919 --> 00:11:19.639 The source also mentions posting files like simulating and upload. 249 00:11:19.840 --> 00:11:23.039 Yeah. If a form has an input type file, you 250 00:11:23.080 --> 00:11:26.600 can simulate uploading a file using curl ob post fields. 251 00:11:27.279 --> 00:11:29.240 You set the value for that field name to the 252 00:11:29.279 --> 00:11:32.080 path of your local file, but you prefix the path 253 00:11:32.120 --> 00:11:33.080 with an AT symbol. 254 00:11:33.240 --> 00:11:36.639 Crrol understands the AT means upload this file correct. 255 00:11:36.679 --> 00:11:39.279 It handles reading the file content and sending it appropriately. 256 00:11:39.360 --> 00:11:42.120 Okay, so you've send the PST request, maybe logged in. 257 00:11:42.159 --> 00:11:43.519 How do you know if it actually worked. 258 00:11:43.679 --> 00:11:46.440 The simplest check shown in the source is just to 259 00:11:46.480 --> 00:11:48.799 look for a specific piece of text in the HTML 260 00:11:48.879 --> 00:11:52.840 response that you know only appears after a successful submission. 261 00:11:53.080 --> 00:11:57.080 Like log in, successful or welcome back user exactly. 262 00:11:57.240 --> 00:12:00.360 You get the response page source from curl and search 263 00:12:00.399 --> 00:12:03.480 that string for your success message. If it's there, it probably. 264 00:12:03.200 --> 00:12:09.039 Worked, right. Okay, single pages forms, but the real power 265 00:12:09.080 --> 00:12:12.679 comes from scraping lots of pages, like product listings or 266 00:12:12.720 --> 00:12:16.399 search results that span multiple pages. How do you handle 267 00:12:16.440 --> 00:12:17.080 that pagination? 268 00:12:17.360 --> 00:12:20.039 Yeah, traversing multiple pages. You start by scraping the first 269 00:12:20.080 --> 00:12:20.639 page as. 270 00:12:20.600 --> 00:12:23.600 Usual, get the data, find the image, whatever, right, But. 271 00:12:23.639 --> 00:12:26.559 While you're doing that, you also use XPath to look 272 00:12:26.559 --> 00:12:28.320 for the link to the next page. 273 00:12:28.039 --> 00:12:30.759 Like in a next button or page number link exactly. 274 00:12:30.799 --> 00:12:35.080 Commonplaces are l elements with class pagination or pager. You'd 275 00:12:35.080 --> 00:12:37.840 write an XPath query to find a tag inside that, 276 00:12:37.919 --> 00:12:40.679 maybe specifically the one with text next, and grab its 277 00:12:40.799 --> 00:12:41.799 href attribute. 278 00:12:41.840 --> 00:12:44.120 So you get the URL for page two yep. 279 00:12:44.360 --> 00:12:47.120 Then you scrape page two, and on page two you 280 00:12:47.159 --> 00:12:49.159 look for the link to page three, and so on. 281 00:12:49.240 --> 00:12:51.720 You basically collect a list of all the page URLs 282 00:12:51.720 --> 00:12:52.360 you need to visit. 283 00:12:52.600 --> 00:12:55.080 Make sure they're full URLs, not relative ones. 284 00:12:55.200 --> 00:12:58.159 Good point. If they're relative links like page two, you 285 00:12:58.200 --> 00:13:01.720 need to prepend the base website url to make them 286 00:13:01.759 --> 00:13:04.279 absolute before fetching with curl. 287 00:13:04.440 --> 00:13:07.000 Then you just loop through your list of URLs, scraping 288 00:13:07.000 --> 00:13:07.519 each one. 289 00:13:07.639 --> 00:13:10.399 Pretty much. I'll fetch page, extract data, fetch next page, 290 00:13:10.399 --> 00:13:12.000 extract data, repeat. 291 00:13:12.120 --> 00:13:14.879 Now this sounds like it could hit the server pretty 292 00:13:14.879 --> 00:13:16.840 fast if you have hundreds of pages. 293 00:13:17.120 --> 00:13:20.039 It absolutely can, and that brings up a really critical point. 294 00:13:20.039 --> 00:13:22.120 The source emphasizes politeness. 295 00:13:22.279 --> 00:13:24.559 Right, don't be a nuisance or worse. 296 00:13:24.399 --> 00:13:27.919 Get yourself blocked. Hammering a server with rapid fire requests 297 00:13:28.000 --> 00:13:30.519 is bad form and often triggers automated defenses. 298 00:13:30.600 --> 00:13:31.600 So how do you be polite? 299 00:13:31.679 --> 00:13:34.080 The simplest, most common way shown is to just pause 300 00:13:34.120 --> 00:13:39.120 between requests. Use PHP's sleep function. The sort suggests sleep 301 00:13:39.480 --> 00:13:41.159 rand one three. 302 00:13:40.919 --> 00:13:44.039 Way, a random one to three seconds between fetching each page. 303 00:13:44.279 --> 00:13:47.600 Yeah, it slows your script down, mimics human browsing speed 304 00:13:47.639 --> 00:13:50.960 a bit more, and drastically reduces the load on their server. 305 00:13:51.480 --> 00:13:53.559 It's essential for any non trivial scraping. 306 00:13:53.720 --> 00:13:57.519 Okay, vital tip. So you've scraped politely across many pages, 307 00:13:58.000 --> 00:14:02.000 extracted tons of data. Where does it all go? Printing 308 00:14:02.000 --> 00:14:03.000 to screen is useless? 309 00:14:03.000 --> 00:14:06.639 Now right, you need persistent storage. The obvious choice demonstrated 310 00:14:06.759 --> 00:14:10.960 is a database. Since XMPP includes my sequel, that's the 311 00:14:11.039 --> 00:14:11.840 example used. 312 00:14:11.960 --> 00:14:13.600 First up is setting up the database table. 313 00:14:13.720 --> 00:14:16.600 Yep, you need to design your table structure. Now define 314 00:14:16.639 --> 00:14:19.799 columns that match the data points are scraping, like book title, author, 315 00:14:20.320 --> 00:14:22.000 release date is SBN, etc. 316 00:14:22.360 --> 00:14:24.279 And you can use food pi admin for that. 317 00:14:24.320 --> 00:14:26.919 It's a handi graphical tool for creating the database and tables, 318 00:14:27.000 --> 00:14:28.519 setting data types, all that stuff. 319 00:14:28.600 --> 00:14:31.919 Okay, tables ready. How does the PHP script connect and 320 00:14:32.039 --> 00:14:33.519 insert the scraped data? 321 00:14:33.559 --> 00:14:37.440 The source uses PDOPHP data objects. It's a standard flexible 322 00:14:37.480 --> 00:14:40.480 way in PHP to talk to databases, including my school. 323 00:14:40.600 --> 00:14:42.639 You establish a connection using your database name. 324 00:14:42.639 --> 00:14:44.440 Username, password, then insert the data. 325 00:14:44.480 --> 00:14:46.759 For inserting lots of items, the best practice shown is 326 00:14:46.840 --> 00:14:48.120 using prepared statements. 327 00:14:48.360 --> 00:14:50.679 Why is that better than just building insert strings? 328 00:14:50.759 --> 00:14:55.440 Two main reasons. Security, It prevents SEQL injection vulnerabilities, and 329 00:14:55.559 --> 00:14:58.440 often better performance when you're inserting many rows with the 330 00:14:58.440 --> 00:14:59.200 same structure. 331 00:14:59.320 --> 00:15:00.200 How do they work. 332 00:15:00.399 --> 00:15:03.919 You write the insert query once, but use placeholders like 333 00:15:04.000 --> 00:15:07.519 question marks are named parameters for the actual values. You 334 00:15:07.639 --> 00:15:11.399 prepare this query structure with the database okay, Then you 335 00:15:11.519 --> 00:15:14.639 loop through your array of scraped data items like all 336 00:15:14.639 --> 00:15:17.600 the books you found inside the loop. For each book, 337 00:15:18.039 --> 00:15:21.960 you bind its specific title, author, etc. To the placeholders 338 00:15:22.000 --> 00:15:24.440 in the prepared statement, and then you execute it. 339 00:15:24.519 --> 00:15:26.879 So the queer structure is set once and then just 340 00:15:26.919 --> 00:15:29.279 the data changes for each execution exactly. 341 00:15:29.519 --> 00:15:32.480 It's cleaner and safer. One row gets inserted into your 342 00:15:32.559 --> 00:15:35.159 database table for each item in your scrape data. 343 00:15:35.000 --> 00:15:37.360 Array, and the source also shows getting data back out. 344 00:15:37.440 --> 00:15:41.000 Yeah, completes the picture using a select query, again often 345 00:15:41.080 --> 00:15:44.679 via pdo, to fetch the data you saved, perhaps looping 346 00:15:44.720 --> 00:15:47.159 through the results to display them in an HTML table 347 00:15:47.200 --> 00:15:48.080 on a web page. 348 00:15:48.200 --> 00:15:52.480 Okay, this is getting quite sophisticated fetching, parsing, interacting, storing. 349 00:15:52.759 --> 00:15:55.519 As scripts get bigger, the code could get messy, right, 350 00:15:55.960 --> 00:15:59.440 repeating the same CURL setup or XPath. 351 00:15:59.159 --> 00:16:03.759 Creation definitely can. That's where the source introduces object oriented 352 00:16:03.799 --> 00:16:08.039 programming or OOP principles as a way to organize things better. 353 00:16:07.919 --> 00:16:10.000 Making the code reusable and tidier. 354 00:16:10.080 --> 00:16:13.159 That's the goal. The core idea is creating a class, 355 00:16:13.519 --> 00:16:16.240 which is like a blueprint for an object. It bundles 356 00:16:16.240 --> 00:16:19.519 together related data properties and functions that operate on that 357 00:16:19.639 --> 00:16:20.399 data methods. 358 00:16:20.559 --> 00:16:22.799 The book uses a human class analogy. 359 00:16:22.919 --> 00:16:25.720 Yeah, like a human blueprint might define properties like name 360 00:16:25.759 --> 00:16:29.519 and age, and methods like speak or walk. An object 361 00:16:29.639 --> 00:16:32.320 is a specific instance created from that blueprint, like Bobby 362 00:16:32.360 --> 00:16:35.679 eesyl new human, where Bob has his own specific name 363 00:16:35.720 --> 00:16:36.120 in age. 364 00:16:36.240 --> 00:16:39.200 So how does the example scrape class in the source 365 00:16:39.320 --> 00:16:39.919 apply this? 366 00:16:40.240 --> 00:16:43.679 It takes the common scraping tasks fetching a URL, creating 367 00:16:43.759 --> 00:16:46.519 the XPath object and put some inside the class definition 368 00:16:46.600 --> 00:16:49.360 as methods exactly, and it often uses a special method 369 00:16:49.360 --> 00:16:52.879 called construct What's that do? The constructor runs automatically whenever 370 00:16:52.919 --> 00:16:54.679 you create a new object from the class. So in 371 00:16:54.759 --> 00:16:57.320 the scrape class, example, when you write page craper new 372 00:16:57.399 --> 00:17:01.120 scrape http dot example dot com, the struct method immediately 373 00:17:01.120 --> 00:17:01.519 takes that. 374 00:17:01.679 --> 00:17:03.799 URL, the one you just passed in, calls. 375 00:17:03.559 --> 00:17:06.599 An internal method maybe curl get to fetch the source 376 00:17:06.640 --> 00:17:09.799 code for that specific URL, and calls another internal method 377 00:17:09.920 --> 00:17:12.960 like return XPath object to create the XPath object for 378 00:17:13.000 --> 00:17:15.160 that source. It does the initial setup. 379 00:17:14.799 --> 00:17:17.519 Work ah, so the object is immediately ready with the 380 00:17:17.559 --> 00:17:21.160 source and the XPath tool for its specific URL precisely. 381 00:17:21.319 --> 00:17:24.039 The paid scraper object now holds its own source code 382 00:17:24.039 --> 00:17:26.960 property and six path object property ready for you to use. 383 00:17:27.200 --> 00:17:30.559 You'd access them like pagecap er, XPath object query. 384 00:17:30.279 --> 00:17:32.519 And you could add more methods to the scrape class 385 00:17:32.559 --> 00:17:34.160 like save image or submit form. 386 00:17:34.240 --> 00:17:37.599 Absolutely you build up a reusable toolkit within the class. 387 00:17:38.119 --> 00:17:40.720 Create a scrape object for any URL, and you have 388 00:17:40.799 --> 00:17:43.359 all your scraping tools ready to work on that specific 389 00:17:43.400 --> 00:17:47.359 pages content. Makes the main part of your script much cleaner, very. 390 00:17:47.200 --> 00:17:51.519 Neat okay, final piece automation. You've built this great script, 391 00:17:52.319 --> 00:17:56.000 maybe using a class it saves to a database. How 392 00:17:56.000 --> 00:17:58.799 do you make it run automatically? Say every night? 393 00:17:58.960 --> 00:18:00.799 Yeah, you don't want to manually run it every time. 394 00:18:00.920 --> 00:18:03.400 This is where scheduling comes in. The source gives an 395 00:18:03.400 --> 00:18:05.440 example using Windows task scheduler. 396 00:18:05.880 --> 00:18:08.799 So it's not code within the PHP script itself. 397 00:18:09.000 --> 00:18:12.839 No, Usually you leverage the operating system scheduling tools Task 398 00:18:12.880 --> 00:18:16.079 scheduler on Windows, or chron jobs on Linux or Maco. 399 00:18:16.759 --> 00:18:18.480 They are built for exactly this purpose. 400 00:18:18.559 --> 00:18:19.240 How does it work? 401 00:18:19.279 --> 00:18:21.920