WEBVTT

1
00:00:00.080 --> 00:00:03.879
<v Speaker 1>Okay, welcome to the deep dive. Today. We're jumping into webscraping.

2
00:00:04.440 --> 00:00:09.679
<v Speaker 2>That's right, specifically using PhD based on some key ideas

3
00:00:09.679 --> 00:00:12.679
<v Speaker 2>from roller source Instant PHPE Webscraping.

4
00:00:12.759 --> 00:00:15.599
<v Speaker 1>Yeah, it's from twenty thirteen by Jacob Board. So things

5
00:00:15.640 --> 00:00:19.519
<v Speaker 1>have moved on obviously, but the core ideas they're often

6
00:00:19.559 --> 00:00:20.800
<v Speaker 1>still relevant, aren't.

7
00:00:20.559 --> 00:00:23.280
<v Speaker 2>They They really are. The book was aimed at beginners,

8
00:00:23.399 --> 00:00:26.760
<v Speaker 2>you know, showing how to programmatically crawl websites, download content,

9
00:00:27.280 --> 00:00:31.280
<v Speaker 2>and basically turn unstructured web stuff into structured data using

10
00:00:32.119 --> 00:00:32.799
<v Speaker 2>using PHP.

11
00:00:33.159 --> 00:00:35.640
<v Speaker 1>So our mission here is to pull out those fundamental

12
00:00:35.679 --> 00:00:38.880
<v Speaker 1>techniques from these excerpts give you a solid grounding in

13
00:00:38.920 --> 00:00:41.200
<v Speaker 1>how PHPU webscraping works at its core.

14
00:00:41.439 --> 00:00:44.439
<v Speaker 2>Even if you're maybe adapting these ideas for more modern

15
00:00:44.479 --> 00:00:47.280
<v Speaker 2>sites later on, the basics often carry.

16
00:00:47.039 --> 00:00:50.159
<v Speaker 1>Through exactly the source assumes. Maybe not a ton of

17
00:00:50.200 --> 00:00:54.280
<v Speaker 1>programming experience, though knowing some PHP and HTML definitely.

18
00:00:53.840 --> 00:00:56.200
<v Speaker 2>Helps, sure, But the focus is really on the scraping

19
00:00:56.240 --> 00:00:57.119
<v Speaker 2>concepts themselves.

20
00:00:57.200 --> 00:00:59.640
<v Speaker 1>All right, let's kick things off. Before you scrape, you

21
00:00:59.679 --> 00:01:03.039
<v Speaker 1>need your tools. What's the basic toolkit according to these sources?

22
00:01:03.320 --> 00:01:07.760
<v Speaker 2>Okay, so first, obviously you need PHP itself that's the language, right,

23
00:01:08.040 --> 00:01:10.840
<v Speaker 2>then a good place to write your code an ID

24
00:01:11.640 --> 00:01:16.760
<v Speaker 2>integrated development environment. The source mentions Eclipse PDT.

25
00:01:16.400 --> 00:01:19.439
<v Speaker 1>PDT being the PHP development tools for Eclipse, so a

26
00:01:19.480 --> 00:01:20.680
<v Speaker 1>specialized code editor.

27
00:01:20.840 --> 00:01:24.480
<v Speaker 2>Yeah. Basically makes coding easier, keeps things organized. And then

28
00:01:24.519 --> 00:01:26.680
<v Speaker 2>you need a way to run the PHP and.

29
00:01:26.640 --> 00:01:29.680
<v Speaker 1>Probably a database too, like a local server setup exactly.

30
00:01:30.000 --> 00:01:34.040
<v Speaker 2>The source recommends xm yp It bundles a patche which

31
00:01:34.079 --> 00:01:37.599
<v Speaker 2>is the web server, PHP and myseqel the database all

32
00:01:37.599 --> 00:01:38.400
<v Speaker 2>in one package.

33
00:01:38.480 --> 00:01:41.239
<v Speaker 1>Ah. Convenient avoids installing everything separately.

34
00:01:41.400 --> 00:01:44.159
<v Speaker 2>Yeah, and it even includes php I admin oh.

35
00:01:44.120 --> 00:01:48.359
<v Speaker 1>Right for managing the myseql database visually useful later definitely Okay,

36
00:01:48.359 --> 00:01:53.439
<v Speaker 1>So you install XAMPP maybe Eclipse any specific setup tweaks needed.

37
00:01:53.599 --> 00:01:55.959
<v Speaker 2>A couple of key things. The source points out. Setting

38
00:01:56.000 --> 00:01:58.599
<v Speaker 2>your PHP path variable is good practice. Lets you run

39
00:01:58.640 --> 00:02:01.920
<v Speaker 2>PHP scripts easily from the command line for testing and stuff, right,

40
00:02:01.959 --> 00:02:06.280
<v Speaker 2>But the really critical one for scraping is enabling the CURL.

41
00:02:05.680 --> 00:02:07.920
<v Speaker 1>Extension curl Okay, what is that? Exactly?

42
00:02:08.039 --> 00:02:10.719
<v Speaker 2>It's a PHP library. You need it enabled in your

43
00:02:10.759 --> 00:02:13.759
<v Speaker 2>main PHP can fig file the PHP dot i ne

44
00:02:14.560 --> 00:02:17.879
<v Speaker 2>Without it, your PHP script can't really make web requests easily.

45
00:02:18.000 --> 00:02:21.080
<v Speaker 1>Ah, So it's essential for fetching pages programmatically.

46
00:02:21.240 --> 00:02:23.199
<v Speaker 2>Absolutely, And then you know, just test this setup, make

47
00:02:23.240 --> 00:02:26.120
<v Speaker 2>sure a patche runs. Maybe run a simple finfo script

48
00:02:26.280 --> 00:02:28.479
<v Speaker 2>to see if curl is listed as enabled.

49
00:02:28.560 --> 00:02:32.280
<v Speaker 1>Got it, so tool get ready curl enabled. Now the

50
00:02:32.360 --> 00:02:36.520
<v Speaker 1>first actual step in scraping getting the web page.

51
00:02:36.199 --> 00:02:39.520
<v Speaker 2>Fetching the content. Yeah, this is where CRL comes into

52
00:02:39.520 --> 00:02:40.560
<v Speaker 2>play directly.

53
00:02:40.199 --> 00:02:42.800
<v Speaker 1>Because it handles HTTP requests.

54
00:02:42.520 --> 00:02:46.080
<v Speaker 2>Exactly unless your script act like a browser, essentially sending

55
00:02:46.120 --> 00:02:48.599
<v Speaker 2>a request to a URL and getting back the HTML

56
00:02:48.680 --> 00:02:49.439
<v Speaker 2>source code.

57
00:02:49.360 --> 00:02:51.960
<v Speaker 1>And the source provides a function example curl get. What's

58
00:02:52.000 --> 00:02:53.800
<v Speaker 1>the basic flow there, it's pretty logical.

59
00:02:53.840 --> 00:02:56.280
<v Speaker 2>You initialize a CRL session, think of it as opening

60
00:02:56.280 --> 00:02:59.120
<v Speaker 2>a connection channel. Okay, Then you set options for that session,

61
00:02:59.360 --> 00:03:00.280
<v Speaker 2>tell it what you want to.

62
00:03:00.280 --> 00:03:01.560
<v Speaker 1>Do, like the URL you want to fetch.

63
00:03:01.840 --> 00:03:04.800
<v Speaker 2>Curl opterill is the main one, and critically kurl opter

64
00:03:04.919 --> 00:03:07.400
<v Speaker 2>own transfer. You usually want that set to true. Why

65
00:03:07.439 --> 00:03:10.759
<v Speaker 2>is that so? That's url returns the page content as

66
00:03:10.800 --> 00:03:13.919
<v Speaker 2>a string variable in your PHP script instead of just

67
00:03:13.960 --> 00:03:16.400
<v Speaker 2>like printing it straight to the screen. You need it

68
00:03:16.439 --> 00:03:17.800
<v Speaker 2>as a variable to work with.

69
00:03:17.759 --> 00:03:20.800
<v Speaker 1>It, right, makes sense? Any other key options?

70
00:03:20.840 --> 00:03:23.439
<v Speaker 2>Oh yeah, curl opc fallow location is super.

71
00:03:23.280 --> 00:03:25.840
<v Speaker 1>Useful for redirects like three ozho One's exactly.

72
00:03:26.039 --> 00:03:29.159
<v Speaker 2>Websites often redirect you. This option tells the URL to

73
00:03:29.199 --> 00:03:32.639
<v Speaker 2>automatically follow those redirects to the final page. Saves you

74
00:03:32.680 --> 00:03:34.120
<v Speaker 2>a lot to hassle. Nice.

75
00:03:34.240 --> 00:03:36.199
<v Speaker 1>What about curl op twuser agent. The source gives them

76
00:03:36.159 --> 00:03:38.319
<v Speaker 1>an example string ah.

77
00:03:38.039 --> 00:03:41.319
<v Speaker 2>The user agent. It's basically a string that identifies your

78
00:03:41.319 --> 00:03:45.400
<v Speaker 2>client your script to the web server one well, partly politeness,

79
00:03:45.439 --> 00:03:48.560
<v Speaker 2>partly necessity. Some servers block requests that don't have a

80
00:03:48.639 --> 00:03:50.479
<v Speaker 2>user agent string that looks like it's from a normal

81
00:03:50.479 --> 00:03:53.360
<v Speaker 2>web browser, so sending one makes your script look less

82
00:03:53.360 --> 00:03:54.199
<v Speaker 2>like a basic.

83
00:03:53.919 --> 00:03:56.840
<v Speaker 1>Bot okay, helps avoid immediate blocks potentially.

84
00:03:56.960 --> 00:03:59.479
<v Speaker 2>Yeah. Then there's curl optt header if you need to

85
00:03:59.520 --> 00:04:02.919
<v Speaker 2>send custom headers sometimes needed for specific sites and curl

86
00:04:02.960 --> 00:04:04.039
<v Speaker 2>up tfell on error.

87
00:04:04.080 --> 00:04:04.680
<v Speaker 1>What does that do?

88
00:04:05.000 --> 00:04:08.439
<v Speaker 2>It tells curl to treat HDP air codes like four

89
00:04:08.439 --> 00:04:11.280
<v Speaker 2>O four not found or five hundred server error as

90
00:04:11.479 --> 00:04:12.560
<v Speaker 2>well actual.

91
00:04:12.199 --> 00:04:15.000
<v Speaker 1>Script errors instead of just returning an empty page or

92
00:04:15.039 --> 00:04:15.719
<v Speaker 1>an air page.

93
00:04:15.919 --> 00:04:18.000
<v Speaker 2>Right, it can be a simple way to detect if

94
00:04:18.040 --> 00:04:20.000
<v Speaker 2>the request failed badly okay.

95
00:04:20.319 --> 00:04:23.800
<v Speaker 1>And the source mentions checking the HTTP response code itself.

96
00:04:24.040 --> 00:04:25.120
<v Speaker 1>Using curl jet info.

97
00:04:25.439 --> 00:04:29.040
<v Speaker 2>Why bother because knowing the code tells you exactly what happened.

98
00:04:29.120 --> 00:04:31.319
<v Speaker 2>Two hundred oka means success, you have the page.

99
00:04:31.439 --> 00:04:34.839
<v Speaker 1>Four oh four means it doesn't exist, right crucial info

100
00:04:35.480 --> 00:04:37.879
<v Speaker 1>A four h three forbidden means you don't have permission.

101
00:04:38.279 --> 00:04:41.040
<v Speaker 2>Maybe you need to log in or your IP is blocked.

102
00:04:40.879 --> 00:04:44.600
<v Speaker 1>Three oh one moved permanently, which follow location handles, but

103
00:04:44.839 --> 00:04:45.319
<v Speaker 1>good to know.

104
00:04:45.639 --> 00:04:49.399
<v Speaker 2>Yeah, checking the status code is fundamental for robust error handling.

105
00:04:49.519 --> 00:04:50.759
<v Speaker 2>You know why something might have failed.

106
00:04:50.800 --> 00:04:54.120
<v Speaker 1>Okay, so CRL gets you the raw HTML, maybe a

107
00:04:54.160 --> 00:04:57.439
<v Speaker 1>massive string of code. Now the real challenge finding the

108
00:04:57.439 --> 00:04:59.639
<v Speaker 1>specific bit of data you want inside all that.

109
00:05:00.360 --> 00:05:03.160
<v Speaker 2>Extraction time and the main tool the source introduces here

110
00:05:03.279 --> 00:05:04.759
<v Speaker 2>is XPath XPath.

111
00:05:04.800 --> 00:05:06.519
<v Speaker 1>I've heard of it with XML. How does it apply

112
00:05:06.560 --> 00:05:07.199
<v Speaker 1>to HTML?

113
00:05:07.439 --> 00:05:10.639
<v Speaker 2>Well, HTML isn't always perfect XML, but it's structured right

114
00:05:10.680 --> 00:05:13.959
<v Speaker 2>with tags and attributes. You can parts that downloaded HTML

115
00:05:14.000 --> 00:05:17.720
<v Speaker 2>string into something called a DOM a document object model.

116
00:05:17.519 --> 00:05:18.879
<v Speaker 1>A tree structure of the page.

117
00:05:19.199 --> 00:05:22.920
<v Speaker 2>Precisely, an XPath is a language specifically for navigating that

118
00:05:22.959 --> 00:05:27.639
<v Speaker 2>tree and selecting nodes elements attributes text based on their

119
00:05:27.639 --> 00:05:28.959
<v Speaker 2>path or characteristics.

120
00:05:29.079 --> 00:05:32.040
<v Speaker 1>So it's more structured than just like searching for keywords

121
00:05:32.079 --> 00:05:33.120
<v Speaker 1>in the string much more.

122
00:05:33.439 --> 00:05:36.839
<v Speaker 2>The source shows a function return XPath object. It basically

123
00:05:36.839 --> 00:05:37.959
<v Speaker 2>takes the HTML strength.

124
00:05:38.000 --> 00:05:39.759
<v Speaker 1>So when you got from crl right, it.

125
00:05:39.800 --> 00:05:43.120
<v Speaker 2>Uses PHPs built in don document class to load that HTML,

126
00:05:43.399 --> 00:05:44.600
<v Speaker 2>even if it's a bit messy.

127
00:05:44.800 --> 00:05:46.920
<v Speaker 1>I see the source us as an AT symbol before

128
00:05:46.959 --> 00:05:49.680
<v Speaker 1>load HTML. Is that related to MESSYHTML?

129
00:05:49.800 --> 00:05:53.279
<v Speaker 2>It is? Real world HTML often have minor errors. The

130
00:05:53.319 --> 00:05:56.920
<v Speaker 2>AT symbol in PHP suppresses warnings that load HTML might

131
00:05:56.959 --> 00:06:00.000
<v Speaker 2>generate because of that imperfect markup. It stops your script

132
00:06:00.040 --> 00:06:01.600
<v Speaker 2>potentially halting on minor issues.

133
00:06:01.639 --> 00:06:03.959
<v Speaker 1>Ah a practical trick for scraping. Okay, so don document

134
00:06:04.000 --> 00:06:05.279
<v Speaker 1>lugs the hhamel than what.

135
00:06:05.360 --> 00:06:08.120
<v Speaker 2>Then you create a dom XPath object from that DOM document,

136
00:06:08.439 --> 00:06:10.519
<v Speaker 2>and that XPath object is what you use to run

137
00:06:10.519 --> 00:06:11.040
<v Speaker 2>your queries.

138
00:06:11.120 --> 00:06:14.040
<v Speaker 1>Okay queries. The source has examples like h one or

139
00:06:14.079 --> 00:06:16.000
<v Speaker 1>span at class some class exactly.

140
00:06:16.040 --> 00:06:19.279
<v Speaker 2>Those are XPath expressions. H one means find any H

141
00:06:19.360 --> 00:06:21.399
<v Speaker 2>one element anywhere in the document.

142
00:06:21.120 --> 00:06:22.199
<v Speaker 1>And the span it class.

143
00:06:22.279 --> 00:06:25.439
<v Speaker 2>That's more specific, find any span element that has an

144
00:06:25.439 --> 00:06:29.279
<v Speaker 2>attribute named class with the exact value some class.

145
00:06:29.399 --> 00:06:31.160
<v Speaker 1>What about at href at.

146
00:06:31.120 --> 00:06:33.800
<v Speaker 2>The end of one example that's selecting an attribute, so

147
00:06:33.839 --> 00:06:36.439
<v Speaker 2>maybe it found a specific link attag and added to

148
00:06:36.439 --> 00:06:39.040
<v Speaker 2>rev says get the value of its h ref attribute

149
00:06:39.240 --> 00:06:40.199
<v Speaker 2>the url itself.

150
00:06:40.279 --> 00:06:43.519
<v Speaker 1>So you run these queries against the XPath object and it.

151
00:06:43.480 --> 00:06:46.639
<v Speaker 2>Gives you back a list of matching nodes, elements.

152
00:06:46.240 --> 00:06:49.439
<v Speaker 1>Or attributes, and then the source shows item zero node

153
00:06:49.519 --> 00:06:51.240
<v Speaker 1>value to get the actual text.

154
00:06:51.439 --> 00:06:54.240
<v Speaker 2>Right, the query might find multiple matches, so item zero

155
00:06:54.399 --> 00:06:56.600
<v Speaker 2>usually gets the first one in the list. Then node

156
00:06:56.720 --> 00:07:00.000
<v Speaker 2>value extracts the text content from inside that element.

157
00:07:00.079 --> 00:07:03.120
<v Speaker 1>Okay, so XPath is powerful for navigating that structure.

158
00:07:03.399 --> 00:07:06.199
<v Speaker 2>Very The source has a table with common expressions eight

159
00:07:06.319 --> 00:07:10.439
<v Speaker 2>headed taro apro using brackets for conditions. That's your vocabulary

160
00:07:10.480 --> 00:07:11.519
<v Speaker 2>for building these queries.

161
00:07:11.600 --> 00:07:14.439
<v Speaker 1>What if the data isn't neatly inside a tag or

162
00:07:14.480 --> 00:07:17.439
<v Speaker 1>the structure's just chaotic, XPath might not work then.

163
00:07:17.680 --> 00:07:22.800
<v Speaker 2>Exactly, Sometimes XPath is overkill or just plain impossible. That's

164
00:07:22.800 --> 00:07:24.959
<v Speaker 2>where as the source shows, you might need more direct

165
00:07:25.000 --> 00:07:26.920
<v Speaker 2>approach custom functions.

166
00:07:26.800 --> 00:07:29.040
<v Speaker 1>Like the screen between function mentioned.

167
00:07:28.759 --> 00:07:33.000
<v Speaker 2>Perfect example, it's much simpler conceptually. Its whole job is

168
00:07:33.040 --> 00:07:36.120
<v Speaker 2>to find a chunk of text that sits between two

169
00:07:36.279 --> 00:07:39.000
<v Speaker 2>other known unique strings, so you.

170
00:07:38.920 --> 00:07:42.199
<v Speaker 1>Don't care about HTML tags, just find the text after

171
00:07:42.240 --> 00:07:44.399
<v Speaker 1>start marker and before end marker.

172
00:07:44.319 --> 00:07:46.519
<v Speaker 2>Precisely, you give it the whole chunk of text, like

173
00:07:46.560 --> 00:07:49.199
<v Speaker 2>the page source, the starting string and the ending string.

174
00:07:49.279 --> 00:07:50.040
<v Speaker 1>How does it work.

175
00:07:50.240 --> 00:07:54.839
<v Speaker 2>It uses basic PHP string functions stripos to find the

176
00:07:54.879 --> 00:07:58.759
<v Speaker 2>position of the start and end markers, then subscripted to

177
00:07:58.759 --> 00:08:00.959
<v Speaker 2>cut out the piece of the string between those positions.

178
00:08:01.160 --> 00:08:04.240
<v Speaker 1>Simple but effective if the markers are reliable. The source

179
00:08:04.360 --> 00:08:06.879
<v Speaker 1>uses scraping a Google Analytics ID as an example.

180
00:08:07.000 --> 00:08:10.759
<v Speaker 2>Yeah, that ID is often embedded in JavaScript between specific

181
00:08:10.839 --> 00:08:14.279
<v Speaker 2>quote marks or function calls. XPath wouldn't easily grab that,

182
00:08:14.399 --> 00:08:16.319
<v Speaker 2>but scrape between works perfectly.

183
00:08:16.399 --> 00:08:19.120
<v Speaker 1>Okay, so we have text extraction covered with XPath and

184
00:08:19.160 --> 00:08:22.399
<v Speaker 1>custom functions. What about non text content images?

185
00:08:22.600 --> 00:08:25.319
<v Speaker 2>Good question. You often need to grab images too. The

186
00:08:25.319 --> 00:08:28.800
<v Speaker 2>process combines things we've discussed how So, First you usually

187
00:08:28.800 --> 00:08:32.000
<v Speaker 2>find the images url using XPath. You look for an

188
00:08:32.080 --> 00:08:35.240
<v Speaker 2>mg tag and grab its src attribute.

189
00:08:34.799 --> 00:08:37.360
<v Speaker 1>So like mng at src exactly.

190
00:08:37.559 --> 00:08:39.799
<v Speaker 2>That gives you the URL of the image file.

191
00:08:39.960 --> 00:08:41.399
<v Speaker 1>Then you use currl again.

192
00:08:41.600 --> 00:08:44.960
<v Speaker 2>Yep, you use your curl git function or similar to

193
00:08:45.080 --> 00:08:47.240
<v Speaker 2>download the content at that image url.

194
00:08:47.279 --> 00:08:49.879
<v Speaker 1>But this time you're expecting image data, not HDML.

195
00:08:50.120 --> 00:08:53.879
<v Speaker 2>Right, binary data and the source suggests a good practice.

196
00:08:54.480 --> 00:08:58.200
<v Speaker 2>Verify it actually is an image before saving it. How

197
00:08:58.519 --> 00:09:01.279
<v Speaker 2>PHP has a function get image. You can pass it

198
00:09:01.320 --> 00:09:04.519
<v Speaker 2>to the downloaded data or the filepath. If you save

199
00:09:04.519 --> 00:09:07.879
<v Speaker 2>it temporarily, it'll return image dimensions if it's valid or

200
00:09:07.919 --> 00:09:09.879
<v Speaker 2>false if it's not a recognized image type.

201
00:09:10.000 --> 00:09:12.440
<v Speaker 1>Smart So you verify it and then then you.

202
00:09:12.399 --> 00:09:15.679
<v Speaker 2>Just use standard PHP file functions. So open to open

203
00:09:15.720 --> 00:09:18.440
<v Speaker 2>a local file for writing, right to write the image

204
00:09:18.480 --> 00:09:20.840
<v Speaker 2>data you got from CURL into it, and F close

205
00:09:20.919 --> 00:09:21.559
<v Speaker 2>to close.

206
00:09:21.279 --> 00:09:23.200
<v Speaker 1>The file, and you've saved the image locally.

207
00:09:23.360 --> 00:09:27.120
<v Speaker 2>You have and that basic method find url, download with CRL,

208
00:09:27.200 --> 00:09:29.480
<v Speaker 2>save with file functions works for other file types too,

209
00:09:29.519 --> 00:09:30.440
<v Speaker 2>like pds or whatever.

210
00:09:30.559 --> 00:09:33.159
<v Speaker 1>Okay, fetching static stuff in images is one thing, but

211
00:09:33.240 --> 00:09:36.440
<v Speaker 1>lots of data is behind logins or search forms. Yeah,

212
00:09:36.519 --> 00:09:38.039
<v Speaker 1>how do you interact with sites like that?

213
00:09:38.240 --> 00:09:42.399
<v Speaker 2>Yeah? This requires simulating form submissions. Forms often use the

214
00:09:42.519 --> 00:09:45.600
<v Speaker 2>HTTP post method to send data.

215
00:09:45.279 --> 00:09:48.879
<v Speaker 1>So you need to make post requests with curl exactly.

216
00:09:49.159 --> 00:09:51.840
<v Speaker 2>The source shows a CURL post function example for this.

217
00:09:52.080 --> 00:09:54.039
<v Speaker 1>What do you need to know to make that POC

218
00:09:54.080 --> 00:09:54.960
<v Speaker 1>request work?

219
00:09:55.360 --> 00:09:58.080
<v Speaker 2>You have to inspect the HTML form on the actual

220
00:09:58.080 --> 00:10:01.120
<v Speaker 2>web page. First, look for the the form tag itself.

221
00:10:01.720 --> 00:10:04.480
<v Speaker 2>You need its action attribute. That's the URL you send

222
00:10:04.480 --> 00:10:05.879
<v Speaker 2>the post request to.

223
00:10:06.000 --> 00:10:07.679
<v Speaker 1>Okay, the destination you arel yep.

224
00:10:07.879 --> 00:10:10.039
<v Speaker 2>Then you need to find all the input elements inside

225
00:10:10.039 --> 00:10:13.279
<v Speaker 2>that form and select or text area too potentially what

226
00:10:13.360 --> 00:10:16.639
<v Speaker 2>about them? You need their name attributes. Those names become

227
00:10:16.679 --> 00:10:19.039
<v Speaker 2>the keys in the data you send, and you need

228
00:10:19.080 --> 00:10:20.720
<v Speaker 2>the value you want to send for each name.

229
00:10:20.919 --> 00:10:24.720
<v Speaker 1>So if there's an input name username, you send your username.

230
00:10:24.440 --> 00:10:27.919
<v Speaker 2>Right And crucially don't forget hidden input fields. They often

231
00:10:27.960 --> 00:10:31.080
<v Speaker 2>contain important stuff like session tokens or form IDs that

232
00:10:31.120 --> 00:10:34.519
<v Speaker 2>the server expects back. The source login example mentions needing

233
00:10:34.559 --> 00:10:38.000
<v Speaker 2>email password, but also destination and format which might be

234
00:10:38.320 --> 00:10:39.080
<v Speaker 2>hidden fields.

235
00:10:39.360 --> 00:10:41.840
<v Speaker 1>Ah, I got to check the source carefully. What about

236
00:10:41.840 --> 00:10:43.919
<v Speaker 1>login specifically? Don't they involve cookies?

237
00:10:44.240 --> 00:10:47.960
<v Speaker 2>Absolutely vital When you log in successfully, the server usually

238
00:10:48.000 --> 00:10:51.080
<v Speaker 2>sends back cookies to track your session for subsequent request

239
00:10:51.159 --> 00:10:53.720
<v Speaker 2>or restricted pages. You need to send those cookies back.

240
00:10:53.759 --> 00:10:55.480
<v Speaker 1>How does Kira r L handle that?

241
00:10:55.720 --> 00:10:58.679
<v Speaker 2>It has options for it? Curl up pokie jar tells

242
00:10:58.799 --> 00:11:03.200
<v Speaker 2>Curl to save cookies receives into a specified file. Cookie

243
00:11:03.240 --> 00:11:06.159
<v Speaker 2>file tells Curl to read cookies from a file and

244
00:11:06.240 --> 00:11:07.799
<v Speaker 2>send them with the request.

245
00:11:07.639 --> 00:11:10.000
<v Speaker 1>So you log in, save the cookies, and then use

246
00:11:10.000 --> 00:11:12.679
<v Speaker 1>those cookies for future requests to stay logged in.

247
00:11:12.879 --> 00:11:15.679
<v Speaker 2>That's the basic idea. It maintains your session state.

248
00:11:15.919 --> 00:11:19.639
<v Speaker 1>The source also mentions posting files like simulating and upload.

249
00:11:19.840 --> 00:11:23.039
<v Speaker 2>Yeah. If a form has an input type file, you

250
00:11:23.080 --> 00:11:26.600
<v Speaker 2>can simulate uploading a file using curl ob post fields.

251
00:11:27.279 --> 00:11:29.240
<v Speaker 2>You set the value for that field name to the

252
00:11:29.279 --> 00:11:32.080
<v Speaker 2>path of your local file, but you prefix the path

253
00:11:32.120 --> 00:11:33.080
<v Speaker 2>with an AT symbol.

254
00:11:33.240 --> 00:11:36.639
<v Speaker 1>Crrol understands the AT means upload this file correct.

255
00:11:36.679 --> 00:11:39.279
<v Speaker 2>It handles reading the file content and sending it appropriately.

256
00:11:39.360 --> 00:11:42.120
<v Speaker 1>Okay, so you've send the PST request, maybe logged in.

257
00:11:42.159 --> 00:11:43.519
<v Speaker 1>How do you know if it actually worked.

258
00:11:43.679 --> 00:11:46.440
<v Speaker 2>The simplest check shown in the source is just to

259
00:11:46.480 --> 00:11:48.799
<v Speaker 2>look for a specific piece of text in the HTML

260
00:11:48.879 --> 00:11:52.840
<v Speaker 2>response that you know only appears after a successful submission.

261
00:11:53.080 --> 00:11:57.080
<v Speaker 1>Like log in, successful or welcome back user exactly.

262
00:11:57.240 --> 00:12:00.360
<v Speaker 2>You get the response page source from curl and search

263
00:12:00.399 --> 00:12:03.480
<v Speaker 2>that string for your success message. If it's there, it probably.

264
00:12:03.200 --> 00:12:09.039
<v Speaker 1>Worked, right. Okay, single pages forms, but the real power

265
00:12:09.080 --> 00:12:12.679
<v Speaker 1>comes from scraping lots of pages, like product listings or

266
00:12:12.720 --> 00:12:16.399
<v Speaker 1>search results that span multiple pages. How do you handle

267
00:12:16.440 --> 00:12:17.080
<v Speaker 1>that pagination?

268
00:12:17.360 --> 00:12:20.039
<v Speaker 2>Yeah, traversing multiple pages. You start by scraping the first

269
00:12:20.080 --> 00:12:20.639
<v Speaker 2>page as.

270
00:12:20.600 --> 00:12:23.600
<v Speaker 1>Usual, get the data, find the image, whatever, right, But.

271
00:12:23.639 --> 00:12:26.559
<v Speaker 2>While you're doing that, you also use XPath to look

272
00:12:26.559 --> 00:12:28.320
<v Speaker 2>for the link to the next page.

273
00:12:28.039 --> 00:12:30.759
<v Speaker 1>Like in a next button or page number link exactly.

274
00:12:30.799 --> 00:12:35.080
<v Speaker 2>Commonplaces are l elements with class pagination or pager. You'd

275
00:12:35.080 --> 00:12:37.840
<v Speaker 2>write an XPath query to find a tag inside that,

276
00:12:37.919 --> 00:12:40.679
<v Speaker 2>maybe specifically the one with text next, and grab its

277
00:12:40.799 --> 00:12:41.799
<v Speaker 2>href attribute.

278
00:12:41.840 --> 00:12:44.120
<v Speaker 1>So you get the URL for page two yep.

279
00:12:44.360 --> 00:12:47.120
<v Speaker 2>Then you scrape page two, and on page two you

280
00:12:47.159 --> 00:12:49.159
<v Speaker 2>look for the link to page three, and so on.

281
00:12:49.240 --> 00:12:51.720
<v Speaker 2>You basically collect a list of all the page URLs

282
00:12:51.720 --> 00:12:52.360
<v Speaker 2>you need to visit.

283
00:12:52.600 --> 00:12:55.080
<v Speaker 1>Make sure they're full URLs, not relative ones.

284
00:12:55.200 --> 00:12:58.159
<v Speaker 2>Good point. If they're relative links like page two, you

285
00:12:58.200 --> 00:13:01.720
<v Speaker 2>need to prepend the base website url to make them

286
00:13:01.759 --> 00:13:04.279
<v Speaker 2>absolute before fetching with curl.

287
00:13:04.440 --> 00:13:07.000
<v Speaker 1>Then you just loop through your list of URLs, scraping

288
00:13:07.000 --> 00:13:07.519
<v Speaker 1>each one.

289
00:13:07.639 --> 00:13:10.399
<v Speaker 2>Pretty much. I'll fetch page, extract data, fetch next page,

290
00:13:10.399 --> 00:13:12.000
<v Speaker 2>extract data, repeat.

291
00:13:12.120 --> 00:13:14.879
<v Speaker 1>Now this sounds like it could hit the server pretty

292
00:13:14.879 --> 00:13:16.840
<v Speaker 1>fast if you have hundreds of pages.

293
00:13:17.120 --> 00:13:20.039
<v Speaker 2>It absolutely can, and that brings up a really critical point.

294
00:13:20.039 --> 00:13:22.120
<v Speaker 2>The source emphasizes politeness.

295
00:13:22.279 --> 00:13:24.559
<v Speaker 1>Right, don't be a nuisance or worse.

296
00:13:24.399 --> 00:13:27.919
<v Speaker 2>Get yourself blocked. Hammering a server with rapid fire requests

297
00:13:28.000 --> 00:13:30.519
<v Speaker 2>is bad form and often triggers automated defenses.

298
00:13:30.600 --> 00:13:31.600
<v Speaker 1>So how do you be polite?

299
00:13:31.679 --> 00:13:34.080
<v Speaker 2>The simplest, most common way shown is to just pause

300
00:13:34.120 --> 00:13:39.120
<v Speaker 2>between requests. Use PHP's sleep function. The sort suggests sleep

301
00:13:39.480 --> 00:13:41.159
<v Speaker 2>rand one three.

302
00:13:40.919 --> 00:13:44.039
<v Speaker 1>Way, a random one to three seconds between fetching each page.

303
00:13:44.279 --> 00:13:47.600
<v Speaker 2>Yeah, it slows your script down, mimics human browsing speed

304
00:13:47.639 --> 00:13:50.960
<v Speaker 2>a bit more, and drastically reduces the load on their server.

305
00:13:51.480 --> 00:13:53.559
<v Speaker 2>It's essential for any non trivial scraping.

306
00:13:53.720 --> 00:13:57.519
<v Speaker 1>Okay, vital tip. So you've scraped politely across many pages,

307
00:13:58.000 --> 00:14:02.000
<v Speaker 1>extracted tons of data. Where does it all go? Printing

308
00:14:02.000 --> 00:14:03.000
<v Speaker 1>to screen is useless?

309
00:14:03.000 --> 00:14:06.639
<v Speaker 2>Now right, you need persistent storage. The obvious choice demonstrated

310
00:14:06.759 --> 00:14:10.960
<v Speaker 2>is a database. Since XMPP includes my sequel, that's the

311
00:14:11.039 --> 00:14:11.840
<v Speaker 2>example used.

312
00:14:11.960 --> 00:14:13.600
<v Speaker 1>First up is setting up the database table.

313
00:14:13.720 --> 00:14:16.600
<v Speaker 2>Yep, you need to design your table structure. Now define

314
00:14:16.639 --> 00:14:19.799
<v Speaker 2>columns that match the data points are scraping, like book title, author,

315
00:14:20.320 --> 00:14:22.000
<v Speaker 2>release date is SBN, etc.

316
00:14:22.360 --> 00:14:24.279
<v Speaker 1>And you can use food pi admin for that.

317
00:14:24.320 --> 00:14:26.919
<v Speaker 2>It's a handi graphical tool for creating the database and tables,

318
00:14:27.000 --> 00:14:28.519
<v Speaker 2>setting data types, all that stuff.

319
00:14:28.600 --> 00:14:31.919
<v Speaker 1>Okay, tables ready. How does the PHP script connect and

320
00:14:32.039 --> 00:14:33.519
<v Speaker 1>insert the scraped data?

321
00:14:33.559 --> 00:14:37.440
<v Speaker 2>The source uses PDOPHP data objects. It's a standard flexible

322
00:14:37.480 --> 00:14:40.480
<v Speaker 2>way in PHP to talk to databases, including my school.

323
00:14:40.600 --> 00:14:42.639
<v Speaker 2>You establish a connection using your database name.

324
00:14:42.639 --> 00:14:44.440
<v Speaker 1>Username, password, then insert the data.

325
00:14:44.480 --> 00:14:46.759
<v Speaker 2>For inserting lots of items, the best practice shown is

326
00:14:46.840 --> 00:14:48.120
<v Speaker 2>using prepared statements.

327
00:14:48.360 --> 00:14:50.679
<v Speaker 1>Why is that better than just building insert strings?

328
00:14:50.759 --> 00:14:55.440
<v Speaker 2>Two main reasons. Security, It prevents SEQL injection vulnerabilities, and

329
00:14:55.559 --> 00:14:58.440
<v Speaker 2>often better performance when you're inserting many rows with the

330
00:14:58.440 --> 00:14:59.200
<v Speaker 2>same structure.

331
00:14:59.320 --> 00:15:00.200
<v Speaker 1>How do they work.

332
00:15:00.399 --> 00:15:03.919
<v Speaker 2>You write the insert query once, but use placeholders like

333
00:15:04.000 --> 00:15:07.519
<v Speaker 2>question marks are named parameters for the actual values. You

334
00:15:07.639 --> 00:15:11.399
<v Speaker 2>prepare this query structure with the database okay, Then you

335
00:15:11.519 --> 00:15:14.639
<v Speaker 2>loop through your array of scraped data items like all

336
00:15:14.639 --> 00:15:17.600
<v Speaker 2>the books you found inside the loop. For each book,

337
00:15:18.039 --> 00:15:21.960
<v Speaker 2>you bind its specific title, author, etc. To the placeholders

338
00:15:22.000 --> 00:15:24.440
<v Speaker 2>in the prepared statement, and then you execute it.

339
00:15:24.519 --> 00:15:26.879
<v Speaker 1>So the queer structure is set once and then just

340
00:15:26.919 --> 00:15:29.279
<v Speaker 1>the data changes for each execution exactly.

341
00:15:29.519 --> 00:15:32.480
<v Speaker 2>It's cleaner and safer. One row gets inserted into your

342
00:15:32.559 --> 00:15:35.159
<v Speaker 2>database table for each item in your scrape data.

343
00:15:35.000 --> 00:15:37.360
<v Speaker 1>Array, and the source also shows getting data back out.

344
00:15:37.440 --> 00:15:41.000
<v Speaker 2>Yeah, completes the picture using a select query, again often

345
00:15:41.080 --> 00:15:44.679
<v Speaker 2>via pdo, to fetch the data you saved, perhaps looping

346
00:15:44.720 --> 00:15:47.159
<v Speaker 2>through the results to display them in an HTML table

347
00:15:47.200 --> 00:15:48.080
<v Speaker 2>on a web page.

348
00:15:48.200 --> 00:15:52.480
<v Speaker 1>Okay, this is getting quite sophisticated fetching, parsing, interacting, storing.

349
00:15:52.759 --> 00:15:55.519
<v Speaker 1>As scripts get bigger, the code could get messy, right,

350
00:15:55.960 --> 00:15:59.440
<v Speaker 1>repeating the same CURL setup or XPath.

351
00:15:59.159 --> 00:16:03.759
<v Speaker 2>Creation definitely can. That's where the source introduces object oriented

352
00:16:03.799 --> 00:16:08.039
<v Speaker 2>programming or OOP principles as a way to organize things better.

353
00:16:07.919 --> 00:16:10.000
<v Speaker 1>Making the code reusable and tidier.

354
00:16:10.080 --> 00:16:13.159
<v Speaker 2>That's the goal. The core idea is creating a class,

355
00:16:13.519 --> 00:16:16.240
<v Speaker 2>which is like a blueprint for an object. It bundles

356
00:16:16.240 --> 00:16:19.519
<v Speaker 2>together related data properties and functions that operate on that

357
00:16:19.639 --> 00:16:20.399
<v Speaker 2>data methods.

358
00:16:20.559 --> 00:16:22.799
<v Speaker 1>The book uses a human class analogy.

359
00:16:22.919 --> 00:16:25.720
<v Speaker 2>Yeah, like a human blueprint might define properties like name

360
00:16:25.759 --> 00:16:29.519
<v Speaker 2>and age, and methods like speak or walk. An object

361
00:16:29.639 --> 00:16:32.320
<v Speaker 2>is a specific instance created from that blueprint, like Bobby

362
00:16:32.360 --> 00:16:35.679
<v Speaker 2>eesyl new human, where Bob has his own specific name

363
00:16:35.720 --> 00:16:36.120
<v Speaker 2>in age.

364
00:16:36.240 --> 00:16:39.200
<v Speaker 1>So how does the example scrape class in the source

365
00:16:39.320 --> 00:16:39.919
<v Speaker 1>apply this?

366
00:16:40.240 --> 00:16:43.679
<v Speaker 2>It takes the common scraping tasks fetching a URL, creating

367
00:16:43.759 --> 00:16:46.519
<v Speaker 2>the XPath object and put some inside the class definition

368
00:16:46.600 --> 00:16:49.360
<v Speaker 2>as methods exactly, and it often uses a special method

369
00:16:49.360 --> 00:16:52.879
<v Speaker 2>called construct What's that do? The constructor runs automatically whenever

370
00:16:52.919 --> 00:16:54.679
<v Speaker 2>you create a new object from the class. So in

371
00:16:54.759 --> 00:16:57.320
<v Speaker 2>the scrape class, example, when you write page craper new

372
00:16:57.399 --> 00:17:01.120
<v Speaker 2>scrape http dot example dot com, the struct method immediately

373
00:17:01.120 --> 00:17:01.519
<v Speaker 2>takes that.

374
00:17:01.679 --> 00:17:03.799
<v Speaker 1>URL, the one you just passed in, calls.

375
00:17:03.559 --> 00:17:06.599
<v Speaker 2>An internal method maybe curl get to fetch the source

376
00:17:06.640 --> 00:17:09.799
<v Speaker 2>code for that specific URL, and calls another internal method

377
00:17:09.920 --> 00:17:12.960
<v Speaker 2>like return XPath object to create the XPath object for

378
00:17:13.000 --> 00:17:15.160
<v Speaker 2>that source. It does the initial setup.

379
00:17:14.799 --> 00:17:17.519
<v Speaker 1>Work ah, so the object is immediately ready with the

380
00:17:17.559 --> 00:17:21.160
<v Speaker 1>source and the XPath tool for its specific URL precisely.

381
00:17:21.319 --> 00:17:24.039
<v Speaker 2>The paid scraper object now holds its own source code

382
00:17:24.039 --> 00:17:26.960
<v Speaker 2>property and six path object property ready for you to use.

383
00:17:27.200 --> 00:17:30.559
<v Speaker 2>You'd access them like pagecap er, XPath object query.

384
00:17:30.279 --> 00:17:32.519
<v Speaker 1>And you could add more methods to the scrape class

385
00:17:32.559 --> 00:17:34.160
<v Speaker 1>like save image or submit form.

386
00:17:34.240 --> 00:17:37.599
<v Speaker 2>Absolutely you build up a reusable toolkit within the class.

387
00:17:38.119 --> 00:17:40.720
<v Speaker 2>Create a scrape object for any URL, and you have

388
00:17:40.799 --> 00:17:43.359
<v Speaker 2>all your scraping tools ready to work on that specific

389
00:17:43.400 --> 00:17:47.359
<v Speaker 2>pages content. Makes the main part of your script much cleaner, very.

390
00:17:47.200 --> 00:17:51.519
<v Speaker 1>Neat okay, final piece automation. You've built this great script,

391
00:17:52.319 --> 00:17:56.000
<v Speaker 1>maybe using a class it saves to a database. How

392
00:17:56.000 --> 00:17:58.799
<v Speaker 1>do you make it run automatically? Say every night?

393
00:17:58.960 --> 00:18:00.799
<v Speaker 2>Yeah, you don't want to manually run it every time.

394
00:18:00.920 --> 00:18:03.400
<v Speaker 2>This is where scheduling comes in. The source gives an

395
00:18:03.400 --> 00:18:05.440
<v Speaker 2>example using Windows task scheduler.

396
00:18:05.880 --> 00:18:08.799
<v Speaker 1>So it's not code within the PHP script itself.

397
00:18:09.000 --> 00:18:12.839
<v Speaker 2>No, Usually you leverage the operating system scheduling tools Task

398
00:18:12.880 --> 00:18:16.079
<v Speaker 2>scheduler on Windows, or chron jobs on Linux or Maco.

399
00:18:16.759 --> 00:18:18.480
<v Speaker 2>They are built for exactly this purpose.

400
00:18:18.559 --> 00:18:19.240
<v Speaker 1>How does it work?

401
00:18:19.279 --> 00:18:21.920
<v Speaker 2>Basically, you can figure a task in the scheduler. You

402
00:18:22.000 --> 00:18:24.720
<v Speaker 2>tell it the scheduled daily at three am for instance. Okay,

403
00:18:24.839 --> 00:18:27.599
<v Speaker 2>you tell the action to perform. The action is essentially

404
00:18:28.200 --> 00:18:30.880
<v Speaker 2>run the PHP program and tell it to execute your

405
00:18:30.920 --> 00:18:32.640
<v Speaker 2>specific scraping script file.

406
00:18:32.839 --> 00:18:36.079
<v Speaker 1>So you point it to php dot ex and then

407
00:18:36.119 --> 00:18:38.359
<v Speaker 1>your my scraper not PHP file exactly.

408
00:18:38.440 --> 00:18:41.160
<v Speaker 2>You provide the full paths. When the schedule time arrives,

409
00:18:41.200 --> 00:18:45.240
<v Speaker 2>the OS runs php Php execute your script and hopefully

410
00:18:45.480 --> 00:18:47.920
<v Speaker 2>your database gets updated with fresh data and it just.

411
00:18:47.960 --> 00:18:49.799
<v Speaker 1>Runs in the background without you doing anything.

412
00:18:49.920 --> 00:18:52.680
<v Speaker 2>That's the beauty of it. True automation. Wow.

413
00:18:52.839 --> 00:18:57.000
<v Speaker 1>Okay, so we've really gone from set up getting PHP EXMPP,

414
00:18:57.640 --> 00:19:01.720
<v Speaker 1>enabling CURL all the way through fetching pages, dealing with readerrects,

415
00:19:01.880 --> 00:19:03.079
<v Speaker 1>user agents.

416
00:19:02.839 --> 00:19:07.039
<v Speaker 2>Extracting data with XPath for structure or custom functions like

417
00:19:07.079 --> 00:19:09.680
<v Speaker 2>scrap between for trickier bits, handling.

418
00:19:09.279 --> 00:19:15.079
<v Speaker 1>Images, simulating form posts, managing cookies for logins, traversing pagization

419
00:19:15.200 --> 00:19:16.920
<v Speaker 1>politely with delays.

420
00:19:16.559 --> 00:19:19.000
<v Speaker 2>Saving the results into a my school database using PDO

421
00:19:19.119 --> 00:19:23.759
<v Speaker 2>and prepared statements, organizing the code using op with classes.

422
00:19:23.319 --> 00:19:26.400
<v Speaker 1>And finally automating the whole thing with task scheduler or coronate.

423
00:19:26.640 --> 00:19:27.640
<v Speaker 1>That's quite the journey.

424
00:19:27.759 --> 00:19:29.960
<v Speaker 2>It really covers the fundamental life cycle of a web

425
00:19:30.039 --> 00:19:33.559
<v Speaker 2>scripting PASK based on these source excerpts. Even though the

426
00:19:33.559 --> 00:19:36.920
<v Speaker 2>web is way more complex now, especially with JavaScript rendering content.

427
00:19:36.720 --> 00:19:38.960
<v Speaker 1>Right, that's a whole other challenge it is, but these.

428
00:19:38.759 --> 00:19:43.319
<v Speaker 2>Core concepts fetching HTTP content, parsing structured or semi structured data,

429
00:19:43.400 --> 00:19:47.359
<v Speaker 2>handling sessions, storing results, they still form the foundation. You

430
00:19:47.440 --> 00:19:50.559
<v Speaker 2>might add tools like headless browsers to handle JavaScript on top,

431
00:19:50.839 --> 00:19:52.720
<v Speaker 2>but you still need to understand these basics.

432
00:19:52.839 --> 00:19:55.920
<v Speaker 1>That's a great takeaway. The principles are enduring even if

433
00:19:55.920 --> 00:19:59.640
<v Speaker 1>the tools evolve, so thinking about these building blocks how

434
00:19:59.720 --> 00:20:02.119
<v Speaker 1>might you combine them for something more complex, or maybe

435
00:20:02.279 --> 00:20:05.559
<v Speaker 1>how does that politeness principle the sleep delay become even

436
00:20:05.599 --> 00:20:08.680
<v Speaker 1>more important, maybe even in ethical consideration when you think

437
00:20:08.680 --> 00:20:11.279
<v Speaker 1>about scraping at a really large scale, what's the most

438
00:20:11.319 --> 00:20:14.079
<v Speaker 1>interesting challenge you could tackle just starting with these foundational

439
00:20:14.119 --> 00:20:14.599
<v Speaker 1>ideas
