使用wget工具抓取网页和图片 成功尝试
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了使用wget工具抓取网页和图片 成功尝试相关的知识,希望对你有一定的参考价值。
奇怪的需求
公司需要将服务器的网页缓存到路由器,用户在访问该网页时就直接取路由器上的缓存即可。虽然我不知道这个需求有什么意义,但还是尽力去实现吧。
wget概述
wget是unix和类unix下的一个网页抓取工具,待我熟悉它后,发现它的功能远不止这些。但是这篇博文只说怎么抓取一个指定URL以及它下面的相关内容(包括html,js,css,图片)并将内容里的绝对路径换成相对路径。网上搜到一堆有关wget的文章,关于它怎么抓取网页和相关的图片资源,反正我是没有找到一篇实用的,都以失败告终。
这是wget -h > ./help_wget.txt 后的文件内容
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
|
GNU Wget 1.16, a non-interactive network retriever. Usage: wget [OPTION]... [URL]... Mandatory arguments to long options are mandatory for short options too. Startup: -V, --version display the version of Wget and exit . -h, --help print this help. -b, --background go to background after startup. -e, --execute=COMMAND execute a `.wgetrc‘-style command . Logging and input file : -o, --output- file =FILE log messages to FILE. -a, --append-output=FILE append messages to FILE. -q, --quiet quiet (no output). - v , --verbose be verbose (this is the default). -nv, --no-verbose turn off verboseness, without being quiet. --report-speed=TYPE Output bandwidth as TYPE. TYPE can be bits. -i, --input- file =FILE download URLs found in local or external FILE. -B, --base=URL resolves HTML input- file links (-i -F) relative to URL. --config=FILE Specify config file to use. --no-config Do not read any config file . Download: -t, --tries=NUMBER set number of retries to NUMBER (0 unlimits). --retry-connrefused retry even if connection is refused. -O, --output-document=FILE write documents to FILE. - nc , --no-clobber skip downloads that would download to existing files (overwriting them). -c, -- continue resume getting a partially-downloaded file . --start-pos=OFFSET start downloading from zero-based position OFFSET. --progress=TYPE select progress gauge type . --show-progress display the progress bar in any verbosity mode. -N, --timestamping don‘t re-retrieve files unless newer than local . --no-use-server-timestamps don ‘t set the local file‘ s timestamp by the one on the server. -S, --server-response print server response. --spider don‘t download anything. -T, --timeout=SECONDS set all timeout values to SECONDS. --dns-timeout=SECS set the DNS lookup timeout to SECS. --connect-timeout=SECS set the connect timeout to SECS. -- read -timeout=SECS set the read timeout to SECS. -w, --wait=SECONDS wait SECONDS between retrievals. --waitretry=SECONDS wait 1..SECONDS between retries of a retrieval. --random-wait wait from 0.5*WAIT...1.5*WAIT secs between retrievals. --no-proxy explicitly turn off proxy. -Q, -- quota =NUMBER set retrieval quota to NUMBER. --bind-address=ADDRESS bind to ADDRESS ( hostname or IP) on local host. --limit-rate=RATE limit download rate to RATE. --no-dns-cache disable caching DNS lookups. --restrict- file -names=OS restrict chars in file names to ones OS allows. --ignore- case ignore case when matching files /directories . -4, --inet4-only connect only to IPv4 addresses. -6, --inet6-only connect only to IPv6 addresses. --prefer-family=FAMILY connect first to addresses of specified family, one of IPv6, IPv4, or none. --user=USER set both ftp and http user to USER. --password=PASS set both ftp and http password to PASS. --ask-password prompt for passwords. --no-iri turn off IRI support. -- local -encoding=ENC use ENC as the local encoding for IRIs. --remote-encoding=ENC use ENC as the default remote encoding. --unlink remove file before clobber. Directories: -nd, --no-directories don‘t create directories. -x, --force-directories force creation of directories. -nH, --no-host-directories don‘t create host directories. --protocol-directories use protocol name in directories. -P, --directory-prefix=PREFIX save files to PREFIX/... -- cut - dirs =NUMBER ignore NUMBER remote directory components. HTTP options: --http-user=USER set http user to USER. --http-password=PASS set http password to PASS. --no-cache disallow server-cached data. --default-page=NAME Change the default page name (normally this is `index.html‘.). -E, --adjust-extension save HTML /CSS documents with proper extensions. --ignore-length ignore `Content-Length‘ header field. --header=STRING insert STRING among the headers. --max-redirect maximum redirections allowed per page. --proxy-user=USER set USER as proxy username. --proxy-password=PASS set PASS as proxy password. --referer=URL include `Referer: URL‘ header in HTTP request. --save-headers save the HTTP headers to file . -U, --user-agent=AGENT identify as AGENT instead of Wget /VERSION . --no-http-keep-alive disable HTTP keep-alive (persistent connections). --no-cookies don‘t use cookies. --load-cookies=FILE load cookies from FILE before session. --save-cookies=FILE save cookies to FILE after session. --keep-session-cookies load and save session (non-permanent) cookies. --post-data=STRING use the POST method; send STRING as the data. --post- file =FILE use the POST method; send contents of FILE. --method=HTTPMethod use method "HTTPMethod" in the request. --body-data=STRING Send STRING as data. --method MUST be set . --body- file =FILE Send contents of FILE. --method MUST be set . --content-disposition honor the Content-Disposition header when choosing local file names (EXPERIMENTAL). --content-on-error output the received content on server errors. --auth-no-challenge send Basic HTTP authentication information without first waiting for the server‘s challenge. HTTPS (SSL /TLS ) options: --secure-protocol=PR choose secure protocol, one of auto, SSLv2, SSLv3, TLSv1 and PFS. --https-only only follow secure HTTPS links --no-check-certificate don ‘t validate the server‘ s certificate. --certificate=FILE client certificate file . --certificate- type =TYPE client certificate type , PEM or DER. --private-key=FILE private key file . --private-key- type =TYPE private key type , PEM or DER. --ca-certificate=FILE file with the bundle of CA‘s. --ca-directory=DIR directory where hash list of CA‘s is stored. --random- file =FILE file with random data for seeding the SSL PRNG. --egd- file =FILE file naming the EGD socket with random data. FTP options: -- ftp -user=USER set ftp user to USER. -- ftp -password=PASS set ftp password to PASS. --no-remove-listing don ‘t remove `.listing‘ files. --no-glob turn off FTP file name globbing. --no-passive- ftp disable the "passive" transfer mode. --preserve-permissions preserve remote file permissions. 以上是关于使用wget工具抓取网页和图片 成功尝试的主要内容,如果未能解决你的问题,请参考以下文章
|