使用wget工具抓取网页和图片成功尝试

Posted 2020-06-09

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了使用wget工具抓取网页和图片成功尝试相关的知识，希望对你有一定的参考价值。

使用wget工具抓取网页和图片

发表于1年前(2014-12-17 11:29) 阅读（2471） | 评论（14） 85人收藏此文章, 我要收藏

wget 网页抓取图片抓取

目录[-]

奇怪的需求

公司需要将服务器的网页缓存到路由器，用户在访问该网页时就直接取路由器上的缓存即可。虽然我不知道这个需求有什么意义，但还是尽力去实现吧。

wget概述

wget是unix和类unix下的一个网页抓取工具，待我熟悉它后，发现它的功能远不止这些。但是这篇博文只说怎么抓取一个指定URL以及它下面的相关内容（包括html,js,css,图片）并将内容里的绝对路径换成相对路径。网上搜到一堆有关wget的文章，关于它怎么抓取网页和相关的图片资源，反正我是没有找到一篇实用的，都以失败告终。

这是wget -h > ./help_wget.txt 后的文件内容

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

GNU Wget 1.16, a non-interactive network retriever.
 
Usage: wget [OPTION]... [URL]...
 
Mandatory arguments to long options are mandatory for short options too.
 
Startup:
 
  -V,  --version                   display the version of Wget and exit.
 
  -h,  --help                      print this help.
 
  -b,  --background                go to background after startup.
 
  -e,  --execute=COMMAND           execute a `.wgetrc‘-style command.
 
Logging and input file:
 
  -o,  --output-file=FILE          log messages to FILE.
 
  -a,  --append-output=FILE        append messages to FILE.
 
  -q,  --quiet                     quiet (no output).
 
  -v,  --verbose                   be verbose (this is the default).
 
  -nv, --no-verbose                turn off verboseness, without being quiet.
 
       --report-speed=TYPE         Output bandwidth as TYPE.  TYPE can be bits.
 
  -i,  --input-file=FILE           download URLs found in local or external FILE.
 
  -F,  --force-html                treat input file as HTML.
 
  -B,  --base=URL                  resolves HTML input-file links (-i -F)
 
                                   relative to URL.
 
       --config=FILE               Specify config file to use.
 
       --no-config                 Do not read any config file.
 
Download:
 
  -t,  --tries=NUMBER              set number of retries to NUMBER (0 unlimits).
 
       --retry-connrefused         retry even if connection is refused.
 
  -O,  --output-document=FILE      write documents to FILE.
 
  -nc, --no-clobber                skip downloads that would download to
 
                                   existing files (overwriting them).
 
  -c,  --continue                  resume getting a partially-downloaded file.
 
       --start-pos=OFFSET          start downloading from zero-based position OFFSET.
 
       --progress=TYPE             select progress gauge type.
 
       --show-progress             display the progress bar in any verbosity mode.
 
  -N,  --timestamping              don‘t re-retrieve files unless newer than
 
                                   local.
 
  --no-use-server-timestamps       don‘t set the local file‘s timestamp by
 
                                   the one on the server.
 
  -S,  --server-response           print server response.
 
       --spider                    don‘t download anything.
 
  -T,  --timeout=SECONDS           set all timeout values to SECONDS.
 
       --dns-timeout=SECS          set the DNS lookup timeout to SECS.
 
       --connect-timeout=SECS      set the connect timeout to SECS.
 
       --read-timeout=SECS         set the read timeout to SECS.
 
  -w,  --wait=SECONDS              wait SECONDS between retrievals.
 
       --waitretry=SECONDS         wait 1..SECONDS between retries of a retrieval.
 
       --random-wait               wait from 0.5*WAIT...1.5*WAIT secs between retrievals.
 
       --no-proxy                  explicitly turn off proxy.
 
  -Q,  --quota=NUMBER              set retrieval quota to NUMBER.
 
       --bind-address=ADDRESS      bind to ADDRESS (hostname or IP) on local host.
 
       --limit-rate=RATE           limit download rate to RATE.
 
       --no-dns-cache              disable caching DNS lookups.
 
       --restrict-file-names=OS    restrict chars in file names to ones OS allows.
 
       --ignore-case               ignore case when matching files/directories.
 
  -4,  --inet4-only                connect only to IPv4 addresses.
 
  -6,  --inet6-only                connect only to IPv6 addresses.
 
       --prefer-family=FAMILY      connect first to addresses of specified family,
 
                                   one of IPv6, IPv4, or none.
 
       --user=USER                 set both ftp and http user to USER.
 
       --password=PASS             set both ftp and http password to PASS.
 
       --ask-password              prompt for passwords.
 
       --no-iri                    turn off IRI support.
 
       --local-encoding=ENC        use ENC as the local encoding for IRIs.
 
       --remote-encoding=ENC       use ENC as the default remote encoding.
 
       --unlink                    remove file before clobber.
 
Directories:
 
  -nd, --no-directories            don‘t create directories.
 
  -x,  --force-directories         force creation of directories.
 
  -nH, --no-host-directories       don‘t create host directories.
 
       --protocol-directories      use protocol name in directories.
 
  -P,  --directory-prefix=PREFIX   save files to PREFIX/...
 
       --cut-dirs=NUMBER           ignore NUMBER remote directory components.
 
HTTP options:
 
       --http-user=USER            set http user to USER.
 
       --http-password=PASS        set http password to PASS.
 
       --no-cache                  disallow server-cached data.
 
       --default-page=NAME         Change the default page name (normally
 
                                   this is `index.html‘.).
 
  -E,  --adjust-extension          save HTML/CSS documents with proper extensions.
 
       --ignore-length             ignore `Content-Length‘ header field.
 
       --header=STRING             insert STRING among the headers.
 
       --max-redirect              maximum redirections allowed per page.
 
       --proxy-user=USER           set USER as proxy username.
 
       --proxy-password=PASS       set PASS as proxy password.
 
       --referer=URL               include `Referer: URL‘ header in HTTP request.
 
       --save-headers              save the HTTP headers to file.
 
  -U,  --user-agent=AGENT          identify as AGENT instead of Wget/VERSION.
 
       --no-http-keep-alive        disable HTTP keep-alive (persistent connections).
 
       --no-cookies                don‘t use cookies.
 
       --load-cookies=FILE         load cookies from FILE before session.
 
       --save-cookies=FILE         save cookies to FILE after session.
 
       --keep-session-cookies      load and save session (non-permanent) cookies.
 
       --post-data=STRING          use the POST method; send STRING as the data.
 
       --post-file=FILE            use the POST method; send contents of FILE.
 
       --method=HTTPMethod         use method "HTTPMethod" in the request.
 
       --body-data=STRING          Send STRING as data. --method MUST be set.
 
       --body-file=FILE            Send contents of FILE. --method MUST be set.
 
       --content-disposition       honor the Content-Disposition header when
 
                                   choosing local file names (EXPERIMENTAL).
 
       --content-on-error          output the received content on server errors.
 
       --auth-no-challenge         send Basic HTTP authentication information
 
                                   without first waiting for the server‘s
 
                                   challenge.
 
HTTPS (SSL/TLS) options:
 
       --secure-protocol=PR        choose secure protocol, one of auto, SSLv2,
 
                                   SSLv3, TLSv1 and PFS.
 
       --https-only                only follow secure HTTPS links
 
       --no-check-certificate      don‘t validate the server‘s certificate.
 
       --certificate=FILE          client certificate file.
 
       --certificate-type=TYPE     client certificate type, PEM or DER.
 
       --private-key=FILE          private key file.
 
       --private-key-type=TYPE     private key type, PEM or DER.
 
       --ca-certificate=FILE       file with the bundle of CA‘s.
 
       --ca-directory=DIR          directory where hash list of CA‘s is stored.
 
       --random-file=FILE          file with random data for seeding the SSL PRNG.
 
       --egd-file=FILE             file naming the EGD socket with random data.
 
FTP options:
 
       --ftp-user=USER             set ftp user to USER.
 
       --ftp-password=PASS         set ftp password to PASS.
 
       --no-remove-listing         don‘t remove `.listing‘ files.
 
       --no-glob                   turn off FTP file name globbing.
 
       --no-passive-ftp            disable the "passive" transfer mode.
 
       --preserve-permissions      preserve remote file permissions.
 
       以上是关于使用wget工具抓取网页和图片 成功尝试的主要内容，如果未能解决你的问题，请参考以下文章

   
   
  
    
   
        
       


   
 (c)2006-2024 SYSTEM All Rights Reserved  IT常识

使用wget工具抓取网页和图片 成功尝试

使用wget工具抓取网页和图片

奇怪的需求

wget概述

使用wget工具抓取网页和图片成功尝试