爬取当当网的图书信息之结尾
Posted 王起帆
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了爬取当当网的图书信息之结尾相关的知识,希望对你有一定的参考价值。
由于当当网上的图书信息很丰富,全部抓取下来工作量很大。只抓取其中的一类
在Main()方法里面
首先用户输入种子URL
string starturl = Console.ReadLine();
构建数据库上下文对象
BookStoreEntities storeDB = new BookStoreEntities();
获取图书类的URL
string html = Tool.GetHtml(starturl); ArrayList list = new ArrayList(); list = Tool.GetList(html); foreach (var item in list) { BookClass bookclass = new BookClass(); bookclass.Url = item.ToString(); storeDB.BookClass.Add(bookclass); } storeDB.SaveChanges();
使用多线程爬取图书信息
每个图书种类都开一个线程来爬取图书信息
封装一个process类
public class process { BookStoreEntities storeDB = new BookStoreEntities(); public BookClass BookClass; public process(int BookClassId) { BookClass = storeDB.BookClass.Find(BookClassId); } }
接下来要在这个类实现爬取图书信息
public void threads() { }
实现翻页
图书种类展示页面是有规律的
http://category.dangdang.com/cp01.54.06.00.00.00.html
http://category.dangdang.com/pg2-cp01.54.06.00.00.00.html
http://category.dangdang.com/pg3-cp01.54.06.00.00.00.html
把第一页的URL拆成两部分 前部分http://category.dangdang.com/,后部分cp01.54.06.00.00.00.html
第二页到100页都是 前部分+"pg"+页数+“-”+后部分
for (int i = 1; i <= BookClass.Pages; i++) { string url = ""; //http://category.dangdang.com/pg100-cp01.54.06.00.00.00.html //http://book.dangdang.com/01.54.htm?ref=book-01-A //http://category.dangdang.com/cp01.54.06.00.00.00.html //http://category.dangdang.com/pg2-cp01.54.13.00.00.00.html string tempurl = BookClass.Url; int p1 = tempurl.IndexOf("cp"); string fast = ""; string rear = ""; if (p1 > 0) { fast = tempurl.Substring(0, p1); rear = tempurl.Substring(p1, tempurl.Length - p1); url = fast + "pg" + i.ToString() + "-" + rear; } if (url == "") { return; } if (i==1) { url = BookClass.Url; } }
继续在这个方法里面添加
public void threads() { ArrayList L = new ArrayList(); for (int i = 1; i <= BookClass.Pages; i++) { string url = ""; //http://category.dangdang.com/pg100-cp01.54.06.00.00.00.html //http://book.dangdang.com/01.54.htm?ref=book-01-A //http://category.dangdang.com/cp01.54.06.00.00.00.html //http://category.dangdang.com/pg2-cp01.54.13.00.00.00.html string tempurl = BookClass.Url; int p1 = tempurl.IndexOf("cp"); string fast = ""; string rear = ""; if (p1 > 0) { fast = tempurl.Substring(0, p1); rear = tempurl.Substring(p1, tempurl.Length - p1); url = fast + "pg" + i.ToString() + "-" + rear; } if (url == "") { return; } if (i==1) { url = BookClass.Url; } string internet = Tool.GetHtml(url); L = Tool.GetProduct(internet); foreach (var item in L) { Console.WriteLine(item.ToString()); string html = Tool.GetHtml(item.ToString()); Dictionary<int, string> dict; dict = Tool.analysis(html); Book book = new Book { AuthorName = dict[3], BookName = dict[1], Price = Convert.ToDecimal(dict[2]), Publisher = dict[4], PictureUrl = dict[5], BookContent = dict[6] }; BookClass.Books.Add(book); storeDB.SaveChanges(); } } }
回到Main函数
var items = storeDB.BookClass; foreach (var bookclass in items ) { process p=new process(bookclass.BookClassId); Thread th = new Thread(p.threads); th.IsBackground = true; th.Start(); Thread.Sleep(1000); } storeDB.SaveChanges(); Console.ReadLine();
以上是关于爬取当当网的图书信息之结尾的主要内容,如果未能解决你的问题,请参考以下文章
用python的xpath定位textarea爬取不下来是啥原因,一直是空,比如当当网图书的目录标签就是textarea?