C#写爬虫，版本V1.0

Posted 2020-07-24 张杨

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了C#写爬虫，版本V1.0相关的知识，希望对你有一定的参考价值。

　　　　之前看了Sql Server中的基本数据类型，发现image这个类型还是比较特殊的。

于是乎就做了一个将图片以二进制流形式存储的程序http://www.cnblogs.com/JsonZhangAA/p/5568575.html，现在如果我想批量ed存储网上一个网站的图片，难道我要手写n多地址吗？显然这是不可取的，针对这种情况，就用C#写了一个简单的爬虫，我们所爬的对象是天文网http://www.tianwenwang.cn/

程序的原理是利用WebRequest和WebResponse来相应网站（不懂，只能这样说0.0）,而后利用StreamWrite将网站的源文件存储到txt文本文件中，这是我们可以发现一个

现象，图片地址都是类似于http://p.tianwenwang.cn/upload/150318/68181426648163.jpg!list.jpg，http://p.tianwenwang.cn/upload/150312/58341426094069.jpg!list.jpg这种的，于是可以利用正则表达式来将里面的http:全部取出，放到一个字符串数组中，最后就是判断地址时候包含典型的jpg,gif等图片类型后缀了（V1.0最大的缺陷），如果包含就将其存储到数据库中。

后台代码如下：

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.IO;
using System.Linq;
using System.Net;
using System.Text;
using System.Text.RegularExpressions;
using System.Threading.Tasks;
using System.Windows.Forms;

namespace 网络爬虫
{
    public partial class Form1 : Form
    {
        private static string[] getLinks(string html)
        {
            const string pattern = @"http://([\\w-]+\\.)+[\\w-]+(/[\\w- ./?%&=]*)?";
            Regex r = new Regex(pattern, RegexOptions.IgnoreCase); //新建正则模式
            MatchCollection m = r.Matches(html); //获得匹配结果
            string[] links = new string[m.Count];

            for (int i = 0; i < m.Count; i++)
            {
                links[i] = m[i].ToString(); //提取出结果
            }
            return links;
        }
        private static bool isValiable(string url)
        {
            if (url.Contains(".jpg") || url.Contains(".gif")||url.Contains(".png"))
            {
                return true; //得到一些图片之类的资源
            }
            return false;
        }
        private static void savePicture(string path)
        {
            DataClasses1DataContext db = new DataClasses1DataContext();
            Uri url = new Uri(path);
            WebRequest webRequest = WebRequest.Create(url);
            WebResponse webResponse = webRequest.GetResponse();
            
            if (isValiable(path))//判断如果是图片，就将其存储到数据库中。
            {
                Bitmap myImage = new Bitmap(webResponse.GetResponseStream());

                MemoryStream ms = new MemoryStream();
                myImage.Save(ms, System.Drawing.Imaging.ImageFormat.Jpeg);
                var p = new pictureUrl
                {
                    pictureUrl1 = ms.ToArray()
                };
                db.pictureUrl.InsertOnSubmit(p);
                db.SubmitChanges();
            }

        }
        public Form1()
        {
            InitializeComponent();
        }

        private void button1_Click(object sender, EventArgs e)
        {
            string rl;
            string path = this.textBox1.Text;
            Uri url = new Uri(path);
            WebRequest webRequest = WebRequest.Create(url);
            WebResponse webResponse = webRequest.GetResponse();
            Stream resStream = webResponse.GetResponseStream();
            StreamReader sr = new StreamReader(resStream, Encoding.UTF8);
            StringBuilder sb = new StringBuilder();
            while ((rl = sr.ReadLine()) != null)
            {
                sb.Append(rl);
            }
            FileStream aFile = new FileStream("../../txt.txt", FileMode.OpenOrCreate);
            StreamWriter sw = new StreamWriter(aFile);//将网页存储到了txt文本文件中
            sw.WriteLine(sb.ToString());
            sw.Close();
            string[] s;
            s = getLinks(sb.ToString());
            int i = 0;
            foreach (string sl in s)
            {
                i++;
                savePicture(sl);
            }
        }
    }
}

本版本只能对类似于天文网的这类网站进行爬虫，我会后续升级爬虫，争取做出一个通用的爬虫O(∩_∩)O~！

以上是关于C#写爬虫，版本V1.0的主要内容，如果未能解决你的问题，请参考以下文章

C#写爬虫，版本V2.0

如何为 XSLT 代码片段配置 CruiseControl 的 C# 版本？

scrapy按顺序启动多个爬虫代码片段(python3)

scrapy主动退出爬虫的代码片段(python3)

python爬虫：找房助手V1.0-爬取58同城租房信息

(C#)用 ScrapySharp 并行下载天涯图片