爬去知乎百万用户信息之UserTask

Posted 王起帆

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了爬去知乎百万用户信息之UserTask相关的知识,希望对你有一定的参考价值。

UserTask是获取用户信息的爬虫模块

public   class UserManage
    {
        private string html;

        private string url_token;

     }

构造函数

 

用户主页的uRL格式为"https://www.zhihu.com/people/"+url_token+"/following";

 public UserManage(string urltoken)
         {
             url_token = urltoken;
         }

先封装一个获取html页面的方法

 

 private bool GetHtml()

        {                

            string url="https://www.zhihu.com/people/"+url_token+"/following";

            html = HttpHelp.DownLoadString(url);

            return  !string.IsNullOrEmpty(html);

        }

拿到了html页面,接下来是剥取页面中的JSON,借助HtmlAgilityPack

public  void  analyse()
        {
                if (GetHtml())
                {
                    try
                    {
                        Stopwatch watch = new Stopwatch();
                        watch.Start();
                        HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
                        doc.LoadHtml(html);
                        HtmlNode node = doc.GetElementbyId("data");
                        StringBuilder stringbuilder =new StringBuilder(node.GetAttributeValue("data-state", ""));
                        stringbuilder.Replace(""", "");           
                        stringbuilder.Replace("&lt;", "<");
                        stringbuilder.Replace("&gt;", ">");
                     
                        watch.Stop();
                       Console.WriteLine("分析Html用了{0}毫秒", watch.ElapsedMilliseconds.ToString());
                       
                    }
                    catch (Exception ex)
                    {
                        Console.WriteLine(ex.ToString());
                    }
                }
            
            }    

添加用户的关注列表的链接

 private void  GetUserFlowerandNext(string json)

        {

                 string foollowed = "https://www.zhihu.com/api/v4/members/" + url_token + "/followers?include=data%5B*%5D.answer_count%2Carticles_count%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=0&limit=20";

                 string following = "https://www.zhihu.com/api/v4/members/" + url_token + "/followees?include=data%5B%2A%5D.answer_count%2Carticles_count%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics&limit=20&offset=0";

                 RedisCore.PushIntoList(1, "nexturl", following);

                 RedisCore.PushIntoList(1, "nexturl", foollowed);

        }

 

对json数据进一步剥取,只要用户的信息,借助JSON解析工具Newtonsoft.Json

 

private void  GetUserInformation(string json)
        {  
                JObject obj = JObject.Parse(json);
                string xpath = "[‘" + url_token + "‘]";
                JToken tocken = obj.SelectToken("[‘entities‘]").SelectToken("[‘users‘]").SelectToken(xpath);
                RedisCore.PushIntoList(2, "User", tocken.ToString());
              
        }  

现在来完成下analyse函数

  public  void  analyse()
        {
                if (GetHtml())
                {
                    try
                    {
                        Stopwatch watch = new Stopwatch();
                        watch.Start();
                        HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
                        doc.LoadHtml(html);
                        HtmlNode node = doc.GetElementbyId("data");
                        StringBuilder stringbuilder =new StringBuilder(node.GetAttributeValue("data-state", ""));
                        stringbuilder.Replace("&quot;", "");           
                        stringbuilder.Replace("&lt;", "<");
                        stringbuilder.Replace("&gt;", ">");
                        GetUserInformation(stringbuilder.ToString());
                        GetUserFlowerandNext(stringbuilder.ToString());
                        watch.Stop();
                        Console.WriteLine("分析Html用了{0}毫秒", watch.ElapsedMilliseconds.ToString());
                       
                    }
                    catch (Exception ex)
                    {
                        Console.WriteLine(ex.ToString());
                    }
                }
            
            }    
        }
    

 

 

 

以上是关于爬去知乎百万用户信息之UserTask的主要内容,如果未能解决你的问题,请参考以下文章

爬取知乎百万信息之UrlTask

月薪30k的资深程序员用Python爬取了知乎百万用户!并数据分析!

大数据实战:知乎百万用户分析

Python爬去知乎上问题下所有图片

知乎百万热议:副业兼职如何做到年薪 50 万?

知乎百万热议:一毕业就能进阿里的人有多厉害?