Working with Data Sources 2
Posted 阿难的机器学习计划
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Working with Data Sources 2相关的知识,希望对你有一定的参考价值。
Web Scriping:
1. We can also use requests.get to get the html file form a webpage.
2. If we would like to extract the content from the webpage, we can use BeautifulSoup Library.
from bs4 import BeautifulSoup
parser = BeautifulSoup(content, ‘html.parser‘) #initial the parser, pass the content by using BeautifulSoup
body = parser.body # extract the <p></p> from the parser
p = body.p #Get body from <p></p>
head = parser.head
title_text = head.title.text #Get the content from <title></title>
3. We can use find_all function to find all the relevant content in the webpage. The find_all function can only being usd to bs4 elements.(tag)
head = parser.find_all("head") # Find all the files with tag head and save them as a list into variable head.
title = head[0].find_all("title")
title_text = title[0].text
4. Find_all function can also find the content by its id. Find_all always return a list.
second_paragraph_text = parser.find_all("p", id ="second")[0].text
5. Find_all function can also find the content by class.
second_inner_paragraph_text = parser.find_all("p", class_= "inner-text")[1].text # "p" indicates the tag of the class.
6. We can also use CSS selector to find the specific content. Same as find_all method. selector method also works on the sb4 format and return a list.
first_outer_text = parser.select(".outer-text")[0].text
second_text = parser.select("#second")[0].text
以上是关于Working with Data Sources 2的主要内容,如果未能解决你的问题,请参考以下文章