什么是BS系统？

Posted 2023-05-11

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了什么是BS系统？相关的知识，希望对你有一定的参考价值。

前几天刚听说一个BS系统，不知道是什么玩意儿？知道的请详细告知，谢谢！

B/S架构即浏览器和服务器架构模式，是随着Internet技术的兴起，对C/S架构的一种变化或者改进的架构。在这种架构下，用户工作界面是通过WWW浏览器来实现，极少部分事务逻辑在前端（Browser）实现，但是主要事务逻辑在服务器端（Server)实现，形成所谓三层3-tier结构。B/S架构是WEB兴起后的一种网络架构模式，WEB浏览器是客户端最主要的应用软件。

B/S 的特征和基本结构：

在 B/S 结构中，每个节点都分布在网络上，这些网络节点可以分为浏览器端、服务器端和中间件，通过它们之间的链接和交互来完成系统的功能任务。三个层次的划分是从逻辑上分的，在实际应用中多根据实际物理网络进行不同的物理划分。

参考技术A 就是基于浏览器应用的软件.

BS就是是英文Brower/Server的缩写
BS是建立在广域网的基础上的

主要就是就是可以在任何地方进行操作而不用安装任何专门的软件。只要有一台能上网的电脑就能使用，客户端零维护。系统的扩展非常容易，只要能上网，再由系统管理员分配一个用户名和密码，就可以使用了。甚至可以在线申请，通过公司内部的安全认证（如CA证书）后，不需要人的参与，系统可以自动分配给用户一个账号进入系统。参考技术B 在线操作系统参考技术C BS是Brower/Server的缩写，客户机上只要安装一个浏览器（Browser），如Netscape Navigator或Internet Explorer，服务器安装Oracle、Sybase、Informix或 SQL Server等数据库。浏览器通过Web Server 同数据库进行数据交互。
BS最大的优点就是可以在任何地方进行操作而不用安装任何专门的软件。只要有一台能上网的电脑就能使用，客户端零维护。系统的扩展非常容易，只要能上网，再由系统管理员分配一个用户名和密码，就可以使用了。甚至可以在线申请，通过公司内部的安全认证（如CA证书）后，不需要人的参与，系统可以自动分配给用户一个账号进入系统。
BS是建立在广域网的基础上的本回答被提问者采纳

爬虫入门之爬取策略 XPath与bs4实现

爬虫入门之爬取策略 XPath与bs4实现(五)

在爬虫系统中，待抓取URL队列是很重要的一部分。待抓取URL队列中的URL以什么样的顺序排列也是一个很重要的问题，因为这涉及到先抓取那个页面，后抓取哪个页面。而决定这些URL排列顺序的方法，叫做抓取策略。下面重点介绍几种常见的抓取策略：

1 深度优先遍历策略:

深度优先遍历策略是指网络爬虫会从起始页开始，一个链接一个链接跟踪下去，处理完这条线路之后再转入下一个起始页，继续跟踪链接。我们以下面的图为例：遍历的路径：A-F-G E-H-I B C D

#深度抓取url,递归的思路
import requests
import re

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"}

def getURL(url):
    html = getHTML(url)
    # <a asd asdf href="www/s?wd=%E5%B2%9B%E5%9B%azi" dsgfa asdf >岛国大片 留下邮箱</a>
    urlre = "<a .*href="(.*?)".*?>"
    urlList = re.findall(urlre, html)
    return urlList

def getHTML(url):
    html = requests.get(url, headers=headers).text
    return html

def getEmail():
    #功能块

def deepSpider(url, depth):
    print("			" * depthDict[url], "抓取了第%d:%s页面" % (depthDict[url], url))
    # 超出深度结束
    if depthDict[url] >= depth:
        return

    # 子url
    sonUrlList = getURL(url)
    for newUrl in sonUrlList:
        # 去重复兵去除非http链接
        if newUrl.find("http") != -1:
            if newUrl not in depthDict: 
                # 层级+1
                depthDict[newUrl] = depthDict[url] + 1
                # 递归及
                deepSpider(newUrl, depth)

if __name__ == ‘__main__‘:
    # 起始url
    startUrl = "https://www.baidu.com/s?wd=岛国邮箱"
    # 层级控制
    depthDict = {}
    depthDict[startUrl] = 1  # {url:层级}
    deepSpider(startUrl, 4)  #调用函数deepSpider

深度遍历,栈思路

import requests
import re

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"}

def getURL(url):
    html = getHTML(url)
    # <a asd asdf href="www/s?wd=%E5%B2%9B%E5%9B%azi" dsgfa asdf >岛国大片 留下邮箱</a>
    urlre = "<a .*href="(.*?)".*?>"
    urlList = re.findall(urlre, html)
    return urlList

def getHTML(url):
    html = requests.get(url, headers=headers).text
    return html

def vastSpider(depth):
    ‘‘‘
    深度抓取方式二，栈实现(先进后出)
    :param depth:深度
    :return:
    ‘‘‘
    # 是否为空
    while len(urlList) > 0:  #while urlList:
        url = urlList.pop()  #关键取最后一个(先进后出)
        print(‘			‘ * depthDict[url], "抓取了第%d层：%s" % (depthDict[url], url))

        # 层级控制
        if depthDict[url] < depth:
            # 生成新url
            sonUrlList = getURL(url)
            for newUrl in sonUrlList:
                # 去重复及去非http链接
                if newUrl.find("http") != -1:
                    if newUrl not in depthDict:
                        depthDict[newUrl] = depthDict[url] + 1
                        # 放入待爬取栈
                        urlList.append(newUrl)

if __name__ == ‘__main__‘:
    # 起始url
    startUrl = "https://www.baidu.com/s?wd=岛国邮箱"
    # 层级控制
    depthDict = {}
    depthDict[startUrl] = 1
    # 待爬取栈(栈实际就是列表)
    urlList = []
    urlList.append(startUrl)
    vastSpider(4)

2 广度优先遍历策略

宽度优先遍历策略的基本思路是，将新下载网页中发现的链接直接**待抓取URL队列的末尾。也就是指网络爬虫会先抓取起始网页中链接的所有网页，然后再选择其中的一个链接网页，继续抓取在此网页中链接的所有网页。还是以上面的图为例：遍历路径：A-B-C-D-E-F-G-H-I

#采用队列思路,先进后出
import requests
import re

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"}

def getURL(url):
    ‘‘‘
    获取新url
    :param url:
    :return: urlList
    ‘‘‘
    html = getHTML(url)
    # <a asd asdf href="www/s?wd=%E5%B2%9B%E5%9B%azi" dsgfa asdf >岛国大片 留下邮箱</a>
    urlre = "<a .*href="(.*?)".*?>"
    urlList = re.findall(urlre, html)
    return urlList

def getHTML(url):
    html = requests.get(url, headers=headers).text
    return html

def vastSpider(depth):
    ‘‘‘
    广度抓取
    :param depth:深度
    :return:
    ‘‘‘
    # 是否为空
    while len(urlList) > 0:
        url = urlList.pop(0)
        print(‘			‘ * depthDict[url], "抓取了第%d层：%s" % (depthDict[url], url))

        # 层级控制
        if depthDict[url] < depth:
            # 生成新url
            sonUrlList = getURL(url)
            for newUrl in sonUrlList:
                # 去重复及非http链接
                if newUrl.find("http") != -1:
                    if newUrl not in depthDict:
                        depthDict[newUrl] = depthDict[url] + 1
                        # 放入待爬取队列
                        urlList.append(newUrl)

if __name__ == ‘__main__‘:
    # 起始url
    startUrl = "https://www.baidu.com/s?wd=岛国邮箱"
    # 层级控制
    depthDict = {}
    depthDict[startUrl] = 1
    # 待爬取队列
    urlList = []
    urlList.append(startUrl)
    vastSpider(4)

3 页面解析与数据提取

一般来讲对我们而言，需要抓取的是某个网站或者某个应用的内容，提取有用的价值。内容一般分为两部分，非结构化的数据和结构化的数据。

非结构化数据：先有数据，再有结构，
结构化数据：先有结构、再有数据

不同类型的数据，我们需要采用不同的方式来处理。

非结构化的数据处理

正则表达式
HTML 文件
正则表达式
XPath
CSS选择器

结构化的数据处理

json 文件 JSON Path 转化成Python类型进行操作（json类） XML 文件转化成Python类型（xmltodict） XPath CSS选择器正则表达式

4 Beautiful Soup

(1) beautifull soup概述

官方文档地址:http://www.crummy.com/software/BeautifulSoup/bs4/doc/

Beautiful Soup 相比其他的html解析有个非常重要的优势,BeautifulSoup将复杂HTML文档转换成一个复杂的树形结构。html会被拆解为对象处理。全篇转化为字典和数组。

相比正则解析的爬虫，省略了学习正则的高成本.

相比xpath爬虫的解析，同样节约学习时间成本.

安装

#liunx安装
apt-get install python-bs4
#python包安装
pip install beautifulsoup4

每个节点都是Python对象，我们只用根据节点进行查询 , 归纳为4大对象

Tag #节点类型
NavigableString # 标签内容
BeautifulSoup #根节点类型
Comment #注释

(1) 创建对象:

网上文件生成对象

soup = BeautifulSoup(‘网上下载的字符串‘, ‘lxml‘)

本地文件生成对象

soup = BeautifulSoup(open(‘1.html‘), ‘lxml‘)

(2) tag标签

格式化输出

from bs4 import BeautifulSoup  
soup = BeautifulSoup(html_doc)  
print(soup.prettify())  #html格式化

获取指定的tag内容

soup.p.b  #获取p标签中b标签
# <b>The Dormouse‘s story</b>  
soup.a  
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
soup.title  #获取title标签
# <title>The Dormouse‘s story</title>  
  
soup.title.name  #获取title标签名
# u‘title‘  
  
soup.title.string   #获取title标签内容
# u‘The Dormouse‘s story‘  
  
soup.title.parent.name  #获取title的父节点tag的名称
# u‘head‘  

#- 提取tag属性   方法是 soup.tag[‘属性名称‘]
<a href="http://blog.csdn.net/watsy">watsy‘s blog</a>  
soup.a[‘href‘]

(3)find与find_all

find_all(返回一个列表)

find_all(‘a‘)  查找到所有的a
find_all([‘a‘, ‘span‘])  返回所有的a和span
find_all(‘a‘, limit=2)  只找前两个a

find(返回一个对象)

find(‘a‘)：只找到第一个a标签
find(‘a‘, title=‘名字‘)
find(‘a‘, class_=‘名字‘)
#注意
1. 不能使用name属性查找
2. class_后面有下划线
3.可以使用自定义属性，比如age

def find_all(self, name=None, attrs={}, recursive=True, text=None,limit=None, **kwargs):
tag的名称，attrs属性, 是否递归，text是判断内容 limit是提取数量限制 **kwargs 含字典的关键字参数
    
print(soup.find(‘p‘)) # 第一个p
print(soup.find_all(‘p‘))  # 所有的p,列表

print(soup.find_all([‘b‘, ‘i‘])) #找列表中任意一个
print(soup.find_all(‘a‘, attrs={‘id‘: ‘link2‘}))  #限制属性为 id: link2
print(soup.find_all(‘a‘, id=‘link2‘)) #关键字形式 id=link2
print(soup.find_all(‘a‘, limit=2))

#class_属性
print(soup.find_all(‘a‘, class_="sister"))
print(soup.find_all(‘a‘, text=re.compile(‘^L‘)))

tag名称  
soup.find_all(‘b‘)  
# [<b>The Dormouse‘s story</b>]  
  
正则参数  
import re  
for tag in soup.find_all(re.compile("^b")): #匹配所有以b开头标签
    print(tag.name)  
# body  
# b  
for tag in soup.find_all(re.compile("t")):  
    print(tag.name)  
# html  
# title  
  
函数调用  
def has_class_but_no_id(tag):  
    return tag.has_attr(‘class‘) and not tag.has_attr(‘id‘)  
  
soup.find_all(has_class_but_no_id)  
# [<p class="title"><b>The Dormouse‘s story</b></p>,  
#  <p class="story">Once upon a time there were...</p>,  
#  <p class="story">...</p>]  
  
tag的名称和属性查找  
soup.find_all("p", "title")  
# [<p class="title"><b>The Dormouse‘s story</b></p>]  
  
tag过滤  
soup.find_all("a")  
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,  
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,  
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]  
  
tag属性过滤  
soup.find_all(id="link2")  
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]  
  
text正则过滤  
import re  
soup.find(text=re.compile("sisters"))  
# u‘Once upon a time there were three little sisters; and their names were
‘

(4) select 根据选择器--节点对象

element   p
.class   .firstname
#id      #firstname
属性选择器
    [attribute]        [target]
    [attribute=value]  [target=blank]
层级选择器
    element element   div p
    element>element   div>p
    element,element   div,p

(5) 节点信息

获取节点内容
    obj.string
    obj.get_text()【推荐】
节点的属性
    tag.name 获取标签名
    tag.attrs将属性值作为一个字典返回
获取节点属性
    obj.attrs.get(‘title‘)
    obj.get(‘title‘)
    obj[‘title‘]

5 XPath语法

XPath 使用路径表达式来选取 XML 文档中的节点或节点集

安装导入

import lxml
from lxml import etree

添加插件

chrome插件网：http://www.cnplugins.com/

Ctrl + Shift + X打开或关闭插件

(1) XPath的安装

#安装lxml库
pip install lxml
#导入lxml.etree
from lxml import etree

etree.parse()  解析本地html文件  html_tree = etree.parse(‘XX.html‘)
etree.HTML()   解析网络的html字符串   html_tree = etree.HTML(response.read().decode(‘utf-8‘)
html_tree.xpath()    使用xpath路径查询信息，返回一个列表

(2) 选取节点

表达式	说明
/bookstore	选取根元素 bookstore。注释：假如路径起始于正斜杠( / )，则此路径始终代表到某元素的绝对路径！
bookstore/book	选取属于 bookstore 的子元素的所有 book 元素。
//book	选取所有 book 子元素，而不管它们在文档中的位置。
bookstore//book	选择属于 bookstore 元素的后代的所有 book 元素，而不管它们位于 bookstore 之下的什么位置。
//@lang	选取名为 lang 的所有属性。
bookstore	选取 bookstore 元素的所有子节点。

(3) 选取元素

路径表达式	结果
/bookstore/book[1]	选取属于 bookstore 子元素的第一个 book 元素。
/bookstore/book[last()]	选取属于 bookstore 子元素的最后一个 book 元素。
/bookstore/book[last()-1]	选取属于 bookstore 子元素的倒数第二个 book 元素。
/bookstore/book[position()<3]	选取最前面的两个属于 bookstore 元素的子元素的 book 元素。
//title[@lang]	选取所有拥有名为 lang 的属性的 title 元素。
//title[@lang=‘eng‘]	选取所有 title 元素，且这些元素拥有值为 eng 的 lang 属性。
/bookstore/book[price>35.00]	选取 bookstore 元素的所有 book 元素，且其中的 price 元素的值须大于 35.00。
/bookstore/book[price>35.00]/title	选取 bookstore 元素中的 book 元素的所有 title 元素，且其中的 price 元素的值须大于 35.00。

(4) 选取若干路径

通过在路径表达式中使用"|"运算符，您可以选取若干个路径。

路径表达式	结果
//book/title \| //book/price	选取 book 元素的所有 title 和 price 元素。
//title \| //price	选取文档中的所有 title 和 price 元素。
/bookstore/book/title \| //price	选取属于 bookstore 元素的 book 元素的所有 title 元素，以及文档中所有的 price 元素。

print(‘myetree.xpath(//*[@class="item-0"])‘)  # *所有
print(‘myetree.xpath(//li[@class="item-0" | //div[@class="item-0"]])‘)  #或运算

(5) XPath语法总结

节点查询
    element
路径查询
    //  查找所有子孙节点，不考虑层级关系
    /  找直接子节点
谓词查询
    //div[@id]
    //div[@id="maincontent"]
属性查询
    //@class
逻辑运算
    //div[@id="head" and @class="s_down"]
    //title | //price
模糊查询
    //div[contains(@id, "he")]
    //div[starts-with(@id, "he")]
    //div[ends-with(@id, "he")]
内容查询
    //div/h1/text()

(6)XPath示例

htmlFile = ‘‘‘
    <ul>
        <li class="item-0"><a href="link1.html">first item</a></li>
        <li class="item-1"><a href="link2.html">second item</a></li>
        <li class="item-inactive"><a href="link3.html">third item</a></li>
        <li class="item-1"><a href="link4.html">fourth item</a></li>
        <li class="item-0"><a href="link5.html">fifth item</a></li> 
    </ul>
    ‘‘‘

html = lxml.etree.parse("filename.html") # 读取文件,需要传文件路径
myetree = lxml.etree.HTML(htmltext) # 直接加载,直接加载html文档

ul = myetree.xpath(‘/html/body/ul‘)  #根节点开始
ul = myetree.xpath(‘//ul‘)  #所有的ul子孙
ul = myetree.xpath(‘/html//ul‘)   #html下所有的ul,返回列表

print(html.xpath("//li/@class")) # 取出li的所有节点class名称
print(html.xpath("//li/@text")) # 为空，如果包含这个属性，
print(html.xpath("//li/a")) # li下面5个节点，每个节点对应一个元素
print(html.xpath("//li/a/@href")) # 取出li的所有节点 a内部href名称
print(html.xpath("//li/a/@href="link3.html"")) # 判断是有一个节点==link3.html

print(html.xpath("//li//span")) # 取出li下面所有的span
print(html.xpath("//li//span/@class")) # 取出li下面所有的span内部的calss
print(html.xpath("//li/a//@class")) # 取出li的所有节点内部节点a包含的class
print(html.xpath("//li")) # 取出所有节点li
print(html.xpath("//li[1]")) # 取出第一个,li[下标]从1开始
print(html.xpath("//li[last()]")) # 取出最后一个
print(html.xpath("//li[last()-1]")) # 取出倒数第2个
print(html.xpath("//li[last()-1]/a/@href")) # 取出倒数第2个的a下面的href

print(html.xpath("//*[@text="3"]")) # 选着text=3的元素
print(html.xpath("//*[@text="3"]/@class")) # 选着text=3的元素
print(html.xpath("//*[@class="nimei"]")) # 选着text=3的元素
print(html.xpath("//li/a/text()")) # 取出a标签的文本
print(html.xpath("//li[3]/a/span/text()")) # 取出内部<>数据

实例

爬取照片操作

from lxml import etree
import urllib.request
import urllib.parse
import os
url = ‘http://sc.chinaz.com/tupian/shuaigetupian.html‘
headers = {
    ‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36‘
}

request = urllib.request.Request(url=url,headers=headers)
response = urllib.request.urlopen(request)

# 获取请求到的html字符串
html_string = response.read().decode(‘utf-8‘)

# 将html字符串转换成etree结构
html_tree = etree.HTML(html_string)

# 解析名字和图片
src_list = html_tree.xpath(‘//div[@id="container"]//div[starts-with(@class,"box")]/div/a/img/@src2‘)
src_name = html_tree.xpath(‘//div[@id="container"]//div[starts-with(@class,"box")]/div/a/img/@alt‘)

# 下载到本地
for index in range(len(src_list)):
    pic_url = src_list[index]
    suffix = os.path.splitext(pic_url)[-1]
    file_name = ‘images/‘ + src_name[index] + suffix
    urllib.request.urlretrieve(pic_url,file_name)  #保存本地图片

bs4 xpath与正则

import requests,re
from bs4 import BeautifulSoup
from lxml import etree

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36"
}
url = ‘https://sou.zhaopin.com/jobs/searchresult.ashx?jl=%E6%9D%AD%E5%B7%9E&kw=python&sm=0&p=1‘

#正则
response = requests.get(url,headers=headers).text
print(re.findall(‘<em>(d+)</em>‘,response))

#bs4
soup = BeautifulSoup(response,‘lxml‘)
print(soup.find_all(‘span‘,class_=‘search_yx_tj‘)[0].em.text)
print(soup.select(‘span.search_yx_tj > em‘)[0].get_text())

#xpath
myetree = etree.HTML(response)
print(myetree.xpath(‘//span[@class="search_yx_tj"]/em/text()‘))

xpath综合运用

import re
import lxml
from lxml import etree
import requests

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"}

def getCity(url):
    ‘‘‘
    获取城市列表
    :param url:
    :return: 城市列表
    ‘‘‘
    response = requests.get(url, headers=headers).content.decode(‘gbk‘)
    mytree = lxml.etree.HTML(response)
    # 城市列表
    cityList = mytree.xpath(‘//div[@class="maincenter"]/div[2]/div[2]//a‘)
    for city in cityList:
        # 城市名
        cityName = city.xpath(‘./text()‘)[0]  # 当前
        # url
        cityurl = city.xpath(‘./@href‘)[0]
        print(cityName, cityurl)
        # 调用获取页面数量方法
        getPageNum(cityurl)

def getJobInfo(url):
    ‘‘‘
    获取岗位信息
    :param url:
    :return:
    ‘‘‘
    response = requests.get(url, headers=headers).content.decode(‘gbk‘)
    mytree = lxml.etree.HTML(response)
    jobList = mytree.xpath(‘//div[@class=‘detlist gbox‘]/div‘)
    # 保证有数据
    if len(jobList) != 0:
        for job in jobList:
            # 岗位名称
            jobName = job.xpath(‘.//p[@class="info"]/span[1]/a/@title‘)[0]
            # url
            joburl = job.xpath(‘.//p[@class="info"]/span[1]/a/@href‘)[0]
            # 公司名
            company = job.xpath(‘.//p[@class="info"]/a/@title‘)[0]
            # 工作地点
            jobAddr = job.xpath(‘.//p[@class="info"]/span[2]/text()‘)[0]
            # 薪资
            jobMoney = job.xpath(‘.//p[@class="info"]/span[@class="location"]/text()‘)
            if len(jobMoney) == 0:
                jobMoney = "面议"
            else:
                jobMoney = jobMoney[0]
            print(jobName, joburl, company, jobAddr, jobMoney)
            # 职责
            jobResponsibility = job.xpath(‘.//p[@class="text"]/@title‘)[0]
            print(jobResponsibility)

def getPageNum(url):
    ‘‘‘
    获取页面数量
    :param url:城市url
    :return:
    ‘‘‘
    response = requests.get(url, headers=headers).content.decode(‘gbk‘)
    mytree = lxml.etree.HTML(response)
    pageNum = mytree.xpath(‘//*[@id="cppageno"]/span[1]/text()‘)[0]
    numre = ".*?(d+).*"
    pageNum = re.findall(numre, pageNum)[0]
    for i in range(1, int(pageNum) + 1):
        newurl = url + ‘p%d‘ % i
        # 获取岗位信息
        getJobInfo(newurl)

if __name__ == ‘__main__‘:
    starurl = "https://jobs.51job.com/"
    getCity(starurl)

以上是关于什么是BS系统？的主要内容，如果未能解决你的问题，请参考以下文章