有没有办法使用 BeautifulSoup 将数据从列表正确转换为 CSV 文件?

Posted

技术标签:

【中文标题】有没有办法使用 BeautifulSoup 将数据从列表正确转换为 CSV 文件?【英文标题】:Is there a way to properly convert data from lists to a CSV file using BeautifulSoup? 【发布时间】:2019-08-19 20:00:00 【问题描述】:

我正在尝试为网站创建网络爬虫。问题是收集的数据存储在列表中后,我无法将其正确写入 csv 文件。我被这个问题困扰了很久,希望有人知道如何解决这个问题!

从网页获取数据的循环:

import csv
from htmlrequest import simple_get
from htmlrequest import BeautifulSoup

# Define variables
listData = ['Companies', 'Locations', 'Descriptions']

plus = 15
max = 30
count = 0

# while loop to repeat process till max is reached
while (count <= max):
        start = 'https://www.companiesintheuk.co.uk/find?q=Activities+of+sport+clubs&start=' + str(count) + '&s=h&t=SicCodeSearch&location=&sicCode=93120'
        raw_html = simple_get(start)
        soup = BeautifulSoup(raw_html, 'html.parser')
        for i, div in enumerate(soup.find_all('div', class_="search_result_title")):
            listData[0] = listData[0].strip() + div.text 

        for i, div2 in enumerate(soup.find_all('div', class_="searchAddress")):
             listData[1] = listData[1].strip() + div2.text  

# This is extra information
#        for i, div3 in enumerate(soup.find_all('div', class_="searchSicCode")):
#            listData[2] = listData[2].strip() + div3.text 

        count = count + plus

如果打印输出示例:

Companies
(AMG) AGILITY MANAGEMENT GROUP LTD
(KLA) LIONS/LIONESS FOOTBALL TEAMS WORLD CUP LTD
(Dissolved)
1 SPORT ORGANISATION LIMITED
100UK LTD
1066 GYMNASTICS
1066 SPECIALS
10COACHING LIMITED
147 LOUNGE LTD
147 SNOOKER AND POOL CLUB (LEICESTER) LIMITED

Locations
ENGLAND, BH8 9PS
LONDON, EC2M 2PL
ENGLAND, LS7 3JB
ENGLAND, LE2 8FN
UNITED KINGDOM, N18 2QX
AVON, BS5 0JH
UNITED KINGDOM, WC2H 9JQ
UNITED KINGDOM, SE18 5SZ
UNITED KINGDOM, EC1V 2NX

我尝试使用此代码将其转换为 CSV 文件,但我不知道如何正确格式化我的输出!欢迎提出任何建议。

# writing to csv
with open('test.csv', 'w') as csvfile:
            write = csv.writer(csvfile, delimiter=',')
            write.writerow(['Name','Location'])
            write.writerow([listData[0],listData[1]])

print("Writing has been done!")

我希望代码能够在 csv 文件中正确格式化,以便能够在数据库中导入这两行。

This is the output when I write the data on 'test.csv'

which will result into this when opened up

The expected outcome would be something like this!

【问题讨论】:

你的问题到底是什么?堆栈跟踪或意外行为是什么? 我不确定如何将输出数据正确写入 csv 文件,以便按公司名称/位置排序。我尝试的一切都是徒劳的。 但我的问题是你的问题是什么。文件根本没有写吗?它只写了部分吗?它没有运行并且有错误消息吗?这是包含在您的帖子中的重要信息 其实没有错误!该代码有效,它将所有内容写入“test.csv”文件,但是当我在文本编辑器中打开 CSV 文件时,格式错误。所以我的问题是弄清楚如何正确格式化数据以将其写入 CSV 文件。 格式怎么错了?与它确实的样子相比,您希望它是什么样子。如果您可以编辑您的帖子以包含示例,那就太好了。根据您发布的印刷声明,我无法判断它如何不符合您的期望 【参考方案1】:

我不确定它的格式是否不正确,但也许您只需将 with open('test.csv', 'w') 替换为 with open('test.csv', 'w+', newline='')

我已经合并了你的代码(取出requestsbs4 模块的htmlrequests,也没有使用listData,而是创建了我自己的列表。我已经离开了你的列表,但他们什么也没做) :

import csv
import bs4
import requests

# Define variables
listData = ['Companies', 'Locations', 'Descriptions']
company_list = []
locations_list = []
plus = 15
max = 30
count = 0

# while loop to repeat process till max is reached

while count <= max:
    start = 'https://www.companiesintheuk.co.uk/find?q=Activities+of+sport+clubs&start=&s=h&t=SicCodeSearch&location=&sicCode=93120'.format(count)
    res = requests.get(start)
    soup = bs4.BeautifulSoup(res.text, 'html.parser')
    for i, div in enumerate(soup.find_all('div', class_="search_result_title")):
        listData[0] = listData[0].strip() + div.text
        company_list.append(div.text.strip())
    for i, div2 in enumerate(soup.find_all('div', class_="searchAddress")):
        listData[1] = listData[1].strip() + div2.text
        locations_list.append(div2.text.strip())
# This is extra information
#        for i, div3 in enumerate(soup.find_all('div', class_="searchSicCode")):
#            listData[2] = listData[2].strip() + div3.text
    count = count + plus

if len(company_list) == len(locations_list):
    with open('test.csv', 'w+', newline='') as csvfile:
        writer = csv.writer(csvfile, delimiter=',')
        writer.writerow(['Name', 'Location'])
        for i in range(len(company_list)):
            writer.writerow([company_list[i], locations_list[i]])

这会生成一个 csv 文件,例如:

Name,Location
(AMG) AGILITY MANAGEMENT GROUP LTD,"UNITED KINGDOM, M6 6DE"
"(KLA) LIONS/LIONESS FOOTBALL TEAMS WORLD CUP LTD
(Dissolved)","ENGLAND, BD1 2PX"
0161 STUDios LTD,"UNITED KINGDOM, HD6 3AX"
1 CLICK SPORTS MANAGEMENT LIMITED,"ENGLAND, E10 5PW"
1 SPORT ORGANISATION LIMITED,"UNITED KINGDOM, CR2 6NF"
100UK LTD,"UNITED KINGDOM, BN14 9EJ"
1066 GYMNASTICS,"EAST SUSSEX, BN21 4PT"
1066 SPECIALS,"EAST SUSSEX, TN40 1HE"
10COACHING LIMITED,"UNITED KINGDOM, SW6 6LR"
10IS ACADEMY LIMITED,"ENGLAND, PE15 9PS"
"10TH MAN LIMITED
(Dissolved)","GLASGOW, G3 6AN"
12 GAUGE EAST MANCHESTER COMMUNITY MMA LTD,"ENGLAND, OL9 8DQ"
121 MAKING WAVES LIMITED,"TYNE AND WEAR, NE30 1AR"
121 WAVES LTD,"TYNE AND WEAR, NE30 1AR"
1-2-KICK LTD,"ENGLAND, BH8 9PS"
"147 HAVANA LIMITED
(Liquidation)","LONDON, EC2M 2PL"
147 LOUNGE LTD,"ENGLAND, LS7 3JB"
147 SNOOKER AND POOL CLUB (LEICESTER) LIMITED,"ENGLAND, LE2 8FN"
1ACTIVE LTD,"UNITED KINGDOM, N18 2QX"
1ON1 KING LTD,"AVON, BS5 0JH"
1PUTT LTD,"UNITED KINGDOM, WC2H 9JQ"
1ST SPORTS LTD,"UNITED KINGDOM, SE18 5SZ"
2 BRO PRO EVENTS LTD,"UNITED KINGDOM, EC1V 2NX"
2 SPLASH SWIM SCHOOL LTD,"ENGLAND, B36 0EY"
2 STEPPERS C.I.C.,"SURREY, CR0 6BX"
2017 MOTO LIMITED,"UNITED KINGDOM, ME2 4NW"
2020 ARCHERY LTD,"LONDON, SE16 6SS"
21 LEISURE LIMITED,"LONDON, EC4M 7WS"
261 FEARLESS CLUB UNITED KINGDOM CIC,"LANCASHIRE, LA2 8RF"
2AIM4 LIMITED,"HERTFORDSHIRE, SG2 0JD"
2POINT4 FM LTD,"LONDON, NW10 8LW"
3 LIONS SCHOOL OF SPORT LTD,"BRISTOL, BS20 8BU"
3 PT LTD,"ANTRIM, BT40 2FB"
3 PUTT LIFE LTD,"UNITED KINGDOM, LU3 2DP"
3 THIRTY SEVEN LTD,"KENT, DA9 9RS"
3:30 SOCCER SCHOOL LTD,"UNITED KINGDOM, EH6 7JB"
30 MINUTE WORKOUT (LLANISHEN) LTD,"PONTYCLUN, CF72 9UA"
321 RELAX LTD,"MID GLAMORGAN, CF83 3HL"
360 MOTOR RACING CLUB LTD,"HALSTEAD, CO9 2ET"
3LIONSATHLETICS LIMITED,"ENGLAND, S3 8DB"
3S SWIM ROMFORD LTD,"UNITED KINGDOM, DA9 9DR"
3XL EVENT MANAGEMENT LIMITED,"KENT, BR3 4NW"
3XL MOTORSPORT MANAGEMENT LIMITED,"KENT, BR3 4NW"
4 CORNER FOOTBALL LTD,"BROMLEY, BR1 5DD"
4 PRO LTD,"UNITED KINGDOM, FY5 5HT"

这对我来说似乎很好,但是你的帖子很不清楚你期望它的格式,所以我真的不知道

【讨论】:

这实际上是我一直在寻找的!我已经挣扎了一段时间,这就是解决方案。谢谢你,这是救命稻草! @hoyeung 不错!如果您有兴趣学习 csv 模块(或者在将来尝试处理 csvs 时遇到困难),我建议您阅读此页面 docs.python.org/3/library/csv.html 还有其他有用的功能,例如 csv.DictReader() 或指定方言和引号字符!跨度>

以上是关于有没有办法使用 BeautifulSoup 将数据从列表正确转换为 CSV 文件?的主要内容,如果未能解决你的问题,请参考以下文章

如何从 BeautifulSoup4 中的 html 标签中找到特定的数据属性?

python 使用pip安装软件beautifulsoup4一直失败解决办法

使用 BeautifulSoup 在 Python 中查找非递归 DOM 子节点

BeautifulSoup:从 html 获取 css 类

BeautifulSoup/Regex:从 href 中查找特定值

美丽的汤在源文件中找到标记的位置?