如何使用 python/BeautifulSoup 或类似方法将 kml 文件解析为 csv？

Posted 2023-02-23

技术标签:

【中文标题】如何使用 python/BeautifulSoup 或类似方法将 kml 文件解析为 csv？【英文标题】：How to parse a kml file to csv using python/BeautifulSoup or similar? 【发布时间】：2013-09-19 23:31:24 【问题描述】：

我一直在尝试将 Google Earth KML 文件转换为 GIS shapefile（或其他 GIS 文件格式，例如 Postgresql/PostGIS 表）（见 - GIS.stackexchange question here 基本上我想将 KML 文件转换为CSV。

我的问题是 KML 文件包含一些存储在 html 表中的数据，因此解析的 KML 文件在我的结果数据表中有一个包含 HTML 的字段，如下所示：

    "<br><br><br>
<table border="1" padding="0">
<tr><td>ID_INT</td><td>NGA0104001</td></tr>
<tr><td>N_sd</td><td>Igbere</td></tr>
<tr><td>Skm2</td><td>3.34</td></tr>
<tr><td>PT2010</td><td>13000</td></tr>"

当使用GDAL 库时，我最终得到一个 CSV 文件，其中一个字段包含一大段 HTML。我希望使用 BeautifulSoup（或一些类似的 Python 库）将 KML 文件的 HTML 元素解析为我的 CSV 文件中的四个单独的字段。我似乎能够将 KML 传递给 BeautifulSoup，但不确定从这里开始做什么，或者是否确实有另一种方法可以实现相同的目标。

我在这里和其他地方读过很多关于这个主题的类似问题，但我真的不知道从哪里开始解析这个文件。有没有人成功实现这一目标？非常感谢提前...

哦，下面是我的 KML 文件中的一段代码作为示例：

 <?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2" xmlns:gx="http://www.google.com/kml/ext/2.2" xmlns:kml="http://www.opengis.net/kml/2.2" xmlns:atom="http://www.w3.org/2005/Atom">
    <Document>
    <name>AFNGA_SWAC.kml</name>
    <open>1</open>
    <Style id="s_ylw-pushpin1">
        <IconStyle>
            <scale>1.1</scale>
            <Icon>
                <href>http://maps.google.com/mapfiles/kml/pushpin/ylw-pushpin.png</href>
            </Icon>
            <hotSpot x="20" y="2" xunits="pixels" yunits="pixels"/>
        </IconStyle>
        <LineStyle>
            <color>ff00ffff</color>
            <width>3</width>
        </LineStyle>
        <PolyStyle>
            <color>3300ffff</color>
        </PolyStyle>
    </Style>
    <StyleMap id="m_ylw-pushpin1">
        <Pair>
            <key>normal</key>
            <styleUrl>#s_ylw-pushpin1</styleUrl>
        </Pair>
        <Pair>
            <key>highlight</key>
            <styleUrl>#s_ylw-pushpin_hl1</styleUrl>
        </Pair>
    </StyleMap>
    <Style id="s_ylw-pushpin_hl1">
        <IconStyle>
            <scale>1.3</scale>
            <Icon>
                <href>http://maps.google.com/mapfiles/kml/pushpin/ylw-pushpin.png</href>
            </Icon>
            <hotSpot x="20" y="2" xunits="pixels" yunits="pixels"/>
        </IconStyle>
        <LineStyle>
            <color>ff00ffff</color>
            <width>3</width>
        </LineStyle>
        <PolyStyle>
            <color>3300ffff</color>
        </PolyStyle>
    </Style>
    <Folder>
        <name>AFNGA_SWAC</name>
        <open>1</open>
        <description>1027 Éléments de la couche Afnga_swac</description>
        <Placemark>
            <name>Aba</name>
            <description><![CDATA[<br><br><br>
    <table border="1" padding="0">
    <tr><td>ID_INT</td><td>NGA0101001</td></tr>
    <tr><td>N_sd</td><td>Aba</td></tr>
    <tr><td>Skm2</td><td>384.07</td></tr>
    <tr><td>PT2010</td><td>1010000</td></tr>]]></description>
            <styleUrl>#m_ylw-pushpin1</styleUrl>
            <Polygon>
                <extrude>1</extrude>
                <tessellate>1</tessellate>
                <outerBoundaryIs>
                    <LinearRing>
                        <coordinates>
                            7.294567000000001,5.00267,0 7.294408999999999,5.002552,0 7.294211,5.002394,0

【问题讨论】：

你想用 BeautifulSoup 检索什么数据？我想解析上面代码中的这些 HTML 表格：

&lt;description&gt;&lt;![CDATA[&lt;br&gt;&lt;br&gt;&lt;br&gt;     &lt;table border="1" padding="0"&gt;     &lt;tr&gt;&lt;td&gt;ID_INT&lt;/td&gt;&lt;td&gt;NGA0101001&lt;/td&gt;&lt;/tr&gt;     &lt;tr&gt;&lt;td&gt;N_sd&lt;/td&gt;&lt;td&gt;Aba&lt;/td&gt;&lt;/tr&gt;     &lt;tr&gt;&lt;td&gt;Skm2&lt;/td&gt;&lt;td&gt;384.07&lt;/td&gt;&lt;/tr&gt;     &lt;tr&gt;&lt;td&gt;PT2010&lt;/td&gt;&lt;td&gt;1010000&lt;/td&gt;&lt;/tr&gt;]]&gt;&lt;/description&gt;

【参考方案1】：

Beautiful Soup 通常非常适合直接找到您想要的内容（假设您可以轻松地在 xml/html 中识别出包含您正在寻找的数据的模式）。我不知道你希望你的输出如何格式化，但是如果你在 <description> 标签中寻找数据，那实际上很容易（下面的例子来自 Python3）：

from bs4 import BeautifulSoup

inputfile = "whateveryourfileiscalled.xml"
with open(inputfile, 'r') as f:
  soup = BeautifulSoup(f)

  # After you have a soup object, you can access tags very easily.
  # For instance, you can iterate over and get <description> like so:

  for node in soup.select('description'):
       print(node)

这通常不是很有用，所以再深入一点，我们甚至可以访问我们在<description> 中找到的节点内的元素。此外，如果我们愿意，我们可以只隔离文本（使用“字符串”属性）：

  for node in soup.select('description'):
     for item in node.select('td'):
         print(item.string)

我总是打印来测试我得到了我想要的。如果那里什么都没有，你会得到很多Nones。无论如何，这应该让你接近，显然，你可以用它做任何你想做的事情，而不是打印输出（存储在某个容器中，将其写入 csv 等）。这可能适用于您粘贴到评论中的块，但可能不适用于您最初问题中的块，因为您有多个描述标签。

在您的问题中，您有多个<description> 标签，并且并非所有标签都有节点，在这种情况下，您需要使用 find_all 而不是 select：

  for node in soup.find_all('description'):
      for item in node.find_all('td'):
          print(item.string)

【讨论】：

感谢您。不知道出了什么问题，但是当我尝试上面最后三个代码块中的任何一个时出现以下错误：

In [9]:      print node  for node in soup.select('description'):      print node  --------------------------------------------------------------------------- TypeError                                 Traceback (most recent call last) /home/marty/&lt;ipython-input-9-94997d85f5f4&gt; in &lt;module&gt;() ----&gt; 1 for node in soup.select('description'):       2     print node  TypeError: 'NoneType' object is not callable

help?! 嗯。它没有找到<description> 标签。它适用于我使用您拥有的 xml。也许你有不正确的缩进？你试过soup.find_all('description')吗？是的，试过了，但得到了相同的 TypeError:

In [16]:            print(item.string)  for node in soup.find_all('description'):        for item in node.find_all('td'):            print(item.string)  --------------------------------------------------------------------------- TypeError                                 Traceback (most recent call last) /home/marty/&lt;ipython-input-16-0b68b4e99228&gt; in &lt;module&gt;() ----&gt; 1 for node in soup.find_all('description'):       2       for item in node.find_all('td'):       3           print(item.string) TypeError: 'NoneType' object is not callable

嗯，好的，现在可以使用了。 bs4 库没有正确安装有问题。我现在得到这种输出 ID_INT NGA3714003 N_sd Magare Skm2 1.29 PT2010 10500 这是我正在寻找的数据！我真的需要将此数据输出作为数据帧（CSV）中的单独字段/列，其中“ID_INT”、“N_sd”、“Skm2”和“PT2010”是列标题。我还需要解析 HTML 块之前和之后的其余数据，因为它包含我需要能够在 GIS 中绘制此 CSV 的关键空间数据（即坐标）。不知道这是否有意义！！？顺便在 BeautifulSoup 中解析这个正在杀死我的机器 - 这正常吗？！说句公道话，我的笔记本电脑相当老旧，安装的 Ubuntu 12.04 有点小问题...

以上是关于如何使用 python/BeautifulSoup 或类似方法将 kml 文件解析为 csv？的主要内容，如果未能解决你的问题，请参考以下文章