将 Excel 文件读取到 pandas 数据框的更快方法

Posted 2023-03-11

技术标签:

【中文标题】将 Excel 文件读取到 pandas 数据框的更快方法【英文标题】：Faster way to read Excel files to pandas dataframe 【发布时间】：2015-04-30 05:28:06 【问题描述】：

我有一个 14MB Excel file with five worksheets 正在读入 Pandas 数据帧，虽然下面的代码有效，但需要 9 分钟！

有人有加快速度的建议吗？

import pandas as pd

def OTT_read(xl,site_name):
    df = pd.read_excel(xl.io,site_name,skiprows=2,parse_dates=0,index_col=0,
                       usecols=[0,1,2],header=None,
                       names=['date_time','%s_depth'%site_name,'%s_temp'%site_name])
    return df

def make_OTT_df(FILEDIR,OTT_FILE):
    xl = pd.ExcelFile(FILEDIR + OTT_FILE)
    site_names = xl.sheet_names
    df_list = [OTT_read(xl,site_name) for site_name in site_names]
    return site_names,df_list

FILEDIR='c:/downloads/'
OTT_FILE='OTT_Data_All_stations.xlsx'
site_names_OTT,df_list_OTT = make_OTT_df(FILEDIR,OTT_FILE)

【问题讨论】：

你可以尝试保存为 csv 并加载它，可能 excel 阅读器没有 csv 阅读器快它有多个工作表，但那不是行不通吗？您应该仍然可以保存每张纸，不幸的是，这里的痛苦是您必须单独保存每张纸，14MB 并不是一个大尺寸，csv 阅读器会很快吃掉它。还有一点可以试试ExcelFile.parse 【参考方案1】：

正如其他人所建议的，csv 读取速度更快。因此，如果您在 Windows 上并拥有 Excel，则可以调用 vbscript 将 Excel 转换为 csv，然后读取 csv。我尝试了下面的脚本，大约花了 30 秒。

# create a list with sheet numbers you want to process
sheets = map(str,range(1,6))

# convert each sheet to csv and then read it using read_csv
df=
from subprocess import call
excel='C:\\Users\\rsignell\\OTT_Data_All_stations.xlsx'
for sheet in sheets:
    csv = 'C:\\Users\\rsignell\\test' + sheet + '.csv' 
    call(['cscript.exe', 'C:\\Users\\rsignell\\ExcelToCsv.vbs', excel, csv, sheet])
    df[sheet]=pd.read_csv(csv)

这里有一点 python 的 sn-p 来创建 ExcelToCsv.vbs 脚本：

#write vbscript to file
vbscript="""if WScript.Arguments.Count < 3 Then
    WScript.Echo "Please specify the source and the destination files. Usage: ExcelToCsv <xls/xlsx source file> <csv destination file> <worksheet number (starts at 1)>"
    Wscript.Quit
End If

csv_format = 6

Set objFSO = CreateObject("Scripting.FileSystemObject")

src_file = objFSO.GetAbsolutePathName(Wscript.Arguments.Item(0))
dest_file = objFSO.GetAbsolutePathName(WScript.Arguments.Item(1))
worksheet_number = CInt(WScript.Arguments.Item(2))

Dim oExcel
Set oExcel = CreateObject("Excel.Application")

Dim oBook
Set oBook = oExcel.Workbooks.Open(src_file)
oBook.Worksheets(worksheet_number).Activate

oBook.SaveAs dest_file, csv_format

oBook.Close False
oExcel.Quit
""";

f = open('ExcelToCsv.vbs','w')
f.write(vbscript.encode('utf-8'))
f.close()

这个答案得益于Convert XLS to CSV on command line 和csv & xlsx files import to pandas data frame: speed issue

【讨论】：

如果在Linux上，解决办法是什么？它给出了“TypeError: write() argument must be str, not bytes”的错误，所以我把它改成了 f = open('ExcelToCsv.vbs','wb')。谢谢。【参考方案2】：

如果您的行数少于 65536（在每张纸中），您可以尝试 xls（而不是 xlsx。根据我的经验，xls 比 xlsx 快。很难与 csv 进行比较因为这取决于张数。

虽然这不是一个理想的解决方案（xls 是一种旧的二进制私有格式），但我发现如果您使用 一大堆工作表，这很有用>，内部公式具有经常更新的值，或出于任何原因您真正想要保留 excel 多表功能（而不是 csv 分隔文件）。

【讨论】：

【参考方案3】：

我知道这是旧的，但万一其他人正在寻找不涉及 VB 的答案。 Pandas read_csv() 更快，但您不需要 VB 脚本来获取 csv 文件。

打开您的 Excel 文件并保存为 *.csv（逗号分隔值）格式。

在工具下，您可以选择 Web 选项，在编码选项卡下，您可以将编码更改为适合您数据的任何编码。我最终使用了西欧的 Windows，因为 Windows UTF 编码是“特殊的”，但是有很多方法可以完成同样的事情。然后使用pd.read_csv() 中的编码参数来指定您的编码。

列出编码选项here

【讨论】：

【参考方案4】：

没有理由打开excel如果你愿意处理一次缓慢的转换。

pd.read_excel()

pd.to_csv()

避免使用 excel 和 windows 特定的调用。就我而言，一次性命中是值得的。我有一个☕。

【讨论】：

【参考方案5】：

根据我的经验，Pandas read_excel() 可以很好地处理包含多个工作表的 Excel 文件。正如Using Pandas to read multiple worksheets 中所建议的那样，如果将sheet_name 分配给None，它会自动将每个工作表放入一个Dataframe 中，并输出一个带有工作表名称键的Dataframes 字典。

但需要时间的原因是您在代码中解析文本的位置。 5 张 14MB 的 excel 并不算多。我有一个 20.1MB 的 excel 文件，每张 46 张，超过 6000 行和 17 列，使用read_excel 如下所示：

t0 = time.time()

def parse(datestr):
    y,m,d = datestr.split("/")
    return dt.date(int(y),int(m),int(d))

data = pd.read_excel("DATA (1).xlsx", sheet_name=None, encoding="utf-8", skiprows=1, header=0, parse_dates=[1], date_parser=parse)

t1 = time.time()

print(t1 - t0)
## result: 37.54169297218323 seconds

在上面的代码中，data 是一个包含 46 个数据帧的字典。

正如其他人所建议的，使用read_csv() 会有所帮助，因为读取.csv 文件更快。但考虑到.xlsx 文件使用压缩这一事实，.csv 文件可能更大，因此读取速度更慢。但是，如果您想使用 python 将文件转换为逗号分隔（VBcode 由Rich Signel 提供），您可以使用：Convert xlsx to csv

【讨论】：

【参考方案6】：

我编写了一个快速而肮脏的脚本来从 .xlsx 文件中读取值。这不会更新值（如日期），仅适用于我使用的文件。可能还有一些错误，因为我只是写下来，没有仔细研究 xlsx 定义:-)。但它比默认的 pd.read_excel 快大约五到十倍。

# -*- coding: utf-8 -*-
"""
Created on Fri Jan  3 16:42:08 2020

@author: FelixKling
"""

import re
import zipfile
import pandas as pd
import datetime as dt
import numpy as np
import html

def is_number(s):
    try:
        float(s)
        return True
    except ValueError:
        return False

def read(path, sheet_name=None, header=True, index_col=False, skiprows=[], skipcolumns=[]):
    """
    Reads an .xlsx or .xlsm file and returns a Pandas DataFrame. Is much faster than pandas.read_excel().

    Parameters
    ----------
    path : str
        The path to the .xlsx or .xlsm file.
    sheet_name : str, optional
        Name of the sheet to read. If none, the first (not the active!) sheet is read. The default is None.
    header : bool, optional
        Whether to use the first line as column headers. The default is True.
    index_col : bool, optional
        Whether to use the first column as index. The default is False.
    skiprows : list of int, optional.
        The row numbers to skip ([0, 1] skips the first two rows). The default is [].
    skipcolumns : list of int, optional.
        The column numbers to skip ([0, 1] skips the first two columns). The default is [].

    Raises
    ------
    TypeError
        If the file is no .xlsx or .xlsm file.
    FileNotFoundError
        If the sheet name is not found.

    Returns
    -------
    Pandas DataFrame
        The input file as DataFrame.

    """
    # check extension
    if "." not in path:
        raise TypeError("This is no .xlsx or .xlsm file!")
    if path.rsplit(".", 1)[1] not in ["xlsx", "xlsm"]:
        raise TypeError("This is no .xlsx or .xlsm file!")

    path = path.replace("\\","/")

    tempfiles = dict()
    with zipfile.ZipFile(path, 'r') as zipObj:
        for name in zipObj.namelist():
            if name.startswith("xl/worksheets/") or name in [
                    "xl/_rels/workbook.xml.rels",
                    "xl/styles.xml",
                    "xl/workbook.xml",
                    "xl/sharedStrings.xml",                    
                    ]:
                try:
                    tempfiles[name] = zipObj.read(name).decode("utf-8")
                except UnicodeDecodeError:
                    tempfiles[name] = zipObj.read(name).decode("utf-16")

    # read rels (paths to sheets)
    
    text = tempfiles["xl/_rels/workbook.xml.rels"]
    rels = 
    
    relids = re.findall(r'<Relationship Id="([^"]+)"', text)
    relpaths = re.findall(r'<Relationship .*?Target="([^"]+)"', text)
    rels = dict(zip(relids, relpaths))

    # read sheet names and relation ids

    if sheet_name:
        text = tempfiles["xl/workbook.xml"]
        workbooks = 
       
        workbookids = re.findall(r'<sheet.*? r:id="([^"]+)"', text)
        workbooknames = re.findall(r'<sheet.*? name="([^"]+)"', text)
        workbooks = dict(zip(workbooknames, workbookids))
        if sheet_name in workbooks:
            sheet = rels[workbooks[sheet_name]].rsplit("/", 1)[1]
        else:
            raise FileNotFoundError("Sheet " + str(sheet_name) + " not found in Excel file! Available sheets: " + "; ".join(workbooks.keys()))

    else:
        sheet="sheet1.xml"

    # read strings, they are numbered
    string_items = []
    if "xl/sharedStrings.xml" in tempfiles:
        text = tempfiles["xl/sharedStrings.xml"]
        
        string_items = re.split(r"<si.*?><t.*?>", text.replace("<t/>", "<t></t>").replace("</t></si>","").replace("</sst>",""))[1:]
        string_items = [html.unescape(str(i).split("</t>")[0]) if i != "" else np.nan for i in string_items]
    
    # read styles, they are numbered

    text = tempfiles["xl/styles.xml"]
    styles = re.split(r"<[/]?cellXfs.*?>", text)[1]
    styles = styles.split('numFmtId="')[1:]
    styles = [int(s.split('"', 1)[0]) for s in styles]

    numfmts = text.split("<numFmt ")[1:]
    numfmts = [n.split("/>", 1)[0] for n in numfmts]
    for i, n in enumerate(numfmts):
        n = re.sub(r"\[[^\]]*\]", "", n)
        n = re.sub(r'"[^"]*"', "", n)
        if any([x in n for x in ["y", "d", "w", "q"]]):
            numfmts[i] = "date"
        elif any([x in n for x in ["h", "s", "A", "P"]]):
            numfmts[i] = "time"
        else:
            numfmts[i] = "number"

    def style_type(x):
        if 14 <= x <= 22:
            return "date"
        if 45 <= x <= 47:
            return "time"
        if x >= 165:
            return numfmts[x - 165]
        else:
            return "number"

    styles = list(map(style_type, styles))


    text = tempfiles["xl/worksheets/" + sheet]


    def code2nr(x):
        nr = 0
        d = 1
        for c in x[::-1]:
            nr += (ord(c)-64) * d
            d *= 26
        return nr - 1

    table = []
    max_row_len = 0

    rows = [r.replace("</row>", "") for r in re.split(r"<row .*?>", text)[1:]]
    for r in rows:            
        # c><c r="AT2" s="1" t="n"><v></v></c><c r="AU2" s="115" t="inlineStr"><is><t>bla (Namensk&#252;rzel)</t></is></c>

        r = re.sub(r"</?r.*?>","", r)        
        r = re.sub(r"<(is|si).*?><t.*?>", "<v>", r)
        r = re.sub(r"</t></(is|si)>", "</v>", r)
        r = re.sub(r"</t><t.*?>","", r)

        values = r.split("</v>")[:-1]
        add = []
        colnr = 0
        for v in values:
            value = re.split("<v.*?>", v)[1]
            
            v = v.rsplit("<c", 1)[1]
            # get column number of the field
            nr = v.split(' r="')[1].split('"')[0]
            nr = code2nr("".join([n for n in nr if n.isalpha()]))
            if nr > colnr:
                for i in range(nr - colnr):
                    add.append(np.nan)
            colnr = nr + 1

            sty = "number"
            if ' s="' in v:
                sty = int(v.split(' s="', 1)[1].split('"', 1)[0])
                sty = styles[sty]
         
            # inline strings
            if 't="inlineStr"' in v:
                add.append(html.unescape(value) if value != "" else np.nan)
            # string from list
            elif 't="s"' in v:
                add.append(string_items[int(value)])
            # boolean
            elif 't="b"' in v:
                add.append(bool(int(value)))
            # date
            elif sty == "date":
                if len(value) == 0:
                    add.append(pd.NaT)
                # Texts like errors
                elif not is_number(value):
                    add.append(html.unescape(value))
                else:
                    add.append(dt.datetime(1900,1,1) + dt.timedelta(days=float(value) - 2))
            # time
            elif sty == "time":
                if len(value) == 0:
                    add.append(pd.NaT)
                # Texts like errors
                elif not is_number(value):
                    add.append(html.unescape(value))
                else:
                    add.append((dt.datetime(1900,1,1) + dt.timedelta(days=float(value) - 2)).time())
            # Null
            elif len(value) == 0:
                add.append(np.nan)
            # Texts like errors
            elif not is_number(value):
                add.append(html.unescape(value))
            # numbers
            else:
                add.append(round(float(value), 16))
        table.append(add)
        if len(add) > max_row_len:
            max_row_len = len(add)

    df = pd.DataFrame(table)

    # skip rows or columns
    df = df.iloc[[i for i in range(len(df)) if i not in skiprows], [i for i in range(len(df.columns)) if i not in skipcolumns]]
    
    if index_col:
        df = df.set_index(df.columns[0])
    if header:
        df.columns = df.iloc[0].values
        df = df.iloc[1:]

    return df

【讨论】：

【参考方案7】：

我使用 xlsx2csv 将 excel 文件虚拟转换为内存中的 csv，这有助于将读取时间缩短到大约一半。

from xlsx2csv import Xlsx2csv
from io import StringIO
import pandas as pd


def read_excel(path: str, sheet_name: str) -> pd.DataFrame:
    buffer = StringIO()
    Xlsx2csv(path, outputencoding="utf-8", sheet_name=sheet_name).convert(buffer)
    buffer.seek(0)
    df = pd.read_csv(buffer)
    return df

【讨论】：

以上是关于将 Excel 文件读取到 pandas 数据框的更快方法的主要内容，如果未能解决你的问题，请参考以下文章