只将包含某个单词的 Excel sheet_names 读入 pandas 数据框

Posted

技术标签:

【中文标题】只将包含某个单词的 Excel sheet_names 读入 pandas 数据框【英文标题】:Only read Excel sheet_names containing a certain word into a pandas dataframe 【发布时间】:2021-10-11 16:20:15 【问题描述】:

我有很多报告要在 python 中编译成单个数据框。

此代码用于循环遍历我的目录并读取每个文件中工作表名称相同的所有报告文件...我在每个工作簿中有很多工作表但只想找到包含特定字符串的 sheet_names , '报告'。

import pandas as pd
from pathlib import Path
import os
import glob

pathsting= 'path/to/working/directory'
rootdir = Path(pathsting)
onlydirs = [f for f in os.listdir(rootdir) if os.path.isdir(os.path.join(rootdir, f))]

df0 = pd.DataFrame()
for direct in onlydirs:
    print(direct)
    dirpathstring = pathsting + '\\' + direct
    dirpath = Path(dirpathstring)
    onlyfiles = [f for f in os.listdir(dirpath) if os.path.isfile(os.path.join(dirpath, f))]
    for f in dirpath.glob("*Report.xlsm"):
        print(f.name)
        temp = pd.read_excel(f, sheet_name='Report')
        df0 = pd.concat([df0, temp])
display(df0)

现在假设随着时间的推移,报告的格式会发生变化,而不是 sheet_name='Report',而是变为 sheet_name='XYZ Report'。我有很多报告,并且名称更改了几次。我不想在多个不同的循环中硬编码所有可能的报告名称。

我能够使用 glob 读取所有以“Report.xlsm”结尾的文件,但是是否有类似的方法可以读取包含文本“Report”而不是确切字符串的 sheet_names?

【问题讨论】:

【参考方案1】:

试试:

import pandas as pd
import glob
import re

path = r'./files' # use your path
all_files = glob.glob(path + "/*.xlsm")

# case insensitive pattern for file names like blahReportblah or fooreportingss etc.  Modify as required if necessary.
pattern = r'(?i)(.*report.*)'

# create empty list to hold dataframes from sheets found
dfs = []

# for each file in the path above ending .xlsm
for file in all_files:
    #if the file name has the word 'report' or even 'rEpOrTs' in it
    if re.search(pattern, file):
        #open the file
        ex_file = pd.ExcelFile(file)
        #then for each sheet in that file
        for sheet in ex_file.sheet_names:
            #check if the sheet has 'RePORting' etc. in it
            if re.search(pattern, sheet):
                #if so create a dataframe (maybe parse_dates isn't required).  Tweak as required
                df = ex_file.parse(sheet, parse_dates=True)
                #add this new (temp during the looping) frame to the end of the list
                dfs.append(df)
            else:
                #if sheet doesn't have the word 'report' move on, nothing to see here
                continue
    else:
        #if file doesn't have the word 'report' move on, nothing to see here
        continue

#handle a list that is empty
if len(dfs) == 0:
    print('No file or sheets found.')
    #create a dummy frame
    df = pd.DataFrame()
#or have only one item/frame and get it out
elif len(dfs) == 1:
    df = dfs[0]
#or concatenate more than one frame together
else:
    df = pd.concat(dfs, ignore_index=True)
    df = df.reset_index(drop=True)

#check what you've got
print(df.head())

【讨论】:

感谢您的详细回复。这似乎是合理的。我会尝试这种方法。【参考方案2】:

您需要编写一个函数来读取工作表名称以查看它是否包含“报告”一词。这可以帮助您获得工作表名称:

How to obtain sheet names from XLS files without loading the whole file?

【讨论】:

以上是关于只将包含某个单词的 Excel sheet_names 读入 pandas 数据框的主要内容,如果未能解决你的问题,请参考以下文章

Excel搜索Word是否在列A中的某个位置以及列B中

pandas.read_excel 参数“sheet_name”不起作用

pandas.read_excel参数“sheet_name”无法正常工作,,将sheet_name改写成sheetname

pandas读取和写入excel多个sheet表单

使用多个 excel 表加快 pandas 迭代

正则表达式:匹配包含某个单词的所有单词