Python递归文件夹读取

Posted 2023-02-18

技术标签:

【中文标题】Python递归文件夹读取【英文标题】：Python recursive folder read 【发布时间】：2011-01-13 20:01:27 【问题描述】：

我有 C++/Obj-C 背景，我刚刚发现 Python（已经写了大约一个小时）。我正在编写一个脚本来递归读取文件夹结构中文本文件的内容。

我遇到的问题是我编写的代码仅适用于一个文件夹深处。我可以在代码中看到为什么（请参阅#hardcoded path），我只是不知道如何继续使用 Python，因为我对它的体验只是全新的。

Python 代码：

import os
import sys

rootdir = sys.argv[1]

for root, subFolders, files in os.walk(rootdir):

    for folder in subFolders:
        outfileName = rootdir + "/" + folder + "/py-outfile.txt" # hardcoded path
        folderOut = open( outfileName, 'w' )
        print "outfileName is " + outfileName

        for file in files:
            filePath = rootdir + '/' + file
            f = open( filePath, 'r' )
            toWrite = f.read()
            print "Writing '" + toWrite + "' to" + filePath
            folderOut.write( toWrite )
            f.close()

        folderOut.close()

【问题讨论】：

【参考方案1】：

确保你理解os.walk的三个返回值：

for root, subdirs, files in os.walk(rootdir):

有以下含义：

root：当前路径“走过” subdirs：root 类型目录中的文件 files：root（不在subdirs）中的文件类型不是目录

请使用os.path.join 而不是用斜线连接！你的问题是filePath = rootdir + '/' + file - 你必须连接当前的“walked”文件夹而不是最顶层的文件夹。所以那一定是filePath = os.path.join(root, file)。 BTW "file" 是内置的，所以你通常不使用它作为变量名。

另一个问题是你的循环，应该是这样的，例如：

import os
import sys

walk_dir = sys.argv[1]

print('walk_dir = ' + walk_dir)

# If your current working directory may change during script execution, it's recommended to
# immediately convert program arguments to an absolute path. Then the variable root below will
# be an absolute path as well. Example:
# walk_dir = os.path.abspath(walk_dir)
print('walk_dir (absolute) = ' + os.path.abspath(walk_dir))

for root, subdirs, files in os.walk(walk_dir):
    print('--\nroot = ' + root)
    list_file_path = os.path.join(root, 'my-directory-list.txt')
    print('list_file_path = ' + list_file_path)

    with open(list_file_path, 'wb') as list_file:
        for subdir in subdirs:
            print('\t- subdirectory ' + subdir)

        for filename in files:
            file_path = os.path.join(root, filename)

            print('\t- file %s (full path: %s)' % (filename, file_path))

            with open(file_path, 'rb') as f:
                f_content = f.read()
                list_file.write(('The file %s contains:\n' % filename).encode('utf-8'))
                list_file.write(f_content)
                list_file.write(b'\n')

如果您不知道，文件的with 语句是简写：

with open('filename', 'rb') as f:
    dosomething()

# is effectively the same as

f = open('filename', 'rb')
try:
    dosomething()
finally:
    f.close()

【讨论】：

太棒了，大量的印刷品可以理解发生了什么，而且效果很好。谢谢！ +1 请注意任何像我一样愚蠢/健忘的人...此代码示例将一个 txt 文件写入每个目录。很高兴我在版本控制的文件夹中对其进行了测试，尽管我需要编写清理脚本的所有内容都在这里 :) 第二个（最长的）代码 sn-p 运行良好，为我省去了很多无聊的工作既然速度显然是最重要的方面，os.walk 也不错，尽管我通过os.scandir 想出了一个更快的方法。所有glob 解决方案都比walk 和scandir 慢很多。我的功能，以及完整的速度分析，可以在这里找到：***.com/a/59803793/2441026【参考方案2】：

如果您使用的是 Python 3.5 或更高版本，则可以在 1 行中完成。

import glob

# root_dir needs a trailing slash (i.e. /root/dir/)
for filename in glob.iglob(root_dir + '**/*.txt', recursive=True):
     print(filename)

如documentation中所述

如果递归为真，则模式“**”将匹配任何文件以及零个或多个目录和子目录。

如果你想要每个文件，你可以使用

import glob

for filename in glob.iglob(root_dir + '**/**', recursive=True):
     print(filename)

【讨论】：

如开头所说，仅适用于Python 3.5+ root_dir 必须有一个尾部斜杠（否则你会得到类似 'folder**/*' 而不是 'folder/**/*' 作为第一个参数）。您可以使用 os.path.join(root_dir, '*/')，但我不知道是否可以将 os.path.join 与通配符路径一起使用（尽管它适用于我的应用程序)。 @ChillarAnand 您能否在此答案的代码中添加注释，root_dir 需要尾部斜杠？这将节省人们的时间（或者至少它会节省我的时间）。谢谢。如果我按照答案运行它，它不会递归工作。为了递归地完成这项工作，我不得不将其更改为：glob.iglob(root_dir + '**/**', recursive=True)。我正在使用 Python 3.8.2 请注意 glob.glob 与点文件不匹配。您可以改用 pathlib.glob【参考方案3】：

同意 Dave Webb，os.walk 将为树中的每个目录生成一个项目。事实上，您不必关心subFolders。

这样的代码应该可以工作：

import os
import sys

rootdir = sys.argv[1]

for folder, subs, files in os.walk(rootdir):
    with open(os.path.join(folder, 'python-outfile.txt'), 'w') as dest:
        for filename in files:
            with open(os.path.join(folder, filename), 'r') as src:
                dest.write(src.read())

【讨论】：

不错的一个。这也有效。然而，我确实更喜欢 AndiDog 的版本，尽管它更长，因为作为 Python 的初学者更容易理解。 +1【参考方案4】：

TL;DR：这相当于find -type f 遍历所有文件夹中的所有文件，包括当前文件夹：

for currentpath, folders, files in os.walk('.'):
    for file in files:
        print(os.path.join(currentpath, file))

正如其他答案中已经提到的，os.walk() 是答案，但可以更好地解释。这很简单！让我们穿过这棵树：

docs/
└── doc1.odt
pics/
todo.txt

使用此代码：

for currentpath, folders, files in os.walk('.'):
    print(currentpath)

currentpath 是它正在查看的当前文件夹。这将输出：

.
./docs
./pics

所以它循环了 3 次，因为有 3 个文件夹：当前文件夹，docs 和 pics。在每个循环中，它用所有文件夹和文件填充变量folders 和files。让我们展示一下：

for currentpath, folders, files in os.walk('.'):
    print(currentpath, folders, files)

这向我们展示了：

# currentpath  folders           files
.              ['pics', 'docs']  ['todo.txt']
./pics         []                []
./docs         []                ['doc1.odt']

所以在第一行中，我们看到我们在文件夹. 中，它包含两个文件夹，即pics 和docs，并且有一个文件，即todo.txt。您无需执行任何操作即可递归到这些文件夹中，因为如您所见，它会自动递归并为您提供任何子文件夹中的文件。以及它的任何子文件夹（尽管我们在示例中没有这些子文件夹）。

如果你只想循环遍历所有文件，相当于find -type f，你可以这样做：

for currentpath, folders, files in os.walk('.'):
    for file in files:
        print(os.path.join(currentpath, file))

这个输出：

./todo.txt
./docs/doc1.odt

【讨论】：

【参考方案5】：

pathlib 库非常适合处理文件。您可以像这样对 Path 对象执行递归 glob。

from pathlib import Path

for elem in Path('/path/to/my/files').rglob('*.*'):
    print(elem)

【讨论】：

【参考方案6】：

import glob
import os

root_dir = <root_dir_here>

for filename in glob.iglob(root_dir + '**/**', recursive=True):
    if os.path.isfile(filename):
        with open(filename,'r') as file:
            print(file.read())

**/**用于递归获取包括directory在内的所有文件。

if os.path.isfile(filename) 用于检查filename 变量是file 还是directory，如果是文件那么我们可以读取该文件。我在这里打印文件。

【讨论】：

【参考方案7】：

如果您想要一个给定目录下所有路径的平面列表（如 shell 中的 find .）：

   files = [ 
       os.path.join(parent, name)
       for (parent, subdirs, files) in os.walk(YOUR_DIRECTORY)
       for name in files + subdirs
   ]

要仅包含基本目录下文件的完整路径，请省略 + subdirs。

【讨论】：

【参考方案8】：

我发现以下是最简单的

from glob import glob
import os

files = [f for f in glob('rootdir/**', recursive=True) if os.path.isfile(f)]

使用glob('some/path/**', recursive=True) 获取所有文件，但还包括目录名称。添加if os.path.isfile(f) 条件仅将此列表过滤到现有文件

【讨论】：

【参考方案9】：

使用os.path.join() 构建路径 - 更简洁：

import os
import sys
rootdir = sys.argv[1]
for root, subFolders, files in os.walk(rootdir):
    for folder in subFolders:
        outfileName = os.path.join(root,folder,"py-outfile.txt")
        folderOut = open( outfileName, 'w' )
        print "outfileName is " + outfileName
        for file in files:
            filePath = os.path.join(root,file)
            toWrite = open( filePath).read()
            print "Writing '" + toWrite + "' to" + filePath
            folderOut.write( toWrite )
        folderOut.close()

【讨论】：

看起来此代码仅适用于 2 级（或更深）的文件夹。它仍然让我更接近。【参考方案10】：

os.walk 默认执行递归遍历。对于每个目录，从 root 开始，它会产生一个 3 元组（目录路径、目录名、文件名）

from os import walk
from os.path import splitext, join

def select_files(root, files):
    """
    simple logic here to filter out interesting files
    .py files in this example
    """

    selected_files = []

    for file in files:
        #do concatenation here to get full path 
        full_path = join(root, file)
        ext = splitext(file)[1]

        if ext == ".py":
            selected_files.append(full_path)

    return selected_files

def build_recursive_dir_tree(path):
    """
    path    -    where to begin folder scan
    """
    selected_files = []

    for root, dirs, files in walk(path):
        selected_files += select_files(root, files)

    return selected_files

【讨论】：

在 Python 2.6 walk() do 返回递归列表。我尝试了您的代码并获得了一个包含许多重复项的列表...如果您只是删除注释“# recursive calls on subfolders”下的行 - 它工作正常【参考方案11】：

在我看来，os.walk() 有点过于复杂和冗长。您可以通过以下方式清理已接受的答案：

all_files = [str(f) for f in pathlib.Path(dir_path).glob("**/*") if f.is_file()]

with open(outfile, 'wb') as fout:
    for f in all_files:
        with open(f, 'rb') as fin:
            fout.write(fin.read())
            fout.write(b'\n')

【讨论】：

【参考方案12】：

我认为问题在于您没有正确处理os.walk 的输出。

首先，改变：

filePath = rootdir + '/' + file

到：

filePath = root + '/' + file

rootdir 是你的固定起始目录； root 是os.walk 返回的目录。

其次，您不需要缩进文件处理循环，因为为每个子目录运行它是没有意义的。您将获得root 设置到每个子目录。除非您想对目录本身做些什么，否则您不需要手动处理子目录。

【讨论】：

我每个子目录都有数据，所以我需要为每个目录的内容有一个单独的文本文件。 @Brock：文件部分是当前目录中的文件列表。所以缩进确实是错误的。您正在写信给filePath = rootdir + '/' + file，这听起来不对：文件来自当前文件列表，所以您正在写很多现有文件？【参考方案13】：

试试这个：

import os
import sys

for root, subdirs, files in os.walk(path):

    for file in os.listdir(root):

        filePath = os.path.join(root, file)

        if os.path.isdir(filePath):
            pass

        else:
            f = open (filePath, 'r')
            # Do Stuff

【讨论】：

当您已经将目录列表从 walk() 拆分为文件和目录时，为什么还要执行另一个 listdir() 和 isdir()？这看起来在大型树中会相当慢（执行三个系统调用而不是一个：1=walk、2=listdir、3=isdir，而不是仅仅遍历“subdirs”和“files”）。跨度> 【参考方案14】：

如果您更喜欢（几乎）Oneliner：

from pathlib import Path

lookuppath = '.' #use your path
filelist = [str(item) for item in Path(lookuppath).glob("**/*") if Path(item).is_file()]

在这种情况下，您将获得一个列表，其中仅包含递归位于查找路径下的所有文件的路径。如果没有 str()，您将在每个路径中添加 PosixPath()。

【讨论】：

【参考方案15】：

如果只是文件名还不够，在os.scandir()之上实现Depth-first search很容易：

stack = ['.']
files = []
total_size = 0
while stack:
    dirname = stack.pop()
    with os.scandir(dirname) as it:
        for e in it:
            if e.is_dir(): 
                stack.append(e.path)
            else:
                size = e.stat().st_size
                files.append((e.path, size))
                total_size += size

docs 有话要说：

scandir() 函数返回目录条目以及文件属性信息，为许多常见用例提供更好的性能。

【讨论】：

【参考方案16】：

这对我有用：

import glob

root_dir = "C:\\Users\\Scott\\" # Don't forget trailing (last) slashes    
for filename in glob.iglob(root_dir + '**/*.jpg', recursive=True):
     print(filename)
     # do stuff

【讨论】：

以上是关于Python递归文件夹读取的主要内容，如果未能解决你的问题，请参考以下文章

php 递归读取文件夹内所有文件报错

【大数据】Spark 递归读取 HDFS

Python学习笔记八：文件操作（续），文件编码与解码，函数，递归，函数式编程介绍，高阶函数

递归读取带有文件夹的目录

python3在文件夹中查找指定文件方法封装

使用 Cordova 递归读取所有文件和文件夹结构