全局排除模式

Posted 2023-02-21

技术标签:

【中文标题】全局排除模式【英文标题】：glob exclude pattern 【发布时间】：2014-01-05 10:40:12 【问题描述】：

我有一个目录，里面有一堆文件：eee2314、asd3442 ... 和 eph。

我想使用glob 函数排除所有以eph 开头的文件。

我该怎么做？

【问题讨论】：

【参考方案1】：

要排除确切的单词，您可能需要实现 自定义正则表达式指令，然后您将在 glob 处理之前将其替换为空字符串。

#!/usr/bin/env python3
import glob
import re

# glob (or fnmatch) does not support exact word matching. This is custom directive to overcome this issue
glob_exact_match_regex = r"\[\^.*\]"
path = "[^exclude.py]*py"  # [^...] is a custom directive, that excludes exact match

# Process custom directive
try:  # Try to parse exact match direction
    exact_match = re.findall(glob_exact_match_regex, path)[0].replace('[^', '').replace(']', '')
except IndexError:
    exact_match = None
else:  # Remove custom directive
    path = re.sub(glob_exact_match_regex, "", path)
paths = glob.glob(path)
# Implement custom directive
if exact_match is not None:  # Exclude all paths with specified string
    paths = [p for p in paths if exact_match not in p]

print(paths)

【讨论】：

【参考方案2】：

假设你有这样的目录结构：

.
├── asd3442
├── eee2314
├── eph334
├── eph_dir
│   ├── asd330
│   ├── eph_file2
│   ├── exy123
│   └── file_with_eph
├── eph_file
├── not_eph_dir
│   ├── ephXXX
│   └── with_eph
└── not_eph_rest

您可以使用完整的 glob 通过 pathlib 和***目录的生成器过滤完整路径结果：

i_want=(fn for fn in Path(path_to).glob('*') if not fn.match('**/*/eph*'))

>>> list(i_want)
[PosixPath('/tmp/test/eee2314'), PosixPath('/tmp/test/asd3442'), PosixPath('/tmp/test/not_eph_rest'), PosixPath('/tmp/test/not_eph_dir')]

pathlib 方法match 使用glob 来匹配路径对象； glob '**/*/eph*' 是指向以'eph' 开头的文件的任何完整路径。

或者，您可以将.name 属性与name.startswith('eph') 一起使用：

i_want=(fn for fn in Path(path_to).glob('*') if not fn.name.startswith('eph'))

如果你只想要文件，不要目录：

i_want=(fn for fn in Path(path_to).glob('*') if fn.is_file() and not fn.match('**/*/eph*'))
# [PosixPath('/tmp/test/eee2314'), PosixPath('/tmp/test/asd3442'), PosixPath('/tmp/test/not_eph_rest')]

同样的方法适用于递归 glob：

i_want=(fn for fn in Path(path_to).glob('**/*') 
           if fn.is_file() and not fn.match('**/*/eph*'))

# [PosixPath('/tmp/test/eee2314'), PosixPath('/tmp/test/asd3442'), 
   PosixPath('/tmp/test/not_eph_rest'), PosixPath('/tmp/test/eph_dir/asd330'), 
   PosixPath('/tmp/test/eph_dir/file_with_eph'), PosixPath('/tmp/test/eph_dir/exy123'), 
   PosixPath('/tmp/test/not_eph_dir/with_eph')]

【讨论】：

【参考方案3】：

相比glob，我推荐pathlib。过滤一种模式非常简单。

from pathlib import Path

p = Path(YOUR_PATH)
filtered = [x for x in p.glob("**/*") if not x.name.startswith("eph")]

如果你想过滤更复杂的模式，你可以定义一个函数来做到这一点，就像：

def not_in_pattern(x):
    return (not x.name.startswith("eph")) and not x.name.startswith("epi")


filtered = [x for x in p.glob("**/*") if not_in_pattern(x)]

使用该代码，您可以过滤所有以eph 或以epi 开头的文件。

【讨论】：

【参考方案4】：

glob 的模式规则不是正则表达式。相反，它们遵循标准的 Unix 路径扩展规则。只有几个特殊字符：两个不同的通配符，并且支持字符范围[来自pymotw: glob – Filename pattern matching]。

因此您可以排除一些带有模式的文件。例如，要使用 glob 排除清单文件（以 _ 开头的文件），您可以使用：

files = glob.glob('files_path/[!_]*')

【讨论】：

这必须在官方文档中，请有人将其添加到docs.python.org/3.5/library/glob.html#glob.glob 请注意，glob 模式不能直接满足 OP 提出的要求：仅排除以 eph 开头但可以以其他任何内容开头的文件。 [!e][!p][!h] 将过滤掉以 eee 开头的文件。请注意，如果您习惯于将 shell glob 排除项指定为 [^_]，则这在 python 的 glob 中不起作用。必须使用! @VitalyZdanevich 在 fnmatch 的文档中：docs.python.org/3/library/fnmatch.html#module-fnmatch【参考方案5】：

如果字符的位置不重要，例如排除清单文件（无论在哪里找到_），使用glob 和re - regular expression operations ，你可以使用：

import glob
import re
for file in glob.glob('*.txt'):
    if re.match(r'.*\_.*', file):
        continue
    else:
        print(file)

或者以更优雅的方式 - list comprehension

filtered = [f for f in glob.glob('*.txt') if not re.match(r'.*\_.*', f)]

for mach in filtered:
    print(mach)

【讨论】：

【参考方案6】：

在遍历文件夹中的所有文件时跳过特定文件怎么样！下面的代码将跳过所有以'eph'开头的excel文件

import glob
import re
for file in glob.glob('*.xlsx'):
    if re.match('eph.*\.xlsx',file):
        continue
    else:
        #do your stuff here
        print(file)

通过这种方式，您可以使用更复杂的正则表达式模式来包含/排除文件夹中的一组特定文件。

【讨论】：

【参考方案7】：

你可以扣除套数：

set(glob("*")) - set(glob("eph*"))

【讨论】：

真正有趣的解决方案！但是我的情况会非常慢，读两遍。此外，如果网络目录上的文件夹内容很大，则会再次变慢。但无论如何，真的很方便。你的操作系统应该缓存文件系统请求，所以还不错:) 我自己试过了，我刚得到 TypeError: unsupported operand type(s) for -: 'list' and 'list' @TomBusby 尝试将它们转换为集合：set(glob("*")) - set(glob("eph*"))（注意“eph*”末尾的 *）顺便说一句，glob 返回列表而不是集合，但是这种操作只适用于集合，因此neutrinus 将其强制转换。如果您需要它保留一个列表，只需将整个操作包装在一个演员表中：list(set(glob("*")) - set(glob("eph")))【参考方案8】：

您不能使用 glob 函数排除模式，glob 只允许包含模式。 Globbing syntax 非常有限（即使是[!..] 字符类必须匹配一个字符，所以它是一个包含模式 用于不在该类中的每个字符）。

您必须自己进行过滤；列表推导通常在这里工作得很好：

files = [fn for fn in glob('somepath/*.txt') 
         if not os.path.basename(fn).startswith('eph')]

【讨论】：

在此处使用iglob 以避免将完整列表存储在内存中 @Hardex：在内部，iglob 生成列表无论如何；您所做的只是懒惰地评估过滤器。这无助于减少内存占用。 @Hardex：如果您在 目录名称 中使用 glob，那么您就有了意义，那么在您迭代时最多将一个 os.listdir() 结果保存在内存中.但是somepath/*.txt 必须读取内存中一个目录中的所有文件名，然后将该列表缩减为仅匹配的那些。你说得对，这并不重要，但有 CPython 库存，glob.glob(x) = list(glob.iglob(x))。开销不大，但仍然很高兴知道。这不是迭代两次吗？一次通过文件获取列表，第二次通过列表本身？如果是这样，是否不可能在一次迭代中完成？【参考方案9】：

正如接受的答案所述，您不能使用 glob 排除模式，因此以下是过滤 glob 结果的方法。

公认的答案可能是最好的 Pythonic 做事方式，但如果您认为列表推导式看起来有点难看，并且想要让您的代码最大限度地 numpythonic（就像我所做的那样），那么您可以这样做（但请注意，这是可能比列表理解方法效率低）：

import glob

data_files = glob.glob("path_to_files/*.fits")

light_files = np.setdiff1d( data_files, glob.glob("*BIAS*"))
light_files = np.setdiff1d(light_files, glob.glob("*FLAT*"))

（在我的例子中，我有一些图像框架、偏置框架和平面框架都在一个目录中，我只想要图像框架）

【讨论】：

【参考方案10】：

游戏晚了，但您也可以将 python filter 应用于 glob 的结果：

files = glob.iglob('your_path_here')
files_i_care_about = filter(lambda x: not x.startswith("eph"), files)

或将 lambda 替换为适当的正则表达式搜索等...

编辑：我刚刚意识到，如果您使用完整路径，startswith 将不起作用，因此您需要一个正则表达式

In [10]: a
Out[10]: ['/some/path/foo', 'some/path/bar', 'some/path/eph_thing']

In [11]: filter(lambda x: not re.search('/eph', x), a)
Out[11]: ['/some/path/foo', 'some/path/bar']

【讨论】：

【参考方案11】：

更一般地说，要排除不符合某些 shell 正则表达式的文件，您可以使用模块 fnmatch：

import fnmatch

file_list = glob('somepath')    
for ind, ii in enumerate(file_list):
    if not fnmatch.fnmatch(ii, 'bash_regexp_with_exclude'):
        file_list.pop(ind)

上面将首先从给定路径生成一个列表，然后弹出不满足具有所需约束的正则表达式的文件。

【讨论】：

以上是关于全局排除模式的主要内容，如果未能解决你的问题，请参考以下文章

有没有办法全局排除 Maven 依赖项？

如何通过多进程共享（或排除共享）全局变量？

如何在 React App 中排除全局样式？

如何从 PhpStorm 全局搜索中排除文件 (Ctrl+Shift+F)

使用在启动时全局设置的 JsonStringEnumConverter 排除模型的枚举属性？

Sonata admin bundle：从全局搜索中排除自定义管理员