在 Linux 上忽略 glob() 中的大小写

Posted 2023-02-23

技术标签:

【中文标题】在 Linux 上忽略 glob() 中的大小写【英文标题】：Ignore case in glob() on Linux 【发布时间】：2022-01-08 18:13:04 【问题描述】：

我正在编写一个脚本，它必须在 Windows 和 Linux 用户手动修改的目录上工作。 Windows 用户在分配文件名时往往根本不关心大小写。

有没有办法在 Python 的 Linux 端处理这个问题，即我可以得到一个不区分大小写的、类似 glob 的行为吗？

【问题讨论】：

【参考方案1】：

我只是想要一个这样的变体，如果我指定文件扩展名，我只会不区分大小写 - 例如，我希望“.jpg”和“.JPG”被抓取相同。这是我的变种：

import re
import glob
import os
from fnmatch import translate as regexGlob
from platform import system as getOS

def linuxGlob(globPattern:str) -> frozenset:
    """
    Glob with a case-insensitive file extension
    """
    base = set(glob.glob(globPattern, recursive= True))
    maybeExt = os.path.splitext(os.path.basename(globPattern))[1][1:]
    caseChange = set()
    # Now only try the extended insensitivity if we've got a file extension
    if len(maybeExt) > 0 and getOS() != "Windows":
        rule = re.compile(regexGlob(globPattern), re.IGNORECASE)
        endIndex = globPattern.find("*")
        if endIndex == -1:
            endIndex = len(globPattern)
        crawl = os.path.join(os.path.dirname(globPattern[:endIndex]), "**", "*")
        checkSet = set(glob.glob(crawl, recursive= True)) - base
        caseChange = set([x for x in checkSet if rule.match(x)])
    return frozenset(base.union(caseChange))

我实际上并没有将不敏感性限制为只是扩展，因为我很懒，但是混淆空间非常小（例如，您想要捕获 FOO.jpg 和 FOO.JPG但不是foo.JPG 或foo.jpg；如果你的路径是病态的，你还有其他问题）

【讨论】：

【参考方案2】：

这是fnmatch.translate() 的工作示例：

from glob import glob
from pathlib import Path
import fnmatch, re


mask_str = '"*_*_yyww.TXT" | "*_yyww.TXT" | "*_*_yyww_*.TXT" | "*_yyww_*.TXT"'
masks_list = ["yyyy", "yy", "mmmmm", "mmm", "mm", "#d", "#w", "#m", "ww"]

for mask_item in masks_list:
    mask_str = mask_str.replace(mask_item, "*")

clean_quotes_and_spaces = mask_str.replace(" ", "").replace('"', '')
remove_double_star = clean_quotes_and_spaces.replace("**", "*")
masks = remove_double_star.split("|")

cwd = Path.cwd()

files = list(cwd.glob('*'))
print(files)

files_found = set()

for mask in masks:
    mask = re.compile(fnmatch.translate(mask), re.IGNORECASE)
    print(mask)

    for file in files:        
        if mask.match(str(file)):
            files_found.add(file)         

print(files_found)

【讨论】：

【参考方案3】：

借鉴@Timothy C. Quinn 的回答，此修改允许在路径中的任何位置使用通配符。诚然，这对于 glob_pat 参数是不区分大小写的。

import re
import os
import fnmatch
import glob

def find_files(path: str, glob_pat: str, ignore_case: bool = False):
    rule = re.compile(fnmatch.translate(glob_pat), re.IGNORECASE) if ignore_case \
            else re.compile(fnmatch.translate(glob_pat))
    return [n for n in glob.glob(os.path.join(path, '*')) if rule.match(n)]

【讨论】：

【参考方案4】：

这是我对 Python 的非递归文件搜索，在 Python 3.5+ 中具有类似 glob 的行为

# Eg: find_files('~/Downloads', '*.Xls', ignore_case=True)
def find_files(path: str, glob_pat: str, ignore_case: bool = False):
    rule = re.compile(fnmatch.translate(glob_pat), re.IGNORECASE) if ignore_case \
            else re.compile(fnmatch.translate(glob_pat))
    return [n for n in os.listdir(os.path.expanduser(path)) if rule.match(n)]

注意：此版本处理home directory expansion

【讨论】：

此方法有效，但不能在路径中的任何位置使用通配符，只能在文件中使用。 @MatthewSnyder - 谢谢。当我有时间时，我会更新它以处理路径中的通配符。【参考方案5】：

您可以将每个字母字符 c 替换为 [cC]，通过

import glob
def insensitive_glob(pattern):
    def either(c):
        return '[%s%s]' % (c.lower(), c.upper()) if c.isalpha() else c
    return glob.glob(''.join(map(either, pattern)))

【讨论】：

要多一点pythonic...或者至少让pylint开心return glob.glob(''.join(either(char) for char in pattern)) shao.lo：是的，确实有更长的优势。这个解决方案有严重的缺陷，所以要小心。首先，glob() 在 Windows 驱动器号上使用这种模式会失败。然后，同样适用于“魔术”文件夹，例如“sysnative”文件夹。这看起来很老套，但它确实完成了工作，不是吗。它对我来说很好。 @GeoffreyIrving Ha，更慢...（显示大小为 1k 的图案）

python In[10] %timeit ''.join(either(char) for char in pattern) 392 µs ± 5.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) In[11] %timeit ''.join(map(either, pattern)) 358 µs ± 7.68 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

【参考方案6】：

根据您的情况，您可以在文件模式和文件夹列表结果中使用.lower()，然后才将模式与文件名进行比较

【讨论】：

【参考方案7】：

非递归

为了检索目录“路径”的文件（和仅文件），使用“globexpression”：

list_path = [i for i in os.listdir(path) if os.path.isfile(os.path.join(path, i))]
result = [os.path.join(path, j) for j in list_path if re.match(fnmatch.translate(globexpression), j, re.IGNORECASE)]

递归

步行：

result = []
for root, dirs, files in os.walk(path, topdown=True):
  result += [os.path.join(root, j) for j in files \
             if re.match(fnmatch.translate(globexpression), j, re.IGNORECASE)]

最好也编译正则表达式，所以代替

re.match(fnmatch.translate(globexpression)

做（循环前）：

reg_expr = re.compile(fnmatch.translate(globexpression), re.IGNORECASE)

然后在循环中替换：

  result += [os.path.join(root, j) for j in files if re.match(reg_expr, j)]

【讨论】：

【参考方案8】：

使用不区分大小写的正则表达式而不是全局模式。 fnmatch.translate 从 glob 模式生成一个正则表达式，所以

re.compile(fnmatch.translate(pattern), re.IGNORECASE)

为您提供 glob 模式的不区分大小写版本作为已编译的 RE。

请记住，如果文件系统由类 Unix 文件系统上的 Linux 机器托管，则用户将能够在同一目录中创建文件 foo、Foo 和 FOO。

【讨论】：

cool 8) 还有一个函数可以返回匹配的文件名列表，还是我必须手动通过 os.listdir() 的级联？在摆弄os.walk 2 小时后，我不知所措。你能多指教吗？我很难弄清楚dirs 周围的循环，匹配re 并适当地中断。可能不是我的一天:( @andreash: os.walk 返回三元组 (basepath, dirs, files) s.t.您可以通过将其 (os.path.join) 与 basepath 连接来获取目录或文件的相对路径。然后，您可以尝试将结果与您的模式匹配。我会接受这个答案，因为它给出了有效的响应。但是，出于速度原因，我决定使用更量身定制的 os.walk 和 os.listdir 组合。

以上是关于在 Linux 上忽略 glob() 中的大小写的主要内容，如果未能解决你的问题，请参考以下文章

linux文件查找之find

Linux忽略大小写查找技巧

MySqllinux下，设置mysql表名忽略大小写

可以使 PHP 的 glob() 以不区分大小写的方式查找文件吗？

在 Unix/Linux/Java 世界中，GLOB 的缩写是啥？

mysql linux下表名忽略大小写注意事项