Python删除方括号和它们之间的无关信息

Posted 2023-03-11

技术标签:

【中文标题】Python删除方括号和它们之间的无关信息【英文标题】：Python remove Square brackets and extraneous information between them 【发布时间】：2020-08-21 14:38:57 【问题描述】：

我正在尝试处理一个文件，我需要删除文件中的无关信息；值得注意的是，我正在尝试删除括号 [] 包括括号内和括号之间的文本 [] [] 块，说这些块之间的所有内容包括它们本身，但打印它之外的所有内容。

下面是我的带有数据样本的文本文件：

$ cat smb
Hi this is my config file.
Please dont delete it

[homes]
  browseable                     = No
  comment                        = Your Home
  create mode                    = 0640
  csc policy                     = disable
  directory mask                 = 0750
  public                         = No
  writeable                      = Yes

[proj]
  browseable                     = Yes
  comment                        = Project directories
  csc policy                     = disable
  path                           = /proj
  public                         = No
  writeable                      = Yes

[]

This last second line.
End of the line.

期望的输出：

Hi this is my config file.
Please dont delete it
This last second line.
End of the line.

根据我的理解和重新搜索，我尝试了什么：

$ cat test.py
with open("smb", "r") as file:
  for line in file:
    start = line.find( '[' )
    end = line.find( ']' )
    if start != -1 and end != -1:
      result = line[start+1:end]
      print(result)

输出：

$ ./test.py
   homes
   proj

【问题讨论】：

【参考方案1】：

只有一个正则表达式

import re

with open("smb", "r") as f: 
    txt = f.read()
    txt = re.sub(r'(\n\[)(.*?)(\[]\n)', '', txt, flags=re.DOTALL)

print(txt)

正则表达式解释：

(\n\[) 找到一个序列，其中有一个换行符后跟一个 [

(\[]\n) 找到一个序列，其中有 [] 后跟换行符

(.*?) 删除(\n\[) 和(\[]\n) 中间的所有内容

re.DOTALL用于防止不必要的回溯

！！！熊猫更新！！！

同样的逻辑，同样的解决方案可以用pandas进行

import re
import pandas as pd

# read each line in the file (one raw -> one line)
txt = pd.read_csv('smb',  sep = '\n', header=None)
# join all the line in the file separating them with '\n'
txt = '\n'.join(txt[0].to_list())
# apply the regex to clean the text (the same as above)
txt = re.sub(r'(\n\[)(.*?)(\[]\n)', '\n', txt, flags=re.DOTALL)

print(txt)

【讨论】：

【参考方案2】：

将文件读入字符串，

extract = '''Hi this is my config file.
Please dont delete it

[homes]
  browseable                     = No
  comment                        = Your Home
  create mode                    = 0640
  csc policy                     = disable
  directory mask                 = 0750
  public                         = No
  writeable                      = Yes

[proj]
  browseable                     = Yes
  comment                        = Project directories
  csc policy                     = disable
  path                           = /proj
  public                         = No
  writeable                      = Yes

[]

This last second line.
End of the line.
'''.split('\n[')[0][:-1]

会给你，

Hi this is my config file.
Please dont delete it

.split('\n[') 通过出现'\n[' 字符集来拆分字符串，[0] 选择上面的描述行。

with open("smb", "r") as f: 
     extract = f.read()
     tail = extract.split(']\n')
     extract = extract.split('\n[')[0][:-1]+[tail[len(tail)-1]

将读取并输出，

Hi this is my config file.
Please dont delete it
This last second line.
End of the line.

【讨论】：

谢谢 Akash，但是如何读取文件然后处理它，你能这样说吗？ with open("smb", "r") as f: extract = f.read().split('\n[')[0][:-1] 可以解决问题，我添加了一个 [:-1] 来删除即将到来的空格。 smb 是一个包含输入的文件。如果它回答了您的问题，请考虑接受我的回答。谢谢！ @kulfi，请仔细阅读上面的更新代码！谢谢！ @kulfi，你能给我举个例子吗？我已经在帖子中编辑了我的smb 文件，感谢您的帮助。【参考方案3】：

既然你标记了pandas，让我们试试吧：

df = pd.read_csv('smb', sep='----', header=None)

# mark rows starts with `[`
s = df[0].str.startswith('[')

# drop the lines between `[`
df = df.drop(np.arange(s.idxmax(),s[::-1].idxmax()+1))

# write to file if needed
df.to_csv('clean.txt', header=None, index=None)

输出（df）：

                             0
0   Hi this is my config file.
1        Please dont delete it
18      This last second line.
19            End of the line.

【讨论】：

【参考方案4】：

您可以遍历文件行并将它们收集到某个列表中，除非到达用括号括起来的行，然后将收集的行连接回来：

with open("smb", "r") as f:
    result = []
    for line in f:
        if line.startswith("[") and line.endswith("]"):
            break
        result.append(line)
    result = "\n".join(result)
    print(result)

【讨论】：

感谢安德烈，但它打印的文件内容与不需要的相同。【参考方案5】：

如果我对您的理解正确，您需要第一个 [ 之前和最后一个 ] 之后的所有内容。如果不是这样，请告诉我，我会更改答案。

with open("smb", "r") as f: 
    s = f.read()
    head = s[:s.find('[')]
    tail = s[s.rfind(']') + 1:]
    return head.strip("\n") + "\n" + tail.strip("\n") # removing \n

这会给你想要的输出。

【讨论】：

Traceback (most recent call last):   File "/data/user/0/ru.iiec.pydroid3/files/accomp_files/iiec_run/iiec_run.py", line 31, in &lt;module&gt;     start(fakepyfile,mainpyfile)  File "/data/user/0/ru.iiec.pydroid3/files/accomp_files/iiec_run/iiec_run.py", line 30, in start     exec(open(mainpyfile).read(),  __main__.__dict__)   File "&lt;string&gt;", line 5 SyntaxError: 'return' outside function  [Program finished]

如果在函数中使用效果很好 @Subham 是的，“return”只在函数内部起作用。如果你想在函数之外使用它，你可以用 print 替换 return。【参考方案6】：

另一种选择是首先匹配方括号，如[homes]，然后匹配所有不只包含[] 的行，因为那是结束标记。

您可以在不使用(?s) 或re.DOTALL 的情况下获得匹配项，以防止不必要的回溯并将匹配项替换为空字符串。

^\s*\[[^][]*\](?:\r?\n(?![^\S\r\n]*\[]$).*)*\r?\n[^\S\r\n]*\[]$\s*

解释

^行首 \s* 匹配 0+ 个空白字符 \[[^][]*\] (?:非捕获组 \r?\n匹配换行符 (?!负前瞻，断言右边的不是 [^\S\r\n]*\[]$ 匹配 0+ 次空格字符（换行符除外）并匹配 [] )关闭非捕获组 .* 匹配除换行符以外的任何字符 0+ 次 )*关闭非捕获组并重复0+次 \r?\n 匹配换行符 [^\S\r\n]* 匹配 0+ 个没有换行符的空白字符 \[]$ 匹配 [] 并断言行尾 \s* 匹配 0+ 个空白字符

Regex demo | Python demo

代码示例

import re

regex = r"^\s*\[[^][]*\](?:\r?\n(?![^\S\r\n]*\[]$).*)*\r?\n[^\S\r\n]*\[]$\s*"

with open("smb", "r") as file:
    data = file.read()
    result = re.sub(regex, "", data, 0, re.MULTILINE)
    print(result)

输出

Hi this is my config file.
Please dont delete it
This last second line.
End of the line.

【讨论】：

【参考方案7】：

在Regex101 你可以测试这个：

(^\W)+?\[[\w\W]+?\[\](\W)+?(\w)

代码类似

import re ------------------------------------------------------------↧-string where to replace-- result = re.sub(r"(^\W)+?\[[\w\W]+?\[\](\W)+?(\w)", "", input_string, 0, re.MULTILINE) ----------------------↑-this is the regex------------↑-substitution string-------------

干杯

【讨论】：

【参考方案8】：

由于您已标记 pandas 并规定文本位于方括号之前和之后，我们可以使用 str.contains 并使用布尔值过滤掉位于第一个和最后一个方括号之间的行。

df = pd.read_csv(your_file,sep='\t',header=None)

idx = df[df[0].str.contains('\[')].index

df1 = df.loc[~df.index.isin(range(idx[0],idx[-1] + 1))]

                             0
0   Hi this is my config file.
1        Please dont delete it
18      This last second line.
19            End of the line.

【讨论】：

【参考方案9】：

使用熊猫：

df = pd.read_csv('smb.txt', sep='----', header=None, engine='python',names=["text"])

res = df.loc[~df.text.str.contains("=|\[.*\]")]
print(res)
text
0   Hi this is my config file.
1   Please dont delete it
18  This last second line.
19  End of the line.

说明：排除包含 = 或包含可能后跟也可能不跟字符 (.*) 并带有右括号 ( ]``). the backslash (```) 告诉python不要将括号视为特殊字符

仅在 Python 中，使用相同的正则表达式模式，并额外增加一行来处理空条目：

import re
with open('smb.txt') as myfile:
    content = myfile.readlines()
    pattern = re.compile("=|\[.*\]")
    res = [ent.strip() for ent in content if not pattern.search(ent) ]
    res = [ent for ent in res if ent != ""]
    print(res)
['Hi this is my config file.',
 'Please dont delete it',
 'This last second line.', 
 'End of the line.']

【讨论】：

【参考方案10】：

您的索引错误。除此之外，代码看起来还不错。

试试：

start=0
targ = ""
end=0
with open("smb", "r") as file:
    for line in file: 
        try:  
            if start==0:
                start = line.index("[")
        except:
            start = start
        try:  
            end = line.index("]")
        except:
            end = end
        targ = targ+line

targ = targ[0:start-1]+targ[end+1:]

这应该可行。让我知道是否有任何问题。 :)

【讨论】：

targ 是在哪里定义的？【参考方案11】：

这可能是您可以做到的最干净的方法之一。

import re
from pathlib import Path
res = '\n'.join(re.findall(r'^\w.*', Path('smb').read_text(), flags=re.M))

解释：

Path 为文件创建一个Path 对象。 Path.read_text() 打开文件读取文本并关闭文件。文件内容被传递给re.findall，它使用re.M 标志来查看文件中的每一行以再次验证模式'^\w.*'，它只接受以单词字符开头的行。这消除了以空格或括号开头的行。

【讨论】：

【参考方案12】：

试试r"(?s)\s*\[[^\[\]]*\](?:(?:(?!\[[^\[\]]*\]).)+\[[^\[\]]*\])*\s*" 替换r"\n"

demo

【讨论】：

在这种情况下，取你想要的比省略你不想要的更容易。 '\n'.join(re.findall(r'^\w.*', Path('smb').read_text(), flags=re.M)) .findall(r'^\w.* 得到 aparnot 的标签主体。但是 OP 没有标签的定义或开放/结束定义，所以大多数 aswers 是 gesses 并且接受是最坏的 gessu

以上是关于Python删除方括号和它们之间的无关信息的主要内容，如果未能解决你的问题，请参考以下文章