删除python注释的正则表达式[重复]

Posted

技术标签:

【中文标题】删除python注释的正则表达式[重复]【英文标题】:regex expression for delete python comment [duplicate] 【发布时间】:2020-09-30 14:52:48 【问题描述】:

我想删除 python 文件中的所有注释。 像这样的文件: ---------------comment.py ---------------

# this is comment line.
age = 18  # comment in line
msg1 = "I'm #1."  # comment. there's a # in code.
msg2 = 'you are #2. ' + 'He is #3'  # strange sign ' # ' in comment. 
print('Waiting your answer')

我写了很多正则表达式来提取所有的 cmets,有些像这样:

(?(?<=['"])(?<=['"])\s*#.*$|\s*#.*$)
get:  #1."  # comment. there's a # in code.

(?<=('|")[^\1]*\1)\s*#.*$|\s*#.*$
wrong. it's not 0-width in lookaround (?<=..)

但它不能正常工作。什么是正确的正则表达式? 你能帮帮我吗?

【问题讨论】:

您可能不会编写解析器来正确处理所有这些边缘情况。 使用正则表达式解析代码是个坏主意。结果你会得到巨大的表达,这真的很慢。 感谢您的建议。我今天想放弃,一般情况下写\s*#[^'"]*$。但是python idle可以解决所有情况,不知道python idle是否使用正则表达式? 【参考方案1】:

您可以尝试使用tokenize 而不是regex,正如@OlvinR​​oght 所说,在这种情况下使用正则表达式解析代码可能是个坏主意。如您所见here,您可以尝试这样的方法来检测 cmets:

import tokenize
fileObj = open('yourpath\comment.py', 'r')
for toktype, tok, start, end, line in tokenize.generate_tokens(fileObj.readline):
    # we can also use token.tok_name[toktype] instead of 'COMMENT'
    # from the token module 
    if toktype == tokenize.COMMENT:
        print('COMMENT' + " " + tok)

输出:

COMMENT # -*- coding: utf-8 -*-
COMMENT # this is comment line.
COMMENT # comment in line
COMMENT # comment. there's a # in code.
COMMENT # strange sign ' # ' in comment.

然后,要得到预期的结果,即没有cmets的python文件,你可以试试这个:

nocomments=[]
for toktype, tok, start, end, line in tokenize.generate_tokens(fileObj.readline):
    if toktype != tokenize.COMMENT:
        nocomments.append(tok)

print(' '.join(nocomments))

输出:

 age = 18 
 msg1 = "I'm #1." 
 msg2 = 'you are #2. ' + 'He is #3' 
 print ( 'Waiting your answer' )  

【讨论】:

在这种情况下,tokenize 比 re 更好。【参考方案2】:

信用:https://gist.github.com/BroHui/aca2b8e6e6bdf3cb4af4b246c9837fa3

这样就可以了。它使用标记化。您可以根据自己的使用情况修改此代码。

""" Strip comments and docstrings from a file.
"""

import sys, token, tokenize

def do_file(fname):
    """ Run on just one file.
    """
    source = open(fname)
    mod = open(fname + ",strip", "w")

    prev_toktype = token.INDENT
    first_line = None
    last_lineno = -1
    last_col = 0

    tokgen = tokenize.generate_tokens(source.readline)
    for toktype, ttext, (slineno, scol), (elineno, ecol), ltext in tokgen:
        if 0:   # Change to if 1 to see the tokens fly by.
            print("%10s %-14s %-20r %r" % (
                tokenize.tok_name.get(toktype, toktype),
                "%d.%d-%d.%d" % (slineno, scol, elineno, ecol),
                ttext, ltext
                ))
        if slineno > last_lineno:
            last_col = 0
        if scol > last_col:
            mod.write(" " * (scol - last_col))
        if toktype == token.STRING and prev_toktype == token.INDENT:
            # Docstring
            mod.write("#--")
        elif toktype == tokenize.COMMENT:
            # Comment
            mod.write("\n")
        else:
            mod.write(ttext)
        prev_toktype = toktype
        last_col = ecol
        last_lineno = elineno

if __name__ == '__main__':
    do_file("text.txt")

text.txt:

# this is comment line.
age = 18  # comment in line
msg1 = "I'm #1."  # comment. there's a # in code.
msg2 = 'you are #2. ' + 'He is #3'  # strange sign ' # ' in comment. 
print('Waiting your answer')

输出:

age = 18  

msg1 = "I'm #1."  

msg2 = 'you are #2. ' + 'He is #3'  

print('Waiting your answer')

输入:

msg1 = "I'm #1."  # comment. there's a # in code.  the regex#.*$ will match #1."  # comment. there's a # in code. . Right match shoud be # comment. there's a # in code.

输出:

msg1 = "I'm #1."  

【讨论】:

以上是关于删除python注释的正则表达式[重复]的主要内容,如果未能解决你的问题,请参考以下文章

删除正则表达式中的 Html 标签 [重复]

正则表达式检测代码中的注释[重复]

正则表达式删除多行注释

正则表达式删除单行 SQL 注释 (--)

eclipsemyeclipse中删除所有注释正则表达式

括号之间的Python正则表达式替换[重复]