删除python注释的正则表达式[重复]
Posted
技术标签:
【中文标题】删除python注释的正则表达式[重复]【英文标题】:regex expression for delete python comment [duplicate] 【发布时间】:2020-09-30 14:52:48 【问题描述】:我想删除 python 文件中的所有注释。 像这样的文件: ---------------comment.py ---------------
# this is comment line.
age = 18 # comment in line
msg1 = "I'm #1." # comment. there's a # in code.
msg2 = 'you are #2. ' + 'He is #3' # strange sign ' # ' in comment.
print('Waiting your answer')
我写了很多正则表达式来提取所有的 cmets,有些像这样:
(?(?<=['"])(?<=['"])\s*#.*$|\s*#.*$)
get: #1." # comment. there's a # in code.
(?<=('|")[^\1]*\1)\s*#.*$|\s*#.*$
wrong. it's not 0-width in lookaround (?<=..)
但它不能正常工作。什么是正确的正则表达式? 你能帮帮我吗?
【问题讨论】:
您可能不会编写解析器来正确处理所有这些边缘情况。 使用正则表达式解析代码是个坏主意。结果你会得到巨大的表达,这真的很慢。 感谢您的建议。我今天想放弃,一般情况下写\s*#[^'"]*$
。但是python idle可以解决所有情况,不知道python idle是否使用正则表达式?
【参考方案1】:
您可以尝试使用tokenize
而不是regex
,正如@OlvinRoght 所说,在这种情况下使用正则表达式解析代码可能是个坏主意。如您所见here,您可以尝试这样的方法来检测 cmets:
import tokenize
fileObj = open('yourpath\comment.py', 'r')
for toktype, tok, start, end, line in tokenize.generate_tokens(fileObj.readline):
# we can also use token.tok_name[toktype] instead of 'COMMENT'
# from the token module
if toktype == tokenize.COMMENT:
print('COMMENT' + " " + tok)
输出:
COMMENT # -*- coding: utf-8 -*-
COMMENT # this is comment line.
COMMENT # comment in line
COMMENT # comment. there's a # in code.
COMMENT # strange sign ' # ' in comment.
然后,要得到预期的结果,即没有cmets的python文件,你可以试试这个:
nocomments=[]
for toktype, tok, start, end, line in tokenize.generate_tokens(fileObj.readline):
if toktype != tokenize.COMMENT:
nocomments.append(tok)
print(' '.join(nocomments))
输出:
age = 18
msg1 = "I'm #1."
msg2 = 'you are #2. ' + 'He is #3'
print ( 'Waiting your answer' )
【讨论】:
在这种情况下,tokenize 比 re 更好。【参考方案2】:信用:https://gist.github.com/BroHui/aca2b8e6e6bdf3cb4af4b246c9837fa3
这样就可以了。它使用标记化。您可以根据自己的使用情况修改此代码。
""" Strip comments and docstrings from a file.
"""
import sys, token, tokenize
def do_file(fname):
""" Run on just one file.
"""
source = open(fname)
mod = open(fname + ",strip", "w")
prev_toktype = token.INDENT
first_line = None
last_lineno = -1
last_col = 0
tokgen = tokenize.generate_tokens(source.readline)
for toktype, ttext, (slineno, scol), (elineno, ecol), ltext in tokgen:
if 0: # Change to if 1 to see the tokens fly by.
print("%10s %-14s %-20r %r" % (
tokenize.tok_name.get(toktype, toktype),
"%d.%d-%d.%d" % (slineno, scol, elineno, ecol),
ttext, ltext
))
if slineno > last_lineno:
last_col = 0
if scol > last_col:
mod.write(" " * (scol - last_col))
if toktype == token.STRING and prev_toktype == token.INDENT:
# Docstring
mod.write("#--")
elif toktype == tokenize.COMMENT:
# Comment
mod.write("\n")
else:
mod.write(ttext)
prev_toktype = toktype
last_col = ecol
last_lineno = elineno
if __name__ == '__main__':
do_file("text.txt")
text.txt:
# this is comment line.
age = 18 # comment in line
msg1 = "I'm #1." # comment. there's a # in code.
msg2 = 'you are #2. ' + 'He is #3' # strange sign ' # ' in comment.
print('Waiting your answer')
输出:
age = 18
msg1 = "I'm #1."
msg2 = 'you are #2. ' + 'He is #3'
print('Waiting your answer')
输入:
msg1 = "I'm #1." # comment. there's a # in code. the regex#.*$ will match #1." # comment. there's a # in code. . Right match shoud be # comment. there's a # in code.
输出:
msg1 = "I'm #1."
【讨论】:
以上是关于删除python注释的正则表达式[重复]的主要内容,如果未能解决你的问题,请参考以下文章