正则表达式拆分字符串,不包括可转义引号之间的分隔符
Posted
技术标签:
【中文标题】正则表达式拆分字符串,不包括可转义引号之间的分隔符【英文标题】:Regex to split string excluding delimiters between escapable quotes 【发布时间】:2021-04-23 00:17:02 【问题描述】:除非空格包含在可转义的引号中,否则我需要能够用空格分隔字符串。换句话说,spam spam spam "and \"eggs"
应该返回 spam
、spam
、spam
和 and "eggs
。我打算在 python 中使用re.split
方法来执行此操作,您可以在其中使用正则表达式识别要拆分的字符。
我发现这个可以找到可转义引号之间的所有内容:
((?<![\\])['"])((?:.(?!(?<![\\])\1))*.?)\1
来自:https://www.metaltoad.com/blog/regex-quoted-string-escapable-quotes 并且 this 按字符拆分,除非在引号之间:
\s(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)
来自:https://stackabuse.com/regex-splitting-by-character-unless-in-quotes/。这会查找在空格和行尾之间有偶数个双引号的所有空格。
我正在努力将这两个解决方案结合在一起。
对于参考参考,我发现了这个我发现了这个超级有用的正则表达式备忘单:https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285
我还发现https://regex101.com/ 非常有用:允许您测试正则表达式
【问题讨论】:
【参考方案1】:终于成功了:
\s(?=(?:(?:\\\"|[^\"])*(?<!\\)\"(?:\\\"|[^\"])*(?<!\\)\")*(?:\\\"|[^\"])*$)
这结合了问题中的两个解决方案,以在右侧找到偶数个未转义双引号的空格。说明:
\s # space
(?= # followed by (not included in match though)
(?: # match pattern (but don't capture)
(?:
\\\" # match escaped double quotes
| # OR
[^\"] # any character that is not double quotes
)* # 0 or more times
(?<!\\)\" # followed by unescaped quotes
(?:\\\"|[^\"])* # as above match escaped double quotes OR any character that is not double quotes
(?<!\\)\" # as above - followed by unescaped quotes
# the above pairs of unescaped quotes
)* # repeated 0 or more times (acting on pairs of quotes given an even number of quotes returned)
(?:\\\"|[^\"])* # as above
$ # end of the line
)
所以最终的python是:
import re
test_str = r'spam spam spam "and \"eggs"'
regex = r'\s(?=(?:(?:\\\"|[^\"])*(?<!\\)\"(?:\\\"|[^\"])*(?<!\\)\")*(?:\\\"|[^\"])*$)'
test_list = re.split(regex, test_str)
print(test_list)
>>> ['spam', 'spam', 'spam', '"and \\"eggs"']
这种方法唯一的缺点是它会留下前导的尾随引号,但是我可以使用以下 python 轻松识别和删除它们:
# remove leading and trailing unescaped quotes
test_list = list(map(lambda x: re.sub(r'(?<!\\)"', '', x), test_list))
# remove escape characters - they are no longer required
test_list = list(map(lambda x: x.replace(r'\"', '"'), test_list))
print(test_list)
>>> ['spam', 'spam', 'spam', 'and "eggs']
【讨论】:
以上是关于正则表达式拆分字符串,不包括可转义引号之间的分隔符的主要内容,如果未能解决你的问题,请参考以下文章