Python 正则表达式 / 正则表达式 - 如何在保持目标文本完整的情况下绕过目标文本?

Posted

技术标签:

【中文标题】Python 正则表达式 / 正则表达式 - 如何在保持目标文本完整的情况下绕过目标文本?【英文标题】:Python Regex / Regular Expressions - How Do You Get AROUND The Target Text While Leaving Target Text Intact? 【发布时间】:2021-12-11 08:31:09 【问题描述】:

以下是目标文本的示例:

"feature1": "color", "feature2": "size", "name_color": "Gray", "name_size": "7'10\"x10'2\"", "ebay ": "\"_id\": \"6175ee6eb7f86b42582b4667\", \"rawColor\": \"Gray\", \"rawSize\": \"7'10\\\"x10'2\\\" \"", "overstock": "\"_id\": \"6175eef7b7f86b42582b4678\", \"rawColor\": \"Brown/Red\", \"rawSize\": \"7'10\\ \"x10'2\\\"\""', '"feature1": "color", "feature2": "size", "name_color": "Gray", "name_size": "7'10 \"x10'2\"", "ebay": "\"_id\": \"6175ee72b7f86b42582b466c\", \"rawColor\": \"棕色/红色\", \"rawSize\": \"7 '10\\\"x10'2\\\"\"", "overstock": "\"_id\": \"6175eef7b7f86b42582b4679\", \"rawColor\": \"Gray\", \" rawSize\": \"7'10\\\"x10'2\\\"\""', '"feature1": "color", "feature2": "size", "name_color": " Gray", "name_size": "7'10\"x10'2\"", "ebay": "\"_id\": \"6175ee72b7f86b42582b466c\", \"rawColor\": \"棕色/红色\ ", \"rawSize\": \"7'10\\\"x10'2\\\"\"", "overstock": "\"_id\": \"6175eef7b7f86b42582b4678\", \"rawColor \": \"棕色/红色\", \"rawSize\": \"7'10\\\"x10'2\\\"\""', ' "feature1": "color", "feature2": "size", "name_color": "Gray", "name_size": "7'10\"x10'2\"", "ebay": "\"_id \": \"6175ee6eb7f86b42582b4667\", \"rawColor\": \"Gray\", \"rawSize\": \"7'10\\\"x10'2\\\"\"", "overstock ": "\"_id\": \"6175eef7b7f86b42582b4679\", \"rawColor\": \"Gray\", \"rawSize\": \"7'10\\\"x10'2\\\" \""

不幸的是,我需要让json.loads 接受它,但由于JSONDecodeError: Expecting value: line 1 column 1 (char 0) 而它失败了

到目前为止我尝试过的是:

import re 
import json

problem = "'\"feature1\": \"color\", \"feature2\": \"size\", \"name_color\": \"Gray\", \"name_size\": \"7\\'10\\\\\"x10\\'2\\\\\"\", \"ebay\": \"\\\\\"_id\\\\\": \\\\\"6175ee6eb7f86b42582b4667\\\\\", \\\\\"rawColor\\\\\": \\\\\"Gray\\\\\", \\\\\"rawSize\\\\\": \\\\\"7\\'10\\\\\\\\\\\\\"x10\\'2\\\\\\\\\\\\\"\\\\\"\", \"overstock\": \"\\\\\"_id\\\\\": \\\\\"6175eef7b7f86b42582b4678\\\\\", \\\\\"rawColor\\\\\": \\\\\"Brown/Red\\\\\", \\\\\"rawSize\\\\\": \\\\\"7\\'10\\\\\\\\\\\\\"x10\\'2\\\\\\\\\\\\\"\\\\\"\"', '\"feature1\": \"color\", \"feature2\": \"size\", \"name_color\": \"Gray\", \"name_size\": \"7\\'10\\\\\"x10\\'2\\\\\"\", \"ebay\": \"\\\\\"_id\\\\\": \\\\\"6175ee72b7f86b42582b466c\\\\\", \\\\\"rawColor\\\\\": \\\\\"Brown/Red\\\\\", \\\\\"rawSize\\\\\": \\\\\"7\\'10\\\\\\\\\\\\\"x10\\'2\\\\\\\\\\\\\"\\\\\"\", \"overstock\": \"\\\\\"_id\\\\\": \\\\\"6175eef7b7f86b42582b4679\\\\\", \\\\\"rawColor\\\\\": \\\\\"Gray\\\\\", \\\\\"rawSize\\\\\": \\\\\"7\\'10\\\\\\\\\\\\\"x10\\'2\\\\\\\\\\\\\"\\\\\"\"', '\"feature1\": \"color\", \"feature2\": \"size\", \"name_color\": \"Gray\", \"name_size\": \"7\\'10\\\\\"x10\\'2\\\\\"\", \"ebay\": \"\\\\\"_id\\\\\": \\\\\"6175ee72b7f86b42582b466c\\\\\", \\\\\"rawColor\\\\\": \\\\\"Brown/Red\\\\\", \\\\\"rawSize\\\\\": \\\\\"7\\'10\\\\\\\\\\\\\"x10\\'2\\\\\\\\\\\\\"\\\\\"\", \"overstock\": \"\\\\\"_id\\\\\": \\\\\"6175eef7b7f86b42582b4678\\\\\", \\\\\"rawColor\\\\\": \\\\\"Brown/Red\\\\\", \\\\\"rawSize\\\\\": \\\\\"7\\'10\\\\\\\\\\\\\"x10\\'2\\\\\\\\\\\\\"\\\\\"\"', '\"feature1\": \"color\", \"feature2\": \"size\", \"name_color\": \"Gray\", \"name_size\": \"7\\'10\\\\\"x10\\'2\\\\\"\", \"ebay\": \"\\\\\"_id\\\\\": \\\\\"6175ee6eb7f86b42582b4667\\\\\", \\\\\"rawColor\\\\\": \\\\\"Gray\\\\\", \\\\\"rawSize\\\\\": \\\\\"7\\'10\\\\\\\\\\\\\"x10\\'2\\\\\\\\\\\\\"\\\\\"\", \"overstock\": \"\\\\\"_id\\\\\": \\\\\"6175eef7b7f86b42582b4679\\\\\", \\\\\"rawColor\\\\\": \\\\\"Gray\\\\\", \\\\\"rawSize\\\\\": \\\\\"7\\'10\\\\\\\\\\\\\"x10\\'2\\\\\\\\\\\\\"\\\\\"\"'"
b = problem
b = re.sub(r'\s\\\\"', ' "', b)
b = re.sub(r'\\\\"_id\\\\', '"_id', b) # cleans up area around _id
b = re.sub(r'\\\\":', '":', b) # cleans up post property and colon
b = re.sub(r'\\\\",', '",', b) # cleans up post property and comma
b = re.sub(r'\\\\""', '', b) # cleans up ending of string 
b = re.sub(r'\\\\\\\\\\\\"', '\\\\\\"', b) # fixes inches backslashes
b = re.sub(r'\\\\"', '\\"', b) # clears up escaping inches
b = re.sub(r'"",', '",', b) # clears up extra quotation marks
b = re.sub(r'""', '"', b)
finally_b = b[1:-1:] # removes the extra  and  from the ends 
print('b...')
print(b)
print()
print('finally_b...')
print(finally_b)
json.loads( finally_b )

输出:

b...
'"feature1": "color", "feature2": "size", "name_color": "Gray", "name_size": "7\'10\"x10\'2\", "ebay": "_id": "6175ee6eb7f86b42582b4667", "rawColor": "Gray", "rawSize": "7\'10\"x10\'2\"\"", "overstock": "_id": "6175eef7b7f86b42582b4678", "rawColor": "Brown/Red", "rawSize": "7\'10\"x10\'2\"', '"feature1": "color", "feature2": "size", "name_color": "Gray", "name_size": "7\'10\"x10\'2\", "ebay": "_id": "6175ee72b7f86b42582b466c", "rawColor": "Brown/Red", "rawSize": "7\'10\"x10\'2\"\"", "overstock": "_id": "6175eef7b7f86b42582b4679", "rawColor": "Gray", "rawSize": "7\'10\"x10\'2\"', '"feature1": "color", "feature2": "size", "name_color": "Gray", "name_size": "7\'10\"x10\'2\", "ebay": "_id": "6175ee72b7f86b42582b466c", "rawColor": "Brown/Red", "rawSize": "7\'10\"x10\'2\"\"", "overstock": "_id": "6175eef7b7f86b42582b4678", "rawColor": "Brown/Red", "rawSize": "7\'10\"x10\'2\"', '"feature1": "color", "feature2": "size", "name_color": "Gray", "name_size": "7\'10\"x10\'2\", "ebay": "_id": "6175ee6eb7f86b42582b4667", "rawColor": "Gray", "rawSize": "7\'10\"x10\'2\"\"", "overstock": "_id": "6175eef7b7f86b42582b4679", "rawColor": "Gray", "rawSize": "7\'10\"x10\'2\"'

finally_b...
'"feature1": "color", "feature2": "size", "name_color": "Gray", "name_size": "7\'10\"x10\'2\", "ebay": "_id": "6175ee6eb7f86b42582b4667", "rawColor": "Gray", "rawSize": "7\'10\"x10\'2\"\"", "overstock": "_id": "6175eef7b7f86b42582b4678", "rawColor": "Brown/Red", "rawSize": "7\'10\"x10\'2\"', '"feature1": "color", "feature2": "size", "name_color": "Gray", "name_size": "7\'10\"x10\'2\", "ebay": "_id": "6175ee72b7f86b42582b466c", "rawColor": "Brown/Red", "rawSize": "7\'10\"x10\'2\"\"", "overstock": "_id": "6175eef7b7f86b42582b4679", "rawColor": "Gray", "rawSize": "7\'10\"x10\'2\"', '"feature1": "color", "feature2": "size", "name_color": "Gray", "name_size": "7\'10\"x10\'2\", "ebay": "_id": "6175ee72b7f86b42582b466c", "rawColor": "Brown/Red", "rawSize": "7\'10\"x10\'2\"\"", "overstock": "_id": "6175eef7b7f86b42582b4678", "rawColor": "Brown/Red", "rawSize": "7\'10\"x10\'2\"', '"feature1": "color", "feature2": "size", "name_color": "Gray", "name_size": "7\'10\"x10\'2\", "ebay": "_id": "6175ee6eb7f86b42582b4667", "rawColor": "Gray", "rawSize": "7\'10\"x10\'2\"\"", "overstock": "_id": "6175eef7b7f86b42582b4679", "rawColor": "Gray", "rawSize": "7\'10\"x10\'2\"'
---------------------------------------------------------------------------

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

有没有更好的方法来处理像\\\\\"rawSize\\\\\" 这样的部分并将它们变成"rawSize"?这就是我所说的绕过rawSize这个词的意思,然后清理那个词周围的东西。

【问题讨论】:

看起来有点乱。您能否清理您的问题并将其剥离到核心并提供一个最小的可重现示例? @mnist 好的,完成 生成该字符串的原因是什么?可以修改输出有效的JSON吗? “我需要让这个被接受”不,你告诉提供者它不是 JSON 并修复他们的一面。这个非 JSON 字符串本身就是垃圾,虽然您可能能够破解/伪造它来为该字符串的 JSON 加载工作,但是下一个字符串或之后的字符串呢,您的相同代码会在上面工作吗? 【参考方案1】:

我认为数据在我看来已损坏。看这部分: "name_size": "7'10\"x10'2\"" 7 周围的 " 和 ' 都没有反斜杠。这只是解释时的问题。

就个人而言,我建议清理字符串。您可以将字符串转换为原始字符串,可能通过test_string.encode('unicode_escape') 编码 然后确保每个 " 和 ' 前面都有反斜杠,然后 json 加载它?

【讨论】:

以上是关于Python 正则表达式 / 正则表达式 - 如何在保持目标文本完整的情况下绕过目标文本?的主要内容,如果未能解决你的问题,请参考以下文章

如何理解Python中的正则表达式

转:Python正则表达式指南

Python正则表达式指南

python正则表达式

python正则表达式

Python的正则表达概述