如何剪切部分文本并用Python和RegEx替换每一行

Posted 2021-03-26

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了如何剪切部分文本并用Python和RegEx替换每一行相关的知识，希望对你有一定的参考价值。

您好，我是Python的初学者，刚开始学习它并使用RegEx进行文本操作。如果我违反了StackOverflow的一些规则，我很抱歉

我在Python中编写一个脚本，我将从第一行开始（剪切）日期和时间，并在每行上替换“Date”“TimeWindowStart”和TimeWindowEnd“

ReportDate=03/24/2019, TimeWindowStart=18:00:00, TimeWindowEnd=20:59:59

Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000

我知道如何选择正则表达式日期

([0-9][0-9]|2[0-9])/[0-9][0-9](/[0-9][0-9][0-9][0-9])?

以及如何选择时间

([0-9][0-9]|2[0-9]):[0-9][0-9](:[0-9][0-9])?

但我坚持如何选择文本的一部分复制它，然后找到我想用re.sub函数替换的文本

所以最终输出看起来像这样：

ReportDate=, TimeWindowStart=, TimeWindowEnd=

03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000 
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000 
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000 
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000

答案

首先，你可以在正则表达式查询中指定一个量词，所以如果你想要4个数字，你不需要[0-9][0-9][0-9][0-9]但你可以用[0-9]{4}。要捕获表达式，请将其包装在圆括号中，value=([0-9]{4})将仅为您提供数字

如果你想使用re.sub，你只需要给它一个模式，一个替换字符串和你的输入字符串，例如re.sub(pattern, replacement, string)

因此：

import re

txt = """ReportDate=03/24/2019, TimeWindowStart=18:00:00, TimeWindowEnd=20:59:59

Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
"""

pattern_date = 'ReportDate=([0-9]{2}/[0-9]{2}/[0-9]{4})'
report_date = re.findall(pattern_date, txt)[0]

pattern_time_start = 'TimeWindowStart=([0-9]{2}:[0-9]{2}:[0-9]{2})'
start_time = re.findall(pattern_time_start, txt)[0]

pattern_time_end = 'TimeWindowEnd=([0-9]{2}:[0-9]{2}:[0-9]{2})'
end_time = re.findall(pattern_time_end, txt)[0]

splitted = txt.split('
')  # Split the txt so that we skip the first line

txt2 = '
'.join(splitted[1:])  # text to perform the sub 

# substitution of your values
txt2 = re.sub('Date', report_date, txt2)
txt2 = re.sub('TimeWindowStart', start_time, txt2)
txt2 = re.sub('TimeWindowEnd', end_time, txt2)

txt_final = splitted[0] + '
' + txt2
print(txt_final)

输出：

ReportDate=03/24/2019, TimeWindowStart=18:00:00, TimeWindowEnd=20:59:59

03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000

另一答案

这是一个部分答案，因为我不知道用于操作文本文件的Python API特别好。您可以读取文件的第一行，并提取报告日期的值以及开始/结束窗口时间。

first = "ReportDate=03/24/2019, TimeWindowStart=18:00:00, TimeWindowEnd=20:59:59"
ReportDate = re.sub(r'ReportDate=([^,]+),.*', '\1', first)
TimeWindowStart = re.sub(r'.*TimeWindowStart=([^,]+),.*', '\1', first)
TimeWindowEnd = re.sub(r'.*TimeWindowEnd=(.*)', '\1', first)

写出第一行，删除三个变量的值。

然后，您需要做的就是在每个后续行中读取并执行以下替换：

line = "Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000"
line = re.sub(r'Date', ReportDate, line)
line = re.sub(r' TimeWindowStart', TimeWindowStart, line)
line = re.sub(r' TimeWindowEnd', TimeWindowEnd, line)

以这种方式处理每一行后，您可以将其写入输出文件。

另一答案

这是我的代码：

import re

s = """ReportDate=03/24/2019, TimeWindowStart=18:00:00, TimeWindowEnd=20:59:59

Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000"""

datereg = r'(d{2}/d{2}/d{4})'
timereg = r'(d{2}:d{2}:d{2})'

dates = re.findall(datereg, s)
times = re.findall(timereg, s)

# replacing one thing at a time
result = re.sub(r'Date', dates[0],
            re.sub(r'TimeWindowEnd,', times[1] + ',',
                re.sub(r'TimeWindowStart,', times[0] + ',',
                    re.sub(timereg, '', 
                        re.sub(datereg, '', s)))))

print(result)

输出：

ReportDate=, TimeWindowStart=, TimeWindowEnd=

03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000

另一答案

试试这个，

import re

#Open file and read line by line
with open("a") as file:
 # Get and process first line
 first_line = file.readline()
 m = re.search("ReportDate=(?P<ReportDate>[0-9/]+), TimeWindowStart=(?P<TimeWindowStart>[0-9:]+), TimeWindowEnd=(?P<TimeWindowEnd>[0-9:]+)",first_line)
 first_line= re.sub(m.group('ReportDate'), "", first_line)
 first_line= re.sub(m.group('TimeWindowStart'), "", first_line)
 first_line= re.sub(m.group('TimeWindowEnd'), "", first_line)
 print(first_line)

 # Process rest of the lines
 for line in file:
    line = re.sub(r'Date', m.group('ReportDate'), line)
    line = re.sub(r'TimeWindowStart', m.group('TimeWindowStart'), line)
    line = re.sub(r'TimeWindowEnd', m.group('TimeWindowEnd'), line)
    print(line.rstrip())

输出：

ReportDate=, TimeWindowStart=, TimeWindowEnd=

03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000
03/24/2019, 18:00:00, 20:59:59, Report-20190323_210000

另一答案

找到一个明确的解决方案如下：

import re

input_str = """
ReportDate=03/24/2019, TimeWindowStart=18:00:00, TimeWindowEnd=20:59:59
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
Date, TimeWindowStart, TimeWindowEnd, Report-20190323_210000
"""

# Divide input string into two parts: header, body
header = input_str.split('
')[1]
body = '
'.join(input_str.split('
')[2:])

# Find elements to be replaced
ri = re.findall('d{2}/d{2}/d{4}',header)
ri.extend(re.findall('d{2}:d{2}:d{2}',header))

# Replace elements
new_header = header.replace(ri[0],'')
                   .replace(ri[1],'')
                   .replace(ri[2],'')

new_body = body.replace('Date',ri[0])
               .replace('TimeWindowStart',ri[1])
               .replace('TimeWindowEnd',ri[2])

# Construct the result string
full_string = new_header + '

' + new_body

只需找到要用正则表达式替换的项目并执行普通的字符串替换。我认为只有少数元素才会有效。

以上是关于如何剪切部分文本并用Python和RegEx替换每一行的主要内容，如果未能解决你的问题，请参考以下文章