Python 转到下一行并保存/编辑内容

Posted 2023-02-25

技术标签:

【中文标题】Python 转到下一行并保存/编辑内容【英文标题】：Python go to next line and save/edit content 【发布时间】：2011-10-20 05:14:39 【问题描述】：

此代码是在以前的帖子中建立的。我正在尝试对其进行调整以适应我们的数据。但它不起作用..这是我们文件的一个例子：

read:1424:2165 TGACCA/1:2165 TGACCA/2 
1..100  +chr1:3033296..3033395 #just this line
1..100  -chr1:3127494..3127395  
1..100  +chr1:3740372..3740471  

1 concordant    read:1483:2172 TGACCA/1:2172 TGACCA/2 
1..100  -chr7:94887644..94887545 #and just this line

此代码应执行以下操作：

搜索每一行识别字符串'read:' 转到下一行并提取类似于 '+chr:number..number' 的内容就一次！然后搜索下一个“read:”等...

因此，如果我在“read:”之后多次使用“-chr : no..no”，那只会占用第一个。

不幸的是，我无法弄清楚如何让它工作......

    import re

    infile='myfile.txt'
    outfile='outfile.txt'

    pat1 = re.compile(r'read:')
    pat2 = re.compile(r'([+-])chr([^:]+):(\d+)\.\.(\d+)')

    with open(infile, mode='r') as in_f, open(outfile, mode='w') as out_f:
        for line in in_f.readlines():
            if '\t' not in line.rstrip():
                continue
            a = pat1.search(line)
            if a:
            m = pat2.search(line)
            out_f.write(' '.join(m.groups()) + '\n')
            if not a:
                continue

输出应该是这样的：

  1 3033293 3033395 
  7 94887644 94887545

请给我一块骨头

从下面的答案更新

好的，我正在上传我使用的 Tim McNamara 稍作修改的版本。它运行良好，但输出无法识别“chr”后的两位数，并在最后一个数字后打印一个字符串

with open(infile, mode='r') as in_f, open(outfile, mode='w') as out_f:
    lines = [line for line in in_f.readlines()]
    for i, line in enumerate(lines):
       if 'read' in line:
            data = lines[i+1].replace(':', '..').split('..')
            try:
                out_f.write('  \n'.format(data[1][-1], data[2], data[3])) #Here I tried to remove data[3] to avoid to have "start" in the output file.. didn't work .. 
            except IndexError:
                continue

这是使用此代码获得的输出：

6 140302505 140302604 start  # 'start' is a string in our data after this number
5 46605561 46605462 start    # I don't understand why it grabs it thou...
5 46605423 46605522 start    # I tried to modify the code to avoid this, but ... didn't work out
6 29908310 29908409 start
6 29908462 29908363 start
4 12712132 12712231 start

如何解决这两个错误？

【问题讨论】：

你到底为什么要检查'\t' not in line.rstrip()？ 【参考方案1】：

您的大错误是您需要包含readlines 才能迭代“in_f”：

with open(infile, mode='r') as in_f, open(outfile, mode='w') as out_f:
    for line in in_f.readlines():
        ...

不过，整段代码可能还可以整理一下。

with open(infile, mode='r') as in_f, open(outfile, mode='w') as out_f:
    lines = [line for line in in_f.readlines()]
    for i, line in enumerate(lines):
        if 'read' in line:
            data = lines[i+1].replace(':', '..').split('..')
            try:
                a = data[1].split('chr')[-1]
                b = data[2]
                c = data[3].split()[0]
                out_f.write('  \n'.format(a, b, c))
            except IndexError:
                pass

【讨论】：

感谢我编辑了代码并添加了 readlines() 但它仍然显示错误消息.. 感谢我尝试整理代码，这对我来说似乎更好。但它返回错误“索引超出范围”。我猜部分 out_f.write(' \n'.format(data[1][-1], data[2], data[3])) 应该保留为 pat2 = re .compile(r'([+-])chr([^:]+):(\d+)\.\.(\d+)')。你怎么看？您的代码中有一些错误。例如，缩进不正确。我使用了字符串操作，而不是正则表达式，因为我发现它们更简单。我的代码与您的所有数据匹配。如果您的数据不规则，那么正则表达式可能会很有用。我上传了一个更新的新帖子，如果你有时间请查看:) 谢谢！！

以上是关于Python 转到下一行并保存/编辑内容的主要内容，如果未能解决你的问题，请参考以下文章