如何从两个制表符分隔的文件中获取枢轴线？

Posted 2023-02-24

技术标签:

【中文标题】如何从两个制表符分隔的文件中获取枢轴线？【英文标题】：How to get the pivot lines from two tab-separated files? 【发布时间】：2021-05-13 05:23:45 【问题描述】：

给定两个文件file1.txt

abc def \t 123 456
jkl mno \t 987 654
foo bar \t 789 123
bar bar \t 432

和file2.txt

foo bar \t hello world
abc def \t good morning
xyz \t 456

任务是提取第一列匹配的行并实现：

abc def \t 123 456 \t good morning
foo bar \t 789 123 \t hello world

我可以在 Python 中这样做：

from io import StringIO

file1 = """abc def \t 123 456
jkl mno \t 987 654
foo bar \t 789 123
bar bar \t 432"""


file2 = """foo bar \t hello world
abc def \t good morning
xyz \t 456"""

map1, map2 = , 

with StringIO(file1) as fin1:
    for line in file1.split('\n'):
        one, two = line.strip().split('\t')
        map1[one] = two
    
    
with StringIO(file2) as fin2:
    for line in file2.split('\n'):
        one, two = line.strip().split('\t')
        map2[one] = two
        
        
for k in set(map1).intersection(set(map2)):
    print('\t'.join([k, map1[k], map2[k]]))

实际的任务文件有数十亿行，有没有更快的解决方案，无需加载所有内容并保留哈希图/字典？

也许使用 unix/bash 命令？对文件进行预排序有帮助吗？

【问题讨论】：

【参考方案1】：

你可以试试这个awk:

awk 'key = $1 FS $2 FNR==NR sub(/^([^[:blank:]]+[[:blank:]]+)2/, ""); map[key] = $0; next key in map print $0, map[key]' file2.txt file1.txt

abc def \t 123 456 \t good morning
foo bar \t 789 123 \t hello world

更易读的版本：

awk '
   key = $1 FS $2

FNR == NR 
   sub(/^([^[:blank:]]+[[:blank:]]+)2/, "")
   map[key] = $0
   next

key in map 
   print $0, map[key]
' file2.txt file1.txt

它只将file2的数据加载到内存中，并逐行处理file1的记录。

【讨论】：

【参考方案2】：

join 命令有时很难使用，但在这里很简单：

join -t $'\t' <(sort file1.txt) <(sort file2.txt)

使用 bash 的 ANSI-C quoting 指定制表符分隔符，并使用 process substitutions 将程序输出视为文件。

要查看输出，请将以上内容通过管道传输到 cat -A 以查看表示为 ^I 的选项卡：

abc def^I123 456^Igood morning$
foo bar^I789 123^Ihello world$

【讨论】：

【参考方案3】：

使用 Miller (https://github.com/johnkerl/miller) 及其连接动词

mlr --tsv --implicit-csv-header --headerless-csv-output join -j 1 --rp 2 -f file1.txt file2.txt >output.tsv

输出将是（它只是一个预览，你会有制表符分隔符）：

| foo bar | 789 123 | hello world  |
| abc def | 123 456 | good morning |

【讨论】：

以上是关于如何从两个制表符分隔的文件中获取枢轴线？的主要内容，如果未能解决你的问题，请参考以下文章