Bash：使用另一个文件的行查找和替换文件中的行

Posted 2023-03-15

技术标签:

【中文标题】Bash：使用另一个文件的行查找和替换文件中的行【英文标题】：Bash: Find and replace lines in a file using the lines of another file 【发布时间】：2021-04-27 22:45:30 【问题描述】：

我有两个文件：masterlist.txt 包含数百行 URL，toupdate.txt 包含 masterlist.txt 文件中需要替换的少量更新版本的行。

我希望能够使用 Bash 自动执行此过程，因为这些列表的创建和使用已经在 bash 脚本中进行。

URL 的服务器部分是变化的部分，因此我们可以使用唯一部分进行匹配：/whatever/whatever_user.xml，但是如何查找和替换 masterlist.txt 中的那些行？即如何遍历toupdate.txt 的每一行，并在它以/f_SomeName/f_SomeName_user.xml 结尾时，找到以masterlist.txt 结尾的那一行并将整行替换为新行？

例如，https://123456url.domain.com/26/path/f_SomeName/f_SomeName_user.xml 变为 https://new-123.domain.com/1/path/f_SomeName/f_SomeName_user.xml。

masterlist.txt 的其余部分需要保持原样，因此我们必须只查找和替换具有相同行尾 (ID) 的不同服务器的行。

结构

masterlist.txt 看起来像这样：

https://123456url.domain.com/26/path/f_SomeName/f_SomeName_user.xml
https://456789url.domain.com/32/path/f_AnotherName/f_AnotherName_user.xml
https://101112url.domain.com/1/path/g_SomethingElse/g_SomethingElse_user.xml
https://222blah11.domain.com/19/path/e_BlahBlah/e_BlahBlah_user.xml
[...]

toupdate.txt 看起来像这样：

https://new-123.domain.com/1/path/f_SomeName/f_SomeName_user.xml
https://foo-254.domain.com/8/path/g_SomethingElse/g_SomethingElse_user.xml

期望的结果

使masterlist.txt 看起来像：

https://new-123.domain.com/1/path/f_SomeName/f_SomeName_user.xml
https://456789url.domain.com/32/path/f_AnotherName/f_AnotherName_user.xml
https://foo-254.domain.com/8/path/g_SomethingElse/g_SomethingElse_user.xml
https://222blah11.domain.com/19/path/e_BlahBlah/e_BlahBlah_user.xml
[...]

初步检查

我查看了sed，但我不知道如何使用这两个文件中的行进行查找和替换？

这是我目前所拥有的，至少在处理文件：

#!/bin/bash

#...

while read -r line; do
    # there's a new link on each line
    link="$line"
    # extract the unique part from the end of each line
    grabXML="$link##*/"
    grabID="$grabXML%_user.xml"
    # if we cannot grab the ID, then just set it to use the full link so we don't have an empty string
    if [ -n "$grabID" ]; then
        identifier=$grabID
    else
        identifier="$line"
    fi
    
    ## the find and replace here? ##    

# we're done when we've reached the end of the file
done < "masterlist.txt"

【问题讨论】：

***.com/q/399078***.com/q/15783701***.com/q/55343129 【参考方案1】：

请您尝试以下方法：

#!/bin/bash

declare -A map
while IFS= read -r line; do
    if [[ $line =~ (/[^/]+/[^/]*\.xml)$ ]]; then
        uniq_part="$BASH_REMATCH[1]"
        map[$uniq_part]=$line
    fi
done < "toupdate.txt"

while IFS= read -r line; do
    if [[ $line =~ (/[^/]+/[^/]*\.xml)$ ]]; then
        uniq_part="$BASH_REMATCH[1]"
        if [[ -n $map[$uniq_part] ]]; then
            line=$map[$uniq_part]
        fi
    fi
    echo "$line"
done < "masterlist.txt" > "masterlist_tmp.txt"

# if the result of "masterlist_tmp.txt" is good enough, uncomment the line below
# mv -f -- "masterlist_tmp.txt" "masterlist.txt"

结果：

https://new-123.domain.com/1/path/f_SomeName/f_SomeName_user.xml
https://456789url.domain.com/32/path/f_AnotherName/f_AnotherName_user.xml
https://foo-254.domain.com/8/path/g_SomethingElse/g_SomethingElse_user.xml
https://222blah11.domain.com/19/path/e_BlahBlah/e_BlahBlah_user.xml

[解释]

关联数组map 将/f_SomeName/f_SomeName_user.xml 等“唯一部分”映射到https://new-123.domain.com/1/path/f_SomeName/f_SomeName_user.xml 等“完整路径”。正则表达式(/[^/]+/[^/]*\.xml)$，如果匹配，则分配shell变量 BASH_REMATCH[1] 从第二个最右边的斜杠开始到字符串末尾的扩展名“.xml”。在文件“toupdate.txt”的第一个循环中，它生成“唯一部分” 和“填充路径”对作为关联数组的键值对。在文件“masterlist.txt”的第二个循环中，提取的如果存在关联值，则测试“唯一部分”。如果是这样，则行替换为关联值，“toupdate.txt”中的行文件。

[替代] 如果文本文件很大，bash 可能不够快。在这种情况下，awk 脚本会更有效地工作：

awk 'NR==FNR 
    if (match($0, "/[^/]+/[^/]*\\.xml$")) 
        map[substr($0, RSTART, RLENGTH)] = $0
    
    next


    if (match($0, "/[^/]+/[^/]*\\.xml$")) 
        full_path = map[substr($0, RSTART, RLENGTH)]
        if (full_path != "") 
            $0 = full_path
        
    
    print
' "toupdate.txt" "masterlist.txt" > "masterlist_tmp.txt"

[解释]

NR==FNR BLOCK1; next BLOCK2 语法是一个常见的习惯用法为每个文件单独切换处理。作为NR==FNR 条件仅满足参数列表中的第一个文件和 next 语句跳过下一个块，BLOCK1 仅处理文件“toupdate.txt”。同样，BLOCK2 仅处理文件“masterlist.txt”。如果函数match($0, pattern) 成功，它将设置awk 变量 RSTART 到 $0 之外匹配子串的起始位置，从文件中读取的当前记录，然后将变量RLENGTH 设置为匹配子串的长度。现在我们可以提取匹配的子字符串，例如 /f_SomeName/f_SomeName_user.xml 使用 substr() 函数。然后我们分配数组map，这样子字符串（唯一部分）映射到“toupdate.txt”中的整个url。第二个块的工作方式与第一个块大体相似。如果key对应的value 在数组map 中找到，然后将记录 ($0) 替换为键索引的数组的值。

【讨论】：

我喜欢awk 的回答！谢谢。您能否详细解释一下awk 答案的工作原理？ @nooblag 查看***.com/q/32481877/3220113 以获得类似awk 命令的解释。感谢您的反馈。我在回答中添加了对 awk 脚本的小解释。【参考方案2】：

为什么不让sed 编写自己的脚本 - 生成所需的输出，

sed -e "$(sed -e 's<^\(http[s]*://[^/]*/[^/]*/\)\(.*\)<\\|\2\$| s|.*|\1\2|<' toupdate.txt)" masterlist.txt

在哪里

内部sed 命令有一个外部和一个内部substitution 命令外部s (s<...<...<) 将scheme://domain/N/ 捕获为\1 和rest-of-path $.*$ 作为\2 并将它们插入到外部sed 命令的脚本中外部 sed 脚本 (\|\2$| s|.*|\1\2|) 查找以 masterlist.txt 结尾的 URL rest-of-path，替换（内部s）来自toupdate.txt的新URL 避免大量反斜杠转义< 和| 用作两个s 命令的分隔符，\|...| 用于/.../

【讨论】：

谢谢。如果我理解正确，我应该注意到... 不是字面意思，我在masterlist.txt 的摘录中使用省略号来表示“等等”。我将添加一些括号以帮助清除？ @nooblag 是的。在我的解释中，我使用了... 而不是省略号（U+2026），而不是文字；我理解你的也是省略号。

以上是关于Bash：使用另一个文件的行查找和替换文件中的行的主要内容，如果未能解决你的问题，请参考以下文章