在bash中的字符串列表中动态提取每个字符串唯一的模式

Posted 2023-03-24

技术标签:

【中文标题】在bash中的字符串列表中动态提取每个字符串唯一的模式【英文标题】：Dynamically extract pattern unique to each string in a list of strings in bash 【发布时间】：2021-09-17 08:05:59 【问题描述】：

我正在尝试从 bash 中的文件名列表中动态提取唯一模式。

文件名的输入列表如下所示

Exp1_ML_Rep1.txt,Exp1_ML_Rep2.txt,Exp1_ML_Rep3.txt

我想动态提取字符串

Rep1,Rep2,Rep3

此处以图表形式显示：

注意：输入模式每次都会改变例如另一个用例可能是

Exp2_DT_10ng_55C_1_User1.png,Exp2_DT_10ng_55C_2_User1.png,Exp2_DT_10ng_55C_3_User1.png

在这种情况下，我想提取

1,2,3

此处以图表形式显示：

在 bash 中实现这一目标的最佳方法是什么？

按照 cmets 中的建议，我尝试了以下方法：

declare -p string1 string2

declare -- string1="ER_Rep1"
declare -- string2="ER_Rep2"

diff <(echo "$string1" ) <(echo "$string2") 返回

1c1 
< ER_Rep1 
--- 
> ER_Rep2

我要提取的是 Rep1,Rep2。

【问题讨论】：

是的，我已经尝试了几种工具，如 diff、sed、perl、awk 等，但没有一个优雅的解决方案。如果Exp1_ML_Rep1.txt 变为Rep1 但Exp2_DT_10ng_55C_1_User1.png 变为1，则根本不清楚转换算法是什么。请edit提出您的问题。 @edmorton，抱歉不清楚。传入的字符串集可以改变模式并且不符合固定模式。这就是为什么我给出了两个例子并且需要一个动态的解决方案。我想要得到的只是输入字符串之间的difference。 @EdMorton，我添加了插图以使其更清晰。也许更好的标题应该是字符串列表中的唯一子字符串。其实your comment 是澄清了您的要求，谢谢。 【参考方案1】：

您可以将 GNU awk 与 sort 和 uniq 结合使用

echo 'Exp1_ML_Rep1.txt,Exp1_ML_Rep2.txt,Exp1_ML_Rep3.txt' | awk -v RS='[_.,]' '1' | sort | uniq -u

或

tr 与 sort 和 uniq 结合使用

echo 'Exp1_ML_Rep1.txt,Exp1_ML_Rep2.txt,Exp1_ML_Rep3.txt' | tr '_.,' '\n' | sort | uniq -u

产生输出

Rep1
Rep2
Rep3

【讨论】：

非常聪明的解决方案。您应该提到，多字符 RS 需要 GNU awk。 POSIX awk 会将RS=[_.,] 视为您刚刚写了RS='['。它还会输出一个不需要的额外空行 - 您应该将其设为 awk -v RS='[_.,]' 'RT' @glennjackman 和@BarathVutukuri，是的，确实是一个非常漂亮的解决方案。我最终使用了tr 解决方案:)。【参考方案2】：

你可以考虑这个awk解决方案：

declare -- string1="ER_Rep1"
declare -- string2="ER_Rep2"

awk -F '[_.]+' 'for (i=1; i<=NF; ++i) ++fq[$i]
END for (w in fq) if (fq[w] == 1) print w' <(echo "$string1" ) <(echo "$string2")

Rep1
Rep2

此awk 解决方案使用_ 或. 作为字段分隔符，并将每个字段存储在关联数组fq 中，其值是表示该单词出现频率的数字。

在END 块中，我们迭代fq 数组中的每个单词，并在频率等于1 时打印该单词，表明该单词的唯一出现。

【讨论】：

【参考方案3】：

我确信有更好的编码方式，但我在这里所做的是针对任意数量的输入字符串的通用解决方案。

查找下划线分隔子串的最长公共前缀

longestCommonPrefix() 
    local i prefix file found
    local -a pieces
    IFS=_ read -ra pieces <<<"$1"
    for ((i = $#pieces[@] - 1; i > 0; i--)); do
        prefix=$(IFS=_; echo "$pieces[*]:0:i_")
        found=true
        for file in "$@:2"; do
            if [[ $file != "$prefix"* ]]; then
                found=false
                break
            fi
        done
        if $found; then
            echo "$prefix"
            return
        fi
    done

找到最长的公共后缀（纯字符）

longestCommonSuffix() 
    local i suffix file found
    for ((i = $#1; i > 0; i--)); do
        suffix=$1: -i
        found=true
        for file in "$@:2"; do
            if [[ $file != *"$suffix" ]]; then
                found=false
                break
            fi
        done
        if $found; then
            echo "$suffix"
            return
        fi
    done

把它们放在一起

uniqueStrings() 
    local prefix=$(longestCommonPrefix "$@")
    set -- "$@/#"$prefix"/"
    local suffix=$(longestCommonSuffix "$@")
    printf '%s\n' "$@/%"$suffix"/"

然后

$ uniqueStrings Exp1_ML_Rep1.txt Exp1_ML_Rep2.txt Exp1_ML_Rep3.txt
Rep1
Rep2
Rep3

和

$ uniqueStrings Exp2_DT_10ng_55C_1_User1.png Exp2_DT_10ng_55C_2_User1.png Exp2_DT_10ng_55C_3_User1.png
1
2
3

其他几个例子：

# nothing in common, should return the input strings
$ uniqueStrings foo bar baz
foo
bar
baz

$ uniqueStrings x_foo13 x_bar13 x_baz13 x_qux13
foo
bar
baz
qux

适用于 bash v3.2+

【讨论】：

这似乎更接近 OP 所展示的；我也在研究“删除通用前缀/后缀”解决方案......但正在考虑在单个字符级别（即没有分隔符）......并且可能在awk（出于性能原因) 我鼓励你提交一个解决方案，我很想知道你会怎么做。完成；有点冗长，但是，嘿，这是第一次通过......【参考方案4】：

查看类似于@glennjackman 提出的解决方案：

找到公共前缀找到共同的后缀去掉通用前缀/后缀，剩下的就是区别

假设：

文件名列表以逗号分隔的字符串形式提供可变数量的文件名逐个字符比较没有分隔符假定由连续字符组成的单个“差异”，例如，在比较aBcDe 和aXcYe 时，我们认为c 不常见，因此差异将报告为BcD 和@987654325 @

使用awk 的一个想法，应该比bash-level 循环有一些性能改进：

awk '

# function to return an absolute value of a number

function abs(v)  return v < 0 ? -v : v 

# function to determine if each string has the same character at a given offset;
# return 0 if "no", return 1 if "yes"

function equal() 

    for ( i=1; i<=n; i++ ) 
        pos = offset <= 0 ? length(fname[i]) + offset : offset
        x   = substr(fname[i],pos,1)
        if ( i == 1 )    curr = x
        if ( x != curr ) return 0
    
    return 1


# for now assume strings input using a here-string, and strings are delimited by a comma

FNR==1  n=split($0,fname,",")
         exit                              # skip to END processing
       

END 
    # twice through the outer "for" loop:
    #    op =  1 => prefix processing
    #    op = -1 => suffix processing
    # "op" will be used to increment/decrement our offset pointer to
    # perform the character-by-character comparison

    for ( op=1; op>=-1; op=op-2 ) 
        offset = op == 1 ? 1 : 0           # determine initial offset based on op (prefix vs suffix)

        # if all strings have the same character @ a given offset then update our pfx/sfx pointers

        while ( equal() && abs(offset) <= length(fname[1]) ) 
            if ( op == 1 ) pfx = offset
            else           sfx = offset

            offset = offset + op           # go to next offset
        
    

if ( pfx == "" ) pfx=0                     # if no common prefix, default to 0
if ( sfx == "" ) sfx=1                     # if no common suffix, default to 1

# use substr() and our pfx/sfx offsets to display the difference

for ( i=1; i<=n; i++ )
    print substr(fname[i], pfx+1, length(fname[i]) - pfx - 1 + sfx )

' <<< "$in"

注意事项：

此时有点冗长；或许可以精简一点... 可以修改代码以直接使用“正常”文件列表（例如，将find 的输出通过管道传输到awk）；一个想法是只处理第一条记录 (FNR==1) 并将 FILENAME 填充到数组中

测试结果：

# in='Exp1_ML_Rep1.txt,Exp1_ML_Rep2.txt,Exp1_ML_Rep3.txt'
1
2
3

# in='Exp2_DT_10ng_55C_1_User1.png,Exp2_DT_10ng_55C_2_User1.png,Exp2_DT_10ng_55C_3_User1.png'
1
2
3

# in='x_foo13,x_bar13,x_baz13,x_qux13'
foo
bar
baz
qux

# in='x_foo13,x_bar13,x_baz13,x_abcde23'
foo1
bar1
baz1
abcde2

# in='abcde.123,abcde.123,abcde.123'    # identical
                  # three
                  # blank
                  # lines

# in='abc,def,123456,xyz$$'             # nothing in common
abc
def
123456
xyz$$

【讨论】：

“问题”是 OP 想要“Rep1,Rep2,Rep3”作为第一个输入。

以上是关于在bash中的字符串列表中动态提取每个字符串唯一的模式的主要内容，如果未能解决你的问题，请参考以下文章

如何使用bash从参数中的字符串列表中提取字符串

EXCEL VBA - 根据单元格范围和字符串创建动态下拉列表[关闭]

ESLint 错误：列表中的每个孩子都应该有一个唯一的“关键”道具

在文本列表中提取唯一值，其中每个项目具有恒定的公共前缀或后缀或两者兼有

字符串387. 字符串中的第一个唯一字符

正则表达式从bash脚本中的字符串中提取第一个浮点数