找出与其他单词有最多相同字母的单词

Posted 2023-02-21

技术标签:

【中文标题】找出与其他单词有最多相同字母的单词【英文标题】：Find the word with most letters in common with other words 【发布时间】：2011-10-02 03:07:33 【问题描述】：

我希望 Perl (5.8.8) 找出与数组中的其他单词有最多共同字母的单词 - 但只有在同一位置的字母。（最好不使用库。）

以这个单词列表为例：

贝克销售员打包机照顾者 RUFFR

她的 BALER 是与其他单词有最多共同字母的单词。它与 BAKER 中的 BAxER、SALER 中的 xALER、CARER 中的 xAxER 和 RUFFR 中的 xxxxR 匹配。

我希望 Perl 在具有相同长度和大小写的任意单词列表中为我找到这个单词。看来我在这里碰壁了，非常感谢您的帮助！

我到现在为止的尝试

目前并没有太多的脚本：

use strict;
use warnings; 
my @wordlist = qw(BAKER SALER MALER BARER RUFFR);
foreach my $word (@wordlist) 
    my @letters = split(//, $word);
    # now trip trough each iteration and work magic...

注释在哪里，我尝试了几种代码，其中包含大量的 for 循环和 ++ 变量。到目前为止，我的任何尝试都没有完成我需要它做的事情。

所以，为了更好地解释：我需要的是逐字逐句地测试列表中的每个字母位置，以找到与列表中其他字母最多的单词，在该字母的位置。

一种可能的方法是首先检查哪些单词在字母位置 0 处最常见，然后测试字母位置 1，依此类推，直到找到总和具有最多字母的单词与列表中的其他词相同。然后我想像一个矩阵一样打印列表，每个字母位置的分数加上每个单词的总分，与 DavidO 的建议不同。

你最终得到的是每个单词的矩阵，每个字母位置的分数，以及矩阵中每个单词的总分。

计划的目的

嘿嘿，我还是这么说吧：这个程序是用来破解《辐射 3》游戏中的终端的。:D 我的想法是，这是学习 Perl 的好方法，同时还能玩得开心。

这是我用于研究的辐射 3 终端黑客教程之一：FALLOUT 3: Hacking FAQ v1.2，我已经制作了一个程序来缩短单词列表，如下所示：

#!/usr/bin/perl
# See if one word has equal letters as the other, and how many of them are equal
use strict;
use warnings; 

my $checkword = "APPRECIATION"; # the word to be checked
my $match = 4; # equal to the match you got from testing your checkword
my @checkletters = split(//, $checkword); #/

my @wordlist = qw(
    PARTNERSHIPS
    REPRIMANDING
    CIVILIZATION
    APPRECIATION
    CONVERSATION
    CIRCUMSTANCE
    PURIFICATION
    SECLUSIONIST
    CONSTRUCTION
    DISAPPEARING
    TRANSMISSION
    APPREHENSIVE
    ENCOUNTERING
);

print "$checkword has $match letters in common with:\n";

foreach my $word (@wordlist) 
    next if $word eq $checkword;
    my @letters = split(//, $word);
    my $length = @letters; # determine length of array (how many letters to check)

    my $eq_letters = 0; # reset to 0 for every new word to be tested
    for (my $i = 0; $i < $length; $i++) 
        if ($letters[$i] eq $checkletters[$i]) 
            $eq_letters++;
        
    
    if ($eq_letters == $match) 
        print "$word\n";
    

# Now to make a script on to find the best word to check in the first place...

该脚本将产生CONSTRUCTION 和TRANSMISSION 作为其结果，就像在游戏常见问题解答中一样。不过，原始问题的诀窍（以及我自己没能找到的东西）是如何找到最好的词来尝试，即APPRECIATION。

好的，我现在已经根据您的帮助提供了自己的解决方案，并认为此线程已关闭。非常感谢所有的贡献者。你帮了很大的忙，在这个过程中我也学到了很多东西。 :D

【问题讨论】：

你能把你的剧本给我们看一下吗，我们还有事要忙吗？可能，解决这个问题的最简单方法是计算单词之间的Hamming Distance。但是我想知道它是否只能比较两个和两个单词... 我查看了黑客常见问题解答，这对我来说似乎是一个 prolog 任务，适用于 perl thoug。 【参考方案1】：

这是一种方法。重读了几次你的规范，我认为这就是你要找的。p>

值得一提的是，可能会有多个单词的最高分相同。从您的列表中只有一个获胜者，但在较长的列表中，可能会有几个同样获胜的单词。该解决方案解决了这个问题。另外，据我了解，只有当它们出现在每个单词的同一列中时，您才会计算字母匹配。如果是这种情况，这里有一个可行的解决方案：

use 5.012;
use strict;
use warnings;
use List::Util 'max';

my @words = qw/
    BAKER
    SALER
    BALER
    CARER
    RUFFR
/;

my @scores;
foreach my $word ( @words ) 
    my $score;
    foreach my $comp_word ( @words ) 
        next if $comp_word eq $word;
        foreach my $pos ( 0 .. ( length $word ) - 1 ) 
            $score++ if substr( $word, $pos, 1 ) eq substr( $comp_word, $pos, 1);
        
    
    push @scores, $score;

my $max = max( @scores );
my ( @max_ixs ) = grep  $scores[$_] == $max  0 .. $#scores;

say "Words with most matches:";
say for @words[@max_ixs];

此解决方案计算每个字母列中每个单词的字母与其他单词匹配的次数。比如：

Words:     Scores:       Because:
ABC        1, 2, 1 = 4   A matched once,  B matched twice, C matched once.
ABD        1, 2, 1 = 4   A matched once,  B matched twice, D matched once.
CBD        0, 2, 1 = 3   C never matched, B matched twice, D matched once.
BAC        0, 0, 1 = 1   B never matched, A never matched, C matched once.

这为您提供了 ABC 和 ABD 的获胜者，每人都有四场位置比赛的得分。即，第一列、第一行匹配第一列第二行、第三行和第四行的累积次数，依此类推，以用于后续列。它可能可以进一步优化，并重新措辞更短，但我试图保持逻辑相当容易阅读。享受吧！

更新/编辑 我想了想，意识到虽然我现有的方法完全按照你原来的问题要求做，但它在 O(n^2) 时间内完成了，这是比较慢的。但是如果我们对每一列的字母使用哈希键（每个键一个字母），并计算每个字母在列中出现的次数（作为哈希元素的值），我们可以在 O(1 ) 时间，以及我们在 O(n*c) 时间内遍历列表（其中 c 是列数，n 是单词数）。还有一些设置时间（创建哈希）。但是我们仍然有很大的进步。这是每种技术的新版本，以及每种技术的基准比较。

use strict;
use warnings;
use List::Util qw/ max sum /;
use Benchmark qw/ cmpthese /;

my @words = qw/
    PARTNERSHIPS
    REPRIMANDING
    CIVILIZATION
    APPRECIATION
    CONVERSATION
    CIRCUMSTANCE
    PURIFICATION
    SECLUSIONIST
    CONSTRUCTION
    DISAPPEARING
    TRANSMISSION
    APPREHENSIVE
    ENCOUNTERING
/;


# Just a test run for each solution.
my( $top, $indexes_ref );

($top, $indexes_ref ) = find_top_matches_force( \@words );
print "Testing force method: $top matches.\n";
print "@words[@$indexes_ref]\n";

( $top, $indexes_ref ) = find_top_matches_hash( \@words );
print "Testing hash  method: $top matches.\n";
print "@words[@$indexes_ref]\n";



my $count = 20000;
cmpthese( $count, 
    'Hash'  => sub find_top_matches_hash( \@words ); ,
    'Force' => sub find_top_matches_force( \@words ); ,
 );


sub find_top_matches_hash 
    my $words = shift;
    my @scores;
    my $columns;
    my $max_col = max( map  length $_  @$words ) - 1;
    foreach my $col_idx ( 0 .. $max_col ) 
        $columns->[$col_idx] substr $_, $col_idx, 1 ++ 
            for @$words;
    
    foreach my $word ( @$words ) 
        my $score = sum( 
            map 
                $columns->[$_] substr $word, $_, 1  - 1
             0 .. $max_col
        );
        push @scores, $score;
    
    my $max = max( @scores );
    my ( @max_ixs ) = grep  $scores[$_] == $max  0 .. $#scores;
    return(  $max, \@max_ixs );



sub find_top_matches_force 
    my $words = shift;
    my @scores;
    foreach my $word ( @$words ) 
        my $score;
        foreach my $comp_word ( @$words ) 
            next if $comp_word eq $word;
            foreach my $pos ( 0 .. ( length $word ) - 1 ) 
                $score++ if 
                    substr( $word, $pos, 1 ) eq substr( $comp_word, $pos, 1);
            
        
        push @scores, $score;
    
    my $max = max( @scores );
    my ( @max_ixs ) = grep  $scores[$_] == $max  0 .. $#scores;
    return( $max, \@max_ixs );

输出是：

Testing force method: 39 matches.
APPRECIATION
Testing hash  method: 39 matches.
APPRECIATION
        Rate Force  Hash
Force 2358/s    --  -74%
Hash  9132/s  287%    --

在您看到提供的其他一些选项后，我意识到您的原始规格发生了变化，这在某种程度上是创新的本质，但这个谜题仍然在我脑海中浮现。如您所见，我的哈希方法比原始方法快 287%。在更短的时间内获得更多乐趣！

【讨论】：

你完全正确！这些是我所追求的比赛。不过，我真的很想要一个可以与 Perl 5.8.8 版一起使用的版本。取出写着use 5.012; 的那一行将两个“say”语句替换为“print”语句，并放入一个\n 换行符，如下所示：print "Words with most matches:\n"; print "$_\n" for @words[@max_ixs];。现在你有了一个适用于 5.8.8 的版本！我希望你能找到一个有趣的用途。我还没弄清楚你用它解决了什么问题，但弄清楚逻辑是一个有趣的转移。非常感谢！ :D 这是我要解决的问题：gamefaqs.com/pc/918428-fallout-3/faqs/54644 现在我只是想知道，你能取消 lib 吗？或者它是大多数 Perl 安装的默认设置？ List::Utils 唯一能做的就是找到最大值。您可以通过在代码中添加以下几行（并删除 my $max = max(... 行）来做同样的事情：就在“my @scores;”之前，把“my $max = 0;”放在“push @scores...”之前，把$max = ( $score > $max ) ? $score : $max; .最后，删除use List::Utils.. 行。在我的回答之前还有什么可以满足您的需求吗？玩得开心。不，伙计。这太棒了！ :D 现在我只需要修改不同的方法来了解它是如何工作的。我想我不会在这里发帖，否则我还有很多东西要学。再次感谢你！ :)【参考方案2】：

作为起点，您可以有效地检查它们共有多少个字母：

$count = ($word1 ^ $word2) =~ y/\0//;

但这只有在你遍历所有可能的词对时才有用，在这种情况下是不必要的：

use strict;
use warnings;
my @words = qw/
    BAKER
    SALER
    BALER
    CARER
    RUFFR
/;

# you want a hash to indicate which letters are present how many times in each position:

my %count;
for my $word (@words) 
    my @letters = split //, $word;
    $count$_ $letters[$_] ++ for 0..$#letters;


# then for any given word, you get the count for each of its letters minus one (because the word itself is included in the count), and see if it is a maximum (so far) for any position or for the total:

my %max_common_letters_count;
my %max_common_letters_words;
for my $word (@words) 
    my @letters = split //, $word;
    my $total;
    for my $position (0..$#letters, 'total') 
        my $count;
        if ( $position eq 'total' ) 
            $count = $total;
        
        else 
            $count = $count$position $letters[$position]  - 1;
            $total += $count;
        
        if ( ! $max_common_letters_count$position || $count >= $max_common_letters_count$position ) 
            if ( $max_common_letters_count$position && $count == $max_common_letters_count$position ) 
                push @ $max_common_letters_words$position , $word;
            
            else 
                $max_common_letters_count$position = $count;
                $max_common_letters_words$position = [ $word ];
            
        
    


# then show the maximum words for each position and in total: 

for my $position ( sort  $a <=> $b  grep $_ ne 'total', keys %max_common_letters_count ) 
    printf( "Position %s had a maximum of common letters of %s in words: %s\n",
        $position,
        $max_common_letters_count$position,
        join(', ', @ $max_common_letters_words$position )
    );

printf( "The maximum total common letters was %s in words(s): %s\n",
    $max_common_letters_count'total',
    join(', ', @ $max_common_letters_words'total' )
);

【讨论】：

我真的很喜欢筛选逻辑，并看到 Lingua::EN::Inflect 的一个工作示例。但我确实有一个问题。既然您知道每个单词有多少个常用字母，那么您如何确定哪个单词与每个列位置中的行数最多匹配？您不需要保留每列匹配多少行的累积分数吗？（也许我让规范太难了）。 Lingua::EN::Inflect 使复数变得容易；一个更复杂的例子：print inflect("NUM($_) PL_N(nation) PL_V(endorses) but PL_V(isn't) endorsed") for 0..2 我真的很喜欢这个，尤其是第一行，因为它是比较两个单词的一种优雅方式。现在逐词比较列表中的单词，以找到在同一字母位置具有最多共同字母的单词，即在字母 pos 0 处与其他单词最相似的单词，然后是字母 pos 1，依此类推在。该库在我的 Mac 上不起作用，虽然... @Kebman：一次看一遍；用例如转储数据结构print Data::Dumper::Dumper(\%max_common_letters_words) 看看它正在收集什么数据；在这里询问是否有任何特定的问题困扰您 @ysth：抱歉，在声称 xor 技巧不适用于非 ASCII 数据之前，我确实对其进行了测试。不幸的是，我的测试使用'\x101' 而不是"\x101" - 哎呀。【参考方案3】：

这是一个完整的脚本。它使用了与 ysth 提到的相同的想法（尽管我独立拥有它）。使用按位异或组合字符串，然后计算结果中的 NUL 数量。只要你的字符串是 ASCII，它就会告诉你有多少匹配的字母。（这种比较是区分大小写的，我不确定如果字符串是 UTF-8 会发生什么。可能没什么好。）

use strict;
use warnings;
use 5.010;

use List::Util qw(max);

sub findMatches

  my ($words) = @_;

  # Compare each word to every other word:
  my @matches = (0) x @$words;

  for my $i (0 .. $#$words-1) 
    for my $j ($i+1 .. $#$words) 
      my $m = ($words->[$i] ^ $words->[$j]) =~ tr/\0//;

      $matches[$i] += $m;
      $matches[$j] += $m;
    
  

  # Find how many matches in the best word:
  my $max = max(@matches);

  # Find the words with that many matches:
  my @wanted = grep  $matches[$_] == $max  0 .. $#matches;

  wantarray ? @$words[@wanted] : $words->[$wanted[0]];
 # end findMatches

my @words = qw(
    BAKER
    SALER
    BALER
    CARER
    RUFFR
);

say for findMatches(\@words);

【讨论】：

【参考方案4】：

好久没接触perl了，所以是伪代码。这不是最快的算法，但它适用于少量单词。

totals = new map #e.g. an object to map :key => :value

for each word a
  for each word b
    next if a equals b

    totals[a] = 0
    for i from 1 to a.length
      if a[i] == b[i]
        totals[a] += 1
      end
    end
  end
end

return totals.sort_by_key.last

很抱歉没有 perl，但如果你把它编码成 perl，它应该会像一个魅力一样工作。

关于运行时间的快速说明：这将及时运行 number_of_words^2 * length_of_words，因此在 100 个单词的列表中，每个长度为 10 个字符，这将运行 100,000 个周期，这对于大多数应用程序来说已经足够了。

【讨论】：

酷！我认为这是我自己尝试回答问题时使用的方法。但是，在阅读了该主题后，我现在想知道您如何将模式变成三叉树样式搜索？我确信有很多方法可以解决这个问题——我很乐意探索它。作为一个简单的问题：您正在查看的输入量是多少？您希望达到什么水平的效率？我意识到这个项目的范围不值得，但是为了好玩，让我们说很多！【参考方案5】：

这是一个依赖转置单词来计算相同字符的版本。我使用了您原始比较中的词语，而不是代码。

这应该适用于任何长度的单词和任何长度的列表。输出是：

Word    score
----    -----
BALER   12
SALER   11
BAKER   11
CARER   10
RUFFR   4

代码：

use warnings;
use strict;

my @w = qw(BAKER SALER BALER CARER RUFFR);
my @tword = t_word(@w);

my @score;
push @score, str_count($_) for @tword;
@score = t_score(@score);

my %total;

for (0 .. $#w) 
    $total$w[$_] = $score[$_];


print "Word\tscore\n";
print "----\t-----\n";
print "$_\t$total$_\n" for (sort  $total$b <=> $total$a  keys %total);

# transpose the words
sub t_word 
    my @w = @_;
    my @tword;
    for my $word (@w) 
        my $i = 0;
        while ($word =~ s/(.)//) 
            $tword[$i++] .= $1;
        
    
    return @tword;


# turn each character into a count
sub str_count 
    my $str = uc(shift);
    while ( $str =~ /([A-Z])/ ) 
        my $chr = $1;
        my $num = () = $str =~ /$chr/g;
        $num--;
        $str =~ s/$chr/$num /g;
    
    return $str;


# sum up the character counts
# while reversing the transpose
sub t_score 
    my @count = @_;
    my @score;
    for my $num (@count) 
        my $i = 0;
        while( $num =~ s/(\d+) //) 
            $score[$i++] += $1;
        
    
    return @score;

【讨论】：

【参考方案6】：

这是我试图回答的问题。如果需要，这也将允许您查看每个单独的匹配项。（即 BALER 匹配 BAKER 中的 4 个字符）。编辑：如果单词之间存在平局，它现在会捕获所有匹配项（我将“CAKER”添加到列表中进行测试）。

#! usr/bin/perl

use strict;
use warnings;

my @wordlist = qw( BAKER SALER BALER CARER RUFFR CAKER);

my %wordcomparison;

#foreach word, break it into letters, then compare it against all other words
#break all other words into letters and loop through the letters (both words have same amount), adding to the count of matched characters each time there's a match
foreach my $word (@wordlist) 
    my @letters = split(//, $word);
    foreach my $otherword (@wordlist) 
        my $count;
        next if $otherword eq $word;
        my @otherwordletters = split (//, $otherword);
        foreach my $i (0..$#letters) 
            $count++ if ( $letters[$i] eq $otherwordletters[$i] );
        
        $wordcomparison"$word""$otherword" = $count;
    


# sort (unnecessary) and loop through the keys of the hash (words in your list)
# foreach key, loop through the other words it compares with
#Add a new key: total, and sum up all the matched characters.
foreach my $word (sort keys %wordcomparison) 
    foreach ( sort keys % $wordcomparison$word ) 
        $wordcomparison$wordtotal += $wordcomparison$word$_;
    


#Want $word with highest total

my @max_match = (sort  $wordcomparison$btotal <=> $wordcomparison$atotal  keys %wordcomparison );

#This is to get all if there is a tie:
my $maximum = $max_match[0];
foreach (@max_match) 
print "$_\n" if ($wordcomparison$_total >= $wordcomparison$maximumtotal )

输出很简单：CAKER BALER 和 BAKER。

哈希%wordcomparison 看起来像：

'SALER'
        
          'RUFFR' => 1,
          'BALER' => 4,
          'BAKER' => 3,
          'total' => 11,
          'CARER' => 3
        ;

【讨论】：

【参考方案7】：

你可以这样做，如果一个字母在它的位置匹配，你可以使用一个肮脏的正则表达式来执行代码，但不是其他的，谢天谢地，你可以很容易地构建正则表达式：

一个示例正则表达式是：

(?:(C(? $c++ ))|.)(?:(A(? $c++ ))|.)(?:(R(? $c++ ))|.)(?:(E(? $c++ ))|.)(?:(R(? $c++ ))|.)

这可能会也可能不会很快。

use 5.12.0;
use warnings;
use re 'eval';

my @words = qw(BAKER SALER BALER CARER RUFFR);

my ($best, $count) = ('', 0);
foreach my $word (@words) 
    our $c = 0;
    foreach my $candidate (@words) 
    next if $word eq $candidate;

    my $regex_str = join('', map "(?:($_(? \$c++ ))|.)" split '', $word);
    my $regex = qr/^$regex_str$/;

    $candidate =~ $regex or die "did not match!";
    
    say "$word $c";
    if ($c > $count) 
    $best = $word;
    $count = $c;
    


say "Matching: first best: $best";

使用异或技巧会很快，但会假设您可能遇到的字符范围很多。 utf-8 可以通过多种方式打破这种情况。

【讨论】：

这里不需要 O(N**2) 解决方案（xor 也应该是）；我认为，如果您单词中的一个字母与多个候选人匹配，那将是双重计算。此外，^ 应该可以在 utf8 上正常工作。重读问题，如果单词中的一个字母匹配多个其他单词，它应该计算多次，对不起。是的，无论如何，这是一个不错的小谜题。我喜欢更好的哈希方法，真的应该先尝试一下。【参考方案8】：

非常感谢所有贡献者！您确实向我展示了我还有很多东西要学，但是您也极大地帮助了我找出自己的答案。我只是把它放在这里作为参考和可能的反馈，因为可能有更好的方法来做这件事。对我来说，这是我自己能找到的最简单、最直接的方法。享受！ :)

#!/usr/bin/perl
use strict;
use warnings; 

# a list of words for testing
my @list = qw( 
BAKER
SALER
BALER
CARER
RUFFR
);

# populate two dimensional array with the list, 
# so we can compare each letter with the other letters on the same row more easily 
my $list_length = @list;
my @words;

for (my $i = 0; $i < $list_length; $i++) 
    my @letters = split(//, $list[$i]);
    my $letters_length = @letters;
    for (my $j = 0; $j < $letters_length; $j++) 
        $words[$i][$j] = $letters[$j];
    

# this gives a two-dimensionla array:
#
# @words = (    ["B", "A", "K", "E", "R"],
#               ["S", "A", "L", "E", "R"],
#               ["B", "A", "L", "E", "R"],
#               ["C", "A", "R", "E", "R"],
#               ["R", "U", "F", "F", "R"],
# );

# now, on to find the word with most letters in common with the other on the same row

# add up the score for each letter in each word
my $word_length = @words;
my @letter_score;
for my $i (0 .. $#words) 
    for my $j (0 .. $#$words[$i]) 
        for (my $k = 0; $k < $word_length; $k++) 
            if ($words[$i][$j] eq $words[$k][$j]) 
                $letter_score[$i][$j] += 1; 
            
        
        # we only want to add in matches outside the one we're testing, therefore
        $letter_score[$i][$j] -= 1;
    


# sum each score up
my @scores;
for my $i (0 .. $#letter_score ) 
    for my $j (0 .. $#$letter_score[$i]) 
        $scores[$i] += $letter_score[$i][$j];
    


# find the highest score
my $max = $scores[0];
foreach my $i (@scores[1 .. $#scores]) 
    if ($i > $max) 
        $max = $i;
    


# and print it all out :D
for my $i (0 .. $#letter_score ) 
    print "$list[$i]: $scores[$i]";
    if ($scores[$i] == $max) 
        print " <- best";
       
    print "\n";

运行时，脚本会产生以下结果：

BAKER: 11
SALER: 11
BALER: 12 <- best
CARER: 10
RUFFR: 4

【讨论】：

以上是关于找出与其他单词有最多相同字母的单词的主要内容，如果未能解决你的问题，请参考以下文章