BFS 五个字母词链

Posted 2023-02-22

技术标签:

【中文标题】BFS 五个字母词链【英文标题】：BFS five letter word chain 【发布时间】：2020-10-26 10:03:53 【问题描述】：

我需要一些关于 BFS 词链作业的帮助。单词链是基于五个字母的单词，当单词 x 的最后四个字母在单词 y 中时，两个单词被连接起来。例如爬升和飞艇是相连的，因为爬升中的 l、i、m 和 b 在 blimp 中。

建议使用 Sedgewick 算法第 4 版中的定向 BFS 或对其进行修改。代码可以在这里找到：https://algs4.cs.princeton.edu/40graphs/ 并使用以下代码阅读列表单词的数据文件：

BufferedReader r =
    new BufferedReader(new InputStreamReader(new FileInputStream(fnam)));
ArrayList<String> words = new ArrayList<String>();
while (true) 
    String word = r.readLine();
    if (word == null)  break; 
    assert word.length() == 5;  // inputcheck, if you run with assertions
    words.add(word);

以及从文件中读取测试用例的以下代码：

BufferedReader r = 
    new BufferedReader(new InputStreamReader(new FileInputStream(fnam)));
while (true) 
    String line = r.readLine();
    if (line == null)  break; 
    assert line.length() == 11; // inputcheck, if you run with assertions
    String start = line.substring(0, 5);
    String goal = line.substring(6, 11);
    // ... search path from start to goal here

数据文件中的文字是：

their
moist
other
blimp
limps
about
there
pismo
abcde
bcdez
zcdea
bcdef
fzcde

当使用测试用例文件时...

other there
other their
their other
blimp moist
limps limps
moist limps
abcde zcdea

...输出应该是每个单词对之间的边数，如果单词之间没有路径，则输出-1。

我是使用图表的新手，我不确定如何使用 Sedgewick 的 BFS 并对其进行修改以读取测试用例文件。任何帮助表示赞赏。

【问题讨论】：

Java 是首选语言吗？还是单纯的算法无关紧要？ @DanielHao 我对其他语言不太熟悉，所以我更喜欢 Java，但即使是其他语言也可能会有所帮助。你能发布你目前的代码吗？只是为了理解，所以服务bfs算法，你不知道如何将文件传递给算法？对吗？ 【参考方案1】：

假设n 是数据集中的单词数。

首先，我们需要根据给定的条件为上述所有单词建立一个邻接表，即x和y之间存在边当且仅当x的最后四个字母出现在y。构建这个邻接表是一个 O(n^2 * w) 操作，其中 w 是数据集中每个单词的平均大小。

其次，我们所要做的就是对测试数据进行传统的 BFS。

这是main 函数：

    public static void main(String[] args) throws IOException 
        // get words from dataset
        List<String> words = readData();
        // get the word pairs to test
        List<List<String>> testData = getTestData();
        // form an adjacency list
        Map<String, List<String>> adj = getAdjacencyList(words);
        
        // for each test, do a traditional BFS
        for (List<String> test : testData) 
            System.out.println(bfs(adj, test));

下面是根据给定条件构建邻接表的函数：

    public static Map<String, List<String>> getAdjacencyList(List<String> words) 
        Map<String, List<String>> adj = new HashMap<>();
        for (int i = 0; i < words.size(); ++i) 
            String word = words.get(i);
            adj.put(word, adj.getOrDefault(word, new ArrayList<>()));
            for (int j = 0; j < words.size(); ++j) 
                if (i == j) continue;
                int count = 0;
                String other = words.get(j);
                for (int k = 1; k < 5; ++k) 
                    count += other.indexOf(word.charAt(k)) != -1 ? 1 : 0;
                
                // if the condition is satisfied, there exists an edge from `word` to `other`
                if (count >= 4)
                    adj.get(word).add(other);
            
        

        return adj;

这是 BFS：

    public static int bfs(Map<String, List<String>> adj, List<String> test) 
        Queue<String> q = new LinkedList<>();
        Set<String> visited = new HashSet<>(); // to keep track of the visited words, since the graph is not necessarily a DAG
        String start = test.get(0);
        String end = test.get(1);
        // if `start` and `end` words are equal
        if (start.equals(end))
            return 0;

        q.add(start);
        visited.add(start);
        int count = 0;
        while (!q.isEmpty()) 
            count++;
            int size = q.size();
            for (int i = 0; i < size; ++i) 
                String word = q.poll();
                for (String val : adj.get(word)) 
                    if (val.equals(end))
                        return count; // return the number of edges
                    if (!visited.contains(val)) // only add the words which aren't visited yet.
                        q.add(val);
                
            
        
        return -1; // if there isn't any edge

【讨论】：

感谢您的建议，虽然我不确定如何读取 testdata 以将其添加到 List> testData。这并没有给出 abcde zcdea 的最短路径，看看我下面的答案，看看怎么做【参考方案2】：

@The Room 提供了一个很好的答案，但我想建议对邻接列表构造部分进行简单修改，因为所提供的构建列表的方法复杂度为 O(n^2)，这将导致性能不佳对于大型输入文件。

您可以简单地将每个单词的 4 个字符的所有可能的排序模式插入到带有单词 id（例如索引）的哈希映射中。

C++ 代码示例：

map<string , vector<int> >mappings ;

for(int i = 0 ; i < words.size();  i++)
    string word = words[i].substr(0 , 4) ; 
    sort(word.begin() , word.end()); 
    mappings[word].push_back(i); 
    for(int j = 0 ; j < 4 ; j++)
        word = words[i].substr(0 , 4) ; 
        word[j] = words[i][4]; 
        sort(word.begin() , word.end()); 
        mappings[word].push_back(i);

现在您有了一个单词索引向量，您知道它们与任何以向量键的相同 4 个字符结尾的单词之间必须有一条边。

然后您可以简单地构建图形并注意不要创建自循环（避免使用节点和自身创建边）。

代码示例：

// Building the graph with complexity of O(n * log(no. of edges))
const int N = 100000; // Just and example 
vector<int>graph[N]; 
for(int i = 0 ; i < words.size(); i++)
    string tmp = words[i].substr(1 , 4); 
    sort(tmp.begin() , tmp.end()); 
    for(int j = 0 ; j < mappings[tmp].size(); j++)
        if (j == mappings[tmp][j])
            continue; 
            
        graph[i].push_back(mappings[tmp][j]);

最后，您可以遍历您的测试文件，获取开始和目标索引（读取文件时将每个单词存储为具有索引值的键），然后应用 bfs 函数计算边数为在@The Room 的回答中描述

我只是想为可能需要解决类似问题的大量输入的人建议这个答案，这将把构建图的复杂性从 O(N^2) 降低到 O(N * log(no.边数)) 其中 N 是单词数。

【讨论】：

【参考方案3】：

我的方法略有不同，我将在下面讨论的问题也有细微差别：

首先我们创建一个邻接列表：（@Volpe95 对此进行了很好的优化）。以单词为关键字使用节点图。

Map<String, Node> nodes = new HashMap<>();

        List<String> words = new DataHelper().loadWords("src/main/wordsInput.dat");
        System.out.println(words);

        for (int i = 0; i < words.size(); i++) 
            String l = words.get(i);
            nodes.put(l, new Node(l));
        

        for(Map.Entry<String,Node> l: nodes.entrySet()) 
            for(Map.Entry<String, Node> r:nodes.entrySet()) 
                if (l.equals(r)) continue;
                if (isLinkPair(l.getKey(), r.getKey())) 
                    Node t = nodes.get(l.getKey());
                    System.out.println(t);
                    t.addChild(nodes.get(r.getKey()));

IsLinkPair 检查是否可以在可能的子单词中找到单词的最后四个字母。

private static boolean isLinkPair(String l, String r) 
        // last 4 chars only
        for (int i = 1; i < l.length(); i++) 
            if(r.indexOf(l.charAt(i)) == -1)
                return false;
            
        
        return true;

节点存储每个单词和子节点以及edgeTo，用于计算每个节点存储其父节点的最短路径。此子父级将始终在最短路径上。（Sedgewick 将这些数据存储在单独的数组中，但通常更容易将它们分组到一个类中，因为它使代码更易于理解）

（为清楚起见，省略了 Getters Setters 等）

public class Node 
    private Set<Node> children;
    private String word;

    private Node edgeTo;

    private int visited;

    public Node(String word) 
        children = new HashSet<>();
        this.word = word;
        edgeTo = null;

基于 Sedgewick 的 BFS 算法，依次搜索每个节点、其直接子节点及其子节点，依此类推。它每次都在离原点如此遥远的地方寻找。注意使用了一个队列，这是由 Java 中的 LinkedList 实现的。

private boolean bfs(Map<String,Node> map, Node source, Node target) 
        if(source == null || target == null) return false;
        if(source.equals(target))return true;
        Queue<Node> queue = new LinkedList<>();
        source.setVisited();
        queue.add(source);
        while(!queue.isEmpty()) 
            Node v = queue.poll();
            for (Node c : v.getChildren()) 
                if(c.getVisited()==0)
                    System.out.println("visiting " + c);
                    c.setVisited();
                    c.setEdgeTo(v);
                    if(c.equals(target)) 
                        return true;
                    
                    queue.add(c);
                
            
        

        return false;

请注意，v 是父级，c 是其子级。 setEdgeTo 用于设置孩子的父母。

最后我们检查source和target分别是源词和目标词的结果：

BreadthFirstPaths bfs = new BreadthFirstPaths(nodes,source,target);
int shortestPath = bfs.getShortestPath(nodes,source,target);

那么我上面提到的细微差别呢？最短路径计算是必要的，因为 zcdea 有两个父母 fzcde 和 bcdez，您需要最短路径上的父母。要使用孩子的 edgeTo，找到它的父母并重复，直到路径如下所示。由于 bfs 从原点向外搜索的方式，该子父关系将始终处于最短路径。

// get edgeTo on target (the parent) , find this node and get its parent
    // continue until the shortest path is walked or no path is found
    public int getShortestPath(Map<String,Node> map, String source, String target) 
        Node node = map.get(target);
        int pathLength = 0;
        do 
            if(node == null || pathLength > map.size()) return NOPATH;
            if(node.equals(map.get(source))) return pathLength;
            node = map.get(node.getWord()).getEdgeTo();
            pathLength++;
         while (true);

总是需要考虑和优化时空复杂度的权衡。

【讨论】：

以上是关于BFS 五个字母词链的主要内容，如果未能解决你的问题，请参考以下文章

二十五个个性字母——插画

c语言，输入五个国家的名字，按字母顺序(即按ASCII码从小到大的顺序)排列输出。

跪求五个字母组成的英文单词少于100个别发了

python统计并输出字符串中小写元音字母的个数？

POJ-3026 Borg Maze---BFS预处理+最小生成树

a,e,i,o,u分别能发啥音（求英语国际音标）？请把这五个字母能发的音一一列出，谢谢了