和字符串有关的一些算法...

Posted 2020-08-02

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了和字符串有关的一些算法...相关的知识，希望对你有一定的参考价值。

前段时间有朋友问我看没看alg4, 我和他说我对于算法就是用到在零时学, 用不到的不学. 说起来确实是这样, 正是因为一直奉行这条准则, 所以我算是很久没弄算法了, 一来自己之前也略微学过一些算法, 平时一些问题的解决上算法还算是够用, 二来leetcode之前也刷过大概70道题, 刷着刷着感觉算法也没什么意思... 最近这段时间一直在学习c++, 刚看完c++ primer的第一部分, 实在没什么练手的好项目, 于是搞一期算法来玩玩, 这一期不会涉及很复杂的算法, 只是初略的过一些常见得算法...

先是字符串, 字符串部分涉及到的体型还是蛮多的, 今天主要先将几个部分 :

1. 字符串的排列和组合

2. 字符串的移动

3. 字符串的匹配

字符串的排列 : 给定任意长度的字符串, 打印出它的全排列... 例如对"abc", 有"abc", "acb", "bac","bca","cab","cba"...

这种题目的思路还是比较好把握, 一般来说就是递归, 先确定第一个字符(比如这道题里面, 先把第一个字符确定为"a", "b"或者 "c"), 接下来对剩下的字符串进行全排列(递归)... 代码如下:

    void allPermutation(char *str, int cnt = 0){
        if(str[cnt] == ‘\\0‘){
            printf("%s\\n", str);
            return;
        }
        for(int i = cnt; str[i] != ‘\\0‘; ++i){
            swap(str, cnt, i);
            allPermutation(str, cnt + 1);
            swap(str, cnt, i);
        }
    }

字符串的组合 : 给定任意字符串, 打印它们的全组合... 例如对"abc", 有"a", "b", "c", "ab", "ac", "bc", "abc", ""...

思路其实和上面差不多, 就是对每一个字符, 其实只存在两种情况, 要么保留这个字符串, 要么舍弃这个字符串... 代码如下:

    void allCombination(char *str, int cnt = 0){
        if(str[cnt] == ‘\\0‘){
            printf("%s\\n", str);
            return;
        }
　　　　　　//这里C风格浓郁, 其实可以用new的, 没改过来, 习惯了...
        char* temp = static_cast<char *>(malloc(sizeof(char) * strlen(str)));
        for(int i = 0, j = 0, len = strlen(str); i <= len ; ++i, ++j){
            if(i == cnt){
                j--;
            } else{
                temp[j] = str[i];
            }
        }
        allCombination(temp, cnt);
        free(temp);
        allCombination(str, cnt+1);
    }

字符串的移动 : 一般会给定两个指针, 要你讲其中一个指针中的内容(这里考虑是字符串)拷贝到另一个指针当中...

这个题目很简单, 但是却很容易错, 我第一次做的时候也中招了, 主要是没有考虑到两块空间重叠的情况, 也就是说当第一个指针所指字符串的结尾在第二个指针之后时, 简简单单的复制是错误的, 因为这样字符串末尾的内容直接被新内容覆盖了 ... 所以其实代码应该这么写 :

    void memmove(void *dest, const void *src, size_t n){
        const char *from = static_cast<const char *>(src);
        char *to = static_cast<char*>(dest);
        if(from < to){
            to += n;
            from += n;
            /*
               avoid memory space overlap.
             */
            while(n-- != 0){
                *--to = *--from;
            }
        }else{
            while(n-- != 0){
                *to++ = *from++;
            }
        }
    }

最后是字符串的匹配 : 就是给定母字符串和子字符串, 让你求子字符串在母字符串中第一次发生全匹配出现的位置, 比如对 "abcdef" 和 "cde" 应该返回2, 如果不匹配返回-1...

这种题目解法很多, 思路从简单到复杂, 我目前只接触过三种... 所以这里就只讲三种...

1. BF(brute force) : 思路最简单, 速度最慢, 一般没学过算法的一开始想到的就是这种, 就是说从母字符串第一个字符开始, 检查是否与子字符串匹配, 不匹配的话将母字符串向前移动一个字符, 再匹配... 代码如下 :

    int strMatchBF(const char* str, const char* target){
        int cnt = 0;
        while(str[cnt] != 0) {
            for (int i = cnt, j = 0; str[i] == target[j]; ++i, ++j) {
                if (!target[j + 1]) {
                    return cnt;
                }
            }
            ++cnt;
        }
        return -1;
    }

2.KR : 首先分别为子母字符串求解哈希值, 同时增量式的更新母字符串的哈希值, 只有当哈希值相等时才比较字母字符串是否相同(比较是因为一般这种哈希都不考虑冲突)... 这里我求解哈希的方法很简单, 但是说实话如果出现太长的字符串的话这个值可能会大的越界, 同时也没有避免冲突, 所以只能算是了解个思路, 其实代码并不精致... 但是这里要注意一点, 哈希值的求解必须要是增量式的, 否则和BF没区别, 网上很多博客里面KR算法更新哈希值是直接根据移动之后的字符串重新算哈希而并没有利用到之前的哈希值, 这样的话其实就是BF, 理解这个算法重要的是理解他节省的时间就在于重复利于之前求过的信息, 代码如下 :

    int strMatchKR(const char* str, const char* target){
        int len = (int)strlen(target);
        long targetHash = hash(target, len);
        long strHash = hash(str, len);
        for (int i = 0, guard = (int)strlen(str) - len;; ++i) {
            if(strHash == targetHash){
                if(!strncmp(str+i, target, len)){
                    return i;
                }
            }
            /*
             *  add if here to judge quit rather than in line 26 is to avoid update go out of the memeory boundary...
             */
            if(i == guard)  return -1;
            strHash = update(strHash, str+i, len);
        }
    }

    long hash(const char* str, int len){
        int i = 0;
        long sum = 0;
        while(i != len){
            sum *= 2;
            sum += str[i++];
        }
        return sum;
    }

    long update(long old, const char* start, int len){
        old -=  *start * pow(2, len-1);
        old *= 2;
        old += *(start+len);
        return old;
    }

3. KMP(据说又叫看毛片算法) : 这个算法我觉得单单文字可能描述不清楚, 可以先看看这个博客 http://www.cnblogs.com/c-cloud/p/3224788.html ... 建议不要去看关于实现的讲解, 先自己想一想, 捋一捋思路... 思路大概是这样 :

1. 算法基本框架还是BF的框架, 改进的地方在于i并不是每次递增1, 而是增匹配长度减去该位置的部分匹配值, 那么为什么 ?

2. 如何求部分匹配值 ?

先来回答第一个问题, 部分匹配值是什么? 其实就是匹配子串结尾和开头的最大重叠长度数, 比如"ababcababd" 和 "ababd", 第一次失配匹配到了最后一个字母, 由于"c"和"d"不同, 此时匹配数是4, 你可以想象把"ababd"的向后移动, 你可以直接移动到与子字符串中后一个的"ab"相匹配的母字符串的"ab"的位置(同时我们可以不用再次检查子字符串前面的"ab"是否与母字符串匹配, 因为既然子串结尾和开头相同, 同时结尾已经和母串匹配了, 那么当子串开头移动到结尾的位置也一定匹配, 对于这个例子就是说这个"ab"已经匹配过了), 如果子串中收尾重叠部分不存在的话, 直接移动匹配数就行了, 但是这里匹配到"d", 此时重叠部分为"ab", 长度是2, 那么至少还有机会让子串的开头和与当前结尾匹配的母串匹配...

至于第二个问题, 我只能给出自己的思路 :

    int* analyse(const char* target){
        int len = strlen(target);
        int* table = new int[len];
        table[0] = 0;
        int i = 1;
        while(i != len){
            if(target[i] == target[0]){
                table[i] = 1;
                int j = 0;
                while(++i != len && target[i] == target[++j]){
                    table[i] = table[i-1] + 1;
                }
            }else{
                table[i++] = 0;
            }
        }
        return table;
    }

然后主函数刚开始是这样的 :

    int strMatchKMP(const char* str, const char* target){
        int* table = analyse(target);
        int len = (int)strlen(target);
        for (int i = 0, guard = (int)strlen(str) - len; i <= guard;) {
            int k = 0;
            for(int j = i; str[j] == target[k]; ++j, ++k){
                if(target[k+1] == 0){
                    return i;
                }
            }
            if(k != 0) {
                i += k - table[k-1];
            }else{
                i++;
            }

        }

        delete[](table);
        return -1;
    }

因为一开始我还没想到一次匹配完之后那些匹配过的字符(重叠部分的字符)不用再重复匹配, 就在写这篇博客的时候突然想到了, 所以我进行了改进, 其实也就是加了个pass来记录下一次匹配之前可以跳过的字符串数 :

    int strMatchKMP(const char* str, const char* target){
        int* table = analyse(target);
        int len = (int)strlen(target);
        int pass = 0;
        for (int i = 0, k = 0, guard = (int)strlen(str) - len; i <= guard; k = pass) {
            for(int j = i + pass; str[j] == target[k]; ++j, ++k){
                if(target[k+1] == 0){
                    return i;
                }
            }
            if(k != 0) {
                i += k - table[k-1];
                pass = table[k-1];
            }else{
                ++i;
            }
        }
        delete[](table);
        return -1;
    }

最后稍微换了下参数名加了注释看起来更舒服一点 :

    int strMatchKMP(const char* str, const char* target){
        int* table = analyse(target);
        /*
         *  cur is the current pos of parent string
         *  son is the current pos of son string
         *  ignore is the num of chars that can be ignored in the next match
         */
        for (int cur = 0, son = 0, ignore = 0, guard = strlen(str) - strlen(target); cur <= guard; son = ignore) {
            for(int parent = cur + ignore; str[parent] == target[son]; ++parent, ++son){
                if(target[son+1] == 0){
                    return cur;
                }
            }
            
            if(son != 0) {
                ignore = table[son-1];
                cur += son - ignore;
            }else{
                ++cur;
            }
        }
        delete[](table);
        return -1;
    }

对于这个算法我自己的感受就是, 尽量别去看别人的算法, 看思路之后自己实现, 这样你会理解得更快而且更彻底...

以上是关于和字符串有关的一些算法...的主要内容，如果未能解决你的问题，请参考以下文章