如何存储句子中的单词

Posted 2023-02-22

技术标签:

【中文标题】如何存储句子中的单词【英文标题】：How to store words from the sentence 【发布时间】：2021-11-10 13:05:43 【问题描述】：

我有以下 C++ 代码，用于计算总字数并将值存储在 count 变量中。因此，问题是如何将句子中的这些特定单词存储在变量中，以便以后在传递要匹配的单词时使用它们来匹配句子中的单词。

感谢您的帮助。

#include <iostream>
#include <cstring>
#include <new>
#include <cctype>

int wordsInString(const char* );

int main()

    wordsInString("My name is Donnie");
    return 0;


int wordsInString(const char* s)

    int count = 0;
    int len = strlen(s);
    int i;
    for(i=0;i<len;i++)
    
        while(i<len && (s[i] == ' ' || s[i] == '\t' || s[i] == '\n'))
        
            i++;
        
        if(i<len)
        
            count++;
            while(i<len && (s[i] != ' ' && s[i] != '\t' && s[i] != '\n'))
            
                i++;
            
        
    
    std::cout << "The total count: " << count << std::endl;
    return count;

【问题讨论】：

std::map 就是你要找的东西普通的“输入”操作符>> 以空格分隔。所以你可以read a line，将其放入input string stream，然后使用input stream iterators将其放入向量中（使用迭代器overload of the vector constructor）。或者你想将“单词”映射到单词的计数器（即制作单词出现的直方图）然后可能是std::unordered_map，单词作为键，计数器作为数据。使用例如while (input_stream >> word) 逐字阅读所有单词。 @Someprogrammerdude 你能用简单的语言解释一下吗？我是初学者 @Federico 除了提到的头文件之外，我不能使用任何其他头文件。这是一位心理学教授的作业 【参考方案1】：

我可以提供另一种方法来解决这个问题 - 使用算法深度优先搜索。

#include <iostream>
#include <string>
#include <cctype>
#include <vector>
using namespace std;
const int maximumSize=40;
string wordsInString="My name is Donnie";
vector<string> words;
string temporary;
vector<int> visitedWords(maximumSize, 0);
template<class Type>
void showContent(Type input)

    for(int i=0; i<input.size(); ++i)
    
        cout<<input[i]<<", ";
    
    return;

void dfsWords(int current, int previous)

    if(visitedWords[current]==1)
    
        return;
    
    visitedWords[current]=1;
    if(isspace(wordsInString[current])==false)
    
        temporary.push_back(wordsInString[current]);
    
    else
    
        words.push_back(temporary);
        temporary.clear();
    
    if(current==(wordsInString.size()-1))
    
        words.push_back(temporary);
        temporary.clear();
    
    for(int next=current; next<wordsInString.size(); ++next)
    
        dfsWords(next, current);
    
    return;

void solve()

    dfsWords(0, -1);
    cout<<"wordsInString <- ";
    showContent(wordsInString);
    cout<<endl<<"words <- ";
    showContent(words);
    return;

int main()

    solve();
    return 0;

结果如下：

wordsInString <- M, y,  , n, a, m, e,  , i, s,  , D, o, n, n, i, e, 
words <- My, name, is, Donnie,

【讨论】：

【参考方案2】：

问题作者展示了一个有效的代码并计算字数。完全没问题。问题是，现在如何存储单词。而且不仅算数。

在最后的评论中，作者说他们只能使用提到的标题。不幸的是，这将使所有其他答案无效。

对标头的引用不仅是一种限制，也是一种提案解决方案。

显然，OP 正在学习 C++ 课程，并且目前正在积累专业知识。但是，当然离掌握最复杂的编程语言之一还有很长的路要走。

但是如果我们开始忽略需求，那么问题总是可以用一条线来解决的。不需要复杂的递归内容和废话。

所以，首先我也将忽略需求并展示单线方法。

稍后，我会展示教授想看的内容。

所有这一切都依赖于我们在 C++ 中有容器和迭代器来迭代这些容器中的元素这一事实。你会发现这个无处不在。例如。 begin() 和 end() 的东西。并且有许多迭代器用于许多不同的目的。

例如，如果你有一个字符串（它是一个字符的容器），那么你可以简单地使用迭代器来迭代字符串中的所有字符。

但是，多年来，有一个特殊的迭代器可以迭代模式，例如字符串中的单词。这是专门为此目的设计的专用迭代器。

而且因为它是专门为这个用例设计的，而且使用简单且非常灵活，所以应该使用它。

我们正在谈论std::sregex:token_iterator。有关说明，请参阅here。您可以简单地用简单的正则表达式语言定义一个模式，比如一个单词，然后遍历所有单词。非常简单。

好的，现在我们知道我们有一个特定的专用迭代器，专门为我们的目的而设计，我们可以使用现有的迭代器函数来获得我们需要的结果。

首先，计数。有一个函数叫做std::distance。正如它的名字所说，它计算了 2 个迭代器之间的距离。例如，第一个元素和最后一个元素之间的距离当然等于任何容器中元素的数量。

而且，与任何迭代器一样，std::sregex_token_iterator 也有构造函数。阅读here，我们可以看到“开始”迭代器将使用容器和正则表达式进行初始化。 “end”迭代器是默认构造函数，没有任何参数。所以，你可以简单地写来获取结束迭代器。

有了这个诀窍，我们可以写：

#include <iostream>
#include <string>
#include <regex>
#include <iterator>

// --------------------------------------------------------------------------------
// Some alias name to save typing word and get a more readable text
using Words = std::vector<std::string>;
using WordIter = std::sregex_token_iterator;

// A regex to specify a pattern for words consisting of alpha and digit characters
const std::regex re R"(\w+)" ;

// Test string with some example data
const std::string test " My name is Donnie " ;
// --------------------------------------------------------------------------------

int main()

    // ---------------------------------------------------------------------------------------
    // One-liner to get the count
    unsigned int count = std::distance(WordIter(test.begin(), test.end(), re), );
    // ---------------------------------------------------------------------------------------

    std::cout << "\nWe found '" << count<< "' words\n\n";

这非常简单，复杂度低，易于理解。

我很欣赏正则表达式有一些开销，但我认为这并不重要。

下一步：存储单词。

如果我们想要存储单词，我们通常会使用动态容器，例如 std::vector。

而且，非常好，std::vector 有一个构造函数，它接受一对迭代器，然后复制所有元素。见constructor number 5。

因此，迭代器指向的所有内容都将放入向量中。此外，在将单词添加到std::vector 之后，我们还有单词的“计数”，因为向量知道其中有多少单词。

查看下一条：

#include <iostream>
#include <string>
#include <vector>
#include <regex>


// --------------------------------------------------------------------------------
// Some alias name to save typing word and get a more readable text
using Words = std::vector<std::string>;
using WordIter = std::sregex_token_iterator;

// A regex to specify a pattern for words consisting of alpha and digit characters
const std::regex re R"(\w+)" ;

// Test string with some example data
const std::string test " My name is Donnie " ;
// --------------------------------------------------------------------------------

int main()

    // ---------------------------------------------------------------------------------------
    // One-liner to store all words in a vector and get its count
    // Define a container that will contain all words using the range constructor of the vector
    
    Words words(WordIter(test.begin(), test.end(), re), );

    // ---------------------------------------------------------------------------------------

    // Output part: Now we have all words and their count. Show result to user
    std::cout << "\nWe found '" << words.size() << "' words\n\nAnd these are:\n\n";
    for (const std::string& w : words) std::cout << w << '\n';

又是一个超级简单高效的解决方案。

通过使用正确的容器和迭代器，我们得到了一个紧凑且可读性和可理解性非常好的解决方案。

但是现在，回到现实。 OP开始学习语言。以上高级东西的复制粘贴都会被教授拒绝。

所以，我们需要遵循他们的要求。

他们只想查看以下功能：

iostream --> 仅 std::cout cstring --> 只有 C 风格的字符串 (char *) 和 C 风格的字符串函数 new --> 手写动态内存管理。没有std::vector cctype -->我们可以使用std::ispace来搜索空格。

对于新手来说尤其困难的是动态内存管理。但这要在这里教。

第一个信息：C 字符串由所有字符和一个尾随 0 组成，用于终止字符串。这意味着，如果我们需要为字符串分配动态内存，我们需要获得“字符数 + 1”字节。

所以，如果我们有一个字符串“ABC”，那么它有 3 个字节。如果我们想将它复制到动态分配的内存中，那么我们需要写：char* text = new char[3+1];。这需要记住。

然后，如果我们要存储多个字符串，那么我们需要一个二维数组。一维表示构成单词的字符，第二维表示单词的数量。

所以，char**

动态分配这样一个数组有两种可能。首先，我们对所有单词进行计数，然后使用该计数分配一个数组：char** wordList = new char*[count];。

或者，我们遵循完全动态的方法，我们为这样的数组定义一些默认容量，例如 1 个元素。然后，如果数组已满，我们将容量加倍（因此，得到 1、2、4、8、16，...）并分配一个新数组。比如：

if (count >= capacityOfArray) 
    // Not enough capacity. Get more. Double capacity
            capacityOfArray = capacityOfArray * 2;

            // Create new array with bigger capacity
            char** temp = new char* [capacityOfArray] ;

然后将旧数组中的所有数据复制到新数组中。复制完所有元素后，千万不要忘记释放旧数组的内存。

无论如何。必须释放所有动态分配的内存！

这样，我们得到以下代码，它满足所有要求：

#include <iostream>
#include <cstring>
#include <new>
#include <cctype>

int wordsInString(const char * s, char **& wordList)

    int count = 0;
    int beginOfWord = 0;
    int endOfWord = 0;

    // Create an array that can hold a number of string. InitialCapacity will be 1
    int capacityOfArray = 1;
    // Create the array. Allocate memory, Initialize with nullptr
    wordList = new char *[capacityOfArray] ;

    // How many characters do we have in the string.
    int len = strlen(s);

    // Check each character in the string
    for (int i = 0; i < len; i++)
    
        // Search the beginning of a word, so the first non white space character
        while (i < len && std::isspace(s[i]))
        
            i++;
        
        // We remember the start position of the word
        beginOfWord = i;
        if (i < len)
        
            // Look for the end of the word, either a white space or end of string
            while (i < len && !std::isspace(s[i]))
            
                i++;
            
            // Here we have the end position of the word
            endOfWord = i;

            // What is the length of the word that we just found 
            int wordLength = endOfWord - beginOfWord;
            if (wordLength) 

                // Allocate new space for the newWord variable
                char* newWord = new char[wordLength + 1] '\0' ;

                // Copy the current evaluated word into our variable
#pragma warning(suppress : 4996)
                strncpy(newWord, &s[beginOfWord], wordLength);

                // Check, if we have enough capacity in our array for this new word
                if (count >= capacityOfArray) 
                    // NOt enough capacity. Get more. Double capacity
                    capacityOfArray = capacityOfArray * 2;

                    // Create new array with bigger capacity
                    char** temp = new char* [capacityOfArray] ;

                    // Copy all pointers from old array to new array
                    for (int k = 0; k < count; k++) 
                        temp[k] = wordList[k];
                    
                    // Release old memory for words 
                    delete[] wordList;

                    // And assign new array to wordlist
                    wordList = temp;
                
                // Store the latest word in our wordlist
                wordList[count] = newWord;

                // Now, we have one word more
                count++;
            
        
    
    return count;



int main()

    // This array will hold all words
    char** wordList = nullptr;
    int count = 0;

    // Get all words in string and their count
    count = wordsInString(" My name is Donnie ", wordList);

    // Show result on the screen
    std::cout << "\nThe number of words is: " << count << "\n\n";

    // Now show all extracted words on the screen
    for (int i = 0; i < count; ++i)
       std::cout << wordList[i] << '\n';

    // Release all dynamically allocated memory
    for (int k = 0; k < count; ++k)
        // First delete word by word
        delete[] wordList[k];

    // Delete word list
    delete[] wordList;

    return 0;

请注意：这样的代码可以用于学术目的。但在 C++ 中，我们永远不会将指针用于拥有的内存或 C 风格的字符串和数组。

【讨论】：

我该如何感谢您，先生？感谢您抽出宝贵时间这看起来是一个不错的答案，但没有我想象的那么圆滑。文字很重要！解决技术分歧的一个好方法是在他们的帖子下对 cmets 中的其他作者提出（适当的）批评——有时这会产生批评作者所寻求的改变。 Markdown注意事项：块代码只需要三个反引号，inline code只需要一个反引号。当作者可以使用最少的符号来获得他们需要的格式时，它使材料更容易编辑。【参考方案3】：

也许我在最初的 cmets 中误解了这个问题，但是如果您只想计算字符串中以空格分隔的“单词”的数量，那么我建议对字符串使用 std::istringstream，然后使用循环提取使用普通输入运算符>>将空格分隔的“单词”一一放入一个虚拟字符串变量中，并在循环中增加一个计数器。

大概是这样的：

unsigned wordsInString(std::string const& string)

    // A stream we can read words from
    std::istringstream stream(string);

    // The word counter
    unsigned word_counter;

    // A dummy string, the contents of the string will never be used
    std::string dummy;

    // While we can read words from the stream, increase counter
    for (word_counter = 0; stream >> dummy; ++word_counter)
    
        // Empty
    

    // Return the counter
    return word_counter;

这行得通，因为：

>> 流提取运算符的结果是流本身，当转换为布尔值时，当我们到达流的末尾时它将为 false，从而中断循环；和

流提取运算符在空格（空格、换行符、制表符等）上分隔；和

对于从流中成功提取的每个单词，我们将增加计数器，因此循环计算流中的“单词”。

【讨论】：

非常感谢。这很有帮助呃，为什么不使用while 循环呢？ @justANewbie 没必要。 for 循环的工作原理相同，而且更紧凑（虽然可能不是我写的那种特别冗长的方式）并且同样易于理解。

以上是关于如何存储句子中的单词的主要内容，如果未能解决你的问题，请参考以下文章