C++scanner.h 扫描双引号之间的内容作为标记:不跳过引号内的空格

Posted

技术标签:

【中文标题】C++scanner.h 扫描双引号之间的内容作为标记:不跳过引号内的空格【英文标题】:C++ scanner.h scan content between double-quotes as a token: not skipping spaces inside quotes 【发布时间】:2011-12-05 16:11:40 【问题描述】:

我正在尝试将双引号之间的内容计为分配的一个标记。

例如: “你好世界” = 1 个令牌 "hello" "world" = 3 个标记(因为空格算作 1 个标记)

我创建了 main.cpp 并将“scanQuotesAsString”代码添加到给定的 3 个模块:

scanner.cpp scanner.h scanpriv.h

现在,“hello world”会扫描 2 个标记,而不是跳过空格。如果我添加 (或跳过空格,那么不带引号的常规输入(例如 |hello world|)也会跳过空格。

我认为我的问题在于scanner.cpp,其中最后几个函数是:

/*
* Private method: scanToEndOfIdentifier
* Usage: finish = scanToEndOfIdentifier();
* ----------------------------------------
* This function advances the position of the scanner until it
* reaches the end of a sequence of letters or digits that make
* up an identifier. The return value is the index of the last
* character in the identifier; the value of the stored index
* cp is the first character after that.
*/
int Scanner::scanToEndOfIdentifier() 
    while (cp < len && isalnum(buffer[cp])) 
        if ((stringOption == ScanQuotesAsStrings) && (buffer[cp] == '"')) 
            break;
        cp++;
    
    return cp - 1;



/* Private functions */
/*
* Private method: scanQuotedString
* Usage: scanQuotedString();
* -------------------
* This function advances the position of the scanner until the
* current character is a double quotation mark
*/
void Scanner::scanQuotedString() 
    while ((cp < len && (buffer[cp] == '"')) || (cp < len && (buffer[cp] == '"')))
        cp++;
    

这是 main.cc

#include "genlib.h"
#include "simpio.h"
#include "scanner.h"
#include <iostream>

/* Private function prototypes */

int CountTokens(string str);

int main() 
    cout << "Please enter a sentence: ";
    string str = GetLine();

    int num = CountTokens(str);
    cout << "You entered " << num << " tokens." << endl;
    return 0;


int CountTokens(string str) 

    int count = 0;
    Scanner scanner;        // create new scanner object            
    scanner.setInput(str);  // initialize the input to be scanned

    //scanner.setSpaceOption(Scanner::PreserveSpaces);
    scanner.setStringOption(Scanner::ScanQuotesAsStrings);

    while (scanner.hasMoreTokens())  // read tokens from the scanner
        scanner.nextToken();
        count++;
    
    return count;

这是scanner.cpp

/*
* File: scanner.cpp
* -----------------
* Implementation for the simplified Scanner class.
*/
#include "genlib.h"
#include "scanner.h"
#include <cctype>
#include <iostream>
/*
* The details of the representation are inaccessible to the client,
* but consist of the following fields:
*
* buffer -- String passed to setInput
* len -- Length of buffer, saved for efficiency
* cp -- Current character position in the buffer
* spaceOption -- Setting of the space option extension
*/
Scanner::Scanner() 
    buffer = "";
    spaceOption = PreserveSpaces;

Scanner::~Scanner() 
/* Empty */

void Scanner::setInput(string str) 
    buffer = str;
    len = buffer.length();
    cp = 0;

/*
* Implementation notes: nextToken
* -------------------------------
* The code for nextToken follows from the definition of a token.
*/
string Scanner::nextToken() 
    if (cp == -1) 
        Error("setInput has not been called");
    
    if (stringOption == ScanQuotesAsStrings) scanQuotedString();
    if (spaceOption == IgnoreSpaces) skipSpaces();
    int start = cp;
    if (start >= len) return "";
    if (isalnum(buffer[cp])) 
        int finish = scanToEndOfIdentifier();
        return buffer.substr(start, finish - start + 1);
    
    cp++;
    return buffer.substr(start, 1);


bool Scanner::hasMoreTokens() 
    if (cp == -1) 
        Error("setInput has not been called");
    
    if (stringOption == ScanQuotesAsStrings) scanQuotedString();
    if (spaceOption == IgnoreSpaces) skipSpaces();
    return (cp < len);


void Scanner::setSpaceOption(spaceOptionT option) 
    spaceOption = option;


Scanner::spaceOptionT Scanner::getSpaceOption() 
    return spaceOption;


void Scanner::setStringOption(stringOptionT option) 
    stringOption = option;


Scanner::stringOptionT Scanner::getStringOption() 
    return stringOption;



/* Private functions */
/*
* Private method: skipSpaces
* Usage: skipSpaces();
* -------------------
* This function advances the position of the scanner until the
* current character is not a whitespace character.
*/
void Scanner::skipSpaces() 
    while (cp < len && isspace(buffer[cp])) 
        cp++;
    


    /*
    * Private method: scanToEndOfIdentifier
    * Usage: finish = scanToEndOfIdentifier();
    * ----------------------------------------
    * This function advances the position of the scanner until it
    * reaches the end of a sequence of letters or digits that make
    * up an identifier. The return value is the index of the last
    * character in the identifier; the value of the stored index
    * cp is the first character after that.
    */
    int Scanner::scanToEndOfIdentifier() 
        while (cp < len && isalnum(buffer[cp])) 
            if ((stringOption == ScanQuotesAsStrings) && (buffer[cp] == '"')) 
                break;
            cp++;
        
        return cp - 1;
    


    /* Private functions */
    /*
    * Private method: scanQuotedString
    * Usage: scanQuotedString();
    * -------------------
    * This function advances the position of the scanner until the
    * current character is a double quotation mark
    */
    void Scanner::scanQuotedString() 
        while ((cp < len && (buffer[cp] == '"')) || (cp < len && (buffer[cp] == '"')))
            cp++;
        

scanner.h

/*
* File: scanner.h
* ---------------
* This file is the interface for a class that facilitates dividing
* a string into logical units called "tokens", which are either
*
* 1. Strings of consecutive letters and digits representing words
* 2. One-character strings representing punctuation or separators
*
* To use this class, you must first create an instance of a
* Scanner object by declaring
*
* Scanner scanner;
*
* You initialize the scanner's input stream by calling
*
* scanner.setInput(str);
*
* where str is the string from which tokens should be read.
* Once you have done so, you can then retrieve the next token
* by making the following call:
*
* token = scanner.nextToken();
*
* To determine whether any tokens remain to be read, you can call
* the predicate method scanner.hasMoreTokens(). The nextToken
* method returns the empty string after the last token is read.
*
* The following code fragment serves as an idiom for processing
* each token in the string inputString:
*
* Scanner scanner;
* scanner.setInput(inputString);
* while (scanner.hasMoreTokens()) 
* string token = scanner.nextToken();
* . . . process the token . . .
* 
*
* This version of the Scanner class includes an option for skipping
* whitespace characters, which is described in the comments for the
* setSpaceOption method.
*/
#ifndef _scanner_h
#define _scanner_h
#include "genlib.h"
/*
* Class: Scanner
* --------------
* This class is used to represent a single instance of a scanner.
*/
class Scanner 
public:
/*
* Constructor: Scanner
* Usage: Scanner scanner;
* -----------------------
* The constructor initializes a new scanner object. The scanner
* starts empty, with no input to scan.
*/
    Scanner();
/*
* Destructor: ~Scanner
* Usage: usually implicit
* -----------------------
* The destructor deallocates any memory associated with this scanner.
*/
    ~Scanner();
/*
* Method: setInput
* Usage: scanner.setInput(str);
* -----------------------------
* This method configures this scanner to start extracting
* tokens from the input string str. Any previous input string is
* discarded.
*/
    void setInput(string str);
/*
* Method: nextToken
* Usage: token = scanner.nextToken();
* -----------------------------------
* This method returns the next token from this scanner. If
* nextToken is called when no tokens are available, it returns the
* empty string.
*/
    string nextToken();
/*
* Method: hasMoreTokens
* Usage: if (scanner.hasMoreTokens()) . . .
* ------------------------------------------
* This method returns true as long as there are additional
* tokens for this scanner to read.
*/
    bool hasMoreTokens();
/*
* Methods: setSpaceOption, getSpaceOption
* Usage: scanner.setSpaceOption(option);
* option = scanner.getSpaceOption();
* ------------------------------------------
* This method controls whether this scanner
* ignores whitespace characters or treats them as valid tokens.
* By default, the nextToken function treats whitespace characters,
* such as spaces and tabs, just like any other punctuation mark.
* If, however, you call
*
* scanner.setSpaceOption(Scanner::IgnoreSpaces);
*
* the scanner will skip over any white space before reading a
* token. You can restore the original behavior by calling
*
* scanner.setSpaceOption(Scanner::PreserveSpaces);
*
* The getSpaceOption function returns the current setting
* of this option.
*/
    enum spaceOptionT  PreserveSpaces, IgnoreSpaces ;
    void setSpaceOption(spaceOptionT option);
    spaceOptionT getSpaceOption();

/*
 * Methods: setStringOption, getStringOption
 * Usage: scanner.setStringOption(option);
 *        option = scanner.getStringOption();
 * --------------------------------------------------
 * This method controls how the scanner reads double quotation marks 
 * as input.  The default is set to treat quotes just like any other 
 * punctuation character: 
 *    scanner.setStringOption(Scanner::ScanQuotesAsPunctuation);
 * 
 * Otherwise, the option:
 *    scanner.setStringOption(Scanner::ScanQuotesAsStrings);
 *
 * the token starting with a quotation mark will be scanned until
 * another quotation mark is found (closing quotation). Therefore
 * the entire string within the quotation, including both quotation
 * marks counts as 1 token.
 */
    enum stringOptionT  ScanQuotesAsPunctuation, ScanQuotesAsStrings ;

    void setStringOption(stringOptionT option);
    stringOptionT getStringOption();


private:

#include "scanpriv.h"
;
#endif

** 最后是 scanpriv.h **

/*
* File: scanpriv.h
* ----------------
* This file contains the private data for the simplified version
* of the Scanner class.
*/

/* Instance variables */
string buffer; /* The string containing the tokens */
int len; /* The buffer length, for efficiency */
int cp; /* The current index in the buffer */
spaceOptionT spaceOption; /* Setting of the space option */
stringOptionT stringOption;

/* Private method prototypes */
void skipSpaces();
int scanToEndOfIdentifier();
void scanQuotedString();

【问题讨论】:

【参考方案1】:

读起来太长了。

两种解析引用文本的方式:

0) 状态

一个简单的开关,它告诉您现在是否在引号中,并激活一些特殊的引号处理。这基本上等同于 #1),只是内联。

1) 递归下降扫描器中的子规则

把状态放在一边,写一个单独的规则来扫描引用的文本。代码实际上非常简单(受 C++ 启发的 p 代码):

// assume we are one behind the opening quotation mark
for (c : text) 
    if (is_escape (*c))   // to support stuff like "foo's name is \"bar\""
        p = peek(c);
        if (!is_valid_escape_character (peek (c))) error;
        else 
            make the peeked character (*p) part of the result;
            ++c;
        
    
    else if (is_quotation_mark (*c))
    
        return the result; // we approached the end of the string
    
    else if (!is_valid_character (*c))
    
        error; // maybe you want to forbid literal control characters
    
    else
    
        make *c part of the result
    

error; // reached end of input before closing quotation mark

如果你不想这么支持转义字符,代码会变得更简单:

// assume we are one behind the opening quotation mark
for (c : text) 
    if (is_quotation_mark (*c))
        return the result;
    else if (!is_valid_character (*c))
        error;
    else
        make *c part of the result

error; // reached end of input before closing quotation mark

您不应忽略检查其是否为无效字符,因为这会邀请用户利用您的代码并可能利用您程序的未定义行为。

【讨论】:

【参考方案2】:

快速浏览一下代码:如果您处于ScanQuotesAsStrings 模式,除了带引号的字符串之外,您不需要其他标记;相反,区别应该是当您看到以'"' 开头的令牌时,您会转到单独的子扫描器。

在伪代码中(使用 C++“end iterator is one-past-the-end”的习语):

current_token.begin = cursor;
current_token.end = current_token.begin + 1;
if(scan_quotes_as_strings && *current_token.begin == '"') 
    while(*current_token.end && *current_token.end != '"')
        ++current_token.end;
    return;

while(*current_token.end && *current_token.end != ' ')
    ++current_token.end;

您可以通过引入一个状态变量而不是用不同的代码路径表示扫描仪状态,将这两个循环组合成一个循环。

还有,

while ((cp < len && (buffer[cp] == '"')) || (cp < len && (buffer[cp] == '"'))) ...

只是看起来很可疑。

【讨论】:

我认为你应该检查读取字符的有效性。 读取字符的有效性由解析的语言定义,我对此一无所知。到目前为止的要求是,除非引用,否则标记是空格分隔的;也许他/她的语言使用 Codepage 437 输入并接受字符串或标识符中的表情符号。

以上是关于C++scanner.h 扫描双引号之间的内容作为标记:不跳过引号内的空格的主要内容,如果未能解决你的问题,请参考以下文章

js单双引号之间的区别

提取双引号之间的字符

c语言中如何输入双引号

VBA中如何取双引号的内容,比如某单元格的内容是"c:\abc.txt",我想得到c:\abc.txt,不包含双引号

构建字符串在两个变量之间添加双引号[重复]

定义变量时无引号,单引号,双引号区别与特点: