在进行 url 编码时,std::regex_replace 对于字符“+”不能正常工作
Posted
技术标签:
【中文标题】在进行 url 编码时,std::regex_replace 对于字符“+”不能正常工作【英文标题】:While doing url encoding, the std::regex_replace doesn't work properly for character "+" 【发布时间】:2018-05-16 06:46:11 【问题描述】:下面是sn-p代码,regex_replace不能正常处理字符“+”,我不应该对字符使用特殊处理,但它应该可以正常工作。
/*All headerfiles are available.*/
std::string charToHex(unsigned char c, bool bUpperCase);
std::string urlEncode(const std::string& toEncode, bool bEncodeForwardSlash);
std::string getEncodedUrl(const std::string& url)
std::string bktObjKey = "";
std::string urlEnc = url;
boost::regex expression("^(([^:/?#]+):)?(//([^/?#:]*)(:\\d+)?)?([^?#]*)((\\?[^#]*))?(#(.*))?");
std::string::const_iterator start=url.begin(), end = url.end();
boost::match_results<std::string::const_iterator> what;
boost::match_flag_type flags = boost::match_default;
if (regex_search(url.begin(), url.end(), what, expression, flags))
std::cout<<"Matched"<<std::endl;
bktObjKey.insert(bktObjKey.begin(), what[6].first, what[6].second);
std::regex fobj(bktObjKey);
/*std::string fobj = bktObjKey;*/
/*auto pos = url.find(bktObjKey);*/
bktObjKey = urlEncode(bktObjKey, false);
std::cout<<"bktObjKey :"<<bktObjKey.c_str()<<" urlEnc: "<<urlEnc.c_str()<<std::endl;
urlEnc = std::regex_replace(url, fobj, bktObjKey);
std::cout<<" urlEnc: "<<urlEnc.c_str()<<std::endl;
return urlEnc;
std::string urlEncode(const std::string& toEncode, bool bEncodeForwardSlash)
std::ostringstream out;
std::cout<<"inside encode"<<std::endl;
for(std::string::size_type i=0; i < toEncode.length(); ++i)
char ch = toEncode.at(i);
if ((ch >= 'A' && ch <= 'Z') ||
(ch >= 'a' && ch <= 'z') ||
(ch >= '0' && ch <= '9') ||
(ch == '_' || ch == '-' || ch == '~' || ch == '.') ||
(ch == '/' && !bEncodeForwardSlash))
out << ch;
std::cout<<out.str()<<" Is not coded to HEX"<<std::endl;
else
out << "%" << charToHex(ch, true);
std::cout<<out.str()<<" Is coded to HEX"<<std::endl;
std::cout<<"Return :"<<out.str()<<std::endl;
return out.str();
std::string charToHex(unsigned char c, bool bUpperCase)
short i = c;
std::stringstream s;
s << std::setw(2) << std::setfill('0') << std::hex << i;
return s.str();
int main()
std::string url1 ="http://10.130.0.36/rbkt10/+";
std::string out1 = getEncodedUrl(url1);
std::cout<<"Encoded URL1=:"<<out1<<std::endl;
return 0;
输出: 编码 URL1=:http://10.130.0.36/rbkt10/%2b+
所以输出变成“++”。它应该只有“+”。我怎样才能让它完美地工作?
【问题讨论】:
代码是供人阅读的。格式化很重要 【参考方案1】:您将原始字符串解释为正则表达式。 +
在正则表达式中是特殊的¹。
您应该简单地使用std::string::replace
,因为您不需要正则表达式替换功能:
boost::smatch what;
if (regex_search(url.cbegin(), url.cend(), what, expression))
boost::ssub_match query = what[6];
url.replace(query.first, query.second, urlEncode(query.str(), false));
像这样复杂、分散的代码: 可能只是:
std::string bktObjKey = what[6].str();
复杂的循环
for (std::string::size_type i = 0; i < toEncode.length(); ++i)
char ch = toEncode.at(i);
可能是
for (char ch : toEncode)
charToHex
每次都会创建一个新的 2 字符字符串,每次都使用另一个字符串流,将结果从字符串流中复制出来等。相反,只需写入您拥有的字符串流并避免所有低效率:
void writeHex(std::ostream& os, unsigned char c, bool uppercase)
os << std::setfill('0') << std::hex;
if (uppercase)
os << std::uppercase;
os << std::setw(2) << static_cast<int>(c);
请注意,这也解决了您忘记使用
bUppercase
的事实
查看<cctype>
以获得对字符进行分类的帮助。
使用原始文字书写
boost::regex expression("^(([^:/?#]+):)?(//([^/?#:]*)(:\\d+)?)?([^?#]*)((\\?[^#]*))?(#(.*))?");
改为:
boost::regex expression(R"(^(([^:/?#]+):)?(//([^/?#:]*)(:\d+)?)?([^?#]*)((\?[^#]*))?(#(.*))?)");
(无需双重转义\d
和\?
)
要么删除所有多余的子组
boost::regex expression(R"(^([^:/?#]+:)?(//[^/?#:]*(:\d+)?)?[^?#]*(\?[^#]*)?(#.*)?)");
或者使它们可维护和有用²:
boost::regex uri_regex(
R"(^((?<scheme>[^:/?#]+):)?)"
R"((?<authority>//(\?<host>[^/?#:]*)(:(?<port>\d+))?)?)"
R"((?<path>[^?#]*))"
R"((\?(?<query>([^#]*)))?)"
R"((#(?<fragment>.*))?)");
现在您可以访问 URI 的逻辑组件,应用它以更好地了解何时何地进行编码:
std::string escaped =
what["scheme"].str() +
what["authority"].str() +
urlEncode(what["path"].str(), false);
if (query.matched)
escaped += '?';
escaped.append(urlEncode(query, true));
if (fragment.matched)
escaped += '#';
escaped.append(urlEncode(fragment, true));
重载 urlEncode
,它采用现有的 ostream 引用,而不是始终创建自己的引用:
std::ostringstream out;
out << what["scheme"] << what["authority"];
urlEncode(out, what["path"], false);
if (query.matched)
urlEncode(out << '?', query, true);
if (fragment.matched)
urlEncode(out << '#', fragment, true);
审查后的代码
Live On Coliru
#include <boost/regex.hpp>
#include <iostream>
#include <iomanip>
void writeHex(std::ostream& os, unsigned char c, bool uppercase)
os << std::setfill('0') << std::hex;
if (uppercase)
os << std::uppercase;
os << '%' << std::setw(2) << static_cast<int>(c);
void urlEncode(std::ostream& os, const std::string &toEncode, bool bEncodeForwardSlash)
auto is_safe = [=](uint8_t ch)
return std::isalnum(ch) ||
(ch == '/' && !bEncodeForwardSlash) ||
std::strchr("_-~.", ch);
;
for (char ch : toEncode)
if (is_safe(ch))
os << ch;
else
writeHex(os, ch, true);
std::string urlEncode(const std::string &toEncode, bool bEncodeForwardSlash)
std::ostringstream out;
urlEncode(out, toEncode, bEncodeForwardSlash);
return out.str();
std::string getEncodedUrl(std::string url)
boost::regex uri_regex(
R"(^((?<scheme>[^:/?#]+):)?)"
R"((?<authority>//(\?<host>[^/?#:]*)(:(?<port>\d+))?)?)"
R"((?<path>[^?#]*))"
R"((\?(?<query>([^#]*)))?)"
R"((#(?<fragment>.*))?)");
boost::match_results<std::string::iterator> what;
//boost::smatch what;
if (regex_search(url.begin(), url.end(), what, uri_regex))
auto& full = what[0];
auto& query = what["query"];
auto& fragment = what["fragment"];
std::ostringstream out;
out << what["scheme"] << what["authority"];
urlEncode(out, what["path"], false);
if (query.matched)
urlEncode(out << '?', query, true);
if (fragment.matched)
urlEncode(out << '#', fragment, true);
url.replace(full.begin(), full.end(), out.str());
return url;
int main()
for (std::string url :
"http://10.130.0.36/rbkt10/+",
"//10.130.0.36/rbkt10/+",
"//localhost:443/rbkt10/+",
"https:/rbkt10/+",
"https:/rbkt10/+?in_params='please do escape / (forward slash)'&more#also=in/fragment",
"match inside text http://10.130.0.36/rbkt10/+ is a bit fuzzy",
)
std::cout << "Encoded URL: " << getEncodedUrl(url) << std::endl;
打印
Encoded URL: http//10.130.0.36/rbkt10/%2B
Encoded URL: //10.130.0.36/rbkt10/%2B
Encoded URL: //localhost%3A443/rbkt10/%2B
Encoded URL: https/rbkt10/%2B
Encoded URL: https/rbkt10/%2B?in_params%3D%27please%20do%20escape%20%2F%20%28forward%20slash%29%27%26more#also%3Din%2Ffragment
Encoded URL: match inside text http//10.130.0.36/rbkt10/%2B%20is%20a%20bit%20fuzzy
注意
请注意,代码仍然不符合规范:
这就是您使用库的原因。
¹(这会导致输入中留下 +。它不是“重复”的,只是没有被替换,因为 /+
表示 1 个或多个 /
)。
² 见https://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Generic_syntax
【讨论】:
以上是关于在进行 url 编码时,std::regex_replace 对于字符“+”不能正常工作的主要内容,如果未能解决你的问题,请参考以下文章
在进行 url 编码时,std::regex_replace 对于字符“+”不能正常工作