C++ 新特性学习（三） — Regex库

C++ STL终于会放点实用的东西了。可喜可贺。

这个，显然是正则表达式库，作为一个强大而又NB的库，我表示对其理解甚少，只能先研究下基本用法，更具体的用法要等实际应用中用到的时候在细看了。 PS：正则表达式的资料见 http://www.regexlab.com/ 更多资料见 https://www.owent.net/2011/264.html

就这样吧，开始。正则表达式这玩意是用自动机搞出来的，效率当然就是自动机的效率了。当然不同的实现效率是不一样的，至于STL的效率。我就不清楚了，不过姑且相信STL吧。

**第一个注意：**使用正则表达式的转义的时候，不要忘了C/C++的斜杠也是要转义的正则表达式主要函数有三 std::regex_search std::regex_match std::regex_replace 第三个好说，看函数名就知道什么意思，但是前两个呢？直接报答案吧，第一个是不完全匹配，第二个是完全匹配。

同时，在正则表达式库里还有两个重要的类 enum std::regex_constants::match_flag_type 这个看名字就能知道是设置匹配选项的，具体选项看内容就很容易看懂，也不用多解释了。另一个是类模版std::match_results，传进去的类型是类的迭代器如以下从VC里抄来的

typedef basic_regex regex;
typedef basic_regex wregex;
typedef match_results cmatch;
typedef match_results wcmatch;
typedef match_results smatch;
typedef match_results wsmatch;

这都是默认定义这个用于记录匹配结果，匹配如果成功，它里面会有多个std::sub_match对象，分别指向匹配的结果 std::sub_match里有matched成员表示该项是否匹配成功，还有first和second成员分别指向匹配的目标的起始位置和结束位置，str()函数可以获取匹配的值而同时std::match_results的prefix()和suffix()函数分别指向整个匹配式的头和尾。返回的类型也是std::sub_match，内容和上面的类似

**这里有第二个注意：匹配结果里的数据是共享的，只是指针不同，所以要注意不要随意释放资源。另外有第三个注意：**匹配返回真的时候才会对传入的匹配项的变量修改，如果返回false，传入的std::match_results是不会变化的

接下来就是std::regex_replace了，说到这个还涉及到std::match_results的format函数，这是一个表示筛选匹配项的的东东具体的嘛，看下面（只是把BOOST里的东西简单翻译以下，没有boost扩展的部分，并且只留下了VC++里tr1包含的功能，他说是Perl风格的）

      占位符 |           含义 |

—————-|—————| $& | 整个匹配值 | $MATCH | 和 $& 一样 | ${^MATCH} | 和 $& 一样 | $``` | 被匹配字符串去除匹配目标后的结果（即） | `$PREMATCH` | 和$``` 一样 | `${^PREMATCH}| 和 ``$``` 一样 |$'` | 当前匹配位置之后的全部文本（不包括匹配的字符串） | `$POSTMATCH| 和$'` 一样 | `${^POSTMATCH}| 和$'` 一样 | `$$` | 字符 `'$’|$n` | 第n和被匹配项的值 |

我表示boost的功能更强大不过这些已经够了。另外转义字符如下

     Escape |           Meaning |

—————-|——————-| \a | Outputs the bell character: ‘\a’. | \e | Outputs the ANSI escape character (code point 27). | \f | Outputs a form feed character: ‘\f’ | \n | Outputs a newline character: ‘\n’. | \r | Outputs a carriage return character: ‘\r’. | \t | Outputs a tab character: ‘\t’. | \v | Outputs a vertical tab character: ‘\v’. | \xDD | Outputs the character whose hexadecimal code point is 0xDD | \x{DDDD} | Outputs the character whose hexadecimal code point is 0xDDDDD | \cX | Outputs the ANSI escape sequence “escape-X”. | \D | If D is a decimal digit in the range 1-9, then outputs the text that matched sub-expression D. | \l | Causes the next character to be outputted, to be output in lower case. | \u | Causes the next character to be outputted, to be output in upper case. | \L | Causes all subsequent characters to be output in lower case, until a \E is found. | \U | Causes all subsequent characters to be output in upper case, until a \E is found. | \E | Terminates a \L or \U sequence. |

这个就懒得翻译和测试了，都是很简单的东西。

接下来std::regex_replace里的format也是传入这种东西，返回的就是替换后的字符串了。

另外正则表达式错误，会抛出异常，当然你也可以配合std::regex_constants::match_flag_type做一些变化。

最后，贴出代码和结

#include 
#include 
#include 
#include 
#include 



int main() {
    using namespace std;

    regex reg("(http|https)://([\\w\\./]*)");
    string strIn;
    std::smatch res;
    bool isUrl;

    // 查找
    getline(cin, strIn);
    isUrl = std::regex_search(strIn, res, reg, std::regex_constants::match_not_null);
    cout<< (isUrl? "It's a url": "It's not a url")<< endl;
    // 输入 MyBlog is http://www.owent.net/ 匹配成功
    // 匹配结果里有三项，分别是整个匹配表达式和两个子表达式
    // 以下代码输出
    // 这个时候千万不能执行类似strIn = "" 改变strIn内容的操作，
    // 因为其和res指针指向的内存是共享的，如果对其进行就该会出现RE
    for (std::smatch::size_type i = 0; i < res.size(); i ++) {
        cout<< "第"<< i + 1<< "条匹配项first地址 => "<< &(res[i].first)<< endl;
        cout<< "第"<< i + 1<< "条匹配项second地址 => "<< &(res[i].second)<< endl;
        cout<< "第"<< i + 1<< "条匹配值为 => "<< res[i].str()<< endl<< endl;
    }

  
    // 匹配
    isUrl = std::regex_match(strIn, res, reg);
    cout<< isUrl<< " <= Matched? ,Size =>"<$&
\nScheme is $1\nAddress is $2";
    string strOut = std::regex_replace(strIn, reg, strRule);
    cout<< strOut<< endl;
    return 0;
}

//以下是输入“MyBlog is http://www.owent.net/ ”的输出结果：
//It's a url
//第1条匹配项first地址 => 0032EB70
//第1条匹配项second地址 => 0032EB7C
//第1条匹配值为 => http://www.owent.net/
//
//第2条匹配项first地址 => 0032EB8C
//第2条匹配项second地址 => 0032EB98
//第2条匹配值为 => http
//
//第3条匹配项first地址 => 0032EBA8
//第3条匹配项second地址 => 0032EBB4
//第3条匹配值为 => www.owent.net/
//
//0 <= Matched? ,Size =>3
//MyBlog is http://www.owent.net/

//Scheme is http
//Address is www.owent.net/

I'm OWenT

Challenge Everything

C++ 新特性学习（三） — Regex库