

Note: I'm not sure what to do about mistakes in text, like when someone doesn't put a space after a period at the end of a sentence. NET (even when you remember to escape the period and pluses) are screwed by the \b. Which is why Java-based regex searches for C++, C# or. But Java, JavaScript, and PCRE match only ASCII characters with \w.

XML Schema and XPath even include all symbols in \w. Connector punctuation other than the underscore and numeric symbols that aren't digits may or may not be included. Letters and digits from alphabetic scripts and ideographs are generally included. Aggregate ('', (s, e) > s + e.Value, s > s) Thanks to Replace chars if not.
If it is possible that the expression matches multiple times in your text, you can use this: var result Regex.Matches (text, regExpression).CastThere is a lot of inconsistency about which characters are actually included. Removing the non-matched characters is the same as keeping the matched ones. In most flavors that support Unicode, \w includes many characters from other scripts.

Notice the inclusion of the underscore and digits (but not dash!). (I'm sure there was a good reason for it at the time). Java supports Unicode for \b but not for \w. You would think that computer programmers would know better than to name a language something that is hard to write regular expressions for.Īnyway, this is what I found out (summarized mostly from, which is a great site): In most flavors of regex, characters that are matched by the short-hand character class \w are the characters that are treated as word characters by word boundaries. I ran into an even worse problem when searching text for words like.
