The char_separator class breaks a sequence of characters into tokens based on character delimiters much in the same way that strtok() does (but without all the evils of non-reentrancy and destruction of the input sequence).
The char_separator class is used in conjunction with the token_iterator or tokenizer to perform tokenizing.
The strtok() function does not include matches with the character delimiters in the output sequence of tokens. However, sometimes it is useful to have the delimiters show up in the output sequence, therefore char_separator provides this as an option. We refer to delimiters that show up as output tokens as kept delimiters and delimiters that do now show up as output tokens as dropped delimiters.
When two delimiters appear next to each other in the input sequence, there is the question of whether to output an empty token or to skip ahead. The behaviour of strtok() is to skip ahead. The char_separator class provides both options.
This first examples shows how to use char_separator as a replacement for the strtok() function. We've specified three character delimiters, and they will not show up as output tokens. We have not specified any kept delimiters, and by default any empty tokens will be ignored.
The output is:// char_sep_example_1.cpp #include <iostream> #include <boost/tokenizer.hpp> #include <string> int main() { std::string str = ";;Hello|world||-foo--bar;yow;baz|"; typedef boost::tokenizer<boost::char_separator<char> > tokenizer; boost::char_separator<char> sep("-;|"); tokenizer tokens(str, sep); for (tokenizer::iterator tok_iter = tokens.begin(); tok_iter != tokens.end(); ++tok_iter) std::cout << "<" << *tok_iter << "> "; std::cout << "\n"; return EXIT_SUCCESS; }
<Hello> <world> <foo> <bar> <yow> <baz>
The next example shows tokenizing with two dropped delimiters '-' and ';' and a single kept delimiter '|'. We also specify that empty tokens should show up in the output when two delimiters are next to each other.
The output is:// char_sep_example_2.cpp #include <iostream> #include <boost/tokenizer.hpp> #include <string> int main() { std::string str = ";;Hello|world||-foo--bar;yow;baz|"; typedef boost::tokenizer<boost::char_separator<char> > tokenizer; boost::char_separator<char> sep("-;", "|", boost::keep_empty_tokens); tokenizer tokens(str, sep); for (tokenizer::iterator tok_iter = tokens.begin(); tok_iter != tokens.end(); ++tok_iter) std::cout << "<" << *tok_iter << "> "; std::cout << "\n"; return EXIT_SUCCESS; }
<> <> <Hello> <|> <world> <|> <> <|> <> <foo> <> <bar> <yow> <baz> <|> <>
The final example shows tokenizing on punctuation and whitespace characters using the default constructor of the char_separator.
The output is:// char_sep_example_3.cpp #include <iostream> #include <boost/tokenizer.hpp> #include <string> int main() { std::string str = "This is, a test"; typedef boost::tokenizer<boost::char_separator<char> > Tok; boost::char_separator<char> sep; // default constructed Tok tok(str, sep); for(Tok::iterator tok_iter = tok.begin(); tok_iter != tok.end(); ++tok_iter) std::cout << "<" << *tok_iter << "> "; std::cout << "\n"; return EXIT_SUCCESS; }
<This> <is> <,> <a> <test>
Parameter | Description | Default |
---|---|---|
Char | The type of elements within a token, typically char. | |
Traits | The char_traits for the character type. | char_traits<char> |
explicit char_separator(const Char* dropped_delims, const Char* kept_delims = "", empty_token_policy empty_tokens = drop_empty_tokens)
This creates a char_separator object, which can then be used to create a token_iterator or tokenizer to perform tokenizing. The dropped_delims and kept_delims are strings of characters where each character is used as delimiter during tokenizing. Whenever a delimiter is seen in the input sequence, the current token is finished, and a new token begins. The delimiters in dropped_delims do not show up as tokens in the output whereas the delimiters in kept_delims do show up as tokens. If empty_tokens is drop_empty_tokens, then empty tokens will not show up in the output. If empty_tokens is keep_empty_tokens then empty tokens will show up in the output.
explicit char_separator()
The function std::isspace() is used to identify dropped delimiters and std::ispunct() is used to identify kept delimiters. In addition, empty tokens are dropped.
template <typename InputIterator, typename Token> bool operator()(InputIterator& next, InputIterator end, Token& tok)
This function is called by the token_iterator to perform tokenizing. The user typically does not call this function directly.
© Copyright Jeremy Siek and John R. Bandela 2001-2002. Permission to copy, use, modify, sell and distribute this document is granted provided this copyright notice appears in all copies. This document is provided "as is" without express or implied warranty, and with no claim as to its suitability for any purpose.