the tokenizer allows to split text into tokens in a convenient way. More...

#include <tokenizer.h>

Inheritance diagram for cgv::utils::tokenizer:

Public Member Functions
	tokenizer ()
	construct empty tokenizer

	tokenizer (const token &)
	construct from token

	tokenizer (const char *)
	construct from character string

	tokenizer (const std::string &)
	construct from string

tokenizer &	set_ws (const std::string &ws)
	set the list of white spaces, that separate tokens and are skipped

tokenizer &	set_skip (const std::string &open, const std::string &close)
	set several character pairs that enclose tokens that are not split

tokenizer &	set_skip (const std::string &open, const std::string &close, const std::string &escape)
	set several character pairs that enclose tokens that are not split and one escape character for each pair

tokenizer &	set_sep (const std::string &sep, bool merge)
	set the list of separators and specify whether succeeding separators are merged into single tokens

tokenizer &	set_sep (const std::string &sep)
	set the list of separators

tokenizer &	set_sep_merge (bool merge)
	specify whether succeeding separators are merged into single tokens

token	bite ()
	bite away a single token from the front

token	reverse_bite ()
	bite away a single token from the back

void	reverse_skip_whitespaces ()
	skip whitespaces at the back

void	skip_whitespaces ()
	skip whitespaces at the front

bool	skip_ws_check_empty ()
	skip whitespaces at the front and return whether the complete text has been processed

bool	reverse_skip_ws_check_empty ()
	skip whitespaces at the back and return whether the complete text has been processed

bool	balanced_bite (token &result, const std::string &open_parenthesis, const std::string &close_parenthesis, bool wait_for_sep=false)
	bite one token until all potentially nested opended parenthesis have been closed again

void	bite_all (std::vector< token > &result)

Public Member Functions inherited from cgv::utils::token
	token ()
	construct with both pointers set to 0

	token (const char *_str)
	construct from c-string

	token (const char _b, const char _e)
	construct from character range

	token (const std::string &s)
	construct from string

size_t	get_length () const
	return the length of the token in number of characters

size_t	size () const
	return the length of the token in number of characters

bool	empty () const
	return whether the token is empty

void	skip (const std::string &skip_chars)
	set begin by skipping all instances of the given character set

void	reverse_skip (const std::string &skip_chars)
	set end by skipping all instances of the given character set

char	operator[] (unsigned int i) const
	return the i-th character of the token

bool	operator== (const char *s) const
	compare to const char*

bool	operator== (const std::string &s) const
	compare to string

bool	operator!= (const char *s) const
	compare to const char*

bool	operator!= (const std::string &s) const
	compare to string

Protected Member Functions
void	init ()

bool	handle_skip (token &result)

bool	handle_separators (token &result, bool check_skip=true)

bool	reverse_handle_skip (token &result)

bool	reverse_handle_separators (token &result, bool check_skip=true)

Protected Attributes
std::string	separators

bool	merge_separators

std::string	begin_skip

std::string	end_skip

std::string	escape_skip

std::string	whitespaces

Additional Inherited Members
Public Attributes inherited from cgv::utils::token
const char *	begin
	pointers that define the range of characters

const char *	end

Detailed Description

the tokenizer allows to split text into tokens in a convenient way.

It supports splitting at white spaces and single or multi charactor separators. Furthermore, it supports enclosing character pairs like parantheses or string separators that skip white spaces and separators between enclosing pairs.

By default white spaces are set to space, tab, newline. The list of separators and skip character pairs is empty by default.

A tokenizer can be constructed from a string, a cont char* or a token. The resulting tokens are stored as two pointers to the begin and after the end of the token. No new memory is allocated and the tokens are only valid as long as the string or const char* is valid from which the tokenizer has been construct.

In the simplest usage, the tokenizer generates a vector of tokens through the bite_all function. Suppose you want to split the string str="Hello tokenizer." at the white spaces into two tokens <Hello> and <tokenizer.>. Notice that no token contains the white space separating the tokens. The following code performs this task:

std::vector<token> toks;

bite_all(tokenizer(str), toks);

cgv::utils::tokenizer::tokenizer

tokenizer()

construct empty tokenizer

Definition tokenizer.cxx:16

If you want to also cut the dot into a separate token, just set the list of separators with the set_sep method:

std::vector<token> toks;

bite_all(tokenizer(str).set_sep("."), toks);

cgv::utils::tokenizer::set_sep

tokenizer & set_sep(const std::string &sep, bool merge)

set the list of separators and specify whether succeeding separators are merged into single tokens

Definition tokenizer.cxx:59

The result are three tokens: <Hello>, <tokenizer> and <.>. If you want to split a semicolon separated list with tokens that can contain white spaces and ignoring the semicolons, you can set the semicolon character as the only white space:

std::vector<token> toks;

bite_all(tokenizer(str).set_ws(";"), toks);

cgv::utils::tokenizer::set_ws

tokenizer & set_ws(const std::string &ws)

set the list of white spaces, that separate tokens and are skipped

Definition tokenizer.cxx:38

The previous code would split the string "a and b;c and d" into two tokens and .

If you want to not split into tokens in between strings enclosed by <'> and in between paranthesis, you can several skip character pairs:

std::vector<token> toks;


bite_all(tokenizer(str).set_sep("[]").set_skip("'({", "')}"), toks);
cgv::utils::tokenizer::set_skip
tokenizer & set_skip(const std::string &open, const std::string &close)
set several character pairs that enclose tokens that are not split
Definition tokenizer.cxx:44

The previous code example would split the string "'a b'[{c d}]" into four tokens: <'a b'>, <[>, <{c d}> and <]>. Note that you can apply several setter methods to the tokenizer in a sequence as each setter returns a reference to the tokenizer itself similar to the stream operators.

Definition at line 67 of file tokenizer.h.

Constructor & Destructor Documentation

◆ tokenizer() [1/4]

cgv::utils::tokenizer::tokenizer ( )

construct empty tokenizer

Definition at line 16 of file tokenizer.cxx.

◆ tokenizer() [2/4]

cgv::utils::tokenizer::tokenizer ( const token & t )

construct from token

Definition at line 21 of file tokenizer.cxx.

◆ tokenizer() [3/4]

cgv::utils::tokenizer::tokenizer ( const char * p )

construct from character string

Definition at line 26 of file tokenizer.cxx.

References cgv::utils::token::begin.

◆ tokenizer() [4/4]

cgv::utils::tokenizer::tokenizer ( const std::string & s )

construct from string

Definition at line 33 of file tokenizer.cxx.

Member Function Documentation

◆ balanced_bite()

bool cgv::utils::tokenizer::balanced_bite	(	token &	result,
		const std::string &	open_parenthesis,
		const std::string &	close_parenthesis,
		bool	wait_for_sep = `false`
	)

bite one token until all potentially nested opended parenthesis have been closed again

Definition at line 236 of file tokenizer.cxx.

References cgv::utils::token::begin, cgv::utils::is_element(), skip_whitespaces(), and cgv::utils::token::token().

◆ bite()

token cgv::utils::tokenizer::bite ( )

bite away a single token from the front

Definition at line 179 of file tokenizer.cxx.

References cgv::utils::token::begin, cgv::utils::is_element(), and skip_whitespaces().

Referenced by cgv::base::analyze_command(), and cgv::utils::bite_all().

◆ bite_all()

void cgv::utils::tokenizer::bite_all ( std::vector< token > & result )

Definition at line 313 of file tokenizer.cxx.

◆ handle_separators()

bool cgv::utils::tokenizer::handle_separators	(	token &	result,
		bool	check_skip = `true`
	)

protected

Definition at line 142 of file tokenizer.cxx.

◆ handle_skip()

bool cgv::utils::tokenizer::handle_skip ( token & result )

protected

Definition at line 78 of file tokenizer.cxx.

◆ init()

void cgv::utils::tokenizer::init ( )

protected

Definition at line 7 of file tokenizer.cxx.

◆ reverse_bite()

token cgv::utils::tokenizer::reverse_bite ( )

bite away a single token from the back

Definition at line 210 of file tokenizer.cxx.

References cgv::utils::token::begin, cgv::utils::is_element(), and reverse_skip_whitespaces().

◆ reverse_handle_separators()

bool cgv::utils::tokenizer::reverse_handle_separators	(	token &	result,
		bool	check_skip = `true`
	)

protected

Definition at line 160 of file tokenizer.cxx.

◆ reverse_handle_skip()

bool cgv::utils::tokenizer::reverse_handle_skip ( token & result )

protected

Definition at line 110 of file tokenizer.cxx.

◆ reverse_skip_whitespaces()

void cgv::utils::tokenizer::reverse_skip_whitespaces ( )

skip whitespaces at the back

Definition at line 307 of file tokenizer.cxx.

References cgv::utils::token::reverse_skip().

Referenced by reverse_bite().

◆ reverse_skip_ws_check_empty()

bool cgv::utils::tokenizer::reverse_skip_ws_check_empty ( )

inline

skip whitespaces at the back and return whether the complete text has been processed

Definition at line 113 of file tokenizer.h.

◆ set_sep() [1/2]

tokenizer & cgv::utils::tokenizer::set_sep ( const std::string & sep )

set the list of separators

Definition at line 66 of file tokenizer.cxx.

◆ set_sep() [2/2]

tokenizer & cgv::utils::tokenizer::set_sep	(	const std::string &	sep,
		bool	merge
	)

set the list of separators and specify whether succeeding separators are merged into single tokens

Definition at line 59 of file tokenizer.cxx.

Referenced by cgv::base::analyze_command(), cgv::render::texture::deduce_file_names(), cgv::reflect::find_reflection_handler::find_reflection_handler(), cgv::render::shader_code::get_last_error(), cgv::gui::base_provider_generator::parse_gui_file(), cgv::data::component_format::set_component_format(), and cgv::data::data_format::set_data_format().

◆ set_sep_merge()

tokenizer & cgv::utils::tokenizer::set_sep_merge ( bool merge )

specify whether succeeding separators are merged into single tokens

Definition at line 72 of file tokenizer.cxx.

◆ set_skip() [1/2]

tokenizer & cgv::utils::tokenizer::set_skip	(	const std::string &	open,
		const std::string &	close
	)

set several character pairs that enclose tokens that are not split

Definition at line 44 of file tokenizer.cxx.

Referenced by cgv::base::analyze_command(), cgv::reflect::find_reflection_handler::find_reflection_handler(), cgv::gui::help_menu_entry::on_register(), and cgv::gui::base_provider_generator::parse_gui_file().

◆ set_skip() [2/2]

tokenizer & cgv::utils::tokenizer::set_skip	(	const std::string &	open,
		const std::string &	close,
		const std::string &	escape
	)

set several character pairs that enclose tokens that are not split and one escape character for each pair

Definition at line 51 of file tokenizer.cxx.

◆ set_ws()

tokenizer & cgv::utils::tokenizer::set_ws ( const std::string & ws )

set the list of white spaces, that separate tokens and are skipped

Definition at line 38 of file tokenizer.cxx.

Referenced by cgv::base::analyze_command(), cgv::render::shader_program::attach_program(), cgv::render::texture::deduce_file_names(), cgv::reflect::find_reflection_handler::find_reflection_handler(), cgv::render::shader_code::get_last_error(), cgv::gui::help_menu_entry::on_register(), cgv::utils::parse_enum_declarations(), cgv::gui::question(), and cgv::base::std_query().

◆ skip_whitespaces()

void cgv::utils::tokenizer::skip_whitespaces ( )

skip whitespaces at the front

Definition at line 302 of file tokenizer.cxx.

References cgv::utils::token::skip().

Referenced by balanced_bite(), and bite().

◆ skip_ws_check_empty()

bool cgv::utils::tokenizer::skip_ws_check_empty ( )

inline

skip whitespaces at the front and return whether the complete text has been processed

Definition at line 111 of file tokenizer.h.

Referenced by cgv::utils::bite_all().

Member Data Documentation

◆ begin_skip

std::string cgv::utils::tokenizer::begin_skip

protected

Definition at line 72 of file tokenizer.h.

◆ end_skip

std::string cgv::utils::tokenizer::end_skip

protected

Definition at line 73 of file tokenizer.h.

◆ escape_skip

std::string cgv::utils::tokenizer::escape_skip

protected

Definition at line 74 of file tokenizer.h.

◆ merge_separators

bool cgv::utils::tokenizer::merge_separators

protected

Definition at line 71 of file tokenizer.h.

◆ separators

std::string cgv::utils::tokenizer::separators

protected

Definition at line 70 of file tokenizer.h.

◆ whitespaces

std::string cgv::utils::tokenizer::whitespaces

protected

Definition at line 75 of file tokenizer.h.

The documentation for this class was generated from the following files:

cgv/utils/tokenizer.h
cgv/utils/tokenizer.cxx

Public Member Functions

Protected Member Functions

Protected Attributes

Additional Inherited Members

Detailed Description

Constructor & Destructor Documentation

◆ tokenizer() [1/4]

◆ tokenizer() [2/4]

◆ tokenizer() [3/4]

◆ tokenizer() [4/4]

Member Function Documentation

◆ balanced_bite()

◆ bite()

◆ bite_all()

◆ handle_separators()

◆ handle_skip()

◆ init()

◆ reverse_bite()

◆ reverse_handle_separators()

◆ reverse_handle_skip()

◆ reverse_skip_whitespaces()

◆ reverse_skip_ws_check_empty()

◆ set_sep() [1/2]

◆ set_sep() [2/2]

◆ set_sep_merge()

◆ set_skip() [1/2]

◆ set_skip() [2/2]

◆ set_ws()

◆ skip_whitespaces()

◆ skip_ws_check_empty()

Member Data Documentation

◆ begin_skip

◆ end_skip

◆ escape_skip

◆ merge_separators

◆ separators

◆ whitespaces