nltk.tokenize package¶
Submodules¶
nltk.tokenize.api module¶
Tokenizer Interface
-
class
nltk.tokenize.api.
StringTokenizer
[source]¶ Bases:
nltk.tokenize.api.TokenizerI
A tokenizer that divides a string into substrings by splitting on the specified string (defined in subclasses).
-
class
nltk.tokenize.api.
TokenizerI
[source]¶ Bases:
object
A processing interface for tokenizing a string. Subclasses must define
tokenize()
ortokenize_sents()
(or both).-
span_tokenize
(s)[source]¶ Identify the tokens using integer offsets
(start_i, end_i)
, wheres[start_i:end_i]
is the corresponding token.Return type: iter(tuple(int, int))
-
nltk.tokenize.casual module¶
Twitter-aware tokenizer, designed to be flexible and easy to adapt to new domains and tasks. The basic logic is this:
- The tuple regex_strings defines a list of regular expression strings.
- The regex_strings strings are put, in order, into a compiled regular expression object called word_re.
- The tokenization is done by word_re.findall(s), where s is the user-supplied string, inside the tokenize() method of the class Tokenizer.
- When instantiating Tokenizer objects, there is a single option: preserve_case. By default, it is set to True. If it is set to False, then the tokenizer will downcase everything except for emoticons.
-
class
nltk.tokenize.casual.
TweetTokenizer
(preserve_case=True, reduce_len=False, strip_handles=False)[source]¶ Bases:
object
Tokenizer for tweets.
>>> from nltk.tokenize import TweetTokenizer >>> tknzr = TweetTokenizer() >>> s0 = "This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <--" >>> tknzr.tokenize(s0) ['This', 'is', 'a', 'cooool', '#dummysmiley', ':', ':-)', ':-P', '<3', 'and', 'some', 'arrows', '<', '>', '->', '<--']
Examples using strip_handles and reduce_len parameters:
>>> tknzr = TweetTokenizer(strip_handles=True, reduce_len=True) >>> s1 = '@remy: This is waaaaayyyy too much for you!!!!!!' >>> tknzr.tokenize(s1) [':', 'This', 'is', 'waaayyy', 'too', 'much', 'for', 'you', '!', '!', '!']
-
nltk.tokenize.casual.
casual_tokenize
(text, preserve_case=True, reduce_len=False, strip_handles=False)[source]¶ Convenience function for wrapping the tokenizer.
-
nltk.tokenize.casual.
int2byte
()¶ S.pack(v1, v2, …) -> bytes
Return a bytes object containing values v1, v2, … packed according to the format string S.format. See help(struct) for more on format strings.
nltk.tokenize.mwe module¶
Multi-Word Expression Tokenizer
A MWETokenizer
takes a string which has already been divided into tokens and
retokenizes it, merging multi-word expressions into single tokens, using a lexicon
of MWEs:
>>> from nltk.tokenize import MWETokenizer
>>> tokenizer = MWETokenizer([('a', 'little'), ('a', 'little', 'bit'), ('a', 'lot')])
>>> tokenizer.add_mwe(('in', 'spite', 'of'))
>>> tokenizer.tokenize('Testing testing testing one two three'.split())
['Testing', 'testing', 'testing', 'one', 'two', 'three']
>>> tokenizer.tokenize('This is a test in spite'.split())
['This', 'is', 'a', 'test', 'in', 'spite']
>>> tokenizer.tokenize('In a little or a little bit or a lot in spite of'.split())
['In', 'a_little', 'or', 'a_little_bit', 'or', 'a_lot', 'in_spite_of']
-
class
nltk.tokenize.mwe.
MWETokenizer
(mwes=None, separator='_')[source]¶ Bases:
nltk.tokenize.api.TokenizerI
A tokenizer that processes tokenized text and merges multi-word expressions into single tokens.
-
add_mwe
(mwe)[source]¶ Add a multi-word expression to the lexicon (stored as a word trie)
We use
util.Trie
to represent the trie. Its form is a dict of dicts. The key True marks the end of a valid MWE.Parameters: mwe (tuple(str) or list(str)) – The multi-word expression we’re adding into the word trie Example: >>> tokenizer = MWETokenizer() >>> tokenizer.add_mwe(('a', 'b')) >>> tokenizer.add_mwe(('a', 'b', 'c')) >>> tokenizer.add_mwe(('a', 'x')) >>> expected = {'a': {'x': {True: None}, 'b': {True: None, 'c': {True: None}}}} >>> tokenizer._mwes == expected True
-
tokenize
(text)[source]¶ Parameters: text (list(str)) – A list containing tokenized text Returns: A list of the tokenized text with multi-words merged together Return type: list(str) Example: >>> tokenizer = MWETokenizer([('hors', "d'oeuvre")], separator='+') >>> tokenizer.tokenize("An hors d'oeuvre tonight, sir?".split()) ['An', "hors+d'oeuvre", 'tonight,', 'sir?']
-
nltk.tokenize.nist module¶
nltk.tokenize.punkt module¶
Punkt Sentence Tokenizer
This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used.
The NLTK data package includes a pre-trained Punkt tokenizer for English.
>>> import nltk.data
>>> text = '''
... Punkt knows that the periods in Mr. Smith and Johann S. Bach
... do not mark sentence boundaries. And sometimes sentences
... can start with non-capitalized words. i is a good variable
... name.
... '''
>>> sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
>>> print('\n-----\n'.join(sent_detector.tokenize(text.strip())))
Punkt knows that the periods in Mr. Smith and Johann S. Bach
do not mark sentence boundaries.
-----
And sometimes sentences
can start with non-capitalized words.
-----
i is a good variable
name.
(Note that whitespace from the original text, including newlines, is retained in the output.)
Punctuation following sentences is also included by default (from NLTK 3.0 onwards). It can be excluded with the realign_boundaries flag.
>>> text = '''
... (How does it deal with this parenthesis?) "It should be part of the
... previous sentence." "(And the same with this one.)" ('And this one!')
... "('(And (this)) '?)" [(and this. )]
... '''
>>> print('\n-----\n'.join(
... sent_detector.tokenize(text.strip())))
(How does it deal with this parenthesis?)
-----
"It should be part of the
previous sentence."
-----
"(And the same with this one.)"
-----
('And this one!')
-----
"('(And (this)) '?)"
-----
[(and this. )]
>>> print('\n-----\n'.join(
... sent_detector.tokenize(text.strip(), realign_boundaries=False)))
(How does it deal with this parenthesis?
-----
) "It should be part of the
previous sentence.
-----
" "(And the same with this one.
-----
)" ('And this one!
-----
')
"('(And (this)) '?
-----
)" [(and this.
-----
)]
However, Punkt is designed to learn parameters (a list of abbreviations, etc.)
unsupervised from a corpus similar to the target domain. The pre-packaged models
may therefore be unsuitable: use PunktSentenceTokenizer(text)
to learn
parameters from the given text.
PunktTrainer
learns parameters such as a list of abbreviations
(without supervision) from portions of text. Using a PunktTrainer
directly
allows for incremental training and modification of the hyper-parameters used
to decide what is considered an abbreviation, etc.
The algorithm for this tokenizer is described in:
Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence
Boundary Detection. Computational Linguistics 32: 485-525.
-
class
nltk.tokenize.punkt.
PunktBaseClass
(lang_vars=None, token_cls=<class 'nltk.tokenize.punkt.PunktToken'>, params=None)[source]¶ Bases:
object
Includes common components of PunktTrainer and PunktSentenceTokenizer.
-
class
nltk.tokenize.punkt.
PunktLanguageVars
[source]¶ Bases:
object
Stores variables, mostly regular expressions, which may be language-dependent for correct application of the algorithm. An extension of this class may modify its properties to suit a language other than English; an instance can then be passed as an argument to PunktSentenceTokenizer and PunktTrainer constructors.
-
internal_punctuation
= ',:;'¶ sentence internal punctuation, which indicates an abbreviation if preceded by a period-final token.
-
period_context_re
()[source]¶ Compiles and returns a regular expression to find contexts including possible sentence boundaries.
-
re_boundary_realignment
= re.compile('["\\\')\\]}]+?(?:\\s+|(?=--)|$)', re.MULTILINE)¶ Used to realign punctuation that should be included in a sentence although it follows the period (or ?, !).
-
sent_end_chars
= ('.', '?', '!')¶ Characters which are candidates for sentence boundaries
-
-
class
nltk.tokenize.punkt.
PunktParameters
[source]¶ Bases:
object
Stores data used to perform sentence boundary detection with Punkt.
-
abbrev_types
= None¶ A set of word types for known abbreviations.
-
collocations
= None¶ A set of word type tuples for known common collocations where the first word ends in a period. E.g., (‘S.’, ‘Bach’) is a common collocation in a text that discusses ‘Johann S. Bach’. These count as negative evidence for sentence boundaries.
-
ortho_context
= None¶ A dictionary mapping word types to the set of orthographic contexts that word type appears in. Contexts are represented by adding orthographic context flags: …
-
sent_starters
= None¶ A set of word types for words that often appear at the beginning of sentences.
-
-
class
nltk.tokenize.punkt.
PunktSentenceTokenizer
(train_text=None, verbose=False, lang_vars=None, token_cls=<class 'nltk.tokenize.punkt.PunktToken'>)[source]¶ Bases:
nltk.tokenize.punkt.PunktBaseClass
,nltk.tokenize.api.TokenizerI
A sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries. This approach has been shown to work well for many European languages.
-
PUNCTUATION
= (';', ':', ',', '.', '!', '?')¶
-
debug_decisions
(text)[source]¶ Classifies candidate periods as sentence breaks, yielding a dict for each that may be used to understand why the decision was made.
See format_debug_decision() to help make this output readable.
-
sentences_from_text
(text, realign_boundaries=True)[source]¶ Given a text, generates the sentences in that text by only testing candidate sentence breaks. If realign_boundaries is True, includes in the sentence closing punctuation that follows the period.
-
sentences_from_text_legacy
(text)[source]¶ Given a text, generates the sentences in that text. Annotates all tokens, rather than just those with possible sentence breaks. Should produce the same results as
sentences_from_text
.
-
sentences_from_tokens
(tokens)[source]¶ Given a sequence of tokens, generates lists of tokens, each list corresponding to a sentence.
-
span_tokenize
(text, realign_boundaries=True)[source]¶ Given a text, generates (start, end) spans of sentences in the text.
-
-
class
nltk.tokenize.punkt.
PunktToken
(tok, **params)[source]¶ Bases:
object
Stores a token of text with annotations produced during sentence boundary detection.
-
abbr
¶
-
ellipsis
¶
-
first_case
¶
-
first_lower
¶ True if the token’s first character is lowercase.
-
first_upper
¶ True if the token’s first character is uppercase.
-
is_alpha
¶ True if the token text is all alphabetic.
-
is_ellipsis
¶ True if the token text is that of an ellipsis.
-
is_initial
¶ True if the token text is that of an initial.
-
is_non_punct
¶ True if the token is either a number or is alphabetic.
-
is_number
¶ True if the token text is that of a number.
-
linestart
¶
-
parastart
¶
-
period_final
¶
-
sentbreak
¶
-
tok
¶
-
type
¶
-
type_no_period
¶ The type with its final period removed if it has one.
-
type_no_sentperiod
¶ The type with its final period removed if it is marked as a sentence break.
-
unicode_repr
()¶ A string representation of the token that can reproduce it with eval(), which lists all the token’s non-default annotations.
-
-
class
nltk.tokenize.punkt.
PunktTrainer
(train_text=None, verbose=False, lang_vars=None, token_cls=<class 'nltk.tokenize.punkt.PunktToken'>)[source]¶ Bases:
nltk.tokenize.punkt.PunktBaseClass
Learns parameters used in Punkt sentence boundary detection.
-
ABBREV
= 0.3¶ cut-off value whether a ‘token’ is an abbreviation
-
ABBREV_BACKOFF
= 5¶ upper cut-off for Mikheev’s(2002) abbreviation detection algorithm
-
COLLOCATION
= 7.88¶ minimal log-likelihood value that two tokens need to be considered as a collocation
-
IGNORE_ABBREV_PENALTY
= False¶ allows the disabling of the abbreviation penalty heuristic, which exponentially disadvantages words that are found at times without a final period.
-
INCLUDE_ABBREV_COLLOCS
= False¶ this includes as potential collocations all word pairs where the first word is an abbreviation. Such collocations override the orthographic heuristic, but not the sentence starter heuristic. This is overridden by INCLUDE_ALL_COLLOCS, and if both are false, only collocations with initials and ordinals are considered.
-
INCLUDE_ALL_COLLOCS
= False¶ this includes as potential collocations all word pairs where the first word ends in a period. It may be useful in corpora where there is a lot of variation that makes abbreviations like Mr difficult to identify.
-
MIN_COLLOC_FREQ
= 1¶ this sets a minimum bound on the number of times a bigram needs to appear before it can be considered a collocation, in addition to log likelihood statistics. This is useful when INCLUDE_ALL_COLLOCS is True.
-
SENT_STARTER
= 30¶ minimal log-likelihood value that a token requires to be considered as a frequent sentence starter
-
finalize_training
(verbose=False)[source]¶ Uses data that has been gathered in training to determine likely collocations and sentence starters.
-
find_abbrev_types
()[source]¶ Recalculates abbreviations given type frequencies, despite no prior determination of abbreviations. This fails to include abbreviations otherwise found as “rare”.
-
freq_threshold
(ortho_thresh=2, type_thresh=2, colloc_thres=2, sentstart_thresh=2)[source]¶ Allows memory use to be reduced after much training by removing data about rare tokens that are unlikely to have a statistical effect with further training. Entries occurring above the given thresholds will be retained.
-
get_params
()[source]¶ Calculates and returns parameters for sentence boundary detection as derived from training.
-
train
(text, verbose=False, finalize=True)[source]¶ Collects training data from a given text. If finalize is True, it will determine all the parameters for sentence boundary detection. If not, this will be delayed until get_params() or finalize_training() is called. If verbose is True, abbreviations found will be listed.
-
nltk.tokenize.regexp module¶
Regular-Expression Tokenizers
A RegexpTokenizer
splits a string into substrings using a regular expression.
For example, the following tokenizer forms tokens out of alphabetic sequences,
money expressions, and any other non-whitespace sequences:
>>> from nltk.tokenize import RegexpTokenizer
>>> s = "Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\n\nThanks."
>>> tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
>>> tokenizer.tokenize(s)
['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York', '.',
'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
A RegexpTokenizer
can use its regexp to match delimiters instead:
>>> tokenizer = RegexpTokenizer('\s+', gaps=True)
>>> tokenizer.tokenize(s)
['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York.',
'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks.']
Note that empty tokens are not returned when the delimiter appears at the start or end of the string.
The material between the tokens is discarded. For example, the following tokenizer selects just the capitalized words:
>>> capword_tokenizer = RegexpTokenizer('[A-Z]\w+')
>>> capword_tokenizer.tokenize(s)
['Good', 'New', 'York', 'Please', 'Thanks']
This module contains several subclasses of RegexpTokenizer
that use pre-defined regular expressions.
>>> from nltk.tokenize import BlanklineTokenizer
>>> # Uses '\s*\n\s*\n\s*':
>>> BlanklineTokenizer().tokenize(s)
['Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.',
'Thanks.']
All of the regular expression tokenizers are also available as functions:
>>> from nltk.tokenize import regexp_tokenize, wordpunct_tokenize, blankline_tokenize
>>> regexp_tokenize(s, pattern='\w+|\$[\d\.]+|\S+')
['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York', '.',
'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
>>> wordpunct_tokenize(s)
['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York',
'.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
>>> blankline_tokenize(s)
['Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.', 'Thanks.']
Caution: The function regexp_tokenize()
takes the text as its
first argument, and the regular expression pattern as its second
argument. This differs from the conventions used by Python’s
re
functions, where the pattern is always the first argument.
(This is for consistency with the other NLTK tokenizers.)
-
class
nltk.tokenize.regexp.
BlanklineTokenizer
[source]¶ Bases:
nltk.tokenize.regexp.RegexpTokenizer
Tokenize a string, treating any sequence of blank lines as a delimiter. Blank lines are defined as lines containing no characters, except for space or tab characters.
-
class
nltk.tokenize.regexp.
RegexpTokenizer
(pattern, gaps=False, discard_empty=True, flags=<RegexFlag.UNICODE|DOTALL|MULTILINE: 56>)[source]¶ Bases:
nltk.tokenize.api.TokenizerI
A tokenizer that splits a string using a regular expression, which matches either the tokens or the separators between tokens.
>>> tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
Parameters: - pattern (str) – The pattern used to build this tokenizer. (This pattern must not contain capturing parentheses; Use non-capturing parentheses, e.g. (?:…), instead)
- gaps (bool) – True if this tokenizer’s pattern should be used to find separators between tokens; False if this tokenizer’s pattern should be used to find the tokens themselves.
- discard_empty (bool) – True if any empty tokens ‘’ generated by the tokenizer should be discarded. Empty tokens can only be generated if _gaps == True.
- flags (int) – The regexp flags used to compile this tokenizer’s pattern. By default, the following flags are used: re.UNICODE | re.MULTILINE | re.DOTALL.
-
span_tokenize
(text)[source]¶ Identify the tokens using integer offsets
(start_i, end_i)
, wheres[start_i:end_i]
is the corresponding token.Return type: iter(tuple(int, int))
-
unicode_repr
()¶ Return repr(self).
-
class
nltk.tokenize.regexp.
WhitespaceTokenizer
[source]¶ Bases:
nltk.tokenize.regexp.RegexpTokenizer
Tokenize a string on whitespace (space, tab, newline). In general, users should use the string
split()
method instead.>>> from nltk.tokenize import WhitespaceTokenizer >>> s = "Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\n\nThanks." >>> WhitespaceTokenizer().tokenize(s) ['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York.', 'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks.']
-
class
nltk.tokenize.regexp.
WordPunctTokenizer
[source]¶ Bases:
nltk.tokenize.regexp.RegexpTokenizer
Tokenize a text into a sequence of alphabetic and non-alphabetic characters, using the regexp
\w+|[^\w\s]+
.>>> from nltk.tokenize import WordPunctTokenizer >>> s = "Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\n\nThanks." >>> WordPunctTokenizer().tokenize(s) ['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
-
nltk.tokenize.regexp.
regexp_tokenize
(text, pattern, gaps=False, discard_empty=True, flags=<RegexFlag.UNICODE|DOTALL|MULTILINE: 56>)[source]¶ Return a tokenized copy of text. See
RegexpTokenizer
for descriptions of the arguments.
nltk.tokenize.repp module¶
-
class
nltk.tokenize.repp.
ReppTokenizer
(repp_dir, encoding='utf8')[source]¶ Bases:
nltk.tokenize.api.TokenizerI
A class for word tokenization using the REPP parser described in Rebecca Dridan and Stephan Oepen (2012) Tokenization: Returning to a Long Solved Problem - A Survey, Contrastive Experiment, Recommendations, and Toolkit. In ACL. http://anthology.aclweb.org/P/P12/P12-2.pdf#page=406
>>> sents = ['Tokenization is widely regarded as a solved problem due to the high accuracy that rulebased tokenizers achieve.' , ... 'But rule-based tokenizers are hard to maintain and their rules language specific.' , ... 'We evaluated our method on three languages and obtained error rates of 0.27% (English), 0.35% (Dutch) and 0.76% (Italian) for our best models.' ... ] >>> tokenizer = ReppTokenizer('/home/alvas/repp/') >>> for sent in sents: ... tokenizer.tokenize(sent) ... (u'Tokenization', u'is', u'widely', u'regarded', u'as', u'a', u'solved', u'problem', u'due', u'to', u'the', u'high', u'accuracy', u'that', u'rulebased', u'tokenizers', u'achieve', u'.') (u'But', u'rule-based', u'tokenizers', u'are', u'hard', u'to', u'maintain', u'and', u'their', u'rules', u'language', u'specific', u'.') (u'We', u'evaluated', u'our', u'method', u'on', u'three', u'languages', u'and', u'obtained', u'error', u'rates', u'of', u'0.27', u'%', u'(', u'English', u')', u',', u'0.35', u'%', u'(', u'Dutch', u')', u'and', u'0.76', u'%', u'(', u'Italian', u')', u'for', u'our', u'best', u'models', u'.')
>>> for sent in tokenizer.tokenize_sents(sents): ... print sent ... (u'Tokenization', u'is', u'widely', u'regarded', u'as', u'a', u'solved', u'problem', u'due', u'to', u'the', u'high', u'accuracy', u'that', u'rulebased', u'tokenizers', u'achieve', u'.') (u'But', u'rule-based', u'tokenizers', u'are', u'hard', u'to', u'maintain', u'and', u'their', u'rules', u'language', u'specific', u'.') (u'We', u'evaluated', u'our', u'method', u'on', u'three', u'languages', u'and', u'obtained', u'error', u'rates', u'of', u'0.27', u'%', u'(', u'English', u')', u',', u'0.35', u'%', u'(', u'Dutch', u')', u'and', u'0.76', u'%', u'(', u'Italian', u')', u'for', u'our', u'best', u'models', u'.') >>> for sent in tokenizer.tokenize_sents(sents, keep_token_positions=True): ... print sent ... [(u'Tokenization', 0, 12), (u'is', 13, 15), (u'widely', 16, 22), (u'regarded', 23, 31), (u'as', 32, 34), (u'a', 35, 36), (u'solved', 37, 43), (u'problem', 44, 51), (u'due', 52, 55), (u'to', 56, 58), (u'the', 59, 62), (u'high', 63, 67), (u'accuracy', 68, 76), (u'that', 77, 81), (u'rulebased', 82, 91), (u'tokenizers', 92, 102), (u'achieve', 103, 110), (u'.', 110, 111)] [(u'But', 0, 3), (u'rule-based', 4, 14), (u'tokenizers', 15, 25), (u'are', 26, 29), (u'hard', 30, 34), (u'to', 35, 37), (u'maintain', 38, 46), (u'and', 47, 50), (u'their', 51, 56), (u'rules', 57, 62), (u'language', 63, 71), (u'specific', 72, 80), (u'.', 80, 81)] [(u'We', 0, 2), (u'evaluated', 3, 12), (u'our', 13, 16), (u'method', 17, 23), (u'on', 24, 26), (u'three', 27, 32), (u'languages', 33, 42), (u'and', 43, 46), (u'obtained', 47, 55), (u'error', 56, 61), (u'rates', 62, 67), (u'of', 68, 70), (u'0.27', 71, 75), (u'%', 75, 76), (u'(', 77, 78), (u'English', 78, 85), (u')', 85, 86), (u',', 86, 87), (u'0.35', 88, 92), (u'%', 92, 93), (u'(', 94, 95), (u'Dutch', 95, 100), (u')', 100, 101), (u'and', 102, 105), (u'0.76', 106, 110), (u'%', 110, 111), (u'(', 112, 113), (u'Italian', 113, 120), (u')', 120, 121), (u'for', 122, 125), (u'our', 126, 129), (u'best', 130, 134), (u'models', 135, 141), (u'.', 141, 142)]
-
find_repptokenizer
(repp_dirname)[source]¶ A module to find REPP tokenizer binary and its repp.set config file.
-
generate_repp_command
(inputfilename)[source]¶ This module generates the REPP command to be used at the terminal.
Parameters: inputfilename (str) – path to the input file
-
static
parse_repp_outputs
(repp_output)[source]¶ This module parses the tri-tuple format that REPP outputs using the “–format triple” option and returns an generator with tuple of string tokens.
Parameters: repp_output (type) – Returns: an iterable of the tokenized sentences as tuples of strings Return type: iter(tuple)
-
nltk.tokenize.sexpr module¶
S-Expression Tokenizer
SExprTokenizer
is used to find parenthesized expressions in a
string. In particular, it divides a string into a sequence of
substrings that are either parenthesized expressions (including any
nested parenthesized expressions), or other whitespace-separated
tokens.
>>> from nltk.tokenize import SExprTokenizer
>>> SExprTokenizer().tokenize('(a b (c d)) e f (g)')
['(a b (c d))', 'e', 'f', '(g)']
By default, SExprTokenizer will raise a ValueError
exception if
used to tokenize an expression with non-matching parentheses:
>>> SExprTokenizer().tokenize('c) d) e (f (g')
Traceback (most recent call last):
...
ValueError: Un-matched close paren at char 1
The strict
argument can be set to False to allow for
non-matching parentheses. Any unmatched close parentheses will be
listed as their own s-expression; and the last partial sexpr with
unmatched open parentheses will be listed as its own sexpr:
>>> SExprTokenizer(strict=False).tokenize('c) d) e (f (g')
['c', ')', 'd', ')', 'e', '(f (g']
The characters used for open and close parentheses may be customized
using the parens
argument to the SExprTokenizer constructor:
>>> SExprTokenizer(parens='{}').tokenize('{a b {c d}} e f {g}')
['{a b {c d}}', 'e', 'f', '{g}']
The s-expression tokenizer is also available as a function:
>>> from nltk.tokenize import sexpr_tokenize
>>> sexpr_tokenize('(a b (c d)) e f (g)')
['(a b (c d))', 'e', 'f', '(g)']
-
class
nltk.tokenize.sexpr.
SExprTokenizer
(parens='()', strict=True)[source]¶ Bases:
nltk.tokenize.api.TokenizerI
A tokenizer that divides strings into s-expressions. An s-expresion can be either:
- a parenthesized expression, including any nested parenthesized expressions, or
- a sequence of non-whitespace non-parenthesis characters.
For example, the string
(a (b c)) d e (f)
consists of four s-expressions:(a (b c))
,d
,e
, and(f)
.By default, the characters
(
and)
are treated as open and close parentheses, but alternative strings may be specified.Parameters: - parens (str or list) – A two-element sequence specifying the open and close parentheses that should be used to find sexprs. This will typically be either a two-character string, or a list of two strings.
- strict – If true, then raise an exception when tokenizing an ill-formed sexpr.
-
tokenize
(text)[source]¶ Return a list of s-expressions extracted from text. For example:
>>> SExprTokenizer().tokenize('(a b (c d)) e f (g)') ['(a b (c d))', 'e', 'f', '(g)']
All parentheses are assumed to mark s-expressions. (No special processing is done to exclude parentheses that occur inside strings, or following backslash characters.)
If the given expression contains non-matching parentheses, then the behavior of the tokenizer depends on the
strict
parameter to the constructor. Ifstrict
isTrue
, then raise aValueError
. Ifstrict
isFalse
, then any unmatched close parentheses will be listed as their own s-expression; and the last partial s-expression with unmatched open parentheses will be listed as its own s-expression:>>> SExprTokenizer(strict=False).tokenize('c) d) e (f (g') ['c', ')', 'd', ')', 'e', '(f (g']
Parameters: text (str or iter(str)) – the string to be tokenized Return type: iter(str)
nltk.tokenize.simple module¶
Simple Tokenizers
These tokenizers divide strings into substrings using the string
split()
method.
When tokenizing using a particular delimiter string, use
the string split()
method directly, as this is more efficient.
The simple tokenizers are not available as separate functions;
instead, you should just use the string split()
method directly:
>>> s = "Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\n\nThanks."
>>> s.split()
['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York.',
'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks.']
>>> s.split(' ')
['Good', 'muffins', 'cost', '$3.88\nin', 'New', 'York.', '',
'Please', 'buy', 'me\ntwo', 'of', 'them.\n\nThanks.']
>>> s.split('\n')
['Good muffins cost $3.88', 'in New York. Please buy me',
'two of them.', '', 'Thanks.']
The simple tokenizers are mainly useful because they follow the
standard TokenizerI
interface, and so can be used with any code
that expects a tokenizer. For example, these tokenizers can be used
to specify the tokenization conventions when building a CorpusReader.
-
class
nltk.tokenize.simple.
CharTokenizer
[source]¶ Bases:
nltk.tokenize.api.StringTokenizer
Tokenize a string into individual characters. If this functionality is ever required directly, use
for char in string
.
-
class
nltk.tokenize.simple.
LineTokenizer
(blanklines='discard')[source]¶ Bases:
nltk.tokenize.api.TokenizerI
Tokenize a string into its lines, optionally discarding blank lines. This is similar to
s.split('\n')
.>>> from nltk.tokenize import LineTokenizer >>> s = "Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\n\nThanks." >>> LineTokenizer(blanklines='keep').tokenize(s) ['Good muffins cost $3.88', 'in New York. Please buy me', 'two of them.', '', 'Thanks.'] >>> # same as [l for l in s.split('\n') if l.strip()]: >>> LineTokenizer(blanklines='discard').tokenize(s) ['Good muffins cost $3.88', 'in New York. Please buy me', 'two of them.', 'Thanks.']
Parameters: blanklines – Indicates how blank lines should be handled. Valid values are:
discard
: strip blank lines out of the token list before returning it.- A line is considered blank if it contains only whitespace characters.
keep
: leave all blank lines in the token list.discard-eof
: if the string ends with a newline, then do not generate- a corresponding token
''
after that newline.
-
class
nltk.tokenize.simple.
SpaceTokenizer
[source]¶ Bases:
nltk.tokenize.api.StringTokenizer
Tokenize a string using the space character as a delimiter, which is the same as
s.split(' ')
.>>> from nltk.tokenize import SpaceTokenizer >>> s = "Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\n\nThanks." >>> SpaceTokenizer().tokenize(s) ['Good', 'muffins', 'cost', '$3.88\nin', 'New', 'York.', '', 'Please', 'buy', 'me\ntwo', 'of', 'them.\n\nThanks.']
-
class
nltk.tokenize.simple.
TabTokenizer
[source]¶ Bases:
nltk.tokenize.api.StringTokenizer
Tokenize a string use the tab character as a delimiter, the same as
s.split('\t')
.>>> from nltk.tokenize import TabTokenizer >>> TabTokenizer().tokenize('a\tb c\n\t d') ['a', 'b c\n', ' d']
nltk.tokenize.stanford module¶
-
class
nltk.tokenize.stanford.
StanfordTokenizer
(path_to_jar=None, encoding='utf8', options=None, verbose=False, java_options='-mx1000m')[source]¶ Bases:
nltk.tokenize.api.TokenizerI
Interface to the Stanford Tokenizer
>>> from nltk.tokenize.stanford import StanfordTokenizer >>> s = "Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\nThanks." >>> StanfordTokenizer().tokenize(s) ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.'] >>> s = "The colour of the wall is blue." >>> StanfordTokenizer(options={"americanize": True}).tokenize(s) ['The', 'color', 'of', 'the', 'wall', 'is', 'blue', '.']
nltk.tokenize.stanford_segmenter module¶
-
class
nltk.tokenize.stanford_segmenter.
StanfordSegmenter
(path_to_jar=None, path_to_slf4j=None, java_class=None, path_to_model=None, path_to_dict=None, path_to_sihan_corpora_dict=None, sihan_post_processing='false', keep_whitespaces='false', encoding='UTF-8', options=None, verbose=False, java_options='-mx2g')[source]¶ Bases:
nltk.tokenize.api.TokenizerI
Interface to the Stanford Segmenter
If stanford-segmenter version is older than 2016-10-31, then path_to_slf4j should be provieded, for example:
seg = StanfordSegmenter(path_to_slf4j='/YOUR_PATH/slf4j-api.jar')
>>> from nltk.tokenize.stanford_segmenter import StanfordSegmenter >>> seg = StanfordSegmenter() >>> seg.default_config('zh') >>> sent = u'这是斯坦福中文分词器测试' >>> print(seg.segment(sent)) 这 是 斯坦福 中文 分词器 测试 >>> seg.default_config('ar') >>> sent = u'هذا هو تصنيف ستانفورد العربي للكلمات' >>> print(seg.segment(sent.split())) هذا هو تصنيف ستانفورد العربي ل الكلمات
nltk.tokenize.texttiling module¶
-
class
nltk.tokenize.texttiling.
TextTilingTokenizer
(w=20, k=10, similarity_method=0, stopwords=None, smoothing_method=[0], smoothing_width=2, smoothing_rounds=1, cutoff_policy=1, demo_mode=False)[source]¶ Bases:
nltk.tokenize.api.TokenizerI
Tokenize a document into topical sections using the TextTiling algorithm. This algorithm detects subtopic shifts based on the analysis of lexical co-occurrence patterns.
The process starts by tokenizing the text into pseudosentences of a fixed size w. Then, depending on the method used, similarity scores are assigned at sentence gaps. The algorithm proceeds by detecting the peak differences between these scores and marking them as boundaries. The boundaries are normalized to the closest paragraph break and the segmented text is returned.
Parameters: - w (int) – Pseudosentence size
- k (int) – Size (in sentences) of the block used in the block comparison method
- similarity_method (constant) – The method used for determining similarity scores: BLOCK_COMPARISON (default) or VOCABULARY_INTRODUCTION.
- stopwords (list(str)) – A list of stopwords that are filtered out (defaults to NLTK’s stopwords corpus)
- smoothing_method (constant) – The method used for smoothing the score plot: DEFAULT_SMOOTHING (default)
- smoothing_width (int) – The width of the window used by the smoothing method
- smoothing_rounds (int) – The number of smoothing passes
- cutoff_policy (constant) – The policy used to determine the number of boundaries: HC (default) or LC
>>> from nltk.corpus import brown >>> tt = TextTilingTokenizer(demo_mode=True) >>> text = brown.raw()[:4000] >>> s, ss, d, b = tt.tokenize(text) >>> b [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0]
-
class
nltk.tokenize.texttiling.
TokenSequence
(index, wrdindex_list, original_length=None)[source]¶ Bases:
object
A token list with its original length and its index
-
class
nltk.tokenize.texttiling.
TokenTableField
(first_pos, ts_occurences, total_count=1, par_count=1, last_par=0, last_tok_seq=None)[source]¶ Bases:
object
A field in the token table holding parameters for each token, used later in the process
-
nltk.tokenize.texttiling.
smooth
(x, window_len=11, window='flat')[source]¶ smooth the data using a window with requested size.
This method is based on the convolution of a scaled window with the signal. The signal is prepared by introducing reflected copies of the signal (with the window size) in both ends so that transient parts are minimized in the beginning and end part of the output signal.
Parameters: - x – the input signal
- window_len – the dimension of the smoothing window; should be an odd integer
- window – the type of window from ‘flat’, ‘hanning’, ‘hamming’, ‘bartlett’, ‘blackman’ flat window will produce a moving average smoothing.
Returns: the smoothed signal
example:
t=linspace(-2,2,0.1) x=sin(t)+randn(len(t))*0.1 y=smooth(x)
See also: numpy.hanning, numpy.hamming, numpy.bartlett, numpy.blackman, numpy.convolve, scipy.signal.lfilter TODO: the window parameter could be the window itself if an array instead of a string
nltk.tokenize.toktok module¶
The tok-tok tokenizer is a simple, general tokenizer, where the input has one sentence per line; thus only final period is tokenized.
Tok-tok has been tested on, and gives reasonably good results for English, Persian, Russian, Czech, French, German, Vietnamese, Tajik, and a few others. The input should be in UTF-8 encoding.
Reference: Jon Dehdari. 2014. A Neurophysiologically-Inspired Statistical Language Model (Doctoral dissertation). Columbus, OH, USA: The Ohio State University.
-
class
nltk.tokenize.toktok.
ToktokTokenizer
[source]¶ Bases:
nltk.tokenize.api.TokenizerI
This is a Python port of the tok-tok.pl from https://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl
>>> toktok = ToktokTokenizer() >>> text = u'Is 9.5 or 525,600 my favorite number?' >>> print (toktok.tokenize(text, return_str=True)) Is 9.5 or 525,600 my favorite number ? >>> text = u'The https://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl is a website with/and/or slashes and sort of weird : things' >>> print (toktok.tokenize(text, return_str=True)) The https://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl is a website with/and/or slashes and sort of weird : things >>> text = u'¡This, is a sentence with weird» symbols… appearing everywhere¿' >>> expected = u'¡ This , is a sentence with weird » symbols … appearing everywhere ¿' >>> assert toktok.tokenize(text, return_str=True) == expected >>> toktok.tokenize(text) == [u'¡', u'This', u',', u'is', u'a', u'sentence', u'with', u'weird', u'»', u'symbols', u'…', u'appearing', u'everywhere', u'¿'] True
-
AMPERCENT
= (re.compile('& '), '& ')¶
-
CLOSE_PUNCT
= ')]}༻༽᚜⁆⁾₎〉❩❫❭❯❱❳❵⟆⟧⟩⟫⟭⟯⦄⦆⦈⦊⦌⦎⦐⦒⦔⦖⦘⧙⧛⧽⸣⸥⸧⸩〉》」』】〕〗〙〛〞〟﴿︘︶︸︺︼︾﹀﹂﹄﹈﹚﹜﹞)]}⦆」'¶
-
CLOSE_PUNCT_RE
= (re.compile('([)]}༻༽᚜⁆⁾₎〉❩❫❭❯❱❳❵⟆⟧⟩⟫⟭⟯⦄⦆⦈⦊⦌⦎⦐⦒⦔⦖⦘⧙⧛⧽⸣⸥⸧⸩〉》」』】〕〗〙〛〞〟﴿︘︶︸︺︼︾﹀﹂﹄﹈﹚﹜﹞)]}⦆」])'), '\\1 ')¶
-
COMMA_IN_NUM
= (re.compile('(?<!,)([,،])(?![,\\d])'), ' \\1 ')¶
-
CURRENCY_SYM
= '$¢£¤¥֏؋৲৳৻૱௹฿៛₠₡₢₣₤₥₦₧₨₩₪₫€₭₮₯₰₱₲₳₴₵₶₷₸₹₺꠸﷼﹩$¢£¥₩'¶
-
CURRENCY_SYM_RE
= (re.compile('([$¢£¤¥֏؋৲৳৻૱௹฿៛₠₡₢₣₤₥₦₧₨₩₪₫€₭₮₯₰₱₲₳₴₵₶₷₸₹₺꠸﷼﹩$¢£¥₩])'), '\\1 ')¶
-
EN_EM_DASHES
= (re.compile('([–—])'), ' \\1 ')¶
-
FINAL_PERIOD_1
= (re.compile('(?<!\\.)\\.$'), ' .')¶
-
FINAL_PERIOD_2
= (re.compile('(?<!\\.)\\.\\s*(["\'’»›”]) *$'), ' . \\1')¶
-
FUNKY_PUNCT_1
= (re.compile('([،;؛¿!"\\])}»›”؟¡%٪°±©®।॥…])'), ' \\1 ')¶
-
FUNKY_PUNCT_2
= (re.compile('([({\\[“‘„‚«‹「『])'), ' \\1 ')¶
-
LSTRIP
= (re.compile('^ +'), '')¶
-
MULTI_COMMAS
= (re.compile('(,{2,})'), ' \\1 ')¶
-
MULTI_DASHES
= (re.compile('(-{2,})'), ' \\1 ')¶
-
MULTI_DOTS
= (re.compile('(\\.{2,})'), ' \\1 ')¶
-
NON_BREAKING
= (re.compile('\xa0'), ' ')¶
-
ONE_SPACE
= (re.compile(' {2,}'), ' ')¶
-
OPEN_PUNCT
= '([{༺༼᚛‚„⁅⁽₍〈❨❪❬❮❰❲❴⟅⟦⟨⟪⟬⟮⦃⦅⦇⦉⦋⦍⦏⦑⦓⦕⦗⧘⧚⧼⸢⸤⸦⸨〈《「『【〔〖〘〚〝﴾︗︵︷︹︻︽︿﹁﹃﹇﹙﹛﹝([{⦅「'¶
-
OPEN_PUNCT_RE
= (re.compile('([([{༺༼᚛‚„⁅⁽₍〈❨❪❬❮❰❲❴⟅⟦⟨⟪⟬⟮⦃⦅⦇⦉⦋⦍⦏⦑⦓⦕⦗⧘⧚⧼⸢⸤⸦⸨〈《「『【〔〖〘〚〝﴾︗︵︷︹︻︽︿﹁﹃﹇﹙﹛﹝([{⦅「])'), '\\1 ')¶
-
PIPE
= (re.compile('\\|'), ' | ')¶
-
PROB_SINGLE_QUOTES
= (re.compile("(['’`])"), ' \\1 ')¶
-
RSTRIP
= (re.compile('\\s+$'), '\n')¶
-
STUPID_QUOTES_1
= (re.compile(' ` ` '), ' `` ')¶
-
STUPID_QUOTES_2
= (re.compile(" ' ' "), " '' ")¶
-
TAB
= (re.compile('\t'), ' 	 ')¶
-
TOKTOK_REGEXES
= [(re.compile('\xa0'), ' '), (re.compile('([،;؛¿!"\\])}»›”؟¡%٪°±©®।॥…])'), ' \\1 '), (re.compile(':(?!//)'), ' : '), (re.compile('\\?(?!\\S)'), ' ? '), (re.compile('(:\\/\\/)[\\S+\\.\\S+\\/\\S+][\\/]'), ' / '), (re.compile(' /'), ' / '), (re.compile('& '), '& '), (re.compile('\t'), ' 	 '), (re.compile('\\|'), ' | '), (re.compile('([([{༺༼᚛‚„⁅⁽₍〈❨❪❬❮❰❲❴⟅⟦⟨⟪⟬⟮⦃⦅⦇⦉⦋⦍⦏⦑⦓⦕⦗⧘⧚⧼⸢⸤⸦⸨〈《「『【〔〖〘〚〝﴾︗︵︷︹︻︽︿﹁﹃﹇﹙﹛﹝([{⦅「])'), '\\1 '), (re.compile('([)]}༻༽᚜⁆⁾₎〉❩❫❭❯❱❳❵⟆⟧⟩⟫⟭⟯⦄⦆⦈⦊⦌⦎⦐⦒⦔⦖⦘⧙⧛⧽⸣⸥⸧⸩〉》」』】〕〗〙〛〞〟﴿︘︶︸︺︼︾﹀﹂﹄﹈﹚﹜﹞)]}⦆」])'), '\\1 '), (re.compile('(,{2,})'), ' \\1 '), (re.compile('(?<!,)([,،])(?![,\\d])'), ' \\1 '), (re.compile('(?<!\\.)\\.\\s*(["\'’»›”]) *$'), ' . \\1'), (re.compile("(['’`])"), ' \\1 '), (re.compile(' ` ` '), ' `` '), (re.compile(" ' ' "), " '' "), (re.compile('([$¢£¤¥֏؋৲৳৻૱௹฿៛₠₡₢₣₤₥₦₧₨₩₪₫€₭₮₯₰₱₲₳₴₵₶₷₸₹₺꠸﷼﹩$¢£¥₩])'), '\\1 '), (re.compile('([–—])'), ' \\1 '), (re.compile('(-{2,})'), ' \\1 '), (re.compile('(\\.{2,})'), ' \\1 '), (re.compile('(?<!\\.)\\.$'), ' .'), (re.compile('(?<!\\.)\\.\\s*(["\'’»›”]) *$'), ' . \\1'), (re.compile(' {2,}'), ' ')]¶
-
URL_FOE_1
= (re.compile(':(?!//)'), ' : ')¶
-
URL_FOE_2
= (re.compile('\\?(?!\\S)'), ' ? ')¶
-
URL_FOE_3
= (re.compile('(:\\/\\/)[\\S+\\.\\S+\\/\\S+][\\/]'), ' / ')¶
-
URL_FOE_4
= (re.compile(' /'), ' / ')¶
-
nltk.tokenize.treebank module¶
Penn Treebank Tokenizer
The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank. This implementation is a port of the tokenizer sed script written by Robert McIntyre and available at http://www.cis.upenn.edu/~treebank/tokenizer.sed.
-
class
nltk.tokenize.treebank.
MacIntyreContractions
[source]¶ Bases:
object
List of contractions adapted from Robert MacIntyre’s tokenizer.
-
CONTRACTIONS2
= ['(?i)\\b(can)(?#X)(not)\\b', "(?i)\\b(d)(?#X)('ye)\\b", '(?i)\\b(gim)(?#X)(me)\\b', '(?i)\\b(gon)(?#X)(na)\\b', '(?i)\\b(got)(?#X)(ta)\\b', '(?i)\\b(lem)(?#X)(me)\\b', "(?i)\\b(mor)(?#X)('n)\\b", '(?i)\\b(wan)(?#X)(na)\\s']¶
-
CONTRACTIONS3
= ["(?i) ('t)(?#X)(is)\\b", "(?i) ('t)(?#X)(was)\\b"]¶
-
CONTRACTIONS4
= ['(?i)\\b(whad)(dd)(ya)\\b', '(?i)\\b(wha)(t)(cha)\\b']¶
-
-
class
nltk.tokenize.treebank.
TreebankWordDetokenizer
[source]¶ Bases:
nltk.tokenize.api.TokenizerI
The Treebank detokenizer uses the reverse regex operations corresponding to the Treebank tokenizer’s regexes.
Note: - There’re additional assumption mades when undoing the padding of [;@#$%&]
punctuation symbols that isn’t presupposed in the TreebankTokenizer.- There’re additional regexes added in reversing the parentheses tokenization,
- the r’([])}>])s([:;,.])’ removes the additional right padding added to the closing parentheses precedding [:;,.].
- It’s not possible to return the original whitespaces as they were because there wasn’t explicit records of where ‘
- ‘, ‘ ‘ or ‘s’ were removed at
the text.split() operation.
>>> from nltk.tokenize.treebank import TreebankWordTokenizer, TreebankWordDetokenizer >>> s = '''Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\nThanks.''' >>> d = TreebankWordDetokenizer() >>> t = TreebankWordTokenizer() >>> toks = t.tokenize(s) >>> d.detokenize(toks) 'Good muffins cost $3.88 in New York. Please buy me two of them. Thanks.'
The MXPOST parentheses substitution can be undone using the convert_parentheses parameter:
>>> s = '''Good muffins cost $3.88\nin New (York). Please (buy) me\ntwo of them.\n(Thanks).''' >>> expected_tokens = ['Good', 'muffins', 'cost', '$', '3.88', 'in', ... 'New', '-LRB-', 'York', '-RRB-', '.', 'Please', '-LRB-', 'buy', ... '-RRB-', 'me', 'two', 'of', 'them.', '-LRB-', 'Thanks', '-RRB-', '.'] >>> expected_tokens == t.tokenize(s, convert_parentheses=True) True >>> expected_detoken = 'Good muffins cost $3.88 in New (York). Please (buy) me two of them. (Thanks).' >>> expected_detoken == d.detokenize(t.tokenize(s, convert_parentheses=True), convert_parentheses=True) True
During tokenization it’s safe to add more spaces but during detokenization, simply undoing the padding doesn’t really help.
- During tokenization, left and right pad is added to [!?], when detokenizing, only left shift the [!?] is needed. Thus (re.compile(r’s([?!])’), r’g<1>’)
- During tokenization [:,] are left and right padded but when detokenizing, only left shift is necessary and we keep right pad after comma/colon if the string after is a non-digit. Thus (re.compile(r’s([:,])s([^d])’), r’ ’)
>>> from nltk.tokenize.treebank import TreebankWordDetokenizer >>> toks = ['hello', ',', 'i', 'ca', "n't", 'feel', 'my', 'feet', '!', 'Help', '!', '!'] >>> twd = TreebankWordDetokenizer() >>> twd.detokenize(toks) "hello, i can't feel my feet! Help!!"
>>> toks = ['hello', ',', 'i', "can't", 'feel', ';', 'my', 'feet', '!', ... 'Help', '!', '!', 'He', 'said', ':', 'Help', ',', 'help', '?', '!'] >>> twd.detokenize(toks) "hello, i can't feel; my feet! Help!! He said: Help, help?!"
-
CONTRACTIONS2
= [re.compile('(?i)\\b(can)\\s(not)\\b', re.IGNORECASE), re.compile("(?i)\\b(d)\\s('ye)\\b", re.IGNORECASE), re.compile('(?i)\\b(gim)\\s(me)\\b', re.IGNORECASE), re.compile('(?i)\\b(gon)\\s(na)\\b', re.IGNORECASE), re.compile('(?i)\\b(got)\\s(ta)\\b', re.IGNORECASE), re.compile('(?i)\\b(lem)\\s(me)\\b', re.IGNORECASE), re.compile("(?i)\\b(mor)\\s('n)\\b", re.IGNORECASE), re.compile('(?i)\\b(wan)\\s(na)\\s', re.IGNORECASE)]¶
-
CONTRACTIONS3
= [re.compile("(?i) ('t)\\s(is)\\b", re.IGNORECASE), re.compile("(?i) ('t)\\s(was)\\b", re.IGNORECASE)]¶
-
CONVERT_PARENTHESES
= [(re.compile('-LRB-'), '('), (re.compile('-RRB-'), ')'), (re.compile('-LSB-'), '['), (re.compile('-RSB-'), ']'), (re.compile('-LCB-'), '{'), (re.compile('-RCB-'), '}')]¶
-
DOUBLE_DASHES
= (re.compile(' -- '), '--')¶
-
ENDING_QUOTES
= [(re.compile("([^' ])\\s('ll|'LL|'re|'RE|'ve|'VE|n't|N'T) "), '\\1\\2 '), (re.compile("([^' ])\\s('[sS]|'[mM]|'[dD]|') "), '\\1\\2 '), (re.compile("(\\S)(\\'\\')"), '\\1\\2 '), (re.compile(" '' "), '"')]¶
-
PARENS_BRACKETS
= [(re.compile('\\s([\\[\\(\\{\\<])\\s'), ' \\g<1>'), (re.compile('\\s([\\]\\)\\}\\>])\\s'), '\\g<1> '), (re.compile('([\\]\\)\\}\\>])\\s([:;,.])'), '\\1\\2')]¶
-
PUNCTUATION
= [(re.compile("([^'])\\s'\\s"), "\\1' "), (re.compile('\\s([?!])'), '\\g<1>'), (re.compile('([^\\.])\\s(\\.)([\\]\\)}>"\\\']*)\\s*$'), '\\1\\2\\3'), (re.compile('\\s([#$])\\s'), ' \\g<1>'), (re.compile('\\s([;%])\\s'), '\\g<1> '), (re.compile('\\s([&])\\s'), ' \\g<1> '), (re.compile('\\s\\.\\.\\.\\s'), '...'), (re.compile('\\s([:,])\\s$'), '\\1'), (re.compile('\\s([:,])\\s([^\\d])'), '\\1 \\2')]¶
-
STARTING_QUOTES
= [(re.compile('([ (\\[{<])\\s``'), '\\1"'), (re.compile('\\s(``)\\s'), '\\1'), (re.compile('^``'), '\\"')]¶
-
class
nltk.tokenize.treebank.
TreebankWordTokenizer
[source]¶ Bases:
nltk.tokenize.api.TokenizerI
The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank. This is the method that is invoked by
word_tokenize()
. It assumes that the text has already been segmented into sentences, e.g. usingsent_tokenize()
.This tokenizer performs the following steps:
split standard contractions, e.g.
don't
->do n't
andthey'll
->they 'll
treat most punctuation characters as separate tokens
split off commas and single quotes, when followed by whitespace
separate periods that appear at the end of line
>>> from nltk.tokenize import TreebankWordTokenizer >>> s = '''Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\nThanks.''' >>> TreebankWordTokenizer().tokenize(s) ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York.', 'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks', '.'] >>> s = "They'll save and invest more." >>> TreebankWordTokenizer().tokenize(s) ['They', "'ll", 'save', 'and', 'invest', 'more', '.'] >>> s = "hi, my name can't hello," >>> TreebankWordTokenizer().tokenize(s) ['hi', ',', 'my', 'name', 'ca', "n't", 'hello', ',']
-
CONTRACTIONS2
= [re.compile('(?i)\\b(can)(?#X)(not)\\b', re.IGNORECASE), re.compile("(?i)\\b(d)(?#X)('ye)\\b", re.IGNORECASE), re.compile('(?i)\\b(gim)(?#X)(me)\\b', re.IGNORECASE), re.compile('(?i)\\b(gon)(?#X)(na)\\b', re.IGNORECASE), re.compile('(?i)\\b(got)(?#X)(ta)\\b', re.IGNORECASE), re.compile('(?i)\\b(lem)(?#X)(me)\\b', re.IGNORECASE), re.compile("(?i)\\b(mor)(?#X)('n)\\b", re.IGNORECASE), re.compile('(?i)\\b(wan)(?#X)(na)\\s', re.IGNORECASE)]¶
-
CONTRACTIONS3
= [re.compile("(?i) ('t)(?#X)(is)\\b", re.IGNORECASE), re.compile("(?i) ('t)(?#X)(was)\\b", re.IGNORECASE)]¶
-
CONVERT_PARENTHESES
= [(re.compile('\\('), '-LRB-'), (re.compile('\\)'), '-RRB-'), (re.compile('\\['), '-LSB-'), (re.compile('\\]'), '-RSB-'), (re.compile('\\{'), '-LCB-'), (re.compile('\\}'), '-RCB-')]¶
-
DOUBLE_DASHES
= (re.compile('--'), ' -- ')¶
-
ENDING_QUOTES
= [(re.compile('([»”’])'), ' \\1 '), (re.compile('"'), " '' "), (re.compile("(\\S)(\\'\\')"), '\\1 \\2 '), (re.compile("([^' ])('[sS]|'[mM]|'[dD]|') "), '\\1 \\2 '), (re.compile("([^' ])('ll|'LL|'re|'RE|'ve|'VE|n't|N'T) "), '\\1 \\2 ')]¶
-
PARENS_BRACKETS
= (re.compile('[\\]\\[\\(\\)\\{\\}\\<\\>]'), ' \\g<0> ')¶
-
PUNCTUATION
= [(re.compile('([^\\.])(\\.)([\\]\\)}>"\\\'»”’ ]*)\\s*$'), '\\1 \\2 \\3 '), (re.compile('([:,])([^\\d])'), ' \\1 \\2'), (re.compile('([:,])$'), ' \\1 '), (re.compile('\\.\\.\\.'), ' ... '), (re.compile('[;@#$%&]'), ' \\g<0> '), (re.compile('([^\\.])(\\.)([\\]\\)}>"\\\']*)\\s*$'), '\\1 \\2\\3 '), (re.compile('[?!]'), ' \\g<0> '), (re.compile("([^'])' "), "\\1 ' ")]¶
-
STARTING_QUOTES
= [(re.compile('([«“‘„]|[`]+)'), ' \\1 '), (re.compile('^\\"'), '``'), (re.compile('(``)'), ' \\1 '), (re.compile('([ \\(\\[{<])(\\"|\\\'{2})'), '\\1 `` '), (re.compile("(?i)(\\')(?!re|ve|ll|m|t|s|d)(\\w)\\b", re.IGNORECASE), '\\1 \\2')]¶
-
span_tokenize
(text)[source]¶ Uses the post-hoc nltk.tokens.align_tokens to return the offset spans.
>>> from nltk.tokenize import TreebankWordTokenizer >>> s = '''Good muffins cost $3.88\nin New (York). Please (buy) me\ntwo of them.\n(Thanks).''' >>> expected = [(0, 4), (5, 12), (13, 17), (18, 19), (19, 23), ... (24, 26), (27, 30), (31, 32), (32, 36), (36, 37), (37, 38), ... (40, 46), (47, 48), (48, 51), (51, 52), (53, 55), (56, 59), ... (60, 62), (63, 68), (69, 70), (70, 76), (76, 77), (77, 78)] >>> list(TreebankWordTokenizer().span_tokenize(s)) == expected True >>> expected = ['Good', 'muffins', 'cost', '$', '3.88', 'in', ... 'New', '(', 'York', ')', '.', 'Please', '(', 'buy', ')', ... 'me', 'two', 'of', 'them.', '(', 'Thanks', ')', '.'] >>> [s[start:end] for start, end in TreebankWordTokenizer().span_tokenize(s)] == expected True
Additional example >>> from nltk.tokenize import TreebankWordTokenizer >>> s = ‘’‘I said, “I’d like to buy some ‘’good muffins” which cost $3.88n each in New (York).”’‘’ >>> expected = [(0, 1), (2, 6), (6, 7), (8, 9), (9, 10), (10, 12), … (13, 17), (18, 20), (21, 24), (25, 29), (30, 32), (32, 36), … (37, 44), (44, 45), (46, 51), (52, 56), (57, 58), (58, 62), … (64, 68), (69, 71), (72, 75), (76, 77), (77, 81), (81, 82), … (82, 83), (83, 84)] >>> list(TreebankWordTokenizer().span_tokenize(s)) == expected True >>> expected = [‘I’, ‘said’, ‘,’, ‘”’, ‘I’, “‘d”, ‘like’, ‘to’, … ‘buy’, ‘some’, “’‘”, “good”, ‘muffins’, ‘”’, ‘which’, ‘cost’, … ‘$’, ‘3.88’, ‘each’, ‘in’, ‘New’, ‘(‘, ‘York’, ‘)’, ‘.’, ‘”’] >>> [s[start:end] for start, end in TreebankWordTokenizer().span_tokenize(s)] == expected True
nltk.tokenize.util module¶
-
class
nltk.tokenize.util.
CJKChars
[source]¶ Bases:
object
An object that enumerates the code points of the CJK characters as listed on http://en.wikipedia.org/wiki/Basic_Multilingual_Plane#Basic_Multilingual_Plane
This is a Python port of the CJK code point enumerations of Moses tokenizer: https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/detokenizer.perl#L309
-
CJK_Compatibility_Forms
= (65072, 65103)¶
-
CJK_Compatibility_Ideographs
= (63744, 64255)¶
-
CJK_Radicals
= (11904, 42191)¶
-
Hangul_Jamo
= (4352, 4607)¶
-
Hangul_Syllables
= (44032, 55215)¶
-
Katakana_Hangul_Halfwidth
= (65381, 65500)¶
-
Phags_Pa
= (43072, 43135)¶
-
Supplementary_Ideographic_Plane
= (131072, 196607)¶
-
ranges
= [(4352, 4607), (11904, 42191), (43072, 43135), (44032, 55215), (63744, 64255), (65072, 65103), (65381, 65500), (131072, 196607)]¶
-
-
nltk.tokenize.util.
align_tokens
(tokens, sentence)[source]¶ This module attempt to find the offsets of the tokens in s, as a sequence of
(start, end)
tuples, given the tokens and also the source string.>>> from nltk.tokenize import TreebankWordTokenizer >>> from nltk.tokenize.util import align_tokens >>> s = str("The plane, bound for St Petersburg, crashed in Egypt's " ... "Sinai desert just 23 minutes after take-off from Sharm el-Sheikh " ... "on Saturday.") >>> tokens = TreebankWordTokenizer().tokenize(s) >>> expected = [(0, 3), (4, 9), (9, 10), (11, 16), (17, 20), (21, 23), ... (24, 34), (34, 35), (36, 43), (44, 46), (47, 52), (52, 54), ... (55, 60), (61, 67), (68, 72), (73, 75), (76, 83), (84, 89), ... (90, 98), (99, 103), (104, 109), (110, 119), (120, 122), ... (123, 131), (131, 132)] >>> output = list(align_tokens(tokens, s)) >>> len(tokens) == len(expected) == len(output) # Check that length of tokens and tuples are the same. True >>> expected == list(align_tokens(tokens, s)) # Check that the output is as expected. True >>> tokens == [s[start:end] for start, end in output] # Check that the slices of the string corresponds to the tokens. True
Parameters: - tokens (list(str)) – The list of strings that are the result of tokenization
- sentence (str) – The original string
Return type: list(tuple(int,int))
-
nltk.tokenize.util.
is_cjk
(character)[source]¶ Python port of Moses’ code to check for CJK character.
>>> CJKChars().ranges [(4352, 4607), (11904, 42191), (43072, 43135), (44032, 55215), (63744, 64255), (65072, 65103), (65381, 65500), (131072, 196607)] >>> is_cjk(u'㏾') True >>> is_cjk(u'﹟') False
Parameters: character (char) – The character that needs to be checked. Returns: bool
-
nltk.tokenize.util.
regexp_span_tokenize
(s, regexp)[source]¶ Return the offsets of the tokens in s, as a sequence of
(start, end)
tuples, by splitting the string at each successive match of regexp.>>> from nltk.tokenize.util import regexp_span_tokenize >>> s = '''Good muffins cost $3.88\nin New York. Please buy me ... two of them.\n\nThanks.''' >>> list(regexp_span_tokenize(s, r'\s')) [(0, 4), (5, 12), (13, 17), (18, 23), (24, 26), (27, 30), (31, 36), (38, 44), (45, 48), (49, 51), (52, 55), (56, 58), (59, 64), (66, 73)]
Parameters: - s (str) – the string to be tokenized
- regexp (str) – regular expression that matches token separators (must not be empty)
Return type: iter(tuple(int, int))
-
nltk.tokenize.util.
spans_to_relative
(spans)[source]¶ Return a sequence of relative spans, given a sequence of spans.
>>> from nltk.tokenize import WhitespaceTokenizer >>> from nltk.tokenize.util import spans_to_relative >>> s = '''Good muffins cost $3.88\nin New York. Please buy me ... two of them.\n\nThanks.''' >>> list(spans_to_relative(WhitespaceTokenizer().span_tokenize(s))) [(0, 4), (1, 7), (1, 4), (1, 5), (1, 2), (1, 3), (1, 5), (2, 6), (1, 3), (1, 2), (1, 3), (1, 2), (1, 5), (2, 7)]
Parameters: spans (iter(tuple(int, int))) – a sequence of (start, end) offsets of the tokens Return type: iter(tuple(int, int))
-
nltk.tokenize.util.
string_span_tokenize
(s, sep)[source]¶ Return the offsets of the tokens in s, as a sequence of
(start, end)
tuples, by splitting the string at each occurrence of sep.>>> from nltk.tokenize.util import string_span_tokenize >>> s = '''Good muffins cost $3.88\nin New York. Please buy me ... two of them.\n\nThanks.''' >>> list(string_span_tokenize(s, " ")) [(0, 4), (5, 12), (13, 17), (18, 26), (27, 30), (31, 36), (37, 37), (38, 44), (45, 48), (49, 55), (56, 58), (59, 73)]
Parameters: - s (str) – the string to be tokenized
- sep (str) – the token separator
Return type: iter(tuple(int, int))
-
nltk.tokenize.util.
xml_escape
(text)[source]¶ This function transforms the input text into an “escaped” version suitable for well-formed XML formatting.
Note that the default xml.sax.saxutils.escape() function don’t escape some characters that Moses does so we have to manually add them to the entities dictionary.
>>> input_str = ''')| & < > ' " ] [''' >>> expected_output = ''')| & < > ' " ] [''' >>> escape(input_str) == expected_output True >>> xml_escape(input_str) ')| & < > ' " ] ['
Parameters: text (str) – The text that needs to be escaped. Return type: str
-
nltk.tokenize.util.
xml_unescape
(text)[source]¶ This function transforms the “escaped” version suitable for well-formed XML formatting into humanly-readable string.
Note that the default xml.sax.saxutils.unescape() function don’t unescape some characters that Moses does so we have to manually add them to the entities dictionary.
>>> from xml.sax.saxutils import unescape >>> s = ')| & < > ' " ] [' >>> expected = ''')| & < > ' " ] [''' >>> xml_unescape(s) == expected True
Parameters: text (str) – The text that needs to be unescaped. Return type: str
Module contents¶
NLTK Tokenizer Package
Tokenizers divide strings into lists of substrings. For example, tokenizers can be used to find the words and punctuation in a string:
>>> from nltk.tokenize import word_tokenize
>>> s = '''Good muffins cost $3.88\nin New York. Please buy me
... two of them.\n\nThanks.'''
>>> word_tokenize(s)
['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.',
'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
This particular tokenizer requires the Punkt sentence tokenization models to be installed. NLTK also provides a simpler, regular-expression based tokenizer, which splits text on whitespace and punctuation:
>>> from nltk.tokenize import wordpunct_tokenize
>>> wordpunct_tokenize(s)
['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York', '.',
'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
We can also operate at the level of sentences, using the sentence tokenizer directly as follows:
>>> from nltk.tokenize import sent_tokenize, word_tokenize
>>> sent_tokenize(s)
['Good muffins cost $3.88\nin New York.', 'Please buy me\ntwo of them.', 'Thanks.']
>>> [word_tokenize(t) for t in sent_tokenize(s)]
[['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.'],
['Please', 'buy', 'me', 'two', 'of', 'them', '.'], ['Thanks', '.']]
Caution: when tokenizing a Unicode string, make sure you are not
using an encoded version of the string (it may be necessary to
decode it first, e.g. with s.decode("utf8")
.
NLTK tokenizers can produce token-spans, represented as tuples of integers having the same semantics as string slices, to support efficient comparison of tokenizers. (These methods are implemented as generators.)
>>> from nltk.tokenize import WhitespaceTokenizer
>>> list(WhitespaceTokenizer().span_tokenize(s))
[(0, 4), (5, 12), (13, 17), (18, 23), (24, 26), (27, 30), (31, 36), (38, 44),
(45, 48), (49, 51), (52, 55), (56, 58), (59, 64), (66, 73)]
There are numerous ways to tokenize text. If you need more control over tokenization, see the other methods provided in this package.
For further information, please see Chapter 3 of the NLTK book.
-
nltk.tokenize.
sent_tokenize
(text, language='english')[source]¶ Return a sentence-tokenized copy of text, using NLTK’s recommended sentence tokenizer (currently
PunktSentenceTokenizer
for the specified language).Parameters: - text – text to split into sentences
- language – the model name in the Punkt corpus
-
nltk.tokenize.
word_tokenize
(text, language='english', preserve_line=False)[source]¶ Return a tokenized copy of text, using NLTK’s recommended word tokenizer (currently an improved
TreebankWordTokenizer
along withPunktSentenceTokenizer
for the specified language).Parameters: - text (str) – text to split into words
- language (str) – the model name in the Punkt corpus
- preserve_line (bool) – An option to keep the preserve the sentence and not sentence tokenize it.