nltk.tokenize package

Submodules

nltk.tokenize.api module

Tokenizer Interface

class nltk.tokenize.api.StringTokenizer[source]

Bases: nltk.tokenize.api.TokenizerI

A tokenizer that divides a string into substrings by splitting on the specified string (defined in subclasses).

span_tokenize(s)[source]

Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.

Return type:iter(tuple(int, int))
tokenize(s)[source]

Return a tokenized copy of s.

Return type:list of str
class nltk.tokenize.api.TokenizerI[source]

Bases: object

A processing interface for tokenizing a string. Subclasses must define tokenize() or tokenize_sents() (or both).

span_tokenize(s)[source]

Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.

Return type:iter(tuple(int, int))
span_tokenize_sents(strings)[source]

Apply self.span_tokenize() to each element of strings. I.e.:

return [self.span_tokenize(s) for s in strings]
Return type:iter(list(tuple(int, int)))
tokenize(s)[source]

Return a tokenized copy of s.

Return type:list of str
tokenize_sents(strings)[source]

Apply self.tokenize() to each element of strings. I.e.:

return [self.tokenize(s) for s in strings]
Return type:list(list(str))

nltk.tokenize.casual module

Twitter-aware tokenizer, designed to be flexible and easy to adapt to new domains and tasks. The basic logic is this:

  1. The tuple regex_strings defines a list of regular expression strings.
  2. The regex_strings strings are put, in order, into a compiled regular expression object called word_re.
  3. The tokenization is done by word_re.findall(s), where s is the user-supplied string, inside the tokenize() method of the class Tokenizer.
  4. When instantiating Tokenizer objects, there is a single option: preserve_case. By default, it is set to True. If it is set to False, then the tokenizer will downcase everything except for emoticons.
class nltk.tokenize.casual.TweetTokenizer(preserve_case=True, reduce_len=False, strip_handles=False)[source]

Bases: object

Tokenizer for tweets.

>>> from nltk.tokenize import TweetTokenizer
>>> tknzr = TweetTokenizer()
>>> s0 = "This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <--"
>>> tknzr.tokenize(s0)
['This', 'is', 'a', 'cooool', '#dummysmiley', ':', ':-)', ':-P', '<3', 'and', 'some', 'arrows', '<', '>', '->', '<--']

Examples using strip_handles and reduce_len parameters:

>>> tknzr = TweetTokenizer(strip_handles=True, reduce_len=True)
>>> s1 = '@remy: This is waaaaayyyy too much for you!!!!!!'
>>> tknzr.tokenize(s1)
[':', 'This', 'is', 'waaayyy', 'too', 'much', 'for', 'you', '!', '!', '!']
tokenize(text)[source]
Parameters:text – str
Return type:list(str)
Returns:a tokenized list of strings; concatenating this list returns the original string if preserve_case=False
nltk.tokenize.casual.casual_tokenize(text, preserve_case=True, reduce_len=False, strip_handles=False)[source]

Convenience function for wrapping the tokenizer.

nltk.tokenize.casual.int2byte()

S.pack(v1, v2, …) -> bytes

Return a bytes object containing values v1, v2, … packed according to the format string S.format. See help(struct) for more on format strings.

nltk.tokenize.casual.reduce_lengthening(text)[source]

Replace repeated character sequences of length 3 or greater with sequences of length 3.

nltk.tokenize.casual.remove_handles(text)[source]

Remove Twitter username handles from text.

nltk.tokenize.mwe module

Multi-Word Expression Tokenizer

A MWETokenizer takes a string which has already been divided into tokens and retokenizes it, merging multi-word expressions into single tokens, using a lexicon of MWEs:

>>> from nltk.tokenize import MWETokenizer
>>> tokenizer = MWETokenizer([('a', 'little'), ('a', 'little', 'bit'), ('a', 'lot')])
>>> tokenizer.add_mwe(('in', 'spite', 'of'))
>>> tokenizer.tokenize('Testing testing testing one two three'.split())
['Testing', 'testing', 'testing', 'one', 'two', 'three']
>>> tokenizer.tokenize('This is a test in spite'.split())
['This', 'is', 'a', 'test', 'in', 'spite']
>>> tokenizer.tokenize('In a little or a little bit or a lot in spite of'.split())
['In', 'a_little', 'or', 'a_little_bit', 'or', 'a_lot', 'in_spite_of']
class nltk.tokenize.mwe.MWETokenizer(mwes=None, separator='_')[source]

Bases: nltk.tokenize.api.TokenizerI

A tokenizer that processes tokenized text and merges multi-word expressions into single tokens.

add_mwe(mwe)[source]

Add a multi-word expression to the lexicon (stored as a word trie)

We use util.Trie to represent the trie. Its form is a dict of dicts. The key True marks the end of a valid MWE.

Parameters:mwe (tuple(str) or list(str)) – The multi-word expression we’re adding into the word trie
Example:
>>> tokenizer = MWETokenizer()
>>> tokenizer.add_mwe(('a', 'b'))
>>> tokenizer.add_mwe(('a', 'b', 'c'))
>>> tokenizer.add_mwe(('a', 'x'))
>>> expected = {'a': {'x': {True: None}, 'b': {True: None, 'c': {True: None}}}}
>>> tokenizer._mwes == expected
True
tokenize(text)[source]
Parameters:text (list(str)) – A list containing tokenized text
Returns:A list of the tokenized text with multi-words merged together
Return type:list(str)
Example:
>>> tokenizer = MWETokenizer([('hors', "d'oeuvre")], separator='+')
>>> tokenizer.tokenize("An hors d'oeuvre tonight, sir?".split())
['An', "hors+d'oeuvre", 'tonight,', 'sir?']

nltk.tokenize.nist module

nltk.tokenize.punkt module

Punkt Sentence Tokenizer

This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used.

The NLTK data package includes a pre-trained Punkt tokenizer for English.

>>> import nltk.data
>>> text = '''
... Punkt knows that the periods in Mr. Smith and Johann S. Bach
... do not mark sentence boundaries.  And sometimes sentences
... can start with non-capitalized words.  i is a good variable
... name.
... '''
>>> sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
>>> print('\n-----\n'.join(sent_detector.tokenize(text.strip())))
Punkt knows that the periods in Mr. Smith and Johann S. Bach
do not mark sentence boundaries.
-----
And sometimes sentences
can start with non-capitalized words.
-----
i is a good variable
name.

(Note that whitespace from the original text, including newlines, is retained in the output.)

Punctuation following sentences is also included by default (from NLTK 3.0 onwards). It can be excluded with the realign_boundaries flag.

>>> text = '''
... (How does it deal with this parenthesis?)  "It should be part of the
... previous sentence." "(And the same with this one.)" ('And this one!')
... "('(And (this)) '?)" [(and this. )]
... '''
>>> print('\n-----\n'.join(
...     sent_detector.tokenize(text.strip())))
(How does it deal with this parenthesis?)
-----
"It should be part of the
previous sentence."
-----
"(And the same with this one.)"
-----
('And this one!')
-----
"('(And (this)) '?)"
-----
[(and this. )]
>>> print('\n-----\n'.join(
...     sent_detector.tokenize(text.strip(), realign_boundaries=False)))
(How does it deal with this parenthesis?
-----
)  "It should be part of the
previous sentence.
-----
" "(And the same with this one.
-----
)" ('And this one!
-----
')
"('(And (this)) '?
-----
)" [(and this.
-----
)]

However, Punkt is designed to learn parameters (a list of abbreviations, etc.) unsupervised from a corpus similar to the target domain. The pre-packaged models may therefore be unsuitable: use PunktSentenceTokenizer(text) to learn parameters from the given text.

PunktTrainer learns parameters such as a list of abbreviations (without supervision) from portions of text. Using a PunktTrainer directly allows for incremental training and modification of the hyper-parameters used to decide what is considered an abbreviation, etc.

The algorithm for this tokenizer is described in:

Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence
  Boundary Detection.  Computational Linguistics 32: 485-525.
class nltk.tokenize.punkt.PunktBaseClass(lang_vars=None, token_cls=<class 'nltk.tokenize.punkt.PunktToken'>, params=None)[source]

Bases: object

Includes common components of PunktTrainer and PunktSentenceTokenizer.

class nltk.tokenize.punkt.PunktLanguageVars[source]

Bases: object

Stores variables, mostly regular expressions, which may be language-dependent for correct application of the algorithm. An extension of this class may modify its properties to suit a language other than English; an instance can then be passed as an argument to PunktSentenceTokenizer and PunktTrainer constructors.

internal_punctuation = ',:;'

sentence internal punctuation, which indicates an abbreviation if preceded by a period-final token.

period_context_re()[source]

Compiles and returns a regular expression to find contexts including possible sentence boundaries.

re_boundary_realignment = re.compile('["\\\')\\]}]+?(?:\\s+|(?=--)|$)', re.MULTILINE)

Used to realign punctuation that should be included in a sentence although it follows the period (or ?, !).

sent_end_chars = ('.', '?', '!')

Characters which are candidates for sentence boundaries

word_tokenize(s)[source]

Tokenize a string to split off punctuation other than periods

class nltk.tokenize.punkt.PunktParameters[source]

Bases: object

Stores data used to perform sentence boundary detection with Punkt.

abbrev_types = None

A set of word types for known abbreviations.

add_ortho_context(typ, flag)[source]
clear_abbrevs()[source]
clear_collocations()[source]
clear_ortho_context()[source]
clear_sent_starters()[source]
collocations = None

A set of word type tuples for known common collocations where the first word ends in a period. E.g., (‘S.’, ‘Bach’) is a common collocation in a text that discusses ‘Johann S. Bach’. These count as negative evidence for sentence boundaries.

ortho_context = None

A dictionary mapping word types to the set of orthographic contexts that word type appears in. Contexts are represented by adding orthographic context flags: …

sent_starters = None

A set of word types for words that often appear at the beginning of sentences.

class nltk.tokenize.punkt.PunktSentenceTokenizer(train_text=None, verbose=False, lang_vars=None, token_cls=<class 'nltk.tokenize.punkt.PunktToken'>)[source]

Bases: nltk.tokenize.punkt.PunktBaseClass, nltk.tokenize.api.TokenizerI

A sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries. This approach has been shown to work well for many European languages.

PUNCTUATION = (';', ':', ',', '.', '!', '?')
debug_decisions(text)[source]

Classifies candidate periods as sentence breaks, yielding a dict for each that may be used to understand why the decision was made.

See format_debug_decision() to help make this output readable.

dump(tokens)[source]
sentences_from_text(text, realign_boundaries=True)[source]

Given a text, generates the sentences in that text by only testing candidate sentence breaks. If realign_boundaries is True, includes in the sentence closing punctuation that follows the period.

sentences_from_text_legacy(text)[source]

Given a text, generates the sentences in that text. Annotates all tokens, rather than just those with possible sentence breaks. Should produce the same results as sentences_from_text.

sentences_from_tokens(tokens)[source]

Given a sequence of tokens, generates lists of tokens, each list corresponding to a sentence.

span_tokenize(text, realign_boundaries=True)[source]

Given a text, generates (start, end) spans of sentences in the text.

text_contains_sentbreak(text)[source]

Returns True if the given text includes a sentence break.

tokenize(text, realign_boundaries=True)[source]

Given a text, returns a list of the sentences in that text.

train(train_text, verbose=False)[source]

Derives parameters from a given training text, or uses the parameters given. Repeated calls to this method destroy previous parameters. For incremental training, instantiate a separate PunktTrainer instance.

class nltk.tokenize.punkt.PunktToken(tok, **params)[source]

Bases: object

Stores a token of text with annotations produced during sentence boundary detection.

abbr
ellipsis
first_case
first_lower

True if the token’s first character is lowercase.

first_upper

True if the token’s first character is uppercase.

is_alpha

True if the token text is all alphabetic.

is_ellipsis

True if the token text is that of an ellipsis.

is_initial

True if the token text is that of an initial.

is_non_punct

True if the token is either a number or is alphabetic.

is_number

True if the token text is that of a number.

linestart
parastart
period_final
sentbreak
tok
type
type_no_period

The type with its final period removed if it has one.

type_no_sentperiod

The type with its final period removed if it is marked as a sentence break.

unicode_repr()

A string representation of the token that can reproduce it with eval(), which lists all the token’s non-default annotations.

class nltk.tokenize.punkt.PunktTrainer(train_text=None, verbose=False, lang_vars=None, token_cls=<class 'nltk.tokenize.punkt.PunktToken'>)[source]

Bases: nltk.tokenize.punkt.PunktBaseClass

Learns parameters used in Punkt sentence boundary detection.

ABBREV = 0.3

cut-off value whether a ‘token’ is an abbreviation

ABBREV_BACKOFF = 5

upper cut-off for Mikheev’s(2002) abbreviation detection algorithm

COLLOCATION = 7.88

minimal log-likelihood value that two tokens need to be considered as a collocation

IGNORE_ABBREV_PENALTY = False

allows the disabling of the abbreviation penalty heuristic, which exponentially disadvantages words that are found at times without a final period.

INCLUDE_ABBREV_COLLOCS = False

this includes as potential collocations all word pairs where the first word is an abbreviation. Such collocations override the orthographic heuristic, but not the sentence starter heuristic. This is overridden by INCLUDE_ALL_COLLOCS, and if both are false, only collocations with initials and ordinals are considered.

INCLUDE_ALL_COLLOCS = False

this includes as potential collocations all word pairs where the first word ends in a period. It may be useful in corpora where there is a lot of variation that makes abbreviations like Mr difficult to identify.

MIN_COLLOC_FREQ = 1

this sets a minimum bound on the number of times a bigram needs to appear before it can be considered a collocation, in addition to log likelihood statistics. This is useful when INCLUDE_ALL_COLLOCS is True.

SENT_STARTER = 30

minimal log-likelihood value that a token requires to be considered as a frequent sentence starter

finalize_training(verbose=False)[source]

Uses data that has been gathered in training to determine likely collocations and sentence starters.

find_abbrev_types()[source]

Recalculates abbreviations given type frequencies, despite no prior determination of abbreviations. This fails to include abbreviations otherwise found as “rare”.

freq_threshold(ortho_thresh=2, type_thresh=2, colloc_thres=2, sentstart_thresh=2)[source]

Allows memory use to be reduced after much training by removing data about rare tokens that are unlikely to have a statistical effect with further training. Entries occurring above the given thresholds will be retained.

get_params()[source]

Calculates and returns parameters for sentence boundary detection as derived from training.

train(text, verbose=False, finalize=True)[source]

Collects training data from a given text. If finalize is True, it will determine all the parameters for sentence boundary detection. If not, this will be delayed until get_params() or finalize_training() is called. If verbose is True, abbreviations found will be listed.

train_tokens(tokens, verbose=False, finalize=True)[source]

Collects training data from a given list of tokens.

nltk.tokenize.punkt.demo(text, tok_cls=<class 'nltk.tokenize.punkt.PunktSentenceTokenizer'>, train_cls=<class 'nltk.tokenize.punkt.PunktTrainer'>)[source]

Builds a punkt model and applies it to the same text

nltk.tokenize.punkt.format_debug_decision(d)[source]

nltk.tokenize.regexp module

Regular-Expression Tokenizers

A RegexpTokenizer splits a string into substrings using a regular expression. For example, the following tokenizer forms tokens out of alphabetic sequences, money expressions, and any other non-whitespace sequences:

>>> from nltk.tokenize import RegexpTokenizer
>>> s = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."
>>> tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
>>> tokenizer.tokenize(s)
['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York', '.',
'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']

A RegexpTokenizer can use its regexp to match delimiters instead:

>>> tokenizer = RegexpTokenizer('\s+', gaps=True)
>>> tokenizer.tokenize(s)
['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York.',
'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks.']

Note that empty tokens are not returned when the delimiter appears at the start or end of the string.

The material between the tokens is discarded. For example, the following tokenizer selects just the capitalized words:

>>> capword_tokenizer = RegexpTokenizer('[A-Z]\w+')
>>> capword_tokenizer.tokenize(s)
['Good', 'New', 'York', 'Please', 'Thanks']

This module contains several subclasses of RegexpTokenizer that use pre-defined regular expressions.

>>> from nltk.tokenize import BlanklineTokenizer
>>> # Uses '\s*\n\s*\n\s*':
>>> BlanklineTokenizer().tokenize(s)
['Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.',
'Thanks.']

All of the regular expression tokenizers are also available as functions:

>>> from nltk.tokenize import regexp_tokenize, wordpunct_tokenize, blankline_tokenize
>>> regexp_tokenize(s, pattern='\w+|\$[\d\.]+|\S+')
['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York', '.',
'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
>>> wordpunct_tokenize(s)
['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York',
 '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
>>> blankline_tokenize(s)
['Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.', 'Thanks.']

Caution: The function regexp_tokenize() takes the text as its first argument, and the regular expression pattern as its second argument. This differs from the conventions used by Python’s re functions, where the pattern is always the first argument. (This is for consistency with the other NLTK tokenizers.)

class nltk.tokenize.regexp.BlanklineTokenizer[source]

Bases: nltk.tokenize.regexp.RegexpTokenizer

Tokenize a string, treating any sequence of blank lines as a delimiter. Blank lines are defined as lines containing no characters, except for space or tab characters.

class nltk.tokenize.regexp.RegexpTokenizer(pattern, gaps=False, discard_empty=True, flags=<RegexFlag.UNICODE|DOTALL|MULTILINE: 56>)[source]

Bases: nltk.tokenize.api.TokenizerI

A tokenizer that splits a string using a regular expression, which matches either the tokens or the separators between tokens.

>>> tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
Parameters:
  • pattern (str) – The pattern used to build this tokenizer. (This pattern must not contain capturing parentheses; Use non-capturing parentheses, e.g. (?:…), instead)
  • gaps (bool) – True if this tokenizer’s pattern should be used to find separators between tokens; False if this tokenizer’s pattern should be used to find the tokens themselves.
  • discard_empty (bool) – True if any empty tokens ‘’ generated by the tokenizer should be discarded. Empty tokens can only be generated if _gaps == True.
  • flags (int) – The regexp flags used to compile this tokenizer’s pattern. By default, the following flags are used: re.UNICODE | re.MULTILINE | re.DOTALL.
span_tokenize(text)[source]

Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.

Return type:iter(tuple(int, int))
tokenize(text)[source]

Return a tokenized copy of s.

Return type:list of str
unicode_repr()

Return repr(self).

class nltk.tokenize.regexp.WhitespaceTokenizer[source]

Bases: nltk.tokenize.regexp.RegexpTokenizer

Tokenize a string on whitespace (space, tab, newline). In general, users should use the string split() method instead.

>>> from nltk.tokenize import WhitespaceTokenizer
>>> s = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."
>>> WhitespaceTokenizer().tokenize(s)
['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York.',
'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks.']
class nltk.tokenize.regexp.WordPunctTokenizer[source]

Bases: nltk.tokenize.regexp.RegexpTokenizer

Tokenize a text into a sequence of alphabetic and non-alphabetic characters, using the regexp \w+|[^\w\s]+.

>>> from nltk.tokenize import WordPunctTokenizer
>>> s = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."
>>> WordPunctTokenizer().tokenize(s)
['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York',
'.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
nltk.tokenize.regexp.regexp_tokenize(text, pattern, gaps=False, discard_empty=True, flags=<RegexFlag.UNICODE|DOTALL|MULTILINE: 56>)[source]

Return a tokenized copy of text. See RegexpTokenizer for descriptions of the arguments.

nltk.tokenize.repp module

class nltk.tokenize.repp.ReppTokenizer(repp_dir, encoding='utf8')[source]

Bases: nltk.tokenize.api.TokenizerI

A class for word tokenization using the REPP parser described in Rebecca Dridan and Stephan Oepen (2012) Tokenization: Returning to a Long Solved Problem - A Survey, Contrastive Experiment, Recommendations, and Toolkit. In ACL. http://anthology.aclweb.org/P/P12/P12-2.pdf#page=406

>>> sents = ['Tokenization is widely regarded as a solved problem due to the high accuracy that rulebased tokenizers achieve.' ,
... 'But rule-based tokenizers are hard to maintain and their rules language specific.' ,
... 'We evaluated our method on three languages and obtained error rates of 0.27% (English), 0.35% (Dutch) and 0.76% (Italian) for our best models.'
... ]
>>> tokenizer = ReppTokenizer('/home/alvas/repp/') 
>>> for sent in sents:                             
...     tokenizer.tokenize(sent)                   
...
(u'Tokenization', u'is', u'widely', u'regarded', u'as', u'a', u'solved', u'problem', u'due', u'to', u'the', u'high', u'accuracy', u'that', u'rulebased', u'tokenizers', u'achieve', u'.')
(u'But', u'rule-based', u'tokenizers', u'are', u'hard', u'to', u'maintain', u'and', u'their', u'rules', u'language', u'specific', u'.')
(u'We', u'evaluated', u'our', u'method', u'on', u'three', u'languages', u'and', u'obtained', u'error', u'rates', u'of', u'0.27', u'%', u'(', u'English', u')', u',', u'0.35', u'%', u'(', u'Dutch', u')', u'and', u'0.76', u'%', u'(', u'Italian', u')', u'for', u'our', u'best', u'models', u'.')
>>> for sent in tokenizer.tokenize_sents(sents): 
...     print sent                               
...
(u'Tokenization', u'is', u'widely', u'regarded', u'as', u'a', u'solved', u'problem', u'due', u'to', u'the', u'high', u'accuracy', u'that', u'rulebased', u'tokenizers', u'achieve', u'.')
(u'But', u'rule-based', u'tokenizers', u'are', u'hard', u'to', u'maintain', u'and', u'their', u'rules', u'language', u'specific', u'.')
(u'We', u'evaluated', u'our', u'method', u'on', u'three', u'languages', u'and', u'obtained', u'error', u'rates', u'of', u'0.27', u'%', u'(', u'English', u')', u',', u'0.35', u'%', u'(', u'Dutch', u')', u'and', u'0.76', u'%', u'(', u'Italian', u')', u'for', u'our', u'best', u'models', u'.')
>>> for sent in tokenizer.tokenize_sents(sents, keep_token_positions=True): 
...     print sent                                                          
...
[(u'Tokenization', 0, 12), (u'is', 13, 15), (u'widely', 16, 22), (u'regarded', 23, 31), (u'as', 32, 34), (u'a', 35, 36), (u'solved', 37, 43), (u'problem', 44, 51), (u'due', 52, 55), (u'to', 56, 58), (u'the', 59, 62), (u'high', 63, 67), (u'accuracy', 68, 76), (u'that', 77, 81), (u'rulebased', 82, 91), (u'tokenizers', 92, 102), (u'achieve', 103, 110), (u'.', 110, 111)]
[(u'But', 0, 3), (u'rule-based', 4, 14), (u'tokenizers', 15, 25), (u'are', 26, 29), (u'hard', 30, 34), (u'to', 35, 37), (u'maintain', 38, 46), (u'and', 47, 50), (u'their', 51, 56), (u'rules', 57, 62), (u'language', 63, 71), (u'specific', 72, 80), (u'.', 80, 81)]
[(u'We', 0, 2), (u'evaluated', 3, 12), (u'our', 13, 16), (u'method', 17, 23), (u'on', 24, 26), (u'three', 27, 32), (u'languages', 33, 42), (u'and', 43, 46), (u'obtained', 47, 55), (u'error', 56, 61), (u'rates', 62, 67), (u'of', 68, 70), (u'0.27', 71, 75), (u'%', 75, 76), (u'(', 77, 78), (u'English', 78, 85), (u')', 85, 86), (u',', 86, 87), (u'0.35', 88, 92), (u'%', 92, 93), (u'(', 94, 95), (u'Dutch', 95, 100), (u')', 100, 101), (u'and', 102, 105), (u'0.76', 106, 110), (u'%', 110, 111), (u'(', 112, 113), (u'Italian', 113, 120), (u')', 120, 121), (u'for', 122, 125), (u'our', 126, 129), (u'best', 130, 134), (u'models', 135, 141), (u'.', 141, 142)]
find_repptokenizer(repp_dirname)[source]

A module to find REPP tokenizer binary and its repp.set config file.

generate_repp_command(inputfilename)[source]

This module generates the REPP command to be used at the terminal.

Parameters:inputfilename (str) – path to the input file
static parse_repp_outputs(repp_output)[source]

This module parses the tri-tuple format that REPP outputs using the “–format triple” option and returns an generator with tuple of string tokens.

Parameters:repp_output (type) –
Returns:an iterable of the tokenized sentences as tuples of strings
Return type:iter(tuple)
tokenize(sentence)[source]

Use Repp to tokenize a single sentence.

Parameters:sentence (str) – A single sentence string.
Returns:A tuple of tokens.
Return type:tuple(str)
tokenize_sents(sentences, keep_token_positions=False)[source]

Tokenize multiple sentences using Repp.

Parameters:sentences (list(str)) – A list of sentence strings.
Returns:A list of tuples of tokens
Return type:iter(tuple(str))

nltk.tokenize.sexpr module

S-Expression Tokenizer

SExprTokenizer is used to find parenthesized expressions in a string. In particular, it divides a string into a sequence of substrings that are either parenthesized expressions (including any nested parenthesized expressions), or other whitespace-separated tokens.

>>> from nltk.tokenize import SExprTokenizer
>>> SExprTokenizer().tokenize('(a b (c d)) e f (g)')
['(a b (c d))', 'e', 'f', '(g)']

By default, SExprTokenizer will raise a ValueError exception if used to tokenize an expression with non-matching parentheses:

>>> SExprTokenizer().tokenize('c) d) e (f (g')
Traceback (most recent call last):
  ...
ValueError: Un-matched close paren at char 1

The strict argument can be set to False to allow for non-matching parentheses. Any unmatched close parentheses will be listed as their own s-expression; and the last partial sexpr with unmatched open parentheses will be listed as its own sexpr:

>>> SExprTokenizer(strict=False).tokenize('c) d) e (f (g')
['c', ')', 'd', ')', 'e', '(f (g']

The characters used for open and close parentheses may be customized using the parens argument to the SExprTokenizer constructor:

>>> SExprTokenizer(parens='{}').tokenize('{a b {c d}} e f {g}')
['{a b {c d}}', 'e', 'f', '{g}']

The s-expression tokenizer is also available as a function:

>>> from nltk.tokenize import sexpr_tokenize
>>> sexpr_tokenize('(a b (c d)) e f (g)')
['(a b (c d))', 'e', 'f', '(g)']
class nltk.tokenize.sexpr.SExprTokenizer(parens='()', strict=True)[source]

Bases: nltk.tokenize.api.TokenizerI

A tokenizer that divides strings into s-expressions. An s-expresion can be either:

  • a parenthesized expression, including any nested parenthesized expressions, or
  • a sequence of non-whitespace non-parenthesis characters.

For example, the string (a (b c)) d e (f) consists of four s-expressions: (a (b c)), d, e, and (f).

By default, the characters ( and ) are treated as open and close parentheses, but alternative strings may be specified.

Parameters:
  • parens (str or list) – A two-element sequence specifying the open and close parentheses that should be used to find sexprs. This will typically be either a two-character string, or a list of two strings.
  • strict – If true, then raise an exception when tokenizing an ill-formed sexpr.
tokenize(text)[source]

Return a list of s-expressions extracted from text. For example:

>>> SExprTokenizer().tokenize('(a b (c d)) e f (g)')
['(a b (c d))', 'e', 'f', '(g)']

All parentheses are assumed to mark s-expressions. (No special processing is done to exclude parentheses that occur inside strings, or following backslash characters.)

If the given expression contains non-matching parentheses, then the behavior of the tokenizer depends on the strict parameter to the constructor. If strict is True, then raise a ValueError. If strict is False, then any unmatched close parentheses will be listed as their own s-expression; and the last partial s-expression with unmatched open parentheses will be listed as its own s-expression:

>>> SExprTokenizer(strict=False).tokenize('c) d) e (f (g')
['c', ')', 'd', ')', 'e', '(f (g']
Parameters:text (str or iter(str)) – the string to be tokenized
Return type:iter(str)

nltk.tokenize.simple module

Simple Tokenizers

These tokenizers divide strings into substrings using the string split() method. When tokenizing using a particular delimiter string, use the string split() method directly, as this is more efficient.

The simple tokenizers are not available as separate functions; instead, you should just use the string split() method directly:

>>> s = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."
>>> s.split()
['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York.',
'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks.']
>>> s.split(' ')
['Good', 'muffins', 'cost', '$3.88\nin', 'New', 'York.', '',
'Please', 'buy', 'me\ntwo', 'of', 'them.\n\nThanks.']
>>> s.split('\n')
['Good muffins cost $3.88', 'in New York.  Please buy me',
'two of them.', '', 'Thanks.']

The simple tokenizers are mainly useful because they follow the standard TokenizerI interface, and so can be used with any code that expects a tokenizer. For example, these tokenizers can be used to specify the tokenization conventions when building a CorpusReader.

class nltk.tokenize.simple.CharTokenizer[source]

Bases: nltk.tokenize.api.StringTokenizer

Tokenize a string into individual characters. If this functionality is ever required directly, use for char in string.

span_tokenize(s)[source]

Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.

Return type:iter(tuple(int, int))
tokenize(s)[source]

Return a tokenized copy of s.

Return type:list of str
class nltk.tokenize.simple.LineTokenizer(blanklines='discard')[source]

Bases: nltk.tokenize.api.TokenizerI

Tokenize a string into its lines, optionally discarding blank lines. This is similar to s.split('\n').

>>> from nltk.tokenize import LineTokenizer
>>> s = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."
>>> LineTokenizer(blanklines='keep').tokenize(s)
['Good muffins cost $3.88', 'in New York.  Please buy me',
'two of them.', '', 'Thanks.']
>>> # same as [l for l in s.split('\n') if l.strip()]:
>>> LineTokenizer(blanklines='discard').tokenize(s)
['Good muffins cost $3.88', 'in New York.  Please buy me',
'two of them.', 'Thanks.']
Parameters:blanklines

Indicates how blank lines should be handled. Valid values are:

  • discard: strip blank lines out of the token list before returning it.
    A line is considered blank if it contains only whitespace characters.
  • keep: leave all blank lines in the token list.
  • discard-eof: if the string ends with a newline, then do not generate
    a corresponding token '' after that newline.
span_tokenize(s)[source]

Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.

Return type:iter(tuple(int, int))
tokenize(s)[source]

Return a tokenized copy of s.

Return type:list of str
class nltk.tokenize.simple.SpaceTokenizer[source]

Bases: nltk.tokenize.api.StringTokenizer

Tokenize a string using the space character as a delimiter, which is the same as s.split(' ').

>>> from nltk.tokenize import SpaceTokenizer
>>> s = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."
>>> SpaceTokenizer().tokenize(s)
['Good', 'muffins', 'cost', '$3.88\nin', 'New', 'York.', '',
'Please', 'buy', 'me\ntwo', 'of', 'them.\n\nThanks.']
class nltk.tokenize.simple.TabTokenizer[source]

Bases: nltk.tokenize.api.StringTokenizer

Tokenize a string use the tab character as a delimiter, the same as s.split('\t').

>>> from nltk.tokenize import TabTokenizer
>>> TabTokenizer().tokenize('a\tb c\n\t d')
['a', 'b c\n', ' d']
nltk.tokenize.simple.line_tokenize(text, blanklines='discard')[source]

nltk.tokenize.stanford module

class nltk.tokenize.stanford.StanfordTokenizer(path_to_jar=None, encoding='utf8', options=None, verbose=False, java_options='-mx1000m')[source]

Bases: nltk.tokenize.api.TokenizerI

Interface to the Stanford Tokenizer

>>> from nltk.tokenize.stanford import StanfordTokenizer
>>> s = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\nThanks."
>>> StanfordTokenizer().tokenize(s)
['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
>>> s = "The colour of the wall is blue."
>>> StanfordTokenizer(options={"americanize": True}).tokenize(s)
['The', 'color', 'of', 'the', 'wall', 'is', 'blue', '.']
tokenize(s)[source]

Use stanford tokenizer’s PTBTokenizer to tokenize multiple sentences.

nltk.tokenize.stanford.setup_module(module)[source]

nltk.tokenize.stanford_segmenter module

class nltk.tokenize.stanford_segmenter.StanfordSegmenter(path_to_jar=None, path_to_slf4j=None, java_class=None, path_to_model=None, path_to_dict=None, path_to_sihan_corpora_dict=None, sihan_post_processing='false', keep_whitespaces='false', encoding='UTF-8', options=None, verbose=False, java_options='-mx2g')[source]

Bases: nltk.tokenize.api.TokenizerI

Interface to the Stanford Segmenter

If stanford-segmenter version is older than 2016-10-31, then path_to_slf4j should be provieded, for example:

seg = StanfordSegmenter(path_to_slf4j='/YOUR_PATH/slf4j-api.jar')
>>> from nltk.tokenize.stanford_segmenter import StanfordSegmenter
>>> seg = StanfordSegmenter()
>>> seg.default_config('zh')
>>> sent = u'这是斯坦福中文分词器测试'
>>> print(seg.segment(sent))
这 是 斯坦福 中文 分词器 测试

>>> seg.default_config('ar')
>>> sent = u'هذا هو تصنيف ستانفورد العربي للكلمات'
>>> print(seg.segment(sent.split()))
هذا هو تصنيف ستانفورد العربي ل الكلمات
default_config(lang)[source]

Attempt to intialize Stanford Word Segmenter for the specified language using the STANFORD_SEGMENTER and STANFORD_MODELS environment variables

segment(tokens)[source]
segment_file(input_file_path)[source]
segment_sents(sentences)[source]
tokenize(s)[source]

Return a tokenized copy of s.

Return type:list of str
nltk.tokenize.stanford_segmenter.setup_module(module)[source]

nltk.tokenize.texttiling module

class nltk.tokenize.texttiling.TextTilingTokenizer(w=20, k=10, similarity_method=0, stopwords=None, smoothing_method=[0], smoothing_width=2, smoothing_rounds=1, cutoff_policy=1, demo_mode=False)[source]

Bases: nltk.tokenize.api.TokenizerI

Tokenize a document into topical sections using the TextTiling algorithm. This algorithm detects subtopic shifts based on the analysis of lexical co-occurrence patterns.

The process starts by tokenizing the text into pseudosentences of a fixed size w. Then, depending on the method used, similarity scores are assigned at sentence gaps. The algorithm proceeds by detecting the peak differences between these scores and marking them as boundaries. The boundaries are normalized to the closest paragraph break and the segmented text is returned.

Parameters:
  • w (int) – Pseudosentence size
  • k (int) – Size (in sentences) of the block used in the block comparison method
  • similarity_method (constant) – The method used for determining similarity scores: BLOCK_COMPARISON (default) or VOCABULARY_INTRODUCTION.
  • stopwords (list(str)) – A list of stopwords that are filtered out (defaults to NLTK’s stopwords corpus)
  • smoothing_method (constant) – The method used for smoothing the score plot: DEFAULT_SMOOTHING (default)
  • smoothing_width (int) – The width of the window used by the smoothing method
  • smoothing_rounds (int) – The number of smoothing passes
  • cutoff_policy (constant) – The policy used to determine the number of boundaries: HC (default) or LC
>>> from nltk.corpus import brown
>>> tt = TextTilingTokenizer(demo_mode=True)
>>> text = brown.raw()[:4000]
>>> s, ss, d, b = tt.tokenize(text)
>>> b
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0]
tokenize(text)[source]

Return a tokenized copy of text, where each “token” represents a separate topic.

class nltk.tokenize.texttiling.TokenSequence(index, wrdindex_list, original_length=None)[source]

Bases: object

A token list with its original length and its index

class nltk.tokenize.texttiling.TokenTableField(first_pos, ts_occurences, total_count=1, par_count=1, last_par=0, last_tok_seq=None)[source]

Bases: object

A field in the token table holding parameters for each token, used later in the process

nltk.tokenize.texttiling.demo(text=None)[source]
nltk.tokenize.texttiling.smooth(x, window_len=11, window='flat')[source]

smooth the data using a window with requested size.

This method is based on the convolution of a scaled window with the signal. The signal is prepared by introducing reflected copies of the signal (with the window size) in both ends so that transient parts are minimized in the beginning and end part of the output signal.

Parameters:
  • x – the input signal
  • window_len – the dimension of the smoothing window; should be an odd integer
  • window – the type of window from ‘flat’, ‘hanning’, ‘hamming’, ‘bartlett’, ‘blackman’ flat window will produce a moving average smoothing.
Returns:

the smoothed signal

example:

t=linspace(-2,2,0.1)
x=sin(t)+randn(len(t))*0.1
y=smooth(x)
See also:numpy.hanning, numpy.hamming, numpy.bartlett, numpy.blackman, numpy.convolve, scipy.signal.lfilter

TODO: the window parameter could be the window itself if an array instead of a string

nltk.tokenize.toktok module

The tok-tok tokenizer is a simple, general tokenizer, where the input has one sentence per line; thus only final period is tokenized.

Tok-tok has been tested on, and gives reasonably good results for English, Persian, Russian, Czech, French, German, Vietnamese, Tajik, and a few others. The input should be in UTF-8 encoding.

Reference: Jon Dehdari. 2014. A Neurophysiologically-Inspired Statistical Language Model (Doctoral dissertation). Columbus, OH, USA: The Ohio State University.

class nltk.tokenize.toktok.ToktokTokenizer[source]

Bases: nltk.tokenize.api.TokenizerI

This is a Python port of the tok-tok.pl from https://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl

>>> toktok = ToktokTokenizer()
>>> text = u'Is 9.5 or 525,600 my favorite number?'
>>> print (toktok.tokenize(text, return_str=True))
Is 9.5 or 525,600 my favorite number ?
>>> text = u'The https://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl is a website with/and/or slashes and sort of weird : things'
>>> print (toktok.tokenize(text, return_str=True))
The https://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl is a website with/and/or slashes and sort of weird : things
>>> text = u'¡This, is a sentence with weird» symbols… appearing everywhere¿'
>>> expected = u'¡ This , is a sentence with weird » symbols … appearing everywhere ¿'
>>> assert toktok.tokenize(text, return_str=True) == expected
>>> toktok.tokenize(text) == [u'¡', u'This', u',', u'is', u'a', u'sentence', u'with', u'weird', u'»', u'symbols', u'…', u'appearing', u'everywhere', u'¿']
True
AMPERCENT = (re.compile('& '), '&amp; ')
CLOSE_PUNCT = ')]}༻༽᚜⁆⁾₎〉❩❫❭❯❱❳❵⟆⟧⟩⟫⟭⟯⦄⦆⦈⦊⦌⦎⦐⦒⦔⦖⦘⧙⧛⧽⸣⸥⸧⸩〉》」』】〕〗〙〛〞〟﴿︘︶︸︺︼︾﹀﹂﹄﹈﹚﹜﹞)]}⦆」'
CLOSE_PUNCT_RE = (re.compile('([)]}༻༽᚜⁆⁾₎〉❩❫❭❯❱❳❵⟆⟧⟩⟫⟭⟯⦄⦆⦈⦊⦌⦎⦐⦒⦔⦖⦘⧙⧛⧽⸣⸥⸧⸩〉》」』】〕〗〙〛〞〟﴿︘︶︸︺︼︾﹀﹂﹄﹈﹚﹜﹞)]}⦆」])'), '\\1 ')
COMMA_IN_NUM = (re.compile('(?<!,)([,،])(?![,\\d])'), ' \\1 ')
CURRENCY_SYM = '$¢£¤¥֏؋৲৳৻૱௹฿៛₠₡₢₣₤₥₦₧₨₩₪₫€₭₮₯₰₱₲₳₴₵₶₷₸₹₺꠸﷼﹩$¢£¥₩'
CURRENCY_SYM_RE = (re.compile('([$¢£¤¥֏؋৲৳৻૱௹฿៛₠₡₢₣₤₥₦₧₨₩₪₫€₭₮₯₰₱₲₳₴₵₶₷₸₹₺꠸﷼﹩$¢£¥₩])'), '\\1 ')
EN_EM_DASHES = (re.compile('([–—])'), ' \\1 ')
FINAL_PERIOD_1 = (re.compile('(?<!\\.)\\.$'), ' .')
FINAL_PERIOD_2 = (re.compile('(?<!\\.)\\.\\s*(["\'’»›”]) *$'), ' . \\1')
FUNKY_PUNCT_1 = (re.compile('([،;؛¿!"\\])}»›”؟¡%٪°±©®।॥…])'), ' \\1 ')
FUNKY_PUNCT_2 = (re.compile('([({\\[“‘„‚«‹「『])'), ' \\1 ')
LSTRIP = (re.compile('^ +'), '')
MULTI_COMMAS = (re.compile('(,{2,})'), ' \\1 ')
MULTI_DASHES = (re.compile('(-{2,})'), ' \\1 ')
MULTI_DOTS = (re.compile('(\\.{2,})'), ' \\1 ')
NON_BREAKING = (re.compile('\xa0'), ' ')
ONE_SPACE = (re.compile(' {2,}'), ' ')
OPEN_PUNCT = '([{༺༼᚛‚„⁅⁽₍〈❨❪❬❮❰❲❴⟅⟦⟨⟪⟬⟮⦃⦅⦇⦉⦋⦍⦏⦑⦓⦕⦗⧘⧚⧼⸢⸤⸦⸨〈《「『【〔〖〘〚〝﴾︗︵︷︹︻︽︿﹁﹃﹇﹙﹛﹝([{⦅「'
OPEN_PUNCT_RE = (re.compile('([([{༺༼᚛‚„⁅⁽₍〈❨❪❬❮❰❲❴⟅⟦⟨⟪⟬⟮⦃⦅⦇⦉⦋⦍⦏⦑⦓⦕⦗⧘⧚⧼⸢⸤⸦⸨〈《「『【〔〖〘〚〝﴾︗︵︷︹︻︽︿﹁﹃﹇﹙﹛﹝([{⦅「])'), '\\1 ')
PIPE = (re.compile('\\|'), ' &#124; ')
PROB_SINGLE_QUOTES = (re.compile("(['’`])"), ' \\1 ')
RSTRIP = (re.compile('\\s+$'), '\n')
STUPID_QUOTES_1 = (re.compile(' ` ` '), ' `` ')
STUPID_QUOTES_2 = (re.compile(" ' ' "), " '' ")
TAB = (re.compile('\t'), ' &#9; ')
TOKTOK_REGEXES = [(re.compile('\xa0'), ' '), (re.compile('([،;؛¿!"\\])}»›”؟¡%٪°±©®।॥…])'), ' \\1 '), (re.compile(':(?!//)'), ' : '), (re.compile('\\?(?!\\S)'), ' ? '), (re.compile('(:\\/\\/)[\\S+\\.\\S+\\/\\S+][\\/]'), ' / '), (re.compile(' /'), ' / '), (re.compile('& '), '&amp; '), (re.compile('\t'), ' &#9; '), (re.compile('\\|'), ' &#124; '), (re.compile('([([{༺༼᚛‚„⁅⁽₍〈❨❪❬❮❰❲❴⟅⟦⟨⟪⟬⟮⦃⦅⦇⦉⦋⦍⦏⦑⦓⦕⦗⧘⧚⧼⸢⸤⸦⸨〈《「『【〔〖〘〚〝﴾︗︵︷︹︻︽︿﹁﹃﹇﹙﹛﹝([{⦅「])'), '\\1 '), (re.compile('([)]}༻༽᚜⁆⁾₎〉❩❫❭❯❱❳❵⟆⟧⟩⟫⟭⟯⦄⦆⦈⦊⦌⦎⦐⦒⦔⦖⦘⧙⧛⧽⸣⸥⸧⸩〉》」』】〕〗〙〛〞〟﴿︘︶︸︺︼︾﹀﹂﹄﹈﹚﹜﹞)]}⦆」])'), '\\1 '), (re.compile('(,{2,})'), ' \\1 '), (re.compile('(?<!,)([,،])(?![,\\d])'), ' \\1 '), (re.compile('(?<!\\.)\\.\\s*(["\'’»›”]) *$'), ' . \\1'), (re.compile("(['’`])"), ' \\1 '), (re.compile(' ` ` '), ' `` '), (re.compile(" ' ' "), " '' "), (re.compile('([$¢£¤¥֏؋৲৳৻૱௹฿៛₠₡₢₣₤₥₦₧₨₩₪₫€₭₮₯₰₱₲₳₴₵₶₷₸₹₺꠸﷼﹩$¢£¥₩])'), '\\1 '), (re.compile('([–—])'), ' \\1 '), (re.compile('(-{2,})'), ' \\1 '), (re.compile('(\\.{2,})'), ' \\1 '), (re.compile('(?<!\\.)\\.$'), ' .'), (re.compile('(?<!\\.)\\.\\s*(["\'’»›”]) *$'), ' . \\1'), (re.compile(' {2,}'), ' ')]
URL_FOE_1 = (re.compile(':(?!//)'), ' : ')
URL_FOE_2 = (re.compile('\\?(?!\\S)'), ' ? ')
URL_FOE_3 = (re.compile('(:\\/\\/)[\\S+\\.\\S+\\/\\S+][\\/]'), ' / ')
URL_FOE_4 = (re.compile(' /'), ' / ')
tokenize(text, return_str=False)[source]

Return a tokenized copy of s.

Return type:list of str

nltk.tokenize.treebank module

Penn Treebank Tokenizer

The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank. This implementation is a port of the tokenizer sed script written by Robert McIntyre and available at http://www.cis.upenn.edu/~treebank/tokenizer.sed.

class nltk.tokenize.treebank.MacIntyreContractions[source]

Bases: object

List of contractions adapted from Robert MacIntyre’s tokenizer.

CONTRACTIONS2 = ['(?i)\\b(can)(?#X)(not)\\b', "(?i)\\b(d)(?#X)('ye)\\b", '(?i)\\b(gim)(?#X)(me)\\b', '(?i)\\b(gon)(?#X)(na)\\b', '(?i)\\b(got)(?#X)(ta)\\b', '(?i)\\b(lem)(?#X)(me)\\b', "(?i)\\b(mor)(?#X)('n)\\b", '(?i)\\b(wan)(?#X)(na)\\s']
CONTRACTIONS3 = ["(?i) ('t)(?#X)(is)\\b", "(?i) ('t)(?#X)(was)\\b"]
CONTRACTIONS4 = ['(?i)\\b(whad)(dd)(ya)\\b', '(?i)\\b(wha)(t)(cha)\\b']
class nltk.tokenize.treebank.TreebankWordDetokenizer[source]

Bases: nltk.tokenize.api.TokenizerI

The Treebank detokenizer uses the reverse regex operations corresponding to the Treebank tokenizer’s regexes.

Note: - There’re additional assumption mades when undoing the padding of [;@#$%&]

punctuation symbols that isn’t presupposed in the TreebankTokenizer.
  • There’re additional regexes added in reversing the parentheses tokenization,
    • the r’([])}>])s([:;,.])’ removes the additional right padding added to the closing parentheses precedding [:;,.].
  • It’s not possible to return the original whitespaces as they were because there wasn’t explicit records of where ‘
‘, ‘ ‘ or ‘s’ were removed at

the text.split() operation.

>>> from nltk.tokenize.treebank import TreebankWordTokenizer, TreebankWordDetokenizer
>>> s = '''Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\nThanks.'''
>>> d = TreebankWordDetokenizer()
>>> t = TreebankWordTokenizer()
>>> toks = t.tokenize(s)
>>> d.detokenize(toks)
'Good muffins cost $3.88 in New York. Please buy me two of them. Thanks.'

The MXPOST parentheses substitution can be undone using the convert_parentheses parameter:

>>> s = '''Good muffins cost $3.88\nin New (York).  Please (buy) me\ntwo of them.\n(Thanks).'''
>>> expected_tokens = ['Good', 'muffins', 'cost', '$', '3.88', 'in',
... 'New', '-LRB-', 'York', '-RRB-', '.', 'Please', '-LRB-', 'buy',
... '-RRB-', 'me', 'two', 'of', 'them.', '-LRB-', 'Thanks', '-RRB-', '.']
>>> expected_tokens == t.tokenize(s, convert_parentheses=True)
True
>>> expected_detoken = 'Good muffins cost $3.88 in New (York). Please (buy) me two of them. (Thanks).'
>>> expected_detoken == d.detokenize(t.tokenize(s, convert_parentheses=True), convert_parentheses=True)
True

During tokenization it’s safe to add more spaces but during detokenization, simply undoing the padding doesn’t really help.

  • During tokenization, left and right pad is added to [!?], when detokenizing, only left shift the [!?] is needed. Thus (re.compile(r’s([?!])’), r’g<1>’)
  • During tokenization [:,] are left and right padded but when detokenizing, only left shift is necessary and we keep right pad after comma/colon if the string after is a non-digit. Thus (re.compile(r’s([:,])s([^d])’), r’ ’)
>>> from nltk.tokenize.treebank import TreebankWordDetokenizer
>>> toks = ['hello', ',', 'i', 'ca', "n't", 'feel', 'my', 'feet', '!', 'Help', '!', '!']
>>> twd = TreebankWordDetokenizer()
>>> twd.detokenize(toks)
"hello, i can't feel my feet! Help!!"
>>> toks = ['hello', ',', 'i', "can't", 'feel', ';', 'my', 'feet', '!',
... 'Help', '!', '!', 'He', 'said', ':', 'Help', ',', 'help', '?', '!']
>>> twd.detokenize(toks)
"hello, i can't feel; my feet! Help!! He said: Help, help?!"
CONTRACTIONS2 = [re.compile('(?i)\\b(can)\\s(not)\\b', re.IGNORECASE), re.compile("(?i)\\b(d)\\s('ye)\\b", re.IGNORECASE), re.compile('(?i)\\b(gim)\\s(me)\\b', re.IGNORECASE), re.compile('(?i)\\b(gon)\\s(na)\\b', re.IGNORECASE), re.compile('(?i)\\b(got)\\s(ta)\\b', re.IGNORECASE), re.compile('(?i)\\b(lem)\\s(me)\\b', re.IGNORECASE), re.compile("(?i)\\b(mor)\\s('n)\\b", re.IGNORECASE), re.compile('(?i)\\b(wan)\\s(na)\\s', re.IGNORECASE)]
CONTRACTIONS3 = [re.compile("(?i) ('t)\\s(is)\\b", re.IGNORECASE), re.compile("(?i) ('t)\\s(was)\\b", re.IGNORECASE)]
CONVERT_PARENTHESES = [(re.compile('-LRB-'), '('), (re.compile('-RRB-'), ')'), (re.compile('-LSB-'), '['), (re.compile('-RSB-'), ']'), (re.compile('-LCB-'), '{'), (re.compile('-RCB-'), '}')]
DOUBLE_DASHES = (re.compile(' -- '), '--')
ENDING_QUOTES = [(re.compile("([^' ])\\s('ll|'LL|'re|'RE|'ve|'VE|n't|N'T) "), '\\1\\2 '), (re.compile("([^' ])\\s('[sS]|'[mM]|'[dD]|') "), '\\1\\2 '), (re.compile("(\\S)(\\'\\')"), '\\1\\2 '), (re.compile(" '' "), '"')]
PARENS_BRACKETS = [(re.compile('\\s([\\[\\(\\{\\<])\\s'), ' \\g<1>'), (re.compile('\\s([\\]\\)\\}\\>])\\s'), '\\g<1> '), (re.compile('([\\]\\)\\}\\>])\\s([:;,.])'), '\\1\\2')]
PUNCTUATION = [(re.compile("([^'])\\s'\\s"), "\\1' "), (re.compile('\\s([?!])'), '\\g<1>'), (re.compile('([^\\.])\\s(\\.)([\\]\\)}>"\\\']*)\\s*$'), '\\1\\2\\3'), (re.compile('\\s([#$])\\s'), ' \\g<1>'), (re.compile('\\s([;%])\\s'), '\\g<1> '), (re.compile('\\s([&])\\s'), ' \\g<1> '), (re.compile('\\s\\.\\.\\.\\s'), '...'), (re.compile('\\s([:,])\\s$'), '\\1'), (re.compile('\\s([:,])\\s([^\\d])'), '\\1 \\2')]
STARTING_QUOTES = [(re.compile('([ (\\[{<])\\s``'), '\\1"'), (re.compile('\\s(``)\\s'), '\\1'), (re.compile('^``'), '\\"')]
detokenize(tokens, convert_parentheses=False)[source]

Duck-typing the abstract tokenize().

tokenize(tokens, convert_parentheses=False)[source]

Python port of the Moses detokenizer.

Parameters:tokens (list(str)) – A list of strings, i.e. tokenized text.
Returns:str
class nltk.tokenize.treebank.TreebankWordTokenizer[source]

Bases: nltk.tokenize.api.TokenizerI

The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank. This is the method that is invoked by word_tokenize(). It assumes that the text has already been segmented into sentences, e.g. using sent_tokenize().

This tokenizer performs the following steps:

  • split standard contractions, e.g. don't -> do n't and they'll -> they 'll

  • treat most punctuation characters as separate tokens

  • split off commas and single quotes, when followed by whitespace

  • separate periods that appear at the end of line

    >>> from nltk.tokenize import TreebankWordTokenizer
    >>> s = '''Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\nThanks.'''
    >>> TreebankWordTokenizer().tokenize(s)
    ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York.', 'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks', '.']
    >>> s = "They'll save and invest more."
    >>> TreebankWordTokenizer().tokenize(s)
    ['They', "'ll", 'save', 'and', 'invest', 'more', '.']
    >>> s = "hi, my name can't hello,"
    >>> TreebankWordTokenizer().tokenize(s)
    ['hi', ',', 'my', 'name', 'ca', "n't", 'hello', ',']
    
CONTRACTIONS2 = [re.compile('(?i)\\b(can)(?#X)(not)\\b', re.IGNORECASE), re.compile("(?i)\\b(d)(?#X)('ye)\\b", re.IGNORECASE), re.compile('(?i)\\b(gim)(?#X)(me)\\b', re.IGNORECASE), re.compile('(?i)\\b(gon)(?#X)(na)\\b', re.IGNORECASE), re.compile('(?i)\\b(got)(?#X)(ta)\\b', re.IGNORECASE), re.compile('(?i)\\b(lem)(?#X)(me)\\b', re.IGNORECASE), re.compile("(?i)\\b(mor)(?#X)('n)\\b", re.IGNORECASE), re.compile('(?i)\\b(wan)(?#X)(na)\\s', re.IGNORECASE)]
CONTRACTIONS3 = [re.compile("(?i) ('t)(?#X)(is)\\b", re.IGNORECASE), re.compile("(?i) ('t)(?#X)(was)\\b", re.IGNORECASE)]
CONVERT_PARENTHESES = [(re.compile('\\('), '-LRB-'), (re.compile('\\)'), '-RRB-'), (re.compile('\\['), '-LSB-'), (re.compile('\\]'), '-RSB-'), (re.compile('\\{'), '-LCB-'), (re.compile('\\}'), '-RCB-')]
DOUBLE_DASHES = (re.compile('--'), ' -- ')
ENDING_QUOTES = [(re.compile('([»”’])'), ' \\1 '), (re.compile('"'), " '' "), (re.compile("(\\S)(\\'\\')"), '\\1 \\2 '), (re.compile("([^' ])('[sS]|'[mM]|'[dD]|') "), '\\1 \\2 '), (re.compile("([^' ])('ll|'LL|'re|'RE|'ve|'VE|n't|N'T) "), '\\1 \\2 ')]
PARENS_BRACKETS = (re.compile('[\\]\\[\\(\\)\\{\\}\\<\\>]'), ' \\g<0> ')
PUNCTUATION = [(re.compile('([^\\.])(\\.)([\\]\\)}>"\\\'»”’ ]*)\\s*$'), '\\1 \\2 \\3 '), (re.compile('([:,])([^\\d])'), ' \\1 \\2'), (re.compile('([:,])$'), ' \\1 '), (re.compile('\\.\\.\\.'), ' ... '), (re.compile('[;@#$%&]'), ' \\g<0> '), (re.compile('([^\\.])(\\.)([\\]\\)}>"\\\']*)\\s*$'), '\\1 \\2\\3 '), (re.compile('[?!]'), ' \\g<0> '), (re.compile("([^'])' "), "\\1 ' ")]
STARTING_QUOTES = [(re.compile('([«“‘„]|[`]+)'), ' \\1 '), (re.compile('^\\"'), '``'), (re.compile('(``)'), ' \\1 '), (re.compile('([ \\(\\[{<])(\\"|\\\'{2})'), '\\1 `` '), (re.compile("(?i)(\\')(?!re|ve|ll|m|t|s|d)(\\w)\\b", re.IGNORECASE), '\\1 \\2')]
span_tokenize(text)[source]

Uses the post-hoc nltk.tokens.align_tokens to return the offset spans.

>>> from nltk.tokenize import TreebankWordTokenizer
>>> s = '''Good muffins cost $3.88\nin New (York).  Please (buy) me\ntwo of them.\n(Thanks).'''
>>> expected = [(0, 4), (5, 12), (13, 17), (18, 19), (19, 23),
... (24, 26), (27, 30), (31, 32), (32, 36), (36, 37), (37, 38),
... (40, 46), (47, 48), (48, 51), (51, 52), (53, 55), (56, 59),
... (60, 62), (63, 68), (69, 70), (70, 76), (76, 77), (77, 78)]
>>> list(TreebankWordTokenizer().span_tokenize(s)) == expected
True
>>> expected = ['Good', 'muffins', 'cost', '$', '3.88', 'in',
... 'New', '(', 'York', ')', '.', 'Please', '(', 'buy', ')',
... 'me', 'two', 'of', 'them.', '(', 'Thanks', ')', '.']
>>> [s[start:end] for start, end in TreebankWordTokenizer().span_tokenize(s)] == expected
True

Additional example >>> from nltk.tokenize import TreebankWordTokenizer >>> s = ‘’‘I said, “I’d like to buy some ‘’good muffins” which cost $3.88n each in New (York).”’‘’ >>> expected = [(0, 1), (2, 6), (6, 7), (8, 9), (9, 10), (10, 12), … (13, 17), (18, 20), (21, 24), (25, 29), (30, 32), (32, 36), … (37, 44), (44, 45), (46, 51), (52, 56), (57, 58), (58, 62), … (64, 68), (69, 71), (72, 75), (76, 77), (77, 81), (81, 82), … (82, 83), (83, 84)] >>> list(TreebankWordTokenizer().span_tokenize(s)) == expected True >>> expected = [‘I’, ‘said’, ‘,’, ‘”’, ‘I’, “‘d”, ‘like’, ‘to’, … ‘buy’, ‘some’, “’‘”, “good”, ‘muffins’, ‘”’, ‘which’, ‘cost’, … ‘$’, ‘3.88’, ‘each’, ‘in’, ‘New’, ‘(‘, ‘York’, ‘)’, ‘.’, ‘”’] >>> [s[start:end] for start, end in TreebankWordTokenizer().span_tokenize(s)] == expected True

tokenize(text, convert_parentheses=False, return_str=False)[source]

Return a tokenized copy of s.

Return type:list of str

nltk.tokenize.util module

class nltk.tokenize.util.CJKChars[source]

Bases: object

An object that enumerates the code points of the CJK characters as listed on http://en.wikipedia.org/wiki/Basic_Multilingual_Plane#Basic_Multilingual_Plane

This is a Python port of the CJK code point enumerations of Moses tokenizer: https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/detokenizer.perl#L309

CJK_Compatibility_Forms = (65072, 65103)
CJK_Compatibility_Ideographs = (63744, 64255)
CJK_Radicals = (11904, 42191)
Hangul_Jamo = (4352, 4607)
Hangul_Syllables = (44032, 55215)
Katakana_Hangul_Halfwidth = (65381, 65500)
Phags_Pa = (43072, 43135)
Supplementary_Ideographic_Plane = (131072, 196607)
ranges = [(4352, 4607), (11904, 42191), (43072, 43135), (44032, 55215), (63744, 64255), (65072, 65103), (65381, 65500), (131072, 196607)]
nltk.tokenize.util.align_tokens(tokens, sentence)[source]

This module attempt to find the offsets of the tokens in s, as a sequence of (start, end) tuples, given the tokens and also the source string.

>>> from nltk.tokenize import TreebankWordTokenizer
>>> from nltk.tokenize.util import align_tokens
>>> s = str("The plane, bound for St Petersburg, crashed in Egypt's "
... "Sinai desert just 23 minutes after take-off from Sharm el-Sheikh "
... "on Saturday.")
>>> tokens = TreebankWordTokenizer().tokenize(s)
>>> expected = [(0, 3), (4, 9), (9, 10), (11, 16), (17, 20), (21, 23),
... (24, 34), (34, 35), (36, 43), (44, 46), (47, 52), (52, 54),
... (55, 60), (61, 67), (68, 72), (73, 75), (76, 83), (84, 89),
... (90, 98), (99, 103), (104, 109), (110, 119), (120, 122),
... (123, 131), (131, 132)]
>>> output = list(align_tokens(tokens, s))
>>> len(tokens) == len(expected) == len(output)  # Check that length of tokens and tuples are the same.
True
>>> expected == list(align_tokens(tokens, s))  # Check that the output is as expected.
True
>>> tokens == [s[start:end] for start, end in output]  # Check that the slices of the string corresponds to the tokens.
True
Parameters:
  • tokens (list(str)) – The list of strings that are the result of tokenization
  • sentence (str) – The original string
Return type:

list(tuple(int,int))

nltk.tokenize.util.is_cjk(character)[source]

Python port of Moses’ code to check for CJK character.

>>> CJKChars().ranges
[(4352, 4607), (11904, 42191), (43072, 43135), (44032, 55215), (63744, 64255), (65072, 65103), (65381, 65500), (131072, 196607)]
>>> is_cjk(u'㏾')
True
>>> is_cjk(u'﹟')
False
Parameters:character (char) – The character that needs to be checked.
Returns:bool
nltk.tokenize.util.regexp_span_tokenize(s, regexp)[source]

Return the offsets of the tokens in s, as a sequence of (start, end) tuples, by splitting the string at each successive match of regexp.

>>> from nltk.tokenize.util import regexp_span_tokenize
>>> s = '''Good muffins cost $3.88\nin New York.  Please buy me
... two of them.\n\nThanks.'''
>>> list(regexp_span_tokenize(s, r'\s'))
[(0, 4), (5, 12), (13, 17), (18, 23), (24, 26), (27, 30), (31, 36),
(38, 44), (45, 48), (49, 51), (52, 55), (56, 58), (59, 64), (66, 73)]
Parameters:
  • s (str) – the string to be tokenized
  • regexp (str) – regular expression that matches token separators (must not be empty)
Return type:

iter(tuple(int, int))

nltk.tokenize.util.spans_to_relative(spans)[source]

Return a sequence of relative spans, given a sequence of spans.

>>> from nltk.tokenize import WhitespaceTokenizer
>>> from nltk.tokenize.util import spans_to_relative
>>> s = '''Good muffins cost $3.88\nin New York.  Please buy me
... two of them.\n\nThanks.'''
>>> list(spans_to_relative(WhitespaceTokenizer().span_tokenize(s)))
[(0, 4), (1, 7), (1, 4), (1, 5), (1, 2), (1, 3), (1, 5), (2, 6),
(1, 3), (1, 2), (1, 3), (1, 2), (1, 5), (2, 7)]
Parameters:spans (iter(tuple(int, int))) – a sequence of (start, end) offsets of the tokens
Return type:iter(tuple(int, int))
nltk.tokenize.util.string_span_tokenize(s, sep)[source]

Return the offsets of the tokens in s, as a sequence of (start, end) tuples, by splitting the string at each occurrence of sep.

>>> from nltk.tokenize.util import string_span_tokenize
>>> s = '''Good muffins cost $3.88\nin New York.  Please buy me
... two of them.\n\nThanks.'''
>>> list(string_span_tokenize(s, " "))
[(0, 4), (5, 12), (13, 17), (18, 26), (27, 30), (31, 36), (37, 37),
(38, 44), (45, 48), (49, 55), (56, 58), (59, 73)]
Parameters:
  • s (str) – the string to be tokenized
  • sep (str) – the token separator
Return type:

iter(tuple(int, int))

nltk.tokenize.util.xml_escape(text)[source]

This function transforms the input text into an “escaped” version suitable for well-formed XML formatting.

Note that the default xml.sax.saxutils.escape() function don’t escape some characters that Moses does so we have to manually add them to the entities dictionary.

>>> input_str = ''')| & < > ' " ] ['''
>>> expected_output =  ''')| &amp; &lt; &gt; ' " ] ['''
>>> escape(input_str) == expected_output
True
>>> xml_escape(input_str)
')&#124; &amp; &lt; &gt; &apos; &quot; &#93; &#91;'
Parameters:text (str) – The text that needs to be escaped.
Return type:str
nltk.tokenize.util.xml_unescape(text)[source]

This function transforms the “escaped” version suitable for well-formed XML formatting into humanly-readable string.

Note that the default xml.sax.saxutils.unescape() function don’t unescape some characters that Moses does so we have to manually add them to the entities dictionary.

>>> from xml.sax.saxutils import unescape
>>> s = ')&#124; &amp; &lt; &gt; &apos; &quot; &#93; &#91;'
>>> expected = ''')| & < > ' " ] ['''
>>> xml_unescape(s) == expected
True
Parameters:text (str) – The text that needs to be unescaped.
Return type:str

Module contents

NLTK Tokenizer Package

Tokenizers divide strings into lists of substrings. For example, tokenizers can be used to find the words and punctuation in a string:

>>> from nltk.tokenize import word_tokenize
>>> s = '''Good muffins cost $3.88\nin New York.  Please buy me
... two of them.\n\nThanks.'''
>>> word_tokenize(s)
['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.',
'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']

This particular tokenizer requires the Punkt sentence tokenization models to be installed. NLTK also provides a simpler, regular-expression based tokenizer, which splits text on whitespace and punctuation:

>>> from nltk.tokenize import wordpunct_tokenize
>>> wordpunct_tokenize(s)
['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York', '.',
'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']

We can also operate at the level of sentences, using the sentence tokenizer directly as follows:

>>> from nltk.tokenize import sent_tokenize, word_tokenize
>>> sent_tokenize(s)
['Good muffins cost $3.88\nin New York.', 'Please buy me\ntwo of them.', 'Thanks.']
>>> [word_tokenize(t) for t in sent_tokenize(s)]
[['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.'],
['Please', 'buy', 'me', 'two', 'of', 'them', '.'], ['Thanks', '.']]

Caution: when tokenizing a Unicode string, make sure you are not using an encoded version of the string (it may be necessary to decode it first, e.g. with s.decode("utf8").

NLTK tokenizers can produce token-spans, represented as tuples of integers having the same semantics as string slices, to support efficient comparison of tokenizers. (These methods are implemented as generators.)

>>> from nltk.tokenize import WhitespaceTokenizer
>>> list(WhitespaceTokenizer().span_tokenize(s))
[(0, 4), (5, 12), (13, 17), (18, 23), (24, 26), (27, 30), (31, 36), (38, 44),
(45, 48), (49, 51), (52, 55), (56, 58), (59, 64), (66, 73)]

There are numerous ways to tokenize text. If you need more control over tokenization, see the other methods provided in this package.

For further information, please see Chapter 3 of the NLTK book.

nltk.tokenize.sent_tokenize(text, language='english')[source]

Return a sentence-tokenized copy of text, using NLTK’s recommended sentence tokenizer (currently PunktSentenceTokenizer for the specified language).

Parameters:
  • text – text to split into sentences
  • language – the model name in the Punkt corpus
nltk.tokenize.word_tokenize(text, language='english', preserve_line=False)[source]

Return a tokenized copy of text, using NLTK’s recommended word tokenizer (currently an improved TreebankWordTokenizer along with PunktSentenceTokenizer for the specified language).

Parameters:
  • text (str) – text to split into words
  • language (str) – the model name in the Punkt corpus
  • preserve_line (bool) – An option to keep the preserve the sentence and not sentence tokenize it.