Package nltk :: Module util
[hide private]
[frames] | no frames]

Module util

source code

Classes [hide private]
  MinimalSet
Find contexts where more than one possible target value can appear.
  HTMLCleaner
  OrderedDict
  AbstractLazySequence
An abstract base class for read-only sequences whose values are computed as needed.
  LazySubsequence
A subsequence produced by slicing a lazy sequence.
  LazyConcatenation
A lazy sequence formed by concatenating a list of lists.
  LazyMap
A lazy sequence whose elements are formed by applying a given function to each element in one or more underlying lists.
  LazyZip
A lazy sequence whose elements are tuples, each containing the i-th element from each of the argument sequences.
  LazyEnumerate
A lazy sequence whose elements are tuples, each ontaining a count (from zero) and a value yielded by underlying sequence.
  LazyMappedList
Use LazyMap instead.
  LazyMappedChain
Use LazyConcatenation(LazyMap(func, lists)) instead.
Functions [hide private]
 
usage(obj, selfname='self') source code
 
pr(data, start=0, end=None)
Pretty print a sequence of data items
source code
 
print_string(s, width=70)
Pretty print a string, breaking lines on whitespace
source code
string
re_show(regexp, string, left='{', right='}')
Search string for substrings matching regexp and wrap the matches with braces.
source code
 
filestring(f) source code
 
breadth_first(tree, children=<built-in function iter>, depth=-1, queue=None)
Traverse the nodes of a tree in breadth-first order.
source code
 
guess_encoding(data)
Given a byte string, attempt to decode it.
source code
 
invert_dict(d) source code
string
clean_html(html)
Remove HTML markup from the given string.
source code
 
clean_url(url) source code
list of tuples
ngram(sequence, n)
A utility that produces a sequence of ngrams from a sequence of items.
source code
iterator of tuples
ingram(sequence, n)
A utility that produces an iterator over ngrams generated from a sequence of items.
source code
Variables [hide private]
  skip = ['script', 'style']
Function Details [hide private]

pr(data, start=0, end=None)

source code 

Pretty print a sequence of data items

Parameters:
  • data (sequence or iterator) - the data stream to print
  • start (int) - the start position
  • end (int) - the end position

print_string(s, width=70)

source code 

Pretty print a string, breaking lines on whitespace

Parameters:
  • s (string) - the string to print, consisting of words and spaces
  • width (int) - the display width

re_show(regexp, string, left='{', right='}')

source code 

Search string for substrings matching regexp and wrap the matches with braces. This is convenient for learning about regular expressions.

Parameters:
  • regexp (string) - The regular expression.
  • string (string) - The string being matched.
  • left (string) - The left delimiter (printed before the matched substring)
  • right (string) - The right delimiter (printed after the matched substring)
Returns: string
A string with markers surrounding the matched substrings.

breadth_first(tree, children=<built-in function iter>, depth=-1, queue=None)

source code 

Traverse the nodes of a tree in breadth-first order. (No need to check for cycles.) The first argument should be the tree root; children should be a function taking as argument a tree node and returning an iterator of the node's children.

guess_encoding(data)

source code 

Given a byte string, attempt to decode it. Tries the standard 'UTF8' and 'latin-1' encodings, Plus several gathered from locale information.

The calling program *must* first call:

   locale.setlocale(locale.LC_ALL, '')

If successful it returns (decoded_unicode, successful_encoding). If unsuccessful it raises a UnicodeError.

clean_html(html)

source code 

Remove HTML markup from the given string.

Parameters:
  • html (string) - the HTML string to be cleaned
Returns: string

ngram(sequence, n)

source code 

A utility that produces a sequence of ngrams from a sequence of items. For example:

>>> ngram([1,2,3,4,5], 3)
[(1, 2, 3), (2, 3, 4), (3, 4, 5)]

Use ingram for an iterator version of this function.

Parameters:
  • sequence (sequence or iterator) - the source data to be converted into ngrams
  • n (int) - the degree of the ngram
Returns: list of tuples
The ngrams

ingram(sequence, n)

source code 

A utility that produces an iterator over ngrams generated from a sequence of items.

For example:

>>> list(ingram([1,2,3,4,5], 3))
[(1, 2, 3), (2, 3, 4), (3, 4, 5)]

Use ngram for a list version of this function.

Parameters:
  • sequence (sequence or iterator) - the source data to be converted into ngrams
  • n (int) - the degree of the ngram
Returns: iterator of tuples
The ngrams