nltk.align package

Submodules

nltk.align.api module

class nltk.align.api.AlignedSent(words=[], mots=[], alignment='', encoding='utf8')[source]

Bases: builtins.object

Return an aligned sentence object, which encapsulates two sentences along with an Alignment between them.

>>> from nltk.align import AlignedSent
>>> algnsent = AlignedSent(['klein', 'ist', 'das', 'Haus'],
...     ['the', 'house', 'is', 'small'], '0-2 1-3 2-1 3-0')
>>> algnsent.words
['klein', 'ist', 'das', 'Haus']
>>> algnsent.mots
['the', 'house', 'is', 'small']
>>> algnsent.alignment
Alignment([(0, 2), (1, 3), (2, 1), (3, 0)])
>>> algnsent.precision('0-2 1-3 2-1 3-3')
0.75
>>> from nltk.corpus import comtrans
>>> print(comtrans.aligned_sents()[54])
<AlignedSent: 'Weshalb also sollten...' -> 'So why should EU arm...'>
>>> print(comtrans.aligned_sents()[54].alignment)
0-0 0-1 1-0 2-2 3-4 3-5 4-7 5-8 6-3 7-9 8-9 9-10 9-11 10-12 11-6 12-6 13-13
Parameters:
  • words (list(str)) – source language words
  • mots (list(str)) – target language words
  • alignment (Alignment) – the word-level alignments between the source and target language
alignment
alignment_error_rate(reference, possible=None)[source]

Return the Alignment Error Rate (AER) of an aligned sentence with respect to a “gold standard” reference AlignedSent.

Return an error rate between 0.0 (perfect alignment) and 1.0 (no alignment).

>>> from nltk.align import AlignedSent
>>> s = AlignedSent(["the", "cat"], ["le", "chat"], [(0, 0), (1, 1)])
>>> s.alignment_error_rate(s)
0.0
Parameters:
  • reference (AlignedSent or Alignment) – A “gold standard” reference aligned sentence.
  • possible (AlignedSent or Alignment or None) – A “gold standard” reference of possible alignments (defaults to reference if None)
Return type:

float or None

invert()[source]

Return the aligned sentence pair, reversing the directionality

Return type:AlignedSent
mots[source]
precision(reference)[source]

Return the precision of an aligned sentence with respect to a “gold standard” reference AlignedSent.

Parameters:reference (AlignedSent or Alignment) – A “gold standard” reference aligned sentence.
Return type:float or None
recall(reference)[source]

Return the recall of an aligned sentence with respect to a “gold standard” reference AlignedSent.

Parameters:reference (AlignedSent or Alignment) – A “gold standard” reference aligned sentence.
Return type:float or None
unicode_repr()

Return a string representation for this AlignedSent.

Return type:str
words[source]
class nltk.align.api.Alignment[source]

Bases: builtins.frozenset

A storage class for representing alignment between two sequences, s1, s2. In general, an alignment is a set of tuples of the form (i, j, ...) representing an alignment between the i-th element of s1 and the j-th element of s2. Tuples are extensible (they might contain additional data, such as a boolean to indicate sure vs possible alignments).

>>> from nltk.align import Alignment
>>> a = Alignment([(0, 0), (0, 1), (1, 2), (2, 2)])
>>> a.invert()
Alignment([(0, 0), (1, 0), (2, 1), (2, 2)])
>>> print(a.invert())
0-0 1-0 2-1 2-2
>>> a[0]
[(0, 1), (0, 0)]
>>> a.invert()[2]
[(2, 1), (2, 2)]
>>> b = Alignment([(0, 0), (0, 1)])
>>> b.issubset(a)
True
>>> c = Alignment('0-0 0-1')
>>> b == c
True
invert()[source]

Return an Alignment object, being the inverted mapping.

range(positions=None)[source]

Work out the range of the mapping from the given positions. If no positions are specified, compute the range of the entire mapping.

unicode_repr()

Produce a Giza-formatted string representing the alignment.

nltk.align.bleu_score module

BLEU score implementation.

nltk.align.bleu_score.bleu(candidate, references, weights)[source]

Calculate BLEU score (Bilingual Evaluation Understudy)

Parameters:
  • candidate (list(str)) – a candidate sentence
  • references (list(list(str))) – reference sentences
  • weights (list(float)) – weights for unigrams, bigrams, trigrams and so on
>>> weights = [0.25, 0.25, 0.25, 0.25]
>>> candidate1 = ['It', 'is', 'a', 'guide', 'to', 'action', 'which',
...               'ensures', 'that', 'the', 'military', 'always',
...               'obeys', 'the', 'commands', 'of', 'the', 'party']
>>> candidate2 = ['It', 'is', 'to', 'insure', 'the', 'troops',
...               'forever', 'hearing', 'the', 'activity', 'guidebook',
...               'that', 'party', 'direct']
>>> reference1 = ['It', 'is', 'a', 'guide', 'to', 'action', 'that',
...               'ensures', 'that', 'the', 'military', 'will', 'forever',
...               'heed', 'Party', 'commands']
>>> reference2 = ['It', 'is', 'the', 'guiding', 'principle', 'which',
...               'guarantees', 'the', 'military', 'forces', 'always',
...               'being', 'under', 'the', 'command', 'of', 'the',
...               'Party']
>>> reference3 = ['It', 'is', 'the', 'practical', 'guide', 'for', 'the',
...               'army', 'always', 'to', 'heed', 'the', 'directions',
...               'of', 'the', 'party']
>>> bleu(candidate1, [reference1, reference2, reference3], weights)
0.504...
>>> bleu(candidate2, [reference1, reference2, reference3], weights)
0

Papineni, Kishore, et al. “BLEU: A method for automatic evaluation of machine translation.” Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2002. http://www.aclweb.org/anthology/P02-1040.pdf

nltk.align.gale_church module

A port of the Gale-Church Aligner.

Gale & Church (1993), A Program for Aligning Sentences in Bilingual Corpora. http://aclweb.org/anthology/J93-1004.pdf

class nltk.align.gale_church.LanguageIndependent[source]

Bases: builtins.object

AVERAGE_CHARACTERS = 1
PRIORS = {(0, 1): 0.0099, (1, 2): 0.089, (2, 2): 0.011, (1, 0): 0.0099, (1, 1): 0.89, (2, 1): 0.089}
VARIANCE_CHARACTERS = 6.8
nltk.align.gale_church.align_blocks(source_sents, target_sents, params=<class 'nltk.align.gale_church.LanguageIndependent'>)[source]

Return the sentence alignment of two text blocks (usually paragraphs).

>>> align_blocks([5,5,5], [7,7,7])
[(0, 0), (1, 1), (2, 2)]
>>> align_blocks([10,5,5], [12,20])
[(0, 0), (1, 1), (2, 1)]
>>> align_blocks([12,20], [10,5,5])
[(0, 0), (1, 1), (1, 2)]
>>> align_blocks([10,2,10,10,2,10], [12,3,20,3,12])
[(0, 0), (1, 1), (2, 2), (3, 2), (4, 3), (5, 4)]

@param source_sents: The list of source sentence lengths. @param target_sents: The list of target sentence lengths. @param params: the sentence alignment parameters. @return: The sentence alignments, a list of index pairs.

nltk.align.gale_church.align_log_prob(i, j, source_sents, target_sents, alignment, params)[source]

Returns the log probability of the two sentences C{source_sents[i]}, C{target_sents[j]} being aligned with a specific C{alignment}.

@param i: The offset of the source sentence. @param j: The offset of the target sentence. @param source_sents: The list of source sentence lengths. @param target_sents: The list of target sentence lengths. @param alignment: The alignment type, a tuple of two integers. @param params: The sentence alignment parameters.

@returns: The log probability of a specific alignment between the two sentences, given the parameters.

nltk.align.gale_church.align_texts(source_blocks, target_blocks, params=<class 'nltk.align.gale_church.LanguageIndependent'>)[source]

Creates the sentence alignment of two texts.

Texts can consist of several blocks. Block boundaries cannot be crossed by sentence alignment links.

Each block consists of a list that contains the lengths (in characters) of the sentences in this block.

@param source_blocks: The list of blocks in the source text. @param target_blocks: The list of blocks in the target text. @param params: the sentence alignment parameters.

@returns: A list of sentence alignment lists

nltk.align.gale_church.erfcc(x)[source]

Complementary error function.

nltk.align.gale_church.norm_cdf(x)[source]

Return the area under the normal distribution from M{-∞..x}.

nltk.align.gale_church.norm_logsf(x)[source]
nltk.align.gale_church.parse_token_stream(stream, soft_delimiter, hard_delimiter)[source]

Parses a stream of tokens and splits it into sentences (using C{soft_delimiter} tokens) and blocks (using C{hard_delimiter} tokens) for use with the L{align_texts} function.

nltk.align.gale_church.split_at(it, split_value)[source]

Splits an iterator C{it} at values of C{split_value}.

Each instance of C{split_value} is swallowed. The iterator produces subiterators which need to be consumed fully before the next subiterator can be used.

nltk.align.gale_church.trace(backlinks, source, target)[source]

nltk.align.gdfa module

nltk.align.gdfa.grow_diag_final_and(srclen, trglen, e2f, f2e)[source]

This module symmetrisatizes the source-to-target and target-to-source word alignment output and produces, aka. GDFA algorithm (Koehn, 2005).

Step 1: Find the intersection of the bidirectional alignment.

Step 2: Search for additional neighbor alignment points to be added, given
these criteria: (i) neighbor alignments points are not in the intersection and (ii) neighbor alignments are in the union.
Step 3: Add all other alignment points thats not in the intersection, not in
the neighboring alignments that met the criteria but in the original foward/backward alignment outputs.
>>> forw = ('0-0 2-1 9-2 21-3 10-4 7-5 11-6 9-7 12-8 1-9 3-10 '
...         '4-11 17-12 17-13 25-14 13-15 24-16 11-17 28-18')
>>> back = ('0-0 1-9 2-9 3-10 4-11 5-12 6-6 7-5 8-6 9-7 10-4 '
...         '11-6 12-8 13-12 15-12 17-13 18-13 19-12 20-13 '
...         '21-3 22-12 23-14 24-17 25-15 26-17 27-18 28-18')
>>> srctext = ("この よう な ハロー 白色 わい 星 の L 関数 "
...            "は L と 共 に 不連続 に 増加 する こと が "
...            "期待 さ れる こと を 示し た 。")
>>> trgtext = ("Therefore , we expect that the luminosity function "
...            "of such halo white dwarfs increases discontinuously "
...            "with the luminosity .")
>>> srclen = len(srctext.split())
>>> trglen = len(trgtext.split())
>>>
>>> gdfa = grow_diag_final_and(srclen, trglen, forw, back)
>>> gdfa == set([(28, 18), (6, 6), (24, 17), (2, 1), (15, 12), (13, 12),
...         (2, 9), (3, 10), (26, 17), (25, 15), (8, 6), (9, 7), (20,
...         13), (18, 13), (0, 0), (10, 4), (13, 15), (23, 14), (7, 5),
...         (25, 14), (1, 9), (17, 13), (4, 11), (11, 17), (9, 2), (22,
...         12), (27, 18), (24, 16), (21, 3), (19, 12), (17, 12), (5,
...         12), (11, 6), (12, 8)])
True

References: Koehn, P., A. Axelrod, A. Birch, C. Callison, M. Osborne, and D. Talbot. 2005. Edinburgh System Description for the 2005 IWSLT Speech Translation Evaluation. In MT Eval Workshop.

Parameters:
  • srclen (int) – the number of tokens in the source language
  • trglen (int) – the number of tokens in the target language
  • e2f (str) – the forward word alignment outputs from source-to-target language (in pharaoh output format)
  • f2e (str) – the backward word alignment outputs from target-to-source language (in pharaoh output format)
Return type:

set(tuple(int))

Returns:

the symmetrized alignment points from the GDFA algorithm

nltk.align.ibm1 module

class nltk.align.ibm1.IBMModel1(align_sents, num_iter)[source]

Bases: builtins.object

This class implements the algorithm of Expectation Maximization for the IBM Model 1.

Step 1 - Collect the evidence of a English word being translated by a
foreign language word.
Step 2 - Estimate the probability of translation according to the
evidence from Step 1.
>>> from nltk.corpus import comtrans
>>> bitexts = comtrans.aligned_sents()[:100]
>>> ibm = IBMModel1(bitexts, 20)
>>> aligned_sent = ibm.align(bitexts[6])
>>> aligned_sent.alignment
Alignment([(0, 0), (1, 1), (2, 2), (3, 7), (4, 7), (5, 8)])
>>> print('{0:.3f}'.format(bitexts[6].precision(aligned_sent)))
0.556
>>> print('{0:.3f}'.format(bitexts[6].recall(aligned_sent)))
0.833
>>> print('{0:.3f}'.format(bitexts[6].alignment_error_rate(aligned_sent)))
0.333
align(align_sent)[source]

Returns the alignment result for one sentence pair.

train(align_sents, num_iter)[source]

Return the translation probability model trained by IBM model 1.

Arguments: align_sents – A list of instances of AlignedSent class, which

contains sentence pairs.

num_iter – The number of iterations.

Returns: t_ef – A dictionary of translation probabilities.

nltk.align.ibm2 module

class nltk.align.ibm2.IBMModel2(align_sents, num_iter)[source]

Bases: builtins.object

This class implements the algorithm of Expectation Maximization for the IBM Model 2.

Step 1 - Run a number of iterations of IBM Model 1 and get the initial
distribution of translation probability.
Step 2 - Collect the evidence of an English word being translated by a
foreign language word.
Step 3 - Estimate the probability of translation and alignment according
to the evidence from Step 2.
>>> from nltk.corpus import comtrans
>>> bitexts = comtrans.aligned_sents()[:100]
>>> ibm = IBMModel2(bitexts, 5)
>>> aligned_sent = ibm.align(bitexts[0])
>>> aligned_sent.words
['Wiederaufnahme', 'der', 'Sitzungsperiode']
>>> aligned_sent.mots
['Resumption', 'of', 'the', 'session']
>>> aligned_sent.alignment
Alignment([(0, 0), (1, 2), (2, 3)])
>>> bitexts[0].precision(aligned_sent)
0.75
>>> bitexts[0].recall(aligned_sent)
1.0
>>> bitexts[0].alignment_error_rate(aligned_sent)
0.1428571428571429
align(align_sent)[source]

Returns the alignment result for one sentence pair.

train(align_sents, num_iter)[source]

Return the translation and alignment probability distributions trained by the Expectation Maximization algorithm for IBM Model 2.

Arguments: align_sents – A list contains some sentence pairs. num_iter – The number of iterations.

Returns: t_ef – A distribution of translation probabilities. align – A distribution of alignment probabilities.

nltk.align.ibm3 module

class nltk.align.ibm3.HashableDict[source]

Bases: builtins.dict

This class implements a hashable dict, which can be put into a set.

class nltk.align.ibm3.IBMModel3(align_sents, num_iter)[source]

Bases: builtins.object

This class implements the algorithm of Expectation Maximization for the IBM Model 3.

Step 1 - Run a number of iterations of IBM Model 2 and get the initial
distribution of translation probability.

Step 2 - Sample the alignment spaces by using the hillclimb approach.

Step 3 - Collect the evidence of translation probabilities, distortion,
the probability of null insertion, and fertility.
Step 4 - Estimate the new probabilities according to the evidence from
Step 3.
>>> align_sents = []
>>> align_sents.append(AlignedSent(['klein', 'ist', 'das', 'Haus'], ['the', 'house', 'is', 'small']))
>>> align_sents.append(AlignedSent(['das', 'Haus'], ['the', 'house']))
>>> align_sents.append(AlignedSent(['das', 'Buch'], ['the', 'book']))
>>> align_sents.append(AlignedSent(['ein', 'Buch'], ['a', 'book']))
>>> ibm3 = IBMModel3(align_sents, 5)
>>> print('{0:.1f}'.format(ibm3.probabilities['Buch']['book']))
1.0
>>> print('{0:.1f}'.format(ibm3.probabilities['das']['book']))
0.0
>>> print('{0:.1f}'.format(ibm3.probabilities[None]['book']))
0.0
>>> aligned_sent = ibm3.align(align_sents[0])
>>> aligned_sent.words
['klein', 'ist', 'das', 'Haus']
>>> aligned_sent.mots
['the', 'house', 'is', 'small']
>>> aligned_sent.alignment
Alignment([(0, 2), (1, 3), (2, 0), (3, 1)])
align(align_sent)[source]

Returns the alignment result for one sentence pair.

hillclimb(a, j_pegged, es, fs, fert)[source]

This function returns the best alignment on local. It gets some neighboring alignments and finds out the alignment with highest probability in those alignment spaces. If the current alignment recorded has the highest probability, then stop the search loop. If not, then continue the search loop until it finds out the highest probability of alignment in local.

neighboring(a, j_pegged, es, fs, fert)[source]

This function returns the neighboring alignments from the given alignment by moving or swapping one distance.

probability(a, es, fs, Fert)[source]

This function returns the probability given an alignment. The Fert variable is math syntax ‘Phi’ in the fomula, which represents the fertility according to the current alignment, which records how many output words are generated by each input word.

sample(e, f)[source]

This function returns a sample from the entire alignment space. First, it pegs one alignment point and finds out the best alignment through the IBM model 2. Then, using the hillclimb approach, it finds out the best alignment on local and returns all its neighborings, which are swapped or moved one distance from the best alignment.

train(align_sents, num_iter)[source]

This function is the main process of training model, which initialize all the probability distributions and executes a specific number of iterations.

nltk.align.phrase_based module

nltk.align.phrase_based.extract(f_start, f_end, e_start, e_end, alignment, e_aligned, f_aligned, srctext, trgtext, srclen, trglen)[source]

This function checks for alignment point consistency and extracts phrases using the chunk of consistent phrases.

A phrase pair (e, f ) is consistent with an alignment A if and only if:

  1. No English words in the phrase pair are aligned to words outside it.

    ∀e i ∈ e, (e i , f j ) ∈ A ⇒ f j ∈ f

  2. No Foreign words in the phrase pair are aligned to words outside it.

    ∀f j ∈ f , (e i , f j ) ∈ A ⇒ e i ∈ e

  3. The phrase pair contains at least one alignment point.

    ∃e i ∈ e ̄ , f j ∈ f ̄ s.t. (e i , f j ) ∈ A

Parameters:
  • f_start (int) – Starting index of the possible foreign language phrases
  • f_end (int) – Starting index of the possible foreign language phrases
  • e_start (int) – Starting index of the possible source language phrases
  • e_end (int) – Starting index of the possible source language phrases
  • srctext (list) – The source language tokens, a list of string.
  • trgtext (list) – The target language tokens, a list of string.
  • srclen (int) – The number of tokens in the source language tokens.
  • trglen (int) – The number of tokens in the target language tokens.
nltk.align.phrase_based.phrase_extraction(srctext, trgtext, alignment)[source]

Phrase extraction algorithm extracts all consistent phrase pairs from a word-aligned sentence pair.

The idea is to loop over all possible source language (e) phrases and find the minimal foreign phrase (f) that matches each of them. Matching is done by identifying all alignment points for the source phrase and finding the shortest foreign phrase that includes all the foreign counterparts for the source words.

In short, a phrase alignment has to (a) contain all alignment points for all covered words (b) contain at least one alignment point

>>> srctext = "michael assumes that he will stay in the house"
>>> trgtext = "michael geht davon aus , dass er im haus bleibt"
>>> alignment = [(0,0), (1,1), (1,2), (1,3), (2,5), (3,6), (4,9), 
... (5,9), (6,7), (7,7), (8,8)]
>>> phrases = phrase_extraction(srctext, trgtext, alignment)
>>> for i in sorted(phrases):
...    print(i)
...
((0, 1), (0, 1), 'michael', 'michael')
((0, 2), (0, 4), 'michael assumes', 'michael geht davon aus')
((0, 2), (0, 4), 'michael assumes', 'michael geht davon aus ,')
((0, 3), (0, 6), 'michael assumes that', 'michael geht davon aus , dass')
((0, 4), (0, 7), 'michael assumes that he', 'michael geht davon aus , dass er')
((0, 9), (0, 10), 'michael assumes that he will stay in the house', 'michael geht davon aus , dass er im haus bleibt')
((1, 2), (1, 4), 'assumes', 'geht davon aus')
((1, 2), (1, 4), 'assumes', 'geht davon aus ,')
((1, 3), (1, 6), 'assumes that', 'geht davon aus , dass')
((1, 4), (1, 7), 'assumes that he', 'geht davon aus , dass er')
((1, 9), (1, 10), 'assumes that he will stay in the house', 'geht davon aus , dass er im haus bleibt')
((2, 3), (5, 6), 'that', ', dass')
((2, 3), (5, 6), 'that', 'dass')
((2, 4), (5, 7), 'that he', ', dass er')
((2, 4), (5, 7), 'that he', 'dass er')
((2, 9), (5, 10), 'that he will stay in the house', ', dass er im haus bleibt')
((2, 9), (5, 10), 'that he will stay in the house', 'dass er im haus bleibt')
((3, 4), (6, 7), 'he', 'er')
((3, 9), (6, 10), 'he will stay in the house', 'er im haus bleibt')
((4, 6), (9, 10), 'will stay', 'bleibt')
((4, 9), (7, 10), 'will stay in the house', 'im haus bleibt')
((6, 8), (7, 8), 'in the', 'im')
((6, 9), (7, 9), 'in the house', 'im haus')
((8, 9), (8, 9), 'house', 'haus')
Parameters:
  • srctext (str) – The sentence string from the source language.
  • trgtext (str) – The sentence string from the target language.
  • alignment (str) – The word alignment outputs as list of tuples, where

the first elements of tuples are the source words’ indices and second elements are the target words’ indices. This is also the output format of nltk/align/ibm1.py

Return type:list(tuple)
Returns:A list of tuples, each element in a list is a phrase and each

phrase is a tuple made up of (i) its source location, (ii) its target location, (iii) the source phrase and (iii) the target phrase. The phrase list of tuples represents all the possible phrases extracted from the word alignments.

nltk.align.util module

Module contents

Experimental functionality for bitext alignment. These interfaces are prone to change.