org.apache.nutch.analysis.lang
Class NGramProfile

java.lang.Object
  extended byorg.apache.nutch.analysis.lang.NGramProfile

public class NGramProfile
extends Object

This class represents a ngram profile. A ngram profile is a set of the most frequently used sequences of chars in a text or set of texts. This class can be used to runs a ngram analysis over submitted text and then to build new NGramProfiles. A profile can then be serialized into a textual file, or a profile can be initialized from a ngram profile file (ngp files).

Author:
Sami Siren, Jérôme Charron

Constructor Summary
NGramProfile(String name, int minlen, int maxlen)
          Construct a new ngram profile
 
Method Summary
 void add(StringBuffer word)
          Add ngrams from a single word to this profile
 void add(Token t)
          Add ngrams from a token to this profile
 void analyze(StringBuffer text)
          Analyze a piece of text.
static NGramProfile create(String name, InputStream is, String encoding)
          Create a new ngram profile from an input stream.
 String getName()
          Returns the profile name.
 float getSimilarity(NGramProfile another)
          Calculate a score how well NGramProfiles match each other The similarity calculation is at experimental level.
 List getSorted()
          Return a sorted list of ngrams.
 void load(InputStream is)
          Loads a ngram profile from an InputStream.
static void main(String[] args)
          Main method used for command line process.
protected  void normalize()
          Normalize the profile (calculates the ngrams frequencies)
 void save(OutputStream os)
          Writes NGramProfile content into OutputStream.
 String toString()
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

NGramProfile

public NGramProfile(String name,
                    int minlen,
                    int maxlen)
Construct a new ngram profile

Parameters:
name - is the name of the profile
minlen - is the min length of ngram sequences
maxlen - is the max length of ngram sequences
Method Detail

getName

public String getName()
Returns the profile name.

Returns:
the profile name.

add

public void add(Token t)
Add ngrams from a token to this profile

Parameters:
t - is the Token to be added

add

public void add(StringBuffer word)
Add ngrams from a single word to this profile

Parameters:
word - is the word to add

analyze

public void analyze(StringBuffer text)
Analyze a piece of text.

Parameters:
text - is the text to be analyzed

normalize

protected void normalize()
Normalize the profile (calculates the ngrams frequencies)


getSorted

public List getSorted()
Return a sorted list of ngrams. The list is sorted by:
  1. frequency
  2. sequence

Returns:
A sorted list of ngrams

toString

public String toString()

getSimilarity

public float getSimilarity(NGramProfile another)
Calculate a score how well NGramProfiles match each other The similarity calculation is at experimental level. You have been warned.

Parameters:
another - is the ngram profile to compare against
Returns:
a similarity indicator, where 0 stands for an exact match.

load

public void load(InputStream is)
          throws IOException
Loads a ngram profile from an InputStream. Please notice, that this method assumes that the stream is UTF-8 encoded.

Parameters:
is - is the InputStream to read
Throws:
IOException

create

public static NGramProfile create(String name,
                                  InputStream is,
                                  String encoding)
                           throws UnsupportedEncodingException
Create a new ngram profile from an input stream. Please notice that the size of the submitted content must be quite large for a good result.

Parameters:
name - is the name of the profile.
is - is the stream to read.
encoding - is the encoding of the stream.
Throws:
UnsupportedEncodingException

save

public void save(OutputStream os)
          throws IOException
Writes NGramProfile content into OutputStream. The content is outputted using UTF-8 encoding.

Parameters:
os - is the stream to output to.
Throws:
IOException - if something wrong occurs on the output stream.

main

public static void main(String[] args)
Main method used for command line process.
Usage is:
 NGramProfile [-create profilename filename encoding]
              [-similarity file1 file2]
              [-score profile-name filename encoding]
 

Parameters:
args - arguments.


Copyright © 2006 The Apache Software Foundation