org.apache.nutch.mapReduce
Interface InputFormat

All Known Implementing Classes:
InputFormatBase

public interface InputFormat

An input data format. Input files are stored in a NutchFileSystem. The processing of an input file may be split across multiple machines. Files are processed as sequences of records, implementing RecordReader. Files must thus be split on record boundaries.


Method Summary
 String getName()
          The name of this input format.
 RecordReader getRecordReader(NutchFileSystem fs, FileSplit split, JobConf job)
          Construct a RecordReader for a FileSplit.
 FileSplit[] getSplits(NutchFileSystem fs, JobConf job, int numSplits)
          Splits a set of input files.
 

Method Detail

getName

public String getName()
The name of this input format.

See Also:
InputFormats

getSplits

public FileSplit[] getSplits(NutchFileSystem fs,
                             JobConf job,
                             int numSplits)
                      throws IOException
Splits a set of input files. One split is created per map task.

Parameters:
fs - the filesystem containing the files to be split
job - the job whose input files are to be split
numSplits - the desired number of splits
Returns:
the splits
Throws:
IOException

getRecordReader

public RecordReader getRecordReader(NutchFileSystem fs,
                                    FileSplit split,
                                    JobConf job)
                             throws IOException
Construct a RecordReader for a FileSplit.

Parameters:
fs - the NutchFileSystem
split - the FileSplit
job - the job that this split belongs to
Returns:
a RecordReader
Throws:
IOException


Copyright © 2006 The Apache Software Foundation