org.apache.hadoop.mapred
Class FileInputFormat<K extends WritableComparable,V extends Writable>

java.lang.Object
  extended by org.apache.hadoop.mapred.FileInputFormat<K,V>
All Implemented Interfaces:
InputFormat<K,V>
Direct Known Subclasses:
InputFormatBase, KeyValueTextInputFormat, MultiFileInputFormat, SequenceFileInputFormat, TextInputFormat

public abstract class FileInputFormat<K extends WritableComparable,V extends Writable>
extends Object
implements InputFormat<K,V>

A base class for file-based InputFormat.

FileInputFormat is the base class for all file-based InputFormats. This provides generic implementations of validateInput(JobConf) and getSplits(JobConf, int). Implementations fo FileInputFormat can also override the isSplitable(FileSystem, Path) method to ensure input-files are not split-up and are processed as a whole by Mappers.


Field Summary
static org.apache.commons.logging.Log LOG
           
 
Constructor Summary
FileInputFormat()
           
 
Method Summary
protected  long computeSplitSize(long goalSize, long minSize, long blockSize)
           
abstract  RecordReader<K,V> getRecordReader(InputSplit split, JobConf job, Reporter reporter)
          Get the RecordReader for the given InputSplit.
 InputSplit[] getSplits(JobConf job, int numSplits)
          Splits files returned by listPaths(JobConf) when they're too big.
protected  boolean isSplitable(FileSystem fs, Path filename)
          Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be.
protected  Path[] listPaths(JobConf job)
          List input directories.
protected  void setMinSplitSize(long minSplitSize)
           
 void validateInput(JobConf job)
          Check for validity of the input-specification for the job.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

LOG

public static final org.apache.commons.logging.Log LOG
Constructor Detail

FileInputFormat

public FileInputFormat()
Method Detail

setMinSplitSize

protected void setMinSplitSize(long minSplitSize)

isSplitable

protected boolean isSplitable(FileSystem fs,
                              Path filename)
Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. FileInputFormat implementations can override this and return false to ensure that individual input files are never split-up so that Mappers process entire files.

Parameters:
fs - the file system that the file is on
filename - the file name to check
Returns:
is this file splitable?

getRecordReader

public abstract RecordReader<K,V> getRecordReader(InputSplit split,
                                                  JobConf job,
                                                  Reporter reporter)
                                                                                       throws IOException
Description copied from interface: InputFormat
Get the RecordReader for the given InputSplit.

It is the responsibility of the RecordReader to respect record boundaries while processing the logical split to present a record-oriented view to the individual task.

Specified by:
getRecordReader in interface InputFormat<K extends WritableComparable,V extends Writable>
Parameters:
split - the InputSplit
job - the job that this split belongs to
Returns:
a RecordReader
Throws:
IOException

listPaths

protected Path[] listPaths(JobConf job)
                    throws IOException
List input directories. Subclasses may override to, e.g., select only files matching a regular expression.

Parameters:
job - the job to list input paths for
Returns:
array of Path objects
Throws:
IOException - if zero items.

validateInput

public void validateInput(JobConf job)
                   throws IOException
Description copied from interface: InputFormat
Check for validity of the input-specification for the job.

This method is used to validate the input directories when a job is submitted so that the JobClient can fail early, with an useful error message, in case of errors. For e.g. input directory does not exist.

Specified by:
validateInput in interface InputFormat<K extends WritableComparable,V extends Writable>
Parameters:
job - job configuration.
Throws:
InvalidInputException - if the job does not have valid input
IOException

getSplits

public InputSplit[] getSplits(JobConf job,
                              int numSplits)
                       throws IOException
Splits files returned by listPaths(JobConf) when they're too big.

Specified by:
getSplits in interface InputFormat<K extends WritableComparable,V extends Writable>
Parameters:
job - job configuration.
numSplits - the desired number of splits, a hint.
Returns:
an array of InputSplits for the job.
Throws:
IOException

computeSplitSize

protected long computeSplitSize(long goalSize,
                                long minSize,
                                long blockSize)


Copyright © 2008 The Apache Software Foundation