FlatFileReader

Home \| Table of Contents	FlatFileReader	CloverETL 4.7.0
Prev	Readers	Next

Short Description

FlatFileReader reads data from flat files such as CSV (comma-separated values) file and delimited, fixed-length, or mixed text files.

The component can read a single file as well as a collection of files placed on a local disk or remotely. Remote files are accessible via HTTP, HTTPS, FTP, or SFTP protocols. Using this component, ZIP and TAR archives of flat files can be read. Also reading data from input port, or dictionary is supported.

FlatFileReader has an alias - UniversalDataReader.

Component	Data source	Input ports	Output ports	Each to all outputs	Different to different outputs	Transformation	Transf. req.	Java	CTL	Auto-propagated metadata
FlatFileReader	flat file	0-1	1-2

Icon

Ports

Port type	Number	Description	Metadata
Input	0	for Input Port Reading	include specific `byte`/ `cbyte`/ `string` field
Output	0	for correct data records	any
Output	1	for incorrect data records	specific structure, see table below

Metadata

FlatFileReader does not propagate metadata.

This component has Metadata Templates available.

The optional logging port for incorrect records has to define the following metadata structure - the record contains exactly five fields (named arbitrarily) of given types in the following order:

Table 53.5. Error Metadata for FlatFileReader

Field number	Field name	Data type	Description
0	recordNo	long	position of the erroneous record in the dataset (record numbering starts at 1)
1	fieldNo	integer	position of the erroneous field in the record (1 stands for the first field, i.e., that of index 0)
2	originalData	string \| byte \| cbyte	erroneous record in raw form (including all field and record delimiters)
3	errorMessage	string \| byte \| cbyte	error message - detailed information about this error
4	fileURL	string	source file in which the error occurred

Metadata on output port 0 can use Autofilling Functions.

source_timestamp and source_size functions work only when reading from a file directly (if the file is an archive or it is stored in a remote location, timestamp will be empty and size will be 0).

FlatFileReader Attributes

Attribute	Req	Description	Possible values
Basic
File URL		path to data source (flat file, input port, dictionary) to be read specified, see Supported File URL Formats for Readers.
Charset		Character encoding of input records (character encoding does not apply on byte fields if the record type is `fixed`) The default encoding depends on DEFAULT_CHARSET_DECODER in defaultProperties.	UTF-8 \| <other encodings>
Data policy		specifies how to handle misformatted or incorrect data, see Data Policy	strict (default) \| controlled \| lenient
Trim strings		specifies whether leading and trailing whitespace should be removed from strings before setting them to data fields, see Trimming Data below	default \| true \| false
Quoted strings		Fields containing a special character (comma, newline, or double quote) have to be enclosed in quotes. Only single/double quote is accepted as the quote character. If `true`, special characters are removed when read by the component (they are not treated as delimiters). Example: To read input data `"25"\|"John"`, switch Quoted strings to `true` and set Quote character to ". This will produce two fields: `25\|John`. By default, the value of this attribute is inherited from metadata on output port 0. See also Record Details.	false \| true
Quote character		Specifies which kind of quotes will be permitted in Quoted strings. By default, the value of this attribute is inherited from metadata on output port 0. See also Record Details.	both \| " \| '
Advanced
Skip leading blanks		specifies whether to skip leading whitespace (blanks e.g.) before setting input strings to data fields. If not explicitly set (i.e., having the default value), the value of Trim strings attribute is used. See Trimming Data.	default \| true \| false
Skip trailing blanks		specifies whether to skip trailing whitespace (blanks e.g.) before setting input strings to data fields. If not explicitly set (i.e., having the default value), the value of Trim strings attribute is used. See Trimming Data.	default \| true \| false
Number of skipped records		how many records/rows to be skipped from the source file(s); see Selecting Input Records.	0 (default) - N
Max number of records		how many records to be read from the source file(s) in turn; all records are read by default; See Selecting Input Records.	1 - N
Number of skipped records per source		how many records/rows to be skipped from each source file. By default, the value of Skip source rows record property in output port 0 metadata is used. In case the value in metadata differs from the value of this attribute, the Number of skipped records per source value is applied, having a higher priority. See Selecting Input Records.	0 (default)- N
Max number of records per source		how many records/rows to be read from each source file; all records from each file are read by default; See Selecting Input Records.	1 - N
Max error count		maximum number of tolerated error records in input file(s); applicable only if `Controlled` Data Policy is set	0 (default) - N
Treat multiple delimiters as one		If a field is delimited by a multiplied delimiter char, it will be interpreted as a single delimiter when setting to `true`.	false (default) \| true
Incremental file	^{[ 1]}	Name of the file storing the incremental key, including path. See Incremental Reading.
Incremental key	^{[ 1]}	Variable storing the position of the last read record. See Incremental Reading.
Verbose		By default, less comprehensive error notification is provided and the performance is slightly higher. However, if switched to `true`, more detailed information with less performance is provided.	false (default) \| true
Parser		By default, the most appropriate parser is applied. Besides, the parser for processing data may be set explicitly. If an improper one is set, an exception is thrown and the graph fails. See Data Parsers	auto (default) \| `<other>`
^{[ 1]} Either both or neither of these attributes must be specified

Details

Parsed data records are sent to the first output port. The component has an optional output logging port for getting detailed information about incorrect records. Only if Data Policy is set to controlled and a proper Writer (Trash or FlatFileWriter) is connected to port 1, all incorrect records together with the information about the incorrect value, its location and the error message are sent out through this error port.

Trimming Data

Input strings are implicitly (i.e., the Trim strings attribute kept at the default value) processed before converting to value according to the field data type as follows:
- Whitespace is removed from both the start and the end in case of boolean, date, decimal, integer, long, or number.
- Input string is set to a field including leading and trailing whitespace in case of byte, cbyte, or string.
If the Trim strings attribute is set to true, all leading and trailing whitespace characters are removed. A field composed of only whitespaces is transformed to null (zero length string). The false value implies preserving all leading and trailing whitespace characters. Remember that input string representing a numerical data type or boolean can not be parsed including whitespace. Thus, use the false value carefully.
Both the Skip leading blanks and Skip trailing blanks attributes have higher priority than Trim strings. So, the input strings trimming will be determined by the true or false values of these attributes, regardless the Trim strings value.

Data Parsers

org.jetel.data.parser.SimpleDataParser - is a very simple but fast parser with limited validation, error handling, and functionality. The following attributes are not supported:
- Trim strings
- Skip leading blanks
- Skip trailing blanks
- Incremental reading
- Number of skipped records
- Max number of records
- Quoted strings
- Treat multiple delimiters as one
- Skip rows
- Verbose
On top of that, you cannot use metadata containing at least one field with one of these attributes:
- the field is fixed-length
- the field has no delimiter or, on the other hand, more of them
- Shift is not null (see Details Pane)
- Autofilling set to true
- the field is byte-based
org.jetel.data.parser.DataParser - an all-round parser working with any reader settings
org.jetel.data.parser.CharByteDataParser - can be used whenever metadata contain byte-based fields mixed with char-based ones. A byte-based field is a field of one of these types: byte, cbyte or any other field whose format property starts with the "BINARY:" prefix. See Binary Formats.
org.jetel.data.parser.FixLenByteDataParser - used for metadata with byte-based fields only. It parses sequences of records consisting of a fixed number of bytes.

	Note
	Choosing `org.jetel.data.parser.SimpleDataParser` while using Quoted strings will cause the Quoted strings attribute to be ignored.

Tips & Tricks

Handling records with large data fields: FlatFileReader can process input strings of even hundreds or thousands of characters when you adjust the field and record buffer sizes. Just increase the following properties according to your needs: Record.MAX_RECORD_SIZE for record serialization, DataParser.FIELD_BUFFER_LENGTH for parsing, and DataFormatter.FIELD_BUFFER_LENGTH for formatting. Finally, don't forget to increase the DEFAULT_INTERNAL_IO_BUFFER_SIZE variable to be at least 2*MAX_RECORD_SIZE. Go to Chapter 18, Engine Configuration to get know how to change these property variables.

Examples

Processing files with headers

If the first rows of your input file do not represent real data but field labels instead, set the Number of skipped records attribute. If a collection of input files with headers is read, set the Number of skipped records per source

Handling typist's error when creating the input file manually

If you wish to ignore accidental errors in delimiters (such as two semicolons instead of a single one as defined in metadata when the input file is typed manually), set the Treat multiple delimiters as one attribute to true. All redundant delimiter chars will be replaced by the proper one.

Best Practices

We recommend users to explicitly specify encoding of input file (with Charset attribute). It ensures better portability of the graph across systems with different default encoding.

The recommended encoding is UTF-8.

Compatibility

FlatFileReader is available since CloverETL 4.2.0-M1. In 4.2.0-M1, the UniversalDataReader was renamed to FlatFileReader.

In 4.4.0-M2, the default encoding was changed from ISO-8859-1 to UTF-8.

Troubleshooting

With default charset (UTF-8),FlatFileReader cannot parse csv files with binary data. To parse csv files with binary data, change Charset attribute.

Prev	Up	Next
EmailReader	Home \| Table of Contents	HadoopReader