Short Description |
Ports |
Metadata |
ParallelReader Attributes |
Details |
Examples |
Best Practices |
See also |
ParallelReader reads data from flat files using multiple threads.
Component | Data source | Input ports | Output ports | Each to all outputs | Different to different outputs | Transformation | Transf. req. | Java | CTL | Auto-propagated metadata |
---|---|---|---|---|---|---|---|---|---|---|
ParallelReader | flat file | 0 | 1-2 |
Port type | Number | Required | Description | Metadata |
---|---|---|---|---|
Output | 0 | for correct data records | any | |
1 | for incorrect data records | specific structure, see table bellow |
Parsed data records are sent to the first output port.
The component has an optional output logging port for getting detailed information about incorrect records. To get all incorrect records together with the information about the incorrect value, its location, and the error message to error port, Data Policy has to be controlled and an edge has to be connected to the error port.
ParallelReader has metadata template on second output port.
Metadata on output port can use Autofilling Functions.
Table 53.11. Error Metadata for Parallel Reader
Field Number | Field Content | Data Type | Description |
---|---|---|---|
0 | record number | long | position of the erroneous record in the dataset (record numbering starts at 1) |
1 | field number | integer | position of the erroneous field in the record (1 stands for the first field, i.e., that of index 0) |
2 | original data | string | erroneous record in raw form (including delimiters) |
3 | error message | string | error message - detailed information about this error |
4 | reading thread offset | long | indicates the initial file offset of the parsing thread (optional field) |
Attribute | Req | Description | Possible values |
---|---|---|---|
Basic | |||
File URL | Data source(s) will be read. See Supported File URL Formats for Readers. | ||
Charset |
Encoding of records that are read in. The default encoding depends on DEFAULT_CHARSET_DECODER in defaultProperties. | UTF-8 | <other encodings> | |
Data policy | Determines what should be done when an error occurs. See Data Policy for more information. | Strict (default) | Controlled | Lenient | |
Trim strings | specifies whether leading and trailing whitespace should be removed from
strings before setting them to data fields, see Trimming Data.
If true , the use of the robust parser is forced.
| false (default) | true | |
Quoted strings |
Fields containing a special character (comma, newline, or double
quote) have to be enclosed in quotes.
Only single/double quote is accepted as the quote character.
If
Example: To read input data By default, the value of this attribute is inherited from metadata on output port 0. See also Record Details. | false | true | |
Quote character | Specifies which kind of quotes will be permitted in Quoted strings. By default, the value of this attribute is inherited from metadata on output port 0. See also Record Details. | both | " | ' | |
Advanced | |||
Skip leading blanks | specifies whether to skip leading whitespace (blanks e.g.) before setting
input strings to data fields.
If not explicitly set (i.e., having the default value), the value of Trim strings attribute is used.
See Trimming Data.
If true , the use of the robust parser is enforced.
| false (default) | true | |
Skip trailing blanks | specifies whether to skip trailing whitespace (blanks e.g.) before setting
input strings to data fields.
If not explicitly set (i.e., having the default value), the value of Trim strings attribute is used.
See Trimming Data.
If true , the use of the robust parser is enforced.
| false (default) | true | |
Max error count | maximum number of tolerated error records in input file(s);
applicable only if Controlled Data
Policy is set | 0 (default) - N | |
Treat multiple delimiters as one | If a field is delimited by a multiplied delimiter char,
it will be interpreted as a single delimiter when setting to true .
| false (default) | true | |
Verbose | By default, less comprehensive error notification is provided
and the performance is slightly higher.
However, if switched to true , more detailed information with less
performance is provided.
| false (default) | true | |
Level of parallelism | Number of threads used to read input data files. The order of records is not preserved if it is 2 or higher. If the file is too small, this value will be switched to 1 automatically. | 2 (default) | 1-n | |
Distributed file segment reading | In case the component is running in a CloverETL Server Cluster environment and a shared file is read, each component's instance process the appropriate part of the file. The whole file is divided into segments by CloverETL Server and each cluster worker processes only one proper part of file. By default, this option is turned off. This attribute is ignored for partitioned files. | false (default) | true | |
Parser | By default, the most appropriate parser is applied. Besides, the parser for processing data may be set explicitly. If an improper one is set, an exception is thrown and the graph fails. See Data Parsers | auto (default) | <other> |
ParallelReader reads delimited flat files like CSV, tab delimited, etc., fixed-length, or mixed text files. The component can read a single file as well as a collection of files placed on a local disk or remotely, remote files are accessible via FTP and S3 protocol.
Reading goes in several parallel threads, which improves the reading speed. Input file is divided into set of chunks and each reading thread parses just records from this part of file.
The component can use either the fast simplistic parser (SimpleDataParser
)
or the robust (CharByteDataParser
) one.
Which parser is used depends on the component settings and data structure.
If you use ParallelReader instead of FlatFileReader, the speed up more significant with metadata of many date data fields.
The attribute considerably changes the way your data is parsed.
If it is set to true
,
all field delimiters inside quoted strings will be ignored
(after the first Quote character is actually read).
Quote characters will be removed from the field.
Example input:
1;"lastname;firstname";gender
Output with Quoted strings == true:
{1}, {lastname;firstname
}, {gender
}
Output with Quoted strings == false:
{1}, {"lastname
}, {firstname";gender
}
This example shows the basic use of ParallelReader.
Read file file.txt
using ParallelReader.
In ParallelReader, specify File URL and connect an edge to the first output port.
ParallelReader will read it using two threads.
We recommend users to explicitly specify Charset.
ParallelReader is included in CloverETL Designer version 2.8.1 and higher.
Since 4.4.0-M1, ParallelReader support reading files from S3.