Supported File URL Formats for Readers

The File URL attribute may be defined using the URL File Dialog.

[Important]Important

To ensure graph portability, forward slashes must be used when defining the path in URLs (even on Microsoft Windows).

Below are examples of possible URL for Readers:

Reading of Local Files

  • /path/filename.txt

    Reads a specified file.

  • /path1/filename1.txt;/path2/filename2.txt

    Reads two specified files.

  • /path/filename?.txt

    Reads all files satisfying the mask.

  • /path/*

    Reads all files in a specified directory.

  • zip:(/path/file.zip)

    Reads the first file compressed in the file.zip file.

  • zip:(/path/file.zip)#innerfolder/filename.txt

    Reads a specified file compressed in the file.zip file.

  • gzip:(/path/file.gz)

    Reads the first file compressed in the file.gz file.

  • tar:(/path/file.tar)#innerfolder/filename.txt

    Reads a specified file archived in the file.tar file.

  • zip:(/path/file??.zip)#innerfolder?/filename.*

    Reads all files from the compressed zip file(s) that satisfy the specified mask. Wild cards (? and *) may be used in the compressed file names, inner folder and inner file names.

  • tar:(/path/file????.tar)#innerfolder??/filename*.txt

    Reads all files from the archive file(s) that satisfy the specified mask. Wild cards (? and *) may be used in the compressed file names, inner folder and inner file names.

  • gzip:(/path/file*.gz)

    Reads all files that has been gzipped into the file that satisfy the specified mask. Wild cards may be used in the compressed file names.

  • tar:(gzip:/path/file.tar.gz)#innerfolder/filename.txt

    Reads a specified file compressed in the file.tar.gz file.

    [Note]Note

    Although CloverETL can read data from a .tar file, writing to .tar files is not supported.

  • tar:(gzip:/path/file??.tar.gz)#innerfolder?/filename*.txt

    Reads all files from the gzipped tar archive file(s) that satisfy the specified mask. Wild cards (? and *) may be used in the compressed file names, inner folder and inner file names.

  • zip:(zip:(/path/name?.zip)#innerfolder/file.zip)#innermostfolder?/filename*.txt

    Reads all files satisfying the file mask from all paths satisfying the path mask from all compressed files satisfying the specified zip mask. Wild cards (? and *) may be used in the outer compressed files, innermost folder and innermost file names. They cannot be used in the inner folder and inner zip file names.

Reading of Remote Files

  • ftp://username:password@server/path/filename.txt

    Reads a specified filename.txt file on a remote server connected via an ftp protocol using username and password.

  • sftp://username:password@server/path/filename.txt

    Reads a specified filename.txt file on a remote server connected via an ftp protocol using username and password.

    If a certificate-based authentication is used, certificates are placed in the ${PROJECT}/ssh-keys/ directory and each private key file name has the .key suffix. Only certificates without password are currently supported. The certificate-based authentication has URL without password:

    sftp://username@server/path/filename.txt

  • http://server/path/filename.txt

    Reads a specified filename.txt file on a remote server connected via an http protocol.

  • https://server/path/filename.txt

    Reads a specified filename.txt file on a remote server connected via an https protocol.

  • zip:(ftp://username:password@server/path/file.zip)#innerfolder/filename.txt

    Reads a specified filename.txt file compressed in the file.zip file on a remote server connected via an ftp protocol using username and password.

  • zip:(http://server/path/file.zip)#innerfolder/filename.txt

    Reads a specified filename.txt file compressed in the file.zip file on a remote server connected via an http protocol.

  • tar:(ftp://username:password@server/path/file.tar)#innerfolder/filename.txt

    Reads a specified filename.txt file archived in the file.tar file on a remote server connected via an ftp protocol using username and password.

  • zip:(zip:(ftp://username:password@server/path/name.zip)#innerfolder/file.zip)#innermostfolder/filename.txt

    Reads a specified filename.txt file compressed in the file.zip file that is also compressed in the name.zip file on a remote server connected via an ftp protocol using username and password.

  • gzip:(http://server/path/file.gz)

    Reads the first file compressed in the file.gz file on a remote server connected via an http protocol.

  • http://server/filename*.dat

    Reads all files from WebDAV server which satisfy specified mask (only * is supported).

  • s3://access_key_id:[email protected]/bucketname/filename*.out

    Reads all objects which satisfy specified mask from the Amazon S3 web storage service from given bucket using access key ID and secret access key.

    It is recommended to connect to S3 via region-specific S3 URL: s3://s3.eu-central-1.amazonaws.com/bucket.name/. The region-specific URL has much better performance than the generic one (s3://s3.amazonaws.com/bucket.name/).

    See recommendation on Amazon S3 URL.

    [Note]Note

    s3:// URL protocol is available since CloverETL 4.1. More information about the deprecated http:// S3 protocol can be found in CloverETL 4.0 User Guide.

  • hdfs://CONN_ID/path/filename.dat

    Reads a file from the Hadoop distributed file system (HDFS). To which HDFS NameNode to connect to is defined in a Hadoop connection with ID CONN_ID. This example file URL reads a file with the /path/filename.dat absolute HDFS path.

  • smb://domain%3Buser:password@server/path/filename.txt

    Reads files from Windows share (Microsoft SMB/CIFS protocol). URL path may contain wildcards (both * and ? are supported). The server part may be a DNS name, an IP address or a NetBIOS name. Userinfo part of the URL (domain%3Buser:password) is not mandatory and any URL reserved character it contains should be escaped using the %-encoding similarly as the semicolon ; character with %3B in the example (the semicolon is escaped because it collides with default Clover file URL separator).

    The SMB protocol is implemented in the JCIFS library which may be configured using Java system properties. See Setting Client Properties in the JCIFS documentation for the list of all configurable properties.

Reading from Input Port

  • port:$0.FieldName:discrete

    Data from each record field selected for input port reading are read as a single input file.

  • port:$0.FieldName:source

    URL addresses, i.e., values of field selected for input port reading, are loaded in and parsed.

  • port:$0.FieldName:stream

    Input port field values are concatenated and processed as an input file(s); null values are replaced by the eof.

Using Proxy in Readers

  • http:(direct:)//seznam.cz

    Without proxy.

  • http:(proxy://user:[email protected]:443)//seznam.cz

    Proxy setting for http protocol.

  • ftp:(proxy://user:password@proxyserver:1234)//seznam.cz

    Proxy setting for ftp protocol.

  • sftp:(proxy://66.11.122.193:443)//user:password@server/path/file.dat

    Proxy setting for sftp protocol.

Reading from Dictionary

  • dict:keyName:discrete [1]

    Reads data from dictionary.

  • dict:keyName:source[1]

    Reads data from dictionary in the same way like the discrete processing type, but expects that the dictionary values are input file URLs. The data from this input passes to the Reader.

Sandbox Resource as Data Source

A sandbox resource, whether it is a shared, local or partitioned sandbox, is specified in the graph under the fileURL attributes as a so called sandbox URL like this:

sandbox://data/path/to/file/file.dat

where "data" is code for sandbox and "path/to/file/file.dat" is the path to the resource from the sandbox root. URL is evaluated by CloverETL Server during graph execution and a component (reader or writer) obtains the opened stream from the server. This may be a stream to a local file or to some other remote resource. Thus, a graph does not have to run on the node which has local access to the resource. There may be more sandbox resources used in the graph and each of them may be on a different node. In such cases, CloverETL Server would choose the node with the most local resources to minimalize remote streams.

The sandbox URL has a specific use for parallel data processing. When the sandbox URL with the resource in a partitioned sandbox is used, that part of the graph/phase runs in parallel, according to the node allocation specified by the list of partitioned sandbox locations. Thus, each worker has its own local sandbox resource. CloverETL Server evaluates the sandbox URL on each worker and provides an open stream to a local resource to the component.

See also

Supported File URL Formats for Writers
URL File Dialog


[1]  Reader finds out the type of source value from the dictionary and creates readable channel for the parser. Reader supports following type of sources: InputStream, byte[], ReadableByteChannel, CharSequence, CharSequence[], List<CharSequence>, List<byte[]>, ByteArrayOutputStream.