GT 4.0 GridFTP Glossary

C

client

FTP is a command/response protocol. The defining characteristic of a client is that it is the process sending the commands and receiving the responses. It may or may not take part in the actual movement of data.

See Also client/server transfer, third party transfers.

client/server transfer

In a client/server transfer, there are only two entities involved in the transfer, the client entity and the server entity. We use the term entity here rather than process because in the implementation provided in GT4, the server entity may actually run as two or more separate processes.

The client will either move data from or to his local host. The client will decide whether or not he wishes to connect to the server to establish the data channel or the server should connect to him (MODE E dictates who must connect).

If the client wishes to connect to the server, he will send the PASV (passive) command. The server will start listening on an ephemeral (random, non-privileged) port and will return the IP and port as a response to the command. The client will then connect to that IP/Port.

If the client wishes to have the server connect to him, the client would start listening on an ephemeral port, and would then send the PORT command which includes the IP/Port as part of the command to the server and the server would initiate the TCP connect. Note that this decision has an impact on traversing firewalls. For instance, the client's host may be behind a firewall and the server may not be able to connect.

Finally, now that the data channel is established, the client will send either the RETR “filename” command to transfer a file from the server to the client (GET), or the STOR “filename” command to transfer a file from the client to the server (PUT).

See Also extended block mode (MODE E).

command/response

Both FTP and GridFTP are command/response protocols. What this means is that once a client sends a command to the server, it can only accept responses from the server until it receives a response indicating that the server is finished with that command. For most commands this is not a big deal. For instance, setting the type of the file transfer to binary (called "I" for "image in the protocol"), simply consists of the client sending TYPE I and the server responding with 220 OK. Type set to I. However, the SEND and RETR commands (which actually initiate the movement of data) can run for a long time. Once the command is sent, the client’s only options are to wait until it receives the completion reply, or kill the transfer.

concurrency

When speaking of GridFTP transfers, concurrency refers to having multiple files in transit at the same time. They may all be on the same host or across multiple hosts. This is equivalent to starting up “n” different clients for “n” different files, and having them all running at the same time. This can be effective if you have many small files to move. The Reliable File Transfer (RFT) service utilizes concurrency to improve its performance.

D

dual channel protocol

GridFTP uses two channels:

  • One of the channels, called the control channel, is used for sending commands and responses. It is low bandwidth and is encrypted for security reasons.
  • The second channel is known as the data channel. Its sole purpose is to transfer the data. It is high bandwidth and uses an efficient protocol.

By default, the data channel is authenticated at connection time, but no integrity checking or encryption is performed due to performance reasons. Integrity checking and encryption are both available via the client and libraries.

Note that in GridFTP (not FTP) the data channel may actually consist of several TCP streams from multiple hosts.

See Also parallelism, striping.

E

extended block mode (MODE E)

MODE E is a critical GridFTP components because it allows for out of order reception of data. This in turn, means we can send the data down multiple paths and do not need to worry if one of the paths is slower than the others and the data arrives out of order. This enables parallelism and striping within GridFTP. In MODE E, a series of “blocks” are sent over the data channel. Each block consists of:

  • an 8 bit flag field,
  • a 64 bit field indicating the offset in the transfer,
  • and a 64 bit field indicating the length of the payload,
  • followed by length bytes of payload.

Note that since the offset and length are included in the block, out of order reception is possible, as long as the receiving side can handle it, either via something like a seek on a file, or via some application level buffering and ordering logic that will wait for the out of order blocks. [TODO: LINK TO GRAPHIC]

See Also parallelism, striping.

I

improved extended block mode (MODE X)

This protocol is still under development. It is intended to address a number of the deficiencies found in MODE E. For instance, it will have explicit negotiation for use of a data channel, thus removing the race condition and the requirement for the sender to be the connector. This will help with firewall traversal. A method will be added to allow the filename to be provided prior to the data channel connection being established to help large data farms better allocate resources. Other additions under consideration include block checksumming, resends of blocks that fail checksums, and inclusion of a transfer ID to allow pipelining and de-multiplexing of commands.

See Also extended block mode (MODE E).

M

MODE command

In reality, GridFTP is not one protocol, but a collection of several protocols. There is a protocol used on the control channel, but there is a range of protocols available for use on the data channel. Which protocol is used is selected by the MODE command. Four modes are defined: STREAM (S), BLOCK (B), COMPRESSED (C) in RFC 959 for FTP, and EXTENDED BLOCK (E) in GFD.020 for GridFTP. There is also a new data channel protocol, or mode, being defined in the GGF GridFTP Working group which, for lack of a better name at this point, is called MODE X.

See Also extended block mode (MODE E), improved extended block mode (MODE X), stream mode (MODE S).

N

network end points

A network endpoint is generally something that has an IP address (a network interface card). It is a point of access to the network for transmission or reception of data. Note that a single host could have multiple network end points if it haS multiple NICs installed (multi-homed). This definition is necessary to differentiate between parallelism and striping.

See Also parallelism, striping.

P

parallelism

When speaking about GridFTP transfers, parallelism refers to having multiple TCP connections between a single pair of network endpoints. This is used to improve performance of transfers on connections with light to moderate packet loss.

S

server

The compliment to the client is the server. Its defining characteristic is that it receives commands and sends responses to those commands. Since it is a server or service, and it receives commands, it must be listening on a port somewhere to receive the commands. Both FTP and GridFTP have IANA registered ports. For FTP it is port 21, for GridFTP it is port 2811. This is normally handled via inetd or xinetd on Unix variants. However, it is also possible to implement a daemon that listens on the specified port. This is described more fully in http://www.globus.org/toolkit/docs/4.0/data/gridftp/developer-index.htmls-gridftp-developer-archdes in the GridFTP Developer's Guide.

See Also client.

stream mode (MODE S)

The only mode normally implemented for FTP is MODE S. This is simply sending each byte, one after another over the socket in order, with no application level framing of any kind. This is the default and is what a standard FTP server will use. This is also the default for GridFTP.

striping

When speaking about GridFTP transfers, striping refers to having multiple network endpoints at the source, destination, or both participating in the transfer of the same file. This is normally accomplished by having a cluster with a parallel shared file system. Each node in the cluster reads a section of the file and sends it over the network. This mode of transfer is necessary if you wish to transfer a single file faster than a single host is capable of. This also tends to only be effective for large files, though how large depends on how many hosts and how fast the end-to-end transfer is. Note that while it is theoretically possible to use NFS for the shared file system, your performance will be poor, and would make using striping pointless.

T

third party transfers

In the simplest terms, a third party transfer moves a file between two GridFTP servers.

The following is a more detailed, programmatic description.

In a third party transfer, there are three entities involved. The client, who will only orchestrate, but not actually take place in the data transfer, and two servers one of which will be sending data to the other. This scenario is common in Grid applications where you may wish to stage data from a data store somewhere to a supercomputer you have reserved. The commands are quite similar to the client/server transfer. However, now the client must establish two control channels, one to each server. He will then choose one to listen, and send it the PASV command. When it responds with the IP/port it is listening on, the client will send that IP/port as part of the PORT command to the other server. This will cause the second server to connect to the first server, rather than the client. To initiate the actual movement of the data, the client then sends the RETR “filename” command to the server that will read from disk and write to the network (the “sending” server) and will send the STOR “filename” command to the other server which will read from the network and write to the disk (the “receiving” server).

See Also client/server transfer.