GT 4.0 GridFTP : Developer's Guide

1. Introduction

This guide contains information of interest to developers working with GridFTP. It provides reference information for application developers, including APIs, architecture, procedures for using the APIs and code samples.

2. Before you begin

2.1. Feature summary

Features new in GT 4.0

  • A new, complete reimplementation of the server.
  • Support for striping.
  • This new implementation will greatly ease new feature additions and modifications of the server (new commands, new data sources such as mass storage devices, etc.), maintainability, and resolves a licensing issue that was discovered.

Features that continue to be supported from previous versions

  • GSI security: This is the PKI based, de facto standard security system used in Grid applications. Kerberos is also possible but is not supported and can be difficult to use due to divergence in the capabilities of GSI and Kerberos.
  • Third-party transfers: Very common in Grid applications, this is where a client mediates a transfer between two servers (both likely at remote sites) rather than between the server and itself (called a client/server transfer).
  • Partial file access: Regions of a file may be accessed by specifying an offset into the file and the length of the block desired.
  • Reliability/restart: The receiving server periodically (the default is 5 seconds, but this can be changed) sends “restart markers” to the client. This marker is a messages specifying what bytes have been successfully written to the disk. If the transfer fails, the client may restart the transfer and provide these markers (or an aggregated equivalent marker), and the transfer will pick up where it left off. This can include “holes” in the file.
  • Large file support: All file sizes, lengths, and offsets are 64 bits in length.
  • Data channel reuse: Data channel can be held open and reused if the next transfer has the same source, destination, and credentials. This saves the time of connection establishment, authentication, and delegation. This can be a huge performance difference when moving lots of small files.
  • Integrated instrumentation (Performance Markers).
  • Logging/audit trail (Extensive Logging in the server).
  • Parallel transfers (Multiple TCP streams between a pair of hosts).
  • TCP Buffer size control (Protocol supports Manual and Automatic; Only Manual Implemented).
  • Server-side computation (Extended Retrieve (ERET) / Extended Store (ESTO) commands).
  • Based on Standards: RFC 959, RFC 2228, RFC 2389, IETF Draft MLST-16 , GGF GFD.020.

Other Supported Features

  • On the client side we provide a scriptable tool called globus-url-copy. This tool can take advantage of all the GridFTP protocol features and can also do protocol translation between FTP, HTTP, HTTPS, and POSIX file IO on the client machine.
  • We also provide a set of development libraries and APIs for developers wishing to add GridFTP functionality to their application.

Deprecated Features

  • None

2.2. Tested platforms

Tested platforms for GridFTP

  • i386 Linux
  • ia64 Linux (TeraGrid)
  • AIX 5.2
  • Solaris 9
  • PA-RISC HP/UX 11.11
  • ia64 HP/UX 11.22
  • Tru64 Unix
  • Mac OS X

While the above list includes platforms on which we have tested GridFTP, it does not imply support for a specific platform. However, we are interested in hearing reports of success or bug reports on any platform.

2.3. Backward compatibility summary

Protocol changes since GT 3.2

  • None

API changes since GT 3.2

  • None

Exception changes since GT 3.2

  • Not Applicable (GridFTP is not Java-based)

Schema changes since GT 3.2

  • Not Applicable (GridFTP is not SOAP-based)

2.4. Technology dependencies

GridFTP depends on the following GT components:

  • Pre-WS Authentication / Authorization
  • C Common Libraries
  • XIO

GridFTP depends on the following 3rd party software:

  • OpenSSL (version included in release)

2.5. Security considerations

The following are points to consider relative to security:

2.5.1. Two ways to configure your server

We now provide two ways to configuring your server:

  • The classic installation. This is equivalent to any FTP server you would normally install. It is run as a root setuid process. Once the user is authenticated, the process does a setuid to the appropriate non-privileged user account.
  • A new split process installation. In this configuration, the server consists of two processes:

    • The control channel (the process the external user connects to) runs as a non-privileged user (typically the globus user).
    • The data channel (the process that access the file system and moves the data) runs as a root setuid program as before but is only contacted by the control channel process from a local machine. This means an external user is never connected to a root running process and thus minimizes the impact of an exploit. This does, however, require that a copy of the host cert and host key be owned by the non-privileged user. If you use this configuration, the non-privileged user should not have write permission to executables, configuration files, etc.

2.5.2. New authentication options

There are new authentication options available for the server in GT4.0.0:

  • Anonymous: The server now supports anonymous access. In order for this to work, a configuration switch must explicitly enable it, a list of acceptable usernames must be defined, and an account under which the anonymous user should run must be defined. If the necessary configurations are in place, and the client presents a username that is in the list of acceptable anonymous users, then the session will be accepted and the process will setuid to the anonymous user account. We do not support chroot in this version of the server.
  • Username / Password: This is standard FTP authentication. It uses a separate password file, used only by the GridFTP server, *NOT* the system password file.
[Warning]Warning

WE HIGHLY RECOMMEND YOU NOT USE THIS. YOU WILL BE SENDING YOUR PASSWORD IN CLEAR TEXT OVER THE NETWORK.

We do, however, have some user communities who run only on internal networks for testing purposes and who do not wish to deal with obtaining GSI credentials. If you are considering this, we would recommend that you look at Simple CA and set up your own testbed CA. This can be done in less than an hour and then provides you full GSI security.

2.5.3. Firewall requirements

If the GridFTP server is behind a firewall:

  1. Contact your network administrator to open up port 2811 (for GridFTP control channel connection) and a range of ports (for GridFTP data channel connections) for the incoming connections. If the firewall blocks the outgoing connections, open up a range of ports for outgoing connections as well.

  2. Set the environment variable GLOBUS_TCP_PORT_RANGE:

    export GLOBUS_TCP_PORT_RANGE=min,max 

    where min,max specify the port range that you have opened for the incoming connections on the firewall. This restricts the listening ports of the GridFTP server to this range. Recommended range is 1000 (e.g., 50000-51000) but it really depends on how much use you expect.

  3. If you have a firewall blocking the outgoing connections and you have opened a range of ports, set the environment variable GLOBUS_TCP_SOURCE_RANGE:

    export GLOBUS_TCP_SOURCE_RANGE=min,max 

    where min,max specify the port range that you have opened for the outgoing connections on the firewall. This restricts the outbound ports of the GridFTP server to this range. Recommended range is twice the range used for GLOBUS_TCP_PORT_RANGE, because if parallel TCP streams are used for transfers, the listening port would remain the same for each connection but the connecting port would be different for each connection.

[Note]Note

If the server is behind NAT, the --data-interface <real ip/hostname> option needs to be used on the server.

If the GridFTP client is behind a firewall:

  1. Contact your network administrator to open up a range of ports (for GridFTP data channel connections) for the incoming connections. If the firewall blocks the outgoing connections, open up a range of ports for outgoing connections as well.

  2. Set the environment variable GLOBUS_TCP_PORT_RANGE

    export GLOBUS_TCP_PORT_RANGE=min,max 

    where min,max specify the port range that you have opened for the incoming connections on the firewall. This restricts the listening ports of the GridFTP client to this range. Recommended range is 1000 (e.g., 50000-51000) but it really depends on how much use you expect.

  3. If you have a firewall blocking the outgoing connections and you have opened a range of ports, set the environment variable GLOBUS_TCP_SOURCE_RANGE:

    export GLOBUS_TCP_PORT_RANGE=min,max 

    where min,max specify the port range that you have opened for the outgoing connections on the firewall. This restricts the outbound ports of the GridFTP client to this range. Recommended range is twice the range used for GLOBUS_TCP_PORT_RANGE, because if parallel TCP streams are used for transfers, the listening port would remain the same for each connection but the connecting port would be different for each connection.

Additional information on Globus Toolkit Firewall Requirements is available here.

3. Architecture and design overview

GridFTP represents a service that a host is providing. Therefore, the service must be listening on a port waiting for client to request access to that service. This is generally handled one of two ways:

  • Either an application daemon is running listening for connections, or
  • inetd/xinetd is used.

3.1. GridFTP Listening

The following list describes the process between the service listening for connection and an exchange of data taking place:

  1. These services (application daemon or inetd/xinetd) listen for connections.
  2. When a connection is received on a “well known” port such as 2811 for GridFTP, inetd does a fork/exec to start up a GridFTP server process and then does a Switch User (SU) so that the server is running in a user account rather than as root for security reasons. At this point, the client has established a control channel to the server.
  3. The client will then send a series of commands to configure or describe the transfer that it wants to take place.

3.2. GridFTP Transfer

There are basically four important components of the exchange:

  1. The first is security. You must authenticate, and for GridFTP, you must establish encryption on the control channel. The control channel is encrypted by default, though it can be switched off (see the security section for more detail).
  2. The second is setup and informational exchanges. The client may specify the type of the file (Binary or ASCII), the MODE of the transfer, he might request the size of a file before transferring it, etc..
  3. Third, the information and negotiation for the data channel must be done. How this is handled, depends on whether you are doing a client/server transfer or third party transfer.
  4. Finally, a store (STOR), retrieve (RETR), extended store (ESTO) or extended retrieve (ERET) to indicate direction of the transfer and to start data moving.

3.3. GridFTP Server

Architecturally, the Globus GridFTP server can be divided into 3 modules:

  • the GridFTP protocol module,
  • the (optional) data transform module, and
  • the Data Storage Interface (DSI).

In the GT4.0.x implementation, the data transform module and the DSI have been merged, although we plan to have separate, chainable, data transform modules in the future.

[Note]Note

This architecture does NOT apply to the WU-FTPD implementation (GT3.2.1 and lower).

3.3.1. GridFTP Protocol Module

The GridFTP protocol module is the module that reads and writes to the network and implements the GridFTP protocol. This module should not need to be modified since to do so would make the server non-protocol compliant, and unable to communicate with other servers.

3.3.2. Data Transform Functionality

The data transform functionality is invoked by using the ERET (extended retrieve) and ESTO (extended store) commands. It is seldom used and bears careful consideration before it is implemented, but in the right circumstances can be very useful. In theory, any computation could be invoked this way, but it was primarily intended for cases where some simple pre-processing (such as a partial get or sub-sampling) can greatly reduce the network load. The disadvantage to this is that you remove any real option for planning, brokering, etc., and any significant computation could adversely affect the data transfer performance. Note that the client must also support the ESTO/ERET functionality as well.

3.3.3. Data Storage Interface (DSI) / Data Transform module

The Data Storage Interface (DSI) / Data Transform module knows how to read and write to the “local” storage system and can optionally transform the data. We put local in quotes because in a complicated storage system, the storage may not be directly attached, but for performance reasons, it should be relatively close (for instance on the same LAN).

The interface consists of functions to be implemented such as send (get), receive (put), command (simple commands that simply succeed or fail like mkdir), etc..

Once these functions have been implemented for a specific storage system, a client should not need to know or care what is actually providing the data. The server can either be configured specifically with a specific DSI, i.e., it knows how to interact with a single class of storage system, or one particularly useful function for the ESTO/ERET functionality mentioned above is to load and configure a DSI on the fly.

3.3.4. Latest information about HPSS

Last Update: August 2005

Working with Los Alamos National Laboratory and the High Performance Storage System (HPSS) collaboration (http://www.hpss-collaboration.org), we have written a Data Storage Interface (DSI) for read/write access to HPSS. This DSI would allow an existing application that uses a GridFTP compliant client to utilize an HPSS data resources.

This DSI is currently in testing. Due to changes in the HPSS security mechanisms, it requires HPSS 6.2 or later, which is due to be released in Q4 2005. Distribution for the DSI has not been worked out yet, but it will *probably* be available from both Globus and the HPSS collaboration. While this code will be open source, it requires underlying HPSS libraries which are NOT open source (proprietary).

[Note]Note

This is a purely server side change, the client does not know what DSI is running, so only a site that is already running HPSS and wants to allow GridFTP access needs to worry about access to these proprietary libraries.

3.3.5. Latest information about SRB

Last Update: August 2005

Working with the SRB team at the San Diego Supercomputing Center, we have written a Data Storage Interface (DSI) for read/write access to data in the Storage Resource Broker (SRB) (http://www.npaci.edu/DICE/SRB). This DSI will enable GridFTP compliant clients to read and write data to an SRB server, similar in functionality to the sput/sget commands.

This DSI is currently in testing and is not yet publicly available, but will be available from both the SRB web site (here) and the Globus web site (here). It will also be included in the next stable release of the toolkit. We are working on performance tests, but early results indicate that for wide area network (WAN) transfers, the performance is comparable.

When might you want to use this functionality:

  • You have existing tools that use GridFTP clients and you want to access data that is in SRB
  • You have distributed data sets that have some of the data in SRB and some of the data available from GridFTP servers.

4. Public interface

The semantics and syntax of the APIs and WSDL for the component, along with descriptions of domain-specific structured interface data, can be found in GT 4.0 Component Guide to Public Interfaces: GridFTP.

5. Usage scenarios

There is no content available at this time.

6. Tutorials

There is no content available at this time.

7. Debugging

There is no content available at this time.

8. Troubleshooting

There is no content available at this time.