GT 4.0 GridFTP : User's Guide

1. Introduction

The GridFTP User's Guide provides general end user-oriented information.

2. Usage scenarios

2.1. Basic procedure for using GridFTP (globus-url-copy)

If you just want the "rules of thumb" on getting started (without all the details), the following options using globus-url-copy will normally give acceptable performance:

globus-url-copy -vb -tcp-bs 2097152 -p 4 source_url destination_url

The source/destination URLs will normally be one of the following:

  • file:///path/to/my/file if you are accessing a file on a file system accessible by the host on which you are running your client.
  • gsiftp://hostname/path/to/remote/file if you are accessing a file from a GridFTP server.

2.1.1. Putting files

One of the most basic tasks in GridFTP is to "put" files, i.e., moving a file from your file system to the server. So for example, if you want to move the file /tmp/foo from a file system accessible to the host on which you are running your client to a file name /tmp/bar on a host named remote.machine.my.edu running a GridFTP server, you would use this command:

globus-url-copy -vb -tcp-bs 2097152 -p 4 file:///tmp/foo gsiftp://remote.machine.my.edu/tmp/bar
[Note]Note

In theory, remote.machine.my.edu could be the same host as the one on which you are running your client, but that is normally only done in testing situations.

2.1.2. Getting files

A get, i.e, moving a file from a server to your file system, would just reverse the source and destination URLs:

[Tip]Tip

Remember file: always refers to your file system.

globus-url-copy -vb -tcp-bs 2097152 -p 4 gsiftp://remote.machine.my.edu/tmp/bar file:///tmp/foo

2.1.3. Third party transfers

Finally, if you want to move a file between two GridFTP servers (a third party transfer), both URLs would use gsiftp: as the protocol:

globus-url-copy -vb -tcp-bs 2097152 -p 4 gsiftp://other.machine.my.edu/tmp/foo gsiftp://remote.machine.my.edu/tmp/bar

2.1.4. For more information

If you want more information and details on URLs and the command line options, the Key Concepts Guide gives basic definitions and an overview of the GridFTP protocol as well as our implementation of it.

2.2. Accessing data in...

2.2.1. Accessing data in a non-POSIX file data source that has a POSIX interface

If you want to access data in a non-POSIX file data source that has a POSIX interface, the standard server will do just fine. Just make sure it is really POSIX-like (out of order writes, contiguous byte writes, etc).

2.2.2. Accessing data in HPSS

The following information is helpful if you want to use GridFTP to access data in HPSS.

Architecturally, the Globus GridFTP server can be divided into 3 modules:

  • the GridFTP protocol module,
  • the (optional) data transform module, and
  • the Data Storage Interface (DSI).

In the GT4.0.x implementation, the data transform module and the DSI have been merged, although we plan to have separate, chainable, data transform modules in the future.

[Note]Note

This architecture does NOT apply to the WU-FTPD implementation (GT3.2.1 and lower).

2.2.2.1. GridFTP Protocol Module

The GridFTP protocol module is the module that reads and writes to the network and implements the GridFTP protocol. This module should not need to be modified since to do so would make the server non-protocol compliant, and unable to communicate with other servers.

2.2.2.2. Data Transform Functionality

The data transform functionality is invoked by using the ERET (extended retrieve) and ESTO (extended store) commands. It is seldom used and bears careful consideration before it is implemented, but in the right circumstances can be very useful. In theory, any computation could be invoked this way, but it was primarily intended for cases where some simple pre-processing (such as a partial get or sub-sampling) can greatly reduce the network load. The disadvantage to this is that you remove any real option for planning, brokering, etc., and any significant computation could adversely affect the data transfer performance. Note that the client must also support the ESTO/ERET functionality as well.

2.2.2.3. Data Storage Interface (DSI) / Data Transform module

The Data Storage Interface (DSI) / Data Transform module knows how to read and write to the “local” storage system and can optionally transform the data. We put local in quotes because in a complicated storage system, the storage may not be directly attached, but for performance reasons, it should be relatively close (for instance on the same LAN).

The interface consists of functions to be implemented such as send (get), receive (put), command (simple commands that simply succeed or fail like mkdir), etc..

Once these functions have been implemented for a specific storage system, a client should not need to know or care what is actually providing the data. The server can either be configured specifically with a specific DSI, i.e., it knows how to interact with a single class of storage system, or one particularly useful function for the ESTO/ERET functionality mentioned above is to load and configure a DSI on the fly.

2.2.2.4. HPSS info

Last Update: August 2005

Working with Los Alamos National Laboratory and the High Performance Storage System (HPSS) collaboration (http://www.hpss-collaboration.org), we have written a Data Storage Interface (DSI) for read/write access to HPSS. This DSI would allow an existing application that uses a GridFTP compliant client to utilize an HPSS data resources.

This DSI is currently in testing. Due to changes in the HPSS security mechanisms, it requires HPSS 6.2 or later, which is due to be released in Q4 2005. Distribution for the DSI has not been worked out yet, but it will *probably* be available from both Globus and the HPSS collaboration. While this code will be open source, it requires underlying HPSS libraries which are NOT open source (proprietary).

[Note]Note

This is a purely server side change, the client does not know what DSI is running, so only a site that is already running HPSS and wants to allow GridFTP access needs to worry about access to these proprietary libraries.

2.2.3. Accessing data in SRB

The following information is helpful if you want to use GridFTP to access data in SRB.

Architecturally, the Globus GridFTP server can be divided into 3 modules:

  • the GridFTP protocol module,
  • the (optional) data transform module, and
  • the Data Storage Interface (DSI).

In the GT4.0.x implementation, the data transform module and the DSI have been merged, although we plan to have separate, chainable, data transform modules in the future.

[Note]Note

This architecture does NOT apply to the WU-FTPD implementation (GT3.2.1 and lower).

2.2.3.1. GridFTP Protocol Module

The GridFTP protocol module is the module that reads and writes to the network and implements the GridFTP protocol. This module should not need to be modified since to do so would make the server non-protocol compliant, and unable to communicate with other servers.

2.2.3.2. Data Transform Functionality

The data transform functionality is invoked by using the ERET (extended retrieve) and ESTO (extended store) commands. It is seldom used and bears careful consideration before it is implemented, but in the right circumstances can be very useful. In theory, any computation could be invoked this way, but it was primarily intended for cases where some simple pre-processing (such as a partial get or sub-sampling) can greatly reduce the network load. The disadvantage to this is that you remove any real option for planning, brokering, etc., and any significant computation could adversely affect the data transfer performance. Note that the client must also support the ESTO/ERET functionality as well.

2.2.3.3. Data Storage Interface (DSI) / Data Transform module

The Data Storage Interface (DSI) / Data Transform module knows how to read and write to the “local” storage system and can optionally transform the data. We put local in quotes because in a complicated storage system, the storage may not be directly attached, but for performance reasons, it should be relatively close (for instance on the same LAN).

The interface consists of functions to be implemented such as send (get), receive (put), command (simple commands that simply succeed or fail like mkdir), etc..

Once these functions have been implemented for a specific storage system, a client should not need to know or care what is actually providing the data. The server can either be configured specifically with a specific DSI, i.e., it knows how to interact with a single class of storage system, or one particularly useful function for the ESTO/ERET functionality mentioned above is to load and configure a DSI on the fly.

2.2.3.4. SRB info

Last Update: August 2005

Working with the SRB team at the San Diego Supercomputing Center, we have written a Data Storage Interface (DSI) for read/write access to data in the Storage Resource Broker (SRB) (http://www.npaci.edu/DICE/SRB). This DSI will enable GridFTP compliant clients to read and write data to an SRB server, similar in functionality to the sput/sget commands.

This DSI is currently in testing and is not yet publicly available, but will be available from both the SRB web site (here) and the Globus web site (here). It will also be included in the next stable release of the toolkit. We are working on performance tests, but early results indicate that for wide area network (WAN) transfers, the performance is comparable.

When might you want to use this functionality:

  • You have existing tools that use GridFTP clients and you want to access data that is in SRB
  • You have distributed data sets that have some of the data in SRB and some of the data available from GridFTP servers.

2.2.4. Accessing data in some other non-POSIX data source

The following information is helpful If you want to use GridFTP to access data in a non-POSIX data source.

Architecturally, the Globus GridFTP server can be divided into 3 modules:

  • the GridFTP protocol module,
  • the (optional) data transform module, and
  • the Data Storage Interface (DSI).

In the GT4.0.x implementation, the data transform module and the DSI have been merged, although we plan to have separate, chainable, data transform modules in the future.

[Note]Note

This architecture does NOT apply to the WU-FTPD implementation (GT3.2.1 and lower).

2.2.4.1. GridFTP Protocol Module

The GridFTP protocol module is the module that reads and writes to the network and implements the GridFTP protocol. This module should not need to be modified since to do so would make the server non-protocol compliant, and unable to communicate with other servers.

2.2.4.2. Data Transform Functionality

The data transform functionality is invoked by using the ERET (extended retrieve) and ESTO (extended store) commands. It is seldom used and bears careful consideration before it is implemented, but in the right circumstances can be very useful. In theory, any computation could be invoked this way, but it was primarily intended for cases where some simple pre-processing (such as a partial get or sub-sampling) can greatly reduce the network load. The disadvantage to this is that you remove any real option for planning, brokering, etc., and any significant computation could adversely affect the data transfer performance. Note that the client must also support the ESTO/ERET functionality as well.

2.2.4.3. Data Storage Interface (DSI) / Data Transform module

The Data Storage Interface (DSI) / Data Transform module knows how to read and write to the “local” storage system and can optionally transform the data. We put local in quotes because in a complicated storage system, the storage may not be directly attached, but for performance reasons, it should be relatively close (for instance on the same LAN).

The interface consists of functions to be implemented such as send (get), receive (put), command (simple commands that simply succeed or fail like mkdir), etc..

Once these functions have been implemented for a specific storage system, a client should not need to know or care what is actually providing the data. The server can either be configured specifically with a specific DSI, i.e., it knows how to interact with a single class of storage system, or one particularly useful function for the ESTO/ERET functionality mentioned above is to load and configure a DSI on the fly.

3. Command line tools

Please see the GridFTP Command Reference.

4. Graphical user interfaces

Globus does not provide any interactive client for GridFTP, either GUI or text based. However, NCSA, as part of there TeraGrid activity, produces a text based interactive client called UberFTP, which you may want to check out. See Interactive Clients for more information.

5. Security Considerations

The following are points to consider relative to security:

5.1. Two ways to configure your server

We now provide two ways to configuring your server:

  • The classic installation. This is equivalent to any FTP server you would normally install. It is run as a root setuid process. Once the user is authenticated, the process does a setuid to the appropriate non-privileged user account.
  • A new split process installation. In this configuration, the server consists of two processes:

    • The control channel (the process the external user connects to) runs as a non-privileged user (typically the globus user).
    • The data channel (the process that access the file system and moves the data) runs as a root setuid program as before but is only contacted by the control channel process from a local machine. This means an external user is never connected to a root running process and thus minimizes the impact of an exploit. This does, however, require that a copy of the host cert and host key be owned by the non-privileged user. If you use this configuration, the non-privileged user should not have write permission to executables, configuration files, etc.

5.2. New authentication options

There are new authentication options available for the server in GT4.0.0:

  • Anonymous: The server now supports anonymous access. In order for this to work, a configuration switch must explicitly enable it, a list of acceptable usernames must be defined, and an account under which the anonymous user should run must be defined. If the necessary configurations are in place, and the client presents a username that is in the list of acceptable anonymous users, then the session will be accepted and the process will setuid to the anonymous user account. We do not support chroot in this version of the server.
  • Username / Password: This is standard FTP authentication. It uses a separate password file, used only by the GridFTP server, *NOT* the system password file.
[Warning]Warning

WE HIGHLY RECOMMEND YOU NOT USE THIS. YOU WILL BE SENDING YOUR PASSWORD IN CLEAR TEXT OVER THE NETWORK.

We do, however, have some user communities who run only on internal networks for testing purposes and who do not wish to deal with obtaining GSI credentials. If you are considering this, we would recommend that you look at Simple CA and set up your own testbed CA. This can be done in less than an hour and then provides you full GSI security.

5.3. Firewall requirements

If the GridFTP server is behind a firewall:

  1. Contact your network administrator to open up port 2811 (for GridFTP control channel connection) and a range of ports (for GridFTP data channel connections) for the incoming connections. If the firewall blocks the outgoing connections, open up a range of ports for outgoing connections as well.

  2. Set the environment variable GLOBUS_TCP_PORT_RANGE:

    export GLOBUS_TCP_PORT_RANGE=min,max 

    where min,max specify the port range that you have opened for the incoming connections on the firewall. This restricts the listening ports of the GridFTP server to this range. Recommended range is 1000 (e.g., 50000-51000) but it really depends on how much use you expect.

  3. If you have a firewall blocking the outgoing connections and you have opened a range of ports, set the environment variable GLOBUS_TCP_SOURCE_RANGE:

    export GLOBUS_TCP_SOURCE_RANGE=min,max 

    where min,max specify the port range that you have opened for the outgoing connections on the firewall. This restricts the outbound ports of the GridFTP server to this range. Recommended range is twice the range used for GLOBUS_TCP_PORT_RANGE, because if parallel TCP streams are used for transfers, the listening port would remain the same for each connection but the connecting port would be different for each connection.

[Note]Note

If the server is behind NAT, the --data-interface <real ip/hostname> option needs to be used on the server.

If the GridFTP client is behind a firewall:

  1. Contact your network administrator to open up a range of ports (for GridFTP data channel connections) for the incoming connections. If the firewall blocks the outgoing connections, open up a range of ports for outgoing connections as well.

  2. Set the environment variable GLOBUS_TCP_PORT_RANGE

    export GLOBUS_TCP_PORT_RANGE=min,max 

    where min,max specify the port range that you have opened for the incoming connections on the firewall. This restricts the listening ports of the GridFTP client to this range. Recommended range is 1000 (e.g., 50000-51000) but it really depends on how much use you expect.

  3. If you have a firewall blocking the outgoing connections and you have opened a range of ports, set the environment variable GLOBUS_TCP_SOURCE_RANGE:

    export GLOBUS_TCP_PORT_RANGE=min,max 

    where min,max specify the port range that you have opened for the outgoing connections on the firewall. This restricts the outbound ports of the GridFTP client to this range. Recommended range is twice the range used for GLOBUS_TCP_PORT_RANGE, because if parallel TCP streams are used for transfers, the listening port would remain the same for each connection but the connecting port would be different for each connection.

Additional information on Globus Toolkit Firewall Requirements is available here.

6. Troubleshooting

If you are having problems using the GridFTP server, try the steps listed below. If you have an error, try checking the server logs if you have access to them. By default, the server logs to stderr, unless it is running from inetd, or its execution mode is detached, in which case logging is disabled by default.

The command line options -d , -log-level, -L and -logdir can affect where logs will be written, as can the configuration file options log_single and log_unique. See the Configuration information for more information on these and other configuration options.

6.1. Establish control channel connection

Verify that you can establish a control channel connection and that the server has started successfully by telnetting to the port on which the server is running:

% telnet localhost 2811
                Trying 127.0.0.1...
                Connected to localhost.
                Escape character is '^]'.
                220 GridFTP Server mldev.mcs.anl.gov 2.0 (gcc32dbg, 1113865414-1) ready.

If you see anything other than a 220 banner such as the one above, the server has not started correctly.

Verify that there are no configuration files being unexpectedly loaded from /etc/grid-security/gridftp.conf or $GLOBUS_LOCATION/etc/gridftp.conf. If those files exist, and you did not intend for them to be used, rename them to .save, or specify -c none on the command line and try again.

If you can log into the machine where the server is, try running the server from the command line with only the -s option:

$GLOBUS_LOCATION/sbin/globus-gridftp-server -s

The server will print the port it is listening on:

Server listening at gridftp.mcs.anl.gov:57764

Now try and telnet to that port. If you still do not get the banner listed above, something is preventing the socket connection. Check firewalls, tcp-wrapper, etc.

If you now get a correct banner, add -p 2811 (you will have to disable (x)inetd on port 2811 if you are using them or you will get port already in use):

$GLOBUS_LOCATION/sbin/globus-gridftp-server -s -p 2811

Now telnet to port 2811. If this does not work, something is blocking port 2811. Check firewalls, tcp-wrapper, etc.

If this works correctly then re-enable your normal server, but remove all options but -i, -s, or -S.

Now telnet to port 2811. If this does not work, something is wrong with your service configuration. Check /etc/services and (x)inetd config, have (x)inetd restarted, etc.

If this works, begin adding options back one at a time, verifying that you can telnet to the server after each option is added. Continue this till you find the problem or get all the options you want.

At this point, you can establish a control connection. Now try running globus-url-copy.

6.2. Try running globus-url-copy

Once you've verified that you can establish a control connection, try to make a transfer using globus-url-copy.

If you are doing a client/server transfer (one of your URLs has file: in it) then try:

globus-url-copy -vb -dbg gsiftp://host.server,running.on/dev/zero file:///dev/null

This will run until you control-c the transfer. If that works, reverse the direction:

globus-url-copy -vb -dbg file:///dev/zero gsiftp://host.server.running.on/dev/null

Again, this will run until you control-c the transfer.

If you are doing a third party transfer, run this command:

globus-url-copy -vb -dbg gsiftp://host.server1.on/dev/zero gsiftp://host.server2.on/dev/null

Again, this will run until you control-c the transfer.

If the above transfers work, try your transfer again. If it fails, you likely have some sort of file permissions problem, typo in a file name, etc.

6.3. If your server starts...

If the server has started correctly, and your problem is with a security failure or gridmap lookup failure, verify that you have security configured properly here.

If the server is running and your client successfully authenticates but has a problem at some other time during the session, please ask for help on [email protected]. When you send mail or submit bugs, please always include as much of the following information as possible:

  • Specs on all hosts involved (OS, processor, RAM, etc).
  • globus-url-copy -version
  • globus-url-copy -versions
  • Output from the telnet test above.
  • The actual command line you ran with -dbg added. Don't worry if the output gets long.
  • Check that you are getting a FQDN and /etc/hosts that is sane.
  • The server configuration and setup (/etc/services entries, (x)inetd configs, etc.).
  • Any relevant lines from the server logs (not the entire log please).

7. Usage statistics collection by the Globus Alliance

The following GridFTP-specific usage statistics are sent in a UDP packet at the end of each transfer, in addition to the standard header information described in the Usage Stats section.

  • Start time of the transfer
  • End time of the transfer
  • Version string of the server
  • TCP buffer size used for the transfer
  • Block size used for the transfer
  • Total number of bytes transferred
  • Number of parallel streams used for the transfer
  • Number of stripes used for the transfer
  • Type of transfer (STOR, RETR, LIST)
  • FTP response code -- Success or failure of the transfer

[Note]Note

The client (globus-url-copy) does NOT send any data. It is the servers that send the usage statistics.

We have made a concerted effort to collect only data that is not too intrusive or private and yet still provides us with information that will help improve and gauge the usage of the GridFTP server. Nevertheless, if you wish to disable this feature for GridFTP only, see the Logging section of Section 3.2, “GridFTP server configuration options”. Note that you can disable transmission of usage statistics globally for all C components by setting "GLOBUS_USAGE_OPTOUT=1" in your environment.

Also, please see our policy statement on the collection of usage statistics.