GT 4.0 WS GRAM: Developer's Guide

1. Introduction

This guide is intended to help a developer create compatible WS GRAM clients and alternate service implementations.

The key concepts for the GRAM component have not changed. Its purpose is still to provide the mechanisms to execute remote applications for a user. Given an RSL (Resource Specification Language) job description, GRAM submits the job to a scheduling system such as PBS or Condor, or to a simple fork-based way of spawning processes, and monitors it until completion. More details can be found here:

http://www.globus.org/toolkit/docs/3.2/gram/key

2. Before you begin

2.1. Feature summary

New Features new since 3.2

  • Support for mpich-g2 jobs:

    • multi-job submission capabilities
    • ability to coordinate processes in a job
    • ability to coordinate subjobs in a multi-job

  • Publishing of the job's exit code
  • The ability to select the account under which the remote job will be run. If a user's grid credential is mapped to multiple accounts, then the user can specify, in the RSL, under which account the job should be run.
  • Optional client-specified hold on a state. Released with the new "release" operation.

Other Supported Features

  • Remote job execution and management
  • Uniform and flexible interface to batch scheduling systems
  • File staging before and after job execution
  • File / directory clean up after job execution (after file stage out)

Deprecated Features

  • managed-job-globusrun has been replaced by globusrun-ws.
  • Service managed data streaming of job's stdout/err during execution.
  • File staging using the GASS protocol
  • File caching of stages files, e.g. GASS Cache

2.2. Tested platforms

Tested platforms for WS GRAM:

  • Linux

    • Fedora Core 1 i686
    • Fedora Core 3 i686
    • Fedora Core 3 yup xeon
    • RedHat 7.3 i686
    • RedHat 9 x86
    • Debian Sarge x86
    • Debian 3.1 i686

Tested containers for WS GRAM:

  • Java WS Core container
  • Tomcat 4.1.31

2.3. Backward compatibility summary

Protocol changes since GT version 3.2:

  • The protocol has been changed to be WSRF compliant. There is no backward compatibility between this version and any previous versions.

2.4. Technology dependencies

GRAM depends on the following GT components:

  • Java WS Core
  • Transport-Level Security
  • Delegation Service
  • RFT
  • GridFTP
  • MDS - internal libraries

GRAM depends on the following 3rd party software. The dependency exists only for the batch schedulers configured, thus making job submissions possible to the batch scheduling service:

Scheduler adapters are included in the GT 4.0.x releases for these schedulers:

Other scheduler adapters available for GT 4.0.x releases:

2.5. Security considerations

No special security considerations exist at this time.

3. Architecture and design overview

The GRAM services in GT 4.0 are WSRF compliant. One of the key concepts in the WSRF specification is the decoupling of a service with the public "state" of the service in the interface via the implied resource pattern. Following this concept, the data of GT 4.0 GRAM jobs is published as part of WSRF resources, while there is only one service to start jobs or query and monitor their state. This is different from the OGSI model of GT3 where each job was represented as a separate service. There is still a job factory service that can be called in order to create job instances (represented as WSRF resources). Each scheduling system that GRAM is interfaced with is represented as a separate factory resource. By making a call to the factory service while associating the call to the appropriate factory resource, the job submitting actor can create a job resource mapping to a job in the chosen scheduling system.

3.1. Job States

3.1.1. Overview

The Managed Executable Job Service (MEJS) relies on a state machine to handle state transitions. There are two sets of states: external and internal. The external states are those that the user gets in notifications and can be queried as a resource property. The internal states are those that are strictly used by the state machine to step through all the necessary internal tasks that need to be performed for a particular job.

The Managed Multi-Job Service does not rely on a state machine, but instead makes judgements after receiving notifications from the sub-jobs about which external state it should be in. The external states for the MMJS are identical to the ones used by the MEJS.

3.1.2. External and Internal States of the Managed Job Services

3.1.2.1. External States of the Managed Job Services
  • Unsubmitted
  • StageIn
  • Pending
  • Active
  • Suspended
  • StageOut
  • CleanUp
  • Done
  • Failed
3.1.2.2. Internal States of the Managed Executable Job Service
  • None
  • Restart
  • Start
  • StageIn
  • StageInResponse
  • Submit
  • OpenStdout
  • OpenStderr
  • WaitingForStateChanges
  • MergeStdout
  • StageOut
  • StageOutResponse
  • UserCancel
  • UserCancelResponse
  • SystemCancel
  • CleanUp
  • FileCleanUp
  • FileCleanUpResponse
  • CacheCleanUp
  • CacheCleanUpResponse
  • ScratchCleanUp
  • ScratchCleanUpResponse
  • Suspend
  • Resume
  • Done
  • FailureFileCleanUp
  • FailureFileCleanUpResponse
  • FailureCacheCleanUp
  • FailureCacheCleanUpResponse
  • FailureScratchCleanUp
  • FailureScratchCleanUpResponse
  • Failed
  • UnsubmittedHold
  • StageInHold
  • PendingHold
  • ActiveHold
  • SuspendedHold
  • StageOutHold
  • CleanUpHold
  • DoneHold
  • FailedHold

3.1.3. Managed Executable Job Service Internal State Diagram

Here is a diagram illustrating the internal state transitions of the Managed Executable Job Service and how the external states are triggered within this progression: Managed Executable Job Service Internal State Transition Diagram.

4. Public interface

The semantics and syntax of the APIs and WSDL for the component, along with descriptions of domain-specific structured interface data, can be found in GT 4.0 Component Guide to Public Interfaces: WS GRAM.

5. Usage scenarios

5.2. Submitting a job in C

The following is a general scenario for submitting a job using the C stubs and APIs. Please consult the C WS Core API, WS-GRAM API documentation for details on package names for classes referenced in the code excerpts.

6. Tutorials

The following tutorials are available for WS GRAM developers:

7. Debugging

7.1. Enabling debug logging for GRAM classes

For starters, consult the Debugging section of the Java WS Core Developer's Guide for details about what files to edit and other general log4j configuration information.

To turn on debug logging for the Managed Executable Job Service (MEJS), add the following entry to the container-log4j.properties file:

log4j.category.org.globus.exec.service.exec=DEBUG

To turn on debug logging for the delegated proxy management code, add the following entry to the container-log4j.properties file:

log4j.category.org.globus.exec.service.utils=DEBUG

To turn on debug logging for the Managed Multi Job Service (MMJS), add the following entry to the container-log4j.properties file:

log4j.category.org.globus.exec.service.multi=DEBUG

To turn on debug logging for the Managed Job Factory Service (MJFS), add the following entry to the container-log4j.properties file:

log4j.category.org.globus.exec.service.factory=DEBUG

To turn on debug logging for all GRAM code, add the following entry to the container-log4j.properties file:

log4j.category.org.globus.exec=DEBUG

Follow the pattern to turn on logging for other specific packages or classes.

7.2. Instrumented timings logging

Both the service and Java client API code contain special debugging statements which output certain timing data to help in determining performance bottlenecks.

The service code uses the PerformanceLog class to output the timings information. To turn on service timings logging without triggering full debug logging for the service code, add the following lines to the container-log4j.properties file:

log4j.category.org.globus.exec.service.factory.ManagedJobFactoryService.performance=DEBUG
log4j.category.org.globus.exec.service.exec.ManagedExecutableJobResource.performance=DEBUG
log4j.category.org.globus.exec.service.exec.StateMachine.performance=DEBUG

The Java client API has not been converted over to using the PerformanceLog class, so the debug statements are sent at the INFO level to avoid having to turn on full debug logging. To turn on client timings logging without triggering full debug logging for the client code, add the following line to the container-log4j.properties file:

log4j.category.org.globus.exec.client.e=INFO

There are two parsing scripts available in the source distribution that aren't distributed in any GPT package for summarizing the service and client timings data. The are located in ws-gram/service/java/test/throughput/, and are named parse-service-timings.pl and parse-client-timings.pl. They both simply take the path of the appropriate log file that contains the timing data. These scripts work fine with log files that have other logging statements mixed with the timing data.

7.3. Debugging script execution

It may be necessary to debug the scheduler scripts if jobs aren't being submitted correctly, and either no fault or a less-than-helpful fault is generated. Ideally we would like that this not be necessary; so if you find that you must resort to this, please file a bug report or let us know on the discuss e-mail list.

By turning on debug logging for the MEJS (see above), you should be able to search for "Perl Job Description" in the logging output to find the perl form of the job description that is sent to the scheduler scripts.

Also by turning on debug logging for the MEJS, you should be able to search for "Executing command" in the logging output to find the specific commands that are executed when the scheduler scripts are invoked from the service code. If you saved the perl job description from the previous paragraph, then you can use this to manually run these commands.

There is a perl job description attribute named logfile that isn't currently supported in the XML job description that can be used to print debugging info about the execution of the perl scripts. The value for this attribute is a path to a file that will be created. You can add this to the perl job description file that you created from the service debug logging before manually running the script commands.

Beyond the above advice, you may want to edit the perl scripts themselves to print more detailed information. For more information on the location and composition of the scheduler scripts, please consult the WS-GRAM Scheduler Interface Tutorial.

8. Troubleshooting

When I submit a streaming or staging job, I get the following error: ERROR service.TransfreWork Terminal transfer error: [Caused by: Authentication failed[Caused by: Operation unauthorized(Mechanism le vel: Authorization failed. Expected"/CN=host/localhost.localdomain" target but r eceived "/O=Grid/OU=GlobusTest/OU=simpleCA-my.machine.com/CN=host/my.machine.com ")

  • Check $GLOBUS_LOCATION/etc/gram-service/globus_gram_fs_map_config.xml for the use of "localhost" or "127.0.0.1" instead of the public hostname (in the example above, "my.machine.com"). Change these uses of the loopback hostname or IP to the public hostname as neccessary.

Fork jobs work fine, but submitting PBS jobs with globusrun-ws hangs at "Current job state: Unsubmitted"

  • Make sure the the log_path in $GLOBUS_LOCATION/etc/globus-pbs.conf points to locally accessible scheduler logs that are readable by the user running the container. The Scheduler Event Generator (SEG) will not work without local scheduler logs to monitor. This can also apply to other resource managers, but is most comonly seen with PBS.
  • If the SEG configuration looks sane, try running the SEG tests. They are located in $GLOBUS_LOCATION/test/globus_scheduler_event_generator_*_test/. If Fork jobs work, you only need to run the PBS test. Run each test by going to the associated directory and run ./TESTS.pl. If any tests fail, report this to the [email protected] mailing list.
  • If the SEG tests succeed, the next step is to figure out the ID assigned by PBS to the queued job. Enable GRAM debug logging by uncommenting the appropriate line in the $GLOBUS_LOCATION/container-log4j.properties configuration file. Restart the container, run a PBS job, and search the container log for a line that contains "Received local job ID" to obtain the local job ID.
  • Once you have the local job ID you can check the latest PBS logs pointed to by the value of "log_path" in $GLOBUS_LOCATION/etc/globus-pbs.conf to make sure the job's status is being logged. If the status is not being logged, check the documentation for your flavor of PBS to see if there's any futher configuration that needs to be done to enable job status logging. For example, PBS Pro requires a sufficient -e <bitmask> option added to the pbs_server command line to enable enough logging to satisfy the SEG.
  • If the correct status is being logged, try running the SEG manually to see if it is reading the log file properly. The general form of the SEG command line is as follows: $GLOBUS_LOCATION/libexec/globus-scheduler-event-generator -s pbs -t <timestamp> The timestamp is in seconds since the epoch and dictates how far back in the log history the SEG should scan for job status events. The command should hang after dumping some status data to stdout. If no data appears, change the timestamp to an earlier time. If nothing ever appears, report this to the [email protected] mailing list.
  • If running the SEG manually succeeds, try running another job and make sure the job process actually finishes and PBS has logged the correct status before giving up and cancelling globusrun-ws. If things are still not working, report your problem and exactly what you have tried to remedy the situtation to the [email protected] mailing list.

The job manager detected an invalid script response

  • Check for a restrictive umask. When the service writes the native scheduler job description to a file, an overly restrictive umask will cause the permissions on the file to be such that the submission script run through sudo as the user cannot read the file (bug #2655).

When restarting the container, I get the following error: Error getting delegation resource

  • Most likely this is simply a case of the delegated credential expiring. Either refresh it for the affected job or destroy the job resource.

The user's home directory has not been determined correctly

  • This occurs when the administrator changed the location of the users's home directory and did not restart the GT4 container afterwards. Beginning from version 4.0.3 of the GT, WS-GRAM determines a user's home directory only once in the lifetime of a container (when the user submits the first job). Subsequently submitted jobs will use the cached home directory during job execution.

9. Related Documentation

No related documentation links have been determined at this time.

10. Internal Components

Internal Components