WS-GRAM Scheduler Interface Tutorial (Perl Module)

The Perl-language scheduler module provides the job submission and cancelling interface between the Managed Job Service and the underlying scheduler. Very little has been added to this part of the scheduler interface since GT 2---if you have a version for an older Globus Toolkit release, you can ignore most of this tutorial and jump to the changes from GT 2 section of this tutorial.

Perl Scheduler Module

The scheduler interface is implemented as a Perl module which is a subclass of the Globus::GRAM::JobManager module. Its name must match the scheduler type string used when the service is installed, but in all lower case: for the LSF scheduler, the module name is Globus::GRAM::JobManager::lsf and it is stored in the file lsf.pm. Though there are several methods in the Globus::GRAM::JobManager interface, the only ones which absolutely need to be implemented in a scheduler module are submit and cancel.

We'll begin by looking at the start of the lsf source module, lsf.in (the transformation to lsf.pm happens when a setup script is run. To begin the script, we import the GRAM support modules into the scheduler module's namespace, declare the module's namespace, and declare this module as a subclass of the Globus::GRAM::JobManager module. All scheduler packages will need to do this, substituting the name of the scheduler type being implemented where we see lsf below.

use Globus::GRAM::Error;
use Globus::GRAM::JobState;
use Globus::GRAM::JobManager;
use Globus::Core::Paths;

...

package Globus::GRAM::JobManager::lsf;

@ISA = qw(Globus::GRAM::JobManager);

Next, we declare any system-specific values which will be substituted when the setup package scripts are run. In the LSF case, we need the know the paths to a few programs which interact with the scheduler:

my ($mpirun, $bsub, $bjobs, $bkill);

BEGIN
{
    $mpirun = '@MPIRUN@';
    $bsub   = '@BSUB@';
    $bjobs  = '@BJOBS@';
    $bkill  = '@BKILL@';
}

The values surrounded by the @-sign (such as @MPIRUN@) will be replaced by with the path to the named programs by the find-lsf-tools script described below.

Writing a Constructor

For scheduler interfaces which need to setup some data before calling their other methods, they can overload the new method which acts as a constructor. Scheduler scripts which don't need any per-instance initialization will not need to provide a constructor, the default Globus::GRAM::JobManager::new constructor will do the job.

If you do need to overloaded this method, be sure to call the parent module's constructor to allow it to do its initialization, as in this example:

# Example-only implementation of the module's constructor. You won't find
# this code in lsf.in
sub new
{
    my $proto = shift;
    my $class = ref($proto) || $proto;
    my $self = $class->SUPER::new(@_);

    ## Insert scheduler-specific startup code here
    $self->{foo} = 1;
    $self->{bar} = 'baz';
    ## End of scheduler-specific startup code

    return $self;
}
    

The job interface methods are called with only one argument: the scheduler object itself. That object contains a Globus::GRAM::JobDescription object ($self->{JobDescription}) which includes the values from the RSL associated with the request, as well as a few extra values:

job_id
The string returned as the value of JOB_ID in the return hash from submit. This won't be present for methods called before the job is submitted.
uniq_id
A string associated with this job request by the job manager program. It will be unique for all jobs on a host for all time and might be useful in creating temporary files or scheduler-specific processing.
Now, let's look at the methods which will interface to the scheduler.

Submitting Jobs

All scheduler modules must implement the submit method. This method is called when the job manager wishes to submit the job to the scheduler. The information in the original job request RSL string is available to the scheduler interface through the JobDescription data member of it's hash.

For most schedulers, this is the longest method to be implemented, as it must decide what to do with the job description, and convert RSL elements to something which the scheduler can understand.

We'll look at some of the steps in the LSF manager code to see how the scheduler interface is implemented.

In the beginning of the submit method, we'll get our parameters and look up the job description in the manager-specific object:

sub submit
{
    my $self = shift;
    my $description = $self->{JobDescription};

    

Then we will check for values of the job parameters that we will be handling. For example, this is how we check for a valid job type in the LSF scheduler interface:

if(defined($description->jobtype())
{
    if($description->jobtype !~ /^(mpi|single|multiple)$/)
    {
        return Globus::GRAM::Error::JOBTYPE_NOT_SUPPORTED;
    }
    elsif($description->jobtype() eq 'mpi' &t;&t; $mpirun eq "no")
    {
        return Globus::GRAM::Error::JOBTYPE_NOT_SUPPORTED;
    }
}
    

The lsf module supports most of the core RSL elements, so it does quite a bit more processing to determine what to do with the values in the job description.

Once we've inspected the JobDescription we'll know what we need to tell the scheduler about so that it'll start the job properly. For LSF, we will construct a job description script and pass that to the bsub command. This script is a Bourne Shell script with some special comments which LSF uses to decide what constraints to use when scheduling the job.

First, we'll open the new file, and write the file header:

    local(*JOB);
    open(JOB, '>' . $lfs_job_script_name);
    print JOB<<"EOF"
#! /bin/sh
#
# LSF batch job script built by Globus Job Manager
#
EOF
    

Then, we'll add some comments to pass job constraints to LSF, such as the queue and project names:

if(defined($queue))
{
    print JOB "#BSUB -q $queue\n";
}
if(defined($description->project()))
{
    print JOB "#BSUB -P " . $description->project() . "\n");
}
    

At the end of the job description script, we actually run the executable named in the JobDescription. For LSF, we support a few different job types which require different startup commands. Here, we will quote and escape the strings in the argument list so that the values of the arguments will be identical to those in the initial job request string. For this Bourne-shell syntax script, we will double-quote each argument, and escaping the backslash (\), dollar-sign ($), double-quote ("), and single-quote (') characters. We will use this new string later in the script.

    @arguments = $description->arguments();

    foreach(@arguments)
    {
        if(ref($_))
        {
            return Globus::GRAM::Error::RSL_ARGUMENTS;
        }
    }
    if($#arguments > 0)
    {
        foreach(@arguments)
        {
             $_ =~ s/\\/\\\\/g;
             $_ =~ s/\$/\\\$/g;
             $_ =~ s/"/\\\"/g;
             $_ =~ s/`/\\\`/g;

             $args .= '"' . $_ . '" ';
        }
    }
    else
    {
        $args = "";
    }
    

To end the LSF job description script, we will write the command line of the executable to the script. Depending on the job type of this submission, we will need to start either one or more instances of the executable, or the mpirun program which will start the job with the executable count in the JobDescription:

    if($description->jobtype() eq 'mpi')
    {
        print JOB "$mpirun -np ", $description->count(), ' ';
        print JOB $description->executable(), " $args \n";
    }
    elsif($description->jobtype() eq 'multiple')
    {
        for(my $i = 0; $i < $description->count(); $i++)
        {
            print JOB $description->executable(), " $args &\n";
        }
        print JOB "wait\n";
    }
    else
    {
        print JOB $description->executable(), " $args\n";
    }
    

Next, we submit the job to the scheduler. Be sure to close the script file before trying to redirect it into the submit command, or some of the script file may be buffered and things will fail in strange ways!

When the submission command returns, we check its output for the scheduler-specific job identifier. We will use this value to poll or cancel the job.

The return value of the script should be either a GRAM error object or a reference to a hash of values. The Globus::GRAM::JobManager documentation lists the valid keys to that hash. For the submit method, we'll return the job identifier as the value of JOB_ID in the hash. If the scheduler returned a job status result, we could return that as well. LSF does not, so we'll return the PENDING state along with the job ID, or if the job fails, we'll return an error object. We'll include the standard error output from the scheduler submission program as the GT3_FAILURE_MESSAGE hash value. This failure message is used to provide context into the failure for the user.

    close(JOB);
    chmod 0755, $lsf_job_script_name;

    $lsf_job_err_name = $self->job_dir() . '/scheduler_bsub_stderr';
    $self->log("job err is at $lsf_job_err_name");

    $self->log("about to submit job");
    $self->nfssync( $lsf_job_script_name );
    $self->nfssync( $lsf_job_err_name );
    $job_id = (grep(/is submitted/,
                   split(/\n/, `$bsub < $lsf_job_script_name 2> $lsf_job_err_name`)))[0];

    if($? == 0)
    {
        $job_id =~ m/<([^>]*)>/;
        $job_id = $1;

        return {
                   JOB_ID => $job_id,
                   JOB_STATE => Globus::GRAM::JobState::PENDING
                };
    }
    else
    {
        $self->log("job submission failed, checking $lsf_job_err_name");

        my $stderr;
        local(*ERR);
        $self->nfssync( $lsf_job_err_name );
        open(ERR, "<$lsf_job_err_name");
        local $/;
        $stderr = <ERR>;
        close(ERR);

        open(ERR, '>' . $description->stderr());
        print ERR $stderr;
        close(ERR);

        $stderr =~ s/\n/ /g;

        $self->respond({ GT3_FAILURE_MESSAGE => $stderr });
    }
    

That finishes the submit method. Most of the functionality for the scheduler interface is now written. We just have one more (much shorter) method to implement.

Cancelling Jobs

All scheduler modules must also implement the cancel method. The purpose of this method is to cancel a scheduled job, whether it's already running or waiting in a queue.

This method will be given the job ID as part of the JobDescription object in the manager object. If the scheduler interface provides feedback that the job was cancelled successfully, then we can return a JOB_STATE change to the FAILED state. Otherwise we can return an empty hash reference, and let the scheduler event generator record the state change next when it occurs.

To process a cancel in the LSF case, we will run the bkill command with the job ID.

sub cancel
{
    my $self = shift;
    my $description = $self->{JobDescription};
    my $job_id = $description->jobid();

    $self->log("cancel job $job_id");

    system("$bkill $job_id >/dev/null 2>/dev/null");

    if($? == 0)
    {
        return { JOB_STATE => Globus::GRAM::JobState::FAILED }
    }
    return Globus::GRAM::Error::JOB_CANCEL_FAILED;

}
    

End of the script

It is required that all perl modules return a non-zero value when they are parsed. To do this, make sure the last line of your module consists of:

1;
    

Scheduler Setup Package

Once we've written the job manager script, we need to get it installed so that the Managed Job Service will be able to access the scheduler. We do this by writing a setup script. For LSF, we will write the script setup-globus-job-manager-lsf.pl, which we will list in the LSF package as the Post_Install_Program.

To set up the Gatekeeper service, our LSF setup script does the following:

  1. Perform system-specific configuration.
  2. Install the GRAM scheduler Perl module (and register as a gatekeeper service for pre-WS GRAM compatibility).
  3. (Optional) Install an RSL validation file defining extra scheduler-specific RSL attributes which the scheduler interface will support (pre-WS GRAM only).
  4. Update the GPT metadata to indicate that the job manager service has been set up.

LSF Setup Script

The LSF setup script begins by checking the environment for the location where GPT and the Globus Toolkit are installed. Both of these are needed to successfully set up the LSF scheduler module.

my $gpath = $ENV{GPT_LOCATION};

if (!defined($gpath))
{
    $gpath = $ENV{GLOBUS_LOCATION};
}

if (!defined($gpath))
{
    die "GPT_LOCATION or GLOBUS_LOCATION needs to be set before running this script";
}

@INC = (@INC, "$gpath/lib/perl");
    

After this is some option-parsing code and default values of the options to handle system-specific changes to the scheduler name and whether the pre-WS GRAM service will validate the list of queues at RSL parsing time, or defer to the scheduler script's submit method's error handling.

require Grid::GPT::Setup;
use Getopt::Long;

my $name                = 'jobmanager-lsf';
my $manager_type        = 'lsf';
my $cmd;
my $validate_queues     = 1;

GetOptions('service-name|s=s' => \$name,
           'validate-queues=s' => \$validate_queues,
           'help|h' => \$help);

&usage if $help;
    

Next, the script constructs a metadata object for the setup package. This object will be used to write a GPT file which will indicate that the dependency described in the GRAM service is met. The meaning of the setup package dependency is: a scheduler script has been deployed into the perl path and a pre-WS GRAM service entry has been created for the scheduler. Note that the second part of this is not relevant to the WS-GRAM service, but is still required to keep the metadata meaningful. The package_name hash value must match the name of the setup package in its GPT metadata.

my $metadata =
    new Grid::GPT::Setup(package_name => "globus_gram_job_manager_setup_lsf");
    

System-specific Configuration

The system-specific configuration for lsf involves finding the paths to the LSF tools. This is done in the find-lsf-tools configure script described below. Note that this script is invoked with the argument --cache-file=/dev/null: this makes the script not cache values between invocations, so that if something in the system environment changes, rerunning the script will behave as expected.

print `./find-lsf-tools --cache-file=/dev/null`;
if($? != 0)
{
    print STDERR "Error locating LSF commands, aborting!\n";
    exit 2;
}
    

Registering as a Gatekeeper Service

Next, the setup script installs its perl module into the perl library directory and registers an entry in the Globus Gatekeeper's service directory. The program globus-job-manager-service (distributed in the job manager program setup package) performs both of these tasks. When run, it expects the scheduler perl module to be located in the $GLOBUS_LOCATION/setup/globus directory. If this is successful, we can write the setup package metadata to let GPT know that the package is configured properly.

# Create service
$cmd = "$libexecdir/globus-job-manager-service -add -m lsf -s \"$name\"";
system("$cmd >/dev/null 2>/dev/null");
    

Installing an RSL Validation File (pre-WS GRAM only)

If the scheduler script implements RSL attributes which are not part of the core set supported by the job manager, it must publish them in the job manager's data directory. If the scheduler script wants to set some default values of RSL attributes, it may also set those as the default values in the validation file. This is not used by the ws-GRAM service, and is only applicable to scripts which are used in both pre-WS GRAM and WS-GRAM contexts.

The format of the validation file is described in the RSL Validation Section of the pre-WS GRAM documentation. The validation file must be named scheduler-type.rvf and installed in the $GLOBUS_LOCATION/share/globus_gram_job_manager directory.

In the LSF setup script, we check the list of queues supported by the local LSF installation, and add a section of acceptable values for the queue RSL attribute:

open(VALIDATION_FILE,
     ">$ENV{GLOBUS_LOCATION}/share/globus_gram_job_manager/lsf.rvf");

# Customize validation file with queue info
open(BQUEUES, "bqueues -w |");

# discard header
$_ = ;
my @queues = ();

while()
{
    chomp;

    $_ =~ m/^(\S+)/;

    push(@queues, $1);
}
close(BQUEUES);

if(@queues)
{
    print VALIDATION_FILE "Attribute: queue\n";
    print VALIDATION_FILE join(" ", "Values:", @queues);
    print VALIDATION_FILE "\n";
}
close VALIDATION_FILE;
    

Updating GPT Metadata

Finally, the setup package should finalize a Grid::GPT::Setup. If the finish() methods fail, then it is considered good practice to clean up any files created by the setup script. From setup-globus-job-manager-lsf.pl: @code $metadata->finish(); @endcode

System-Specific Configuration Script

First, our scheduler setup script probes for any system-specific information needed to interface with the local scheduler. For example, the LSF scheduler uses the mpirun, bsub, bqueues, bjobs, and bkill commands to submit, poll, and cancel jobs. We'll assume that the administrator who is installing the package has these commands in their path. We'll use an autoconf script to locate the executable paths for these commands and substitute them into our scheduler Perl module. In the LSF package, we have the find-lsf-tools script, which is generated during bootstrap by autoconf from the find-lsf-tools.in file:

## Required Prolog

AC_REVISION($Revision: 1.8 $)
AC_INIT(lsf.in)

# checking for the GLOBUS_LOCATION

if test "x$GLOBUS_LOCATION" = "x"; then
    echo "ERROR Please specify GLOBUS_LOCATION" >&2
    exit 1
fi

...

## Check for optional tools, warn if not found

AC_PATH_PROG(MPIRUN, mpirun, no)
if test "$MPIRUN" = "no" ; then
    AC_MSG_WARN([Cannot locate mpirun])
fi

...

## Check for required tools, error if not found

AC_PATH_PROG(BSUB, bsub, no)
if test "$BSUB" = "no" ; then
    AC_MSG_ERROR([Cannot locate bsub])
fi

...

## Required epilog - update scheduler specific module

prefix='$(GLOBUS_LOCATION)'
exec_prefix='$(GLOBUS_LOCATION)'
libexecdir=${prefix}/libexec

AC_OUTPUT(
    lsf.pm:lsf.in
)
    

If this script exits with a non-zero error code, then the setup script propagates the error to the caller and exits without installing the service.

Packaging

Now that we've written a job manager scheduler interface, we'll package it using GPT to make it easy for our users to build and install. We'll start by gathering the different files we've written above into a single directory: lsf.

  • lsf.in
  • ind-lsf-tools.in
  • setup-globus-job-manager.pl

Package Documentation

If there are any scheduler-specific options defined for this scheduler module, or if there any any optional setup items, then it is good to provide a documentation page which describes these. For LSF, we describe the changes since the last version of this package in the file globus_gram_job_manager_lsf.dox. This file consists of a doxygen mainpage. See http://www.doxygen.org for information on how to write documentation with that tool.

configure.in

Now, we'll write our configure.in script. This file is converted to the configure shell script by the bootstrap script below. Since we don't do any probes for compile-time tools or system characteristics, we just call the various initialization macros used by GPT, declare that we may provide doxygen documentation, and then output the files we need substitutions done on.

AC_REVISION($Revision: 1.8 $)
AC_INIT(Makefile.am)

GLOBUS_INIT
AM_PROG_LIBTOOL

dnl Initialize the automake rules the last argument
AM_INIT_AUTOMAKE($GPT_NAME, $GPT_VERSION)

LAC_DOXYGEN("../", "*.dox")

GLOBUS_FINALIZE

AC_OUTPUT(
        Makefile
        pkgdata/Makefile
        pkgdata/pkg_data_src.gpt
        doxygen/Doxyfile
        doxygen/Doxyfile-internal
        doxygen/Makefile
)
    

Package Metadata

Now we'll write our metadata file, and put it in the pkgdata subdirectory of our package. The important things to note in this file are the package name and version, the post_install_program, and the setup sections. These define how the package distribution will be named, what command will be run by gpt-postinstall when this package is installed, and what the setup dependencies will be written when the Grid::GPT::Setup object is finalized.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE gpt_package_metadata SYSTEM "package.dtd">

<gpt_package_metadata Format_Version="0.02" Name="globus_gram_job_manager_setup_lsf" >

  <Aging_Version Age="0" Major="1" Minor="0" />
  <Description >LSF Job Manager Setup</Description>
  <Functional_Group >ResourceManagement</Functional_Group>
  <Version_Stability Release="Beta" />
  <src_pkg >

    <With_Flavors build="no" />
    <Source_Setup_Dependency PkgType="pgm" >
      <Setup_Dependency Name="globus_gram_job_manager_setup" >
        <Version >
          <Simple_Version Major="3" />
        </Version>
      </Setup_Dependency>
      <Setup_Dependency Name="globus_common_setup" >
        <Version >
          <Simple_Version Major="2" />
        </Version>
      </Setup_Dependency>
    </Source_Setup_Dependency>

    <Build_Environment >
      <cflags >@GPT_CFLAGS@</cflags>
      <external_includes >@GPT_EXTERNAL_INCLUDES@</external_includes>
      <pkg_libs > </pkg_libs>
      <external_libs >@GPT_EXTERNAL_LIBS@</external_libs>
    </Build_Environment>

    <Post_Install_Message >
      Run the setup-globus-job-manager-lsf setup script to configure an
      lsf job manager.
    </Post_Install_Message>

    <Post_Install_Program >
      setup-globus-job-manager-lsf
    </Post_Install_Program>

    <Setup Name="globus_gram_job_manager_service_setup" >
      <Aging_Version Age="0" Major="1" Minor="0" />
    </Setup>

  </src_pkg>

</gpt_package_metadata>
    

Automake Makefile.am

The automake file Makefile.am for this package is short because there isn't any compilation needed for this package. We just need to define what needs to be installed into which directory, and what source files need to be put into our source distribution. For the LSF package, we need to list the lsf.in, find-lsf-tools, and setup-globus-job-manager-lsf.pl scripts as files to be installed into the setup directory. We need to add those files plus our documentation source file to the EXTRA_LIST variable so that they will be included in source distributions. The rest of the lines in the file are needed for proper interaction with GPT.

include $(top_srcdir)/globus_automake_pre
include $(top_srcdir)/globus_automake_pre_top

SUBDIRS = pkgdata doxygen

setup_SCRIPTS = \
    lsf.in \
    find-lsf-tools \
    setup-globus-job-manager-lsf.pl

EXTRA_DIST = $(setup_SCRIPTS) globus_gram_job_manager_lsf.dox

include $(top_srcdir)/globus_automake_post
include $(top_srcdir)/globus_automake_post_top
    

Bootstrap

The final piece we need to write for our package is the bootstrap script. This script is the standard bootstrap script for a globus package, with an extra line to generate the fine-lsf-tools script using autoconf.

#!/bin/sh

# checking for the GLOBUS_LOCATION

if test "x$GLOBUS_LOCATION" = "x"; then
    echo "ERROR Please specify GLOBUS_LOCATION" >&2
    exit 1
fi

if [ ! -f ${GLOBUS_LOCATION}/libexec/globus-bootstrap.sh ]; then
    echo "ERROR: Unable to locate \${GLOBUS_LOCATION}/libexec/globus-bootstrap.sh"
    echo "       Please ensure that you have installed the globus-core package and"
    echo "       that GLOBUS_LOCATION is set to the proper directory"
    exit
fi

. ${GLOBUS_LOCATION}/libexec/globus-bootstrap.sh

autoconf find-lsf-tools.in > find-lsf-tools
chmod 755 find-lsf-tools

exit 0
    

Changes In GT 4.0

Module Methods

The GT-4.0 ws-GRAM service only calls a subset of the Perl methods which were used by the pre-ws GRAM services. Most importantly for script implementors, the polling method is no longer used. Instead, the scheduler-event-generator monitors jobs to signal the service when job change changes occur. Staging is now done via the Reliable File Transfer service, so the file_stage_in and file_stage_out methods are no longer called. Schedulers typically did not implement the staging methods, so this shouldn't affect most scheduler modules.

That being said, scheduler implementers which would like to have their scheduler both with pre-ws GRAM and WS-GRAM should definitely implement the poll() method described in the pre-WS version of this tutorial.

GASS Cache

The GT-4.0 ws-GRAM service does not use the GASS cache for storing temporary files or for staging files.

Changes in GT 3.2

In GT 3.2, additional error message context info was added. Scripts can optionally add one of these fields to the return hash from an operation to provide extra error information to the client:

GT3_FAILURE_MESSAGE
Error message from underlying script processing indicating what caused a job request to fail
GT3_FAILURE_TYPE
One of filestagein, filestageout, filestageinshared, executable, stdin indicating what job request element caused a staging fault.
GT3_FAILURE_SOURCE
Source URL or file for a failed staging operation
GT3_FAILURE_DESTINATION
Destination URL or file for a failed staging operation