WS-GRAM Scheduler Interface Tutorial (Perl Module)
The Perl-language scheduler module provides the job submission and cancelling interface between the Managed Job Service and the underlying scheduler. Very little has been added to this part of the scheduler interface since GT 2---if you have a version for an older Globus Toolkit release, you can ignore most of this tutorial and jump to the changes from GT 2 section of this tutorial.
Perl Scheduler Module
The scheduler interface is implemented as a Perl module which is a
subclass of the Globus::GRAM::JobManager
module. Its name
must match the scheduler type string used when the service is installed,
but in all lower case: for the LSF scheduler, the module name is
Globus::GRAM::JobManager::lsf
and it is stored in the file
lsf.pm
. Though there are several methods in the
Globus::GRAM::JobManager interface
, the only ones which
absolutely need to be implemented in a scheduler module are
submit
and cancel
.
We'll begin by looking at the start of the lsf source module,
lsf.in
(the transformation to lsf.pm
happens
when a setup script is run. To begin the script, we import the GRAM
support modules into the scheduler module's namespace, declare the
module's namespace, and declare this
module as a subclass of the Globus::GRAM::JobManager
module.
All scheduler packages will need to do this, substituting the name of the
scheduler type being implemented where we see lsf below.
use Globus::GRAM::Error; use Globus::GRAM::JobState; use Globus::GRAM::JobManager; use Globus::Core::Paths; ... package Globus::GRAM::JobManager::lsf; @ISA = qw(Globus::GRAM::JobManager);
Next, we declare any system-specific values which will be substituted when the setup package scripts are run. In the LSF case, we need the know the paths to a few programs which interact with the scheduler:
my ($mpirun, $bsub, $bjobs, $bkill); BEGIN { $mpirun = '@MPIRUN@'; $bsub = '@BSUB@'; $bjobs = '@BJOBS@'; $bkill = '@BKILL@'; }
The values surrounded by the @-sign (such as @MPIRUN@
)
will be replaced by with the path to the named programs by
the find-lsf-tools
script described
below.
Writing a Constructor
For scheduler interfaces which need to setup some data before calling
their other methods, they can overload the new
method which
acts as a constructor. Scheduler scripts which don't need any
per-instance initialization will not need to provide a constructor, the
default Globus::GRAM::JobManager::new
constructor will do
the job.
If you do need to overloaded this method, be sure to call the parent module's constructor to allow it to do its initialization, as in this example:
# Example-only implementation of the module's constructor. You won't find # this code in lsf.in sub new { my $proto = shift; my $class = ref($proto) || $proto; my $self = $class->SUPER::new(@_); ## Insert scheduler-specific startup code here $self->{foo} = 1; $self->{bar} = 'baz'; ## End of scheduler-specific startup code return $self; }
The job interface methods are called with only one argument: the
scheduler object itself. That object contains a
Globus::GRAM::JobDescription object
($self->{JobDescription}
) which includes the values from
the RSL associated with the request, as well as a few extra
values:
- job_id
- The string returned as the value of JOB_ID in the return hash from submit. This won't be present for methods called before the job is submitted.
- uniq_id
- A string associated with this job request by the job manager program. It will be unique for all jobs on a host for all time and might be useful in creating temporary files or scheduler-specific processing.
Submitting Jobs
All scheduler modules must implement the submit method. This method is
called when the job manager wishes to submit the job to the scheduler.
The information in the original job request RSL string is available to
the scheduler interface through the JobDescription
data
member of it's hash.
For most schedulers, this is the longest method to be implemented, as it must decide what to do with the job description, and convert RSL elements to something which the scheduler can understand.
We'll look at some of the steps in the LSF manager code to see how the scheduler interface is implemented.
In the beginning of the submit method, we'll get our parameters and look up the job description in the manager-specific object:
sub submit { my $self = shift; my $description = $self->{JobDescription};
Then we will check for values of the job parameters that we will be handling. For example, this is how we check for a valid job type in the LSF scheduler interface:
if(defined($description->jobtype()) { if($description->jobtype !~ /^(mpi|single|multiple)$/) { return Globus::GRAM::Error::JOBTYPE_NOT_SUPPORTED; } elsif($description->jobtype() eq 'mpi' &t;&t; $mpirun eq "no") { return Globus::GRAM::Error::JOBTYPE_NOT_SUPPORTED; } }
The lsf module supports most of the core RSL elements, so it does quite a bit more processing to determine what to do with the values in the job description.
Once we've inspected the JobDescription we'll know what we need to
tell the scheduler about so that it'll start the job properly. For LSF,
we will construct a job description script and pass that to the
bsub
command. This script is a Bourne Shell script with some
special comments which LSF uses to decide what constraints to use when
scheduling the job.
First, we'll open the new file, and write the file header:
local(*JOB); open(JOB, '>' . $lfs_job_script_name); print JOB<<"EOF" #! /bin/sh # # LSF batch job script built by Globus Job Manager # EOF
Then, we'll add some comments to pass job constraints to LSF, such as the queue and project names:
if(defined($queue)) { print JOB "#BSUB -q $queue\n"; } if(defined($description->project())) { print JOB "#BSUB -P " . $description->project() . "\n"); }
At the end of the job description script, we actually run the executable named in the JobDescription. For LSF, we support a few different job types which require different startup commands. Here, we will quote and escape the strings in the argument list so that the values of the arguments will be identical to those in the initial job request string. For this Bourne-shell syntax script, we will double-quote each argument, and escaping the backslash (\), dollar-sign ($), double-quote ("), and single-quote (') characters. We will use this new string later in the script.
@arguments = $description->arguments(); foreach(@arguments) { if(ref($_)) { return Globus::GRAM::Error::RSL_ARGUMENTS; } } if($#arguments > 0) { foreach(@arguments) { $_ =~ s/\\/\\\\/g; $_ =~ s/\$/\\\$/g; $_ =~ s/"/\\\"/g; $_ =~ s/`/\\\`/g; $args .= '"' . $_ . '" '; } } else { $args = ""; }
To end the LSF job description script, we will write the command line
of the executable to the script. Depending on the job type of this
submission, we will need to start either one or more instances of the
executable, or the mpirun program which will start the job with the
executable count in the JobDescription
:
if($description->jobtype() eq 'mpi') { print JOB "$mpirun -np ", $description->count(), ' '; print JOB $description->executable(), " $args \n"; } elsif($description->jobtype() eq 'multiple') { for(my $i = 0; $i < $description->count(); $i++) { print JOB $description->executable(), " $args &\n"; } print JOB "wait\n"; } else { print JOB $description->executable(), " $args\n"; }
Next, we submit the job to the scheduler. Be sure to close the script file before trying to redirect it into the submit command, or some of the script file may be buffered and things will fail in strange ways!
When the submission command returns, we check its output for the scheduler-specific job identifier. We will use this value to poll or cancel the job.
The return value of the script should be either a GRAM error object or
a reference to a hash of values. The Globus::GRAM::JobManager
documentation lists the valid keys to that hash. For the submit method,
we'll return the job identifier as the value of JOB_ID
in
the hash. If the scheduler returned a job status result, we could return
that as well. LSF does not, so we'll return the PENDING state along
with the job ID, or if the job fails, we'll return an error object. We'll
include the standard error output from the scheduler submission program
as the GT3_FAILURE_MESSAGE hash value. This failure message is used to
provide context into the failure for the user.
close(JOB); chmod 0755, $lsf_job_script_name; $lsf_job_err_name = $self->job_dir() . '/scheduler_bsub_stderr'; $self->log("job err is at $lsf_job_err_name"); $self->log("about to submit job"); $self->nfssync( $lsf_job_script_name ); $self->nfssync( $lsf_job_err_name ); $job_id = (grep(/is submitted/, split(/\n/, `$bsub < $lsf_job_script_name 2> $lsf_job_err_name`)))[0]; if($? == 0) { $job_id =~ m/<([^>]*)>/; $job_id = $1; return { JOB_ID => $job_id, JOB_STATE => Globus::GRAM::JobState::PENDING }; } else { $self->log("job submission failed, checking $lsf_job_err_name"); my $stderr; local(*ERR); $self->nfssync( $lsf_job_err_name ); open(ERR, "<$lsf_job_err_name"); local $/; $stderr = <ERR>; close(ERR); open(ERR, '>' . $description->stderr()); print ERR $stderr; close(ERR); $stderr =~ s/\n/ /g; $self->respond({ GT3_FAILURE_MESSAGE => $stderr }); }
That finishes the submit method. Most of the functionality for the scheduler interface is now written. We just have one more (much shorter) method to implement.
Cancelling Jobs
All scheduler modules must also implement the cancel method. The purpose of this method is to cancel a scheduled job, whether it's already running or waiting in a queue.
This method will be given the job ID as part of the JobDescription object in the manager object. If the scheduler interface provides feedback that the job was cancelled successfully, then we can return a JOB_STATE change to the FAILED state. Otherwise we can return an empty hash reference, and let the scheduler event generator record the state change next when it occurs.
To process a cancel in the LSF case, we will run the bkill command with the job ID.
sub cancel { my $self = shift; my $description = $self->{JobDescription}; my $job_id = $description->jobid(); $self->log("cancel job $job_id"); system("$bkill $job_id >/dev/null 2>/dev/null"); if($? == 0) { return { JOB_STATE => Globus::GRAM::JobState::FAILED } } return Globus::GRAM::Error::JOB_CANCEL_FAILED; }
End of the script
It is required that all perl modules return a non-zero value when they are parsed. To do this, make sure the last line of your module consists of:
1;
Scheduler Setup Package
Once we've written the job manager script, we need to get it installed
so that the Managed Job Service will be able to access the scheduler. We
do this by writing a setup script. For LSF, we will write the script
setup-globus-job-manager-lsf.pl
, which we will list in the
LSF package as the Post_Install_Program
.
To set up the Gatekeeper service, our LSF setup script does the following:
- Perform system-specific configuration.
- Install the GRAM scheduler Perl module (and register as a gatekeeper service for pre-WS GRAM compatibility).
- (Optional) Install an RSL validation file defining extra scheduler-specific RSL attributes which the scheduler interface will support (pre-WS GRAM only).
- Update the GPT metadata to indicate that the job manager service has been set up.
LSF Setup Script
The LSF setup script begins by checking the environment for the location where GPT and the Globus Toolkit are installed. Both of these are needed to successfully set up the LSF scheduler module.
my $gpath = $ENV{GPT_LOCATION}; if (!defined($gpath)) { $gpath = $ENV{GLOBUS_LOCATION}; } if (!defined($gpath)) { die "GPT_LOCATION or GLOBUS_LOCATION needs to be set before running this script"; } @INC = (@INC, "$gpath/lib/perl");
After this is some option-parsing code and default values of the options to handle system-specific changes to the scheduler name and whether the pre-WS GRAM service will validate the list of queues at RSL parsing time, or defer to the scheduler script's submit method's error handling.
require Grid::GPT::Setup; use Getopt::Long; my $name = 'jobmanager-lsf'; my $manager_type = 'lsf'; my $cmd; my $validate_queues = 1; GetOptions('service-name|s=s' => \$name, 'validate-queues=s' => \$validate_queues, 'help|h' => \$help); &usage if $help;
Next, the script constructs a metadata object for the setup package.
This object will be used to write a GPT file which will indicate that the
dependency described in the GRAM service is met. The meaning of the
setup package dependency is: a scheduler script has been deployed
into the perl path and a pre-WS GRAM service entry has been created for
the scheduler. Note that the second part of this is not relevant
to the WS-GRAM service, but is still required to keep the metadata
meaningful. The package_name
hash value must match the name
of the setup package in its GPT metadata.
my $metadata = new Grid::GPT::Setup(package_name => "globus_gram_job_manager_setup_lsf");
System-specific Configuration
The system-specific configuration for lsf involves finding the
paths to the LSF tools. This is done in the
find-lsf-tools configure script described below. Note that this
script is invoked with the argument --cache-file=/dev/null
:
this makes the script not cache values between invocations, so that if
something in the system environment changes, rerunning the script will
behave as expected.
print `./find-lsf-tools --cache-file=/dev/null`; if($? != 0) { print STDERR "Error locating LSF commands, aborting!\n"; exit 2; }
Registering as a Gatekeeper Service
Next, the setup script installs its perl module into the perl library directory and registers an entry in the Globus Gatekeeper's service directory. The program
globus-job-manager-service
(distributed in the job manager
program setup package) performs both of these tasks. When run, it expects
the scheduler perl module to be located in the
$GLOBUS_LOCATION/setup/globus
directory. If this is
successful, we can write the setup package metadata to let GPT know
that the package is configured properly.
# Create service $cmd = "$libexecdir/globus-job-manager-service -add -m lsf -s \"$name\""; system("$cmd >/dev/null 2>/dev/null");
Installing an RSL Validation File (pre-WS GRAM only)
If the scheduler script implements RSL attributes which are not part of the core set supported by the job manager, it must publish them in the job manager's data directory. If the scheduler script wants to set some default values of RSL attributes, it may also set those as the default values in the validation file. This is not used by the ws-GRAM service, and is only applicable to scripts which are used in both pre-WS GRAM and WS-GRAM contexts.
The format of the validation file is
described in the RSL
Validation Section of the pre-WS GRAM documentation. The validation
file must be named
scheduler-type.rvf
and installed in the
$GLOBUS_LOCATION/share/globus_gram_job_manager
directory.
In the LSF setup script, we check the list of queues supported by the local LSF installation, and add a section of acceptable values for the queue RSL attribute:
open(VALIDATION_FILE, ">$ENV{GLOBUS_LOCATION}/share/globus_gram_job_manager/lsf.rvf"); # Customize validation file with queue info open(BQUEUES, "bqueues -w |"); # discard header $_ =; my @queues = (); while( ) { chomp; $_ =~ m/^(\S+)/; push(@queues, $1); } close(BQUEUES); if(@queues) { print VALIDATION_FILE "Attribute: queue\n"; print VALIDATION_FILE join(" ", "Values:", @queues); print VALIDATION_FILE "\n"; } close VALIDATION_FILE;
Updating GPT Metadata
Finally, the setup package should finalize aGrid::GPT::Setup
. If the finish()
methods
fail, then it is considered good practice to clean up any files created
by the setup script. From setup-globus-job-manager-lsf.pl
:
@code
$metadata->finish();
@endcode
System-Specific Configuration Script
First, our scheduler setup script probes for any system-specific
information needed to interface with the local scheduler. For example,
the LSF scheduler uses the mpirun
, bsub
,
bqueues
, bjobs
, and bkill
commands
to submit, poll, and cancel jobs. We'll assume that the administrator who
is installing the package has these commands in their path. We'll use an
autoconf script to locate the executable paths for these commands and
substitute them into our scheduler Perl module. In the LSF package, we
have the find-lsf-tools
script, which is generated during
bootstrap by autoconf from the find-lsf-tools.in
file:
## Required Prolog AC_REVISION($Revision: 1.8 $) AC_INIT(lsf.in) # checking for the GLOBUS_LOCATION if test "x$GLOBUS_LOCATION" = "x"; then echo "ERROR Please specify GLOBUS_LOCATION" >&2 exit 1 fi ... ## Check for optional tools, warn if not found AC_PATH_PROG(MPIRUN, mpirun, no) if test "$MPIRUN" = "no" ; then AC_MSG_WARN([Cannot locate mpirun]) fi ... ## Check for required tools, error if not found AC_PATH_PROG(BSUB, bsub, no) if test "$BSUB" = "no" ; then AC_MSG_ERROR([Cannot locate bsub]) fi ... ## Required epilog - update scheduler specific module prefix='$(GLOBUS_LOCATION)' exec_prefix='$(GLOBUS_LOCATION)' libexecdir=${prefix}/libexec AC_OUTPUT( lsf.pm:lsf.in )
If this script exits with a non-zero error code, then the setup script propagates the error to the caller and exits without installing the service.
Packaging
Now that we've written a job manager scheduler interface, we'll package
it using GPT to make it easy for our users to build and install. We'll
start by gathering the different files we've written above into a single
directory: lsf
.
- lsf.in
- ind-lsf-tools.in
- setup-globus-job-manager.pl
Package Documentation
If there are any scheduler-specific options defined for this scheduler
module, or if there any any optional setup items, then it is good to
provide a documentation page which describes these. For LSF, we describe
the changes since the last version of this package in the file
globus_gram_job_manager_lsf.dox
. This file consists of a
doxygen mainpage. See http://www.doxygen.org for information
on how to write documentation with that tool.
configure.in
Now, we'll write our configure.in
script. This file is
converted to the configure shell script by the bootstrap script below.
Since we don't do any probes for compile-time tools or system
characteristics, we just call the various initialization macros used by
GPT, declare that we may provide doxygen documentation, and then output
the files we need substitutions done on.
AC_REVISION($Revision: 1.8 $) AC_INIT(Makefile.am) GLOBUS_INIT AM_PROG_LIBTOOL dnl Initialize the automake rules the last argument AM_INIT_AUTOMAKE($GPT_NAME, $GPT_VERSION) LAC_DOXYGEN("../", "*.dox") GLOBUS_FINALIZE AC_OUTPUT( Makefile pkgdata/Makefile pkgdata/pkg_data_src.gpt doxygen/Doxyfile doxygen/Doxyfile-internal doxygen/Makefile )
Package Metadata
Now we'll write our metadata file, and put it in the pkgdata
subdirectory of our package. The important things to note in this file
are the package name and version, the post_install_program, and the setup
sections. These define how the package distribution will be named, what
command will be run by gpt-postinstall
when this package is
installed, and what the setup dependencies will be written when the
Grid::GPT::Setup
object is finalized.
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE gpt_package_metadata SYSTEM "package.dtd"> <gpt_package_metadata Format_Version="0.02" Name="globus_gram_job_manager_setup_lsf" > <Aging_Version Age="0" Major="1" Minor="0" /> <Description >LSF Job Manager Setup</Description> <Functional_Group >ResourceManagement</Functional_Group> <Version_Stability Release="Beta" /> <src_pkg > <With_Flavors build="no" /> <Source_Setup_Dependency PkgType="pgm" > <Setup_Dependency Name="globus_gram_job_manager_setup" > <Version > <Simple_Version Major="3" /> </Version> </Setup_Dependency> <Setup_Dependency Name="globus_common_setup" > <Version > <Simple_Version Major="2" /> </Version> </Setup_Dependency> </Source_Setup_Dependency> <Build_Environment > <cflags >@GPT_CFLAGS@</cflags> <external_includes >@GPT_EXTERNAL_INCLUDES@</external_includes> <pkg_libs > </pkg_libs> <external_libs >@GPT_EXTERNAL_LIBS@</external_libs> </Build_Environment> <Post_Install_Message > Run the setup-globus-job-manager-lsf setup script to configure an lsf job manager. </Post_Install_Message> <Post_Install_Program > setup-globus-job-manager-lsf </Post_Install_Program> <Setup Name="globus_gram_job_manager_service_setup" > <Aging_Version Age="0" Major="1" Minor="0" /> </Setup> </src_pkg> </gpt_package_metadata>
Automake Makefile.am
The automake file Makefile.am
for this package is short
because there isn't
any compilation needed for this package. We just need to define what
needs to be installed into which directory, and what source files need to
be put into our source distribution. For the LSF package, we need to list
the lsf.in
, find-lsf-tools
, and
setup-globus-job-manager-lsf.pl
scripts as files to be
installed into the setup directory. We need to add those files plus our
documentation source file to the EXTRA_LIST variable so that they will be
included in source distributions. The rest of the lines in the file are
needed for proper interaction with GPT.
include $(top_srcdir)/globus_automake_pre include $(top_srcdir)/globus_automake_pre_top SUBDIRS = pkgdata doxygen setup_SCRIPTS = \ lsf.in \ find-lsf-tools \ setup-globus-job-manager-lsf.pl EXTRA_DIST = $(setup_SCRIPTS) globus_gram_job_manager_lsf.dox include $(top_srcdir)/globus_automake_post include $(top_srcdir)/globus_automake_post_top
Bootstrap
The final piece we need to write for our package is the
bootstrap
script. This script is the standard bootstrap
script for a globus package, with an extra line to generate the
fine-lsf-tools
script using autoconf.
#!/bin/sh # checking for the GLOBUS_LOCATION if test "x$GLOBUS_LOCATION" = "x"; then echo "ERROR Please specify GLOBUS_LOCATION" >&2 exit 1 fi if [ ! -f ${GLOBUS_LOCATION}/libexec/globus-bootstrap.sh ]; then echo "ERROR: Unable to locate \${GLOBUS_LOCATION}/libexec/globus-bootstrap.sh" echo " Please ensure that you have installed the globus-core package and" echo " that GLOBUS_LOCATION is set to the proper directory" exit fi . ${GLOBUS_LOCATION}/libexec/globus-bootstrap.sh autoconf find-lsf-tools.in > find-lsf-tools chmod 755 find-lsf-tools exit 0
Changes In GT 4.0
Module Methods
The GT-4.0 ws-GRAM service only calls a subset of the Perl methods which were used by the pre-ws GRAM services. Most importantly for script implementors, the polling method is no longer used. Instead, the scheduler-event-generator monitors jobs to signal the service when job change changes occur. Staging is now done via the Reliable File Transfer service, so the file_stage_in and file_stage_out methods are no longer called. Schedulers typically did not implement the staging methods, so this shouldn't affect most scheduler modules.
That being said, scheduler implementers which would like to have their scheduler both with pre-ws GRAM and WS-GRAM should definitely implement the poll() method described in the pre-WS version of this tutorial.
GASS Cache
The GT-4.0 ws-GRAM service does not use the GASS cache for storing temporary files or for staging files.
Changes in GT 3.2
In GT 3.2, additional error message context info was added. Scripts can optionally add one of these fields to the return hash from an operation to provide extra error information to the client:
- GT3_FAILURE_MESSAGE
- Error message from underlying script processing indicating what caused a job request to fail
- GT3_FAILURE_TYPE
- One of
filestagein
,filestageout
,filestageinshared
,executable
,stdin
indicating what job request element caused a staging fault. - GT3_FAILURE_SOURCE
- Source URL or file for a failed staging operation
- GT3_FAILURE_DESTINATION
- Destination URL or file for a failed staging operation