GT 4.0 Pre WS GRAM Approach

1. Introduction

The Globus Resource Allocation Manager (GRAM) is the lowest level of Globus resource management architecture. GRAM allows you to run jobs remotely, providing an API for submitting, monitoring, and terminating your job.

When a job is submitted, the request is sent to the gatekeeper of the remote computer. The gatekeeper handles the request and creates a job manager for the job. The job manager starts and monitors the remote program, communicating state changes back to the user on the local machine. When the remote application terminates, normally or by failing, the job manager terminates as well.

GRAM is responsible for

  • Parsing and processing the Resource Specification Language (RSL) specifications that outline job requests. The request specifies resource selection, job process creation, and job control. This is accomplished by either denying the request or creating one or more i processes (jobs) to satisfy the request.

  • Enabling remote monitoring and managing of jobs already created.

The Resource Specification Language (RSL) is a structured language by which resource requirements and parameters can be outlined by a user.

To run a job remotely, a GRAM gatekeeper (server) must be running on a remote computer, listening at a port; and the application needs to be compiled on that remote machine. The execution begins when a GRAM user application runs on the local machine, sending a job request to the remote computer. The executable, stdin and stdout, as well as the name and port of the remote computer, are specified as part of the job request. The job request is handled by the gatekeeper, which creates a job manager for the new job. The job manager handles the execution of the job, as well as any communication with the user.

The architecture of GRAM is diagrammed below:

Resource

An entity capable of running one or more processes on behalf of a user.

Client

The process that is using the resource allocation client-side API.

Job

A process or set of processes resulting from a job request. Jobs are grouped, so any error in one job results in the mutual termination of all others in the group. If the job is killed by the client, all processes are terminated, and the job itself is finally terminated as well.

Job Request

A request to gatekeeper to create one or more job processes, expressed in the supplied Resource Specification Language. This request guides

  • resource selection (when and where to create the job processes)

  • job process creation (what job processes to create)

  • job control (how the processes should execute

2. Components

2.1. Gatekeeper

A process, running as root, which begins the process of handling allocation requests. It exists on the remote computer before any request is submitted. When the gatekeeper receives an allocation request from a client, it

  • resource selection (when and where to create the job processes)

  • mutually authenticates with the client,

  • maps the requestor to a local user,

  • starts a job manager on the local host as the local user, and

  • passes the allocation arguments to the newly created job manager.

2.2. Job Manager

One job manager is created by the gatekeeper to fulfill every request submitted to the gatekeeper. It starts the job on the local system, and handles all further communication with the client. It is made up of two components:

  • Common Component - translates messages received from the gatekeeper and client into an internal API that is implemented by the machine specific component. It also translates callback requests from the machine specific components through the internal API into messages to the application manager.

  • Machine-Specific Component - implements the internal API in the local environment. This includes calls to the local system, messages to the resource monitor, and inquiries to the MDS.

3. Job States

The GRAM supports the following scheduling model. A user or resource broker submits a job request, which initially registers as a pending job. The job then undergoes state changes according to this state diagram:

Unsubmitted

The job has not yet been submitted to the scheduler. A job state callback for this state is never sent; rather it was introduced for the case when the job manager is stopped and restarted before the job is submitted. This state was introduced in GRAM 1.5 (Globus 2.0).

StageIn

The job manager is staging executable, input, or data files to the job. Jobs which do not involve any staging will not enter this state. This state was introduced in GRAM 1.6.

Pending

The job has been submitted to the scheduler, but resources have not yet been allocated for the job.

Active

The job has received all of it's resources, and the application is executing.

Suspended

The job has been stopped temporarily by the scheduler. Only some schedulers will cause a job to enter the Suspended state. This state was introduced in GRAM 1.5 (Globus 2.0).

StageOut

The job manager is staging output files from the job manager host to remote storage. Jobs which do not involve any staging will not enter this state. This state was introduced in GRAM 1.6.

Done

The job completed successfully.

Failed

The job terminated before completion, as a result of an error, or a user or system cancel.

4. Audit

Table 1. Audit Logging Support

GRAM job auditing direct to DB

GRAM can be configured to write a job audit record to a file that is ready for uploading into a Database. This can be useful for exposing and integrating GRAM job information with a Grid's existing accounting infrastructure. A case study for TeraGrid can be read here

Local scheduler logging

For systems using a local batch scheduler, all of the accounting and logging facilities of that scheduler remain available for the administrator to track jobs whether submitted through GRAM or directly to the scheduler by local users.