Overview and Status of Current GT Performance Studies

Overview of Current Performance Work High-Level Description of Study High-Level Summary of Current Findings Summary of Planned Near-Term Work Update Date of this Entry Link to Detailed Current and Historical Data
WS GRAM
Study #1
MEJS throughput fork JT1 - details sustained job throughput achieved by the MJES for Fork scheduler, submitting a simple job (/bin/date): no delegation, no staging, no cleanup, no stdout/err streaming maximum throughput achieved 77 jobs per minute 01/21/05 3.9.4 results

Bottleneck investigation

Delegation reuse improvement
WS GRAM
Study #2
pre-ws JM throughput fork JT1 - details sustained job throughput achieved by the pre-ws Job manager for Fork scheduler, submitting a simple job (/bin/date): no delegation, no staging, no cleanup, no stdout/err streaming maximum throughput achieved nn jobs per minute 12/15/04 need results
WS GRAM
Study #3
MEJS burst fork JT1 - details simultaneous job submissions to the same MEJS for the Fork scheduler, submitting a simple job: no delegation, no staging, no cleanup, no stdout/err streaming n jobs of n total jobs processed successfully 12/15/04 need results
WS GRAM
Study #4
MEJS max concurrency fork JT1 - details Maximum job submissions to the same MEJS for the Fork scheduler, submitting a long running sleep job: no delegation, no staging, no cleanup, no stdout/err streaming Maximum concurrency achieved was 8000 jobs, but no failures, so limit still not know 01/21/05 find current limit
WS GRAM
Study #4.1
MEJS max concurrency condor JT1 - details Maximum job submissions to the same MEJS for the Condor scheduler, submitting a long running sleep job: no delegation, no staging, no cleanup, no stdout/err streaming Maximum concurrency achieved was 32,000 jobs, due to system limitation. mkdir: cannot create directory `bar': Too many links 04/06/05 32,000 job limit
WS GRAM
Study #5
MEJS long run fork JT2 - details Keep a moderate load (10 jobs?) on the service for one month duration. Job should perform all tasks: delegation, stage in, out, cleanup, in order to get the most GT service code coverage. relevant to bug 2479 number of jobs submitted. Test duration n hours or n days 04/06/05 23 days 500,000+ sequential jobs submitted without container crash
WS GRAM
Study #n
         
WS MDS
Long Running Index
Long Running Index Server (4.0.0), zero entry index Ran for 8,338,435 secs (96 days) til server reboot, Processed 623,395,877requests,   Oct 10, 2005 Perf slides available in a talk here
WS MDS
Long Running Index
Long Running Index Server (4.0.1), 100 entry index running for 2,860,028 seconds (33 days) as of Sept 30 and continuing, Processed 14,686,638 queries Test ongoing Oct 10, 2005 Perf slides available in a talk here
WS MDS
Index Scalabilty
Index Server (4.0.1) Scalability wrt number of users Throughput is stable at ~2,000 queries per minute for 10-800 users ongoing Oct 10, 2005 Perf slides available in a talk here
Java WS Core
Study #1
Core Messaging Performance Timing a roundtrip WS GRAM job creation like message with and without resource dispatch. Timing is measured as a function of the number of typical sub-job descriptions in the input message.   02/21/2005 Results (no resource dispatch)
Java WS Core
Study #2
Core WSRF/WSN Operation Performance Timing of various WSRF and WSN operations on a simple service   02/21/2005 Details & Results
Java WS Core
Study #3
         
C WS Core
Study #1
         
GridFTP
Study #1
TeraGrid Bandwidth Study 90% utilization (27 Gbs on a 30 Gbs link) memory to memory with 32 nodes; 17.5 Gbs disk to disk with 64 nodes, limited by the SAN. work with SAN folks to improve disk to disk, but low priority. Our bandwidth performance is good enough for now. 2004-12-06 Excel Spreadsheet
GridFTP
Study #2
Long running (stability) Test Single client instance and single server instance with cached connections Ran for about a week. Slowly increased memory usage and lost bandwidth (BW) then crashed. Restarted, has now been running for about two weeks, still lost performance, but it looks like the BW has stabilized, need to check on the memory. Looking into this. 2005-01-04 BW MRTG Graph (shows bandwidth)
GridFTP
Study #3
         
RFT
Study #1
         

General description of the Tests

  • In all cases, you must be clear what process you intend to test (for instance the the client or the service). You must then ensure that process is the bottleneck and not something else. For example, if your intent is to test the load that a client host can handle, if you submit all the client requests against the same service, the service may well fail before the client; therefore, you may need multiple services over which you can distribute the load.
  • All tests should record the input parameters to the test.
  • We also need a way to either profile the process while it is running, or do a postmortem.
  • Things such as CPU load and memory usage should be recorded if at all possible.
  • On the C side, the time command actually has a whole range of statistics it can provide.

Test 1: Max Concurrent Test

This test is designed to see how many of a particular process a host can handle. The idea here is to run long jobs that will not end during the duration of the test. Continue to increase the number until failure. This test can be conducted using the throughput tester as long as the run time of a job is longer than the duration of the test. The results of this test should be the number at which failure occurred and the mode of the failure (container out of memory, connection refused, etc).

Test 2: Max Load Test

This test is designed to simulate a heavily loaded client or server host. The difference between this and Test #1 is that jobs should complete and be re-submitted so that there is turnover. This is essentially the test that the GRAM threaded throughput tester and the Mats C client accomplishes. This test ramps the number of jobs up to the specified load and then holds it there, starting a new job for every job that completes, for the duration of the test. This is iterated, increasing the load, until the tested process fails. The results from this test would be the load at which the tested process fails, the actual time the test ran, and the desired test parameters.

Test 3: Burst Test

This test is identical to Test #2 except that all of the tests should be synchronized so that they start as close to the same time as possible. This simulates a "job storm" - a sudden spike where many jobs hit simultaneously. This may give identical results to Test #2 and if so, we will discontinue one or the other of the tests.

Test 4: Robustness Test

This test is intended to keep the tested process alive and moderately busy for a long period of time. How long will vary from component to component, but initially we are thinking one week for a GRAM client, and one month for a service. The throughput tester can be used for this as well, but the load is set constant at some moderate load, not ramped until failure. Failure in this case would be caused by things such as memory leaks, or other issues that slowly increase over time. The results of this test would be total hours it was up and the number of jobs completed.