Base Roll - An installation scalability fix. When multiple compute nodes are simultaneously reinstalled, there is a chance that some of the compute nodes will not complete their install. If you plug a monitor into a compute node in this state (or if you connect to the compute node's virtual console with rocks-console), you'll see a message indicating that the compute node can not download a specific package.
The fix is to aggressively retry when there is a package download failure.
SGE Roll - A fix for the 'Job Queue' web page. The output of 'qstat -f' changed between SGE 6.0 and 6.1u4 that caused no queued jobs to be reported on the 'Job Queue' ganglia web page.
The fix is to change the way the ganglia metric gets job queue info from SGE (the old method was 'qstat -f -xml' and the new method is 'qstat -f -u \* -xml'.
Base Roll - A 411 scalability fix. In larger clusters, 'rocks sync users' was not pushing all the user-related files (e.g., /etc/passwd, /etc/shadow, etc.) out to all the nodes. This was due to the fact that the 411 listener on each compute node, was rebroadcasting the 411 alert that it received from the frontend. On a vanilla Rocks cluster, there are 9 files that are pushed from the frontend to the cluster nodes when one executes 'rocks sync users'. On a 100-node cluster without this fix that means that each compute node would receive 900 411 alert messages!
This fix removes the rebroadcasting of the 411 alert messages, so now each compute node only receives 9 alerts, regardless of the cluster size.