P P C L　Beowulf计算机群测试报告

廖　　　琦

中国科学院化学研究所

我们目前拥有两套用于分子模拟的Beowulf计算机群。一套是用于教学目的8 CPU AMD　Athlon MP 2000+ (主频1666M Hz) 构架的Beowulf计算机群，用100M Ethernet连接，代号为ding，搭建于2002年10月。另一套是上个月搭建的用于多相多组份高分子体系模拟的机群，代号为”m”，拥有46个Intel Xeon 2.4 G CPU，以1000M　Ethernet连接。

为了便于和其他Beowulf计算机群进行性能比较，最近我们用通用的Linpack软件包对这两个系统进行的测试。下面是整个测试过程和结果的报告。

１．HPL的安装

我是在

http://www.netlib.org/benchmark/hpl/

下载的hpl包．HPL的常见问题参见：

http://www.netlib.org/benchmark/hpl/faqs.html

Linpack的常见问题参见：http://www.netlib.org/utk/people/JackDongarra/faq-linpack.html#_Toc27885730

这个HPL版本为1.0，我看到有人用1.1的，但我没有在网上找到。

为了正确编译HPL，需要MPI和BLAS库，由于RedHat Linux自己带有LAM-MPI，所以我没有用其他的MPI库。对于BLAS库，我采用了ATLAS，　下载地址为：http://www.netlib.org/atlas/　目前的稳定版本为3.4.2，采用这个库的原因是免费而且提供针对机器的优化。

安装ATLAS和HPL不需要root权限，普通用户即可安装。

首先安装ATLAS库。具体过程参见解包后的README和INSTALL文件。这一步我碰到没有任何困难。安装过程是交互式的，主要需要回答一些关于机器的配置问题．完成后会针对你的配置在lib目录下生成针对机器的目录，存放生成的优化库。最后你可以通过一个测试命令检测是否成功安装。

然后就可以安装HPL。解包后需要自己编辑针对自己机器配置的make文件。你可以从setup目录中的几个样本中的一个进行修改而得到合适的文件。

这里我碰到一点麻烦，主要是在机群”Ｍ”上，由于OSCAR将RedHat的LAM-MPI version 6.5替换为version7.0，需要修改MPI库的路径;另外由于在”m”上采用了多线程编译，需要联接pthread库。

其他没有什么大问题，只是对其他路径进行定制，再在编译选项上加上：

-march=pentium4 –mpfmath=sse –msse2 　　　　(“m”)

-march=athlon-mp –mpfmath=sse –msse 　　　　(“ding”)

编译成功后在bin目录下生成一个针对机器的目录，在此目录下存放编译成功的可执行代码。同时在此目录下有一个HPL.dat样本文件，可以进行简单的测试。这个文件也是我们下一步对机器进行全面测试时，需要精细调整的文件。

2.测试参数的微调

关于HPL.dat文件的参数意义和设置原则可以参见：http://www.netlib.org/benchmark/hpl/tuning.html

需要说明的是关于# of problem size(N) 。

原则上说应该取内存允许的最大值。如果CPU个数为W，每个CPU拥有的内存为M，那么

，其中8是指双精度浮点数（64bit）占用的字节数。

那么在ding上N的上限为21166, 在m上N的上限为75828

实际上我们分别取19000和75000。

关于# of process grid (P x Q)

原则上P 与Q越接近越好，但是由于46只能分成1×46和2×23，所以在这个参数选择上, m这个系统十分为难,测试结果也受到一些影响。

我们实际用于ding的HPL.dat文件如下：

HPLinpack benchmark input file

Innovative Computing Laboratory, University of Tennessee

HPL.out output file name (if any)

6 device out (6=stdout,7=stderr,file)

1 # of problems sizes (N)

19000 Ns

1 # of NBs

84 NBs

1 # of process grids (P x Q)

1 Ps

7 Qs

16.0 threshold

1 # of panel fact

1 PFACTs (0=left, 1=Crout, 2=Right)

1 # of recursive stopping criterium

4 NBMINs (>= 1)

1 # of panels in recursion

2 NDIVs

1 # of recursive panel fact.

2 RFACTs (0=left, 1=Crout, 2=Right)

1 # of broadcast

1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)

1 # of lookahead depth

1 DEPTHs (>=0)

2 SWAP (0=bin-exch,1=long,2=mix)

84 swapping threshold

0 L1 in (0=transposed,1=no-transposed) form

0 U in (0=transposed,1=no-transposed) form

1 Equilibration (0=no,1=yes)

8 memory alignment in double (> 0)

我们实际用于“m”　的HPL.dat文件如下：

HPLinpack benchmark input file

Innovative Computing Laboratory, University of Tennessee

HPL.out output file name (if any)

6 device out (6=stdout,7=stderr,file)

1 # of problems sizes (N)

75000 Ns

1 # of NBs

64 NBs

1 # of process grids (P x Q)

2 Ps

23 Qs

16.0 threshold

1 # of panel fact

1 PFACTs (0=left, 1=Crout, 2=Right)

1 # of recursive stopping criterium

4 NBMINs (>= 1)

1 # of panels in recursion

2 NDIVs

1 # of recursive panel fact.

2 RFACTs (0=left, 1=Crout, 2=Right)

1 # of broadcast

0 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)

1 # of lookahead depth

0 DEPTHs (>=0)

2 SWAP (0=bin-exch,1=long,2=mix)

64 swapping threshold

0 L1 in (0=transposed,1=no-transposed) form

0 U in (0=transposed,1=no-transposed) form

1 Equilibration (0=no,1=yes)

8 memory alignment in double (> 0)

3.测试结果

关于测试输出数据的意义可以参见：http://www.netlib.org/benchmark/hpl/tuning.html

其中理论峰值Gpflos的计算，对32位x86构架的处理器，为

2×主频×CPU数，因为一个时钟周期可以执行两条浮点运算。

测试结果摘要如下表一：

Table 1. Linpack Performance of Cluster ding and m

Year	Computer	Number of Processors	Measured Gflops	Size of Problem	Size of 1.2 Perf	Theoretical Peak Gflops
2002	Ding	7	7.768	19000		23.3
2003	M	46	96.04	75000		220.4

作为比较，我在表二和表三列出了目前世界和中国top5的机器Linpack测试结果。其中新的中国top50明天即将公布，网址在http://www.samss.org.cn/

Table 2. Linpack Performance of Top 5 (June 2003, http://www.top500.org/dlist/2003/06/)

Rank	Computer	Number of Processors	Measured Gflops	Size of Problem	Size of 1.2 Perf	Theoretical Peak Gflops
1	NEC earth-simulator	5120	35860	1075200	266240	40960
2	HP ASCI-Q AlphaServer	8192	13880	633000	225000	20480
3	Linux Networx Xeon 2.4G	2304	7634	350000	75000	12288
4	IBM SP Power3 375	8192	7304			12288
5	IBM SP Power3 375/16way	6656	7304			9984

Table 3. Linpack Performance of Top 5 in China (June 2003, http://www.redcluster.net/modules.php?name=Sections&op=viewarticle&artid=9)

Rank	Computer	Number of Processors	Measured Gflops	Size of Problem	Size of 1.2 Perf	Theoretical Peak Gflops
1	Legend Xeon 2.0GHz Myrinet	512	1046	153600	49920	2048
2	Legend Xeon 2.4GHz Myrinet	256	711.7	106304	29952	1228
3	HP SuperDome 875MHz	192	408			672
4	IBM Xeon 2.4GHz Gig-E	256	402.5			1228
5	Legend Xeon 2.4GHz Giganet	256	385.4	154856	60000	1228

ding的测试输出原始结果如下：

============================================================================

HPLinpack 1.0 -- High-Performance Linpack benchmark -- September 27, 2000

Written by A. Petitet and R. Clint Whaley, Innovative Computing Labs., UTK

============================================================================

An explanation of the input/output parameters follows:

T/V : Wall time / encoded variant.

N : The order of the coefficient matrix A.

NB : The partitioning blocking factor.

P : The number of process rows.

Q : The number of process columns.

Time : Time in seconds to solve the linear system.

Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N : 19000

NB : 84

P : 1

Q : 7

PFACT : Crout

NBMIN : 4

NDIV : 2

RFACT : Right

BCAST : 1ringM

DEPTH : 1

SWAP : Mix (threshold = 84)

L1 : transposed form

U : transposed form

EQUIL : yes

ALIGN : 8 double precision words

----------------------------------------------------------------------------

- The matrix A is randomly generated for each test.

- The following scaled residual checks will be computed:

1) ||Ax-b||_oo / ( eps * ||A||_1 * N )

2) ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 )

3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )

- The relative machine precision (eps) is taken to be 1.110223e-16

- Computational tests pass if scaled residuals are less than 16.0

============================================================================

T/V N NB P Q Time Gflops

----------------------------------------------------------------------------

W11R2C4 19000 84 1 7 598.40 7.642e+00

----------------------------------------------------------------------------

||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0354046 ...... PASSED

||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0148153 ...... PASSED

||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0028482 ...... PASSED

============================================================================

“m”测试输出原始结果如下：

============================================================================

HPLinpack 1.0 -- High-Performance Linpack benchmark -- September 27, 2000

Written by A. Petitet and R. Clint Whaley, Innovative Computing Labs., UTK

============================================================================

An explanation of the input/output parameters follows:

T/V : Wall time / encoded variant.

N : The order of the coefficient matrix A.

NB : The partitioning blocking factor.

P : The number of process rows.

Q : The number of process columns.

Time : Time in seconds to solve the linear system.

Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N : 75000

NB : 64

P : 2

Q : 23

PFACT : Crout

NBMIN : 4

NDIV : 2

RFACT : Right

BCAST : 1ring

DEPTH : 0

SWAP : Mix (threshold = 64)

L1 : transposed form

U : transposed form

EQUIL : yes

ALIGN : 8 double precision words

----------------------------------------------------------------------------

- The matrix A is randomly generated for each test.

- The following scaled residual checks will be computed:

1) ||Ax-b||_oo / ( eps * ||A||_1 * N )

2) ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 )

3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )

- The relative machine precision (eps) is taken to be 1.110223e-16

- Computational tests pass if scaled residuals are less than 16.0

============================================================================

T/V N NB P Q Time Gflops

----------------------------------------------------------------------------

W00R2C4 75000 64 2 23 2984.28 9.425e+01

----------------------------------------------------------------------------

||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0033867 ...... PASSED

||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0069262 ...... PASSED

||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0012201 ...... PASSED

============================================================================

4.存在的问题和结论

1. “m”的实测速度不足其峰值的一半，除去矩阵分块不合理(2x23)外，还有其他因素吗？

目前“m”的实测速度已经达到每秒九百六十亿次浮点运算(96Gflops) 。由于集群体系本身的缺陷，LINPACK实测数据很难超过理论峰值的一半, 但是依然有可以改进的余地，例如采用更加合理的ＰＱ和Ｎb参数，从而达到每秒一千亿次浮点运算(100Gflops)的指标。

单纯从Rmax/Rpeak参数来说，该系统远好于联想和IBM同样基于Giganet的系统(见表3)，达到了0.44，而后两者的Rmax/Rpeak均小于1/3。当然这个参数与内连的ＣＰＵ数目有关，但是我们可以看到，“m”在系统大小和性能表现方面都比较均衡，是目前我们一个较好的选择，为以后我们的分子模拟工作提供了高性价比的平台。

2. “ding”的实测速度只有其峰值的1/3，是网络延迟造成的还是其他系统缺陷？

这个系统的LINPACK性能让人失望。该系统是一个从底层硬件构造的廉价实验平台，我们会在以后对该系统进行彻底改造。

2003.11.7

(转载请注明出处)

P P C L Beowulf计算机群测试报告

１．HPL的安装

2.测试参数的微调

3.测试结果

4.存在的问题和结论

P P C L　Beowulf计算机群测试报告