# GPU vs Xeon Phi: Performance of Bandwidth Bound Applications with a Lattice QCD Case Study



### Mathias Wagner

GTC 2015 | Mathias Wagner | Indiana University |

### **INDIANA UNIVERSITY**



Y

# Lattice Quantum ChromoDynamics

and Deep Learning ...

GTC 2015 | Mathias Wagner | Indiana University |



### ... sorry, not (yet?) here.







# Lattice QCD: Some Basics Formulating Lattice QCD

- QCD partition function
- 4 dimensional grid (=Lattice)
  - quarks live on lattice sites
  - gluons live on the links
- •typical sizes:  $24^3 \times 6$  to  $256^4$ 
  - parallelization over lattice sites ( $10^5$  to  $10^9$ )







- Krylov space inversion of fermion matrix dominates runtime
- within inversion application of sparse Matrix (Dslash) dominates (>80%)





- Krylov space inversion of fermion matrix dominates runtime
- within inversion application of sparse Matrix (Dslash) dominates (>80%)
- Highly Improved Staggered Quarks (HISQ) use next and 3rd neighbor stencil  $w_x = D_{x,x'}v_{x'} = \sum_{\mu=0}^{3} \left[ \left\{ U_{x,\mu}v_{x+\hat{\mu}} - U_{x-\hat{\mu},\mu}^{\dagger}v_{x-\hat{\mu}} \right\} + \right]$

$$+\left\{N_{x,\mu}v_{x+3\hat{\mu}} - N_{x-3\hat{\mu},\mu}^{\dagger}v_{x-3\hat{\mu}}\right\}\right]$$







- Krylov space inversion of fermion matrix dominates runtime
- within inversion application of sparse Matrix (Dslash) dominates (>80%)
- Highly Improved Staggered Quarks (HISQ) use next and 3rd neighbor stencil  $w_x = D_{x,x'}v_{x'} = \sum_{\mu=0} \left[ \left\{ U_{x,\mu}v_{x+\hat{\mu}} - U_{x-\hat{\mu},\mu}^{\dagger}v_{x-\hat{\mu}} \right\} + U_{\mu}^{\dagger} + U_{\mu}^{\dagger}v_{x-\hat{\mu}} \right\} + U_{\mu}^{\dagger} + U_{\mu}^{\dagger}v_{x-\hat{\mu}} + U_{\mu}^{\dagger}v_{x$ complex 3x3 matrix complex 3-dim vector 72 byte for fp32 56 byte for fp32 24 byte for fp32

$$+\left\{N_{x,\mu}v_{x+3\hat{\mu}} - N_{x-3\hat{\mu},\mu}^{\dagger}v_{x-3\hat{\mu}}\right\}\right]$$

complex 3x3 matrix + U(3) symmetry







- Krylov space inversion of fermion matrix dominates runtime
- within inversion application of sparse Matrix (Dslash) dominates (>80%)
- Highly Improved Staggered Quarks (HISQ) use next and 3rd neighbor stencil  $w_{x} = D_{x,x'}v_{x'} = \sum_{\mu=0}^{3} \left[ \left\{ U_{x,\mu}v_{x+\hat{\mu}} - U_{x-\hat{\mu},\mu}^{\dagger}v_{x-\hat{\mu}} \right\} + U_{\mu=0}^{\dagger} \right] + U_{\mu=0}^{3} \left[ \left\{ U_{x,\mu}v_{x+\hat{\mu}} - U_{x-\hat{\mu},\mu}^{\dagger}v_{x-\hat{\mu}} \right\} + U_{\mu=0}^{3} \right] + U_{\mu=0}^{3} \left[ \left\{ U_{x,\mu}v_{x+\hat{\mu}} - U_{x-\hat{\mu},\mu}^{\dagger}v_{x-\hat{\mu}} \right\} + U_{\mu=0}^{3} \right] + U_{\mu=0}^{3} \left[ \left\{ U_{x,\mu}v_{x+\hat{\mu}} - U_{x-\hat{\mu},\mu}^{\dagger}v_{x-\hat{\mu}} \right\} + U_{\mu=0}^{3} \right] + U_{\mu=0}^{3} \left[ \left\{ U_{x,\mu}v_{x+\hat{\mu}} - U_{x-\hat{\mu},\mu}^{\dagger}v_{x-\hat{\mu}} \right\} + U_{\mu=0}^{3} \right] + U_{\mu=0}^{3} \left[ \left\{ U_{x,\mu}v_{x+\hat{\mu}} - U_{x-\hat{\mu},\mu}^{\dagger}v_{x-\hat{\mu}} \right\} + U_{\mu=0}^{3} \right] + U_{\mu=0}^{3} \left[ \left\{ U_{x,\mu}v_{x+\hat{\mu}} - U_{x-\hat{\mu},\mu}^{\dagger}v_{x-\hat{\mu}} \right\} + U_{\mu=0}^{3} \right] + U_{\mu=0}^{3} \left[ \left\{ U_{x,\mu}v_{x+\hat{\mu}} - U_{x-\hat{\mu},\mu}^{\dagger}v_{x-\hat{\mu}} \right\} + U_{\mu=0}^{3} \left[ \left\{ U_{x,\mu}v_{x+\hat{\mu}} - U_{\mu}^{\dagger}v_{x-\hat{\mu}} \right\} + U_{\mu=0}^{3} \left[ \left\{ U_{x,\mu}v_{x+\hat{\mu}} - U_{\mu}^{\dagger}v_{x-\hat{\mu}} \right\} + U_{\mu=0}^{3} \left[ \left\{ U_{x,\mu}v_{x+\hat{\mu}} - U_{\mu}^{\dagger}v_{x-\hat{\mu}} \right\} + U_{\mu=0}^{3} \left[ \left\{ U_{x,\mu}v_{x+\hat{\mu}} - U_{\mu}v_{x+\hat{\mu}} \right\} + U_{\mu=0}^{3} \left[ \left\{ U_{x,\mu}v_{x+\hat{\mu}} - U_{\mu}v_{x+\hat{\mu}} \right\} + U_{\mu=0}^{3} \left[ \left\{ U_{x,\mu}v_{x+\hat{\mu}} - U_{\mu}v_{x+\hat{\mu}} \right\} + U_{\mu=0}^{3} \left[ \left\{ U_{x,\mu}v_{x+\hat{\mu}} + U_{\mu}v_{x+\hat{\mu}} \right\} + U_{\mu}v_{x+\hat{\mu}}v_{x+\hat{\mu}} + U_{\mu}v_{x+\hat{\mu}}v_{x+\hat{\mu}}v_{x+\hat{\mu}} + U_{\mu}v_{x+\hat{\mu}}v_{x+\hat{\mu}}v_{x+\hat{\mu}}v_{x+\hat{\mu}}v_{x+\hat{\mu}}v_{x+\hat{\mu}}v_{x+\hat{\mu}}v_{x+\hat{\mu}}v_{x+\hat{\mu}}v_{x+\hat{\mu}}v_{x+\hat{\mu}}v_{x+\hat{\mu}}v_{x+\hat{\mu}}v_{x+\hat{\mu}}v_{x+\hat{\mu}}v_{x+\hat{\mu}}v_{x+\hat{\mu}}v_{x+\hat{$ complex 3x3 matrix complex 3-dim vector 56 byte for fp32 72 byte for fp32 24 byte for fp32
- performs 1146 flop: arithmetic intensity: 0.8 flop/byte

$$+\left\{N_{x,\mu}v_{x+3\hat{\mu}} - N_{x-3\hat{\mu},\mu}^{\dagger}v_{x-3\hat{\mu}}\right\}$$



complex 3x3 matrix + U(3) symmetry

• each site (x) loads 1024 bytes for links and 384 bytes for vectors, stores 24 bytes: total 1432 bytes / site





- Krylov space inversion of fermion matrix dominates runtime
- within inversion application of sparse Matrix (Dslash) dominates (>80%)
- Highly Improved Staggered Quarks (HISQ) use next and 3rd neighbor stencil  $w_x = D_{x,x'}v_{x'} = \sum_{\mu=0}^{3} \left[ \left\{ U_{x,\mu}v_{x+\hat{\mu}} - U_{x-\hat{\mu},\mu}^{\dagger}v_{x-\hat{\mu}} \right\} + U_{\mu=0}^{\dagger} \right] + U_{\mu=0}^{\dagger} \left[ \left\{ U_{x,\mu}v_{x+\hat{\mu}} - U_{x-\hat{\mu},\mu}^{\dagger}v_{x-\hat{\mu}} \right\} + U_{\mu=0}^{\dagger} \right] + U_{\mu=0}^{\dagger} \left[ \left\{ U_{x,\mu}v_{x+\hat{\mu}} - U_{x-\hat{\mu},\mu}^{\dagger}v_{x-\hat{\mu}} \right\} + U_{\mu=0}^{\dagger} \right] + U_{\mu=0}^{\dagger} \left[ \left\{ U_{x,\mu}v_{x+\hat{\mu}} - U_{x-\hat{\mu},\mu}^{\dagger}v_{x-\hat{\mu}} \right\} + U_{\mu=0}^{\dagger} \right] + U_{\mu=0}^{\dagger} \left[ \left\{ U_{x,\mu}v_{x+\hat{\mu}} - U_{x-\hat{\mu},\mu}^{\dagger}v_{x-\hat{\mu}} \right\} + U_{\mu=0}^{\dagger} \right] + U_{\mu=0}^{\dagger} \left[ \left\{ U_{x,\mu}v_{x+\hat{\mu}} - U_{x-\hat{\mu},\mu}^{\dagger}v_{x-\hat{\mu}} \right\} + U_{\mu=0}^{\dagger} \right] + U_{\mu=0}^{\dagger} \left[ \left\{ U_{x,\mu}v_{x+\hat{\mu}} - U_{x-\hat{\mu},\mu}^{\dagger}v_{x-\hat{\mu}} \right\} + U_{\mu=0}^{\dagger} \right] + U_{\mu=0}^{\dagger} \left[ \left\{ U_{x,\mu}v_{x+\hat{\mu}} - U_{x-\hat{\mu},\mu}^{\dagger}v_{x-\hat{\mu}} \right\} + U_{\mu=0}^{\dagger} \left[ \left\{ U_{x,\mu}v_{x+\hat{\mu}} - U_{\mu}^{\dagger}v_{x-\hat{\mu}} \right\} + U_{\mu=0}^{\dagger} \left[ \left\{ U_{x,\mu}v_{x+\hat{\mu}} - U_{\mu}^{\dagger}v_{x-\hat{\mu}} \right\} + U_{\mu=0}^{\dagger} \left[ \left\{ U_{x,\mu}v_{x+\hat{\mu}} - U_{\mu}^{\dagger}v_{x-\hat{\mu}} \right\} + U_{\mu=0}^{\dagger} \left[ \left\{ U_{x,\mu}v_{x+\hat{\mu}} - U_{\mu}v_{x+\hat{\mu}} \right\} + U_{\mu=0}^{\dagger} \left[ \left\{ U_{x,\mu}v_{x+\hat{\mu}} - U_{\mu}v_{x+\hat{\mu}} \right\} + U_{\mu=0}^{\dagger} \left[ \left\{ U_{x,\mu}v_{x+\hat{\mu}} - U_{\mu}v_{x+\hat{\mu}} \right\} + U_{\mu=0}^{\dagger} \left[ \left\{ U_{x,\mu}v_{x+\hat{\mu}} + U_{\mu}v_{x+\hat{\mu}} \right\} + U_{\mu}v_{x+\hat{\mu}} + U_{\mu}v_{x+\hat{\mu}} \right] + U_{\mu}v_{\mu}v_{x+\hat{\mu}} + U_{\mu}v_{x+\hat{\mu}} + U_{\mu}v_{$ complex 3x3 matrix complex 3-dim vector 72 byte for fp32 56 byte for fp32 24 byte for fp32
- performs 1146 flop: arithmetic intensity: 0.8 flop/byte

$$+\left\{N_{x,\mu}v_{x+3\hat{\mu}} - N_{x-3\hat{\mu},\mu}^{\dagger}v_{x-3\hat{\mu}}\right\}$$



complex 3x3 matrix + U(3) symmetry

• each site (x) loads 1024 bytes for links and 384 bytes for vectors, stores 24 bytes: total 1432 bytes / site

sensitive to memory bandwidth





## Accelerators

Sorry, not the ones with liquid helium cooling and TDP > 300W.









Cores / SMX

Vector instructions

CUDA cores / SMX

Clock Speed [MHz]

peak fp32 [TFlop/s]

peak fp64 [TFlop/s]

Memory [GB]

Memory Bandwidth [GB/s]

L1 Cache [kB] / (Core/SM)

L2 Cache [MB]

TDP [W]





|     | 5110          | 7120            | K20  | K20X          | K40     |
|-----|---------------|-----------------|------|---------------|---------|
|     | 60            | 61              | 13   | 14            | 15      |
|     | 512 bit       | (16 fp32)       |      |               |         |
|     |               |                 |      | 192           |         |
|     | 1053          | 1238 - 1333     | 705  | 732           | 745-875 |
|     | 2.02          | 2.42            | 3.52 | 3.91          | 4.29    |
|     | 1.01          | 1.21            | 1.27 | 1.31          | 1.43    |
|     | 8             | 8               | 5    | 6             | 12      |
| \$] | 320           | 352             | 208  | 250           | 288     |
| X)  | C             | 32              | 16-  | 48 + 48 (Text | ure)    |
|     | 30 (60 x 0.5) | 30.5 (61 x 0.5) |      | 1.5           |         |
|     | 225           | 300             | 225  | 235           | 235     |







### How can we achieve this performance?



L2 Cache [MB]

TDP [W]





|     | 5110          | 7120            | K20  | K20X          | K40     |
|-----|---------------|-----------------|------|---------------|---------|
|     | 60            | 61              | 13   | 14            | 15      |
|     | 512 bit       | (16 fp32)       |      |               |         |
|     |               |                 |      | 192           |         |
|     | 1053          | 1238 - 1333     | 705  | 732           | 745-875 |
|     | 2.02          | 2.42            | 3.52 | 3.91          | 4.29    |
|     | 1.01          | 1.21            | 1.27 | 1.31          | 1.43    |
|     | 8             | 8               | 5    | 6             | 12      |
| \$] | 320           | 352             | 208  | 250           | 288     |
| X)  | C             | 32              | 16-  | 48 + 48 (Text | ure)    |
|     | 30 (60 x 0.5) | 30.5 (61 x 0.5) |      | 1.5           |         |
|     | 225           | 300             | 225  | 235           | 235     |













|     | 5110          | 7120            | K20  | K20X          | K40     |
|-----|---------------|-----------------|------|---------------|---------|
|     | 60            | 61              | 13   | 14            | 15      |
|     | 512 bit       | (16 fp32)       |      |               |         |
|     |               |                 |      | 192           |         |
|     | 1053          | 1238 - 1333     | 705  | 732           | 745-875 |
|     | 2.02          | 2.42            | 3.52 | 3.91          | 4.29    |
|     | 1.01          | 1.21            | 1.27 | 1.31          | 1.43    |
|     | 8             | 8               | 5    | 6             | 12      |
| \$] | 320           | 352             | 208  | 250           | 288     |
| X)  | C             | 32              | 16-  | 48 + 48 (Text | ure)    |
|     | 30 (60 x 0.5) | 30.5 (61 x 0.5) |      | 1.5           |         |
|     | 225           | 300             | 225  | 235           | 235     |













|     | 5110          | 7120            | K20  | K20X          | K40     |
|-----|---------------|-----------------|------|---------------|---------|
|     | 60            | 61              | 13   | 14            | 15      |
|     | 512 bit       | (16 fp32)       |      |               |         |
|     |               |                 |      | 192           |         |
|     | 1053          | 1238 - 1333     | 705  | 732           | 745-875 |
|     | 2.02          | 2.42            | 3.52 | 3.91          | 4.29    |
|     | 1.01          | 1.21            | 1.27 | 1.31          | 1.43    |
|     | 8             | 8               | 5    | 6             | 12      |
| \$] | 320           | 352             | 208  | 250           | 288     |
| X)  | C             | 32              | 16-  | 48 + 48 (Text | ure)    |
|     | 30 (60 x 0.5) | 30.5 (61 x 0.5) |      | 1.5           |         |
|     | 225           | 300             | 225  | 235           | 235     |







## Setting the bar

What performance can we expect on the different accelerators? Is our code optimized?







• naive model: bandwidth times arithmetic intensity







- naive model: bandwidth times arithmetic intensity
- better use STREAM triad bandwidth



GTC 2015 | Mathias Wagner | Indiana University |







- naive model: bandwidth times arithmetic intensity
- better use STREAM triad bandwidth



GTC 2015 | Mathias Wagner | Indiana University |









- naive model: bandwidth times arithmetic intensity
- better use STREAM triad bandwidth
- faster than estimate from triad bandwidth









- naive model: bandwidth times arithmetic intensity
- better use STREAM triad bandwidth
- faster than estimate from triad bandwidth

account for existence of cache in estimate of performance









• for upper limit: assume cache hits are free bytes / site: 1024 x (1-hitrate) 384 + 24





• for upper limit: assume cache hits are free *bytes / site: 1024 x (1-hitrate) 384 + 24* 





![](_page_20_Picture_6.jpeg)

![](_page_20_Figure_7.jpeg)

![](_page_20_Picture_8.jpeg)

• for upper limit: assume cache hits are free *bytes / site: 1024 x (1-hitrate) 384 + 24* 

![](_page_21_Figure_2.jpeg)

 $\rightarrow$  arithmetic intensity 1.07 (w/o cache 0.80)

![](_page_21_Figure_6.jpeg)

![](_page_21_Picture_7.jpeg)

• for upper limit: assume cache hits are free *bytes / site: 1024 x (1-hitrate) 384 + 24* 

![](_page_22_Figure_2.jpeg)

 $\rightarrow$  arithmetic intensity 1.07 (w/o cache 0.80)

- •Kepler: 1.5MB L2+ (16-48) kB L1 / SMX [15 SMX]

![](_page_22_Picture_8.jpeg)

![](_page_22_Figure_9.jpeg)

![](_page_22_Picture_10.jpeg)

- Empirical: vectors through L1, links through texture
- ignore L2: also loads gauge field (128MB 1024MB)

![](_page_23_Figure_4.jpeg)

![](_page_23_Picture_5.jpeg)

![](_page_23_Picture_6.jpeg)

- Empirical: vectors through L1, links through texture
- ignore L2: also loads gauge field (128MB 1024MB)
- •48 kB L1 can hold 2048 24-byte vector elements
  - for 64<sup>3</sup>x16: 1 xy-plane (even-odd precondition) hit 7 out of 16 (43% hit rate)
  - for  $32^3x8$ : xy plane has 512 elements  $\rightarrow$  4 xy-planes in z direction we can hit 2 of 4 elements: 9/16 (56% hit rate)

![](_page_24_Picture_9.jpeg)

![](_page_24_Picture_10.jpeg)

- Empirical: vectors through L1, links through texture
- ignore L2: also loads gauge field (128MB 1024MB)
- •48 kB L1 can hold 2048 24-byte vector elements
  - for 64<sup>3</sup>x16: 1 xy-plane (even-odd precondition) hit 7 out of 16 (43% hit rate)
  - for  $32^3x8$ : xy plane has 512 elements  $\rightarrow$  4 xy-planes in z direction we can hit 2 of 4 elements: 9/16 (56% hit rate)

![](_page_25_Figure_7.jpeg)

![](_page_25_Picture_8.jpeg)

![](_page_25_Picture_9.jpeg)

![](_page_25_Picture_10.jpeg)

- Empirical: vectors through L1, links through texture
- ignore L2: also loads gauge field (128MB 1024MB)
- •48 kB L1 can hold 2048 24-byte vector elements
  - for 64<sup>3</sup>x16: 1 xy-plane (even-odd precondition) hit 7 out of 16 (43% hit rate)
  - for  $32^3x8$ : xy plane has 512 elements  $\rightarrow$  4 xy-planes in z direction we can hit 2 of 4 elements: 9/16 (56% hit rate)

| hit rate             | 0/16 | 15/16 | 3/16 | 5/16 | 7 |
|----------------------|------|-------|------|------|---|
| arithmetic intensity | 0.8  | 1.07  | 0.84 | 0.87 | ( |

![](_page_26_Figure_8.jpeg)

![](_page_26_Picture_9.jpeg)

![](_page_26_Picture_10.jpeg)

![](_page_26_Picture_11.jpeg)

- Empirical: vectors through L1, links through texture
- ignore L2: also loads gauge field (128MB 1024MB)
- •48 kB L1 can hold 2048 24-byte vector elements
  - for 64<sup>3</sup>x16: 1 xy-plane (even-odd precondition) hit 7 out of 16 (43% hit rate)
  - for  $32^3x8$ : xy plane has 512 elements  $\rightarrow$  4 xy-planes in z direction we can hit 2 of 4 elements: 9/16 (56% hit rate)

| hit rate             | 0/16 | 15/16 | 3/16 | 5/16 | 7 |
|----------------------|------|-------|------|------|---|
| arithmetic intensity | 0.8  | 1.07  | 0.84 | 0.87 | ( |

![](_page_27_Figure_8.jpeg)

![](_page_27_Picture_9.jpeg)

![](_page_27_Picture_10.jpeg)

- Empirical: vectors through L1, links through texture
- ignore L2: also loads gauge field (128MB 1024MB)
- •48 kB L1 can hold 2048 24-byte vector elements
  - for 64<sup>3</sup>x16: 1 xy-plane (even-odd precondition) hit 7 out of 16 (43% hit rate)
  - for  $32^3x8$ : xy plane has 512 elements  $\rightarrow$  4 xy-planes in z direction we can hit 2 of 4 elements: 9/16 (56% hit rate)

| hit rate             | 0/16 | 15/16 | 3/16 | 5/16 | 7 |
|----------------------|------|-------|------|------|---|
| arithmetic intensity | 0.8  | 1.07  | 0.84 | 0.87 | ( |

![](_page_28_Figure_8.jpeg)

![](_page_28_Figure_9.jpeg)

![](_page_28_Picture_10.jpeg)

- Empirical: vectors through L1, links through texture
- ignore L2: also loads gauge field (128MB 1024MB)
- •48 kB L1 can hold 2048 24-byte vector elements
  - for 64<sup>3</sup>x16: 1 xy-plane (even-odd precondition) hit 7 out of 16 (43% hit rate)
  - for  $32^3x8$ : xy plane has 512 elements  $\rightarrow$  4 xy-planes in z direction we can hit 2 of 4 elements: 9/16 (56% hit rate)

| hit rate             | 0/16 | 15/16 | 3/16 | 5/16 | 7 |
|----------------------|------|-------|------|------|---|
| arithmetic intensity | 0.8  | 1.07  | 0.84 | 0.87 | ( |

![](_page_29_Figure_8.jpeg)

![](_page_29_Figure_9.jpeg)

![](_page_29_Picture_10.jpeg)

- Empirical: vectors through L1, links through texture
- ignore L2: also loads gauge field (128MB 1024MB)
- •48 kB L1 can hold 2048 24-byte vector elements
  - for 64<sup>3</sup>x16: 1 xy-plane (even-odd precondition) hit 7 out of 16 (43% hit rate)
  - for  $32^3x8$ : xy plane has 512 elements  $\rightarrow$  4 xy-planes in z direction we can hit 2 of 4 elements: 9/16 (56% hit rate)

| hit rate             | 0/16 | 15/16 | 3/16 | 5/16 | 7 |
|----------------------|------|-------|------|------|---|
| arithmetic intensity | 0.8  | 1.07  | 0.84 | 0.87 | ( |

![](_page_30_Figure_9.jpeg)

![](_page_30_Picture_10.jpeg)

# Increasing the Intensity

Focus on the arithmetic intensity now ... push ups later.

Cache effects for vectors but remember they are only ~25% of the memory traffic.

What can we do about the gauge links?

![](_page_31_Picture_5.jpeg)

![](_page_31_Picture_7.jpeg)

![](_page_31_Picture_8.jpeg)

## HISQ Inverter for multiple right hand sides (rhs)

• combine multiple inversions with constant gauge field (constant sparse matrix)

$$\left(w_x^{(1)}, w_x^{(2)}, \dots, w_x^{(n)}\right) = D_{x,x'}\left(v_{x'}^{(1)}, v_{x'}^{(2)}, \dots, v_x^{(n)}\right)$$

• reuse links (input for the sparse matrix) in the matrix-vector multiplication (Dslash)

![](_page_32_Picture_5.jpeg)

![](_page_32_Picture_6.jpeg)

# HISQ Inverter for multiple right hand sides (rhs)

combine multiple inversions with constant gauge field (constant s

$$\left(w_x^{(1)},w_x^{(2)},\ldots,w_x^{(n)}
ight)$$

reuse links (input for the sparse matrix) in the matrix-vector multip

| #rhs      | 1    | 2    | 3    | 4    |
|-----------|------|------|------|------|
| Flop/byte | 0.80 | 1.25 | 1.53 | 1.73 |

![](_page_33_Figure_6.jpeg)

![](_page_33_Picture_7.jpeg)

![](_page_33_Picture_8.jpeg)

![](_page_33_Picture_9.jpeg)

# HISQ Inverter for multiple right hand sides (rhs)

combine multiple inversions with constant gauge field (constant sparse matrix)

$$\left(w_x^{(1)}, w_x^{(2)}, \dots, w_x^{(n)}\right) = D_{x,x'}\left(v_{x'}^{(1)}, v_{x'}^{(2)}, \dots, v_x^{(n)}\right)$$

• reuse links (input for the sparse matrix) in the matrix-vector multiplication (Dslash)

| #rhs      | 1    | 2    | 3    | 4    |
|-----------|------|------|------|------|
| Flop/byte | 0.80 | 1.25 | 1.53 | 1.73 |

- ignored cache effects for vectors here
  - caching will be much harder now as cache needs to be shared by vectors for #rhs
- memory traffic from gauge links decreases from 70% (1 rhs) to 30% (4 rhs)

![](_page_34_Figure_9.jpeg)

![](_page_34_Picture_12.jpeg)

## GPU Implementation: Texture Cache and Registers

• obvious solution: store matrix in registers

 possible issue: more registers / thread  $\rightarrow$  occupancy / spilling

![](_page_35_Picture_4.jpeg)

![](_page_35_Picture_5.jpeg)
## GPU Implementation: Texture Cache and Registers

• obvious solution: store matrix in registers

 possible issue: more registers / thread  $\rightarrow$  occupancy / spilling







## GPU Implementation: Texture Cache and Registers

• obvious solution: store matrix in registers

- possible issue: more registers / thread  $\rightarrow$  occupancy / spilling
- exploit texture cache → reduce register pressure
  - links should hit in texture cache  $\rightarrow$  only one global load
  - one block is executed by one SMX







## GPU Implementation: Texture Cache and Registers

• obvious solution: store matrix in registers

- possible issue: more registers / thread  $\rightarrow$  occupancy / spilling
- exploit texture cache  $\rightarrow$  reduce register pressure
  - links should hit in texture cache  $\rightarrow$  only one global load
  - one block is executed by one SMX
- combine both and explore best possible combinations

```
__global__ Dslashregcache (w1, w2, w3, v1, v2, v3){
offset = threadIdx.y;
for(xp=...){
    w1(x, offset) = D(x,xp) * v1(xp, offset);
    w2(x, offset) = D(x,xp) * v2(xp, offset);
    w3(x, offset) = D(x,xp) * v3(xp, offset);
```







### Does it work?

• use only memory bandwidth and arithmetic intensity

• estimate with bandwidth from triad benchmark









### Does it work?

- use only memory bandwidth and arithmetic intensity
- estimate with bandwidth from triad benchmark
- works even better than expected
  - expectation speedup for 4 rhs / 1 rhs:1.73/0.8 ~ 2.16
  - observed speedup: ~ 2.5
    - makes more efficient use of GPU (why ?)









### Does it work?

- use only memory bandwidth and arithmetic intensity
- estimate with bandwidth from triad benchmark
- works even better than expected
  - expectation speedup for 4 rhs / 1 rhs:1.73/0.8 ~ 2.16
  - observed speedup: ~ 2.5
    - makes more efficient use of GPU (why ?)
- pure loading through texture cache always wins
  - but 48kB texture cache can only hold links for 48 sites (each sites need 8x72 bytes + 8x56 bytes)









profile for 4 rhs to see whether caching strategy works:

| Block    | regs | occup. | eligibl.<br>warps | IPC  |
|----------|------|--------|-------------------|------|
| [16,4]   | 63   | 0.49   | 2.45              | 1.92 |
| [128,4]  | 63   | 0.47   | 2.92              | 1.92 |
| [256,4]  | 63   | 0.48   | 3.08              | 1.87 |
| [1024,1] | 62   | 0.48   | 0.87              | 0.77 |





profile for 4 rhs to see whether caching strategy works:

| Block    | regs | occup. | eligibl.<br>warps | IPC  | TC<br>Hits % | L2 (TC<br>Hits % |
|----------|------|--------|-------------------|------|--------------|------------------|
| [16,4]   | 63   | 0.49   | 2.45              | 1.92 | 51.9         | 50.0             |
| [128,4]  | 63   | 0.47   | 2.92              | 1.92 | 74.3         | 5.6              |
| [256,4]  | 63   | 0.48   | 3.08              | 1.87 | 75.9         | 0.0              |
| [1024,1] | 62   | 0.48   | 0.87              | 0.77 | 3.8          | 0.0              |

• each gauge link loaded once / rhs  $\rightarrow$  best case 75% texture cache hit









profile for 4 rhs to see whether caching strategy works:

| Block    | regs | occup. | eligibl.<br>warps | IPC  | TC<br>Hits % | L2 (TC)<br>Hits % | L1<br>Hits % | L2 (L1)<br>Hits % |
|----------|------|--------|-------------------|------|--------------|-------------------|--------------|-------------------|
| [16,4]   | 63   | 0.49   | 2.45              | 1.92 | 51.9         | 50.0              | 18.2         | 48.4              |
| [128,4]  | 63   | 0.47   | 2.92              | 1.92 | 74.3         | 5.6               | 31.2         | 37.1              |
| [256,4]  | 63   | 0.48   | 3.08              | 1.87 | 75.9         | 0.0               | 33.9         | 28.9              |
| [1024,1] | 62   | 0.48   | 0.87              | 0.77 | 3.8          | 0.0               | 44.3         | 7.1               |

• each gauge link loaded once / rhs  $\rightarrow$  best case 75% texture cache hit





profile for 4 rhs to see whether caching strategy works:

| Block    | regs | occup. | eligibl.<br>warps | IPC  | TC<br>Hits % | L2 (TC)<br>Hits % | L1<br>Hits % | L2 (L1)<br>Hits % | Tex+L2<br>Hits % | L1+L2<br>Hits % |
|----------|------|--------|-------------------|------|--------------|-------------------|--------------|-------------------|------------------|-----------------|
| [16,4]   | 63   | 0.49   | 2.45              | 1.92 | 51.9         | 50.0              | 18.2         | 48.4              | 75.9             | 57.8            |
| [128,4]  | 63   | 0.47   | 2.92              | 1.92 | 74.3         | 5.6               | 31.2         | 37.1              | 75.7             | 56.7            |
| [256,4]  | 63   | 0.48   | 3.08              | 1.87 | 75.9         | 0.0               | 33.9         | 28.9              | 75.9             | 53.0            |
| [1024,1] | 62   | 0.48   | 0.87              | 0.77 | 3.8          | 0.0               | 44.3         | 7.1               | 3.8              | 48.3            |

- each gauge link loaded once / rhs  $\rightarrow$  best case 75% texture cache hit
- better speedup than expected for 4 rhs compared to 1 rhs:
  - better utilization of GPU and better use of L2 cache





- focus on pure texture cache solution [1,4]
- each thread needs  $(8 \times 72 + 8 \times 56) = 1024$  bytes
- warps (32 threads) assigned to one scheduler







- focus on pure texture cache solution [1,4]
- each thread needs  $(8 \times 72 + 8 \times 56) = 1024$  bytes
- warps (32 threads) assigned to one scheduler
- switching between threads: need only some of the data









- focus on pure texture cache solution [1,4]
- each thread needs  $(8 \times 72 + 8 \times 56) = 1024$  bytes
- warps (32 threads) assigned to one scheduler
- switching between threads: need only some of the data
- block sizes and warps
  - •[16,4]  $\rightarrow$  2 warps
  - $[128,4] \rightarrow 16$  warps

| Block   | TC<br>Hits % | L2 (TC)<br>Hits % | Tex+L2<br>Hits % |
|---------|--------------|-------------------|------------------|
| [16,4]  | 51.9         | 50.0              | 75.9             |
| [128,4] | 74.3         | 5.6               | 75.7             |



Tex Cache

Tex Cache





- focus on pure texture cache solution [1,4]
- each thread needs  $(8 \times 72 + 8 \times 56) = 1024$  bytes
- warps (32 threads) assigned to one scheduler
- switching between threads: need only some of the data
- block sizes and warps
  - •[16,4]  $\rightarrow$  2 warps
  - $[128,4] \rightarrow 16$  warps

| Block   | TC<br>Hits % | L2 (TC)<br>Hits % | Tex+L2<br>Hits % |
|---------|--------------|-------------------|------------------|
| [16,4]  | 51.9         | 50.0              | 75.9             |
| [128,4] | 74.3         | 5.6               | 75.7             |

| (015,0)<br>(015,1) | (015,2)<br>(015,3) | (1631,0)<br>(1631,1) | (1631,2)<br>(1631,3) |
|--------------------|--------------------|----------------------|----------------------|
|                    |                    |                      |                      |





- focus on pure texture cache solution [1,4]
- each thread needs  $(8 \times 72 + 8 \times 56) = 1024$  bytes
- warps (32 threads) assigned to one scheduler
- switching between threads: need only some of the data
- block sizes and warps
  - $[16,4] \rightarrow 2$  warps
  - $|128,4| \rightarrow 16$  warps

| Block   | TC<br>Hits % | L2 (TC)<br>Hits % | Tex+L2<br>Hits % |
|---------|--------------|-------------------|------------------|
| [16,4]  | 51.9         | 50.0              | 75.9             |
| [128,4] | 74.3         | 5.6               | 75.7             |





- focus on pure texture cache solution [1,4]
- each thread needs  $(8 \times 72 + 8 \times 56) = 1024$  bytes
- warps (32 threads) assigned to one scheduler
- switching between threads: need only some of the data
- block sizes and warps
  - •[16,4]  $\rightarrow$  2 warps
  - $[128,4] \rightarrow 16$  warps

| Block   | TC<br>Hits % | L2 (TC)<br>Hits % | Tex+L2<br>Hits % |
|---------|--------------|-------------------|------------------|
| [16,4]  | 51.9         | 50.0              | 75.9             |
| [128,4] | 74.3         | 5.6               | 75.7             |



Tex Cache

Tex Cache





- focus on pure texture cache solution [1,4]
- each thread needs  $(8 \times 72 + 8 \times 56) = 1024$  bytes
- warps (32 threads) assigned to one scheduler
- switching between threads: need only some of the d
- block sizes and warps
  - •[16,4]  $\rightarrow$  2 warps
  - •[128,4] → 16 warps

| Block   | TC<br>Hits % | L2 (TC)<br>Hits % | Tex+L2<br>Hits % |
|---------|--------------|-------------------|------------------|
| [16,4]  | 51.9         | 50.0              | 75.9             |
| [128,4] | 74.3         | 5.6               | 75.7             |

| data | (031,0) | (3264,0) | (6495,0) | (96127,0) |
|------|---------|----------|----------|-----------|
|      | (031,1) | (3264,1) | (6495,1) | (96127,1) |
|      | (031,2) | (3264,2) | (6495,2) | (96127,2) |
|      | (031,3) | (3264,3) | (6495,3) | (96127,3) |



- focus on pure texture cache solution [1,4]
- each thread needs  $(8 \times 72 + 8 \times 56) = 1024$  bytes
- warps (32 threads) assigned to one scheduler
- switching between threads: need only some of the data
- block sizes and warps
  - •[16,4]  $\rightarrow$  2 warps
  - •[128,4] → 16 warps

| Block   | TC<br>Hits % | L2 (TC)<br>Hits % | Tex+L2<br>Hits % |
|---------|--------------|-------------------|------------------|
| [16,4]  | 51.9         | 50.0              | 75.9             |
| [128,4] | 74.3         | 5.6               | 75.7             |





- focus on pure texture cache solution [1,4]
- each thread needs  $(8 \times 72 + 8 \times 56) = 1024$  bytes
- warps (32 threads) assigned to one scheduler
- switching between threads: need only some of the data
- block sizes and warps
  - •[16,4]  $\rightarrow$  2 warps
  - •[128,4] → 16 warps

| Block   | TC<br>Hits % | L2 (TC)<br>Hits % | Tex+L2<br>Hits % |
|---------|--------------|-------------------|------------------|
| [16,4]  | 51.9         | 50.0              | 75.9             |
| [128,4] | 74.3         | 5.6               | 75.7             |





- focus on pure texture cache solution [1,4]
- each thread needs  $(8 \times 72 + 8 \times 56) = 1024$  bytes
- warps (32 threads) assigned to one scheduler
- switching between threads: need only some of the data
- block sizes and warps
  - •[16,4]  $\rightarrow$  2 warps
  - •[128,4] → 16 warps

| Block   | TC<br>Hits % | L2 (TC)<br>Hits % | Tex+L2<br>Hits % |
|---------|--------------|-------------------|------------------|
| [16,4]  | 51.9         | 50.0              | 75.9             |
| [128,4] | 74.3         | 5.6               | 75.7             |





- focus on pure texture cache solution [1,4]
- each thread needs  $(8 \times 72 + 8 \times 56) = 1024$  bytes
- warps (32 threads) assigned to one scheduler
- switching between threads: need only some of the data
- block sizes and warps
  - •[16,4]  $\rightarrow$  2 warps
  - •[128,4] → 16 warps

| Block   | TC<br>Hits % | L2 (TC)<br>Hits % | Tex+L2<br>Hits % |
|---------|--------------|-------------------|------------------|
| [16,4]  | 51.9         | 50.0              | 75.9             |
| [128,4] | 74.3         | 5.6               | 75.7             |





## Some Details of the Phi Implementation

- effort lead by Patrick Steinbrecher (Universität Bielefeld → Brookhaven National Lab)
- single accelerator
- optimized for performance with multiple rhs





## Some Details of the Phi Implementation

- effort lead by Patrick Steinbrecher (Universität Bielefeld → Brookhaven National Lab)
- single accelerator
- optimized for performance with multiple rhs
- parallelized using OpenMP
- vectorized using intrinsics:
  - fuse lattice sites into 512bit vectors
  - 16 sites with SoA-layout







## Impact of Memory Layout and Prefetch

register pressure limits scaling with #rhs





### Impact of Memory Layout and Prefetch 16-fold + prefetch 16-fold ▲ 8-fold + prefetch 8-fold 300 hardware prefetching not effective for access pattern 225 Gflop/s 150 75 \_\_\_\_ 4 2 3 # rhs

- register pressure limits scaling with #rhs
- software prefetching improves by about 2x





### Impact of Memory Layout and Prefetch 16-fold + prefetch 16-fold ▲ 8-fold + prefetch 8-fold 300 hardware prefetching not effective for access pattern 225 Gflop/s 150 • reduces register pressure 75 **----** harder to implement • small gain for 1 rhs 4 3 2 # rhs

- register pressure limits scaling with #rhs
- software prefetching improves by about 2x
- •8-fold site fusion





## Impact of Memory Layout and Prefetch

- register pressure limits scaling with #rhs
- software prefetching improves by about 2x
  - hardware prefetching not effective for access pattern
- •8-fold site fusion
  - reduces register pressure
  - harder to implement
  - small gain for 1 rhs







## Let's get ready to rumble

Results for the full conjugate gradient inverter on Xeon Phi and Tesla









64,16



































### Green or blue computing



How energy efficient are the two architectures?

GTC 2015 | Mathias Wagner | Indiana University |



Oh, does anyone wonder about Maxwell in this respect?





### Energy consumption

- bandwidth-bound applications are unlikely to hit TDP
- What is the relevant observable?
  - energy consumed by the node?
  - energy consumed by the accelerator?
  - include infrastructure (cooling, ...) ?





### Energy consumption

- bandwidth-bound applications are unlikely to hit TDP
- What is the relevant observable?
  - energy consumed by the node?
  - energy consumed by the accelerator?
  - include infrastructure (cooling, ...) ?
- Take what we can get
  - software reported power consumption (nvprof)

• Xeon Phi is a bit more tricky: **estimate only** 
















## preliminary: code only optimized for Kepler









# Finish

GTC 2015 | Mathias Wagner | Indiana University |







# Summary

- Lattice QCD applications reflects triad bandwidth
  - equally well performing implementations for GPU / Phi
- multiple rhs achieve can easily speedup solver by 2.5
- •Xeon Phi requires vectorization and software prefetches
- GPU uses texture cache
- Caching of vectors likely improved with multiple rhs







| ۲. |   |
|----|---|
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
| ۰. | 1 |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
| •  |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
| ۰. |   |
|    |   |
|    |   |
|    |   |
|    |   |
| L  | _ |
| -  | - |

## Summary

- Lattice QCD applications reflects triad bandwidth
  - equally well performing implementations for GPU / Phi
- multiple rhs achieve can easily speedup solver by 2.5
- Xeon Phi requires vectorization and software prefetches
- GPU uses texture cache
- Caching of vectors likely improved with multiple rhs.



GTC 2015 | Mathias Wagner | Indiana University |



- GK110 about 1.5 times more efficient than KNL
- Maxwell promises another factor 1.5
- multiple rhs about twice as energy efficient



| ۲. |   |
|----|---|
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
| ۰. | 1 |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
| •  |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
|    |   |
| ۰. |   |
|    |   |
|    |   |
|    |   |
|    |   |
| L  | _ |
| -  | - |



# GPU vs Xeon Phi: Performance of Bandwidth Bound Applications with a Lattice QCD Case Study







### **Collaborators:**

- P. Steinbrecher (Bielefeld U  $\rightarrow$  Brookhaven National Lab)
- C. Schmidt (Bielefeld U)
- O. Kaczmarek (Bielefeld U)

## **Contact:**

mathwagn@indiana.edu http://linked.in/mathwagn @mathwagn

Thanks to: Jeongnim Kim (Intel) Mike Clark (Nvidia)

**References:** arXiv:1411.4439 [physics.comp-ph] arXiv:1409.1510 [cs.DC]



