Header And Logo

PostgreSQL
| The world's most advanced open source database.

checksum.c

Go to the documentation of this file.
00001 /*-------------------------------------------------------------------------
00002  *
00003  * checksum.c
00004  *    Checksum implementation for data pages.
00005  *
00006  * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
00007  * Portions Copyright (c) 1994, Regents of the University of California
00008  *
00009  *
00010  * IDENTIFICATION
00011  *    src/backend/storage/page/checksum.c
00012  *
00013  *-------------------------------------------------------------------------
00014  *
00015  * Checksum algorithm
00016  *
00017  * The algorithm used to checksum pages is chosen for very fast calculation.
00018  * Workloads where the database working set fits into OS file cache but not
00019  * into shared buffers can read in pages at a very fast pace and the checksum
00020  * algorithm itself can become the largest bottleneck.
00021  *
00022  * The checksum algorithm itself is based on the FNV-1a hash (FNV is shorthand
00023  * for Fowler/Noll/Vo) The primitive of a plain FNV-1a hash folds in data 1
00024  * byte at a time according to the formula:
00025  *
00026  *     hash = (hash ^ value) * FNV_PRIME
00027  *
00028  * FNV-1a algorithm is described at http://www.isthe.com/chongo/tech/comp/fnv/
00029  *
00030  * PostgreSQL doesn't use FNV-1a hash directly because it has bad mixing of
00031  * high bits - high order bits in input data only affect high order bits in
00032  * output data. To resolve this we xor in the value prior to multiplication
00033  * shifted right by 17 bits. The number 17 was chosen because it doesn't
00034  * have common denominator with set bit positions in FNV_PRIME and empirically
00035  * provides the fastest mixing for high order bits of final iterations quickly
00036  * avalanche into lower positions. For performance reasons we choose to combine
00037  * 4 bytes at a time. The actual hash formula used as the basis is:
00038  *
00039  *     hash = (hash ^ value) * FNV_PRIME ^ ((hash ^ value) >> 17)
00040  *
00041  * The main bottleneck in this calculation is the multiplication latency. To
00042  * hide the latency and to make use of SIMD parallelism multiple hash values
00043  * are calculated in parallel. The page is treated as a 32 column two
00044  * dimensional array of 32 bit values. Each column is aggregated separately
00045  * into a partial checksum. Each partial checksum uses a different initial
00046  * value (offset basis in FNV terminology). The initial values actually used
00047  * were chosen randomly, as the values themselves don't matter as much as that
00048  * they are different and don't match anything in real data. After initializing
00049  * partial checksums each value in the column is aggregated according to the
00050  * above formula. Finally two more iterations of the formula are performed with
00051  * value 0 to mix the bits of the last value added.
00052  *
00053  * The partial checksums are then folded together using xor to form a single
00054  * 32-bit checksum. The caller can safely reduce the value to 16 bits
00055  * using modulo 2^16-1. That will cause a very slight bias towards lower
00056  * values but this is not significant for the performance of the
00057  * checksum.
00058  *
00059  * The algorithm choice was based on what instructions are available in SIMD
00060  * instruction sets. This meant that a fast and good algorithm needed to use
00061  * multiplication as the main mixing operator. The simplest multiplication
00062  * based checksum primitive is the one used by FNV. The prime used is chosen
00063  * for good dispersion of values. It has no known simple patterns that result
00064  * in collisions. Test of 5-bit differentials of the primitive over 64bit keys
00065  * reveals no differentials with 3 or more values out of 100000 random keys
00066  * colliding. Avalanche test shows that only high order bits of the last word
00067  * have a bias. Tests of 1-4 uncorrelated bit errors, stray 0 and 0xFF bytes,
00068  * overwriting page from random position to end with 0 bytes, and overwriting
00069  * random segments of page with 0x00, 0xFF and random data all show optimal
00070  * 2e-16 false positive rate within margin of error.
00071  *
00072  * Vectorization of the algorithm requires 32bit x 32bit -> 32bit integer
00073  * multiplication instruction. As of 2013 the corresponding instruction is
00074  * available on x86 SSE4.1 extensions (pmulld) and ARM NEON (vmul.i32).
00075  * Vectorization requires a compiler to do the vectorization for us. For recent
00076  * GCC versions the flags -msse4.1 -funroll-loops -ftree-vectorize are enough
00077  * to achieve vectorization.
00078  *
00079  * The optimal amount of parallelism to use depends on CPU specific instruction
00080  * latency, SIMD instruction width, throughput and the amount of registers
00081  * available to hold intermediate state. Generally, more parallelism is better
00082  * up to the point that state doesn't fit in registers and extra load-store
00083  * instructions are needed to swap values in/out. The number chosen is a fixed
00084  * part of the algorithm because changing the parallelism changes the checksum
00085  * result.
00086  *
00087  * The parallelism number 32 was chosen based on the fact that it is the
00088  * largest state that fits into architecturally visible x86 SSE registers while
00089  * leaving some free registers for intermediate values. For future processors
00090  * with 256bit vector registers this will leave some performance on the table.
00091  * When vectorization is not available it might be beneficial to restructure
00092  * the computation to calculate a subset of the columns at a time and perform
00093  * multiple passes to avoid register spilling. This optimization opportunity
00094  * is not used. Current coding also assumes that the compiler has the ability
00095  * to unroll the inner loop to avoid loop overhead and minimize register
00096  * spilling. For less sophisticated compilers it might be beneficial to manually
00097  * unroll the inner loop.
00098  */
00099 #include "postgres.h"
00100 
00101 #include "storage/checksum.h"
00102 
00103 /* number of checksums to calculate in parallel */
00104 #define N_SUMS 32
00105 /* prime multiplier of FNV-1a hash */
00106 #define FNV_PRIME 16777619
00107 
00108 /*
00109  * Base offsets to initialize each of the parallel FNV hashes into a
00110  * different initial state.
00111  */
00112 static const uint32 checksumBaseOffsets[N_SUMS] = {
00113     0x5B1F36E9, 0xB8525960, 0x02AB50AA, 0x1DE66D2A,
00114     0x79FF467A, 0x9BB9F8A3, 0x217E7CD2, 0x83E13D2C,
00115     0xF8D4474F, 0xE39EB970, 0x42C6AE16, 0x993216FA,
00116     0x7B093B5D, 0x98DAFF3C, 0xF718902A, 0x0B1C9CDB,
00117     0xE58F764B, 0x187636BC, 0x5D7B3BB1, 0xE73DE7DE,
00118     0x92BEC979, 0xCCA6C0B2, 0x304A0979, 0x85AA43D4,
00119     0x783125BB, 0x6CA8EAA2, 0xE407EAC6, 0x4B5CFC3E,
00120     0x9FBF8C76, 0x15CA20BE, 0xF2CA9FD3, 0x959BD756
00121 };
00122 
00123 /*
00124  * Calculate one round of the checksum.
00125  */
00126 #define CHECKSUM_COMP(checksum, value) do {\
00127     uint32 __tmp = (checksum) ^ (value);\
00128     (checksum) = __tmp * FNV_PRIME ^ (__tmp >> 17);\
00129 } while (0)
00130 
00131 uint32
00132 checksum_block(char *data, uint32 size)
00133 {
00134     uint32 sums[N_SUMS];
00135     uint32 (*dataArr)[N_SUMS] = (uint32 (*)[N_SUMS]) data;
00136     uint32 result = 0;
00137     int i, j;
00138 
00139     /* ensure that the size is compatible with the algorithm */
00140     Assert((size % (sizeof(uint32)*N_SUMS)) == 0);
00141 
00142     /* initialize partial checksums to their corresponding offsets */
00143     memcpy(sums, checksumBaseOffsets, sizeof(checksumBaseOffsets));
00144 
00145     /* main checksum calculation */
00146     for (i = 0; i < size/sizeof(uint32)/N_SUMS; i++)
00147         for (j = 0; j < N_SUMS; j++)
00148             CHECKSUM_COMP(sums[j], dataArr[i][j]);
00149 
00150     /* finally add in two rounds of zeroes for additional mixing */
00151     for (i = 0; i < 2; i++)
00152         for (j = 0; j < N_SUMS; j++)
00153             CHECKSUM_COMP(sums[j], 0);
00154 
00155     /* xor fold partial checksums together */
00156     for (i = 0; i < N_SUMS; i++)
00157         result ^= sums[i];
00158 
00159     return result;
00160 }