CONTENTS Line Subject 18 Variations of Harley's algorithm for computing nlz(x) 315 Web site about representing sets with bit strings, Gray codes, population count, the Hilbert curve, and mask generation 355 Count trailing zeros (ntz(x)) 396 Multibyte (or SIMD) arithmetic 529 Test for the presence of a 0-byte in a word 581 Loop unrolling, counting 1-bits in a word and in a sparse array 694 Some tricks useful for the Secure Hash Algorithm 783 Population count and bit reversal 856 Average of two integers, various forms 942 Max and min for x86 machines ------------------------------------------------------------------------ From Julius Goryavsky, 7/25/2003: Hello, I apologize for spend your time, however I continued searches of "magic" numbers and improvements for algorithms placed in your book! I have found not only new magic constants and method's modifications for reducing the table size, but I also have found some of performance optimizations for original Robert Harley's algorithm. If you interests, I can send you the program for the analysis of all possible constants for all modifications of algorithms known to me and search among those constants which have the minimal or maximal numerical value, giving the shortest tables, having the minimal or maximal value of population count. I offer the following modifications and improvements for Robert Harley's and David Seal's algorithms: 1). I add new transformation step after conversion "x" to "111...111"-like mask: ... x = x | (x >> 8); x = x | (x >> 16); /* New step: */ x = x ^ (x >> 16); To reduce performance impact, this additional step may be combined with previous one: ... x = x | (x >> 8); /* x = x | (x >> 16); */ /* x = x ^ (x >> 16); */ x = x & ~(x >> 16); If the processor has a command "and with complement", then this construction uses the same number of processor cycles and instructions. Using this method we can reduce number of multipliers in the magic constant from four (in original algorithm) to three. We can use a constant: 0xFD7049FF = 511 * 2047 * 16383 This change allows us to reduce number of instructions in algorithm nlz1b and to speed up its work by two instructions (or by one instruction, if the processor do not have the "and with complement" command). 2). If we use Robert Harley's algorithm without replacement of multiplication by shifts and subtractions, then the offered modifications allows us to reduce the size of the auxiliary table from 64 to 48 elements. For this purpose it is necessary to use the magic constant = 0x3EF5D037. Reduction of the size of the table from 64 to 48 elements allows us to saving one line of cache for processors having 16-byte cache lines. 3). Also it is possible to use other modification (similar to first): ... x = x | (x >> 8); x = x | (x >> 16); /* New step: */ x = ~(x ^ (x >> 16)); This additional step may be combined with previous one and simplified by using of De-Morgan's law: ... x = x | (x >> 8); /* x = x | (x >> 16); */ /* x = ~(x ^ (x >> 16))= ~(x & ~(x >> 16)) */ x = ~x | (x >> 16); This construction is like previous, but it does not demand additional time, even at one cycle of the processor, because operation "~x" can be executed in parallel with operation "(x > > 16)". This modification of algorithm allows us to reduce number of factors in the magic constant from four to three if the appropriate constant is used: 0xF8877FF1 = 15 * 32767 * 0x800001 Interesting feature of this modification is it never produces a table index equal to zero for the given magic constant (and the first element can be excluded from the table, see program "ntz1e"). 4). This modification also allows us to reduce the size of the table from 64 to 48 elements if the magic constant = 0x353B8761 is used. Thus the operating time of algorithm is not increased in comparison with original version, even for those processors, which do not have "and with complement" instruction, because operations "~x" and " (x > > 16) " are calculated in parallel. If we use of this constant, then first three elements of the original table (with 64 elements) are not used (also they are reduced by my program used for construction of tables - to minimize the effective size of the table to 48 elements). 5). The previous algorithm has an interesting magic constant 0x02293001 which has population count = 7. Due to this it is possible to replace multiplication by a sequence of 6 shifts and additions: 0x02293001 = 10 0010 1001 0011 0000 0000 0001 binary: x = x + (x << 12) + (x << 13) + (x << 16) + (x << 19) + (x << 21) + (x << 25); Thus all shifts can be executed in parallel, and six additions can be made in 3 cycles (in pairs) if the processor has infinity parallelism. Other interesting constant is: 0xFAD7BFFF, it may be used in first modification of Harley's algorithm offered by me (x ^ (x > > 16)). It has maximal population count = 27, that gives the following sequence: 0xFAD7BFFF = 1111 1010 1011 0111 1011 1111 1111 1111 binary: x = -x - (x << 26) - (x << 24) - (x << 21) - (x << 19) - (x << 15); 6). If to keep original Robert Harley's and David Seal's algorithms, magic constants minimize the size of the tables for them with 64/63 to 53 elements it 0x2E9BBECD (Harley's) and 0x0432A68B (Seal's). 7). The minimally possible constant for Harley's algorithm is 0x0218A58B which is intended for second modification of the algorithm offered by me. From constants of a kind "sum(2 ^ k/-1)" it is interesting 0x0D322585 = 5 * 127 * 2049 * 131071 * 0x100001. This constant minimizes the size of the table in Seal's algorithm (up to 53 elements), but it consists of five factors instead of three.:( Also it's possible the modification of Seal's algorithm, using expression "-x | x" for which there is smallest of constants that I know: 0x0218A393. Further included the full code for the modification of algorithms offered by me: 1). Method: "x ^ (x >> 16)", shift/add implementation: int nlz1c(unsigned x) { static unsigned char table [64] = { 32, 20, 19, u, u, 18, u, 7, 10, 17, u, u, 14, u, 6, u, u, 9, u, 16, u, u, 1, 26, u, 13, u, u, 24, 5, u, u, u, 21, u, 8, 11, u, 15, u, u, u, u, 2, 27, 0, 25, u, 22, u, 12, u, u, 3, 28, u, 23, u, 4, 29, u, u, 30, 31 }; x = x | (x >> 1); // Propagate leftmost x = x | (x >> 2); // 1-bit to the right. x = x | (x >> 4); x = x | (x >> 8); x = x & ~(x >> 16); /* x = x * 0xFD7049FF; */ x = (x << 9) - x; x = (x << 11) - x; x = (x << 14) - x; return table[x >> 26]; } 2). Method: "x ^ (x >> 16)", multiply-based implementation (table length is 48 elements!): int nlz1d(unsigned x) { static unsigned char table [48] = { 32, 6, 5, u, 4, 12, u, 20, 15, 3, 11, 0, u, 18, 25, 31, 8, 14, 2, u, 10, u, u, u, u, u, u, 21, u, u, 19, 26, 7, u, 13, u, 16, 1, 22, 27, 9, u, 17, 23, 28, 24, 29, 30 }; x = x | (x >> 1); // Propagate leftmost x = x | (x >> 2); // 1-bit to the right. x = x | (x >> 4); x = x | (x >> 8); x = x & ~(x >> 16); x = x * 0x3EF5D037; return table[x >> 26]; } 3). Method: "~(x ^ (x >> 16))", shift/add implementation: int nlz1e(unsigned x) { unsigned char table [62] = { 32, u, 31, 19, u, 11, 30, 18, u, u, 10, u, 7, 29, u, 17, 1, u, u, 9, u, u, u, 6, u, u, 4, u, 28, 24, u, 16, 0, 20, u, 12, u, u, 8, u, 2, u, u, u, u, 5, 25, u, 21, u, 13, u, 3, u, 26, 22, u, 14, 27, 23, u, 15 }; x = x | (x >> 1); // Propagate leftmost x = x | (x >> 2); // 1-bit to the right. x = x | (x >> 4); x = x | (x >> 8); x = ~x | (x >> 16); /* x = x * 0xF8877FF1; */ x = (x << 4) - x; x = (x << 15) - x; x = (x << 23) + x; return (table - 1)[x >> 26]; // Zero index is impossible! } 4). Method: "~(x ^ (x >> 16))", multiply-based implementation (table length is 48 elements!): int nlz1f(unsigned x) { unsigned char table [48] = { 1, 22, u, 3, 18, 21, 15, 30, u, u, 6, 13, 17, 10, 20, u, u, 0, 29, 27, u, 25, u, u, 2, 4, u, 16, 7, 14, 11, 23, 19, u, 31, u, 5, u, 8, 12, 28, 26, u, 9, u, u, 24, 32 }; x = x | (x >> 1); // Propagate leftmost x = x | (x >> 2); // 1-bit to the right. x = x | (x >> 4); x = x | (x >> 8); x = ~x | (x >> 16); x = x * 0x353B8761; return (table - 3)[x >> 26]; // Index in range [0...2] is impossible! } 5). Method: "~(x ^ (x >> 16))", using magic constants with population count = 7 and 27: int nlz1g(unsigned x) { unsigned char table [64] = { 3, u, 2, u, 9, 1, u, u, 12, 8, u, 0, u, u, u, 5, 11, 14, u, 7, u, u, 22, u, u, 17, u, 20, u, 26, u, u, 4, 10, u, 13, u, u, u, 6, 15, u, u, 23, 18, 21, 27, u, u, u, u, 16, u, 24, 19, 28, u, u, 25, 29, u, 30, 31, 32 }; x = x | (x >> 1); // Propagate leftmost x = x | (x >> 2); // 1-bit to the right. x = x | (x >> 4); x = x | (x >> 8); x = ~x | (x >> 16); /* x = x * 0x02293001; */ x = x + (x << 12) + (x << 13) + (x << 16) + (x << 19) + (x << 21) + (x << 25); return table[x >> 26]; } int nlz1h(unsigned x) { unsigned char table [64] = { 32, u, u, u, u, 14, u, u, u, 8, 13, u, u, u, 5, u, 0, u, 10, 7, 12, u, u, u, 27, 22, u, u, 25, 4, u, u, 20, u, 15, u, 9, u, u, 6, 1, 11, u, u, 28, 23, 26, u, 21, 16, u, u, 2, u, 29, 24, u, 17, 3, u, 30, 18, 31, 19 }; x = x | (x >> 1); // Propagate leftmost x = x | (x >> 2); // 1-bit to the right. x = x | (x >> 4); x = x | (x >> 8); x = x & ~(x >> 16); /* x = x * 0xFAD7BFFF; */ x = -x - (x << 26) - (x << 24) - (x << 21) - (x << 19) - (x << 15); return table[x >> 26]; } 6). Constants for minimization of the size of tables up to 53 elements (using the original algorithms): Robert Harley's algorithm, magic constant 0x2E9BBECD: unsigned char table [53] = { 32, 6, 23, u, 2, u, u, 25, u, u, u, 31, u, 9, 5, u, 22, 29, 19, 15, 1, u, u, u, u, u, 7, 24, 3, u, 26, u, u, 10, 30, 20, 16, u, u, 8, 4, 27, 11, 21, 17, u, 28, 12, 18, 13, u, 14, 0 }; David Seal's algorithm, magic constant 0x0432A68B: unsigned char table [53] = { 32, 0, 1, 6, 2, 25, 7, u, 3, u, 12, 26, 8, 19, u, u, 4, 23, u, 17, 15, 13, 27, u, 29, 9, 20, u, u, u, u, u, 31, 5, 24, u, u, 11, 18, u, 22, 16, 14, u, 28, u, u, u, 30, u, 10, u, 21 }; Sincerely, Julius Goryavsky. ------------------------------------------------------------------------ From Doug Moore, 5/6/2003: If you're interested, I have a few bit twiddling items at a web site that persists at my former employer: http://www.caam.rice.edu/~dougm/twiddle/ Of particular interest, I think: Two concern integers with a fixed number of bits set: 1. given an integer with k bits set, find the next larger one 2. given an integer with k bits set, find another one that differs from this one in only two bit positions, and has k bits set; by repeating the operation, you enumerate the k-bits-set integers. One concerns Hilbert curves - code in n-dimensions, and for comparison and range queries, not just conversions from int to point and back. I'd have to do some work to make it understandable though. Related to that is some work on transposing rectangular bit matrices. It's an important part of n-dimensional Hilbert encoding. One not on the web page that has interested me is the representation of rational numbers by their position in the Sturm-Brochot tree - it's not a different base for numbers, but a different binary representation that includes rationals. Very briefly: if x is a number, then 1x = 1+x, and 0x = x/(1+x) so 0101 = 2/5. There's a HAKMEM item about this representation, though not focusing on a bit-representation. I'm looking forward to reading the book more carefully. Doug Moore ------------------------------------------------------------------------ From Dean Gaudet, 10/23/2003: here's yet another way of counting trailing zeros which i don't think i saw on your pages / in your book. int ntz(unsigned src) { unsigned x; unsigned bit4, bit3, bit2, bit1, bit0; if (src == 0) return 32; x = src & -src; bit4 = (src & 0xffff) ? 0 : 16; bit3 = (x & 0xff00ff00) ? 8 : 0; bit2 = (x & 0xf0f0f0f0) ? 4 : 0; bit1 = (x & 0xcccccccc) ? 2 : 0; bit0 = (x & 0xaaaaaaaa) ? 1 : 0; return bit4 + bit3 + bit2 + bit1 + bit0; } the bit4 test is deliberately different because it can go in parallel with the (src & -src) calculation. notice how all those ?: tests are independent of each other. if the target machine is really wide they could go entirely in parallel. it looks like it'd work well on your "full risc" model using movne for the ?:. or cmpne could set bitN to 0/1 and ins could be used in place of all the additions. those 32-bit immediates could be replaced with 16-bit immediates if x>>16 is used when bit4 != 0. take care -dean ------------------------------------------------------------------------ From Falk Hueffner, 1/23/2004: Okay. Here's what I have [on "multibyte," or SIMD arithmetic operations]. It seems this method is worthwhile for add/sub, I already have a patch for gcc to do this for the new autovectorization framework (http://gcc.gnu.org/projects/tree-ssa/vectorization.html), but probably not for saturated sub/add, unless somebody can come up with a better method. "v8qi" is gcc speak for a vector of 4 "quarter ints", i.e., bytes. typedef unsigned long long uint64_t; // 6 insns, 3 cycles (without constant construction) inline uint64_t addv8qi(uint64_t x, uint64_t y) { uint64_t signmask = 0x8080808080808080ULL; uint64_t signs = (x ^ y) & signmask; x &= ~signmask; y &= ~signmask; x += y; x ^= signs; return x; } // 6 insns, 3 cycles (without constant construction) inline uint64_t subv8qi(uint64_t x, uint64_t y) { uint64_t signmask = 0x8080808080808080ULL; uint64_t signs = (x ^ ~y) & signmask; x |= signmask; y &= ~signmask; x -= y; return x ^ signs; } // 4 insns, 3 cycles inline uint64_t negv8qi(uint64_t y) { return subv8qi(0, y); } // unsigned saturated add, saturate to 0xff inline uint64_t usaddv8qi(uint64_t x, uint64_t y) { uint64_t t0, t1; uint64_t signmask = 0x8080808080808080ULL; t0 = (y ^ x) & signmask; t1 = (y & x) & signmask; x &= ~signmask; y &= ~signmask; x += y; t1 |= t0 & x; t1 = (t1 << 1) - (t1 >> 7); return (x ^ t0) | t1; } // unsigned saturated sub, saturate to 0x00 or 0xff // 14 insns, 7 cycles (without constant construction) inline uint64_t ussubv8qi(uint64_t x, uint64_t y) { uint64_t t0, t1; uint64_t signmask = 0x8080808080808080ULL; t0 = (y ^ ~x) & signmask; t1 = (y & ~x) & signmask; x |= signmask; y &= ~signmask; x -= y; t1 |= t0 & ~x; t1 = (t1 << 1) - (t1 >> 7); return (x ^ t0) & ~t1; } // 1 insn, 1 cycle inline uint64_t usnegv8qi(uint64_t x) { return 0; } // signed saturated add, saturate to 0x80 or 0x7f // 16 insns, 8 cycles uint64_t ssaddv8qi(uint64_t x, uint64_t y) { uint64_t eq, xv, yv, satmask, satbits, satadd, t0, t1; uint64_t signmask = 0x8080808080808080ULL; eq = (x ^ ~y) & signmask; xv = x & ~signmask; yv = y & ~signmask; xv += yv; satbits = (xv ^ y) & eq; satadd = satbits >> 7; satmask = (satbits << 1) - satadd; xv ^= eq; t0 = (xv & ~satmask) ^ signmask; t1 = satadd & ~(xv >> 7); return t0 - t1; } // Average. This is valuable for MPEG codecs. inline uint64_t avgv8qi(uint64_t a, uint64_t b) { return (a & b) + (((a ^ b) & 0xfefefefefefefefeULL) >> 1); } // Rounded average. inline uint64_t avg_roundv8qi(uint64_t a, uint64_t b) { return (a | b) - (((a ^ b) & 0xfefefefefefefefeULL) >> 1); } void dumpsb(uint64_t v) { signed char c[8]; memcpy(c, &v, 8); printf("%016lx %4d %4d %4d %4d %4d %4d %4d %4d\n", v, c[7], c[6], c[5], c[4], c[3], c[2], c[1], c[0]); } // some example int main() { uint64_t x = 0x01ff807fff897908; uint64_t y = 0x08010712fe8870ff; dumpsb(x); dumpsb(y); dumpsb(ssaddv8qi(x, y)); } As I mentioned, the ideas are not from me but mostly stolen somewhere. I've also not thoroughly tested them... -- Falk ------------------------------------------------------------------------ From Paul Curtis, 12/15/2004: Hi, P. 92 of Hacker's Delight indicates that after the second assignment to y, you can do a jump-if-zero to determine if there's any zero byte in a word. But... If that's what you want to do, then why not use: (fz == find-if-zero) int fz(int x) { return ((x-0x01010101) ^ x) & ~x & 0x80808080; } (X-0x01010101) ^ x indicates whether there was a change of sign in bit 7 of each byte of the word after subtraction. Anding with ~x separates the 80-1=0x7f and 00-=0xff case. Anding with 80 gives us each bit we're after. But with De Morgan, bit twiddling can be reduced somewhat. Assume v = x-0x01010101, then: (v ^ x) & ~x => (v & ~x | ~v & x) & ~x => (v & ~x & ~x | ~v & x & ~x) => (v & ~x | 0) => (v & ~x) int fz(int x) { return (x-0x01010101) & ~x & 0x80808080; } If this returns a non-zero result, then there's a zero in the word. It looks like it delivers along the same as the first two lines of Figure 6-3. This is a significant result. It can accelerate many zero-terminated string functions if the inputs are word aligned... Regards, -- Paul Curtis, Rowley Associates Ltd http://www.rowley.co.uk CrossWorks for MSP430, ARM, and now AVR processors ------------------------------------------------------------------------ From Oleg Khovayko, 3/29/05: Dear Henry! Let I introduce myself. My name Oleg Khovayko. Detailed info about myself you can found at: http://olegh.spedia.net (sorry, not updated for some years) I want to share to you some my old tricks, not reflected in the your book now. [Below is an efficient way to unroll a loop by a factor of 2, when it is a counting-loop (not a while-loop), and the machine has a way to shift right 1 and then branch on the bit shifted out. First is Intel x86 code, followed by PDP-11 code. Part of the wisdom here is that if the loop is to be executed an odd number of times, then it is best to take care of the odd iteration at the start, rather than at the end, to avoid a compare and branch in the middle of the unrolled loop. The presence of the label $2 will, however, inhibit some optimizations, such as commoning and some scheduling. Here CMD is the body of the loop being unrolled.] 1 ------- unloop code: mov count, %cx inc %cx shr %cx ; Shift right 1, setting carry (CF) jnc $2 ; Jump if no carry. $1: CMD $2: CMD loop $1 ; Decrement count in %cx & jump if nonzero. [Another loop transformation that might be considered is to change: do i = 1 to n; CMD end; to: do i = 1 to (n & -2) by 2; CMD i = i + 1; CMD end; if n > 0 && (n & 1) != 0 then do; i = n; CMD end; This puts the two occurrences of CMD in the loop with no labels or branches between them, giving more opportunities for optimizations, particularly scheduling. Of course, there's a code blow-up.] ------- Same code on the DEC PDP-11 Assembly language (more familiar to me): mov #count, r0 inc r0 asr r0 bcc 2$ 1$: CMD 2$: CMD sob r0, 1$ [Below is a method for counting the 1-bits in a word by shifting left 1, adding the bit shifted out to an accumulator (r0 below), and continuing until the word being shifted becomes 0. This is PDP-11 code. The bne branches if the register result of the asl (arithmetic shift left) instruction is nonzero.] 2 --- compact code for bit counting in the word: clr r0 ; accumulator mov #WORD, r1 1$: adc r0 ; Add carry to accumulator asl r1 ; shift left, and put bit 15 to "C" bne 1$ ; not zero adc r0 ; r0 - contains result 3 --- Bit counting in a sparse array For bit counting in the array, approx 4 years ago I've written following procedure (not copied from code, restored from memory): int ar_bit(int *ar, int len) { int ac = *ar++, rc = 0; while(--len > 0) { register x = *ar++; register n = x & ac; // conflict bits ac |= x; // Or next value to accumulator if(n) { // was conflict rc += wbc(ac); ac = n; } } // while return rc + wbc(ac); } // ar_bit It is very fast for sparsely populated arrays. We use an "accumulator," where we accumulate (by ORs) bits from the array. We call the wbc (WordBitCount) function only when the accumulator "overflows" (two bits want to go into the same bit position). In this way, we do not count bits in each word; we count bits only in the "rich" accumulator from time to time. This code is good for superscalar CPUs, since during regular control flow, the result of the current command uses in "overnext" command only. ------------------------------------------------------------------------ From a reader, 7/12/05: [Ed. note: SHA = Secure Hash Algorithm. See http://csrc.nist.gov/publications/fips/fips180-2/fips180-2.pdf] Looking at section 2-19, I was reminded of some tricks that arise in implementing SHA. One operation required in SHA (and some monochrome graphics) is a bit-wise x ? y : z. If your machine has a 1-cycle and-not, that can be simply implemented as ch(x,y,z) = (x & y) | (~x & z) However, another useful implementation is ch(x,y,z) = z ^ (x & (y ^ z)) This is particularly good on 2-operand machines when you want to leave the original operands around. The same trick can be used to generate a 2n-bit rotate on a machine with only n-bit rotates. Using a suitable mask M, you can do an r-bit rotate using: hi = ROTL(hi, r%n) lo = ROTL(lo, r%n) t = (hi ^ lo) & M hi ^= t lo ^= t Particularly useful for the 64-bit fixed rotates in SHA-512. In fact, similar tricks can be used to rotate an n*32-bit word by up to +/-32 bits. Just rotate each word by the appropriate amount, and then shuffle the corresponding portions of each word. For example, a 17-bit left rotate of a 128-bit word, stored little-endian on a 32-bit machine: x[0] = ROTL(x[0],17) x[1] = ROTL(x[1],17) x[2] = ROTL(x[2],17) x[3] = ROTL(x[3],17) Followed by either three swap operations: t = (x[3] ^ x[2]) & 0x1ffff; x[3] ^= t; x[2] ^= t; t = (x[2] ^ x[1]) & 0x1ffff; x[2] ^= t; x[1] ^= t; t = (x[1] ^ x[0]) & 0x1ffff; x[1] ^= t; x[0] ^= t; Or, more efficiently: x3 = x[3]; x[3] ^= (x3 ^ x[2]) & 0x1ffff; x[2] ^= (x[2] ^ x[1]) & 0x1ffff; x[1] ^= (x[1] ^ x[0]) & 0x1ffff; x[0] ^= (x[0] ^ x3) & 0x1ffff; Another trick that is useful in SHA is to replace an OR with an XOR or + when the arguments can be proven to have no bits in common. With the simple ch() function, ch(x,y,z) = (x & y) | (~x & z) = (x & y) ^ (~x & z) = (x & y) + (~x & z) is useful, because some rounds of SHA-1 compute: ROTL(a,5) + ch(b,c,d) + k + w[i] Having ch() broken into two, summed sub-functions gives you more scheduling flexibility. The same trick can be applied to the bit-wise majority function, also used in the SHA family: maj(x,y,z) = (x&y) | (y&z) | (z&x) = x&(y|z) | (y&z) = x&(y^z) | (y&z) = x&(y^z) ^ (y&z) = x&(y^z) + (y&z) Again, the latter form helps find parallelism when summing the result. ------------------------------------------------------------------------ From Norbert Juffa, 10/01/05: Just a quick note to pass on a couple observations I made with regard to the population count and bit reversal sections of "Hacker's Delight". The first observation is in regard to the HAKMEM version of population count. The original version is quite slow since it uses a modulo operation (% 63) to sum the 6-bit fields. However, the summing can be accelerated significantly by using an integer multiply. In fact, on my Athlon (which has a very fast integer multiplier) this puts this version among the top performers. int popcnt32 (unsigned x) { unsigned s = 033333333333; unsigned t = 030707070707; unsigned n; n = (x >> 1) & s; x = x - n; n = (n >> 1) & s; x = x - n; n = x; x = x >> 3; n = (n + x) & t; x = n; t = t & (t << 2); /* t = 0x04104104 */ x = (n >> 30) + ((x * t) >> 26); return x; } Note that the magic multiplier needed for summing the 6-bit fields can be derived easily from one of the existing masks. This is advantageous on processors with poor support for constructing constants, such as ARM (the ARM can execute t = t & (t << 2) in a single instruction, on the other hand). The second observation is with regard to the many masks needed in the 32-bit bit-reversal. In fact, except for one mask which needs to be loaded, all these masks can be easily generated on the fly. To enable this, one has to reverse the order of the processing steps such that the bigger groups are exchanged prior to smaller groups. On-the-fly construction of large masks is very advantageous on the ARM, for example. Having a CPU with an ANDN instruction certainly helps here, since half the masks are just inverted versions of the other half. unsigned bitreverse (unsigned x) { unsigned m = 0xffff0000; x = ((x & m) >> 16) | ((x & ~m) << 16); m = m ^ (m >> 8); x = ((x & m) >> 8) | ((x & ~m) << 8); m = m ^ (m >> 4); x = ((x & m) >> 4) | ((x & ~m) << 4); m = m ^ (m >> 2); x = ((x & m) >> 2) | ((x & ~m) << 2); m = m ^ (m >> 1); x = ((x & m) >> 1) | ((x & ~m) << 1); return x; } One could also start with m = 0x0000ffff, if that is more efficient on some platform. Also, on a 32-bit machine there is no need to mask in the first step when bit reversing 32-bit operands. It is sufficient to do: x = (x >> 16) | (x << 16); -- Norbert ------------------------------------------------------------------------ From Peter Luschny, 6/17/06: Dear Mr. Warren, I read with much delight the sample sections of your Hacker's Delight. After hearing about the much discussed bug in the Java implementation of the binary search http://googleresearch.blogspot.com/2006/06/extra-extra-read-all-about-it-nearly.html http://lambda-the-ultimate.org/node/1549 I considered the question of finding the average of two signed integers (a+b)/2 and tried to solve the problem with 6 rounding-modes: Up, Down, Ceiling, Floor, Even, Odd. Maybe my solutions, which I include below, are of interest to you. With many regards Peter Luschny // ---------------------------------------------------------------- // Written in C#, code is in the public domain. // Specifies a rounding behavior for averaging two signed integers. public enum Rounding { Up, Down, Ceiling, Floor, Even, Odd }; public class AveragesOfTwoSignedIntegers { // Computes the average of two signed integers according to // the rounding mode requested. I deliberately used a tiny // framework for simple comparison. public static int Average(int a, int b, Rounding mode) { int v = (a >> 1) + (b >> 1); bool cond; switch (mode) { // Round the average towards negative infinity. case Rounding.Floor: cond = true; break; // Round the average towards positive infinity. case Rounding.Ceiling: cond = false; break; // Round the average away from zero. case Rounding.Up: cond = (a ^ b) < 0; break; // Round the average towards zero. case Rounding.Down: cond = (a ^ b) > 0; break; // Round the average towards the even neighbor. case Rounding.Even: cond = ((v & 1) + ((a & b) & 1)) == 0; break; // Round the average towards the odd neighbor. case Rounding.Odd: cond = ((v & 1) + ((a & b) & 1)) != 0; break; } return v + ((cond ? a & b : a | b) & 1); } // --- I run some tests against the following function. public static int dAverage(int a, int b, Rounding mode) { if (a == b) return a; double h = a / 2.0 + b / 2.0; int r = 0; switch (mode) { case Rounding.Floor: r = (int) Math.Floor(h); break; case Rounding.Ceiling: r = (int) Math.Ceiling(h); break; case Rounding.Down: r = ((a < 0 && b < 0) || (a >= 0 && b >= 0)) ? (int) Math.Floor(h) : (int) Math.Ceiling(h); break; case Rounding.Up: r = ((a < 0 && b < 0) || (a >= 0 && b >= 0)) ? (int) Math.Ceiling(h) : (int)Math.Floor(h); break; case Rounding.Even: r = (int) Math.Floor(h); if ((r & 1) != 0) r = (int) Math.Ceiling(h); break; case Rounding.Odd: r = (int) Math.Floor(h); if ((r & 1) != 1) r = (int) Math.Ceiling(h); break; } return r; } } // ---------------------------------------------------------------- ------------------------------------------------------------------------ From Tim Shaporev, 10/30/06: Hello! I just wrote [bought] the "Hacker's Delight" book (well, Russian edition to be exact) and I think I have couple of additions (I found nothing similar on the site from the first glance). for #2.18 Doz, Min and Max There is quite effective implementation of min and max functions for Intel x86 processor (unsigned integers only). I wrote the sample for 16-bits mode though it works for 32 bits as well. Arguments are in AX, BX sub ax,bx sbb dx,dx ; either 0 or all 1's and ax,dx ; either 0 or a-b add ax,bx ; min The max function is one command longer: sub ax,bx sbb dx,dx ; either 0 or all 1's not dx and ax,dx ; either 0 or a-b add ax,bx ; max These were found in some old disassembled code so I do not know who is the real author of the trick. for #5.3 nlz() function. The function in the book is quite interesting for it demonstrates three different tricks, but using just two of them gives bit more effective implementation: int nlz32(unsigned x) { int m, y; int n = 32; y = -(int)(x >> 16); /* test left half of x */ m = (y >> 16) & 16; /* either 0 or 16 */ n -= m; /* remember the count */ x >>= m; /* significant 16 bits to lower */ y = -(int)(x >> 8); m = (y >> 8) & 8; /* either 0 or 8 */ n -= m; /* remeber the result */ x >>= m; /* significant byte to lower */ y = -(int)(x >> 4); m = (y >> 4) & 4; /* either 0 or 4 */ n -= m; /* remember the result */ x >>= m; y = -(int)(x >> 2); m = (y >> 2) & 2; n -= m; x >>= m; /* x = 0, 1, 2 or 3 */ m = (int)(x & ~(x >> 1)); /* m = 0, 1 or 2 */ return n - m; } Depending on the processor and/or compiler it could be advantageous to minimize the number of constants: int nlz32(unsigned x) { int m, s, y; int n = 32; s = n >> 1; y = -(int)(x >> s); /* test left half of x */ m = (y >> s) & s; /* either 0 or 16 */ n -= m; /* remember the count */ x >>= m; /* significant 16 bits to lower */ s >>= 1; y = -(int)(x >> s); m = (y >> s) & s; /* either 0 or 8 */ n -= m; /* remeber the result */ x >>= m; /* significant byte to lower */ s >>= 1; y = -(int)(x >> s); m = (y >> s) & s; /* either 0 or 4 */ n -= m; /* remember the result */ x >>= m; s >>= 1; y = -(int)(x >> s); m = (y >> s) & s; n -= m; x >>= m; /* x = 0, 1, 2 or 3 */ m = (int)(x & ~(x >> 1)); /* m = 0, 1 or 2 */ return n - m; } Hope this could be helpful. My apologies for your time if this is known. Tim