What is the fastest way to get an Inverse Square Root? (Part 2)

Part 1: What is the fastest way to get an Inverse Square Root?

Last time I covered what the fastest way was for single precision floating point, but what about double precision floating point numbers? Do they behave the same or will we run into even more issues on older hardware?

Since the market starts looking towards 128-bit integers and quad precision floating point numbers, it’s time to also test this one while it’s still ‘fresh’.

The Different Ways

This time I’m doing it slightly differently, I will include SSE immediately since all 64-bit supporting CPUs know how to handle SSE instructions. But the same algorithm selection remains: Standard Math, Quake III Fast InvSqrt (with an update magic constant), SSE Math and Quake III Fast InvSqrt with SSE.

Standard Math and SSE Math

// Math
double invsqrt(double v) {
    return 1.0 / sqrt(v);
}

// SSE Math
double invsqrt_sse(double v) {
    const __m128d divisor = _mm_set_pd(1.0, 0);
    __m128d value = _mm_set_pd(v, 0);
    value = _mm_sqrt_pd(value);
    value = _mm_div_pd(divisor, value);
    double res;
    _mm_storel_pd(&res, value);
    return res;
}
void invsqrt_sse2(double* v) {
    const __m128d divisor = _mm_set_pd(1.0, 1.0);
    __m128d value = _mm_set_pd(v[0], v[1]);
    value = _mm_sqrt_pd(value);
    value = _mm_div_pd(divisor, value);
    _mm_storeu_pd(v, value);
}
void invsqrt_sse4(double* v) {
    invsqrt_sse2(v);
    invsqrt_sse2(v + 2);
}
void invsqrt_sse8(double* v) {
    invsqrt_sse4(v);
    invsqrt_sse4(v + 4);
}
void invsqrt_sse16(double* v) {
    invsqrt_sse8(v);
    invsqrt_sse8(v + 8);
}
void invsqrt_sse32(double* v) {
    invsqrt_sse16(v);
    invsqrt_sse16(v + 8);
}

Quake III and Quake III with SSE

// Quake III
#define Q3_INVSQRT_CONSTANT 0x5fe6eb50c7b537a9ll
double invsqrt_q3(double v) {
    union {
        double f;
        long long u;
    } y = { v };
    double x2 = y.f * 0.5;
    y.u = Q3_INVSQRT_CONSTANT - (y.u >> 1);
    y.f = 1.5 * (x2 * y.f * y.f);
    return y.f;
}

// Quake III SSE
double invsqrt_q3_sse(double v) {
    const __m128i magic_constant = _mm_set_epi64x(Q3_INVSQRT_CONSTANT, 0);
    const __m128d zero_point_five = _mm_set_pd(0.5, 0);
    const __m128d one_point_five = _mm_set_pd(1.5, 0);

    __m128d value = _mm_set_pd(v, 0);
    __m128d halfvalue = _mm_mul_pd(value, one_point_five);
    __m128i ivalue = _mm_castpd_si128(value);
    //__m128i ivalue2 = _mm_srli_epi64(ivalue, 1);
    ivalue.m128i_i64[0] = ivalue.m128i_i64[0] >> 1; // No Arithmethic shift right available for 64-bit int.
    _mm_sub_epi64(magic_constant, ivalue);
    value = _mm_castsi128_pd(ivalue);

    // y.f = 1.5f - (x2 * y.f * y.f) part
    value = _mm_mul_pd(value, value); // y.f * y.f
    value = _mm_mul_pd(value, halfvalue); // x2 * y.f * y.f
    value = _mm_sub_pd(one_point_five, value); // 1.5f - (x2 * y.f * y.f)

    double res;
    _mm_storel_pd(&res, value);
    return res;
}
void invsqrt_q3_sse2(double* v) {
    const __m128i magic_constant = _mm_set_epi64x(Q3_INVSQRT_CONSTANT, Q3_INVSQRT_CONSTANT);
    const __m128d zero_point_five = _mm_set_pd(0.5, 0.5);
    const __m128d one_point_five = _mm_set_pd(1.5, 1.5);

    __m128d value = _mm_set_pd(v[0], v[1]);
    __m128d halfvalue = _mm_mul_pd(value, one_point_five);
    __m128i ivalue = _mm_castpd_si128(value);
    //__m128i ivalue2 = _mm_srli_epi64(ivalue, 1);
    ivalue.m128i_i64[0] = ivalue.m128i_i64[0] >> 1; // No Arithmetic shift right available for 64-bit int.
    ivalue.m128i_i64[1] = ivalue.m128i_i64[1] >> 1; // No Arithmetic shift right available for 64-bit int.
    _mm_sub_epi64(magic_constant, ivalue);
    value = _mm_castsi128_pd(ivalue);

    // y.f = 1.5f - (x2 * y.f * y.f) part
    value = _mm_mul_pd(value, value); // y.f * y.f
    value = _mm_mul_pd(value, halfvalue); // x2 * y.f * y.f
    value = _mm_sub_pd(one_point_five, value); // 1.5f - (x2 * y.f * y.f)

    _mm_storeu_pd(v, value);
}
void invsqrt_q3_sse4(double* v) {
    invsqrt_q3_sse2(v);
    invsqrt_q3_sse2(v + 2);
}
void invsqrt_q3_sse8(double* v) {
    invsqrt_q3_sse4(v);
    invsqrt_q3_sse4(v + 4);
}
void invsqrt_q3_sse16(double* v) {
    invsqrt_q3_sse8(v);
    invsqrt_q3_sse8(v + 8);
}
void invsqrt_q3_sse32(double* v) {
    invsqrt_q3_sse16(v);
    invsqrt_q3_sse16(v + 16);
}

Some may ask why there are normal bit shifts in there and the answer is quite simple: there is no _mm_srai_epi64.

Test Environment

Like before, the test environment consists of a usual gaming workload running side-by-side with the actual test to simulate a real world situation. Timing is done for a lot of iterations and then the average is used to give an estimate for a single call. Still using an i5-4690 as well. The Timing code was taken from before, adjusted to use doubles instead of floats.

The Timing Results

Iterations: 10000000 32-Bit Single % Diff 64-Bit Single % Diff
InvSqrt 43.943808 Reference 14.266584 Reference
InvSqrt SSE 41.512343 -5.60% 14.060391 -1.45%
InvSqrt SSE (2 Ops) 19.545172 -62.35% 13.363418 -6.33%
InvSqrt SSE (4 Ops) 9,870345 -77.55% 6.974453 -51.11%
InvSqrt SSE (8 Ops) 5,371453 -87.78% 3.857693 -72.96%
InvSqrt SSE (16 Ops) 3,126460 -92.89% 2.394183 -83.22%
InvSqrt SSE (32 Ops) 2,385877 -94.57% 1.771858 -87.58%
Quake III 41.432505 Reference 13.484680 Reference
Quake III SSE 41.964349 -1.28% 14.153586 -4.96%
Quake III SSE (2 Ops) 21.259890 -48.69% 15.219657 -12.87%
Quake III SSE (4 Ops) 12.600375 -69.59% 7.760635 -42.45%
Quake III SSE (8 Ops) 7.929093 -80.86% 4.499441 -66.63%
Quake III SSE (16 Ops) 6.631565 -83.99% 3.006890 -77.70%
Quake III SSE (32 Ops) 5.980866 -85.56% 3.552134 -73.66%

We can immediately see that the earliest performance boost for both architectures once again happens at 4 SSE calculations. And we can also see that the Quake III SSE code scales worse the more data we send it, especially on 64-bit. The reason for this is that with double precision there is no arithmetic right shift SSE intrinsic which we could use, so it has to go back to normal registers to do this.

If we compare this with the results from the single precision test, we can tell which one is faster: single precision. But the difference is almost zero on 64-bit and once we have compilers that take advantage of the XMM8-15 registers in addition to XMM0-7, we will likely see a huge increase in double precision performance and possibly also in single precision performance.

Until then, use whatever floating point precision type fits your data. The performance difference between single and double precision is so small that it amounts to almost no difference at all. “Single Precision Floating Points are faster than Double Precision Floating Points” will soon no longer be the truth, especially when CPUs need to switch into another state to handle single precision math.

Xaymar out.

Bookmark the permalink.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.