What is the fastest way to get an Inverse Square Root?

Everyone who works with a 2D or 3D development studio knows that eventually you will run into hardware that is too slow. Wether that hardware is what you currently have or what your target demographic has doesn‘t matter – you‘ll hit this limit, even if you try not to.

That‘s why some clever people came up with different ways to make things faster. Seperating the FPU from the IPU, SSE and AVX and the famous Quake III floating point hack. But which one performs the fastest, and can we make that one even faster?

Inverse Square Root Functions

Everyone knows some of them, but this post focuses only on the following few:

  • Standard Math
  • Quake III Fast Inverse Square Root (aka the one that Carmack supposedly wrote)

It’s pretty easy to implement each of them, so here is the code for each:

// Standard Math
float invsqrt(float v) {
    return 1.0f / sqrtf(v);
}

// Quake III / Carmack Version
#define USE_LOMONT_CONSTANT
#ifndef USE_LOMONT_CONSTANT
#define FASTINVSQRT 0x5F3759DF // Carmack
#else
#define FASTINVSQRT 0x5F375A86 // Chris Lomont
#endif

float invsqrt_q3(float v) {
    union {
        float f;
        long u;
    } y = { v };
    float x2 = v * 0.5f;
    y.u = FASTINVSQRT - (y.u >> 1);
    y.f = (1.5f - (x2 * y.f * y.f));
    return y.f;
}

Measuring Speed or Time Taken to Execute

Most people make the mistake of only testing the function execution time once and taking that for granted. But nowadays with multiple cores and many more threads this is no longer the case. Your function execution could literally be interrupted by the OS at any point and your execution time is blown out for the reason that you thought that would be better.

So the correct way is to test as many times as possible. For fast functions, that means more than 1 million times, for slower ones somewhat less. And ideally you want to test across many machines at the same time, but we don’t all have the ability to do that.

My test setup consisted of a single machine with an Intel i5-4690, a normal workload that runs next to the test to simulate the average gaming environment and all optimisations active that I also have active in the target project (Optimise for Speed, No Runtime Checks, No Guards, No Security Checks, No Edit And Continue).

The code I used to test for time is below.

typedef std::tuple<std::chrono::high_resolution_clock::duration, float, float> test_data;
typedef float(*sqrt_func_t)(float);
test_data test(float testValue, uint64_t testSize, sqrt_func_t func) {
    float x = 0;

    std::chrono::high_resolution_clock::duration t_total = std::chrono::nanoseconds(0);
    for (uint64_t run = 0; run < testSize; run++) {
        auto t_start = std::chrono::high_resolution_clock::now();
        float y = func(testValue);
        auto t_time = std::chrono::high_resolution_clock::now() - t_start;
        x += y;
        t_total += t_time;
    }

    return std::make_tuple(t_total, testValue, x);
}

void printLog(const char* format, ...) {
    va_list args;
    va_start(args, format);
    std::vector<char> buf(_vscprintf(format, args) + 1);
    vsnprintf(buf.data(), buf.size(), format, args);
    va_end(args);
    std::cout << buf.data() << std::endl;
}

void printScore(const char* name, uint64_t timeNanoSeconds, uint64_t testSize) {
    uint64_t time_ns = timeNanoSeconds % 1000000000;
    uint64_t time_s = std::chrono::duration_cast<std::chrono::seconds>(std::chrono::nanoseconds(timeNanoSeconds)).count() % 60;
    uint64_t time_m = std::chrono::duration_cast<std::chrono::minutes>(std::chrono::nanoseconds(timeNanoSeconds)).count() % 60;
    uint64_t time_h = std::chrono::duration_cast<std::chrono::hours>(std::chrono::nanoseconds(timeNanoSeconds)).count();
    double time_single = (double)std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::nanoseconds(timeNanoSeconds)).count() / (double)testSize;
    uint64_t pffloatfix = (uint64_t)round(time_single * 1000000);
    printLog("| %-30s | %2llu:%02llu:%02llu.%09llu | %3lld.%06lld ns | %14llu |",
        name,
        time_h, time_m, time_s, time_ns,
        pffloatfix / 1000000, pffloatfix % 1000000, // Because fuck you %3.6f, you broken piece of shit.
        (uint64_t)floor(1000000.0 / time_single)
    );
}

int main(int argc, const char** argv) {
    float testValue = 1234.56789f;
    size_t testSize = 100000000;

    #ifdef _WIN32
    timeBeginPeriod(1);
    #endif

    std::cout << "InvSqrt Single Test" << std::endl;
    std::cout << " - Iterations: " << testSize << std::endl;
    std::cout << " - Tested Value: " << testValue << std::endl;

    printLog("| Test Name                      | Time (Total)       | Time (Single) | Score (ops/ms) |");
    printLog("|:-------------------------------|-------------------:|--------------:|---------------:|");

    {
        std::this_thread::sleep_for(std::chrono::milliseconds(100));
        auto tv = test(testValue, testSize, invsqrt);
        const char* name = "InvSqrt";
        auto tvc = std::get<0>(tv);
        printScore(name, std::chrono::duration_cast<std::chrono::nanoseconds>(tvc).count(), testSize);
    }

    {
        std::this_thread::sleep_for(std::chrono::milliseconds(100));
        auto tv = test(testValue, testSize, invsqrt_q3);
        const char* name = "Quake III InvSqrt";

        auto tvc = std::get<0>(tv);
        printScore(name, std::chrono::duration_cast<std::chrono::nanoseconds>(tvc).count(), testSize);
    }

    #ifdef _WIN32
    timeEndPeriod(1);
    #endif

    getchar();
    return 0;
}

Results

Iterations: 10000000 32-Bit Single 64-Bit Single
InvSqrt 44.891717 ns 14.055867 ns
Quake III 42.461385 ns 13.487772 ns

These results are actually expected, since current and last generation CPUs have strong FPUs that match the integer processing in speed and accuracy. And we don’t hit any cache misses either since the code fits into the CPU Cache.

So this tells us that standard floating point math is only slightly slower than the Quake III version while also being more accurate than the Quake III version. This ‘slightly slower’ only gets larger the older the hardware gets, so it’s a matter of pick the right one for the right job here.

But there is another CPU feature that exists since 2001: Streaming SIMD Extensions or SSE. This feature allows you to do many calculations side-by-side instead of doing them one after another. Is it worth it? Hell yes. Are we going to abuse it? You know it.

SSE Specific Performance

To understand why SSE is a big deal is to know what it can do and why it would be used. Imagine you have a set of four real numbers (single precision floating point values) and now you want to multiply them with another set of four real numbers. Normally you have to calculate each result after another, but with SSE you can calculate all four at the same time. It’s basically like there’s you times four on a math test.

But with that said, which versions are actually capable of having an SSE version? Turns out, all of them are – and all of them benefit from this performance boost.

InvSqrt with SSE

float invsqrt_sse(float v) {
    __m128 mv = _mm_set_ps(v, 0, 0, 0);
    const __m128 dv = _mm_set_ps(1.0f, 0, 0, 0);
    auto sv = _mm_sqrt_ps(mv);
    sv = _mm_div_ps(dv, sv);
    float result;
    _mm_store1_ps(&result, sv);
    return result;
}
void invsqrt_sse2(float* v) {
    const __m128 dv = _mm_set_ps(1.0f, 1.0f, 1.0f, 1.0f);
    __m128 mv = _mm_set_ps(v[0], v[1], 0, 0);
    mv = _mm_sqrt_ps(mv);
    mv = _mm_div_ps(dv, mv);
    float result[4];
    _mm_storeu_ps(result, mv);
    v[0] = result[0];
    v[1] = result[1];
}
void invsqrt_sse4(float* v) {
    const __m128 dv = _mm_set_ps(1.0f, 1.0f, 1.0f, 1.0f);
    __m128 mv = _mm_set_ps(v[0], v[1], v[2], v[3]);
    mv = _mm_sqrt_ps(mv);
    mv = _mm_div_ps(dv, mv);
    _mm_storeu_ps(v, mv);
}
void invsqrt_sse8(float* v) {
    invsqrt_sse4(v);
    invsqrt_sse4(v + 4);
}
void invsqrt_sse16(float* v) {
    invsqrt_sse8(v);
    invsqrt_sse8(v + 8);
}
void invsqrt_sse32(float* v) {
    invsqrt_sse16(v);
    invsqrt_sse16(v + 16);
}

Quake III with SSE

float invsqrt_q3_sse(float v) {
    const __m128i magic_constant = _mm_set_epi32(FASTINVSQRT, 0, 0, 0);
    const __m128 zero_point_five = _mm_set_ps(0.5f, 0, 0, 0);
    const __m128 one_point_five = _mm_set_ps(1.5f, 0, 0, 0);

    __m128 value = _mm_set_ps(v, 0, 0, 0); // y.f = v
    __m128 halfvalue = _mm_mul_ps(value, zero_point_five); // x2 = v * 0.5f
    __m128i ivalue = _mm_castps_si128(value); // y.u (union) y.f
    ivalue = _mm_srai_epi32(ivalue, 1); // y.u >> 1
    ivalue = _mm_sub_epi32(magic_constant, ivalue); // FASTINVSQRT - (y.u >> 1)
    value = _mm_castsi128_ps(ivalue); // y.f (union) y.u

    // y.f = 1.5f - (x2 * y.f * y.f) part
    value = _mm_mul_ps(value, value); // y.f * y.f
    value = _mm_mul_ps(value, halfvalue); // x2 * y.f * y.f
    value = _mm_sub_ps(one_point_five, value); // 1.5f - (x2 * y.f * y.f)

    float result;
    _mm_store1_ps(&result, value);
    return result;
}
void invsqrt_q3_sse2(float* v) {
    const __m128i magic_constant = _mm_set_epi32(FASTINVSQRT, FASTINVSQRT, 0, 0);
    const __m128 zero_point_five = _mm_set_ps(0.5f, 0.5f, 0, 0);
    const __m128 one_point_five = _mm_set_ps(1.5f, 1.5f, 0, 0);

    __m128 value = _mm_set_ps(v[0], v[1], 0, 0); // y.f = v
    __m128 halfvalue = _mm_mul_ps(value, zero_point_five); // x2 = v * 0.5f
    __m128i ivalue = _mm_castps_si128(value); // y.u (union) y.f
    ivalue = _mm_srai_epi32(ivalue, 1); // y.u >> 1
    ivalue = _mm_sub_epi32(magic_constant, ivalue); // FASTINVSQRT - (y.u >> 1)
    value = _mm_castsi128_ps(ivalue); // y.f (union) y.u

    // y.f = 1.5f - (x2 * y.f * y.f) part
    value = _mm_mul_ps(value, value); // y.f * y.f
    value = _mm_mul_ps(value, halfvalue); // x2 * y.f * y.f
    value = _mm_sub_ps(one_point_five, value); // 1.5f - (x2 * y.f * y.f)

    // result
    float result[4];
    _mm_storeu_ps(v, value);
    v[0] = result[0];
    v[1] = result[1];
}
void invsqrt_q3_sse4(float* v) {
    const __m128i magic_constant = _mm_set_epi32(FASTINVSQRT, FASTINVSQRT, FASTINVSQRT, FASTINVSQRT);
    const __m128 zero_point_five = _mm_set_ps(0.5f, 0.5f, 0.5f, 0.5f);
    const __m128 one_point_five = _mm_set_ps(1.5f, 1.5f, 1.5f, 1.5f);

    __m128 value = _mm_set_ps(v[0], v[1], v[2], v[3]); // y.f = v
    __m128 halfvalue = _mm_mul_ps(value, zero_point_five); // x2 = v * 0.5f
    __m128i ivalue = _mm_castps_si128(value); // y.u (union) y.f
    ivalue = _mm_srai_epi32(ivalue, 1); // y.u >> 1
    ivalue = _mm_sub_epi32(magic_constant, ivalue); // FASTINVSQRT - (y.u >> 1)
    value = _mm_castsi128_ps(ivalue); // y.f (union) y.u

    // y.f = 1.5f - (x2 * y.f * y.f) part
    value = _mm_mul_ps(value, value); // y.f * y.f
    value = _mm_mul_ps(value, halfvalue); // x2 * y.f * y.f
    value = _mm_sub_ps(one_point_five, value); // 1.5f - (x2 * y.f * y.f)

    // result
    _mm_storeu_ps(v, value);
}
void invsqrt_q3_sse8(float* v) {
    invsqrt_q3_sse4(v);
    invsqrt_q3_sse4(v + 4);
}
void invsqrt_q3_sse16(float* v) {
    invsqrt_q3_sse8(v);
    invsqrt_q3_sse8(v + 8);
}
void invsqrt_q3_sse32(float* v) {
    invsqrt_q3_sse16(v);
    invsqrt_q3_sse16(v + 16);
}

Technically this is identical to the normal Quake III one, except it now relies on SSE2 Integer instructions. Which are thankfully part of any 64-Bit CPU. If you look at the assembly for this one, you will notice that the Visual Studio compiler avoided using XMM0 for passing and retrieving the float and instead uses ebp, which is most likely due to it not being programmed to optimise for SSE2 yet.

New Timing Code

Obviously we have to make some adjustments to test the new code, mostly to actually test the benefits of SSE. We’ll also test for up to 32 SSE operations after another to see the true speed boost.

typedef std::tuple<std::chrono::high_resolution_clock::duration, float, float&rt; test_data;
typedef float(*sqrt_func_t)(float);
test_data test(float testValue, uint64_t testSize, sqrt_func_t func) {
    float x = 0;

    std::chrono::high_resolution_clock::duration t_total = std::chrono::nanoseconds(0);
    for (uint64_t run = 0; run < testSize; run++) {
        auto t_start = std::chrono::high_resolution_clock::now();
        float y = func(testValue);
        auto t_time = std::chrono::high_resolution_clock::now() - t_start;
        x += y;
        t_total += t_time;
    }

    return std::make_tuple(t_total, testValue, x);
}

typedef void(*sqrt_func_sse_t)(float*);
test_data test_sse_var(float testValue, uint64_t testSize, size_t comboSize, sqrt_func_sse_t func) {
    // Version for testing SSE on four floats at once.

    float x = 0;
    std::vector<float&rt; y(comboSize);
    uint64_t tmpTestSizeLoop = testSize / comboSize; // Divide by 4.
    uint64_t tmpTestSizeRem = testSize % comboSize; // Remaining parts.

    std::chrono::high_resolution_clock::duration t_total = std::chrono::nanoseconds(0);
    for (uint64_t run = 0; run < tmpTestSizeLoop; run++) {
        for (size_t vi = 0; vi < comboSize; vi++)
            y[vi] = testValue + (float)vi;

        auto t_start = std::chrono::high_resolution_clock::now();
        func(y.data());
        auto t_time = std::chrono::high_resolution_clock::now() - t_start;
        for (size_t vi = 0; vi < comboSize; vi++)
            x += y[vi];
        t_total += t_time;
    }
    for (size_t vi = 0; vi < comboSize; vi++)
        y[vi] = testValue + (float)vi;
    auto t_start = std::chrono::high_resolution_clock::now();
    func(y.data());
    auto t_time = std::chrono::high_resolution_clock::now() - t_start;
    for (size_t run = 0; run < tmpTestSizeRem; run++) {
        x += y[run];
    }
    t_total += t_time;

    return std::make_tuple(t_total, testValue, x);
}

void printLog(const char* format, ...) {
    va_list args;
    va_start(args, format);
    std::vector<char&rt; buf(_vscprintf(format, args) + 1);
    vsnprintf(buf.data(), buf.size(), format, args);
    va_end(args);
    std::cout << buf.data() << std::endl;
}

void printScore(const char* name, uint64_t timeNanoSeconds, uint64_t testSize) {
    uint64_t time_ns = timeNanoSeconds % 1000000000;
    uint64_t time_s = std::chrono::duration_cast<std::chrono::seconds&rt;(std::chrono::nanoseconds(timeNanoSeconds)).count() % 60;
    uint64_t time_m = std::chrono::duration_cast<std::chrono::minutes&rt;(std::chrono::nanoseconds(timeNanoSeconds)).count() % 60;
    uint64_t time_h = std::chrono::duration_cast<std::chrono::hours&rt;(std::chrono::nanoseconds(timeNanoSeconds)).count();
    double time_single = (double)std::chrono::duration_cast<std::chrono::nanoseconds&rt;(std::chrono::nanoseconds(timeNanoSeconds)).count() / (double)testSize;
    uint64_t pffloatfix = (uint64_t)round(time_single * 1000000);
    printLog("| %-30s | %2llu:%02llu:%02llu.%09llu | %3lld.%06lld ns | %14llu |",
        name,
        time_h, time_m, time_s, time_ns,
        pffloatfix / 1000000, pffloatfix % 1000000, // Because fuck you %3.6f, you broken piece of shit.
        (uint64_t)floor(1000000.0 / time_single)
    );
}

int main(int argc, const char** argv) {
    float testValue = 1234.56789f;
    size_t testSize = 100000000;

    #ifdef _WIN32
    timeBeginPeriod(1);
    #endif

    std::cout << "InvSqrt Single Test" << std::endl;
    std::cout << " - Iterations: " << testSize << std::endl;
    std::cout << " - Tested Value: " << testValue << std::endl;

    printLog("| Test Name                      | Time (Total)       | Time (Single) | Score (ops/ms) |");
    printLog("|:-------------------------------|-------------------:|--------------:|---------------:|");

    {
        std::this_thread::sleep_for(std::chrono::milliseconds(100));
        auto tv = test(testValue, testSize, invsqrt);
        const char* name = "InvSqrt";
        auto tvc = std::get<0&rt;(tv);
        printScore(name, std::chrono::duration_cast<std::chrono::nanoseconds&rt;(tvc).count(), testSize);
    }

    {
        std::this_thread::sleep_for(std::chrono::milliseconds(100));
        auto tv = test(testValue, testSize, invsqrt_q3);
        const char* name = "Quake III InvSqrt";

        auto tvc = std::get<0&rt;(tv);
        printScore(name, std::chrono::duration_cast<std::chrono::nanoseconds&rt;(tvc).count(), testSize);
    }

    {
        std::this_thread::sleep_for(std::chrono::milliseconds(100));
        auto tv = test(testValue, testSize, invsqrt_sse);
        const char* name = "SSE InvSqrt";

        auto tvc = std::get<0&rt;(tv);
        printScore(name, std::chrono::duration_cast<std::chrono::nanoseconds&rt;(tvc).count(), testSize);
    }

    {
        std::this_thread::sleep_for(std::chrono::milliseconds(100));
        auto tv = test(testValue, testSize, invsqrt_q3_sse);
        const char* name = "Quake III SSE InvSqrt";

        auto tvc = std::get<0&rt;(tv);
        printScore(name, std::chrono::duration_cast<std::chrono::nanoseconds&rt;(tvc).count(), testSize);
    }

    {
        std::this_thread::sleep_for(std::chrono::milliseconds(100));
        auto tv = test_sse_var(testValue, testSize, 2, invsqrt_sse2);
        const char* name = "SSE InvSqrt (2 Ops)";

        auto tvc = std::get<0&rt;(tv);
        printScore(name, std::chrono::duration_cast<std::chrono::nanoseconds&rt;(tvc).count(), testSize);
    }

    {
        std::this_thread::sleep_for(std::chrono::milliseconds(100));
        auto tv = test_sse_var(testValue, testSize, 2, invsqrt_q3_sse2);
        const char* name = "Quake III SSE InvSqrt (2 Ops)";

        auto tvc = std::get<0&rt;(tv);
        printScore(name, std::chrono::duration_cast<std::chrono::nanoseconds&rt;(tvc).count(), testSize);
    }

    {
        std::this_thread::sleep_for(std::chrono::milliseconds(100));
        auto tv = test_sse_var(testValue, testSize, 4, invsqrt_sse4);
        const char* name = "SSE InvSqrt (4 Ops)";

        auto tvc = std::get<0&rt;(tv);
        printScore(name, std::chrono::duration_cast<std::chrono::nanoseconds&rt;(tvc).count(), testSize);
    }

    {
        std::this_thread::sleep_for(std::chrono::milliseconds(100));
        auto tv = test_sse_var(testValue, testSize, 4, invsqrt_q3_sse4);
        const char* name = "Quake III SSE InvSqrt (4 Ops)";

        auto tvc = std::get<0&rt;(tv);
        printScore(name, std::chrono::duration_cast<std::chrono::nanoseconds&rt;(tvc).count(), testSize);
    }

    {
        std::this_thread::sleep_for(std::chrono::milliseconds(100));
        auto tv = test_sse_var(testValue, testSize, 8, invsqrt_sse8);
        const char* name = "SSE InvSqrt (8 Ops)";

        auto tvc = std::get<0&rt;(tv);
        printScore(name, std::chrono::duration_cast<std::chrono::nanoseconds&rt;(tvc).count(), testSize);
    }

    {
        std::this_thread::sleep_for(std::chrono::milliseconds(100));
        auto tv = test_sse_var(testValue, testSize, 8, invsqrt_q3_sse8);
        const char* name = "Quake III SSE InvSqrt (8 Ops)";

        auto tvc = std::get<0&rt;(tv);
        printScore(name, std::chrono::duration_cast<std::chrono::nanoseconds&rt;(tvc).count(), testSize);
    }

    {
        std::this_thread::sleep_for(std::chrono::milliseconds(100));
        auto tv = test_sse_var(testValue, testSize, 16, invsqrt_sse16);
        const char* name = "SSE InvSqrt (16 Ops)";

        auto tvc = std::get<0&rt;(tv);
        printScore(name, std::chrono::duration_cast<std::chrono::nanoseconds&rt;(tvc).count(), testSize);
    }

    {
        std::this_thread::sleep_for(std::chrono::milliseconds(100));
        auto tv = test_sse_var(testValue, testSize, 16, invsqrt_q3_sse16);
        const char* name = "Quake III SSE InvSqrt (16 Ops)";

        auto tvc = std::get<0&rt;(tv);
        printScore(name, std::chrono::duration_cast<std::chrono::nanoseconds&rt;(tvc).count(), testSize);
    }

    {
        std::this_thread::sleep_for(std::chrono::milliseconds(100));
        auto tv = test_sse_var(testValue, testSize, 32, invsqrt_sse32);
        const char* name = "SSE InvSqrt (32 Ops)";

        auto tvc = std::get<0&rt;(tv);
        printScore(name, std::chrono::duration_cast<std::chrono::nanoseconds&rt;(tvc).count(), testSize);
    }

    {
        std::this_thread::sleep_for(std::chrono::milliseconds(100));
        auto tv = test_sse_var(testValue, testSize, 32, invsqrt_q3_sse32);
        const char* name = "Quake III SSE InvSqrt (32 Ops)";

        auto tvc = std::get<0&rt;(tv);
        printScore(name, std::chrono::duration_cast<std::chrono::nanoseconds&rt;(tvc).count(), testSize);
    }

    #ifdef _WIN32
    timeEndPeriod(1);
    #endif

    getchar();
    return 0;
}

Results

Iterations: 10000000 32-Bit Single % Diff 64-Bit Single % Diff
InvSqrt 44.891717 ns Reference 14.055867 ns Reference
InvSqrt SSE 41.324776 ns -7.95% 12.696187 ns -9.67%
InvSqrt SSE (2 Ops) 20.076773 ns -55.28% 13.450678 ns -4.18%
InvSqrt SSE (4 Ops) 9.538691 ns -78.75% 6.673227 ns -52.52%
InvSqrt SSE (8 Ops) 4.857515 ns -89.18% 3.906945 ns -72.20%
InvSqrt SSE (16 Ops) 2.727815 ns -93.92% 2.033558 ns -85.53%
InvSqrt SSE (32 Ops) 1.627955 ns -96.37% 1.183631 ns -91.58%
Quake III 42.461385 ns Reference 13.487772 ns Reference
Quake III SSE 41.038958 ns -3.35% 13.459031 ns -0.21%
Quake III SSE (2 Ops) 20.222290 ns -52.37% 14.1913031 ns +5.22%
Quake III SSE (4 Ops) 9.570546 ns -77.46% 6.694998 ns -50.36%
Quake III SSE (8 Ops) 5.180970 ns -87.80% 3.605875 ns -73.27%
Quake III SSE (16 Ops) 2.994105 ns -92.95% 2.044841 ns -84.84%
Quake III SSE (32 Ops) 1.724123 ns -95.94% 1.194048 ns -91.15%

By optimising our standard math code to use SSE we’ve beaten the normal Quake III hack and almost beaten the SSE enhanced Quake III code. And we actually beat the Quake III hack performance if we just double the simultaneous SSE calculations. It seems that the Quake III hack scales not as good as the normal approach when using SSE.

On 64-Bit the performance boost is not that large. This is due to the fact that the CPU has more registers readily available and the compiler will actually optimise for these, improving overall performance. Additionally we now have to take special care of SSE to use it to it’s full potential it, since it will actually hurt our performance slightly with the Quake III hack.

What is actually faster now?

That really depends on what sort of system you target. If you target older systems or notebooks/netbooks, you will find that the Quake III version offers better performance, if you target anything in the last 5 years you’ll have better performance with SSE enhanced versions.

Things will probably look a lot different when it comes to double precision floating point math, which is what I’ll be testing next. If the results are much different, I’ll make a new post.

Bookmark the permalink.

One Comment

  1. Pingback: What is the fastest way to get an Inverse Square Root? (Part 2) – Xaymars Blog

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.