TrueNAS: Passive file corruption with ZStd compression

Earlier this week I attempted to make the maintainers of FreeNAS/TrueNAS aware that their ZStd compression scheme in ZFS causes data corruption over time. I've verified that this happens on physical hardware, as well as virtual machines, in several configurations. Despite that, ixSystems simply closed the report because the affected live machine was in a virtual machine. Leaving me with the only option I have: putting it on my blog for everyone to see.

I first found out about this corruption issue months ago, but wasn't able to figure out why or how it happened. The Host and Guest never lost power, everything has backup power in case mains go out, and even storage is accessed uncached with write-through policy. Yet, despite all efforts to mitigate the issue, data on TrueNAS kept corrupting. Particularly data that was rarely written to or read from had the most corruption.

The Testing Phase

In any other circumstance I'd assume that the problem is related to the controller of the drives, but not in this case. I never had a single file corruption with LZ4 compression, so why did they start with ZStd? It shouldn't corrupt anything if implemented correctly, yet here I am with thousands of corrupted files. So I began testing, setting up a basic virtual machine that I could instantly reset once I had results.

And results is what I got:

CPU Compression Pool Type Status
AMD Zen1 (Passthrough) None RAID0
RAID1
RAID5
LZ4 RAID0
RAID1
RAID5
ZStd RAID0
RAID1
RAID5
AMD Zen2-like None RAID0
RAID1
RAID5
LZ4 RAID0
RAID1
RAID5
ZStd RAID0
RAID1
RAID5
AMD Bulldozer-like None RAID0
RAID1
RAID5
LZ4 RAID0
RAID1
RAID5
ZStd RAID0
RAID1
RAID5
Intel 7th-gen-like None RAID0
RAID1
RAID5
LZ4 RAID0
RAID1
RAID5
ZStd RAID0
RAID1
RAID5

As the host system itself is Zen1 based (AMD Threadripper 1950X), reproducing the Zen1 VM result was easy. The test on direct physical hardware had the exact same results, pointing at the common factor: AMD Zen, and ZStd. But why was ZStd even corrupting files? I had been using it for real-time compression of networked data for a year now, and never had any corruption happen. Which meant I had to do more tests, and since I had already confirmed that this happens on physical hardware, I continued the tests in virtual machines.

So next up was figuring out how the corruption happened, as I had already seen it never affected regularly used files on the live system that was affected. I tested various things, but the one that stood out the most was: Copy a set of known good data, then write a few thousand tiny files in partial sections over and over - just like what a torrent client does. After downloading an entire 1GiB file using qBittorrent, 0.2% of the known good data had corrupted with no direct influence.

The corruption always appears at 16kB aligned offsets to the start of the file, which still does not make sense. Repeating the same download several times saw the number of corrupted files shoot up. After just 50 downloads of the same file, 3.7% of data had been corrupted - sometimes even including the file we had just written to. Quite an alarming failure rate for just 819200 partial 64kB sized writes, though it likely was excaggerating the actual corruption rate due to being a test.

While this was already alarming, the real alarm bells started ringing when I noticed that the virtual machine I had set up as control also started corrupting. It had been running idle, only holding the known good data, with no write tests or similar happening. It was on entirely different physical disks, on a different NUMA node, and on an entirely different disk controller. Without any external or internal input, data had corrupted - 0.037% of it to be exact.

The Reporting Phase

With this information in hand, I attempted to inform iXSystems - the company behind TrueNAS - of this problem. Using the built-in bug report form, I included all necessary information for a reproduction of the bug, and waited. After just one day, the bug report was closed as "Not Applicable". Apparently iXSystems just closes any bug report that was made with TrueNAS running on virtual machines, despite virtual machines being one of the best ways to mass test your software.

So since iXSystems does not appear to care enough, only wants reproduction on physical hardware, and also claims to not have enough resources to test every possible configuration (if only there was a way they could do that...), I'll instead publish it on my blog. Perhaps it'll raise some eyebrows somewhere, or make people aware the problem. In the best case, it'll prevent someone else from encountering the same problem, and wondering what is going on.

Anyway, this experience with iXSystems has woken me up to the possibility of just not using TrueNAS for NAS, so my next personal project is migrating away from this terribly maintained software package.

Comments for: TrueNAS: Passive file corruption with ZStd compression