Oh, what's your use-case for ECC ram?

vbezhenar · on Jan 31, 2018

The good question would be what's use-case for non-ECC RAM. I wouldn't want unreliably hardware for any task, even as simple as gaming. It's a shame that ECC ram available only for enthusiasts and with unnecessary premium. ECC really should be baseline. Computers meant to be reliable.

srcmap · on Jan 31, 2018

Is there any windows / linux utility that shows the number of ECC errors corrected since the boot?

If not, AMD should write and promote one.

Very curious on how often it happened for normal home/offic usage.

I used to work for a silicon company who took a embedded network switch system with ECC logic to some nuclear lab for testing to verify/showcase the ECC functionalities.

speleo_engr · on Feb 1, 2018

In Linux, yes, there is a service called mcelog and a utility from the edac-utils package called edac-util.

You will see correctable ECC errors on systems. How frequent honestly seems to depend on the workload and the system itself. My suspicion is that often they are caused by poor PCB layout and ECC saves you. I spent literally weeks (nights, weekends) chasing down an issue I thought was a software bug but turned out a board layout issue on an embedded system. If the system had ECC, the error would have either been corrected or we would have gotten the uncorrectable ECC error trap. Since then, every workstation/server/desktop I spec is ECC. I wish more laptops had it.

srcmap · on Feb 2, 2018

Thanks,

Try the edac-util on 5 of 30+ or so servers.

   "Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz" with 128G RAM each. 

   edac-util: No errors to report.

System uptime ~30+ days. We recently have to move those servers. I wish I check this command before the move. The uptime should be 600 days + for some of the servers.

mjevans · on Jan 31, 2018

Even if you never personally encounter an error, knowing that the 1 bit correct and 1+ bit likely detect exists will save you in terms of piece of mind and trouble-shooting other issues.

Plus, if you're scrubbing your storage the last thing you want is a memory error killing your data.

imtringued · on Feb 1, 2018

The problem with bit flips is that they accumulate in high uptime systems.

If you reboot your PC at least once every week it's not going to be a problem.

JoeAltmaier · on Feb 1, 2018

Early ECC days, you 'washed' memory to fix this. On a read, a single-bit ECC error is actually repaired by the hardware. To get the most benefit from this you would want to read every allocated memory location periodically, 'washing it clean' so the accumulated errors wouldn't become double-bit errors (unrecoverable).

I'd put a wash routine in the background process, where it would string-move a block of memory to nowhere in a round-robin way. Not a terrible hit on the cache; we're idle when in the background task so not impacting the most used code. Some latency issue with interrupts and the like.

shawnz · on Feb 1, 2018

> I wouldn't want unreliably hardware for any task, even as simple as gaming.

Would you prefer frame-perfect rendering to increased performance?

koffiezet · on Feb 8, 2018

We just set up a Kubernetes cluster for building C++ and Java software. Thing is - you want to be absolutely sure that the software you software is built correctly and don't have any bit-flips ending up in your final builds, so ECC is an absolute must. Threadripper supporting this allowed us to create that cluster with cheap commodity hardware and made self-hosting the build-farm the clear financial winner, especially since we have tons of free rack-space (came with the building that was bought), already host servers locally - which means a lot support infrastructure is already in place, and have a solar surplus throughout the year in this building.

simcop2387 · on Feb 1, 2018

In my case I've got a large amount of ram (128GB) and run many virtual machines for various purposes, along with that I've got 24TB of drives hooked up with ZFS running to back up those virtual machines, family photos and movies, etc. Being able to know that an error has happened and it's been corrected or handled appropriately (even killing the system and needing a reboot is appropriate) so that data isn't destroyed is a good thing for me.

ihsw2 · on Jan 31, 2018

I'm not the parent commenter but it's probably finance or CAD modeling, environments where soft bit-errors (ones that silently and unpredictably cause data corruption rather than hard-crashes that are reproducible) can lead to nightmares.

usefulcat · on Jan 31, 2018

Or could be using ZFS

loeg · on Jan 31, 2018

"There's nothing special about ZFS that requires/encourages the use of ECC RAM more so than any other filesystem." - Matt Ahrens[0]

[0]: https://arstechnica.com/civis/viewtopic.php?f=2&t=1235679&p=...

usefulcat · on Feb 1, 2018

I didn't say it was required. It's not uncommon for people to want to use ZFS specifically because they highly value data integrity. If that's the case, then it probably makes sense to use ECC.

godzillabrennus · on Feb 1, 2018

Except there is:

https://forums.freenas.org/index.php?threads/ecc-vs-non-ecc-...