On Mon, Oct 21, 2019 at 03:21:34PM -0400, Paul Winalski wrote:
I'm much more alarmed by the lack of memory error detection and
correction on a lot of modern computers. This is one of my big
concerns with the use of GPUs for heavy-duty computation. GPUs
typically don't have memory with error detection because the worst
that happens if there's a memory error in the GPU is you get a bad
pixel or two displayed. I'd not like to cross a bridge whose design
software used CUDA.
This might be true of gaming cards and low-end workstation cards, but
the higher-end Quadro cards and all the dedicated GPGPUs have had at
least ECC since at least the Maxwell era. Of course, nobody does a
single run and stamps the drawings, so the process itself should catch
these problems, but any correctly-configured CAD workstation or compute
cluster has error-correcting memory, both on the system and in the
accelerators. AMD's Radeon Pro and Intel's Xeon Phi accelerators do
ECC as well.
khm