On Sat, Sep 1, 2018 at 3:19 PM, Theodore Y. Ts'o <tytso(a)mit.edu> wrote:
On Sat, Sep 01, 2018 at 01:17:59PM -0400, Arthur
Krewat wrote:
On 9/1/2018 12:27 PM, Kevin Bowling wrote:
I find
it's about equal, and even exceeds Linux in terms of it's NUMA
support and multi-processor support. I need to move some systems away from
Solaris and off to Linux, and I find it's NUMA support lacking in certain
ways.
This is pure fantasy. To understand Linux performance on high core
count and multi-socket machines is to have at least passing knowledge
of Paul McKenney's genius work on RCU [1] and NUMA [2] at Sequent [3]
and on Linux. IBM bought Sequent, made a favorable patent grant of
RCU for Linux, and the rest history.
Thanks :) - I'm basing this on Oracle database performance, for the most
part, and it's weird way of supporting NUMA on Linux in a bass-ackwards sort
of way. Nothing I see in the latest RedHat/CentOS tells me it even cares
about NUMA, but maybe that's more of their "we know better than you"
mentality and it's all hidden under the covers somewhere.
It wouldn't surprise me if Linux's NUMA performance is pretty weak
compared to Solaris. There was an attempt to try to make NUMA work
well on Linux, with a lot of the effort coming from IBM and SGI, but
that effort was overtaken by events. Back in Sequent's day, the
remote to local memory latency was ten to one, so making the system
NUMA aware was critical. But by early 2000's, the remote to local
ratio was under 3:1 (or 2:1) for 4 socket systems, and with AMD's
"Sufficiently Uniform Memory Organization" (SUMO), the ratio was under
1.5:1 or less.
Sorry this is just bogus about being weak compared to Solaris. Are
you looking back with rosy glasses or have you scanned the code in the
past couple years? I have and there is nothing particularly special
about Solaris internals here or elsewhere. In fact, there are a lot
of pessimization all over the place. As Larry said, a lot of folks in
the Linux community clearly cared about performance. Although the
Solaris code is fairly clean It's not clear Sun valued performance at
all. A stroll through arch/*/include/asm/ was enough to convince me
of Larry's claims. I'm not a Linux fanboy but credit goes where it's
due.
Solaris has lgroups, which are a clean design but that is the extent
of its NUMA support, one shot at placement and scheduling. Linux has
a NUMA allocator, aware scheduler, NUMA-optimized spinlocks and
mutexes, various subsystems correctly use the primitives, and can use
cgroups to contain or gang things. There is a userland policy tool
called numad that tries to add some additional runtime affinity and
movement policy decisions.
I agree that architecturally Linux NUMA is nowhere near where Sequent
and especially SGI was. And the reasons you cite are valid, Linux
implementation is good for maybe 8-16 sockets of modern core count
with a much tighter off chip network than the big dogs were building.
Keep in mind IBM wants to sell RockHoppers and E980s (4 drawers, 16
sockets, 768 threads) for dedicated Linux use which have similar
north/south and east/west off chip networks. They have a lot of very
talented people on the firmware, kernel, compilers to make these
things work fast, including Paul.
The main reason for this was that Windows was (and as
far as I know,
still is) NUMA oblivious. So x86 chip and motherboard designers
solved the problem, by brute foruce, in hardware. So by 2003 or 2004,
the Linux Scalability Effort had more or less petered out. (You can
see the leftover remnants at
http://lse.sourceforge.net)
Windows' NUMA support is on par with Solaris insofar as there is
domain aware memory allocator and scheduler hierarchy that takes the
domain (and SMT etc) into account. What Windows lacks is the finely
tuned concurrency primitives and everything else Linux has done..
which Solaris lacks as well. I'm not even talking about RCU (or epoch
based reclamation or proxy collection or hazard pointers, at least one
of which is not patent encumbered), I'm just talking about the quality
of primitives like spinlocks and mutex and rwlock. There are big
tradeoffs to the implementations of these in terms of fairness,
progress guarantees, and thread scalability. Linux leads the pack by
a long shot in this department.
Where you start going beyond Linux-like NUMA IMO is when you get
Irix-like features of page copying, migration, and multiple advanced
placement policies.
Fundamentally, the economics of 4 socket and higher
machines was such
that for many workloads, scale out was much cheaper than scale up. So
why buy super-expensive IBM X440, x450, and x460 servers, which were
huge cabinets connected by one or more "scalability cables" (sometimes
referred to as the "scalability bottleneck"), when most of the time,
you could just buy a rack of 2U x86 servers which would be much, much
cheaper?
Agreed, this is why x86 has dominated the server market for a long time.
There were certainly workloads this wasn't
applicable, of course. But
when Sun was selling Sun 10k's to web startups during the dot com
boom, and they were using it to serve web traffic, they probably had
too much VC money to burn, because that was *not* the most cost
effective way to do things.
Agreed. Those big margins must have caused them to take their eye off
what mattered right at the time Linux was getting some momentum from
the big HW vendors.
Don't get me wrong; the Read Copy Update (RCU)
technique was certainly
very important, and is responsible for much of Linux's SMP scalability
today. But these days, when you can get up to 28 cores (56 threads)
on a single socket, the need for more than 2 socket systems is already
somewhat niche, and by the time you get to more than 4 sockets, it's
positively microscopic. As a result, NUMA support on Linux is
certainly not as strong as it could be, and it wouldn't surprise me
that Solaris has developed much better ways of handling the behemoths
such as Sun Enterprise 10k.
The E10k was only a 64-core machine on a tight backplane compared to
other large systems. It didn't have any of the pressing needs that
Sequent and SGI did with multi-drawer interconnects to drive
excellence in NUMA.
These are strange times. Intel's been putting out some real doozy
chips. The mesh in Skylake is a partial improvement over the dual
rings of Haswell (though they did some goofy things to increase
latency in undesirable ways), and they aren't going to continue to
brute force it like IBM did with their 17 metal layer process node..
many SKUs in Cascade Lake will be a dual die design and cost a
hilarious amount of money. AMD's EPYC is really bad in this
department too, one EPYC behaves identically to a 4 socket system with
extremely poor inter-die latency [1]. I think POWER9 is universally
better and the high bin chips (22 core, 88 thread, mega cache) are
only around $2500 compared to Skylake's absurd $12,000. POWER9 is a
single on chip NUMA domain for 24 cores. Google publicly stated they
are using it for GPU servers, and that all their monorepo is built for
multiple ISAs. Through the grapevine I've heard gmail is running on
POWER9 as well now. That is pretty competent, the reason Intel is
sucking so bad is because people allowed themselves such lock in. A
hyperscaler should be able to change between a couple ISAs as needed
between purchasing cycles.
- Ted
P.S. IBM made the RCU patent available for any GPL code, well before
Sun decided on the CDDL for Solaris. So if Sun management had chosen
GPL, they could have used RCU...
True. There is also at least one unencumbered strategy such as epoch
based reclamation which was known about around that time [2]
[1]
https://www.servethehome.com/amd-epyc-infinity-fabric-latency-ddr4-2400-v-2…
[2]
https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-579.pdf
Regards,
Kevin