On Thu, Jan 19, 2023 at 09:58:48AM -0700, Warner Losh wrote:
On Thu, Jan 19, 2023 at 9:41 AM Dan Cross
<crossd(a)gmail.com> wrote:
But it's interesting the way the "Gods
of BSD vs the rebel alliance"
thing seems to have inverted itself. Getting stuff done in Linux these
days is pretty hard; oh sure, I suppose if you whip off a patch fixing
a typo in a comment or something, someone will just apply it. But if
you want to do something substantial, you have to be willing to invest
a lot of time and effort in shepherding it through the upstreaming
process, which implies you've got to have resources backing that
effort.....
This matches my experience: Lots of gatekeepers, any one of which can take
a disliking to your submission for what, at times, seems like arbitrary and
capricious reasons. If you make it to the 'in crowd' it becomes more of a
rubber stamp at times... I have recently been successful at submitting an
obvious fix to a tiny backwater area of the kernel without undue stress
though... But I submitted it to the maintainer of the area, who then
submitted it to the 'greater maintainer', and then to the greater
maintainer and then to the tree. So my tiny fix has more Signed-off-by:
lines than lines of change and took about two weeks to make its way all the
way up into the repo... For me, it took about 2x as long to prep and send
the change than it does for the direct commit access I have for FreeBSD,
but I've spent more than 10x on other submissions that ultimately didn't
make it in.
I'll note that a lot of this is a matter of scale. There are roughly
15,000 commits added to the Linux kernel per 9 week release cycle.
That translates to roughly 10 commits per hour, 7 days a week, 24
hours a day.
From an upstream maintainer's perspective, what matters is whether a
commit handled in time for it to make the next release train. So
whether it takes two days or two weeks to get a patch in the repo is
not considered important. What *is* important is that commits that
are up for review before, say, at least two weeks before the opening
of the merge window, are reviewed in time for them to be pulled into
Linus's tree when the merge window next opens. If it is an urgent bug
fix, then it will be acted upon more expeditiously, but the reality is
most users won't see it until the the fix is backported into the
Long-Term Stable kernel that their distribution or product kernel is
derived from.
For more substantial contributions, one of the reasons why there are
so many gatekeepers is mainly due to the fact that we've had far too
many cases of "drive-by contributions" where a company tries to dump a
whole new file system, or massive changes to a large number of systems
--- and then once the code lands in the tree, the developers vanish.
And now we're left with the question of whether we just drop the
subsystems, and screw-over the users who have started to depend on the
new functionality. This may be one of the places where the Linux
culture of "thou shall never cause user-space visible regressions" has
its downside --- it means that it's a lot harder to accept new
functionality unless there is strong confidence that contributor is
there for the long haul, and admittedly that's a lot easier if you are
a member of the "in-crowd".
I looked into Linux's processes to improve
FreeBSD's. And came to the
conclusion that in large part they succeed despite their processes, not
because of them. They have too much overhead, rely too much on magic bots
that are hard to replicate and I'm astonished that things work as well as
they do.
I think that if you divide the total overhead by the number of commits
that land in each release, the overhead isn't really that bad. Sure,
a tractor-trailer truck has a lot more overhead than say, a bicycle;
but you can haul an awful lot more in a semi-trailer truck than you
can in a bicycle rack. Hence, the overhead of a semi is arguably much
less, once you take into account how much you can fit in a
tractor-trailer.
Also, it's actually not *that* expensive, even in absolute terms. For
example, I run about 26 hours worth of regression tests (using a dozen
VM's, so the wall clock time is about 2 hours), using gce-xfstests[1],
and the retail cost if I didn't have corporate sponsorship is less
than $2 USD for a complete set of ext4 tests. And I made a point of
making well documented and easy for others to standup so they can test
their own file systems if they want. The reason why I did all of this
packaging work was to try to get graduate students to understand how
much work left to get a publishable file system, such as say
BetrFS[3], into a production-ready state for real-world use. :-)
[1]
https://thunk.org/gce-xfstests
[2]
https://github.com/tytso/xfstests-bld/blob/master/Documentation/gce-xfstest…
[3]
https://www.usenix.org/conference/fast15/technical-sessions/presentation/ja…
It's a grown culture / process that relies on old
tools mixed with
new in weird ways you'd never think of standing up today. Things can be
learned from it, but it seems to be a unique snowflake relative to all the
other projects I looked at...
That's fair enough. What is needed for a particular scale may be
massive overkill for another. All projects will probably be needing
to adopt different processes or tools as they grow.
Similar statements have made about whether startups or other small
projects should use Kubernetes[4]. Kubernetes was designed by Google
based on their learnings from their in-house cluster management
systems, borg. But if you aren't running at that scale, it may have
more overhead and complexity than what really makes sense.
[4]
https://blog.porter.run/when-to-use-kubernetes-as-a-startup/
I'm reminded of a talk that given by an engineer from Alibaba at
LinuxCon China 2017 where he was bragging about how they had finally
achieved the scaling necessasry so they could support over a thousand
servers in a data center, and how this was an incredible achievement.
That was especially funny to me, since right about that time, Google
had just finished an engineering effort to scale our cluster
management software *down* to a O(thousand) servers, in order to
efficiently support "mini-clusters" that could be deployed in smaller
data centers various countries in Europe, Asia, etc. :-)
And later, even *more* engineering work was needed to efficiently
support O(hundreds) servers for Stadia.
What works well at one scale, may not work well at others, either when
scaling up or scaling down.
Cheers,
- Ted