[pups] Strange problems on an uPDP 11/53+ - TUHS

5 Feb 2001

After 52 days, my uPDP 11/53+ has suddenly been acting rather strange.
/usr/include got 'replaced' by /usr/new, to be precise. At the time,
I was the only user. Seeing this, I immediately halted the system,
expecting a load of file system errors upon boot. None showed up, and
/usr/include is back to itself again. However, programs which *used*
to be running perfectly (like my work-in-progress ps) suddenly fail,
with a "not enough memory for saving info".
Any hints?
--
    Martijn van Buul -  Pino(a)dohd.org - http://www.stack.nl/~martijnb/
         Geek code: G--  - Visit OuterSpace: mud.stack.nl 3333
   Kees J. Bot: The sum of CPU power and user brain power is a constant.
Received: (from major@localhost)
        by minnie.cs.adfa.edu.au (8.9.3/8.9.3) id LAA70139
        for pups-liszt; Tue, 6 Feb 2001 11:46:34 +1100 (EST)
        (envelope-from owner-pups(a)minnie.cs.adfa.edu.au)
...
 From "Steven M. Schultz"
&lt;sms(a)moe.2bsd.com&gt; Tue Feb  6 10:35:17 2001 
Received: from moe.2bsd.com
(MOE.2BSD.COM [206.139.202.200])
        by minnie.cs.adfa.edu.au (8.9.3/8.9.3) with ESMTP id LAA70135
        for &lt;pups(a)minnie.cs.adfa.edu.au&gt;; Tue, 6 Feb 2001 11:46:30 +1100 (EST)
        (envelope-from sms(a)moe.2bsd.com)
Received: (from sms@localhost)
        by moe.2bsd.com (8.10.1/8.10.1) id f160ZHg18114
        for pups(a)minnie.cs.adfa.edu.au; Mon, 5 Feb 2001 16:35:17 -0800 (PST)
Date: Mon, 5 Feb 2001 16:35:17 -0800 (PST)
From: "Steven M. Schultz" &lt;sms(a)moe.2bsd.com&gt;
Message-Id: &lt;200102060035.f160ZHg18114(a)moe.2bsd.com&gt;
To: pups(a)minnie.cs.adfa.edu.au
Subject: Re: [pups] Strange problems on an uPDP 11/53+
Sender: owner-pups(a)minnie.cs.adfa.edu.au
Precedence: bulk
Hi -
...
  From: Martijn van Buul &lt;pino(a)dohd.org&gt;
 After 52 days, my uPDP 11/53+ has suddenly been acting rather strange.
 /usr/include got 'replaced' by /usr/new, to be precise. At the time, 
        Oops!
...
  I was the only user. Seeing this, I immediately halted
the system,
 expecting a load of file system errors upon boot. None showed up, and
 /usr/include is back to itself again. However, programs which *used*
 to be running perfectly (like my work-in-progress ps) suddenly fail,
 with a "not enough memory for saving info". 
...
  Any hints? 
        How much memory is on the system now after the reboot.   The only
        thing that pops into mind is that the system is running without
        enough memory.    If part of the memory on the system dropped out
        earlier that would (possibly) explain the strange behaviour was
        seen.   Rebooting/reseting the system would cause the system to
        recount memory.
        A program can get 'ENOMEM' as an error two ways: 1) exceeding the
        maximum 64KB dataspace (stack + data) or 2) the system has run out
        of swap or the maps ('coremap' and/or 'swapmap') have become
too
        fragmented.
        Two commands that can be useful in obtaining more information are
                sysctl hw
        and
                pstat -s
        "sysctl hw" will give several lines of output - the two you'd be
        interested in are
        hw.physmem = 2097152
        hw.usermem = 415744
        'physmem' is the amount of memory physically present and
'usermem' is
        the amount current free and available for user programs.
        "pstat -s" will give a swap space usage summary.
        Steven Schultz
        sms(a)Moe.2bsd.com
Received: (from major@localhost)
        by minnie.cs.adfa.edu.au (8.9.3/8.9.3) id SAA71926
        for pups-liszt; Tue, 6 Feb 2001 18:41:36 +1100 (EST)
        (envelope-from owner-pups(a)minnie.cs.adfa.edu.au)
...
 From Martijn van Buul &lt;pino(a)dohd.org&gt; Tue Feb  6
17:39:28 2001 Received: from mud.stack.nl (mud.stack.nl [131.155.141.98])
        by minnie.cs.adfa.edu.au (8.9.3/8.9.3) with ESMTP id SAA71922
        for &lt;pups(a)minnie.cs.adfa.edu.au&gt;; Tue, 6 Feb 2001 18:41:32 +1100 (EST)
        (envelope-from martijnb(a)stack.nl)
Received: by mud.stack.nl (Postfix, from userid 587)
        id D00657F08; Tue,  6 Feb 2001 08:39:28 +0100 (CET)
Date: Tue, 6 Feb 2001 08:39:28 +0100
From: Martijn van Buul &lt;pino(a)dohd.org&gt;
To: "Steven M. Schultz" &lt;sms(a)moe.2bsd.com&gt;
Cc: pups(a)minnie.cs.adfa.edu.au
Subject: Re: [pups] Strange problems on an uPDP 11/53+
Message-ID: &lt;20010206083928.A15141(a)mud.stack.nl&gt;
Reply-To: Martijn van Buul &lt;pino(a)dohd.org&gt;
References: &lt;200102060035.f160ZHg18114(a)moe.2bsd.com&gt;
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.3.3i
In-Reply-To: &lt;200102060035.f160ZHg18114(a)moe.2bsd.com&gt;; from sms(a)moe.2bsd.com on Mon,
Feb 05, 2001 at 04:35:17PM -0800
Sender: owner-pups(a)minnie.cs.adfa.edu.au
Precedence: bulk
Steven M. Schultz wrote:
...
  Hi -
  From: Martijn van Buul &lt;pino(a)dohd.org&gt;
 After 52 days, my uPDP 11/53+ has suddenly been acting rather strange.
 /usr/include got 'replaced' by /usr/new, to be precise. At the time, 
        Oops!    
Well, strange things are afoot indeed. About the same time, 1 machine
crashed (A DEC Alpha running OpenBSD), 2 started acting very strangely,
and had to be rebooted (My PDP, and a Wintel box running Windows 2000),
and a 4th machine (A Wintel box running Minix-VMD) suddenly had some
problems reading his harddisk and using its network (but recovered). The
strange thing is that these machines aren't related in any way but one:
they're standing quite near to eachother. Do I hear EMC somewhere?
> ...
  Any hints? >
>    How much memory is on the system now after the reboot.
1.5 MB. 798 Kilowords.
...
        The only thing that pops into mind is that the
system is running
       without enough memory.    If part of the memory on the system dropped
       out earlier that would (possibly) explain the strange behaviour was
        seen.   Rebooting/reseting the system would cause the system to
        recount memory. 
Well, the machine had 1.5 MB before it crashed.. It's doubtlessly some
memory fault, but it *seems* to be a temporal one.
...

        "sysctl hw" will give several lines of output - the two you'd be
        interested in are
        hw.physmem = 2097152 
hw.physmem = 1572864
...
          hw.usermem = 415744 
hw.usermem = 313472
...
          'physmem' is the amount of memory
physically present and 'usermem' is
        the amount current free and available for user programs. 
Should be enough. 'cc' works without problems - only my ps with debug
info seems to be affected; it might not be a memory issue, but a "ps can't
determine the right amount of processes"-issue..
I've checked it, and this seems to be the case. Ps thinks that there are
0 processes running, and does a
 outargs = (struct psout *)calloc(nproc, sizeof(struct psout));
on that. With 'nproc' being 0, this returns a NULL pointer, but doesn't
mean that the process is out of memory.
Having no ps is very annoying; finding back those 4 children spawned
by a httpd can be a nuisance then. pstat -p works, but it isn't comfortable:)
...
          "pstat -s" will give a swap space
usage summary. 
15/59 swapmap entries
910 kbytes swap used, 6263 kbytes free
--
    Martijn van Buul -  Pino(a)dohd.org - http://www.stack.nl/~martijnb/
         Geek code: G--  - Visit OuterSpace: mud.stack.nl 3333
   Kees J. Bot: The sum of CPU power and user brain power is a constant.
Received: (from major@localhost)
        by minnie.cs.adfa.edu.au (8.9.3/8.9.3) id DAA75027
        for pups-liszt; Wed, 7 Feb 2001 03:47:05 +1100 (EST)
        (envelope-from owner-pups(a)minnie.cs.adfa.edu.au)
...
 From "Steven M. Schultz"
&lt;sms(a)moe.2bsd.com&gt; Wed Feb  7 02:36:03 2001 
Received: from moe.2bsd.com
(MOE.2BSD.COM [206.139.202.200])
        by minnie.cs.adfa.edu.au (8.9.3/8.9.3) with ESMTP id DAA75023
        for &lt;pups(a)minnie.cs.adfa.edu.au&gt;; Wed, 7 Feb 2001 03:46:56 +1100 (EST)
        (envelope-from sms(a)moe.2bsd.com)
Received: (from sms@localhost)
        by moe.2bsd.com (8.10.1/8.10.1) id f16Ga3301595;
        Tue, 6 Feb 2001 08:36:03 -0800 (PST)
Date: Tue, 6 Feb 2001 08:36:03 -0800 (PST)
From: "Steven M. Schultz" &lt;sms(a)moe.2bsd.com&gt;
Message-Id: &lt;200102061636.f16Ga3301595(a)moe.2bsd.com&gt;
To: pino(a)dohd.org, sms(a)moe.2bsd.com
Subject: Re: [pups] Strange problems on an uPDP 11/53+
Cc: pups(a)minnie.cs.adfa.edu.au
Sender: owner-pups(a)minnie.cs.adfa.edu.au
Precedence: bulk
Hi --
...
  Well, strange things are afoot indeed. About the same
time, 1 machine...
 they're standing quite near to eachother. Do I hear EMC somewhere? 
        Time to increase the shielding around the computer room, eh? ;-)
...
  Well, the machine had 1.5 MB before it crashed..
It's doubtlessly some
 memory fault, but it *seems* to be a temporal one. 
        I do not think it is a memory/hardware problem - that was just a
        guess (not a very good one at that ;)).
...
  hw.usermem = 313472 
        That's fine.
...
  Should be enough. 'cc' works without
problems - only my ps with debug 
        What about the standard 'ps' that came with the system?
...
  info seems to be affected; it might not be a memory
issue, but a "ps can't
 determine the right amount of processes"-issue..
 I've checked it, and this seems to be the case. Ps thinks that there are
 0 processes running, and does a
  outargs = (struct psout *)calloc(nproc, sizeof(struct psout)); 
        Ah, ok - malloc() used to actually return a non-NULL pointer when
        presented with a size request of 0.  That was an error and was changed
        (I forget the exact update/patch number).  There were a couple programs
        in the system that relied on the old behaviour and those had to be
        fixed.
...
  on that. With 'nproc' being 0, this returns
a NULL pointer, but doesn't
 mean that the process is out of memory. 
        Right, the ENOMEM error was overloaded by malloc().  An argument
        can be made that EINVAL should have been returned instead by malloc()
        if 0 was passed in.
...
  Having no ps is very annoying; finding back those 4
children spawned
 by a httpd can be a nuisance then. pstat -p works, but it isn't comfortable:) 
        Are you are using the traditional 'nlist()' method of reading
        the kernel symbol table to look for 'nproc' and '_proc'?  If
so
        is there a permissions problem?  /dev/*mem needs to be group=kmem, mode
        640, the /unix image should be mode 644 and the 'ps' program setgid
        to kmem.   If there is a problem reading the kernel symbol table
        then 'nproc' will remain 0 which is what you're seeing.
        Another way of examining some kernel variables (proc table, file table,
        etc) is with the "sysctl" call.  It's much faster since it
doesn't
        have to do a sequential scan of the /unix symbol table.   You can
        look in /usr/src/ucb/w.c at the function 'readpr()' to see how to
        examine the proc table using sysctl.
        Steve