TUHS June 2017

tuhs@tuhs.org

38 participants
21 discussions

Re: [TUHS] 211bsd: kernel panic after a 'here document' in tcsh

by Walter F.J. Mueller

Hi, the kernel panic after tcsh here documents is understood. And fixed, at least on my system. The essential hint was Johnny's observation that on his system he gets an "Illegal instruction - core dumped" and no kernel panic. I'm using a self-build PDP 11/70 on an FPGA, see https://github.com/wfjm/w11/ https://wfjm.github.io/home/w11/ which doesn't have a floating point unit. Therefore the kernel is build with floating point emulation, thus with FPSIM YES # floating point simulator In a kernel with FPSIM activated the trap handler trap(), see http://www.retro11.de/ouxr/211bsd/usr/src/sys/pdp/trap.c.html calls for each user mode illegal instruction trap fpsim(). In case it was a floating point instruction fpsim() emulates it, returns 0, and trap() simply returns. If not, fpsim() returns the abort signal type, and trap() calls psignal() with this signal type, which in general will terminate the offending process. The kernel panic is due to a coding error in mch_fpsim.s. Look in http://www.retro11.de/ouxr/211bsd/usr/src/sys/pdp/mch_fpsim.s.html the code after label badins: badins: / Illegal Instruction mov $SIGILL.,r0 br 2b The constant SIGILL is defined in assym.h as #define SIGILL 4. Thus after substitution the mov instruction is mov $4..,r0 with *two dots* !!! The 'as' assembler generates from this mov #160750,r0 So r0 will contain a invalid signal number, which is returned by fpsim() to trap(). This signal number is passed to psignal(), which starts with mask = sigmask(sig); prop = sigprop[sig]; The access to sigprop[sig] results into an address in IO space, causes an UNIBUS timeout, and in consequence the kernel panic. After fixing the "$SIGILL." to "$SIGILL" (removing the extraneous '.') and three similar cases the kernel doesn't panic anymore, tcsh crashed with an illegal instruction trap. Remains the question why tcsh runs onto an illegal instruction. Getting now a tcsh core dump adb gives the answer adb tcsh tcsh.core $c 0172774: _rscan(0176024,0174434) from ~heredoc+0246 0176040: _heredoc(067676) from ~execute+0234 0176126: _execute(067040,01512,0,0) from ~execute+03410 0176222: _execute(066754,01512,0,0) from ~process+01224 0176274: _process(01) from ~main+06030 0177414: _main() from start+0104 heredoc(), which is located in OV1, calls rscan(), which is in OV6 with rscan(Dv, Dtestq); where Dtestq is a function pointer to Dtestq(), which is as heredoc() in OV1. rscan(), which has the signature rscan(t, f) register Char **t; void (*f) (); uses 'f' in the statement (*f) (*p++); The problem is that - heredoc() and Dtestq() are in OV1 - that's why in the end ~Dtestq is used a function pointer, like for all overlay internal function invocations - rscan() is in OV6, when it's called, overlay is switched OV1 -> OV6 - this invalidates the function pointer, which points to some random code location, which happens to hold '000045', causing a trap. It is clear that in this context _Dtestq, the forwarder in the base, must be used and not ~Dtestq, the entry point in the overlay. The generated code for 'rscan(Dv, Dtestq)' is ~heredoc+0230: mov $0174434,(sp) # arg Dtestq: uses ~Dtestq ~heredoc+0234: mov r5,-(sp) ~heredoc+0236: add $0177764,(sp) # arg Dv ~heredoc+0242: jsr pc,*$_rscan Since rscan() is very small and only used by heredoc() I simply moved the code of rscan() from sh.glob.c (OV6) to sh.dol.c where also heredoc() and Dtestq() is defined. After that tcsh works fine with here documents ./tcsh cat >x.x <<EOF 1 $TERM $PWD EOF cat x.x 1 vt100-long /usr/src/bin/tcsh Bottom line - fpsim was broken all the time - tcsh was broken all the time I'm convert this into proper patches and send them to Steven, but this will take some time because I've to tidy up my system to be again in the position to provide proper and clean patch sets. With best regards, Walter P.S.: debugging the kernel issue was quite easy because the w11a CPU has three essential 'build into the cpu' debug tools: - a 'cpu monitor', which records 144 bits of processor state for the last 256 instructions or vector fetches, see https://github.com/wfjm/w11/blob/master/rtl/w11a/pdp11_dmcmon.vhd - a 'breakpoint unit' which allows to set instruction of data breakpoints - an 'ibus monitor' which records the last 512 ibus transactions After setting a breakpoint on the trap 004/010 handler an inspection of the instruction trace gave the essential information. Below a very condensed and annotated excerpt nc ....pc cprptnzvc ..dsrc ..ddst ..dres vmaddr vmdata # # the "(*f) (*p++)" in tcsh, running onto an illegal instruction # 15 145210 uu00-.... 000105 173052 000105 w d 173052 000105 mov r0,(sp) 25 145212 uu00-.... 173050 174434 174434 w d 173050 145216 jsr pc,@n(r5) 19 174434 uu00-.... 000010 173064 000010 r i 174434 000045 ?000045? 1 174434 uu00-.... 000012 173064 000012 r d 000010 000045 !VFETCH 010 RIT # # the "mov $SIGILL.,r0" in fpsim(), load 160750 instead of 000004 # 17 160744 ku00-n..c 160750 000045 160750 r i 160746 160750 mov #n,r0 14 160750 ku00-n..c 160752 160750 160732 r i 160750 000770 br .-14 # # the "sigprop[sig]" access in psignal(), which accesses 174036 # which leads to a external bus (or UNIBUS) time out and IIT trap # 23 161314 ku00-.z.. 000000 147500 000000 w d 147500 000000 mov r1,n(r5) 9 161320 ku00-.z.. 174036 000000 000000 Ebto 174036 013066 movb n(r3),r0 3 161320 ku00-.z.. 000006 000000 000006 r d 000004 013066 !VFETCH 004 IIT

8 years, 1 month

Re: [TUHS] Array index history

by David

Arnold gets it right on the Pascal indexing. In UCSD Pascal you could specify any array bounds you would like and the compiler would 0 base them for you by always doing a subtraction, or addition if your min was negative, of your min array index. So a little run time cost for non-zero based arrays. I’m not sure how other Pascal compilers did this. I find it interesting that there are now a slew of testing programs (Valgrind, Address Sanitizer, Purify, etc) that will add the ‘missing’ array bounds checking for C. David > On Jun 7, 2017, at 10:01 AM, tuhs-request(a)minnie.tuhs.org wrote: > > Date: Wed, 07 Jun 2017 07:20:43 -0600 > From: arnold(a)skeeve.com > To: tuhs(a)tuhs.org, ag4ve.us(a)gmail.com > Subject: Re: [TUHS] Array index history > Message-ID: <201706071320.v57DKhmJ026303(a)freefriends.org> > Content-Type: text/plain; charset=us-ascii > > Pascal (IIRC) allowed you to specify upper and lower bounds, something > like > > foo : array[5..10] of integer; > > with runtime bounds checking on array accesses. (I could be wrong --- > it's been a LLLLOOONNNGGG time.) > > HTH, > > Arnold

8 years, 1 month

Re: [TUHS] Array index history

by Johnny Billquist

On 2017-06-07 19:01, "Ron Natalie"<ron(a)ronnatalie.com> wrote: > The original FORTRAN and BASIC arrays started indexing at one because everybody other than computer scientists start counting at 1. FORTRAN, yes. BASIC (which dialect might we be talking about?) normally actually start with 0. However, BASIC is weird, in that the DIM statement is actually specifying the highest usable index, and not the size of the array. Thus: DIM X(10) means you get an array with 11 elements. So, people who wanted to use array starting at 1 would still be happy, and if you wanted to start at 0, that also worked. You might unintentionally have a bit of wasted memory, though. > These languages were for scientists and the beginner, so you wanted to make things compatible with their normal concepts. True. > PASCAL on the other hand required you to give the minimum and maximum index for the array. In a way, PASCAL makes the most sense. You still what range you want, and you get that. Anything works, and it's up to you. That said, PASCAL could get a bit ugly when passing arrays as arguments to functions because of this. > Of course, C’s half-assaed implementation of arrays kind of depends on zero-indexing to work. :-) Johnny -- Johnny Billquist || "I'm on a bus || on a psychedelic trip email: bqt(a)softjar.se || Reading murder books pdp is alive! || tryin' to stay hip" - B. Idol

8 years, 1 month

Re: [TUHS] 211bsd: kernel panic after a 'here document' in tcsh

by Johnny Billquist

On 2017-06-07 22:14, "Walter F.J. Mueller"<w.f.j.mueller(a)retro11.de> wrote: > Hi, > > a few remarks on the feedback on the kernel panic after a 'here document' in tcsh. > > To Michael Kjörling question: > > I'm curious whether the same thing happens if you try that in some > > other shell? (Not sure how widely here documents were supported back > > then, but I'm asking anyway.) > And Johnny Billquist remark > > Not sure if any of the other shells have this. > > 'here documents' are available and work fine in sh and csh. > And are in fact used, examples Ah. Thanks. Too lazy to check. > To Michael Kjörling remark > > The PC value in the panic report ("pc 161324") strikes me as high > and Johnny Billquist remark > > This is in kernel mode, and that is in the I/O page. > > 211bsd uses split I/D space and uses all 64 kB I space for code. D'oh! Color me stupid. I should have thought of that. > The top 8 kB are in fact the overlay area, and the crash happened > in overlay 4 (as indicated by ov 4). With a simple > > nm /unix | sort | grep " 4" > > one gets > > 161254 t ~psignal 4 > 162302 t ~issignal 4 > > so the crash is just 050 bytes after the entry point of psignal. So the > PC address is fine and not the problem. For psignal look at > > http://www.retro11.de/ouxr/211bsd/usr/src/sys/sys/kern_sig.c.html#s:_psignal > > the crash must be one of the first lines. psignal is an internal kernel > function, called from > > http://www.retro11.de/ouxr/211bsd/usr/src/sys/sys/kern_sig.c.html#xref:s:_p… > > and has nothing to do with the libc function psignal > > http://www.retro11.de/ouxr/211bsd/usr/man/cat3/psignal.0.html > http://www.retro11.de/ouxr/211bsd/usr/src/lib/libc/gen/psignal.c.html The libc function would be in user mode, so that one was pretty clear. Ok. Digging through this a little for real then. psignal gets called with a signal from the trap handler. The actual signal is weird. It would appear to be 0160750, which would be -7704 if I'm counting right. That does not make sense as a signal. The psignal code pulls a value based on the signal number, which is the line: prop = sigprop[sig]; which uses the signal number as an index. With a random, weird signal number, this access wherever that might end up. Which is when you get the crash. On my system, sigprop is at address 0012172, which, with a signal of -7704 ends up at address 0173142, which by (un)luck happens to be in the middle of the diagnostics bootstrap rom space. So I don't get a Unibus timeout error, while you do. Probably because sigprop is at a slightly different address in your kernel. So, the real question is how trap can be calling psignal with such a broken signal number. I might dig further down that question another day. But unless you already got this far, I might have saved you a few minutes of digging. I did start looking into the trap code, which is in pdp/trap.c, but this is not entirely straight forward. It goes through a bunch of things trying to decide what signal to send, before actually calling psignal. > To Johnny Billquist remark > > Could you (Walter) try the latest version of 2.11BSD and see if you > > still get that crash? > > very interesting that you see a core dump of tcsh rather a kernel panic. Indeed. > Whatever tcsh does, it should not lead to a kernel panic, and if it does, > it is primarily a bug of the kernel. It looks like there are two issues, > one in tcsh, and one in the kernel. I've a hunch were this might come from, > but that will take a weekend or two to check on. Agree that the kernel should not crash on this. Also, tcsh should not really crash either, but it's a separate issue, even though one might have triggered the other here. But yes, there are two bugs in here. If you can recreate the kernel crash on the latest version, that would be good. But it smells like trap.c have some path where it does not even set what signal to deliver, and then calls psignal with whatever the variable i got at the function start. Which would be some random stuff on the stack. Johnny -- Johnny Billquist || "I'm on a bus || on a psychedelic trip email: bqt(a)softjar.se || Reading murder books pdp is alive! || tryin' to stay hip" - B. Idol

8 years, 1 month

Re: [TUHS] Array index history

by Johnny Billquist

On 2017-06-08 22:17, Dave Horsfall<dave(a)horsfall.org> wrote: > > Just to diverge from this thread a little, it probably isn't all that > remarkable that programming languages tend to reflect the hardware for > which they were designed. > > Thus, for example, we have the C construct: > > do { ... } while (--i); > > which translated right into the PDP-11's "SOB" instruction (and > reminiscent of FORTRAN's insistence that DO loops are run at least once > (there was a CACM article about that once; anyone have a pointer to it?)). > > And of course the afore-mentioned FORTRAN, which really reflects the > underlying IBM 70x architecture (shudder). FORTRAN stopped running the loops at least once already with FORTRAN 77. The last who insisted on running loops at least once was FORTRAN IV. Johnny -- Johnny Billquist || "I'm on a bus || on a psychedelic trip email: bqt(a)softjar.se || Reading murder books pdp is alive! || tryin' to stay hip" - B. Idol

8 years, 1 month

Array index history

by shawn wilson

I learned the other day that array indexes in some languages start at 1 instead of 0. This seems to be an old trend that changed around the 70s? Who started this? Why was the change made? It seems to have come about around the same time as C, but interestingly enough Lua is kinda in between (you can start an array at 0 or 1). Smalltalk can probably have a 0 base index just by it's nature, but I wonder whether that would work in a 40 year old interpreter.

8 years, 1 month

Re: [TUHS] Array index history

by Richard Tobin

> Basically, until C came along, the standard practice was for indices > to start at 1. Certainly Fortran and Pascal did it that way. Mercury Autocode used 0. http://www.homepages.ed.ac.uk/jwp/history/mercury/manual/autocode/4.jpg -- Richard -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

8 years, 1 month

Re: [TUHS] 211bsd: kernel panic after a 'here document' in tcsh

by Walter F.J. Mueller

Hi, a few remarks on the feedback on the kernel panic after a 'here document' in tcsh. To Michael Kjörling question: > I'm curious whether the same thing happens if you try that in some > other shell? (Not sure how widely here documents were supported back > then, but I'm asking anyway.) And Johnny Billquist remark > Not sure if any of the other shells have this. 'here documents' are available and work fine in sh and csh. And are in fact used, examples /usr/adm/daily (a /bin/sh script) su uucp << EOF /etc/uucp/clean.daily EOF /usr/crash/why (a /bin/csh script) adb -k {unix,core}.$1 << 'EOF' version/sn"Backtrace:"n $c 'EOF' To Michael Kjörling remark > The PC value in the panic report ("pc 161324") strikes me as high and Johnny Billquist remark > This is in kernel mode, and that is in the I/O page. 211bsd uses split I/D space and uses all 64 kB I space for code. The top 8 kB are in fact the overlay area, and the crash happened in overlay 4 (as indicated by ov 4). With a simple nm /unix | sort | grep " 4" one gets 161254 t ~psignal 4 162302 t ~issignal 4 so the crash is just 050 bytes after the entry point of psignal. So the PC address is fine and not the problem. For psignal look at http://www.retro11.de/ouxr/211bsd/usr/src/sys/sys/kern_sig.c.html#s:_psignal the crash must be one of the first lines. psignal is an internal kernel function, called from http://www.retro11.de/ouxr/211bsd/usr/src/sys/sys/kern_sig.c.html#xref:s:_p… and has nothing to do with the libc function psignal http://www.retro11.de/ouxr/211bsd/usr/man/cat3/psignal.0.html http://www.retro11.de/ouxr/211bsd/usr/src/lib/libc/gen/psignal.c.html To Johnny Billquist remark > Could you (Walter) try the latest version of 2.11BSD and see if you > still get that crash? very interesting that you see a core dump of tcsh rather a kernel panic. Whatever tcsh does, it should not lead to a kernel panic, and if it does, it is primarily a bug of the kernel. It looks like there are two issues, one in tcsh, and one in the kernel. I've a hunch were this might come from, but that will take a weekend or two to check on. With best regards, Walter

8 years, 1 month

Re: [TUHS] 211bsd: kernel panic after a 'here document' in tcsh

by Johnny Billquist

On 2017-06-06 04:00, Michael Kjörling <michael(a)kjorling.se> wrote: > > On 5 Jun 2017 16:12 +0200, from w.f.j.mueller(a)retro11.de (Walter F.J. Mueller): >> I'm using 211bsd (Version 447) and found that a 'here document' in tcsh >> leads to a kernel panic. It's absolutely reproducible on my system, both >> when run it on my FPGA PDP-11 or in simh. Just doing >> >> tcsh >> cat << EOF > I'm curious whether the same thing happens if you try that in some > other shell? (Not sure how widely here documents were supported back > then, but I'm asking anyway.) Not sure if any of the other shells have this. We're basically talking csh, sh and ksh unless I remember wrong. But it's a good question. If noone else have tried it by tomorrow, I could check. >> is enough, and I get >> >> ka6 31333 aps 147472 >> pc 161324 ps 30004 >> ov 4 >> cpuerr 20 >> trap type 0 >> panic: trap >> syncing disks... done >> >> looking at the crash dump gives >> >> cd /etc/crash >> ./why 4 >> Backtrace: >> 0147372: _boot(05000,0100) from ~panic+072 >> 0147414: _etext(011350) from ~trap+0350 >> 0147450: ~trap() from call+040 >> 0147516: _psignal(0101520,0160750) from ~trap+0364 >> 0147554: ~trap() from call+040 >> >> so the crash is in psignal, which is afaik the kernel internal >> mechanism to dispatch signals. > The PC value in the panic report ("pc 161324") strikes me as high, but > 161324 octal is 58068 decimal, so it's not excessively so, and perhaps > in line with what one might expect to see with a kernel pinned near > top of memory. Are the offsets in the backtrace constant, i.e. does it > always crash on the same code? 161324 is way high. This is in kernel mode, and that is in the I/O page. Basically no code lives in the I/O page (some boot roms and hardware diagnostics excepted). This smells like corrupted memory (pointer or stack), or something else very funny. > Not knowing what cpuerr 20 is specifically doesn't help, and at least > http://www.retro11.de/ouxr/29bsd/usr/src/sys/sys/trap.c.html#n:112 > (which doesn't seem to be too far from what you are running) isn't > terribly enlightening; CPUERR is simply a pointer into a memory-mapped > register of some kind, as seen at > http://www.retro11.de/ouxr/29bsd/usr/include/sys/iopage.h.html#m:CPUERR, > and at least pdp11_cpumod.c from the simh source code at > http://simh.trailing-edge.com/interim/pdp11_cpumod.c wasn't terribly > enlightening, though of course I could be looking in entirely the > wrong place. Like others said - the cpu error register is documented in the processor handbook. 020 means Unibus Timeout, which is consistent with trying to access something in the I/O page, where there is no device configured to respond to that address. I just tried the same thing on a simh system here, and I do not get a crash. This on 2.11BSD at patch level 449, running on an emulated 11/94. I do however get tcsh to crash. simh:/home/bqt> su - Password: erase, kill ^U, intr ^C # tcsh simh:/# cat << EOF Illegal instruction - core dumped # Suspended (tty input) simh:/home/bqt> simh:/home/bqt> cat /VERSION Current Patch Level: 448 Date: January 5, 2010 Yes, it says patch level 448, but it really is 449. This was the system where I worked together with Steven when doing the 449 patch set, but I never got around to actually updating the VERSION file itself. Also, this was while running on the console. Could you (Walter) try the latest version of 2.11BSD and see if you still get that crash? Johnny -- Johnny Billquist || "I'm on a bus || on a psychedelic trip email: bqt(a)softjar.se || Reading murder books pdp is alive! || tryin' to stay hip" - B. Idol

8 years, 1 month

211bsd: kernel panic after a 'here document' in tcsh

by Walter F.J. Mueller

Hi, I'm using 211bsd (Version 447) and found that a 'here document' in tcsh leads to a kernel panic. It's absolutely reproducible on my system, both when run it on my FPGA PDP-11 or in simh. Just doing tcsh cat << EOF is enough, and I get ka6 31333 aps 147472 pc 161324 ps 30004 ov 4 cpuerr 20 trap type 0 panic: trap syncing disks... done looking at the crash dump gives cd /etc/crash ./why 4 Backtrace: 0147372: _boot(05000,0100) from ~panic+072 0147414: _etext(011350) from ~trap+0350 0147450: ~trap() from call+040 0147516: _psignal(0101520,0160750) from ~trap+0364 0147554: ~trap() from call+040 so the crash is in psignal, which is afaik the kernel internal mechanism to dispatch signals. Questions: 1. has anybody seen this before ? 2. any idea what the reason could be ? With best regards, Walter

8 years, 1 month

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

1996

1995

1994

1993

1992

1991

1990

TUHS June 2017