of 4.1 bad traps

a while ago, i solicited input from people who were seeing bad trap

panics in SunOS 4.1. several months ago (end of june, i think),

i started collecting examples. now i have a summary and some

explanations for general consumption.

a bad trap is exactly what it sounds like - the kernel got a trap

(hardware or software interrupt) that it can't handle. there are

two major flavors: timeouts and data faults. timeouts aren't that

interesting, since they're usually hardware related (out of revision

memory boards failing to latch signals for the proper setup and

hold times, for example). data faults are more interesting, because

they are caused by data corruption of one sort or another:

a non-reentrant routine was re-entered, or a null pointer got dereferenced.

in the case of a null pointer, the trace back should show you an address

down in the first two pages of memory.

there are 4 classes of bad traps that fell out of the mail i collected.

1. flush_windows()

this is peculiar to sparc machines and is caused by a race condition

in the fork/context switch code. in general, it looks like you either

see this one *a lot* or you don't ever see it. processes that have

huge stacks (eg, lots of local variables in procedure calls or deeply

nested calls) tend to be affected as well as shells that do an explicit

setting of the stack size limit as part of their initialization.

a patch is available.

2. streams (tty) read

if you are seeing "zs?: parity error ignored" messages around the time

of the panic, and panicing in the streams tty code (strq, strread, etc)

then you may be getting stung by the message itself. logging the message

actually drops the kernel priority for a small window of time; if you

have to handle more tty input during that window there is danger of

damaging streams data structures.

a patch should be available soon, although the best fix is to determine

the cause of the parity errors (line noise, poor grounding, device that

leaves half-frames between connect/disconnects, etc).

3. streams (tty) ioctl

i've only seen this in one place, and the user was doing

        while (1) {

                ioctl(0, FIONREAD &r);

                read(0, buf, r);

        }

with a fire-hose like input stream containing control characters.

the problem looks like the canonical input processing was nuking

characters as fast as the FIONREAD ioctl() was trying to count

them. the kernel panics somewhere in msgdsize().

fix: if you're reading raw (unprocessed) input from a tty device, turn

off canonical processing or pop the tty modules off of the stream.

use select() instead of FIONREAD unless you absolutely need to know

how many characters are in the stream.

4. ifconfig on non-ethernet device

the sunos 4.1 ifconfig has a very neat feature: it will display

the ethernet (MAC) address of a network interface. if you ifconfig

a non-ethernet device (eg, sync serial line), this may panic the

system *if* you have the NIT device present (ifconfig uses the NIT

interface to glean the address info).

fix: remove /dev/nit or take "options NIT" out of the kernel configuration.

if you're booting diskless clients, you can't do this: rarpd requires

the NIT device to be present. just say no to "ifconfig ifd0"

--hal stern

  sun microsystems

  northeast area consulting group

  halstern@sun.com

[3914 byte] By [CodeProf.com] at [2007-12-25 7:15:00]