Low-level SCSI debugging
Nov. 19th, 2011 11:12 pm![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
The hardware engineer on the project rented a SCSI analyzer, which also insisted the drive was misbehaving. It clearly showed the read command going out, and the "OK" status coming back. So we ordered another drive, and when it arrived, it did exactly the same thing. Monitoring other devices showed the expected behaviour: send a read command, get data, then get the OK status.
I, however, did not trust the SCSI analyzer. It operated on the assumption that everything was operating according to specifications, and was designed to show what data was going back and forth, not investigate weird protocol violations.
Accordingly, I went and rounded up a logic analyzer, which just shows the raw signals, and does not interpret them at all. It is more effort to figure out what's going on from the raw logic levels, but the logic analyzer doesn't hide anything either. And sure enough, when I puzzled out what the logic analyzer was telling me, it became clear what was happening. The computer would put the read command on the bus, one byte at a time, assert the strobe signal to indicate that the command byte was ready to read, take away the byte, and wait for the "ack" (acknowledge) signal back from the target device. And this is wrong. What it should do is leave the byte on the bus until it gets the ack back. The SCSI control chip in the computer was very simple, and did not do the signal sequencing itself, depending on its device driver to do so. And, looking at the device driver source code (fortunately, we had access to it), it showed the same sequence of events: put data on bus, assert strobe, take data away, wait for ack. So I swapped two lines in the driver, so it would put the data on the bus, assert strobe, wait for ack, and then take the data away.
And lo, the SCSI floppy disk drive started to work perfectly! The remaining question was, why did the other devices work? My theory is that the other, fancier, devices had hardware SCSI interfaces that latched the incoming command bytes immediately upon strobe, so they didn't care that the data went away immediately afterward. Whereas the floppy drive implemented its SCSI interface with a microcontroller. The strobe signal would send an interrupt to the microcontroller, which would then go read the data byte off the SCSI bus. Unfortunately, by the time it got around to it, the data was gone, and the bus terminators had pulled the data lines back to their idle state of zero. And, sure enough, a SCSI command block of all zeroes is a valid command: "test unit ready", for which the correct response is simply "OK".