Discussed here is a bug which affected IBM System 370 Models 158AP and 158MP. These configurations had two processors. (The bug is not relevant to the single processor 158UP.)
The complex interacting bug described here is fascinating because it involves all of the following:
The bug did not manifest itself until IBM released the new operating systems, MVS/SE and VM370/SE, in late 1977, although these systems ``did nothing wrong.'' When the bug first surfaced, some called it the ``compare and swap'' bug, but this was a gross misnomer since compare-swap was the one instruction which could not provoke the bug.
During the late 1970's I was frequently involved in troubleshooting IBM mainframes, or rather ``plug-compatible'' main memory attachments for those mainframes. I encountered a variety of problems:
Often the misplaced wire was more challenging to diagnose than a design error, but the most interesting bug of all involved design errors. I'll tell that story in great detail, going on tangents which, though irrelevant to ``The Unusual Bug (tm)'' are instructive in their own right.
Something that made engineering IBM ``plug-compatible'' systems particularly challenging was that modifications were made to essentially ``hostile'' circuits. Thus a problem which IBM might solve with a simple redesign of a CPU circuit board could not easily be solved by us the same way -- we would then need to redesign the entire IBM circuit, an expensive and time-consuming undertaking. Instead we would try to design a parallel circuit which could ``wire-or'' to and override the IBM logic. When no way could be found to achieve the desired logic with a ``wired-or'' we might use an Exacto knife to cut an IBM printed-circuit trace, though many project managers refused to countenance that.
IBM mainframes used static memory, but the Brand X companies saved cost with cheap high-density dynamic memories. These memories needed to be refreshed periodically; if the refresh cycles could not be hidden during system operation, the CPU would have to be held-off (stopped or prevented from accessing memory) while the dynamic memory did a forced refresh. AMS/Intersil used hidden refreshes, so the dynamic memory had only slight degradation to system performance. NatSemi, on the other hand, used only forced refresh on some models and needed to signal the CPU to hold-off once every 16 microseconds.
Eventually, holdoff circuitry was designed for the IBM 370 models 138, 148, 158 and 168 (and later for 3031, 3032, 3033 -- all relatives of the 168). The 148 design was more challenging and interesting than that of the 138, and the 158 design was more challenging still -- the 158 was a complicated beast, with non-obvious issues. Perhaps paradoxically, the top-of-the-line 168 was the easiest of all to modify -- because of its large number of components, IBM designed its hardware to have a very careful modular structure so it was easy to manipulate.
The larger IBM models are complicated by a memory hierarchy: recently fetched data is saved in a high-speed buffer (cache) to avoid redundant fetches to the slower backing store; when the backing store is written the high-speed buffer must be checked for that address, and the buffer entry also rewritten, or at least invalidated. (Even less expensive Models like the 148 may have this issue, although the ``buffer'' may just be a few bytes representing the next instruction(s) to be executed.)
When the 158AP or 158MP performs a memory write, three actions have to happen:
(The remote buffer is invalidated, not overwritten. Although the CPU has a data bus to the opposite backing storage, backing and buffer store operate independently and do not share the same input data bus.)
The three actions are initiated concurrently, but only the first two need take place quickly. The backing store write may be delayed by a previous write from this CPU which has not yet completed or any operation from the other CPU, if it is accepted first. It can be further delayed when both local write and remote access are pending. On machines with non-IBM add-on ``dynamic'' memory, access can be delayed still further by a forced refresh cycle.
The 158 AP/MP systems we discuss have some bug-like features which are not related to the Unusual Bug that developed, but I mention them just so this page can be a compendium of interesting design issues.
For example, although the two CPU's in these systems operate at identical 8.69565 MHz clock speeds, each CPU has its own independent oscillator, so control signals from the remote CPU occur at completely arbitrary times relative to the local oscillater. A single clocked latch cannot reliably synchronize such a signal to the local clock; this is because the asynchronous signal might switch at exactly the same time as thc locking edge -- in that case the latch output may be poorly behaved. (In some logic families it might oscillate. With the ECL circuits used in IBM mainframes in the 1970's, a more likely failure mode would be for the signal to remain near threshold level for a while, neither at a high (`true') nor low (`false') level.) Such a latch is said to enter a ``metastable'' state.
The output of the metastable latch can be clocked into a second latch, say 45 nanoseconds after the first latch is clocked. This is likely to produce a clean signal, but is still not guaranteed: the faulty level output by the first (metastable) latch may put the second latch also into a metastable mode (``Garbage in, garbage out'').
The usual solution is to provide yet more clocked latching in series: the more such latches the more unlikely failure becomes. In the circuit diagram following `-RMT SELECT' is seen to pass through three such synchronizing latches, but my memory is hazy -- in the IBM circuit there might have been four latches in this series.
(An article in an engineering magazine about this problem had an interesting table showing, for different logic families, the probability a latch output could still be metastable X nanoseconds after the clock. I remember CMOS scored well, and ECL scored poorly, but the surprising thing was that popular 74S-series TTL parts scored all over the chart: those manufactured by Texas Instruments outperformed CMOS, while parts from National Semi with ``identical form, fit and function'' to the T.I. parts were worst in chart.)
We begin our look at details by depicting the circuitry which initiates a backing store cycle on the 370 Model 158. This is from my 30-year old memory, but I think it is substantially correct.
The relevant nets show the names as they appeared in the IBM logic diagrams: ``- Local Select'' and ``+Remote Select'' are the signals from the two respective CPUs which request access to a backing-store memory cabinet.
``Fet Control Busy'' is active when a backing store memory cycle is in progress (the 158 backing store used Field Effect (MOS) Transistors, hence the ``FET'' mnemonic); ``Remote Cycle'' indicates to which CPU's request the backing store is responding, and ``End Reset'' is a pulse generated a fixed delay after the falling edge of ``- FET CTRL BZ.''
Ignore the peculiar part of the circuitry in green; it isn't part of the IBM circuit and is discussed in the next section.
Special System/370 op-codes (Compare-Swap and Test-Set) allow one CPU to deny access to the other; this is indicated in the diagram (``cs/ts'') but without details.
Note that Remote Select has priority over Local Select, but with contemporaneous selects, Local goes first because Remote is delayed by the synchronization latches discussed in the previous section. The ``+RMT CYCLE'' signal must not change states while ``-FET CTRL BZ'' is active; note that the logic ensures this.
The ``Fet Control Busy'' signal exists in two polarities; the minus-active is connected, unbuffered, to the backplane (``mother-board'') so it can be used by other circuit boards. The diagram does not depict those details, though they happen to be quite fortuitous as implied in the next section.
One engineer (John Bizjak) designed a circuit to holdoff the 158 from accessing its backing store, and supplied it to several manufacturers.
The relevant circuit is so simple that it may seem amazing it does the job: delaying the memory of a million-dollar computer, and letting it start up again right on schedule. ``Want Holdoff'' requests a forced refresh, and ``Holdoff'' acknowledges that it may begin. ``Refresh complete'' terminates the holdoff; it must be synchronized to the appropriate 158 clock phase.
I've shown the add-on circuits in a different color from the IBM circuits in the previous diagram, but ``-FET CTRL BZ'' is the IBM net. It is both an input and an output on an AND-gate: this is called a one-gate latch. Since the IBM 158 computer used NPN emitter-coupled logic, the output of this circuit is OR'ed (``wire-or'ed'') with the output of IBM's ``-FET CTRL BZ'' circuit. While the one-gate latch is set, a backing-store cycle cannot be initiated, but the latch will not set if such a cycle is already in progress.
With the Bizjak circuit added on, IBM's ``FET CTRL BZ'' latch will still set as usual but, when the add-on memory is doing forced refresh, the backing-store cycle will be delayed.
Note a curious feature of the IBM circuit when this add-on is installed: During Holdoff, the Remote Cycle latch will be allowed to change from local to remote selection even though a Local select has previously set the Fet-Control-Busy latch! (This leads directly to the Unusual Bug which is the subject of this web page.)
Although it did not cause trouble here, I will mention an unusual logic hazard that can arise in a wire-or like that used in the Bizjak circuit.
The Bizjak circuit is installed in an empty card slot in IBM's card cage. Because there are no empty slots near the IBM circuit board that generates ``-FET CTRL BZ'', the add-on ``one-gate latch'' will be at least 60 centimeters away (``as the electron flies'') from IBM's latch; in other words the drivers that are wire-or'ed together are two feet apart.
When both inputs to an OR-gate are active (high) and one goes low, you expect the output to remain high, but that is not what happens here! The add-on circuitry used different power-supply voltages and had a slightly lower high level, so the IBM driver would supply all the current to the wired-or net (See Note). When that transistor turns off, its emitter voltage will immediately begin dropping, pulled down by a nearby (ca 390 ohm) resistor. This falling voltage cannot be reversed until it is ``seen'' by the other drive transistor, which will then turn on, and until that turn-on is ``seen'' back at the first source. In other words there is a delay of eight nanoseconds minimum before the falling voltage can be reversed! (Light travels about one nanosecond per foot in vacuum, but electric signals only about half that speed in copper and the signal has a round-trip of 4 feet through the 2-foot wire.)
Placing an oscilloscope on IBM's Fet-Control-Busy net, one did indeed see a clear negative-going glitch whenever the CPU attempted to initiate a backing-store cycle during a forced-refresh. However the glitchy excursion only went barely below the threshold level and was not quite enough to cause trouble. Installers took care to use a short wire and keep the Fet-Control-Busy add-on to two feet: the system would fail when a 3-foot wire was substituted!
Note: The industry standard ECL 10K family used a 5.2 volt power supply differential and was not fully compatible with IBM's MST logic family and its three voltages -- +1.25, 0, and 3.0 volts. (Like other add-on designs, the Bizjak circuit used an add-on -4.0 volt power supply to achieve a 5.25 volt differential.) However, because of minor device variations, the described logic hazard would exist even if the two drivers had been compatible.
The add-on manufacturers began attaching dynamic memory to IBM 370 Model 158 in 1975 and despite the peculiarities mentioned, things were OK at first, even on the 158MP and 158AP systems.
It may be somewhat uncommon for one CPU to be caching data from a memory section into which the other CPU is writing, so the software might have suffered no ill effects even if the buffer invalidation mechanism were defective!
When both CPUs are executing kernel code, they may each have copies of shared variables in their caches, but the kernel will often use Test-Set or Compare-Swap when updating such a shared variable. On the 158, these instructions are handled with special microcode. The sequence begins with Read-and-Lock, which bypasses the buffer, sets a latch to prevent the other CPU from accessing the backing store, and reads backing store. If a Compare operation leads to a Write, that Write will invalidate the remote buffer entry, update backing store, and only then reset the Lock latch. Caches and backing store thus have ensured consistency.
CPUs are not required to use the Read-and-Lock/Compare/Write mechanism on all updates to shared variables, however, and as we will see the caches could become inconsistent. The IBM design which allowed this might be called ``precarious'' but could not fairly be described as ``buggy'' -- failures arose only when non-IBM (``Brand X'') memories were attached.
IBM began shipping new operating systems (MVS/SE and VM370/SE) shortly before I went to work for NatSemi. The 158 was regarded as a ``done deal'' so I was asked to design circuitry for the 3031, 3032 and 3033 attachments. But soon strange reports began drifting in about 158AP and 158MP systems.
The symptoms varied from site to site. In several cases, the only problem was that a terminated job survived in a ``zombie'' status: the operators could ``kill'' that job explicitly and continue normally. At other installations, CPU B (the ``AP side'') would go into a tight infinite loop. Those customers had the choice of continuing with half their CPU power out of action, or doing a re-IPL (and then hanging again, randomly, after several hours). At still other installations, the operating system would crash completely.
On one particular machine, a kernel variable (SRBCNT) would go negative. Since this variable starts at zero, increments when an SRB begins, and decrements when it exits, it should never become negative ... but it did. The problem began as an inconsistency between the two buffers of the two CPUs, but it could be quickly disguised: one CPU fetches its inconsistent value, updates it, and writes it to backing store. The buffered and backing-store data are all then identical -- but wrong.
IBM's diagnostic strategy may have been simplified once they noticed that all the failing machines had NatSemi add-on memory installed!
IBM's top software specialists developed special diagnostic patches to detect the buffer inconsistencies before they ``healed'' themselves. Because the Compare-Swap instruction fetches from the backing-store on Model 158, software can detect buffer inconsistency by simply doing an ordinary fetch and a Compare-Swap fetch and comparing the results.
The telephone communication that develops among engineers in such situations is rather like a game of ``gossip.'' IBM had little incentive to put its best engineers in contact with NatSemi, and NatSemi let managerial types communicate at its end. For some reason, all these people were calling the syndrome the ``Compare-and-Swap'' problem. I had a hunch what the real problem was, but knew it couldn't be Compare-Swap. When I finally got an IBM engineer on the phone, I asked him what instruction was used to write SRBCNT. He thought it was CS (Compare-Swap) but I asked him to please check the source code. When he came back on the line with the answer, he seemed surprised: STH (Store Halfword).
I have no knowledge of IBM's operating systems -- my programs were standalone diagnostics -- but I suspect that MVS/SE was more aggressive than its predecessor in allowing both CPUs to execute dispatcher code in the kernel concurrently. They took care to keep the data consistent but this needn't require that every write instruction be ``locked'' (use CS or TS opcode): a single locked write can suffice to grant a unique processor software-controlled write authority.
The following table shows what I think was happening to SRBCNT. Note that it wasn't necessary that CPU B read SRBCNT while CPU held the kernel lock; the bug is provoked by a read of any datum (XYZ) in the same quadword (the 158 memory operates in 16-byte chunks). The table is not ``to scale''! The actual STH R5,SRBCNT instruction is broken into parts to show cache and backing-store changes separately.
|Kernel||CPU actions||Value of SRBCNT||Comment|
|Lock||CPU A||CPU B||Buffer A||Buffer B||Backing store|
|B||-||dispatch SRB||0||0||0||B issues an SRB,|
|B||-||(begin STH)||Inv||1||0||updating SRBCNT|
|-||-||-||1||1||1||(time passes, A's cache gets reloaded)|
|A||dispatch SRB||-||1||1||1||A issues an SRB,|
|A||(begin STH)||-||2||Inv||1||but before the Store-Halfword|
|A||( ( holdoff for forced refresh in A's memory ) )||makes it to backing store|
|A||-||Read XYZ||2||1||1||an unrelated fetch fills|
|A||(end STH)||2||1||2||B's cache with obsolete data|
|B||-||SRB exit||2||1||2||B needs to decrement SRBCNT;|
|B||-||(begin STH)||Inv||0||2||this time the hardware|
|B||-||(end STH)||Inv||0||0||operation proceeds correctly|
|-||-||-||Inv||0||0||but the resultant value is wrong|
|-||-||-||0||0||0||(time passes, A's cache gets reloaded)|
|A||SRB exit||-||0||0||0||Now A decrements SRBCNT;|
|A||(begin STH)||-||-1||Inv||0||again hardware action is correct|
|A||(end STH)||-||-1||Inv||-1||but an ``impossible'' negative results|
|( ( with an erroneous SRBCNT the MVS system does not behave properly ) )|
The problem was caused by a coincidence of three independent events:
The normal sequence of events would be
With the forced refresh, the operations may be reordered:
(It may seem wrong to invalidate B's buffer before updating the backing store, but unmodified IBM machines did not fail.)
IBM modified the kernels on each failing machine, not to fix the problem, but to detect it at the earliest stage. Since machines had different symptoms, the patches were different, but usually checked explicitly for a mismatch between cache and backing store.
In those days, IBM operated under a consent decree ordered by a federal judge, stating that IBM was a monopoly and needed to cooperate to some degree with competing manufacturers and maintenance services. However, although IBM had figured out the details of the problem, it did not feel obliged to educate NatSemi. (It did submit a large bill for the many hours IBM had spent troubleshooting the problem.)
(Eventually, other manufacturers of dynamic add-on memory for the 158 had to address this problem. NatSemi had the first opportunity since, lacking any mechanism to hide refresh, they were holding off the CPU once every 16 microseconds and thus provoked a high failure rate.)
It fell to NatSemi to rediagnose and solve the problem. Although a customer machine might only fail once a day or so, I wrote a simple program which would cause and detect a buffer mismatch after a few seconds. This enabled us to test design fixes, and we developed circuitry both to hide some refreshes and to preserve the Local/Remote ordering during forced refreshes. The fixes were installed on customer machines, and all began acting correctly.
Or should I say, all but one ...
The electric utility in Columbus, Ohio had an IBM 158AP with NatSemi memory attached. Their machine had suffered problems and IBM patched the kernel to detect buffer inconsistencies. NatSemi installed its fixes, but the kernel patch continued to report errors.
The 158AP has backing store only on A's side so, as explained above, the Bug can lead to a mismatch between cache and backing store only on B's side. However, the mismatches now reported by the modified kernel were occurring on the A side. It seemed to me that the defect was probably in the ad-hoc kernel patch rather than the hardware but the customer in Columbus refused to revert to the virgin operating system.
I examined the kernel patch. It was in 370 Assembler language of course
but was equivalent to the following C code:
backstore_data = backstore_fetch(&shared_variable);
/* Window 1 */
cache_data = cache_fetch(&shared_variable);
/* Window 2 */
if (cache_data != backstore_data)
backstore_data = backstore_fetch(&shared_variable);
if (cache_data != backstore_data)
(As explained earlier, backstore_fetch() is accomplished with the CS opcode.)
The bug-detection patch does two comparisons, to allow for the possibility that the other CPU will update shared_variable exactly at the point marked ``Window 1.''
Is there a bug in this bug-detector? Could the other CPU update shared_variable both during ``Window 1'' and during ``Window 2''? Perhaps we'll never know for sure now, but I assume that was happening.
NatSemi was ordered to remove its memory attachment, but the kernel patch continued to report ``cache inconsistencies'' even when there was no dynamic memory installed.
The bug in the bug-detector at Columbus, Ohio should teach us a healthy respect for Murphy's Law as it applies to multi-processor races.
I hoped to describe an interesting bug from the past, but I'm afraid it ended up long-winded and boring. Please e-mail me and tell me what you thought.