Sometime in the 80s, I implemented the core of the Mandelbrot Set calculation using assembly on an 8087. As the article mentions, the compilers did math very inefficiently on this stack architecture. For example, if you multiplied two numbers together and then added a third, they would push the first two numbers, multiply, pop the result, push the result back onto the stack (perhaps clearing the stack? after 40 years I don't remember), push the third number, add, pop the result. For the Mandelbrot loop this was even worse, as it never kept the results of the loop. My assembly kept all the intermediate results on the stack for a 100x speed up.
Running this code, the 8087 emitted a high-pitched whine. I could tell when my code was broken and it had gone into an infinite loop by the sound. Which was convenient because, of course, there was no debugger.
Ah, lots of supposedly solid state computer stuff, including CPUs, did that. I, too, used it for debugging. This wasn't very conscious on my part, but if some whine became unusual and constant, it was often a sign of something hanging.
As I got older, not only did computers stop doing that, my hearing also got worse (entirely normal for my age, but still), so that's mostly a thing of the past.
I used to hear the 16KHz whistle of CRT monitors. Of course, there is no whistle with LED monitors, but I stopped hearing the CRT whistle before they went obsolete. It was my first sign my hearing was declining.
I thought I was protecting my ears from loud noises like rock concerts and gunshots. But I didn't know that driving with the window down damages the hearing. I crossed the country many times with the window down. I'm pretty sure that was the cause as my left ear is much worse off than my right.
I don't need a hearing aid yet, but I'm pretty careful in wearing ear plugs whenever there are loud noises.
16 kHz is very high on the spectrum. Just the normal age-related decline of hearing makes that inaudible pretty quickly, you don’t need to drive with the window down for that.
The sound usually comes from inductors and capacitors in the power supply circuitry, not the ICs themselves as they draw pulses of power in patterns at audible frequencies. Modern CPUs and GPUs will still whine audibly if given a suitable load; the amount of current they consume is amazingly high, dozens to hundreds of amps, and also changing extremely quickly.
I had a Radeon 5850 that did it. I ran someone's simple test unity project with vsync disabled, was getting around 3000 fps, and heard a tone that was probably 3000hz. Supposedly the 5090 FE's are pretty bad too.
Oh boy...more memories. About a decade later at work I identified a bottleneck in our line-drawing code. The final step was to cast two floats (a point) to integers, which the compiler turned into ftoa() calls. Unfortunately, ftoa changed and restored the floating point control flags in order to set the rounding behavior (the Intel default did not match the standard library spec). Even more unfortunately, this stalled the Pentium's instruction pipeline. Replacing the casts with a simple fld/fist pair was another 100x speedup. A few years later I noticed the compilers started adding optimization flags controlling this behavior.
The idea of listening to hardware running a program to tell what it is doing is surprisingly old. On the EDSAC computer[0] a little speaker was connected across one of the serial data lines, which allowed the progress of a program to be listened to. Skilled operators could immediately tell when a program had triggered a bug and either gone off into the weeds or entered a tight loop.
> - Forth people defined the IEEE754 standard on floating point, because they knew how to do that well in software.
IEEE 754 was principally developed by Kahan (in collaboration with his grad student, Coonen, and a visiting professor, Stone, whence the name KCS draft), none of whom were involved with Forth in any way that I am aware. And the history is pretty clear that the greatest influence on IEEE 754 before its release was Kahan's work with Intel developing the 8087.
There were a couple interesting points about the market for 8087 chips -- Intel designed the motherboard for the IBM PC, and they included an 8086 slot and a slot for either an 8087 or 8089. IBM didn't populate the slot for the coprocessor chip as it would compete with their mainframes, but Intel went around marketing the chips to research labs. One of them ended up with Stephen Fried who founded Microway in 1981 to create software for the 8087 and sell the chips, and the company is still in business after 44 years of chasing high performance computing. That's how I first got started with computing - a Microway Number Smasher (TM) card in an IBM PC.
The 80287 (AKA 287) and 80387 (AKA 387) floating point microprocessors started to pick up some competition from Weitek 1167 and 4167 chips and Inmos Transputer chips, so Intel integrated the FPU into the CPU with the 80486 processor (I question whether this was a monopoly move on Intel's part). This was also the first time that Intel made multiple versions of a CPU - there was a 486DX and a 486SX (colloquially referred to as the "sucks" model at the time) which disabled the FPU.
The 486 was also interesting because it was the first Intel x86 series chip to be able to operate at a multiple of the base frequency with the release of the DX2, DX3, and DX4 variants which allowed for different clock rates of 50MHz, 66MHz, 75MHz, and 100MHz based on the 25MHz and 33MHz base clock rates. I had a DX2-66MHz for a while and a DX4-100. The magic of these higher clock rates came from the introduction of the cache memory. The 486 was the first Intel CPU to utilize a cache.
Even though Intel had superseded the 8087/287/387 floating point coprocessor by including the latest version in the 80486, they introduced the 80860 (AKA i860) which was a VLIW RISC-based 64-bit FPU that was significantly faster, and also was the first microprocessor to exceed 1 million transistors.
The history of the FPU dedicated for special purpose applications is that it eventually became superseded by the GPU. Some of the first powerful GPUs from companies like Silicon Graphics utilized a number of i860 chips on a card in a very similar structure to more modern GPUs. You can think of each of the 12x i860 chips on an SGI Onyx / RealityEngine2 like a Streaming Multiprocessor node in an NVIDIA GPU.
Obviously, modern computers run at significantly faster clock speeds with significantly more cache and many kinds of cache, but it's good to look at the history of where these devices started to appreciate where we are now.
> The 486 was the first Intel CPU to utilize a cache.
The 486 was the first Intel CPU to integrate a cache on its die (following the competing Motorola CPUs MC68020 and MC68030).
Previous Intel CPUs already utilized caches, otherwise they could not achieve 0-wait state memory access cycles.
The cheaper 80286 and 80386SX motherboards usually omitted the cache to minimize the costs, but any decent higher-end 80386DX motherboard included an external write-through cache, with a size typically between 32 kB and 64 kB, so significantly bigger than the internal 8 kB write-through cache of 80486. An 80386DX without cache could never approach its advertised speed.
Because of the small internal cache of 80486, all good 486 motherboards implemented an external L2 cache, usually with sizes between 128 kB and 256 kB, as by that time the cost of cache memory chips had diminished in comparison with that of the years of 80386.
In the beginning, write-through caches were used, as they were much easier to implement externally.
Pentium (1993) was the first Intel CPU with a write-back cache (of 16 kB), which then was also added to the Intel 486DX4 CPU (100 MHz). Then AMD made 2 kinds of 486DX4 @ 100 MHz CPUs, an early model with an 8 kB write-through cache and a late model with an 8 kB write-back cache (which had also taken the CPUID instruction from Intel Pentium). AMD's DX4 @ 133 MHz had the write-back cache extended to 16 kB, like that of Pentium (and it was rebranded as 5x86, to confuse the buyers).
> The 80287 (AKA 287) and 80387 (AKA 387) floating point microprocessors started to pick up some competition from Weitek 1167 and 4167 chips and Inmos Transputer chips, so Intel integrated the FPU into the CPU with the 80486 processor (I question whether this was a monopoly move on Intel's part).
I don't think it was, transistor density became sufficient to integrate such a hefty chunk of circuitry on-die. Remember that earlier CPUs had even things like MMUs as separate chips, like Motorola 68851.
The 2-bit-per-transistor ROM using four transistor sizes is wild. Were there other chips from this era experimenting with semi-analog storage, or was the 8087 unusually aggressive here?
If you happen to know... what was the reasoning behind the oddball stack architecture? It feels like Intel must have had this already designed for some other purpose so they tossed it in. I can't imagine why anyone would think this arch was a good idea.
Then again... they did try to force VLIW and APX on us so Intel has a history of "interesting" ideas about processor design.
edit: You addressed it in the article and I guess that's probably the reason but for real... what a ridiculous hand-wavy thing to do. Just assume it will be fine? If the anecdotes about Itanium/VLIW are true they committed the same sin on that project: some simulations with 50 instructions were the (claimed) basis for that fiasco. Methinks cutting AMD out of the market might have been the real reason but I have no proof for that.
Stack-based architectures have an appeal, especially for mathematics. (Think of the HP calculator.) And the explanation that they didn't have enough instruction bits also makes sense. (The co-processor uses 8086 "ESCAPE" instructions, but 5 bits get used up by the ESCAPE itself.) I think that the 8087's stack could have been implemented a lot better, but even so, there's probably a reason that hardly any other systems use a stack-based architecture. And the introduction of out-of-order execution made stacks even less practical.
x86 has a general pattern of encoding operands, the ModR/M byte(s), which gives you either two register operands, or a register and a memory operand. Intel also did this trick that uses one of the register operand for extra opcode bits, at the cost of sacrificing one of the operands.
There are 8 escape opcodes, and all of them have a ModR/M byte trailing it. If you use two-address instructions, that gives you just 8 instructions you can implement... not enough to do anything useful! But if you're happy with one-address instructions, you get 64 instructions with a register operand and 64 instructions with a memory operand.
A stack itself is pretty easy to compile for, until you have to spill a register because there's too many live variables on the stack. Then the spill logic becomes a nightmare. My guess is that the designers were thinking along these lines--organizing the registers in the stack is an efficient way to use the encoding space, and a fairly natural way to write expressions--and didn't have the expertise or the communication to realize that the design came with some edge cases that were painfully sharp to deal with.
> there's probably a reason that hardly any other systems use a stack-based architecture
I don't know about other backend guys, but I disliked the stack architecture because it just incompatible with enregistering variables, register allocation by live range analysis, common subexpression elimination, etc.
There are software workarounds for some of those and very simple hardware workarounds for the others. In a stack-based architecture there should also be some directly-addressable registers for storing long-lived temporary variables. Most stack-based architectures included some set of stack shuffling operations that solved the problem of common subexpression elimination.
The real disadvantage is that the stack operations share the output operand, which introduces a resource dependency between otherwise independent operations, which prevents their concurrent execution.
There are hardware workarounds even for this, but the hardware would become much more complex, which is unlikely to be worthwhile.
The main influencer of the 8087 architecture, William Kahan, had previously worked on the firmware of the HP scientific calculators, so he was well experienced in implementing numeric algorithms by using stacks.
When writing in assembly language, the stack architecture is very convenient and it minimizes the program size. That is why most virtual machines used for implementing interpreters for programming languages have been stack-based.
The only real disadvantage of the stack architecture is that it prevents the concurrent execution of operations, because all operations have a resource dependency by sharing the stack as output location.
At the time when 8087 was designed, the possibility of implementing parallel execution of instructions in hardware was still very far in the future, so this disadvantage was dismissed.
Replacing the stack by individually addressable registers is not the only possible method for enabling concurrent execution of instructions. There are 2 alternatives that can continue to use a stack architecture.
One can have multiple operand stacks and each instruction must contain a stack number. Then the compiler assigns each chain of dependent operations to one stack and the CPU can execute in parallel as many independent chains of dependent instructions as there are stacks.
The other variant is to also have multiple operand stacks but to have the same instruction set with only one implicit stack, while implementing simultaneous multi-threading (SMT). Then each hardware thread uses its own stack while sharing the parallel execution units and then one can execute in parallel as many instructions as there are threads. For this variant one would need to have much more threads than in a CPU with registers, which combines superscalar execution with SMT, so one would need 8 or more SMT threads to be competitive.
Looking at the complexity and area of hardware floating point, I often wonder why we don't see more unified combined integer+floating point units, like done in the R4200 [1], which reused most of the integer datapath while just adding a smaller extra smaller 12-bit datapath for the exponent.
I didn't expect the microcode to be at the center of the chip. I'd expect it on the side and only talking to the microcode engine, making more room for data traffic between chip halves. Also, the microcode is huge.
The microcode was so huge that they had to use a semi-analog ROM that held two bits per transistor by using four transistor sizes.
As far as the layout, the outputs from the microcode ROM are the control signals that go to all parts of the chip, so it makes sense to give it a central location. There's not a lot of communication between the upper half of the chip (the bus interface to the 8086 and memory) and the lower half of the chip (the 80-bit datapath), so it doesn't get in the way too much. That said, I've been tracing out the chip and there is a surprising amount of wiring to move signals around. The wiring in the 8087 is optimized to be as dense as possible: things like running some parallel signals in silicon and some in polysilicon because the lines can get squeezed together just a bit more that way.
Nah. As others have said, translating infix to RPN is pretty easy to do. The nasty part was keeping values within registers on the stack, especially within loops. The 8087 couldn't do binary ops between two arbitrary locations on the stack, one had to be the top of stack. This meant that if you need to add two non-top locations, for example, you had to exchange (FXCH) one of them to the top of the stack first. This meant that optimized x87 code tended to be a mess of FXCH instructions.
Complicating this further, doing this in a loop requires that the stack state match between the start and end of the loop. This can be challenging to do with minimal FXCH instructions. I've seen compilers emit 3+ FXCH instructions in a row at the end of a loop to match the stack state, where with some hairy rearrangement it was possible to get it down to 2 or 1.
Finally, the performance characteristics of different x87 implementations varied in annoying ways. The Intel Pentium, for instance, required very heavy use of FXCH to keep the add and multiply pipelines busy. Other x87 FPUs at the time, however, were non-pipelined, some taking 4 cycles for an FADD and another 4 cycles for FXCH. This meant that rearranging x87 code for Pentium could _halve_ the speed on other CPUs.
To the last point I would see it the other way around. Rearranging code for pipelined 0 cycle FXCH Pentium FPU speed up floating point by probably way more than x2 compared to heavily optimized code running on K5/K6. Im not even sure if K6/-2 ever got 0 cycle FXCH, K6-3 did, but still no FPU pipelining until Athlon.
Quake wouldnt happen until Pentium 2 if Intel didnt pipeline FPU.
You're not wrong, the performance gain from proper FPU instruction scheduling on a Pentium was immense. But applications written prior to Quake and the Pentium gaining prominence or non-game oriented would have needed more blended code generation. Optimizing for the highest end CPU at the time at the cost of the lowest end CPU wouldn't necessarily have been a good idea, unless your lowest CPU was a Pentium. (Which it was for Quake, which was a slideshow on a 486.)
K6 did have the advantage of being OOO, which reduced the importance of instruction scheduling a lot, and having good integer performance. It also had some advantage with 3DNow! starting with K6-2, for the limited software that could use it.
My compiler knowledge is limited, but I think that you end up with the same parse tree at a very early level of processing, whether you use Reverse-Polish notation or inline notation. So I don't think a language change would make a difference.
I remember failing an interview with the optimization team of a large fruit trademarked computer maker because I couldn't explain why the x87 stack was a bad design. TBF they were looking for someone with a masters, not someone just graduating with a BS. But, now I know... honestly, I'm still not 100% sure what they were looking for in an answer. I assume something about register renaming. memory, and cycle efficiency.
Having given a zillion interviews, I expect that they weren't looking for the One True Answer, but were interested in seeing if you discussed plausible reasons in an informed way, as well as seeing what areas you focused on (e.g., do you discuss compiler issues or architecture issues). Saying "I dunno" is bad, especially after hints like "what about ..." and spouting complete nonsense is also bad.
(I'm just commenting on interviews in general, and this is in no way a criticism of your response.)
I think I said something about the stack efficiency. I was a kid that barely understood out of order execution. Register renaming and the rest was well beyond me. It was also a long time ago, so recollections are fuzzy. But, I do recall is they didn't prompt anything. I suspect the only reason I got the interview is I had done some SSE programming (AVX didn't exist yet, and to give timing context AltiVec was discussed), and they figured if I was curious enough to do that I might not be garbage.
Edit: Jogging my memory I believe they were explicit at the end of the interview they were looking for a Masters candidate. They did say I was on a good path IIRC. It wasn't a bad interview, but I was very clearly not what they were looking for.
It's all about that 80-bit/82-bit floating point format with the explicit mantissa bit just to be extra different. ;) Not only is it a 1:15:1:63, it's (2(tag)):1:15:1:63, whereas binary64 is 1:11:0:52. (sign:exponent [biased]:explicit leading mantissa bit stored?:manitissa remaining)
Other pre-P5 ISA idiosyncrasies: Only the 8087 has FDISI/FNDISI, FENI/FNENI. Only the plain 287 has a functional FSETPM. Most everything else looks like a 387 ISA-wise, more or less until MMX arrived. That's all I know.
I'm curious what the CX-83D87 and Weiteks look like.
Keep up the good work!
PS: Perhaps sometime in the (near) future we might get almost 1:1 silicon "OCR" transcription of die scans to FPGA RTL with bugs and all?
> I'm curious what the CX-83D87 and Weiteks look like.
The Weitek's were memory mapped. (At least those built for x86 machines.).
This essentially increased bandwidth by using the address bus as a source for floating point instructions. Was really a very cool idea, although I don't know what the performance realities were when using one.
The operand fields of a WTL 3167 address have been
specifically designed so that a WTL 3167 address can
be given as either the source or the destination to a
REP MOVSD instruction. [
Single-precision vector arithmetic is accomplished by
applying the 80386 block move instruction REP
MOVSD to a WTL 3167 address involving arithmetic
instead of loading or storing.
haha - took me a while to figure out that's Mauro Bonomi's signature
iirc the 3167 was a single clocked, full barrel shift mac pipeline with a bunch (64?) of registers, so the FPU could be driven with a RISC-style opcode on every address bus clock (given the right driver on the CPU) ... the core registers were enough to run inner loops (think LINPACK) very fast with some housekeeping on context switch of course
this window sat between full PCB minicomputer FPUs made from TTL and the decoupling of microcomputer internal clocks & cache from address bus rates ...
Weitek tried to convert their FPU base into an integrated FPU/CPU play during the RISC wars, but lost
A significant difference in 80387 versus 80287 & 8087 was that in 387 you could no longer select the "projective" behavior for infinities (where positive and negative infinities are identical).
This feature had not been included in the IEEE standard, so it was no longer implemented.
Testing whether this feature works or not was used in the programs running on an 80386 CPU to detect whether the attached FP coprocessor was a 287 or a 387 (because the hardware allowed both; 387 was launched later than 386, so initially a 386 had to be coupled with a 287, if a hardware FPU was needed).
I made my Dad buy me a 387 math coprocessor when I was in college because I was taking math and physics courses but I bet none of the software I used ever even accessed that chip. It was more about the empty socket on the mobo looking out of place.
This is cool, but the renormalization and (Programmable and bidirectional) barrel shifter are of much more interest.
I had a 10Mhz XT, and ran a 8087-8 at a bit higher clock rate. I used it both for Lotus 1-2-3 and Turbo Pascal-87. It made Turbo Pascal significantly faster.
Sometime in the 80s, I implemented the core of the Mandelbrot Set calculation using assembly on an 8087. As the article mentions, the compilers did math very inefficiently on this stack architecture. For example, if you multiplied two numbers together and then added a third, they would push the first two numbers, multiply, pop the result, push the result back onto the stack (perhaps clearing the stack? after 40 years I don't remember), push the third number, add, pop the result. For the Mandelbrot loop this was even worse, as it never kept the results of the loop. My assembly kept all the intermediate results on the stack for a 100x speed up.
Running this code, the 8087 emitted a high-pitched whine. I could tell when my code was broken and it had gone into an infinite loop by the sound. Which was convenient because, of course, there was no debugger.
Thanks for bringing back this memory.
Ah, lots of supposedly solid state computer stuff, including CPUs, did that. I, too, used it for debugging. This wasn't very conscious on my part, but if some whine became unusual and constant, it was often a sign of something hanging.
As I got older, not only did computers stop doing that, my hearing also got worse (entirely normal for my age, but still), so that's mostly a thing of the past.
I used to hear the 16KHz whistle of CRT monitors. Of course, there is no whistle with LED monitors, but I stopped hearing the CRT whistle before they went obsolete. It was my first sign my hearing was declining.
I thought I was protecting my ears from loud noises like rock concerts and gunshots. But I didn't know that driving with the window down damages the hearing. I crossed the country many times with the window down. I'm pretty sure that was the cause as my left ear is much worse off than my right.
I don't need a hearing aid yet, but I'm pretty careful in wearing ear plugs whenever there are loud noises.
16 kHz is very high on the spectrum. Just the normal age-related decline of hearing makes that inaudible pretty quickly, you don’t need to drive with the window down for that.
You're right, but it was coincident with my realizing I had trouble hearing my watch tick with my left ear.
The sound usually comes from inductors and capacitors in the power supply circuitry, not the ICs themselves as they draw pulses of power in patterns at audible frequencies. Modern CPUs and GPUs will still whine audibly if given a suitable load; the amount of current they consume is amazingly high, dozens to hundreds of amps, and also changing extremely quickly.
I had a Radeon 5850 that did it. I ran someone's simple test unity project with vsync disabled, was getting around 3000 fps, and heard a tone that was probably 3000hz. Supposedly the 5090 FE's are pretty bad too.
The compilers available at the time that the 8087 was commonplace were overall horrible and easily beaten anyway.
On the other hand, skilled humans can do very very well with the x87; this 256-byte demo makes use of it excellently: https://www.pouet.net/prod.php?which=53816
Oh boy...more memories. About a decade later at work I identified a bottleneck in our line-drawing code. The final step was to cast two floats (a point) to integers, which the compiler turned into ftoa() calls. Unfortunately, ftoa changed and restored the floating point control flags in order to set the rounding behavior (the Intel default did not match the standard library spec). Even more unfortunately, this stalled the Pentium's instruction pipeline. Replacing the casts with a simple fld/fist pair was another 100x speedup. A few years later I noticed the compilers started adding optimization flags controlling this behavior.
Yah, I never did a very good job with x87 code generation. I'm a bit embarrassed by that.
The idea of listening to hardware running a program to tell what it is doing is surprisingly old. On the EDSAC computer[0] a little speaker was connected across one of the serial data lines, which allowed the progress of a program to be listened to. Skilled operators could immediately tell when a program had triggered a bug and either gone off into the weeds or entered a tight loop.
[0]: https://en.wikipedia.org/wiki/EDSAC
- You can do the Mandelbrot set with integers. In Forth it's 6 lines.
- Coincidentally, Forth promotes a fixed point philosophy.
- Forth people defined the IEEE754 standard on floating point, because they knew how to do that well in software.
> - Forth people defined the IEEE754 standard on floating point, because they knew how to do that well in software.
IEEE 754 was principally developed by Kahan (in collaboration with his grad student, Coonen, and a visiting professor, Stone, whence the name KCS draft), none of whom were involved with Forth in any way that I am aware. And the history is pretty clear that the greatest influence on IEEE 754 before its release was Kahan's work with Intel developing the 8087.
I'm a big fan of Kahan's work. I am just sad that the NaN remains terribly misunderstood.
The signalling NaN, however, turned out to be quite useless and I abandoned it.
I think the Zortech C++ compiler was the first one to fully support NaN with the Standard library.
I think the 1985's standard/propotal from the Forth Vendor Group set a precedent.
Citation?
There were a couple interesting points about the market for 8087 chips -- Intel designed the motherboard for the IBM PC, and they included an 8086 slot and a slot for either an 8087 or 8089. IBM didn't populate the slot for the coprocessor chip as it would compete with their mainframes, but Intel went around marketing the chips to research labs. One of them ended up with Stephen Fried who founded Microway in 1981 to create software for the 8087 and sell the chips, and the company is still in business after 44 years of chasing high performance computing. That's how I first got started with computing - a Microway Number Smasher (TM) card in an IBM PC.
The 80287 (AKA 287) and 80387 (AKA 387) floating point microprocessors started to pick up some competition from Weitek 1167 and 4167 chips and Inmos Transputer chips, so Intel integrated the FPU into the CPU with the 80486 processor (I question whether this was a monopoly move on Intel's part). This was also the first time that Intel made multiple versions of a CPU - there was a 486DX and a 486SX (colloquially referred to as the "sucks" model at the time) which disabled the FPU.
The 486 was also interesting because it was the first Intel x86 series chip to be able to operate at a multiple of the base frequency with the release of the DX2, DX3, and DX4 variants which allowed for different clock rates of 50MHz, 66MHz, 75MHz, and 100MHz based on the 25MHz and 33MHz base clock rates. I had a DX2-66MHz for a while and a DX4-100. The magic of these higher clock rates came from the introduction of the cache memory. The 486 was the first Intel CPU to utilize a cache.
Even though Intel had superseded the 8087/287/387 floating point coprocessor by including the latest version in the 80486, they introduced the 80860 (AKA i860) which was a VLIW RISC-based 64-bit FPU that was significantly faster, and also was the first microprocessor to exceed 1 million transistors.
The history of the FPU dedicated for special purpose applications is that it eventually became superseded by the GPU. Some of the first powerful GPUs from companies like Silicon Graphics utilized a number of i860 chips on a card in a very similar structure to more modern GPUs. You can think of each of the 12x i860 chips on an SGI Onyx / RealityEngine2 like a Streaming Multiprocessor node in an NVIDIA GPU.
Obviously, modern computers run at significantly faster clock speeds with significantly more cache and many kinds of cache, but it's good to look at the history of where these devices started to appreciate where we are now.
> The 486 was the first Intel CPU to utilize a cache.
The 486 was the first Intel CPU to integrate a cache on its die (following the competing Motorola CPUs MC68020 and MC68030).
Previous Intel CPUs already utilized caches, otherwise they could not achieve 0-wait state memory access cycles.
The cheaper 80286 and 80386SX motherboards usually omitted the cache to minimize the costs, but any decent higher-end 80386DX motherboard included an external write-through cache, with a size typically between 32 kB and 64 kB, so significantly bigger than the internal 8 kB write-through cache of 80486. An 80386DX without cache could never approach its advertised speed.
Because of the small internal cache of 80486, all good 486 motherboards implemented an external L2 cache, usually with sizes between 128 kB and 256 kB, as by that time the cost of cache memory chips had diminished in comparison with that of the years of 80386.
In the beginning, write-through caches were used, as they were much easier to implement externally.
Pentium (1993) was the first Intel CPU with a write-back cache (of 16 kB), which then was also added to the Intel 486DX4 CPU (100 MHz). Then AMD made 2 kinds of 486DX4 @ 100 MHz CPUs, an early model with an 8 kB write-through cache and a late model with an 8 kB write-back cache (which had also taken the CPUID instruction from Intel Pentium). AMD's DX4 @ 133 MHz had the write-back cache extended to 16 kB, like that of Pentium (and it was rebranded as 5x86, to confuse the buyers).
> The 80287 (AKA 287) and 80387 (AKA 387) floating point microprocessors started to pick up some competition from Weitek 1167 and 4167 chips and Inmos Transputer chips, so Intel integrated the FPU into the CPU with the 80486 processor (I question whether this was a monopoly move on Intel's part).
I don't think it was, transistor density became sufficient to integrate such a hefty chunk of circuitry on-die. Remember that earlier CPUs had even things like MMUs as separate chips, like Motorola 68851.
> I question whether this was a monopoly move on Intel's part
Well, I was happy about that because I no longer had to deal with switches to generated x87 code or emulate it.
Story of how Intel-derived proposal was standardized as IEEE754: https://people.eecs.berkeley.edu/~wkahan/ieee754status/754st...
The 2-bit-per-transistor ROM using four transistor sizes is wild. Were there other chips from this era experimenting with semi-analog storage, or was the 8087 unusually aggressive here?
Intel also used the 2-bit-per-transistor ROM in the iAPX 432, their unsuccessful "micro-maninframe" chip.
Nowadays, flash uses multiple voltage levels to store four bits per cell (QLC, Quad Level Cell), which is a similar concept.
I wrote a whole blog post about the 2-bit-per-transistor technique, back in 2018: https://www.righto.com/2018/09/two-bits-per-transistor-high-...
Author here for your 8087 questions...
If you happen to know... what was the reasoning behind the oddball stack architecture? It feels like Intel must have had this already designed for some other purpose so they tossed it in. I can't imagine why anyone would think this arch was a good idea.
Then again... they did try to force VLIW and APX on us so Intel has a history of "interesting" ideas about processor design.
edit: You addressed it in the article and I guess that's probably the reason but for real... what a ridiculous hand-wavy thing to do. Just assume it will be fine? If the anecdotes about Itanium/VLIW are true they committed the same sin on that project: some simulations with 50 instructions were the (claimed) basis for that fiasco. Methinks cutting AMD out of the market might have been the real reason but I have no proof for that.
Stack-based architectures have an appeal, especially for mathematics. (Think of the HP calculator.) And the explanation that they didn't have enough instruction bits also makes sense. (The co-processor uses 8086 "ESCAPE" instructions, but 5 bits get used up by the ESCAPE itself.) I think that the 8087's stack could have been implemented a lot better, but even so, there's probably a reason that hardly any other systems use a stack-based architecture. And the introduction of out-of-order execution made stacks even less practical.
To expand on this a little bit more:
x86 has a general pattern of encoding operands, the ModR/M byte(s), which gives you either two register operands, or a register and a memory operand. Intel also did this trick that uses one of the register operand for extra opcode bits, at the cost of sacrificing one of the operands.
There are 8 escape opcodes, and all of them have a ModR/M byte trailing it. If you use two-address instructions, that gives you just 8 instructions you can implement... not enough to do anything useful! But if you're happy with one-address instructions, you get 64 instructions with a register operand and 64 instructions with a memory operand.
A stack itself is pretty easy to compile for, until you have to spill a register because there's too many live variables on the stack. Then the spill logic becomes a nightmare. My guess is that the designers were thinking along these lines--organizing the registers in the stack is an efficient way to use the encoding space, and a fairly natural way to write expressions--and didn't have the expertise or the communication to realize that the design came with some edge cases that were painfully sharp to deal with.
> there's probably a reason that hardly any other systems use a stack-based architecture
I don't know about other backend guys, but I disliked the stack architecture because it just incompatible with enregistering variables, register allocation by live range analysis, common subexpression elimination, etc.
There are software workarounds for some of those and very simple hardware workarounds for the others. In a stack-based architecture there should also be some directly-addressable registers for storing long-lived temporary variables. Most stack-based architectures included some set of stack shuffling operations that solved the problem of common subexpression elimination.
The real disadvantage is that the stack operations share the output operand, which introduces a resource dependency between otherwise independent operations, which prevents their concurrent execution.
There are hardware workarounds even for this, but the hardware would become much more complex, which is unlikely to be worthwhile.
The main influencer of the 8087 architecture, William Kahan, had previously worked on the firmware of the HP scientific calculators, so he was well experienced in implementing numeric algorithms by using stacks.
When writing in assembly language, the stack architecture is very convenient and it minimizes the program size. That is why most virtual machines used for implementing interpreters for programming languages have been stack-based.
The only real disadvantage of the stack architecture is that it prevents the concurrent execution of operations, because all operations have a resource dependency by sharing the stack as output location.
At the time when 8087 was designed, the possibility of implementing parallel execution of instructions in hardware was still very far in the future, so this disadvantage was dismissed.
Replacing the stack by individually addressable registers is not the only possible method for enabling concurrent execution of instructions. There are 2 alternatives that can continue to use a stack architecture.
One can have multiple operand stacks and each instruction must contain a stack number. Then the compiler assigns each chain of dependent operations to one stack and the CPU can execute in parallel as many independent chains of dependent instructions as there are stacks.
The other variant is to also have multiple operand stacks but to have the same instruction set with only one implicit stack, while implementing simultaneous multi-threading (SMT). Then each hardware thread uses its own stack while sharing the parallel execution units and then one can execute in parallel as many instructions as there are threads. For this variant one would need to have much more threads than in a CPU with registers, which combines superscalar execution with SMT, so one would need 8 or more SMT threads to be competitive.
insightful footnote, thanks: https://web.archive.org/web/20190301193516/http://www.drdobb...
Make no mistake, this article is of extraordinary historical significance, even the list of constantans being hardwired....
Looking at the complexity and area of hardware floating point, I often wonder why we don't see more unified combined integer+floating point units, like done in the R4200 [1], which reused most of the integer datapath while just adding a smaller extra smaller 12-bit datapath for the exponent.
[1] https://en.wikipedia.org/wiki/R4200
The integer pipeline is often needed for address calculation near the same time as the floating point pipeline.
The R4200 FPU performance suffered for this reason.
I didn't expect the microcode to be at the center of the chip. I'd expect it on the side and only talking to the microcode engine, making more room for data traffic between chip halves. Also, the microcode is huge.
The microcode was so huge that they had to use a semi-analog ROM that held two bits per transistor by using four transistor sizes.
As far as the layout, the outputs from the microcode ROM are the control signals that go to all parts of the chip, so it makes sense to give it a central location. There's not a lot of communication between the upper half of the chip (the bus interface to the 8086 and memory) and the lower half of the chip (the 80-bit datapath), so it doesn't get in the way too much. That said, I've been tracing out the chip and there is a surprising amount of wiring to move signals around. The wiring in the 8087 is optimized to be as dense as possible: things like running some parallel signals in silicon and some in polysilicon because the lines can get squeezed together just a bit more that way.
I wonder, if C used Reverse-Polish notation for math operations, would compilers have been able to target the 8087 better than they did?
Nah. As others have said, translating infix to RPN is pretty easy to do. The nasty part was keeping values within registers on the stack, especially within loops. The 8087 couldn't do binary ops between two arbitrary locations on the stack, one had to be the top of stack. This meant that if you need to add two non-top locations, for example, you had to exchange (FXCH) one of them to the top of the stack first. This meant that optimized x87 code tended to be a mess of FXCH instructions.
Complicating this further, doing this in a loop requires that the stack state match between the start and end of the loop. This can be challenging to do with minimal FXCH instructions. I've seen compilers emit 3+ FXCH instructions in a row at the end of a loop to match the stack state, where with some hairy rearrangement it was possible to get it down to 2 or 1.
Finally, the performance characteristics of different x87 implementations varied in annoying ways. The Intel Pentium, for instance, required very heavy use of FXCH to keep the add and multiply pipelines busy. Other x87 FPUs at the time, however, were non-pipelined, some taking 4 cycles for an FADD and another 4 cycles for FXCH. This meant that rearranging x87 code for Pentium could _halve_ the speed on other CPUs.
To the last point I would see it the other way around. Rearranging code for pipelined 0 cycle FXCH Pentium FPU speed up floating point by probably way more than x2 compared to heavily optimized code running on K5/K6. Im not even sure if K6/-2 ever got 0 cycle FXCH, K6-3 did, but still no FPU pipelining until Athlon.
Quake wouldnt happen until Pentium 2 if Intel didnt pipeline FPU.
You're not wrong, the performance gain from proper FPU instruction scheduling on a Pentium was immense. But applications written prior to Quake and the Pentium gaining prominence or non-game oriented would have needed more blended code generation. Optimizing for the highest end CPU at the time at the cost of the lowest end CPU wouldn't necessarily have been a good idea, unless your lowest CPU was a Pentium. (Which it was for Quake, which was a slideshow on a 486.)
K6 did have the advantage of being OOO, which reduced the importance of instruction scheduling a lot, and having good integer performance. It also had some advantage with 3DNow! starting with K6-2, for the limited software that could use it.
My compiler knowledge is limited, but I think that you end up with the same parse tree at a very early level of processing, whether you use Reverse-Polish notation or inline notation. So I don't think a language change would make a difference.
Converting to RPN is, roughly speaking, the easiest way to generate code for either register or stack architectures.
Once you have a parse tree, visiting it in post order (left tree, right tree, operation) produces the RPN.
I don't know what the GRX field is.
The field of the instruction that selects the stack offset.
Looks like a log multiply-adder ... maybe a 5 clock cycle? Also, on the microcode ... them FP divide algorithms are pretty intense.
Would be cool to hear a real designer compare to the Weitek 1064.
I remember failing an interview with the optimization team of a large fruit trademarked computer maker because I couldn't explain why the x87 stack was a bad design. TBF they were looking for someone with a masters, not someone just graduating with a BS. But, now I know... honestly, I'm still not 100% sure what they were looking for in an answer. I assume something about register renaming. memory, and cycle efficiency.
Having given a zillion interviews, I expect that they weren't looking for the One True Answer, but were interested in seeing if you discussed plausible reasons in an informed way, as well as seeing what areas you focused on (e.g., do you discuss compiler issues or architecture issues). Saying "I dunno" is bad, especially after hints like "what about ..." and spouting complete nonsense is also bad.
(I'm just commenting on interviews in general, and this is in no way a criticism of your response.)
I think I said something about the stack efficiency. I was a kid that barely understood out of order execution. Register renaming and the rest was well beyond me. It was also a long time ago, so recollections are fuzzy. But, I do recall is they didn't prompt anything. I suspect the only reason I got the interview is I had done some SSE programming (AVX didn't exist yet, and to give timing context AltiVec was discussed), and they figured if I was curious enough to do that I might not be garbage.
Edit: Jogging my memory I believe they were explicit at the end of the interview they were looking for a Masters candidate. They did say I was on a good path IIRC. It wasn't a bad interview, but I was very clearly not what they were looking for.
Very cool.
It's all about that 80-bit/82-bit floating point format with the explicit mantissa bit just to be extra different. ;) Not only is it a 1:15:1:63, it's (2(tag)):1:15:1:63, whereas binary64 is 1:11:0:52. (sign:exponent [biased]:explicit leading mantissa bit stored?:manitissa remaining)
Other pre-P5 ISA idiosyncrasies: Only the 8087 has FDISI/FNDISI, FENI/FNENI. Only the plain 287 has a functional FSETPM. Most everything else looks like a 387 ISA-wise, more or less until MMX arrived. That's all I know.
I'm curious what the CX-83D87 and Weiteks look like.
Keep up the good work!
PS: Perhaps sometime in the (near) future we might get almost 1:1 silicon "OCR" transcription of die scans to FPGA RTL with bugs and all?
> I'm curious what the CX-83D87 and Weiteks look like.
The Weitek's were memory mapped. (At least those built for x86 machines.).
This essentially increased bandwidth by using the address bus as a source for floating point instructions. Was really a very cool idea, although I don't know what the performance realities were when using one.
http://www.bitsavers.org/components/weitek/dataSheets/WTL-31...
This is nuts, in the best way.
The operand fields of a WTL 3167 address have been specifically designed so that a WTL 3167 address can be given as either the source or the destination to a REP MOVSD instruction. [
Single-precision vector arithmetic is accomplished by applying the 80386 block move instruction REP MOVSD to a WTL 3167 address involving arithmetic instead of loading or storing.
haha - took me a while to figure out that's Mauro Bonomi's signature
iirc the 3167 was a single clocked, full barrel shift mac pipeline with a bunch (64?) of registers, so the FPU could be driven with a RISC-style opcode on every address bus clock (given the right driver on the CPU) ... the core registers were enough to run inner loops (think LINPACK) very fast with some housekeeping on context switch of course
this window sat between full PCB minicomputer FPUs made from TTL and the decoupling of microcomputer internal clocks & cache from address bus rates ...
Weitek tried to convert their FPU base into an integrated FPU/CPU play during the RISC wars, but lost
A significant difference in 80387 versus 80287 & 8087 was that in 387 you could no longer select the "projective" behavior for infinities (where positive and negative infinities are identical).
This feature had not been included in the IEEE standard, so it was no longer implemented.
Testing whether this feature works or not was used in the programs running on an 80386 CPU to detect whether the attached FP coprocessor was a 287 or a 387 (because the hardware allowed both; 387 was launched later than 386, so initially a 386 had to be coupled with a 287, if a hardware FPU was needed).
I made my Dad buy me a 387 math coprocessor when I was in college because I was taking math and physics courses but I bet none of the software I used ever even accessed that chip. It was more about the empty socket on the mobo looking out of place.
This is cool, but the renormalization and (Programmable and bidirectional) barrel shifter are of much more interest.
I had a 10Mhz XT, and ran a 8087-8 at a bit higher clock rate. I used it both for Lotus 1-2-3 and Turbo Pascal-87. It made Turbo Pascal significantly faster.
You're in luck, I wrote about the 8087's shifter back in 2020 :-) https://www.righto.com/2020/05/die-analysis-of-8087-math-cop...