Why is addition as fast as bit-wise operations in modern processors?

I know that bit-wise operations are so fast on modern processors, because they can operate on 32 or 64 bits on parallel, so bit-wise operations take only one clock cycle. However addition is a complex operation that consists of at least one and possibly up to a dozen bit-wise operations, so I naturally thought it will be 3-4 times slower. I was surprised to see after a simple benchmark that addition is exactly as fast as any of the bit-wise operations(XOR, OR, AND etc). Can anyone shed light on this?

Teodor Dyakov

Posted 2017-05-23T23:44:43.203

Reputation: 1 245

16https://en.wikipedia.org/wiki/Carry-lookahead_adder also https://www.cs.umd.edu/class/sum2003/cmsc311/Notes/Comb/lookahead.html – slebetman – 2017-05-24T05:22:26.530

fwiw, multiplication isn't much slower

– Octopus – 2017-05-27T20:29:08.970

1Yep, multiplication was pretty fast in my tests too. It was only around 2x slower than addition, while division was around 30x (!) times slower. – Teodor Dyakov – 2017-05-27T20:45:35.870

Compact overview of state-of-the-art parallel prefix tree adders: A Taxonomy of Parallel Prefix Networks by David Harris: http://pages.hmc.edu/harris/research/taxonomy.pdf

– Franki – 2018-07-18T12:47:33.807

More elaborated: PhD Jun Chen's doctoral thesis "Parallel-prefix structures for binary and modulo {2n−1, 2n, 2n+1} adders" http://digital.library.okstate.edu/etd/Chen_okstate_0664D_10070.pdf

– Franki – 2018-07-18T12:51:49.907

Answers

106

Addition is fast because CPU designers have put in the circuitry needed to make it fast. It does take significantly more gates than bitwise operations, but it is frequent enough that CPU designers have judged it to be worth it. See https://en.wikipedia.org/wiki/Adder_(electronics).

Both can be made fast enough to execute within a single CPU cycle. They're not equally fast -- addition requires more gates and more latency than a bitwise operation -- but it's fast enough that a processor can do it in one clock cycle. There is a per-instruction latency overhead for the instruction decoding and control logic, and the latency for that is significantly larger than the latency to do a bitwise operation, so the difference between the two gets swamped by that overhead. AProgrammer's answer and Paul92's answer explain those effects well.

D.W.

Posted 2017-05-23T23:44:43.203

Reputation: 128 457

Comments are not for extended discussion; this conversation has been moved to chat.

– D.W. – 2017-05-28T15:18:48.360

There are several aspects.

The relative cost of a bitwise operation and an addition. A naive adder will have a gate-depth which depend linearly of the width of the word. There are alternative approaches, more costly in terms of gates, which reduce the depth (IIRC the depth then depend logarithmically of the width of the word). Others have given references for such techniques, I'll just point out that difference is also less important than what it may seem just considering the cost of the operation because of the need of control logic which add delays.
Then there is the fact that processors are usually clocked (I'm aware of some research or special purpose non clocked designs, but I'm not even sure that some are available commercially). That means that whatever the speed of an operation is, it will take at an integer multiple of the clock cycle.
Finally there is the micro-architectural considerations: are you sure that you measure what you want? Nowadays, processors tend to be pipelined, multi-scalar, with out-of-order execution and whatever else. That means that they are able to execute several instructions at the same time, at various stage of completion. If you want to show by measurements that an operation takes more time that another, you have to take take those aspect in consideration as their goal is to hide those difference. You may very well have the same throughput for addition and bitwise operations when using independent data but a measure of the latency or introducing dependencies between the operations may show otherwise. And you have also to be sure that the bottleneck of your measure is in the execution, and not for instance in the memory accesses.

AProgrammer

Posted 2017-05-23T23:44:43.203

Reputation: 2 939

+1. Yes, most processors are clocked, but a few clockless CPUs are commercially available.

– David Cary – 2017-05-24T14:10:41.980

3Another possibility is that a processor might store a 64-bit register as one 16-bit piece and three 17-bit pieces, where the extra bits of each piece holding a deferred carry from below. An addition which is followed by a bitwise operation or a store might require 1-2 extra cycles to propagate the carry, but an addition which is followed by another addition would not. Further, in the "store" case, the extra propagation time may delay the performance of the store, but there would be no need for code to "wait" for it. – supercat – 2017-05-24T15:59:14.043

4@supercat The Pentium 4 did something like this, with a double-speed (relative to the rest of the processor) ALU that would have the low 16 or 32 bits ready for a subsequent operation a half-cycle before the upper half's bits. – Jeffrey Bosboom – 2017-05-24T21:38:52.943

3are you sure that you measure what you want? In this case, the OP's conclusion from the measurements happens to be correct for the vast majority of CPUs. Addition is so common that superscalar CPUs have add units on all execution ports, and booleans are so cheap to implement (in transistor count) that they're also present on all ports. So add and booleans almost always have the same throughput (e.g. 4 per clock in Intel Haswell). – Peter Cordes – 2017-05-25T21:46:11.603

It's also worth the design effort to make it single-cycle latency for the critical path of a dependency chain. This requires a single-cycle adder and efficient forwarding of outputs to inputs (bypassing write-back to registers). All x86 CPUs that Agner Fog has measured have single cycle latency for ADD (see his instruction tables), except for Pentium4 where ADD has 0.5 cycle latency (thanks to double-clocked ALUs, as @Jeffrey points out). A chain of dependent additions can execute at 2 per clock on P4. (Or 1 per clock on everything else).

– Peter Cordes – 2017-05-25T21:49:24.783

4SIMD integer add is often lower throughput than SIMD boolean, though, even though they usually have the same latency. Intel CPUs from PentiumII through Broadwell can only run vector-int adds (e.g. paddw) at 2 per clock, but booleans (like pand) at 3 per clock. (Skylake puts a vector adder on all three vector execution ports.) – Peter Cordes – 2017-05-25T21:53:20.930

CPUs operate in cycles. At each cycle, something happens. Usually, an instruction takes more cycles to execute, but multiple instructions are executed at the same time, in different states.

For example, a simple processor might have 3 steps for each instruction: fetch, execute and store. At any time, 3 instructions are being processed: one is being fetched, one is being executed and one stores its results. This is called a pipeline and has in this example 3 stages. Modern processors have pipelines with over 15 stages. However, addition, as well as most of the arithmetic operations, are usually executed in one stage (I am speaking about the operation of adding 2 numbers by the ALU, not about the instruction itself - depending on the processor architecture, the instruction might require more cycles for fetching arguments from memory, performing conditionals, storing results to memory).

The duration of a cycle is determined by the longest critical path. Basically, it's the longest amount of time necessary for some stage of the pipline to complete. If you want to make the CPU faster, you need to optimize the critical path. If reducing the critical path per se is not possible, it can be split into 2 stages of the pipeline, and you are now able to clock your CPU at almost twice the frequency (assuming there is not another critical path that prevents you from doing this). But this comes with an overhead: you need to insert a register between the stages of the pipeline. Which means that you don't really gain 2x speed (the register needs time to store the data), and you have complicated the whole design.

There are already quite efficient methods for performing addition (e.g. carry lookahead adders) and addition is not a critical path for the processor speed, thus is makes no sense splitting it into multiple cycles.

Also, note that while it might seem complicated for you, in hardware things can be done in parallel very fast.

Paul92

Posted 2017-05-23T23:44:43.203

Reputation: 508

4The big overhead from longer pipelines is more cycles to recover from a branch misprediction! Spending transistors to buffer data between stages is minor these days. Even a simple pipelined CPU has to be fetching / decoding ahead of instructions that are actually executing. If the CPU discovers that the front-end was working on the wrong code because a branch went a different way than it predicted (or some other mis-speculation), it has to throw away that work and start from the correct instruction. Things only get worse with superscalar out-of-order CPUs that can have many insns in flight. – Peter Cordes – 2017-05-25T21:59:45.740

Processors are clocked, so even if some instructions can clearly be done faster than others, they may well take the same number of cycles.

You'll probably find that the circuitry required to transport data between registers and execution units is significantly more complicated than the adders.

Note that the simple MOV (register to register) instruction does even less computation than bitwise logic, yet both MOV and ADD usually take one cycle. If MOV could be made twice as fast, CPUs would be clocked twice as fast and the ADDs would be two cycles.

James Hollis

Posted 2017-05-23T23:44:43.203

Reputation: 296

Comments are not for extended discussion; this conversation has been moved to chat.

– Gilles 'SO- stop being evil' – 2017-05-26T08:19:50.987

Summary of the discussion: some out-of-order CPUs handle MOV specially with register-renaming, with effectively zero latency. See Can x86's MOV really be “free”? Why can't I reproduce this at all? for the full details of what MOV really costs.

– Peter Cordes – 2017-05-27T18:43:31.673

Addition is important enough to not have it wait for a carry bit to ripple through a 64-bit accumulator: the term for that is a carry-lookahead adder and they are basically part of 8-bit CPUs (and their ALUs) and upwards. Indeed, modern processors tend to need not much more execution time for a full multiplication either: carry-lookahead is actually a really old (and comparatively affordable) tool in a processor designer's toolbox.

user72735

Posted 2017-05-23T23:44:43.203

Reputation: 121

1Integer multiplication is definitely higher latency and lower throughput than ADD on x86. But it is amazingly fast considering how many adders it takes to build a fast multiplier: e.g. on Intel since Nehalem, and AMD since Ryzen, 8/16/32/64-bit scalar integer multiply is 3 cycle latency, with one per 1c throughput (one fully-pipelined execution unit). This sucks compared to ADD throughput of 3 or 4 per clock, but is amazing compared to 9 cycle IMUL latency in Intel Pentium P5. Things are similar for SIMD: vector-int multiply is higher latency and lower throughput than add, but still fast. – Peter Cordes – 2017-05-30T08:38:35.933

1So yes, multiply used to be much more expensive relative to other instructions than it is now. Avoiding it at a cost of more than 2 instructions is usually not worth it, and sometimes not even a 2-instruction substitute is worth it (e.g. with a shift+add lea instruction). – Peter Cordes – 2017-05-30T08:41:33.570

I think you'd be hard pressed to find a processor that had addition taking more cycles than a bitwise operation. Partly because most processors must carry out at least one addition per instruction cycle simply to increment the program counter. Mere bitwise operations aren't all that useful.

(Instruction cycle, not clock cycle - e.g. the 6502 takes a minimum of two clock cycles per instruction due to being non-pipelined and not having an instruction cache)

The real concept you may be missing is that of the critical path: within a chip, the longest operation that may be performed within one cycle dictates, at the hardware level, how fast the chip may be clocked.

The exception to this is the (rarely used and hardly commercialised) asynchronous logic, which really does execute at different speeds depending on logic propagation time, device temperature etc.

pjc50

Posted 2017-05-23T23:44:43.203

Reputation: 371

It's not user-controllable bitwise operations, but some instructions on the 8086 (eg. clearing the interrupt flag) took fewer cycles than an integer addition. More abstractly, a RISC system where all the instructions are one word in size could use a simple binary counter for the PC, which would be a much faster circuit than a general-purpose adder.

– Mark – 2017-05-25T01:57:43.390

1Addition on the program counter tends to be very simple compared to an addition arithmetic instruction, because one of the operands is small (either an instruction size, or a relative jump offset which is also size-limited) – Ben Voigt – 2017-05-25T14:28:01.227

16502 was pipelined - it read the first byte of the next instruction during the last cycle of the previous one. Otherwise fetch/decode/execute would have been at least three cycles. – gnasher729 – 2017-05-26T19:14:34.580

At the gate level, you are correct that it takes more work to do addition, and thus takes longer. However, that cost is sufficiently trivial that doesn't matter.

Modern processors are clocked. You cannot do instructions at anything except multiples of this clock rate. If the clock rates were pushed higher, to maximize the speed of the bitwise operations, you would have to spend at least 2 cycles on addition. Much of this time would be spent waiting around because you didn't really need the full 2 cycles worth of time. You only needed 1.1 (or some number like that). Now your chip adds slower than everyone else on the market.

Worse, the mere act of adding or doing bitwise operations is only one tiny part of what is going on during a cycle. You have to be able to fetch/decode instructions within a cycle. You have to be able to do cache operations within a cycle. Lots of other things are going on on the same timescale as the simple addition or bitwise operation.

The solution, of course, is to develop a massively deep pipeline, breaking these tasks up into tiny parts that fit into the tiny cycle time defined by a bitwise operation. The Pentium 4 famously showed the limits of thinking in these deep pipeline terms. All sorts of issues arise. In particular branching gets notoriously difficult because you have to flush the pipeline once you have the data to figure out which branch to take.

Cort Ammon

Posted 2017-05-23T23:44:43.203

Reputation: 2 819

Modern processors are clocked: Every operation takes some integral number of clock cycles. The designers of the processor determine the length of a clock cycle. There are two considerations there: One, the speed of the hardware, for example measured as the delay of a single NAND-gate. This depends on the technology used, and on tradeoffs like speed vs. power usage. It is independent of the processor design. Two, the designers decide that the length of a clock cycle equals n delays of a single NAND-gate, where n might be 10, or 30, or any other value.

This choice n limits how complex operations can be that can be processed in one cycle. There will be operations that can be done in 16 but not in 15 NAND delays. So chosing n = 16 means such an operation can be done in a cycle, choosing n = 15 means it can't be done.

The designers will chose n so that many important operations can be just about done in one, or maybe two or three cycles. n will be chosen locally optimal: If you replaced n with n-1, then most operations would be a bit faster, but some (those that really need the full n NAND delays) would be slower. If few operations would slow down, so that overall program execution is faster on average, then you would have picked n-1. You could also have picked n+1. That makes most operations a bit slower, but if you have many operations that can't be done within n delays but can be done within n+1 delays then it would make the processor overall faster.

Now your question: Add and subtract are so common operations that you want to be able to execute them in a single cycle. As a result, it doesn't matter that AND, OR etc. could execute faster: They still need that one cycle. Of course the unit "calculating" AND, OR etc has a lot of time to twiddle its thumbs, but that can't be helped.

Note that it's not just whether an operation can be done within n NAND-delays or not: An addition for example can be made faster by being a bit clever, still faster by being very clever, still a bit faster by investing extraordinary amounts of hardware, and at last a processor can have a mixture of very fast very expensive and a bit slower and cheaper circuits, so there is the possiblity to make one operation just about fast enough by spending more money on it.

Now you could make the clock speed so high / the cycle so short that only the simple bit operations execute in one cycle and everything else in two or more. That would most likely slow the processor down. For operations that take two cycles, there is usually overhead to move an incomplete instruction from one cycle to the next, so two cycles doesn't mean you have twice as much time for execution. So to do the addition in two cycles, you couldn't double the clock speed.

gnasher729

Posted 2017-05-23T23:44:43.203

Reputation: 21 531

Let me correct a few things that were not mentioned that explicitely in your existing answers:

I know that bitwise operations are so fast on modern processors, because they can operate on 32 or 64 bits on parallel,

This is true. Labeling a CPU as "XX" bit usually (not always) means that most of its common structures (register widths, addressable RAM etc.) are XX bits in size (often "+/- 1" or somesuch). But in regards to your question, you can safely assume that a CPU with 32 bit or 64 bit will do any basic bit operation on 32 or 64 bits in constant time.

so bitwise operations take only one clock cycle.

This conclusion is not necessarily the case. Especially CPUs with rich instruction sets (google CISC vs. RISC) can easily take more than one cycle for even simple commands. With interleaving, even simples commands might break down into fetch-exec-store with 3 clocks (as an example).

However addtion is a complex operation

No, integer addition is a simple operation; subtraction as well. It is very easy to implement adders in full hardware, and they do their stuff as instantaneously as basic bit operations.

that consists of at least one and possibly up to a dozen bitwise operations, so I naturally thought it will be 3-4 times slower.

It will take up 3-4 times as many transistors, but in comparison to the big picture that is neglectible.

I was surprised to see after a simple benchmark that addition is exactly as fast as any of the bitwise operations(XOR, OR, AND etc). Can anyone shed light on this?

Yes: integer addition is a bitwise operation (with a few more bits than the others, but still). There is no need to do anything in stages, there is no need for complicated algorithms, clocks or anything else.

If you wish to add more bits than your CPU architecture, you will incur a penalty of having to do it in stages. But this is on another level of complexity (programming language level, not assembly/machine code level). This was a common problem in the past (or today on small embedded CPUs). For PCs etc., their 32 or 64 bits are sufficient for the most common data types for this to start to become a moot point.

AnoE

Posted 2017-05-23T23:44:43.203

Reputation: 1 067

1It's interesting to note that reducing the time cost of addition from O(N) to O(sqrt(N)) does not significantly increase the required number of transistors or routing complexity (each stage just needs to let one carry wire sneak in from below, and there need to be sqrt(N) extra merging stages. The time cost can be reduced to O(lgN) at a cost of O(lgN) transistors, but in many cases it may be helpful to process something like a 64-bit addition as e.g. eight 8-bit adds (using sqrtN forwarding) joined with three layers of merging logic, rather than as 64 1-bit adds with six layers of merging. – supercat – 2017-05-24T16:03:27.610

Yeah, adders are fairly simple. What's really impressive is modern x86 CPUs with a fully pipelined 3-cycle latency 64-bit integer multiplier. (e.g. imul rax, rcx has 3c latency, and one per 1c throughput on Intel Sandybridge-family, and AMD Ryzen). Even 64-bit full-multiplication (producing the 128 bit result in rdx:rax) has the same latency and throughput, but is implemented as 2 uops (which run in parallel on different ports). (See http://agner.org/optimize/ for instruction tables and an excellent microarch guide).

– Peter Cordes – 2017-05-25T22:32:11.143

1[add-with-carry] is on another level of complexity (programming language level, not assembly/machine code level. It depends on the language. A C compiler targeting a 16-bit CPU has to emit add/adc for you when it compiles addition of two uint32_t values. This is still relevant today for int64_t on 32-bit targets. AVR is an 8-bit RISC microcontroller, so 32-bit integers require 4 instructions: https://godbolt.org/g/wre0fM – Peter Cordes – 2017-05-25T22:40:50.237

Yes, @PeterCordes, that is what I meant, I've clarified my sentence a little bit. – AnoE – 2017-05-26T05:48:20.817