Branchless programming — Why your CPU will thank you
Intro
‘Performance’… we hear it everywhere in software development nowadays. But, what does that mean? We all can have different definitions for ‘performance’ in our daily workflows. On a very lower level, performance is associated with the underlying computer components that we utilize.
Performance optimization plays a crucial role in extracting maximum efficiency from contemporary processors (CPUs). An often overlooked technique that might be holding the key to substantial performance gains, is branchless programming. In this article we aim to provide a comprehensive understanding of branchless programming, exploring the techniques utilized, its practical applications, drawbacks, and the impact it has on contemporary CPUs.
Branchless programming is a programming approach that, as its name suggests, seeks to minimize or eliminate branches, such as conditional statements (if-else statements) and loops, which can introduce performance penalties due to pipeline slowdowns and wrongfully predicted branches in modern processors. As CPU architectures have advanced, the depth of their instruction pipelines has increased, making the cost of mispredicted branches even more impactful. These penalties occur when a processor incorrectly predicts the outcome of a branch, causing it to discard the work already done and start again, wasting valuable processing cycles. The primary idea behind branchless programming is to mitigate these performance penalties by replacing branches with more efficient arithmetic and bitwise operations, which can be executed very efficiently on modern processors. With the reduction of the number of branches, this technique can lead to significant performance improvements in performance-critical code, such as high-performance computing, real-time systems, or graphics and ray-tracing processing.
There are several techniques used in branchless programming, including replacing conditional assignments with bitwise operations, using conditional move instructions, employing lookup tables, and utilizing arithmetic operations. Each technique offers unique advantages that can be used depending on the specific requirements and characteristics of the code in need of optimization. With the understanding and utilization of these techniques, software developers can extract that much more performance from their code without compromising its functionality. It is important to note that while branchless programming can produce substantial performance benefits, it may also introduce some drawbacks. One such drawback is the potential for reduced code readability and maintainability. This can lead to situations where it’s difficult for other programmers to understand, debug, and change the existing code. Additionally, some branchless programming techniques can introduce security risks, such as cache timing attacks. As a result, it is essential for programmers to balance the performance gains against the potential drawbacks before implementing branchless programming techniques.
In this article, we will explore the various techniques used in branchless programming, providing examples in the C programming language to illustrate each concept. We will also mention some practical applications of branchless programming, focusing on the types of systems and applications where it is most beneficial to be implemented. Finally, we will examine the drawbacks and limitations of branchless programming, offering some insights on when to apply these techniques and when not to. A lot of the optimization techniques we are going to mention are mostly covered in CPU manufacturing manuals.
The C programming language has been chosen for the given examples as it is more closely associated with performance-critical systems and low-level programming. Using C for the examples will provide a clearer illustration of the performance benefits and implications of branchless programming.
At the end of the article, we should be a bit more knowledgeable about this subject matter in order to make informed decisions about when and where to incorporate branchless programming techniques into our own code, aiming for the perfect balance between performance optimization and code readability and maintainability.
Let’s try and unlock the full performance potential of modern CPUs.
Branchless techniques
Conditional Assignments vs. Bitwise Operations
One of the most frequently-used techniques in branchless programming is replacing conditional assignments with bitwise operations.
Bitwise operations directly manipulate the bits that represent the data in a computer’s memory, making them very efficient and suitable for performance-critical code. The most common bitwise operations used in branchless programming are AND, OR, and XOR.
AND, OR, and XOR operations work on the individual bits of the operands, producing result values based on the corresponding bits of the input values. The AND operation results in a 1 for each bit position where both input bits are 1, while the OR operation results in a 1 for each bit position where either input bit is 1.
In branchless programming, bitwise operations can replace conditional assignments in order to eliminate branches and improve performance. We can look at the following example for a conditional assignment:
In the example code above, we use an if-else statement to determine the maximum of two integer values, a and b. This introduces an unnecessary branch that may lead to performance penalties on some CPU’s. Using bitwise operations we can achieve the same result without introducing any branches:
In this branchless version, we first calculate the difference between a and b, then create a mask by shifting the sign bit of the difference to the right by the number of bits in an integer minus 1. The mask will have all bits set to 1 if the difference is negative (i.e., a < b), and all bits set to 0 if the difference is positive (i.e., a >= b). We then AND the mask with the difference and subtract the result from a to obtain the greater value.
This approach is particularly useful when the performance cost of branches is significant, such as in tight loops or performance-critical code. Bitwise operations are typically faster than branching instructions, as they do not introduce pipeline stalls or mispredicted branches, allowing the processor to execute the code more efficiently with not requiring additional computing cycles.
As we mentioned earlier, replacing conditional assignments with bitwise operations can sometimes result in less readable and maintainable code. The branchless version of the code may be more difficult to understand at first glance, making it harder for other developers to debug or modify the code. As such, it is essential to make clear decisions when and where to pick performance gains over code readability, and where to skip branchless programming techniques altogether.
In addition to the AND, OR, and XOR operations, other bitwise operations, such as NOT and bit shifts, can also be used in branchless programming to achieve various results. The key is to understand the underlying principles of these operations and how they can be used to eliminate or minimize branches in our code.
Conditional Move Instructions
Conditional move instructions are another powerful technique used in branchless programming. These instructions allow data to be moved conditionally without creating a branch, which can lead to significant performance improvements in modern processors.
While some assembly languages, such as x86 assembly, provide specific conditional move instructions like CMOV, the C programming language does not have a direct equivalent. However, we can mimic the behavior of conditional move instructions using a combination of bitwise operations and arithmetic.
Consider the following example, where we again find the maximum of two integer values a and b:
We can replace the if-else statement with a branchless equivalent that simulates a conditional move:
In this example, we first create a mask based on the result of the comparison (a > b). If a is greater than b, the mask will be all 1s (0xFFFFFFFF), and if a is less than or equal to b, the mask will be all 0s (0x00000000). We then AND the mask with a and the negation of the mask with b, combining the results with an OR operation to obtain the maximum value.
This branchless version avoids the performance penalties associated with branching, allowing the processor to execute the code more efficiently. As with other branchless programming techniques, it can also result in less readable and maintainable code. Can we spot a pattern here? It is essential to weigh the performance benefits against these potential drawbacks and apply the technique carefully.
At this point, it’s worth mentioning that some compilers, especially modern ones, may automatically optimize certain types of conditional assignments into conditional move instructions when targeting specific processor architectures. However, relying on compiler optimizations can be unpredictable, and explicitly implementing branchless programming techniques can help ensure consistent performance gains. In cases where the C programming language is compiled to assembly for a specific target architecture that supports conditional move instructions, the compiler may generate more efficient machine code by automatically replacing certain branches with conditional moves. This optimization is not guaranteed, and writing branchless code in C can help ensure consistent performance improvements regardless of the target architecture or compiler optimizations.
Lookup Tables
A lookup table is a precomputed array of values that can be indexed directly to retrieve a result, avoiding the need for any runtime execution of complex calculations or conditional logic.
Lookup tables are particularly useful for replacing switch statements or series of if-else statements, which can introduce multiple branches and negatively impact performance on modern CPUs. The replacement of these branches with direct array indexing, and lookup tables can significantly speed up performance-critical code.
We can look at a simple example that calculates the factorial of a small integer value, n:
The above code uses a switch statement to calculate the factorial, introducing multiple branches. We can replace the switch statement with a branchless lookup table:
int n = 5; int factorial; const int lookupTable[] = {1, 1, 2, 6, 24, 120}; factorial = (n < sizeof(lookup_table) / sizeof(lookup_table[0])) ? lookupTable[n] : -1;
In this example, we first create a lookup table as a constant array containing precomputed factorial values. We then directly index the lookup table using the input value n and assign the result to the factorial. To handle invalid input, we use the ternary conditional operator to check if n is within the bounds of the lookup table before indexing it. If the input is out of bounds, the result is set to -1.
The branchless version eliminates the multiple branches introduced by the switch statement, allowing the processor to execute the code more efficiently. It is important to note that lookup tables trade memory for performance, as the precomputed values must be stored in memory. In some cases, this compromise may be the preferred way, particularly when memory space is not a limiting factor and the performance gains are significant.
Lookup tables can be used in a wide range of scenarios, from simple value mappings to more complex functions. They are particularly useful for replacing branches in performance-critical code, such as in tight loops, real-time systems, or graphics processing. With this technique, it's important to take memory usage into account when going with the implementation route.
Arithmetic Operations
Arithmetic operations provide another powerful tool in the branchless programming world. With the efficiency of arithmetic operations, such as addition, subtraction, multiplication, and division, we can write code solutions that not only circumvent the pitfalls of branching but also amaze with its performance. The task of utilizing arithmetic operations in branchless programming is based on the idea that we can replace conditional logic with mathematical expressions. These expressions are often quite ingenious, leaving even the most jaded of programmers with a sense of wonder and approval for their elegance.
We can look at the following example where we have to determine the absolute value of an integer x:
The code above, while functional, suffers from the branching issue we try to omit. We will now try to remove the branch by invoking the powers of arithmetic operations:
In this branchless code block, we create a mask by shifting the sign bit of x to the right by the number of bits in an integer minus one. The mask will be a row of 1s (0xFFFFFFFF) if x is negative, and a row of 0s (0x00000000) if x is positive or zero. We then add the mask to x and XOR the result with the mask, producing the absolute value of x.
This transformation produces code that not only avoids any branching penalties it could suffer from but also provides some if we dare to say, mathematical elegance.
Arithmetic operations can be used in a variety of scenarios, limited only by the bounds of one's imagination and mathematical abilities. The author of this article is definitely not akin to this subject matter. These operations provide an alternative to branching, enabling our code to support a processor’s pipeline in a more native approach.
The branchless programming techniques we have examined, including replacing conditional assignments with bitwise operations, conditional move instructions, lookup tables, and arithmetic operations, can elevate our code to new heights of performance. When used carefully, these techniques can help us tackle the full potential of modern processors while adding a touch of coding extravaganza. These techniques elevate our programming skills to the next level.
As we have seen, branchless programming is much like an art form, blending mathematics, logic, and creativity to overcome the limitations of branching.
Practical applications
Branchless programming techniques can be applied to a wide range of applications, particularly those where performance is critical. In this section, we will explore several practical applications of branchless programming, examining the benefits and challenges introduced to each use case. While the previous examples were intentionally designed to be more introductory, the following will be more focused on providing concrete, real-world examples of how branchless programming can be employed.
Performance-critical systems
In performance-critical systems, such as real-time systems, and high-performance computing applications, branchless programming can provide significant performance gains by minimizing the overhead associated with branching. Examples of performance-critical systems include automotive control systems, real-time gaming, medical devices, and high-frequency trading platforms.
In these systems, minimizing latency and maximizing throughput are often top priorities, and branchless programming techniques can help achieve these goals in ways we mentioned before. Using techniques such as conditional move instructions, lookup tables, and arithmetic operations, developers can create code that is more efficient and better suited for high-performance environments.
Graphics processing
Graphics processing is another area where branchless programming can provide substantial benefits. Modern graphics processing units (GPUs) are designed to handle large amounts of parallelism and are particularly sensitive to branching instructions. With the reduction of branches in graphics processing code, developers can take full advantage of the GPU’s parallel processing capabilities, resulting in improved performance and reduced latency.
Examples of graphics processing applications where branchless programming can be applied include shader programs, geometry processing, texture filtering, procedural environment generation and even AI/ML model training. In these applications, employing techniques such as bitwise operations and lookup tables can help minimize branching and improve overall performance.
Cryptographic algorithms
Always a popular topic, cryptographic algorithms are providing another practical application of branchless programming techniques. Many cryptographic algorithms require constant-time execution to prevent timing attacks, which can leak sensitive information based on variations in execution time. At the time of writing this article (April of 2023), there’s an influx of data breaching and scrapping of all types of sensitive information throughout the world with the sole purpose to wait for when the time comes and computers get powerful enough to crack the encryptions used on that data and be utilized in potentially malicious ways. Branchless programming can help ensure constant-time execution by eliminating branches, which may introduce variable execution times.
The branchless programming techniques in cryptographic algorithms can be used by developers to create a tad-bit more secure implementations that are less susceptible to timing attacks. Some examples of cryptographic algorithms that can benefit from branchless programming include hash functions, block ciphers, and public-key cryptography algorithms.
Compression algorithms
Compression algorithms are another area where branchless programming can provide performance improvements. Compression algorithms usually involve complex operations and decision-making processes, which can lead to numerous branches and potential performance bottlenecks. With branchless programming, we can create more efficient compression algorithms that could yield faster compression and decompression times.
Compression algorithms where branchless programming can be applied include data compression algorithms like LZ77, LZ78, and Huffman coding, as well as image compression algorithms like JPEG and PNG. In these applications, techniques such as conditional move instructions, lookup tables, and arithmetic operations can help reduce branching and improve overall performance.
Networking and communication protocols
Networking and communication protocols often require high performance and low latency throughout to ensure smooth and efficient data transfer between various devices. The implementation of optimized protocols that use branchless programming in order to minimize the overhead associated with branching can lead to overall performance gains.
Networking transport layer protocols like TCP and UDP are just some of the examples where branchless programming can help. In these applications, using techniques such as bitwise operations, lookup tables, and arithmetic operations can help reduce branching and improve the efficiency of protocol processing.
Database Systems
High-performance database systems like SQLite, utilize branchless programming techniques to optimize query processing, sorting, and indexing operations. These techniques help reduce pipeline stalls and cache misses, enabling faster access and retrieval of data.
Embedded Systems
Branchless programming is often employed in the development of embedded systems, where power consumption and resource constraints are critical factors. With the reduction of the number of branch instructions, developers can create more efficient code that reduces power consumption and improves battery life in devices like smartphones and IoT sensors.
There is a broad range of practical applications, particularly in performance-critical environments where branchless programming can help mitigate some of the pitfalls. Now developers can create more efficient code that takes full advantage of modern processor architectures and delivers improved performance in various applications. Like everything, it is essential to balance the performance gains with the potential drawbacks. When applied appropriately, branchless programming can be a powerful tool for optimizing code and unlocking the full potential of contemporary CPUs.
Drawbacks and Limitations
While we covered some of the performance benefits of what branchless programming could offer, it is worth noting that this comes with a set of potential drawbacks and limitations. In this section, we will explore some of the potential issues that can come when using branchless programming techniques, as well as situations where these techniques may not be the optimal choice.
Code readability and maintainability
Ok, we already mentioned this one a few times through this article, but the simple case is that this is indeed one of the most notable drawbacks of branchless programming, because it can lead to code that is more difficult to read and maintain. Techniques such as bitwise operations, lookup tables, and arithmetic operations can create code that is less intuitive and harder to understand compared to traditional branching code. This can make the code more challenging to debug, modify, or extend, potentially leading to a higher risk of introducing bugs or making it harder for other developers to work with the code.
When using branchless programming techniques, it is essential to balance the performance gains with the potential impact on code readability and maintainability. In some cases, it may be more appropriate to use traditional branching code for the sake of clarity, particularly when the performance gains are not significant or the code is not performance-critical, or when developing in more heterogenous developer teams.
Compiler optimizations
Modern compilers are more than capable of applying various optimizations to the generated code, including automatically converting some branches to branchless equivalents. Relying on compiler optimizations can sometimes result in more efficient code without the need for manual branchless programming. However, these optimizations are not always consistent or guaranteed, and explicitly implementing branchless techniques can help ensure consistent performance gains.
In some cases, branchless programming techniques may conflict with compiler optimizations or even result in less efficient code. It is essential to test and profile the code to ensure that the branchless techniques used are genuinely providing the expected performance improvements. Remember, when performance is involved, always measure. Don’t pre-optimize code before measuring it.
Processor-specific performance characteristics
This is mostly tailored to embedded programming, where each of the hardware pieces requires some form of performance optimizations in order to squeeze that much more from what is possible. Branchless programming techniques are often designed to take advantage of specific performance characteristics of modern processors, such as pipelining and branch prediction as we mentioned. Not all processors have the same performance characteristics, and branchless programming techniques that work well on one processor may not be as effective on another and could be even made detrimental to the point of failure.
When using branchless programming techniques, it is essential to consider the target processor’s performance characteristics and ensure that the techniques used are appropriate for that processor. In some cases, it may be necessary to adapt or modify the branchless techniques to suit the specific performance characteristics of the target processor. Many processor manufacturers provide manuals for each of their product, just for these kinds of scenarios where each processor cycle matters. One example can be Intel’s architecture and optimization manual (PDF).
Trade-offs between performance and resource usage
Some branchless programming techniques involve trade-offs between performance and resource usage, such as memory or computational complexity. For example, lookup tables can provide significant performance improvements by eliminating branches but may require additional memory to store the precomputed values. Similarly, arithmetic operations can reduce branching but may introduce additional computational complexity.
When using branchless programming techniques, it is essential to weigh the performance gains against the potential resource usage implications. In some cases, the trade-offs may not be worth the performance benefits, particularly when resources are limited or the performance gains are not substantial.
Security Concerns
While branchless programming techniques can offer significant performance improvements, they may also introduce security risks if not implemented correctly, particularly when used in cryptographic algorithms. One of the primary concerns is the potential for side-channel attacks. Side-channel attacks exploit information leaked through the physical properties of a system, such as timing, power consumption, or electromagnetic radiation, and focus on specific ways in which algorithms are implemented. In cryptographic implementations, a constant-time execution is important for minimizing the risk of such attacks. Although some branchless techniques can help achieve constant-time execution, incorrect implementation can still lead to information leakage.
To address the security risks associated with branchless programming techniques, developers can execute the following:
- Thoroughly understand the potential side-channel vulnerabilities of their specific use-case and the branchless techniques they plan to use
- Follow best practices and guidelines for implementing constant-time cryptographic algorithms, ensuring that execution time is independent of the input data
- Test and validate the implementation for potential side-channel vulnerabilities using tools like static code analyzers, dynamic analysis, and fuzz testing
- Keep up-to-date with the latest research, security vulnerabilities, and recommended mitigation strategies in the field.
Just by being aware of the security concerns and mentioned best practices, developers can minimize the risks associated with branchless programming techniques while still benefiting from their performance improvements.
Limited applicability
While branchless programming techniques can be applied to a wide range of scenarios, there are some situations where these techniques may not be suitable or may provide limited benefits. For example, some algorithms or code structures may inherently require complex branching logic that cannot be easily replaced with branchless equivalents.
In such cases, it is essential to recognize the limitations of branchless programming and consider alternative optimization strategies, such as algorithmic improvements, parallelization, or hardware acceleration.
Branchless programming offers numerous performance benefits, but it is essential to be aware of the potential drawbacks and limitations associated with these techniques. We should carefully consider the trade-offs and applicability of branchless programming. One way in which we as developers can make informed decisions is to decide when and how to use these techniques to optimize our code effectively.
Final Note
Throughout this article, we have delved into the world of branchless programming, exploring its underlying principles, techniques, some practical applications, and potential drawbacks. Alongside branchless programming, it’s also important to consider other optimization techniques, such as micro-optimizations, which focus on making small, incremental improvements to the code’s performance.
Micro-optimizations can complement branchless programming techniques, helping developers to further refine their code and achieve even better performance gains. Combining branchless programming with micro-optimizations can enable developers to take a more holistic approach to code optimization, addressing various aspects of performance improvement.
At this point, it is essential to look back on some of the key takeaways and consider how we, as developers, can apply this knowledge to our own projects, and regular workflows and continue to explore the benefits of branchless programming. Branchless programming offers a powerful set of techniques that can help us unlock the full potential of modern processors, leading to significant performance improvements in various applications. From performance-critical systems and graphics processing to cryptographic algorithms and networking protocols, the practical applications of branchless programming are vast and varied. As with everything, it is crucial to recognize that this type of programming is not a one-size-fits-all solution. We have already seen that there are potential drawbacks and limitations to consider, such as the impact on code readability and maintainability, conflicts with compiler optimizations, processor-specific performance characteristics, and trade-offs between performance and resource usage. It is essential to approach branchless programming with a critical eye, carefully considering the specific context and requirements of each project to determine whether the benefits of these techniques outweigh the potential drawbacks.
As developers, our journey with branchless programming should not end with this article. There’s no shortage of processor architectures and optimization techniques to explore, with the whole landscape constantly evolving, and it is essential for some developers to stay up-to-date with the latest developments in the field. Continuing to learn about new techniques, profiling tools, and best practices, we can ensure that we are well-equipped to utilize the power of branchless programming in our own projects. In addition, it is crucial to cultivate a culture of collaboration and knowledge-sharing within the developer community itself. Only by discussing our experiences with branchless programming, sharing insights, and providing constructive feedback, we can collectively see what is possible with these techniques and continue to drive innovation in the field.
If this article has sparked your interest in branchless programming and inspired you to delve deeper into the subject matter, feel free to leave a comment and feedback. With the insights gained from this exploration, you are now just a tad bit better equipped to tackle performance challenges in your projects and bring forth optimized, efficient, and high-performing code.