Welcome to Coders' Corner

(A little collection of optimisation/coding techniques, hints and tricks
plus a few free-to-use C/ARM routines that I've written over the years)


Coding techniques for improved performance

It should already be understood by anybody interested enough to be reading this page, that the greatest performance gains are obtained by using a suitable algorithm. This is not a discussion of algorithms, but rather matters of implementation and coding style that can impact upon the performance of any algorithm/program.

Many of the optimisation techniques described here are applicable to all block-structured, sequential programming languages. ISO C is used as the example language, and where language-specific optimisations are listed, they apply to ISO C, although analogous optimisations may apply to code in similar languages.

Avoiding redudant computations

This should sound obvious, but the first rule of optimisation is don't do any avoidable work, yet a recurring observation over the years is that code is written as comes naturally to the programmer, but then not read and re-expressed for efficient execution. In my experience a lot of programmers working in high-level languages place too much trust in the compiler to perform optimisations that it cannot legally apply without breaking the rules of the language, and potentially changing the behaviour of the program.

A common example would be the repeated invocation of a function in the C language when the C compiler cannot see the definition of the function. Since C has no way of declaring a true function, ie. a sequence of instructions whose result depends only upon the inputs to the function and which has no side-effects visible externally, the compiler must assume that any function whose definition cannot see may modify or depend upon any global variable, and return a different result each time it is presented with the same inputs. Furthermore, any global variable referenced within the calling function before the function invocation must be assumed to have changed after its invocation.

What this means is that compilers often have a lot less scope for optimisation than many programmers appear to think, and consideration should always be given to where within the body of your function you compute the intermediates that your code requires.

  int caller(int n, int a, int b)
  {
    int ret, i;
    for(i = 0; i < n; i++)
    {
      int c = called_func(a, b);
      ..
      c += ...
      ret ^= c;
    }
    return ret;
  }

In the above code sketch, if the definition (body) of 'called_func' is invisible to the compiler then it must - in the interests of safety - assume that 'c' varies on each iteration and must invoke the function 'n' times. If in fact the programmer knows that the function will always return the same result for the given input arguments 'a' and 'b', the value of 'c' should be computed just once, outside the loop. The computation is being performed in the wrong scope.

Naturally, it is preferable to consider the most efficient scope for each computation when first writing the code, rather than later returning to the code to re-read, modify and re-test it, especially since this may never happen.

Strength reduction

Continuing on from the above discussion about avoiding repeated computation of loop-invariant values, sometimes it is possible to reduce the cost of even those variables that do vary from one iteration to the next, by deriving its value for the current iteration from its previous value. This technique is called 'strength reduction' because it is replacing one expensive operation per iteration with a less expensive one. Note, however, that if the compiler is capable of parallelising code for a vector processor, for example, that re-expressing the code in this manner could make it harder/impossible for the compiler to parallelise.

Here is a simple illustration of this idea is the replacement of multiplication with addition:

    uint64_t *dp = ...
    int b = ...
    int a = ...

    for(i = 0; i < n; i++)
    {
      int c = a * i + b;
      ....

      dp[i] = c;
    }

becomes:

    uint64_t *dp = ...
    int b = ...
    int a = ...
    int c = b;
    for(i = 0; i < n; i++)
    {
      ...
      *dp++ = c;
      c += a;
    }

The above example also illustrates a change to the writing out of results via the pointer 'dp' which on some compilers/architectures can be beneficial, and employs the same idea, of re-using pointer arithmetic from the previous iteration. In truth for most modern architectures there will be little observable difference, if any, for these two examples, and many modern compilers will be capable of re-expressing such simple code.

A more realistic, more complex example is the replacement of division - typically still an expensive operation - with either multiplication or addition, but care must be taken to ensure that either (i) sufficient precision is available in the intermediates to ensure that numerically identical results are obtained, or that (ii) the precise results is not critical to the algorithm and small deviations may be tolerated:

   int a = ;  // loop-invariant, unknown
   for(i = 0; i < n; i++)
   {
     int c = i / a;  // numerator is known
     int b = ...;    // loop-varying, unknown
     int d = b / a;
   }

becomes:

   int a = ;  // loop-invariant, unknown
   int recip_a = (1 << SH) / a;  // loop-invariant, unknown
   int tc = 0;
   for(i = 0; i < n; i++)
   {
     int c = tc >> SH;  // division has been replaced by addition and shift
     int b = ...;       // loop-varying, unknown
     ...
     int d = (b * recip_a) >> SH;  // div replaced by multiplication and shift

     tc += recip_a;
   }

   Note: this example, in the interests of clarity, ignores rounding issues,
         but in practise these should be considered both in the initial
         calculation of 'recip_a' and in the shift right operations calculating
         'c' and 'd'

If/else ladders

A common consequence of many engineers working on a single codebase is that a lot of copy-paste editing can occur because it is often quicker and easier to mimic than to write from scratch, which is fair enough. However, it can result in very inefficient code. Take the case of a function which accepts a string input and is required to interpret this string as a command and perform the appropriate action.

   int process_command(const char *cmd, command_args_t *args)
   {
     int ret = -1;  // invalid command

     if (!strcmp(cmd, "open"))
       ret = exec_open(args);
     if (!strcmp(cmd, "close"))
       ret = exec_open(args);
     ...

     return ret;
   }

I have encountered commercial code like this where the input string is matched against each command in turn and the appropriate action performed. Engineers arriving at this code and needing to implement another command will often copy an existing example, which over time results in a long 'ladder' of 'if statements. However, notice what happens when a match has been found; execution proceeds to test the input string against all of the recognised commands too. None will match, since a match has been found, but this is just wasting processing time. Giving consideration to whether the conditions are mutually exclusive will yield more efficient code very easily:

     if (!strcmp(cmd, "open"))
       ret = exec_open(args);
     else if (!strcmp(cmd, "close"))
       ret = exec_close(args);
     else if (!strcmp(cmd, ...

Whilst this is a simplified version of occurrences that I have encountered, for the purposes of illustration, it is realistic. When the number of cases increases beyond a certain point, and all of the cases are mutually exclusive, it becomes worth considering an altogether different approach, such as using a hash table in the case of the above example.

Use memcpy instead of strcpy

Another example of redundant calculations which is particularly common in ISO C source relates to the handling of strings. ISO C does not handle strings terribly well, treating them as NUL-terminated arrays of characters, with no information stored about the length of the string.

Poorly-written code often performs redundant traversals of such strings, operations which are themselves inherently inefficient since the processor cannot read far ahead of the character currently being considered for fear of incorrectly triggering a memory exception. It is therefore forced to sequentially execute a short sequence of each instruction, operating on a very small unit (very often just a single character) each iteration; very inefficient.

However, often the impact of this treatment of strings upon performance may be mitigated simply by using the available information:

  Before:   int len = strlen(src);
            dest = malloc(len + 1);  // include NUL terminator
            if (dest)
              strcpy(dest, src);

  After:    int len = strlen(src);
            dest = malloc(len + 1);
            if (dest)
              memcpy(dest, src, len + 1);

In the second case, memcpy is permitted to assume that there is no overlap between the destination and source buffers and furthermore it knows how much work it is being asked to do, and may thus perform larger transfers, reading data well ahead of the data currently being written into the destination buffer. For strings of non-negligible length this will almost certainly prove to be considerably faster, and it really doesn't take any longer to write the code in the second format in the first place.

Loop unrolling

When an operation must be performed repeatedly within a loop, but the operation itself is comparatively trivial, the overhead of performing the 'completion test' of the loop becomes more significant and it may become possible to achieve an appreciable speed increase by reducing the number of such tests and the number of branches from loop end to loop start. Take the case of a very simple pixel plotting function.

  void plot_line_8_32(uint32_t *dp, const uint8_t *sp, int n)
  {
    while (n-- > 0)
      *dp++ = palette[*sp++];
  }

Note that the above code exhibits a similar performance problem to the traversal of NUL-terminated ISO C strings; it becomes impossible to do much work before we must again test for completion. In this case, however, the constraint is artificial, because we have been furnished with a pixel count 'n', ie. we are implementing a function more analogous to 'memcpy' than 'strcpy.' By simply testing up-front whether we may process a larger number of pixels together as a block, we arrive at something like:

  void plot_line_8_32(uint32_t *dp, const uint8_t *sp, int n)
  {
    // process the first 4 * floor(n/4) pixels
    while (n >= 4)
    {
      n -= 4;
      dp[0] = palette[sp[0]];
      dp[1] = palette[sp[1]];
      dp[2] = palette[sp[2]];
      dp[3] = palette[sp[3]];
      sp += 4;
      dp += 4;
    }
    // complete the final 0..3 pixels
    while (n-- > 0)
      *dp++ = palette[*sp++];
  }

For a simple case like that illustrated above, it is perfectly possible that a modern compiler will assist the programmer by performing loop unrolling automatically, in which case it may be desirable to avoid polluting the source code; after all, we now have five locations instead of one that any change to the pixel processing must be made. However, for more complex code it may be impossible for the compiler to know/safely assume that unrolling can be performed, in which case manual unrolling will be the only way to achieve the performance gain.

There is a tradeoff here, of course. As the code becomes more complex, and the compiler's chances of unrolling it automatically reduce, so does the likely performance gain from unrolling it in the first place, as the overhead of completion testing and branching becomes less significant.

Use of sentinels to speed up inner loops

In the above example, the performance cost of loop completion testing was discussed and unrolling was described as a way to reduce this overhead. Another useful technique for reducing the cost of multi-clause completion tests is to subsume one clause under another by the introduction of a 'sentinel' which is a special value/condition that ensures the loop does not perform too many iterations ('run on too far').

...TBD...


Coding tricks


Bit counting

Extracting the least significant set bit

Given an arbitary number 'x' we can find the least significant bit that is set (has a value of '1') by using a simply AND operation and noting the relationship of 'x-1' to 'x' in binary representation. We note that in 'x' all of the bits to the right of the desired bit will be '0' by definition, and that in 'x-1' they will be '1':

       x: abcdefg100000000
     x-1: abcdefg011111111
  ~(x-1): ABCDEFG100000000

  using A to denote ~a

Further, as illustrated above, we may invert only the bits to the left of this desired result bit by computing ~(x-1). A simple AND operation with the original input 'x' then yields the result.

         x: abcdefg10000000
       x-1: abcdefg01111111
    ~(x-1): ABCDEFG10000000
  x&~(x-1): 000000010000000

  uint32_t lsb_set(uint32_t x)
  {
    return x & ~(x-1);
  }

Thus we arrive at the above very simple function, and we may process all of the set bits within a variable by iteratively invoking it and then masking out the bit just processed:

   uint32_t n;
   while (n)
   {
     uint32_t lsb = lsb_set(n);
     ...
     n &= ~lsb;
   }

Testing whether a number is a power of two

Employing a very similar technique to that used above, we can test whether a non-zero number is a power of two by testing the bitwise AND of a number 'x' with its immediate predecessor, 'x-1'.

  bool power_of_two(uint32_t x)
  {
    return !(x & (x-1));
  }

We note that for all of the bits to the right of the least significant bit that is set, the result will be zero, and similarly for that bit itself. To the left, ie. the more significant bits, the bits of 'x-1' are unchanged from the corresponding bits of 'x', and thus from the bits of the result.

       x: abcdefg100000000
     x-1: abcdefg011111111
 x&(x-1): abcdefg000000000

So if, and only if, all of those bits are zero, the entire result will be zero. Thus we may detect any input 'x' having at most a single non-zero bit, ie. powers-of-two and zero. The latter may be excluded by employing a simple test.

  bool power_of_two(uint32_t x)
  {
    return x && !(x & (x-1));
  }

Free-for-use routines

All of the code below is placed in the public domain and may be used freely for any purpose, including commercial uses. Please retain the copyright message within the source, but there is no requirement to display any message within the program interface/output. It is only hoped that they may be of use to a few people.

ARM assembler routines

Implementations of ISO C memory/string functions, optimised for the XScale

XScale-optimised implementations of the standard ISO C memory and string functions (memcpy, memmove, memset, strcpy and strlen). These were written for use within Aemulor, developed from scratch for better performance and because Aemulor is unable, for technical reasons to use the standard implementations presented in the SharedCLibrary module.


Copyright © Adrian Lees 2015