The UTCI algorithm (in Python) is shown below:
def calculateUTCI(Ta, Pa): depth = 6 for n in range(0, depth + 1): for k in range(depth - n + 1, 0, -1): yield (depth - n - k + 1) * Ta**n * Pa**k
and is equivalent to the polynomial for calculating UTCI.
In a previous post I profiled the running time of various implementations, and found the the algorithm above was massively faster than the current implementation, but significantly slower than simply evaluating the polynomial. However, in the interests of readability and maintainability, I feel that the algorithm still has some merit. So the question is, is it possible to optimise this algorithm to perform closer to the speed of the polynomial calculation?
I hypothesise that the nested FOR loops are the bottleneck, due to the interpreter overhead during a FOR loop. Python has a number of possibilities to explore here, including list comprehensions, generators and map. As reference, I point to the Python Wiki and an interesting article written by Python’s creator, Guido von Rossum.
Alternatives to the for loop
Case 1: Baseline
The algorithm above will be used as the baseline in this experiment.
Case 2: List Comprehension
A list comprehension in Python is really just syntactic sugar. It is a more concise way to express a loop across a list. List comprehensions are prevalent in functional programming languages and have been adopted (and specifically optimised) in Python.
Case 3: Generator Expression
The beauty of a generator function is that you avoid the need to evaluate all of the terms at once, instead iterating through the sequence and only doing work when you absolutely need to. I would point out that the above algorithm is in fact a generator function. Case 3 (as well as Case 4) leverage the knowledge that the sequence of the variables n, k is known and constant.
Case 4: Map
Map applies a function to all elements in a list or sequence. It avoids the interpreter overhead of the for loop and moves the looping construct right into C code. Surely this must be faster?! We shall see.
Case 5: Numpy array
Numpy is a numerical package for Python. This Wiki Page suggests that Numpy is used for intensive numerical work. In this experiment we compare the standard library’s for loop to Numpy’s ability to apply an operation across all elements in an array.
Case 6: Polynomial Calculation
In our previous experiment, this was the fastest method for calculating UTCI. Here, it will be the benchmark from which to compare the other run times.
Run time comparisons
The UTCI algorithm was modified accordingly for each case (for, list comprehension, generator, map and a numpy array) and the running times compared to the baseline (for loop algorithm shown above) and the benchmark (evaluation of the polynomial). The results are shown below:
Observation of Case 1 and Case 2 show almost identical running times, which agree with the previously observed running time for the UTCI algorithm (with for loop). This seems to confirm that a list comprehension is simply a more concise syntax for a for loop in Python. A list comprehension is particularly powerful when the resulting list is required elsewhere in your program, and is faster than an equivalent for loop which grows a list by appending to it. However in our case, we do not require persistent in-memory storage of the calculations.
Case 3 (generator) is interesting. The generator expression is significantly faster than the baseline, and has the advantage of not building a full list (c.f. list comprehension). Thus, the running time and memory cost are both favourable.
Case 4 (map) is surprising. While faster than the baseline, it did not perform as well as Case 3. This would seem to highlight the cost of function calls in Python. This is explained in von Rossum’s article: “Try to use map(), filter() or reduce() to replace an explicit for loop, but only if you can use a built-in function: map with a built-in function beats for loop, but a for loop with in-line code beats map with a lambda function!”. I had hoped this to be faster than observed.
Case 5 (numpy array). This result is truly incredible, performing twice as fast as the benchmark! Conceptually, this approach is very different to the other cases and the performance benefits are enormous. The previous cases iterate over the inputs individually. In this example, all the inputs are stored in an array, and numpy performs the UTCI calculation across the array. The results here seem to indicate that Numpy can perform this step much much faster than any equivalent in python’s standard library. However, this does come at the cost of holding all inputs in memory. There may be potential here, but careful planning would be needed to aggregate the climate data sensibly to maximise the performance benefits of numpy’s arrays, while keeping the memory cost acceptable.
Case 6 (benchmark): interestingly, this performed much much faster in this experiment compared to the previous experiment. In the previous experiment, I spread the 21 terms across 21 lines to maintain clarity – however in this one, I wanted to save some lines so I only spread them across 7 lines. This seems to have a significant impact on the interpreter’s speed and must be an observation unique to Python (c.f. C, Java, Haskell, or any other compiled language which would reduce this down to a single machine instruction). So again, we have a trade off between readability / maintainability and performance.
This experiment highlighted some of the performance considerations when looping in Python. The benefits of generators were shown and the power of Numpy for intensive numerical operations gave a massive performance boost. Numpy seems to be the obvious choice, but more design and research will have to go into the best way of implementing this to avoid excessive memory consumption.