Programmers have to implement many different types of algorithms throughout their careers. In many cases, we use algorithms that were made by other programmers, because of the complexity of the algorithm or time restraints or other compounding factors. But in many cases, the implementation of the algorithms we use is not the most optimized version. Sometimes these implementations are too complex for the constraints of the algorithm.
A simple algorithm that many programmers use is the Luhn algorithm. This algorithm is a simple checksum used to validate identification numbers, where the last digit in the number is the "check digit". For example, we use this algorithm to check the validity of credit card numbers to reduce processing fees at credit card gateways. This algorithm proceeds as follows:
There is an example implementation of this algorithm on Wikipedia [Accessed 7/1/2015] in the python programming language. This implementation, while a good demonstration of the algorithm for the purpose of a teaching tool on Wikipedia, is much too complicated. Let's take a look at this implementation:
def luhn_checksum(card_number):
def digits_of(n):
return [int(d) for d in str(n)]
digits = digits_of(card_number)
odd_digits = digits[-1::-2]
even_digits = digits[-2::-2]
checksum = 0
checksum += sum(odd_digits)
for d in even_digits:
checksum += sum(digits_of(d*2))
return checksum % 10
def is_luhn_valid(card_number):
return luhn_checksum(card_number) == 0
First lets take a look at the digits_of function. This function iterates through each character in a string of numbers and turns it into a list of integers. If this function was called with an input of '79927398713', the output would be [7, 9, 9, 2, 7, 3, 9, 8, 7, 1, 3]. This function introduces an immense amount of complexity for the implicit constraints of this algorithm. Since we know we are only ever going to handle numbers from 0 - 9, using the int cast function to change a string representation of an integer into an integer type is excessive. The int function is used to deal with a multitude of situations that we know implicitly will never occur. But we do need a way to change a string representation of a digit from 0 - 9 to an integer. For this we use a simple map:
INT_MAP = {'0': 0, '1': 1, '2': 2, '3': 3, '4': 4, '5': 5, '6': 6,
'7': 7, '8': 8, '9': 9}
With this map we can simple do INT_MAP['0'] and get the value 0 without having to worry about the safety checks and complex interpretations that the int casting function has to implement.
You may have also noticed that the digits_of function includes a str cast function as well, which changes its input into its string representation. This function uses this exclusively for the edge case introduced in line 10, where an integer is used as an input into the function. On this line we see digits_of(d*2). The purpose of this code is to multiply an even digit by 2 and if the product is greater than 9, then subtract 9. This accomplishes this by casting the product to a string, then summing the individual digits. This only works for the numbers from 5 - 9, so it's a clever way of subtracting 9. For example, for d=8 we get an input of 16 which will get casted into its string representation of '16'. Then each digit is casted to an int and summed for an output of 7.
If this process of multiplying by 2 and subtracting by 9 for only the numbers from 5 - 9 using a summation trick while creating an edge case in the digits_of function sounds complex, it's because it is. Once again the implicit limitations of this algorithm makes this process excessive. We know we will only get numbers from 0 - 9, so we can just pre-calculate all of this and store the solutions in a map we will call LUHN_EVEN:
LUHN_EVEN = {'0': 0, '1': 2, '2': 4, '3': 6, '4': 8, '5': 1, '6': 3,
'7': 5, '8': 7, '9': 9}
Lets move on from the digits_of function and look at how this implementation separates odd from even digits. On line 5-6 we see odd_digits = digits[-1::-2] and even_digits = digits[-2::-2]. These lines use a feature of the python programming language called extended slices that generates a new list given a start index, an end index, and a step vector. In the case of odd_digits we have a start index of -1, which means we start at the last digit. The end index is empty, which means we have no limit. And the step is -2 which will sample every 2nd digit iterating towards the first digit. The even_digits slice has the same slicing step but starts iterating at the second to the last digit. This is a clever way to get the odd and even digits, but it has the downside of using time and memory by generating two new lists. Another solution would be to iterate through all digits and check if the digit is the even or odd digit and perform a different action based on this condition. In the case of the Luhn algorithm we have to start from the rightmost digit and assign it as an odd digit:
for i, digit in enumerate(reversed(card_number)):
if i & 1 == 0:
# i is even, but in luhn our even digits are considered odd
# perform odd digit action
else:
# perform even digit action
But we aren't going to do this either, because it is not necessary to use a conditional to get the even and odd digits. The preferred solution for this case is to create a generator and manually iterate through it, two digits at a time. A generator is simply a function that allows iteration through an iterable data structure, like a list, by using a next function to step through the structure. The last example included two generators: enumerate which provided the i value as a count of the number of iterations performed so far and reversed which iterates through its input in the reversed order. The for value in iterable syntax takes care of calling the next functions transparently on each iteration. When we iterate manually two digits at a time we get the following:
gen = reversed(card_number)
while True:
odd = next(gen, None)
if odd is None:
break
# perform odd digit action
even = next(gen, None)
if even is None:
break
# perform even digit action
The next function allows us to set a sentinel value for the purpose of signalling the end of the iterable structure. The value of None, which we know to not exist in the card_number string, is used for this purpose. (A good sentinel value in python for data structures with unknown members is Ellipsis.) We can check for this sentinel value in order to break out of the while loop. This is a simple and easy to understand solution to getting the even and odd digits.
When we put all of these simplifications together we get the following implementation of a Luhn checksum:
LUHN_EVEN = {'0': 0, '1': 2, '2': 4, '3': 6, '4': 8, '5': 1,
'6': 3, '7': 5, '8': 7, '9': 9}
LUHN_ODD = {'0': 0, '1': 1, '2': 2, '3': 3, '4': 4, '5': 5,
'6': 6, '7': 7, '8': 8, '9': 9}
def luhn_checksum(card_number):
checksum = 0
gen = reversed(card_number)
while True:
odd = next(gen, None)
if odd is None:
break
checksum += LUHN_ODD[odd]
even = next(gen, None)
if even is None:
break
checksum += LUHN_EVEN[even]
return checksum % 10
def luhn_valid(card_number):
return luhn_checksum(card_number) == 0
This implementation is not only easier to read and understand but it is also more efficient in both memory usage and run time. Using the python timeit module, which is the standard module for measuring run time of python functions, at the default settings of one million iterations we see that our simplified implementation is ten times faster than the implementation we found on Wikipedia. Which, for an algorithm as simple and small as the Luhn checksum, is an impressive speed up.
If the maximum length of the string input is known, this could be further optimized by pre-calculating the modulus step checksum % 10. This can be done by creating a mapping of all values from 0 to n * 9, where n is the maximum string length allowed, to the modulus ten of that value. For an n of 2 this would lead to the map:
LUHN_MOD = {0:0, 1:1, 2:2, 3:3, 4:4, 5:5, 6:6, 7:7, 8:8, 9:9, 10:0,
11:1, 12:2, 13:3, 14:4, 15:5, 16:6, 17:7, 18:8}
This optimization leads to a small speed up in the run time in exchange for a non-trivial increase in static memory usage. So this optimization may not be worth it, but it depends on the conditions and constraints for your specific case.
While we can't always ensure we are using the best implementation of an algorithm, due to time constraints or lack of expertise in the algorithm. If you have the time, and the expertise, make sure to keep it simple.