Regex Performance in Python


<- back | tags: /dev/journal python performance

A while back, I got into a discussion with my friends about the performance implications of picking different libraries. We spoke generally about how, just because something's in a standard library, it doesn't mean it's particularly great. We also agreed that just because there's a third-party alternative it doesn't mean that it's the bee's knees either.

We spoke of off-the-cuff ways they might be able to make decisions about what to use based on performance. Not the sort of performance testing that you give to a customer. No, we're talking about the curious sort you do when you're interested in sweeping statements and generalities.

Although contrived, I came up with the following discussion. What follows is a summarized anecdote about Python, Regex, and some poor man's performance testing.

Regex Performance in Python

Regex in Python isn't particularly fast. To be fair, it's a difficult problem and Python isn't helped from the fact that it's original implementation used recursive backtracking instead of finite automata-based techniques like we find in C. That said, this academic sideshow isn't really what we're here to talk about.

Example

Let's get to the code. The following block contains four regular expression methods that perform the same function: determining if a sub-string is contained within another string. Additionally, we use timeit as a cheap and easy way of benchmarking the performance of each method:

import re
import regex
from timeit import timeit

def re_search(string, text):
  '''
  Utilize 're' from the standard library; performs a 'search'
  '''
  if re.search(text, string):
    pass

def match_search(string, text):
  '''
  Utilize 're' from the standard library; performs a 'match'
  '''
  if re.match(text, string):
    pass

def in_search(string, text):
  '''
  Utilize the standard library operator 'in'
  '''
  if text in string:
    pass

def regex_search(string, text):
  '''
  Utilize 'regex' from a third party library; Extends 're'; performs a 'search'
  '''
  if regex.search(text, string):
    pass

# Execute each function independently to analyze performance
print(timeit("re_search(string, text)", "from __main__ import re_search; string='win'; text='windows'"))
print(timeit("match_search(string, text)", "from __main__ import match_search; string='win'; text='windows'"))
print(timeit("in_search(string, text)", "from __main__ import in_search; string='windows'; text='win'"))
print(timeit("regex_search(string, text)", "from __main__ import regex_search; string='win'; text='windows'")

Executing the code above you should observe similar to the following output:

2.6346885659986583
2.7640133929999138
0.42010939700048766
11.700926834000711

What have we learned? First, apparently the third-party regex library is a solid way to start your day disappointed. Second, the Python standard library re implementation has a lot of ways for us to get the same information (acknowledging this is a manufactured test). Although they're all reasonably quick, the timeit test leads us to believe that in is the way to approach this particular problem in Python.

timeit is a module that gives us a quick way to time things in Python. It's flexible and configurable enough to provide us with a warm and fuzzy impression of what things might perform better than other things. We can wrap function calls with it, configure the timer and any looping behavior, and pop out a quantifiable on the other side that helps us pick appropriate implementations.

Resources

If you'd like to read more about regex then there's a really fascinating essay by Russ Cox titled: Regular Expression Matching Can Be Simple And Fast.