A while back, I got into a discussion with my friends about the performance implications of picking different libraries. We spoke generally about how, just because something's in a standard library, it doesn't mean it's particularly great. We also agreed that just because there's a third-party alternative it doesn't mean that it's the bee's knees either.
We spoke of off-the-cuff ways they might be able to make decisions about what to use based on performance. Not the sort of performance testing that you give to a customer. No, we're talking about the curious sort you do when you're interested in sweeping statements and generalities.
Although contrived, I came up with the following discussion. What follows is a summarized anecdote about
Regex, and some poor man's performance testing.
Regex Performance in Python
Python isn't particularly fast. To be fair, it's a difficult problem and
Python isn't helped from the fact that it's original implementation used recursive backtracking instead of finite automata-based techniques like we find in
C. That said, this academic sideshow isn't really what we're here to talk about.
Let's get to the code. The following block contains four
regular expression methods that perform the same function: determining if a sub-string is contained within another string. Additionally, we use timeit as a cheap and easy way of benchmarking the performance of each method:
import re import regex from timeit import timeit def re_search(string, text): ''' Utilize 're' from the standard library; performs a 'search' ''' if re.search(text, string): pass def match_search(string, text): ''' Utilize 're' from the standard library; performs a 'match' ''' if re.match(text, string): pass def in_search(string, text): ''' Utilize the standard library operator 'in' ''' if text in string: pass def regex_search(string, text): ''' Utilize 'regex' from a third party library; Extends 're'; performs a 'search' ''' if regex.search(text, string): pass # Execute each function independently to analyze performance print(timeit("re_search(string, text)", "from __main__ import re_search; string='win'; text='windows'")) print(timeit("match_search(string, text)", "from __main__ import match_search; string='win'; text='windows'")) print(timeit("in_search(string, text)", "from __main__ import in_search; string='windows'; text='win'")) print(timeit("regex_search(string, text)", "from __main__ import regex_search; string='win'; text='windows'")
Executing the code above you should observe similar to the following output:
2.6346885659986583 2.7640133929999138 0.42010939700048766 11.700926834000711
What have we learned? First, apparently the third-party
regex library is a solid way to start your day disappointed. Second, the
Python standard library
re implementation has a lot of ways for us to get the same information (acknowledging this is a manufactured test). Although they're all reasonably quick, the
timeit test leads us to believe that
in is the way to approach this particular problem in
timeit is a module that gives us a quick way to time things in
Python. It's flexible and configurable enough to provide us with a warm and fuzzy impression of what things might perform better than other things. We can wrap function calls with it, configure the timer and any looping behavior, and pop out a quantifiable on the other side that helps us pick appropriate implementations.
If you'd like to read more about
regex then there's a really fascinating essay by Russ Cox titled: Regular Expression Matching Can Be Simple And Fast.