A while back, I got into a discussion with my friends about the performance implications of picking different libraries. We spoke generally about how, just because something's in a standard library, it doesn't mean it's particularly great. We also agreed that just because there's a third-party alternative it doesn't mean that it's the bee's knees either.
We spoke of off-the-cuff ways they might be able to make decisions about what to use based on performance. Not the sort of performance testing that you give to a customer. No, we're talking about the curious sort you do when you're interested in sweeping statements and generalities.
Although contrived, I came up with the following discussion. What follows is a
summarized anecdote about Python
, Regex
, and some poor man's performance
testing.
Regex Performance in Python
Regex in Python
isn't
particularly fast. To be fair, it's a difficult problem and Python
isn't
helped from the fact that it's original implementation used
recursive backtracking
instead of finite automata-based techniques
like we find in C
. That said, this academic sideshow isn't really what
we're here to talk about.
Example
Let's get to the code. The following block contains four regular expression
methods that perform the same function: determining if a sub-string is
contained within another string. Additionally, we use timeit
as a cheap and easy way of benchmarking the performance of each method:
import re
import regex
from timeit import timeit
def re_search(string, text):
'''
Utilize 're' from the standard library; performs a 'search'
'''
if re.search(text, string):
pass
def match_search(string, text):
'''
Utilize 're' from the standard library; performs a 'match'
'''
if re.match(text, string):
pass
def in_search(string, text):
'''
Utilize the standard library operator 'in'
'''
if text in string:
pass
def regex_search(string, text):
'''
Utilize 'regex' from a third party library; Extends 're'; performs a 'search'
'''
if regex.search(text, string):
pass
# Execute each function independently to analyze performance
print(timeit(
"re_search(string, text)",
"from __main__ import re_search; string='win'; text='windows'"
))
print(timeit(
"match_search(string, text)",
"from __main__ import match_search; string='win'; text='windows'"
))
print(timeit(
"in_search(string, text)",
"from __main__ import in_search; string='windows'; text='win'"
))
print(timeit(
"regex_search(string, text)",
"from __main__ import regex_search; string='win'; text='windows'"
))
Executing the code above you should observe similar to the following output:
2.6346885659986583
2.7640133929999138
0.42010939700048766
11.700926834000711
What have we learned? First, apparently the third-party regex
library is a
solid way to start your day disappointed. Second, the Python
standard library
re
implementation has a lot of ways for us to get the same information
(acknowledging this is a manufactured test). Although they're all reasonably
quick, the timeit
test leads us to believe that in
is the way to approach
this particular problem in Python
.
timeit
is a module that gives us a quick way to time things in Python
. It's
flexible and configurable enough to provide us with a warm and fuzzy impression
of what things might perform better than other things. We can wrap function
calls with it, configure the timer and any looping behavior, and pop out a
quantifiable on the other side that helps us pick appropriate implementations.
Resources
If you'd like to read more about regex
then there's a really fascinating
essay by Russ Cox titled:
Regular Expression Matching Can Be Simple And Fast.