A while back, I got into a discussion with my friends about the performance implications of picking different libraries. We spoke generally about how, just because something's in a standard library, it doesn't mean it's particularly great. We also agreed that just because there's a third-party alternative it doesn't mean that it's the bee's knees either.
We spoke of off-the-cuff ways they might be able to make decisions about what to use based on performance. Not the sort of performance testing that you give to a customer. No, we're talking about the curious sort you do when you're interested in sweeping statements and generalities.
Although contrived, I came up with the following discussion. What follows is a
summarized anecdote about
Regex, and some poor man's performance
Regex Performance in Python
particularly fast. To be fair, it's a difficult problem and
helped from the fact that it's original implementation used
instead of finite automata-based techniques
like we find in
C. That said, this academic sideshow isn't really what
we're here to talk about.
Let's get to the code. The following block contains four
methods that perform the same function: determining if a sub-string is
contained within another string. Additionally, we use timeit
as a cheap and easy way of benchmarking the performance of each method:
import re import regex from timeit import timeit def re_search(string, text): ''' Utilize 're' from the standard library; performs a 'search' ''' if re.search(text, string): pass def match_search(string, text): ''' Utilize 're' from the standard library; performs a 'match' ''' if re.match(text, string): pass def in_search(string, text): ''' Utilize the standard library operator 'in' ''' if text in string: pass def regex_search(string, text): ''' Utilize 'regex' from a third party library; Extends 're'; performs a 'search' ''' if regex.search(text, string): pass # Execute each function independently to analyze performance print(timeit( "re_search(string, text)", "from __main__ import re_search; string='win'; text='windows'" )) print(timeit( "match_search(string, text)", "from __main__ import match_search; string='win'; text='windows'" )) print(timeit( "in_search(string, text)", "from __main__ import in_search; string='windows'; text='win'" )) print(timeit( "regex_search(string, text)", "from __main__ import regex_search; string='win'; text='windows'" ))
Executing the code above you should observe similar to the following output:
2.6346885659986583 2.7640133929999138 0.42010939700048766 11.700926834000711
What have we learned? First, apparently the third-party
regex library is a
solid way to start your day disappointed. Second, the
Python standard library
re implementation has a lot of ways for us to get the same information
(acknowledging this is a manufactured test). Although they're all reasonably
timeit test leads us to believe that
in is the way to approach
this particular problem in
timeit is a module that gives us a quick way to time things in
flexible and configurable enough to provide us with a warm and fuzzy impression
of what things might perform better than other things. We can wrap function
calls with it, configure the timer and any looping behavior, and pop out a
quantifiable on the other side that helps us pick appropriate implementations.
If you'd like to read more about
regex then there's a really fascinating
essay by Russ Cox titled:
Regular Expression Matching Can Be Simple And Fast.