Codestin Search App

519 lines (481 loc) · 41.6 KB
<!DOCTYPE html>
<meta http-equiv="Content-Type" content="text/html; charset=US-ASCII">
<meta name="generator" content="hevea 2.09">
<link rel="stylesheet" type="text/css" href="thinkpython2.css">
<title>Analysis of Algorithms</title>
<a href="thinkpython2021.html"><img src="back.png" ALT="Previous"></a>
<a href="index.html.1"><img src="up.png" ALT="Up"></a>
<a href="thinkpython2023.html"><img src="next.png" ALT="Next"></a>
<td valign="top" width="100" bgcolor="#b6459a">
<td valign="top" width="600" style="padding: 20px 20px;">
<a href="http://amzn.to/1VUYQUU">Buy this book at Amazon.com</a>
<h1 class="chapter" id="sec251">Appendix&#XA0;B&#XA0;&#XA0;Analysis of Algorithms</h1>
<a id="algorithms"></a></p><blockquote class="quote">
This appendix is an edited excerpt from <span class="c009">Think Complexity</span>, by
Allen B. Downey, also published by O&#X2019;Reilly Media (2012). When you
are done with this book, you might want to move on to that one.
</blockquote><p><span class="c010">Analysis of algorithms</span> is a branch of computer science that
studies the performance of algorithms, especially their run time and
space requirements. See
<a href="http://en.wikipedia.org/wiki/Analysis_of_algorithms"><span class="c004">http://en.wikipedia.org/wiki/Analysis_of_algorithms</span></a>.
<a id="hevea_default1806"></a> <a id="hevea_default1807"></a></p><p>The practical goal of algorithm analysis is to predict the performance
of different algorithms in order to guide design decisions.</p><p>During the 2008 United States Presidential Campaign, candidate
Barack Obama was asked to perform an impromptu analysis when
he visited Google. Chief executive Eric Schmidt jokingly asked him
for &#X201C;the most efficient way to sort a million 32-bit integers.&#X201D;
Obama had apparently been tipped off, because he quickly
replied, &#X201C;I think the bubble sort would be the wrong way to go.&#X201D;
See <a href="http://www.youtube.com/watch?v=k4RRi_ntQc8"><span class="c004">http://www.youtube.com/watch?v=k4RRi_ntQc8</span></a>.
<a id="hevea_default1808"></a>
<a id="hevea_default1809"></a>
<a id="hevea_default1810"></a></p><p>This is true: bubble sort is conceptually simple but slow for
large datasets. The answer Schmidt was probably looking for is
&#X201C;radix sort&#X201D; (<a href="http://en.wikipedia.org/wiki/Radix_sort"><span class="c004">http://en.wikipedia.org/wiki/Radix_sort</span></a>)<sup><a id="text2" href="thinkpython2022.html#note2">1</a></sup>.
<a id="hevea_default1811"></a></p><p>The goal of algorithm analysis is to make meaningful
comparisons between algorithms, but there are some problems:
<a id="hevea_default1812"></a></p><ul class="itemize"><li class="li-itemize">The relative performance of the algorithms might
depend on characteristics of the hardware, so one algorithm
might be faster on Machine A, another on Machine B.
The general solution to this problem is to specify a
<span class="c010">machine model</span> and analyze the number of steps, or
operations, an algorithm requires under a given model.
<a id="hevea_default1813"></a></li><li class="li-itemize">Relative performance might depend on the details of
the dataset. For example, some sorting
algorithms run faster if the data are already partially sorted;
other algorithms run slower in this case.
A common way to avoid this problem is to analyze the
<span class="c010">worst case</span> scenario. It is sometimes useful to
analyze average case performance, but that&#X2019;s usually harder,
and it might not be obvious what set of cases to average over.
<a id="hevea_default1814"></a>
<a id="hevea_default1815"></a></li><li class="li-itemize">Relative performance also depends on the size of the
problem. A sorting algorithm that is fast for small lists
might be slow for long lists.
The usual solution to this problem is to express run time
(or number of operations) as a function of problem size,
and group functions into categories depending on how quickly
they grow as problem size increases.</li></ul><p>The good thing about this kind of comparison is that it lends
itself to simple classification of algorithms. For example,
if I know that the run time of Algorithm A tends to be
proportional to the size of the input, <span class="c009">n</span>, and Algorithm B
tends to be proportional to <span class="c009">n</span><sup>2</sup>, then I
expect A to be faster than B, at least for large values of <span class="c009">n</span>.</p><p>This kind of analysis comes with some caveats, but we&#X2019;ll get
to that later.</p>
<h2 class="section" id="sec252">B.1&#XA0;&#XA0;Order of growth</h2>
<p>Suppose you have analyzed two algorithms and expressed
their run times in terms of the size of the input:
Algorithm A takes 100<span class="c009">n</span>+1 steps to solve a problem with
size <span class="c009">n</span>; Algorithm B takes <span class="c009">n</span><sup>2</sup> + <span class="c009">n</span> + 1 steps.
<a id="hevea_default1816"></a></p><p>The following table shows the run time of these algorithms
for different problem sizes:</p><table class="c000 cellpadding1" border=1><tr><td class="c014">Input</td><td class="c014">Run time of</td><td class="c014">Run time of </td></tr>
<tr><td class="c014">size</td><td class="c014">Algorithm A</td><td class="c014">Algorithm B </td></tr>
<tr><td class="c014">10</td><td class="c014">1 001</td><td class="c014">111 </td></tr>
<tr><td class="c014">100</td><td class="c014">10 001</td><td class="c014">10 101 </td></tr>
<tr><td class="c014">1 000</td><td class="c014">100 001</td><td class="c014">1 001 001 </td></tr>
<tr><td class="c014">10 000</td><td class="c014">1 000 001</td><td class="c014">&gt; 10<sup>10</sup> </td></tr>
</table><p>At <span class="c009">n</span>=10, Algorithm A looks pretty bad; it takes almost 10 times
longer than Algorithm B. But for <span class="c009">n</span>=100 they are about the same, and
for larger values A is much better.</p><p>The fundamental reason is that for large values of <span class="c009">n</span>, any function
that contains an <span class="c009">n</span><sup>2</sup> term will grow faster than a function whose
leading term is <span class="c009">n</span>. The <span class="c010">leading term</span> is the term with the
highest exponent.
<a id="hevea_default1817"></a>
<a id="hevea_default1818"></a></p><p>For Algorithm A, the leading term has a large coefficient, 100, which
is why B does better than A for small <span class="c009">n</span>. But regardless of the
coefficients, there will always be some value of <span class="c009">n</span> where
<span class="c009">a n</span><sup>2</sup> &gt; <span class="c009">b n</span>, for any values of <span class="c009">a</span> and <span class="c009">b</span>.
<a id="hevea_default1819"></a></p><p>The same argument applies to the non-leading terms. Even if the run
time of Algorithm A were <span class="c009">n</span>+1000000, it would still be better than
Algorithm B for sufficiently large <span class="c009">n</span>.</p><p>In general, we expect an algorithm with a smaller leading term to be a
better algorithm for large problems, but for smaller problems, there
may be a <span class="c010">crossover point</span> where another algorithm is better. The
location of the crossover point depends on the details of the
algorithms, the inputs, and the hardware, so it is usually ignored for
purposes of algorithmic analysis. But that doesn&#X2019;t mean you can forget
<a id="hevea_default1820"></a></p><p>If two algorithms have the same leading order term, it is hard to say
which is better; again, the answer depends on the details. So for
algorithmic analysis, functions with the same leading term
are considered equivalent, even if they have different coefficients.</p><p>An <span class="c010">order of growth</span> is a set of functions whose growth
behavior is considered equivalent. For example, 2<span class="c009">n</span>, 100<span class="c009">n</span> and <span class="c009">n</span>+1 
belong to the same order of growth, which is written <span class="c009">O</span>(<span class="c009">n</span>) in
<span class="c010">Big-Oh notation</span> and often called <span class="c010">linear</span> because every function
in the set grows linearly with <span class="c009">n</span>.
<a id="hevea_default1821"></a>
<a id="hevea_default1822"></a></p><p>All functions with the leading term <span class="c009">n</span><sup>2</sup> belong to <span class="c009">O</span>(<span class="c009">n</span><sup>2</sup>); they are
called <span class="c010">quadratic</span>.
<a id="hevea_default1823"></a></p><p>The following table shows some of the orders of growth that
appear most commonly in algorithmic analysis,
in increasing order of badness.
<a id="hevea_default1824"></a></p><table class="c000 cellpadding1" border=1><tr><td class="c014">Order of</td><td class="c014">Name </td></tr>
<tr><td class="c014">growth</td><td class="c014">&nbsp;</td></tr>
<tr><td class="c014"><span class="c009">O</span>(1)</td><td class="c014">constant </td></tr>
<tr><td class="c014"><span class="c009">O</span>(log<sub><span class="c009">b</span></sub> <span class="c009">n</span>)</td><td class="c014">logarithmic (for any <span class="c009">b</span>) </td></tr>
<tr><td class="c014"><span class="c009">O</span>(<span class="c009">n</span>)</td><td class="c014">linear </td></tr>
<tr><td class="c014"><span class="c009">O</span>(<span class="c009">n</span> log<sub><span class="c009">b</span></sub> <span class="c009">n</span>)</td><td class="c014">linearithmic </td></tr>
<tr><td class="c014"><span class="c009">O</span>(<span class="c009">n</span><sup>2</sup>)</td><td class="c014">quadratic </td></tr>
<tr><td class="c014"><span class="c009">O</span>(<span class="c009">n</span><sup>3</sup>)</td><td class="c014">cubic </td></tr>
<tr><td class="c014"><span class="c009">O</span>(<span class="c009">c</span><sup><span class="c009">n</span></sup>)</td><td class="c014">exponential (for any <span class="c009">c</span>) </td></tr>
</table><p>For the logarithmic terms, the base of the logarithm doesn&#X2019;t matter;
changing bases is the equivalent of multiplying by a constant, which
doesn&#X2019;t change the order of growth. Similarly, all exponential
functions belong to the same order of growth regardless of the base of
the exponent.
Exponential functions grow very quickly, so exponential algorithms are
only useful for small problems.
<a id="hevea_default1825"></a>
<a id="hevea_default1826"></a></p><div class="theorem"><span class="c010">Exercise&#XA0;1</span>&#XA0;&#XA0;<p><em>Read the Wikipedia page on Big-Oh notation at
</em><a href="http://en.wikipedia.org/wiki/Big_O_notation"><em><span class="c004">http://en.wikipedia.org/wiki/Big_O_notation</span></em></a><em> and
answer the following questions:</em></p><ol class="enumerate" type=1><li class="li-enumerate">
<em>What is the order of growth of </em><span class="c009">n</span><sup>3</sup> + <span class="c009">n</span><sup>2</sup><em>?
What about </em>1000000 <span class="c009">n</span><sup>3</sup> + <span class="c009">n</span><sup>2</sup><em>?
What about </em><span class="c009">n</span><sup>3</sup> + 1000000 <span class="c009">n</span><sup>2</sup><em>?</em></li><li class="li-enumerate"><em>What is the order of growth of </em>(<span class="c009">n</span><sup>2</sup> + <span class="c009">n</span>) &#XB7; (<span class="c009">n</span> + 1)<em>? Before
you start multiplying, remember that you only need the leading term.</em></li><li class="li-enumerate"><em>If </em><span class="c009">f</span><em> is in </em><span class="c009">O</span>(<span class="c009">g</span>)<em>, for some unspecified function </em><span class="c009">g</span><em>, what can
we say about </em><span class="c009">af</span>+<span class="c009">b</span><em>?</em></li><li class="li-enumerate"><em>If </em><span class="c009">f</span><sub>1</sub><em> and </em><span class="c009">f</span><sub>2</sub><em> are in </em><span class="c009">O</span>(<span class="c009">g</span>)<em>, what can we say about </em><span class="c009">f</span><sub>1</sub> + <span class="c009">f</span><sub>2</sub><em>?</em></li><li class="li-enumerate"><em>If </em><span class="c009">f</span><sub>1</sub><em> is in </em><span class="c009">O</span>(<span class="c009">g</span>)<em>
and </em><span class="c009">f</span><sub>2</sub><em> is in </em><span class="c009">O</span>(<span class="c009">h</span>)<em>,
what can we say about </em><span class="c009">f</span><sub>1</sub> + <span class="c009">f</span><sub>2</sub><em>?</em></li><li class="li-enumerate"><em>If </em><span class="c009">f</span><sub>1</sub><em> is in </em><span class="c009">O</span>(<span class="c009">g</span>)<em> and </em><span class="c009">f</span><sub>2</sub><em> is </em><span class="c009">O</span>(<span class="c009">h</span>)<em>,
what can we say about </em><span class="c009">f</span><sub>1</sub> &#XB7; <span class="c009">f</span><sub>2</sub><em>?
</em></li></ol></div><p>Programmers who care about performance often find this kind of
analysis hard to swallow. They have a point: sometimes the
coefficients and the non-leading terms make a real difference.
Sometimes the details of the hardware, the programming language, and
the characteristics of the input make a big difference. And for small
problems asymptotic behavior is irrelevant.</p><p>But if you keep those caveats in mind, algorithmic analysis is a
useful tool. At least for large problems, the &#X201C;better&#X201D; algorithm
is usually better, and sometimes it is <em>much</em> better. The
difference between two algorithms with the same order of growth is
usually a constant factor, but the difference between a good algorithm
and a bad algorithm is unbounded!</p>
<h2 class="section" id="sec253">B.2&#XA0;&#XA0;Analysis of basic Python operations</h2>
<p>In Python, most arithmetic operations are constant time;
multiplication usually takes longer than addition and subtraction, and
division takes even longer, but these run times don&#X2019;t depend on the
magnitude of the operands. Very large integers are an exception; in
that case the run time increases with the number of digits.
<a id="hevea_default1827"></a></p><p>Indexing operations&#X2014;reading or writing elements in a sequence
or dictionary&#X2014;are also constant time, regardless of the size
of the data structure.
<a id="hevea_default1828"></a></p><p>A <span class="c004">for</span> loop that traverses a sequence or dictionary is
usually linear, as long as all of the operations in the body
of the loop are constant time. For example, adding up the
elements of a list is linear:</p><pre class="verbatim">    total = 0
    for x in t:
        total += x
</pre><p>The built-in function <span class="c004">sum</span> is also linear because it does
the same thing, but it tends to be faster because it is a more
efficient implementation; in the language of algorithmic analysis,
it has a smaller leading coefficient.</p><p>As a rule of thumb, if the body of a loop is in <span class="c009">O</span>(<span class="c009">n</span><sup><span class="c009">a</span></sup>) then
the whole loop is in <span class="c009">O</span>(<span class="c009">n</span><sup><span class="c009">a</span>+1</sup>). The exception is if you can
show that the loop exits after a constant number of iterations.
If a loop runs <span class="c009">k</span> times regardless of <span class="c009">n</span>, then
the loop is in <span class="c009">O</span>(<span class="c009">n</span><sup><span class="c009">a</span></sup>), even for large <span class="c009">k</span>.</p><p>Multiplying by <span class="c009">k</span> doesn&#X2019;t change the order of growth, but neither
does dividing. So if the body of a loop is in <span class="c009">O</span>(<span class="c009">n</span><sup><span class="c009">a</span></sup>) and it runs
<span class="c009">n</span>/<span class="c009">k</span> times, the loop is in <span class="c009">O</span>(<span class="c009">n</span><sup><span class="c009">a</span>+1</sup>), even for large <span class="c009">k</span>.</p><p>Most string and tuple operations are linear, except indexing and <span class="c004">len</span>, which are constant time. The built-in functions <span class="c004">min</span> and
<span class="c004">max</span> are linear. The run-time of a slice operation is
proportional to the length of the output, but independent of the size
of the input.
<a id="hevea_default1829"></a>
<a id="hevea_default1830"></a></p><p>String concatenation is linear; the run time depends on the sum
of the lengths of the operands.
<a id="hevea_default1831"></a></p><p>All string methods are linear, but if the lengths of
the strings are bounded by a constant&#X2014;for example, operations on single
characters&#X2014;they are considered constant time.
The string method <span class="c004">join</span> is linear; the run time depends on
the total length of the strings.
<a id="hevea_default1832"></a></p><p>Most list methods are linear, but there are some exceptions:
<a id="hevea_default1833"></a></p><ul class="itemize"><li class="li-itemize">Adding an element to the end of a list is constant time on
average; when it runs out of room it occasionally gets copied
to a bigger location, but the total time for <span class="c009">n</span> operations
is <span class="c009">O</span>(<span class="c009">n</span>), so the average time for each
operation is <span class="c009">O</span>(1).</li><li class="li-itemize">Removing an element from the end of a list is constant time.</li><li class="li-itemize">Sorting is <span class="c009">O</span>(<span class="c009">n</span> log<span class="c009">n</span>).
<a id="hevea_default1834"></a></li></ul><p>Most dictionary operations and methods are constant time, but
there are some exceptions:
<a id="hevea_default1835"></a></p><ul class="itemize"><li class="li-itemize">The run time of <span class="c004">update</span> is
proportional to the size of the dictionary passed as a parameter,
not the dictionary being updated.</li><li class="li-itemize"><span class="c004">keys</span>, <span class="c004">values</span> and <span class="c004">items</span> are constant time because 
they return iterators. But
if you loop through the iterators, the loop will be linear.
<a id="hevea_default1836"></a></li></ul><p>The performance of dictionaries is one of the minor miracles of
computer science. We will see how they work in
Section&#XA0;<a href="thinkpython2022.html#hashtable">B.4</a>.</p><div class="theorem"><span class="c010">Exercise&#XA0;2</span>&#XA0;&#XA0;<p><em>Read the Wikipedia page on sorting algorithms at
</em><a href="http://en.wikipedia.org/wiki/Sorting_algorithm"><em><span class="c004">http://en.wikipedia.org/wiki/Sorting_algorithm</span></em></a><em> and answer
the following questions:
</em><a id="hevea_default1837"></a></p><ol class="enumerate" type=1><li class="li-enumerate"><em>What is a &#X201C;comparison sort?&#X201D; What is the best worst-case order
of growth for a comparison sort? What is the best worst-case order
of growth for any sort algorithm?
</em><a id="hevea_default1838"></a></li><li class="li-enumerate"><em>What is the order of growth of bubble sort, and why does Barack
Obama think it is &#X201C;the wrong way to go?&#X201D;</em></li><li class="li-enumerate"><em>What is the order of growth of radix sort? What preconditions
do we need to use it?</em></li><li class="li-enumerate"><em>What is a stable sort and why might it matter in practice?
</em><a id="hevea_default1839"></a></li><li class="li-enumerate"><em>What is the worst sorting algorithm (that has a name)?</em></li><li class="li-enumerate"><em>What sort algorithm does the C library use? What sort algorithm
does Python use? Are these algorithms stable? You might have to
Google around to find these answers.</em></li><li class="li-enumerate"><em>Many of the non-comparison sorts are linear, so why does does
Python use an </em><span class="c009">O</span>(<span class="c009">n</span> log<span class="c009">n</span>)<em> comparison sort?</em></li></ol></div>
<h2 class="section" id="sec254">B.3&#XA0;&#XA0;Analysis of search algorithms</h2>
<p>A <span class="c010">search</span> is an algorithm that takes a collection and a target
item and determines whether the target is in the collection, often
returning the index of the target.
<a id="hevea_default1840"></a></p><p>The simplest search algorithm is a &#X201C;linear search&#X201D;, which traverses
the items of the collection in order, stopping if it finds the target.
In the worst case it has to traverse the entire collection, so the run
time is linear.
<a id="hevea_default1841"></a></p><p>The <span class="c004">in</span> operator for sequences uses a linear search; so do string
methods like <span class="c004">find</span> and <span class="c004">count</span>.
<a id="hevea_default1842"></a></p><p>If the elements of the sequence are in order, you can use a <span class="c010">bisection search</span>, which is <span class="c009">O</span>(log<span class="c009">n</span>). Bisection search is
similar to the algorithm you might use to look a word up in a
dictionary (a paper dictionary, not the data structure). Instead of
starting at the beginning and checking each item in order, you start
with the item in the middle and check whether the word you are looking
for comes before or after. If it comes before, then you search the
first half of the sequence. Otherwise you search the second half.
Either way, you cut the number of remaining items in half.
<a id="hevea_default1843"></a></p><p>If the sequence has 1,000,000 items, it will take about 20 steps to
find the word or conclude that it&#X2019;s not there. So that&#X2019;s about 50,000
times faster than a linear search.</p><p>Bisection search can be much faster than linear search, but
it requires the sequence to be in order, which might require
extra work.</p><p>There is another data structure, called a <span class="c010">hashtable</span> that
is even faster&#X2014;it can do a search in constant time&#X2014;and it
doesn&#X2019;t require the items to be sorted. Python dictionaries
are implemented using hashtables, which is why most dictionary
operations, including the <span class="c004">in</span> operator, are constant time.</p>
<h2 class="section" id="sec255">B.4&#XA0;&#XA0;Hashtables</h2>
<a id="hashtable"></a></p><p>To explain how hashtables work and why their performance is so
good, I start with a simple implementation of a map and
gradually improve it until it&#X2019;s a hashtable.
<a id="hevea_default1844"></a></p><p>I use Python to demonstrate these implementations, but in real
life you wouldn&#X2019;t write code like this in Python; you would just use a
dictionary! So for the rest of this chapter, you have to imagine that
dictionaries don&#X2019;t exist and you want to implement a data structure
that maps from keys to values. The operations you have to
implement are:</p><dl class="description"><dt class="dt-description"><span class="c010"><span class="c004">add(k, v)</span>:</span></dt><dd class="dd-description"> Add a new item that maps from key <span class="c004">k</span>
to value <span class="c004">v</span>. With a Python dictionary, <span class="c004">d</span>, this operation
is written <span class="c004">d[k] = v</span>.</dd><dt class="dt-description"><span class="c010"><span class="c004">get(k)</span>:</span></dt><dd class="dd-description"> Look up and return the value that corresponds
to key <span class="c004">k</span>. With a Python dictionary, <span class="c004">d</span>, this operation
is written <span class="c004">d[k]</span> or <span class="c004">d.get(k)</span>.</dd></dl><p>For now, I assume that each key only appears once.
The simplest implementation of this interface uses a list of
tuples, where each tuple is a key-value pair.
<a id="hevea_default1845"></a></p><pre class="verbatim">class LinearMap:
    def __init__(self):
        self.items = []
    def add(self, k, v):
        self.items.append((k, v))
    def get(self, k):
        for key, val in self.items:
            if key == k:
                return val
        raise KeyError
</pre><p><span class="c004">add</span> appends a key-value tuple to the list of items, which
takes constant time.</p><p><span class="c004">get</span> uses a <span class="c004">for</span> loop to search the list:
if it finds the target key it returns the corresponding value;
otherwise it raises a <span class="c004">KeyError</span>.
So <span class="c004">get</span> is linear.
<a id="hevea_default1846"></a></p><p>An alternative is to keep the list sorted by key. Then <span class="c004">get</span>
could use a bisection search, which is <span class="c009">O</span>(log<span class="c009">n</span>). But inserting a
new item in the middle of a list is linear, so this might not be the
best option. There are other data structures that can implement <span class="c004">add</span> and <span class="c004">get</span> in log time, but that&#X2019;s still not as good as
constant time, so let&#X2019;s move on.
<a id="hevea_default1847"></a></p><p>One way to improve <span class="c004">LinearMap</span> is to break the list of key-value
pairs into smaller lists. Here&#X2019;s an implementation called
<span class="c004">BetterMap</span>, which is a list of 100 LinearMaps. As we&#X2019;ll see
in a second, the order of growth for <span class="c004">get</span> is still linear,
but <span class="c004">BetterMap</span> is a step on the path toward hashtables:
<a id="hevea_default1848"></a></p><pre class="verbatim">class BetterMap:
    def __init__(self, n=100):
        self.maps = []
        for i in range(n):
            self.maps.append(LinearMap())
    def find_map(self, k):
        index = hash(k) % len(self.maps)
        return self.maps[index]
    def add(self, k, v):
        m = self.find_map(k)
        m.add(k, v)
    def get(self, k):
        m = self.find_map(k)
        return m.get(k)
</pre><p><code>__init__</code> makes a list of <span class="c004">n</span> <span class="c004">LinearMap</span>s.</p><p><code>find_map</code> is used by
<span class="c004">add</span> and <span class="c004">get</span>
to figure out which map to put the
new item in, or which map to search.</p><p><code>find_map</code> uses the built-in function <span class="c004">hash</span>, which takes
almost any Python object and returns an integer. A limitation of this
implementation is that it only works with hashable keys. Mutable
types like lists and dictionaries are unhashable.
<a id="hevea_default1849"></a></p><p>Hashable objects that are considered equivalent return the same hash
value, but the converse is not necessarily true: two objects with
different values can return the same hash value.</p><p><code>find_map</code> uses the modulus operator to wrap the hash values
into the range from 0 to <span class="c004">len(self.maps)</span>, so the result is a legal
index into the list. Of course, this means that many different
hash values will wrap onto the same index. But if the hash function
spreads things out pretty evenly (which is what hash functions
are designed to do), then we expect <span class="c009">n</span>/100 items per LinearMap.</p><p>Since the run time of <span class="c004">LinearMap.get</span> is proportional to the
number of items, we expect BetterMap to be about 100 times faster
than LinearMap. The order of growth is still linear, but the
leading coefficient is smaller. That&#X2019;s nice, but still not
as good as a hashtable.</p><p>Here (finally) is the crucial idea that makes hashtables fast: if you
can keep the maximum length of the LinearMaps bounded, <span class="c004">LinearMap.get</span> is constant time. All you have to do is keep track
of the number of items and when the number of
items per LinearMap exceeds a threshold, resize the hashtable by
adding more LinearMaps.
<a id="hevea_default1850"></a></p><p>Here is an implementation of a hashtable:
<a id="hevea_default1851"></a></p><pre class="verbatim">class HashMap:
    def __init__(self):
        self.maps = BetterMap(2)
        self.num = 0
    def get(self, k):
        return self.maps.get(k)
    def add(self, k, v):
        if self.num == len(self.maps.maps):
            self.resize()
        self.maps.add(k, v)
        self.num += 1
    def resize(self):
        new_maps = BetterMap(self.num * 2)
        for m in self.maps.maps:
            for k, v in m.items:
                new_maps.add(k, v)
        self.maps = new_maps
</pre><p>Each <span class="c004">HashMap</span> contains a <span class="c004">BetterMap</span>; <code>__init__</code> starts
with just 2 LinearMaps and initializes <span class="c004">num</span>, which keeps track of
the number of items.</p><p><span class="c004">get</span> just dispatches to <span class="c004">BetterMap</span>. The real work happens
in <span class="c004">add</span>, which checks the number of items and the size of the
<span class="c004">BetterMap</span>: if they are equal, the average number of items per
LinearMap is 1, so it calls <span class="c004">resize</span>.</p><p><span class="c004">resize</span> make a new <span class="c004">BetterMap</span>, twice as big as the previous
one, and then &#X201C;rehashes&#X201D; the items from the old map to the new.</p><p>Rehashing is necessary because changing the number of LinearMaps
changes the denominator of the modulus operator in
<code>find_map</code>. That means that some objects that used
to hash into the same LinearMap will get split up (which is
what we wanted, right?).
<a id="hevea_default1852"></a></p><p>Rehashing is linear, so
<span class="c004">resize</span> is linear, which might seem bad, since I promised
that <span class="c004">add</span> would be constant time. But remember that
we don&#X2019;t have to resize every time, so <span class="c004">add</span> is usually
constant time and only occasionally linear. The total amount
of work to run <span class="c004">add</span> <span class="c009">n</span> times is proportional to <span class="c009">n</span>,
so the average time of each <span class="c004">add</span> is constant time!
<a id="hevea_default1853"></a></p><p>To see how this works, think about starting with an empty
HashTable and adding a sequence of items. We start with 2 LinearMaps,
so the first 2 adds are fast (no resizing required). Let&#X2019;s
say that they take one unit of work each. The next add
requires a resize, so we have to rehash the first two
items (let&#X2019;s call that 2 more units of work) and then
add the third item (one more unit). Adding the next item
costs 1 unit, so the total so far is
6 units of work for 4 items.</p><p>The next <span class="c004">add</span> costs 5 units, but the next three
are only one unit each, so the total is 14 units for the
first 8 adds.</p><p>The next <span class="c004">add</span> costs 9 units, but then we can add 7 more
before the next resize, so the total is 30 units for the
first 16 adds.</p><p>After 32 adds, the total cost is 62 units, and I hope you are starting
to see a pattern. After <span class="c009">n</span> adds, where <span class="c009">n</span> is a power of two, the
total cost is 2<span class="c009">n</span>&#X2212;2 units, so the average work per add is
a little less than 2 units. When <span class="c009">n</span> is a power of two, that&#X2019;s
the best case; for other values of <span class="c009">n</span> the average work is a little
higher, but that&#X2019;s not important. The important thing is that it
is <span class="c009">O</span>(1).
<a id="hevea_default1854"></a></p><p>Figure&#XA0;<a href="thinkpython2022.html#fig.hash">B.1</a> shows how this works graphically. Each
block represents a unit of work. The columns show the total
work for each add in order from left to right: the first two
<span class="c004">adds</span> cost 1 units, the third costs 3 units, etc.</p><blockquote class="figure"><div class="center"><hr class="c019"></div>
<div class="center"><img src="thinkpython2026.png"></div>
<div class="caption"><table class="c001 cellpading0"><tr><td class="c018">Figure B.1: The cost of a hashtable add.<a id="fig.hash"></a></td></tr>
</table></div>
<div class="center"><hr class="c019"></div></blockquote><p>The extra work of rehashing appears as a sequence of increasingly
tall towers with increasing space between them. Now if you knock
over the towers, spreading the cost of resizing over all
adds, you can see graphically that the total cost after <span class="c009">n</span>
adds is 2<span class="c009">n</span> &#X2212; 2.</p><p>An important feature of this algorithm is that when we resize the
HashTable it grows geometrically; that is, we multiply the size by a
constant. If you increase the size
arithmetically&#X2014;adding a fixed number each time&#X2014;the average time
per <span class="c004">add</span> is linear.
<a id="hevea_default1855"></a></p><p>You can download my implementation of HashMap from
<a href="http://thinkpython2.com/code/Map.py"><span class="c004">http://thinkpython2.com/code/Map.py</span></a>, but remember that there
is no reason to use it; if you want a map, just use a Python dictionary.</p>
<h2 class="section" id="sec256">B.5&#XA0;&#XA0;Glossary</h2>
<dl class="description"><dt class="dt-description"><span class="c010">analysis of algorithms:</span></dt><dd class="dd-description"> A way to compare algorithms in terms of
their run time and/or space requirements.
<a id="hevea_default1856"></a></dd><dt class="dt-description"><span class="c010">machine model:</span></dt><dd class="dd-description"> A simplified representation of a computer used
to describe algorithms.
<a id="hevea_default1857"></a></dd><dt class="dt-description"><span class="c010">worst case:</span></dt><dd class="dd-description"> The input that makes a given algorithm run slowest (or
require the most space.
<a id="hevea_default1858"></a></dd><dt class="dt-description"><span class="c010">leading term:</span></dt><dd class="dd-description"> In a polynomial, the term with the highest exponent.
<a id="hevea_default1859"></a></dd><dt class="dt-description"><span class="c010">crossover point:</span></dt><dd class="dd-description"> The problem size where two algorithms require
the same run time or space. 
<a id="hevea_default1860"></a></dd><dt class="dt-description"><span class="c010">order of growth:</span></dt><dd class="dd-description"> A set of functions that all grow in a way
considered equivalent for purposes of analysis of algorithms. 
For example, all functions that grow linearly belong to the same
order of growth.
<a id="hevea_default1861"></a></dd><dt class="dt-description"><span class="c010">Big-Oh notation:</span></dt><dd class="dd-description"> Notation for representing an order of growth;
for example, <span class="c009">O</span>(<span class="c009">n</span>) represents the set of functions that grow
<a id="hevea_default1862"></a></dd><dt class="dt-description"><span class="c010">linear:</span></dt><dd class="dd-description"> An algorithm whose run time is proportional to
problem size, at least for large problem sizes.
<a id="hevea_default1863"></a></dd><dt class="dt-description"><span class="c010">quadratic:</span></dt><dd class="dd-description"> An algorithm whose run time is proportional to
<span class="c009">n</span><sup>2</sup>, where <span class="c009">n</span> is a measure of problem size.
<a id="hevea_default1864"></a></dd><dt class="dt-description"><span class="c010">search:</span></dt><dd class="dd-description"> The problem of locating an element of a collection
(like a list or dictionary) or determining that it is not present.
<a id="hevea_default1865"></a></dd><dt class="dt-description"><span class="c010">hashtable:</span></dt><dd class="dd-description"> A data structure that represents a collection of
key-value pairs and performs search in constant time.
<a id="hevea_default1866"></a></dd></dl><hr class="footnoterule"><dl class="thefootnotes"><dt class="dt-thefootnotes">
<a id="note2" href="thinkpython2022.html#text2">1</a></dt><dd class="dd-thefootnotes"><div class="footnotetext">
But if you get a question like this in an interview, I think
a better answer is, &#X201C;The fastest way to sort a million integers
is to use whatever sort function is provided by the language
I&#X2019;m using. Its performance is good enough for the vast majority
of applications, but if it turned out that my application was too
slow, I would use a profiler to see where the time was being
spent. If it looked like a faster sort algorithm would have
a significant effect on performance, then I would look
around for a good implementation of radix sort.&#X201D;</div>
<a href="http://amzn.to/1VUYQUU">Buy this book at Amazon.com</a>
<td width=130 valign="top">
<h4>Are you using one of our books in a class?</h4>  We'd like to know
about it.  Please consider filling out <a href="http://spreadsheets.google.com/viewform?formkey=dC0tNUZkMjBEdXVoRGljNm9FRmlTMHc6MA" onClick="javascript: pageTracker._trackPageview('/outbound/survey');">this short survey</a>.
<a rel="nofollow" href="http://www.amazon.com/gp/product/1491938455/ref=as_li_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=1491938455&linkCode=as2&tag=greenteapre01-20&linkId=2JJH4SWCAVVYSQHO">Think DSP</a><img class="c003" src="http://ir-na.amazon-adsystem.com/e/ir?t=greenteapre01-20&l=as2&o=1&a=1491938455" width="1" height="1" border="0" alt="">
<a rel="nofollow" href="http://www.amazon.com/gp/product/1491938455/ref=as_li_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=1491938455&linkCode=as2&tag=greenteapre01-20&linkId=CTV7PDT7E5EGGJUM"><img border="0" src="http://ws-na.amazon-adsystem.com/widgets/q?_encoding=UTF8&ASIN=1491938455&Format=_SL160_&ID=AsinImage&MarketPlace=US&ServiceVersion=20070822&WS=1&tag=greenteapre01-20"></a><img class="c003" src="http://ir-na.amazon-adsystem.com/e/ir?t=greenteapre01-20&l=as2&o=1&a=1491938455" width="1" height="1" border="0" alt="">
<a rel="nofollow" href="http://www.amazon.com/gp/product/1491929561/ref=as_li_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=1491929561&linkCode=as2&tag=greenteapre01-20&linkId=ZY6MAYM33ZTNSCNZ">Think Java</a><img class="c003" src="http://ir-na.amazon-adsystem.com/e/ir?t=greenteapre01-20&l=as2&o=1&a=1491929561" width="1" height="1" border="0" alt="">
<a rel="nofollow" href="http://www.amazon.com/gp/product/1491929561/ref=as_li_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=1491929561&linkCode=as2&tag=greenteapre01-20&linkId=PT77ANWARUNNU3UK"><img border="0" src="http://ws-na.amazon-adsystem.com/widgets/q?_encoding=UTF8&ASIN=1491929561&Format=_SL160_&ID=AsinImage&MarketPlace=US&ServiceVersion=20070822&WS=1&tag=greenteapre01-20"></a><img class="c003" src="http://ir-na.amazon-adsystem.com/e/ir?t=greenteapre01-20&l=as2&o=1&a=1491929561" width="1" height="1" border="0" alt="">
<a href="http://www.amazon.com/gp/product/1449370780/ref=as_li_qf_sp_asin_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=1449370780&linkCode=as2&tag=greenteapre01-20">Think Bayes</a><img class="c003" src="http://ir-na.amazon-adsystem.com/e/ir?t=greenteapre01-20&l=as2&o=1&a=1449370780" width="1" height="1" border="0" alt="">
<a href="http://www.amazon.com/gp/product/1449370780/ref=as_li_qf_sp_asin_il?ie=UTF8&camp=1789&creative=9325&creativeASIN=1449370780&linkCode=as2&tag=greenteapre01-20"><img border="0" src="http://ws-na.amazon-adsystem.com/widgets/q?_encoding=UTF8&amp;ASIN=1449370780&amp;Format=_SL160_&amp;ID=AsinImage&amp;MarketPlace=US&amp;ServiceVersion=20070822&amp;WS=1&amp;tag=greenteapre01-20"></a><img class="c003" src="http://ir-na.amazon-adsystem.com/e/ir?t=greenteapre01-20&l=as2&o=1&a=1449370780" width="1" height="1" border="0" alt="">
<a rel="nofollow" href="http://www.amazon.com/gp/product/1491939362/ref=as_li_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=1491939362&linkCode=as2&tag=greenteapre01-20&linkId=FJKSQ3IHEMY2F2VA">Think Python 2e</a><img class="c003" src="http://ir-na.amazon-adsystem.com/e/ir?t=greenteapre01-20&l=as2&o=1&a=1491939362" width="1" height="1" border="0" alt="">
<a rel="nofollow" href="http://www.amazon.com/gp/product/1491939362/ref=as_li_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=1491939362&linkCode=as2&tag=greenteapre01-20&linkId=ZZ454DLQ3IXDHNHX"><img border="0" src="http://ws-na.amazon-adsystem.com/widgets/q?_encoding=UTF8&ASIN=1491939362&Format=_SL160_&ID=AsinImage&MarketPlace=US&ServiceVersion=20070822&WS=1&tag=greenteapre01-20"></a><img class="c003" src="http://ir-na.amazon-adsystem.com/e/ir?t=greenteapre01-20&l=as2&o=1&a=1491939362" width="1" height="1" border="0" alt="">
<a href="http://www.amazon.com/gp/product/1491907339/ref=as_li_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=1491907339&linkCode=as2&tag=greenteapre01-20&linkId=O7WYM6H6YBYUFNWU">Think Stats 2e</a><img class="c003" src="http://ir-na.amazon-adsystem.com/e/ir?t=greenteapre01-20&l=as2&o=1&a=1491907339" width="1" height="1" border="0" alt="">
<a href="http://www.amazon.com/gp/product/1491907339/ref=as_li_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=1491907339&linkCode=as2&tag=greenteapre01-20&linkId=JVSYKQHYSUIEYRHL"><img border="0" src="http://ws-na.amazon-adsystem.com/widgets/q?_encoding=UTF8&amp;ASIN=1491907339&amp;Format=_SL160_&amp;ID=AsinImage&amp;MarketPlace=US&amp;ServiceVersion=20070822&amp;WS=1&amp;tag=greenteapre01-20"></a><img class="c003" src="http://ir-na.amazon-adsystem.com/e/ir?t=greenteapre01-20&l=as2&o=1&a=1491907339" width="1" height="1" border="0" alt="">
<a href="http://www.amazon.com/gp/product/1449314635/ref=as_li_tf_tl?ie=UTF8&tag=greenteapre01-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=1449314635">Think Complexity</a><img class="c003" src="http://www.assoc-amazon.com/e/ir?t=greenteapre01-20&l=as2&o=1&a=1449314635" width="1" height="1" border="0" alt="">
<a href="http://www.amazon.com/gp/product/1449314635/ref=as_li_tf_il?ie=UTF8&camp=1789&creative=9325&creativeASIN=1449314635&linkCode=as2&tag=greenteapre01-20"><img border="0" src="http://ws-na.amazon-adsystem.com/widgets/q?_encoding=UTF8&amp;ASIN=1449314635&amp;Format=_SL160_&amp;ID=AsinImage&amp;MarketPlace=US&amp;ServiceVersion=20070822&amp;WS=1&amp;tag=greenteapre01-20"></a><img class="c003" src="http://www.assoc-amazon.com/e/ir?t=greenteapre01-20&l=as2&o=1&a=1449314635" width="1" height="1" border="0" alt="">
<a href="thinkpython2021.html"><img src="back.png" ALT="Previous"></a>
<a href="index.html.1"><img src="up.png" ALT="Up"></a>
<a href="thinkpython2023.html"><img src="next.png" ALT="Next"></a>
Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

thinkpython2022.html

Latest commit

History

thinkpython2022.html

File metadata and controls