|
| 1 | +Sorting HOW TO |
| 2 | +************** |
| 3 | + |
| 4 | +:Author: Andrew Dalke and Raymond Hettinger |
| 5 | +:Release: 0.1 |
| 6 | + |
| 7 | + |
| 8 | +Python lists have a built-in :meth:`list.sort` method that modifies the list |
| 9 | +in-place and a :func:`sorted` built-in function that builds a new sorted list |
| 10 | +from an iterable. |
| 11 | + |
| 12 | +In this document, we explore the various techniques for sorting data using Python. |
| 13 | + |
| 14 | + |
| 15 | +Sorting Basics |
| 16 | +============== |
| 17 | + |
| 18 | +A simple ascending sort is very easy: just call the :func:`sorted` function. It |
| 19 | +returns a new sorted list:: |
| 20 | + |
| 21 | + >>> sorted([5, 2, 3, 1, 4]) |
| 22 | + [1, 2, 3, 4, 5] |
| 23 | + |
| 24 | +You can also use the :meth:`list.sort` method of a list. It modifies the list |
| 25 | +in-place (and returns *None* to avoid confusion). Usually it's less convenient |
| 26 | +than :func:`sorted` - but if you don't need the original list, it's slightly |
| 27 | +more efficient. |
| 28 | + |
| 29 | + >>> a = [5, 2, 3, 1, 4] |
| 30 | + >>> a.sort() |
| 31 | + >>> a |
| 32 | + [1, 2, 3, 4, 5] |
| 33 | + |
| 34 | +Another difference is that the :meth:`list.sort` method is only defined for |
| 35 | +lists. In contrast, the :func:`sorted` function accepts any iterable. |
| 36 | + |
| 37 | + >>> sorted({1: 'D', 2: 'B', 3: 'B', 4: 'E', 5: 'A'}) |
| 38 | + [1, 2, 3, 4, 5] |
| 39 | + |
| 40 | +Key Functions |
| 41 | +============= |
| 42 | + |
| 43 | +Both :meth:`list.sort` and :func:`sorted` have *key* parameter to specify a |
| 44 | +function to be called on each list element prior to making comparisons. |
| 45 | + |
| 46 | +For example, here's a case-insensitive string comparison: |
| 47 | + |
| 48 | + >>> sorted("This is a test string from Andrew".split(), key=str.lower) |
| 49 | + ['a', 'Andrew', 'from', 'is', 'string', 'test', 'This'] |
| 50 | + |
| 51 | +The value of the *key* parameter should be a function that takes a single argument |
| 52 | +and returns a key to use for sorting purposes. This technique is fast because |
| 53 | +the key function is called exactly once for each input record. |
| 54 | + |
| 55 | +A common pattern is to sort complex objects using some of the object's indices |
| 56 | +as keys. For example: |
| 57 | + |
| 58 | + >>> student_tuples = [ |
| 59 | + ('john', 'A', 15), |
| 60 | + ('jane', 'B', 12), |
| 61 | + ('dave', 'B', 10), |
| 62 | + ] |
| 63 | + >>> sorted(student_tuples, key=lambda student: student[2]) # sort by age |
| 64 | + [('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)] |
| 65 | + |
| 66 | +The same technique works for objects with named attributes. For example: |
| 67 | + |
| 68 | + >>> class Student: |
| 69 | + def __init__(self, name, grade, age): |
| 70 | + self.name = name |
| 71 | + self.grade = grade |
| 72 | + self.age = age |
| 73 | + def __repr__(self): |
| 74 | + return repr((self.name, self.grade, self.age)) |
| 75 | + |
| 76 | + >>> student_objects = [ |
| 77 | + Student('john', 'A', 15), |
| 78 | + Student('jane', 'B', 12), |
| 79 | + Student('dave', 'B', 10), |
| 80 | + ] |
| 81 | + >>> sorted(student_objects, key=lambda student: student.age) # sort by age |
| 82 | + [('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)] |
| 83 | + |
| 84 | +Operator Module Functions |
| 85 | +========================= |
| 86 | + |
| 87 | +The key-function patterns shown above are very common, so Python provides |
| 88 | +convenience functions to make accessor functions easier and faster. The operator |
| 89 | +module has :func:`operator.itemgetter`, :func:`operator.attrgetter`, and |
| 90 | +an :func:`operator.methodcaller` function. |
| 91 | + |
| 92 | +Using those functions, the above examples become simpler and faster: |
| 93 | + |
| 94 | + >>> from operator import itemgetter, attrgetter |
| 95 | + |
| 96 | + >>> sorted(student_tuples, key=itemgetter(2)) |
| 97 | + [('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)] |
| 98 | + |
| 99 | + >>> sorted(student_objects, key=attrgetter('age')) |
| 100 | + [('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)] |
| 101 | + |
| 102 | +The operator module functions allow multiple levels of sorting. For example, to |
| 103 | +sort by *grade* then by *age*: |
| 104 | + |
| 105 | + >>> sorted(student_tuples, key=itemgetter(1,2)) |
| 106 | + [('john', 'A', 15), ('dave', 'B', 10), ('jane', 'B', 12)] |
| 107 | + |
| 108 | + >>> sorted(student_objects, key=attrgetter('grade', 'age')) |
| 109 | + [('john', 'A', 15), ('dave', 'B', 10), ('jane', 'B', 12)] |
| 110 | + |
| 111 | +Ascending and Descending |
| 112 | +======================== |
| 113 | + |
| 114 | +Both :meth:`list.sort` and :func:`sorted` accept a *reverse* parameter with a |
| 115 | +boolean value. This is using to flag descending sorts. For example, to get the |
| 116 | +student data in reverse *age* order: |
| 117 | + |
| 118 | + >>> sorted(student_tuples, key=itemgetter(2), reverse=True) |
| 119 | + [('john', 'A', 15), ('jane', 'B', 12), ('dave', 'B', 10)] |
| 120 | + |
| 121 | + >>> sorted(student_objects, key=attrgetter('age'), reverse=True) |
| 122 | + [('john', 'A', 15), ('jane', 'B', 12), ('dave', 'B', 10)] |
| 123 | + |
| 124 | +Sort Stability and Complex Sorts |
| 125 | +================================ |
| 126 | + |
| 127 | +Sorts are guaranteed to be `stable |
| 128 | +<http://en.wikipedia.org/wiki/Sorting_algorithm#Stability>`_\. That means that |
| 129 | +when multiple records have the same key, their original order is preserved. |
| 130 | + |
| 131 | + >>> data = [('red', 1), ('blue', 1), ('red', 2), ('blue', 2)] |
| 132 | + >>> sorted(data, key=itemgetter(0)) |
| 133 | + [('blue', 1), ('blue', 2), ('red', 1), ('red', 2)] |
| 134 | + |
| 135 | +Notice how the two records for *blue* retain their original order so that |
| 136 | +``('blue', 1)`` is guaranteed to precede ``('blue', 2)``. |
| 137 | + |
| 138 | +This wonderful property lets you build complex sorts in a series of sorting |
| 139 | +steps. For example, to sort the student data by descending *grade* and then |
| 140 | +ascending *age*, do the *age* sort first and then sort again using *grade*: |
| 141 | + |
| 142 | + >>> s = sorted(student_objects, key=attrgetter('age')) # sort on secondary key |
| 143 | + >>> sorted(s, key=attrgetter('grade'), reverse=True) # now sort on primary key, descending |
| 144 | + [('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)] |
| 145 | + |
| 146 | +The `Timsort <http://en.wikipedia.org/wiki/Timsort>`_ algorithm used in Python |
| 147 | +does multiple sorts efficiently because it can take advantage of any ordering |
| 148 | +already present in a dataset. |
| 149 | + |
| 150 | +The Old Way Using Decorate-Sort-Undecorate |
| 151 | +========================================== |
| 152 | + |
| 153 | +This idiom is called Decorate-Sort-Undecorate after its three steps: |
| 154 | + |
| 155 | +* First, the initial list is decorated with new values that control the sort order. |
| 156 | + |
| 157 | +* Second, the decorated list is sorted. |
| 158 | + |
| 159 | +* Finally, the decorations are removed, creating a list that contains only the |
| 160 | + initial values in the new order. |
| 161 | + |
| 162 | +For example, to sort the student data by *grade* using the DSU approach: |
| 163 | + |
| 164 | + >>> decorated = [(student.grade, i, student) for i, student in enumerate(student_objects)] |
| 165 | + >>> decorated.sort() |
| 166 | + >>> [student for grade, i, student in decorated] # undecorate |
| 167 | + [('john', 'A', 15), ('jane', 'B', 12), ('dave', 'B', 10)] |
| 168 | + |
| 169 | +This idiom works because tuples are compared lexicographically; the first items |
| 170 | +are compared; if they are the same then the second items are compared, and so |
| 171 | +on. |
| 172 | + |
| 173 | +It is not strictly necessary in all cases to include the index *i* in the |
| 174 | +decorated list, but including it gives two benefits: |
| 175 | + |
| 176 | +* The sort is stable -- if two items have the same key, their order will be |
| 177 | + preserved in the sorted list. |
| 178 | + |
| 179 | +* The original items do not have to be comparable because the ordering of the |
| 180 | + decorated tuples will be determined by at most the first two items. So for |
| 181 | + example the original list could contain complex numbers which cannot be sorted |
| 182 | + directly. |
| 183 | + |
| 184 | +Another name for this idiom is |
| 185 | +`Schwartzian transform <http://en.wikipedia.org/wiki/Schwartzian_transform>`_\, |
| 186 | +after Randal L. Schwartz, who popularized it among Perl programmers. |
| 187 | + |
| 188 | +Now that Python sorting provides key-functions, this technique is not often needed. |
| 189 | + |
| 190 | + |
| 191 | +The Old Way Using the *cmp* Parameter |
| 192 | +===================================== |
| 193 | + |
| 194 | +Many constructs given in this HOWTO assume Python 2.4 or later. Before that, |
| 195 | +there was no :func:`sorted` builtin and :meth:`list.sort` took no keyword |
| 196 | +arguments. Instead, all of the Py2.x versions supported a *cmp* parameter to |
| 197 | +handle user specified comparison functions. |
| 198 | + |
| 199 | +In Py3.0, the *cmp* parameter was removed entirely (as part of a larger effort to |
| 200 | +simplify and unify the language, eliminating the conflict between rich |
| 201 | +comparisons and the :meth:`__cmp__` magic method). |
| 202 | + |
| 203 | +In Py2.x, sort allowed an optional function which can be called for doing the |
| 204 | +comparisons. That function should take two arguments to be compared and then |
| 205 | +return a negative value for less-than, return zero if they are equal, or return |
| 206 | +a positive value for greater-than. For example, we can do: |
| 207 | + |
| 208 | + >>> def numeric_compare(x, y): |
| 209 | + return x - y |
| 210 | + >>> sorted([5, 2, 4, 1, 3], cmp=numeric_compare) |
| 211 | + [1, 2, 3, 4, 5] |
| 212 | + |
| 213 | +Or you can reverse the order of comparison with: |
| 214 | + |
| 215 | + >>> def reverse_numeric(x, y): |
| 216 | + return y - x |
| 217 | + >>> sorted([5, 2, 4, 1, 3], cmp=reverse_numeric) |
| 218 | + [5, 4, 3, 2, 1] |
| 219 | + |
| 220 | +When porting code from Python 2.x to 3.x, the situation can arise when you have |
| 221 | +the user supplying a comparison function and you need to convert that to a key |
| 222 | +function. The following wrapper makes that easy to do:: |
| 223 | + |
| 224 | + def cmp_to_key(mycmp): |
| 225 | + 'Convert a cmp= function into a key= function' |
| 226 | + class K(object): |
| 227 | + def __init__(self, obj, *args): |
| 228 | + self.obj = obj |
| 229 | + def __lt__(self, other): |
| 230 | + return mycmp(self.obj, other.obj) < 0 |
| 231 | + def __gt__(self, other): |
| 232 | + return mycmp(self.obj, other.obj) > 0 |
| 233 | + def __eq__(self, other): |
| 234 | + return mycmp(self.obj, other.obj) == 0 |
| 235 | + def __le__(self, other): |
| 236 | + return mycmp(self.obj, other.obj) <= 0 |
| 237 | + def __ge__(self, other): |
| 238 | + return mycmp(self.obj, other.obj) >= 0 |
| 239 | + def __ne__(self, other): |
| 240 | + return mycmp(self.obj, other.obj) != 0 |
| 241 | + return K |
| 242 | + |
| 243 | +To convert to a key function, just wrap the old comparison function: |
| 244 | + |
| 245 | + >>> sorted([5, 2, 4, 1, 3], key=cmp_to_key(reverse_numeric)) |
| 246 | + [5, 4, 3, 2, 1] |
| 247 | + |
| 248 | + |
| 249 | +Odd and Ends |
| 250 | +============ |
| 251 | + |
| 252 | +* For locale aware sorting, use :func:`locale.strxfrm` for a key function or |
| 253 | + :func:`locale.strcoll` for a comparison function. |
| 254 | + |
| 255 | +* The *reverse* parameter still maintains sort stability (i.e. records with |
| 256 | + equal keys retain the original order). Interestingly, that effect can be |
| 257 | + simulated without the parameter by using the builtin :func:`reversed` function |
| 258 | + twice: |
| 259 | + |
| 260 | + >>> data = [('red', 1), ('blue', 1), ('red', 2), ('blue', 2)] |
| 261 | + >>> assert sorted(data, reverse=True) == list(reversed(sorted(reversed(data)))) |
| 262 | + |
| 263 | +* The sort routines are guaranteed to use :meth:`__lt__` when making comparisons |
| 264 | + between two objects. So, it is easy to add a standard sort order to a class by |
| 265 | + defining an :meth:`__lt__` method:: |
| 266 | + |
| 267 | + >>> Student.__lt__ = lambda self, other: self.age < other.age |
| 268 | + >>> sorted(student_objects) |
| 269 | + [('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)] |
| 270 | + |
| 271 | +* Key functions need not depend directly on the objects being sorted. A key |
| 272 | + function can also access external resources. For instance, if the student grades |
| 273 | + are stored in a dictionary, they can be used to sort a separate list of student |
| 274 | + names: |
| 275 | + |
| 276 | + >>> students = ['dave', 'john', 'jane'] |
| 277 | + >>> newgrades = {'john': 'F', 'jane':'A', 'dave': 'C'} |
| 278 | + >>> sorted(students, key=newgrades.__getitem__) |
| 279 | + ['jane', 'dave', 'john'] |
0 commit comments