Numpy Book 1
Numpy Book 1
Release 2011
Pauli Virtanen
1 Advanced Numpy 3
1.1 The Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
i
4.4 Integer indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.5 Integer indexing, simple . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.6 Integer indexing + slices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.7 Integer indexing + slices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.8 Windows to data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6 Summary 29
7 Exercises 31
7.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
7.2 Warming up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
7.3 Broadcasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7.4 Fancy indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
7.5 Structured data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.6 Advanced . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
ii
Numpy tutorial, Release 2011
CONTENTS 1
Numpy tutorial, Release 2011
2 CONTENTS
CHAPTER
ONE
ADVANCED NUMPY
Pauli Virtanen
3
Numpy tutorial, Release 2011
TWO
2.1 It’s...
ndarray =
• block of memory
• how to interpret an element
• how to locate an element
/* Block of memory */
char *data;
/* Indexing scheme */
int nd;
npy_intp *dimensions;
npy_intp *strides;
/* + other stuff */
} PyArrayObject;
5
Numpy tutorial, Release 2011
• Memory address
>>> x.__array_interface__[’data’][0]
140507238089520
2.4 Flags
>>> x = ’1’
>>> y = np.frombuffer(x, dtype=np.int8)
>>> y.flags
C_CONTIGUOUS : True
F_CONTIGUOUS : True
OWNDATA : False
WRITEABLE : False
ALIGNED : True
UPDATEIFCOPY : False
• The owndata and writeable flags indicate status of the memory block.
• Some flags can be changed.
>>> y.flags.writeable = True
• A mathematical detour.
• Byte order:
>>> np.frombuffer(’\x01\x02’, dtype=’>i2’) # Big-endian
array([258], dtype=int16)
>>> 1 * 2**8 + 2 * 2**0
258
>>> z.astype(int)
array([1, 2, 3, 4])
>>> x = np.array([100]).astype(’S2’).astype(int)
>>> x
array([10])
>>> x[1] = 5
>>> y
array([ 1281, 1027], dtype=int16)
>>> y.base is x
True
• Multidimensional array
>>> x = np.array([[1, 2], [3, 4]], dtype=np.uint8)
>>> y = x.T.copy().T
>>> x
array([[1, 2],
[3, 4]], dtype=uint8)
>>> y
array([[1, 2],
[3, 4]], dtype=uint8)
>>> x.view(np.int16)
array([[ 513],
[1027]], dtype=int16)
>>> y.view(np.int16)
array([[ 769, 1026]], dtype=int16)
???
• But:
>>> x
array([[1, 2],
[3, 4]], dtype=uint8)
>>> y
array([[1, 2],
[3, 4]], dtype=uint8)
2.11 Indexing?
The question
>>> x = np.array([[1, 2],
... [3, 4],
... [5, 6]], dtype=np.int8)
>>> str(x.data)
’\x01\x02\x03\x04\x05\x06’
• simple, flexible
>>> y = x[2:]
>>> y.__array_interface__[’data’][0] - x.__array_interface__[’data’][0]
8
Bad:
>>> x_diag = as_strided(x, shape=(3e6,), strides=((3+1)*x.itemsize,))
>>> x_diag += 9
Segmentation fault (core dumped)
Even worse:
>>> x_diag = as_strided(x, shape=(4,), strides=((3+1)*x.itemsize,))
>>> x_diag += 9
>>> # <-- No segmentation fault!
THREE
EVERYDAY FEATURES:
BROADCASTING
3.1 Scalars
3.2 Arrays?
>>> c = a + b
>>> c
array([[11, 22, 33],
[41, 52, 63]])
cij = aij + bj
13
Numpy tutorial, Release 2011
Shape arithmetic:
(2, 3) (2, 3)
(1, 3) (3,) <-- behaves as scalar for axis=0
-------- --------
(2, 3) (2, 3)
(3, 4, 5) (3, 4, 5)
(3, 1, 5) (4, 5)
--------- ---------
(3, 4, 5) (3, 4, 5)
• np.ix_
• Tensor operations
Example: many matrix products for small matrices
>>> R = np.random.rand(3, 3, 2000) # 2000 of 3x3 matrices
>>> Z = np.random.rand(3, 3, 2000)
• They’re views?
>>> x[0,0] = -1
>>> x2
array([[-1, 20, 30, 40],
[-1, 20, 30, 40],
[-1, 20, 30, 40]])
• Strides?
>>> ...
FOUR
• Assignment works:
>>> a[a > 2] = -1
>>> a
array([ 1, 2, -1, -1])
• 1-D masks
>>> a = np.arange(4*5).reshape(4, 5)
>>> a
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]])
• Extract rows
>>> a[np.array([True,False,False,True])]
array([[ 0, 1, 2, 3, 4],
[15, 16, 97, 98, 99]])
• Extract columns
19
Numpy tutorial, Release 2011
• Or this:
>>> b = np.zeros_like(a)
>>> b[mask] = a[mask]
>>> b
array([[ 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0],
[ 0, 0, 97, 98, 99]])
• In a nutshell:
a = 2-dim array
p = integer array of shape (M, N, K)
q = integer array of shape (M, N, K)
b = a[p, q]
produces b:
b.shape == (M, N, K)
• Pick diagonal:
>>> i = np.arange(3)
>>> a[i,i]
array([1, 5, 9])
• Pick columns:
• Higher dimensions...
>>> a = np.arange(3*4*5).reshape(3,4,5)
>>> i = np.array([0, 1])
>>> j = np.array([1, 2])
>>> a[:,i,j][:,0]
array([ 1, 21, 41])
>>> a[:,i[0],j[0]]
array([ 1, 21, 41])
OK...
>>> a[i,:,j][:,0]
array([ 1, 22])
>>> a[i[0],:,j[0]]
array([ 1, 6, 11, 16])
What?
• That is:
a = 4-dim array of shape (p, q, r, s)
II = integer array of shape (M, N, K)
JJ = integer array of shape (M, N, K)
produces b, c:
b.shape == (p, M, N, K, q)
c.shape == (M, N, K, p, q)
• Fancy indices are next to each other: fancy axes go to the same position
• Otherwise, fancy axes go first
Pick the largest value from each row on a 2-D array, and its 2 neighbors. (Produce N x 3 array of results, mark
‘missing’ data with -1.)
Some “data”:
>>> a = np.random.zipf(1.3, size=(10, 5))
>>> a
array([[ 1, 1339, 113, 1, 3],
[ 3, 27, 63, 6, 1],
[ 3, 14, 1, 1, 2],
[ 1046, 1, 1, 66, 1],
[ 14, 2, 9, 1, 39633],
[ 4, 136, 258, 27, 1],
[ 661, 11, 313, 4, 1],
[ 55, 55, 1, 13, 72],
[ 1, 5, 1027, 12, 134],
[ 214, 11, 3, 274, 1]])
Locate maximum:
>>> j_max = np.argmax(a, axis=1)
>>> i, j = np.broadcast_arrays(i, j)
>>> i.shape
(10, 3)
>>> j.shape
(10, 3)
Result array:
Fancy stuff:
>>> b[mask] = a[i[mask], j[mask]]
Result:
>>> a
array([[ 1, 1339, 113, 1, 3],
[ 3, 27, 63, 6, 1],
[ 3, 14, 1, 1, 2],
[ 1046, 1, 1, 66, 1],
[ 14, 2, 9, 1, 39633],
[ 4, 136, 258, 27, 1],
[ 661, 11, 313, 4, 1],
[ 55, 55, 1, 13, 72],
[ 1, 5, 1027, 12, 134],
[ 214, 11, 3, 274, 1]])
>>> b
array([[ 1, 1339, 113],
[ 27, 63, 6],
[ 3, 14, 1],
[ -1, 1046, 1],
[ 1, 39633, -1],
[ 136, 258, 27],
[ -1, 661, 11],
[ 13, 72, -1],
[ 5, 1027, 12],
[ 3, 274, 1]])
FIVE
25
Numpy tutorial, Release 2011
• Beware:
>>> assert y.flags.c_contiguous
SIX
SUMMARY
• Internals
Indexing, slicing, strides, etc.
• Broadcasting
• Fancy indexing
• Structured arrays
29
Numpy tutorial, Release 2011
30 Chapter 6. Summary
CHAPTER
SEVEN
EXERCISES
7.1 Setup
7.2 Warming up
1. Create a 5x6 Numpy array containing random numbers in range [0, 1].
• Compute the mean of all the numbers in it
(To find the function to do this: np.lookfor("mean of array"))
• Compute the minimum value in each row, and maximum in each column
• Multiply each element by 10 and convert to an integer with the .astype() method.
What is the difference between a.astype(int) and np.around(a)?
2. Compare:
np.array([1, 2, 3, 4]) / 2
np.array([1.0, 2, 3, 4]) / 2
np.array([1, 2, 3, 4]) // 2
np.array([1.0, 2, 3, 4]) // 2
31
Numpy tutorial, Release 2011
a[:,[0,1]]
a[:,0:2]
a[0]
a.T
a[[True, False]]
to change array:
a = np.array([1, 2, 3, 4, 5, 6])
to:
array([1, 7, 8, 9, 5, 6])
1. Consider:
a = np.array([[1, 2, 3], [4, 5, 6]], dtype=np.int16)
What do the following operations do, and what are the resulting strides:
a
a.T
a[::-1]
Which of the above are C-contiguous (and what does that mean)?
7.3 Broadcasting
32 Chapter 7. Exercises
Numpy tutorial, Release 2011
Tips
• np.newaxis
Write a function f(a, b, c) that returns ab − c. Generate a shape (24, 12, 6) array containing the values
f(a_i, b_j, c_k) at points a_i, b_j and c_k forming a grid in the unit cube [0, 1] x [0, 1] x [0, 1].
Approximate the 3-d integral
Z 1 Z 1 Z 1
(ab − c)da db dc
0 0 0
1
over this volume with the mean of the values. The exact result is: log(2) − 2 — how close do you get?
Try also using np.mgrid instead of broadcasting. Is there a speed difference? How about ogrid with
broadcast_arrays?
Tips
• You can make np.ogrid give a number of points in given ranges with the syntax
np.ogrid[a:b:20j, c:d:10j].
• You can use %timeit in IPython to check timings
2. Generate a 10 x 3 array of random numbers (in range [0,1]). From each row, pick the number closest to 0.75.
Tips
• Make use of np.abs and np.argmax to find the column j closest for each row.
• Use fancy integer indexing to extract the numbers. Remember that in a[i,j] the index array i must
correspond to j.
Design a structured data type suitable for the data (in words.txt):
% rank lemma (10 letters max) frequency dispersion
21 they 1865844 0.96
42 her 969591 0.91
49 as 829018 0.95
7 to 6332195 0.98
63 take 670745 0.97
14 you 3085642 0.92
35 go 1151045 0.93
56 think 772787 0.91
28 not 1638883 0.98
Load the data from the text file. Examine the data you got, for example: extract words only, extract the 3rd row, print
all words with rank < 30.
Sort the data according to frequency. Save the result to a Numpy data file sorted.npz with np.savez and load
back with np.load. Do you get back what you put in?
Save the result to a text file sorted.txt using np.savetxt. Here, you need to provide a fmt argument to
savetxt.
Tips
• See the documentation of the .sort() method: help(np.ndarray.sort)
• For structured arrays, savetxt needs a fmt argument that tells it what to do.
fmt is a string. For example "%s %d %g" tells that the first field is to be formatted as a string, the
second as an integer, and the third as a float.
The .wav audio files are binary files: they contain a fixed-size header followed by raw sound data.
Construct a Numpy structured data type describing the .wav file header, and use it to read the header. Print for
example the sample rate and number of channels. (A test.wav is provided so you can try things out on that.)
34 Chapter 7. Exercises
Numpy tutorial, Release 2011
Tips
• You can read a binary structure described by some_dtype to a Numpy array with:
with open(’test.wav’, ’rb’) as f:
data = np.fromfile(f, dtype=some_dtype, count=1)
Byte # Field
0 chunk_id 4-byte string ("RIFF")
4 chunk_size 4-byte uint (little-endian)
8 format 4-byte string ("WAVE")
12 fmt_id 4-byte string ("fmt ")
16 fmt_size 4-byte uint (little-endian)
20 audio_fmt 2-byte uint (little-endian)
22 num_channels 2-byte uint (little-endian)
24 sample_rate 4-byte uint (little-endian)
28 byte_rate 4-byte uint (little-endian)
32 block_align 2-byte uint (little-endian)
34 bits_per_sample 2-byte uint (little-endian)
36 data_id 4-byte string ("data")
40 data_size 4-byte uint (little-endian)
• data_size bytes of actual sound data follow
7.6 Advanced
Reimplement array indexing (for 2-D, without using Numpy)! Write a function data_at_index(indices,
data, strides, dtype) that returns the data corresponding to a specified array element, as a string of bytes.
I.e.:
a = np.array([[1, 2, 3], [4, 5, 6]], dtype=np.int16)
b = a.T
7.6. Advanced 35
Numpy tutorial, Release 2011
without making copies (so that it is fast). The trick is a stride trick:
from numpy.lib.stride_tricks import as_strided
strides = ...
y = as_strided(x, shape=(8, 3), strides=strides)
2. Use the same trick to compute the 5 x 5 median filter of an image. For each pixel, compute the median of the
5 x 5 block of pixels surrounding it.
The median filter provides a degree of denoising similarly to a gaussian blur, but it preserves sharp edges better.
>>> import scipy
>>> import matplotlib.pyplot as plt
Noisy image
>>> img = scipy.lena() # A standard test image for image processing
>>> img += 0.8 * img.std() * np.random.rand(*img.shape)
>>> plt.imshow(img)
0
100
200
300
400
500
0 100 200 300 400 500
>>> window_size = 5
>>> shape = ... # Careful, no out-of-bounds access...
>>> strides = ...
Denoise!
>>> img_median = np.median(img_window.reshape(..., window_size*window_size), axis=...)
>>> plt.imshow(img_median)
>>> plt.gray()
>>> plt.imsave(’sharpened.png’, img_median)
Note:
• Above, the .reshape() makes a copy (why?).
• Scipy has an implementation for the median filter in scipy.ndimage, with more features.
36 Chapter 7. Exercises
Numpy tutorial, Release 2011
We don’t yet have a rolling_window function in Numpy that would make the above easier. We, however, do have
a contributed implementation that is discussed here:
https://github.com/numpy/numpy/pull/31
Can you extend the version posted by Warren to make N-dimensional windows, or think of any other features such a
function would need to have? (If yes, just ask me how to contribute your stuff.)
Generate an approximation to the Menger sponge by creating a 3-D Numpy array filled with 1, and drilling holes to it
with slicing.
Tips:
• Use dtype np.int8 so you don’t eat all memory
• Power-of-3 size cube works best, e.g., 81 x 81 x 81
• You need a function to recurse to drill many levels
• s = np.s_[i:j] creates a “free” slice object: a[s] == a[i:j].
Take a 2-D slice of the sponge diagonally through the center of the cube, with normal vector (1, 1, 1). What sort
of a patterns you get in the intersection?
7.6. Advanced 37
Numpy tutorial, Release 2011
Spoilers:
http://www.nytimes.com/2011/06/28/science/28math-menger.html
38 Chapter 7. Exercises