You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: episodes/optimisation-numpy.md
+82-1Lines changed: 82 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -280,6 +280,87 @@ python_map: 7.94ms
280
280
numpy_vectorize: 7.80ms
281
281
```
282
282
283
+
## Other libraries that use NumPy
284
+
285
+
Across the scientific Python software ecosystem, [many domain-specific packages](https://numpy.org/#:~:text=ECOSYSTEM) are built on top of NumPy arrays.
286
+
Similar to the demos above, we can often gain significant performance boosts by using these libraries well.
287
+
288
+
::::::::::::::::::::::::::::::::::::: challenge
289
+
290
+
Take a look at the [list of libraries on the NumPy website](https://numpy.org/#:~:text=ECOSYSTEM). Are you using any of them already?
291
+
292
+
If you’ve brought a project you want to work on: Are there areas of the project where you might benefit from adapting one of these libraries instead of writing your own code from scratch?
293
+
294
+
:::::::::::::::::::::::: hint
295
+
296
+
These libraries could be specific to your area of research; but they could also include packages from other fields that provide tools you need (e.g. statistics or machine learning)!
297
+
298
+
:::::::::::::::::::::::::::::::::
299
+
300
+
:::::::::::::::::::::::::::::::::::::::::::::::
301
+
302
+
303
+
Which libraries you may use will depend on your research domain; here, we’ll show two examples from our own experience.
304
+
305
+
### Example: Image analysis with Shapely
306
+
307
+
A colleague had a large data set of images of cells. She had already reconstructed the locations of cell walls and various points of interest and needed to identify which points were located in each cell.
308
+
To do this, she used the [Shapely](https://github.com/shapely/shapely) geometry library.
# manually loop over all points, check if polygon contains that point
316
+
out_points = []
317
+
for i inrange(n_points):
318
+
current_point = points.iloc[i, :]
319
+
if current_polygon.contains(current_point["geometry"]):
320
+
out_points.append(current_point.name)
321
+
322
+
points_per_polygon[polygon_idx] = out_points
323
+
```
324
+
325
+
For about 500k points and 1000 polygons, the initial version of the code took about 20 hours to run.
326
+
327
+
Luckily, Shapely is built on top of NumPy, so she was able to apply functions to an array of points instead and wrote an improved version, which took just 20 minutes:
TODO: add a bit more explanation for instructors here
343
+
344
+
Maybe also add an example image for illustration?
345
+
346
+
::::::::::::::::::::::::::::::::::::::::::::::::
347
+
348
+
<!--
349
+
TODO: The following example needs more work to be used by instructors other than me.
350
+
And since it’s not a very clean example (mixes np arrays and list comprehensions) and hard to extract a nice before/after snippet, maybe it’s better not to include this example in the general course materials? Or only in a callout or instructor note?
351
+
-->
352
+
<!--
353
+
### Example: Interpolating astrophysical spectra with AstroPy
354
+
355
+
This is from an open-source package I’m working on, so we can look at the actual pull request where I made this change: https://github.com/SNEWS2/snewpy/pull/310
356
+
357
+
→ See the first table of benchmark results. Note that using a Python `for` loop to calculate the spectrum in 100 different time bins takes 100 times as long as for a single time bin. In the vectorized version, the computing time increases much more slowly.
358
+
359
+
(Note that energies were already vectorized—that’s another factor of 100 we got “for free”!)
[Pandas](https://pandas.pydata.org/) is the most common Python package used for scientific computing when working with tabular data akin to spreadsheets (DataFrames).
@@ -426,6 +507,6 @@ If you can filter your rows before processing, rather than after, you may signif
426
507
427
508
- Python is an interpreted language, this adds an additional overhead at runtime to the execution of Python code. Many core Python and NumPy functions are implemented in faster C/C++, free from this overhead.
428
509
- NumPy can take advantage of vectorisation to process arrays, which can greatly improve performance.
429
-
-Pandas' data tables store columns as arrays, therefore operations applied to columns can take advantage of NumPys vectorisation.
510
+
-Many domain-specific packages are built on top of NumPy and can offer similar performance boosts.
0 commit comments