Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit c32af9d

Browse files
committed
mention wide range of other packages built on top of NumPy, which can similarly benefit from vectorisation
add real life example(s) for illustration
1 parent 600c63e commit c32af9d

1 file changed

Lines changed: 82 additions & 1 deletion

File tree

episodes/optimisation-numpy.md

Lines changed: 82 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -280,6 +280,87 @@ python_map: 7.94ms
280280
numpy_vectorize: 7.80ms
281281
```
282282

283+
## Other libraries that use NumPy
284+
285+
Across the scientific Python software ecosystem, [many domain-specific packages](https://numpy.org/#:~:text=ECOSYSTEM) are built on top of NumPy arrays.
286+
Similar to the demos above, we can often gain significant performance boosts by using these libraries well.
287+
288+
::::::::::::::::::::::::::::::::::::: challenge
289+
290+
Take a look at the [list of libraries on the NumPy website](https://numpy.org/#:~:text=ECOSYSTEM). Are you using any of them already?
291+
292+
If you’ve brought a project you want to work on: Are there areas of the project where you might benefit from adapting one of these libraries instead of writing your own code from scratch?
293+
294+
:::::::::::::::::::::::: hint
295+
296+
These libraries could be specific to your area of research; but they could also include packages from other fields that provide tools you need (e.g. statistics or machine learning)!
297+
298+
:::::::::::::::::::::::::::::::::
299+
300+
:::::::::::::::::::::::::::::::::::::::::::::::
301+
302+
303+
Which libraries you may use will depend on your research domain; here, we’ll show two examples from our own experience.
304+
305+
### Example: Image analysis with Shapely
306+
307+
A colleague had a large data set of images of cells. She had already reconstructed the locations of cell walls and various points of interest and needed to identify which points were located in each cell.
308+
To do this, she used the [Shapely](https://github.com/shapely/shapely) geometry library.
309+
310+
```Python
311+
points_per_polygon = {}
312+
for polygon_idx in range(n_polygons):
313+
current_polygon = polygons.iloc[polygon_idx,:]["geometry"]
314+
315+
# manually loop over all points, check if polygon contains that point
316+
out_points = []
317+
for i in range(n_points):
318+
current_point = points.iloc[i, :]
319+
if current_polygon.contains(current_point["geometry"]):
320+
out_points.append(current_point.name)
321+
322+
points_per_polygon[polygon_idx] = out_points
323+
```
324+
325+
For about 500k points and 1000 polygons, the initial version of the code took about 20 hours to run.
326+
327+
Luckily, Shapely is built on top of NumPy, so she was able to apply functions to an array of points instead and wrote an improved version, which took just 20 minutes:
328+
329+
```Python
330+
points_per_polygon = {}
331+
for polygon_idx in range(n_polygons):
332+
current_polygon = polygons.iloc[polygon_idx,:]["geometry"]
333+
334+
# vectorized: apply `contains` to an array of points at once
335+
points_in_polygon_idx = current_polygon.contains(points_list)
336+
points_in_polygon = point_names_list[points_in_polygon_idx]
337+
338+
points_per_polygon[polygon_idx] = points_in_polygon.tolist()
339+
```
340+
::::::::::::::::::::::::::::::::::::: instructor
341+
342+
TODO: add a bit more explanation for instructors here
343+
344+
Maybe also add an example image for illustration?
345+
346+
::::::::::::::::::::::::::::::::::::::::::::::::
347+
348+
<!--
349+
TODO: The following example needs more work to be used by instructors other than me.
350+
And since it’s not a very clean example (mixes np arrays and list comprehensions) and hard to extract a nice before/after snippet, maybe it’s better not to include this example in the general course materials? Or only in a callout or instructor note?
351+
-->
352+
<!--
353+
### Example: Interpolating astrophysical spectra with AstroPy
354+
355+
This is from an open-source package I’m working on, so we can look at the actual pull request where I made this change: https://github.com/SNEWS2/snewpy/pull/310
356+
357+
&rightarrow; See the first table of benchmark results. Note that using a Python `for` loop to calculate the spectrum in 100 different time bins takes 100 times as long as for a single time bin. In the vectorized version, the computing time increases much more slowly.
358+
359+
(Note that energies were already vectorized—that’s another factor of 100 we got “for free”!)
360+
361+
Code diff: https://github.com/SNEWS2/snewpy/pull/310/commits/0320b384ff22233818d07913c55c30f5968ae330
362+
-->
363+
283364
## Using Pandas (Effectively)
284365

285366
[Pandas](https://pandas.pydata.org/) is the most common Python package used for scientific computing when working with tabular data akin to spreadsheets (DataFrames).
@@ -426,6 +507,6 @@ If you can filter your rows before processing, rather than after, you may signif
426507

427508
- Python is an interpreted language, this adds an additional overhead at runtime to the execution of Python code. Many core Python and NumPy functions are implemented in faster C/C++, free from this overhead.
428509
- NumPy can take advantage of vectorisation to process arrays, which can greatly improve performance.
429-
- Pandas' data tables store columns as arrays, therefore operations applied to columns can take advantage of NumPys vectorisation.
510+
- Many domain-specific packages are built on top of NumPy and can offer similar performance boosts.
430511

431512
::::::::::::::::::::::::::::::::::::::::::::::::

0 commit comments

Comments
 (0)