Thanks to visit codestin.com
Credit goes to github.com

Skip to content

perf: parallelize save_manifest file hashing with ThreadPoolExecutor#1295

Merged
safishamsi merged 1 commit into
safishamsi:v8from
sirphilliptubell:perf/fix5
Jun 13, 2026
Merged

perf: parallelize save_manifest file hashing with ThreadPoolExecutor#1295
safishamsi merged 1 commit into
safishamsi:v8from
sirphilliptubell:perf/fix5

Conversation

@sirphilliptubell

Copy link
Copy Markdown

stat+MD5 across all files now fans out over a thread pool instead of running sequentially — I/O-bound work releases the GIL so this scales with available disk concurrency.

On my corporate repo this reduced module > main > main > save_manifest from 328s -> 49s

Before:

[graphify extract] scanning C:\myrepo
[graphify extract] found 36456 code, 0 docs, 0 papers, 0 images
[graphify extract] AST extraction on 36456 code files...
  AST extraction: 100/36456 uncached files (0%) [20 workers]
  AST extraction: 200/36456 uncached files (0%) [20 workers]
<trimmed>
  AST extraction: 36300/36456 uncached files (99%) [20 workers]
  AST extraction: 36400/36456 uncached files (99%) [20 workers]
  AST extraction: 36456/36456 files (100%) [20 workers]
[graphify] Deduplicated 125510 node(s) (124857 exact, 580 fuzzy).
[graphify extract] wrote C:\myrepo\graphify-out\graph.json: 510993 nodes, 970955 edges, 29755 communities
[graphify extract] wrote C:\myrepo\graphify-out\.graphify_analysis.json
[graphify extract] next: run `graphify cluster-only C:\myrepo` to generate GRAPH_REPORT.md and name communities

  _     ._   __/__   _ _  _  _ _/_   Recorded: 14:50:25  Samples:  913401
 /_//_/// /_\ / //_// / //_'/ //     Duration: 1368.850  CPU time: 987.984
/   _/                      v5.1.2

Program: graphify .

1368.865 <module>  ..\graphify\__main__.py:1
└─ 1368.834 main  ..\graphify\__main__.py:2070
   └─ 1368.566 main  ..\graphify\__main__.py:2070
      ├─ 420.841 cluster  ..\graphify\cluster.py:86
      │  ├─ 315.731 _partition  ..\graphify\cluster.py:22
      │  │  └─ 281.434 func  networkx\utils\decorators.py:783
      │  │     └─ 281.433 argmap_louvain_communities_5  <class 'networkx.utils.decorators.argmap'> compilation 9:1
      │  │        └─ 281.433 _dispatchable._call_if_no_backends_installed  networkx\utils\backends.py:541
      │  │           └─ 281.433 louvain_communities  networkx\algorithms\community\louvain.py:14
      │  │              └─ 281.367 louvain_partitions  networkx\algorithms\community\louvain.py:133
      │  │                 └─ 247.689 _one_level  networkx\algorithms\community\louvain.py:227
      │  │                    ├─ 124.306 [self]  networkx\algorithms\community\louvain.py
      │  │                    └─ 110.314 _neighbor_weights  networkx\algorithms\community\louvain.py:335
      │  └─ 96.767 _split_community  ..\graphify\cluster.py:191
      │     └─ 91.448 _partition  ..\graphify\cluster.py:22
      │        └─ 59.034 argmap_louvain_communities_5  <class 'networkx.utils.decorators.argmap'> compilation 9:1
      │           └─ 59.020 _dispatchable._call_if_no_backends_installed  networkx\utils\backends.py:541
      │              └─ 59.019 louvain_communities  networkx\algorithms\community\louvain.py:14
      │                 └─ 58.968 louvain_partitions  networkx\algorithms\community\louvain.py:133
      │                    ├─ 26.548 _one_level  networkx\algorithms\community\louvain.py:227
      │                    └─ 14.976 argmap_modularity_19  <class 'networkx.utils.decorators.argmap'> compilation 22:1
      │                       └─ 14.968 _dispatchable._call_if_no_backends_installed  networkx\utils\backends.py:541
      │                          └─ 14.920 modularity  networkx\algorithms\community\quality.py:144
      ├─ 328.208 save_manifest  ..\graphify\detect.py:1229
      │  └─ 323.129 _md5_file  ..\graphify\detect.py:1149
      │     └─ 320.476 WindowsPath.open  pathlib\__init__.py:768
      │        └─ 320.433 open  <built-in>
      ├─ 269.743 extract  ..\graphify\extract.py:11298
      │  ├─ 107.038 _extract_parallel  ..\graphify\extract.py:11178
      │  │  └─ 102.795 as_completed  concurrent\futures\_base.py:193
      │  │        [3 frames hidden]  threading, <built-in>
      │  ├─ 83.613 _augment_symbol_resolution_edges  ..\graphify\extract.py:7832
      │  │  ├─ 60.454 _collect_js_symbol_resolution_facts  ..\graphify\extract.py:7492
      │  │  │  └─ 43.993 _walk_js_tree  ..\graphify\extract.py:7194
      │  │  └─ 22.442 _apply_symbol_resolution_facts  ..\graphify\extract.py:6958
      │  ├─ 36.282 load_cached  ..\graphify\cache.py:226
      │  │  └─ 21.613 file_hash  ..\graphify\cache.py:97
      │  └─ 20.552 _disambiguate_colliding_node_ids  ..\graphify\extract.py:6759
      ├─ 168.004 detect  ..\graphify\detect.py:997
      │  └─ 149.078 _is_ignored  ..\graphify\detect.py:760
      │     └─ 139.281 _eval  ..\graphify\detect.py:783
      │        └─ 121.024 _matches  ..\graphify\detect.py:787
      │           └─ 107.143 fnmatch  fnmatch.py:22
      │                 [5 frames hidden]  <frozen ntpath>, <built-in>, fnmatch
      ├─ 84.587 to_json  ..\graphify\export.py:484
      │  └─ 70.835 dump  json\__init__.py:120
      │        [4 frames hidden]  json
      ├─ 50.783 build  ..\graphify\build.py:276
      │  └─ 40.163 build_from_json  ..\graphify\build.py:107
      ├─ 26.108 surprising_connections  ..\graphify\analyze.py:119
      │  └─ 25.544 _cross_file_surprises  ..\graphify\analyze.py:263
      │     └─ 16.311 _is_file_node  ..\graphify\analyze.py:50
      └─ 16.402 score_all  ..\graphify\cluster.py:220
         └─ 16.355 cohesion_score  ..\graphify\cluster.py:209
            └─ 14.254 Graph.number_of_edges  networkx\classes\graph.py:1961
               └─ 14.227 Graph.size  networkx\classes\graph.py:1918
                  └─ 13.712 <genexpr>  networkx\classes\graph.py:1954

To view this report with different options, run:
    pyinstrument --load-prev 2026-06-12T14-50-25 [options]

Elapsed: 1529.1s

After:

[graphify extract] scanning C:\myrepo
[graphify extract] found 36456 code, 0 docs, 0 papers, 0 images
[graphify extract] AST extraction on 36456 code files...
  AST extraction: 100/36456 uncached files (0%) [20 workers]
  AST extraction: 200/36456 uncached files (0%) [20 workers]
<trimmed>
  AST extraction: 36300/36456 uncached files (99%) [20 workers]
  AST extraction: 36400/36456 uncached files (99%) [20 workers]
  AST extraction: 36456/36456 files (100%) [20 workers]
[graphify] Deduplicated 125510 node(s) (124857 exact, 580 fuzzy).
[graphify extract] wrote C:\myrepo\graphify-out\graph.json: 510993 nodes, 970955 edges, 29825 communities
[graphify extract] wrote C:\myrepo\graphify-out\.graphify_analysis.json
[graphify extract] next: run `graphify cluster-only C:\myrepo` to generate GRAPH_REPORT.md and name communities

  _     ._   __/__   _ _  _  _ _/_   Recorded: 15:29:42  Samples:  941452
 /_//_/// /_\ / //_// / //_'/ //     Duration: 1187.473  CPU time: 1067.531
/   _/                      v5.1.2

Program: graphify .

1187.484 <module>  ..\graphify\__main__.py:1
└─ 1187.444 main  ..\graphify\__main__.py:2070
   └─ 1187.202 main  ..\graphify\__main__.py:2070
      ├─ 436.542 cluster  ..\graphify\cluster.py:86
      │  ├─ 357.453 _partition  ..\graphify\cluster.py:22
      │  │  └─ 323.762 func  networkx\utils\decorators.py:783
      │  │     └─ 323.762 argmap_louvain_communities_5  <class 'networkx.utils.decorators.argmap'> compilation 9:1
      │  │        └─ 323.762 _dispatchable._call_if_no_backends_installed  networkx\utils\backends.py:541
      │  │           └─ 323.762 louvain_communities  networkx\algorithms\community\louvain.py:14
      │  │              └─ 323.701 louvain_partitions  networkx\algorithms\community\louvain.py:133
      │  │                 └─ 289.285 _one_level  networkx\algorithms\community\louvain.py:227
      │  │                    ├─ 141.018 [self]  networkx\algorithms\community\louvain.py
      │  │                    └─ 133.831 _neighbor_weights  networkx\algorithms\community\louvain.py:335
      │  └─ 71.418 _split_community  ..\graphify\cluster.py:191
      │     └─ 67.236 _partition  ..\graphify\cluster.py:22
      │        └─ 38.923 argmap_louvain_communities_5  <class 'networkx.utils.decorators.argmap'> compilation 9:1
      │           └─ 38.913 _dispatchable._call_if_no_backends_installed  networkx\utils\backends.py:541
      │              └─ 38.912 louvain_communities  networkx\algorithms\community\louvain.py:14
      │                 └─ 38.879 louvain_partitions  networkx\algorithms\community\louvain.py:133
      │                    └─ 17.441 _one_level  networkx\algorithms\community\louvain.py:227
      ├─ 351.721 extract  ..\graphify\extract.py:11298
      │  ├─ 145.903 _extract_parallel  ..\graphify\extract.py:11178
      │  │  └─ 138.774 as_completed  concurrent\futures\_base.py:193
      │  │        [3 frames hidden]  threading, <built-in>
      │  ├─ 88.776 _augment_symbol_resolution_edges  ..\graphify\extract.py:7832
      │  │  ├─ 66.238 _collect_js_symbol_resolution_facts  ..\graphify\extract.py:7492
      │  │  │  └─ 47.669 _walk_js_tree  ..\graphify\extract.py:7194
      │  │  └─ 21.793 _apply_symbol_resolution_facts  ..\graphify\extract.py:6958
      │  ├─ 71.417 load_cached  ..\graphify\cache.py:226
      │  │  ├─ 41.545 file_hash  ..\graphify\cache.py:97
      │  │  │  └─ 22.650 WindowsPath.resolve  pathlib\__init__.py:937
      │  │  │        [2 frames hidden]  <frozen ntpath>, <built-in>
      │  │  └─ 17.706 cache_dir  ..\graphify\cache.py:213
      │  └─ 21.441 _disambiguate_colliding_node_ids  ..\graphify\extract.py:6759
      ├─ 176.468 detect  ..\graphify\detect.py:998
      │  └─ 154.788 _is_ignored  ..\graphify\detect.py:761
      │     └─ 144.614 _eval  ..\graphify\detect.py:784
      │        └─ 125.482 _matches  ..\graphify\detect.py:788
      │           └─ 110.947 fnmatch  fnmatch.py:22
      │                 [5 frames hidden]  <frozen ntpath>, <built-in>, fnmatch
      ├─ 90.644 to_json  ..\graphify\export.py:484
      │  └─ 77.492 dump  json\__init__.py:120
      │        [6 frames hidden]  json, <built-in>
      ├─ 61.295 build  ..\graphify\build.py:276
      │  └─ 50.016 build_from_json  ..\graphify\build.py:107
      └─ 49.081 save_manifest  ..\graphify\detect.py:1239
         └─ 39.726 ThreadPoolExecutor.__exit__  concurrent\futures\_base.py:666
               [3 frames hidden]  concurrent, threading, <built-in>

To view this report with different options, run:
    pyinstrument --load-prev 2026-06-12T15-29-42 [options]

Elapsed: 1409.3s

stat+MD5 across all files now fans out over a thread pool instead of running sequentially — I/O-bound work releases the GIL so this scales with available disk concurrency.
@safishamsi safishamsi merged commit 813db19 into safishamsi:v8 Jun 13, 2026
3 checks passed
@sirphilliptubell sirphilliptubell deleted the perf/fix5 branch June 16, 2026 19:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants