Optimizing PyDataStructs - Faster Prim's Algorithm and Smarter Graph Nodes

Posted Jul 1, 2025

By Prerak Singh

2 min read

This week’s progress on PyDataStructs centered around two major improvements: enabling arbitrary data storage in graph nodes using Python objects and benchmarking our Prim’s algorithm implementation against NetworkX.

1. Flexible Graph Nodes with Arbitrary Data

The graph node structure in the C++ backend was previously limited to storing basic types like integers, doubles, and strings using a std::variant. This week, I expanded this capability to support arbitrary Python objects, making the nodes more flexible for real-world applications that require storing complex metadata or user-defined classes.

This was done by introducing a new DataType::PyObject enum value and adapting the variant and memory management accordingly. It allows users to store any Python object (NumPy arrays, dictionaries, or even custom classes) directly inside the C++ graph node structure, while maintaining compatibility with Python’s reference counting.

2. Benchmarking Prim’s Algorithm and Beating NetworkX

I benchmarked our C++-based Prim’s algorithm (cpp_adjacency_list implementation) against NetworkX’s version on randomly generated connected graphs. The performance comparison was conducted across a range of sparse and dense graphs up to 14,000 nodes.

We outperformed NetworkX consistently for graphs up to 10,000 nodes in both sparse and dense configurations with speedups reaching 2.3× faster at certain points. Here’s the results for a graph with 0.1 edge probability (~5x10^6 edges in a graph with 10,000 nodes)

	Nodes	NetworkX_Time_s	PyDataStructs_cpp_adjacency_list_Time_s	ratio
0	500	0.00802446	0.00865229	0.927437
1	1000	0.0551664	0.0426897	1.29227
2	1500	0.0973237	0.08409	1.15737
3	2000	0.205225	0.146714	1.39882
4	2500	0.388858	0.298837	1.30124
5	3000	0.705743	0.400346	1.76283
6	3500	0.872945	0.518429	1.68383
7	4000	1.1357	0.71299	1.59287
8	4500	1.5433	0.848275	1.81934
9	5000	2.12281	1.16875	1.81631
10	5500	2.30357	1.44926	1.58948
11	6000	2.70319	1.7022	1.58806
12	6500	3.39261	2.06868	1.63998
13	7000	4.1325	2.60368	1.58718
14	7500	6.9018	2.98203	2.31447
15	8000	5.29952	3.64075	1.45561
16	8500	6.40807	4.69019	1.36627
17	9000	6.35269	5.23698	1.21304
18	9500	9.18198	6.05257	1.51704
19	10000	10.0267	8.04014	1.24708
20	10500	9.33773	15.0429	0.620741
21	11000	11.0163	17.4937	0.629731
22	11500	14.0998	44.3853	0.317668
23	12000	16.58	55.8765	0.296726
24	12500	13.7712	163.023	0.084474
25	13000	41.7863	147.44	0.283412
26	13500	35.6498	275.782	0.129268
27	14000	53.9506	254.128	0.212297

The PR which has the changes regarding Prim’s implementations can be checked out here (https://github.com/codezonediitj/pydatastructs/pull/685) and the code used to benchmark is availiable here (https://gist.github.com/prex03/7cc12c2f96a71e8077fc25c443bcd6ca)

What’s Next?

Next week, I’ll focus on:

a. Investigating performance bottlenecks for large graphs (>10K nodes)

b. Implementing and benchmarking other algorithms in the C++ backend

Season of Commits

Week 4

This post is licensed under CC BY 4.0 by the author.

Trending Tags