Post

Optimizing PyDataStructs - Faster Prim's Algorithm and Smarter Graph Nodes

This week’s progress on PyDataStructs centered around two major improvements: enabling arbitrary data storage in graph nodes using Python objects and benchmarking our Prim’s algorithm implementation against NetworkX.

1. Flexible Graph Nodes with Arbitrary Data

The graph node structure in the C++ backend was previously limited to storing basic types like integers, doubles, and strings using a std::variant. This week, I expanded this capability to support arbitrary Python objects, making the nodes more flexible for real-world applications that require storing complex metadata or user-defined classes.

This was done by introducing a new DataType::PyObject enum value and adapting the variant and memory management accordingly. It allows users to store any Python object (NumPy arrays, dictionaries, or even custom classes) directly inside the C++ graph node structure, while maintaining compatibility with Python’s reference counting.

2. Benchmarking Prim’s Algorithm and Beating NetworkX

I benchmarked our C++-based Prim’s algorithm (cpp_adjacency_list implementation) against NetworkX’s version on randomly generated connected graphs. The performance comparison was conducted across a range of sparse and dense graphs up to 14,000 nodes.

We outperformed NetworkX consistently for graphs up to 10,000 nodes in both sparse and dense configurations with speedups reaching 2.3× faster at certain points. Here’s the results for a graph with 0.1 edge probability (~5x10^6 edges in a graph with 10,000 nodes)

 NodesNetworkX_Time_sPyDataStructs_cpp_adjacency_list_Time_sratio
05000.008024460.008652290.927437
110000.05516640.04268971.29227
215000.09732370.084091.15737
320000.2052250.1467141.39882
425000.3888580.2988371.30124
530000.7057430.4003461.76283
635000.8729450.5184291.68383
740001.13570.712991.59287
845001.54330.8482751.81934
950002.122811.168751.81631
1055002.303571.449261.58948
1160002.703191.70221.58806
1265003.392612.068681.63998
1370004.13252.603681.58718
1475006.90182.982032.31447
1580005.299523.640751.45561
1685006.408074.690191.36627
1790006.352695.236981.21304
1895009.181986.052571.51704
191000010.02678.040141.24708
20105009.3377315.04290.620741
211100011.016317.49370.629731
221150014.099844.38530.317668
231200016.5855.87650.296726
241250013.7712163.0230.084474
251300041.7863147.440.283412
261350035.6498275.7820.129268
271400053.9506254.1280.212297

The PR which has the changes regarding Prim’s implementations can be checked out here (https://github.com/codezonediitj/pydatastructs/pull/685) and the code used to benchmark is availiable here (https://gist.github.com/prex03/7cc12c2f96a71e8077fc25c443bcd6ca)

What’s Next?

Next week, I’ll focus on:

a. Investigating performance bottlenecks for large graphs (>10K nodes)

b. Implementing and benchmarking other algorithms in the C++ backend

This post is licensed under CC BY 4.0 by the author.