You have a large dataset and want to see its structure as a 2D map. A Self-Organizing Map does this well at scale, but the neighbourhood function that drives training also shapes the final layout: prototypes land where the grid expects them, not exactly where the data places them. The topology is approximately right, but carries a grid-geometry imprint.
The Databionic Swarm (Thrun 2018) was designed to fix this. It lets each sample find its own position on a toroidal grid by minimising a stress function, with no neighbourhood bias at all. The layout then reflects the data’s own topology. The problem: it needs one bot per sample, and Thrun and Ultsch note that beyond roughly 4,000 samples, computation exceeds a day on a single core (Thrun and Ultsch 2021).
TopoSwarm sidesteps that limit. A SOM first compresses \(N\) samples into \(n \ll N\) prototypes; the swarm then projects only those \(n\) prototypes onto the toroidal grid. The dataset size \(N\) stops mattering to the projection stage. On the LSUN benchmark, four ground-truth clusters come out cleanly via U*F flood-fill, with a final stress of 0.014.
Here is what it looks like live: the swarm projecting 212 points from the FCPS hepta benchmark in real time:
A SOM gives you a map quickly. The catch is that the neighbourhood function couples compression and topology during training: the grid geometry influences where prototypes land, not just how close they sit to each other. Close things in feature space end up close on the grid, which is good, but the exact arrangement reflects the training grid as much as the data.
If you want the layout to reflect only the data, you need to decouple the two stages. The Databionic Swarm (Thrun 2018) does this by letting each sample find its position freely, minimising stress with no neighbourhood bias. But operating on all \(N\) raw samples is expensive; Thrun and Ultsch report that beyond roughly 4,000 samples, a single-core run exceeds a day (Thrun and Ultsch 2021).
TopoSwarm decouples compression from projection while keeping the swarm tractable:
Stage
Input
Output
Complexity
SOM quantization
\(N \times d\) raw data
\(n\) prototype weight vectors
\(O(N \cdot n)\) per epoch
Swarm projection
\(n \times n\) distance matrix
grid position per prototype
\(O(n^2)\) per epoch
Sample assignment
\(N\) BMU indices
grid position per raw sample
\(O(N)\)
With \(n \ll N\) prototypes the swarm stage stays fast regardless of dataset size.
What happens inside
The algorithm in pictures
Stage 1: Quantization
SOM Compression
Figure 1: A SOM compresses \(N\) raw samples into \(n \ll N\) prototype weight vectors. Only the weights and hit counts are kept; the SOM grid topology is discarded.
Stage 2: Jump Candidates
Polar Jump Positions
Figure 2: Grid positions encoded as Complex(row,col). \(R_\text{min}\) from grid density, \(R_\text{max} = \text{Lines}/2\). Annealing sweeps \(R_\text{max} \to R_\text{min}\) in integer steps.
Stage 2: Grid Space
Pac-Man Torus
Figure 3: All four edges wrap. A bot at column 1 and one at column \(N\) are neighbours. Removes boundary artefacts that distort rectangular ESOM maps.
Stage 2: Annealing
Radius Annealing
Figure 4: Large \(R\): ~50% of bots explore widely. Small \(R\): ~5% make fine-grained adjustments. Each bot samples 4 random candidates and keeps the best. Exits at weak Nash equilibrium.
Stage 2: Convergence
Stress Over Iterations
Figure 5: Stress drops steeply at large \(R\) (exploration), then plateaus. Each coloured band = one radius phase. Convergence declared when the linear slope of recent stress \(< \varepsilon\).
Stage 3: Clustering
U-matrix & U*F
Figure 6: The U-matrix maps average neighbour distances onto the grid. U*F flood-fills from each local minimum, stopping at ridges, to assign cluster membership without a fixed \(k\).
Step 1: compress with a SOM
A SOM with \(n\) nodes is trained on the raw data \(\mathbf{X} \in \mathbb{R}^{N \times d}\) as a compressor only; its internal grid topology is discarded. We keep only the \(n\) prototype weight vectors \(\mathbf{W} \in \mathbb{R}^{n \times d}\) and the population counts \(\mathbf{p} \in \mathbb{N}^n\).
The quality of compression is the quantization error:
where \(\hat{d}\) denotes distances normalised to \([0, 1]\), and \(p_i\) is the normalised population of node \(i\). Dense nodes dominate the objective over sparse boundary nodes.
The first approximation is controlled by QE; the second by final stress \(\mathcal{S}\). Because both are measurable at runtime, the chain is auditable end-to-end: a low QE with a high stress points to the swarm stage; a high QE with a low stress points to the SOM.
Step 3: read the clusters from the U-matrix
Once the swarm has settled, each occupied grid cell gets a U-matrix value: the average feature-space distance to its immediately adjacent occupied neighbours (Ultsch 2003). Low values mean nearby prototypes are similar; the cell sits inside a cluster. High values mean a sharp boundary; the cell sits on a ridge between clusters. Unoccupied cells are left as \(\mathrm{NaN}\).
Most clustering algorithms ask you to pick \(k\) before you look at the data. U*F does not. The number of clusters is a property of the map, not a parameter you supply. U*F flood-fill reads \(k\) directly from the topology:
Seed one flood from every local minimum on the U-matrix.
Each flood expands outward to neighbouring cells in order of increasing U-value.
A flood stops when it hits a cell above a threshold derived from the upper tail of the U-matrix distribution (threshold_anchor="upper"). That cell becomes a ridge.
Any cell claimed by two different floods becomes a ridge too.
Floods that cover fewer than min_cluster_size nodes are discarded.
What comes out is a partition that follows the natural valleys of the map. If there are four valleys separated by ridges, you get four clusters. If two valleys merge into one broad basin, they come out as one. The data decides.
The code
The SOM quantization stage uses Python (pyesom, IntraSOM backend). The swarm projection stage runs in Julia (TopoSwarm.jl). The two stages communicate via a single .npz file.
Dataset N n_nodes Expected runtime
----------------------------------------------------------
lsun benchmark 404 50 seconds
medium dataset 50,000 500 minutes
large dataset 200,000 1,500 minutes
500K use case 500,000 2,000 tens of minutes
Wrapping up
If you want topology-honest 2D maps from large datasets, the bottleneck has always been the swarm stage. TopoSwarm removes it by running the swarm only on \(n\) prototypes, not on all \(N\) raw samples. The dataset size stops mattering to the projection.
Each stage is independently tunable. Swap the SOM backend, retrain, and check quantization error without touching the swarm. Run the swarm longer and check stress without retraining the SOM. When something looks off on the map, the metrics tell you which stage to adjust.
The output is a toroidal grid with a U-matrix and U*F cluster boundaries, readable the same way as the original Databionic Swarm, just no longer limited to small datasets.
Ultsch, Alfred. 2003. “U*-Matrix: A Tool to Examine High Dimensional Data for Clusters, Topology, Density and Outliers.”Proc. Workshop on Self-Organizing Maps (WSOM).