vMFMM: Hyperspherical Clustering for Production Vision

The geometric mismatch

At Datalogic, I was working with embeddings produced by ArcFace and CCE losses — both of which explicitly train encoders to distribute representations on the surface of a hypersphere. The existing pipeline then fed these embeddings into a standard Gaussian Mixture Model (GMM) for clustering.

The problem: a GMM assumes a Euclidean space with Gaussian-shaped clusters. Hyperspherical embeddings live on a manifold — their geometry is fundamentally different. Applying a GMM to them is a modelling mismatch, not just a performance gap.

The fix: von Mises-Fisher Mixture Model

The von Mises-Fisher (vMF) distribution is the natural probability distribution on the unit hypersphere — the analogue of a Gaussian for directional data. A mixture of vMF distributions (vMFMM) gives you principled probabilistic clustering that respects the geometry of your embedding space.

I implemented the vMFMM from scratch using Expectation-Maximization:

E-step: compute posterior probability of each point belonging to each vMF component, given current parameters (mean directions μₖ and concentration parameters κₖ)
M-step: update component means and concentrations using the closed-form vMF MLE — concentration is estimated via the mean resultant length of the assigned points

The concentration parameter κ is what makes vMF different from a Gaussian: higher κ means tighter clustering around the mean direction on the sphere. It captures spread in angular, not Euclidean, terms.

Results

Evaluated on 500K+ images across 6 encoder architectures:

Metric	vMFMM	GMM baseline
Top-4 Macro Recall	98%	~93–97%
Top-1 Macro Recall	85%	~80–84%

The improvement ranged 1–5% depending on architecture, with the largest gains on encoders with the tightest hyperspherical distributions (ArcFace > CCE).

Ablation: optimal component count

Executed ablation studies across 380K+ training samples varying the number of mixture components. The data exhibited long-tail class imbalance with many rare categories. A 5-component configuration provided the best bias-variance tradeoff — fewer components collapsed rare classes together, more components overfit to noise in the tail.

Continual learning simulation

To validate deployment feasibility, I built a 200-day simulation of on-premise continual learning — new data arriving daily, model updating incrementally. Benchmarked 6 ML architectures (XGBoost, SVM, Random Forest, vMFMM, GMM, k-NN) on the same stream.

Key finding: peak predictive accuracy was achievable within a 50-day window, regardless of architecture. This directly informed the company’s deployment timeline and data collection strategy — they didn’t need to wait for a full year of data before deploying.