The geometric mismatch
At Datalogic, I was working with embeddings produced by ArcFace and CCE losses — both of which explicitly train encoders to distribute representations on the surface of a hypersphere. The existing pipeline then fed these embeddings into a standard Gaussian Mixture Model (GMM) for clustering.
The problem: a GMM assumes a Euclidean space with Gaussian-shaped clusters. Hyperspherical embeddings live on a manifold — their geometry is fundamentally different. Applying a GMM to them is a modelling mismatch, not just a performance gap.
The fix: von Mises-Fisher Mixture Model
The von Mises-Fisher (vMF) distribution is the natural probability distribution on the unit hypersphere — the analogue of a Gaussian for directional data. A mixture of vMF distributions (vMFMM) gives you principled probabilistic clustering that respects the geometry of your embedding space.
I implemented the vMFMM from scratch using Expectation-Maximization:
- E-step: compute posterior probability of each point belonging to each vMF component, given current parameters (mean directions μₖ and concentration parameters κₖ)
- M-step: update component means and concentrations using the closed-form vMF MLE — concentration is estimated via the mean resultant length of the assigned points
The concentration parameter κ is what makes vMF different from a Gaussian: higher κ means tighter clustering around the mean direction on the sphere. It captures spread in angular, not Euclidean, terms.
Results
Evaluated on 500K+ images across 6 encoder architectures:
| Metric | vMFMM | GMM baseline |
|---|---|---|
| Top-4 Macro Recall | 98% | ~93–97% |
| Top-1 Macro Recall | 85% | ~80–84% |
The improvement ranged 1–5% depending on architecture, with the largest gains on encoders with the tightest hyperspherical distributions (ArcFace > CCE).
Ablation: optimal component count
Executed ablation studies across 380K+ training samples varying the number of mixture components. The data exhibited long-tail class imbalance with many rare categories. A 5-component configuration provided the best bias-variance tradeoff — fewer components collapsed rare classes together, more components overfit to noise in the tail.
Continual learning simulation
To validate deployment feasibility, I built a 200-day simulation of on-premise continual learning — new data arriving daily, model updating incrementally. Benchmarked 6 ML architectures (XGBoost, SVM, Random Forest, vMFMM, GMM, k-NN) on the same stream.
Key finding: peak predictive accuracy was achievable within a 50-day window, regardless of architecture. This directly informed the company’s deployment timeline and data collection strategy — they didn’t need to wait for a full year of data before deploying.
