Why Silhouette Scores Fail on Imbalanced Data

Article

Silhouette score is one of the most popular ways to judge clustering. It is also one of the easiest metrics to misread when your data is imbalanced. If one group is much bigger than the others, you can get a “great” silhouette score while your clusters are useless for decisions.

This post breaks down why that happens, what to look for, and what to do instead. I’ll keep it practical. If you use clustering for segmentation, anomaly detection, or “let’s see what patterns show up,” this will save you from false confidence.

Quick refresher: what silhouette score measures

For each data point, silhouette score compares two things:

a: the average distance from the point to other points in its own cluster (how tight the cluster is)
b: the average distance from the point to points in the nearest other cluster (how separated clusters are)

The silhouette value for a point is (b − a) / max(a, b). Values range from −1 to 1. Higher is better. People often treat “higher” as “correct.” That’s the trap.

Silhouette score is a geometry-based metric. It rewards clusters that look like clean “blobs” with clear spacing between them. Real-world data, especially imbalanced data, rarely behaves like that.

What “imbalanced data” means in clustering

In clustering, imbalance usually shows up in one of these ways:

Skewed cluster sizes: one cluster contains most points, the rest are small.
Uneven densities: one cluster is tight, another is spread out.
Rare segments: the patterns you care about are small by definition (fraud, edge cases, special customer types).

In real life, imbalance is normal. Customer behavior is not evenly distributed. Network traffic is not evenly distributed. Claims data is not evenly distributed. The question is not “can clustering handle imbalance?” The question is “can your validation metric tell you the truth under imbalance?”

Why silhouette scores fail on imbalanced data

1) The majority cluster dominates the average

Most teams report a single average silhouette score. If 90% of your points fall into one cluster, that cluster controls the average. If the big cluster is tight and well-separated, the score looks strong even if the small clusters are garbage.

That means you can ship a model that performs well for the majority, while misclassifying the rare segments you actually care about. The metric says “great.” The business says “why are we missing the high-value customers?”

If you only take one thing from this post, take this: a single silhouette number is not a report.

2) Small clusters get punished by geometry

Small clusters have fewer neighbors. That makes a (within-cluster distance) noisy. One outlier can inflate a fast. Meanwhile, b (distance to the nearest other cluster) might not increase much, especially if the small cluster sits near the boundary of the big one.

Result: points in small clusters get low or negative silhouette values, even when the cluster is meaningful. The metric is not asking “does this cluster represent a real segment?” It is asking “does this cluster look like a clean geometric blob?” Those are different questions.

3) Different densities break “one-size” distance assumptions

Silhouette score relies on distances behaving consistently across clusters. If one cluster is dense and another is diffuse, the within-cluster distance a is not comparable across clusters. A diffuse cluster can be perfectly valid (think: “casual shoppers” with varied behavior), but silhouette penalizes it because its points are farther apart.

You see this a lot with K-Means, because K-Means prefers equal-sized spherical clusters. If your data is not shaped that way, K-Means will still return something — and silhouette score can still look decent — while the segmentation is misleading.

4) The metric can reward splitting the big cluster into pieces

On imbalanced data, it is common to get a better silhouette score by splitting the majority cluster into multiple clusters, even when the split has no business meaning. That can happen because separation increases (higher b) and within-cluster distances shrink (lower a) simply due to smaller groups.

So you can “optimize” the score and end up with three clusters that are all basically the same majority behavior, and still fail to isolate the rare segment.

5) High-dimensional data makes distances lie

If you are clustering embeddings or wide feature sets, distance measures can suffer from the curse of dimensionality (distances become less informative as dimensions increase). When most points are similarly far apart, the silhouette math becomes unstable. You can get a smooth-looking score that does not reflect real separability.

This is why teams working with text embeddings can see “okay” silhouette scores while human review says “these clusters are nonsense.”

A simple example you’ve probably seen

Imagine customer segmentation where:

90% are “regular buyers”
9% are “deal hunters”
1% are “high-value repeat buyers”

Silhouette score often favors the split between regular buyers and everyone else. The 1% group can get absorbed into the 9% group, because they are both “not regular.” Your score looks good. Your marketing team can’t find the high-value segment. That is not a modeling issue. That is a validation issue.

The same pattern shows up in:

fraud detection clusters
rare failure mode clustering
healthcare risk segmentation
operational incident grouping

If the minority group is the business goal, silhouette score will often under-report value.

What to check instead of trusting a single silhouette number

1) Look at the silhouette plot, not just the average

A silhouette plot shows the distribution of scores per cluster. On imbalanced data, this is where the truth lives. If the big cluster has strong scores and the small clusters have messy scores, the average is lying by omission.

2) Report per-cluster silhouette and size

Make it a rule: every clustering report includes cluster size and per-cluster silhouette. If the “rare” cluster is tiny and has poor silhouette, you do not automatically delete it. You investigate it. That rare cluster might be the entire reason you are clustering.

3) Add at least one alternative metric

Silhouette score is not wrong. It is incomplete. Add at least one other view:

Davies–Bouldin Index (DBI): lower is better. It can catch overlaps silhouette hides.
Calinski–Harabasz: helpful for comparing K values, but still shape-sensitive.
Density-based validity (DBCV): better for DBSCAN/HDBSCAN-style clusters.

If metrics disagree, that is useful information. Disagreement usually means your clustering method and your data shape are not aligned.

4) Test cluster stability

If you resample the data and rerun clustering, do you get the same clusters? On imbalanced data, unstable clusters are common. A cluster that appears and disappears is not a segment. It is noise dressed up as structure.

5) Validate against downstream impact

The most honest validation is: does the clustering improve a real decision?

Does it improve campaign lift when you target by cluster?
Does it improve fraud review precision when you prioritize by cluster?
Does it reduce support time when you route tickets by cluster?

If a clustering has a “great” silhouette score but no impact, it is not a win. It is a number.

How to cluster imbalanced data more safely

Use the right algorithm for the shape

K-Means: fast, but expects equal-sized spherical clusters.
Gaussian Mixture Models (GMM): better for elliptical clusters, still sensitive to imbalance.
DBSCAN/HDBSCAN: better for uneven shapes and noise, often better for rare groups.

Use features that reflect the decision

Many imbalance problems are feature problems. If your “rare group” is rare because the features do not capture what makes it unique, silhouette score will not save you. Start with a feature set that mirrors the decision you plan to make.

Don’t optimize K purely on silhouette

If you tune K by maximizing silhouette on imbalanced data, you often get clusters that look clean but ignore minority behavior. Use multi-signal selection: silhouette plus stability plus downstream impact.

How this connects to Clustering Algorithms and Silhouette Scores

If you want the broader context on silhouette score itself, start with Clustering Algorithms and Silhouette Scores. This post is the “watch out” version for imbalanced, real-world data.

Bottom line: silhouette score can look great even when clustering is wrong on imbalanced data. The fix is not to throw it away. The fix is to stop using it as a single-number verdict. Use per-cluster views, stability checks, and impact tests. That is how you avoid shipping clusters that impress dashboards and fail reality.

Building clustering into production?

Taliferro helps teams validate clustering, segmentation, and ML outputs before they drive business decisions. Explore machine learning consulting.

Tyrone Showers

Want this fixed on your site?

Tell us your URL and what feels slow. We’ll point to the first thing to fix.

Tell us what’s stuck Browse topics