Mastering Similarity Metrics for Lookalike Modelling

7 min read1 day ago

Introduction

Finding lookalike audiences is a core problem in most data science roles in marketing, sales, research and other fields that rely on consumer segmentation. If we look into our daily lives, it’s used actively by most social media platforms and other interactive media we use. The idea is simple: given a set of objects (users, people, buyers etc), often called audience, that exhibit a specific behaviour, how do we find new objects, often called candidates, that are most similar to them?

This is where nearest neighbours come into play as one of the most popular methods to address this. Unlike the popular KNN Classifier, here we’re not predicting a label — we’re just retrieving the closest matches based on a similarity metric, we will talk about KNN and other methods later. But choosing the right metric is key, and not as simple as one might think. Different distance measures work better depending on the features, if they are numerical, categorical, binary, if they follow a specific ordinal meaning, magnitude differences etc. The data distribution and the way you organize the workflow also influences the results a lot.

This time, we’ll break down some of the most common similarity metrics — Euclidean, Cosine, Jaccard, Hamming, and Minkowski — explaining their mathematical foundations, when to use them, and how they can impact your projects.

Understanding Similarity Metrics in Lookalike Audiences

The foundation of any lookalike audience framework lies in defining similarity. Given a reference set of users, the goal is to identify candidates that are closest in some measurable way. The challenge is that different types of features require different ways to define “closeness.”

For numerical data, distance-based metrics like Euclidean or Minkowski are common, while categorical data often require measures like Jaccard or Hamming. Another factor to consider is whether features are ordinal or encoded, as encoding can impact how distances are computed and the scale in which they are defined can also influence the results.

Now, let’s break down the most commonly used similarity metrics. But before that, it’s important to clarify the distinction between similarity (denoted as S in the formulas) and distance (denoted as D). Similarity metrics measure how alike two objects are, while distance metrics quantify how far apart they are in a coordinate space. Both approaches serve the same goal — determining relationships between data points — but they do so from different perspectives.

Euclidean Distance

Euclidean distance is one of the most intuitive ways to measure similarity between two numerical points. It represents the straight-line distance between two vectors in an n-dimensional space and is defined as:

This distance works well when all features are numerical and properly scaled. Since Euclidean distance is sensitive to magnitude differences, it’s important to standardize features before applying it. Without normalization, variables with larger ranges can dominate the distance calculation and generate biased results, making it ineffective when combining features of different scales (e.g. Dollar Sales and Engaged Activities for a Lead Generation analysis).

Hamming Distance

Hamming Distance is a metric specifically designed for categorical situations. It counts the number of positions at which two vectors differ:

This metric is often used when working with fixed-length categorical strings or encoded features, such as DNA sequences or error detection in digital communication, or if we think about research, on demographic similarity between two groups. However, Hamming Distance can be misleading if categorical data is label-encoded rather than one-hot encoded, as ordinal labels introduce artificial numerical differences that do not reflect real similarity, so be careful when using it. The best way to think about it in my opinion is using supermarket lists, so if you compare your list from this month to the previous one, how much will they differ by comparing item per item?

Minkowski Distance

Minkowski Distance is a more general form of Euclidean and Manhattan, not explained here but similar principles, distances. It gives flexibility to adjust how distances are measured with a tunable exponent:

When p=1, Minkowski simplifies to Manhattan Distance, which measures absolute differences between points. When p=2, it becomes Euclidean Distance, capturing straight-line distance. The ability to tweak p makes it useful for mixed data types, as different values change how feature magnitudes influence the final distance.

Think of it like adjusting a sensitivity dial — choosing different values of p lets you control whether you prioritize small, step-like movements (Manhattan) or direct, straight-line differences (Euclidean).

Cosine Similarity

Unlike Euclidean distance, which measures absolute differences, Cosine Similarity focuses on the angle between two vectors rather than their magnitude. It is computed as:

Cosine similarity is especially useful in cases where the magnitude of values does not matter as much as their directional alignment. This makes it particularly effective for high-dimensional, sparse datasets such as text embeddings or user interaction vectors. Since it measures similarity rather than distance, its values range from -1 (completely opposite) to 1 (identical), with 0 indicating orthogonality.

Jaccard Similarity

For categorical or binary data, Jaccard Similarity is often more appropriate. It measures how much overlap exists between two sets and is given by:

Jaccard Similarity is commonly used for features that represent presence or absence, such as user preferences, purchase history, or binary-encoded attributes, such as demographic attributes in consumer research. The best way to visualize it in my opinion is by imagining ven diagrams for each audience record and candidate record. It works well when comparing entities based on shared elements but does not take magnitude into account, making it unsuitable for numerical data.

Choosing the Right Metric for Lookalike Audiences

The best similarity metric depends on the type of features used in the dataset and the goal you have. If working with strictly numerical data, Euclidean or Minkowski distances are solid choices to explore, along with tunning to their thresholds, indexes and the best algorithm you prefer to apply. When features are categorical, Jaccard and Hamming distances are better suited, especially if the data is binary or one-hot encoded.

Ordinal encoding can introduce misleading distance calculations, so it’s important to carefully choose an encoding strategy that aligns with the similarity metric being used. Additionally, in high-dimensional spaces, Euclidean distance often becomes less meaningful due to the “curse of dimensionality,” making Cosine Similarity a preferred alternative.

So in the end, it will all depend on the methodology and your project objectives. This visualization helps understand these distances, and more other examples:

Nearest Neighbour Search (KNN Distance-Based Lookalike Modelling)

This is the most intuitive and widely used approach for lookalike modelling. The idea is basically what was mentioned before, based on a set of objects, how can I find the distance to all the other ones and classify the closest ones as lookalikes?

Each audience member is treated as a reference point, and the k-nearest neighbours are identified using a chosen similarity measure. It works best for structured datasets with well-defined features, and that’s why it’s widely used in marketing, research segmentation, recommendation systems etc. However, it is very sensitive to the size of the dataset because its brute-force search uses a lot of computation power, so sometimes it needs to be optimized with Locality-Sensitive Hashing (LSH) or KD-Trees depending on the data types you have.

However, it is definitely the most common and important want to learn for building lookalike models that generate value.

Clustering-Based Lookalike Modelling (K-Means & DBSCAN)

Instead of retrieving the closest neighbours, clustering methods try to group similar audience members together. The idea is that if a candidate belongs to the same cluster as an existing audience member, they are likely a strong lookalike. K-Means Clustering is a popular method that partitions data into k clusters based on similarity, typically using Euclidean Distance, but this can be adapted to what works best for you. On the other hand, DBSCAN is a density-based clustering algorithm that is better suited for datasets with irregularly shaped clusters and outliers.

This approach is useful for large-scale lookalike modelling where audience segmentation is important, such as campaign optimization or customer persona clustering for targeted marketing. One of the biggest benefits is that, unlike KNN, clustering doesn’t require a specific similarity threshold — similarity is inferred from the group. However, clustering requires careful tuning of hyperparameters, such as the number of clusters in K-Means, and can be sensitive to feature scaling. That’s because it can split your data into tons of small groups that don’t actually make much sense in practical terms.

Embedding-Based Lookalike Modelling

For unstructured or high-dimensional data, such as text data, or deep learning embeddings, an embedding-based approach can be more effective. In this method, each audience member and candidate is transformed into a high-dimensional vector representation (embedding). Cosine Similarity is the most common metric used in this case, and it measures how aligned these embeddings are in the vector space. The closer the cosine similarity score is to 1, the more similar two entities are.

One of the most common use cases I have for this in my day to day is for Retreival Augmented Generation workflows, so by embedding the documents and content I want to use as reference, I can apply cosine similarity as a metric within embedding-based models to find the most similar piece of content in comparison to my question, which most times means it’s the correct answer.

Conclusion

What we can say about lookalike modelling is that the context matters a lot, for choosing both the metric used for distance or similarity and the method used to find them. It’s just a matter of examining your project, your needs, the best practices in your industry and going forward with the best solution!

Thanks for reading! Follow me to continue with the series — Vinícius A. R. Z.

I’m a Data Scientist and when people ask me what do I work with, my answer is always “I work with Decision Intelligence” because I try not to limit myself to data! It’s like they say, you have to be smart as a fox… 🦊