How we Designed an Entity Resolution Logic Engine
May 3, 2025
8 Min Read

Post By: Raj Gupta
“The art of Entity Resolution lies not in broad searches which yield many results, but precision is key to finding the most valuable information.” — Anonymous Data Wrangler
As H. Dunn put it — record linkage is the process of assembling pieces of information that refer to the same individual.
Remark: Performing Entity Resolution (ER) on a single database is often referred to as Deduplication and comes a part of data cleaning pipeline.
This stems from a use-case driven Graphical Entity Resolution between individuals’ profiles and matching record to record, in collaboration with Jayshankar Bera.
Problem Statement
Entity Resolution is often treated as a one-size-fits-all game: throw in your names, addresses, phone numbers, hit them with a fuzzy or cosine similarity algorithm, and voilà — matches! But that approach is like using a flamethrower to light a single candle.
Many records in our unstructured data lacked consistency: misspelled names, incomplete addresses, and no unique identifiers. For instance, lack of any unique identifier, inconsistency in the form of missing values. We seek to enhance the traditional approach of attribute matching, added with custom built weightage based on our domain study.
Semi-structured data can be messy when inconsistent. We had bet on a particular attribute to identity the same person mentioned in two different records so we could deduplicate the entity in our knowledge graph database. But upon study, 57% of records were missing the attribute we initially relied on for matching.
Customer records also have age varying per ‘date of entry’ without their date-of-birth so we need an error margin of ± 1 yr (not string similarity since 29–30 and 30–31 are not treated equally).The records might have “St” vs. “Street”, synonyms (“Apartment” vs. “Block”) that need custom handling. And door numbers that differ by just one digit (conventional string similarity scores will happily tell you “Door No. 123” and “Door No. 121” are 96% similar) — and that’s a catastrophe, not an insight.
“Needle in the Haystack” Conundrum
What we need is an engine that knows where to generalize (so “Street” and “St” are close) and where to specialize (so “Block B” and “Block D” get penalized). It’s about deciding when to focus on the white board and when to shift attention to the black dot. You won’t care about 99.99 % of the board — you just want that one dot. You need to know:
When to blur (generalize across similar tokens)
When to sharpen (insist on exact numeric or categorical matches)
Why Should You Care?
Prior to customizing our ER logic we’ve been over a couple standard libraries (let’s call exploration) that inspired us to do it ourselves:
Modern algorithms for blanket similarity include Cosine, Levenshtein distance, Least Common Substring which are quite handy in general and it’s where we found we require edge-case handling. It was valid but not not reliable.
I was introduced to the well sought fuzzy logic and BGE Reranker and after testing for a proof of concept I still felt unease where Partial ratios fluctuate in terms of utility; as they would work depending on token order and the semi-structured data we had and I had no control over it. It was not reliable and the test for validity cost us time and resources for decent results.
Today, we still find developers glorifying this open source library being directly applied and it intrigued us if nothing new was tried.
Preprocessing with Phonetic encoding felt like we were digging old tricks from early-2000s as many names in our databases were already spelt closely enough (“Sreekanth”, “Srikanth”). It was not valid.
Record-Linkage offers a Compare object that was a key component for data matching and comparison. We used a Llama model to split our sampled data into a structured json which we then stored into a Dataframe.
Remark: It was a trial and error, at this point we felt (for instance) Address was well handled by damerau-Levenshtein and Jaro and Jaro-Winkler were giving poor results but later when splitting Address down to it’s tokens, Jaro-Winkler was opted for a factor (mentioned later).
We realized how each excelled in specific corners but failed in others.
Our Hybrid Approach: Entity Resolution with Record Linkage
Overview
We built a NormalizeEntity
class to implement a two-stage approach to ER (linkage) for person nodes in a Neo4j graph. It:
Prefilters candidate entities based on name/alias similarity via an efficient database query using Levenshtein distance to reduce load.
Scores each remaining candidate on (a) weighted attribute similarity (b) structural/relationship similarity, using weighted, probabilistic measures (c) penalty scoring logic tied to the weights of (a) and (b)
Normalizing scores out of available data (excludes missing data in averaging the total weighted score)
Returns those candidates whose total score exceeds a configurable threshold.
This design balances performance (by offloading coarse filtering to Neo4j) with fine-grained, explainable similarity scoring.

Initialization & Configuration
Neo4j Connector: Manages an async Bolt driver for all Cypher queries, encapsulated in
neo4jconnector
.Weights are assigned each attribute a domain-expert weight. From what we find as gold-standards for strong evidence and what we would nudge indirectly when matching.
Attributes (name, age, address, etc.) receive domain-informed weights; sensitive/highly distinguishing properties (e.g. Passport) get larger weights.
Relationships (e.g.
HAS_FATHER
,INVESTIGATES
) reflect the relative evidence provided by shared graph edges.Max Scores: Precomputed to later convert raw scores into percentages for interpretability.
Pre-filtering candidates (Blocking or Indexing)
Why: A bit of rules-based deterministic filter for record linkage can dramatically reduce the number of expensive pairwise comparisons while lifting precision scores.
How: Uses Neo4j’s APOC library to compute Levenshtein similarity, filtering to those with ≥ 75% (empirical testing) name/alias match.
Attribute Scoring
Each attribute uses a similarity function tailored to its data type:
Strings (name, alias, other text)
fuzz.token_sort_ratio / 100
for order-invariant, token-based matching.Numeric (age)
Gaussian kernel:np.exp(-((age1 — age2)**2) / (2 * sigma**2))
Why: Smooth similarity that penalizes large age differences but recognizes small variations.
Benefit: Smooth penalty for off-by-one errors, but a steep drop-off for larger differences. This mimics Bayesian-style match likelihoods, which is a hallmark of probabilistic tuning for record linkage.
We tested difference-based scoring out of absolute, relative and gaussian difference. Gaussian highlights age differences better, and testing 29–30 and 30–31 will be the same. Settingsigma=4.3
(empirical testing) stresses for more age similarity and allows an easy filter of 0.90, or one may opt to add penalty.Enum / Exact (gender, phone)
Can opt for Binary 0/1 match with/without penalty over the result.Addresses
Hybrid approach combining:
Door-number Gaussian on extracted house numbers using regex.
Jaro-Winkler for overall string closeness. Jaro-Winkler is a good choice when trying to insist on exact matching addresses while tolerant to minor variations.
Hierarchical token similarity on key segments (block, sector, apt), split by comma-separation to isolate the segment.
Why: We added a context-sensitive similarity function because addresses are multi-part beasts: building, block, sector, apartment, etc. We:Parse out “keywords” (e.g., block, sector)
Normalize synonyms (“apt” ↔ “apartment”)
Compare only the relevant segments as binary 0/1 due to no-compromise. So “Apt A, Block B” and “Apt A, Block C” gives 0.5 but with text similarity we would be looking at 0.98
All component similarities are multiplied by their attribute weight and summed into an overall attribute score. Eg: 0.3*Gauss + o.4*JW + 0.3*Token
. Now we can visualize how minute weighted components fit into larger weighted components
Structural (Relationship) Scoring
Why: Shared connections (e.g. same father, same crime scene) provide independent evidence that two person nodes refer to the same real entity.
Parameters:
We setalpha=0.5
to equally prioritize the source node (e.g., a person’s name) and the target node (e.g., their father’s name) in relationship scoring.
Weights inherited fromrelationship_weights
.
Matching relational evidence (e.g.,HAS_FATHER
,INVESTIGATES
) uses a penalty function so that plenty of positive weak evidences doesn’t swamp a single strong negative evidence.
Benefit: Guarantees that below-threshold signals decay rapidly, preventing “noise” edges from inflating our match score.
Relationship Penalties Logic
A penalty score is applied when similarity drops, using an exponential penalty to multiply predefined weights for relationships, such as different father or spouse names.
Parameters
inst: The instance value, which represents the similarity between two nodes.
w: The weight of the relationship.
k: A scaling factor that controls the rate of decay.
k=math.log(2) / (w / 2)
When deficit is 0 (i.e., inst is greater than or equal to the threshold), the decay factor is 1, and the penalty is equal to the instance value. As deficit increases, the decay factor decreases, and the penalty decreases exponentially.
End-to-End Normalization Flow
Extract the “primary” node and its relationships from input JSON.
Prefilter via
get_similar_entity()
to retrieve a shortlist from Neo4j.For each candidate:
Compute attribute score (
score_primary_node
).Fetch the candidate’s local subgraph (
get_candidate_graph
).Compute relationship score (
score_relationships
).Sum into total score and convert to percentage of the theoretical maximum.
Filter candidates whose total exceeds
final_threshold
.Return list of
(candidate_node, score)
tuples.
String Algorithms Brief Comparison
Jaro: Searches for similarity in character order and position, with a focus on prefix and suffix matches. e.g., “123 Main St” and “123 Main St” would score high.
Jaro-Winkler: Similar to Jaro, but gives more weight to prefix matches (up to 4 characters). It rewards addresses with matching prefixes. e.g., “123 Main St” and “123 Main St Apt 101” would score higher than Jaro.
Levenshtein: Measures the minimum number of single-character edits (insertions, deletions, or substitutions) needed to transform one string into another. It calculates the edit distance between two addresses.
Damerau-Levenshtein: Similar to Levenshtein, but also considers transpositions (swapping two adjacent characters). It’s more lenient than Levenshtein.
Cosine: Measures the cosine similarity between two vectors in a high-dimensional space, often used for comparing sparse strings (like addresses). It calculates the dot product of two vectors.
Jaccard: Token-based similarity that measures the similarity between two sets by dividing the size of their intersection by the size of their union. It treats each string as an n-gram. It looks for common substrings.
Key Takeaways
The benefits of this workflow are many, we slash manual workload and lift precision and recall. Recall can further be boosted by a post-processing ‘sort’ component by categorizing results based on data available (non-empty); from rich data — having the most required fields/attributes, to poor data — lacking this. So 79% match in rich data has a different meaning to 93% match in poor data.
Limitations
Name’s Disambiguation where names might be misspelt but the misspelt name is a valid human name in the database or is a new person entry (Eg: “Shekhar” and “Sekhar” could be 1 misspelt name or 2 valid names). There is no direct way to resolve this.
Next Steps
ML-based Parameter Tuning with training data (classified record pairs): Exploring active learning as a valuable strategy for weight optimization and quantifying the error rate.
Collective ER: relational clustering algorithm to map entities to community and a probabilistic generative generative model to uncover hidden networks inspired by Indrajit Bhattacharya.
References:
record-linkage documentation (Library)
m and u values in the Fellegi-Sunter model (Mathematics behind a probabilistic similarity scoring model)