ReLinker 1.1.0
dotnet add package ReLinker --version 1.1.0
NuGet\Install-Package ReLinker -Version 1.1.0
<PackageReference Include="ReLinker" Version="1.1.0" />
<PackageVersion Include="ReLinker" Version="1.1.0" />
<PackageReference Include="ReLinker" />
paket add ReLinker --version 1.1.0
#r "nuget: ReLinker, 1.1.0"
#addin nuget:?package=ReLinker&version=1.1.0
#tool nuget:?package=ReLinker&version=1.1.0
ReLinker
A Powerful C# Record Linkage (Entity Resolution) Framework
🚀 Overview
ReLinker is a flexible and efficient .NET library for deduplicating and linking records across disparate datasets. It supports advanced string similarity algorithms, blocking strategies, probabilistic matching, and clustering—all designed to scale for real-world data integration tasks.
📚 Key Concepts
- Record Linkage: Identifying records that refer to the same entity within or across databases.
- Blocking: Reducing the number of candidate record pairs for comparison.
- Similarity Computation: Quantifying how similar two records are.
- Probabilistic Matching: Scoring and classifying pairs using statistical models.
- Clustering: Grouping matched records into entities.
🏗️ System Architecture
The ReLinker pipeline consists of:
- Data Loading
- Blocking
- Similarity Computation
- Probabilistic Scoring (Fellegi-Sunter + EM)
- Clustering (Union-Find)
🧩 Core Components
1. Record Structure
public class Record
{
public string Id { get; set; }
public Dictionary<string, string> Fields { get; set; } = new();
}
2. Data Loading System
Loader Interface:
public interface IDatabaseLoader
{
Task<List<Record>> LoadRecordsAsync();
IEnumerable<Record> LoadRecordsInBatches(int batchSize, int startOffset = 0);
}
Factory:
public static class DatabaseLoaderFactory
{
public static IDatabaseLoader CreateLoader(
string type, string connectionStringOrUrl, string queryOrCollection, string providerName = null)
{
return type.ToLower() switch
{
"generic" => new GenericDbLoader(connectionStringOrUrl, queryOrCollection, providerName),
"duckdb" => new DuckDbLoader(connectionStringOrUrl, queryOrCollection),
"ravendb" => new RavenDbLoader(connectionStringOrUrl, queryOrCollection),
_ => throw new ArgumentException("Unknown loader type.")
};
}
}
3. Blocking
Blocking Rule:
public class BlockingRule
{
public string Name { get; set; }
public Func<Record, string> RuleFunc { get; set; }
}
Blocking Helper:
public static class BlockingHelper
{
public static IEnumerable<(Record, Record)> GenerateCandidatePairs(
List<Record> records, List<BlockingRule> rules)
{
var blocks = new Dictionary<string, List<Record>>();
foreach (var rule in rules)
{
foreach (var record in records)
{
var key = rule.RuleFunc(record);
if (!blocks.ContainsKey(key))
blocks[key] = new List<Record>();
blocks[key].Add(record);
}
}
var seen = new HashSet<(string, string)>();
foreach (var block in blocks.Values)
for (int i = 0; i < block.Count; i++)
for (int j = i + 1; j < block.Count; j++)
{
var a = block[i].Id;
var b = block[j].Id;
if (seen.Add((a, b)) && seen.Add((b, a)))
yield return (block[i], block[j]);
}
}
}
4. Similarity Functions
Define similarity logic per field:
public class SimilarityFunction
{
public string FieldName { get; set; }
public Func<Record, Record, double> Compute { get; set; }
}
Example Implementations (see below for usage):
- Levenshtein, Jaro, and TF-IDF similarity
- Field-specific (e.g., exact match for phone/email)
5. Probabilistic Matching (Fellegi-Sunter)
public static class MatchScorer
{
public static List<ScoredPair> Score(
IEnumerable<(Record, Record)> pairs,
List<SimilarityFunction> functions,
double[] mProbs, double[] uProbs)
{
var results = new List<ScoredPair>();
foreach (var (r1, r2) in pairs)
{
double score = 0;
for (int i = 0; i < functions.Count; i++)
{
double sim = functions[i].Compute(r1, r2);
score += Math.Log((sim * mProbs[i]) / (sim * uProbs[i] + 1e-6) + 1e-6);
}
results.Add(new ScoredPair { Record1 = r1, Record2 = r2, Score = score });
}
return results;
}
public static (double[], double[]) EstimateParametersWithEM(
List<ScoredPair> scoredPairs, List<SimilarityFunction> functions, int maxIterations = 10)
{
// (see full implementation in repo)
throw new NotImplementedException("EM implementation here...");
}
}
public class ScoredPair
{
public Record Record1 { get; set; }
public Record Record2 { get; set; }
public double Score { get; set; }
}
6. Clustering (Union-Find)
public class UnionFind
{
private readonly Dictionary<string, string> parent = new();
public string Find(string x) => parent[x] == x ? x : parent[x] = Find(parent[x]);
public void Union(string x, string y)
{
if (!parent.ContainsKey(x)) parent[x] = x;
if (!parent.ContainsKey(y)) parent[y] = y;
string rootX = Find(x), rootY = Find(y);
if (rootX != rootY) parent[rootY] = rootX;
}
public Dictionary<string, List<string>> GetClusters()
{
var clusters = new Dictionary<string, List<string>>();
foreach (var item in parent.Keys)
{
var root = Find(item);
if (!clusters.ContainsKey(root)) clusters[root] = new();
clusters[root].Add(item);
}
return clusters;
}
}
🧑💻 Full Working Example: Customer Deduplication
// 1. Load records from a database (SQLite example)
var loader = DatabaseLoaderFactory.CreateLoader(
"generic",
"Data Source=customers.db",
"SELECT id, first_name, last_name, phone, email, address FROM customers",
"System.Data.SQLite"
);
var records = await loader.LoadRecordsAsync();
// 2. Build IDF dictionary (see helper below)
var idf = BuildIdfDictionary(records);
// 3. Define blocking rules
var blockingRules = new List<BlockingRule>
{
new BlockingRule { Name = "PhonePrefix", RuleFunc = r => r.Fields["phone"][..6] },
new BlockingRule { Name = "EmailDomain", RuleFunc = r => r.Fields["email"].Split('@')[1] }
};
// 4. Define similarity functions
var similarities = new List<SimilarityFunction>
{
new SimilarityFunction
{
FieldName = "FullName",
Compute = (r1, r2) =>
{
string name1 = $"{r1.Fields["first_name"]} {r1.Fields["last_name"]}";
string name2 = $"{r2.Fields["first_name"]} {r2.Fields["last_name"]}";
return Similarity.JaroSimilarity(name1, name2, idf);
}
},
new SimilarityFunction
{
FieldName = "Address",
Compute = (r1, r2) => Similarity.LevenshteinSimilarity(r1.Fields["address"], r2.Fields["address"], idf)
},
new SimilarityFunction
{
FieldName = "Phone",
Compute = (r1, r2) => r1.Fields["phone"] == r2.Fields["phone"] ? 1.0 : 0.0
}
};
// 5. Generate candidate pairs
var pairs = BlockingHelper.GenerateCandidatePairs(records, blockingRules).ToList();
// 6. Score pairs and estimate parameters
double[] mProbs = { 0.9, 0.8, 0.95 }, uProbs = { 0.1, 0.1, 0.05 };
var initialScores = MatchScorer.Score(pairs, similarities, mProbs, uProbs);
// Optional: var (refinedM, refinedU) = MatchScorer.EstimateParametersWithEM(initialScores, similarities);
// 7. Cluster matches
double threshold = 2.0;
var unionFind = new UnionFind();
foreach (var pair in initialScores.Where(p => p.Score > threshold))
unionFind.Union(pair.Record1.Id, pair.Record2.Id);
var clusters = unionFind.GetClusters();
foreach (var cluster in clusters.Values.Where(c => c.Count > 1))
Console.WriteLine($"Duplicate group: {string.Join(", ", cluster)}");
Helper: Build IDF Dictionary
public static Dictionary<string, double> BuildIdfDictionary(List<Record> records)
{
var termDocCounts = new Dictionary<string, int>();
int totalDocs = records.Count;
foreach (var record in records)
{
var allText = string.Join(" ", record.Fields.Values).ToLower();
var terms = allText.Split(new[] { ' ', ',', '.', '!', '?' }, StringSplitOptions.RemoveEmptyEntries).Distinct();
foreach (var term in terms)
termDocCounts[term] = termDocCounts.GetValueOrDefault(term, 0) + 1;
}
return termDocCounts.ToDictionary(
kvp => kvp.Key,
kvp => Math.Log((double)totalDocs / kvp.Value)
);
}
🏁 Getting Started Checklist
- Set up your database and install dependencies.
- Decide on blocking and similarity rules for your domain.
- Load your records and build an IDF dictionary.
- Run the pipeline as shown above.
- Tune thresholds and inspect clusters.
⚙️ Extending ReLinker
- Add new database loaders by implementing
IDatabaseLoader
. - Add new similarity functions for domain-specific fields.
- Use advanced blocking (multiple rules, phonetic codes) for better recall.
- Integrate with logging/monitoring as needed.
⚠️ Limitations & Considerations
- No phonetic similarity (Soundex/Metaphone) out of the box—PRs welcome!
- Only string fields are currently supported for similarity.
- Memory usage increases with dataset size; blocking is essential.
- Data cleaning/standardization before linkage is strongly recommended.
📎 References
For more examples, advanced configuration, or to contribute, see the GitHub repo.
Product | Versions Compatible and additional computed target framework versions. |
---|---|
.NET | net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net8.0
- DuckDB.NET.Data (>= 1.3.0)
- FirebirdSql.Data.FirebirdClient (>= 10.3.3)
- Microsoft.Data.SqlClient (>= 6.0.2)
- MySql.Data (>= 9.3.0)
- Npgsql (>= 9.0.3)
- Oracle.ManagedDataAccess (>= 23.8.0)
- RavenDB.Client (>= 7.0.3)
- System.Data.SQLite (>= 1.0.119)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.