ReLinker 1.1.0

.NET 8.0

dotnet add package ReLinker --version 1.1.0

NuGet\Install-Package ReLinker -Version 1.1.0

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="ReLinker" Version="1.1.0" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="ReLinker" Version="1.1.0" />
                    

                            Directory.Packages.props

<PackageReference Include="ReLinker" />
                    

                            Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add ReLinker --version 1.1.0

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: ReLinker, 1.1.0"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#addin nuget:?package=ReLinker&version=1.1.0
                    

                            Install as a Cake Addin

#tool nuget:?package=ReLinker&version=1.1.0
                    

                            Install as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

ReLinker

A Powerful C# Record Linkage (Entity Resolution) Framework

🚀 Overview

ReLinker is a flexible and efficient .NET library for deduplicating and linking records across disparate datasets. It supports advanced string similarity algorithms, blocking strategies, probabilistic matching, and clustering—all designed to scale for real-world data integration tasks.

📚 Key Concepts

Record Linkage: Identifying records that refer to the same entity within or across databases.
Blocking: Reducing the number of candidate record pairs for comparison.
Similarity Computation: Quantifying how similar two records are.
Probabilistic Matching: Scoring and classifying pairs using statistical models.
Clustering: Grouping matched records into entities.

🏗️ System Architecture

The ReLinker pipeline consists of:

Data Loading
Blocking
Similarity Computation
Probabilistic Scoring (Fellegi-Sunter + EM)
Clustering (Union-Find)

🧩 Core Components

1. Record Structure

public class Record
{
    public string Id { get; set; }
    public Dictionary<string, string> Fields { get; set; } = new();
}

2. Data Loading System

Loader Interface:

public interface IDatabaseLoader
{
    Task<List<Record>> LoadRecordsAsync();
    IEnumerable<Record> LoadRecordsInBatches(int batchSize, int startOffset = 0);
}

Factory:

public static class DatabaseLoaderFactory
{
    public static IDatabaseLoader CreateLoader(
        string type, string connectionStringOrUrl, string queryOrCollection, string providerName = null)
    {
        return type.ToLower() switch
        {
            "generic" => new GenericDbLoader(connectionStringOrUrl, queryOrCollection, providerName),
            "duckdb" => new DuckDbLoader(connectionStringOrUrl, queryOrCollection),
            "ravendb" => new RavenDbLoader(connectionStringOrUrl, queryOrCollection),
            _ => throw new ArgumentException("Unknown loader type.")
        };
    }
}

3. Blocking

Blocking Rule:

public class BlockingRule
{
    public string Name { get; set; }
    public Func<Record, string> RuleFunc { get; set; }
}

Blocking Helper:

public static class BlockingHelper
{
    public static IEnumerable<(Record, Record)> GenerateCandidatePairs(
        List<Record> records, List<BlockingRule> rules)
    {
        var blocks = new Dictionary<string, List<Record>>();
        foreach (var rule in rules)
        {
            foreach (var record in records)
            {
                var key = rule.RuleFunc(record);
                if (!blocks.ContainsKey(key))
                    blocks[key] = new List<Record>();
                blocks[key].Add(record);
            }
        }

        var seen = new HashSet<(string, string)>();
        foreach (var block in blocks.Values)
        for (int i = 0; i < block.Count; i++)
        for (int j = i + 1; j < block.Count; j++)
        {
            var a = block[i].Id;
            var b = block[j].Id;
            if (seen.Add((a, b)) && seen.Add((b, a)))
                yield return (block[i], block[j]);
        }
    }
}

4. Similarity Functions

Define similarity logic per field:

public class SimilarityFunction
{
    public string FieldName { get; set; }
    public Func<Record, Record, double> Compute { get; set; }
}

Example Implementations (see below for usage):

Levenshtein, Jaro, and TF-IDF similarity
Field-specific (e.g., exact match for phone/email)

5. Probabilistic Matching (Fellegi-Sunter)

public static class MatchScorer
{
    public static List<ScoredPair> Score(
        IEnumerable<(Record, Record)> pairs,
        List<SimilarityFunction> functions,
        double[] mProbs, double[] uProbs)
    {
        var results = new List<ScoredPair>();
        foreach (var (r1, r2) in pairs)
        {
            double score = 0;
            for (int i = 0; i < functions.Count; i++)
            {
                double sim = functions[i].Compute(r1, r2);
                score += Math.Log((sim * mProbs[i]) / (sim * uProbs[i] + 1e-6) + 1e-6);
            }
            results.Add(new ScoredPair { Record1 = r1, Record2 = r2, Score = score });
        }
        return results;
    }

    public static (double[], double[]) EstimateParametersWithEM(
        List<ScoredPair> scoredPairs, List<SimilarityFunction> functions, int maxIterations = 10)
    {
        // (see full implementation in repo)
        throw new NotImplementedException("EM implementation here...");
    }
}

public class ScoredPair
{
    public Record Record1 { get; set; }
    public Record Record2 { get; set; }
    public double Score { get; set; }
}

6. Clustering (Union-Find)

public class UnionFind
{
    private readonly Dictionary<string, string> parent = new();
    public string Find(string x) => parent[x] == x ? x : parent[x] = Find(parent[x]);
    public void Union(string x, string y)
    {
        if (!parent.ContainsKey(x)) parent[x] = x;
        if (!parent.ContainsKey(y)) parent[y] = y;
        string rootX = Find(x), rootY = Find(y);
        if (rootX != rootY) parent[rootY] = rootX;
    }
    public Dictionary<string, List<string>> GetClusters()
    {
        var clusters = new Dictionary<string, List<string>>();
        foreach (var item in parent.Keys)
        {
            var root = Find(item);
            if (!clusters.ContainsKey(root)) clusters[root] = new();
            clusters[root].Add(item);
        }
        return clusters;
    }
}

🧑‍💻 Full Working Example: Customer Deduplication

// 1. Load records from a database (SQLite example)
var loader = DatabaseLoaderFactory.CreateLoader(
    "generic", 
    "Data Source=customers.db",
    "SELECT id, first_name, last_name, phone, email, address FROM customers",
    "System.Data.SQLite"
);
var records = await loader.LoadRecordsAsync();

// 2. Build IDF dictionary (see helper below)
var idf = BuildIdfDictionary(records);

// 3. Define blocking rules
var blockingRules = new List<BlockingRule>
{
    new BlockingRule { Name = "PhonePrefix", RuleFunc = r => r.Fields["phone"][..6] },
    new BlockingRule { Name = "EmailDomain", RuleFunc = r => r.Fields["email"].Split('@')[1] }
};

// 4. Define similarity functions
var similarities = new List<SimilarityFunction>
{
    new SimilarityFunction 
    { 
        FieldName = "FullName",
        Compute = (r1, r2) =>
        {
            string name1 = $"{r1.Fields["first_name"]} {r1.Fields["last_name"]}";
            string name2 = $"{r2.Fields["first_name"]} {r2.Fields["last_name"]}";
            return Similarity.JaroSimilarity(name1, name2, idf);
        }
    },
    new SimilarityFunction 
    { 
        FieldName = "Address",
        Compute = (r1, r2) => Similarity.LevenshteinSimilarity(r1.Fields["address"], r2.Fields["address"], idf)
    },
    new SimilarityFunction 
    { 
        FieldName = "Phone", 
        Compute = (r1, r2) => r1.Fields["phone"] == r2.Fields["phone"] ? 1.0 : 0.0
    }
};

// 5. Generate candidate pairs
var pairs = BlockingHelper.GenerateCandidatePairs(records, blockingRules).ToList();

// 6. Score pairs and estimate parameters
double[] mProbs = { 0.9, 0.8, 0.95 }, uProbs = { 0.1, 0.1, 0.05 };
var initialScores = MatchScorer.Score(pairs, similarities, mProbs, uProbs);
// Optional: var (refinedM, refinedU) = MatchScorer.EstimateParametersWithEM(initialScores, similarities);

// 7. Cluster matches
double threshold = 2.0;
var unionFind = new UnionFind();
foreach (var pair in initialScores.Where(p => p.Score > threshold))
    unionFind.Union(pair.Record1.Id, pair.Record2.Id);

var clusters = unionFind.GetClusters();
foreach (var cluster in clusters.Values.Where(c => c.Count > 1))
    Console.WriteLine($"Duplicate group: {string.Join(", ", cluster)}");

Helper: Build IDF Dictionary

public static Dictionary<string, double> BuildIdfDictionary(List<Record> records)
{
    var termDocCounts = new Dictionary<string, int>();
    int totalDocs = records.Count;

    foreach (var record in records)
    {
        var allText = string.Join(" ", record.Fields.Values).ToLower();
        var terms = allText.Split(new[] { ' ', ',', '.', '!', '?' }, StringSplitOptions.RemoveEmptyEntries).Distinct();
        foreach (var term in terms)
            termDocCounts[term] = termDocCounts.GetValueOrDefault(term, 0) + 1;
    }

    return termDocCounts.ToDictionary(
        kvp => kvp.Key,
        kvp => Math.Log((double)totalDocs / kvp.Value)
    );
}

🏁 Getting Started Checklist

Set up your database and install dependencies.
Decide on blocking and similarity rules for your domain.
Load your records and build an IDF dictionary.
Run the pipeline as shown above.
Tune thresholds and inspect clusters.

⚙️ Extending ReLinker

Add new database loaders by implementing IDatabaseLoader.
Add new similarity functions for domain-specific fields.
Use advanced blocking (multiple rules, phonetic codes) for better recall.
Integrate with logging/monitoring as needed.

⚠️ Limitations & Considerations

No phonetic similarity (Soundex/Metaphone) out of the box—PRs welcome!
Only string fields are currently supported for similarity.
Memory usage increases with dataset size; blocking is essential.
Data cleaning/standardization before linkage is strongly recommended.

📎 References

For more examples, advanced configuration, or to contribute, see the GitHub repo.

Product	Compatible and additional computed target framework versions.
.NET	net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed.

Product

.NET

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net8.0
- DuckDB.NET.Data (>= 1.3.0)
- FirebirdSql.Data.FirebirdClient (>= 10.3.3)
- Microsoft.Data.SqlClient (>= 6.0.2)
- MySql.Data (>= 9.3.0)
- Npgsql (>= 9.0.3)
- Oracle.ManagedDataAccess (>= 23.8.0)
- RavenDB.Client (>= 7.0.3)
- System.Data.SQLite (>= 1.0.119)

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last Updated
1.1.0	73	6/21/2025
1.0.111	99	6/20/2025
1.0.15	96	6/20/2025
1.0.13	97	6/20/2025
1.0.12	96	6/20/2025
1.0.0	103	6/20/2025