TFT

CSV Deduplicator

Exact duplicates are easy. But what about 'Jon Smith' vs 'John Smith'? Our deduplicator catches near-duplicates using fuzzy matching and phonetic algorithms — so your data is clean even when humans weren't consistent.

CSV Fuzzy Deduplicator

Advanced fuzzy matching for near-duplicate detection with similarity threshold

Drag and drop a CSV file here, or click to browse

or paste CSV data below

How Fuzzy Deduplication Works:

  • Uses Levenshtein distance algorithm for string similarity
  • Compares selected columns across all row pairs
  • Calculates average similarity score across columns
  • Rows exceeding threshold are marked as duplicates
  • Later occurrences are removed, keeping the first
  • Useful for catching typos and minor variations

What This Tool Does

This tool finds near-duplicate rows in your CSV using fuzzy string matching. Unlike exact duplicate removal, it catches typos, slight variations, and similar entries like "Jon Smith" vs "John Smith" or "Microsft" vs "Microsoft". Adjust the similarity threshold to control how strict the matching is.

How Fuzzy Matching Works

The tool uses the Levenshtein distance algorithm to measure string similarity:

Levenshtein distance: Counts the minimum number of single-character edits (insertions, deletions, substitutions) needed to change one string into another.

Similarity score: Converts distance to a percentage. 100% means identical, 0% means completely different.

Row similarity: For multi-column comparison, similarity is averaged across selected columns.

Threshold: Rows with similarity above the threshold are considered duplicates.

Example: Fuzzy Name Matching

Input CSV:

name,email,company
John Smith,[email protected],Microsoft
Jon Smith,[email protected],Microsft
Jane Doe,[email protected],Google
Jan Doe,[email protected],Googel

With 85% similarity threshold on name column:

Output CSV:

name,email,company
John Smith,[email protected],Microsoft
Jane Doe,[email protected],Google

"Jon Smith" matched "John Smith" (1 character difference = ~83% similarity)

"Jan Doe" matched "Jane Doe" and "Googel" matched "Google"

Similarity Threshold Guide

95-100%: Nearly identical. Catches only typos and minor variations.

85-94%: Close matches. Good for catching common typos and transpositions.

75-84%: Moderate similarity. Catches more variations but may have false positives.

50-74%: Loose matching. Use with caution — may match unrelated entries.

Recommendation: Start at 85% and adjust based on results. Review matches before finalizing.

When to Use Fuzzy Deduplication

Customer data cleanup: Merge entries like "IBM", "I.B.M.", and "International Business Machines".

Survey response cleaning: Catch variations in open-text responses like "USA", "U.S.A.", "United States".

Product catalog deduplication: Find similar product names from different suppliers.

Lead list cleanup: Remove duplicate leads with slight name variations from multiple sources.

Address matching: Catch "123 Main St" vs "123 Main Street" vs "123 Main St.".

Column Selection

Choose which columns to compare for similarity:

Single column: Compare only the most important identifier (email, ID, name).

Multiple columns: Average similarity across selected columns. More accurate but stricter.

Tip: For customer data, compare name + email together to avoid false matches on common names.

Visual Match Display

The tool shows potential duplicates with their similarity scores:

Match Found:
  Row 1: John Smith ([email protected])
  Row 2: Jon Smith ([email protected])
  Similarity: 89%
  
  [Keep Row 1] [Keep Row 2] [Keep Both]

Review each match and decide which to keep before finalizing deduplication.

Limitations

Performance: Fuzzy matching is O(n²) — comparing every row to every other row. Files over 10,000 rows may be very slow.

False positives: Low thresholds may match unrelated entries. "Apple" and "Apply" are 80% similar but different.

Language limitations: Levenshtein works best for Latin alphabets. Non-Latin scripts may have different similarity characteristics.

Frequently Asked Questions

How is this different from Duplicate Remover?

Duplicate Remover finds exact matches only. Deduplicator finds near-matches using fuzzy string comparison.

What threshold should I use?

Start at 85% for most cases. Increase to 90-95% for stricter matching, decrease to 75-80% for more aggressive deduplication.

Can this handle large datasets?

Fuzzy matching is computationally intensive. For datasets over 10,000 rows, consider using a dedicated deduplication tool or script.