Levenshtein Distance & Pattern Matching

Imagine your marketing email requires interested clients to send in a form with details like name, address, etc. At a later date, another blast is sent out, and a lot of people who had responded to the earlier email sign up again. Obviously, accounting for human error, there will be a lot of typos in Names, addresses, etc. To go in and manually correct them would be time consuming. With this scenario in mind, I decided to write a script that would fix some of those issues using Levenshtein Distance (fuzzy string matching) on a dummy dataset.

The example used here is an excel file that contains two columns called list A and list B, both of which contain names of people. In context of the previous example, list A was names collected from the responses at time T0. List B is the names collected from the same people at T1. The names in list A and list B are both supposed to be the same. There can be more than name in each cell.

We need to make sure names in list A and list B match. The reason for any possible mis-match would be in-terms of one of the lists not having the middle name, last name etc. The names are not going to be incorrect, just incomplete. So, we need to make sure we replace each instance of the name with the complete name that we find in the list and replace all incomplete instances of the name. Some of the names are prefixed with Dr. and some with Prof. We want to remove the prefixes and suffixes. We also need to ensure names are in title case.

I use a package called fuzzywuzzy which has a wide variety of commands that use levenshtein distances based string pattern matching.

 

Packages required numpy , pandas , fuzzywuzzy . If you do not have fuzzywuzzy, pip install fuzzywuzzy is all you need to do.

In [1]:
import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

In [2]:
# Load the excel
df=pd.read_excel('Levenshtein_Distance_Example.xlsx')

As seen below, the name could have been entered differently. Our objective is to ensure that List A names match those in List B.

In [3]:
df. head(15)
Out[3]:
  ListA ListB
0 Dr. Solara Sys Dr. Solara Sys
1 Dr. Europa Jupiter Prof. Jupiter
2 Dr. asteroid Dr. Asteroid
3 Dr. Pluto Heart Dr. Pluto Heart
4 Dr.Blue Dot Dr. Blue Dot
5 Prof. Irregular Asteroid Prof. Irregular Asteroid
6 DRplanet X Dr. Planet X
7 Prof. Neptune Prof. Neptune
8 Dr. Swift Tuttle Swift Tuttle/ Asteroid Miner
9 Dr.Sys Solara Sys, M.D,Ph.D
10 Dr. Irrgular Shaped Asteroid Dr. Irrgular Shaped Asteroid
11 Dr. Sun-Earth & Neptune Mars Dr. Solara Sun-Earth and Neptune Mars
12 Venus Mars NaN

Before getting into Levenshtein Distance bit, the data clearly needs some cleaning. The following blocks of code take care of replace NaN’s, prefixes, suffixes, etc. It also takes care separating multiple names in the same cell with a “,”.

In [5]:
# We remove the prefixes and suffixes to clean up the lists to have only the names. 
# The suffix and prefix here were fixed in our case (it was a list of scientists with either MDs or Ph.Ds who were also faculty
# in some cases)
df.ListA=df.ListA.str.replace(r'Dr. |Dr.|MD|M.d|M.D.|Ph.D.|PhD|Dr|M.D|Ph.D|DR|Prof.|Prof|','')
df.ListB=df.ListB.str.replace(r'Dr. |Dr.|MD|M.d|M.D.|Ph.D.|PhD|Dr|M.D|Ph.D|DR|Prof.|Prof|','')
# Convert the names to title case
df.ListA=[df.ListA[x].title() for x in range(len(df))]
df.ListB=[df.ListB[x].title() for x in range(len(df))]
# Remove spaces and commas at the beginning and end of the word that arise from removing the suffixes and prefixes.
df.ListA=df.ListA.str.lstrip().str.rstrip().str.rstrip(',')  #There might be a comma residual from the MD, PhD case.
df.ListB=df.ListB.str.lstrip().str.rstrip().str.rstrip(',')#There might be a comma residual from the MD, PhD case.  
# There are multiple names in some cells, some separated with ',' some with '/', some with 'and', and some with '&'.
#Replace all with comma.
df.ListA=df.ListA.str.replace(r'/|\\\\|\sand\s|\s&\s|\sAnd\s',', ')
df.ListB=df.ListB.str.replace(r'/|\\\\|\sand\s|\s&\s|\sAnd\s',', ')
df.head(15)
Out[5]:
  ListA ListB
0 Solara Sys Solara Sys
1 Europa Jupiter Jupiter
2 Asteroid Asteroid
3 Pluto Heart Pluto Heart
4 Blue Dot Blue Dot
5 Irregular Asteroid Irregular Asteroid
6 Planet X Planet X
7 Neptune Neptune
8 Swift Tuttle Swift Tuttle, Asteroid Miner
9 Sys Solara Sys
10 Irrgular Shaped Asteroid Irrgular Shaped Asteroid
11 Sun-Earth, Neptune Mars Solara Sun-Earth, Neptune Mars
12 Venus Mars Venus Mars

As visible in some records, either list might have two names, whereas the other list has only one. This is rectified by the following block:

In [6]:
# Get multi-named cells and make sure both columns are multi-named cells. If not, replace the cell with only one name with its
# adjacent cell.
multi_cell_position_A=[x for x in range(len(df.ListA)) if len(df.ListA[x].split(','))>1]
multi_cell_position_B=[x for x in range(len(df.ListB)) if len(df.ListB[x].split(','))>1]
missing_name_postion_A=[x for x in multi_cell_position_B if x not in multi_cell_position_A]
missing_name_postion_B=[x for x in multi_cell_position_A if x not in multi_cell_position_B]
df.ListA[missing_name_postion_A]=df.ListB[missing_name_postion_A]
df.ListB[missing_name_postion_B]=df.ListA[missing_name_postion_B]
df.head(15)
Out[6]:
  ListA ListB
0 Solara Sys Solara Sys
1 Europa Jupiter Jupiter
2 Asteroid Asteroid
3 Pluto Heart Pluto Heart
4 Blue Dot Blue Dot
5 Irregular Asteroid Irregular Asteroid
6 Planet X Planet X
7 Neptune Neptune
8 Swift Tuttle, Asteroid Miner Swift Tuttle, Asteroid Miner
9 Sys Solara Sys
10 Irrgular Shaped Asteroid Irrgular Shaped Asteroid
11 Sun-Earth, Neptune Mars Solara Sun-Earth, Neptune Mars
12 Venus Mars Venus Mars

 

Levenshtein Distance:

The Levenshtein distance measures the difference between two sequences. The distance represents the number of single character edits (insertions, deletions and replacements) that would be needed to make the two sequences equivalent. This is therefore used in string matching, where the target string could be compared against a dictionary of words to identify the closest matches (think spell check). The Levenshtein Distance can also be used to compare two long strings, making it useful in fuzzy string search (which is what we will be using it for).

Mathematically represented by :

 lev_{a,b}(i,j) = \begin{cases} max(i,j) & \text{if } min(i,j) = 0,\\ min \begin{cases} lev_{a,b}(i-1,j)+1 \\ lev_{a,b}(i,j-1)+1 & \text{otherwise}\\ lev_{a,b}(i-1,j-1)+1_{(a_{i}\neq b_{j})} \end{cases} \end{cases}

Where  lev_{a,b}(i,j) is the Levenshtein distance, for two strings  a, b of lengths  |a| and |b| respectively. The three minimum cases represent deletion, insertions and lastly mismatch.

 

String pattern matching

We now need to replace all instances of incomplete names. The incomplete names can be just the last name or first and last name without the middle name. Or, it can also so happen that only one of the two names in a cell is noted in some instances.
We use the process extract function from the fuzzy wuzzy module. It is an awesome fucntion which lets us compare a list of strings with a string and gives the location and the levesthein distance in descending order.

In [7]:
# Just to illustrate how we extract similat strings using fuzzy wuzzy.
print('\n This is the string used for illustarting the process.extract function: ',df.ListA[11])
a=process.extract(df.ListA[11],df.ListA)
print('\n Similar Strings and their corresponding distance to',df.ListA[11], ':')
[a[x] for x in range(len(a)) if a[x][1]>79]
 
Out[7]:
 This is the string used for illustrating the process.extract function:  Sun-Earth, Neptune Mars

 Similar Strings and their corresponding distance to Sun-Earth, Neptune Mars :

[('Sun-Earth, Neptune Mars', 100, 11),
 ('Neptune', 90, 7),
 ('Venus Mars', 86, 12)]

The next two blocks of code create lists out of List A and List B (from the dataframe), followed by identifying full names from both the lists, followed by completing the incomplete names using the full name list as a dictionary.

In [8]:
# We start with extracting names from multi-named cells since the pattern matching requires us to split the cells before
# we process the names.
multi_named_cells=[]
multi_named_cells.extend(multi_cell_position_A)
multi_named_cells.extend(multi_cell_position_B)
multi_named_cells=list(set(multi_named_cells))
In [9]:
# Create a list of all unique full names and use that to replace all incomplete names. Although this might not be the
# best way for data which are memory intensive, for my case of a few excel sheets, this works.
# I used this to ensure I can process cells containing many names can be easily processed. 
def flatten(foo): # I just switched to python 3 and miss flatten :-)
    for x in foo:
        if hasattr(x, '__iter__') and not isinstance(x, str):
            for y in flatten(x):
                yield y
        else:
            yield x
            
Unique_Names=pd.concat([df.ListA,df.ListB]).unique()
Temp_Unique_Names=[z.split(',') for z in Unique_Names]
Unique_Names=[z.lstrip().rstrip() for z in list(flatten(Temp_Unique_Names))]
# Loop through to isolate all unique full names
Unique_Names_Pointer=range(len(Unique_Names))
Unique_Full_Names=[]
while len(Unique_Names_Pointer)>0:

    temp1=process.extract(Unique_Names[Unique_Names_Pointer[0]],Unique_Names)
    
# We need to make only those which match the last names are considered. This is because some names are used as both last and first
# names and we don't want last names matching with first names of another person to be considered as one.

    temp2=[temp1[z][0] for z in range(len(temp1)) if( (temp1[z][1]>79)  & (temp1[z][0].split()[len(temp1[z][0].split())-1]==
           Unique_Names[Unique_Names_Pointer[0]].split()[len(Unique_Names[Unique_Names_Pointer[0]].split())-1]))]
    temp2.append(Unique_Names[Unique_Names_Pointer[0]])
    Unique_Full_Names.append(max(temp2, key=len))
    Unique_Names2=[x for x in Unique_Names if x not in temp2] 
    Unique_Names=Unique_Names2
    Unique_Names_Pointer=range(len(Unique_Names))
    
print('The unique full names are: ' ,Unique_Full_Names)
 
The unique full names are:  ['Solara Sys', 'Europa Jupiter', 'Irrgular Shaped Asteroid', 'Pluto Heart', 'Blue Dot', 'Planet X', 'Neptune', 'Swift Tuttle', 'Asteroid Miner', 'Solara Sun-Earth', 'Neptune Mars', 'Venus Mars']
In [10]:
# We start going through cell by cell and replacing the contents with full names from our uniuqe full name list
# We start with our multi-valued cells first.
for x in range(len(multi_named_cells)):
    name1=process.extract(df.ListA[multi_named_cells[x]].split(',')[0],Unique_Full_Names)[0][0]
    name2=process.extract(df.ListA[multi_named_cells[x]].split(',')[1],Unique_Full_Names)[0][0]
    df.ListA[multi_named_cells[x]]=name1+', '+name2    


list_pointer=[x for x in range(len(df)) if x not in multi_named_cells]
for x in list_pointer:
    df.ListA[x]=process.extract(df.ListA[x],Unique_Full_Names)[0][0]
df.ListB=df.ListA #The two lists are supossed to be the same.
In [11]:
df.to_csv("Levenshtein_Distance_Example_Corrected.csv", index=False)

You can find the code here

Leave a Reply

Your email address will not be published. Required fields are marked *