Match Text between two different tables using Python
To achieve this task, you will need to use Python’s pyodbc
library to connect to the SQL Server databases and pandas
library to manipulate the data. First, make sure you have both libraries installed:
pip install pyodbc pandas
Once installed, you can follow the steps below to loop over columns and match text from another dataset:
import pyodbc
import pandas as pd
# Replace with your SQL Server connection details
server_1 = 'your_server_1'
database_1 = 'your_database_1'
username_1 = 'your_username_1'
password_1 = 'your_password_1'
connection_string_1 = f'DRIVER={{ODBC Driver 17 for SQL Server}};SERVER={server_1};DATABASE={database_1};UID={username_1};PWD={password_1}'
server_2 = 'your_server_2'
database_2 = 'your_database_2'
username_2 = 'your_username_2'
password_2 = 'your_password_2'
connection_string_2 = f'DRIVER={{ODBC Driver 17 for SQL Server}};SERVER={server_2};DATABASE={database_2};UID={username_2};PWD={password_2}'
# Connect to the databases
conn_1 = pyodbc.connect(connection_string_1)
conn_2 = pyodbc.connect(connection_string_2)
# Define your queries to fetch the data from the databases
query_1 = 'SELECT * FROM your_table_1'
query_2 = 'SELECT * FROM your_table_2'
# Fetch data into pandas DataFrames
df_1 = pd.read_sql_query(query_1, conn_1)
df_2 = pd.read_sql_query(query_2, conn_2)
# Close the connections
conn_1.close()
conn_2.close()
# Loop over the columns in the first DataFrame and match text with the second DataFrame
for col_name in df_1.columns:
if col_name in df_2.columns:
matched_rows = df_1[df_1[col_name].isin(df_2[col_name])]
print(f'Matched rows for column "{col_name}":\n{matched_rows}\n')
Replace the server, database, username, and password placeholders with your actual SQL Server connection details. Also, replace the table names in the queries with your actual table names.
This code connects to both SQL Server databases, fetches the data into pandas DataFrames, and loops over the columns in the first DataFrame to match text with the second DataFrame. Matched rows are then printed to the console.
To perform fuzzy matching between the columns of two datasets from different SQL Server databases, you can use the fuzzywuzzy
library in Python. First, make sure to install the library:
pip install fuzzywuzzy python-Levenshtein
Once installed, you can follow the steps below to perform fuzzy matching:
import pyodbc
import pandas as pd
from fuzzywuzzy import fuzz, process
# Replace with your SQL Server connection details
# (same as in the previous example)
# Connect to the databases
conn_1 = pyodbc.connect(connection_string_1)
conn_2 = pyodbc.connect(connection_string_2)
# Define your queries to fetch the data from the databases
query_1 = 'SELECT * FROM your_table_1'
query_2 = 'SELECT * FROM your_table_2'
# Fetch data into pandas DataFrames
df_1 = pd.read_sql_query(query_1, conn_1)
df_2 = pd.read_sql_query(query_2, conn_2)
# Close the connections
conn_1.close()
conn_2.close()
# Define a threshold for the fuzzy matching score
threshold = 80
# Loop over the columns in the first DataFrame and match text with the second DataFrame using fuzzy matching
for col_name in df_1.columns:
if col_name in df_2.columns:
matched_rows = []
for index, row in df_1.iterrows():
best_match = process.extractOne(row[col_name], df_2[col_name].tolist(), scorer=fuzz.token_set_ratio)
if best_match and best_match[1] >= threshold:
matched_row = df_2[df_2[col_name] == best_match[0]]
matched_rows.append(matched_row)
if matched_rows:
matched_rows_df = pd.concat(matched_rows)
print(f'Matched rows for column "{col_name}":\n{matched_rows_df}\n')
This code connects to both SQL Server databases, fetches the data into pandas DataFrames, and loops over the columns in the first DataFrame to perform fuzzy matching with the second DataFrame. Matched rows are printed to the console.
The threshold
variable is used to define the minimum fuzzy matching score (0 to 100) required for a match to be considered valid. You can adjust the value according to your needs.