Website migrations and redesigns often bog down SEOs in tedious redirect mapping. This is a critical task, just ask any SEO that has been brought in months after a migration to analyze why traffic is down. The script I provide below uses TF-IDF (Term Frequency-Inverse Document Frequency) and cosine similarity to automate this process, matching old and new URLs by title. This speeds up the work and allows you to put brain power where it matters most, driving strategy, creative, and messaging.
Getting Started
Open Google Colab
Open Google Colab and start a new notebook. From there, copy the code below and paste it directly into the new colab notebook. Once pasted, click the run icon on the left.
Uploading Data:
Using Google Colab’s file upload feature, the script accepts two CSV files—one for the old URLs and one for the new URLs. Each CSV should contains a “URL” column, and a “Title” column. Name the URLS you wish to redirect old.csv and then the new URLs new.csv.
Preprocessing the Titles:
Titles are the primary data used to determine similarity. The script first handles missing values and combines all titles into a single dataset to ensure that the TF-IDF vectorizer is trained on the complete vocabulary.
Vectorizing with TF-IDF:
TF-IDF is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. By converting the titles into numerical vectors, we can quantify the importance of each word in context. This is particularly useful for SEO where keyword relevance is key.
Calculating Cosine Similarity:
Cosine similarity measures the cosine of the angle between two vectors. In our script, this means we can compare the similarity between the TF-IDF representations of old and new titles. The new URL that has the highest cosine similarity to an old URL is considered the best match for a redirect.
Generating the Redirect Mapping:
Finally, the script outputs a CSV file that maps each old URL to its best-matching new URL, along with the corresponding similarity score. This file can be directly used to update your website’s redirect rules.
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from google.colab import files
# Upload the CSV files
# When you run this cell, use the file upload dialog to select both "old.csv" and "new.csv"
uploaded = files.upload()
# Read in the CSV files into DataFrames
old_df = pd.read_csv('old.csv')
new_df = pd.read_csv('new.csv')
# (Optional) Display first few rows to verify they are loaded correctly.
print("Old CSV preview:")
print(old_df.head())
print("\nNew CSV preview:")
print(new_df.head())
# Combine the Title columns from both DataFrames, handling missing values
all_titles = pd.concat([old_df['Title'], new_df['Title']])
# Replace NaN values with empty strings
all_titles = all_titles.fillna('')
# Initialize the TfidfVectorizer with English stop words removed
vectorizer = TfidfVectorizer(stop_words='english')
# Fit the vectorizer on all titles and transform them into TF-IDF vectors
tfidf_all = vectorizer.fit_transform(all_titles)
# Split the combined TF-IDF matrix back into two matrices:
# one for the old titles and one for the new titles.
num_old = len(old_df)
tfidf_old = tfidf_all[:num_old]
tfidf_new = tfidf_all[num_old:]
# Compute cosine similarity between each old title and each new title.
# This produces a matrix with shape (number of old titles, number of new titles)
similarity_matrix = cosine_similarity(tfidf_old, tfidf_new)
# For each old title, find the index of the new title with the highest cosine similarity.
# This index corresponds to the best matching new URL.
best_match_indices = np.argmax(similarity_matrix, axis=1)
# (Optional) If you want to also capture the similarity score, you can do so:
best_scores = np.max(similarity_matrix, axis=1)
# Build a list of mappings (each a dictionary) from old URL to new URL.
# Each mapping uses the best matching new URL for that old URL.
mapping_results = []
for i, best_idx in enumerate(best_match_indices):
mapping_results.append({
'old_url': old_df.iloc[i]['URL'],
'new_url': new_df.iloc[best_idx]['URL'],
'similarity_score': best_scores[i]
})
# Create a DataFrame from the mapping results.
mapping_df = pd.DataFrame(mapping_results)
# Display the mapping DataFrame
print("\nMapping (first 10 entries):")
print(mapping_df.head(10))
# Save the mapping DataFrame to a CSV file.
output_filename = 'redirect_mapping.csv'
mapping_df.to_csv(output_filename, index=False)
print(f"\nMapping file saved as {output_filename}")
# Download the CSV file (this will trigger a download in your browser)
files.download(output_filename)
The Output
The file that it creates will contain three columns. Old URL, New URL, and the Cosine score.
The whole process takes only a few minutes which is massively less than would it’d take to manually map these. From here, the job of an SEO or web manager is to review the directs and ensure they are correct. In the several runs of it myself, the redirects were correct roughly 95% of the time and it was often outliers or old or new content that it wasn’t able to map correctly. In all, I spent roughly one hour versus ten hours which is a massive time-saving.
Redirecting it All Together
A task that used to take tens of hours has been reduced into hours. Let the machines do the work and instead focus on qa’ing and other quality work. The script above took me no more than 5 minutes to ask ChatGPT for and then a quick minor update which I made using Google Gemini. Happy redirecting!
If you run into issues or have other SEO needs, connect with me on LinkedIN or use the form below.