Introduction
In this blog, we'll walk through building a simple web-based movie recommendation engine using Flask, Pandas, and Scikit-learn. This project leverages natural language processing techniques to provide recommendations based on movie descriptions.
Overview
We'll develop a Flask web application that recommends movies similar to a user-provided title. The similarity between movies will be computed using the TF-IDF (Term Frequency-Inverse Document Frequency) and cosine similarity.
Prerequisites
Before diving in, make sure you have the following installed:
Python 3.x
Flask
Pandas
Scikit-learn
The Dataset
We'll use a dataset that contains metadata about various movies. The data is available at https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata
Setting Up the Flask Application
First, we import the necessary libraries and set up a basic Flask app:
import flask
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
app = flask.Flask(__name__, template_folder='templates')
Here, flask.Flask initializes our Flask application, and template_folder specifies the directory where HTML templates are stored.
Loading and Processing the Data
Next, we load the movie data and prepare it for similarity calculations:
Loading Data: The dataset is loaded into a DataFrame (df2).
df2 = pd.read_csv('./model/tmdb.csv')
TF-IDF Vectorization: We use TfidfVectorizer from Scikit-learn to transform the soup column (which contains text data) into a TF-IDF matrix. This matrix represents the importance of words across different movies.
tfidf = TfidfVectorizer(stop_words='english', analyzer='word')
# Construct the TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(df2['soup'])
print(tfidf_matrix.shape)
Cosine Similarity: Using the TF-IDF matrix, we calculate the cosine similarity between all movies. This results in a square matrix where each element represents the similarity between two movies.
# Construct cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
print(cosine_sim.shape)
df2 = df2.reset_index()
indices = pd.Series(df2.index, index=df2['title']).drop_duplicates()
Building the Recommendation Function
We define a function to get movie recommendations based on cosine similarity:
def get_recommendations(title):
global sim_scores
# Get the index of the movie that matches the title
idx = indices[title]
# Get the pairwise similarity scores of all movies with that movie
sim_scores = list(enumerate(cosine_sim[idx]))
# Sort the movies based on the similarity scores
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
# Get the scores of the 10 most similar movies
sim_scores = sim_scores[1:11]
# Get the movie indices
movie_indices = [i[0] for i in sim_scores]
# Return a DataFrame with similar movies
return_df = pd.DataFrame(columns=['Title', 'Homepage'])
return_df['Title'] = df2['title'].iloc[movie_indices]
return_df['Homepage'] = df2['homepage'].iloc[movie_indices]
return_df['ReleaseDate'] = df2['release_date'].iloc[movie_indices]
return return_df
Finding the Movie: The function first locates the index of the provided movie title.
Calculating Similarity: It then calculates similarity scores with all other movies and sorts them in descending order.
Returning Recommendations: The function returns the top 10 most similar movies, along with their titles, homepages, and release dates.
Setting Up Flask Routes
We define the main route that handles GET and POST requests:
@app.route('/', methods=['GET', 'POST'])
def main():
if flask.request.method == 'GET':
return(flask.render_template('index.html'))
if flask.request.method == 'POST':
m_name = " ".join(flask.request.form['movie_name'].split())
if m_name not in all_titles:
return(flask.render_template('notFound.html', name=m_name))
else:
result_final = get_recommendations(m_name)
names = []
homepage = []
releaseDate = []
for i in range(len(result_final)):
names.append(result_final.iloc[i][0])
releaseDate.append(result_final.iloc[i][2])
if(len(str(result_final.iloc[i][1])) > 3):
homepage.append(result_final.iloc[i][1])
else:
homepage.append("#")
return flask.render_template('found.html', movie_names=names, movie_homepage=homepage, search_name=m_name, movie_releaseDate=releaseDate, movie_simScore=sim_scores)
GET Request: Renders the main search page (index.html).
POST Request: Handles the form submission, checks if the movie exists, and returns the recommendations. If the movie isn't found, it renders a notFound.html template.
Running the Flask App
Finally, to run the Flask application, add this block:
if __name__ == '__main__':
app.run(host="127.0.0.1", port=8080, debug=True)
Putting it All together
import flask
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
app = flask.Flask(__name__, template_folder='templates')
df2 = pd.read_csv('./model/tmdb.csv')
tfidf = TfidfVectorizer(stop_words='english',analyzer='word')
#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(df2['soup'])
print(tfidf_matrix.shape)
#construct cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
print(cosine_sim.shape)
df2 = df2.reset_index()
indices = pd.Series(df2.index, index=df2['title']).drop_duplicates()
# create array with all movie titles
all_titles = [df2['title'][i] for i in range(len(df2['title']))]
def get_recommendations(title):
global sim_scores
# Get the index of the movie that matches the title
idx = indices[title]
# Get the pairwise similarity scores of all movies with that movie
sim_scores = list(enumerate(cosine_sim[idx]))
# Sort the movies based on the similarity scores
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
# Get the scores of the 10 most similar movies
sim_scores = sim_scores[1:11]
# print similarity scores
print("\n movieId score")
for i in sim_scores:
print(i)
# Get the movie indices
movie_indices = [i[0] for i in sim_scores]
# return list of similar movies
return_df = pd.DataFrame(columns=['Title','Homepage'])
return_df['Title'] = df2['title'].iloc[movie_indices]
return_df['Homepage'] = df2['homepage'].iloc[movie_indices]
return_df['ReleaseDate'] = df2['release_date'].iloc[movie_indices]
return return_df
# Set up the main route
@app.route('/', methods=['GET', 'POST'])
def main():
if flask.request.method == 'GET':
return(flask.render_template('index.html'))
if flask.request.method == 'POST':
m_name = " ".join(flask.request.form['movie_name'].split())
# check = difflib.get_close_matches(m_name,all_titles,cutout=0.50,n=1)
if m_name not in all_titles:
return(flask.render_template('notFound.html',name=m_name))
else:
result_final = get_recommendations(m_name)
names = []
homepage = []
releaseDate = []
for i in range(len(result_final)):
names.append(result_final.iloc[i][0])
releaseDate.append(result_final.iloc[i][2])
if(len(str(result_final.iloc[i][1]))>3):
homepage.append(result_final.iloc[i][1])
else:
homepage.append("#")
return flask.render_template('found.html',movie_names=names,movie_homepage=homepage,search_name=m_name, movie_releaseDate=releaseDate, movie_simScore=sim_scores)
if __name__ == '__main__':
app.run(host="127.0.0.1", port=8080, debug=True)
#app.run()
If you require any assistance with your Machine Learning projects, please do not hesitate to contact us. We have a team of experienced developers who specialize in Machine Learning and can provide you with the necessary support and expertise to ensure the success of your project. You can reach us through our website or by contacting us directly via email or phone.
Comentários