“Reelly Good Productions”, a Pseudo Start-Up that Wants to Revolutionize the Film Industry with Machine Learning
My First End-to-End Text Mining and Analytics Project — from Scraping to Insights
Learning the various techniques for analyzing text to extract useful insights and patterns to support decision making as a part of my 1-year Nanyang Poly’s specialist diploma in big data analytics, I worked on a NLP Project which explored the possibilities reading a movie’s plot like a data set to predict whether it is a box-office hit.
I and my 3 course mates (Alex, Belinda and Liduina) formed a pseudo start-up company — Reelly Good Productions and the company objective as a filmmaker is to create original content that pleases the audience and keeps them coming back for more. The company believes that there is a specific combination of attributes when put together can determine the success of a movie. The measurable indicators of success are audience satisfaction (i.e., user ratings), respected critic reviews (i.e., Meta scored) and profitability (i.e., worldwide gross). Further the company believes that patterns lie in movie story plots and there are common story elements which make a movie successful. Using data mining tools on past movies, Reelly Good Productions would like to determine if the combination of these identified factors can be used to predict the success of a future movie and with 70% accuracy rate.
My Problem Statement
Instead of analyzing the “structured attributes”, such as budget, directors, actors, user critics, genres, runtime and which make a movie successful, as one of four “data scientist”, I approached this from a different angle. Given the thousands of scripts Reelly Good Productions received daily, the question is how does the company determine which one to invest the limited funds to produce? Finding hidden lexical & syntactic structure in the past popular films maybe the answer. In this assignment, I tried to determine the most important words for each movie and come up with latent features that describe each film. Moreover, I explore how to use k-mean clustering and Latent Dirichlet Allocation (LDA) to model the topics of a set of movies.
1. Web Scraping
To source data for Reelly Good Productions’ data mining objectives, as a coding savvy “data engineer” of the team, I scraped IMDB web site 5000 movies each year, from 2015 to 2019, with Python using BeautifulSoup and requests. Then performed some simple data clean-up, analysis, csv export using numpy, re, pandas and matplotlib.
The team had made the following assumptions when scrapping IMDB movies
- We extract “feature film” types only defined by IMDB, which includes those foreign films shown on USA TV only.
- IMDB default sorts the feature film search record based on film popularity which indicates how much that title’s page has been visited. We extract the records in this default order. However, we assume the popularity < > movie rating and world gross. We also assume it helps to remove those “unknown”/”outlier” films which tend to have no rating and a lot of missing values.
- We will retrieve the data from the year 2015–2019 and 1000 titles for each year. We believe it pretty covers both good and bad films in recent years.
- We exclude year 2000 which the film industries have been impacted greatly by covid19 pandemic. We consider 2020 as outlier year.
- We decided to convert the budget and gross values into USD based on current market rate and ignore the effect of inflation.
I developed IMDB scraper in Jupyter notebook to carry out the actions below
- Request the html content of a movie from IMDB website.
- Pre-process and parse the html content.
- Extract/scrape the data in the 23 attributes shown below from HTML content.
2. Data Preparation
Within the same web scrape script, I also performed data clean-up, including:
- Default value if missing attributes.
- Replace special characters.
- Trim the leading and trailing whitespace in the values.
- Remove suffix wording, for examples, remove “ user” wording from “user_reviews” value, remove “ min” from “runtime” value.
- Extract and concatenate multiple values in a single string, for example, concatenate multiple genres into a single field. The concatenated values may be further transformed, such as one-hot encoding.
- Remove dollar sign / currency codes and comma from the money value.
- Convert foreign currency to USD.
3. Data Exploration
Using pandas and matplotlib libraries, I did some data exploration to gain insight into the data.
- Check for null value in data frame.
- Generate descriptive statistics for categorical and continuous variables.
- Review the distribution of labelled data — user rating, meta score and worldwide gross.
- Compute correlation of variables.
- Generate word cloud of genres and plot keywords for visual representation
4. Data Transformation
During data exploration, we observed most of categorical variables have many distinct values and they needed to be converted so the computer can process them. And we also discovered skewed data in the continuous variables. Encoding labels, one hot encoding, binary encoding and binning were proposed to handle the issues during our data exploration discussion. One hot encoding movie genres is an example done within IMDB scraper Jupyter notebook.
Another transformation task was scraping the ranking of directors and stars from The Numbers and merged with IMDB data. It was to encode the categorical variables of director and primary / secondary / other cast.
5. Feature Selection
For my data mining objective — solving “equation” of a hit film script with data, I concatenated the movie title and plot summary as text feature. As the meta score and worldwide gross data had too many missing data, I decided to select user rating as the labelled data. I encoded and labelled a user rating > 6 as a GOOD movie and a rating <= 6 is a BAD one.
6. Text Pre-processing
Pre-processing is the first step in text mining process to convert the unstructured text into a form that is analyzable and remove all the noise. I conducted text preprocessing steps including tokenization, lowercase conversion, remove punctuation, remove stop words and lemmatization.
7. Train Test Split
I did a 70: 30 simple random splits before the word vectorizing. Then vectorizers fit on the train data and test data separately. This will ensure 30% of the data truly unseen by the model.
8. Word Embeddings / Vectorization
Machines simply cannot process text data in raw form. They need us to break down the text into a numerical format that is easily readable by the machine. This is the idea behind Natural Language Processing (NLP). In this assignment, I used 4 word embeddings / vectorization techniques for text data: Count / Bag of Words (BoW), Term Frequency-Inverse Document Frequency (TFIDF), Word2Vec (W2V) and LDA Topic Model as a classification model input (TOPICS).
The first 3 techniques — BoW, TFIDF & W2V were done in python. In W2V technique, I used pretrained Google News corpus to process the movie titles and plot summaries
SAS visual text analytics tool provided the last technique TOPICS. Using the LDA model, SAS visual text analytics generated 12 clusters / topics based on the text. Each topic contains top 5 keywords. Then used Topic Distributions as feature vectors.
9. Modelling and Validation
Logistic Regression (LR), Naïve Bayes (NB) and Support Vector Machine (SVM) are strong in Text data and word-based classification. In this assignment, I built 9 models in python using the combination of different classifiers (NB, LR, SVM) with word embeddings (BoW, TFIDF, W2V).
Then, used the classification reports and confusion matrix to breaks down the number of correct and incorrect predictions by each class, then evaluate the precision, recall and f1-score of each model.
The last model — classification model with LDA topics was built using SAS Visual Text Analytics’ Model Studio. SAS used Topic Distributions as feature vectors and trained supervised Classification to predict a movie plot summary whether it is a GOOD or BAD choice for filming.
It is crucial that the company does not end up labelling an otherwise bad movie plot as a GOOD candidate. Besides the overall f1-scores, the f1-score of BAD class will be the tiebreaker factor.
The combination of the SVM with a Word2vec embedding produced better results upon comparison. This is because the semantic meaning of the text was better preserved.
Although the models hadn’t met the 70% accuracy rate, my opinion is that the accuracy of the model could be improved if we could run the model through complete movie scripts instead of simply looking at plot summaries. This would provide a larger corpus for more in-depth analysis.
11. Insights (Topic Modelling)
So what were the insights or the most meaningful keywords I identified from the movie titles and plot summaries? I divided the past 5000 movies into GOOD and BAD categories. Then I used 2 approaches — W2V-KMeans Clustering and LDA Topic Modeling to cluster GOOD and BAD movies’ combination of movie titles and plot summaries into 8 distinct clusters/topics. Then, I created word clouds for each cluster and upon further analysis.
I found that there was no obvious difference in terms of words or topics between GOOD and BAD movies. You can see that life, love, family, crime, murder, woman, mother, story and etc are common features for both GOOD and BAD Movies. I received similar results when using the LDA model. This led me to believe that a good movie plot goes beyond just hitting the most common keywords.
In fact it is the least common keywords which matter. The surprise elements or unexpected endings are what captures the audience attention. Take for example, if murder vs detective is the common story elements in a crime-action movie, consider this… what if the murderer turns out to be the detective’s angry, revengeful wife. Or where its gang vs police, we find out that the police force has a mole in the gang and a police officer is actually a gang member. These surprises blew the minds of the audiences and were proven to be highly successful.
Therefore, using term frequency techniques such as BoW or TFIDF alone, it may not be able to capture these plot twists as they would be uncommon. We can overcome that by manually, carefully constructing a domain specific set of stop words along with the standard list. I believe it may benefit topic modelling to disclose the most meaningful words buried in the very frequent terms.
Reflecting on the Text Analytics Process
It is a challenging yet satisfying end-to-end process. I have a full experience from scrapping unstructured data/text, then cleaning, breaking and encoding, eventually converting them into a structured data and feeding them to the algorithms for more insights.
I invested a lot of time for technical research which includes:
- Develop web scrapper application using python beautifulsoup library to source data.
- More data exploration and model evaluation techniques like word cloud, histograms, confusion matrix, classification report etc.
- Correct my misconceptions and have much understanding of word representations, especially word2vec technique.
- Combine classification algorithms with different word embeddings.
- Explore the integration between the unsupervised and supervised learning.
After going thru the modules for a year, big data, machine learning and even python are no longer the strangers to me but help me to look at the data from another perspective.
Proposal of Skills Applications
In my current work area, although there is a enterprise asset management (EAM) system, the critical information are often buried in structured data, such as year of production, make, model, warranty details, as well as unstructured data such as maintenance history and repair logs.
I believe text analytics and unsupervised machine learning techniques can help to analyze structured and unstructured data within the past and present maintenance tickets and discover the hidden insights, including identify repetitive issues, forecast potential issues, generate recommended actions.
With the idea above, using the IMDB data as example, I researched and developed a movie recommender for Reelly Good Productions based on doc2vec algorithm to showcase the potential application. When a user enters a movie ID, the application will immediately calculate cosine similarity between the chosen movie and the past movies’ movie title and plot summaries, then recommend a list of similar movies and provide their plot summaries for user reference.
With the similar technique, it is possible to develop a machine learning powered recommendation engine that can provide a list of similar resolved tickets and recommend maintenance engineers the relevant solutions related to the open tickets. With such application, the engineers also can identify whether a ticket is a repetitive incident and take any necessary actions to prevent it from happening again.
I hope you might find this helpful. You can comment down in the comment sections for any queries. If you want the full code you can access it from here.