The studios had a beautiful price discrimination model for movies. They released the movies in theatres for people with more willingness to pay, then released it on DVD's and then on sites like NetFlix, Hulu, etc followed by home entertainment system. They essentially made time as the discriminating factor. If people wanted to pay lesser, they had to wait for longer. This model is not feasible anymore because of the concept of Piracy. This analysis will focus on the DVD release strategies and its impact on piracy
The dataset for this project was scraped off multiple comprehensive movie sources like IMDB (movie characteristics), Box office mojo (Oscar nominations), Amazon (DVD release dates), movielabs (Piracy), letterboxd (social media presence of movies) using Python scripts. This dataset had many attributes and we have chosen the values based on our hypothesis and changed them as required to get a better R squared value. Also, one of the important things we found out was that there were two very different set of movies and we needed to create a different prediction model for both separately. They were
i) Movies with a wide release
ii) Movies with a limited release
Initially, there was a list of 15000 movies. But I had to reduce the dataset to approximately 2500 movies because there were no DVD releases or piracy data for all the movies.
There were a lot of missing values for the budget, opening weekend collection, total collection and DVD release dates. So we imputed using means for the budget, opening weekend collection and total collection and removed the rows which had no DVD release dates.
There were movies which had their DVD's released first before their theatrical release, which will not be considered in this analysis. This is a small portion of movies which I decided to get rid of so as to get better accuracy.
For the movies which did not have the number of screens it was released in, we made an estimated guess of it being a limited release movie.
SQL was used to explore the distribution of different movies. We try to visualize these results to identify patterns and see if our intuitions reflect what the data tells us.
From this scatterplot, we can clearly see that there is a strong correlation between the number of screens a movie is released in to the number of downloads it has. This initial analysis is what suggested that I should divide the movies into movies with limited release and movies with a wide release because both of them are significantly different.
From this chart, we can see that there is no particular month in which there is huge piracy. We can see in 2012, piracy was on the rise, but from 2013 onwards, piracy has been on the decrease for movies. This could be because of legal and illegal streaming options that have been available over the past couple of years.
From the donut chart, we can conclude that Action movies are among the most pirated, followed by Comedy. It also shows us that the top 5 pirated genres are
Natural experiment on country-wise piracy
From this chart, we expected to find out which country has maximum piracy with respect to internet penetration. It was interesting to find that France has the highest piracy per internet penetration. This might be because of the HADOPI law which forces studios to release DVD's only after 4 months of it's theatrical release. This is very interesting because in France, there is no legal way to watch the movies after they are taken from the theatres and this could maybe explain the higher piracy rates in the country.