Using a TMDb movie data set containing properties of movies released between 1960 and 2015, I explored what are related to financial success of movies, with visualizations.

In assessing financial success, I used Return on Investment (ROI), a measure commonly used in describing profitability. It is calculated by: ROI = (gross revenue - cost)/cost. Of the properties, I was most interested in budget, genres, and vote scores. Specifically, I investigated:

  1. Did high budget movies achieve high ROI?
  2. Did specific genres lead to high ROI?
  3. Did high score movies achieve high ROI?

The work process was:

I will walk through the process here. You can see code and all details in my GitHub.

Data overview

After loading the data set, I looked at a couple of fundamental aspects.

df.head()
id imdb_id popularity budget revenue original_title cast homepage director tagline keywords overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... http://www.jurassicworld.com/ Colin Trevorrow The park is open. monster|dna|tyrannosaurus rex|velociraptor|island Twenty-two years after the events of Jurassic ... 124 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 6/9/15 5562 6.5 2015 1.379999e+08 1.392446e+09
1 76341 tt1392190 28.419936 150000000 378436354 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... http://www.madmaxmovie.com/ George Miller What a Lovely Day. future|chase|post-apocalyptic|dystopia|australia An apocalyptic story set in the furthest reach... 120 Action|Adventure|Science Fiction|Thriller Village Roadshow Pictures|Kennedy Miller Produ... 5/13/15 6185 7.1 2015 1.379999e+08 3.481613e+08
2 262500 tt2908446 13.112507 110000000 295238201 Insurgent Shailene Woodley|Theo James|Kate Winslet|Ansel... http://www.thedivergentseries.movie/#insurgent Robert Schwentke One Choice Can Destroy You based on novel|revolution|dystopia|sequel|dyst... Beatrice Prior must confront her inner demons ... 119 Adventure|Science Fiction|Thriller Summit Entertainment|Mandeville Films|Red Wago... 3/18/15 2480 6.3 2015 1.012000e+08 2.716190e+08
3 140607 tt2488496 11.173104 200000000 2068178225 Star Wars: The Force Awakens Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... http://www.starwars.com/films/star-wars-episod... J.J. Abrams Every generation has a story. android|spaceship|jedi|space opera|3d Thirty years after defeating the Galactic Empi... 136 Action|Adventure|Science Fiction|Fantasy Lucasfilm|Truenorth Productions|Bad Robot 12/15/15 5292 7.5 2015 1.839999e+08 1.902723e+09
4 168259 tt2820852 9.335014 190000000 1506249360 Furious 7 Vin Diesel|Paul Walker|Jason Statham|Michelle ... http://www.furious7.com/ James Wan Vengeance Hits Home car race|speed|revenge|suspense|car Deckard Shaw seeks revenge against Dominic Tor... 137 Action|Crime|Thriller Universal Pictures|Original Film|Media Rights ... 4/1/15 2947 7.3 2015 1.747999e+08 1.385749e+09
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    10866 non-null  int64  
 1   imdb_id               10856 non-null  object 
 2   popularity            10866 non-null  float64
 3   budget                10866 non-null  int64  
 4   revenue               10866 non-null  int64  
 5   original_title        10866 non-null  object 
 6   cast                  10790 non-null  object 
 7   homepage              2936 non-null   object 
 8   director              10822 non-null  object 
 9   tagline               8042 non-null   object 
 10  keywords              9373 non-null   object 
 11  overview              10862 non-null  object 
 12  runtime               10866 non-null  int64  
 13  genres                10843 non-null  object 
 14  production_companies  9836 non-null   object 
 15  release_date          10866 non-null  object 
 16  vote_count            10866 non-null  int64  
 17  vote_average          10866 non-null  float64
 18  release_year          10866 non-null  int64  
 19  budget_adj            10866 non-null  float64
 20  revenue_adj           10866 non-null  float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.7+ MB

The data set had 10866 records with 21 columns. Some columns, like ‘cast’ and ‘genres’, contained multiple values separated by pipe characters. With regard to budget and revenue, the data set had both original and inflation adjusted value column. I chose to use the inflation adjusted ones.

Data wrangling

In this step, I dealt with:

  • Duplicates
  • Missing values
  • Unused columns
  • Incorrect values
  • Extraction of a variable of interest (i.e., ROI)
  • Data type
  • Outliers

Duplicates

First of all, I examined if there were any duplicate records by checking for (1) duplicate id values, and (2) records with the same ‘title’, ‘director’, ‘release_year’, and ‘runtime’.

Based on ‘id’

df[df.duplicated(['id'], keep = False)]
id imdb_id popularity budget revenue original_title cast homepage director tagline keywords overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
2089 42194 tt0411951 0.59643 30000000 967000 TEKKEN Jon Foo|Kelly Overton|Cary-Hiroyuki Tagawa|Ian... NaN Dwight H. Little Survival is no game martial arts|dystopia|based on video game|mart... In the year of 2039, after World Wars destroy ... 92 Crime|Drama|Action|Thriller|Science Fiction Namco|Light Song Films 3/20/10 110 5.0 2010 30000000.0 967000.0
2090 42194 tt0411951 0.59643 30000000 967000 TEKKEN Jon Foo|Kelly Overton|Cary-Hiroyuki Tagawa|Ian... NaN Dwight H. Little Survival is no game martial arts|dystopia|based on video game|mart... In the year of 2039, after World Wars destroy ... 92 Crime|Drama|Action|Thriller|Science Fiction Namco|Light Song Films 3/20/10 110 5.0 2010 30000000.0 967000.0

Two records were detected, identical for all the variables. So, an extra was deleted.

Based on ‘title’, ‘director’, ‘release_year’, and ‘runtime’

df[df.duplicated(['original_title', 'director', ], keep = False)]\
    [['original_title', 'director', 'release_year', 'runtime']].sort_values(by=['original_title'])
original_title director release_year runtime
1400 9 Shane Acker 2009 79
6514 9 Shane Acker 2005 11
4337 Bottle Rocket Wes Anderson 1994 13
8547 Bottle Rocket Wes Anderson 1996 91
4451 Frankenweenie Tim Burton 2012 87
7943 Frankenweenie Tim Burton 1984 29
4063 Madea's Family Reunion Tyler Perry 2002 0
6701 Madea's Family Reunion Tyler Perry 2006 110
5202 Saw James Wan 2003 9
7011 Saw James Wan 2004 103

The output shows that some movies have another movie with the same title/director/release_year. With online search, I learned that this happens sometimes; e.g., a director makes a ‘mini’ version before the full scale. So I kept all the records.

Missing values

The ‘genre’ had only 23 nulls, but there were many records with 0 value for the budget and/or revenue. I had to remove all the records that belonged to any of the following:

  • null for ‘genre’
  • 0 for ‘budget_adj’
  • 0 for ‘revenue_adj’

Incorrect values

As a minimum effort to ensure correctness of budget and revenue values, I inspected data points at both ends (the lowest and highest).

Top budget movie

df[df['budget_adj'] == df['budget_adj'].max()]
id budget revenue original_title genres vote_average release_year budget_adj revenue_adj
2244 46528 425000000 11087569 The Warrior's Way Adventure|Fantasy|Action|Western|Thriller 6.4 2010 425000000.0 11087569.0

Although the above says that ‘The Warrior’s Way’ is the highest budget movie by 425 million, online search revealed that the budget of it is 42.5 million. After this correction, the highest budget became $368 millon of ‘Pirates of the Caribbean: On Stranger Tides,’ which is correct.

Top revenue movie

‘Avartar’ came out as the highest revenue movie, and online search confirmed it.

Bottom budget movie

I displayed several lowest budget movies, and got puzzled with extremely low values like $1, $50, or ‘$100. I did online search and found that:

  • ‘Primer (2004)’ is considered as the lowest budget movie (https://en.wikipedia.org/wiki/Low-budget_film)
  • Many of the low budget values are wrong; for example, the budget of ‘Lost & Found (1999)’ is $30 million, not $1.3, ‘Joyful Noise (2012)’, $25 millon, not $23, ‘Weekend (2011)’, $0.14 millon, not $7755, etc. With the above findings, I inclined to believe that $8081, budget of ‘Primer (2004)’, should be the lowest. I removed movies whose budget was lower than it.

Bottom revenue movie

Again, online search made me distrust data points whose revenue was lower than the revenue of ‘Best Man Down’ ($1840.6). They were removed.

Data type

df.loc[:, 'genres'] = df['genres'].str.strip().str.split("|")

Values of ‘genres’ transformed from string to list, e.g., Action|Adventure|Science Fiction –> [Action, Adventure, Science Fiction], for convenience in later analysis.

Outliers

Fundamentally, I identified records that might inhibit finding a general direction by looking into each variable, and removed them.

Budget

Boxplot of budget

Several records on the top stand out, but I considered them still continuous.

Number of movies by genres

For each genre, I calculated the number of movies involving it.

# Generate a long format dataframe 
df_long = df.explode('genres')

# Count the number of inclusion of each genre.
df_long['genres'].value_counts()
Drama              1746
Comedy             1347
Thriller           1197
Action             1078
Adventure           746
Romance             660
Crime               649
Science Fiction     518
Horror              460
Family              423
Fantasy             393
Mystery             344
Animation           200
Music               134
History             129
War                 119
Western              52
Documentary          34
Foreign              13
TV Movie              1
Name: genres, dtype: int64

I noted that the bottom four genres are involved by a quite small number of movies, less than 100. The number of Western-involved movies is fewer than the half of War-involved movies. I decided to exclude these genres.

Vote scores

Boxplot of vote score

Judging that it is so far away from the nearest, I removed the data point with the lowest value.

ROIs

Boxplot of ROI

The top two data points were considered too far away that I removed them.

To sum up, the data wrangling step removed 7138 records leaving 3728 records in total. The largest removal, 7011 records, was due to value missingness.

Univariate Exploration

I surveyed the variables, budget, ROI, genres, and vote score, individually, before looking at relationships between them.

Budget

The budget is long right tailed, ranging from $8081 to $368 millon (redline for median).

Budget histogram (Cleaned Dataset)

For later analysis, I categorized values into four groups as follows.

qt = df['budget_adj'].quantile([0.25, 0.5, 0.75])

def binning4(x):
    if x < qt[0.25]:
        return "Low"
    elif (x >= qt[0.25]) & (x < qt[0.5]):
        return "Moderate"
    elif (x >= qt[0.5]) & (x < qt[0.75]):
        return 'Little High'
    elif x >= qt[0.75]:
        return 'High'
    
df['budget_level'] = df['budget_adj'].apply(binning4)

Median and Mean of Budget by Budget Level (Cleaned Dataset)

ROI

The ROI column is extremely long tailed, ranging from -1 to 699. More than 3500 out of 3728 fall into between -1 and 22.3 as shown below.

ROI histogram (Cleaned Dataset)

I categorized the ROI in the same way to the budget.

Median and Mean of ROI by ROI Level (Cleaned Dataset)

Genres

I plotted the percentage of movies involving each genre. It shows that almost the half of movies involved ‘Drama’, and ‘Comedy’ is the next most involved genre by about %35.

Percentage of movies by genre (Cleaned Dataset)

Vote score

The vote score seems normally distributed (mean = 6.2, SD = 0.8).

Histogram of vote scores (Cleaned Dataset)

Q1: Did high budget movies achieve high ROI?

I plotted a scatter plot (top), ROI distribution by budget level (bottom-left), and median value by budget level (bottom right).

Budget and ROI plots (Cleaned Dataset)

It is hard to find a relationship in the scatter plot. But, the boxplot by the budget level shows that ROI values of low budget movies span a wide range, up to 500, the largest ROI value in the dataset. The larger budget groups have the narrower spread of ROIs. As well, the comparison of median values shows that the low and high budget levels have a little higher median values than the moderate and little high budget levels.

Meanwhile, I created a contigency table and plotted it as below.

Number of movies by ROI level for each budget level

It is found that low budget movies achieve the low and high level ROIs more frequently than the moderate and little high level ROIs. As opposed to it, high budget movies obtain the moderate and little high level ROIs more frequenlty.

Q2: Did specific genres lead to high ROI?

To compare ROIs between the genres, I calculated two types of means:

  • Simple mean: Simply add up all ROIs and divide it with the total number of movies involving that genre.
  • Weighted mean: Assign weights based on the number genre types involved in a movie. A ROI is multiplied by 1/number_of_genres and the sum is divided by the sum of 1/number_of_genres values.

For example, let’s assume we have a long format dataframe of three movies as below:

id genre ROI number of genres
1 drama 3 2
1 comedy 3 2
2 comedy 5 1
3 drama 2 1

Simple means are:

  • drama, (3+2) / 2 = 2.5
  • comedy, (3+5) / 2 = 4

Weighted means are:

  • drama, (3x1/2+2x1/1) / (1/2+1) = (1.5+2) / 1.5 = 2.33
  • comedy, (3x1/2+5x1/1) / (1/2+1) = (1.5+5) / 1.5 = 4.33

The below shows results of the two methods. The results are not exactly the same (especailly, ‘Horror’ has a larger difference between the two results than other genrens), but order is the same.

Mean of ROIs by genre (Cleaned Dataset)

The results hints that ‘Horror’ tends to bring quite larger ROIs compared to the others. The next profitable genre is ‘Music’ by round 4 or 5 smaller ROI than Horror. ROI values gradually decrease after ‘Music, and thus the difference between ‘Music’ and ‘History’ is only about 3.5.

Although I believe that I obtained an acceptable finding with my approach, it is a shame that effect of combination of genres was not studied. This issue will be discussed a bit more in ‘Limitations’ section at the end.

Q3: Did high score movies achieve high ROI?

Vote score distribution by ROI level (Cleaned Dataset)

The chart shows that the high ROI level tends to bring a little higher median of vote scores (0.4 higher than the little high level, and 0.7 than the low level).

Conclusions

Summary

With the exploration, I found association between 1) budget size and ROI, 2) genres and ROI, and 3) vote average score and ROI. To sum up:

  • I found contrary behavior between the low and high budget groups, in terms of their ROIs. Low budget movies are more likely to be big fail or big success, and high budget movies tend to end up in the middle. This behavior seems a consequence. Considering different capabilities in reaching the audience and the definition of ROI, this behavior seems natural.
  • The genre type that resulted in the highest ROI on average is ‘Horror’, and the smallest ROI, ‘History’, among 17 genre types (excluding ‘Western’, ‘Documentary’, ‘Foreign’, and ‘TV Movie’).
  • It was found that the higher ROI level group tend to associate with higher mean of the vote average.

Based on these findings, it is considered that production of low budget, horror movies that receive high score from the audience is likely to lead to high ROI.

Limitations

The exploration has several limitations as follows:

  • Only about a third of the original datapoints were used in the analysis, since the rest had missing or outlying values. As well, there was no further work to check if any pattern exists in missing or outlying values.
  • In finding association between genres and ROIs, potential effect of genre combination was not explored, and it makes the result somewhat insecure. For instance, let’s assume the following dataset of four movies:
id genres ROI
1 Drama 2
2 Drama, Family 8
3 Drama, Horror 8
4 Family 2

Calculation of weighted means (Drama, 5; Family, 4; Horor, 8) indicates that the highest ROI genre is Horror. However, it is not sure if this is the power of the Horror genre or result of pairing with Drama, since it does not have movies that involve only Horror.