Using a TMDb movie data set containing properties of movies released between 1960 and 2015, I explored what are related to financial success of movies, with visualizations.

In assessing financial success, I used Return on Investment (ROI), a measure commonly used in describing profitability. It is calculated by: ROI = (gross revenue - cost)/cost. Of the properties, I was most interested in budget, genres, and vote scores. Specifically, I investigated:

Did high budget movies achieve high ROI?
Did specific genres lead to high ROI?
Did high score movies achieve high ROI?

The work process was:

Data overview
Data wrangling
Univariate exploration
Q1: Did high budget movies achieve high ROI?
Q2: Did specific genres lead to high ROI?
Q3: Did high score movies achieve high ROI?
Conclusions (Summary, Limitations)

I will walk through the process here. You can see code and all details in my GitHub.

Data overview

After loading the data set, I looked at a couple of fundamental aspects.

df.head()

	id	imdb_id	popularity	budget	revenue	original_title	cast	homepage	director	tagline	keywords	overview	runtime	genres	production_companies	release_date	vote_count	vote_average	release_year	budget_adj	revenue_adj
0	135397	tt0369610	32.985763	150000000	1513528810	Jurassic World	Chris Pratt\|Bryce Dallas Howard\|Irrfan Khan\|Vi...	http://www.jurassicworld.com/	Colin Trevorrow	The park is open.	monster\|dna\|tyrannosaurus rex\|velociraptor\|island	Twenty-two years after the events of Jurassic ...	124	Action\|Adventure\|Science Fiction\|Thriller	Universal Studios\|Amblin Entertainment\|Legenda...	6/9/15	5562	6.5	2015	1.379999e+08	1.392446e+09
1	76341	tt1392190	28.419936	150000000	378436354	Mad Max: Fury Road	Tom Hardy\|Charlize Theron\|Hugh Keays-Byrne\|Nic...	http://www.madmaxmovie.com/	George Miller	What a Lovely Day.	future\|chase\|post-apocalyptic\|dystopia\|australia	An apocalyptic story set in the furthest reach...	120	Action\|Adventure\|Science Fiction\|Thriller	Village Roadshow Pictures\|Kennedy Miller Produ...	5/13/15	6185	7.1	2015	1.379999e+08	3.481613e+08
2	262500	tt2908446	13.112507	110000000	295238201	Insurgent	Shailene Woodley\|Theo James\|Kate Winslet\|Ansel...	http://www.thedivergentseries.movie/#insurgent	Robert Schwentke	One Choice Can Destroy You	based on novel\|revolution\|dystopia\|sequel\|dyst...	Beatrice Prior must confront her inner demons ...	119	Adventure\|Science Fiction\|Thriller	Summit Entertainment\|Mandeville Films\|Red Wago...	3/18/15	2480	6.3	2015	1.012000e+08	2.716190e+08
3	140607	tt2488496	11.173104	200000000	2068178225	Star Wars: The Force Awakens	Harrison Ford\|Mark Hamill\|Carrie Fisher\|Adam D...	http://www.starwars.com/films/star-wars-episod...	J.J. Abrams	Every generation has a story.	android\|spaceship\|jedi\|space opera\|3d	Thirty years after defeating the Galactic Empi...	136	Action\|Adventure\|Science Fiction\|Fantasy	Lucasfilm\|Truenorth Productions\|Bad Robot	12/15/15	5292	7.5	2015	1.839999e+08	1.902723e+09
4	168259	tt2820852	9.335014	190000000	1506249360	Furious 7	Vin Diesel\|Paul Walker\|Jason Statham\|Michelle ...	http://www.furious7.com/	James Wan	Vengeance Hits Home	car race\|speed\|revenge\|suspense\|car	Deckard Shaw seeks revenge against Dominic Tor...	137	Action\|Crime\|Thriller	Universal Pictures\|Original Film\|Media Rights ...	4/1/15	2947	7.3	2015	1.747999e+08	1.385749e+09

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    10866 non-null  int64  
 1   imdb_id               10856 non-null  object 
 2   popularity            10866 non-null  float64
 3   budget                10866 non-null  int64  
 4   revenue               10866 non-null  int64  
 5   original_title        10866 non-null  object 
 6   cast                  10790 non-null  object 
 7   homepage              2936 non-null   object 
 8   director              10822 non-null  object 
 9   tagline               8042 non-null   object 
 10  keywords              9373 non-null   object 
 11  overview              10862 non-null  object 
 12  runtime               10866 non-null  int64  
 13  genres                10843 non-null  object 
 14  production_companies  9836 non-null   object 
 15  release_date          10866 non-null  object 
 16  vote_count            10866 non-null  int64  
 17  vote_average          10866 non-null  float64
 18  release_year          10866 non-null  int64  
 19  budget_adj            10866 non-null  float64
 20  revenue_adj           10866 non-null  float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.7+ MB

The data set had 10866 records with 21 columns. Some columns, like ‘cast’ and ‘genres’, contained multiple values separated by pipe characters. With regard to budget and revenue, the data set had both original and inflation adjusted value column. I chose to use the inflation adjusted ones.

Data wrangling

In this step, I dealt with:

Duplicates
Missing values
Unused columns
Incorrect values
Extraction of a variable of interest (i.e., ROI)
Data type
Outliers

Duplicates

First of all, I examined if there were any duplicate records by checking for (1) duplicate id values, and (2) records with the same ‘title’, ‘director’, ‘release_year’, and ‘runtime’.

Based on ‘id’

df[df.duplicated(['id'], keep = False)]

	id	imdb_id	popularity	budget	revenue	original_title	cast	homepage	director	tagline	keywords	overview	runtime	genres	production_companies	release_date	vote_count	vote_average	release_year	budget_adj	revenue_adj
2089	42194	tt0411951	0.59643	30000000	967000	TEKKEN	Jon Foo\|Kelly Overton\|Cary-Hiroyuki Tagawa\|Ian...	NaN	Dwight H. Little	Survival is no game	martial arts\|dystopia\|based on video game\|mart...	In the year of 2039, after World Wars destroy ...	92	Crime\|Drama\|Action\|Thriller\|Science Fiction	Namco\|Light Song Films	3/20/10	110	5.0	2010	30000000.0	967000.0
2090	42194	tt0411951	0.59643	30000000	967000	TEKKEN	Jon Foo\|Kelly Overton\|Cary-Hiroyuki Tagawa\|Ian...	NaN	Dwight H. Little	Survival is no game	martial arts\|dystopia\|based on video game\|mart...	In the year of 2039, after World Wars destroy ...	92	Crime\|Drama\|Action\|Thriller\|Science Fiction	Namco\|Light Song Films	3/20/10	110	5.0	2010	30000000.0	967000.0

Two records were detected, identical for all the variables. So, an extra was deleted.

Based on ‘title’, ‘director’, ‘release_year’, and ‘runtime’

df[df.duplicated(['original_title', 'director', ], keep = False)]\
    [['original_title', 'director', 'release_year', 'runtime']].sort_values(by=['original_title'])

	original_title	director	release_year	runtime
1400	9	Shane Acker	2009	79
6514	9	Shane Acker	2005	11
4337	Bottle Rocket	Wes Anderson	1994	13
8547	Bottle Rocket	Wes Anderson	1996	91
4451	Frankenweenie	Tim Burton	2012	87
7943	Frankenweenie	Tim Burton	1984	29
4063	Madea's Family Reunion	Tyler Perry	2002	0
6701	Madea's Family Reunion	Tyler Perry	2006	110
5202	Saw	James Wan	2003	9
7011	Saw	James Wan	2004	103

The output shows that some movies have another movie with the same title/director/release_year. With online search, I learned that this happens sometimes; e.g., a director makes a ‘mini’ version before the full scale. So I kept all the records.

Missing values

The ‘genre’ had only 23 nulls, but there were many records with 0 value for the budget and/or revenue. I had to remove all the records that belonged to any of the following:

null for ‘genre’
0 for ‘budget_adj’
0 for ‘revenue_adj’

Incorrect values

As a minimum effort to ensure correctness of budget and revenue values, I inspected data points at both ends (the lowest and highest).

Top budget movie

df[df['budget_adj'] == df['budget_adj'].max()]

	id	budget	revenue	original_title	genres	vote_average	release_year	budget_adj	revenue_adj
2244	46528	425000000	11087569	The Warrior's Way	Adventure\|Fantasy\|Action\|Western\|Thriller	6.4	2010	425000000.0	11087569.0

Although the above says that ‘The Warrior’s Way’ is the highest budget movie by 425 million, online search revealed that the budget of it is 42.5 million. After this correction, the highest budget became $368 millon of ‘Pirates of the Caribbean: On Stranger Tides,’ which is correct.

Top revenue movie

‘Avartar’ came out as the highest revenue movie, and online search confirmed it.

Bottom budget movie

I displayed several lowest budget movies, and got puzzled with extremely low values like $1, $50, or ‘$100. I did online search and found that:

‘Primer (2004)’ is considered as the lowest budget movie (https://en.wikipedia.org/wiki/Low-budget_film)
Many of the low budget values are wrong; for example, the budget of ‘Lost & Found (1999)’ is $30 million, not $1.3, ‘Joyful Noise (2012)’, $25 millon, not $23, ‘Weekend (2011)’, $0.14 millon, not $7755, etc. With the above findings, I inclined to believe that $8081, budget of ‘Primer (2004)’, should be the lowest. I removed movies whose budget was lower than it.

Bottom revenue movie

Again, online search made me distrust data points whose revenue was lower than the revenue of ‘Best Man Down’ ($1840.6). They were removed.

Data type

df.loc[:, 'genres'] = df['genres'].str.strip().str.split("|")

Values of ‘genres’ transformed from string to list, e.g., Action|Adventure|Science Fiction –> [Action, Adventure, Science Fiction], for convenience in later analysis.

Outliers

Fundamentally, I identified records that might inhibit finding a general direction by looking into each variable, and removed them.

Budget

Boxplot of budget

Several records on the top stand out, but I considered them still continuous.

Number of movies by genres

For each genre, I calculated the number of movies involving it.

# Generate a long format dataframe 
df_long = df.explode('genres')

# Count the number of inclusion of each genre.
df_long['genres'].value_counts()

Drama              1746
Comedy             1347
Thriller           1197
Action             1078
Adventure           746
Romance             660
Crime               649
Science Fiction     518
Horror              460
Family              423
Fantasy             393
Mystery             344
Animation           200
Music               134
History             129
War                 119
Western              52
Documentary          34
Foreign              13
TV Movie              1
Name: genres, dtype: int64

I noted that the bottom four genres are involved by a quite small number of movies, less than 100. The number of Western-involved movies is fewer than the half of War-involved movies. I decided to exclude these genres.

Vote scores

Boxplot of vote score

Judging that it is so far away from the nearest, I removed the data point with the lowest value.

ROIs

Boxplot of ROI

The top two data points were considered too far away that I removed them.

To sum up, the data wrangling step removed 7138 records leaving 3728 records in total. The largest removal, 7011 records, was due to value missingness.

Univariate Exploration

I surveyed the variables, budget, ROI, genres, and vote score, individually, before looking at relationships between them.

Budget

The budget is long right tailed, ranging from $8081 to $368 millon (redline for median).

Budget histogram (Cleaned Dataset)

For later analysis, I categorized values into four groups as follows.

qt = df['budget_adj'].quantile([0.25, 0.5, 0.75])

def binning4(x):
    if x < qt[0.25]:
        return "Low"
    elif (x >= qt[0.25]) & (x < qt[0.5]):
        return "Moderate"
    elif (x >= qt[0.5]) & (x < qt[0.75]):
        return 'Little High'
    elif x >= qt[0.75]:
        return 'High'
    
df['budget_level'] = df['budget_adj'].apply(binning4)

Median and Mean of Budget by Budget Level (Cleaned Dataset)

ROI

The ROI column is extremely long tailed, ranging from -1 to 699. More than 3500 out of 3728 fall into between -1 and 22.3 as shown below.

ROI histogram (Cleaned Dataset)

I categorized the ROI in the same way to the budget.

Median and Mean of ROI by ROI Level (Cleaned Dataset)

Genres

I plotted the percentage of movies involving each genre. It shows that almost the half of movies involved ‘Drama’, and ‘Comedy’ is the next most involved genre by about %35.

Percentage of movies by genre (Cleaned Dataset)

Vote score

The vote score seems normally distributed (mean = 6.2, SD = 0.8).

Histogram of vote scores (Cleaned Dataset)

Q1: Did high budget movies achieve high ROI?

I plotted a scatter plot (top), ROI distribution by budget level (bottom-left), and median value by budget level (bottom right).

Budget and ROI plots (Cleaned Dataset)

It is hard to find a relationship in the scatter plot. But, the boxplot by the budget level shows that ROI values of low budget movies span a wide range, up to 500, the largest ROI value in the dataset. The larger budget groups have the narrower spread of ROIs. As well, the comparison of median values shows that the low and high budget levels have a little higher median values than the moderate and little high budget levels.

Meanwhile, I created a contigency table and plotted it as below.

Number of movies by ROI level for each budget level

It is found that low budget movies achieve the low and high level ROIs more frequently than the moderate and little high level ROIs. As opposed to it, high budget movies obtain the moderate and little high level ROIs more frequenlty.

Q2: Did specific genres lead to high ROI?

To compare ROIs between the genres, I calculated two types of means:

Simple mean: Simply add up all ROIs and divide it with the total number of movies involving that genre.
Weighted mean: Assign weights based on the number genre types involved in a movie. A ROI is multiplied by 1/number_of_genres and the sum is divided by the sum of 1/number_of_genres values.

For example, let’s assume we have a long format dataframe of three movies as below:

id	genre	ROI	number of genres
1	drama	3	2
1	comedy	3	2
2	comedy	5	1
3	drama	2	1

Simple means are:

drama, (3+2) / 2 = 2.5
comedy, (3+5) / 2 = 4

Weighted means are:

drama, (3x1/2+2x1/1) / (1/2+1) = (1.5+2) / 1.5 = 2.33
comedy, (3x1/2+5x1/1) / (1/2+1) = (1.5+5) / 1.5 = 4.33

The below shows results of the two methods. The results are not exactly the same (especailly, ‘Horror’ has a larger difference between the two results than other genrens), but order is the same.

Mean of ROIs by genre (Cleaned Dataset)

The results hints that ‘Horror’ tends to bring quite larger ROIs compared to the others. The next profitable genre is ‘Music’ by round 4 or 5 smaller ROI than Horror. ROI values gradually decrease after ‘Music, and thus the difference between ‘Music’ and ‘History’ is only about 3.5.

Although I believe that I obtained an acceptable finding with my approach, it is a shame that effect of combination of genres was not studied. This issue will be discussed a bit more in ‘Limitations’ section at the end.

Q3: Did high score movies achieve high ROI?

Vote score distribution by ROI level (Cleaned Dataset)

The chart shows that the high ROI level tends to bring a little higher median of vote scores (0.4 higher than the little high level, and 0.7 than the low level).

Conclusions

Summary

With the exploration, I found association between 1) budget size and ROI, 2) genres and ROI, and 3) vote average score and ROI. To sum up:

I found contrary behavior between the low and high budget groups, in terms of their ROIs. Low budget movies are more likely to be big fail or big success, and high budget movies tend to end up in the middle. This behavior seems a consequence. Considering different capabilities in reaching the audience and the definition of ROI, this behavior seems natural.
The genre type that resulted in the highest ROI on average is ‘Horror’, and the smallest ROI, ‘History’, among 17 genre types (excluding ‘Western’, ‘Documentary’, ‘Foreign’, and ‘TV Movie’).
It was found that the higher ROI level group tend to associate with higher mean of the vote average.

Based on these findings, it is considered that production of low budget, horror movies that receive high score from the audience is likely to lead to high ROI.

Limitations

The exploration has several limitations as follows:

Only about a third of the original datapoints were used in the analysis, since the rest had missing or outlying values. As well, there was no further work to check if any pattern exists in missing or outlying values.
In finding association between genres and ROIs, potential effect of genre combination was not explored, and it makes the result somewhat insecure. For instance, let’s assume the following dataset of four movies:

id	genres	ROI
1	Drama	2
2	Drama, Family	8
3	Drama, Horror	8
4	Family	2

Calculation of weighted means (Drama, 5; Family, 4; Horor, 8) indicates that the highest ROI genre is Horror. However, it is not sure if this is the power of the Horror genre or result of pairing with Drama, since it does not have movies that involve only Horror.

Jisoo Lee

What are associated with profitable movies?

Data overview

Data wrangling

Duplicates

Based on ‘id’

Based on ‘title’, ‘director’, ‘release_year’, and ‘runtime’

Missing values

Incorrect values

Top budget movie

Top revenue movie

Bottom budget movie

Bottom revenue movie

Data type

Outliers

Budget

Number of movies by genres

Vote scores

ROIs

Univariate Exploration

Budget

ROI

Genres

Vote score

Q1: Did high budget movies achieve high ROI?

Q2: Did specific genres lead to high ROI?

Q3: Did high score movies achieve high ROI?

Conclusions

Summary

Limitations