So, Microsoft wants to start a movie studio: Exploring movie data with Python

Amanda Gaeta
21 min readJan 3, 2021

In my last post, I introduced my journey into learning data science and why it was important to me. Since then, I have been developing foundational skills in Python and SQL to be able to apply this knowledge. My first opportunity to apply those skills was our Phase 1 project.

The assignment put us in the mindset of a consultant tasked with providing recommendations to Microsoft. What could Microsoft possibly be needing consultation on? Well, they want to start a movie studio.

Provided with initial datasets from IMDB, Rotten Tomatoes, The Movie Database, Box Office Mojo, and The Numbers, as well as the opportunity to utilize additional datasets if needed, I was set to dive into the insights hidden within.

Knowing that this would be a very early business conversation, my goal was to provide an understanding of the movie landscape at a high level.

Data Discovery and Foundational Dataset for Analysis

With 10 different databases to get started, how do you tactically start?

My approach was to import and preview each one, taking note of each of the columns. Through this process, not only am I keeping track of what information is available to understand what questions can be answered, but I am looking for ways to tie the datasets together. In the perfect world, there would be some standardized ID in the movie industry that would be used across the datasets — but we all know we don’t live in a perfect world. Thus, I prioritized which data sources to look into (IMDB, Rotten Tomatoes, etc) first by the amount of data initially given.

IMDB represented not only 6 of the 10 files, but it represented the best breadth and depth of information available including: basic movie data points (title, year of release, movie genres), actors, writers, producers, directors, and even user ratings. Yes, these were all in separate files, but the dataset got even better — there were consistent IDs across all of the files for the movies and the individuals associated with them.

Knowing that the end goal was pitching the best initial investments for Microsoft’s movie studio, I knew I needed a base of financial data. This is where the skills of how to marry data without sacrificing too much of it as to keep the analysis worthwhile. If that initial IMDB dataset was cut down to 100 movies, it’s safe to say the analysis may not be safe to base millions of dollars of investment on. When looking at the financial data available, there were two options: The Numbers and Box Office Mojo. The difference was Box Office Mojo gave numbers on foreign and domestic gross on ~3k movies, while The Numbers gave worldwide gross and domestic gross on ~6k; but The Numbers data had one more important asset — production budget. With this extra value, I could map out ROI and use that as a key indicator for investment.

As a final note on the exploratory process, I will say that beyond IMDB ratings data, I dug into Rotten Tomatoes and The Movie Database to see if there was a better representation of viewer feedback to inform additional guidance on what movie studios should be producing. As a parallel to the comparison with The Numbers and Box Office Mojo, IMDB had ratings connected to the movie IDs so it was more easily obtainable on the largest dataset available as well as in a better format for analysis (float number versus fractions that needed cleansing). I also looked into the number of votes movies had to see if one data source offered higher engagement to get a truer sense of viewer feedback, but it was consistent across each dataset at ~15 votes as a median count. Thus, I decided to stick with the IMDB ratings data in my analysis.


Now that I had decided on the datasets I wanted to bring into my final analysis, it was time to cleanse and merge into a master dataset. This was also the point where my main challenge of the project came up and served as a theme end to end:

How do I take what I would do in Excel and translate it to Python?

What was the the equivalent to RIGHT(), LEFT() to breakout certain characters in a string? How do I reformat dollars to numbers? How can I make a more complex list of items into a quick Boolean datapoint for easier reference? If I would normally use a Pivot table, how could I make the exact one in a Jupyter notebook?


When running a .info() on my merged IMDB dataset, genres and runtime had null values.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 73856 entries, 0 to 73855
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 tconst 73856 non-null object
1 primary_title 73856 non-null object
2 original_title 73856 non-null object
3 start_year 73856 non-null int64
4 runtime_minutes 66236 non-null float64
5 genres 73052 non-null object
6 averagerating 73856 non-null float64
7 numvotes 73856 non-null int64
dtypes: float64(2), int64(2), object(4)
memory usage: 5.1+ MB

Knowing that runtime may be part of my analysis, I filled the nulls with 0.0 so I could easily make table that excluded these rows

imdb_tb_tr[‘runtime_minutes’] = imdb_tb_tr[‘runtime_minutes’].fillna(0.0)

For the genres column, I previewed the values to understand formatting beyond simply knowing it was a string. There were comma separated values of genres. I filled in the nulls with “Unknown” initially and knew I could revisit if needed.

imdb_tb_tr['genres'] = imdb_tb_tr['genres'].fillna('Unknown')

Formatting financial data as float data types and in millions for charting

Since financial data also would play a large role in the analysis, especially regarding the idea of ROI, I spent time thinking through what would be needed end to end. At a minimum, the values needed to be numerical with the same formatting.

In The Numbers data, only worldwide gross already the correct datatype.

 3   production_budget       5782 non-null   object 
4 domestic_gross 5782 non-null object
5 worldwide_gross 5782 non-null float64

Each of these was approached in the same way: remove the commas and dollar signs, then convert it to a float object if it was not one already. The code for production_budget is below

tn_mb_df[‘production_budget’] = tn_mb_df[‘production_budget’].str.replace(‘,’,’’)
tn_mb_df[‘production_budget’] = tn_mb_df[‘production_budget’].str.replace(‘$’,’’).astype(float)

Genres conversion from string to list for Boolean breakout

I knew that genres were definitely going to be part of the base of my initial analysis, so I spent time in understanding how they should be formatted. The current string format of “Action,Adventure,Comedy” would not serve the analysis in the best way. Thus, I needed to break them out into their. own boolean columns. This would allow me to capture the individual genres that the movie was categorized under, especially because much of the dataset had multiple genres.

The cleansing portion of this was converting the string of comma separated values into a list. This would allow me to utilize list methods to further parse the data into the formatting I needed.

imdb_tb_tr_tc[‘genres’] = imdb_tb_tr_tc[‘genres’].str.split(‘,’)

Additional Custom Data Points

As I had more time with the data and got closer to developing my final questions, I realized I needed additional data points and translated versions of the data. Below are the custom data points I created in the process:

Genres Boolean breakout (IMDB) create new table then join

After I had converted the genres column values to lists, I assigned a variable to the Series to simplify the coding in the future steps in this process.

imdb_genres = imdb_tb_tr_tc[‘genres’]

Now that I had lists and a variable, I needed to create a list of the possible genre options, which was done using the for loop below.

imdb_genres_list = []# Start with rows in index
for row in imdb_genres.index:
# Access the list data type in each row, it will change with every row in the index
for item in imdb_genres[row]:
# append the genre that is taken as an item from the list within the row and add it to the genres_list
# Define a set of the genres_list from the above for loop; reassign the genres_list variable name to this set
imdb_genres_list = set(imdb_genres_list)

The imdb_genres_list looked like so:


Based on this I created a new DataFrame

imdb_genres = pd.DataFrame(imdb_tb_tr_tc[‘genres’])

Now I needed to establish new columns for each of the genres. The plan being to merge this into the IMDB dataset after the Boolean columns were created.

# Use for loop to create columns for each genre in the deduplicated set of genres for the genres_list
for genre in imdb_genres_list:
#create a new column in our new DataFrame
imdb_genres[genre] = 0

Now imdb_genres was an empty DataFrame with lists of the genres from the IMDB movies in the ~75k dataset with columns for each of the individual genres listed

The final step in this Boolean translation was to go through each row of lists of genres and put a 1 value in the column of the genre that was listed. This was done using the for loop below.

for row in imdb_genres.index:
# Using previous for loop, edit it to access our new DF’s column ‘genres’ THEN the row
# This will get us to the list of genres in the given row
for item in imdb_genres[‘genres’][row]:
# Then say access the column that matches single genre in that list of genres (item) in that row (row)
imdb_genres[item][row] = 1

With the table now filled in, it could be joined with the IMDB full dataset using the original index.

imdb_with_genre_cols = imdb_tb_tr_tc.merge(imdb_genres, left_index=True, right_index=True)

The IMDB dataset now looked like the below (notice the additional columns per genre):

Formatting financial data in millions for charting (The Numbers)

Know that I needed to chart my findings in a clean and easily absorbed format, I knew I needed to round my financial data to be in millions.

I created new “_in_mils” columns to hold the rounded values for gross and budget numbers. Example code for production_budget data below:

tn_mb_df[‘production_budget_in_mil’] = round((tn_mb_df[‘production_budget’]/1000000),2)

Foreign Gross (The Numbers)

I knew foreign gross as a datapoint was something that I had missed out on in not using Box Office Mojo dataset, but with The Numbers dataset it was simple to calculate. I created a new column for the new datapoint and subtracted the available domestic gross from the worldwide gross using my rounded values.

tn_mb_df[‘foreign_gross_in_mil’] = tn_mb_df[‘worldwide_gross_in_mil’] — tn_mb_df[‘domestic_gross_in_mil’]

Domestic vs Foreign Gross Percentage (The Numbers)

I did similar simple math in new columns for domestic and foreign gross percentages in case I needed the data, it was simple enough to calculate at this point.

tn_mb_df[‘domestic_gross_p’] = round((tn_mb_df[‘domestic_gross_in_mil’]/tn_mb_df[‘worldwide_gross_in_mil’]), 2)
tn_mb_df[‘foreign_gross_p’] = round((tn_mb_df[‘foreign_gross_in_mil’]/tn_mb_df[‘worldwide_gross_in_mil’]), 2)

Production ROI (The Numbers): division and cleansing (NaN or infinite)

With the financial data cleansed and calculated, I needed to now create my priority metric — ROI. I created the new column using simple division to calculate the numbers, then I excitedly ran .describe() to get an idea of what ROI looked like for the movies in this dataset. My first real insight!

It produced errors:

count    5780.000000
mean inf
std NaN
min 0.000000
25% 0.492245
50% 1.709144
75% 3.760000
max inf
Name: prod_budget_ROI, dtype: float64

I needed to go to the wider world of google to understand what inf meant in this instance. Thanks to StackOverflow, I was able to rectify the error caused by having nulls or 0s in my production budget columns.

tn_mb_df['prod_budget_ROI'] = tn_mb_df['prod_budget_ROI'].replace([np.inf, -np.inf], np.nan)

…and get true numbers!

count    5779.000000
mean 4.838506
std 34.340229
min 0.000000
25% 0.492183
50% 1.708889
75% 3.757857
max 2250.000000

Movie Release Year and Month (The Numbers)

At this point I also needed to think about how to join my IMDB dataset with The Numbers coming up with the unique ID of title and year concatenation; but The Numbers only had a full release date. I needed to create release year to use in concatenation for creation of this unique ID.

tn_mb_df[‘release_year’] = tn_mb_df[‘release_date’].map(lambda x: x.split(“, “)[1])

I also dove into the possibility of asking questions about release months. So I had to create new columns for each of those by breaking out the data from the full release dates available.

Using the same full release date available from The Numbers , I was able to pull out the values needed into this new columns:

tn_mb_df[‘release_month’] = tn_mb_df[‘release_date’].map(lambda x: x.split(“ “)[0])

Merging Datasets

This brings us into our next problem to solve — both of these datasets meant nothing if I couldn’t find a way to effectively marry them into my base IMDB dataset of ~75k movies.

My initial dataset previews showed me that both movies had the title of the movies; but from my previous experiences of doing extensive VLOOKUPs for Sales and Marketing data, I knew one datapoint was not enough. Duplicates were likely to exist, especially with film remakes. Thus, I created a custom unique ID and concatenated title and year I had thought of while creating additional data points. I created this field in both IMDB and The Numbers datasets. Example code the the Numbers below:

tn_mb_df[‘title_year’] = tn_mb_df[‘movie’] + ‘ ‘ + tn_mb_df[‘release_year’].astype(str)

In an effort to maintain as much data as possible, when joining the two datasets (my IMDB base of ~75k movies and the matching to The Numbers set of ~6k movies), I utilized a left join to pair The Numbers data where it was possible.

imdb_with_genre_cols = imdb_with_genre_cols.set_index(‘title_year’).join(tn_mb_df.set_index(‘title_year’))

There were ~1500 movies that were appended with financial data with this match. Since my goal was to provide a high-level initial analysis, I maintained that this was a good starting point. As conversations persisted, I would either seek out additional avenues to obtain budgets and grosses for movies whether that be via new datasets and/or more in-depth data matching processes.

After getting to this dataset as an acceptable base, I had to reintroduce myself to the universe I was going to be analyzing and making sure it made sense. Since I had utilize year as part of my unique ID and it was one of the most available pieces of data, I started there. In doing a quick run of .info() on my data, I was able to see that my min and max were 2010 and 2019. This gave me a translatable idea of the data being worked with from a business perspective whether stated as “movies from the 2010s” or “movies released in the last 10 years”.

Initial Questions

Now that I had my base dataset and some high level segments to dig into, I could start developing questions. I knew I wanted to approach it from a business perspective via the financial data; but I also knew that there was a limited match, so I continued to zoom out my approach. One of the major features the wider set of IMDB data contributed was a view on genre. Thus, I decided to start at high-level genre analysis, and use a subset of these top genres to dig further into the financials behind them.

The questions below are what I decided to analyze and present on:

  • Question 1: What were the top movie genres made in the 2010s?
  • Question 2: What is the best month to release a movie for highest worldwide gross?
  • Question 3: Of movies that breakeven (ROI >= 1), what genres are most represented?
  • Question 4: Based on production budget and average ratings, what genres are the best investments?
  • Question 5: For these breakeven movies that fall into these genres, what is the recommended runtime and who are the highest rated directors?

Answering Questions

Question 1: What were the top movie genres made in the 2010s?

This question was made far more simple by the pre-work of generating the Boolean breakout of the genres. I created a dictionary and got counts per genre using a for loop:

genre_total_dict = {}for genre in genre_name_list:
genre_total_dict[genre] = imdb_with_genre_cols[genre].sum()


Then I sorted it:

import operator
sorted_genre_count_dict = dict( sorted(genre_total_dict.items(), key=operator.itemgetter(1),reverse=True))
{'Drama': 30788,
'Documentary': 17753,
'Comedy': 17290,
'Thriller': 8217,
'Horror': 7674,
'Action': 6988,
'Romance': 6589,
'Crime': 4611,
'Adventure': 3817,
'Biography': 3809,
'Family': 3412,
'Mystery': 3039,
'History': 2825,
'Sci-Fi': 2206,
'Fantasy': 2126,
'Music': 1968,
'Animation': 1743,
'Sport': 1179,
'War': 853,
'Musical': 721,
'News': 579,
'Western': 280,
'Reality-TV': 17,
'Adult': 3,
'Game-Show': 2,
'Short': 1}

From here I was able to draft a bar chart:

#Plot, note movies with multiple genres counted once for each genre
plt.figure (figsize=(10,10)), sorted_genre_count_dict.values())
plt.title(‘Top Genres Of 2010s’, fontsize=20, fontweight=”bold”)
plt.xlabel(‘Genre’, fontsize=14)
plt.xticks(rotation=90, fontsize=14)
plt.ylabel(‘Count of Movies’, fontsize=14)

Seeing that it was cramped, some counts weren’t showing at this same, and with the intention of presenting this in a business context, I kept the idea in mind to create a subset of this list to “zoom in” on the top 10 genres.

Question 2: What is the best month to release a movie for highest worldwide gross?

I had broken out the month as a string for release dates, but not as a number. Thus I needed to create a new column to represent this.

imdb_all_prod_roi_genres[‘release_month_number’] = imdb_all_prod_roi_genres[‘release_date’].dt.month

To initially view the data, I created a pivot table to show the total worldwide gross per month — another insight!

Question 3: Of movies that breakeven (ROI >= 1), what genres are most represented?

To answer this question I needed to create a separate table for the movies that had an ROI greater than 1.

imdb_prodROI_breakeven = imdb_with_genre_cols[imdb_with_genre_cols[‘prod_budget_ROI’] >= 1]

To get into the genres further, I broke out another data set from my new imdb_prodROI_breakeven dataframe using my previously defined genre_name_list.

imdb_prodROI_breakeven_genres = imdb_prodROI_breakeven[genre_name_list]

From here I was able to use a for loop for counting and see what genres were represented in the movies that broke even — yet another insight I could deliver!

for genre_column in imdb_prodROI_breakeven_genres:
print(genre_column, imdb_prodROI_breakeven_genres[genre_column].sum())
Music 36
War 7
Reality-TV 0
Sport 22
Drama 471
Adventure 296
Game-Show 0
Animation 91
History 25
Romance 139
Musical 5
News 0
Mystery 92
Comedy 391
Documentary 22
Fantasy 95
Adult 0
Horror 115
Crime 151
Family 71
Thriller 179
Western 4
Action 326
Short 0
Sci-Fi 102
Biography 88

Question 4: Based on production budget and average ratings, what genres are the best investments?

This was the “adventurous” and unique plot. I wanted to see how I could bring the data from the beginnings of movie making (production budget) and tie it to user driven data (the ratings).

I started with my top genres and tested with Drama. I separated out the data related to movies with the Drama genre, and using .describe() to get an overview. Immediately seeing there were outliers via a large difference between the max and the 50% and 75% quartile, I made a judgement call that median production budget would be more representative in this plotting as to not be impacted by the massive budget numbers on a handful of films.

Then I looked at the average rating for Drama films.

With this test code already written up, I applied it to the rest of the top genres and created a dataframe for this new information. There was a clear winner in the Drama category — there was the insight again!

Question 5: For these breakeven movies that fall into these genres, what is the recommended runtime and who are the highest rated directors?

I came to this question as I thought of next tactical steps Microsoft could walk away with after this presentation — especially with understanding who to look into contacting not only for their best investment, but to gauge interest in being part of the new production studio.

For the data in this, I tested with Drama first and broke out the movies in the Drama genre into a different table making sure they were also filtered so that they contained values for Directors (no nulls) and available runtimes (>0). From there I was able to look at then looked at average ratings for these films. Since this was based off the initial ~75k record dataset, I pulled out the top 50% of movies to look at the highest rated directors by creating a pivot table and previewing the top 5 entries

top_rated_imdb_dramas.groupby([‘directors’]).agg(“mean”).sort_values(by=’averagerating’, ascending=False).head(5)

… but I still did not have names available on the directors.

I had to use these nm IDs to cross reference another IMDB table that was not joined to pull the names.

In order to do this for the top 5 genres, I defined a function to do this cross referencing. The function definition approach was in an efficient mindset; but this is another beginner instance were the code itself could be improved upon as this goes through the entire IMDB table with the directors names and tries to match the nm ID number. When it does match, it prints the name associated.

# Define function to get top 5 directors names
def top_5_directors_names(directors_list):
“””Return primary_name of director in a list of nm IDs”””
for director in directors_list:
for row in imdb_nb_df.index:
if imdb_nb_df[‘nconst’][row] == director:

I built upon this by creating a for loop to do the above for each of the top genres after testing the above with Drama. This would return median votes, average rating, average runtime, nm IDs for top directors and a list of the directors names.

#Define top genres
top_genres_list = [‘Drama’, ‘Comedy’, ‘Action’, ‘Adventure’, ‘Thriller’]
# Create for loop to do the above for each top genre and print results
for genre in top_genres_list:
genre_table = imdb_with_genre_cols[(imdb_with_genre_cols[genre] == 1) & (imdb_with_genre_cols[‘directors’] != “Unknown”)]
print(f”median votes = {genre_table[‘numvotes’].median()}”)
genre_table = genre_table[genre_table[‘numvotes’] >= (genre_table[‘numvotes’].median())]
print(f”average rating = {genre_table[‘averagerating’].median()}”)
genre_table = genre_table[genre_table[‘averagerating’] >= (genre_table[‘averagerating’].median())]
genre_table_runtime = genre_table[genre_table[‘runtime_minutes’] > 0]
print(f”average runtime = {round(genre_table_runtime[‘runtime_minutes’].mean(),2)}”)
top_5_genre_directors = genre_table.groupby([‘directors’]).agg(“mean”).sort_values(by=’averagerating’, ascending=False).head(5)
top_5_genre_directors = top_5_genre_directors.reset_index()
top_5_genre_directors_ns = list(top_5_genre_directors[‘directors’])

This by far took the longest to run; but I obtained the insight!

median votes = 71.0
average rating = 6.3
average runtime = 107.47
['nm10369569', 'nm9982663', 'nm10285722', 'nm1682596', 'nm10005127']
Nagaraja Uppunda
Arsel Arumugam
Sudheer Shanbhogue
Paul Michael Bloodgood
Colonelu Morteni
median votes = 95.0
average rating = 5.8
average runtime = 103.63
['nm0000233', 'nm10285722', 'nm10436203', 'nm9073819', 'nm8589213']
Quentin Tarantino
Sudheer Shanbhogue
Abhinav Thakur
Amr Gamal
Karan R Guliani
median votes = 170.0
average rating = 5.8
average runtime = 117.65
['nm10466690', 'nm0000233', 'nm9276879', 'nm6442107', 'nm3586222']
Quentin Tarantino
Ajay Andrews Nuthakki
Ram Kumar
Thiagarajan Kumararaja
median votes = 111.0
average rating = 6.0
average runtime = 105.66
['nm6748553', 'nm5139001', 'nm1957250,nm1601055', 'nm7186336', 'nm9762716']
Karzan Kardozi
Zolbayar Dorj
Christina Kyi
Matt Horton
median votes = 132.0
average rating = 5.5
average runtime = 107.38
['nm4891543', 'nm2755490', 'nm10079200', 'nm7464139', 'nm6442107']
Shivkumar Parthasarathy
Amitabh Reza Chowdhury
Gvr Vasu
Sushanth Reddy
Ram Kumar

Learnings Summary

  1. Drama, Documentary, Comedy, Thriller, and Horror were top movie genres made in the 2010s.
  2. June and November are best release months based on highest worldwide gross.
  3. Drama, Comedy, Action, Adventure, and Thriller are top genres represented in breakeven films.
  4. Dramas and Comedies are the best investment with a median production budget of $17MM and $25MM, respectively.
  5. Movies in these top genres should run approximately an hour and 45 minutes to 2 hours long. Top directors include global audiences.

Formatting non-technical answers

Pinpointing the answers to the questions is one thing, but understanding how to translate that to someone who isn’t spending hours in the data and presenting it in a clean and interesting way is another mindset entirely. Fortunately with a background in Sales and Marketing Operations and Technology, this is a skill I have consistently had to work at in my career. I decided to stay as simple as possible with bar graphs and scatterplots. In an attempt to switch it up and layer multiple pieces of important information, I utilized question 4 on production budgets versus average ratings to be creative.

  • Question 1: What were the top movie genres made in the 2010s?

I was able to use my “zoom in” idea and create a subset of the top 10 genres of the dictionary I had previously created:

genre_count_dict_zoom = {k: sorted_genre_count_dict[k] for k in list(sorted_genre_count_dict)[:10]

The was able to import seaborn, then reformat labels and colorings to create and save a cleaner representation of the learnings.

plt.figure (figsize=(20,10)), genre_count_dict_zoom.values(), color=’lightskyblue’)
plt.title(‘Top Genres Of 2010s’, fontsize=20, fontweight=”bold”)
plt.xlabel(‘Genre’, fontsize=14)
plt.xticks(rotation=90, fontsize=14)
plt.ylabel(‘Count of Movies’, fontsize=14)
N=73,856; Note: Movies with multiple genres counted once for each genre of which they were categorized
  • Question 2: What is the best month to release a movie for highest worldwide gross?

In order to plot this with the release months as labels instead of the numbers I had at this point, I edited the new release month number column with the month names as strings that I wanted to view on my final chart in a for loop. From here I was able to map the release month number to the release month name, then I renamed the column. This is definitely an inefficient beginners way to do this, but I also understand that perfect is sometime the enemy of good and done.

imdb_all_prod_roi_genres_mon = imdb_all_prod_roi_genres_mon.rename(columns={“release_month_number”: “release_month_name”, “sum”: “worldwide_gross_in_mil_sum”})

This allowed me a new version of the table I could use to create a chart.

From here I utilized the same formatting base to create consistency for my presentation and charted the months against the worldwide gross for the final product.

  • Question 3: Of movies that breakeven (ROI >= 1), what genres are most represented?

To chart this used a for loop to create two parallel lists to translated into tuples and place into a dataframe for easier plotting.

prodROI_genres = []
prodROI_counts = []
for genre_column in imdb_prodROI_breakeven_genres:
# Create DataFrame for plotting
prodROI_genre_counts = list(zip(prodROI_genres, prodROI_counts))

# Assign data to tuples.
# Create DataFrame
prodROI_genre_counts = pd.DataFrame(prodROI_genre_counts, columns = [‘genre’, ‘count’])
prodROI_genre_counts = prodROI_genre_counts.sort_values(by=’count’, ascending=False)

Once again for visual consistency I repurposed my previous plotting code to create the below for the presentation.

N=1,049; Note: Movies with multiple genres counted once for each genre of which they were categorized
  • Question 4: Based on production budget and average ratings, what genres are the best investments?

Knowing I had a dataframe with the data was a good base, but since this was a unique plotting exercise I had to experiment, especially with coloring.

I plotted production budget and rating separately at first to see how it looked before focusing on how to present it.

I also knew I had to play with color to distinguish the different genres while also tying together that the second plot was for user ratings. I did this by using a seaborn color palette for the genre/budget bars while keeping the average rating plot points a consistent blue.

In presenting this, knowing it was not something seen regularly, I called out that the thing to look for was a gap between the blue dot (average rating) and the bar. This gap was telling us, there was a lower amount of money spent for a higher rating — “ a bang for your buck” metric. Thus, Drama and Comedy were the standouts.

Note: Independent tables separated out by genre. Movies with multiple genres counted once for each genre of which they were categorized
  • Question 5: For these breakeven movies that fall into these genres, what is the recommended runtime and who are the highest rated directors?

In an effort to not have a bar chart for everything and have more tactical recommendations, I created a simple table. This showed the consistency across the run times and also the director representation via global names as well as repeated names across genres.

Next Steps and Recommendations

Since I approached this from the perspective of providing high-level initial insights, there is endless opportunity to dig deeper into these movie production and analysis conversation.

From the Microsoft and business perspective, it is advisable as this point not only to obtain feedback, but to get additional details on how they envision their approach to this production studio start up. In order to guide the business towards details that would be helpful for further analysis, I asked for consideration of:

  • Target audience definition including their location and language — especially since the top rated directors findings showed a global audience
  • Lifecycle definition as it comes to theater release, streaming, or both in a more end to end lifecycle
  • Budgeting and initial launches as it comes to investing heavily in a banner movie to draw large audiences and a goal movie portfolio to start out with.

In regards to data analysis, I followed up with:

  • If Microsoft decide to incorporate streaming, additional data should be gathered from top streaming services for analysis
  • Additional analysis should level-set genres since Drama was overly represented in the initial dataset
  • Discovering the top writers per top genre could be good for findings and to accompany the list of directors for outreach opportunities when ready to start investing in making movies
  • Uncovering the highest rated actors in top genres and their known for characters that could guide banner movie releases and more tactical script writing and character development


As a first opportunity to apply my newly developed skills in Python to a real-life business scenario, this was an interesting way to see how this new information was married into my past experience with data analysis. It was simple for me to derive endless questions to answer and come up with a data story to tell for this initial Microsoft conversation. Then when it came to the hands-on analysis, there was the aforementioned question of “I know how to do this in Excel, but what does that look like if translated to Python?” For a first attempt I don’t find this as a negative thing. It requires me taking my foundational experience, building upon it, and finding ways to relate my new tools to the older ones I am comfortable with. This also connects back to my initial post of Why Data Science? and learning data science as a complementary skill. Ultimately it is all a part of the growth process I intentionally signed up for and am working towards using to make a positive impact.



Amanda Gaeta

MarTech, CRM, automation and data nerd. Managing a small zoo of 3 cats and a dog in Austin, TX.