Barry on Data Analysis

Data Analysis on IMDb Movies Dataset using Python

RESUME - Project overview

Jump to: SOLUTION -Project implementation

A reporter writes an article about 40-year movie industry for MovieMaker Magazine. Dataset containing IMDb movie details was downloaded from The kaggle website. The task of the assigned Data Analyst is to organize data, analyze it and to produce actionable insights to the customer as a basis for writing an article.

Problem Statement

The Dataset contains details on movies released in the period 1980-2020. The format of the Dataset is CSV file, and since the customer is reporter writing an article it is not very useful for them to get some useful insights from raw CSV file.

Dataset has to be cleaned from duplicates and NULL values. Each duplicate row and each row containing NULL values in any column have to be cleaned out from the dataset.

Business insights customer requires are overall statistics, movie score categorization, the most productive companies and directors, the highest scored movies, overview of vote count and movie runtimes, as well correlation between score and profit/ROI.

Skills & Concepts demonstrated

There are following Python features used in this project

Using libraries pandas, matplotlib and seaborn

Data assessing methods head(), tail(), info(), shape, describe()

Cleaning methods duplicated(), isnull(), drop_duplicates(), dropna()

Type conversion methods to_datetime(), astype()

Using data structures: list, dictionary, tuple

String methods split(), format(), join()

DataFrame methods insert(), rename(), set_index(), reset_index(), nlargest()

Series methods between(), keys() and quantile()

Aggregate functions on Series using agg() method

Using lambda function on Series with apply() method

Method cut() and qcut() for data binning

Method value_counts() and operator loc[]

Visualization by pie chart, bar chart, scatter plot, heat map, multi-plot grid

Customer wants to answer the following questions:

Questions

1-Categorize movie Score using the following key, based on the range from lowest to highest movie score: Excellent (top 25%), Very good(65-75%), Good(55-65%), Average(45-55%), Poor(below 45%).

2-Find top 10 companies by number of movies made for each decade and top 5 companies overall.

3-Get top 10 highest rated movies and visualize insights showing director in legend. Find top 10 most productive directors overall (by movies directed).

4-Find overall correlation between movie score and the total nett profit. Nett profit is the difference between gross revenue and budget.

5-Find correlation between movie score and return of investment (ROI) for each decade. ROI is the ratio between the nett profit and budget.

Insights

1. There the most movies (1,862) in Good category (34.5%), followed by Very good (1,309) movies (24.2%). The 3rd place (981) belongs to Average movies (18.2%), followed by Excellent (671) movies (12.4%) and the Poor (578) movie category (10.7%)

2. The top company overall by movies made is Universal Pictures (330), followed by Columbia Pictures (301), Warner Bros. (298), Paramount Pictures (278) and Twentieth Century Fox (209)

3. The highest rated (9.3) movie is The Shawshank Redemption, followed by The Dark Knight (9.0). The Lord of the Rings: The Return of The King (8.9), Pulp Fiction (8.9) and Schindler's List (8.9) share the 3rd, the 4th and the 5th place. The places from 6 to 9 are shared by Inception (8.8), Fight Club (8.8), The Lord of The Rings: The Fellowship of the Ring (8.8) and Forrest Gump (8.8). In the 10th place is the movie Star Wars: Episode V - The Empire Strikes Back (8.7)

4. The correlation is positive and small (0.24). After removing high and low outliers (5% in total) from the Profit variable, the correlation factor is even decreased (0.20) so it remained in the category of small correlation

5. For each decade the correlation is positive and negligible (0.09) before removing outliers. After removing 3% high outliers from ROI variable, the correlation factor increased signifficantly (0.26) nevertheless it remained small.

SOLUTION - Project implementation

Back to: RESUME -Project overview

0.1-Gathering data

0.2-Assessing data

0.3-Data cleaning

0.4-Data transformation

1-Data binning

2-Feature engineering

3-Univariate analysis

4-Bivariate analysis

5-Multivariate analysis

The raw Data source is CSV file:

From the GitHub repository it can be downloaded

Dataset as CSV file

Jupyter notebook script

Before processing business questions, the data will be gathered from CSV file and assessed. Based on assessments outcome data will be cleaned and transfomered i.e. data types will be converted to appropriate types depending on values containing.

0.1-Gathering Data

Dataset is a single CSV file that will be imported using pandas, into a DataFrame object. DataFrame is 2-dimensional structure that stores data in rows and columns, a.k.a. Series. The first step is importing Python libraries that will be used - pandas, matplotlib and seaborn. The Dataset is loaded directly from GitHub repository into pandas DataFrame.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df=pd.read_csv("https://github.com/berislav-vidakovic/Portfolio/blob/main/IMDbMovies/ImdbMovies.csv?raw=true")
pd.options.display.max_rows = 5 #limit row count to display
df # show DataFrame

There are shown first and last 2 rows (by default there would be 5+5 rows, that can be controlled by changing max_rows display option. Below the table there are some basic dataset information - row and column count. However, there are much more information about data (such as column data types, NULL/non-NULL value count etc.) that can be obtained using different pandas methods and attributes of DataFrame and Series classes.

0.2-Assessing Data

In this section data will be explored in general, such as showing few first and last rows, column data types, max/min/mean values per column, detecting null values and duplicates.

Using head() and tail() method on DataFrame and passing number of rows it will be shown first and last n rows. By default, if no paramater passed, the first/last 5 rows are returned. Pretty formatting of floats - 2 decimals and thousand separator - is accomplished by changing the float_format display option and passing in a lambda function

pd.set_option('display.float_format', lambda x: '{:,.2f}'.format(x)) # format float as 1,000.00
df.head(3)
df.tail(3)

Using attribute shape on DataFrame object there will be shown row and column count as tuple data type. It can formatted as well, with accessing each tuple member separately in order to enable more user-friendly information. Using info() method on DataFrame object there will be shown column data types and more details about DataFrame.

df.shape
print("Rows: ", df.shape[0], "Columns:", df.shape[1])
df.info()

Result of info() method already provide insight on null-values for some columns, since there is lower non-null value count than total row count. Null values will be later investigated and visualized.

The next method is describe() that generates descriptive statistics. Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

pd.set_option('display.float_format', lambda x: '{:,.2f}'.format(x)) # format float as 1,000.00
currentSetting = pd.options.display.max_rows 
pd.options.display.max_rows = 15 # needed for describe() method only, transparently changed
df.describe()
df.describe(include='all')
pd.options.display.max_rows = currentSetting # reset to previous value

Checking duplicates starts with examining are there any using method any(), if there are their count is checked using function sum() and duplicate rows are shown using boolean indexing for filtering. These information will be used in Data cleaning section, where duplicates will be removed.

df.duplicated().any() #Check if there are duplicates at all
df.duplicated().sum() #How many duplicate rows are there
df[df.duplicated()==True] #Show duplicate rows

Check missing values (or NaN values, or null values) will start with extracting columns containing any null values. After that there is shown for those column missing value count and percentage. Results are visualized in seaborn heatmap.

dfMissing = df.isnull()  #bool for each value - True for missing value
colsMissing = dfMissing.any() #are there any missing values
colsMissing #show if there are for each column

countMissing=df[df.columns[colsMissing]].isnull().sum() #show only columns with missing values
countMissing #show missing value count
countMissing * 100 / len(df) #show missing value percentage

ax=sns.heatmap(dfMissing) #visualize missing values with seaborn heatmap
ax.invert_yaxis() #show values on y-axis in bottom-up direction
plt.title('Missing values')

0.3-Data cleaning

Within data cleaning process there will be removed duplicate and missing values, using information obtained in previous section of Data assessment to control the process. After dropping dupes and missing values, DataFrame will be re-indexed. Removing duplicates will be the first step, using drop_duplicates() method on DataFrame object.

df.shape # check data shape before starting cleaning
df.drop_duplicates(inplace=True) #removing duplicates
df.duplicated().any() #double check for dupes after dropping
df.shape # duplicates removed, lower row count

Removing rows that contain any missing value will be done using dropna() method on DataFrame object.

df.info() # check for missing value count before dropping them
df.dropna(inplace=True)
df.info() # check for missing value count after dropping them
df.isnull().any() # double check if there are any more missing values

After deleting NULLs and duplicates, it is recommended to reset index. By setting drop=True, the current index column will be removed, while inplace=True ensures the changes are applied directly to the DataFrame without creating a new one.

df.reset_index(inplace=True, drop=True)

0.4-Data transformation

Depending on the range of values stored, value type and common sense, there will be assessed current column data types and determined optimal data type for each particular column.

Datetime Series conversion

Although there are 2 columns with date -[Year] and [Released] - neither of them has datetime data type. We need 1 column of datetime data type as the single source of truth for movie date. The column [Released] is considered to contain relevant date information, nevertheless the column [Year] would be easier to convert to datetime type. To be on the safe side it has to be checked the data consistency between the 2 columns, and before comparison both columns have to be converted to datetime type. In order to convert integer [Year] column to datetime it has to be specified format, otherwise the conversion will not work properly.

df.insert(4,'YearDt','') #insert helper column for conversion testing

df['YearDt']=pd.to_datetime(df['Year']) #conversion with no format specified
df[['Year','YearDt']] #bad outcome

df['YearDt']=pd.to_datetime(df['Year'], format='%Y') #conversion with format specified
df[['Year','YearDt']] #good outcome

df.drop(columns='Year', inplace=True) #remove original integer column
df.rename(columns={'YearDt': 'Year' }, inplace=True) #rename helper column
df #show actual DataFrame
df.info() #check data types

Conversion of [Released] column to datetime type cannot be accomplished directly since there is also state name in that column. First the column has to be split using parenthesis separator, to extract string portion that is clean date. After that column can be converted to datetime type. Argument expand=True in split() function means that result will be written as a DataFrame with one column for each substring, otherwise (with expand=False) all substrings will be within single Series object as items in the list. After splitting it can be compared year in both columns, to check data consistency.

df[['Released']] #original column
dfSplit=df['Released'].str.split('(', expand=False) #split items to list in one Series 
dfSplit=df['Released'].str.split('(', expand=True) #split items to more Series in DataFrame

df.insert(5,'ReleasedDt','') #insert helper column for conversion testing
df['ReleasedDt']=pd.to_datetime(dfSplit[0]) #conversion the date part to helper column
df[['Released','ReleasedDt']] #check if conversion was successful

df.drop(columns='Released', inplace=True) #remove original column
df.rename(columns={'ReleasedDt': 'Released' }, inplace=True) #rename helper column

df #show actual DataFrame
df.info() #check data types

After applying split method with parameter expand=True the conversion is done with first Series from new DataFrame dfSplit, and results are acceptable:

The next step is data consistency check. i.e. extract the year of dates in both columns and compare them

df[df['Year'].dt.year != df['Released'].dt.year] #select rows with incosistent data

There are 703 rows with inconsistent dates. The [Release] column contains complete date including month and day, so [Year] column will be dropped and the single source of truth for movie date will remain in the [Released] column.

df.drop(columns='Year', inplace=True)

String Series conversion

There are columns containing string data but data type is object after data imported. Those columns will be converted to string data type.

lsStringColumns =['Title', 'Rating', 'Genre', 'Director', 'Writer', 'Star', 'Country', 'Company'] #string column list
df[lsStringColumns] #show string columns before conversion

for col in lsStringColumns:
    df[col] = df[col].astype('string') #convert each column to string type

df[lsStringColumns] #show string columns after conversion
df.info() #check data types after conversion

Numerical Series conversion

There are few numerical columns that will be converted to proper data type depending on value range and kind of information they keep. Using aggregate functions min and max there is a range shown for each numerical column

lsNumColumns =['Score', 'Votes', 'Budget', 'Gross', 'Runtime'] #numerical column list
dfNum = df[lsNumColumns].copy() # get numerical columns only
pd.set_option('display.float_format', lambda x: '{:,.2f}'.format(x)) # format float as 1,000.00
dfNum['Budget']=dfNum['Budget'].apply(lambda x: '{:,}'.format(x)) # format integer as 1,000
dfNum # show numerical columns

dfRange=df[lsNumColumns].agg(['min','max']) # get value range for each numerical column
dfRange['Budget']=dfRange['Budget'].apply(lambda x: '{:,}'.format(x)) # format integer as 1,000
dfRange # show column ranges

Taking data statistics into account, the following types would be the best fit for each column:

[Score] has small float values, it will be set to float16

[Votes] is integer number (not float), considering range it will be set to int32

[Budget] is integer number already, considering range (max. is less than 2 Billion) it will be set to int32

[Gross] is large integer, no need for decimals, considering range (max. greater than 2 Billion) it will be set to int64

[Runtime] is small integer (not float), it will be set to int16

df['Score'] = df['Score'].astype('float16')
df['Votes'] = df['Votes'].astype('int32')
df['Budget'] = df['Budget'].astype('int32')
df['Gross'] = df['Gross'].astype('int64')
df['Runtime'] = df['Runtime'].astype('int16')
df.info()

Double check values and ranges after data type conversion:

dfNum = df[lsNumColumns].copy() #get numeric columns only
dfNum['Budget']=dfNum['Budget'].apply(lambda x: '{:,}'.format(x)) #format integer as 1,000
dfNum # show numerical columns

dfRange=df[lsNumColumns].agg(['min','max']) #get value range for each numerical column
dfRange['Budget']=dfRange['Budget'].apply(lambda x: '{:,}'.format(x)) #format integer as 1,000
dfRange['Votes']=dfRange['Votes'].apply(lambda x: '{:,}'.format(x)) #format integer as 1,000
dfRange['Gross']=dfRange['Gross'].apply(lambda x: '{:,}'.format(x)) #format integer as 1,000
dfRange #show column ranges

1-Data binning - movie score categorization

The first step in binning of Series 'Score' will be to check the value range and define the expected results i.e. threshold for the particular score category. Let's look at column statistic data:

df['Score'].agg(['min','max', 'mean']) # get some statistics  for Score column

Distribution of Score column values can be visualized as follows:

def PlotNumDistribution(seriesName, title):
    plt.figure(figsize=(10,4))
    plt.title(title)
    sns.set_style('darkgrid')
    df[seriesName].plot(kind='hist')
    plt.xlabel(seriesName)

PlotNumDistribution('Score','Score distribution')

As required binning is defined with percentage, it is neccessary to calculate the actual values that will be used for binning criterium, the lower and upper limits for category intervals.

pd.set_option('display.float_format', lambda x: '{:,.2f}'.format(x)) # format float as 1,000.00
binLabels=['Poor', 'Average', 'Good', 'Very good', 'Excellent']
binLimits=[0.45, 0.55, 0.65, 0.75]

# extending bin limits to start with 0 and end with 1
def ExtendLimits(lsLimits):
    lsExtended =[0]
    lsExtended.extend(lsLimits)
    lsExtended.extend([1])
    return lsExtended

# mapping bin limits to actual values in numerical Series
def MapLimits(lsLimits, low, high):
    lsExt = ExtendLimits(lsLimits) #add 0 to beginning and 1 to end    
    lsMapped =[]
    lenList = len(lsExt) # limit list length
    diff=high-low
    for i in range(lenList):       
        limit11 = low + lsExt[i] * diff
        lsMapped.append(limit11)        
    return lsMapped

#function fills dictionary with label as key and range tuple as value
def CalcLimits(dict1, binLab, binLim, SeriesNum):        
    # get range from Datafrane numerical Series passed 
    low = df[SeriesNum].min()
    high = df[SeriesNum].max()    
    # mapping bin limits to actual data 
    binLimLocal = MapLimits(binLim, low, high)        
    lenList = len(binLab) # label list length
    for i in range(lenList):       
        low1 = binLimLocal[i] 
        high1 = binLimLocal[i+1]
        dict1[binLab[i]] = (low1, high1) # dict key is label, value is tuple(low,high)           
    return

# Formatting string from passed dict (label as key, range tuple as value) 
def FormatDict(dict1):
    str=''
    for key, value in dict1.items():
        str += key + ': '
        str += '-'.join(format(f, '.2f') for f in value)
        str += '\n'
    return str

# Calculate limit for each category interval
dictLimits = {}
CalcLimits(dictLimits, binLabels, binLimits, 'Score')

# Print required categorization
strRanges = FormatDict(dictLimits)
print('Required binning of Score Series to categories:')
print(strRanges)

Now it's clearly defined which are exactly the intervals for each particular category. The first try will be using qcut method, as there are percentages defined that can be passed to it (using function for extending the initial list of limits with leading 0 and trailing 1)

# The easiest way would be using qcut
lsLimits = ExtendLimits(binLimits) #adding leading 0 and trailing 1 to the list
df['ScoreCat']=pd.qcut(df['Score'], lsLimits, labels=binLabels)

# Let's see the results
#function fills dictionary from DataFrame with Category as key and range tuple as value
def GetCategoryRange(dict1, SeriesCat, SeriesNum):
    for label in binLabels:        
        low=df[df[SeriesCat] == label][SeriesNum].min()
        high=df[df[SeriesCat] == label][SeriesNum].max()
        dict1[label] = (low, high)
    return 

dict1={}
GetCategoryRange(dict1, 'ScoreCat', 'Score')    
# Print accomplished categorization
strRanges = FormatDict(dict1)    
print('Using qcut - Accomplished binning of Score Series to categories:')
print(strRanges) # bad outcome

It seems that results are not as expected. The pandas documentation describes qcut as a “Quantile-based discretization function.” This basically means that qcut tries to divide up the underlying data into equal sized bins. The function defines the bins using percentiles based on the distribution of the data, not the actual numeric edges of the bins. Distribution of new category column can be visualized

# Show category distribution
def PlotCatDistribution(seriesName, orderX, title):
    plt.figure(figsize=(10,4))
    plt.title(title)
    df[seriesName].value_counts().loc[orderX].plot.bar(width=0.9) #using loc to define x-axis order
    plt.xticks(rotation=0)

PlotCatDistribution('ScoreCat', binLabels, 'Score category distribution - using qcut method')

Finally, expected categorization will be accomplished using cut method.

# The better choice will be using cut method
low=df['Score'].min()
high=df['Score'].max()
lsLimits= MapLimits(binLimits,low,high) # extending and mapping limits to actual values
df['ScoreCat']=pd.cut(df['Score'], lsLimits, labels=binLabels, include_lowest=True)

# Let' see the results 
dict1 ={}
GetCategoryRange(dict1, 'ScoreCat', 'Score')  
# Print accomplished categorization
strRanges = FormatDict(dict1)
print('Using cut - Accomplished binning of Score Series to categories:')
print(strRanges) #good outcome
# Plot accomplished categorization
PlotCatDistribution('ScoreCat', binLabels, 'Score category distribution - using cut method')

Nevertheless, comparing to required binning limits, it seems there are some gaps. This is probably because there are no Score values that fall within those intervals. To be on the safe side it can easily be checked:

df['Score'].between(5.2,5.3, inclusive='neither').any() or df['Score'].between(5.9,6.0, inclusive='neither').any() or \
df['Score'].between(6.7,6.8, inclusive='neither').any() or df['Score'].between(7.4,7.5, inclusive='neither').any()

The final categorization of Score column visualized in pie chart:

def PlotPieChart(seriesValues, lsBinLabels, sTitle):
      nCatCount=[]
      for label in lsBinLabels:
          nCatCount.append(len(df[df[seriesValues] == label]))
  
      lsLabels = []
      lenList = len(lsBinLabels) 
      for i in range(lenList):       
          lsLabels.append(lsBinLabels[i] + '(' + str(nCatCount[i]) + ')')
  
      fig, ax = plt.subplots()
      ax.pie(nCatCount, labels=lsLabels, autopct='%1.1f%%')
      plt.title(sTitle)
  
PlotPieChart('ScoreCat', binLabels, 'Movie Score categorization' )

2-Feature engineering - extending data model with decade, ROI and Profit

Get top 10 companies by number of movies made, overall and for each decade.

In order to meet requirements of next business questions there should be created 3 new column: Decade, ROI and Profit. To create Decade category, bins passed to cut method have to the same type as the base column - it is [Released] column od the type datetime:

binLabelsD=["1980's", "1990's","2000's", "2010-2020"]
binsD=[1980, 1990, 2000, 2010, 2021]
# Convert bins to be datetime type such as column that is binned
binsDt =[]
for bin in binsD:
    binsDt.append(pd.to_datetime(bin, format='%Y'))
df['Decade']=pd.cut(df['Released'], bins=binsDt, labels=binLabelsD)
# Call wrapper function to plot the pie chart
PlotPieChart('Decade', binLabelsD, 'Released year categorized by decade' )

There are 2 more columns created and checked in a formatted view

df['Profit']=df['Gross']-df['Budget']
df['ROI']=(df['Gross']-df['Budget'])/df['Budget']

def FormatIntegerSeries(df1, seriesName):
    df1[seriesName]=df1[seriesName].apply(lambda x: '{:,}'.format(x)) #format integer as 1,000    
pd.set_option('display.float_format', lambda x: '{:,.2f}'.format(x)) # format float as 1,000.00

dfFormatted = df[['Title', 'Released', 'Decade', 'Budget', 'Gross', 'Profit', 'ROI']].copy()
FormatIntegerSeries(dfFormatted, 'Budget')
FormatIntegerSeries(dfFormatted, 'Gross')
FormatIntegerSeries(dfFormatted, 'Profit')
dfFormatted

The business question of top 10 companies per decade will be answered using value_counts() and barplot, as well as the question for top 5 companies overall

def GetTopCompaniesForDecade(sDecade):    
    dfDecade = df[df['Decade'] == sDecade] # filter by Decade passed
    seriesValueCounts =dfDecade['Company'].value_counts().head(10) # get top 10 of value_counts()
    dict1 = {'Company': seriesValueCounts.keys(), 'Movies': seriesValueCounts}
    df2=pd.DataFrame(dict1)
    df2=df2.set_index('Company')    
    return df2

def ShowTopCompaniesForDecade(dframe,sDecade):
    plt.figure(figsize=(6,3))
    sns.barplot(x='Movies', y=dframe.index, data=dframe)
    plt.title('Top 10 companies for ' + sDecade)    
    return

for decade in binLabelsD: # For each Decade prepare data and plot
    df2 = GetTopCompaniesForDecade(decade)
    ShowTopCompaniesForDecade( df2, decade)

Top 5 companies overall will be found the same way but without Decade filter:

seriesValueCounts =df['Company'].value_counts().head(5) # get top 5 of value_counts()
dict1 = {'Company': seriesValueCounts.keys(), 'Movies': seriesValueCounts}
df2=pd.DataFrame(dict1)
df2=df2.set_index('Company')
plt.figure(figsize=(6,3))
sns.barplot(x='Movies', y=df2.index, data=df2)
plt.title('Top 5 companies 1980-2020')

3-Univariate analysis - movie Runtime and Votes Analysis & Removing Outliers

Get top 10 highest rated movies and visualize insights showing director in legend. For each of the top 3 genres overall, find top 5 movies by movie score.

Movie [Runtime] distribution shows the central tendancy to 90-100 minute movie.

PlotNumDistribution(df, 'Runtime','Runtime column distribution')
df[['Runtime']].describe()

Removing outliers will enable more detailed insight. First let's check the outlier count/percentage

#check how many outliers are there with Runtime lower than 84 and higher than 150 minutes
nOutliersH = len(df[df['Runtime'] > 150])
nPercentH = nOutliersH / len(df) * 100
nOutliersL = len(df[df['Runtime'] < 84])
nPercentL = nOutliersL / len(df) * 100
print("Outliers with Runtime < 84 minutes: {:.0f} ({:.2f}%)\n\
Outliers with Runtime > 150 minutes: {:.0f} ({:.2f}%)".format(nOutliersL,nPercentL,nOutliersH,nPercentH))

Since in total there are 5% they can be removed without signifficant affecting the data quality

# Remove outliers
df2 = df[df['Runtime'].between(84,150)]
PlotNumDistribution(df2, 'Runtime','Runtime column distribution (removed outliers)')
df2[['Runtime']].describe()

Distribution of [Votes] shows the central tendancy to around 50,000, but since there are outliers with much higher values it's difficult to tell before removing outliers.

PlotNumDistribution(df, 'Votes','Votes column distribution')
df[['Votes']].describe()

Removing outliers will enable more detailed insight. First let's check the outlier count/percentage

# Check how many  outliers over 450000
nOutliersH = len(df[df['Votes'] > 450000])
nPercentH = nOutliersH / len(df) * 100
print("Outliers with Votes > 450000 {:.0f} ({:.2f}%)".format(nOutliersH,nPercentH))

Since in total there are almost 5% they can be removed using quantile method, passing the percentage of values we want to keep (95% in this case) without signifficant affecting the data quality. After outliers removed, central tendancy can be better estimated, it is around 20,000

# Remove outliers for better distribution insight
q=df['Votes'].quantile(0.95) 
df95=df[df['Votes'] < q]
PlotNumDistribution(df95, 'Votes','Votes column distribution (removed outliers)')
df95[['Votes']].describe()

There is business question to answer about top movies and directors.

colsToShow =['Title','Score','Director','Released','Writer','Company', 'Votes', 'Runtime']
top10_rated=df.nlargest(10,'Score')[colsToShow].set_index('Title')
top10_rated

Finally results are visualized:

sns.barplot(x='Score', y=top10_rated.index, data=top10_rated,hue='Director', dodge=False)
sns.set_style('darkgrid')
plt.title('Top 10 highest rated movies')
plt.legend(bbox_to_anchor=(1.05,1),loc=2)

The most productive directors overall:

seriesValueCounts=df.groupby('Director')['Title'].count().sort_values(ascending=False).head(10)
dict1 = {'Director': seriesValueCounts.keys(), 'Movies made': seriesValueCounts}
df2=pd.DataFrame(dict1)
df2=df2.set_index('Director')
#prepare data for bar plot
Dirs = list(seriesValueCounts.keys())
Titles = list(seriesValueCounts)
fig = plt.figure(figsize = (9.5, 4))
# creating the bar plot
plt.bar(Dirs, Titles, color ='darkgreen', width = 0.8)
plt.xlabel("Director")
plt.ylabel("Movies made 1980-2020")
plt.title("Top 10 the most productive Directors")
plt.xticks(rotation=45)
plt.show()

4-Bivariate analysis - correlation of Score vs. Profit

Find overall correlation between movie score and the total nett profit. Nett profit is the difference between gross revenue and budget.

Using scatter graph and heatmap there will be visualized correlation between the 2 variables:

fig = plt.figure(figsize = (5, 4))
#plot scatter graph
sns.set_style("darkgrid")
sns.regplot(x='Score',y=df['Profit']/1000000, data=df, scatter_kws={"color": "red"}, line_kws={"color": "blue"})
plt.xlabel('Movie score')
plt.ylabel('Nett profit (Mio. $)')
plt.title('Correlation of Score vs. Profit')
# plot heatmap
sns.heatmap(df[['Score','Profit','ROI','Votes','Runtime']].corr(), annot=True)

Since there are outliers in Profit axis, removing 5% outliers was expected to lead to higher correlation. However, the assumption was wrong, correlation even decreased!

# Will removing 5% outliers increase correlation?
q=df['Profit'].quantile(0.95) 
df2=df[df['Profit'] < q]
fig = plt.figure(figsize = (5, 4))
sns.set_style("darkgrid")
#plot scatter graph
sns.regplot(x='Score',y=df2['Profit']/1000000, data=df2, scatter_kws={"color": "red"}, line_kws={"color": "blue"})
plt.xlabel('Movie score')
plt.ylabel('Nett profit (Mio. $)')
plt.title('Correlation of Score vs. Profit (removed outliers)')
# plot heatmap
sns.heatmap(df2[['Score','Profit','ROI','Votes','Runtime']].corr(), annot=True)

5-Multivariate analysis - over each decade find correlation of score vs. ROI

Find correlation between average movie score and return of investment (ROI) for each decade. ROI is the ratio between the nett profit and budget.

Scatter graphics show very small correlation between ROI and Score, for each of 4 decades.

g=sns.FacetGrid(data=df, col='Decade', col_wrap=4)
g.map(plt.scatter, 'Score', 'ROI', alpha=0.5)
g.fig.suptitle('ROI vs. Score')

Correlation matrix shows correlation factor

df.corr(numeric_only =True)

Removing only 3% of outliers in ROI axis signifficantly increased correlation!

# Will removing of 3% outliers change anything?
q=df['ROI'].quantile(0.97)
df2=df[df['ROI'] < q]
g=sns.FacetGrid(data=df2, col='Decade', col_wrap=4)
g.map(plt.scatter, 'Score', 'ROI', alpha=0.5)
g.fig.suptitle('ROI vs. Score \n(no outliers)')

Increased correlation factor can be obtained from correlation matrix as well

df2.corr(numeric_only =True) #removing 3% outliers increased correlation factor signifficantly!

About the author

Barry The Analyst has been working from May 2022 untill July 2023 with one big company in Ireland, and since July 2023 has worked with one big company in Germany. This blog was started as a private note collection in order to (re)develop, refresh and strengthen essential skills of Data Analysis field. The content was generated on the way of learning path, with a degree in Electrical Engineering and experience in C++/SQL. Private notes turned into public blog to share the knowledge and provide support to anyone who would find this interesting...

Who is this for?

This is not yet another basic course that would explain elementary concepts. There are many basic features of tools and technologies that are skipped in blog articles and portfolio projects here presented. If you are not an absolute beginner in SQL/databases, Excel, Python and Power BI this might be useful material if you would like to advance a career as Data Analyst and/or Python Developer ...

Barry on Data Analysis-brief insights to relevant facts, definitions and best practices-