Python Web Scraping of German companies in Fortune Global 500



RESUME - Project overview


Jump to: SOLUTION -Project implementation

Problem Statement

Skills & Concepts demonstrated

Questions

Insights

The data source is a Page on Wikipedia where there is a table with top 30 largest German companies, ranked in the Fortune Global 500 list as well. The Fortune Global 500, also known as Global 500, is an annual ranking of the top 500 corporations worldwide as measured by revenue. The list is compiled and published annually by Fortune magazine. The data will be loaded using Web Scraping method and after loading data will be analyzed to obtain useful insights.

Problem Statement


As Data Analyst searching for a job in Germany, I am interested in some details of top German companies. There are information available on Wikipedia Page, but I need some useful insights such as the most profitable industries, the most successful companies by profit per employee or the cities with more company headquarters. I would like to have these insight vizualized, but on Wikipedia there is only a table and some text, so I need a technique to obtain the Wiki table into data model that is able to be processed.


Skills & Concepts demonstrated


There are following Python & other features used in this project

  • Using libraries pandas, matplotlib and seaborn
  • Using bs4 library - BeautifulSoup constructor, find() and find_all() methods, element.Tag attributes
  • Using class_ argument of BeautifulSoup.find_all() method
  • Using requests library - method get() and attribute Response.text
  • Web page inspection for HTML tags <table>, <tr>, <th>, <td>
  • Using libraries pandas, matplotlib and seaborn
  • Plot visuals using Series.plot.bar(), seaborn.barplot(), matplotlib.pie(), seaborn.heatmap(), seaborn.regplot()
  • DataFrame construction from the column list and appending rows using loc[]
  • Convert string Series with negative numbers to numerical type
  • Function isdigit() for numerical character check
  • List comprehension
  • String functions strip(), replace()
  • Exporting DataFrame to CSV file using DataFrame.to_csv() method
  • Extracting largest values from DataFrame using nlargest() method
  • Rename DataFrame index using rename_axis() method
  • Data aggregation using groupby() and sum() methods
  • There are following insights I would like to get about the top 30 German companies:


    Questions


    1-Which are the top 5 industries with the highest Revenue?

    2-Find top 10 companies by profit per employee and visualize results along with company industries

    3-Which cities are headquarters for 2 or more companies?

    4-Categorize companies by Employee count (thresholds of 50,000, 100,000, 200,000 and 400,000) and find the 10 largest

    5-Find correlation between company size and Fortune 500 Rank

    Insights


    1. Among Top 30 largest German companies listed in the Global Fortune 500, the Industry with the highest cumulative Revenue in Billions of USD is Automotive industry (655), folowed by Electric utility (509), Insurance (261), Automotive parts (179) and Transportation (159).



    2. The highest profit per employee in Thousands USD has Electric utilty company RWE (156). The 2nd best is BMW(126), followed by Mercedes-benz Group (90) and Munich Re (87). On the 5th place is Electric utiity company EnBW (72), followed by Deutsche Bank (67) and Talanx (52). On the 8th place is Allianz (45), followed bay Bayer (43) and Deutsche Telekom (41)


    3. The city with most listed company headquarters is Munich (4), followed by Essen (3), Hanover (2), Hamburg (2), Stuttgart (2), Bonn (2) and Duesseldorf (2).


    4. The most companies (9) have 100,000-200,000 employees, followed by companies (7) with less than 50,000 employees. On the 3rd place are the companies (6) with 50,000-100,000 employees, followed by companies (4) that have 200,000-400,000 employees and companies (4) with more than 400,000 employees.

    The largest size company by employees (675,805) is Volkswagen, followed by Deutsche Post (554,975), Robert Bosch GmbH (421,338), Edeka Zentrale (408,900) and Deutsche Bahn (324,136). On the 6th place is Siemens (311,000), followed by Fresenius (282,024), Deutsche Telekom (206,759), Continental AG (199,038) and Mercedez-Benz Group (168,797)


    5. The reasonable assumption was there is a high positive correlation between company size and Fortune Global 500 Rank. However, it shows up that correlation is positive, but not so high as expected:
     



    SOLUTION - Project implementation


    Back to: RESUME -Project overview

    0.1-Inspecting Data source

    0.2-Importing libraries and loading HTML page

    0.3-Extracting table and loading from HTML to DataFrame

    0.4-Data Transformation

    0.5-Exporting Data to CSV file

    1-Top 5 industries with the highest Revenue

    2-Top 10 companies by profit per employee and their industries

    3-Which cities are headquarter for 2 or more companies

    4-Categorize companies by Employee count and find the 10 largest

    5-Find correlation between company size and Fortune 500 Rank


    Before processing business questions, there is Web Page to be loaded into data model using Web Scraping in Python.
    On the GitHub Repository there is Jupyter Notebook script available.


    0.1-Inspecting Data source


    Dataset is table on Wikipedia Page that contains column of Fortune Global 500 Rank.



    Using Inspect option on the Right-click menu browser will open HTML view of the page on the right side. Especially useful for inspecting web page is selecting the option shown below with number 2, as it will highlight the right position within HTML source (on the right side) on selection a particular element (table or whatever else) on the web page (on the left side).



    The web page inspection gets the following insights

  • The required table has 30 rows + header and 8 columns
  • There is one table before the required one, the table we need is in the 2nd position
  • There is column Nr. 2 that is different: "Fortune 500 Rank" is in the table we need
  • Table class is 'wikitable sortable'


  • 0.2-Importing libraries and loading HTML page


    Beautiful Soup is a Python library for scraping data from websites. Requests is an HTTP client library for the Python programming language. We need to import both of them for this project, as well as pandas library for the phase when it will be started to use DataFrame. Loading of the HTML page is done using requests.get() method, that gets URL as parameter and returns Response object. In this case, the return code is 200, which ist standard HTTP success status response code that indicates that the request has succeeded.

    from bs4 import BeautifulSoup
    import requests
    import pandas as pd
    url='https://en.wikipedia.org/wiki/List_of_largest_German_companies'
    page=requests.get(url)
    type(page), page


    Using text attribute there is a content of the returned object, which is not pretty formatted HTML, and it will look better by using Beatiful Soup.

    page.text


    The constructor of BeautifulSoup object gets Response object and parser specification as arguments. Quick check shows there is BeautifulSoup object created.

    soup=BeautifulSoup(page.text, 'html.parser')
    type(soup)


    Printing content of the BeautifulSoup object shows slightly better HTML formatting, although far from useful or human-readable.

    soup


    Using text attribute of the element.Tag class there can be extracted title of loaded HTML page:

    type(soup.title), soup.title.text 



    0.3-Extracting required Data table


    For searching within HTML there 2 methods: find() returns the first object with specified tag, and find_all() returns all objects that have specified tag as a list (accessible by list indexing).

  • Table in HTML is defined with <table> tag.
  • Table row within HTML table is defined with <tr> tag.
  • Table header within HTML row is defined with <th> tag.
  • Table data row within HTML row is defined with <td> tag.
  • The results of the find() method show the first table on Wiki Page, which is not the one that we need:

    soup.find('table') # returns the first table only, but we need the second table


    The table we need will be accessed as the second element of the returned list

    table = soup.find_all('table')[1] # returns 2nd list element - that's the table we need!
    table


    In case there are more table classes in HTML page, it can be explicitly specified particular table class using _class parameter:

    table == soup.find_all('table', class_='wikitable sortable')[1] # list of tables using table class - compare



    Once table is found, loading to DataFrame will be done in 2 steps: loading table headers that will define DataFrame columns, and after that loading table data rows and appending them to the table


    Loading table headers


    To load table headers it will be used find_all() method that returns a list:

    #get table headers
    table_headers = table.find_all('th')
    print(table_headers)


    Using list comprehension and text attribute, the HTML tags will be removed:

    table_titles = [title.text for title in table_headers] # cleaning from HTML tags 
    print(table_titles)


    Using string method strip() newline characters will be removed:

    table_titles = [title.text.strip() for title in table_headers] # cleaning from newline escape sequences 
    print(table_titles)


    The final clean list will be loaded to DataFrame as column names:

    df= pd.DataFrame(columns=table_titles) #loading column headers
    df



    Loading table data rows


    Table data rows will be built out of each piece of data found within each particular table row, using find_all() method:

    table_rows=table.find_all('tr')
    for row in table_rows:
        data_rows = row.find_all('td')
        table_data_row=[title.text.strip() for title in data_rows]
        print(table_data_row)


    Since there is a table header found, looping through the list and loading data to DataFrame will be strated from 2nd list element (skipping headers):

    # insert the data rows into DataFrame
    for row in table_rows[1:]: # skip the first row that is header
        data_rows = row.find_all('td')
        table_data_row=[title.text.strip() for title in data_rows]
        length=len(df) # new row's index in the DataFrame will be equal to DataFrame row count before adding new 
        df.loc[length]=table_data_row
    df

    Now there is a table from Wikipedia Page loaded into the DataFrame object!


    0.4-Data Transformation


    Although there is a very small-sized Dataset, it's good practice to check NULL values and duplicates. Since there aren't any, there's no need for Data Cleaning as well.

    df.isnull().sum().any(), df.duplicated().any()


    After successful loading data into DataFrame, all the columns are the 'object' type by default. There are 2 types of columns in this table: numerical (integer) and string, and all columns will be converted accordingly.

    df.info()



    Numeric column type conversion


    The first try of numeric column conversion failed!

    # Conversion to integer - Try No. 1
    lsNumCols = ['Rank', 'Fortune 500rank', 'Revenue(USD millions)', 'Profits(USD millions)', 'Employees']
    df[column] = [df[column].astype('int32') for column in lsNumCols]


    The cause of error raised is comma within string numeric data.
    However, after removing commas the conversion failed again! What's wrong?

    # Conversion to integer - Try No. 2 - Remove commas from string numbers
    for column in lsNumCols:
        df[column] = df[column].str.replace(',','')
        df[column] = df[column].astype('int32') 


    The reason of this error raised lies in negative numbers. If negative numbers are stored with Unicode minus (code point '\u2212' or decimal 8722) or hyphen (code point '\u2010' or decimal 8208) conversion is not possible. The solution is to replace the current minus sign with the standard minus sign, and conversion is done!

    # Conversion to integer - Try No. 3 - Remove commas and Ensure standard minus sign instead of unicode hyphen/minus sign
    def MakeProperStrNumSeries(sColumnName):    
        df[sColumnName] = df[sColumnName].astype(str) # convert column to string    
        df[sColumnName] = df[sColumnName].str.replace(',','') # remove comma to be able to convert to number       
        lsItems=df[sColumnName].tolist() #convert Series to list
        lsNum = [] #list of numerical values to append after conversion
    
        for item in lsItems:
            item = item.strip() #remove leading and trailing blanks if there are any
            if(item[0].isdigit() == False): # if not digit, it has to be minus             
                item=item.replace(item[0],'-') # replace unicode minus/hyphen sign with standard minus sign            
            n=int(item) 
            lsNum.append(item)        
            
        df[sColumnName]=pd.Series(lsNum) 
    
    # Convert string column to numeric column properly
    def StrColumnToNumber(sColumnName):
        df['colHelper'] = df[sColumnName] # helper column 
            
        # For negative numbers, ensure there is standard minus sign
        MakeProperStrNumSeries('colHelper')     
        
        df['colHelper'] = df['colHelper'].astype('int32') # convert to integer
        df[sColumnName] = df['colHelper']
        df.drop(columns='colHelper', inplace=True) #remove helper column
    
    # call function in the columm list comprehension
    [StrColumnToNumber(column) for column in lsNumCols]
    df.info()


    String column type conversion


    String columns conversion is done successfully.

    #Convert string columns
    lsStrCols = ['Name', 'Industry', 'Headquarters']
    for column in lsStrCols:
        df[column] = df[column].astype('string')



    0.5-Exporting Data to CSV file


    DataFrame will be exported to CSV file using to_csv() method. Index parameter is set to False since there's no need for additional column.

    df.to_csv('C:/Users/User/OneDrive/Documents/BLOG/Portfolio/webScrapeExport.csv', encoding='utf-8', index = False)



    1-Top 5 industries with the highest Revenue


    The required Data subset is extracted from the DataFrame using groupby() method and sum() aggregate function on Series object. Visualization is done using Series.plot.bar() method.

    import seaborn as sns
    import matplotlib.pyplot as plt
    seriesSum =df.groupby('Industry')['Revenue(USD millions)']
    seriesSum = seriesSum.sum()/1000 # show in USD Billions
    seriesSum = seriesSum.sort_values(ascending=False).head(5) # sort values Descending and take top 5
    dfTop5 = pd.DataFrame(seriesSum)
    dfTop5.rename(columns={'Revenue(USD millions)': 'Revenue (USD Billions)'}, inplace=True)
    # plot bar chart
    plt.figure(figsize=(11,5))
    sns.set_style('darkgrid')
    seriesSum.plot.bar(color ='orange', width=0.8)
    plt.xlabel("")
    plt.ylabel("Revenue(USD Billions) of all companies in industry")
    plt.title("Top 5 industries with the highest Revenue (USD Billions)")
    plt.xticks(rotation=0)



    2-Top 10 companies by profit per employee and their industries


    Data subset is extracted using nlargest() method after sorting and visualized using seaborn.barplot() method.

    df['ProfitPerEmployee (USD Thousands)'] = df['Profits(USD millions)'] / df['Employees'] * 1000 # Thousand  USD
    pd.set_option('display.float_format', lambda x: '{:,.2f}'.format(x)) # format float as 1,000.00
    colNames=['Name','ProfitPerEmployee (USD Thousands)','Industry', 'Profits(USD millions)','Fortune 500rank', \
              'Employees', 'Headquarters']
    top10_rated=df.nlargest(10,'ProfitPerEmployee (USD Thousands)')[colNames].set_index('Name')
    top10_rated['Employees'] = top10_rated['Employees'].apply(lambda x: '{:,}'.format(x)) # format integer as 1,000
    top10_rated['Profits(USD millions)'] = \
                top10_rated['Profits(USD millions)'].apply(lambda x: '{:,}'.format(x)) # format integer as 1,000
    top10_rated

    plt.figure(figsize=(8,4))
    sns.barplot(x='ProfitPerEmployee', y=top10_rated.index, data=top10_rated,hue='Industry', dodge=False)
    plt.xlabel("Profit Per Employee (Thousand USD)")
    plt.ylabel("Company")
    plt.title("Top 10 companies with the highest Profit per employee and their industry")
    plt.legend(bbox_to_anchor=(1.02,1),loc=2)



    3-Which cities are headquarter for 2 or more companies


    Dataset required is extracted using value_counts() method, directly passed as boolean index to the DataFrame object and selected particular Series. Visualization is done using Series.plot.bar() method.

    plt.figure(figsize=(12,3))
    plt.title('Headquarters with 2 or more companies')
          
    df2 = pd.DataFrame(df['Headquarters'].value_counts() )
    series2= df2[df2['Headquarters'] > 1]['Headquarters']
    series2.plot.bar(color ='purple', width=0.85) 
    
    plt.xticks(rotation=0)



    4-Categorize companies by Employee count


    Data categorization is done using cut() method. For Company categorization by company size it is added new column and visualized with pie chart. For the pie chart it has to be prepared list containing item count for each category.

    lsLimits= [0,50000,100000,200000,400000,800000]
    lsBinLabels = ['Up to 50,000 Employees', 'From 50,000 to 100,000 Employees', 'From 100,000 to 200,000 Employees',\
                    'From 200,000 to 400,000 Employees', 'More than 400,000 Employees']
    df['Size']=pd.cut(df['Employees'], lsLimits, labels=lsBinLabels, include_lowest=True)
    
    nCatCount=[]
    for label in lsBinLabels:
        nCatCount.append(len(df[df['Size'] == label]))
        
    lsLabels = []
    lenList = len(lsBinLabels) 
    for i in range(lenList):       
        lsLabels.append(lsBinLabels[i] + ' (' + str(nCatCount[i]) + ')')
    
    fig, ax = plt.subplots()
    ax.pie(nCatCount, labels=lsLabels, autopct='%1.1f%%')
    plt.title('Company categories by size (Employee count)')


    Top 10 companies by size are extracted using nlargest() method, and index is renamed using rename_axis() method. Data are visualized by vertical bar chart by passing parameter orient = 'v' to the seaborn.barplot() method:

    top10_emp=df.nlargest(10,'Employees')[['Name','Employees', 'Industry']].set_index('Name')
    top10_emp_show=top10_emp.copy()
    top10_emp_show['Employees'] = top10_emp_show['Employees'].apply(lambda x: '{:,}'.format(x)) # format integer as 1,000
    top10_emp_show.rename_axis('Company', inplace=True) # rename index
    top10_emp_show[['Employees']]
    # plot bar
    plt.figure(figsize=(9,3))
    sns.barplot(y=top10_emp['Employees']/1000, x=top10_emp.index, data=top10_emp, orient='v', hue='Industry', \
                width=0.9, dodge=False)
    plt.ylabel("Employees (in Thousands)")
    plt.xlabel("")
    plt.title("Top 10 largest German companies")
    plt.legend(bbox_to_anchor=(0.7,1),loc=2)
    plt.xticks(rotation=45)



    5-Find correlation between company size and Fortune 500 Rank


    The assumption is, since Fortune Global 500 rank list is based on company revenue, that there will be a very high correlation between company size and Fortune 500 Rank as well. As lower Fortune 500 Rank value in the data model means the higher rank, it makes sense to extend data model with Reversed Rank column, to gett better visual insight of the correlation. The correlation (relationship) between each column in the DataFrame are found using corr() method, with the default Pearson methodology.

    df['F500rank Reversed'] = 500 - df['Fortune 500rank'] # extend DataFrame with reversed Rank column
    dfcorr=df.corr(numeric_only =True)


    The assumption was not completely right, the correlation (factor=0.39) is positive but still belongs to the Low positive correlation category. There is and interpretation of correlation factors:

  • 0.9 to 1.0 (-0.9 to -1.0) -Very high positive (negative) correlation
  • 0.7 to 0.9 (-0.7 to -0.9) -High positive (negative) correlation
  • 0.5 to 0.7 (-0.5 to -0.7) -Moderate positive (negative) correlation
  • 0.3 to 0.5 (-0.3 to -0.5) -Low positive (negative) correlation
  • 0.0 to 0.3 (0.0 to -0.3) -Negligible correlation

  • Using seaborn heatmap correlation factors can be visualized:

    plt.figure(figsize=(10,4))
    sns.heatmap(dfcorr, annot=True, cmap="Blues")
    plt.xticks(rotation=45)
    plt.title('Correlation Heatmap (Pearson method)')


    Finally, the seaborn scatter plot visualizes the bivariate analysis:

    fig = plt.figure(figsize = (10, 4))
    #plot scatter graph
    sns.set_style("darkgrid")
    ax=sns.regplot(x=df['Employees']/1000, y=df['Fortune 500rank'], data=df, scatter_kws={"color": "maroon"}, \
                    line_kws={"color": "green"})
    ax.invert_yaxis() # Rank values in Reversed order
    plt.xlabel('Number of Employees (in Thousands)')
    plt.ylabel('Fortune 500 Rank')
    plt.title('Correlation of Company size and Fortune 500 Rank')

    About the author

    Barry The Analyst has been working from May 2022 untill July 2023 with one big company in Ireland, and since July 2023 has worked with one big company in Germany. This blog was started as a private note collection in order to (re)develop, refresh and strengthen essential skills of Data Analysis field. The content was generated on the way of learning path, with a degree in Electrical Engineering and experience in C++/SQL. Private notes turned into public blog to share the knowledge and provide support to anyone who would find this interesting...

     

    Who is this for?

    This is not yet another basic course that would explain elementary concepts. There are many basic features of tools and technologies that are skipped in blog articles and portfolio projects here presented. If you are not an absolute beginner in SQL/databases, Excel, Python and Power BI this might be useful material if you would like to advance a career as Data Analyst and/or Python Developer ...

    This template downloaded form free website templates