Acquiring and Processing Information on the World's Largest Banks - Building ETL pipeline in Python



RESUME - Project overview


Jump to: SOLUTION -Project implementation

Problem Statement

Skills & Concepts demonstrated

Questions

Insights

A multi-national firm has hired me as a data engineer. My job is to access and process data as per requirements. Using Python skills for ETL there will be extracted, transformed and loaded real-world data about the world's largest banks into a database for further processing and querying.

Problem Statement


My boss asked me to compile the list of the top 10 largest banks in the world ranked by market capitalization in billion USD.

Further, I need to transform the data and store it in USD, GBP, EUR, and INR per the exchange rate information made available to me as a CSV file. I should save the processed information table locally in a CSV format and as a database table.

Managers from different countries will query the database table to extract the list and note the market capitalization value in their own currency.


Skills & Concepts demonstrated


This project demonstrates how to perform ETL operations on real-world data and make the processed information available for further use in different formats. Following skills are shown:

  • Use Webscraping techniques to extract information from any website as per requirement
  • Use Pandas data frames and dictionaries to transform data as per requirement
  • Load the processed information to CSV files and as Database tables
  • Query the database tables using SQLite3 and pandas libraries
  • Log the progress of the code properly
  • There are following tasks to be accomplished:


    Tasks & Questions


    1. Write a function log_progress() to log the progress of the code at different stages in a file code_log.txt. Use the list of log points provided to create log entries as every stage of the code.

    2. Extract the tabular information from the given URL under the heading 'By market capitalization' and save it to a dataframe.

  • Inspect the webpage and identify the position and pattern of the tabular information in the HTML code
  • Write the code for a function extract() to perform the required data extraction.
  • Execute a function call to extract() to verify the output.

  • 3. Transform the dataframe by adding columns for Market Capitalization in GBP, EUR and INR, rounded to 2 decimal places, based on the exchange rate information shared as a CSV file.

  • Write the code for a function transform() to perform the said task
  • Execute a function call to transform() and verify the output.

  • 4. Load the transformed dataframe to an output CSV file. Write a function load_to_csv(), execute a function call and verify the output.

    5. Load the transformed dataframe to an SQL database server as a table. Write a function load_to_db(), execute a function call and verify the output.

    6. Run queries on the database table. Write a function load_queries(), execute a given set of queries and verify the output.

  • Show the full list of top 10 world banks with MS in USD
  • Extract the information for the London office, bank names and MC in GBP
  • Extract the information for the Berlin office, average market capitalization of all the banks in EUR
  • Extract the information for New Delhi office, only the names of the top 5 banks in INR

  • 7. Verify that the log entries have been completed at all stages by checking the contents of the file code_log.txt.


    Insights


    1. Full list of 10 world's largest banks in USD


    2. London office - Top 10 banks list in GBP


    3. Berlin office - Average MC of all the banks in EUR
    Average MC is 196.55 Billion EUR, calculated for top 10 world's largest banks

    4. New Delhi office - Top 5 banks in INR


    5. Log file content



    SOLUTION - Project implementation


    Back to: RESUME -Project overview

    0.1-Installing libraries

    0.2-Importing modules and packages

    0.3-Initialization of known entities

    1-Write log function

    2-Extracting data using Webscraping and saving it to dataframe

    3-Transformation of data

    4-Loading to CSV

    5-Loading to database

    6-Running queries

    7-Invoking ETL functions & Checking log file content

    From the GitHub repository it can be downloaded

  • CSV file that contains Exchange Rates
  • Jupyter notebook script
  • CSV file with loaded data
  • Database file with loaded data
  • Log file
  • Before building the code, it's neccessary to install the required libraries.


    0.1-Installing libraries


    The libraries needed for the code are:

  • requests - The library used for accessing the information from the URL.
  • bs4 - The library containing the BeautifulSoup function used for webscraping.
  • pandas - The library used for processing the extracted data, storing it in required formats, and communicating with the databases.
  • sqlite3 - The library required to create a database server connection.
  • numpy - The library required for the mathematical rounding operations.
  • datetime - The library containing the function datetime used for extracting the timestamp for logging purposes.
  • !pip install requests
    !pip install bs4
    !pip install pandas
    !pip install sqlite3
    !pip install numpy
    !pip install datetime

    0.2-Importing modules and packages


    After installing libraries modules and packages are imported:

    import pandas as pd 
    import requests
    from bs4 import BeautifulSoup
    import sqlite3
    from datetime import datetime

    0.3-Initialization of known entities


    Data URL is Wikipedia page from which data will be extracted using Webscraping, and it's path is initialized as a variable url. There are also initialized Database name, Table name, Table Attributes (upon Extraction only), Exchange rate CSV path, Output CSV Path and Log file name.

    url = 'https://en.wikipedia.org/wiki/List_of_largest_banks'
    db_name = 'Banks.db'
    table_name = 'Largest_banks' 
    table_attribs = ["Name", "MC_USD_Billion"]
    csv_path_exchange = 'https://github.com/berislav-vidakovic/Portfolio/blob/main/PythonETL/exchange_rate.csv'
    csv_path_load = 'Largest_banks_data.csv'
    log_file = 'code_log.txt'

    1-Log function


    This function logs the progress of the code, it accepts the message to be logged and enters it to a text file code_log.txt.

    The format to be used for logging has the following syntax - time_stamp : message. Each log entry happens in the next line in the text file. The log entries are associated with each of the executed function calls.

    Task Log message on completion
    Declaring known values Preliminaries complete. Initiating ETL process
    Call extract() function Data extraction complete. Initiating Transformation process
    Call transform() function Data transformation complete. Initiating Loading process
    Call load_to_csv() Data saved to CSV file
    Initiate SQLite3 connection SQL Connection initiated
    Call load_to_db() Data loaded to Database as a table, Executing queries
    Call run_query() Process Complete
    Close SQLite3 connection Server Connection closed

    The table above shows the logging message at the end of each function call that follows.
    Implementation of the log function follows below.


    #LOGGING DATA ----------------------------------------------------------
    def log_progress(message): 
        ''' This function logs the mentioned message of a given stage of the
        code execution to a log file. Function returns nothing'''
        timestamp_format = '%Y-%h-%d-%H:%M:%S' # Year-Monthname-Day-Hour-Minute-Second 
        now = datetime.now() # get current timestamp 
        timestamp = now.strftime(timestamp_format) 
        with open(log_file,"a") as f: 
            f.write(timestamp + ' : ' + message + '\n') 

    2-Extracting data using Webscraping and saving it to dataframe


    Data will be analyzed from the webpage on the given URL:
    https://web.archive.org/web/20230908091635/https://en.wikipedia.org/wiki/List_of_largest_banks.
    The position of the required table is identified as the table under the heading "By market capitalization".



    The function extract() retrieves the information of the table to a Pandas data frame.

    def extract(url, table_attribs):
        ''' This function aims to extract the required information from the website and save it to a data frame. 
        The function returns the data frame for further processing. '''
    
        df = pd.DataFrame(columns=table_attribs)
        # load the entire web page as an HTML document in python
        html_page = requests.get(url).text
        # parse the text  to enable extraction of relevant information.
        data = BeautifulSoup(html_page, 'html.parser')
        # extract the rows of the table needed 
        tables = data.find_all('tbody')
        rows = tables[0].find_all('tr') #
        # iterate over the rows, col->dict->df
        for row in rows:
            col=row.find_all('td')
            table_data_row=[item.text.strip() for item in col]
            if len(table_data_row) > 2:
                data_dict = { table_attribs[0]: table_data_row[1],
                              table_attribs[1]: table_data_row[2], }
                df1 = pd.DataFrame(data_dict, index=[0])
                df = pd.concat([df,df1], ignore_index=True)             
    
        df['MC_USD_Billion'] = df['MC_USD_Billion'].astype('float')
        return df

    Values are typecasted to float format. The relevant log entry is made



    3-Transformation of data


    The Transform function needs to perform the following tasks:

  • Read the exchange rate CSV file (content shown above) and convert the contents to a dictionary so that the contents of the first columns are the keys to the dictionary and the contents of the second column are the corresponding values
  • Add 3 different columns to the dataframe, viz. MC_GBP_Billion, MC_EUR_Billion and MC_INR_Billion, each containing the content of MC_USD_Billion scaled by the corresponding exchange rate factor. The resulting data have to be rounded to 2 decimal places.
  • def transform(df, csv_path):
        ''' This function accesses the CSV file for exchange rate information, and adds three columns to the data frame, 
        each containing the transformed version of Market Cap column to respective currencies'''
    
        dfExchange = pd.read_csv(csv_path)
        dfExchange.set_index('Currency', inplace=True)
    
        eur = dfExchange.at['EUR', 'Rate']
        gbp = dfExchange.at['GBP', 'Rate']
        inr = dfExchange.at['INR', 'Rate']
    
        df['MC_EUR_Billion'] = round(df['MC_USD_Billion'] * eur, 2) 
        df['MC_GBP_Billion'] = round(df['MC_USD_Billion'] * gbp, 2)  
        df['MC_INR_Billion'] = round(df['MC_USD_Billion'] * inr, 2)
    
        log_progress('Data transformation complete. Initiating Loading process')

    The resulting dataframe contains 3 more columns



    4-Loading to CSV


    The function load_to_csv() loads the transformed data frame to a CSV file, in the path initialized previously. The relevant log entry is made

    def load_to_csv(df, csv_path):
        ''' This function saves the final data frame as a CSV file in
        the provided path. Function returns nothing. '''
        df.to_csv(csv_path, index=False)
        log_progress('Data saved to CSV file')

    5-Loading to Database


    The function load_to_db() loads the transformed data frame to an SQL database, passing the database and table names as initialized previously, using DataFrame.to_sql() function. Before calling this function, it will be initiated the connection to the SQLite3 database server with the name Banks.db. This connection object is passed, along with the required table name Largest_banks and the transformed data frame, to the load_to_db() function in the function call. The relevant log entry is made.
    Upon successful function call, there will be loaded the contents of the table with the required data.

    def load_to_db(df, sql_connection, table_name):
        ''' This function saves the final data frame to a database
        table with the provided name. Function returns nothing.'''
        df.to_sql(table_name, sql_connection, if_exists = 'replace', index =False)
        log_progress("Data loaded to Database as a table, Executing queries")

    6-Running queries


    The function run_queries() accepts the query statement, and the SQLite3 Connection object, and generates the output of the query by using pandas.read_sql() function. The query statement should be printed along with the query output

    def run_queries(sql_connection, query_statement):
        ''' This function runs the passed query and adds query statement
        and query results to the log file. Function returns nothing.'''
        query_output = pd.read_sql(query_statement, sql_connection)
        log_progress(f"Running query... {query_statement}")
        log_progress(f"...Query output:\n{query_output}")

    7-Invoking ETL functions & checking log file


    Upon successful completion of execution, the code_log.txt file should contain all relevant entries as proof that ETL process ran as expected.

    log_progress("Preliminaries complete. Initiating ETL process")
    
    # EXTRACT DATA
    df=extract(url, table_attribs)
    
    # TRANSFORM DATA
    #Adding "?raw=true" at the end of the GitHub URL to get the raw file link
    transform(df, csv_path_exchange+"?raw=true")
    
    #LOAD DATA
    load_to_csv(df,csv_path_load)
    
    # create and connect  process to a  database Banks.db
    conn = sqlite3.connect(db_name)
    log_progress("SQL Connection initiated")
    
    load_to_db(df, conn, table_name)
    
    run_queries(conn, "SELECT Name, MC_USD_Billion FROM Largest_banks")
    run_queries(conn, "SELECT Name, MC_GBP_Billion FROM Largest_banks")
    run_queries(conn, "SELECT AVG(MC_EUR_Billion) AS Average_EUR FROM Largest_banks")
    run_queries(conn, "SELECT Name, MC_INR_Billion FROM Largest_banks LIMIT 5")
    
    # close DB connection
    conn.close()
    log_progress("Server Connection closed")

    Finally, there should be all the relevant entries made in the text file in relation to the stages of code execution.


    About the author

    Barry The Analyst has been working from May 2022 untill July 2023 with one big company in Ireland, and since July 2023 has worked with one big company in Germany. This blog was started as a private note collection in order to (re)develop, refresh and strengthen essential skills of Data Analysis field. The content was generated on the way of learning path, with a degree in Electrical Engineering and experience in C++/SQL. Private notes turned into public blog to share the knowledge and provide support to anyone who would find this interesting...

     

    Who is this for?

    This is not yet another basic course that would explain elementary concepts. There are many basic features of tools and technologies that are skipped in blog articles and portfolio projects here presented. If you are not an absolute beginner in SQL/databases, Excel, Python and Power BI this might be useful material if you would like to advance a career as Data Analyst and/or Python Developer ...

    This template downloaded form free website templates