Barry on Data Analysis

Acquiring and Processing Information on the World's Largest Banks - Building ETL pipeline in Python

RESUME - Project overview

Jump to: SOLUTION -Project implementation

A multi-national firm has hired me as a data engineer. My job is to access and process data as per requirements. Using Python skills for ETL there will be extracted, transformed and loaded real-world data about the world's largest banks into a database for further processing and querying.

Problem Statement

My boss asked me to compile the list of the top 10 largest banks in the world ranked by market capitalization in billion USD.

Further, I need to transform the data and store it in USD, GBP, EUR, and INR per the exchange rate information made available to me as a CSV file. I should save the processed information table locally in a CSV format and as a database table.

Managers from different countries will query the database table to extract the list and note the market capitalization value in their own currency.

Skills & Concepts demonstrated

This project demonstrates how to perform ETL operations on real-world data and make the processed information available for further use in different formats. Following skills are shown:

Use Webscraping techniques to extract information from any website as per requirement

Use Pandas data frames and dictionaries to transform data as per requirement

Load the processed information to CSV files and as Database tables

Query the database tables using SQLite3 and pandas libraries

Log the progress of the code properly

There are following tasks to be accomplished:

Tasks & Questions

1. Write a function log_progress() to log the progress of the code at different stages in a file code_log.txt. Use the list of log points provided to create log entries as every stage of the code.

2. Extract the tabular information from the given URL under the heading 'By market capitalization' and save it to a dataframe.

Inspect the webpage and identify the position and pattern of the tabular information in the HTML code

Write the code for a function extract() to perform the required data extraction.

Execute a function call to extract() to verify the output.

3. Transform the dataframe by adding columns for Market Capitalization in GBP, EUR and INR, rounded to 2 decimal places, based on the exchange rate information shared as a CSV file.

Write the code for a function transform() to perform the said task

Execute a function call to transform() and verify the output.

4. Load the transformed dataframe to an output CSV file. Write a function load_to_csv(), execute a function call and verify the output.

5. Load the transformed dataframe to an SQL database server as a table. Write a function load_to_db(), execute a function call and verify the output.

6. Run queries on the database table. Write a function load_queries(), execute a given set of queries and verify the output.

Show the full list of top 10 world banks with MS in USD

Extract the information for the London office, bank names and MC in GBP

Extract the information for the Berlin office, average market capitalization of all the banks in EUR

Extract the information for New Delhi office, only the names of the top 5 banks in INR

7. Verify that the log entries have been completed at all stages by checking the contents of the file code_log.txt.

Insights

1. Full list of 10 world's largest banks in USD

2. London office - Top 10 banks list in GBP

3. Berlin office - Average MC of all the banks in EUR
Average MC is 196.55 Billion EUR, calculated for top 10 world's largest banks

4. New Delhi office - Top 5 banks in INR

5. Log file content

SOLUTION - Project implementation

Back to: RESUME -Project overview

0.1-Installing libraries

0.2-Importing modules and packages

0.3-Initialization of known entities

1-Write log function

2-Extracting data using Webscraping and saving it to dataframe

3-Transformation of data

4-Loading to CSV

5-Loading to database

6-Running queries

7-Invoking ETL functions & Checking log file content

From the GitHub repository it can be downloaded

CSV file that contains Exchange Rates

Jupyter notebook script

CSV file with loaded data

Database file with loaded data

Log file

Before building the code, it's neccessary to install the required libraries.

0.1-Installing libraries

The libraries needed for the code are:

requests - The library used for accessing the information from the URL.

bs4 - The library containing the BeautifulSoup function used for webscraping.

pandas - The library used for processing the extracted data, storing it in required formats, and communicating with the databases.

sqlite3 - The library required to create a database server connection.

numpy - The library required for the mathematical rounding operations.

datetime - The library containing the function datetime used for extracting the timestamp for logging purposes.

!pip install requests
!pip install bs4
!pip install pandas
!pip install sqlite3
!pip install numpy
!pip install datetime

0.2-Importing modules and packages

After installing libraries modules and packages are imported:

import pandas as pd 
import requests
from bs4 import BeautifulSoup
import sqlite3
from datetime import datetime

0.3-Initialization of known entities

Data URL is Wikipedia page from which data will be extracted using Webscraping, and it's path is initialized as a variable url. There are also initialized Database name, Table name, Table Attributes (upon Extraction only), Exchange rate CSV path, Output CSV Path and Log file name.

url = 'https://en.wikipedia.org/wiki/List_of_largest_banks'
db_name = 'Banks.db'
table_name = 'Largest_banks' 
table_attribs = ["Name", "MC_USD_Billion"]
csv_path_exchange = 'https://github.com/berislav-vidakovic/Portfolio/blob/main/PythonETL/exchange_rate.csv'
csv_path_load = 'Largest_banks_data.csv'
log_file = 'code_log.txt'

1-Log function

This function logs the progress of the code, it accepts the message to be logged and enters it to a text file code_log.txt.

The format to be used for logging has the following syntax - time_stamp : message. Each log entry happens in the next line in the text file. The log entries are associated with each of the executed function calls.

Task	Log message on completion
Declaring known values	Preliminaries complete. Initiating ETL process
Call extract() function	Data extraction complete. Initiating Transformation process
Call transform() function	Data transformation complete. Initiating Loading process
Call load_to_csv()	Data saved to CSV file
Initiate SQLite3 connection	SQL Connection initiated
Call load_to_db()	Data loaded to Database as a table, Executing queries
Call run_query()	Process Complete
Close SQLite3 connection	Server Connection closed

The table above shows the logging message at the end of each function call that follows.
Implementation of the log function follows below.

#LOGGING DATA ----------------------------------------------------------
def log_progress(message): 
    ''' This function logs the mentioned message of a given stage of the
    code execution to a log file. Function returns nothing'''
    timestamp_format = '%Y-%h-%d-%H:%M:%S' # Year-Monthname-Day-Hour-Minute-Second 
    now = datetime.now() # get current timestamp 
    timestamp = now.strftime(timestamp_format) 
    with open(log_file,"a") as f: 
        f.write(timestamp + ' : ' + message + '\n')

2-Extracting data using Webscraping and saving it to dataframe

Data will be analyzed from the webpage on the given URL:
https://web.archive.org/web/20230908091635/https://en.wikipedia.org/wiki/List_of_largest_banks.
The position of the required table is identified as the table under the heading "By market capitalization".

The function extract() retrieves the information of the table to a Pandas data frame.

def extract(url, table_attribs):
    ''' This function aims to extract the required information from the website and save it to a data frame. 
    The function returns the data frame for further processing. '''

    df = pd.DataFrame(columns=table_attribs)
    # load the entire web page as an HTML document in python
    html_page = requests.get(url).text
    # parse the text  to enable extraction of relevant information.
    data = BeautifulSoup(html_page, 'html.parser')
    # extract the rows of the table needed 
    tables = data.find_all('tbody')
    rows = tables[0].find_all('tr') #
    # iterate over the rows, col->dict->df
    for row in rows:
        col=row.find_all('td')
        table_data_row=[item.text.strip() for item in col]
        if len(table_data_row) > 2:
            data_dict = { table_attribs[0]: table_data_row[1],
                          table_attribs[1]: table_data_row[2], }
            df1 = pd.DataFrame(data_dict, index=[0])
            df = pd.concat([df,df1], ignore_index=True)             

    df['MC_USD_Billion'] = df['MC_USD_Billion'].astype('float')
    return df

Values are typecasted to float format. The relevant log entry is made

3-Transformation of data

The Transform function needs to perform the following tasks:

Read the exchange rate CSV file (content shown above) and convert the contents to a dictionary so that the contents of the first columns are the keys to the dictionary and the contents of the second column are the corresponding values

Add 3 different columns to the dataframe, viz. MC_GBP_Billion, MC_EUR_Billion and MC_INR_Billion, each containing the content of MC_USD_Billion scaled by the corresponding exchange rate factor. The resulting data have to be rounded to 2 decimal places.

def transform(df, csv_path):
    ''' This function accesses the CSV file for exchange rate information, and adds three columns to the data frame, 
    each containing the transformed version of Market Cap column to respective currencies'''

    dfExchange = pd.read_csv(csv_path)
    dfExchange.set_index('Currency', inplace=True)

    eur = dfExchange.at['EUR', 'Rate']
    gbp = dfExchange.at['GBP', 'Rate']
    inr = dfExchange.at['INR', 'Rate']

    df['MC_EUR_Billion'] = round(df['MC_USD_Billion'] * eur, 2) 
    df['MC_GBP_Billion'] = round(df['MC_USD_Billion'] * gbp, 2)  
    df['MC_INR_Billion'] = round(df['MC_USD_Billion'] * inr, 2)

    log_progress('Data transformation complete. Initiating Loading process')

The resulting dataframe contains 3 more columns

4-Loading to CSV

The function load_to_csv() loads the transformed data frame to a CSV file, in the path initialized previously. The relevant log entry is made

def load_to_csv(df, csv_path):
    ''' This function saves the final data frame as a CSV file in
    the provided path. Function returns nothing. '''
    df.to_csv(csv_path, index=False)
    log_progress('Data saved to CSV file')

5-Loading to Database

The function load_to_db() loads the transformed data frame to an SQL database, passing the database and table names as initialized previously, using DataFrame.to_sql() function. Before calling this function, it will be initiated the connection to the SQLite3 database server with the name Banks.db. This connection object is passed, along with the required table name Largest_banks and the transformed data frame, to the load_to_db() function in the function call. The relevant log entry is made.
Upon successful function call, there will be loaded the contents of the table with the required data.

def load_to_db(df, sql_connection, table_name):
    ''' This function saves the final data frame to a database
    table with the provided name. Function returns nothing.'''
    df.to_sql(table_name, sql_connection, if_exists = 'replace', index =False)
    log_progress("Data loaded to Database as a table, Executing queries")

6-Running queries

The function run_queries() accepts the query statement, and the SQLite3 Connection object, and generates the output of the query by using pandas.read_sql() function. The query statement should be printed along with the query output

def run_queries(sql_connection, query_statement):
    ''' This function runs the passed query and adds query statement
    and query results to the log file. Function returns nothing.'''
    query_output = pd.read_sql(query_statement, sql_connection)
    log_progress(f"Running query... {query_statement}")
    log_progress(f"...Query output:\n{query_output}")

7-Invoking ETL functions & checking log file

Upon successful completion of execution, the code_log.txt file should contain all relevant entries as proof that ETL process ran as expected.

log_progress("Preliminaries complete. Initiating ETL process")

# EXTRACT DATA
df=extract(url, table_attribs)

# TRANSFORM DATA
#Adding "?raw=true" at the end of the GitHub URL to get the raw file link
transform(df, csv_path_exchange+"?raw=true")

#LOAD DATA
load_to_csv(df,csv_path_load)

# create and connect  process to a  database Banks.db
conn = sqlite3.connect(db_name)
log_progress("SQL Connection initiated")

load_to_db(df, conn, table_name)

run_queries(conn, "SELECT Name, MC_USD_Billion FROM Largest_banks")
run_queries(conn, "SELECT Name, MC_GBP_Billion FROM Largest_banks")
run_queries(conn, "SELECT AVG(MC_EUR_Billion) AS Average_EUR FROM Largest_banks")
run_queries(conn, "SELECT Name, MC_INR_Billion FROM Largest_banks LIMIT 5")

# close DB connection
conn.close()
log_progress("Server Connection closed")

Finally, there should be all the relevant entries made in the text file in relation to the stages of code execution.

About the author

Barry The Analyst has been working from May 2022 untill July 2023 with one big company in Ireland, and since July 2023 has worked with one big company in Germany. This blog was started as a private note collection in order to (re)develop, refresh and strengthen essential skills of Data Analysis field. The content was generated on the way of learning path, with a degree in Electrical Engineering and experience in C++/SQL. Private notes turned into public blog to share the knowledge and provide support to anyone who would find this interesting...

Who is this for?

This is not yet another basic course that would explain elementary concepts. There are many basic features of tools and technologies that are skipped in blog articles and portfolio projects here presented. If you are not an absolute beginner in SQL/databases, Excel, Python and Power BI this might be useful material if you would like to advance a career as Data Analyst and/or Python Developer ...

Barry on Data Analysis-brief insights to relevant facts, definitions and best practices-